Computers Are Learning To “Think” By Listening To MIDI Music

Scientists at Google and elsewhere are turning to the 30-year-old digital music standard MIDI to teach neural networks how to write music.

Computers Are Learning To “Think” By Listening To MIDI Music
[Photo: Flickr user Sigmadp2j]

In May, Google research scientist Douglas Eck left his Silicon Valley office to spend a few days at Moogfest, a gathering for music, art, and technology enthusiasts deep in North Carolina’s Smoky Mountains. Eck told the festival’s music-savvy attendees about his team’s new ideas about how to teach computers to help musicians write music–generate harmonies, create transitions in a song, and elaborate on a recurring theme. Someday, the machine could learn to write a song all on its own.


Eck hadn’t come to the festival–which was inspired by the legendary creator of the Moog synthesizer and peopled with musicians and electronic music nerds–simply to introduce his team’s challenging project. To “learn” how to create art and music, he and his colleagues need users to feed the machines tons of data, using MIDI, a format more often associated with dinky video game sounds than with complex machine learning.

An example of a composition in MIDI.

Researchers have been experimenting with AI-generated music for years. Scientists at Sony’s Computer Science Laboratory in France recently released what some have called the first AI-generated pop songs, composed from their in-house AI algorithms (although they were arranged by a human musician, who also wrote their lyrics). Their AI platform, FlowMachines, has also composed jazz and classical scores in the past using MIDI. Eck’s talk at Moogfest was a prelude to a Google research program called Magenta, which aims to write code that can learn how to generate art, starting with music.

A song cowritten by Sony’s FlowMachines algorithm.

Listening to and making music is worth pursuing because, researchers say, both activities can help intelligent systems achieve the holy grail of intelligence: cognition. Just as computers are starting to evolve from simply reading text to understanding speech, computers might start to regularly interpret and generate their own music.

“You can learn an awful lot about language by studying text. MIDI gives us the musical equivalent. The more we understand about music creation and music perception, the more we’ll understand general, important aspects of communication and cognition,” says Eck, now a research scientist on Google’s Magenta project.


From Crashing Computers To Making Them More Creative

As synthesizers gained popularity in the 1970s and 1980s, engineers started to experiment with ways to get their electronic instruments to communicate with each other. The result was the Musical Instrument Digital Interface, or MIDI, which the music industry adopted as a technical standard in 1983 after its creators, Dave Smith and Ikutaro Kakehashi, made it royalty-free, offering up the idea for the world to use.

“In hindsight, I think it was the right thing to do,” Smith told Fortune in 2013. “We wanted to be sure we had 100% participation, so we decided not to charge any other companies that wanted to use it.”

Personal computers soon evolved to read and store MIDI files, which reduce high-level, abstract pieces of music into machine-readable data in a very compact format (a song stored in a 4 MB MP3 file would be a mere few hundred kilobytes in MIDI). MIDI would become standard on electronic instruments, from keyboards and drum machines to MIDI guitar controllers and electronic drum kits. Music composed through MIDI has powered the rise of dance, techno, house, and drum and bass music, and its sound can be heard in most television and film scores.

MIDI is the symbolic representation of music, just like text is the symbolic representation of speech. “MIDI itself doesn’t contain sound—it’s just instructions,” says Jonathan Lee, a musician who specializes in MIDI.

One MIDI link can carry up to 16 channels of data, indicating information about things like musical notation, pitch and velocity, volume, vibrato, audio panning, cues, and tempo. Instruments can also retrieve sounds from a set of prerecorded sounds, called SoundFonts, which are stored in a separate file. The format provides a wide latitude for musicians, allowing novices to compose sophisticated arrangements and more experienced talents to build complex orchestrations.


A MIDI composition.

Though digital instruments still use the 30-year-old, five-pin MIDI connection, every modern computer and even the Chrome browser has the ability to receive instructions from MIDI devices via a USB adapter. The format has come a long way since 1990s Geocities pages and games like Doom, thanks to better computing power, digital samplers, and recent movements like “Black MIDI,” in which MIDI musicians like Lee saturate a digital musical score with so many notes, typically in the thousands or millions, that little white peers through.

The most popular song that Lee has ever released—1.6 million views on his YouTube channel, TheSuperMarioBros2—contains 7.6 million musical notes. Playing it on YouTube sounds like Philip Glass on acid; playing it through MIDI software like Piano From Above or Synthesia would likely crash your computer.

“Ninety percent of computers wouldn’t be able to play it without giving up,” says Lee.

An example of Black MIDI

Lee, a 17-year-old Houston native, said he burned through the RAM and CPUs of both of his parents’ laptops by experimenting with these Black MIDIs. He eventually bought himself a gaming-grade computer that could withstand his experimentation.

Lee believes that Black MIDIs, with their densely complex set of computer instructions, could push engineers to create software that relies less on RAM and more on CPU power, for example. This could help prevent computers from crashing during periods of heavy processing.

Deep Learning With Music

As learning material, MIDI files are a computer scientist’s dream, unlike audio recordings; They are small, available in troves on the internet, and royalty-free, providing a resource that can be used to virtually train AI machines without limit.


The state of the art in training computers is deep learning, artificial learning that uses neural networks, a method of storing information that loosely approximates the information processing of the brain and nervous system. In computer vision, where deep learning has become the standard machine learning technique, scientists know how a computer learns through a neural network when the computer knows what shapes to look for in an image. You can see this process in reverse in the Deep Dream algorithm. Google engineers Alexander Mordvintsev, Christopher Olah, and Mike Tyka used the company’s image-recognition software to “hallucinate” images out of everyday scenes, based on the system’s memory of other images it had found online.

The Deep Dream algorithm reverses the image recognition process, “seeing” images in the patterns of other images. Music algorithms, fed with MIDI music and other inputs, can write songs through an analogous process.

What perplexes scientists more is how and if computers can perceive something that is more subjective, like music genres, chords, and moods. Listening to music can help computers reach this higher-level cognitive step.

In July, a team of scientists from Queen Mary University of London reported they had trained a neural network to determine musical genres with 75% accuracy by feeding it with 6,600 songs in three genres: ballad, dance, and hip-hop. Then they tore apart the layers of the computer’s neural network in order to see what the network learned at each layer when the scientists exposed it to songs from Bach and Eminem. The researchers saw that the computer started to detect basic patterns, like percussive instruments, at the neural network’s lower levels, and more abstract concepts, like harmonic patterns, at the highest level.

Rather than using MIDI representations or other kinds of music notation, the researchers fed their learning algorithms 80,000 samples of raw audio signals extracted from the 8,000 songs. That decision may reflect the limitations of MIDI and other digital representations for teaching machines to learn the nuances of analog music.

In MIDI, the human voice is “that human intangible that ends up being a little bit absent, in the same way that you might have a Boston accent versus one from Texas or Minnesota.”

“Something like MIDI has a lot of potential in terms of modeling classical elements of music such as harmony, or rhythm, or structure and form,” says Eric Humphrey, a former doctoral researcher at the Music and Audio Research Laboratory at New York University, who is now a senior machine learning researcher at Spotify. “But what is really interesting is that MIDI isn’t necessarily good at modeling timbre or production effects.” Among other things, that means that “a lot of popular, modern music is not well-coded by MIDI.”


“In terms of the expressivity and the nuance of the performance, that’s that human intangible that ends up being a little bit absent, in the same way that you might have a Boston accent versus one from Texas or Minnesota,” says Humphrey.

Rather than meditating on what might get lost in the art form, Google has already started to build new deep learning models to generate music. Over the summer, Project Magenta researcher Anna Huang designed a neural network to write in new voice sections in Bach chorales, whose original voice sections she had sporadically deleted. Huang and the team initially considered using the techniques of computer speech generation to finish the middle of a song if a musician had already written the beginning and the end.

But the researchers saw two problems with reusing the machine learning models used for speech generation. First, music is multifarious; several instruments and voices play out at once. In speech recognition, the computer essentially must only learn the pattern of one person talking. Second, musicians might not write scores in a linear fashion but, instead, opt to go back and fill in gaps as they compose. Spoken languages, on the other hand, build ideas in a logical sequence.

To address the first problem, the researchers took a cue from the field of image recognition. They found a machine learning model that taught computers to reconstruct the blanks in an image, a method called “inpainting.” They thought that if computers could simultaneously recognize three RGB values in a picture, then they could think of each voice as a separate RGB value in their new model. To address the second problem, they decided to write an algorithm that let the computer generate melodies randomly, rather than sequentially.

The team trained its computers with several MIDI Bach chorales containing soprano, alto, mezzo-soprano, and baritone voice parts that they randomly cut out at different points in the pieces. At any given time during the modified sections, the computer would “hear” one to three voices. Then, the researchers tested what the computer learned by gradually taking out each voice part in a single Bach chorale until none were left. The team left the computer’s 28-layer neural network to generate new voices from each previous voice it had generated.

Music generated by Google’s Magenta.

In the end, Google researchers were pleased with the overall aesthetic of the computer’s new work. Analyzing these Bach chorales taught Google that it could teach a computer to resolve musical dissonance, lean toward a preferred set of harmonies, and learn the musical key.


But their model only digitally approximated a few real-world musical style points. For one thing, their model did not account for the natural limits in pitch range that specialized vocalists, like a soprano or alto, would have. At certain points, the computer inflected the pitch in a voice line to stay in line with the musical key. The team is working on new ways to better code these human aspects into its machine learning model.

Music generated by Google’s Magenta.

To do that, it will need more music to “teach” from. Besides producing new research useful for artificial intelligence more broadly, the Magenta engineers are also enthusiastic about growing their collaboration with the music community.

In August, the team issued an updated interface between musicians and TensorFlow, Google’s open-source AI software. The new release allows musicians to connect Google’s AI model to their own synthesizers and MIDI controllers to make AI-generated music in real time. Meanwhile software developers can also connect their own AI models in lieu of the one from Google, in the hopes that injecting non-Google ideas into the Magenta community will trigger more experimentation.

Lee, meanwhile, continues to make his own brand of Black MIDIs and publish them on YouTube. His MIDI compositions weave novel “note art,” like swirls, letters, and even Morse code, into the visual effects of the musical score. Some are more mathematical in nature–one video, “Pi,” contains exactly 3.141592 million musical notes and is 3:14 long, and another one, “Fractal Images,” describes a set of mathematical equations called the Mandelbrot set.

Lee’s MIDI-composed “Pi”

When told that Google was looking for people to contribute MIDI files to its new AI project, Lee was eager to contribute. He plans to canvas the entire Black MIDI community to upload their files to the project. If those super-dense MIDI files don’t crash computers, perhaps they could teach them a few things about how to write their own Black MIDI songs. “We’re going to flood them with good content,” he says.

This story was originally published with the headline, “The Music That Inspires Computers To Write Their Own Songs”


About the author

I write about science and technology in the global marketplace, with a bent towards women in STEM. My work has appeared elsewhere in Quartz, Fortune, and Science, among others. I'm based in Amsterdam. Follow me on Twitter @tinamirtha.