He spent about 60 hours using Frequency to craft an upmix of the 1951 mono R&B hit “The Glory of Love,” by the vocal group the Five Keys, using the app to carefully target the separate vocalists and spread their voices across the stereo spectrum. With a final polish by disco legend Tom Moulton, Kissel’s friend, it became one of the first spectrally edited upmixes to get released on a commercial album, in 2005. Several labels soon began releasing collections of upmixed mono-to-stereo hits, sometimes licensed, sometimes in the public domain, and sometimes in between.
French software company Audionamix started building professional demixing software to help users pull apart tracks on their own, which made this suite of techniques more accessible. In 2007 the company unveiled a major achievement in upmixing, bringing vintage Édith Piaf recordings from mono to theater-ready surround sound for the biopic La Vie en Rose. In 2009, the company opened a Hollywood office to continue courting film, television, and commercial work.
Other times, their projects involved focused demixing. When the British online lending company Sunny wanted to use the song “Sunny” by late American R&B singer Bobby Hebb in a commercial, it found that one of the song’s original vocals interrupted the ad’s narration. With Audionamix’s help, the pesky vocals got zapped from existence. The French company also offers a service—originally called “music disassociation,” but now rebranded slightly less ominously as “music removal”—in which old television series and movies are scrubbed of music that might be too expensive to license, so they can be released in the latest format, be it DVD or streaming. According to Nicolas Cattaneo, a researcher at Audionamix, “This is the first thing that began to be really usable,” at least commercially. (Scholars studying music in old films and television shows should probably rely on releases from before 2009 or so if they want to make sure they’re hearing the original soundtracks.)
AudioSourceRE and Audionamix’s Xtrax Stems are among the first consumer-facing software options for automated demixing. Feed a song into Xtrax, for example, and the software spits out tracks for vocals, bass, drums, and “other,” that last term doing heavy lifting for the range of sounds heard in most music. Eventually, perhaps, a one-size-fits-all application will truly and instantly demix a recording in full; until then, it’s one track at a time, and it’s turning into an art form of its own.
What the Ear Can Hear
At Abbey Road, James Clarke began to chip away at his demixing project in earnest around 2010. In his research, he came across a paper written in the ’70s on a technique used to break video signals into component images, such as faces and backgrounds. The paper reminded him of his time as a master’s student in physics, working with spectrograms that show the changing frequencies of a signal over time.
Spectrograms could visualize signals, but the technique described in the paper—called non-negative matrix factorization—was a way of processing the information. If this new technique worked for video signals, it could work for audio signals too, Clarke thought. “I started looking at how instruments made up a spectrogram,” he says. “I could start to recognize, ‘That’s what a drum looks like, that looks like a vocal, that looks like a bass guitar.’” About a year later, he produced a piece of software that could do a convincing job of breaking apart audio by its frequencies. His first big breakthrough can be heard on the 2016 remaster of the Beatles’ Live at the Hollywood Bowl, the band’s sole official live album. The original LP, released in 1977, is hard to listen to because of the high-pitched shrieks of the crowd.
After unsuccessfully trying to reduce the noise of the crowd, Clarke finally had a “serendipity moment.” Rather than treating the howling fans as noise in the signal that needed to be scrubbed out, he decided to model the fans as another instrument in the mix. By identifying the crowd as its own individual voice, Clarke was able to tame the Beatlemaniacs, isolating them and moving them to the background. That, then, moved the four musicians to the sonic foreground.
Clarke became a go-to industry expert on upmixing. He helped rescue the 38-CD Grammy-nominated Woodstock–Back to the Garden: The Definitive 50th Anniversary Archive, which aimed to assemble every single performance from the 1969 mega-festival. (Disclosure: I contributed liner notes to the set.) At one point during some of the festival’s heaviest rain, sitar virtuoso Ravi Shankar took to the stage. The biggest problem with the recording of the performance wasn’t the rain, however, but that Shankar’s then-producer absconded with the multitrack tapes. After listening to them back in the studio, Shankar deemed them unusable and released a faked-in-the-studio At the Woodstock Festival LP instead, with not a note from Woodstock itself. The original festival multitracks disappeared long ago, leaving future reissue producers nothing but a damaged-sounding mono recording off the concert soundboard.
Using only this monaural recording, Clarke was able to separate the sitar master’s instrument from the rain, the sonic crud, and the tabla player sitting a few feet away. The result was “both completely authentic and accurate,” with bits of ambiance still in the mix, says the box set’s coproducer, Andy Zax.
“The possibilities upmixing gives us to reclaim the unreclaimable are really exciting,” Zax says. Some might see the technique as akin to colorizing classic black-and-white movies. “There’s always that tension. You want to be reconstructive, and you don’t really want to impose your will on it. So that’s the challenge.”
Heading for the Deep End
Around the time Clarke finished working on the Beatles’ Hollywood Bowl project, he and other researchers were coming up against a wall. Their techniques could handle fairly simple patterns, but they couldn’t keep up with instruments with lots of vibrato—the subtle changes in pitch that characterize some instruments and the human voice. The engineers realized they needed a new approach. “That’s what led toward deep learning,” says Derry Fitzgerald, the founder and chief technology officer of AudioSourceRE, a music software company.
Fitzgerald was a lifelong Beach Boys fan; some of the mono-to-stereo upmixes he did of their work, for the fun of it, got tapped for official releases starting in 2012. Like Clarke, Fitzgerald had found his way to non-negative matrix factorization. And, like Clarke, he’d reached the limits of what he could with it. “It got to a point where the amount of hours I spent tweaking the code was very, very time-consuming,” he says. “I thought there had to be a better way.”
The nearly parallel move to AI by Fitzgerald, James Clarke, and others echoed Clarke’s original instinct that if the human ear can naturally separate the sounds of instruments from one another, it should also be possible to model that same separation by machine. “I started researching deep learning to get more of a neural network approach to it,” Clarke says.
He started experimenting with a specific goal in mind: pulling out George Harrison’s guitar from the early Beatles hit “She Loves You.” On the original recording, the instruments and vocals were all laid on a single track, which makes it nearly impossible to manipulate.
Clarke started building an algorithm and trained it on every version of the song he could find—radio sessions, live versions, even renditions by tribute bands. “There were quite a few different ones, so plenty of examples to understand how the track should sound,” Clarke says. Using spectrograms, he now also knew how the track should look. The algorithm broke up the audio into individual stems, one for each instrument, but Clarke only had eyes and ears for Harrison’s Gretsch Chet Atkins Country Gentleman guitar.
Over nine months, Clarke sifted through the guitar part a few seconds at a time, virtually hand-cleaning the track phrase by phrase. He listened for stray audio artifacts from other instruments and used spectral editing software to find and eliminate them. For the final step, he set out to recapture the track’s original ambience. That part was easy. As an Abbey Road employee, he could book time in the vaunted Studio Two, where “She Loves You” was originally recorded. He played his track into the room through the in-house speakers and recorded it anew, to capture some of the subtleties of the room’s well-preserved acoustics. In August 2018, Clarke showed off his AI demixing work publicly for the first time.
The occasion was a sold-out lecture series that offered a rare chance for fans to step inside Studio Two, where the Beatles, Pink Floyd, and plenty of others recorded. Visitors were invited to re-create the clattering E-major chord that ends “A Day in the Life” by playing the studio’s pianos at the same time. The audience also received a glimpse of the future.
In front of a packed audience, Clarke played the Beatles’ original 1963 recording of “She Loves You.” Then, to pin-drop silence, he played what should have been impossible: the same recording with everything removed except for Harrison’s guitar.
Three days later, excerpts of Clarke’s demo made their way onto the web. The truthers quickly descended. Disbelieving audiophiles started trashing Clarke in online forums. “I think it’s a shame that the demonstration to show how good this new technology is happens to be false,” a user who went by Beatlebug wrote.
“It’s kind of sad that Abbey Road has to mislead people like that,” RingoStarr39 posted in the same thread.
Beatlebug, RingoStarr39, and others insisted that the audio segment in Clarke’s lecture was a more easy-to-isolate bit from a later German version of the song, “Sie Liebt Dich,” recorded in stereo. They insisted that James Clarke was a charlatan.
But Clarke had merely demonstrated a proof of concept. Perfecting Harrison’s guitar track of “She Loves You” took him approximately 200 hours. He hadn’t even attempted to isolate John Lennon’s guitar. “Not a viable option for projects,” he admits. It was far from automated. But it could be done. And it would be.
Up, Up, and Away
The dam broke fully when French streaming service Deezer released an open source code library called Spleeter that allowed both casual and professional programmers to build tools for demixing and upmixing. Anybody comfortable enough with their computer’s command line interface could download and install software from Github, select an audio file of their favorite song, and generate their own set of isolated stems. People started putting the code library to creative use. When tech blogger Andy Baio played around with it, he was delighted to discover how easy it now was to create mashups, such as when he crossed the Friends theme and Billy Joel’s “We Didn’t Start the Fire.” “Nobody should have this kind of power,” he tweeted.
The first generation of users are demixing and upmixing in creative ways. Some musicians are removing one instrument from a song to create tracks they can practice along to or to generate source material for new music. Podcast producers are cleaning up dialog recorded in noisy environments. Hobbyists are using iPad apps and free sites to create their own mixes or make any song karaoke-ready. Several streaming services in Japan now offer vocal removal in officially licensed form, including Spotify’s SingAlong, where listeners can turn down a song’s vocals, and Line Music, which promises real-time source separation.
Along with established players (Audionamix, James Clarke), the newest company offering professional demixing services is the California-based startup Audioshake.
The company will soon launch a service where music rights holders—both musicians and labels—can upload their tracks to the cloud and, within minutes, download high-quality stems ready for licensing in film, broadcasting, video games, and elsewhere. Audioshake claims best-in-field ratings for drums, bass, and vocals, according to benchmarks established by the Signal Separation Evaluation Campaign, an organization made up of audio researchers who track the progress of demixing techniques.
But Audioshake is also the first company to figure out how to automatically isolate guitars—or, more precisely, a single guitar. The company is tight-lipped about how it achieved this. “We refined the architecture of our deep-learning network to be specially tailored to the harmonics and timbre of the guitar,” says company AI researcher Fabian-Robert Stöter. Basically, when a user uploads a track to Audioshake, a layer in the company’s algorithm converts the song’s waveform into a numerical representation that makes it easier for the AI model to figure out where a guitar ends and everything else begins.
To see it work, I was invited to upload some songs. Within a few minutes, the company’s software was able to pull apart a track of a rock band playing in guitar-bass-drums-vocals power trio format. A track by Talking Heads’ original lineup came back with David Byrne’s 12-string acoustic guitar separated (with minimal artifacts) alongside tracks of Tina Weymouth’s bass and Chris Frantz’s drums. It works equally well on other songs in that exact guitar/bass/drums/vocals configuration. But music is huge, and the power trio format is a tight set of parameters.
Outside those parameters is the unclaimed frontier of demixing. The original recording of “She Loves You” comes back from Audioshake with Lennon’s and Harrison’s guitars sounding like jangling ghosts. James Clarke’s manual work still can’t be matched by a machine. That said, Audioshake does what couldn’t be done only a few years ago, pointing to a future in which machines will recognize more instruments. They might be unbreachable frontiers. For virtually all producers since the ’60s, a recording studio has been the place to combine unusual instruments and generate wondrous new sounds (and literal overtones) explicitly designed to blend together in the listener’s ears.
But what if the artifacts turn out to be art? If a demixing attempt gone awry sounds cool to the right producer, it might become the basis for fantastic new music. Think Cher turning Auto-Tune into a pop trend with “Believe.” As archival producer Andy Zax put it, “Some 16-year-old making hip hop records on a PlayStation is going to figure out some genius use of this thing and create a sound world we’ve never heard before.”
For now, plenty of experimentation is happening in far-flung fan forums, with unofficial upmixes of many equally unofficial recordings. Some fans have been exploring a subgenre that might be called upfakes, fusing, say, George Harrison’s original 1968 demo for “Sour Milk Sea”’ with the backing track from a more recent recording by another musician. (Fans are understandably jittery about copyright claims and generally only post their work with quickly expiring links.)
Real Life. Real News. Real Action
Zillion Things Mobile!
Read More-Visit US
As for Clarke, he is still working on the exact AI methodology to pull apart a mono Beatles vocal track. He’s also started an independent company called Audio Research Group to work as demixer-for-hire. Lately he’s been helping to create a set of tracks for a band that lost all its master tapes and has only its LPs.
Even to Clarke, though, many recordings can’t be pulled apart, especially if the instruments are close in frequency or a recording is particularly compressed, as on a radio broadcast or many audience-sourced live recordings. He once tried to demix a 1991 R.E.M. tape from London. “There’s just not enough from a spectral point of view, it’s so squashed,” Clarke says. “You get really fuzzy results.” For now, some blurry aspects of the past are going to stay blurry. But some are going to sound brighter than ever.
Let us know what you think about this article. Submit a letter to the editor at mail@WIRED.com.
More Great WIRED Stories
Subscribe to the newsletter news
We hate SPAM and promise to keep your email address safe