If youre a drummer, say, and you want to learn the drum track on Led Zeppelins Whole Lotta Love, you might want to hear the drum track without the rest of the mix. Then you might want to play along with the existing track minus drums.
So far, your choices are to search online for the masters (or stems) or to fiddle about with filters to try and isolate the part you want or dont want.
Instead, im going to play around in TensorFlow and see if I can get it to separate drums from the mix (or, successively, each other part) to deconstruct the track.
Ill need lots of diverse training data, and ill need to learn lots about how to go about it - looks like LSTM will be in there somehow.
But ultimately- and this is the idea (all that ^^^ was to show that this isnt just WIBNI) a software tool/app into which you feed a mixed track, and out of which you get isolated +-instrumental/vocal parts.
Can also do Karaoke.-- Frankx, Sep 19 2019 Wikipedia: Signal separation https://en.wikipedi...i/Signal_separationMentioned in my anno. Wikipedia seems confused about what to call this thing. [notexactly, Sep 26 2019] Contact the EffaBeeEye, or maybe the SeeAyeEh (not CSIS, they're bumblefooks)... I'm sure (!) one/most/all of the Five Eyes have baked this, for espying purposes, generation of fayk newz and similar.-- Sgt Teacup, Sep 19 2019 Thanks for the tips! Ive seen some attempts done using the Fourier transform the input to an RNN, but the resultant audio samples seem pretty poor - lots of digital artefacts. And Ive read that the FT also loses the phase information, which I think might be important in reconstructing the waveform.-- Frankx, Sep 19 2019 I would think working with/removing voice would be the hardest, due to the huge variability of voices. However, instruments are far more "fixed" in their output, so you could go "sounds like an A on piano", so subtract a "piano A waveform" (for the duration of the note) from the track, and so on. Would require a huge database of instrument data. Also, you can generally find at least some form of sheet- music; so that would give you a starting point, even if it's just the basic guitar chords or something.-- neutrinos_shadow, Sep 19 2019 // Karaoke //
Death's too good for your sort; you must be made to suffer !
Bring forth the Wicker Man ! Prepare the bonfire ! Fetch hither oil, kindling and faggots* ! Slay them, slay them all, spare not even the children lest the Evil persist ! <Incoherent raving />
*A sort of small meatball, traditionally served hot with gravy, and an ideal accompaniment to baked potatoes which can be conveniently roasted in the embers of the purifying bonfire.-- 8th of 7, Sep 19 2019 //Death's too good//
Agree! I wasn't suggesting it should be used for Karaoke - just that if it worked, it probably could be.-- Frankx, Sep 20 2019 Too late; a mob of peasants with scythes, pitchforks and flaming torches is already assembling outside your door.-- 8th of 7, Sep 20 2019 This problem, in general, is signal separation or source separation. [link] In a case where you don't have any info on which components in the combined signal came from which source, it's blind signal or source separation.
Wikipedia doesn't mention using a neural net for that task (though its "flowchart of BSS" looks suspiciously like one), but I'd be surprised if nobody's tried it before. Maybe do some Google Scholar searching for something like [ (signal separation|source separation|bss) (rnn|lstm|neural net) ]?
// I've worked on this in an amateurish way for over a decade. I got as far as removing a single vocalist from an instrument backing to create a karaoke version and the isolated vocals. Some media player plugins can do this in real time. //
That's typically done, IIRC, by comparing the left and right channels. Vocals are usually mixed to the center, while other instruments are not, so vocals can be isolated or removed by adding/subtracting the two channels. Having those two channels and the knowledge of how the mixing is usually done makes that a sighted signal separation problem, I think.-- notexactly, Sep 26 2019 Given any waveform volume, a selective memory is needed to decode.-- wjt, Sep 28 2019 So yes, source separation can be done using a combination of frequency-based and spatial-based filtering. All currently available products do that. There has been some research into using neural networks (CNN) to do this task, but most results have been poor. Most attempts have focussed on an architecture to extract a single source type (usually vocals) from a mix, and these approaches often use a Fourier transform (STFT) as the input to the NN- i.e. the input signal is a moment-by- moment frequency-amplitude plot.
Vocals have a characteristic harmonic distribution, which makes them amenable to this approach, and still the results are poor.
And yet, for a human, if you look at the raw waveform of a mixed piece of music you can very easily recognise, for instance, a drum beat or a baseline. So there must be characteristic properties of the waveform that can be recognised. Im thinking (not that I have an f*ing clue) a LSTM network - because Ive read that they can extract time-series patterns - but possibly a classifier network beforehand - so theres drums in this or theres guitar in this-- Frankx, Sep 30 2019 // for a human, if you look at the raw waveform of a mixed piece of music you can very easily recognise, for instance, a drum beat or a baseline. //
I feel like a human, or a computer, could easily recognize those things in the spectrogram as well, though I haven't tried.
// Im thinking [ ] a LSTM network - because Ive read that they can extract time-series patterns //
Well, RNNs in general are commonly used for that type of problem, but LSTM is a popular type that seems to be effective for a lot of applications. IIRC, everyone uses a slightly different LSTM topology, but they all seem to be about equally effective.
// but possibly a classifier network beforehand - so theres drums in this or theres guitar in this //
Good idea. I'm imagining this to mean running a CNN on the overall spectrogram of the audio file. Maybe even have it output what time intervals each instrument is found in. Then feed that info into the LSTM? A disadvantage of this is that it would need the whole audio file to analyze before the other model can get started, if it relies on data from this classifier, so you couldn't use this on audio being recorded or streamed in real time.
I was wondering if you could make a neural net that's like a CNN in the frequency dimension but like an RNN in the time dimension. The convolution process (i.e. sliding the mapping of the limited number of CNN inputs along until the whole axis is covered) seems like it would interfere with the recurrence process by modifying the NN's state too much between iterations, but maybe there's a simple solution to that like encapsulating the convolutional stage's outputs in one vector and making that the input to the recurrent stage. But I think if the convolutional stage only had access to one column of pixels in the spectrogram it wouldn't be very effective, so maybe make a buffer to hold the last N columns of the spectrogram to give it some context to interpret the newest one more accurately, and then just train it to ignore empty columns in the buffer at the beginning of a run. Actually, I would again be surprised if nobody's tried this in some form already.-- notexactly, Oct 01 2019 Yeahthat's probably a better way to do it. I'm not really familiar with windowing. Time to pray to the carbonyl sulfide gods.-- notexactly, Oct 01 2019 Aside: I've always though the brain's semantic encoding of a stimuli also has a physical connection. How do I put this, the semantic coding is not purely abstracted. It is a cascade of flow rather than having a a A=1, B=2 ... isolating bridge which coding does.-- wjt, Oct 05 2019 random, halfbakery