h a l f b a k e r yWe don't have enough art & classy shit around here.
add, search, annotate, link, view, overview, recent, by name, random
news, help, about, links, report a problem
browse anonymously,
or get an account
and write.
register,
|
|
|
If youre a drummer, say, and you want to learn the drum
track on Led Zeppelins Whole Lotta Love, you might want
to hear the drum track without the rest of the mix. Then
you might want to play along with the existing track minus
drums.
So far, your choices are to search online for the masters
(or
stems) or to fiddle about with filters to try and isolate the
part you want or dont want.
Instead, im going to play around in TensorFlow and see if I
can get it to separate drums from the mix (or, successively,
each other part) to deconstruct the track.
Ill need lots of diverse training data, and ill need to learn
lots about how to go about it - looks like LSTM will be in
there somehow.
But ultimately- and this is the idea (all that ^^^ was to
show that this isnt just WIBNI) a software tool/app into
which you feed a mixed track, and out of which you get
isolated +-instrumental/vocal parts.
Can also do Karaoke.
Wikipedia: Signal separation
https://en.wikipedi...i/Signal_separation Mentioned in my anno. Wikipedia seems confused about what to call this thing. [notexactly, Sep 26 2019]
[link]
|
|
Contact the EffaBeeEye, or maybe the SeeAyeEh (not CSIS, they're bumblefooks)... I'm sure (!) one/most/all of the Five Eyes have baked this, for espying purposes, generation of fayk newz and similar. |
|
|
Thanks for the tips! Ive seen some attempts done
using the Fourier transform the input to an RNN,
but the resultant audio samples seem pretty poor -
lots of digital artefacts. And Ive read that the FT
also loses the phase information, which I think
might be important in reconstructing the
waveform. |
|
|
I would think working with/removing voice would be the
hardest, due to the huge variability of voices. However,
instruments are far more "fixed" in their output, so you
could go "sounds like an A on piano", so subtract a "piano A
waveform" (for the duration of the note) from the track, and
so on. Would require a huge database of instrument data.
Also, you can generally find at least some form of sheet-
music; so that would give you a starting point, even if it's
just the basic guitar chords or something. |
|
|
Death's too good for your sort; you must be made to suffer ! |
|
|
Bring forth the Wicker Man ! Prepare the bonfire ! Fetch hither oil, kindling and faggots* ! Slay them, slay them all, spare not even the children lest the Evil persist ! <Incoherent raving /> |
|
|
*A sort of small meatball, traditionally served hot with gravy, and an ideal accompaniment to baked potatoes which can be conveniently roasted in the embers of the purifying bonfire. |
|
|
Agree! I wasn't suggesting it should be used for Karaoke -
just that if it worked, it probably could be. |
|
|
Too late; a mob of peasants with scythes, pitchforks and flaming torches is already assembling outside your door. |
|
|
This problem, in general, is signal separation or source separation. [link] In a case where you
don't have any info on which components in the combined signal came from which source, it's
blind signal or source separation. |
|
|
Wikipedia doesn't mention using a neural net for that task (though its "flowchart of BSS" looks
suspiciously like one), but I'd be surprised if nobody's tried it before. Maybe do some Google
Scholar searching for something like [ (signal separation|source separation|bss) (rnn|lstm|neural
net) ]? |
|
|
// I've worked on this in an amateurish way for over a decade. I got as far as removing a single
vocalist from an instrument backing to create a karaoke version and the isolated vocals. Some
media player plugins can do this in real time. // |
|
|
That's typically done, IIRC, by comparing the left and right channels. Vocals are usually mixed
to the center, while other instruments are not, so vocals can be isolated or removed by
adding/subtracting the two channels. Having those two channels and the knowledge of how the
mixing is usually done makes that a sighted signal separation problem, I think. |
|
|
Given any waveform volume, a selective memory is needed to decode. |
|
|
So yes, source separation can be done using a
combination of frequency-based and spatial-based
filtering. All currently available products do that.
There has been some research into using neural
networks (CNN) to do this task, but most results
have been poor. Most attempts have focussed on
an architecture to extract a single source type
(usually vocals) from a mix, and these approaches
often use a Fourier transform (STFT) as the input
to the NN- i.e. the input signal is a moment-by-
moment frequency-amplitude plot. |
|
|
Vocals have a characteristic harmonic distribution,
which makes them amenable to this approach, and
still the results are poor. |
|
|
And yet, for a human, if you look at the raw
waveform of a mixed piece of music you can very
easily recognise, for instance, a drum beat or a
baseline. So there must be characteristic
properties of the waveform that can be
recognised. Im thinking (not that I have an f*ing
clue) a LSTM network - because Ive read that they
can extract time-series patterns - but possibly a
classifier network beforehand - so theres drums
in this or theres guitar in this |
|
|
// for a human, if you look at the raw waveform of a mixed piece of music you can very easily recognise,
for instance, a drum beat or a baseline. // |
|
|
I feel like a human, or a computer, could easily recognize those things in the spectrogram as well, though I
haven't tried. |
|
|
// Im thinking [
] a LSTM network - because Ive read that they can extract time-series patterns // |
|
|
Well, RNNs in general are commonly used for that type of problem, but LSTM is a popular type that seems
to be effective for a lot of applications. IIRC, everyone uses a slightly different LSTM topology, but they all
seem to be about equally effective. |
|
|
// but possibly a classifier network beforehand - so theres drums in this or theres guitar in this // |
|
|
Good idea. I'm imagining this to mean running a CNN on the overall spectrogram of the audio file. Maybe
even have it output what time intervals each instrument is found in. Then feed that info into the LSTM? A
disadvantage of this is that it would need the whole audio file to analyze before the other model can get
started, if it relies on data from this classifier, so you couldn't use this on audio being recorded or streamed
in real time. |
|
|
I was wondering if you could make a neural net that's like a CNN in the frequency dimension but like an
RNN in the time dimension. The convolution process (i.e. sliding the mapping of the limited number of
CNN inputs along until the whole axis is covered) seems like it would interfere with the recurrence process
by modifying the NN's state too much between iterations, but maybe there's a simple solution to that like
encapsulating the convolutional stage's outputs in one vector and making that the input to the recurrent
stage. But I think if the convolutional stage only had access to one column of pixels in the spectrogram it
wouldn't be very effective, so maybe make a buffer to hold the last N columns of the spectrogram to give
it some context to interpret the newest one more accurately, and then just train it to ignore empty
columns in the buffer at the beginning of a run. Actually, I would again be surprised if nobody's tried this
in some form already. |
|
|
Yeahthat's probably a better way to do it. I'm not really
familiar with windowing. Time to pray to the carbonyl
sulfide gods. |
|
|
Aside: I've always though the brain's semantic encoding of a stimuli also has a physical connection. How do I put this, the semantic coding is not purely abstracted. It is a cascade of flow rather than having a a A=1, B=2 ... isolating bridge which coding does. |
|
| |