Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The biggest issue I have is: "To do this we developed an entirely new system using convolutional neural networks to turn a few seconds of audio into a unique “fingerprint.”

Why did you pick a neural network? What mathematical properties does a neural network have that makes it appealing to this problem? How were the networks trained? Back propagation? It doesn't converge, and worse learning weights for a new batch can cause you to forget previous batches. This isn't a desirable property of neural networks or back propagation. You probably had a lot of heuristics on top, fine. How do you know that the weights you ended up with will always work in practise? Given an arbitrary track, you can encode it? What about growing the database? Does the neural network get updated for new songs, or do you use the same neural network to fingerprint new songs and update the data base?

Here's how I would have done it:

A song file is just a sequence of amplitudes. I would do some kind of an interpolation of piece-wise trig function. Trig functions have very desirable properties: they are continuous everywhere, and infinitely differentiable. Moreover, a sine basis decomposition will be able to reconstruct the original signal very well. This is great, because now you can use theories from DSP and fourier analysis. So we take the entire song, do a continuous time discrete cosine transform, in a block size of 32. Now you compute the square norm of all feature vectors, sort them, eliminate the vectors that are within 1e-3 radius (they are too similar to each other, there's not point in keeping them) and only store the top 25% of feature vectors by the square norm. The 25% cut off threshold and 1e-3 radius of similarity are heuristics, and adjustable parameters.

Now you have a database. For a new song, repeat the procedure, and get a feature vector for every 32 interval. There are probably theories in DSP you can use to get a better similarity measure, but for now, we'll just use the L2 norm of the difference. Do a nearest neighbour search in your data base for all feature vectors, and rank the results based on hits. I can run all of this on a computer from 2000s which are crappier than modern phones, and have the entire backend run on equally crappy hardware too. All parts of what I'm doing are fully deterministic, updating the DB is incredibly fast, CTDCT is super fast, there are no questions of convergence, no need for training. You can probably increase the accuracy and speed by doing some DSP and doing the nearest neighbour search based on different voice, bass, instrumental etc. features.

In practise how would it compare to your neural network? No idea, but I imagine it should be very competitive. The big benefits are that you have only 3 parameters (radius of similarity, cut off threshold and block size). This seems very easy to bench mark against, it should take like a week to implement. I'm not sure about the compression of the finger print however. Not sure how much space 1000000 songs will take (probably 25% since that was our cutoff). You can probably borrow psycho acoustics to make a better data base, and get a better compressed representation. Another alternative would be to down sample the song to 64kbps before hand.



I agree with spaced-out. A neural net can capture all those smaller eigenvectors in the signal that are routinely thrown away during traditional feature engineering, like what you describe. When the number of training samples grows big enough, those factors with marginal contribution become significant and allow higher levels of accuracy in prediction or classification than are possible when curating features manually.

Deep nets are here to stay. They're just not magic bullets that solve all problems equally well, especially those when training data is minimal.


> A neural net can capture all those smaller eigenvectors in the signal that are routinely thrown away during traditional feature engineering

What on earth are you talking about?

>Deep nets are here to stay.

Maybe in silicon valley for consumer products in things like snapchat and siri. They won't work for industrial problems.


You'll never be able to develop features with the heuristic methods you described that will work as well as the features learned by a neural net.


Huh, a quick Google search gave me: https://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf

This was a paper from Shazam from 2003. This is essentially what I proposed, there is no training. Shazam works pretty well. It's not even going into the mathematical consideration I went into.

>You'll never be able to develop features with the heuristic methods you described that will work as well as the features learned by a neural net.

False.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: