How does Shazam work to recognize a song ?

The logo of Shazam

So, you want to know how Shazam works? What is Shazam, you may ask ? Let’s say you’re in a bar, and they play a song that you like and you don’t know its name ? Shazam can help you find out what is the name of that song.

What Shazam does, it lets you record up to 15 seconds of the song you are hearing and then it will tell you everything you want to know about that song: the artist, the name of the song, the album, offer you links to YouTube, to buy the song on iTunes, you name it.

Shazam was first offered in UK as a phone service over the GSM network, but now it’s available world-wide for a large number of phone brands. But this is not the waw part of the app. What is the most amazing thing, is the conditions in which it can detect the song it “hears”: it can detect the song recorded in heavy background noise conditions (like a crowded bar, as I initially mentioned) and even when the recorded sound quality is very low (it can run over cellular phone network). It is so amazing, that it can distinguish a songs when 2 songs are playing simultaneously or when the song is in the background of a radio DJ.

So, how does it manage to do this? Avery Li-Chun Wang, chief scientist and co-founder of Shazam, published a paper that explains just that. Long story short, it has a database of song’s fingerprints generated out of their spectrogram. When you record a sample with their app, they will generate a fingerprint for the recorded sample in the same way they did for all the songs in their database. After that they will try to find a match in their database for the sample.

In the next lines I will try to explain in more simple terms what I understood from that paper (please correct me if I’m wrong :D ):

So, how do they fingerprint a song ?

Let’s start with some vocabulary, explained for everyone:

First they generate a spectrogram for the song. The spectrogram is a 3 dimensions graph: on the horizontal (X) axis, you have the time, on the vertical (Y) axis you have the frequency and the third dimension is represented by color intensity and it represents the amplitude of a certain frequency. So basically, a dot on the graph will represent the volume of a certain sound at a certain time in the song. A darker point means that the specific sound (frequency) is louder than a lighter point.

Storing the full song in the database will occupy an enormous amount of space, considering that the Shazam database has more than 8 million songs. So instead, they will store only the intense sounds in the song, the time when they appear in the song and at which frequency.

So a spectrogram for a song will be transformed from this:

The initial spectrogram

into this:

The simplified spectrogram

Notice that the darker spots in the first image (the spectrogram) match the crosses in the second image.

To store this in the database in a way in which is efficient to search for a match (easy to index), they choose some of the points from within the simplified spectrogram (called “anchor points”) and zones in the vicinity of them (called “target zone”).

Pairing the anchor point with points in a target zone

Now, for each point in the target zone, they will create a hash that will be the aggregation of the following: the frequency at which the anchor point is located (f1) + the frequency at which the point in the target zone is located (f2)+ the time difference between the time when the point in the target zone is located in the song (t2) and the time when the anchor point is located in the song (t1) + t1. To simplify: hash = (f1+f2+(t2-t1))+t1

How the hash is calculated

After this, they will store each hash generated like this in the database.

So, how do they find the song based on the recorded sample ?

Well, they first repeat the same fingerprinting also to the recorded sample. Each hash generated from the sample sound, will be searched for a match in the database.

If a match is found you will have the time of the hash from the sample (th1), the time of the hash from the song in the database (th2) and implicitly the ID of the song for which the hash matched. Basically, th1 is the time since the beginning of the sample until the time of the sample hash and th2 is the time since the beginning of the song and the time of the song hash.

Now, they will draw a new graph called scatter graph. The graph will have on the horizontal axis (X) the time of the song in the database and on the vertical axis (Y) the time of the recorded sample. On the X axis we will mark th2 and on the Y axis we will mark th1. The point of intersection of the two occurrence times (th1 and th2) will be marked with a small circle. The magic happens now: if the graph will contain a lot of pairs of th1‘s and th2‘s from the same song, a diagonal line will form. The idea behind the formation of that line is simple: the rate at which the peaks (the small crosses from the simplified spectrogram) in the database song appear will be the same rate in which the peaks appear in the recorded sample, so if you pair these times, the coordinates on the scatter graph will grow constantly (to the right-top of the graph) as the time passes on both axes.

Scatter graph of a non-matching run

.

Scatter graph of a matching run

Finally, they will calculate a difference between th2 and th1 (dth) and they will plot it in a histogram. If there is a match in the graph plotted, then there will be a lot of dths with the same value, because, basically, subtracting the th2 from th1 will give the offset from where the sample was recorded (the difference between a point in the original song and the same point in the recorded sample). This will result in a peak within the histogram, which will confirm a match.

Histogram of a non-matching run

.

Histogram of a matching run

So, I hope this helped you all understand how a wonderful application like this works. If you have questions, you think I’ve omitted something or there is something wrong in this article, please leave a comment and we will go from there.

Comments

31 Responses to “How does Shazam work to recognize a song ?”
  1. Jonathan says:

    Do you know if anyone has used Shazam to ID bird songs in an Android or iPhone app?

    • Shazam itself, I doubt. Even the technology behind it I don’t think can be adapted to such a purpose, except in the case in which a bird sings exactly the same twice. There are algorithms more flexible, like the one of the Sound Hounds application, which I believe is more suitable for this kind of purposes. With Sound Hound, it’s possible, for example, to hum a song, and it will tell you what song it is. It’s not required to be exactly the same peace of music as the original track present in the database.

  2. Mahesh says:

    Hi, thanks for this wonderful article. Though its kinda a old, I just happen to read it only today. Its really amazing to know how this works. But I do have few questions. What exactly does the app (stored in my mobile) send to the server? Is it the fingerprint of the song I record? If yes, any idea abt the data usage while sending the recorded sample in the form of fingerprints? Also, while receiving, is the song info received like just a push message?

    • I read the specs a long time ago and I don’t remember exactly the full details, so take the following affirmations with a grain of salt.

      The paper just describes the web-side part of the app, but if I was to guess, I believe it sends only the simplified spectrogram (the fingerprint, as I believe you call it). I don’t believe it takes that much processing to obtain it out of the original sample, tho I’ve been wrong before :P

      On the second part, I believe I can answer with a degree of certainty when I say: no. The song info are sent as a response to the web request that was made with the fingerprint / recorded sample.

      Hope it helped.

      • lol says:

        LOL “simplified espectrogram” ,well, a more accurate name is a fingerprint, or acustic fingerprint if you want, the amount of info that your phone sents, is most likely less than 1/30 of the size of the record you send,i have a different fp, and for 50 seconds of audio is about 30 kb

  3. Random Bloke says:

    Spectrograms? Or teams of offshore workers in India & China quickly making best guesses…hmm.

  4. Shazam has a pretty amazing algorithm … I was searching for something like this in order to undertsand the way it works under the hood … nice article that I may put to good use …

  5. alvaro says:

    Thanks, really interesting article!

  6. phil says:

    It’s a great app, but it’s ruining my pub quiz.

  7. Paula says:

    Can Shazan steal other information from our phones and send it to somewere with that sample of music.
    If yes it can be used in thousands of phones world wild!
    It is free in most of cases and the Head Office is large to run a free thing…
    Think about it.

    • In theory, it’s possible, in practice, I don’t believe that’s the case. Especially on iPhone where programs are sanboxed pretty well. And even if it could extract some “top secret” info (the phone book come to mind) it may be traced up by some users and put the company in a position that it just make no sense for them to be.

      This kind of thinking it’s good and constructive some times, but in most cases it’s just make no sense.

      • Paula says:

        Thank you by your opinion but read the Term and Condicions and you will find the company well defended…

      • enamecn says:

        can you tell the struct of shazam’s databases, how can I put all fingerprints together ,and fingerprinter of every song put a list ,and then put every list into database ,or directly put all fingerprinters into database ,please help me thank you!!!!!!!

        • CM says:

          @enamecn: It’s amazing the number of thins like these we can find everywhere: Great! So you know (or have a grab of) how it works. Now please tell ASAP as I can the same with the little effort and the lowest cost, so I can get rich by tomorrow. Come on! Do you really think that will happen?

  8. CM says:

    Very interesting post, thank you. I’m dealing with some image processing, will now look at it to see how it can be adapted. In case you have any input you would like to share, I’m all ears. And, again, thanks for sharing.

  9. VikingDad says:

    Brilliant article, many thanks!

    Hats off to the designer of the Shazam app though, some exceptionally brilliant thought went into that.
    Used the app for years and despite being amazed at it, now I know how it works its even more impressive!

    • Thank you!
      I couldn’t agree more. I was expecting some weird joo-joo math with huge 2-liners formulas, and although behind the scenes maybe there are one or two when it comes to digital signal processing, the concept is relatively straight forward. Brilliant technology!

  10. gergana says:

    Hi,I am creating an app similar to Shazam,with the difference that instead matching song with another song,the app will recognize sound recorded from the environment and will search for match in database songs with this sound.Could you give me some idea how will look like as graph.??

  11. radione says:

    This is a triumph of technology, I was literally taken by surprise when I first used it. I push science to the limits in my everyday work, challenging the harsh conditions in space with mathematics and engineering, but this wonderful piece of programming just filled my eyes with tears. I’m out of words…

  12. Avanti Shrikumar says:

    Nice article! A correction: “Each hash is also associated with the time offset from the beginning of the respective file to its anchor point, though the absolute time is not a part of the hash itself” (basically, t1 is the value, and f1,f2 and t2-t1 make up the key of the hash. Reading your article made it sound like t1 was used to calculate the key, which did not make sense as you would never be able to match the sample hashes to the song hashes since you would not know where in the song the sample came from!)

  13. Chucks says:

    Nice article fam…. thanks!!!

  14. Dima says:

    Hi
    Can anyone explain me how anchor points and their respective target areas are chosen? From what I understand the algorithm will work only if most of anchor points in the sample are also anchor points in the original record.
    Thanks.

  15. T munce says:

    Any chance the same technique could be used for video in the near future?

  16. sruti says:

    nice post

Trackbacks

Check out what others are saying about this post...
  1. [...] are powering the mobile apps from beggining of this post. There are also some whitepapers (or simplified interpretations), feel free to check it out. What seemed to be most interesting was the Echonest platform called [...]

  2. [...] Per poter leggere e capire il documento ho dovuto attenermi a diverse fonti, tra cui Wikipedia e questo blog inglese che ne ha parlato. Ora cerchiamo di [...]

  3. [...] – You likely caught our huge announcement a couple of days ago that Beatport has partnered with music-ID app Shazam so that you can more easily identify which track you’re hearing in a club by just holding up your phone for a few seconds. But how does Shazam actually work, you’re probably wondering. Here’s how: (full story) [...]



Speak Your Mind

Tell us what you're thinking...
and oh, if you want a pic to show with your comment, go get a gravatar!