You are sitting in a restaurant. A song comes on. You know you know it, but the name escapes you. You pull out your phone, open Shazam, hold it up for a few seconds, and there it is — title, artist, album, streaming link. The whole thing takes less time than it takes to say “what is this song?”
It feels like a trick. It is not. It is mathematics.
Shazam was built on a core algorithmic insight developed by its co-founder Avery Wang in the early 2000s, at a time when the idea of reliably identifying a song from a short, noisy audio clip was considered computationally impractical. Wang disagreed. What he developed — a system built on spectrograms, combinatorial hashing, and probabilistic matching — became one of the most elegant applications of applied mathematics in consumer technology. It is worth understanding how it actually works.
Step One: Turning Sound Into a Picture
When you hold your phone up to a speaker, Shazam is not recording the song and comparing it to a library of recordings. That would be computationally catastrophic — matching raw audio waveforms against millions of tracks in milliseconds is not a problem any system could reasonably solve at scale.
Instead, Shazam first converts the incoming audio into a spectrogram.
A spectrogram is a visual representation of a sound signal. The horizontal axis represents time. The vertical axis represents frequency — essentially, which pitches are present at each moment. The brightness or intensity at any given point on the spectrogram represents the amplitude, or loudness, of a particular frequency at a particular moment in time.
To produce this, Shazam applies the Fast Fourier Transform (FFT), one of the most important algorithms in the history of mathematics. The FFT takes a time-domain signal — the raw waveform of incoming audio — and converts it into the frequency domain, decomposing it into its constituent frequencies. This transformation is applied repeatedly across short overlapping windows of the audio (typically 10 to 30 milliseconds each), stitching together a time-frequency map of the sound.
The result is a spectrogram that looks something like a star field — a two-dimensional scatter of dots representing the frequencies present at different moments in time.
What Shazam does next is where it gets clever.
Step Two: Finding the Peaks
A spectrogram of a real-world audio clip is noisy. There is ambient sound, reverb, background conversation, the compression artifacts of a phone microphone. If Shazam tried to match the entire spectrogram against its database, the noise would make reliable matching nearly impossible.
So instead of using the whole spectrogram, Shazam extracts only the local peaks — the points in the spectrogram where the amplitude is significantly higher than the surrounding frequencies at that moment in time. These peaks correspond roughly to the dominant, loudest frequencies in each time window.
The insight here is powerful. The loudest frequencies in a song — the ones that dominate at each moment — tend to be consistent regardless of how the song is being recorded or reproduced. Whether the song is playing through a high-end speaker system or a tinny laptop, whether you are in a quiet room or a noisy bar, the dominant frequency peaks remain relatively stable. Background noise, by contrast, spreads its energy across many frequencies at relatively low amplitudes. The peaks survive the noise. The noise does not survive the peaks.
This gives Shazam a sparse, robust fingerprint of the audio — a constellation of points in time-frequency space that represents the song’s essential character, stripped of environmental noise.
Step Three: The Combinatorial Hash — The Core Invention
The peak constellation by itself is not enough. You still need a way to quickly search a database of millions of fingerprints and find a match. This is where Avery Wang’s core contribution comes in: the combinatorial hash.
Rather than treating each peak individually, Shazam pairs peaks together in a specific way. For each peak — call it the anchor point — Shazam looks at nearby peaks that fall within a defined time window ahead of the anchor. For each such pair, it generates a hash using three pieces of information:
- The frequency of the anchor peak
- The frequency of the paired peak
- The time difference between them
These three values are combined into a single integer — a hash. This hash is what gets stored in Shazam’s database and what gets generated from your incoming audio clip during a search.
The genius of this approach is threefold.
First, it is compact. Each hash is a small integer, not a large waveform. Millions of them can be stored efficiently and looked up almost instantly.
Second, it is noise-resistant. Because the hashes are derived from the relationships between peaks rather than their absolute positions, small distortions from noise or recording quality affect the exact positions of peaks slightly but tend not to destroy the relationship between two peaks that are strongly present in the signal. Enough of the hashes survive intact to allow matching.
Third, it is time-invariant. The hash encodes the time difference between an anchor and its paired peak, not their absolute positions in time. This means it does not matter where in a song you started recording. A hash generated from the chorus at 1:30 matches a hash generated from the same moment in the database regardless of when in your clip that moment appears.
Step Four: The Database and the Search
Shazam pre-processes every song in its catalogue using exactly this method — FFT, peak extraction, combinatorial hashing — and stores the resulting hashes in a massive lookup table. Each hash maps to a list of (song ID, timestamp) pairs indicating which songs contain that hash, and at what point in those songs.
When you hold up your phone, Shazam generates hashes from your audio clip using the same process and fires them against the database. For each hash, it retrieves a list of candidate songs.
Here is the critical part: a single matching hash proves nothing. Popular frequency combinations appear in many songs. What Shazam needs is coherent matching — hashes that not only appear in the same song, but appear at consistent time offsets from each other, matching the temporal structure of the original track.
This is solved through a process analogous to histogram voting. For each candidate song, Shazam plots the time offsets between hashes in your clip and the corresponding hashes in the database. If the song is a genuine match, these offsets will cluster tightly around a single value — the point in the song where your clip begins. Random coincidental hash matches, by contrast, produce offsets scattered randomly across the full song duration.
A strong cluster of coherent time offsets constitutes a match. No cluster, no match. The threshold is calibrated to be high enough to eliminate false positives but low enough to tolerate the inevitable hash losses that come from noise.
Why This Works in Practice
The combinatorial hash approach has several properties that make it extraordinarily well-suited to real-world conditions.
Scalability. The hash lookup is essentially an indexed database search — O(1) per hash query. Even with a catalogue of tens of millions of songs, the matching process runs in milliseconds. Brute-force waveform comparison would take hours.
Noise tolerance. Lab tests of Shazam’s underlying algorithm have shown matching accuracy above 90% even when the signal-to-noise ratio is extremely poor — scenarios where the background noise is nearly as loud as the music itself. This is a direct consequence of the peak extraction step, which acts as a robust filter against noise.
Partial matching. Because the hash encodes time differences rather than absolute positions, Shazam can match any contiguous segment of a song — intro, verse, chorus, bridge. You do not have to start from the beginning.
Robustness to pitch shift and tempo change. This is actually a limitation worth noting. Standard Shazam hashing is sensitive to pitch shifts and significant tempo changes, because the FFT-derived frequencies and time differences are absolute values. This is one reason why a song played at a slightly different pitch (as some DJs do) can occasionally confuse Shazam. Advanced versions of the system apply additional normalisation steps to handle these edge cases.
The Scale of the Problem They Solved
It is worth pausing to appreciate the engineering context in which this was built. Avery Wang developed the core algorithm around 2000 and 2001. The paper documenting it — An Industrial Strength Audio Search Algorithm — was presented at the International Society for Music Information Retrieval conference in 2003. At the time, the idea of a service that could identify a song from a noisy clip in seconds, running on a mobile device over a cellular connection, was not a product category that existed. The iPhone would not arrive for another four years.
Wang’s insight was to reframe the problem. Instead of asking “how do I match audio signals?”, he asked “how do I extract the minimum information necessary to uniquely identify a song, in a form that is fast to search and robust to noise?” The answer — sparse peaks, combinatorial hashes, coherent time-offset matching — solved all three constraints simultaneously.
What Shazam Actually Is
Shazam is often described as an audio recognition app. That is technically accurate but undersells what it is. It is a demonstration that the right mathematical abstraction can make an apparently intractable problem not just solvable but trivially fast.
The spectrogram reduces a continuous audio signal to a manageable 2D representation. The peak extraction filters out noise by exploiting the physics of how loud sounds dominate frequency spectra. The combinatorial hash converts a spatial pattern into a searchable integer. The coherent offset test converts probabilistic evidence into a reliable binary decision.
Each step is individually well-understood mathematics. The insight was knowing which steps to combine, in which order, and why.
That is what engineering at its best looks like — not the invention of new mathematics, but the construction of a pipeline where existing mathematics solves a problem that previously seemed unsolvable.
The next time Shazam names a song in four seconds, that is what it did.