Whale calls are acutely formatted to travel hundreds of miles through water, but an increasing amount of human-created underwater noise has confused whales. In response, a Whale Detection Challenge was held last year—and within three days, a machine learning expert leaped ahead of the competition by turning the two-second sample clips into 2-D spectrogram images. What was an audio-recognition problem became a deep-learning image-recognition solution.
The machine learning expert, Daniel Nouri, based his spectrogram-analyzing algorithm on the image-recognition model popularized by University of Toronto’s Alex Krizhevsky, who used a convolutional neural network (convnet) that exploits spatial correlations in images. Nouri translated the whale-song audio clips (like this one) into spectrograms measuring amplitude in hertz (y-axis) and time (x-axis), like so:
Nouri’s success was apparent after a presentation at DCLDE 2013, a conference that explores methods to track marine mammals using passive acoustics. Previous detection algorithms have been quoted as being 1 to 10 times faster than real-time, but Nouri’s method harnessed a single NVIDIA GTX 580 graphics card to detect and classify sound at 700 times faster than real-time, parsing through one year of audio recordings in around 12 hours.
Of course, it couldn’t be that easy. Even humans get confused with audio samples, often labeling segments as "unsure" and contradicting labels from other human detectors, which makes programmers wary of using sound samples with bad signal-to-noise ratios (SNR). Like any antibody system, though, lack of exposure to high noise samples means that algorithm is only useful for low-noise situations. And since underwater distance between receiver and source is murky, datasets are far from universally accepted.
So Nouri learned when he tried to apply his spectrogram-analyzing model to a dataset from researchers at Oregon State University—and found that his model was entirely inaccurate. Nouri found that the DCLDE dataset he’d based his model on had a two-stage detection process, with humans operating at the first detection bottleneck. Thus, Nouri had to scrap all the data he’d used to build his algorithm and feed it new ones, laboriously refeeding mislabeled examples to train his model (called active learning).
Two thousand examples later, Nouri had a model that the OSU researchers confirmed matched their readings—but unlike the OSU model, which uses human detection in the latter two of its three stages, Nouri’s is all machine-made. Voilà! A noise-enduring bioacoustic-to-image analysis system exists, and it has all the potential to efficiently track the songs of these majestic oceangoing leviathans.