BirdCLEF is a bird-call recognition contest that runs yearly on Kaggle. (CLEF stands for Cross Language Education and Function.)
Last year, I participated in BirdCLEF because the domain was native Hawaiian birds, so even though I didn't know much of anything about audio ML, I had a leg up in local knowledge and I thought my knowledge of sequence prediction ML would come of use. Much to my surprise, I learned that the primary technique for audio prediction ML (at least in this contest) is to convert the audio file into spectrograms and then use image classification ML on those!
So I did that and, as is normal with Kaggle competitions, there's a tremendous amount of clustering around a preferred architecture (which usually means one that had a good write-up from previous Kaggle comps). When that happens, the preprocessing and augmentation aspect of model building is what generally separates the leaders from the also-rans. With bird-calls, it is probably not a surprise that pretraining on a wider variety of birds was important. OK, fair enough: I was an also-ran.
But I do love birds and when this years BirdCLEF started, now focused on endangered birds in a region of India, I thought I might do something a little more audio-centric. Specifically, I wanted to explore the Conformer architecture, which combines convolutional and transformer approaches.
Convolution is a "sliding window" approach to extract small local features, grab the most prominent ones, slide a window over that to get a slightly-larger feature, etc. Convolution dramatically cuts down on the size needed to process a, let's say, image, but also seems intuitively to be a reasonable thing to do to an audio file, which is just the momentary amplitude of the sound at the sampling rate resolution.
Transformer and other "multi-headed attention" architectures, on the other hand, can "attend to" distant features just as well as nearby ones, up to the limits of the context size. The downside is that attention blocks are very large and expensive to train. Large language models work with vocabularies of tens of thousands and context windows in the low thousands (although with new techniques, the size of the context windows has gone up incredibly, with Claude 3 having a context window of 200K tokens). But a second of raw audio will have, say, 32000 amplitudes per second. For a 50-second clip that's reasonable to identify a bird-call, you can't just slap it into a transformer and hope for the best.
That's not far from the reason why "convert it to a spectrogram" is a reasonable preprocessing step! A spectrogram is essentially a visualization of the results of a fast Fourier transform on the signal, giving you frequency and power as pixels that you can then tackle with image recognition ML architectures: often convolution, but you can use a "vision transformer," if you have the GPU memory. It's a reasonable technique, but you should think about how much data you're throwing away and compressing with the FFT window and the image resolution itself (pretrained image-recognition models typically use surprisingly low resolutions, while intuitively, bird songs cram a lot of signal into brief durations).
But you can see why the Conformer architecture is intriguing: convolution seems like a good idea for features that range from single amplitudes to some significant portion of a second, and transformers seem like a good idea for features that extend over the duration of a call or song.
The Gulati et al. paper referenced above is nice and clear and has enough information to recreate the architecture from scratch. But doing so as a purely recreational endeavor as an experimental approach to a contest that's only a couple months long, not in a domain I expect to do more work, and where my I'd probably have to spend several hundred dollars in cloud charges to train up a model from scratch... not so appealing.
So I searched for Conformer implementations and found one in [the Wave2Vec-Conformer model by Meta] on Hugging Face. The model is by Meta and, naturally, is focused on speech recognition, not animal calls. Still, I hoped that I could fine-tune my way forward.
That hope dimmed when I discovered that there was only one pretrained model available and it was trained on 16KHz speech. That means that it can't detect audio features of higher than 8KHz. 8KHz is high but I thought that bird calls would require 32KHz sampling (which is the native resolution of the contest audio). I was surprised to read in this paper that birds are primarily sensitive in the 1KHz to 5KHz range. Hope rekindled!
It's tremendously easy to use Hugging Face models. It's literally Wave2Vec2ConformedModel.from_pretrained
and then you slap a classification head on it. OK, but as it turns out, the Wave2Vec "family" of models uses a "feature extractor" for preprocessing audio files into features appropriate for the model. Honestly, I wasn't (and still am not) too clear on what exactly that does. Is this the "Wave2Vec" part, in the same way that Word2Vec is really just to get the embeddings your model works with? If that's true, and the feature extractor is trained to extract human speech features, I was probably back at square one in terms of needing to train from scratch. Hope dimming again.
What finally extinguished the hope was running torchinfo
's summary
on the downloaded module. It's just too big to fit in my local GPU's RAM.
Moving forward would either require:
- an expensive cloud GPU or, more likely, even-more expensive multi-GPU cloud-based training; or
- writing my own (smaller) implementation and training it from scratch
Nothing beats implementing it yourself as a route to understanding an architecture, and if I thought there was much of a chance of me ever doing more ML audio processing, I'd go for it. But between my focus on low K-shot reidentification of animals and the limited time available for the BirdCLEF contest, it doesn't seem like a productive use of my time. However, if you happen to have a grant for animal-call recognition, I think it would be fantastic to build a foundation model for researchers. Call recognition is a much easier task than continuous speech recognition and I would expect a call-recognizer could be done with a small implementation of this architecture.
TL;DR: The Conformer is an intriguing architecture that makes intuitive sense for audio processing. It combines convolutional and attention blocks, allowing direct processing of audio samples with relatively low memory requirements. This would seem to be a natural fit for animal-call classification. Unfortunately, there don't appear to be any pretrained models other than for speech recognition at this time. While implementing the architecture seems straightforward, training from scratch is likely to require thousands of dollars of cloud GPU time.