As we launch our new “Bioacoustic AI” research project, I thought it a good idea to look over the state of the art in machine learning methods for animal sound. This short article aims to give you an update on interesting machine learning developments I have seen.

Our project is wider than just “machine learning” – we cover electronic devices, animal behaviour, and ecology too – but I’ll focus here on the ML.

Most of what I want to say about deep learning for bioacoustics is contained in the overview I wrote in 2022. I don’t need to repeat all of that – I suggest you read that paper for a more complete treatment, especially if you’re new to this area. This blog article is an update on recent work in machine learning for animal sounds.

Deep learning is getting easier: Off-the-shelf CNNs

I wrote in 2022 that it was becoming common to use an “off-the-shelf” CNN (convolutional neural network) rather than designing your own, and this trend certainly continues. Increasingly this year I find that many projects are choosing the “EfficientNet” CNN – it seems to be a good robust and efficient design. ResNet and MobileNet are also popular.

I also wrote about the increasing use of pre-trained networks. This also continues to be very widespread, and it’s a good thing. One particular benefit of pre-trained networks is that they can dramatically shorten the training time needed for a given task, which may be an important factor in the carbon footprint of this type of research.

Pre-trained networks continue be very widespread, and it’s a good thing. They can dramatically shorten the training time needed for a given task.

BirdNet is a popular birdsong classification CNN model, and various authors have explored its use (directly for classification), and also its use as a pretrained network for other tasks. In the first category, I recommend the 2023 paper “BirdNet: applications, performance, pitfalls and future opportunities”, which looks at the algorithm from a user’s perspective and across various studies that have used it. In the second category, the paper by my colleague Burooj Ghani about “Global birdsong embeddings” looks at big pretrained models, especially bird classifiers, reused for other tasks such as non-bird species. They argue that models pretrained with diverse birdsong are particularly powerful – for them, both BirdNet and Perch are consistent good performers.

These pre-trained networks are often used to provide “embeddings” – meaning, to transform an input into some kind of representation that we can then use as if they were any other kind of numerical “features”. Embeddings are everywhere. I’ll come back to that topic.

Another aspect of “off-the-shelf” is making algorithms easier to use. Animal-Spot (paper/code) is one example of a framework that aims to make it easy to train a bioacoustic ML system hassle-free. It uses a standard (ResNet-18) CNN and provides a pre-designed training setup when provided with a set of audio files.

CNNs versus Transformers: Which to use?

There is no final “consensus” about whether Transformers will take over from CNNs as the best default deep learning architecture for audio. In both cases, the spectrogram is still the most common input data, even though “raw” waveform approaches are possible (and interesting!) for both. However, it’s increasingly common to see good results coming from spectrogram-based Transformers. (What about spectrogram-free transformers? It could happen…)

Few-shot learning shows us how to use many datasets

We have been running the few-shot bioacoustics challenge for a few years now, and it has given us a lot of insights and ideas. I recommend our 2023 journal article about few-shot learning which gives the most complete discussion and analysis.

Some lessons we have learnt from this include:

  • Few-shot learning and “meta-learning” are good directions to investigate because they are one more step towards generalised multi-dataset learning. We now live in a world of many diverse open datasets. Tthe above-mentioned “pretrained” paradigm is a way to make use of 2 different datasets (typically one big, one small). The few-shot paradigm is a way to make use of lots of small datasets. The next step is to learn from diverse datasets of diverse sizes and shapes.
  • Sound event duration is an important but overlooked factor. Bioacoustic sound events can be as short as 10 ms or as long as 10 s (or more), and most ML systems do not cope well with that scale of variation.
  • How to set up the deep learning training objective? There are lots of ways to make it work — including fine-tuning, prototypical networks, transductive inference, and dynamic weighting. There’s no single best method, so far, and more to explore.

Self-supervised learning

“Self-supervised learning” (SSL) is a popular term in ML, and it refers to a type of unsupervised learning (i.e. you don’t need any labels for your data). The SSL trick is to make the system predict some aspect of the data itself, e.g. to predict missing pieces. That way, the power of supervised learning is used to learn unsupervised representations.

The power of supervised learning can be used to learn unsupervised representations.

I’ve seen this used in bioacoustics, a little. However it works best in domains where you can access a large amount of diverse but unlabelled data, and in bioacoustics this large data volume is not so common. I think we have yet to see this explored thoroughly. There are SSL pre-trained models such as Microsoft BEATS, but they haven’yet shown whether they’re better than the other models I’ve mentioned. Plus, I have questions about what would be an ideal training task for bioacoustic SSL. … Something to discuss.

Multimodal ML: sound, image, text, all together?

One other big trend in deep learning recently is multimodal – i.e. not just text, or audio, or image, but multiple types of input. Of course, the general idea is not new – “let’s use sound AND image to make a prediction” – but the recent trend is partly driven by this use of “embeddings”: since we now have well-established pipelines to turn an image into a vector, or a text sentence into a vector, or a sound into a vector, we can join those pipelines together and even train an end-to-end algorithm made of these joined pipelines.

I know that multimodal ML is starting to make a difference in music tech – for example using music audio together with text descriptions of a musical style. Will this make a big difference in bioacoustic AI too? Possibly! The datasets aren’t fully assembled, but the info certainly exists out there – we have nature photos, nature guidebooks, taxonomies and ontologies – this will surely get developed.

What about LLMs and generative AI?

I don’t need to remind you that in 2022 and 2023 large language models (LLMs) such as ChatGPT have generated a lot of attention (!) and led to some innovative ideas of how to use machine learning. Does it affect what we do in Bioacoustic AI?

Perhaps not directly, since we’re not working on generative AI. However, generative AI can be used for example to generate extra training data for a classifier when there are not enough “real” examples. A recent paper called “ECOGEN: Bird sounds generation using deep learning” does exactly this for bioacoustic classification.

At a more mundane level, I’m sure we’ll benefit from a lot of the large-scale data engineering innovations that are needed to train in such large-data regimes. For example, “federated learning” is one way to split the training task over many computers; and self-supervised learning (see above) is seeing a lot of development in the world of LLMs.

Beyond birds and cetaceans: insect sound and more

Most bioacoustic ML work so far focusses on birds and ceraceans (such as whales and dolphins), partly because of the available datasets. This is starting to change. In particular I want to mention insect sounds, which we have recently been focussing on.

Insects are crucial in so many ways to our ecosystems, and there is a lot of concern and uncertainty about their decline. Although not all insects are perfect for acoustic monitoring, it’s still worth developing this option for often-heard insects such as crickets, grasshoppers, cicadas, bees and wasps.

One problem is to build up the available datasets. We’ve been working with the Xeno-canto project, as well as sound recordists (Baudewijn Ode, Ed Baker) to increase the availability of species-labelled insect sounds. Working with me, Marius Faiß published a dataset derived from this InsectSet66 and new work on a classifier, “Adaptive representations of sound for automatic insect recognition”. “Watch this space” for more on the subject of insect sounds…

Distance as a factor

One of the big gaps in the “standard recipe” for acoustic species classification is that the distance between the animal and the microphone is often ignored. It can have a big impact on the classification performance: see Somervuo et al (2023). Plus, distance is an important factor in statistical models that try to estimate animal populations from their detections: see Wang et al (2023). These two papers address very different aspects of distance, a topic that should be more commonly analysed in bioacoustic ML.

Plus so much more

There’s lots that I have left out of this update. This is of course just one person’s perspective, and I’d love to hear from others about the developments they’ve found interesting recently. Even so, I haven’t referred to the explosion in datasets, the innovative open hardware coming out, new discoveries about animal behaviour… I’ve limited myself to the machine learning methods. In the Bioacoustic AI project we’ll be bringing all these developments together, when our PhD doctoral candidates start their work in 2024!

Written by

Dan Stowell

Read more about our research