How I Failed At Kaggle Happywhale

I just deleted my intermediate data and models for Kaggle’s Happywhale competition. I did terrible, never getting much above pure random guessing. Which was frustrating, because it’s a problem in Machine Learning that I’m very interested in (and want to do more work in).

Happywhale, Sad Human

The problem is animal re-identification. Given a photographic catalog of individual Bottlenose Dolphins (say) and a target photograph of a bottlenose dolphin, identify the particular dolphin in the target photograph or, if it’s a new individual, say that.

In the case of Bottlenose Dolphins, the characteristics you hang a re-identification on are the nicks and scars on the dorsal fin. In the case of Humpback Whales, it’s the markings on the underside of the flukes. In the case of Great White Sharks, it’s the zig-zag of the line between the gray back and the white belly. In the case of leopards, it’s spots. And so forth.

Two humpback whale flukes

Modern Machine Learning techniques are a good match for this problem. A deep neural net can map from a picture to a point in a high-dimensional space (say, 30 dimensions). Well, if you can make it such that two photographs of the same individual are near each other in 30-dimensional space, while photographs of different dolphins are further away, there you have it! Just calculate the 30-dimensional location of your target photo, find the nearest photo in your catalog and, if that distance is below some threshold say “Why it’s good ol’ Flipper!”

How hard could this embedding be?

When an n-dimensional feature vector has this characteristic that “related inputs are near each other,” it’s called an embedding. The problem of Happywhale (and related challenges) is to generate embeddings that successfully re-identify individuals.

With ample data for each category/individual, effective embeddings are easy to generate. With a dataset like [Fashion-MNIST][2] you have 6,000 training examples for each of the 10 classes. To recognize any one input, you might use and train an architecture such as this:

MNIST architecture

(Image from https://www.kaggle.com/code/cdeotte/how-to-choose-cnn-architecture-mnist/notebook)

In a case like this, you just use the second-to-last layer as an embedding! It works (trust me)!

But in general, if you take a classifier model and just grab the penultimate layer, the embedding isn’t great. Objects of the same category are near each other, but they’re not super localized:

unclustered

(Image from https://www.kaggle.com/competitions/shopee-product-matching/discussion/226279)

Instead, you use a loss function that is specifically tailored to generate tight clusters in high-dimensional space:

Arcface clusters

There are several loss functions that make good embeddings. A popular one is ArcFace.

Low k-shot in the dark

Having good locality with your embeddings is especially important when you don’t have a lot of photographs of the same individual. When you just have a few photographs you face the problem of low k-shot classification.

(Tangent: image classification is the problem of putting an image into a known set of categories. For instance, “this is a photo of a whale” vs “this is a photo of a cat.” Or, “this is a photo of a humpback whale” vs “this is a photo of a blue whale.” Individual re-identification is just image classification with a huge number of categories (individual names) and, generally, low k-shot.)

But whales! I ❤️ whales!

But now, the Happywhale Competition this year had images from 28 different species, with over 15K individuals! Some photos are distant images of the backs of humpback whales and some are closeups of spinner dolphin dorsal fins.

For a baseline model, I just threw all the training data into the pot, trained it overnight, and tried that. Well, I wasn’t surprised that I had terrible results: generating the transform that accurately locates photos in one of 15K compartments is slow work!

I was confused that, even in the early days of the contest, the leaderboard was filling up with people getting 60-70% accuracy. My first baseline had about 12% accuracy!

Simple! Let’s try complexity!

I quickly (maybe hastily) concluded that the way forward was many models. Instead of throwing all the data into one model and trying to generate 15K tight clusters, I’d use multiple layers:

I trained a model to find, in the photo, a Region Of Interest aka tried to crop the image tightly around the whale
I trained a model to classify the camera viewpoint (I had 6 categories: anterior, posterior, port full, starboard full, port dorsal fin, starboard dorsal fin)
I trained a model to identify the species or species-type in the photo (by species-type I mean I had categories such as “blackfish” that included several species of small toothed whales)

How complex was my multimodel approach?

multimodel architecture

sigh

Realize that this architecture requires number of viewpoints * number of species separate embedding models!

Which might have been tractable had I developed the ability to pretty-rapidly generate good embeddings for a particular viewpoint and species! It would have been too much for me to do locally, but I was planning on building an Azure ML Pipeline, spending some money to spin up a half dozen training machines, and basking in glory. All I had to do was develop some kick-ass embedding code.

And that. Just. Didn’t. Work.

Am I getting close? Who knows?

The big problem with embeddings and the idea that “every individual has an area in high-dimensional space” is that when you generate a point in that high-dimensional space, it’s hard to say if it’s near or far from where you ultimately want it.

For one thing, “distance” can mean two common things when working with embeddings:

Euclidean distance, which is the distance in space between the two points
Cosine distance, which is the difference in the angles to the points, as measured from the origin

Since this is a story of failure, I’m not going to try to justify why I was biased towards Cosine distance. But in the end, I tried almost everything with both. (And got crappy results.)

No matter which distance measure you choose, at the moment you generate an embedding for your target photo, what do you compare it to? Eventually you want it to be “near other photos of the same individual” but in the moment, what do you know?

Overwhelmed by triplets

What you can do is: instead of just the target photo, you have a triplet of photos:

the target photo,
another photo of the same individual (a positive match), and
a photo you know is not the same individual (a negative match)

Generate embeddings for each image in the triplet. Now, early in the training process these will presumably spread out all over your n-dimensional space. But you know that you’d rather have the positive match close and the negative match far. So you can change the weights to further that goal.

And that, dear friends, is called triplet loss and it’s been my world for the past month.

Again, since this is a story of failure, I shouldn’t lecture on “tips and tricks,” but I’ll just point out one obvious challenge with triplet loss: most negative matches are going to be pretty obvious. Similarly, since cetacean photography often has images taken 1/10th of a second after the previous and some individuals have obvious features (mutilated fin edges, often), some positive matches are also obvious. What you want to train on are “the positive match that looks the least like this photo” and “the negative match that looks the most like this photo.” Trying to find those photos is called triplet mining and, yeesh, lemme’ tell ya’. Or, rather, what can I tell you? It didn’t work for me.

tl;dr: I spent a month trying to apply an embeddings-based approach to cetacean reidentification and never had good results. I tried many standard things -- Siamese networks, Euclidean distance, Cosine distance, soft and hard triplet mining, etc. In all cases, I would get good results with simplified datasets (MNIST-Fashion, a hand-picked set of individual cetaceans) but when I applied it to the Happywhale training data, I had very poor results.