ML Proofs of Concept Are Hard

Cartoon showing fragile tower of dependencies

One reason why creating a business case for a Machine Learning project is difficult is that, for virtually any non-trivial task, you’re going to need, from day one of your proof-of-concept, a pretty elaborate data-preparation pipeline and, in most cases, multiple models.

For instance, for a project that I’m considering pursuing, I know I need 3 ML models in a pipeline. Each of the models is a known quantity: it’s just a matter of the considerable work of creating the pipeline and training. And, to really evaluate if the project is worth pursuing, I need an end-to-end proof of concept. It doesn’t have to deal with any corner cases, but it does have to go from input to output.

I just spent the entire weekend yak-shaving my way to the very first elements of the pipeline. Why? Because the biggest lie in Machine Learning is “it’s all Python.” Virtually every framework and non-trivial library depends on a bunch of C/C++ extensions and building them is a #$@&%! pain.

Now, when I’m all done, I should be able to build a Dockerfile that captures the state of my machine, but (a) that’s a manual, error-prone process and (b) that doesn’t make the POC happen any quicker.