Machine Learning for Non-Coders: A Half-Day of Reading

Reading Recommendations: Machine Learning for Non-Coders

The best orientation to machine learning (ML) I could find is Machine Learning Explained from MIT Sloan. That gives a good overall orientation, even if it, like all texts, suffers from underestimating the speed at which ML capabilities are evolving. I don’t think any of the texts I suggest give a good sense of how, in 2022, the popular use of the field shifted dramatically towards “generative ML”: first, the “diffusion models” that have been so successful generating imagery from prompts (DALL-E, Stable Diffusion, Midjourney, etc.) and then, towards the end of the year and into this, how ChatGPT re-attracted the public eye towards “Large Language Models” (LLMs) and, having passed some threshold of competence, the sudden rush to integrate them into search engines.

It’s actually a striking feature of this search for texts: it is basically impossible for anyone to write something accurately about “popular/important” uses of ML because the integration of these technologies into existing domains (search) and their application in essentially new domains (competent art generated from text prompts) defies prediction. Similarly, I could find no texts that introduce the mechanics of what’s going on inside the computer (“neural nets start with the ‘perceptron,’ which sums it’s inputs and outputs a 1 if it’s greater than a threshold and a 0 otherwise”) that end up with these powerful systems that are capturing the public imagination (this year: diffusion models and LLMs. Next year: ???). (The video tutorial by Jay Alammar discussed below being the closest I could find.)

After the Sloan orientation piece, the next piece I suggest is Gradient Boosted Decision Trees-Explained. Decision trees are actually more common than deep learning in regimes with high regulatory oversight (caveat the speed with which things change.) Modern decision trees are generated statistically from the data inputs, but the outputs (binary questions that do the best statistical job of subdividing the problem) are more amenable to interpretation. (Although they’ll still be generated from statistical correlations, not concepts or first principles! For instance, “Does the day of the week start with an ‘S’?,” not “Is it the weekend?” This can interfere with explainability.)

The Yildirim decision trees article discusses gradients in the paragraphs following the sentence “The loss function is used to detect the residuals….” Gradients are really the secret sauce of statistical machine learning:

  1. If you know which scenarios your current model gets wrong; and

  2. You can tell “if I tweak one component of my model by such-and-such an amount, this scenario would have been correct or at least less wrong;” and

  3. You have a computer that can do that for all your scenarios and all your components, over and over again; then

  4. You have a good chance of your model “learning” how to make good predictions across a wide variety of scenarios

Once statistical machine learning has been introduced with boosted decision trees, it’s time to move on to artificial neural networks and deep learning. I can’t bring myself to recommend any introductory text that doesn’t emphasize “biologically inspired,” rather than “…like the human brain does it.” I can’t emphasize enough that artificial neural networks are bags of numbers and multiplications and sums and blah-blah-blah. That’s it. They embody logic and construction and pattern-recognition in purely implicit statistically-generated data flows.

I think the article A Beginner’s Guide to Neural Networks and Deep Learning is good. Again it suffers from being a little left behind by the speed with which technologies are being employed, but I think it walks through the fundamentals of artificial neural networks well. It correctly says that “Deep Learning” just means any neural network with more than 3 layers: when I was doing this stuff in the late 80s, a guy named Hinton proved that 3 layers were sufficient to solve any function that was calculable by neural networks. They might not be efficient, but the calculations could be made with the existing computers. So 3-layer neural networks absolutely dominated until about 2010.

The Pathmind article explains the important characteristic that in “deep” neural networks, the layers near the inputs detect low-level patterns (edges, spots, etc.), layers further from the input use those low-level detections to add more abstraction (textures, color gradients, etc.), and layers near the output are detecting quite abstract features (“Ford hubcaps vs Chevrolet hubcaps,” etc.).

That brings us to the Alammar video The Narrated Transformer Language Model I mentioned before, which does a very good job of breaking down the “transformer” architecture which underlies essentially all of today’s Large Language Models. There’s a big gap between the introductory article and this, but aside from some “I covered this in a previous video,” I think Alammar is very clear. I think it’s very important for motivated learners to get to this level. It’s important to appreciate the architecture is still statistical and very mechanical, with no explicit symbolic or higher-level reasoning.

With this technical orientation in hand, I think it’s easier to wrestle with the existence of “capability overhang,” as nodded to in ChatGPT proves AI is finally mainstream — and things are only going to get weirder but whose degree is still surprising, as you can see in the (partial!) enumeration of 137 emergent abilities of large language models.

Tangent: Once you get to this level of reading, it’s natural to start questioning whether there’s a ghost in the machine, whether these machines are beginning to flirt with human-like reasoning and consciousness. I’ve read a lot of theory of mind stuff over the decades and I’ve never read any cognitive scientist or philosopher suggesting a transformer-like architecture prior to the computer architecture being defined in a breakout 2017 paper ((Attention is all you need)[https://arxiv.org/abs/1706.03762], included here for completion but not recommended for generalist reading). Even if there is a pattern-based language-generating aspect to the stream of consciousness (I’m sympathetic to this view), it’s based on something other than statistically grinding through the corpus of the Internet. And if the way we develop cognition is fundamentally different, how likely is it that we just happen to have converged upon the same final process? And even if it is the case that LLMs can be said to have knowledge of the world, aren’t they exemplars of What Mary Didn’t Know? I’ll force myself to stop this tangent now.

Just as strange women lying in ponds is no basis for a system of governments, statistical correlations in Internet-scraped datasets are no basis for ethical reasoning. The field is rife with examples of systems that exhibit racial and gender bias as discussed in Exploring gender biases in ML and AI academic research through systematic literature review. The important paper On The Danger of Stochastic Parrots presents a number of risks associated with Large Language Models in particular. Joseph Rodman, who developed a leading framework for detecting objects in images, “stopped doing [computer vision] research because I saw the impact my work was having. I loved the work but the military applications and privacy concerns eventually became impossible to ignore.”

As a final note: I wrote this in a word processor (http://lex.page) where I could actually use GPT-3 to generate completions. As happenstance would have it, apparently I had enough opinions to do without LLM help.