A Programming Sabbatical


I'm setting programming aside for the next year. I'll continue to mentor HS students in programming and I'll continue writing short programs and scripts to help me deal with large amounts of data (particularly, in my case, my 80,000+ photographs), but I won't be embarking on a third version of my animal re-identification pipeline, won't be skimming arxiv and Substack for significant ML advances, won't be fine-tuning my prompts for Copilot (a crucial part of programming in the year of our Lord 2025).

Programming has vastly changed since June 2021, when Github's Copilot went into public preview. Between 1986 and 2020 or thereabouts, Fred Brooks' "No Silver Bullet," claim that "There is no single development, in either technology or management technique, which by itself promises even one order-of-magnitude improvement within a decade in productivity, in reliability, in simplicity" held.

This was too often conflated with "no combination of developments can provide an order-of-magnitude improvement within a decade." I think software fixes are routinely deployed more than an order of magnitude faster than they were when my career started in the late 1980s. I guess I was already competent enough not to feel that IDEs were a huge deal, but let me tell you, moving away from "golden master" floppy-disk releases to Internet distribution changed the time from bug discovery from 4-16 months (standard commercial release cadence was once a year and media mastering, production, and shipping was a multimonth process with extremely high costs and high risk) to six months and then quarterly. Jim McCarthy tried to put Microsoft's Visual C++ compiler on a monthly patch cycle, but that was a bridge too far in the early 90s.

Today, release cadence has sped up another several multiples. Even hugely important software systems can have 2- or 4-week release cadences and truly critical patches can be deployed in less than 24 hours (even less if the criticality is greater than the risk of short-changing validation testing).

On top of that, memory management has gone from a constant issue for all to an occasional nuisance for the majority (although now a full-time monster for a few). Unit testing and eliminating "works on my machine," has been a quality multiplier. "Agile Development" has too many interpretations for me to be comfortable with, but the universal acceptance of continuously integrated incremental improvements has changed integration bugs from a potentially schedule-wrecking catastrophe to a slap on the wrist embarrassment.

I've long liked to say that IntelliSense (Microsoft's coinage, I believe, but generically used by now) was the greatest single advancement in developer productivity in my career. By the mid-00s it was the decisive feature that finally convinced me to prefer an IDE to CLI-centric development.

IntelliSense, or function completion, was the popping up of available functions and instance data after typing ceased for a second or two. It's greatest strength was that it answered questions before they rose to consciousness: the name and spelling of "the thing I want." It also offered a convenient scroll through the available to see if "the thing I want" was directly available.

IntelliSense became available so quickly across IDEs that its import was a little hidden. I think JetBrains had it first, but I don't think that advantage lasted a year before Microsoft had a version and, fairly quickly, it was even available within dedicated programming editors. But it was never quite as fast or as feature-rich as the IDE-based version and thus my grudging acceptance of screen real-estate being taken up with lingering sub-windows. (And, yes, I would spend time setting up hotkeys to maximize the editor window, but, y'know, the screens did eventually get wide enough to hold both a long line in the editor and a file browser on the left.)

I'm dwelling on IntelliSense because it was a harbinger of the benefits of LLM code-assistants. Yes, people could and did hit tab over and over, clumsily and obscurely specifying their intent. Yes, people would scan the list up and down and try things that sounded plausible instead of opening the documentation, much less searching out an article on solving the problem efficiently. Its greatest sin was that it led people away from their own datatypes and their own modular functions in favor of tab tabing their way to inelegant and harder-to-maintain code.

But, on balance, the benefits of IntelliSense to everyone, not just mediocre programmers, outweighed these drawbacks. To senior programmers, the async nature of IntelliSense meant an easily-ignorable offering of options. If your pace was such that it strobed, you could fine-tune the delay to better match your personal level of "what goes here?" As IntelliSense evolved, it went from providing function and variable names to parameters and types, again answering questions before they rose to conscious thought and effort.

LLM code-assistance is nowhere near "faster than questions arise" today, but it is at "plausible blocks faster than you can type." In 1995, Brooks wrote "No Silver Bullet Refired," in which he made explicit the difference between accidental and essential complexity. Rather than reiterate the definition, I'll just say that what he got wrong was that, for the majority of programmers in the majority of tasks, essential complexity had not become an ever-growing challenge. Faster user-reviewable iterations made it far less likely that teams would build a software system that was solving the wrong problem. Developers today probably have never encountered such a project, but believe me they were common enough to be a major theme of Software Development magazine's early years. We harped on requirements engineering and UX design and analysis prior to typing code, all of which were huge helps, although it turned out that putting working code in front of real users was the real solution. (People still don't do that often enough, and people still should worry about requirements, UX, and analysis more than is common, but que sera sera.)

There was a fascinating article recently, Measuring the Impact of early-2025 AI on Experienced Open-Source Developer Productivity that studied... well... what it says. The devs were fixing issues on large OS codebases on which they were experts. The most fascinating detail was that the devs estimated that LLM code-assist would aid them 24% and post-study estimated that they had been sped 20%. However, the stopwatch showed that the code-assist had slowed them 19%! A nice part of the experimental design was that the dev's experience level with the codebase was such that their time estimates were well-correlated with final measures.

My feeling is that while today's LLMs might be misleading as to whether they help or hinder, I don't think that will last. The pace of LLM advancement is stunning -- we're talking weeks and months between serious advancements on benchmark after benchmark. As I write this, the big WTF is OpenAI's announcement that one of their non-public models has achieved Gold level in the International Math Olympiad. My decades-long membership in "Team Symbolics" has become utterly untenable.

A part of why I have re-calibrated my thoughts on the pace of LLM advancement is that I also vastly underestimated the impact of prompt engineering. (Ugh, I still hate to use "engineering" in that context.) The first part of this year has seen a number of embarrassing releases, mostly from Elon Musk's xAI but, far more significantly, from OpenAI. These were situations where system prompt defects altered whole-system behavior in obvious ways. In the case of Musk's "Grok," these were heavy-handed attempts to stamp out "wokeness" from particular answers and led to system-wide promotion of a "genocide among white South African farmers" and, recently, a pronouncement that Grok had adopted "MechaHitler" as its own sobriquet. Charming.

I find OpenAI's mistake more disturbing because they are still, at least nominally, concerned about alignment. In their case, they publicly released a model whose sycophancy was fully evident in every response. I find it very hard to believe that model's behavior was extensively human-reviewed. Instead, I think they're using reinforcement learning, synthetic data, and evals that reward "helpfulness" or somesuch. Even so, they presumably had some hundreds of hours of human review, but it was so uncoordinated that this obvious characteristic went unflagged. So maybe not hundreds of hours but dozens of hours? From the company founded precisely because of alignment concerns? Swell.

In both cases, the speed of the fix was all that's necessary to see that the problem lies not in the weights, Horatio, but in the context. Until recently I have been under the impression that massive changes in behavior were reliant on a new generation of weights. These are still coming at a speed too fast for the consequences to be predicted, albeit much slower than the 3 month jump between GPT-3.5->4 (ChatGPT 5 is expected this month, a gap of 2.5 years). Nonetheless, I thought that to the extent the scaling was linear, advancement would be too, and that the curve would go sublinear or ASI superlinear on scales of at least quarters. Meanwhile, architecture refinements on the KVM layers of the Transformer have seemed incremental. I thought the dissolving of the limited context window wasn't epochal. ChatGPT 3.5->4.0 was a shift from 4096 to 8192. Claude 1->2 jump of 9K->100K made me think "Oh, it can talk meaningfully about novellettes, not just short stories."

While 10^6 tokens doesn't seem to guarantee quality (GPT-4.1 apparently has an 8k context version that is significantly more accurate than the million-token version), once the context gets long enough for extensive system and user prompts, those prompts are clearly very, very important to behavior quality.

In the case of code-assist, for instance, I am vastly happier now that my user prompt harps on modularity, function returns, types, and so forth. But is it still costing me 20% instead of gaining me 20%? Maybe, in the case of code with which I am familiar, but I am quite certain that the speed with which I develop Lua plugins for Lightroom or file-manipulation scripts to safely move everything to Digikam is higher than what I would do on my own. I do not program plausible code in those domains as quickly as an LLM. And while they almost inevitably have problems, both functional and compiler-level, I can manually fix those quickly enough either on my own or with refining the LLM query.

In areas with more complexity and fewer samples, today's LLM code-assist is more clearly limited. I simply could not get properly-factored and consistent wpilib swerve-drive code this Spring. I was frustrated and wasted time trying to prompt my way to better code, but it was to no avail. As always with today's LLMs, the conversational context seems to be taken as close to sacrosanct. The LLMs can double-down on structural mistakes even when explicitly directed to drop that line of pursuit.

But none of these issues with mid-2025 code-assist strike me as fundamental. I don't think there's any reason to believe that mid-2026 code-assist won't be very significantly better. Might it still be a stopwatch negative while perceived as a positive? Maybe, but I think that chance is only moderate. Might it be both stopwatch positive and feel even better than that? I think that's likely.

Basically, I think that any work I do in the next, say, six months, is likely to be tab tab improved or recreated in 12 months. Since my goal is satisfaction more than productivity, while I lose months of potentially satisfying development, a future of almost-guaranteed satisfaction is near. (Well, assuming we're not all turned into paperclips.)

Further, when it comes to satisfaction and creativity, I read a study (or at least a headline) that claimed that at least as far as job satisfaction, happiness comes not from "pursuing your dreams" but from "doing something long enough to be good at it." My decision about this sabbatical led me to reflect on whether I should put aside my childhood dream of authorship for the satisfaction of being a pretty good programmer. Both are undoubtedly rewarding, if not in the moment, almost immediately in retrospect.

The ultimate deciding factor for me was that, while I truly believe that programming is a creative endeavor, it has no facility for humor. You can inject such with a program that does funny things, I suppose, or by using a funny language such as Brainfuck or, almost interestingly, Piet. But there's no delightful wordplay or sub-rosa reference. And humor is important to me. I like to think it's far more evident in my conversations than in my professional writing, much less fiction. I snuck the occasional witticism or even April Fool's column in, but, like everyone, I think I've got a better-than-average sense of humor and quickness of mind.