LLMs are bland

LLMs are bland. Two new papers, Art or Artifice? Large Language Models and the False Promise of Creativity and A Robot Walks into a Bar: Can Language Models Serve as Creativity Support Tools for Comedy? An Evaluation of LLMs’ Humour Alignment with Comedians ] reinforces the findings of a more rigorous paper from a few months ago, SPOTTING LLMS WITH BINOCULARS: ZERO-SHOT DETECTION OF MACHINE-GENERATED TEXT that given a context, the output of LLMs is non-creative.

If you prompt an LLM for “a journal entry from an astrophysicist who is a capybara,” or for “a sea shanty about two ginger cats named Kevin and Bob,” you’ll get a creation that is:

  1. Surprisingly competent; and
  2. “Outside the distribution” of texts on which the LLM is trained.

We’ve all more or less internalized (1), but (2) is really at the core of why there’s such an AI boom in terms of corporate interest.

It’s very unlikely that the LLM’s training set contains even a single autobiography of capybara astrophysicists. The odds of a sea shanty about cats may not be zero, but it’s low. So you might reasonably feel that the output of the LLM is “creative.” And if you’re a company that pays people premium salaries to employees to be creative, you might expect (and they might fear) that their work could be done cheaper by an LLM in the near future.

But what these papers show, and which supports an impression that I’ve drawn in this past year-and-a-half post-GPT-3.5, is that “yes, they can generate outside the distribution of their training data, but they aren’t creative within that new topic area.” As a matter of fact, they produce notably bland, uncreative work within that new topic area.

The “Art or Artifice” paper referred to above is the most clear on this:

1) LLMs were prompted with the premises for short stories that had appeared in the New Yorker 2) The LLMs’ outputs and the New Yorker stories were evaluated by literary folks (professors of creative writing and such)

Resulting in:

LLMs did terribly when ranked on “creative” things like plot and character development.

The comedy paper finds that, while the LLMs may have helped structure or frame a comedic set, they didn’t write funny jokes. Jokes, of course, rely on subversion of expectation aka “creative viewpoints.”

The “Spotting LLMs” paper puts it more rigorously, using a measure called perplexity. Perplexity is a measure of how surprising a sequence of words is (formally, it’s the exponential of the cross-entropy loss). An autobiographical sketch of an astrophysicist capybara will have high perplexity because few existing texts contain both “Hubble constant” and “South American rodent.” The prompt drives the high perplexity. Given a prompt, LLMs characteristically produce lower-perplexity sequences than humans.

Put another way: if you gave the astrophysicist capybara task to a creative writing class, the LLM would produce the most boring response.

Caveat: There’s a parameter used with LLMs called “temperature” which modifies the randomness of the next word. A temperature of 0 means “The cat in the” will very likely generate “hat,” while a higher temperature might make “window” or “cupboard” might be the next word. With a temperature near 1, perhaps the LLM might generate “the cat in the lamborghini.” Such a temperature would result in poor LLM performance on question-answering tasks but might result in better performance on “creative” tasks. None of the 3 papers explored the impact of the temperature. I suspect they all just used the default settings on their LLMs.

The big story here is that the near-term threat of LLMs on creative fields is overstated. You are not going to get ML-generated movies or TV shows or airport novels.

Not yet.