How examples undermine GPT-4

by Crispy Chicken

GPT-3 is evidence that our ability to create models that have emergent linguistic skills we didn’t plan for outpaces our ability to design models that have desired linguistic skills.

It can do lots of interesting things. It can tell you the tone of a movie review, in a quantifably accurate way. It can sort-of-kind-of write a sophomore-who-wasn’t-paying-attention essay. It can have interesting behavior elicited from it in all kinds of cases, as long as you’re not annoyed that it doesn’t work a lot of the time, and you have no idea where it won’t work in advance.

I don’t mean to be a downer: GPT-3 is better than any other Language Model, even two years on, and it has forced lots of doubters to admit that Language Models do indeed learn lots of skills and patterns simply by predicting a document one word at a time. But for every complex generation task, the problem is reliability. We don’t trust people that speak nonsense 90% of the time, even if we trust people that can hide in vagaries appropriately 99% of the time.^1

Sarah Perry talks about how trust undermines science. I would like to argue that examples of GPT-3’s abilities are undermining the eventual interpretation of GPT-4.

Examples of GPT-3 behavior are usually cherry-picked: they sample from the same prompt ~10ish times and show you the best answer.

Hey, Crispy, what’s sampling?”

Glad you asked! GPT-3 doesn’t actually say” anything: it applies probability to every posisble string of tokens” (kind of like a word, but sometimes a smaller part of a word so that GPT-3 doesn’t need a representation of every word explicitly). When people ask” GPT-3 something, they’re really just asking for a weighted dice roll of what word will come next, where the weights have been determined by GPT-3.

While GPT-3’s performance is impressive, the fact that we can continually re-reroll such dice gives lots of wiggle room.

Lots of people attempt to preempt this worry by saying This is the first answer’ I got when I asked’ GPT-3!”

But this simply opens up the door for meta-cherry-picking, which OpenAI has openly admitted to doing in their presentation of information:


(from the OpenAI blog)

Don’t worry, everyone does this, OpenAI is just being way more honest about it!

Meta-cherry-picking is simple:

  1. Try prompting GPT-3.
  2. If you get a cool output, you’re done!
  3. If you don’t get a cool output go back to step 1.

A lot of the time people want to show that GPT-3 is capable of doing a certain hard-to-define human thing like responding to social cues”. Easy, just run through a list of social cues off the top of your head with meta-cherry-picking, and then use what passes the filter as evidence of being able to use social cues.

GPT-3 is special, because a simpler model wouldn’t produce enough things that pass this filter to be worth anyone’s anytime. But the examples and their popularization dynamics, both in public communications and formal academic writing, undermine GPT-4, because people will think the things that it’s capable of were already solved. Indeed, if you look at the examples of text generation in academic papers from five years ago, the examples look a little bit worse, but aren’t nearly bad enough to explain the underlying progress that has actually been happenin. That’s because the same thing was still happening back then.

There’s not much we can do to stop it, but if you want to see the space more clearly, you need to understand that it’s happening. In the end it all comes back to Take no one’s word for it.”


(from: Wikimedia)

To their great credit, OpenAI has made it easy to do that by providing a public, graphical playground for prompting GPT-3: https://openai.com/api/

If you want to find out what kind of reliability is missing that makes GPT-3 fall short of a Sophomore, find out for yourself. The rest of the information ecosystem is playing a game that is more or less impossible to unwind without direct acquaintence. Many such cases, I suppose.