KataGo and The Hook
When I was a kid, I played this game called Backyard Football. You draft teams of children, come up with your own plays, and play football against a CPU or a human. I only ever played the CPU, and pretty quickly I only ever played one way. One of my random plays I invented was “The Hook”, where one wide receiver runs off at a 45 degree angle while everyone else gets gummed up in the center. Trying it out, I quickly noticed that something about The Hook short-circuited the poor CPU and it would never, ever see the receiver on the hook. So I’d play The Hook and whip the ball to Pablo Sanchez, who would have no defenders near him and inevitably run it and get a touchdown. I would end every single game over a hundred points up and was the undefeated master of Backyard Sports, at least on my own machine.1
Turns out this kind of thing still works! Here’s a paper about adversarial policies designed to beat KataGo, a best-in-class Go AI:
We attack the state-of-the-art Go-playing AI system KataGo by training adversarial policies against it, achieving a >97% win rate against KataGo running at superhuman settings. Our adversaries do not win by playing Go well. Instead, they trick KataGo into making serious blunders. Our attack transfers zero-shot to other superhuman Go-playing AIs, and is comprehensible to the extent that human experts can implement it without algorithmic assistance to consistently beat superhuman AIs. The core vulnerability uncovered by our attack persists even in KataGo agents adversarially trained to defend against our attack.
I use the term “predator models” for these adversarial policies. They don’t beat their prey models by playing the game better than they do, in the same way that cheetahs doesn’t hunt zebras by eating the grass super quickly so the zebras starve to death. They attack the weak points, the arbitrary leverage the model used to get so good in the first place. One predator model just tricks KataGo into passing so it loses. Another one “wins by first coaxing KataGo into creating a large group of stones in a circular pattern, and then exploiting a weakness in KataGo’s network which allows the adversary to capture the group.” KataGo learns by self-playing, but it never coaxed itself into creating large groups of stones in a circular pattern, so it never learned how to play against it. By modifying KataGo’s training run to include some games in the process of exploiting this weakness, it gained a brief immunity to the predator model. But a predator model trained on this “immunized” KataGo was able to rebound from a 0% win rate back to 47%.
The most important thing to understand about predator models is this diagram from the paper:
KataGo didn’t learn how to defend itself from predators, and is helpless against them; the predator model didn’t learn how to play Go, and is helpless against Go players. If I had dipped my toes into Backyard Football multiplayer running my little predator model, The Hook would only work once before they start leaving some defenders on Pablo, and I would lose badly — even against players not capable of beating the CPU themselves. Finding a single ungoverned state in a hyper-optimized book of strategies is a lot faster than writing that book of strategies, which makes predator models energy efficient killing machines. But then you don’t have the book of strategies! So the cheetah ain’t eating grass and Pablo’s got no clue how to dodge a block.
Each year, the energy we put into AI grows dramatically, both in literal kilowatt-hours and in our conceptions, affordances, and imagination. Understanding this adversarial dynamic is going to become more and more important to get the leverage to move anything sufficiently big and computational. Take it from me, the world’s greatest football coach.
No, seriously, seasons and seasons of just throwing The Hook and getting my guaranteed touchdown over and over. Nowadays I’m big into roguelikes and look down on the games where you just grind endlessly to make a number go up; I wonder if it’s because I got it all out of my system when I was ten.↩︎