Adversarial asymmetry

by Collin Lysford

A comment from RIP DCB on the original draft of The Mongolian meta:

i don’t know much about large language modeling, but i would guess part of your point is that there’s a strategic advantage in being the adversary who can extrapolate your enemies structural flaws when those flaws are so site-dependent?

This is certainly true, as far as it goes. The Street View employee acting as a predator model doesn’t need to come up with the whole meta to start inverting it. But there’s a deeper and more profound lesson here, so I want to take the time to clearly parcel it out.

One way to tell where you are in Mongolia are the mountains. These are giant and striking and effectively impossible to move. Your relative position to the mountains are exactly what it means to be at a certain location. Breaking this correlation takes a massive amount of energy — you’d need to level the mountains themselves.

Imagine this correlation on a scale, and on the other end of that scale is the position of the spare tire of the car doing the mapping. This has absolutely nothing to do with where you are in Mongolia. You just have a spare tire in some spot or another that just happens to correlate because you took many pictures in the same location with the tire wherever you threw it that trip. If you’re going to take a new picture of that spot, you’re going to just throw the spare tire wherever. It would take more energy to keep that correlation intact, because you’d have to remember how you did it before.

When you’re trying to find the geographical location of a photo using elements in that photo, some of the elements are tightly coupled with that element in the world (it would take years of industry or many high-yield bombs to change a mountain’s relationship to a location in Mongolia). Other elements aren’t coupled at all in the world, only the dataset (you can put your spare tire wherever you want as you drive through Mongolia). As a predator model, your goal is to take a picture in Mongolia that gets incorrectly classified by the meta. If everything in your picture was as invariant as the mountains, you’d be stuck. If the meta had an account of variation that properly privileged the mountains and shut it’s eyes to the other stuff, you’d be stuck. But the meta is trying to get all of the juice it can out of the data there is right now, and that means it’s using the information with respect to its predictive power, not its invariance. So you can take every single measure that’s totally uncorrelated with location in the world and change it to the exact opposite the meta expects, for free.

Note that if the initial dataset was bigger, the meta would be forced to adapt and become less strictly predictive but more robust to variation. Because the spare tire really is arbitrary with respect to location, there will be pictures of the spare tire in different places in different locations. You’ll get a clue that it doesn’t bind to location as strictly as spare tire in a net means Western Mongolia.” But unless your pictures have the infinite permutations all mapped, with every single possible spare tire orientation present at every single location, there will still be some correlation, some combos that happen to exist and some that don’t. And the predator model will still be free to take those spurious correlations and reverse them.

This is why even ultra-powerful AIs can be easily snowed by dedicated adversaries. Physically existing and trying to move stuff gives us strong indications of what correlations are extremely difficult to change, which are dead-simple to change, which are linked by social convention but could be reversed by a sufficiently eccentric person, which are linked now but could change dramatically later. I am not saying these distinctions are impossible for AIs to ever learn; I am saying that doing so necessarily diminishes their predictive power over the data that exists today, because it entails throwing away the parts of the meta that won’t survive change. So an AI that is trying to optimize score does so exactly by considering every single bit of the meta, thus becoming more susceptible to predator models. They don’t need to survive change and so they’re not trying to. The strategy to build a living thing that will endure is altogether different.