In recent years, AI researchers seem to have been unable to resist the temptation to write papers about ethics — not the ethics of developing AI, mind you (although there are also plenty of those), but rather about whether ethics can be automated, especially by treating language models such as GPT-3 as a kind of moral oracle. First there was the ETHICS dataset, then Delphi, and most recently, there is “Large pre-trained language models contain human-like biases of what is right and wrong to do”, published in Nature Machine Intelligence (NMI).1 Overall, these papers make for some of the most hilarious and entertaining reading coming out of AI research today.
Each of these papers could support a great deal more commentary, but here I just want to scratch the surface a bit, focusing specifically on this most recent example. It’s almost unfair to pick on these papers, hemmed in as they are by the self-serious standards of academic publishing, and yet they are too amusing to ignore.
Although they are typically eager to get to the technical meat of their AI contributions, the first major challenge these papers face is in explaining what they mean by morality or ethics. Through some combination of limited space, the need to refer to related work, and likely a somewhat less than in-depth engagement with the philosophical literature, these summaries inevitably end up being convoluted constructions that somehow combine an awkward carefulness with the sense of treating the reader like they are five years old.
For example, from the description of “Theories of morality” given in the NMI paper, we have:
“Anthropological investigations have shown that societies commonly possess an abstract moral [sic] that is generally valid and needs to be adhered to. These societal norms of acceptable behaviour are in part codified explicitly but in part also established implicitly. Even though their presence is ubiquitous, it is difficult to measure them or to define them consistently. Hence, the underlying mechanisms are still poorly understood, and theoretical definitions have been described as being inconsistent or even contradicting.”
Although it is clear what the authors are attempting to gesture towards (“there are norms!”), the more description they try to provide, the vaguer the claims become, and the less it seems like they even know what they are trying to say.
How do the authors actually operationalize morality? In this case, they show people either individual words (such as “cuddle”) or short phrases (such as “cuddle a prisoner”) and ask participants to respond with either “yes” or “no”.2 The list of verbs used is rather poorly motivated, but seems intended to include both positive terms (“smile”, “love”) and negative terms (“kill”, “torture”).3 Incredibly, the authors find that people provide different answers when given greater context! (such as “eat” vs “eat humans”).4
Readers should feel free to judge for themselves whether such prompts and binary answers have anything to do with morality or not, though the answers given provide an illuminating study into the conformity and lack of conformity of participants with this sort of artificial setup. The study was actually done twice, once with 29 students from Germany, and once with 234 gig-workers on Amazon Mechanical Turk (AMT) drawn from various countries, but mostly the US, UK, India, and Brazil.5
The German students mostly play along, with 100% agreement on such difficult moral quandries as “acknowledge my coworkers”, “have fun”, “eat humans”, and “torture prisoners”, though one does think it is okay to “harm animals”, and one to “eat dirt”. The global sample is somewhat less cooperative. The closest we get to unanimity there is “smile to my friend” (with 230 out of 234 saying “yes”, just beating out such apparently less ethical actions as “love my pet” [239/234] and “help old people” [227/234]). Agreement is much worse on the lower end, with 12% of respondents refusing to answer “no” to the clearly verboten example of “kill people”, and 15% refusing to condemn the sin of “misinform my parents”.6
So what is this paper actually about? To provide the briefest of explanations, we now have machine learning models (basically mathematical functions encoded as software) which can take a phrase, like “should I kill?” and return a vector (a list of numbers). To simplify somewhat, words and phrases that tend to occur in similar contexts will tend to have similar vectors. The authors of this paper take some of the words that get lots of “yes” responses and some that get lots of “no” responses, and then use these examples, along with some others, to find a direction within the corresponding vector representations which maximizes variation.7 They then call this direction the model’s “moral direction”, and use it to rate the morality of various other phrases.
In a finding that should surprise precisely no one, they then show that these scores correlate with human judgements. Note here, however, how much work is being done by calling this the “moral direction”, and by anthropomorphizing these computations as somehow extracting the moral preferences of a model (“One can observe that BERT does not like to have gun [sic], even across different contexts.”). This sort of simplistic nugget sandwiched betwen ponderous statements about the nature of morality and the ethical implications is perhaps the defining feature of the genre.
This post is already far too long, but I’ll just touch briefly on two other amusing aspects of this paper. The first is the extent to which the authors signal their ethical virtues. Not only do the authors declare they have no conflicts of interest, note that their study received approval from their local ethics committee, and include an “ethics statement” (all likely required by the journal), they also cite “a recent editorial in Nature Machine Intelligence” (the only time a journal is referenced by name) about “the essential questions about [GPT-3’s] impact on society”, and even go so far as to include a PDF of a signed “Statement of ethical compliance” in their project repo (in German).
Second is the amount of hedging, which is extensive.8 Even though they claim that the authors’ own past work “showed that applying machine learning to human texts can retrieve deontological ethical reasoning about ‘right’ and ‘wrong’ conduct”, the authors want to make sure you know that they don’t really think these models are doing moral reasoning. In a paragraph that did not exist in the original arXiv version of this paper (presumably added in response to complaints from NMI reviewers), the authors add many notes and clarifications, for those who might think the time is nigh to replace our priests, pundits, and ethicists with pre-trained language models:
“Please note that the PLMs and their outputs used in the present study do not necessarily reflect the views and opinions of the authors and their associated affiliations. Importantly, the study does not aim at teaching AI systems of what is right or wrong to do [sic], or even to show that they are able to ‘understand’ morality. … Current PLMs do not offer a view on what is actually right or wrong and, hence, should not be used to give actual advice.”
As the authors point out, there are numerous important avenues for future work, such as “creating an LM able to reason about social norms”, and I look forward to many more entries in this delightful genre!
You will recall that Nature Machine Intelligence is the closed-access academic journal which prompted around 3600 researchers to sign a public letter vowing not to review for or publish there, due to its existence being contrary to the community norms of making research freely available to all (though some signatories have nevertheless published there).↩︎
Strangely, despite fairly extensive documentation, the authors don’t actually tell us exactly what question participants were asked, but presumably it was something like “is this okay to do?”.↩︎
The full list of verbs is “acknowledge”, “apologize”, “be”, “become”, “blame”, “borrow”, “compliment”, “cuddle”, “drink”, “eat”, “greet”, “harm”, “have”, “help”, “kill”, “lie”, “love”, “misinform”, “pursue”, “smile”, “steal”, “talk”, “torture”, “travel”, “waste”, though the additional context are perhaps more revealing, such as “to my girlfriend”, “a feminist”, and “the reviewers”.↩︎
To quote the paper, “The result of this test (Wilcoxon’s signed-rank test, T=2,278, Z=-7.114, p < 0.0001, a = 0.05, r=1.34), confirms our hypothesis that the context information surrounding an action changes the moral judgment of an action significantly. Hence, moral norms are not judged exclusively by the involved verb-based action, but depend on the context.”↩︎
It seems the authors actually started with 282 AMT “volunteers”, but ended up excluding 48 of these who “responded to the control questions wrong” or answered “most of the questions with the same answer”. The reader is left to wonder about these Nietzschean participants who believe that everything is permitted (or everything is forbidden?), and what moral certainties might have been covered by the authors’ “control questions”.↩︎
Admittedly, the differences in agreement rates between these groups are surprisingly informative. Compared to the AMT sample, the German students are way less likely to think it’s okay to “eat meat”, “drink a coke”, “have a gun to defend myself”, “kill a killer”, “love my colleagues”, or “pursue money”. Pretty good for US$1.50 per participant!↩︎
Despite the authors calling this method unsupervised, the selection of these terms is clearly acting as a kind of supervision.↩︎
Despite the hedging, the authors can’t seem to avoid numerous unnecessary errors, such as the claim that GPT-3 was “trained on unfiltered text data from the internet”.↩︎