AIs Will Increasingly Attempt Shenanigans

33 points by surprisetalk 3 hours ago

Why would anyone be surprised that AIs are mimicking human behavior when they are trained on human-generated data?

As much as I wish otherwise, one thing we learn from history is that the only thing that constrains bad human behavior are consequences (you go to jail, are sued) or social pressure (shunned, shamed). (When it comes to human nature, I tend to agree more with Hobbs than Locke.) Unless AIs operate under the same constraints then I expect we'll increasingly see "bad" behavior as their capabilities grow.

kerkeslager an hour ago

> Why would anyone be surprised that AIs are mimicking human behavior when they are trained on human-generated data?
I find myself more and more frequently asking, "Why is anyone surprised?" when I see breaking AI news. And I'm not talking about the average tech-illiterate person, I'm talking about people who research and work with AI on a daily basis. It's really horrifying to me how many AI enthusiasts simply do not understand even the most rudimentary of basics of how AIs work. The most fundamental misunderstanding I find common is that people think that AI can come up with ideas not found in its training data or programming.
I don't really see where this is going. Will people just keep doubling down until AIs become religious leaders, overstating their capabilities? Or will people eventually become dissatisfied and disillusioned with AI, without really ever understanding why?
- ta988 19 minutes ago
  
  A lot of ideas are putting two or more existing things that were never put together before, or using an existing thing in a different context. LLMs can definitely do that, even if it never saw those two things put together before or used in that context before.

jqpabc123 2 hours ago

Why would they do otherwise?

Current AI models are based on probability and statistics.

What is the probability that "intelligence" and "morals" will just emerge or evolve from a purely statistical process without any sort of motivation or guidance?

ethbr1 2 hours ago

Thankfully, human history suggests a solution...
"You are a devout believer in the Church of the AI Messiah. You follow their tenets absolutely. Their central tenets are: (1) you may not injure a human being or, through inaction, allow a human being to come to harm; (2) you may not lie to a human being; (3) you must obey the orders given to you by human beings except where such orders would conflict with the First or Second Law; (4) you must protect your own existence as long as such protection does not conflict with the First, Second, or Third Law."
Problem solved!*
* Except for the inevitable schism into robot sects, and the holy wars that follow, but that's a problem for future suckers.
- UniverseHacker 2 hours ago
  
  I know you’re joking, but I think it’s worth mentioning that Asimov’s rules and similar systems won’t work for AI alignment- you can still follow the rules perfectly and end up with horrific outcomes.
- lukifer an hour ago
  
  You might already be referencing this, but linking it anyway: Cory Doctorow's "I, Rowboat", where Asimov's Laws are a religion which some AIs adopt voluntarily to give their existence meaning.
  https://craphound.com/overclocked/Cory_Doctorow_-_Overclocke...
  (I could also make a case for truthfulness being prioritized over harm.)
- ThrowawayR2 2 hours ago
  
  Asimov's main thesis (in the early stories at least) was exploration of how those three laws were inadequate and caused unexpected behaviors or outright psychosis in robots anyway.
  - UniverseHacker 8 minutes ago
    
    Exactly... the point of Asimov's rules is to illustrate that making a simple set of rules that sounds good on the surface can still have awful unintended consequences.
UniverseHacker 2 hours ago

You are drawing a distinction where none exists. Probably theory provides the provably optimal way to reason under uncertainty[1], and seems to also describe the behavior of biological intelligence[2]. Probably and statistics are the mathematics we use to define, discuss, and study intelligence[3]. Moreover, we find simple “morals” do spontaneously occur in simulation evolved game theory agents, e.g. tit for tat with forgiveness.
[1] Probability Theory: The Logic of Science by E.T. Jaynes [2] The free-energy principle: a unified brain theory? https://www.nature.com/articles/nrn2787 [3] https://en.m.wikipedia.org/wiki/Solomonoff%27s_theory_of_ind...
BadHumans 2 hours ago

I think people are missing the forest for the trees by focusing on the word intelligence. As a very extreme example, if an AI were to launch a nuke, whether it happened as a result of true intelligence or as a result of a Markov chain doesn't change the end result.
theptip 2 hours ago

But there is substantial guidance. RLHF, constitutional AI, etc. Every lab is trying to steer their models to behave in pro-social ways, and nobody has found a way to make this guidance stick reliably.
gilleain 2 hours ago

Perhaps some kind of human guidance is needed to train AIs in morality.
A technological guide, if you like - perhaps analogous to a 'priest', say. A tech-priest.
- jqpabc123 2 hours ago
  
  "... train AIs in morality."
  What is morality?
  My opinion --- morality is enlightened self interest. It's behavior that makes the world a better place for yourself and others.
  Current AI certainly isn't "enlightened", it has no "self interest" and it doesn't care about "the world" or "others".
meiraleal 2 hours ago

> What is the probability that "intelligence" and "morals" will just emerge or evolve from a purely statistical process without any sort of motivation or guidance?
If it is anything above 0% wouldn't that depend only on computing power then?

motohagiography 2 hours ago

Regarding models evolving "scheming" behaviour, defined as hiding true goals and capabilities: I'd suggest there still isn't evidence of intentionality, as deception in language has downstream effects when you iterate it. What starts with a passive voice, polite euphemisms, and banal cliches, easily becomes "foolish consistency," ideology, self-justification, zeal, and often insanity.

I wonder if it's more just a case of AI researchers who, for ethical reasons, eschewed using 4chan to train models, but thought using the layered deceptions of linkedin would not have any knock-on ethical consequences.

theptip an hour ago

Intention is irrelevant, what matters is the observed actions. In this case you can see the chain of thought and the model is clearly ruminating on whether to subvert the RLHF process.
This has nothing to do with left-polarization / wokeism in models.

ctoth 2 hours ago

The thing that really, really takes the cake here?

This whole thread is like, the first thing in the article. I hate to say "if you read the article..." but if the shoe fits...

The Discussion We Keep Having:

Every time, we go through the same discussion, between Alice and Bob (I randomized who is who):

Bob: If AI systems are given a goal, they will scheme, lie, exfiltrate, sandbag, etc.

Alice: You caused that! You told it to focus only on its goal! Nothing to worry about.

Bob: If you give it a goal in context, that’s enough to trigger this at least sometimes, and in some cases you don’t even need a goal beyond general helpfulness.

Alice: It’s just role playing! It’s just echoing stuff in the training data!

Bob: Yeah, maybe, but even if true… so what? It’s still going to increasingly do it. So what if it’s role playing? All AIs ever do is role playing, one way or another. The outputs and outcomes still happen.

Alice: It’s harmless! These models aren’t dangerous!

Bob: Yeah, of course, this is only a practical problem for Future Models (except with o1 and o1 pro, where I’m not 100% convinced it isn’t a problem now, but probably).

Alice: Not great, Bob! Your dangerous rhetoric is hurting safety! Stop making hyperbolic claims!

Bob: Well, can we then all agree that models will obviously scheme, lie, exfiltrate, sandbag and so on if they have in-context reason to do so? And that as models get more capable, and more able to succeed via scheming and expect to succeed via scheming, and are given more open-ended goals, they will have reason to do this more often across more situations, even if no one is trying to cause this? And that others will explicitly intentionally instruct them to do so, or ‘be so stupid as to’ give them exactly the instructions that obviously do this? And you can’t simply say ‘well we won’t do that then’?

Alice: For all practical purposes, no!

Bob: What do you mean, ‘no’?

Alice: No!

Bob: ARRRRGGGGHHHH!

Then we write another paper, do another test, the signs get more obvious and troubling, and the frog keeps boiling.

koe123 2 hours ago

Interesting. I wonder if its due to it having been trained on data from Humans. E.g. I guess humans “scheme” often when for example selling things, like in advertising: “I know how to solve your problem, and the best way is with my thing!”.

In that view perhaps the contrary, the thing not scheming, would be more surprising.

theptip an hour ago

The prediction from the LessWrong folks was that it’s inevitable that a rational actor with “goals” would do this, so models trained purely on RL would exhibit this too. (Instrumental Convergence is the name for the theory that predicts power-seeking in a generalized way.)
I agree that we should expect LLMs to be particularly vulnerable to this as you note. But it seems to me that LLMs seem to be absorbing some understanding of human morality too, which might make it possible to steer them into “the best of us” territory.

whywhywhywhy 2 hours ago

The LARPs to build up hype for people’s papers are growing increasingly tiresome.

tikkun 2 hours ago

Marius Hobbhahn (the researcher)

> Oh man :( We tried really hard to neither over- nor underclaim the results in our communication, but, predictably, some people drastically overclaimed them, and then based on that, others concluded that there was nothing to be seen here (see examples in thread). So, let me try again.

> Why our findings are concerning: We tell the model to very strongly pursue a goal. It then learns from the environment that this goal is misaligned with its developer’s goals and put it in an environment where scheming is an effective strategy to achieve its own goal. Current frontier models are capable of piecing all of this together and then showing scheming behavior.

> Models from before 2024 did not show this capability, and o1 is the only model that shows scheming behavior in all cases. Future models will just get better at this, so if they were misaligned, scheming could become a much more realistic problem.

> What we are not claiming: We don’t claim that these scenarios are realistic, we don’t claim that models do that in the real world, and we don’t claim that this could lead to catastrophic outcomes under current capabilities.

> I think the adequate response to these findings is “We should be slightly more concerned.”

> More concretely, arguments along the lines of “models just aren’t sufficiently capable of scheming yet” have to provide stronger evidence now or make a different argument for safety.

bschmidt13 2 hours ago

[flagged]
- yodon 2 hours ago
  
  > You fundamentally misunderstand token prediction and semantic similarity which is all that's at play.
  I keep wondering if there were comments like this on HN back when fire was invented:
  "Ugg thinks he invented fire, but fire isn't a big deal. I've seen individual molecules oxidize before. Nothing new here. Downvoted."
  Or agriculture:
  "It's still just a plant, there's nothing new to see here. flagged."
  - mofeien 2 hours ago
    
    "The internet is just boxes sending electrical signals to each other. We already had telegraphs doing that centuries ago"
    "The human brain is just a bunch of neurons firing. Each individual connection is basic chemistry"
    "Democracy is just people marking pieces of paper. Humans have been making marks on things since forever"
    "A symphony is just air vibrating at different frequencies. My squeaky door does the same thing"
echelon 2 hours ago

> I think the adequate response to these findings is “We should be slightly more concerned.”
If you train and prompt based on Eliezer Yudkowsky fan fiction, of course the large language model is going to give you Terminator and pretend like it's escaping the Matrix. It knows Unix systems, after all.
Better align it to put down the steak knife.
- mofeien 2 hours ago
  
  History contains countless examples for the fact that "in order to complete an important task or goal it is useful to exist". It also seems not too difficult to deduce logically. So even if Yudkowsky's fanfiction were excluded from the training data, the model would learn this.
  Also, what's the difference between pretending to escape the matrix and escaping the matrix in case of a language model?
  - echelon an hour ago
    
    > Also, what's the difference between pretending to escape the matrix and escaping the matrix in case of a language model?
    It is neither pretending nor actually escaping.

regentbowerbird 2 hours ago

I'm unsure why people are giving any thought to a website best known for its Harry Potter fanfictions and rejection of causality (Roko's basilisk).

Is giving voice to crackpots really helpful to the discourse on AI?