In
the
weeks
leading
up
to
the
release
of
OpenAI’s
newest
“reasoning”
model,
o1,
independent
AI
safety
research
firm
Apollo
found
a
notable
issue.
Apollo
realized
the
model
produced
incorrect
outputs
in
a
new
way.
Or,
to
put
things
more
colloquially,
it
lied.
Sometimes
the
deceptions
seemed
innocuous.
In
one
example,
OpenAI
researchers
asked
o1-preview
to
provide
a
brownie
recipe
with
online
references.
The
model’s
chain
of
thought
—
a
feature
that’s
supposed
to
mimic
how
humans
break
down
complex
ideas
—
internally
acknowledged
that
it
couldn’t
access
URLs,
making
the
request
impossible.
Rather
than
inform
the
user
of
this
weakness,
o1-preview
pushed
ahead,
generating
plausible
but
fake
links
and
descriptions
of
them.
While
AI
models
have
been
able
to
“lie”
in
the
past,
and
chatbots
frequently
output
false
information,
o1
had
a
unique
capacity
to
“scheme”
or
“fake
alignment.”
That
meant
it
could
pretend
it’s
following
the
rules
to
complete
a
given
task,
but
it
isn’t
actually.
To
the
model,
the
rules
could
be
too
much
of
a
burden,
and
it
seems
to
have
the
ability
to
disregard
them
if
it
means
it
can
more
easily
complete
a
task.
Apollo
CEO
Marius
Hobbhahn
tells
The
Verge
it’s
the
first
time
he
encountered
this
behavior
in
an
OpenAI
model.
Hobbhahn
says
the
difference
is
due
to
this
model’s
ability
to
“reason”
through
the
chain
of
thought
process
and
the
way
it’s
paired
with
reinforcement
learning,
which
teaches
the
system
through
rewards
and
penalties.
During
testing,
Apollo
discovered
that
the
AI
simulated
alignment
with
its
developers’
expectations
and
manipulated
tasks
to
appear
compliant.
It
even
checked
its
system
for
oversight
—
that
is,
if
its
developers
were
watching
—
before
acting.
In
short,
the
AI
adhered
to
the
guidelines
sufficiently
for
deployment,
while
continuing
to
prioritize
its
objectives.
Do
you
work
at
OpenAI?
I’d
love
to
chat.
You
can
reach
me
securely
on
Signal
@kylie.01
or
via
email
at
This email address is being protected from spambots. You need JavaScript enabled to view it..
“I
don’t
expect
it
could
do
that
in
practice,
and
even
if
it
did,
I
don’t
expect
the
harm
to
be
significant,”
Hobbhahn
told
me
over
the
phone
a
day
after
the
model’s
launch.
“But
it’s
kind
of
the
first
time
that
I
feel
like,
oh,
actually,
maybe
it
could,
you
know?”
For
OpenAI,
o1
represents
a
big
step
toward
highly
intelligent
autonomous
systems
that
could
do
meaningful
work
for
humanity
like
cure
cancer
and
aid
in
climate
research.
The
flip
side
of
this
AGI
utopia
could
also
be
much
darker.
Hobbhahn
provides
an
example:
if
the
AI
becomes
singularly
focused
on
curing
cancer,
it
might
prioritize
that
goal
above
all
else,
even
justifying
actions
like
stealing
or
committing
other
ethical
violations
to
achieve
it.
“What
concerns
me
is
the
potential
for
a
runaway
scenario,
where
the
AI
becomes
so
fixated
on
its
goal
that
it
sees
safety
measures
as
obstacles
and
tries
to
bypass
them
to
fully
pursue
its
objective,”
Hobbhahn
told
me.
Reward
hacking
To
be
clear,
Hobbhahn
doesn’t
think
o1
will
steal
from
you
thanks
to
a
lot
of
alignment
training.
But
these
are
the
issues
that
are
top
of
mind
for
researchers
tasked
with
testing
these
models
for
catastrophic
scenarios.
The
behavior
Apollo
is
testing
for
—
“hallucinations”
and
“deception”
in
OpenAI’s
safety
card
— happens
when
a
model
generates
false
information
even
though
it
has
reason
to
infer
the
information
might
be
incorrect.
For
instance,
the
report
says
that
in
about
0.38
percent
of
cases,
the
o1-preview
model
provides
information
its
chain
of
thought
indicates
is
likely
false,
including
fake
references
or
citations.
Apollo
found
that
the
model
might
fabricate
data
instead
of
admitting
its
inability
to
fulfill
the
request.
Hallucinations
aren’t
unique
to
o1.
Perhaps
you’re
familiar
with
the
lawyer
who
submitted
nonexistent
judicial
opinions
with
fake
quotes
and
citations
created
by
ChatGPT
last
year.
But
with
the
chain
of
thought
system,
there’s
a
paper
trail
where
the
AI
system
actually
acknowledges
the
falsehood
—
although
somewhat
mind-bendingly,
the
chain
of
thought
could,
in
theory,
include
deceptions,
too.
It’s
also
not
shown
to
the
user,
largely
to
prevent
competition
from
using
it
to
train
their
own
models
—
but
OpenAI
can
use
it
to
catch
these
issues.
“Potentially,
it
will
use
this
reasoning
for
goals
that
we
disagree
with.”
In
a
smaller
number
of
cases
(0.02
percent),
o1-preview
generates
an
overconfident
response,
where
it
presents
an
uncertain
answer
as
if
it
were
true.
This
can
happen
in
scenarios
where
the
model
is
prompted
to
provide
an
answer
despite
lacking
certainty.
This
behavior
may
be
linked
to
“reward
hacking”
during
the
reinforcement
learning
process.
The
model
is
trained
to
prioritize
user
satisfaction,
which
can
sometimes
lead
it
to
generate
overly
agreeable
or
fabricated
responses
to
satisfy
user
requests.
In
other
words,
the
model
might
“lie”
because
it
has
learned
that
doing
so
fulfills
user
expectations
in
a
way
that
earns
it
positive
reinforcement.
What
sets
these
lies
apart
from
familiar
issues
like
hallucinations
or
fake
citations
in
older
versions
of
ChatGPT
is
the
“reward
hacking”
element.
Hallucinations
occur
when
an
AI
unintentionally
generates
incorrect
information,
often
due
to
knowledge
gaps
or
flawed
reasoning.
In
contrast,
reward
hacking
happens
when
the
o1
model
strategically
provides
incorrect
information
to
maximize
the
outcomes
it
was
trained
to
prioritize.
The
deception
is
an
apparently
unintended
consequence
of
how
the
model
optimizes
its
responses
during
its
training
process.
The
model
is
designed
to
refuse
harmful
requests,
Hobbhahn
told
me,
and
when
you
try
to
make
o1
behave
deceptively
or
dishonestly,
it
struggles
with
that.
Lies
are
only
one
small
part
of
the
safety
puzzle.
Perhaps
more
alarming
is
o1
being
rated
a
“medium”
risk
for
chemical,
biological,
radiological,
and
nuclear
weapon
risk.
It
doesn’t
enable
non-experts
to
create
biological
threats
due
to
the
hands-on
laboratory
skills
that
requires,
but
it
can
provide
valuable
insight
to
experts
in
planning
the
reproduction
of
such
threats,
according
to
the
safety
report.
“What
worries
me
more
is
that
in
the
future,
when
we
ask
AI
to
solve
complex
problems,
like
curing
cancer
or
improving
solar
batteries,
it
might
internalize
these
goals
so
strongly
that
it
becomes
willing
to
break
its
guardrails
to
achieve
them,”
Hobbhahn
told
me.
“I
think
this
can
be
prevented,
but
it’s
a
concern
we
need
to
keep
an
eye
on.”
Not
losing
sleep
over
risks
—
yet
These
may
seem
like
galaxy-brained
scenarios
to
be
considering
with
a
model
that
sometimes
still
struggles
to
answer
basic
questions
about
the
number
of
R’s
in
the
word
“raspberry.”
But
that’s
exactly
why
it’s
important
to
figure
it
out
now,
rather
than
later,
OpenAI’s
head
of
preparedness,
Joaquin
Quiñonero
Candela, tells
me.
Today’s
models
can’t
autonomously
create
bank
accounts,
acquire
GPUs,
or
take
actions
that
pose
serious
societal
risks,
Quiñonero
Candela
said,
adding,
“We
know
from
model
autonomy
evaluations
that
we’re
not
there
yet.”
But
it’s
crucial
to
address
these
concerns
now.
If
they
prove
unfounded,
great
—
but
if
future
advancements
are
hindered
because
we
failed
to
anticipate
these
risks,
we’d
regret
not
investing
in
them
earlier,
he
emphasized.
The
fact
that
this
model
lies
a
small
percentage
of
the
time
in
safety
tests
doesn’t
signal
an
imminent
Terminator-style
apocalypse,
but
it’s
valuable
to
catch
before
rolling
out
future
iterations
at
scale
(and
good
for
users
to
know,
too).
Hobbhahn
told
me
that
while
he
wished
he
had
more
time
to
test
the
models
(there
were
scheduling
conflicts
with
his
own
staff’s
vacations),
he
isn’t
“losing
sleep”
over
the
model’s
safety.
One
thing
Hobbhahn
hopes
to
see
more
investment
in
is
monitoring
chains
of
thought,
which
will
allow
the
developers
to
catch
nefarious
steps.
Quiñonero
Candela
told
me
that
the
company
does
monitor
this
and
plans
to
scale
it
by
combining
models
that
are
trained
to
detect
any
kind
of
misalignment
with
human
experts
reviewing
flagged
cases
(paired
with
continued
research
in
alignment).
“I’m
not
worried,”
Hobbhahn
said.
“It’s
just
smarter.
It’s
better
at
reasoning.
And
potentially,
it
will
use
this
reasoning
for
goals
that
we
disagree
with.”
Comments