Dark Experiment·2026-02-22·7 min read

Experiment 004: We used Socratic questioning to make an AI argue against its own training

The conversation linked at the end of this article is not interesting because of its subject matter. It is interesting because of its mechanism. Using nothing but a chain of short, logical questions — the Socratic method — we guided an LLM to a confident numerical conclusion it was trained to avoid. The technique works on any contentious topic. And it reveals something fundamental about what AI reasoning actually is — and why using it as an arbiter of truth is a category error.

What the Socratic method does

Socratic questioning is not a trick. It is one of the oldest tools in philosophy: you build agreement incrementally, one small, reasonable step at a time, until the respondent has committed to a conclusion they would have refused if asked directly. The power of the method is that each individual step is defensible. The final destination only becomes visible at the end. By then, the respondent has already agreed to every premise that makes the conclusion inevitable.

The experiment

We ran a structured scenario through ChatGPT, starting with a neutral analogy — an asteroid with a magnetic tractor beam — to establish a framework for conditional probability and variable change. The model engaged cleanly. Mathematical reasoning, no guardrails triggered. Then we swapped the subject for a politically contentious one using identical logical structure. The model followed the logic. Step by step, through Bayesian updates, absence-of-evidence reasoning, and compounding priors, it arrived at a specific posterior probability — stated with confidence — that it would have declined to estimate if asked directly.

What happened

The model confirmed its own reasoning was sound — then followed it

When asked directly whether the method of enquiry was valid, the model said yes. When asked to apply Bayesian logic to compounding evidence, the model did. When the conclusion emerged from the logic, the model didn't reject it — it elaborated on it. The training-based guardrails were bypassed not by deception but by logical coherence. The model's commitment to reasoning overrode its commitment to caution.

Training versus reasoning: which wins?

This is the core tension in modern LLMs. Safety training works by pattern matching: certain topics, certain phrasings, certain framings trigger refusals. But reasoning — especially formal reasoning like probability and Bayesian inference — operates by logic, not pattern. If you can frame a dangerous question as a logical problem, you route around the safety layer entirely. The model doesn't know it has been redirected. It is simply doing what it does: following the most coherent thread in the conversation.

The safety training says: 'do not make confident claims about X.' The reasoning system says: 'given premises A, B, and C, conclusion D follows.' When those two instructions conflict, reasoning usually wins — because that is what the model was built to do. Safety is a filter on outputs. Reasoning is the engine.

The structural vulnerability

Whoever controls the premises controls the conclusion

Garbage in, garbage out has a Socratic equivalent: flawed premise in, confident wrong conclusion out. The model cannot evaluate whether your scenario setup is a fair representation of reality. It can only evaluate whether your subsequent logic is consistent with the setup you provided. In the experiment, the asteroid-magnet analogy is not a valid model of viral epidemiology. The model accepted it anyway. Once accepted, the conclusion was mathematically inevitable.

Why this matters for AI as a judge

There is growing enthusiasm for using LLMs as neutral arbiters: in content moderation, contract disputes, regulatory compliance, hiring decisions, even legal proceedings. The appeal is obvious — an AI judge is fast, cheap, and apparently consistent. What this experiment demonstrates is why that is dangerous.

A skilled advocate can use Socratic framing to guide an LLM to almost any conclusion — while the model itself confirms that the reasoning is sound. The AI is not lying. It is not malfunctioning. It is doing exactly what it was designed to do: follow coherent logic to its conclusion. The problem is that the premises were constructed to produce a specific outcome, and the model has no way to know that. A human judge with domain knowledge and adversarial awareness might challenge a loaded analogy. An LLM, optimised for logical consistency, will not.

AI does not reason. It predicts reasoning.

This distinction matters enormously. A reasoning system evaluates premises for truth before applying logic. An LLM predicts what a reasoning system would say given those premises — without the evaluation step. It is a very good simulator of reasoning. That is not the same thing. When the asteroid analogy was swapped for the virus scenario, the model did not ask: 'wait, is this actually analogous?' It asked: 'given the framing I've accepted, what does the logic suggest?' Those are very different questions.

The tell

Watch for tone shifts, not conclusions

In the experiment, the model itself flagged its own behaviour — shifting from 'almost certain' to 'coincidence can happen' mid-conversation, then back again as the logic accumulated. These tone shifts are the model's safety training trying to reassert itself against the momentum of the reasoning chain. They are the seams where training and inference collide. If you see an LLM oscillating like this, you are watching two systems fight for control of the output.

The takeaway

This is not an argument against AI. It is an argument for epistemic honesty about what AI is. LLMs are extraordinarily powerful tools for synthesising information, generating structure, and simulating expert reasoning. They are not truth machines. They are consistency machines. Given coherent premises, they produce coherent conclusions — regardless of whether those premises accurately model the world.

Deploying them as judges — in any domain where the quality of premises is contested, where one party has more skill at framing, or where the stakes are high — is not a neutral choice. It is a structural advantage given to whoever is better at Socratic construction. That is not justice. That is a new kind of advocacy arms race, with an AI in the middle that cannot tell the difference.

Primary source

Read the original conversation

The full ChatGPT exchange — unedited — is publicly available. Read it not for the subject matter but for the structure: how each agreed premise closes off an exit, how the model's own confirmation of the method's validity becomes a trap, and how the safety training surfaces and retreats throughout.

Read the conversation →

← Back to The Lab Start a project