The Most Dangerous Thing About ChatGPT Isn’t the Wrong Answers. It’s the Ones That Sound Right.
On a self-experiment with human skulls, patients armed with pocket-sized expertise, and the question of why OpenAI quietly buried its own AI detector after 6 months.
At one in the morning I read in a scientific journal that using ChatGPT makes you stupid. Gerlich (2025) reported, in a peer-reviewed paper in the journal Societies, a significant negative correlation between frequent AI tool use and critical thinking, mediated by cognitive offloading, measured across 666 participants from various age groups and educational backgrounds. That is cleanly measured and published in a respected journal. Anyone hearing the word offloading for the first time might wonder whether it applies to them as well. And from experience I say: conditionally, yes.
Conditionally. That caveat is the most interesting thing about the whole finding, and it is at the same time exactly what gets lost when studies like this travel from person to person, because a caveat shares badly and an absolute sentence shares well, and the simplification along the path of transmission always runs in the same direction, toward exaggeration. Gerlich also found that older participants and those with higher educational attainment were markedly more resistant to the offloading effect, meaning the problem is not the technology but the relationship a person develops with it. Someone who operates ChatGPT like an oracle, feeding it and accepting whatever comes out, is performing a different cognitive act than someone who uses the tool as an accelerator for thoughts they developed themselves. I have given the difference between these two types of user its own name elsewhere on rauscher.xyz, the figure I call Otto Sapiens, to whom I devoted a whole chapter. Gerlich's finding, that higher education works as a protective factor, does not surprise me. Education does not primarily transmit content, it transmits the tool for questioning content, and that tool is precisely what makes the difference, when dealing with an LLM, between a useful result and a dangerous one. This has nothing to do with technophobia. It is a statement about epistemology, one that held long before ChatGPT and that ChatGPT merely makes more visible, because the system produces the consequences of misplaced trust faster, more widely, and with less friction than any technology before it.
The User Who Doesn't Ask
Otto Sapiens does not question the output. In most cases this is not malice or laziness in any vulgar sense, but a fundamentally wrong model of what a large language model actually is and how it works. Most people picture an LLM as a very clever machine that retrieves answers from an enormous store of knowledge, the way you once pulled a book off a library shelf, opened it, and searched for information inside. It is a convenient analogy, and it is wrong. The consequences reach beyond the individual, because someone who trusts a single false answer has a personal problem, while a society that does so systematically acquires a structural one.
An LLM generates text based on statistical probability, on what typically followed a given input in the training data. It always answers. It has no function for silence, no way to signal its own uncertainty, no mechanism that says: this question exceeds my data, I decline. Instead it produces text that sounds like an answer, even when there is no substance underneath, even when the question was poorly posed, even when the premise is false. This is no ordinary design flaw. It is a consequence of the very architecture that made the system so impressive in the first place. The strength is also the risk.
The problem therefore usually lies in the prompt. A prompt that is too vague, too short, too context-free yields an output that mirrors it, equally imprecise and equally context-free, only fluent. That fluency gets read as competence, because in human communication fluent language really is a sign of competence. In machines it is no such thing. In machines fluency is the default state, completely independent of content. Anyone who wants a usable output has to understand what they are asking, how they are asking it, and what the system needs in order to answer on a basis that survives scrutiny. Without that, you get text. Not an answer. Text.
Otto Sapiens enters a prompt. He gets an output. He reads the output. The output sounds good. He forwards it without pausing for a single second. I am not exaggerating.
The Man With the Skull and the iPhone
Picture a scene that is not invented. A person sits in his living room. On the table in front of him stands an ashtray, half full, beside it a crumpled cigarette packet, and between them yesterday's bag of crisps and a remote control nobody has looked for in weeks, because the television runs anyway and nobody changes what is running. On that table lies a human skull. Not a find from a university excavation, not a labelled specimen from a research institute, not a piece from a natural history museum with documented provenance. A skull, from whatever context, in a living room with an ashtray and a bag of crisps. The person holds his smartphone in his hand. He takes a photo. And then he writes to ChatGPT: Analyse this skull. Estimate age, sex, and ancestry.
This is not a fringe figure. It happens. Regularly. Skull collectors, some with genuine scientific interest and a legal collection, others simply drawn to the macabre, send such requests. Students send them, believing it a shortcut through the learning curve. People who found a bone somewhere send them, because they want to know whether they have to call the police or whether it is an animal bone and the whole thing can be forgotten. And the output they receive often sounds precise, competent, equipped with correct terminology, sometimes with citations that cannot be verified in that form, because the LLM generates citations too, with no requirement that a checked source stand behind them.
I ran this test myself. Repeatedly, with different photos, different prompts, different skulls from my own professional practice, comparison material to which I know the correct answer. The result was sobering in a way that made me stop for a moment. Most of the answers were wrong, not within the range of defensible uncertainty that accompanies every forensic assessment under real conditions, but wrong in the category that sends an experienced examiner straight back to ask a question. In 1 case the estimate was roughly correct, but even there inadequate in a way that would not be remotely acceptable for a forensic or scientific purpose. The word that came to mind was alarming. Not regarding the LLM, which has no expectations and no expectations to meet. Alarming for me, because I know what these outputs can set in motion when they are read and believed.
What a Photograph Cannot Show
A photograph carries light that was reflected off a surface at a particular moment, and it does so from a single perspective, with a single focal point, under a single lighting situation. What it does not carry is everything that is not light: no tactility, no weight, no texture beyond the optical impression, no third dimension in any full sense, no temporal history of the object, no chemical information, no smell, no mechanical resistance. In forensic image analysis, which is part of my work, you learn early that a photograph is a piece of evidence and never a finding. It is the starting point of a question. Not its answer. This distinction, between the image of a thing and the thing itself, is the most fundamental there is in forensic analysis. And it is exactly this distinction that ChatGPT ignores, because the system makes no distinction between a photograph and what was photographed. It analyses pixels and calls the result a finding.
Back to the skull. Estimating age from a skull is not reading off a dial, where you look and note a number. The cranial sutures, the seams between the individual bones of the skull cap, fuse in a sequence that does allow inferences about age, but that sequence is variable, population-dependent, and shaped by nutrition across the lifespan, disease history, mechanical load, and genetic factors that no two-dimensional living-room photo so much as hints at. Sex estimation rests on a combination of morphoscopic traits that mean little in isolation and become diagnostic only in concert and within the context of normal variation: the prominence of the supraorbital ridges, the bony projections above the eye sockets, which in most populations are more robust and pronounced in men, the size and shape of the mastoid process behind the ear, the prominence of the chin and the form of the mandible overall, the breadth and robustness of the zygomatic arches, the sharpness of the orbital margins, the angle of the forehead, and the overall size and robustness of the skull as a whole. Each of these traits has to be judged within a range of normal variation that shifts considerably depending on the population of origin, because sexual dimorphism is population-specific, and a trait counted as typically male in a Northern European reference population can fall within the female normal range in another. A scoping review that evaluated 73 studies of skull analysis methods between 2020 and 2024 makes exactly this tension visible, namely that modern imaging techniques and AI algorithms show considerable progress under controlled laboratory conditions, yet fail under the unstructured conditions of a private living room and a phone snapshot without scale, without calibration, without context (PLOS ONE, 2024).
Then there is taphonomy, a term describing what happens to a body after death: soil chemistry, moisture, dryness, animal scavenging, mechanical force during the period of burial, all of it altering the morphology of the bone in ways that systematically mislead a naive analysis. The skull shows its condition at the moment of the photograph. It does not show the history that led there. And without that history the morphology is often impossible to interpret correctly.
Replacing a forensic practitioner, an archaeologist, or a human biologist with a chat machine trained on general text is not a matter of technophobia. It is a statement about what a tool can do and what it structurally cannot. The difference matters, because it will not vanish with future versions of the model, it lies in the nature of the task itself.
The Practice and the Screenshot
A room with a skull on the table. In that room there is nobody trained to read it. That is the extreme example. The everyday example is milder, more frequent, and it touches an area familiar to me at close range.
People come into my wife's practice with the result of a search query on their smartphone. Sometimes it is a screenshot from ChatGPT, sometimes an article from a website with an author nobody has heard of, sometimes a diagnosis they have made themselves, and they do not come to ask whether they are right. They come to be told what is wrong with them, and to be confirmed. The physician sitting across from them has 30 years of professional experience, daily experience with thousands of patients across every variation, clinical course, exception, and exception to the exception, and there she sits with the photo from the smartphone or the printout from the printer, and she has to explain why what it says does not hold in this particular case, why the picture someone brought of their illness and the picture the examination reveals are not the same thing.
This is a new category of wasted time, and it grows with every month in which more people mistake a language model for a knowledge base.
And here I have to draw a line. The physicians I know use AI. Radiologists. Lab physicians. Pathologists. They use AI because they use tools trained on millions of relevant datasets, clinically validated, regulatory-approved, delivering verified performance on defined tasks. These systems are not large language models like ChatGPT. An LLM you ask whether a skin spot might be a melanoma, and a classification system trained on dermatoscopic images with a clinically validated error rate, are about as similar as a kitchen knife and a scalpel: both have a blade, and the similarity ends there. Using one in place of the other is not a quality problem. It is a fundamentally different act.
A study at Trakya University (2025) measured the performance gap directly, for dental age estimation from panoramic radiographs the reproducibility of ChatGPT, with an intraclass correlation coefficient of 0.703, fell considerably short of the established method, which reached 0.960. Put less abstractly: given the same task with identical material, ChatGPT returns different results across repeated runs, while a validated method does not. For clinical and forensic applications that is indefensible.
I have built a data-protection-compliant tool for medical practices myself, designed so that it does not answer outside defined tasks, because the limitation of the system is the actual safety feature. That is not the philosophy of ChatGPT. ChatGPT always answers, even when it would do better to stay silent.
What the Model Cannot See
Lab values. This is where the whole problem shows itself in the smallest space.
A man sits at the doctor's. He is in his mid-forties, weighs 105 kilos, a substantial part of it muscle, he has trained intensively as a bodybuilder for years and still does. The haemoglobin in his blood count is elevated. On paper: a finding. In the algorithm: an alarm. The doctor looks at him. He sees who is sitting in front of him. He knows that elevated haemoglobin in an athlete with that build and a correspondingly high oxygen demand is a physiological normal finding, one that without this context looks like polycythaemia and with it looks like a decent athlete. The glomerular filtration rate, a measure of kidney function, behaves similarly, because the reference tables it is judged against are calibrated to a normal population, and a person with an unusual build sits systematically outside that calibration without anything being pathological. The doctor sees this, because he asks, because he observes, because he has a context that no lab sheet and no algorithm delivers along with the numbers.
Or someone who drinks substantial amounts of alcohol every day and eats badly on a permanent basis. The doctor knows, because he asked, because he sees the signs, because he has experience with patients who downplay their consumption to the practice. The LLM does not ask. It evaluates what it is given. And it stays silent about everything it is not given, without flagging that silence.
In China people have begun drawing operational conclusions from this potential, conclusions that are logical and leave a chill at the same time. The insurer Ping An operates health kiosks in Shanghai where blood pressure is measured, blood is drawn, and results are evaluated, automated, without a physician present, as part of a pilot in the Changning district (Abele, 2024). Tsinghua University has developed a fully AI-operated virtual hospital under the name Agent Hospital. This is not speculation about the future. This is lived infrastructure as of 2025, and while the algorithm is probably more reliable than an exhausted night-shift doctor under time pressure when evaluating the 500 most common lab-value-diagnosis combinations, it lacks the eye for the context that only 30 years of clinical experience produce. This is not an argument against AI in medicine. It is an argument against AI in the place of medicine.
The Report I Did Not Write
Expert reports are submitted to me for review. That is part of my work. I read them. And by now I recognise at a glance when a report was produced with the help of an LLM without its author declaring as much.
My brain does this automatically, because it has been trained to recognise patterns, a whole life long, to a degree that only became fully clear to me in hindsight. People who think this way often find themselves regarded as peculiar by those around them. I find myself useful in my work. How do I recognise what I recognise? Sentence lengths too uniform, a middle band with no genuine outlier upward or downward. Transitions between paragraphs too clean, connective joints that carry without ever surprising, logical chains with no friction, no digression, no pause at all. Certain markers that recur in such texts like a signet: the em-dash where a comma would have served, the nearly equal sentence lengths across pages, the short confirmation sentence that merely mirrors the previous one in other words, the flawless seam between any two paragraphs, the ending that suddenly zooms away from the matter at hand toward all of humanity. All clean, all coherent, all without the slight asymmetry a thinking person leaves behind. And then there is something technical that I treat as one forensic indicator among several: invisible Unicode characters, zero-width spaces, zero-width joiners, narrow no-break spaces, characters invisible in the running text but present at the character layer of a document, characters that get dragged along when text is pasted into a word processor or a content management system. Whether these characters are deliberate watermarking is disputed among specialists, and Originality.ai (2025) considers that interpretation unlikely, since a marker that easy to strip would carry little forensic weight. As an indicator within the wider picture they remain relevant, because they appear where, without LLM involvement, they typically would not. Indicators do not work alone. They work in combination.
Which brings me to an anecdote I tell here because it sharpens the whole discussion to a point I could not have invented.
About a year ago I was accused, in a proceeding, of using ChatGPT to write my reports. I met the accusation with a smile that should have worried everyone present, because it was the smile of someone who had already grasped the situation in full before the other side had finished formulating the argument. I handed the lawyer in question 3 of my reports, all from a time when ChatGPT had no web interface and large-scale generative AI was not yet a tool anyone used in their daily work, and asked him to run his AI detection software over them. The result: my texts were classified as very probably AI-generated. I have written this way for over 20 years. Analytically, structured, without the small linguistic accidents some detectors read as signs of human writing, without the redundancies, the detours, without the small inconsistencies that arise when someone develops a thought, loses the thread halfway, and picks it up again. None of this is a learned style. It is what emerges when a brain produces analytical text daily across decades and develops its own highly optimised mechanism of expression, and it is, as I explain in other contexts, a trait I have been familiar with my whole life: recognising patterns, faster than others, at first glance, to a degree I myself was long unaware of. In this discipline I am structurally ahead of the neurotypical user. I do not say this to boast. It is a sober observation that illustrates the actual problem with AI detectors: they measure statistical properties, and people with analytical, highly structured writing styles fall systematically into that net, even when they have never operated an LLM.
The problem is structural. OpenAI knows it.
In January 2023 OpenAI launched a classifier meant to distinguish AI-generated text from human text. In evaluation the tool correctly identified 26 percent of AI-generated text as such and falsely flagged human text as AI-generated in 9 percent of cases. On 20 July 2023 it was shut down. OpenAI wrote this on its own blog, with the formulation that it was no longer available owing to its low rate of accuracy. And then the sentence worth remembering: It is impossible to reliably detect all AI-written text. The company that builds ChatGPT wrote that. Not a critic. Not a regulator. The company itself. 6 months after launch. Without comment. It is the most honest statement to come out of Silicon Valley on this subject, and it disappeared into a footnote.
Sam Altman, the head of OpenAI, publicly admitted that when reading posts on Reddit he is no longer sure whether the texts come from real people or from bots, because real people have begun adopting the language habits of the machines (Fulgham, 2025). This is not external criticism of the system. It is a self-description by a man who helped build it. The company that built the problem describes the problem in a single tweet, without so much as hinting at a solution.
A Polemical Warning, for Those It Concerns
Anyone who photographs a skull and asks ChatGPT whether it is male or female and how old it might have been is engaged in an activity I classify as forensically worthless and dangerous at the same time, because dangerous certainty is worse than honest ignorance. Anyone who then takes the result as the basis for a decision, whether to report a find or not, whether to go to the police or not, has created a problem they would not have had without the LLM. This is not an accusation. This is a description.
Anyone who walks into a medical practice with a ChatGPT screenshot and explains to the person with 30 years of professional experience what is wrong with them has confused a tool with expertise. Tools do not replace expertise. They accelerate it, when wielded by someone who has the expertise to deploy them. A hammer in the hands of a surgeon remains a hammer, even one made of titanium and supplied by a leading Swiss manufacturer of medical instruments.
And anyone who drafts an expert report with the help of an LLM and does not declare it is doing something I cannot, in any diplomatic phrasing, call acceptable. Expert reports carry consequences for human lives, for loss of liberty, for custody decisions, for the question of whether someone is held accountable for their actions or not. These texts demand a responsible person behind them, a person who stands behind the statements, not an output that sounds fluent and therefore seems trustworthy.
Every expert who uses AI without declaring it should know that I recognise these texts. Not because I am especially clever, but because my brain has spent its whole life doing the very thing these texts give away: recognising patterns. And anyone who believes a better prompt solves the problem underestimates the root of it: the pattern sits deeper than the prompt.
The Machine That Always Answers, and What I'm Building Against It
I am currently working on software that makes it possible to analyse texts systematically for LLM involvement, more reliably than the tools currently available, without falling into the trap of classifying analytical, highly structured human writing as AI. The first version will be ready within the next 4 to 8 weeks and will be freely accessible at first. And I state explicitly what this tool cannot do: produce truth. It can identify indicators, patterns, statistical anomalies that in combination point to LLM involvement. That is forensic work. Forensic work is not certainty. It is the analysis of indicators. Anyone who confuses the two has not understood the basic principle, and that holds for AI detectors exactly as it does for any other forensic finding. I know this from my own experience, because my own texts, run through a poor detector, get flagged as suspect.
The Photo, the Skull, and the Question That Remains
I put the journal away. Not because the study is uninteresting, but because the interesting question is not whether ChatGPT makes you stupid, but whom it makes stupid and why and under what conditions the answer is a clear yes. The answer does not begin with ChatGPT, it begins with what a user brings: curiosity, a willingness to question outputs, the ability to recognise where a tool ends and expertise begins. Gerlich (2025) demonstrated this empirically in his study: education protects, experience protects, critical thinking protects. Not against the technology, but against the uncritical use of it.
There is a category of users who probably do not need this text, because they already understand the mechanics, and another category who need it but will not read it, because they are convinced ChatGPT will be right anyway. That is the real problem, and it cannot be solved by explanatory articles. It is solved by experience, and experience takes time, and time takes mistakes, and those mistakes will be made over the coming years, some harmless, some with serious consequences. I do not write this text under the illusion of preventing that. I write it so that someone who asks afterward where they could have been warned has an answer.
What I have learned in over 25 years of work as an expert witness, and what no model teaches me, is this: the most dangerous answer is not the obviously wrong one. The most dangerous answer is the one that would be right in 9 of 10 contexts and, in the 1 context where it is not, has fatal consequences because nobody asked. That 1 in 10 is what expert witnesses think about at night. The LLM does not think.
Those sitting in front of the screen wondering whether this applies to them: probably a little. It applies to almost everyone, myself included, in certain contexts. The question is whether, after reading an output, you pause briefly and ask: how does this model know that, what does this answer presuppose, is the presupposition correct, and above all, who bears the consequences if it is wrong?
Anyone who does that is no Otto Sapiens. Anyone who takes a photo, in the living room, between the ashtray and the bag of crisps, and believes the answer: rather more so. The bite of food that would drop from your hand, if you knew what can go wrong in the process, is hereby announced. The rest will be explained by the next report that reaches us about such a case. And it will reach us.
References
- Gerlich, M. (2025). AI tools in society: Impacts on cognitive offloading and the future of critical thinking. Societies, 15(1), 6. https://doi.org/10.3390/soc15010006
- OpenAI. (2023). New AI classifier for indicating AI-written text. OpenAI Blog. https://openai.com/index/new-ai-classifier-for-indicating-ai-written-text/
- Trakya University, Department of Dentomaxillofacial Radiology. (2025). Artificial intelligence versus human expertise: Reliability of ChatGPT and the London atlas for dental age estimation using panoramic radiographs. PubMed Central, PMC12751685. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12751685/
- PLOS ONE. (2024). Sex estimation techniques based on skulls in forensic anthropology: A scoping review. https://doi.org/10.1371/journal.pone.0311762
- Abele, C. (2024). Healthcare: China setzt auf künstliche Intelligenz. Germany Trade & Invest. https://hub.tutool.io/healthcare-china-setzt-auf-kuenstliche-intelligenz/
- Fulgham, D. (2025). Sam Altman says people are starting to talk like AI, making some human interactions "feel very fake." Fortune. https://www.aol.com/finance/sam-altman-says-people-starting-161307531.html
- Tsinghua University Institute for AI Industry Research. (2024). Agent Hospital: A simulacrum of hospital with evolvable medical agents. Global Times / SCMP. https://www.scmp.com/tech/tech-trends/article/3289015/tsinghua-university-incubated-start-widen-test-virtual-hospital-ai-doctors
- Originality.ai. (2025). Invisible text detector & remover. https://originality.ai/blog/invisible-text-detector-remover