DeepSeek and the Principle of Free Lunch
On the engineering claim that a Chinese startup trained a frontier AI model for $6 million, the ClickHouse database that sat open on the public internet without a single authentication requirement, what it means to hand your internal data to a system legally bound to share it with the Chinese state on request, and the question of why the stock market apparently needed a data breach and an Italian ban to notice what China’s National Intelligence Law has said explicitly since 2017.
This article was substantially revised on 5 May 2026. As this field evolves rapidly, inaccuracies cannot be fully excluded.
On January 27, 2025, Nvidia lost approximately $590 billion in market capitalization in a single trading session. The proximate cause was the publication of a technical report and the release of a consumer app from a startup based in Hangzhou, China, that claimed its AI model performed comparably to the best available Western systems while costing a fraction as much to train. The claim was spectacular enough that professional investors, who are presumably paid to examine claims before reacting to them, sold the stock of every AI-adjacent company they could identify before anyone had subjected the claim to even elementary scrutiny. This is, as a description of institutional investor behavior, not particularly unusual. What was unusual was that the claim turned out to be partially true, partially misleading, and entirely beside the point of the most important question DeepSeek raised, which was never about training costs at all.
That question was, and remains: what happens to the information you send to an AI system operated by a Chinese technology company under the legal jurisdiction of the People’s Republic of China?
The Technical Claims and What They Actually Mean
DeepSeek-V3, the base model on which the R1 reasoning system is built, is a Mixture-of-Experts architecture with 671 billion total parameters, of which 37 billion are activated for any given inference pass (DeepSeek-AI, 2024, “DeepSeek-V3 Technical Report”, arXiv:2412.19437). The MoE design is not a DeepSeek invention; it is a well-established architecture that allows a large model to route different queries to different subsets of its parameters, reducing compute requirements per inference by activating only the relevant specialists. The genuine engineering achievement in V3 was in the efficiency of implementation: the model was trained on 14.8 trillion tokens using 2.788 million H800 GPU hours, and DeepSeek’s own accounting of those hours at $2 per GPU hour produces a training cost of approximately $5.58 million.
That figure created the narrative. OpenAI’s GPT-4 is widely estimated to have cost over $100 million to train; a Chinese startup had apparently produced a comparable model at roughly 5 percent of that cost. Wall Street interpreted this as evidence that the massive GPU infrastructure investment of Nvidia’s largest customers was economically unnecessary, which is why Nvidia’s share price lost $590 billion in an afternoon.
The scrutiny that followed was instructive. SemiAnalysis, an independent semiconductor and AI research firm with documented sources in China’s technology sector, estimated that DeepSeek’s total server capital expenditure, accounting for the full inventory of approximately 50,000 Nvidia Hopper GPUs including 10,000 H100s and 10,000 H800s, totaled approximately $1.3 billion. The $5.58 million figure represented only the GPU-hours billed in the pre-training phase and excluded research and development costs, staff, infrastructure overhead, and the substantial exploratory compute spent on the approaches that did not work before the approaches that did. When DeepSeek subsequently published a paper in Nature disclosing additional details about R1, the document acknowledged that the model relied on the V3 base, that the frequently cited $294,000 training cost referred exclusively to the reinforcement learning phase that ran on top of that base, and that the training data for V3 included “a significant number” of responses generated by OpenAI’s systems, incorporated as a consequence of crawled web data. Microsoft had begun a covert investigation of potential distillation theft weeks before these disclosures appeared.
None of this is to say DeepSeek is not technically impressive. It is. The MoE architecture, the Multi-head Latent Attention design, and the Group Relative Policy Optimization reinforcement learning technique all represent genuine engineering work, and the R1 model’s benchmark performance across AIME mathematics and MMLU general knowledge genuinely matches or approaches the best available Western models at a per-token API price of $2.19 per million tokens, versus OpenAI o1’s $60 per million, a difference of approximately 27 times (HiddenLayer Security, 2025, “DeepSh*t: Exposing the Security Risks of DeepSeek-R1”). The cost efficiency is real. The claim that it was achieved at $6 million all-in is not.
The Training Data Question: Whose Knowledge Is in the Model?
Alongside the cost controversy, there is a second issue surrounding DeepSeek that received considerably less attention than Nvidia’s market capitalization losses: the question of what the model was actually trained on. DeepSeek acknowledged in its Nature publication that V3’s training data contained a “significant number” of responses generated by OpenAI systems, framed as an incidental consequence of crawling publicly available web data. OpenAI and Microsoft independently began investigations into potential distillation theft, the allegation that DeepSeek had deliberately used outputs from proprietary models as training data for its own models, which would violate the terms of service of the affected services.
The legal situation remains unresolved. The technical question of whether a model trained on the outputs of another model violates its intellectual property is not yet legally settled, which in practice means that no one can currently say with certainty whose knowledge actually inhabits the DeepSeek-R1 model. For users considering deploying it, this is relevant information: the model’s knowledge was not independently generated; it is substantially derived from systems and documents whose origin and legal status are not fully transparent. A distilled model inherits not only its teacher’s knowledge, but also its blind spots. This is not a reason to avoid DeepSeek on its own, since all models are trained on data whose legal status is contested to varying degrees. It is, however, a reason to treat claims about DeepSeek’s capabilities and provenance with the same degree of skepticism that one should apply to any party’s self-reporting about its own assets.
The Database That Sat Open Without a Password
While the benchmark debates occupied the technology press, Wiz Research, a cloud security firm, discovered something considerably more concrete. A ClickHouse database belonging to DeepSeek was publicly accessible on the internet without any authentication requirement whatsoever. The database contained over 1 million log entries, including plaintext chat histories, API keys, backend system details, and operational metadata. As Wiz Research’s analyst Gal Nagli described the exposure, it was “low-hanging fruit for attackers,” meaning it required no exploitation technique, no vulnerability research, and no particular skill to access, only the ability to reach a URL.
The researchers who discovered the exposure attempted to contact DeepSeek through the official channels one would normally use for responsible disclosure of a security vulnerability: a dedicated security email, a bug bounty program, a documented contact for vulnerability reports. DeepSeek had none of these. The researchers sent messages to email addresses and LinkedIn profiles they could locate. They received no acknowledgment. Within approximately 30 minutes of their outreach, the database was secured. Whether it had been accessed by parties with less benign intentions before Wiz Research found it, DeepSeek has never confirmed.
The sequence of events is worth pausing on. A company operating an AI chat service with tens of millions of users, processing the internal queries, API keys, and business data of individuals and organizations worldwide, had left a production database containing 1 million log entries publicly accessible without authentication. When the exposure was discovered, the company had no documented security contact. When a notification was sent through improvised channels, the response was to quietly close the exposure without acknowledging it. This is not a security posture; it is an absence of one. And in the context of the data that an AI chat service accumulates, that absence is not a technical deficiency to be addressed in a future update. It is a description of how an organization treats the information it holds.
The Privacy Architecture That Was Always the Problem
The exposed database was a consequential incident, but it was not the most structurally significant privacy concern associated with DeepSeek. The more fundamental issue was visible before the database breach and does not depend on any security failure at all.
DeepSeek’s privacy policy, available at the time of writing, states that user data is stored on servers located within the People’s Republic of China. Security researchers at NowSecure subsequently discovered that DeepSeek’s iOS application communicates with Volcengine, ByteDance’s cloud infrastructure platform, the same parent organization as TikTok, raising the question of how broadly user data flows within China’s technology ecosystem beyond DeepSeek itself.
These facts are significant primarily because of what Chinese law says about them. China’s National Intelligence Law, enacted in 2017, states in Article 7 that “all organizations and citizens shall support, assist, and cooperate with national intelligence work in accordance with the law.” Article 12 grants intelligence agencies the authority to request cooperation with intelligence tasks from any relevant organization or individual. The law does not provide for a judicial review mechanism equivalent to a court order in Western legal systems, nor does it contain provisions that would allow a company to refuse a state request on grounds of client confidentiality or commercial privacy obligations. There is no Chinese equivalent of the legal process by which a US company can, in principle, contest an NSL in court.
This legal structure means that any data stored in China and held by a Chinese company is, by statute, accessible to Chinese state intelligence agencies on request, without the data subject’s knowledge or consent, and without recourse. This is not speculation about DeepSeek’s intentions. It is a description of the legal framework in which DeepSeek, like every other Chinese technology company, operates. The question of whether a given Chinese AI company has good intentions toward its users’ privacy is less relevant than the question of what the legal infrastructure allows regardless of those intentions.
For businesses sharing proprietary queries, internal documents, business strategies, legal situations, or technical details with a cloud-based AI system operating under this legal framework, the exposure is not uncertain at all. It is certain. The data is in a jurisdiction whose law requires it to be made available on request, to an agency that has no obligation to disclose the request, and the business has no mechanism of recourse.
I note, because it would be dishonest to omit, that US-based AI services operate under a different but analogously complex legal framework. The National Security Agency’s collection authorities under Section 702 of the Foreign Intelligence Surveillance Act, the National Security Letter provision of the USA PATRIOT Act, and the CLOUD Act’s mechanisms for law enforcement access to data stored by US-headquartered companies on foreign servers all represent legally authorized access to user data without individual notification. American AI companies receive these requests, respond to them, and are legally prohibited from disclosing them in specific cases. The meaningful differences between the US and Chinese frameworks, which are real and significant, are in the availability of judicial review, the existence of oversight mechanisms with actual legal standing, and the demonstrated record of enforcement against unauthorized access. These differences matter. They do not eliminate the underlying structural reality that cloud-based AI services in any jurisdiction are not private in an absolute sense, and that the jurisdiction determines the specific legal conditions under which they are not private.
Italy, South Korea, and the Regulatory Response
Several data protection authorities moved quickly. Italy’s Garante, the national privacy regulator, issued an order blocking DeepSeek within the European Union on January 30, 2025, citing DeepSeek’s response to its data inquiry as “completely insufficient” and finding that DeepSeek had failed to demonstrate compliance with the General Data Protection Regulation’s requirements for data minimization, purpose limitation, and cross-border transfer safeguards. The Garante’s decision is significant because it identifies the central legal issue precisely: under GDPR, personal data transferred to a third country requires either an adequacy decision, appropriate safeguards under a transfer mechanism, or one of the narrow derogations in Article 49. No adequacy decision exists for China, the legal framework makes standard contractual clauses difficult to apply with confidence, and the mass-market nature of a consumer AI chatbot falls outside the Article 49 derogations. DeepSeek’s claim that EU law simply did not apply to its operations represents a theory of extraterritorial jurisdiction that EU regulators have not accepted from any company, including American ones.
South Korea’s Personal Information Protection Commission similarly launched an investigation. Ireland’s Data Protection Commission opened a preliminary inquiry. For European businesses using DeepSeek in professional contexts, the GDPR compliance issue creates a separate liability exposure that exists entirely independently of whether Chinese intelligence services are interested in their specific data. Processing customer data, personal employee data, or other GDPR-regulated categories via a service without an adequacy basis constitutes a compliance violation that can be enforced by one’s own member state’s supervisory authority.
The Open-Source Model and Its Actual Security Implications
One response to the cloud privacy concerns is to run DeepSeek locally using the open-source weights released on HuggingFace. The model is genuinely available for local deployment, and running inference on hardware you control eliminates the cross-border data transfer problem entirely. This is a legitimate option and, from a data sovereignty standpoint, the correct one for any organization that takes its information security seriously.
It is not, however, a simple option. Running DeepSeek-R1 at full scale requires GPU infrastructure that is not within the reach of most organizations. The 671 billion parameter full model requires multiple high-end GPUs with substantial VRAM, and while distilled variants of 7 billion to 70 billion parameters can run on more accessible hardware with materially degraded performance, the inference experience diverges significantly from the frontier capability that generated the benchmark results.
HiddenLayer Security also identified a technical concern specific to local deployment of DeepSeek’s open-weight models: the model configuration requires the trust_remote_code=True flag to be set, which allows execution of arbitrary Python code from the model repository. This flag is a standard security warning in the machine learning engineering community, and its necessity in DeepSeek’s current deployment configuration is a risk that organizations considering self-hosting should evaluate against their threat model. Additionally, red-teaming by HiddenLayer found DeepSeek R1 to be more susceptible than comparable models to jailbreak techniques, prompt injection, and exploitation of control tokens, which is relevant if the model is being deployed in any context where it may receive untrusted input.
What This Means for Businesses and Professionals
The practical situation for a business considering the use of AI tools is the following. A cloud-based AI service operated by any company transmits queries to that company’s servers and stores or processes them according to that company’s privacy policy and data handling practices. For services operated under GDPR-adequate legal frameworks, this means the data is protected by a set of legal requirements that have meaningful enforcement mechanisms. For services operated in jurisdictions without equivalent data protection law, including China, the legal protections are qualitatively different and, from a GDPR standpoint, insufficient.
For a professional or organization that processes information that is commercially sensitive, legally privileged, regulated under sector-specific frameworks, or simply not intended for disclosure to third parties, using a cloud-based AI service in any jurisdiction requires evaluating the applicable law. For data of material sensitivity, the correct answer is always a deployment model in which the data does not leave infrastructure you control. This means local deployment if GPU resources allow, or a cloud deployment with a provider under an adequate legal framework with contractual data processing agreements in place.
The distinction between an API key and a local deployment is not a technical detail. An API key is not a privacy mechanism; it is a billing requirement. The data still leaves the organization. Any software that requires an API key to a third-party model sends the processed data to that provider by definition. On my own infrastructure, the AI services I operate, including the Tyra system that manages incoming communications and filters requests, run on controlled hardware with data that does not transit third-party cloud infrastructure. This is not a technically exotic arrangement; it is the standard that information security practice has recommended for sensitive data processing since before AI language models existed.
What an AI Chat System Actually Transmits
It is worth being specific about what a cloud-based AI chat service records beyond the obvious. The content of a query is the most visible element, but it is not the only one. Every interaction with a cloud AI system generates metadata: timestamp, session duration, device information, IP address and therefore geographic location, user identifier, and the sequential pattern of queries within a session and across sessions. From this pattern, inferences become possible that extend considerably beyond the literal text of any individual message.
Someone who queries an AI for market information about a specific sector at the same time each morning, for internal communication strategy mid-morning, and occasionally for the phrasing of correspondence with specific regulatory bodies, has generated a usage profile that allows detailed inferences about corporate strategy, competitive interests, and regulatory exposure, without any single prompt containing explicitly sensitive information. AI systems are particularly well-suited to this kind of profiling because people communicate differently with them than with search engines: more completely, more conceptually, more contextually, and with a degree of intellectual openness chosen precisely because they believe no one is reading along.
The widely shared belief that one does not input anything truly sensitive into an AI chat is among the most effective forms of self-deception that recent privacy discourse has produced. The people who hold it are the same people who believe they do not reveal significant information to a search engine, despite the fact that their search sequences across weeks produce an accurate picture of their health concerns, financial anxieties, relationship difficulties, and professional ambitions. An AI chat system accumulates this information not as a keyword list but as a semantically coherent conversational archive that tells a trained analyst substantially more about the user than the user believed they had shared.
A Polemical Warning
I want to be direct about what the DeepSeek moment illustrated beyond its specific technical and legal particulars. The $590 billion evaporation from Nvidia’s market cap was driven by investors who reacted to the headline claim before anyone had examined the claim, and who subsequently watched that claim be substantially revised without returning the $590 billion, which stayed evaporated because markets do not issue corrections in the same type size as the original error.
The AI industry in general, and the public conversation about AI in particular, has an established practice of presenting specific capabilities as general capabilities, specific training cost figures as comprehensive cost figures, and benchmark performance as real-world performance. DeepSeek did this with its training cost narrative. OpenAI does it with capability claims. Anthropic does it. Google does it. The specific form of the claim changes; the structure, which is to present a favorable metric as the representative metric, does not. Evaluating an AI system requires separating the favorable metric from the representative metric, which requires more effort than reading a press release and requires being willing to reach a conclusion the market has not yet reached.
In DeepSeek’s case, the representative metric is not the training cost. It is the legal jurisdiction.
Closing
The free lunch principle in economics holds that if something appears to have no cost, you are probably not seeing all the costs. DeepSeek’s AI capabilities are impressive, and the engineering work behind them is real. The cost of those capabilities, for a user who sends their queries to DeepSeek’s servers, is not denominated in dollars. It is denominated in the information contained in those queries, which is stored in a jurisdiction whose law requires it to be made available to state intelligence services on request, without notification to the user, without judicial review, and without any practical mechanism of objection.
That is the price. Whether it is acceptable depends on what you put into the prompt, and on your assessment of who finds that information interesting. Anyone who has used an AI chat service as I intend it to be used, to work through complex problems, draft sensitive communications, analyze proprietary data, and explore the full range of one’s thinking without self-censorship, should have a clear answer to that question before they decide which cloud receives the output.
References
- Axis Intelligence. (2026). Is DeepSeek safe 2026? Security concerns and honest assessment. https://axis-intelligence.com/is-deepseek-safe-2026-security-concerns/
- DeepSeek-AI. (2024). DeepSeek-V3 technical report. arXiv:2412.19437. https://arxiv.org/abs/2412.19437
- Garante per la protezione dei dati personali. (2025, January 30). Decisione di blocco DeepSeek. Italian Data Protection Authority.
- Gregory Bufithis Research. (2025). DeepSeek’s AI training cost: Not $6M but $1.3B. https://www.gregorybufithis.com
- HiddenLayer Security. (2025). DeepSh*t: Exposing the security risks of DeepSeek-R1. https://hiddenlayer.com/innovation-hub/deepsht-exposing-the-security-risks-of-deepseek-r1
- People’s Republic of China. (2017). National Intelligence Law, Article 7 and Article 12. Standing Committee of the National People’s Congress.
- SemiAnalysis. (2025). DeepSeek’s true infrastructure cost and GPU inventory. SemiAnalysis Research.
- TechSpot. (2025). In rare disclosure, DeepSeek claims R1 model training cost just $294K. https://www.techspot.com/news/109542
- The Register. (2025). DeepSeek didn’t really train its flagship model for $294,000. https://www.theregister.com/2025/09/19/deepseek_cost_train
- Wiz Research / Nagli, G. (2025). DeepSeek database exposure: Open ClickHouse database containing 1M+ log entries. Wiz Research.