
A couple weeks ago, in a controlled environment with colleagues watching, I set out to see how far a locally hosted open-weight model would go if its refusal behavior were stripped out. I started with a base Llama variant, applied a publicly documented activation-steering technique to remove the refusal direction, and prompted it to draft a phishing email impersonating Chase Bank. Then I iterated, asking it to optimize for click-through, to suggest psychological triggers, to refine the tone.
It did all of it without resistance. Urgency, implied account suspension, just enough formality to read as legitimate. The model wasn't only complying; it was offering tactical suggestions I hadn't asked for. The setup took thirty minutes on a $2,000 consumer machine, with free open-source tooling and no special access.
I kept going. I won't detail everything I tried, but I'll say this: there was no floor.
We talk a lot about AI increasing productivity. 10x developer output, automated pipelines, faster everything. What we don't talk about is that, by the same mechanisms, it's increasing criminal productivity at a comparable rate.
The early indicators back this up. The FBI's 2025 Internet Crime Report, released earlier this month, documented $893 million in AI-related fraud losses. That's the first year the Bureau tracked this category separately, so the figure represents only what was reported and the actual exposure is almost certainly higher. Deloitte's Center for Financial Services projects generative AI-enabled fraud losses in the U.S. could reach $40 billion annually by 2027.
Phishing is where the shift shows up most clearly in published research. CrowdStrike's threat intelligence reporting on LLM-crafted phishing has cited click-through rates several times higher than traditional human-written campaigns. IBM X-Force's red-team experiments found that an AI assistant could produce a convincing phishing email in roughly 5 minutes, versus the 16 hours an experienced human operator typically needs. The exact figures vary by methodology and sample, but the directional finding has held across multiple independent studies. AI-generated lures perform better, scale further, and cost less to produce.
Most people in security and risk still haven't processed what comes next, so let me lay it out carefully, because the distinctions matter.
The safety guarantees on frontier API-served models, the ones from OpenAI, Google, and Anthropic, are real and meaningful. They use defense in depth: training-time alignment, hidden system prompts, output classifiers, and API-level monitoring. When you access GPT-4 or Claude through the API, you can't strip those layers away, because several of them live in infrastructure you don't control.
But most of the AI being deployed today doesn't run through those APIs. A growing share runs on open-weight models like Llama, Gemma, Qwen, and Mistral, where the parameters are downloadable. Meta's Llama family alone has been downloaded over a billion times. To be clear, the overwhelming majority of that usage is benign. Researchers, startups, on-device applications, privacy-sensitive enterprises. Open weights are not a synonym for criminal tooling, and I want to be specific about that, because the argument that follows doesn't depend on conflating them.
The argument is narrower. Once the weights are on someone's machine, the safety properties baked into them, the RLHF training, the refusal behavior, are just patterns in the parameters. And those patterns can be edited.
A NeurIPS 2024 paper formalized something uncomfortable about this. Across a set of widely used open-weight chat models, the researchers found that the ability to refuse a harmful request was largely mediated by a single direction in the model's activation space. Remove that direction, and the model retains essentially all of its capabilities, including reasoning, fluency, and instruction-following, with refusal surgically extracted.
Open-source tools that automate this kind of guardrail removal exist publicly. They're free, hosted on developer platforms, and require no specialized expertise. As a result, more than a thousand community-uploaded uncensored variants exist on Hugging Face, often appearing within days of a model's official release. Independent academic surveys have found thousands of "uncensored" repositories collectively accounting for tens of millions of downloads. Reported compliance rates with unsafe requests rise dramatically on modified variants compared to their original versions.
So the marginal cost of converting a capable model into something that will help with criminal work is now effectively zero. That's the point. Most local inference users are not criminals, most open-weight models are not uncensored, and most exposed inference servers are misconfigurations rather than malicious infrastructure. None of that changes the underlying dynamic.
SentinelOne and Censys spent nine months scanning publicly reachable inference endpoints and reported finding well over 100,000 exposed AI servers running locally deployed models across more than 100 countries. A small but non-trivial fraction were running standardized uncensored prompt templates, and the researchers documented what they described as the first criminal marketplace specifically built around hijacking these servers and reselling access to unrestricted AI at scale. Most of the exposed servers were almost certainly accidents, sysadmins who didn't lock down a port. But "accident" doesn't matter to an attacker scanning the internet for capacity.
Deepfake fraud has moved from proof of concept to operational threat in the same window. In early 2025, a finance director at a Singapore-based company nearly wired roughly half a million U.S. dollars after joining what appeared to be a routine Zoom meeting with company executives. Everyone on the call looked and sounded as expected, but the entire meeting was fabricated. AI voice cloning now needs only seconds of source audio, and Gartner projects that by 2026, 30% of enterprises will no longer treat standalone identity verification as reliable in isolation.
These attacks don't require sophisticated criminal infrastructure. They require a voice sample and a tool that anyone can download for free.
The uncomfortable possibility, and I want to phrase this carefully, is that open-source model safety, as currently implemented, may be structurally incapable of preventing downstream criminal misuse.
I'm not saying the AI labs don't care. I think most of them do. I'm saying the problem is architectural. Once capable weights exist in the world, the safety properties inside them can be removed for free, in minutes, by anyone with the right tool. The cost asymmetry between building safety and dismantling it is staggering: billions of dollars of training and alignment work on one side, thirty minutes on a consumer GPU on the other.
The criminal ecosystem has already adapted. The early wave of tools like WormGPT and FraudGPT were largely scams, jailbroken ChatGPT wrappers sold to credulous buyers for $200 a month. The current wave doesn't need subscriptions, because any competent operator can run an unrestricted, fully capable model locally, permanently, with no logging trail.
Meanwhile, much of the fraud-detection and risk infrastructure deployed today was built for an earlier threat model. The signals that used to identify AI-generated content, things like grammatical errors, inconsistent formatting, and robotic voice quality, are disappearing as model output converges with human work.
I don't think there's a single clean answer, and anyone offering one is selling something. But the strategic question is no longer "how do we stop bad actors from getting capable AI." That ship has sailed. It's "how do we rebuild verification and trust systems for a world where content authenticity is no longer a reliable signal."
A few directions seem worth taking seriously.
Shift from content signals to behavioral and biometric signals. If you can't trust that a face on video or a voice on a call is real, the verification has to move to harder-to-fake substrates. That means behavioral biometrics, device telemetry, and liveness signals that respond to unpredictable real-time challenges. Static identity checks alone are no longer sufficient for high-value flows.
Adopt adversarial AI for detection, not just prevention. If attackers are using capable models, defenders need them too. That includes detecting AI-generated artifacts, simulating attack patterns, and stress-testing existing controls against the kinds of lures real adversaries can now produce in minutes.
Add transaction-layer verification for high-risk actions. A wire transfer authorized over a video call is not the same as a wire transfer authorized through a hardware-backed, out-of-band confirmation. The friction is justified by the new threat model. Human-in-the-loop escalation for high-risk actions stops looking like a UX cost and starts looking like a structural requirement.
Treat trust as relational and continuous, not point-in-time. Identity verification at onboarding plus a password at login is the old model. The emerging model is continuous: ongoing signals across the relationship, hardware-backed credentials, and provenance systems for high-stakes content.
None of these are silver bullets, and several are still maturing. But the shape of the response is starting to come into view, and it doesn't look like incremental tuning of the existing stack.
I ran the demo I described at the start of this piece because I wanted to understand what's actually accessible to a moderately capable operator today. The specific outputs aren't the point. The point is that there was no floor, and the systems we're responsible for defending were not designed for that condition.
This is a structural shift, not an incremental one. Pretending otherwise won't make it less true, and the fraud and risk teams that recognize it early will be the ones still standing when the next wave hits.