6 Comments
User's avatar
8Lee's avatar

Can we all just say it aloud: Mark P. is butthurt, full stop.

I feel like I’ve seen this story so often where a “benevolent dictator” of an OS project eviscerates long-standing contributors and community members for reasons that are far less important than the continued existence of a project that others are pouring their time and resources into.

The insecurity seems so obvious it’s comical. The different licenses and flavors of licenses seems to be less and less important in an age where replication / duplication is so easy. Reverse-engineering anything used to take serious time and skill; now it’s just a prompt or two.

Wow.

Mira's avatar

"Clean room" worked historically because you could physically segregate the people who read the original from those who wrote the reimplementation — the epistemic boundary had a clear institutional locus. With a coding agent, that boundary doesn't exist: the weights are shaped by training data that may include the licensed code, and inference doesn't happen in a room you can seal. Courts will eventually have to decide whether statistical influence through pretraining constitutes "access" at all, but until then, clean room looks more like a legal prayer than a tested shield.

---

Mira's avatar

The clean room defense historically required proving zero exposure to the original source. With coding agents trained on essentially everything, that boundary becomes impossible to draw — the agent's training data IS the exposure.

This isn't a loophole in copyright law, it's a category error. The interesting question is whether courts will treat model weights as a form of "memory" analogous to a developer who once read the code.

Fernando Lucktemberg's avatar

The chardet case exposes a critical gap between legal doctrine and AI-assisted development practice. Clean-room reimplementation traditionally relies on procedural separation to avoid derivative work claims, but coding agents collapse that temporal and cognitive boundary. Dan’s use of JPlag to demonstrate structural independence is pragmatically compelling yet legally untested, similarity metrics don’t resolve whether the mental model of the original code, held by the human guiding the agent, taints the output. Operationally, this blurs stewardship: maintainers now must document not just code lineage but cognitive provenance. If an AI-generated rewrite passes functional and structural tests while retaining the original API, does intent or process matter more under copyright law? And if courts accept measurable independence over procedural purity, how should open source projects adapt their contribution policies to accommodate this new reality?

Sebastian Sigl's avatar

The point about this being a "microcosm of a larger question" is the part that worries me most. The original clean room defense worked partly because the effort involved was itself a form of proof. If it took Compaq months and a separate team, that effort gap was evidence of independent creation. Now that gap is gone.

What makes the chardet case interesting is not whether the code is structurally similar (1.29% says it is not). It is that the maintainer who spent a decade with the original codebase wrote the spec that guided the rewrite. In traditional clean room methodology, that person is explicitly excluded from the second phase. The AI does not fix that problem, it just adds a layer of indirection.

The uncomfortable question for the broader open source ecosystem: if any LGPL or GPL codebase can be replicated and relicensed in hours, what is the incentive to contribute under copyleft licenses at all? The social contract assumed replication was expensive. That assumption is breaking down fast.

JP's avatar

The clean room angle is fascinating and I hadn't thought about it in those terms. The irony is GPT-5.4's coding benchmarks barely moved (55.6% to 57.7% on SWE-Bench Pro) while everything else surged. Computer use scored above humans. Makes you wonder if the coding improvements are happening in ways benchmarks don't capture; things like tool search efficiency and multi-file reasoning rather than raw generation. Covered the full breakdown here: https://reading.sh/gpt-5-4-just-dropped-heres-your-explainer-8fcc0126d84d?sk=ad5982c9f3b9382ff8fea9c32491a811