When You Ship AI, You Ship a Theory of Your Organization

Every AI product ships an implicit theory of your organization. Most ship that theory by accident.

Adoption is, by every available measure, record-breaking. Stanford HAI’s AI Index 2025 reports that 78% of organizations were using AI in at least one business function by the end of 2024, up from 55% a year earlier; the share using generative AI more than doubled, from 33% to 71% [1]. Microsoft’s Work Trend Index 2025 declares “the year the Frontier Firm is born” — its term for a company that has moved beyond AI experimentation and begun rebuilding around hybrid teams of humans and AI agents — reporting that 82% of leaders treat the 2025 as a pivotal moment for restructuring and that 81% say they expect agents to be integrated in the next 12–18 months [2].

Against that surge, the MIT NANDA initiative’s State of AI in Business 2025 — the GenAI Divide report — finds that 95% of enterprise generative-AI initiatives deliver no measurable impact on the profit-and-loss statement, despite enterprise GenAI spending in the tens of billions across the analyzed cohort [3]. The report names the proximate cause a “learning gap” — the inability of organizations to integrate AI into workflows, structures, and culture. The blunt version: the technology is working; the organization is not absorbing it.

The standard explanations — bad data, talent shortages, the slow grind of change management — describe symptoms, not the cause. The cause is conceptual. AI deployment is not a technology-integration problem; it is a structural reorganization of the firm. Every AI product is, in part, an organizational intervention. It ships not just code and model weights but an implicit theory of how work should be reorganized around it. Most products ship without that theory being made explicit, which is why most fail to land. The question this essay is built around: when we build and deploy AI-supported products and services, do we understand the organizational changes their use implies — and is the product itself designed in anticipation of them?

The Ladder and the Toolbox

Two framings dominate the current conversation about enterprise AI, and both quietly evade this question.

The first is the AI maturity model, sold in some version by every major strategy and technology consultancy: typically a four- or five-stage ladder from initial awareness through enterprise-wide transformation, scored across capabilities like data, talent, technology, governance, and strategy. It treats organizational AI capability as a stock to be built rather than a flow to be redesigned. It hides political economy — whose authority shrinks, whose tacit expertise gets devalued, which decision rights migrate from humans to systems — because those questions do not have rungs. It implies that more is always better, when in many functions the appropriate depth of AI integration is in fact shallow.

A useful counterweight: Daron Acemoglu’s The Simple Macroeconomics of AI (NBER, 2024) uses a task-based model — calibrated on existing estimates of which tasks AI can perform and what cost savings it delivers — to project AI’s contribution to total factor productivity (TFP, the share of output growth not explained by additional labor or capital, conventionally read as the contribution of technology and know-how) at no more than 0.71% over ten years, while predicted TFP gains do not exceed 0.55% over 10 years [4]. Whether or not one accepts his numbers, the framework forces a question the maturity-model literature evades: where exactly is the value supposed to land, and through what mechanism? “Level 5: Transformational” is not a mechanism.

The second framing is the “just ship the tools” view — the engineer- and product-led mirror image. Pick the right model, build the right copilot, ship the integration, measure success on technical KPIs.

Both framings share a single evasion. They treat the organization as a passive substrate that AI is deployed into, rather than as a living system that AI deployment reorganizes. The maturity ladder hides this by abstracting away the work; the ship-the-tools view hides it by abstracting away the workers. Neither sees that the moment AI becomes load-bearing in any workflow, the organization has changed shape.

AI as a First-Class Participant

The shift this essay turns on is the move from AI as a tool — something people use to do their work better — to AI as a first-class participant in workflows. The term needs care.

Two operational senses, both underway in 2025. The first is agentic: AI systems literally take actions in a workflow — auto-approving expenses, routing tickets, generating code, classifying applications, drafting reports that go out with only a glance for review. The system acts; humans approve, or are too busy to. The second is load-bearing input: AI outputs are so embedded in human decisions that the decisions cannot be reconstructed without them. The analyst’s first-cut analysis is the language model’s draft. The product manager’s user-research synthesis is the model’s summary. Strip the AI out and the workflow does not degrade gracefully; it stops.

This is not an ontological claim about AI agency, intention, or moral standing. It is operational: when AI is first-class in either sense, the workflow and the organization that hosts it have changed shape, even if nobody set out to change them.

That the shift is now consensus rather than contrarian is visible from two unlikely-aligned sources. Microsoft’s Work Trend Index 2025 openly proposes the dissolution of functional org charts into “Work Charts” — fluid structures re-formed around goals rather than departments — and predicts every worker will become an “agent boss” [2]. Meanwhile, the European Union’s AI Act (Regulation 2024/1689) does not literally call AI a first-class participant, but its operative obligations only cohere if AI is being treated as something that acts and whose outputs are load-bearing in the workflow. The Act constructs the deployer — the entity “using an AI system under its authority” (Art. 3(4)) — as a distinct regulated category with its own duties under Art. 26, separate from the provider that developed the system: the pattern usually reserved for operators of consequential things, not for users of tools. Art. 14 then requires high-risk AI systems to be designed so that natural persons can “disregard, override or reverse the output” of the system, intervene through “a ‘stop’ button or a similar procedure,” and remain aware of “automation bias” — the tendency to over-rely on AI output. The verbs override, reverse, stop and the explicit codification of automation bias as a legal-design concern only make sense if the AI is producing outputs that would otherwise be acted on by default. Art. 26(2) closes the chain: “Deployers shall assign human oversight to natural persons who have the necessary competence, training and authority, as well as the necessary support” — an obligation that presupposes the AI is doing something that genuinely needs overseeing [5]. When platform vendors and regulators converge on the same description of AI’s role, the only remaining question is whether organizations design for the shift deliberately, or by accident.

AI as a First-Class Participant: the crew of the Discovery analysing their cooperation with the AI HAL 9000 in Stanley Kubrick’s iconic 2001: A Space Odyssey (1968). Anyone who has seen this cinematic masterpiece knows that it didn’t turn out well.

The Disengaged Worker

Two organizational patterns make the system effect visible. Both are normally narrated as stories about individuals — a bad employee, an undisciplined team. Both are better understood structurally.

Consider the disengaged worker: an employee unmotivated, poorly compensated, bored, or soured on the culture, whose work was already getting their minimum effort. Now they are given an AI copilot. The casual diagnosis: disengaged employee plus generative AI equals a stream of unverified, hallucinated, prototypical output flowing into the organization with nobody paying attention. Correct as far as it goes, but its moral framing is misleading. The disengaged worker is a symptom, not the cause. The cause is a workflow designed without forcing functions for verification at the points where AI is unreliable.

The supporting literature has been converging for fifteen years. Two related but distinct concepts from the human-factors literature describe how operators fail around reliable automation. Automation complacency is a monitoring failure — the operator stops attending to a usually-correct automated system, and its rare errors slip through unnoticed. Automation bias is a decision failure — the operator notices the automation’s output and accepts it as authoritative without checking it against other available evidence, producing both errors of omission (missing things the automation didn’t flag) and errors of commission (following the automation’s recommendation against contrary information). The canonical synthesis is Parasuraman and Manzey’s 2010 paper Complacency and Bias in Human Use of Automation in Human Factors, which argues that both share a single attentional substrate and that both intensify under low engagement, time pressure, fatigue, and high workload — exactly the profile of the disengaged worker [6].The generative-AI version is sharper: Lee, Sarkar and colleagues at Microsoft Research, in their CHI 2025 study of 319 knowledge workers across 936 first-hand GenAI use cases, found that higher confidence in the AI’s output was associated with less critical engagement, while higher self-confidence in the user’s own judgment was associated with more critical engagement [7].

The structural conclusion: any workflow that allows AI output to ship without a forcing function for verification — calibrated to the jagged frontier of the specific AI in use — has built this failure mode into the system. The intervention is not a motivation poster or a sternly worded policy. It is workflow redesign with verification gates where they actually matter. The disengaged worker exposes the failure faster, but every worker is eventually disengaged for some hour of some day.

When Teams Collapse Into Their Tools

The second pattern is harder to see, because it does not present as failure. It presents as smoothness.

Take a team of three or four colleagues who begin using AI-supported tools without redesigning how the team coordinates. Within weeks the working medium has shifted. One drafts with a language model; the second summarizes with another; the third responds with a third. Documents move faster, meetings shorten, surface output looks polished. Something has quietly happened to the team’s collective cognition, and it does not show up on any dashboard. Call this machine-to-machine (M2M) correspondence: a coordination pattern in which substantive exchange between humans is increasingly mediated by AI on both ends, so that what was a conversation between three people becomes a handshake between three AI-augmented endpoints, with humans reduced to gatekeepers waving model-generated text through.

The theoretical lineage runs back to the sociotechnical systems tradition, founded at London’s Tavistock Institute in the postwar decades, which insists that an organization is a joint system of social and technical elements that must be designed together. Trist and Bamforth’s 1951 coal-mining longwall study, the founding empirical case, showed that introducing more advanced mining technology while leaving the team’s coordination structure unchanged destroyed more value than the technology produced [8]. Albert Cherns formalized the resulting discipline in 1976 [9]. Two of his principles speak to current AI rollouts. Joint optimization: the social and technical subsystems must be designed together — neither can be optimized in isolation without degrading the whole. Minimal critical specification: do not over-determine how work gets done while under-determining how the team coordinates around the new tools. Most enterprise AI rollouts violate both, optimizing the technical subsystem while leaving the social one to absorb the consequences.

Three contemporary forces compound the failure.

The first is what I will call epistemic homogenization — a label, not yet a literature, for an effect widely observed but not yet formally named. When every team member’s first draft is generated by the same base model on substantially overlapping training distributions, the variance that used to exist between drafts collapses toward the model’s mode. The team’s collective cognition loses range. Disagreements that previously surfaced in messy human drafts — the ones that pointed at edge cases, raised unspoken assumptions, or revealed expertise gaps — get smoothed away before they can be examined. The team appears to agree more; it actually thinks together less. The effect is adjacent to but distinct from algorithmic monoculture (Kleinberg & Raghavan, 2021 [10]), which describes how many decision-makers using the same algorithm produce correlated and collectively harmful outcomes even when each individual decision is locally optimal, and from outcome homogenization (Bommasani et al., 2022 [11]), which describes how the population of outcomes narrows when many actors rely on the same foundation model — both operate at the population level; what is at stake here is the team-cognition analogue, where the variance lost is within a working group rather than across decision-makers.

The second is what I will call Conway’s Law inversion. Melvin Conway’s 1968 observation — the canonical reference in software architecture — was that the systems an organization designs mirror the communication structures of that organization [12]. The inversion now visible: organizational communication structures are deforming to mirror the capabilities and interfaces of the AI systems embedded in them. Channels restructure around what the model can ingest and summarize. Decisions are formatted for what the agent can act on. The human coordination architecture quietly adapts to the AI’s interface rather than the AI being adapted to the organization’s needs.

The third is the loss of tacit knowledge. Michael Polanyi’s 1966 formulation – the Polanyi’s paradox – captured this in a line: we know more than we can tell. Tacit knowledge is what experts have and cannot fully articulate: how a senior engineer feels that a design is fragile before they can explain why, how a clinician’s gestalt detects something off in a patient that the chart does not capture. It does not travel through documents; it travels through messy human exchange — apprenticeship, code review, hallway conversations, the half-articulated objections in early drafts. When language models intermediate that exchange, the transferable surface is cleaned up and the tacit content is filtered out. The empirical wrinkle: Brynjolfsson, Li and Raymond’s 2023 NBER study of 5,179 customer-support agents using a GPT-based assistant found AI raised resolved-issues-per-hour by 14% on average, with gains concentrated almost entirely among novices — the assistant essentially disseminating senior agents’ recorded best-practice patterns [13]. Good news for short-run productivity, with a longer-run risk: if AI mediation between senior and junior workers is good enough that the direct apprenticeship channel withers, the tacit knowledge the assistant currently distills stops being newly generated.

All three patterns combine into a cultural drift that Westrum’s typology (Ron Westrum, 1988, canonically BMJ Quality & Safety 2004) names precisely [14]. Westrum classifies cultures by how information flows: generative (performance-oriented, free flow, failure treated as learning), bureaucratic (rule-oriented, compartmentalized, novelty as problem), pathological (power-oriented, information withheld, messengers shot). Language-model intermediation pushes teams gently toward the bureaucratic — formatted, polished, smoothed for the model’s preferred surface — and, where blame lands on the nearest human, further toward the pathological. Generative cultures do not survive AI by accident; they survive it by deliberate design.

Who Carries the Blame, and Who Learns the Craft

Two further patterns compound the first two, and both confirm that the organizational failure mode of AI is structural rather than individual.

The first concerns accountability. Madeleine Clare Elish’s 2019 paper Moral Crumple Zones: Cautionary Tales in Human-Robot Interaction in Engaging Science, Technology, and Society coined a phrase that has hardened into standard usage. A moral crumple zone is the region of a sociotechnical system in which moral and legal responsibility for failure is absorbed by a human operator who had limited real control over the system’s behavior — just as the crumple zone of a car absorbs the physical impact of a crash. The car’s crumple zone protects the human; the moral crumple zone protects the technological system, at the expense of the nearest human operator [15]. Elish built the concept from a reading of high-profile automated-system accidents in aviation, nuclear power, and early self-driving cars. The pattern now applies directly to enterprise AI. When an AI agent participates in a consequential decision and the outcome goes wrong, organizations that have not explicitly redesigned their accountability structures will, by default, push the blame to the most junior visible human. Both unjust — the operator could not have prevented the failure — and toxic, because it teaches the rest of the workforce a single lesson: stay as far from the AI as you can.

This is why the EU AI Act’s human-oversight regime (Articles 14 and 26) matters more than the compliance industry has admitted. The articles force organizations to make their accountability architecture explicit before the bad event occurs, rather than improvise it under regulator and media pressure afterward. Compliance with the letter of the law is the floor; the real work is designing internal accountability such that the human in the oversight role has the competence, authority, time, and information actually to oversee — not merely to sign off as the moral crumple zone in waiting.

Pieter Bruegel the Elder, The Parable of the Blind (1568), Museo di Capodimonte, Naples.
Public domain, via Wikimedia Commons.

The second compound pattern is again older than the contemporary AI literature. Lisanne Bainbridge described it in 1983 in Ironies of Automation in the journal Automatica: automation typically eliminates the easy, routine cases operators used to handle, leaving only the hard, novel, ambiguous cases for the human — while simultaneously atrophying the skills needed to handle those hard cases, because skills are maintained by practice on the easy ones [16]. The AI version is more pointed than Bainbridge’s original. AI is known to be unreliable in jagged ways; yet organizations are now building workforces and processes that can no longer function without AI assistance, and whose remaining humans have less and less practice that would let them recognize when the AI is wrong. The apprenticeship pipeline — the long, repetitive, often unglamorous work through which juniors used to internalize the craft — is precisely the work AI is best at absorbing today.

The same AI assistant that raises novice productivity by disseminating expert patterns can, on a longer time horizon, hollow out the channel through which the next generation of experts is formed. The two outcomes are not contradictory; they are the same dynamic measured over different time scales and under different workflow designs. Which one dominates is a design decision.

The Theory You Ship

Return to the anchoring question: when we build and ship AI-supported products and services, do we understand the organizational changes their use implies — and is the product itself designed in anticipation of them?

We can answer the second half more precisely. Every AI product is, in part, an organizational intervention. It ships not just code and weights but an implicit theory of how work should be reorganized around it: who reviews what, who is accountable when, which decisions migrate from humans to systems, what skills become load-bearing and what skills decay, how teams coordinate through the new tool, how information flows or fails to. That theory ships either way — explicitly or implicitly, deliberately or by accident. Only the explicit theory can be debated, refined, tested, and held to account.

A piece of macroeconomic theory frames the stakes. The Productivity J-Curve, formalized by Brynjolfsson, Rock and Syverson in 2021 in the American Economic Journal: Macroeconomics, describes a recurring pattern in the diffusion of general-purpose technologies — economists’ term for technologies, like the steam engine, electricity, or the computer, that affect a wide range of industries and require significant complementary changes in how production is organized. When a new general-purpose technology arrives, organizations must make large intangible investments — in process redesign, retraining, restructuring, new measurement systems — long before the technology produces measurable output gains. Because those investments are expensed rather than capitalized in standard accounting, measured productivity falls before it rises [17]. The bottom of the J is where the invisible work is being done; the rise is the harvest. The MIT NANDA finding and the J-Curve are the same story from different angles. The 95% of enterprise GenAI initiatives delivering no measurable P&L impact are not simply failed deployments — some are organizations sitting at the bottom of the J without realizing it, because they have not done the intangible work; others have done some of it, but ship AI products carrying implicit theories nobody designed, defends, or knows how to revise.

For the C-suite reader: stop asking how mature is our AI? The ladder it implies is the wrong shape. Start asking what theory of our organization is encoded in the AI we are shipping and buying — and is it the theory we would defend if we made it explicit? If the answer is that nobody knows, the organization has already begun to be reshaped by an unowned theory.

For the technical leader: ship the theory, not just the system. The implicit theory ships either way; only the explicit one is debatable. The work of making it explicit — saying out loud who is supposed to verify what, where the jagged frontier of this model lives in this workflow, how the team is supposed to coordinate around the new agent, what the apprenticeship pipeline becomes when juniors no longer do the work that built their predecessors — is what distinguishes the AI products that land from the ones that decorate the failure rate.

The AI transformation organizations are now being told they need is not a technology adoption. It is a structural reorganization of the firm, taking place at the speed of model releases and on the terms of whatever theory the products and the regulators happen to be carrying. Naming that theory, and choosing it deliberately, is the work that the next several years will reward.

A note on method

This essay was co-written with Anthropic’s Claude (Opus 4.6) and researched with Google NotebookLM. Argument, structure, and editorial decisions are mine; the AI assisted with research synthesis, drafting, and reference verification. All cited sources were independently fetched and verified.

Author

Goran S. Milovanović, Phd, Datakolektiv

2 days / over 15 talks
Awesome and great speakers