Jobs and AI: Chains of work
Solving the puzzle of jobs and tasks
Today we’re hosting a very interesting guest post from Dr James Ransom, Honorary Senior Research Fellow at UCL Institute of Education. He brings a thorough, analytical lens on how AI is impacting (or will) jobs by looking at a lot of research and stitching it together with lessons learnt from history. You can catch more of his work at Reskilled.
Enjoy!
What is a chain?
As AI advances, it will stitch together multiple tasks, reshaping the role of humans in jobs. The problem: if you leave the loop, you won’t get back in again.
You are an advertising specialist, and you’ve been called into your boss’s office.
“Right, I need you to own the full campaign lifecycle for the SmartBrew Pro launch – starting Monday, you’ve got two weeks to vacuum up every scrap of consumer data we can lay our hands on, then a week to interpret it all and tell me where the connected kettle market is heading, then about two weeks max where you’ll need to simultaneously advise Sarah’s team on strategy and build out the campaign plan, and finally three weeks to get scripts, copy, storyboards and media bookings nailed down with the agency. I think that’s it, I gotta join a call, keep me posted.”
What you probably won’t think about (especially as you’re going to be pretty busy for the next eight weeks) is that this is a chain of work. You have five tasks to complete, and they build on each other, albeit with some overlap in the middle.
You might consider how AI can speed this up, especially as your boss is pretty bad at spotting AI-generated output. Perhaps ChatGPT can sketch out a campaign plan. If you remember to feed it the right documents, Copilot should be able to help schedule some meetings, draft agendas and summarise next steps.
Most discussion of LLM tools and agents takes a similar approach. Think tank reports look at how AI is replacing or augmenting tasks, the building blocks of jobs. Academic work helps us understand the dynamics of what happens when you automate tasks, but the approach is usually atomistic: the relationship between the tasks is missing.
In real work, tasks are not separate or discrete. They run into each other, overlapping in a messy flow of projects and activities and responsibilities. Why would this be any different for the LLMs of the future? What happens if you skip all the intermediate steps and just request the final deliverable – “produce a complete advertising campaign for the SmartBrew Pro”?
The atomistic fallacy
Most serious attempts to understand AI’s impact on work share a common approach: break jobs into tasks, assess which tasks AI can handle or support, and add up the results. This is sensible, and has produced useful insights. Frey & Osborne’s landmark 2013 study estimated the probability of computerisation for 702 occupations. More recent work by Eloundou and colleagues mapped GPT-4’s capabilities against task descriptions in the US occupational database. Felten and colleagues built exposure scores based on AI capabilities mapped by the Electronic Frontier Foundation; similar approaches have been adopted by organisations from the IMF and ILO. (I’ve also created my own measures for India and the UK). In each case, the logic is the same: examine tasks one-by-one, score them, aggregate at the occupation level.
The problem is that real work doesn’t present itself as a neat list of separable tasks. Our advertising specialist isn’t completing task (a), filing it away, and moving cleanly on to task (b). She is running several in parallel, feeding the output of one into the next, making judgment calls about when something is good enough to build on. The tasks form a chain, and the connections between them matter as much as the tasks themselves.
Recent academic work has started to recognise this. Autor & Thompson argue that occupations ‘bundle a range of tasks of different expertise levels’, and that the overall composition of those bundles shapes what automation does to a role. Gans & Goldfarb show that automating tasks in bundles is often more economically rational than doing so one at a time. ‘A method that classifies tasks one-by-one can miss situations where automation arrives as a bundle – for example, when a platform or integrated system automates multiple steps at once’, they conclude.
But even bundling doesn’t quite capture what’s coming. A bundle is a cluster, whereas a chain is a sequence. Once AI can reliably execute a sequence of dependent tasks – where each step builds on the last – something qualitatively different happens. Our advertising specialist isn’t just removed from one task: she’s removed from the entire process.
To see what’s coming, we need to go back in time.
A lesson from the assembly line
The assembly line is given the credit for the explosion of productivity at Henry Ford’s Highland Park factory in 1913. But, as David Hounshell describes in his landmark study From the American System to Mass Production, 1800-1932, there were major bottlenecks to overcome before this – above all, reliability.
For most of the nineteenth century, ‘interchangeable parts’ were an aspiration: at the Singer sewing machine factory, as late as the 1880s, components were still being hand-fitted together by skilled workers who filed, adjusted and coaxed each piece into place. Every item needed a fitter. The US armouries at Springfield and Harpers Ferry had cracked interchangeability for muskets, but at enormous unit cost, and only at a scale of thousands, rather than hundreds of thousands or millions. Singer simply couldn’t afford that level of precision for a consumer product.
Ford solved the reliability problem by investing heavily in precision machining until parts were fully interchangeable, no fitting required. Only then did the assembly line make sense. Once reliability was sorted, speed followed: after April 1913, chassis assembly time fell from over 12 hours to 93 minutes. As Ford himself put it, ‘in mass production there are no fitters’.
Today’s LLMs have yet to overcome their reliability issues. Their outputs require a fitter – a human who checks, adjusts, corrects and decides whether the result is good enough to use. We are at the Singer stage: producing impressive work, but with someone leaning over every piece. As reliability improves, the pressure to chain tasks together without human intervention will become economically irresistible. What does that look like?
When the machines start talking
In January 2025, a video from an ElevenLabs hackathon went viral. Two AI voice agents, set up to negotiate a hotel booking with each other, quickly abandoned English and switched to a stream of rapid, garbled noise – a machine-to-machine protocol that was far more efficient than human language. The project was called Gibberlink, and some comments on the video suggest it unsettled people. “This is actually terrifying”, said one. “That’s… actually kinda scary”, another agreed. “This will be the last sound many of us hear”, said a third.
When AI agents operate in a chain, they need to pass information between steps. They can do this in English (or your human language of choice) – and for now, mostly they do – because the task is handed off to a human. But natural language is wildly inefficient for machine-to-machine communication. It’s verbose, ambiguous, and expensive: you’re paying for tokens on both ends, to generate and to comprehend. A structured format like JSON, or something more compressed still, strips out the redundancy and lets the chain move faster.[1]
Faster chains are great for efficiency, but they bring accountability concerns. If the chain communicates in formats humans can’t easily read, how do you audit it? One answer is to force every step to produce a human-readable output – a natural language summary of what it did and why. This sounds reasonable until you think about what you’re actually getting. LLMs are, by design, systems that produce plausible-sounding text. A reasoning trace or a summary of an intermediate step might look like a faithful account of the model’s process. But there’s no guarantee it is. It may be a post-hoc rationalisation – a tidy narrative wrapped around a process that was nothing of the sort.[2]
And would a manager who is broadly happy with the chain’s final output want to slow it down so a human can step into the loop and try to audit each step?
For basic transactional auditing – did the right data get passed? was the output in the correct format? – structured logs are fine. But for the kind of judgment-heavy, ambiguous work that characterises most knowledge work, the audit problem is largely unsolved. And as chains get longer, it gets worse, not better.
Toyota’s corrective
Ford’s assembly line is an imperfect analogy for modern work. Whilst technically brilliant, it was designed for high volume and zero variation: millions of identical Model T cars. Knowledge work is the opposite – high variation, with bespoke outputs, and highly context-dependent. A legal brief is not a chassis, and a marketing campaign is not a musket. (As an aside, Ford faced a ‘wrenching nightmare’ in the years that followed, according to Hounshell, as he tried to increase the flexibility of the plant and introduce variety).
Toyota faced a version of this problem in post-war Japan. The market demanded small quantities of many varieties of car, under conditions of low demand – circumstances closer to a modern consulting firm than to Ford’s Highland Park. The Toyota Production System, developed by Taiichi Ohno from the 1950s onwards (and articulated in a book of the same name), presents four principles that may help us better understand how AI chains could function.
One operator, many machines. Before Toyota, the assumption was one worker per machine. Ohno rearranged the factory so that a single operator could attend to three or four machines, intervening only when one stopped. There was understandable resistance – craftsmen didn’t want to become generalists – but it doubled and tripled productivity. The parallel for knowledge work is a human overseeing several concurrent AI chains across different functions: marketing, analysis and finance, for example.
Autonomation: automation with a human touch. Toyota’s machines were designed to detect problems and stop themselves automatically. Any worker could pull the andon cord to halt the entire line if they spotted a defect. The equivalent for an LLM chain would be built-in checkpoints – steps where the system self-evaluates and flags anomalies rather than blindly passing flawed output downstream. A human may not watch every step, but monitors the flow of work, tweaks the process, and responds when the system signals something is wrong.
Kanban as meta-information. Toyota’s kanban cards were packaged in vinyl envelopes and carried only the production information needed so the earlier and later processes could talk to each other. In a multi-LLM chain, this maps onto disciplined context management, passing only the relevant outputs and specifications between steps, not flooding each model with the entire history.
Pull, not push. Ford pushed products through the system based on forecasts. Toyota pulled them through based on actual demand. An AI chain triggered by a specific prompt (produce this campaign, for this client, to meet this goal) is inherently a pull system. The output is bespoke, drawn through the process by the request, not mass-produced and distributed.
Ford concentrated thinking within management. Toyota discovered this was economically inferior in complex, variable environments. Kaizen – continuous improvement driven by the workers themselves – outperformed top-down optimisation because the people closest to the work understood it best. If we build LLM chains on rigid sequences, lacking feedback loops, with no capacity for self-correction – we may get impressive throughput on routine work while creating brittleness at the edges.
What this all means
Today AI is mostly augmenting, rather than replacing, tasks. Humans are in-the-loop. But my sense is that, for many tasks, augmentation is a pit stop on a journey where full automation is the final destination. And as we chain these tasks together, humans begin to slip outside the loop: instead of steering it from within, they oversee it.
This means work is reorganised around several functions, and these do not necessarily sit with the same person: designing the chain (what agents, in what order, with what instructions and context), overseeing the implementation of the chain (and tweaking as needed), and judging the output (is this campaign any good? does it meet the client’s needs? would I stake my reputation on it?). The first two of these in particular are perhaps what the World Economic Forum and others mean by ‘agent orchestrators’.[3]
METR’s measure shows LLMs complete more complex tasks over time, with a doubling time of around 7 months. This chart is from 6 February 2026, and shows the duration at which an agent is predicted to succeed 80% of the time (a 50% success version shows longer time duration). Note the logarithmic Y axis. The measure is human expert completion time.
We’re on a trajectory of rapid improvement, but there is a caveat. METR’s research shows a steady increase in the ability of models to handle longer and more complex tasks.[4] However, the confidence ratings – in this case, models are assessed on whether they can complete a task successfully 50% or 80% of the time – tell us that reliability remains an issue.
For Ford, interchangeable parts was the prerequisite for reliability, and he solved this by 1913. The Toyota system emphasises a continual reduction in defective goods – a continual increase in reliability. For both, solving reliability meant the whole process ran smoothly.
What does this all mean? I think there are four key implications.
1. It means that for now we will have humans in the loop, shepherding tasks and part of the chain of work. When, or if, we increase the reliability of AI systems, those humans disengage from the loop.[5] Until then, audit is difficult to scale: a five-step chain with human review at each stage is manageable, but a fifty-step chain is not.
2. It means that new roles (chain designer, process auditor) are real but scarce. Ohno’s principle of one operator, many machines applies: the human who once executed a five-task chain becomes the human who monitors five chains running concurrently. But this does not necessarily imply one-fifth the headcount, as it depends on the other tasks left after automation.
3. It means that management skills are, as Ethan Mollick puts it, an AI superpower. I’d summarise this in two concepts. The first is compound engineering: most of your thinking takes place before you start work. Before, you might have spent 20% of your time planning and documenting what you want, and 80% executing. This now flips. Second, like any effective manager, you ask, ‘what do you need so you can do your job better?’. (This is captured nicely by the term context engineering).
4. It means a repeat of deskilling and reskilling cycles. Craftsmen who hand-fitted Singer sewing machines were displaced by assembly line workers, who were eventually displaced in turn. But workers in Toyota’s factories had to build up a broader range of skills. As Taiichi Ohno put it, applying human intelligence to machines was the only way to make machines work for people.
Back to the boss’s office
Let’s return to that meeting: your boss wants the SmartBrew Pro campaign, and you’ve got eight weeks. Only now it’s two years later, and the agency has deployed a chain of work, pulling through everything from marketing materials to consumer data analysis.
You write the brief on Monday afternoon. On Tuesday morning, there’s a full campaign to review, and media slots to approve. You didn’t collect the data, you didn’t interpret the trends, and you didn’t draft the scripts. You judged the brief going in, and you’ll judge the output coming out. Everything in-between happened without you.
(Quick aside: the five tasks chained together are actual tasks for 2431: Advertising and Marketing Professional, as per the International Standard Classification of Occupations (ISCO). There are nine tasks in total.)
There’s a question nobody in your office is asking yet: when you’ve been outside the loop for the entire process, how well can you actually judge what comes out the other end? The advertising specialist who spent years learning the craft – who could spot an unconvincing headline or a lazy script because she’d written hundreds – is precisely the person qualified to audit the chain. But she’s also the person the chain is designed to replace. And the junior who never built that expertise? They’ll inherit the auditor role without the knowledge to do it well.
If you leave the loop, you won’t get back in again. That’s the lesson of every previous wave of automation, from the Springfield Armoury to Highland Park to Toyota City. AI can already do some of your tasks. What we need to ask is whether it can chain them together, and what’s left of your job when it does.
[1] JSON, or JavaScript Object Notation, is at least a human-readable format, albeit not the way most humans would naturally communicate to each other. Other formats, such as vector embeddings, are not human-readable.
[2] Of course, this same problem applies when you’re only using one LLM for one task. But it is magnified when (a) tasks get longer, (b) are chained together (and call on external APIs), and (c) you can no longer micromanage the outputs. For example, the overall chain might produce outputs that no individual model ‘intended’, and debugging this becomes very complex.
[3] To stretch our car analogy, manufacturing roles all have their equivalent in the chain. Product designers describe and define the outcome; Process engineers are the architects of the chain; the Foreperson or Supervisor monitors for failures and exceptions; the Line Worker is an LLM executing a task; the Quality Inspector is in charge of validation/verification and could be a human or (if the human is just spot-checking) another LLM.
[4] METR’s study focuses on software engineering tasks. OpenAI’s GDPval looks at a broader range of tasks. Quite separately, OpenClaw demonstrates what AI delegation could look like, albeit in a form that would give any IT manager nightmares.
[5] The METR study does hint at another option, and one that was not feasible for car manufacturers: running a chain of work in parallel five or ten times, and selecting the best outcome.






What worries me even more than whether the human is operationally "in-the-loop" is that the loop of execution might not even be continuous. For example, once the marketing campaign is designed, approved, and set live, let's assume it is successful. It needs to run for a while (several weeks, a month, maybe longer) to become memorable to consumers and bring in value, before trends, seasonality, or both, shift and there comes the need for a reset.
That means humans bring in value at discontinuous spurts of time, and it's helpful to leverage the functional and institutional expertise of the same person to execute multiple refreshes and iterations of a product, a marketing story, a line of business, etc over a longitudinal period of time, except they may not be busy for the same number of hours every single day or week. How do we adjust human rewards to be fair to their continuing revenue contributions, while also being respectful of companies' marginal costs? I feel that a consultancy or contract model may emerge (and is emerging) to take over from full time employment in a number of knowledge fields where expertise is needed, but the need may become discontinuous especially with AI. Hopefully, this will evolve in a way that is fair to workers.
The chain framing is the right one, but the real bottleneck isn’t reliability in the Ford sense, it’s judgment depth at the design stage. What I see while implementing AI in HR, for example, is that chaining tasks amplifies whatever domain understanding exists at the point of design, including the gaps. The junior who never built a workforce report from scratch can’t judge whether the chained output makes sense. We’re not at the Singer fitter stage… we’re at the stage where we’re deciding who gets to be the fitter, and most organizations are getting that wrong.