The idea behind backpropagation was not missing. It kept moving forward through time under different names, in different fields, waiting for AI to finally perceive it.
I The Idea That Kept Passing By
There is a funny kind of tragedy in the history of artificial intelligence: sometimes the missing idea is not missing at all. It is moving forward through time in plain sight, appearing in another field, under another name, written for another audience, waiting for the right people to recognize it.
Backpropagation is one of those ideas. Today, it is the workhorse behind deep learning: the procedure that lets a neural network discover how each of its internal knobs contributed to an error, then nudge them in a better direction. The model makes a guess going forward; backpropagation sends the blame backward.
But the idea did not arrive in one clean thunderclap, and almost nobody who found a piece of it was trying to invent future AI. They were solving their own problems: equations, flight paths, rounding errors, behavioral prediction, brain-like hardware, cognitive representation.
That is what makes the story so strange. Backpropagation was not repeatedly discovered by people trying to build the same thing. It kept appearing as the practical answer to different questions: if a final outcome is wrong, how do earlier choices need to change?
The Problem Backprop Solves
A neural network is a stack of transformations. Numbers go in, layers transform them, and an answer comes out. If the answer is wrong, training means asking a deceptively hard question: which weights caused the mistake, and by how much?
For a tiny network, you might perturb each weight one at a time. Turn this knob slightly. Did the error improve? Turn the next knob. Try again. But this becomes absurd as networks grow. Backpropagation calculates all those sensitivities in one backward pass.
Backpropagation does that by applying the chain rule in reverse. Instead of asking each parameter separately how guilty it is, the algorithm sends an error signal backward through the system and reuses intermediate calculations along the way.
This is the quiet miracle: one forward pass to make a prediction, one backward pass to assign responsibility.
Cauchy, Kelley, and the control-theory route
In 1847, Augustin-Louis Cauchy was not thinking about machines that learn. He was trying to solve systems of equations. His method of steepest descent asked a beautifully general question: if you want a quantity to get smaller, which direction should you move?
Henry J. Kelley brought the story much closer to backpropagation in 1960 with "Gradient Theory of Optimal Flight Paths". Kelley's problem was concrete and aeronautical: how should an aircraft or missile adjust its path to optimize performance? He was not training neural networks. He was asking how choices made earlier in a trajectory affect the final objective.
That practical flight-control problem forced the same kind of reasoning deep learning would later need. To improve the path, Kelley needed sensitivities flowing backward from the goal to earlier control decisions. In modern language, this looks like a continuous-time ancestor of backpropagation: not because Kelley was secretly doing AI, but because the structure of the problem demanded backward credit assignment.
Arthur Bryson followed with related gradient methods in 1961, and Bryson and Yu-Chi Ho later codified the adjoint method in "Applied Optimal Control". Control engineers had a cousin of backpropagation because rockets, aircraft, and dynamic systems also need blame to flow backward from outcome to cause. AI researchers mostly did not notice, because the idea was wearing flight-control clothing.
II Wrong Place, Wrong Time
The thesis about arithmetic bookkeeping
In 1970, Seppo Linnainmaa was not trying to make computers recognize cats, translate languages, or beat humans at Go. He was studying a quieter problem: how rounding errors accumulate when a computer performs a long calculation. Numerical algorithms are chains of operations, and each tiny local error can ripple through the rest of the computation.
To understand that ripple, Linnainmaa described reverse-mode automatic differentiation in a master's thesis at the University of Helsinki. The computational essence of backpropagation was there: represent a calculation as steps, then run derivatives backward through those steps. The future of deep learning was hiding inside arithmetic bookkeeping.
The thesis was written in Finnish. A German translation followed in 1976, but the AI community was not reading Finnish academic theses during the AI Winter. Nobody noticed.
The timing is almost unbearably close: Minsky and Papert published "Perceptrons" in 1969, helping send neural-network research into retreat. Just one year later, Linnainmaa published the algorithmic core that would eventually help prove multilayer learning possible. By then, nobody was listening.
Even the name had an odd prehistory. In 1962, Frank Rosenblatt had used the phrase "back-propagating error correction" while working on perceptrons. But his perceptrons used discrete outputs, so there was no useful gradient to follow. It was like having a road sign for a bridge that had not yet been built.
The Harvard thesis nobody heard
Paul Werbos was not writing a deep-learning manifesto in 1974. His PhD thesis, "Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences", came out of prediction, economics, dynamic systems, and the social sciences. He wanted better tools for modeling systems whose internal causes were tangled together.
That meant he needed a way to compute how a final prediction error depended on earlier internal quantities. His answer was "ordered derivatives": a general method for computing gradients through multilayer systems via the chain rule. It was backpropagation in everything but the name and the future audience.
The origin story is wonderfully strange. Werbos later described developing the idea while trying, among other things, to mathematize Freud's notion of a "flow of psychic energy". Modern deep learning, in one of its ancestral branches, passes through a Harvard dissertation trying to turn psychoanalytic metaphor into equations. History has a sense of humor, even when peer review does not.
The thesis was filed, catalogued, and largely forgotten. Harvard was the right institution, but economics and behavioral forecasting were the wrong wrapper for a neural-network audience. The idea had prestige nearby, but no direct path to the people who needed it.
The almost-famous report
David Parker was much closer to the modern neural-network story, but even his framing was not "let us launch the deep-learning era". His project was called "Learning-Logic", and the subtitle of his 1985 MIT technical report was almost science-fictional: "Casting the Cortex of the Human Brain in Silicon". He was thinking about brain-like computation, learning hardware, and architectures that could adapt.
Parker had an invention-report version in 1982, and the 1985 technical report put the method very close to the neural-network formulation. But the signal still did not propagate far enough. Technical reports mattered, but they did not travel like journal papers. The field was thawing, but not yet hungry enough.
The paper everyone heard
Rumelhart, Hinton, and Williams were not optimizing flight paths or tracing rounding errors. They were trying to answer a cognitive and computational question: how can a neural network learn useful internal representations rather than merely memorizing input-output associations?
"Learning representations by back-propagating errors" was published in "Nature" in October 1986. It showed, concretely and convincingly, that multilayer neural networks could learn internal representations: hidden layers were not a liability, but the whole point.
The packaging was everything. "Nature" reached everyone, the AI Spring was beginning, hardware was becoming adequate, and the community was hungry for exactly this result.
The algorithm was not entirely new. What was new was the match between problem, audience, demonstration, and moment. In the earlier stories, the technique had solved local problems in other worlds. In 1986, it finally solved a problem the AI community recognized as its own.
III Why It Finally Became Visible
Truth was not enough
If by backpropagation we mean reverse-mode differentiation through a computational graph, Linnainmaa is central. If we mean adjoint sensitivity methods, Kelley, Bryson, and control theory were already close. If we mean applying the idea to adaptive systems and neural networks, Werbos has a powerful claim. If we mean a neural-network learning algorithm in recognizable modern clothing, Parker gets close. If we mean the paper that made the AI community take notice, Rumelhart, Hinton, and Williams win the cultural history.
Backpropagation became famous because correctness finally met legibility, timing, distribution, and working demonstrations. NETtalk learned to convert English text into speech-like pronunciation in 1987 and became a minor media celebrity. ALVINN used backpropagation to help a neural network drive a vehicle in 1989. LeNet applied it to handwritten zip-code recognition. TD-Gammon used it in a reinforcement-learning system that reached top-level backgammon.
Recognition did not land evenly. Werbos later received the 1995 IEEE Neural Network Pioneer Award for his role in the roots of backpropagation, and Hinton later shared the 2024 Nobel Prize in Physics for foundational work in machine learning. Kelley, Linnainmaa, and Parker remain much less visible in the public version of the story, which is part of the point: priority and fame are not the same signal.
The joke hidden in the title
Backpropagation is an algorithm for sending information backward so earlier parts of a system can learn from later errors. But the history of backpropagation itself failed to backpropagate. Credit did not flow cleanly backward through the chain of predecessors.
The idea moved forward instead: from optimization to flight control, from rounding errors to economics, from technical reports to "Nature", until the world finally had the right architecture to receive it.
Not a single lightning strike. A signal trying to find a path.
Want to learn more about the evolution of backpropagation through time? Explore the timeline.