The dream was simple: encode the rules, and the machine thinks. The probabilists arrived with something messier, harder to explain, and far more powerful.
I The God Complex
Early AI researchers were, by temperament, logicians. The founding assumption of an entire generation: figure out the rules of thought, write them down, and the machine will think. If you know enough rules, you can answer anything. Uncertainty is just ignorance, and ignorance is just a problem you haven't solved yet.
This was not a fringe view. It was the mainstream. Boolean Logic, Frege's notation, the Turing Machine, the Logic Theorist — the tools of AI's first era were all about finding truth. It is not that approximations were not good enough — they weren't even considered.
The expert systems of the 1970s and 1980s were the apex. MYCIN diagnosed bacterial infections from 600 rules. XCON configured computer systems based on 10,000 conditions. They worked beautifully inside those conditions, and broke spectacularly outside them. When they broke, the proposed solution was more rules. The logic was impeccable: the system failed because reality wasn't fully encoded yet. You just needed to finish encoding it.
Certainty was the whole point. A medical expert system was supposed to deliver the diagnosis, not a shrug with decimals attached. The idea that uncertainty could be represented, calibrated, and useful — that probably might be a form of intelligence rather than a failure of it — took a very long time to arrive.
II The Dead Minister's Footnote
In 1763, a paper was published in the Philosophical Transactions of the Royal Society. The author, Thomas Bayes, was already dead. His friend Richard Price found the manuscript, cleaned it up, and submitted it to the journal. The paper's title was "An Essay towards Solving a Problem in the Doctrine of Chances".
It was, quietly, one of the most subversive documents in the history of science.
Bayes' theorem does not tell you what is true. It tells you how to update what you already believe when new evidence arrives. It formalizes the idea that belief comes in degrees, that evidence shifts those degrees, and that certainty is just one extreme of a long continuum. You start with a guess — a prior — and revise it.
The concrete problem Bayes had in mind is worth a moment. Imagine you are placed in a dark room in front of a billiard table you cannot see. Someone rolls a ball across it, and the ball stops somewhere. You have no idea where. They then roll a second ball, and tell you only whether it stopped to the left or the right of the first. Then they do it again. And again. Each report shifts your estimate of where the first ball is — you never see it directly, you only accumulate indirect evidence, and each piece of evidence updates your belief. You never reach certainty. You just get closer, probabilistically, with each trial.
This was, philosophically, a problem. The frequentists, who dominated statistics for most of the 19th and 20th centuries, refused to accept it. Probability, in their view, was a property of physical processes — not a state of mind. You could not just assign a probability to a question about where something is and start calculating. That was subjective. That was opinion dressed up as math.
The debate raged for nearly two centuries. It is possible that Bayes was partly motivated by David Hume's argument that miracles are never rationally believable — because, Hume had claimed, it is always more probable that the testimony is mistaken than that the impossible occurred. Bayes never answered this directly. He just quietly built the mathematical framework that would let you think about it properly. Price, who published the essay posthumously, understood the theological stakes. Bayes, apparently, just found the problem interesting — and then died, leaving everyone else to argue about it.
III The Tools That Did Not Wait
The philosophical dispute over Bayes did not stop the practical mathematicians from building probabilistic tools. They just did not call them Bayesian, and mostly did not care about the dispute. The story from here is not that Bayesianism won cleanly. It is that uncertainty, estimation, sampling, and approximation kept turning out to be more useful than exact answers.
In 1809, Carl Friedrich Gauss needed to find a lost asteroid. Ceres had been spotted, then swallowed by the sun's glare, and the question was where it would reappear. His solution: fit a model to noisy measurements by minimizing the sum of squared errors. Not an exact calculation — an acknowledgment that measurements are imperfect and you work with what you have. The Method of Least Squares became the foundation of linear regression, and linear regression became the foundation of almost everything that followed in machine learning. Gauss found the asteroid. The method outlasted the solar system discovery by a few centuries.
In 1906, Andrey Markov was analyzing a poem — specifically, the sequence of vowels and consonants in Pushkin's Eugene Onegin. He showed that each letter depended probabilistically on the one before it, and on nothing further back. Not a deterministic rule. A probability. Markov Chains became the backbone of Hidden Markov Models, which dominated speech recognition for twenty years, which turned into the sequence models that eventually became modern language processing. Pushkin would probably have had thoughts about this. We can only estimate what they were.
In 1946, Stanisław Ulam was recovering from illness and playing cards. He wondered what the probability of winning a patience game was — and realized that enumerating every possible game was impossible, but playing it randomly many times and counting wins was perfectly tractable. From an afternoon of solitaire, Monte Carlo methods: use randomness to approximate answers too hard to compute exactly. His colleague Nicholas Metropolis suggested naming the method after the casino. The name was somewhat ironic for a technique built on principled mathematics rather than casino luck, but it stuck.
None of these people framed their work as a philosophical challenge to the certainty ideal. They were solving real-life problems. Probabilistic methods solved them. The philosophers found out later.
IV The Probabilist Who Got Crushed
In 1958, Frank Rosenblatt published "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain". The title is not a throwaway. Rosenblatt was building something explicitly probabilistic — a model where noise and randomness were features, not bugs. The brain, he argued, did not compute exactly. It estimated. It guessed, and then adjusted.
This was the right framing. It was also the wrong decade.
In 1969, Minsky and Papert published "Perceptrons" — a precise mathematical demonstration of what single-layer networks could not do. The result was technically accurate and strategically devastating: funding dried up, researchers pivoted, and neural networks entered the first AI Winter. The certainty-seekers got their decade back. Expert systems bloomed. Logic programming returned. Probability, in neural network form, retreated.
What Rosenblatt had started would lie dormant for twenty years, waiting for hardware and data to catch up with the idea. The solution, incidentally, was already sitting in a Finnish master's thesis from 1970. Nobody read it. It was in Finnish.
V The Quiet Return
The turnaround did not come with a manifesto. It came with a graph.
In 1985, Judea Pearl introduced Bayesian Networks: directed graphs where nodes are variables and edges represent probabilistic dependencies. You feed in evidence — a symptom, a sensor reading, a test result — and the network propagates updated beliefs through to everything connected. The key shift: instead of requiring the system to know the exact state of the world, you let it maintain a probability distribution over possible states. The machine is not certain. It is calibrated — and when new evidence arrives, the confidence updates automatically.
Pearl had done what the philosophers could not: turned the Bayesian intuition into a practical engineering tool. Medical diagnosis, robotics, language parsing — all became tractable under uncertainty in ways they could not be under exact inference. He would win the Turing Award in 2011. This is generally how philosophy ends: someone turns it into a diagram, and the diagram into code.
The EM Algorithm, formalized in 1977 by Dempster, Laird, and Rubin, followed the same logic. You don't know all the parameters, so you guess, improve, and guess again — iterating until the result stabilizes. Each step is guaranteed not to make things worse — which sounds like a description of adolescence, but is in fact a guaranteed-convergent optimization algorithm. Hidden Markov Models ran on EM. So did Gaussian Mixture Models, and a dozen other workhorses of applied statistics.
Certainty was not the goal anymore. Calibrated confidence was.
VI Two Cultures, One Winner
In 2001, Leo Breiman — a statistician, publishing in a statistics journal, addressing an audience of statisticians — argued that statisticians had spent fifty years doing statistics wrong. The journal published it. The statisticians were not impressed.
Breiman called it The Two Cultures. The first culture built explicit parametric models of the data-generating process, because understanding why felt more scientific than just predicting what. The second culture treated the world as a black box and optimized purely for prediction — accepting uncertainty about the mechanism, as long as the model was right about the outcome. Breiman argued the first culture had dominated academia at enormous cost, smothering approaches that actually worked on real problems.
The algorithmic culture, as he called it, is not Bayesian in any narrow sense. It is the culture of acknowledged uncertainty. You don't assume you know what the process is. You let the data tell you. You accept that your model is an approximation, optimize for prediction, and use held-out test data to check whether it works. This is not a philosophical surrender. It is a pragmatic upgrade.
Breiman's contemporary contribution, Random Forests, was the cleanest expression of this philosophy: train hundreds of decision trees, each on a slightly different random sample of data and features, then average their votes. No single tree is trustworthy. Together, they are remarkably robust — following a logic first sketched by Condorcet in 1785, who proved mathematically that many imperfect independent classifiers, aggregated, converge toward truth. Condorcet had been trying to formalize democracy. Breiman had discovered it was also a machine learning theorem. Condorcet himself was arrested during the French Revolution and died in prison in 1794, which suggests that while majority aggregation converges on truth in theory, the confidence interval on whose votes count remained wide.
AdaBoost, introduced in 1997 by Freund and Schapire, amplified the same insight: build a strong classifier by combining a sequence of weak ones, each focused on the examples the previous one got wrong. The weakness is not the enemy. It is the raw material.
The "Two Cultures" essay was written when neural networks were still unfashionable. Breiman thought he was writing a corrective. What he was actually writing was a prophecy.
VII The Machine That Guesses
Today's large language models do not retrieve answers. They sample them.
At the bottom of every ChatGPT is a probability distribution: given everything so far, how likely is each possible next token? The model picks from that distribution — not the most likely token every time, because greedy deterministic decoding produces flat, repetitive text — but a sample, weighted by probability. Temperature, top-p, top-k: all of these are controls on how to sample from uncertainty. When a language model says "Certainly!" it is, technically, doing the opposite.
The neural networks that produce these distributions are trained by stochastic gradient descent — an algorithm whose theoretical foundations trace to Robbins and Monro in 1951. Stochastic gradient descent does not compute the exact gradient of the loss. It estimates it from a random sample, takes a step, and tries again. Approximate, noisy, probabilistic — and, at scale, extremely effective.
Dropout, the training trick that randomly zeros out neurons during each pass, is deliberate noise injection: the model learns to work without any particular feature because it never knows which features will be present. At test time, you average over the distribution of networks encountered during training. This is not a workaround. It is a design choice rooted in the same probabilistic philosophy Pearl brought to AI and Rosenblatt had in the perceptron.
The certainty machine became, step by step, a probabilistic one. Not because anyone voted to abandon the dream — but because every tool that worked was, under the hood, built on uncertainty.
VIII The Bill That's Still Open
In 2018, Judea Pearl — the same Judea Pearl who introduced Bayesian Networks — published The Book of Why, and made an uncomfortable argument: modern machine learning is extremely good at finding correlations, and extremely poor at finding causes.
The machines have learned to be uncertain about predictions. They have not learned to be uncertain about the world in the right way. A model trained on health data will learn that people who carry lighters get lung cancer more often than people who don't. It has found a correlation. It has not learned that smoking causes cancer. It has no idea what would happen if you took the lighter away. The policy implication this suggests — confiscate the lighters — is exactly the kind of thing you get when a pattern-matcher is mistaken for a reasoner.
The distinction matters whenever the system is deployed somewhere it was not trained — whenever an intervention changes the distribution, whenever a counterfactual matters. These are exactly the conditions under which AI is expected to be most useful.
The Ladder of Causation goes: observation (seeing correlations), intervention (changing the world), counterfactual (imagining alternatives). Modern deep learning, for all its probabilistic sophistication, sits almost entirely on the first rung.
Probabilistic thinking got AI past the brittleness of rule-based systems. It gave machines a way to handle noise, update beliefs, and produce calibrated predictions. It did not give them understanding. The uncertain machine has learned to hedge its outputs beautifully. Whether it knows what it's talking about is still open. Probably.
Want to trace the key moments in this story? Explore the timeline.