Noise: A Flaw in Human Judgment
Daniel Kahneman, Olivier Sibony, and Cass R. Sunstein
New York: Little Brown Spark
Decades of research into human decision-making have shown tendencies towards cognitive biases and irrationalities, resulting in decisions that work against personal and organizational interests.
While there are attempts at predictions of the future, across a range of domains, empirical studies have long found that accuracy is often very low, and r (correlational coefficients) tends to be low (.21 across a range of soft science studies).
Daniel Kahneman, Olivier Sibony, and Cass R. Sunstein’s Noise: A Flaw in Human Judgment (2021) contributes to and builds on to this literature. It reads as a follow-on to Kahneman’s Thinking, Fast and Slow (2011), which showed people tending to think too fast (in System 1 thinking) in some cases and failing to go to System 2 thinking for more analytical depth when necessary.
They write about common denial of “objective ignorance,” or the unknowability of so much in a complex world. They write:
People who believe themselves capable of an impossibly high level of predictive accuracy are not just overconfident. They don’t merely deny the risk of noise and bias in their judgments. Nor do they simply deem themselves superior to other morals. They also believe in the predictability of events that are in (fact unpredictable, implicitly denying the reality of uncertainty. In the terms we have used here, this attitude amounts to a denial of ignorance. (p. 145)
People underestimate the difficulty of future projection, especially in the longer term, and they over-estimate their own abilities.
Introduction: On Target, Off Target
Noise… opens with some comparative visuals of targets, with various scatters of x’s indicating where various teams landed their respective shots. If the bull’s-eyes are the correct answers, anything off the bull’s-eyes are errors. Proxemic closeness is desirable if dead-center itself is not achieved.
They describe human judgment “as measurement in which the instrument is a human mind. Implicit in the notion of measurement is the goal of accuracy—to approach truth and minimize error. The goal of judgment is not to impress, not to take a stand, not to persuade” (p. 39). They clarify further: “Judgment is not a synonym for thinking, and making accurate judgments is not a synonym for having good judgment” (p. 40). While people aim for accuracy, perfection itself “is never achieved even in scientific measurement, much less in judgment. There is always some error, some of which is bias and some of which is noise” (p. 40). In contexts of professional judgment, judgments apply to “questions of fact or computation on the one hand and matters of taste or opinion on the other. They are defined by the expectation of bounded disagreement…Exactly how much disagreement is acceptable in a judgment is itself a judgment call and depends on the difficulty of the problem” (p. 44). Where the target is one score or fact, though, anything outside of that number is erroneous.
The co-authors write:
Some judgments are biased; they are systematically off target. Other judgments are noisy, as people who are expected to agree end up at very different points around the target. (Kahneman, Sibony, & Sunstein, 2021, p. 4).
Systemic bias is calculated as an average leaning towards one area over another. Noisiness is about variance, with scatter occurring. These scenarios evoke contexts where there are known, objectively observable targets. For example, if there are projections of a future, that future may be objectively assessed and scored (such as the amount of profits or losses from sales). There are numerous in-world spaces where decisions are noisy: medicine, child custody decision-making, future forecasting, asylum decisions, personnel ones, bail, forensic science, patents, and others (pp. 6 – 7). Given that errors compound and lead to follow-on errors, individuals and organizations that can head off such mistakes stand to achieve important gains from the avoidance of “costly errors” (p. 55). They write: “The different errors add up; they do not cancel out” (p. 55). They write:
…in professional judgments of all kinds, whenever accuracy is the goal, bias and noise play the same role in the calculation of overall error. In some cases, the larger contributor will be bias; in other cases, it will be noise (and these cases are more common than one might expect). But in every case, a reduction of noise has the same impact on overall error as does a reduction of bias by the same amount. For that reason, the measurement and reduction of noise should have the same high priority as the measurement and reduction of bias. (pp. 55 – 56)
This book offers an instrument and approach to conduct “noise audits” in organizations to lessen noise in decision-making.
Measuring Bias and Noise in Human Decision-making
With patterned and repeat decision-making, assessments of bias and noise may be calculated fairly easily. One data visualization shows the true value of the target, a Gaussian curve showing the bias in the guesses or estimates (the amount, the direction, the intensity), and standard deviations of those guesses off-true as indications of noise. Truth then is achieved with zero error. A common tool used to measure overall error is the “mean squared error” as conceptualized by Carl Friedrich Gauss in the late 1700s. MSE involves the “average of the squares of the individual errors of measurement” (p. 59), and the best estimate is “the one that minimizes the overall error of the available measurements” (p. 60). This approach is also known as the least squares method and is widely used today.
This difference between bias and noise is essential for the practical purpose of improving judgments. It may seem paradoxical to claim that we can improve judgments when we cannot verify whether they are right. But we can—if we start by measuring noise. Regardless of whether the goal of judgment is just accuracy or a more complex trade-off between values, noise is undesirable and often measurable. (pp. 53 – 54)
A true value is not necessary to measure noise. They write: “All we need to measure noise is multiple judgments of the same problem. We do not need to know a true value” (p. 53), at the time of the noise measure. A later identification of a true value may be used for further analysis. Then, the amount of error, or “the difference between the judgment and the outcome” (p. 49), may be captured.
Figure 1: Amorphous
Exploring Error Equations
The authors offer some error equations:
The role of bias and noise in error is easily summarized in two expressions that we will call the error equations. The first of these equations decomposes the error in a single measurement into the two components with which you are now familiar: bias—the average error—and a residual ‘noisy error.’ The noisy error is positive when the error is larger than the bias, negative when it is smaller. (p. 62)
That is: Error in a single measurement = bias + noisy error. And: Overall error (MSE) = bias2 + noise2. The error equation “does not apply to evaluative judgments, however, because the concept of error, which depends on the existence of a true value, is far more difficult to apply” (p. 67).
Bias and noise, while initially seen as independent (p. 65), is later described as somewhat interrelated.
“Level errors” are present “in any judgment task” (p. 73) and come from differences in the average level of judgments by different judges. Pattern errors emerge from particular evaluative judgments (and tendencies) by particular judges as residual deviations. They write:
If you wrote down these pattern errors in each cell of the table, you would find that they add up to zero for every judge (row) and that they also add up to zero for every case (column). However, the pattern errors do not cancel out in their contribution to noise, because the values in all cells are squared for the computation of noise. (p. 75)
Pattern noise (variability) “reflects a complex pattern in the attitudes of judges to particular cases” (p. 75). Pattern noise is also known as “judge x case interaction” or “judge-by-case” (p. 76). Pattern noise “is pervasive” (p. 76). This noise comes from variability in judge’s responses “to particular cases” (p. 78). Variability is expected between judges and within judges.
About Occasional Noise
Measuring “occasional noise” is challenging:
Measuring occasion noise is not easy—for much the same reason that its existence, once established, often surprises us. When people form a carefully considered professional opinion, they associate it with the reasons that justify their point of view. If pressed to explain their judgment, they will usually defend it with arguments that they find convincing. And if they are presented with the same problem a second time and recognize it, they will reproduce the earlier answer both to minimize effort and maintain consistency. (p. 81)
For individuals making important decisions, how can they strengthen their decision-making and lower within-subjects noise? How can people increase their own test-retest reliability (or internal consistency)? In a study, Edward Vul and Harold Pashler hypothesized a wisdom of crowds phenomenon for individuals. They suggested that the averaging of two answers would “ be more accurate than either of the answers on its own” (p. 83). In their research, they found that a person’s first guess “was closer to the truth than the second, but the best estimate came from averaging the two guesses” (p. 83). Individuals may benefit from engaging multiple guesses:
As Vul and Pashler put it, ‘You can gain about 1/10th as much from asking yourself the same question twice as you can from getting a second opinion from someone else.’ This is not a large improvement. But you can make the effect much larger by waiting to make a second guess. (p. 84)
People may be aware of their mood and how mood affects decision-making. Good moods lead people to “accept our first impressions as true without challenging them” (p. 87). It also opens people to being more cooperative and reciprocating. The order of information given can affect perception. Human fatigue can affect decision-making. The weather can also affect human perception and decision-making (p. 91).
The presence of group dynamics can result in a high level of variance and noise in decision-making:
Groups can go in all sorts of directions, depending in part on factors that should be irrelevant. Who speaks first, who speaks last, who speaks with confidence, who is wearing black, who is seated next to whom, who smiles or frowns or gestures at the right moment—all these factors, and many more, affect outcomes. (p. 94)
Crowds can be positive or negative:
There are ‘wise crowds,’ whose mean judgment is close to the correct answer, but there are also crowds that follow tyrants, that fuel market bubbles, that believe in magic, or that are under the sway of a shared illusion. Minor differences can lead one group toward a firm yes and an essential identical group toward an emphatic no. (p. 94)
Those who speak first towards a decision stand to sway the outcome (p. 97).
Figure 2: Noise
About One-off Judgments
In real life, there are decisions that may be made only once and are singular. The researchers cite former President Barack Obama’s decision-making during the Ebola outbreak in West Africa in 2014. His decision-making was based on the information provided by various experts. The amount of known information was incomplete. There was not “a prepackaged response” (p. 35). There were any number of serendipitous factors that affected or could have affected the decision-making:
If the same facts had been presented in a slightly different manner, would the conversation have unfolded the same way? If the key players had been in a different mood or had been meeting during a snowstorm, would the final decision have been identical? Seen in this light, the singular decision does not seem so determined. (p. 37)
How easy is it to calculate bias and noise in singular (vs. repeated) judgments? After all, so many important decisions in life may be more rare events: “how to handle an apparently unique business opportunity, whether to launch a whole new product, how to deal with a pandemic, whether to hire someone who just doesn’t meet the standard profile” (p. 12). Noise… suggests several ways to think about this challenge. One is the variance in within-subject decision-making, with people changing their minds, sometimes without apparent reasoning. [This is such a widely known issue that qualitative data analysts record all coding and thoughts so as to capture the trajectory of thinking. Variability in judgments can be positive in many cases, such as when diversities of ideas are desirable.]
The researchers conclude that noise exists in singular decisions even if the amount cannot be directly measured:
…we cannot measure noise in a singular decision, but if we think counterfactually, we know for sure that noise is there. Just as the shooter’s unsteady hand implies that a single shot could have landed somewhere else, noise in the decision makers and in the decision-making process implies that the singular decision could have been different. (p. 37)
In many senses, governance can be strengthened based on proper processes that can ensure that when particular decisions have to be made that the right ones are. Where clear answers are not known, they write:
Focusing on the process of judgment, rather than its outcome, makes it possible to evaluate the quality of judgments that are not verifiable, such as judgments about fictitious problems or long-term forecasts. (p. 50)
As one recent case-in-point, history will judge how various countries dealt with the SARS-CoV-2 / COVID-19 pandemic, along a wide range of dimensions.
AI and Exploring Big Data for Broken-Legs and Other Anomalies
Several seminal research studies suggest that of three core decision-making efforts—clinical judgment (human decision-making), mechanical prediction (the uses of simple models and simple rules, including uses of multiple regressions), and artificial intelligence (AI) pattern identification from big data, the worst is clinical judgment. Mechanical judgment, by replacing the individual with a rules-model, both “eliminates your subtlety,…and eliminates your pattern noise” (p. 120). Even so, many human decision-makers experience “algorithm aversion” (p. 135) and trust their own “instincts” over rules even though algorithms result in improved outcomes.
If simple rules exist on one end, at the other end are more complex machine learning models, which may garner data patterns not available otherwise. For example, the “broken-leg principle” may be identified in big data datasets. This concept is that minute combinations of data details may capture fresh insights. Those who have seen the doctor for a broken leg on the day of a scheduled movie night won’t likely show up for the movie later that day, goes the “broken-leg principle.” That dynamic may involve a fresh signal from the data.
Even as simple rules do better than human judgment, there are real limits to how good predictive judgments can be given “objective ignorance” (p. 138) in a complex world. They write:
A “percent concordant” (PC) of 80% roughly corresponds to a correlation of .80. This level of predictive power is rarely achieved in the real world. In the field of personnel selection, a recent review found that the performance of human judges does not come close. On average they achieve a predictive correlation of .28 (PC = 59%). (p. 139)
In a complex world (as described by chaos theory), minor events “can have large consequences” (p. 141). There is “a large amount of objective ignorance in the prediction of human behavior” (p. 143).
The presence of “objective ignorance” (known unknowns, unknown unknowns) means the following is true: “Models are consistently better than people, but not much better” (pp. 142 – 143). Mechanical decision-making is superior in a “massive and consistent” way to human clinical judgment, but “the performance gap is not large” (p. 143). Even as people have “little useful information” that they can use to product the future, they are still making “bold predictions” (p. 148).
The researchers write:
An extensive review of research in social psychology, covering 25,000 studies and involving 8 million subjects over one hundred years, concluded that ‘social psychological effects typically yield a value of r [correlation coefficient] equal to .21. (p. 151)
People fall into fallacious thinking by assuming that fate or inevitability exists: “When we give in to this feeling of inevitability, we lose sight of how easily things could have been different—how, at each fork in the road, fate could have taken a different path” (p. 154). Another explanation reads: “In the valley of the normal, events unfold…they appear normal in hindsight, although they were not expected, and although we could not have predicted them” (p. 155). People weave narratives to explain the world and confuse their stories with an external reality.
Risks of Self-Satisfaction
Human decision-making is impoverished in part based on human “cognitive bias” and brain wiring.
Too often people go with an “internal signal of judgment completion” which makes them feel personally satisfied and over-confident even as their judgment may have no real tie to an in-world measure with the evidence (pp. 48 – 49). Too often, people go with System 1 “fast” thinking vs. System 2 “statistical” and “slow” thinking. This latter category of thinking “begins with ensembles and considers individual cases as instances of broader categories” (p. 157).
People often think in sloppy ways. Instead of answering difficult questions, people shift to answering easier ones instead. Two examples offered include the following:
“Is nuclear energy necessary? Do I recoil at the word nuclear?”
“Am I satisfied with my life as a whole? What is my mood right now?” (p. 168)
People tend to go to easy-to-access ideas based on an availability heuristic (p. 167). Many are not aware of their own mental shortcuts. Multiple cited studies show that people may make quantitative decisions based on arbitrary numbers. Indeed, people may be swayed subconsciously and unconsciously.
People may weave narrative threads because of a preference towards “excessive coherence” (p. 171). An “affect heuristic” affects decision-making based on internal feelings (p. 170) instead of cognitive evaluation. Some go to stereotypes to evaluate cases, often engaging in inappropriate matching of information (p. 177).
Interventions to Improve Human Decision-making
The researchers offer a componented model that aims to understand error (measured as MSE) and informed by bias, level noise, occasion noise, and stable pattern noise (p. 211). They write:
Noise is inherently statistical: it becomes visible only when we think statistically about an ensemble of similar judgments. Indeed, it then becomes hard to miss: it is the variability in the backward-looking statistics about sentencing decisions and underwriting premiums. It is the range of possibilities when you and others consider how to predict a future outcome. It is the scatter of the hits on the target. Causally, noise is nowhere; statistically, it is everywhere. (p. 219)
Decision hygiene requires awareness of the noise in the environment and ways to quiet that. Prediction markets may be designed to enable a wider range of predictions, with convergence to a true value (when functioning perfectly).
They describe the usage of “brier scores”:
Brier scores reward both good calibration and good resolution. To produce a good score, you have not only to be right on average (i.e., well calibrated) but also to be willing to take a stand and differentiate among forecasts (i.e., have high resolution). Brier scores are based on the logic of mean squared errors, and lower scores are better: a score of 0 would be perfect. (p. 265)
Indeed, some strategies to reduce noise “might introduce errors of their own”…and might even occasionally “produce systematic bias” (p. 327). As such, interventions need to be assessed for their effects, both positive and negative. Noise reduction may be “not possible” in some cases given “extreme circumstances” (p. 329).
This book shares insights about so-called “superforecasters,” those in the top 2% who are able to forecast with accuracy. Based on a profile, such individuals tend to be intelligent, good at “taking the outside view, and they care a lot about base rates” (p. 266), tend to invest effort into research (p. 267), engage in “careful thought and self-criticism,” collect others’ perspectives, update information in a “relentless” way, and train continuously to improve (p. 268). They avoid random errors and noise. They resist “premature intuitions” (p. 373). Superforecasters are evaluated for their performance accuracy.
Figure 3: Super Forecasting
Daniel Kahneman, Olivier Sibony, and Cass R. Sunstein’s Noise: A Flaw in Human Judgment (2021) makes a cogent case for identifying sources of bias and noise in human decision-making and applying mitigations for more efficacious and rational decision-making and ultimately a more efficient and fairer world. The book is eminently readable and well-explained, with summary points at the conclusions of each chapter.
In the Appendices, the researchers offer information on how to run a “noise audit,” and they offer an instrument for this audit (pp. 388 – 389).
About the Author
Shalin Hai-Jew works as an instructional designer / researcher at Kansas State University. Her email is firstname.lastname@example.org.