Sentenced by Algorithm | Jed S. Rakoff | The New York Review of Books

Submit a letter:

Reviewed:

When Machines Can Be Judge, Jury, and Executioner: Justice in the Age of Artificial Intelligence

by Katherine B. Forrest

World Scientific, 134 pp., $28.00

Is it fair for a judge to increase a defendant’s prison time on the basis of an algorithmic score that predicts the likelihood that he will commit future crimes? Many states now say yes, even when the algorithms they use for this purpose have a high error rate, a secret design, and a demonstrable racial bias. The former federal judge Katherine Forrest, in her short but incisive When Machines Can Be Judge, Jury, and Executioner, says this is both unfair and irrational.¹

One might think that the very notion of a defendant having his prison time determined not just by the crime of which he was convicted, but also by a prediction that he will commit other crimes in the future, would be troubling on its face. Such “incapacitation”—depriving the defendant of the capacity to commit future crimes—is usually defended on the grounds that it protects the public and is justifiable as long as the sentence is still within the limits set by the legislature for the crime. But the reality is that the defendant is receiving enhanced punishment for crimes he hasn’t committed, and that seems wrong.

Nonetheless, Congress and state legislatures have long treated incapacitation as a legitimate goal of sentencing. For example, the primary federal statute setting forth the “factors to be considered in imposing a sentence” (18 U.S.C. sec. 3553, enacted in 1984) provides, among other things, that “the court, in determining the particular sentence to be imposed, shall consider…the need for the sentence imposed…to protect the public from further crimes of the defendant.”

How is the likelihood of “further crimes of the defendant” to be determined? Until recently, most judges simply looked at a defendant’s age and criminal history. If he had committed numerous crimes in the past, the presumption was that he was likely to commit further crimes in the future, which could mean locking him up for the statutory maximum term of imprisonment permitted for his crime. In my experience, however, few judges felt comfortable following this theory of incapacitation to its logical conclusion of a maximum sentence—especially if the defendant’s present and prior crimes were nonviolent, or if there was a realistic possibility of rehabilitation (as in the case of many people addicted to drugs), or if he had simply aged out (numerous studies show that the vast majority of adolescents who engage in criminal behavior desist from crime as they mature). But in any individual case, all of this was educated guesswork at best.

Nevertheless Forrest, during her years on the bench (when she was known as a tough sentencer), felt obligated by the existing legal requirements to base her sentences in part on what she describes in her book as “my personal assessment of the individual’s…likelihood of recidivating, and the extent to which any recidivism would harm the community or those around him.” She freely admits that she does not know, even today, whether those assessments were accurate. She therefore concludes that if the law requires the prediction of future risk, good artificial intelligence (AI) could theoretically help a judge make such predictions. But, she argues, the current algorithms used for this purpose are so deficient that they provide only the illusion of accuracy and fairness.

While the term “artificial intelligence” covers a wide spectrum of software, we are concerned here with computer programs that not only scrutinize vast quantities of data but also adjust how they proceed on the basis of the data they analyze. What they “learn” from the data, and how they react to it, is dictated by a preprogrammed set of instructions called algorithms. The classic analogy likens a simple algorithm to a recipe, which tells you what ingredients to use, in what order, how to cook them, and how to adjust the cooking depending on how the dish is turning out. The “designer” of this algorithm is the author of the recipe, the recipe is the equivalent of software, and you, your food, and your oven are the hardware.

Most computer algorithms are, of course, far more complex than this. But the terms “artificial intelligence” and “algorithm” tend to conceal the importance of the human designer of the program. It is the designer who determines what kinds of data will be input into the system and from what sources they will be drawn. It is the designer who determines what weights will be given to different inputs and how the program will adjust to them. And it is the designer who determines how all this will be applied to whatever the algorithm is meant to analyze.

In some fields, there are well-established standards governing the designer’s choices, frequently based on scientific studies. But in the case of programs designed to predict recidivism, no such standards exist, so there is a large element of subjectivity in the choices the designer makes. And unless those choices are transparent, what is already highly subjective becomes utterly mysterious. As Forrest puts it, “This form of AI results in the algorithm becoming a black box.”

The most commonly used algorithm for predicting recidivism is a privately produced product called COMPAS (Correctional Offender Management Profiling for Alternative Sanctions), made by a company called Northpointe (doing business under the name Equivant). For competitive reasons, COMPAS’s design is mostly kept secret, and several courts have upheld the maintaining of such secrecy, even from a criminal defendant who wants to challenge the design.

Perhaps the use of a mysterious “black box” algorithm might be justified if it somehow produced accurate and fair results. But according to Forrest, computer programs like COMPAS that are used to predict recidivism provide neither. Studies suggest they have an error rate of between 30 and 40 percent, mostly in the form of wrong predictions that defendants will commit more crimes in the future. In other words, out of every ten defendants who these algorithms predict will recidivate, three to four will not. To be sure, no one knows if judges who don’t use such programs are any better at predicting recidivism (though one study, mentioned below, finds that even a random sample of laypeople is as good as the most frequently used algorithm). But the use of such programs supplies a scientific façade to these assessments that the large error rate belies.

Furthermore, as Forrest notes, these programs “are materially worse at predicting recidivism and violence for Black defendants than for white ones.” Exactly why this is so is the subject of debate, which is considerably more difficult to resolve because so much of the design of these programs is kept secret. But their greater unreliability in the case of defendants of color is not in dispute, and thus their use is likely to enhance the racially discriminatory tendencies of our criminal justice system.

Perhaps the leading case addressing the use of these algorithms in sentencing is State v. Loomis, a 2016 decision of the Supreme Court of Wisconsin. Eric Loomis, who had a prior criminal record, was charged with being the driver in a drive-by shooting, a charge he consistently denied. The state allowed him to enter into a plea bargain in which he pled guilty to two much less severe charges: operating a vehicle without the owner’s consent and attempting to flee a traffic officer. However, the state reserved the right to apprise the sentencing judge of the evidence that it believed showed that Loomis had been the driver in the drive-by shooting. And the judge was also provided with a presentence investigation report prepared by the court’s probation office that, using the COMPAS algorithm, scored Loomis as a high risk for future violent crimes.

At sentencing, the judge stated:

You’re identified, through the COMPAS assessment, as an individual who is at high risk to the community. In terms of weighing the various factors…the risk assessment tools that have been utilized suggest that you’re extremely high risk to reoffend.

The judge then sentenced Loomis to a combined six years of imprisonment for the two nonviolent crimes to which he had pled guilty.

On appeal, first to an intermediary court and then to the Wisconsin Supreme Court, Loomis focused much of his argument on the refusal of the state, the probation office, and the sentencing court to provide him with even the most basic information about how COMPAS compiled his recidivism score. In response, the Wisconsin Supreme Court said:

Northpointe, Inc., the developer of COMPAS, considers COMPAS a proprietary instrument and a trade secret. Accordingly, it does not disclose how the risk scores are determined or how the factors are weighed. Loomis asserts that because COMPAS does not disclose this information, he has been denied information which the [sentencing] court considered at sentencing…. [However], although Loomis cannot review and challenge how the COMPAS algorithm calculates risk, he can at least review and challenge the resulting risk scores set forth in the report [to the judge].

For the court, that was good enough to deny Loomis’s appeal, but to me it seems like a complete non sequitur. Without knowing how the algorithm is designed, what inputs it receives, and how they are weighed, how can one possibly challenge the resulting risk scores in any given case?

While COMPAS provides a multipage user’s guide, that guide, Forrest notes, “does not disclose what machine learning it is based on, or its algorithm, precise inputs, weightings, or data sets used.” And as the Loomis case illustrates, the manufacturer of COMPAS has for the most part successfully resisted disclosing such specifics on the ground that they are legally protected “trade secrets.” As a result, we know very little about COMPAS’s designers, what choices they made, what tests were undertaken to establish COMPAS’s reliability, to what extent its methodology was independently scrutinized, what its error rate is, whether its reported error rate is accurate, whether it employs consistent standards and methods, and whether it is generally accepted in the scientific community.

This means that its admissibility at trial would be doubtful under either of the standards used for determining admissibility of expert testimony (known as the Daubert standard and the Frye standard). But COMPAS is not used at a trial. It is used for purposes like setting bail and determining sentences, where no such rigor is mandated by current law.

What we do know about COMPAS’s design, moreover, is far from reassuring. According to Northpointe, COMPAS ultimately rests on certain sociological theories of recidivism—the “Social Learning” theory, the “Sub-Culture” theory, the “Control/Restraint” theory, the “Criminal Opportunity” theory, and the “Social Strain” theory. Even a brief review of the sociological literature discloses that many of these theories are controversial, most have been only modestly tested, and with mixed results, and several are inconsistent with one another. Indeed, as several peer reviews of them note, the theories are constantly being revised to account for unexpected results—a classic indication that they are not sufficiently reliable to meet the legal standards for admissibility.

Nor are we given any clue as to which of these theories the designers of COMPAS applied in their choices of data sets, algorithmic responses to those data sets, etc. But we do know that even the validation studies that COMPAS itself has chosen to disclose show an error rate of between 29 and 37 percent in predicting future violent behavior and an error rate of between 27 and 31 percent in predicting future nonviolent recidivism. In other words, according to its own disclosures (which might well be biased in its favor), COMPAS gets it wrong about one third of the time.

Worse still, COMPAS gets it wrong more often with Black defendants than with white ones. Even the court in Loomis felt obliged to note:

A recent analysis of COMPAS’s recidivism scores based upon data from 10,000 criminal defendants in Broward County, Florida, concluded that black defendants “were far more likely than white defendants to be incorrectly judged to be at a higher risk of recidivism.” Likewise, white defendants were more likely than black defendants to be incorrectly flagged as low risk.

It is, of course, impossible to determine why this is so as long as COMPAS keeps secret how its algorithm is designed. Such secrecy has not prevented Northpointe from challenging the results of the independent (and transparent) studies confirming such racial bias. Forrest, who devotes an entire chapter of her book to analyzing this debate in detail, concludes in the end that there is “no longer any real debate that COMPAS and other AI assessment tools produce different results for Black versus white offenders.” But this did not stop the Wisconsin Supreme Court, and other courts since, from allowing COMPAS scores to be used by judges as a factor in sentencing.

To be sure, the decision in Loomis tried to have it both ways by saying that even though “COMPAS can be helpful to assess an offender’s risk to public safety,” it is not to be used “to determine the severity of the sentence.” The best I can make out of this Janus-like formulation is that a sentencing judge in Wisconsin, confronted with a COMPAS score indicating a high risk of recidivism, can treat this (dubious) information as part of a sentencing mix but cannot expressly make it the factor that “determines” the length of the sentence. That may satisfy the Wisconsin Supreme Court, but to this trial judge it appears to be nothing but camouflage. If a sentencing judge, unaware of how unreliable COMPAS really is, is told that this “evidence-based” instrument has scored the defendant as a high recidivism risk, it is unrealistic to suppose that she will not give substantial weight to that score in determining how much of the defendant’s sentence should be weighted toward incapacitation.

Of course, one might argue that there is still a use for COMPAS if judges assessing recidivism without COMPAS get it wrong even more often than COMPAS does. But while there is no real way to study this with certainty, a well-received study by Julia Dressel and Hany Farid (professors of computer science at Dartmouth) strongly suggests otherwise. The study showed that a random sample of everyday people from a popular online website “are as accurate and fair as COMPAS at predicting recidivism.”² And while the databases COMPAS uses are largely shrouded in secrecy, the study also found that “although COMPAS may use up to 137 features to make a prediction, the same predictive accuracy can be achieved with only two features,” namely, age and prior criminal history—the same two features that judges have traditionally used to predict recidivism.

COMPAS and other such recidivism-predicting algorithms are used mostly in state (as opposed to federal) courts. This is partly because the National Center for State Courts has, since 2015, encouraged such use in order to make bail and sentencing determinations more “data-driven”—a mantra that entirely begs the question of whether these tools are accurate, unbiased, and reliable. But even the federal government uses an algorithm it has developed, called Post Conviction Risk Assessment (PCRA), for such limited purposes as determining which defendants should be subject to special scrutiny while on probation.

Nonetheless, Forrest notes, “studies of PCRA demonstrate that its current algorithmic design results in race discrepancies: predictive accuracy is higher for white than Black offenders.” So why are we still using it, even for such limited purposes?

More broadly, the fundamental question remains: Even if these algorithms could be made much more accurate and less biased than they currently are, should they be used in the criminal justice system in determining whom to lock up and for how long? My own view is that increasing a defendant’s sentence of imprisonment on the basis of hypothesized future crimes is fundamentally unfair. But if “incapacitation” should be taken into consideration, I worry that much better algorithms than we currently have will perversely cause judges to place undue emphasis on incapacitation, at the expense of alternatives to prison that might serve to make defendants better citizens and less likely to commit future crimes.

FDR’s Compliant Justices

The Supreme Court’s deference to FDR during World War II resulted in unjustifiable ethical breaches, but its new code of conduct has not resolved the question of when a justice should be disqualified from a case.

December 5, 2024 issue

The Most Conservative Branch

In his new book, Reading the Constitution, Stephen Breyer criticizes recent Supreme Court decisions on issues such as abortion and gun rights as the product of rigid and imperfect reasoning rather than of ideology, and he argues for a more pragmatic jurisprudence.

September 19, 2024 issue

The Frontier Justice

William O. Douglas was a strong advocate of conservation and environmentalism, but as a Supreme Court justice his involvement in such issues was often ethically questionable.

May 25, 2023 issue

FDR’s Compliant Justices

December 5, 2024 issue

The Most Conservative Branch

September 19, 2024 issue

The Frontier Justice

William O. Douglas was a strong advocate of conservation and environmentalism, but as a Supreme Court justice his involvement in such issues was often ethically questionable.

May 25, 2023 issue

Jed S. Rakoff

Jed S. Rakoff is a United States District Judge for the Southern District of New York. (December 2024)

This Issue

June 10, 2021

Francesca Mari

How Can We Stop Gun Violence?

Christopher Carroll

Far from the Realm of the Real

Fintan O’Toole

The King of Little England

All Contents