Can big data fine-tune prison sentences? At their best, risk-assessment algorithms check decision-maker biases and reduce delays. But under a veneer of neutrality, automated risk scores may themselves hide biases, further entrenching inequality. The U.S. Supreme Court recently refused to take a stand on risk-assessment scores, leaving us to speculate about how exactly constitutional protections apply to these complex proprietary technologies.
Risk assessment tools divert low-risk offenders from hard jail time. Prioritizing data over intuition, evidence-based research seeks to eliminate subjectivity and bias, and focus on what works to reduce criminality. Concretely, risk assessment software compile information on offenders and sort them in risk groups. In the U.S., they are used at various stages of the criminal justice system, from bail to sentencing to parole release. Some states even require them. The controversial software essentially approximates the general likelihood that those with a similar history as the offender are likely to reoffend, relying on statistics.
COMPAS, short for Correctional Offender Management and Profiling Alternative Sanctions, is a popular evidence-based risk assessment tool measuring offender’s treatment needs and recidivism risk. Among other factors, its algorithm considers prior arrest history, parole violations, family history and social support. Problematically, some factors such as growing up in a single household and employment indirectly map out to race and class minorities. COMPAS also measures “criminal attitudes” with questions such as:
– To get ahead in life you must always put yourself first
– When things are stolen from rich people they won’t miss the stuff because insurance will cover the costs
– The law doesn’t help average people
Subsequent validation studies reveal a mixed record. Researchers have questioned the underlying simplification of the logistic regression model. Pro Publica stresses the tool’s disparate impact on minorities, with more false positives for black than white defendants. In other words, a larger percentage of black offenders that won’t go on to commit another crime are deemed high risk, depriving them of diversion opportunities and relegating them to hard time behind bars. Northpointe inc. (the company behind COMPAS now known as Equivant) retorts that this is inevitable because black defendants have a higher overall recidivism rate. It also argues that its algorithm is fair because a score of, say 7, means the same likelihood of re-offense regardless of race. Competing, mutually exclusive definitions of fairness underline the challenge of crafting fair algorithms.
As scholars debate algorithmic fairness, lawyers take the constitutionality of risk-scores to court. In 2004, the ACLU contested the constitutionality of Virginia’s Sex Offender Risk Assessment to guide sex offenders sentencing. The succinct decision did not address the issue. In 2010, the Indiana Supreme Court affirmed a sentence considering the results of another risk assessment tool, unimpressed by concerns over its scientific reliability.
More recently, the Wisconsin Supreme Court affirmed the consideration of COMPAS at sentencing in Loomis v. Wisconsin. Eric Loomis pleaded guilty to attempting to flee a traffic officer and operating a vehicle without the owner’s consent. The dismissed charges of first-degree recklessly endangering safety, possession of a firearm by a felon and possession of a short-barreled shotgun or rifle were read in at sentencing. The Circuit Court sentenced him to five years of imprisonment and six years of extended supervision. It then denied a post-conviction motion invoking due process arguments for resentencing. The Wisconsin Court of Appeals certified a question for the Wisconsin Supreme Court to address COMPAS-related due process issues. The latter affirmed the post-conviction decision. In the final chapter, the U.S. Supreme Court denied the defendant’s petition for a writ of certiorari, declining to hear the case on the merits. The Supreme Court of Wisconsin’s decision on the certification thus represents the latest development in the discussion about risk-assessment software at sentencing. Although its clout is limited, the reasoning highlights several shortcomings in analyzing computer-generated risk scores.
The Court of Appeal certified the following question:
Whether the use of a COMPAS risk assessment at sentencing violates a defendant’s right to due process, either because the proprietary nature of COMPAS prevents defendants from challenging the COMPAS assessment’s scientific validity, or because COMPAS assessments take gender into account.
The Wisconsin Supreme Court actually answers this different set of questions:
(1) it [COMPASS] violates a defendant’s right to be sentenced based upon accurate information, in part because the proprietary nature of COMPAS prevents him from assessing its accuracy;
(2) it violates a defendant’s right to an individualized sentence; and
(3) it improperly uses gendered assessments in sentencing.
Reframing the first question in narrower terms severely undercuts meaningful discussion about due process in the context of third-party proprietary software. The court shifts from a general question about opaque software to a specific challenge on accuracy. On the latter point, the Court determines that the defendant can review and challenge the input questions as well as the output risk score’s accuracy. Distinguishing from case law where the defendant had no opportunity to refute confidential information, it deems the opportunity to review the input and output of the algorithm congruent with due process principles.
Analyzing the score’s input and output avoids the germane due process issue. The Court reasons that recidivism scores are based on public data, affording the defendant sufficient opportunity to verify the accuracy of the original information and the final output score. But focusing on input and output accuracy sidesteps the key issue: does outsourcing risk evaluation to private companies jeopardize due process, thanks to trade secrets protecting proprietary algorithms? Given the opaque scientific validity of proprietary algorithmic risk assessments, the Gordian knot sits at the methodology underpinning the score. Indeed, the pressing concern is how COMPAS got from the questionnaire answers (the input) to the risk score (the output).
The company has invoked trade secret to avoid explaining its statistical methodology and the weight of each input factor. Access to the raw data and output do little to provide insight into the process for reaching the result. Just as defendants can cross-examine an expert’s methodology and credentials to test the probative value of the prosecution’s evidence, meaningful protection entitles them to probe the logistic regression statistical analysis, sample population size and demographics as well as the protocols to correct adverse findings. Those are but some ways to ascertain the scientific validity of the software.
Some states have taken a different route, sharing their in-house methodology. For instance, Ohio, Indiana and Missouri explain how each element factors in the final score. Predictive policing software Hunch Lab also favors transparency. So far, there is little evidence that end users game the system or that companies free ride on these open software.
At any rate, a rigid application of due process protection developed for analog evidence is ill suited to computer-mediated evidence. The traditional protection prevents reliance on false information at sentencing without an opportunity to correct it. By the Court’s own recognition, the protection is context-sensitive: «[d]ue process is flexible and calls for such procedural protections as the particular situation demands».
The Court fails to apply its general statement on case-sensitive due process to the particular features of algorithmic risk assessment. A contextual approach would have lead it to abandon its analysis of input/output to focus on the process. Interestingly, the U.S. brief before the Supreme Court recognized this issue:
[A] court’s use of a risk assessment based on an undisclosed scoring methodology creates at least the possibility not only of scoring error, but of a flawed actuarial approach that a defendant cannot effectively counter through other types of evidence.
Transposing the spirit of due process to an algorithmic context requires disclosure of how the end result was achieved. Going back to the foundations of the State’s obligation to disclose evidence, the information asymmetry between the State and defendant undergirds this safeguard. Without information about how the software computes risk, a defendants’ ability to challenge the score’s accuracy is symbolic at best. In that sense, it was opened to the court to apply the case law cited above with the necessary adjustments.
Worryingly, neither the prosecutor nor the judge grasp the intricacies of risk assessment software. It appears that parties proceed on a leap of faith. But a more searching stance towards algorithmic evidence at sentencing faces two hurdles.
The first is a general issue with evidence admissibility at sentencing. Evidentiary rules do not apply at sentencing. The case law distinguishes between robust standards at the guilt-finding stage, where fact-finders decide whether the government has proved a defendant’s guilt beyond a reasonable doubt, and a more flexible approach for admitting facts at the sentencing stage. Simply put, assessing the defendant’s character for the purpose of sentencing justifies using all available information. Therefore, defendants cannot challenge evidence for compliance with the Daubert standard. In a Daubert-type inquiry, the judge tests the scientific validity of the underlying methodology on a balance of probability standard. In other words, the judge will admit evidence only if she finds it more likely scientifically sound than not.
A lower evidentiary threshold at sentencing hardly comports with the stakes for the defendant. From a functional perspective, the right not to be deprived of liberty except by due process of the law is most acute at the sentencing stage. A guilt verdict is abstract, time served is real. That’s even more so considering structural pressures promoting guilty pleas.
Moreover, using all available information for sentencing in the name of flexibility does little to justify admitting unsound risk scores. Contrary to direct information like prior criminal history, family ties or employment history, a risk score is derived information, itself the result of a micro-adjudication analog to psychiatric diagnosis. In that sense, the problematic admissibility of derived evidence predates computer-generated risk assessments.
The second hurdle pertains to cognitive bias towards empirical scores. Neat numerical risk scores lend a patina of neutrality to the assessments. Research suggests that irrespective of its scientific validity, empirical evidence colors subjective evaluations:
Individuals tend to weigh purportedly expert empirical assessments more heavily than nonempirical evidence — which might create a bias in favor of COMPAS assessments over an offender’s own narrative. Research suggests that it is challenging and unusual for individuals to defy algorithmic recommendations. Behavioral economists use the term “anchoring” to describe the common phenomenon in which individuals draw upon an available piece of evidence — no matter its weakness — when making subsequent decisions.
Discretionary sentencing is particularly vulnerable to this type of distortion, both at the level of admissibility and probative value of evidence.
Albeit unsatisfactorily, the Court attempts to mitigate the uncertainty surrounding COMPAS’s scientific validity. It requires pre-sentencing investigation reports to include the following caveats:
(1) the proprietary nature of COMPAS has been invoked to prevent disclosure of information relating to how factors are weighed or how risk scores are to be determined;
(2) risk assessment compares defendants to a national sample, but no cross-validation study for a Wisconsin population has yet been completed;
(3) some studies of COMPAS risk assessment scores have raised questions about whether they disproportionately classify minority offenders as having a higher risk of recidivism; and
(4) risk assessment tools must be constantly monitored and re-normed for accuracy due to changing populations and subpopulations.
Merely stating these warnings does little to adequately ensure due process. The Court missed an opportunity to answer concerns about opaque process and scientific validity. In the past, Courts have authoritatively addressed the role of other technologies in the criminal justice system, ranging from polygraphs to DNA tests. Granted, the fact matrix of this particular case may not be the most conducive; the partially available record suggests that the judge alluded to the risk score in passing. However, its admissibility is a question of law severable from the particular facts on the record. For the time being, more litigants will be sentenced with risk assessments taken at face value.
As for the requirement that a sentence be individualized, the bench opines that risk scores provide a fuller picture of the defendant by identifying whether he belongs to a group of high risk offenders. While the Court is clear that a risk score does not predict individual re-offense risk, it fails to follow through on the implications of considering group generalizations at sentencing.
The Court’s shy caution about the dangers of group approximation falls short of addressing group-based suspicion. The Court clarifies that COMPAS does not predict future individual behavior:
[B]ecause COMPAS risk assessment scores are based on group data, they are able to identify groups of high-risk offenders——not a particular high-risk individual. Accordingly, a circuit court is expected to consider this caution as it weighs all of the factors that are relevant to sentencing an individual defendant.
It is worth to note in passing that sentencing doesn’t purport to preemptively punish future crimes. At any rate, the Court is quick to forget its own caveat as it describes permissible COMPAS use:
COMPAS can be helpful to assess an offender’s risk to public safety and can inform the decision of whether the risk of re-offense presented by the offender can be safely managed and effectively reduced through community supervision and services.
The underlined passage reveals a slippage from general likelihood for similar defendants to specific risk of individual re-offense, exemplifying the cognitive bias explained above.
Moreover, the Court does not appear cognizant of problematic group metrics. COMPAS predicts likelihood of future rearrests, not future criminal activity. Because minorities are more heavily policed, high-risk groups are disproportionally minorities. Insofar as the score reinforces group stereotypes rather than assess individuals, why would the court consider it at all?
Belonging to a statistically high-risk group means the odds are stacked against a defendant. But it does not follow that she should bear the burden of her circumstances. In fact, considering static factors like family criminality and single parent upbringing essentially punishes disenfranchised populations for systemic inequality. High-risk classification lead to longer sentences, which in turn increase the likelihood of re-offense. In that sense, risk scores thwart the possibility for outliers to rise above their situation. Instead of amplifying group discrimination with harsher sentences, risk scores could assist service delivery upstream, targeting high-risk groups with better schools, housing and social services before criminality manifests itself as a symptom of vulnerability.
Relatedly, risk scores stratify society. Including static factors over which defendants have no agency restrict future opportunities in ways that hardly comport with American ideals. These deterministic tools thwart the prospect of better circumstances in return for compliance with parole conditions and good behavior while incarcerated. What is an inmate’s incentive for good behavior when his background will score him into a high-risk group, thus preventing early parole release? Sorting offenders by social status fuels cynicism about equal opportunities, legitimates antisocial behavior and ultimately justifies checking out of the system altogether. Absent the prospect of mobility, social peace further dwindles as unrest simmers.
Finally, the Court endorses COMPAS’s consideration of gender. Including gender for statistical norming – a concept hardly explained – promotes accuracy and does not infringe on the due process right not to be sentenced on the basis of gender. Gender ensures accuracy given the “statistical evidence that men, on average, have higher recidivism and violent crime rates compared to women». The Court goes on to state that «if the inclusion of gender promotes accuracy, it serves the interests of institutions and defendants, rather than a discriminatory purpose». Yet accuracy and discrimination are not mutually exclusive. While this case is framed in due process terms, equal protection arguments may gain traction in further litigation.
For argument’s sake, let’s replace gender with race. If statistical evidence suggests that blacks reoffend more than whites, should COMPAS factor in race? While both race and gender are protected attributes, this example is more shocking because it involves a historically subordinated group scoring higher in the risk scale. Protected groups may indeed exhibit higher recidivism rates correlating to past inequality. Promoting anti-discrimination may therefore require accuracy trade-offs and innovative de-biasing solutions.
Have the cake and eat it too
At the outset, the Court states that risk scores can’t determine sentences. However, the scores can corroborate other findings, as is the case here. The sentencing judge made a passing observation about the score, reinforcing her assessment of other relevant factors. The partially available transcript suggests the score was indeed mentioned in one phrase after elaborating on three other factors. But when exactly do courts cross the threshold from consideration to reliance? How can reviewing courts retrospectively evaluate if a factor was determinative or simply considered? Abrahamson J.’s concurring opinion stresses the importance of sufficient reasons, which help reviewing courts get a sense of the impact of risk scores in the overall reasoning. In any case, the difference between considering (acceptable) and relying (unacceptable) on a risk score likely lies on a spectrum rather than a clear-cut binary distinction. Absent workable criteria, this uncertainty is bound to generate further litigation.
Furthermore, courts have structural incentives to rely on risk scores. Pre-processed information appeals to first instance courts plagued by taxing caseloads. From a pragmatic standpoint, structural pressures for increased efficiency likely increase reliance on risk scores. In that sense, the scores optimize incarceration rather than participate in a more transformative effort to decarcerate.
The majority denied Northpointe inc. the opportunity to intervene. ABRAHAMSON, J. deplores this decision in her concurring opinion:
[T]his court’s lack of understanding of COMPAS was a significant problem in the instant case. At oral argument, the court repeatedly questioned both the State’s and defendant’s counsel about how COMPAS works. Few answers were available.
Given COMPAS’s mixed reviews, the court needed «all the help it could get» to anchor its legal analysis in a sound understanding of the tool. The company might have disclosed information about its methodology, alleviating due process concerns. In the alternative, its trade secret arguments could have been tested in adversarial procedures.
Looking ahead, risk scores are here to stay. User-friendly bar charts promise to alleviate crushing caseloads for judges, correction officers and prison wardens in the U.S. and beyond. In Canada, the pressure for expedient justice has become more salient since the Jordan decision set maximum delays for pretrial detention. Streamlining criminal justice is certainly laudable, but not at the expense of individualized, fair sentences.
Risk scores are also swarming to other enforcement areas. While predictive policing is well established, it may reach a new level as it combines with facial recognition into real-time risk scores assisting law enforcement on the ground. Homeland Security is also seeking to outsource an Extreme Vetting Initiative to automate immigration processing. What’s more, emerging machine learning algorithms further complicate the opacity issue. Yet criminal law remains a prime area for litigating risk scores, since it affords relatively robust constitutional protections. Activists, scholars and lawyers should remain alert to “promising” appellate case law. It is only a matter of time (spent behind bars, for some) before the Supreme Court deems the issue ripe for a determination on the merits.
 Sonja B. Starr, Evidence-Based Sentencing And The Scientific Rationalization Of Discrimination, 66 Stan. L. Rev. 803 (2004), 813 note 29.
 Tracy L. Fass, et al., The LSI-R and the Compas: Validation Data on Two Risk-Needs Tools, 35 Crim. Justice & Behavior 1095, 1100-01 (2008); Jennifer L. Skeem & Jennifer Eno Louden, Assessment of Evidence on the Quality of Correctional Offender Management Profiling for Alternative Sanctions (COMPAS), Prepared for the California Department of Corrections and Rehabilitation (CDCR) (2007); Sharon Lansing, New York State COMPAS-Probation Risk and Need Assessment Study: Examining the Recidivism Scale’s Effectiveness and Predictive Accuracy, N.Y. Div. Crim. Justice Servs., Office of Justice Research & Performance (Sept. 2012); Danielle Keats Citron & Frank Pasquale, Scored society: due process for automated predictions, Wash. L. Rev. 89:1 (2014).
 Wisconsin supra, note 3 (par. 34).
 Ibid. par. 53.
 Ibid. par. 53-55.
 Michael A. Wolff, Missouri’s Information-Based Discretionary Sentencing Sys- tem, 4 OHIO ST. J. CRIM. L. 95, 113 (2006).
 Gardner v. Florida, 430 U.S. 349, 351 (1977) (plurality opinion); State v. Skaff, 152 Wis. 2d 48, 53, 447 N.W.2d 84 (Ct. App. 1989).
 Wisconsin supra, note 3, note 22. Citing Schweiker v. McClure, 456 U.S. 188, 200 (1982).
 State v. Straszkowski, 2008 WI 65; State v. Arredondo, 2004 WI App 7; Wis. Stat. § (Rule) 911.01(4)(c).
 Daubert v. Merrell Dow Pharm., Inc., 509 U.S. 579 (1993).
 Wisconsin, supra, note 5, par. 74.
 Ibid., par. 31.
 Ibid., par. 90.
 Ibid., par. 78.
 Ibid., par. 83.
 See Solon Barocas & Andrew D. Selbst, Big Data’s Disparate Impact, California Law Review Vol. 104, 671-732, 691: some proxies for race also reflect actual work performance, because of historical disparities.
 Wisconsin supra, note 3, par. 17, 44, 88.
 Ibid., par. 106-107.
 Cecelia Klingele, The Promises and Perils of Evidence-Based Corrections, 91 Notre Dame L. Rev. 537, 554 (2015).
 Wisconsin supra, note 3, par. 133.
 Ibid., par. 136.
 R. v. Jordan,  1 SCR 631, 2016 SCC 27.