Spain has become reliant on an algorithm to score how likely a domestic violence victim may be abused again and what protection to provide — sometimes leading to fatal consequences.
I really have a hard time deciding if that is the scandal the article makes it out to be (although there is some backpedaling going on). The crucial point is: 8% of the decisions turn out to be wrong or misjudged. The article seems to want us to think that the use of the algorithm is to blame. Yet, is it? Is there evidence that a human would have judged those cases differently?
Is there evidence that the algorithm does a worse job than humans? If not, then the article devolves onto blatant fear mongering and the message turns from “algorithm is to blame for deaths” into “algorithm unable to predict the future in 100% of cases”, which of course it can’t…
The article mentions that one woman (Stefany González Escarraman) went for a restraining order the day after the system deemed her at “low risk” and the judge denied it referring to the VioGen score.
One was Stefany González Escarraman, a 26-year-old living near Seville. In 2016, she went to the police after her husband punched her in the face and choked her. He threw objects at her, including a kitchen ladle that hit their 3-year-old child. After police interviewed Ms. Escarraman for about five hours, VioGén determined she had a negligible risk of being abused again.
The next day, Ms. Escarraman, who had a swollen black eye, went to court for a restraining order against her husband. Judges can serve as a check on the VioGén system, with the ability to intervene in cases and provide protective measures. In Ms. Escarraman’s case, the judge denied a restraining order, citing VioGén’s risk score and her husband’s lack of criminal history.
About a month later, Ms. Escarraman was stabbed by her husband multiple times in the heart in front of their children.
It also says:
Spanish police are trained to overrule VioGén’s recommendations depending on the evidence, but accept the risk scores about 95 percent of the time, officials said. Judges can also use the results when considering requests for restraining orders and other protective measures.
You could argue that the problem isn’t so much the algorithm itself as it is the level of reliance upon it. The algorithm isn’t unproblematic though. The fact that it just spits out a simple score: “negligible”, “low”, “medium”, “high”, “extreme” is, IMO, an indicator that someone’s trying to conflate far too many factors into a single dimension. I have a really hard time believing that anyone knowledgeable in criminal psychology and/or domestic abuse would agree that 35 yes or no questions would be anywhere near sufficient to evaluate the risk of repeated abuse. (I know nothing about domestic abuse or criminal psychology, so I could be completely wrong.)
Apart from that, I also find this highly problematic:
[The] victims interviewed by The Times rarely knew about the role the algorithm played in their cases. The government also has not released comprehensive data about the system’s effectiveness and has refused to make the algorithm available for outside audit.
i could say a lot in response to your comment about the benefits and shortcomings of algorithms (or put another way, screening tools or assessments), but i’m tired.
i am exceedingly troubled that something which is commonly regarded as indicating very high risk when working with victims of domestic violence was ignored in the cited case (disclaimer - i haven’t read the article). if the algorithm fails to consider history of strangulation, it’s garbage. if the user of the algorithm did not include that information (and it was disclosed to them), or keyed it incorrectly, they made an egregious error or omission.
i suppose, without getting into it, i would add - 35 questions (ie established statistical risk factors) is a good amount. large categories are fine. no screening tool is totally accurate, because we can’t predict the future or have total and complete understanding of complex situations. tools are only useful to people trained to use them and with accurate data and inputs. screening tools and algorithms must find a balance between accurate capture and avoiding false positives.
The article mentions that one woman (Stefany González Escarraman) went for a restraining order the day after the system deemed her at “low risk” and the judge denied it referring to the VioGen score.
The judge should be in jail for that and If the judge thinks the “system” can do his job then he should quit as he is clearly useless.
Could a human have judged it better? Maybe not. I think a better question to ask is, “Should anyone be sent back into a violent domestic situation with no additional protection, no matter the calculated risk?” And as someone who has been on the receiving end of that conversation and later narrowly escaped a total-family-annihilation situation, I would say no…no one should be told that, even though they were in a terrifying, life-threatening situation, they will not be provided protection, and no further steps will be taken to keep them from being injured again, or from being killed next time. But even without algorithms, that happens constantly…the only thing the algorithm accomplishes is that the investigator / social worker / etc doesn’t have to have any kind of personal connection with the victim, so they don’t have to feel some kind of way for giving an innocent person a death sentence because they were just doing what the computer told them to.
Final thought: When you pair this practice with the ongoing conversation around the legality of women seeking divorce without their husband’s consent, you have a terrifying and consistently deadly situation.
the only thing the algorithm accomplishes is that the investigator / social worker / etc doesn’t have to have any kind of personal connection with the victim
This even works for people pulling the trigger. Following orders, sed lex dura lex, et cetera ad infinitum.
An algorithm is never to blame, some pencil necked desk jockey decided the criteria to get help that was used to create the algorithm, the blame is entirely on them.
That said, I doubt it would make any difference if a human was in the loop. An algorithm is still al algorithm, even if it’s applied by a human. We usually just call that a “policy” though. People have been murdered by the paper sea for decades before we started calling it “algorithms”.
It reminds me of the debate around self driving cars. Tesla has a flawed implementation of self driving tech, that’s trying to gather all the information it needs through camera inputs vs using multiple sensor types. This doesn’t always work, and has led to some questionable crashes where it definitely looks like a human driver could have avoided the crash.
However, even with Tesla’s flawed self driving, They’re supposed to have far fewer wrecks than humans driving. According to Tesla’s safety report, Tesla’s in self driving mode average 5-6 million miles per accident vs 1-1.5 million miles for Tesla drivers not using self driving (US average is 500-750k miles per accident).
So a system like this doesn’t have to be perfect to do a far better job than people can, but that doesn’t mean it won’t feel terrible for the unlucky people who things go poorly for.
The Teslas in self driving mode tend to be used on main roads, and most accidents per mile happen on the small side streets. People are also much safer where Teslas are driven than the these statistics suggest.
There’s not much concrete data I can find on accident rates on highways vs non-highways. You would expect small side streets accidents to have lower fatality rates though, with wrecks at highway speeds to have much higher fatality rates. From what I see, a government investigation into how safe autopilot is determined there were 13 deaths, which is very low number given the billions of miles driven with autopilot on (3 billion+ in 2020, probably 5-10billion now? Just guessing here since I can’t find a newer number).
But yeah, there are so many factors with driving that it’s hard get an exact idea. Rural roads have the highest fatality rates (making up to 90% of accident fatalities in some states), and it’s not hard to image that Tesla’s are less popular in rural communities (although they seem to be pretty popular where I live).
But also rural roads are a perfect use case for autopilot, generally easy driving conditions where most deaths happen due to speeding and the driver not paying attention. Increased adoption of self driving cars in rural communities would probably save a lot of lives.
Since 2007, about 0.03 percent of Spain’s 814,000 reported victims of gender violence have been killed after being assessed by VioGén, the ministry said. During that time, repeat attacks have fallen to roughly 15 percent of all gender violence cases from 40 percent, according to government figures.
“If it weren’t for this, we would have more homicides and gender-based violence,” said Juan José López Ossorio, a psychologist who helped create VioGén and works for the Interior Ministry.
So no, not a scandal, it seems it is helping, but perhaps could be better. At least that’s my read.
The article is not about how the AI is responsible for the death. It’s likely that the woman would have died in the counterfactual.
The question is not “how effective is AI”? The question is should life or death decisions be made by an electrified Oracle at Delphi. You must answer this question before “is AI effective” becomes relevant.
If somebody was adjudicating traffic court with Tarot cards, would you ask: well how accurate are the cards compared to a judge?
Decisions should be made by whomever or whatever is most effective. That’s not even a debate. If the tarot cards were right more often than the judge, fire the judge and get me a deck. Because the judge is clearly ineffective.
You can’t privilege an approach just because it sounds more reasonable. It also has to BE more reasonable. It’s crazy to say “I’m happy being wrong because I’m more comfortable with the process”
The trick of course is to find fair ways to measure effectiveness accurately and make sure it’s repeatable. That’s a rabbit hole of challenges.
The judge can bear legal responsibility. It’s a feedback loop - somebody should be responsible for failures. We live in a society. If that somebody is not the side causing failures, things will get bad.
With a deck of cards it should be decided, how the responsibility is distributed between the party replacing humans with it, company producing cards, those interpreting the results.
My impression from the article is more that they’re not doing any kind of garbage-in assessment: nobody is making sure they’re getting answers about the right person (eg: some women date more than one guy) and some women don’t feel safe giving accurate answers to the police, and there aren’t good failsafes available for when it’s wrong; you’re forced to hire legal counsel and pursue a change via the courts.
The crucial point is: 8% of the decisions turn out to be wrong or misjudged.
The article says:
Yet roughly 8 percent of women who the algorithm found to be at negligible risk and 14 percent at low risk have reported being harmed again, according to Spain’s Interior Ministry, which oversees the system.
Granted, neither “negligible” or “low risk” means “no risk”, but I think 8% and 14% are far too high numbers for those categories.
Furthermore, there’s this crucial bit:
At least 247 women have also been killed by their current or former partner since 2007 after being assessed by VioGén, according to government figures. While that is a tiny fraction of gender violence cases, it points to the algorithm’s flaws. The New York Times found that in a judicial review of 98 of those homicides, 55 of the slain women were scored by VioGén as negligible or low risk for repeat abuse.
So in the 98 murders they reviewed, the algorithm put more than 50% of them at negligible or low risk for repeat abuse. That’s a fucking coin flip!
You’ll get that result without an algorithm as well unfortunately. A domestic violence interview often doesn’t result in you getting the truth of what happens because the victim is often economically and emotionally dependent on their partner. It’s helpful to have an algorithm that makes you ask the right questions but there’s still no way I know of to get the right answers of those questions from a victim 100 percent of the time.
Odd. I replied to this comment, but now my reply is gone. Gonna try again and type up as much as I can remember.
Regardless, an algorithm expecting binary answers will obviously not take para- and extralinguistic cues into account. That extra 50 ms hesitation, the downwards glance and the voice cracking when answering “no” to “has he ever tried to strangle you before?” has a reasonable chance to get picked up by a human, but when reducing it to something that the algorithm can handle, it’s just a simple “no”. Humans are really good at picking up on such cues, even if they aren’t consciously aware that they’re doing it, but if said humans are preoccupied with staring into a computer screen in order to input the answers to the questionnaire, then there’s a much higher chance that they’ll miss them too. I honestly only see negatives here.
It’s helpful to have an algorithm that makes you ask the right questions […]
Arguably a piece of paper could solve that problem.
Seriously. 55 victims out of the 98 homicide cases sampled were deemed at negligible or low risk. If a non-algorithm-assisted department presented those numbered I’d expect them to be looking for new jobs real fast.
I think beyond that it’s purely the failure of the interviewer and not the tool. I think getting rid of the tool will just leave you with shitty interviewers and back to the same situation as you had before.
I’ve given plenty of algorithmic driven assessments myself, though mine are generally much shorter and the weights on the questions much simpler (plus I know the actual reasons behind the weight of my questions and why I’m asking them). You can always intervene when someone’s lying and redirect them and can override the algorithm just like this Spanish policy. Lazy judges and police will exist without the tool.
It might be helpful for the tool to include a label that the interviewer thinks the result is unreliable due to the evasiveness of the interviewee, if only to show where the problems are coming from.
I really have a hard time deciding if that is the scandal the article makes it out to be (although there is some backpedaling going on). The crucial point is: 8% of the decisions turn out to be wrong or misjudged. The article seems to want us to think that the use of the algorithm is to blame. Yet, is it? Is there evidence that a human would have judged those cases differently? Is there evidence that the algorithm does a worse job than humans? If not, then the article devolves onto blatant fear mongering and the message turns from “algorithm is to blame for deaths” into “algorithm unable to predict the future in 100% of cases”, which of course it can’t…
deleted by creator
The article mentions that one woman (Stefany González Escarraman) went for a restraining order the day after the system deemed her at “low risk” and the judge denied it referring to the VioGen score.
It also says:
You could argue that the problem isn’t so much the algorithm itself as it is the level of reliance upon it. The algorithm isn’t unproblematic though. The fact that it just spits out a simple score: “negligible”, “low”, “medium”, “high”, “extreme” is, IMO, an indicator that someone’s trying to conflate far too many factors into a single dimension. I have a really hard time believing that anyone knowledgeable in criminal psychology and/or domestic abuse would agree that 35 yes or no questions would be anywhere near sufficient to evaluate the risk of repeated abuse. (I know nothing about domestic abuse or criminal psychology, so I could be completely wrong.)
Apart from that, I also find this highly problematic:
From those quotes looks like Idiocracy.
i could say a lot in response to your comment about the benefits and shortcomings of algorithms (or put another way, screening tools or assessments), but i’m tired.
i will just point out this, for anyone reading.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2573025/
i am exceedingly troubled that something which is commonly regarded as indicating very high risk when working with victims of domestic violence was ignored in the cited case (disclaimer - i haven’t read the article). if the algorithm fails to consider history of strangulation, it’s garbage. if the user of the algorithm did not include that information (and it was disclosed to them), or keyed it incorrectly, they made an egregious error or omission.
i suppose, without getting into it, i would add - 35 questions (ie established statistical risk factors) is a good amount. large categories are fine. no screening tool is totally accurate, because we can’t predict the future or have total and complete understanding of complex situations. tools are only useful to people trained to use them and with accurate data and inputs. screening tools and algorithms must find a balance between accurate capture and avoiding false positives.
The judge should be in jail for that and If the judge thinks the “system” can do his job then he should quit as he is clearly useless.
Could a human have judged it better? Maybe not. I think a better question to ask is, “Should anyone be sent back into a violent domestic situation with no additional protection, no matter the calculated risk?” And as someone who has been on the receiving end of that conversation and later narrowly escaped a total-family-annihilation situation, I would say no…no one should be told that, even though they were in a terrifying, life-threatening situation, they will not be provided protection, and no further steps will be taken to keep them from being injured again, or from being killed next time. But even without algorithms, that happens constantly…the only thing the algorithm accomplishes is that the investigator / social worker / etc doesn’t have to have any kind of personal connection with the victim, so they don’t have to feel some kind of way for giving an innocent person a death sentence because they were just doing what the computer told them to.
Final thought: When you pair this practice with the ongoing conversation around the legality of women seeking divorce without their husband’s consent, you have a terrifying and consistently deadly situation.
deleted by creator
Yep. The ones who manage to slip notes to their veterinarian to help them get away are the exception.
Reading stuff like this makes me sick. All is not well with the world.
This even works for people pulling the trigger. Following orders, sed lex dura lex, et cetera ad infinitum.
Yep! For all the psych nerds, it’s pretty much a direct lift of the Milgram Shock Experiment
Thank you, this is why I came to the Fediverse from Reddit.
An algorithm is never to blame, some pencil necked desk jockey decided the criteria to get help that was used to create the algorithm, the blame is entirely on them.
That said, I doubt it would make any difference if a human was in the loop. An algorithm is still al algorithm, even if it’s applied by a human. We usually just call that a “policy” though. People have been murdered by the paper sea for decades before we started calling it “algorithms”.
It reminds me of the debate around self driving cars. Tesla has a flawed implementation of self driving tech, that’s trying to gather all the information it needs through camera inputs vs using multiple sensor types. This doesn’t always work, and has led to some questionable crashes where it definitely looks like a human driver could have avoided the crash.
However, even with Tesla’s flawed self driving, They’re supposed to have far fewer wrecks than humans driving. According to Tesla’s safety report, Tesla’s in self driving mode average 5-6 million miles per accident vs 1-1.5 million miles for Tesla drivers not using self driving (US average is 500-750k miles per accident).
So a system like this doesn’t have to be perfect to do a far better job than people can, but that doesn’t mean it won’t feel terrible for the unlucky people who things go poorly for.
Wow Tesla said that Tesla was safe!?!? This changes everything.
That report fails to take into account that the Cybertruck is already a wreck when it rolls off the assembly line.
Unfortunately, this is bad statistics.
The Teslas in self driving mode tend to be used on main roads, and most accidents per mile happen on the small side streets. People are also much safer where Teslas are driven than the these statistics suggest.
There’s not much concrete data I can find on accident rates on highways vs non-highways. You would expect small side streets accidents to have lower fatality rates though, with wrecks at highway speeds to have much higher fatality rates. From what I see, a government investigation into how safe autopilot is determined there were 13 deaths, which is very low number given the billions of miles driven with autopilot on (3 billion+ in 2020, probably 5-10billion now? Just guessing here since I can’t find a newer number).
But yeah, there are so many factors with driving that it’s hard get an exact idea. Rural roads have the highest fatality rates (making up to 90% of accident fatalities in some states), and it’s not hard to image that Tesla’s are less popular in rural communities (although they seem to be pretty popular where I live).
But also rural roads are a perfect use case for autopilot, generally easy driving conditions where most deaths happen due to speeding and the driver not paying attention. Increased adoption of self driving cars in rural communities would probably save a lot of lives.
Here’s another quote further down:
So no, not a scandal, it seems it is helping, but perhaps could be better. At least that’s my read.
It implies that a human would have been worse. Or at least that an average human would be worse, the ones making the decision.
The article is not about how the AI is responsible for the death. It’s likely that the woman would have died in the counterfactual.
The question is not “how effective is AI”? The question is should life or death decisions be made by an electrified Oracle at Delphi. You must answer this question before “is AI effective” becomes relevant.
If somebody was adjudicating traffic court with Tarot cards, would you ask: well how accurate are the cards compared to a judge?
Decisions should be made by whomever or whatever is most effective. That’s not even a debate. If the tarot cards were right more often than the judge, fire the judge and get me a deck. Because the judge is clearly ineffective.
You can’t privilege an approach just because it sounds more reasonable. It also has to BE more reasonable. It’s crazy to say “I’m happy being wrong because I’m more comfortable with the process”
The trick of course is to find fair ways to measure effectiveness accurately and make sure it’s repeatable. That’s a rabbit hole of challenges.
The judge can bear legal responsibility. It’s a feedback loop - somebody should be responsible for failures. We live in a society. If that somebody is not the side causing failures, things will get bad.
With a deck of cards it should be decided, how the responsibility is distributed between the party replacing humans with it, company producing cards, those interpreting the results.
Your point is valid regardless but the article mentions nothing about AI. (“Algorithm” doesn’t mean “AI”.)
My impression from the article is more that they’re not doing any kind of garbage-in assessment: nobody is making sure they’re getting answers about the right person (eg: some women date more than one guy) and some women don’t feel safe giving accurate answers to the police, and there aren’t good failsafes available for when it’s wrong; you’re forced to hire legal counsel and pursue a change via the courts.
That and, their action for low-risk is all wrong. The stakes are too high to not give someone help, regardless of the risk level.
The article says:
Granted, neither “negligible” or “low risk” means “no risk”, but I think 8% and 14% are far too high numbers for those categories.
Furthermore, there’s this crucial bit:
So in the 98 murders they reviewed, the algorithm put more than 50% of them at negligible or low risk for repeat abuse. That’s a fucking coin flip!
deleted by creator
You’ll get that result without an algorithm as well unfortunately. A domestic violence interview often doesn’t result in you getting the truth of what happens because the victim is often economically and emotionally dependent on their partner. It’s helpful to have an algorithm that makes you ask the right questions but there’s still no way I know of to get the right answers of those questions from a victim 100 percent of the time.
Odd. I replied to this comment, but now my reply is gone. Gonna try again and type up as much as I can remember.
Regardless, an algorithm expecting binary answers will obviously not take para- and extralinguistic cues into account. That extra 50 ms hesitation, the downwards glance and the voice cracking when answering “no” to “has he ever tried to strangle you before?” has a reasonable chance to get picked up by a human, but when reducing it to something that the algorithm can handle, it’s just a simple “no”. Humans are really good at picking up on such cues, even if they aren’t consciously aware that they’re doing it, but if said humans are preoccupied with staring into a computer screen in order to input the answers to the questionnaire, then there’s a much higher chance that they’ll miss them too. I honestly only see negatives here.
Arguably a piece of paper could solve that problem.
Seriously. 55 victims out of the 98 homicide cases sampled were deemed at negligible or low risk. If a non-algorithm-assisted department presented those numbered I’d expect them to be looking for new jobs real fast.
I think beyond that it’s purely the failure of the interviewer and not the tool. I think getting rid of the tool will just leave you with shitty interviewers and back to the same situation as you had before.
I’ve given plenty of algorithmic driven assessments myself, though mine are generally much shorter and the weights on the questions much simpler (plus I know the actual reasons behind the weight of my questions and why I’m asking them). You can always intervene when someone’s lying and redirect them and can override the algorithm just like this Spanish policy. Lazy judges and police will exist without the tool.
It might be helpful for the tool to include a label that the interviewer thinks the result is unreliable due to the evasiveness of the interviewee, if only to show where the problems are coming from.