venerdì 5 maggio 2017

L'uomo che disegnava bersagli intorno ai fori di pallottola

Bad Stats - Bad Science by Ben Goldacre
Noi non sappiamo leggere le statistiche se non ci vengono “tradotte” in un linguaggio naturale.
L’umile assunzione di questa verità appartiene ad ogni persona di buon senso.
Prendiamo i “miracoli del colesterolo”…
… Let’s say the risk of having a heart attack in your fifties is 50 per cent higher if you have high cholesterol. That sounds pretty bad. Let’s say the extra risk of having a heart attack if you have high cholesterol is only 2 per cent. That sounds OK to me. But they’re the same (hypothetical) figures…
Noi non sappiamo leggere le statistiche perché il cervello umano non comprende il reale significato delle probabilità e dei fattori di rischio. Siamo fatti così, inutile insistere.
Su questo fatto i giornali ci giocano, evitando accuratamente di proporre i medesimi risultati in termini di frequenze naturali, ovvero di numeri assoluti, qualcosa che il nostro cervello afferra meglio…
… Out of a hundred men in their fifties with normal cholesterol, four will be expected to have a heart attack; whereas out of a hundred men with high cholesterol, six will be expected to have a heart attack. That’s two extra heart attacks per hundred. Those are called ‘natural frequencies’. Natural frequencies are readily understandable, because instead of using probabilities, or percentages, or anything even slightly technical or difficult, they use concrete numbers, just like the ones you use every day to check if you’ve lost a kid on a coach trip, or got the right change in a shop. Lots of people have argued that we evolved to reason and do maths with concrete numbers like these, and not with probabilities, so we find them more intuitive…
Rischio. Tra assoluto e relativo ci passa il mare, e i giornali ci giocano…
… you could have a 50 per cent increase in risk (the ‘relative risk increase’); or a 2 per cent increase in risk (the ‘absolute risk increase’); or, let me ram it home, the easy one, the informative one, an extra two heart attacks for every hundred men, the natural frequency…
La carne rossa causa il cancro. Ovvio. Ma in che misura? Al prof in TV è richiesto di impressionare il pubblico senza mentire. Che fa? Ecco un tipico dialogo con l’anchorman…
… Try this, on bowel cancer, from the Today programme on Radio 4: ‘A bigger risk meaning what, Professor Bingham?’ ‘A third higher risk.’ ‘That sounds an awful lot, a third higher risk; what are we talking about in terms of numbers here?’ ‘A difference … of around about twenty people per year.’ ‘So it’s still a small number?’ ‘Umm … per 10,000…’…
Antidepressivi e infarti
… The reports were based on a study that had observed participants over four years, and the results suggested, using natural frequencies, that you would expect one extra heart attack for every 1,005 people taking ibuprofen…
Ecco la notizia sul nesso come riportata dai media, ancora il giochetto di prendere il rischio assoluto anziché quello relativo…
… ‘British research revealed that patients taking ibuprofen to treat arthritis face a 24 per cent increased risk of suffering a heart attack.’ Feel the fear. Almost everyone reported the relative risk increases…
E i ricercatori non sono meno nel drammatizzare. A volte cercano le luci della ribalta più avidamente di una soubrette.
***
H.G. Wells previde che la statistica sarebbe stata il fulcro della civiltà a venire. Giusto. Ma previde anche che ci saremmo abituati ad interpretarle correttamente. Sbagliato, sbagliato, sbagliato…
… Over a hundred years ago, H.G. Wells said that statistical thinking would one day be as important as the ability to read and write in a modern technological society. I disagree; probabilistic reasoning is difficult for everyone, but everyone understands normal numbers…
***
Facciamo un esempio: lo sapevate che la cannabis attualmente in circolazione è molto più potente di quella di una volta? La notizia…
… The Independent was in favour of legalising cannabis for many years, but in March 2007 it decided to change its stance. One option would have been simply to explain this as a change of heart, or a reconsideration of the moral issues. Instead it was decorated with science—as cowardly zealots have done from eugenics through to prohibition—and justified with a fictitious change in the facts… Twice in this story we are told that cannabis is twenty-five times stronger than it was a decade ago… The data from the Laboratory of the Government Chemist goes from 1975 to 1989. Cannabis resin pootles around between 6 per cent and 10 per cent THC, herbal between 4 per cent and 6 per cent. There is no clear trend. The Forensic Science Service data then takes over to produce the more modern figures, showing not much change in resin, and domestically produced indoor herbal cannabis doubling in potency from 6 per cent to around 12 or 14 per cent. (2003–05 data in table under references)…. The rising trend of cannabis potency is gradual, fairly unspectacular, and driven largely by the increased availability of domestic, intensively grown indoor herbal cannabis…. ‘Twenty-five times stronger’, remember. Repeatedly, and on the front page. If you were in the mood to quibble with the Independent’s moral and political reasoning, as well as its evident and shameless venality, you could argue that intensive indoor cultivation of a plant which grows perfectly well outdoors is the cannabis industry’s reaction to the product’s illegality itself… In the mid-1980s, during Ronald Reagan’s ‘war on drugs’ and Zammo’s ‘Just say no’ campaign on Grange Hill, American campaigners were claiming that cannabis was fourteen times stronger than in 1970. Which sets you thinking. If it was fourteen times stronger in 1986 than in 1970, and it’s twenty-five times stronger today than at the beginning of the 1990s, does that mean it’s now 350 times stronger than in 1970? That’s not even a crystal in a plant pot. It’s impossible…
Ricorda un’altra vicenda, quella dell’alluvione di cocaina in arrivo nelle nostre città (marzo 2006). L’articolo…
… ‘Use of the addictive drug by children doubles in a year,’ said the subheading. Was this true?…
I dati erano tratti da fonti governative.
Ma la fonte sembrava minimizzare nel suo commento, parlava di “nessun aumento”. Per fortuna che il fiero giornalista investigativo aveva fiutato il marcio, ovvero aveva scoperto che in realtà i consumatori di cocaina erano raddoppiati!…
… If you read the press release for the government survey on which the story is based, it reports ‘almost no change in patterns of drug use, drinking or smoking since 2000’. But this was a government press release, and journalists are paid to investigate…
La fonte documentale
… You can download the full document online. It’s a survey of 9,000 children, aged eleven to fifteen, in 305 schools. The three-page summary said, again, that there was no change in prevalence of drug use. If you look at the full report you will find the raw data tables: when asked whether they had used cocaine in the past year, 1 per cent said yes in 2004, and 2 per cent said yes in 2005. So the newspapers were right: it doubled? No. Almost all the figures given were 1 per cent or 2 per cent…
Ecco: nel 2003 l’1% degli intervistati diceva di aver consumato cocaina. Nel 2004 il 2%. Possiamo davvero parlare di raddoppio?
Senza contare dell’ “arrotondamento perduto”…
… The actual figures were 1.4 per cent for 2004, and 1.9 per cent for 2005, not 1 per cent and 2 per cent…
Traduciamo tutto in termini di rischio, ma di rischio relativo…
… What we now have is a relative risk increase of 35.7 per cent, or an absolute risk increase of 0.5 per cent. Using the real numbers, out of 9,000 kids we have about forty-five more saying ‘Yes’ to the question ‘Did you take cocaine in the past year?’ Presented with a small increase like this, you have to think: is it statistically significant?…
Nonostante questo, sembrerebbe che l’incremento sia statisticamente significativo. Allora perché gli estensori della statistica dicevano che non vi era alcun incremento? Perché?
Partiamo dall’inizio, cos’è la significatività statistica?…
… It’s just a way of expressing the likelihood that the result you got was attributable merely to chance. Sometimes you might throw ‘heads’ five times in a row, with a completely normal coin, especially if you kept tossing it for long enough… The standard cut-off point for statistical significance is a p-value of 0.05, which is just another way of saying, ‘If I did this experiment a hundred times, I’d expect a spurious positive result on five occasions, just by chance.’…
Ma attenzione: la significatività statistica assume che i casi osservati siano indipendenti, il che non è mai vero del tutto nel mondo reale. Il comportamento degli studenti, per esempio, è influenzato da tanti fattori comuni (mode, eventi, trend…). Tanto è vero che se replichiamo nel mondo concreto il sondaggio non otteniamo mai il 5% canonico…
… To ‘data mine’, taking it out of its real-world context, and saying it is significant, is misleading. The statistical test for significance assumes that every data point is independent, but here the data is ‘clustered’, as statisticians say. They are not data points, they are real children, in 305 schools. They hang out together, they copy each other, they buy drugs from each other, there are crazes, epidemics, group interactions… The increase of forty-five kids taking cocaine could have been a massive epidemic of cocaine use in one school…
Urge correggere il risultato. Gli statistici chiamano questa correzione “clustering” (una tecnica per far la tara alla dipendenza insita tra i data points)…
… As statisticians would say, you must ‘correct for clustering’. This is done with clever maths which makes everyone’s head hurt. All you need to know is that the reasons why you must ‘correct for clustering’ are transparent, obvious and easy, as we have just seen… When you correct for clustering, you greatly reduce the significance of the results…
Cosa resta dopo questa correzione?
Ben poco, anche perché, nel caso di specie, bisogna apportarne una ulteriore.
Quando testi molte relazioni in teoria puoi scegliere quelle che ti fanno più comodo. Aumenta così la possibilità che alcune siano positive per puro caso, viene la tentazione di assumerle scartando le altre. Il metodo scientifico, infatti, imporrebbe di fare delle ipotesi tramite un modello e poi di verificarle. Guardare ai dati per costruire delle ipotesi non è il modo corretto di procedere…
… Will our increase in cocaine use, already down from ‘doubled’ to ‘35.7 per cent’, even survive? No. Because there is a final problem with this data: there is so much of it to choose from. There are dozens of data points in the report: on solvents, cigarettes, ketamine, cannabis, and so on. It is standard practice in research that we only accept a finding as significant if it has a p-value of 0.05 or less. But as we said, a p-value of 0.05 means that for every hundred comparisons you do, five will be positive by chance alone. From this report you could have done dozens of comparisons, and some of them would indeed have shown increases in usage—but by chance alone, and the cocaine figure could be one of those…
Analogia: se lancio ripetutamente il dado potrò poi scegliere ad hoc delle serie di 6 in modo da dimostrare che non c’è casualità…
… If you roll a pair of dice often enough, you will get a double six three times in a row on many occasions…
Lo studio in oggetto contiene una miriade di confronti tra variabili le più disparate. E’ quindi uno studio che induce i ricercatori in tentazione. In casi del genere occorre procedere con la “correzione di Bonferroni”, una rettifica deontologica/metodologica che si applica comunemente in casi del genere…
… This is why statisticians do a ‘correction for multiple comparisons’, a correction for ‘rolling the dice’ lots of times. This, like correcting for clustering, is particularly brutal on the data, and often reduces the significance of findings dramatically…
Dopo quest’ultima correzione, del “raddoppio” di cui parla l’alacre giornalista investigativo non resta più niente.
I nerd che hanno stilato lo studio in oggetto, oltre ad interpretare correttamente il passaggio dall’1% al 2%, conoscevano bene la “correzione per cluster” e la “correzione di Bonferroni”, per questo concludevano che “non si registra alcun aumento nel consumo di cocaina. Per questo, e non per tacere al popolo una “scomoda” verità.
***
Ma la piaga più vistosa delle statistiche sono i campioni mal selezionati
… There are also some perfectly simple ways to generate ridiculous statistics, and two common favourites are to select an unusual sample group, and to ask them a stupid question. Let’s say 70 per cent of all women want Prince Charles to be told to stop interfering in public life. Oh, hang on—70 per cent of all women who visit my website want Prince Charles to be told to stop interfering in public life…
Esempio: disponibilità dei medici a fare aborti
… Telegraph in the last days of 2007. ‘Doctors Say No to Abortions in their Surgeries’ was the headline. ‘Family doctors are threatening a revolt against government plans to allow them to perform abortions in their surgeries… ‘Four out of five GPs do not want to carry out terminations even though the idea is being tested in NHS pilot schemes, a survey has revealed.’…
La fonte della notizia…
… It was an online vote on a doctors’ chat site that produced this major news story. Here is the question, and the options given:   ‘GPs should carry out abortions in their surgeries’ Strongly agree, agree, don’t know, disagree, strongly disagree…
Primo: dubbi sulla formulazione della domanda
… Is that ‘should’ as in ‘should’? As in ‘ought to’?… Are they just saying no because they’re grumbling about more work and low morale? More than that, what exactly does ‘abortion’ mean here?…
***
Un altro caso esemplare. Avete presente quanti omicidi commettono gli psicopatici?…
… In 2006, after a major government report, the media reported that one murder a week is committed by someone with psychiatric problems. Psychiatrists should do better, the newspapers told us, and prevent more of these murders. All of us would agree…
Non si potrebbe fermarli prima? Non si potrebbe trattenere i soggetti più pericolosi?
Chi pensa a soluzioni del genere non ha chiaro il concetto di frequenza di base e di falso positivo
… the blood test for HIV has a very high ‘sensitivity’, at 0.999. That means that if you do have the virus, there is a 99.9 per cent chance that the blood test will be positive. They would also say the test has a high ‘specificity’ of 0.9999—so, if you are not infected, there is a 99.99 per cent chance that the test will be negative. What a smashing blood test.* But if you look at it from the perspective of the person being tested, the maths gets slightly counterintuitive. Because weirdly, the meaning, the predictive value, of an individual’s positive or negative test is changed in different situations, depending on the background rarity of the event that the test is trying to detect. The rarer the event in your population, the worse your test becomes, even though it is the same test. This is easier to understand with concrete figures. Let’s say the HIV infection rate among high-risk men in a particular area is 1.5 per cent. We use our excellent blood test on 10,000 of these men, and we can expect 151 positive blood results overall: 150 will be our truly HIV-positive men, who will get true positive blood tests; and one will be the one false positive we could expect from having 10,000 HIV-negative men being given a test that is wrong one time in 10,000. So, if you get a positive HIV blood test result, in these circumstances your chances of being truly HIV positive are 150 out of 151. It’s a highly predictive test. Let’s now use the same test where the background HIV infection rate in the population is about one in 10,000. If we test 10,000 people, we can expect two positive blood results overall. One from the person who really is HIV positive; and the one false positive that we could expect, again, from having 10,000 HIV-negative men being tested with a test that is wrong one time in 10,000. Suddenly, when the background rate of an event is rare, even our previously brilliant blood test becomes a bit rubbish. For the two men with a positive HIV blood test result, in this population where only one in 10,000 has HIV, it’s only 50:50 odds on whether they really are HIV positive…
L’esame psichiatrico dei soggetti pericolosi ha falsi positivi notevoli abbinati poi a frequenze di base comunque piuttosto basse. Assurdo fermare dei soggetti in condizioni tanto incerte…
… Let’s think about violence. The best predictive tool for psychiatric violence has a ‘sensitivity’ of 0.75, and a ‘specificity’ of 0.75. It’s tougher to be accurate when predicting an event in humans, with human minds and changing human…
Basta fare qualche calcolo…
… Let’s say 5 per cent of patients seen by a community mental health team will be involved in a violent event in a year. Using the same maths as we did for the HIV tests, your ‘0.75’ predictive tool would be wrong eighty-six times out of a hundred. For serious violence, occurring at 1 per cent a year, with our best ‘0.75’ tool, you inaccurately finger your potential perpetrator ninety-seven times out of a hundred. Will you preventively detain ninety-seven people to prevent three violent events?…
Mettere praticamente in gabbia 96 persone per salvare tre vite è un po’ esagerato. O no?
***
Il caso Clark
… In 1999 solicitor Sally Clark was put on trial for murdering her two babies…
La prova della sua colpevolezza…
… At her trial, Professor Sir Roy Meadow, an expert in parents who harm their children, was called to give expert evidence. Meadow famously quoted ‘one in seventy-three million’ as the chance of two children in the same family dying of Sudden Infant Death Syndrome (SIDS)….
Troppo improbabile che due bambini muoiano insieme per ragioni naturali: doveva averli assassinati lui per forza!
Cosa c’è che non va in questo ragionamento.
Innanzitutto si commette una “fallacia ecologica”: certi fatti non sono indipendenti e l’operatore logico “contemporaneamente” non si rende fattorizzando…
… The figure of ‘one in seventy-three million’ itself is iffy, as everyone now accepts. It was calculated as 8,543 × 8,543, as if the chances of two SIDS episodes in this one family were independent of each other. This feels wrong from the outset, and anyone can see why: there might be environmental or genetic factors at play, both of which would be shared by the two babies…
Poi c’è la “fallacia dell’accusatore”, il quale tiene conto solo dell’improbabile innocenza. E l’improbabile colpevolezza che fine ha fatto?…
… Many press reports at the time stated that one in seventy-three million was the likelihood that the deaths of Sally Clark’s two children were accidental: that is, the likelihood that she was innocent… Once this rare event has occurred, the jury needs to weigh up two competing explanations for the babies’ deaths: double SIDS or double murder. Under normal circumstances—before any babies have died—double SIDS is very unlikely, and so is double murder… If we really wanted to play statistics, we would need to know which is relatively more rare, double SIDS or double murder. People have tried to calculate the relative risks of these two events, and one paper says it comes out at around 2:1 in favour of double SIDS… the rarity of double SIDS is irrelevant, because double murder is rare too…
***
A posteriori nessun caso puo’ essere definito sorprendente: Richard Feynman in merito alla cosa…
… You know, the most amazing thing happened to me tonight. I was coming here, on the way to the lecture, and I came in through the parking lot. And you won’t believe what happened. I saw a car with the license plate ARW 357. Can you imagine? Of all the millions of license plates in the state, what was the chance that I would see that particular one tonight? Amazing… Richard Feynman…
Ecco il caso dell’infermiera assassina: troppi morti durante i suoi turni…
… A nurse called Lucia de Berk has been in prison for six years in Holland, convicted of seven counts of murder and three of attempted murder. An unusually large number of people died when she was on shift, and that, essentially, along with some very weak circumstantial evidence, is the substance of the case against her… The judgement was largely based on a figure of ‘one in 342 million against’….
Calma: mai fidarsi delle “previsioni” fatte dopo. Le previsioni si fanno prima: una cosa è rara solo se imprevedibile…
… It’s only weird and startling when something very, very specific and unlikely happens if you have specifically predicted it beforehand…
L’uomo che disegnava bersagli intorno ai fori di pallottola…
… Imagine I am standing near a large wooden barn with an enormous machine gun. I place a blindfold over my eyes and—laughing maniacally—I fire off many thousands and thousands of bullets into the side of the barn. I then drop the gun, walk over to the wall, examine it closely for some time, all over, pacing up and down. I find one spot where there are three bullet holes close to each other, then draw a target around them, announcing proudly that I am an excellent marksman…
Prima le ipotesi, poi le evidenze. Ecco come opera la scienza…
… a cardinal rule of any research involving statistics: you cannot find your hypothesis in your results…
I rischi dell’indagine a ritroso
… To collect more data, the investigators went back to the wards to see if they could find more suspicious deaths. But all the people who were asked to remember ‘suspicious incidents’ knew that they were being asked because Lucia might be a serial killer. There was a high risk that ‘an incident was suspicious’ became synonymous with ‘Lucia was present’…
Qui bisogna essere chiari: alcuni fenomeni non possono essere verificati, cosicché puo’ essere interessante formulare delle ipotesi a posteriori, è tutto quel che abbiamo in mano. pensiamo solo al caso del principio antropico. Tuttavia, bisogna essere ben consapevoli della differenza tra il metodo scientifico più rigoroso e questo modo di agire.