Southern Network News (Reporter Zhu Qilin) A few days ago, Kyushu Publishing House published a popular science book "Numbers Are Not Honest at All: Seeing Through the Data Mystery in Complex Information".
A large number of statistical data in daily life, such as proportion, prevalence, risk value, etc., a large number of explicit or implicit numbers may distort the truth due to misreading or misuse, and "statistical awareness" is indispensable to understand them and make reasonable judgments. This book helps readers understand 22 common mistakes and tricks in numbers, which appear in many situations such as the speed and importance of evaluation, accuracy and ranking, what deviations may occur in the process of collecting and expressing various numbers, and what misleading beliefs such as "watching the screen before going to bed will kill people", and what guidelines should be followed to be responsible for trustworthy statistical work.
About the Author. Tom Chivers is a science writer who has worked for The Telegraph, Buzzfeed and others** before freelancing since 2018. In 2018, he won the Royal Statistical Society's "Journalism 'Statistical' Merit Award". In 2017, he won the American Psychological Association (APA)** Award, and was shortlisted for the British Science Author Award and the British Science Writing Journalism Award.
D**id Chivers is an associate professor of economics at Durham University Business School and a former lecturer at the University of Oxford. He has been published in many excellent academic journals. Areas of study include inequality, growth, and development, among others.
Wonderful book excerpts. Numbers can also misleading.
While it's easy to lie with statistics, it's easier to lie when you don't. — or from the statistician Frederick Mosteller
The coronavirus disease has taught the world a costly crash course in statistical concepts. People suddenly find themselves having to understand what an exponential curve is, infection fatality versus case fatality rates, false positives vs. false negatives, uncertainty intervals. Some of these concepts are obviously complex, but even those that feel like they should be simple – such as the number of people who have died from the virus – are actually difficult to grasp. In this chapter, we'll look at how a seemingly straightforward number can be unexpectedly misleading.
In the beginning, one number that all of us had to figure out was the "r-value". In December 2019, it was likely that no two out of 50 people knew what the r-value was, but by the end of March 2020, mainstream news reports had barely explained the r-value. However, because the numbers can go wrong in subtle ways, the reader is kindly informed of the change in the r-value, which ultimately leads to misunderstandings.
Here's a hint: r is the "regeneration number" of something. It can be applied to anything that spreads or reproduces: memes, humans, yawning, new technologies, etc. In infectious disease epidemiology, the r-value represents how many people are infected by a person with a disease on average. If a disease has an R-value of 5, then on average each infected patient infects five other people.
Of course, it's not that simple because it's an average. If there are 100 people, an R-value of 5 means that each person may have infected 5 peopleIt could also be that 99 of them did not infect anyone at all, but the remaining one infected 500 people;or anything in between.
It won't stay the same either. In the early stages of a new disease outbreak, when no one in the population is immune to the pathogen and there may not be any coping measures (such as social distancing or maskwearing), the R-values at that time can be very different from those that follow. During disease outbreaks, one of the goals of public health policy is to reduce the r-value through vaccination or behavioural changes, because if the r-value is greater than 1, the disease will spread exponentially, and if it is less than 1, the disease will gradually disappear.
But perhaps you would think that when talking about viruses, there is a simple rule when all these complex factors are taken into account: the higher the r-value, the worse. So you probably wouldn't be surprised when the UK** warned in May 2020 that "the r-value of the virus may have picked up" due to a "surge in infections in nursing homes".
But as you might have expected, things are a little more complicated.
From 2000 to 2013, the median real wage ("real" i.e., inflation-adjusted) in the United States fell by about 1%**. Median salary** sounds like a good thing. However, if you look at the population into smaller subgroups, you will find some surprises. For those who didn't finish high school, the median salary fell by 79%;The median salary for high school graduates fell by 47%;The median salary of those who went to college but did not earn a degree fell by 76%;For those who earned a college degree, the median salary fell by 12%。
Median wages fell for those who completed high school and those who did not, between those who completed college and those who did not, regardless of the educational subgroup. And the median wages of the population as a whole have risen.
What's going on?
It turns out that while the median salary of people with a college degree has fallen, the number of people in this subgroup has increased significantly. As a result, the median has taken a strange direction. This phenomenon is called the "Simpson paradox", after the British codebreaker and statistician Edward HEdward H. SimpsonSimpson), who described the phenomenon here in 1951. This phenomenon occurs not only in medians, but also in arithmetic means, but in our case, we look at the median for now.
Let's say the overall number is 11. Three of them dropped out of high school earning £5 a year;3 completed high school and earned £10 a year;3 college dropouts with an annual income of £15;2 earned a bachelor's degree and earned £20 a year. The median wage for the population as a whole (i.e. the salary of the person in the middle of the sequence) is £10.
Then, one year, there was a big push for more people to finish high school and college. But at the same time, the average salary in each subgroup fell by 1 pound. Suddenly, the high school dropout became 2 people with an annual income of 4 pounds;2 high school graduates, income 9 pounds;2 college dropouts, earning £14;There are 5 undergraduate graduates with an income of 19 pounds. The median fell in each subgroup, but rose from £10 to £14 for the population as a whole. Between 2000 and 2013, a similar situation happened in the real U.S. economy, only in larger numbers.
This phenomenon is surprisingly widespread. For example, black Americans are more likely to smoke than whites;But when you control for the variable of education level, you can find that in each subgroup of education level, blacks are less likely to smoke than whites. This is because the subgroup with a higher education generally smokes less, while blacks are underrepresented in this subgroup.
There is also a famous example. In September 1973, 8,000 men and 4,000 women applied for graduate school at the University of California, Berkeley. Of these, 44 percent of male applicants were admitted, while only 35 percent of female applicants were accepted.
But if you take a closer look at the data, you'll notice that in almost every department at this university, female applicants have a higher probability of admission. Eighty-two percent of women who applied to the most popular faculties were admitted, compared to only 62 percent of male applicants. The second most popular department admitted 68% of female applicants and 65% of male applicants.
The reality is that the faculties to which women apply are often more competitive. For example, one department received 933 applications, of which 108 were women. The department admits 82% of female applicants and 62% of male applicants. At the same time, the sixth-most popular department received 714 applications, of which 341 were women. The department admits only 7% of female applicants and 6% of male applicants.
But if the data from the two departments are combined, there are a total of 449 female applicants and 1,199 male applicants. 111 female applicants were admitted, representing an acceptance rate of 25%;Men were admitted with 533 people, an acceptance rate of 44%.
This time, looking at these two departments separately, women have a greater probability of admission;But when the two departments are combined, the probability of admission for women is even smaller.
What should we make of this outcome?It depends. In the U.S. wage example, you might say that the overall median is more informative because the median personal wage in the U.S. has risen (because more Americans have completed college and high school);You might also say that women are generally more likely than men to be accepted regardless of which department they apply to. But you can also point out that for those who do not have a high Chinese credential, the situation is worse;You can also point out that the faculties that women want to apply to are clearly under-resourced, as they can only admit very few applicants. The problem is that when the Simpsons paradox comes along, you can use the same data to tell diametrically opposed stories, depending on which political point you want to express. And the honest approach is to show that there is a Simpson paradox here.
Let's go back to the r-value of the coronavirus. If the r-value is elevated, it means that the virus is spreading to more people, which is not a good thing. However, there is no doubt that the reality is more complicated. There are two barely related "epidemics" that are spreading at the same time: the spread of the disease in nursing homes and hospitals is different from that in the wider community.
Because the exact numbers are not released, we don't know more detailed information. But we can do another thought experiment similar to the precedented. Suppose there are 100 people each in a nursing home and in the general community with the disease. On average, every case in a community spreads the disease to 2 people, while every case in a nursing home spreads the disease to 3 people. The r-value (the average number of people per disease carrier will be infected) is 25。
Then we went into lockdown. As the number of infected people drops, so does the R-value. But – crucially – the R-value drop in communities is greater than in nursing homes. There are now 90 infected people in nursing homes, each of whom will transmit the disease to an average of 29 people, while there are 10 infected people in the community, and each infected person infects an average of 1 person.
Now, the r-value is 271(((90×2.9)+(10×1))/100 = 2.71)。The r-value has gone up!But in fact, the r-value decreased in both subgroups.
What do you think of this phenomenon?Again, we find that the answer is not necessarily obvious. Maybe you're more concerned about the overall r-value, because the two epidemics aren't really unrelated. But the answer is certainly not as simple as "if the r-value rises, it's not good".
When you try to understand an individual or subgroup by looking at the per capita situation of a group of people, the "ecological fallacy" arises, which is a broader problem, and the Simpson's paradox is an example of the group fallacy. The cluster fallacy may be more prevalent than you might think. It's important for readers and journalists to understand that numbers in headlines can obscure more complex truthsTo understand the significance of these numbers, you may need to analyze them more carefully.