- Author: [[David Spiegelhalter]]
- Full Title: The Art of Statistics
- [[Statistics]] [[Types of variables in statistics]] [[Data science]] [[The PPDAC Cycle of Data Science]] #[[Load testing is a data science]] #[[Presentation/Mine/Rise of the Datasaurus: using data science in load testing]] [[Describing the spread of a data distribution]]
- 
- [[The Art of Statistics (lit)]]
- ### [x] Highlights first synced by [[Readwise]] [[2020-09-05]]
- `We can think of this type of iterative, exploratory work as ‘forensic’ statistics, and in this case it was literally true. There is no mathematics, no theory, just a search for patterns that might lead to more interesting questions.` ([Location 247](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=247))
- statistics are always to some extent constructed on the basis of judgements, and it would be an obvious delusion to think the full complexity of personal experience can be unambiguously coded and put into a spreadsheet or other software. ([Location 286](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=286))
- Data has two main limitations as a source of such knowledge. First, it is almost always an imperfect measure of what we are really interested in: ([Location 289](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=289))
- Second, anything we choose to measure will differ from place to place, from person to person, from time to time, and the problem is to extract meaningful insights from all this apparently random variability. ([Location 291](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=291))
- Far from freeing us from the need for statistical skills, bigger data and the rise in the number and complexity of scientific studies makes it even more difficult to draw appropriate conclusions. More data means that we need to be even more aware of what the evidence is actually worth. ([Location 317](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=317))
- All these insights can be brought together under the term data literacy, which describes the ability to not only carry out statistical analysis on real-world problems, but also to understand and critique any conclusions drawn by others on the basis of statistics. But ([Location 323](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=323))
- The first stage of the cycle is specifying a Problem; statistical inquiry always starts with a question, ([Location 333](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=333))
- Finally, the key to good statistical science is drawing appropriate Conclusions that fully acknowledge the limitations in the evidence, and communicating them clearly, ([Location 348](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=348))
- Any conclusions generally raise more questions, and so the cycle starts over again, ([Location 349](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=349))
- The PPDAC cycle provides a convenient framework: Problem—Plan—Data—Analysis—Conclusion and communication. ([Location 384](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=384))
- `A table can be considered as a type of graphic, and requires careful design choices of colour, font and language to ensure engagement and readability. The audience’s emotional response to the table may also be influenced by the choice of which columns to display.` ([Location 432](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=432))
- `Table 1.1 shows the results in terms of both survivors and deaths, but in the US mortality rates from child heart surgery are reported, while the UK provides survival rates. This is known as negative or positive framing, and its overall effect on how we feel is intuitive and well-documented: ‘5% mortality’ sounds worse than ‘95% survival’. Reporting the actual number of deaths as well as the percentage can also increase the impression of risk, as this total might then be imagined as a crowd of real people. `([Location 434](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=434))
- `Note the two tricks used to manipulate the impact of this statistic: convert from a positive to a negative frame, and then turn a percentage into actual numbers of people. Ideally both positive and negative frames should be presented if we want to provide impartial information, although the order of columns might still influence how the table is interpreted.` ([Location 447](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=447))
- `Choosing the start of the axis therefore presents a dilemma. Alberto Cairo, author of influential books on data visualization,3 suggests you should always begin with a ‘logical and meaningful baseline’,` ([Location 460](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=460))
- A variable is defined as any measurement that can take on different values in different circumstances; it’s a very useful shorthand term for all the types of observations that comprise data. ([Location 474](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=474))
- Categorical variables are measures that can take on two or more categories, which may be • Unordered categories: such as a person’s country of origin, the colour of a car, or the hospital in which an operation takes place. • Ordered categories: such as the rank of military personnel. • Numbers that have been grouped: such as levels of obesity, which is often defined in terms of thresholds for the body mass index (BMI).* When it comes to presenting categorical data, pie charts allow an impression of the size of each category relative to the whole pie, but are often visually confusing, especially if they attempt to show too many categories in the same chart, or use a three-dimensional representation that distorts areas. ([Location 477](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=477))
- Comparisons are better based on height or length alone in a bar chart. ([Location 486](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=486))
- They concluded that, in the normal run of things, around 6 in every 100 people who do not eat bacon daily would be expected to get bowel cancer in their lifetime. If 100 similar people ate a bacon sandwich every single day of their lives, then according to the IARC report we would expect that 18% more would get bowel cancer, which means a rise from 6 to 7 cases out of 100.* That is one extra case of bowel cancer in all those 100 lifetime bacon-eaters, which does not sound so impressive as the relative risk (an 18% increase), and might serve to put this hazard into perspective. We need to distinguish what is actually dangerous from what sounds frightening. ([Location 512](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=512))
- This bacon sandwich example illustrates the advantage of communicating risks using expected frequencies: instead of discussing percentages or probabilities, we just ask, ‘What does this mean for 100 (or 1,000) people?’ Psychological studies have shown that this technique improves understanding: in fact communicating only that this additional meat-eating led to an ‘18% increased risk’ could be considered manipulative, since we know this phrasing gives an exaggerated impression of the importance of the hazard. ([Location 517](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=517))
- using multiple ‘1 in…’ statements is not recommended, as many people find them difficult to compare. For example, when asked the question, ‘Which is the bigger risk, 1 in 100, 1 in 10 or 1 in 1,000?’, around a quarter of people answered incorrectly: ([Location 533](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=533))
- ### [x] New highlights added [[2020-09-08]] at 9:29 AM
- The Daily Mail misinterpreted this odds ratio of 1.18 as a relative risk, and produced a headline claiming statins ‘raises risk by up to 20 per cent’, which is a serious misrepresentation of what the study actually found. But not all the blame can be placed on the journalists: the abstract of the paper mentioned only the odds ratio without mentioning that this corresponded to a difference between absolute risks of 85% vs 87%.7 This highlights the danger of using odds ratios in anything but a scientific context, and the advantage of always reporting absolute risks as the quantity that is relevant for an audience, whether ([Location 554](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=554))
- Summary • Binary variables are yes/no questions, sets of which can be summarized as proportions. • Positive or negative framing of proportions can change their emotional impact. • Relative risks tend to convey an exaggerated importance, and absolute risks should be provided for clarity. • Expected frequencies promote understanding and an appropriate sense of importance. • Odds ratios arise from scientific studies but should not be used for general communication. • Graphics need to be chosen with care and awareness of their impact. ([Location 563](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=563))
- Galton titled his letter ‘Vox Populi’ (voice of the people), but this process of decision-making is now better known as the wisdom of crowds. ([Location 580](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=580))
- `But there is a problem with all these charts. The pattern of the points means all the attention is focused on the extremely high guesses, with the bulk of the numbers being squeezed into the left-hand end. Can we present the data in a more informative way? We could throw away the extremely high values as ridiculous (and when we originally analysed this data I rather arbitrarily excluded everything above 9,000). Alternatively we could transform the data in a way that reduces the impact of these extremes, say by plotting it on what is called a logarithmic scale, where the space between 100 and 1,000 is the same as the space between 1,000 and 10,000. `([Location 610](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=610))
- Variables which are recorded as numbers come in different varieties: • Count variables: where measurements are restricted to the integers 0, 1, 2… For example, the number of homicides each year, or guesses at the number of jelly beans in a jar. • Continuous variables: measurements that can be made, at least in principle, to arbitrary precision. For example, height and weight, each of which might vary both between people and from time to time. ([Location 619](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=619))
- `Mean-averages can be highly misleading when the raw data do not form a symmetric pattern around a central value but instead are skewed towards one side like the jelly-bean guesses, typically with a large group of standard cases but with a tail of a few either very high (for example, income) or low (for example, legs) values.` ([Location 641](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=641))
- Unfortunately, when an ‘average’ is reported in the media, it is often unclear whether this should be interpreted as the mean or median. ([Location 646](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=646))
- Describing the Spread of a Data Distribution `It is not enough to give a single summary for a distribution—we need to have an idea of the spread, sometimes known as the variability.` ([Location 661](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=661))
- The range is a natural choice, but is clearly very sensitive to extreme values such as the apparently bizarre guess of 31,337 beans.* In contrast the inter-quartile range (IQR) is unaffected by extremes. This is the distance between the 25th and 75th percentiles of the data and so contains the ‘central half’ of the numbers, in this case between 1,109 and 2,599 beans: the central ‘box’ of the box-and-whisker plots shown above covers the inter-quartile range. Finally the standard deviation is a widely used measure of spread. It is the most technically complex measure, but is only really appropriate for well-behaved symmetric data* since it is also unduly influenced by outlying values. ([Location 665](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=665))
- Large collections of numerical data are routinely summarized and communicated using a few statistics of location and spread, and the sexual-partner example has shown that these can take us a long way in grasping an overall pattern. However, there is no substitute for simply looking at data properly, and the next example shows that a good visualization is particularly valuable when we want to grasp the pattern in a large and complex set of numbers. ([Location 712](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=712))
- `It is convenient to use a single number to summarize a steadily increasing or decreasing relationship between the pairs of numbers shown on a scatter-plot. This is generally chosen to be the Pearson correlation coefficient, an idea originally proposed by Francis Galton but formally published in 1895 by Karl Pearson, one of the founders of modern statistics.* A Pearson correlation runs between −1 and 1, and expresses how close to a straight line the dots or data-points fall. A correlation of 1 occurs if all the points lie on a straight line going upwards, while a correlation of −1 occurs if all the points lie on a straight line going downwards. A correlation near 0 can come from a random scatter of points, or any other pattern in which there is no systematic trend upwards or downwards, some examples of which are shown in Figure 2.6. `([Location 738](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=738))
- `An alternative measure is called Spearman’s rank correlation after English psychologist Charles Spearman (who developed the idea of an underlying general intelligence), and depends only on the ranks of the data rather than their specific values. This means it can be near 1 or −1 if the points are close to a line that steadily increases or decreases, even if this line is not straight`; ([Location 747](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=747))
- Correlation coefficients are simply summaries of association, and cannot be used to conclude that there is definitely an underlying relationship between volume and survival rates, let alone why one might exist.* In many applications the x-axis represents a quantity known as the independent variable, and interest focuses on its influence on the dependent variable plotted on the y-axis. But, as we shall explore further in Chapter 4 on causation, this presupposes the direction in which the influence might lie. ([Location 760](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=760))
- grouping the countries by continents allows us to immediately detect both general clusters and outlying cases. It is always valuable to split data according to a factor—here the continents—that explains some of the overall variability. ([Location 780](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=780))
- `Alberto Cairo has identified four common features of a good data visualization: 1. It contains reliable information. 2. The design has been chosen so that relevant patterns become noticeable. 3. It is presented in an attractive manner, but appearance should not get in the way of honesty, clarity and depth. 4. When appropriate, it is organized in a way that enables some exploration.` The ([Location 797](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=797))
- This chapter has focused on summarizing and communicating data in an open and non-manipulative way; we do not want to influence our audiences’ emotions and attitudes, or convince them of a certain perspective. We just want to tell it how it is, or at least how it seems to be, and while we cannot ever claim to tell the absolute truth, we can at least try to be as truthful as possible. ([Location 818](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=818))
- `Even more advanced are dynamic graphics, in which movement can be used to reveal patterns in the changes over time. The master of this technique was Hans Rosling, whose TED talks and videos set a new standard of storytelling with statistics, for example by showing the relationship between changing wealth and health through the animated movement of bubbles representing each country’s progress from 1800 to the present day. Rosling used his graphics to try to correct misconceptions about the distinction between ‘developed’ and ‘undeveloped’ countries, with the dynamic plots revealing that, over time, almost all countries moved steadily along a common path towards greater health and prosperity.` ([Location 849](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=849))
- Summary • A variety of statistics can be used to summarize the empirical distribution of data-points, including measures of location and spread. • Skewed data distributions are common, and some summary statistics are very sensitive to outlying values. • Data summaries always hide some detail, and care is required so that important information is not lost. • Single sets of numbers can be visualized in strip-charts, box-and-whisker plots and histograms. • Consider transformations to better reveal patterns, and use the eye to detect patterns, outliers, similarities and clusters. • Look at pairs of numbers as scatter-plots, and time-series as line-graphs. • When exploring data, a primary aim is to find factors that explain the overall variation. • Graphics can be both interactive and animated. • Infographics highlight interesting features and can guide the viewer through a story, but should be used with awareness of their purpose and their impact. ([Location 860](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=860))
- Going from our sample (Stage 2) to the study population (Stage 3) is perhaps the most challenging step. We first need to be confident that the people asked to take part in the survey are a random sample from those who are eligible: ([Location 893](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=893))
- `we need to ask ourselves the question, ‘Learn about what?’ And this requires us to confront the challenging idea of **inductive inference**. Many people have a vague idea of deduction, thanks to Sherlock Holmes using deductive reasoning when he coolly announces that a suspect must have committed a crime. In real life deduction is the process of using the rules of cold logic to work from general premises to particular conclusions. If the law of the country is that cars should drive on the right, then we can deduce that on any particular occasion it is best to drive on the right. But induction works the other way, in taking particular instances and trying to work out general conclusions. For example, suppose we don’t know the customs in a community about kissing female friends on the cheek, and we have to try to work it out by observing whether people kiss once, twice, three times, or not at all. The crucial distinction is that deduction is logically certain, whereas induction is generally uncertain`. ([Location 910](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=910))
- * `To get the logarithm of a number x, we find the power of 10 that gives x, so that, for example, the logarithm of 1,000 is 3, since 103 = 1,000. Logarithmic transformations are particularly appropriate when it is reasonable to assume people are making ‘relative’ rather than ‘absolute’ errors, for example because we would expect people to get the answer wrong by a relative factor, say 20% in either direction, rather than being, say, 200 beans off the true count regardless of whether they are guessing a low or high value. `([Location 4726](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=4726))
- ### [x] New highlights added [[2020-09-14]] at 6:37 AM
- the adequacy of the sex survey depends on people giving the same or very similar answers to the same question each time they are asked, and this should not depend on the style of the interviewer or the vagaries of the respondent’s mood or memory. ([Location 933](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=933))
- The quality of the survey also requires the interviewees to be honest when they report their sexual activity, ([Location 935](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=935))
- A survey would not be valid if the questions were biased in favour of a particular response. ([Location 937](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=937))
- For example, in 2017 budget airline Ryanair announced that 92% of their passengers were satisfied with their flight experience. It turned out that their satisfaction survey only permitted the answers, ‘Excellent, very good, good, fair, OK’. ([Location 938](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=938))
- The responses to questions can also be influenced by what has been asked beforehand, a process known as priming. ([Location 947](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=947))
- internal validity: does the sample we observe accurately reflect what is going on in the group we are actually studying? This is where we come to the crucial way of avoiding bias: random sampling. ([Location 953](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=953))
- George Gallup, who essentially invented the idea of the opinion poll in the 1930s, came up with a fine analogy for the value of random sampling. He said that if you have cooked a large pan of soup, you do not need to eat it all to find out if it needs more seasoning. You can just taste a spoonful, provided you have given it a good stir. ([Location 960](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=960))
- Going from study population (Stage 3) to target population (Stage 4): finally, even with perfect measurement and a meticulous random sample, the results may still not reflect what we wanted to investigate in the first place if we have not been able to ask the people in whom we are particularly interested. We want our study to have external validity. ([Location 973](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=973))
- When we have all the data, it is straightforward to produce statistics that describe what has been measured. But when we want to use the data to draw broader conclusions about what is going on around us, then the quality of the data becomes paramount, and we need to be alert to the kind of systematic biases that can jeopardize the reliability of any claims. ([Location 1005](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=1005))
- ### [x] New highlights added [[2020-12-20]] at 9:15 AM
- data distribution—the pattern the data makes, sometimes known as the empirical or sample distribution. Next we must tackle the concept of a population distribution—the pattern in the whole group of interest. ([Location 1014](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=1014))
- > Theory shows that the [[Normal distribution]] can be expected to occur for phenomena that are driven by large numbers of small influences, for example a complex physical trait that is not influenced by just a few genes. ([Location 1027](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=1027))
- terms such as mean and standard deviation are known as statistics when describing a set of data, and parameters when describing a population. ([Location 1043](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=1043))
- a population can be thought of as a physical group of individuals, but also as providing the probability distribution for a random observation. ([Location 1068](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=1068))
- Summary • Inductive inference requires working from our data, through study sample and study population, to a target population. • Problems and biases can crop up at each stage of this path. • The best way to proceed from sample to study population is to have drawn a random sample. • A population can be thought of as a group of individuals, but also as providing the probability distribution for a random observation drawn from that population. • Populations can be summarized using parameters that mirror the summary statistics of sample data. • Often data does not arise as a sample from a literal population. When we have all the data there is, then we can imagine it drawn from a metaphorical population of events that could have occurred, but didn’t. ([Location 1108](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=1108))
- A proper medical trial should ideally obey the following principles: 1. Controls: ([Location 1179](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=1179))
- 2. Allocation of treatment: ([Location 1183](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=1183))
- The best way to ensure this is by randomly allocating participants to be treated or not, and then seeing what happens to them—this is known as a randomized controlled trial (RCT). ([Location 1184](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=1184))
- 3. People should be counted in the groups to which they were allocated: ([Location 1189](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=1189))
- 4. If possible, people should not even know which group they are in: ([Location 1196](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=1196))
- 5. Groups should be treated equally: ([Location 1198](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=1198))
- 6. If possible, those assessing the final outcomes should not know which group the subjects are in: ([Location 1202](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=1202))
- 7. Measure everyone: ([Location 1203](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=1203))
- 8. Don’t rely on a single study: ([Location 1210](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=1210))
- 9. Review the evidence systematically: ([Location 1211](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=1211))
- make sure to include every study that has been done, and so create what is known as a systematic review. The results may then be formally combined in a meta-analysis. ([Location 1212](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=1212))