- Tags:: [[Statistics]] [[Load Testing]] #[[Load testing is a data science]]
- Source: [[The Art of Statistics]]
- Author: [[David Spiegelhalter]]
- Date Created: [[2020-09-07]]
- Statistics can be used in two ways:
- To visualize data and connections between data that support our hypothesis; that is, we start from a conclusion and manipulate relevant data to prove or disprove it.
- However, there is also a more exploratory side to statistics, forensic statistics, in which we start with the data and see what conclusions we can draw out of it. This is the type of statistics we do when we don't know what we're looking for.
- "`We can think of this type of iterative, exploratory work as ‘forensic’ statistics, and in this case it was literally true. There is no mathematics, no theory, just a search for patterns that might lead to more interesting questions.` ([Location 247](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=247))"
- [[Load testing is a data science]]
- In load testing, we end up with a lot of data, and we know generally that we want to draw conclusions about the performance of the application, but there is no metric called "performance". Since a lot of the metrics we typically gather can say something about application performance, we often don't know exactly what we're looking for.
- This is why it's so important to learn how to manipulate data, so that we are free to mix it, combine it, change its form, and examine its relationship to other data-- all without changing its integrity.
- "Data is imperfect (because data gathering methods are imperfect)"
- There are always elements of data that are created on the basis of judgments and emotional components, so it cannot be fully trusted.
- Data is a snapshot at a given moment in time and place, so it is difficult to make inferences about issues that affect other times and places.
- [[The PPDAC Cycle of Data Science]]
- Problem
- Plan
- Data
- Analysis
- Conclusion
- The way data is communicated in a chart can easily mislead others, so we need to make sure that the message we send reflects the facts. Decisions that we make that we think are just cosmetic can cause our data to be misinterpreted by people who aren't statisticians. ^73b378
- "`A table can be considered as a type of graphic, and requires careful design choices of colour, font and language to ensure engagement and readability. The audience’s emotional response to the table may also be influenced by the choice of which columns to display.` ([Location 432](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=432))"
- "`Table 1.1 shows the results in terms of both survivors and deaths, but in the US mortality rates from child heart surgery are reported, while the UK provides survival rates. This is known as negative or positive framing, and its overall effect on how we feel is intuitive and well-documented: ‘5% mortality’ sounds worse than ‘95% survival’. Reporting the actual number of deaths as well as the percentage can also increase the impression of risk, as this total might then be imagined as a crowd of real people. `([Location 434](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=434))"
- "`Note the two tricks used to manipulate the impact of this statistic: convert from a positive to a negative frame, and then turn a percentage into actual numbers of people. Ideally both positive and negative frames should be presented if we want to provide impartial information, although the order of columns might still influence how the table is interpreted.` ([Location 447](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=447))"
- "`Choosing the start of the axis therefore presents a dilemma. Alberto Cairo, author of influential books on data visualization,3 suggests you should always begin with a ‘logical and meaningful baseline’,` ([Location 460](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=460))"
- Binary variables are yes/no questions
- Categorical variables are measures that take two or more categories:
- Unordered categories (country)
- Ordered categories (rank)
- Grouped numbers (BMI)
- Pie charts
- Can be good for categorical variables but only when there are fewer categories; otherwise it can be visually confusing.
- Beware of software that employs 3D rendering in their pie charts; sometimes the slices that are up front look bigger. 2D is better for pie charts.
- Advantage: showing a rough overview of the whole
- Disadvantage: not great for comparisons
- Bar charts
- Great for comparisons
- Text
- Relative vs absolute percentages
- Always use absolute percentages (X is 80% of the whole) rather than relative (X is 18% higher __than Y__) because relative percentages are harder to understand and are frequently reported as absolute percentages anyway.
- "1 in x people develop cancer".
- Advantage: Easy to relate to on its own
- Disadvantage: Hard to compare to other "1 in Y" statements because maths are involved.
- Dealing with outliers
- Plotting the data on a logarithmic scale makes outliers matter less. [[Requires Research]]
- "`But there is a problem with all these charts. The pattern of the points means all the attention is focused on the extremely high guesses, with the bulk of the numbers being squeezed into the left-hand end. Can we present the data in a more informative way? We could throw away the extremely high values as ridiculous (and when we originally analysed this data I rather arbitrarily excluded everything above 9,000). Alternatively we could transform the data in a way that reduces the impact of these extremes, say by plotting it on what is called a logarithmic scale, where the space between 100 and 1,000 is the same as the space between 1,000 and 10,000. `([Location 610](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=610))"
- "* `To get the logarithm of a number x, we find the power of 10 that gives x, so that, for example, the logarithm of 1,000 is 3, since 103 = 1,000. Logarithmic transformations are particularly appropriate when it is reasonable to assume people are making ‘relative’ rather than ‘absolute’ errors, for example because we would expect people to get the answer wrong by a relative factor, say 20% in either direction, rather than being, say, 200 beans off the true count regardless of whether they are guessing a low or high value. `([Location 4726](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=4726))"
- Variables recorded as numbers can be:
- Count variables: recorded as integers (ex: number of homicides)
- Continuous variables
- recorded as decimals and are generally more precise
- However, continuous variables are also just a snapshot of a current state (ex: a person's weight depends on when the measurement was taken)
- Graphing basics ^5207ac
- X is usually the "independent variable" that is the causer, and Y is the "dependent variable" we are examining to see how strongly X affects.
- However, this convention also means we'd have to decide in advance which variable affects the other, which is easy to mistake.
- The problem with averages
- The "average" measurement differs depending on which of the following was actually used to calculate it:
- Mean
- Median
- Mode
- Averages are often used as shorthand to summarize the findings from a study, but they're not enough - we should also look at the spread of the data. The presence of outliers can really change the averages and affect perception otherwise.
- "`Mean-averages can be highly misleading when the raw data do not form a symmetric pattern around a central value but instead are skewed towards one side like the jelly-bean guesses, typically with a large group of standard cases but with a tail of a few either very high (for example, income) or low (for example, legs) values.` ([Location 641](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=641))"
- Describing the spread of a data distribution
- "Describing the Spread of a Data Distribution `It is not enough to give a single summary for a distribution—we need to have an idea of the spread, sometimes known as the variability.` ([Location 661](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=661))"
- Range (1-12,300)
- +: Describes the minimum and maximum
- -: Exaggerates the importance of outliers
- Inter-quartile range [[Requires Research]]
- Distance between the 25th and 75th percentile, or the middle half of the data
- +: Deals with outliers better
- Generally used with the box-and-whisker plots
- Standard deviation [[Requires Research]]
- Mean of 100, with a std. dev. of 24
- -: Still heavily influenced by outliers
- Describing a relationship between variables
- Pearson correlation coefficient (between -1 and 1, where -1 is inversely related and 1 is directly related)
- "`It is convenient to use a single number to summarize a steadily increasing or decreasing relationship between the pairs of numbers shown on a scatter-plot. This is generally chosen to be the Pearson correlation coefficient, an idea originally proposed by Francis Galton but formally published in 1895 by Karl Pearson, one of the founders of modern statistics.* A Pearson correlation runs between −1 and 1, and expresses how close to a straight line the dots or data-points fall. A correlation of 1 occurs if all the points lie on a straight line going upwards, while a correlation of −1 occurs if all the points lie on a straight line going downwards. A correlation near 0 can come from a random scatter of points, or any other pattern in which there is no systematic trend upwards or downwards, some examples of which are shown in Figure 2.6. `([Location 738](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=738))"
- Spearman's rank correlation
- Like Pearson correlation coefficient, but using ranks instead of the actual values to dampen outlier impact
- "`An alternative measure is called Spearman’s rank correlation after English psychologist Charles Spearman (who developed the idea of an underlying general intelligence), and depends only on the ranks of the data rather than their specific values. This means it can be near 1 or −1 if the points are close to a line that steadily increases or decreases, even if this line is not straight`; ([Location 747](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=747))"
- When analyzing data:
- Collection:
- Is it a large enough sample?
- Was it collected in a way that it could accurately describe the target population?
- Was the sample random?
- Look for relationships between variables.
- Transform the data to explain variations.
- categorizing by transaction, load generator region, and systems involved
- using logarithmic scale
- Communicate data in a non-biased and non-manipulative way.
- Describe it: Create a summary including averages AND the spread.
- Pay attention to what's on the X and Y axes, the scale of the axes, the colors that will highlight certain things over others
- [[Data Visualization]]
- A good data visualization should be reliable, designed to showcase the conclusion, attractive, and explorable.
- "`Alberto Cairo has identified four common features of a good data visualization: 1. It contains reliable information. 2. The design has been chosen so that relevant patterns become noticeable. 3. It is presented in an attractive manner, but appearance should not get in the way of honesty, clarity and depth. 4. When appropriate, it is organized in a way that enables some exploration.` The ([Location 797](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=797))"
- "`Even more advanced are dynamic graphics, in which movement can be used to reveal patterns in the changes over time. The master of this technique was Hans Rosling, whose TED talks and videos set a new standard of storytelling with statistics, for example by showing the relationship between changing wealth and health through the animated movement of bubbles representing each country’s progress from 1800 to the present day. Rosling used his graphics to try to correct misconceptions about the distinction between ‘developed’ and ‘undeveloped’ countries, with the dynamic plots revealing that, over time, almost all countries moved steadily along a common path towards greater health and prosperity.` ([Location 849](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=849))" ^be1752
- Inductive vs deductive reasoning
- Deductive is strictly logical: it begins with a premise and then makes particular conclusions that logically from from it.
- Inductive should still be logical, but it's not linear: it begins with data and looks for conclusions and premises that can be drawn from the data.
- Load testing is generally inductive.
- "`we need to ask ourselves the question, ‘Learn about what?’ And this requires us to confront the challenging idea of **inductive inference**. Many people have a vague idea of deduction, thanks to Sherlock Holmes using deductive reasoning when he coolly announces that a suspect must have committed a crime. In real life deduction is the process of using the rules of cold logic to work from general premises to particular conclusions. If the law of the country is that cars should drive on the right, then we can deduce that on any particular occasion it is best to drive on the right. But induction works the other way, in taking particular instances and trying to work out general conclusions. For example, suppose we don’t know the customs in a community about kissing female friends on the cheek, and we have to try to work it out by observing whether people kiss once, twice, three times, or not at all. The crucial distinction is that deduction is logically certain, whereas induction is generally uncertain`. ([Location 910](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=910))"