- Last Updated:
- [[2021-01-26]]
- [[2021-01-22]]
- [[2020-12-28]]
- Describing the spread of a data is a step beyond simply stating summary statistics because seemingly identical data can have significantly different spreads and implications. (See [[The Datasaurus Dozen]])
- Note that in [[Data science]], a __data distribution__ refers to the distribution of the empirical or sample data, as compared to the __population distribution__ of the target group (in load testing, production data).
- Metrics are called [[Statistics]] when they describe sample data, and parameters when they describe a population.
- # Metrics describing spread
- Visualizing data is a given, but here are some metrics that describe the data spread.
- [[Data Range]] (1-12,300)
- "+: Describes the minimum and maximum"
- "-: Exaggerates the importance of outliers"
- [[Inter-quartile range]]
- "Distance between the 25th and 75th percentile, or the middle half of the data"
- "+: Deals with outliers better"
- "Generally used with the box-and-whisker plots"
- [[Standard Deviation]]
- "Mean of 100, with a std. dev. of 24"
- "-: Still heavily influenced by outliers "
- "Describing a relationship between variables"
- [[Pearson correlation coefficient]] (between -1 and 1, where -1 is inversely related and 1 is directly related)
- ""`It is convenient to use a single number to summarize a steadily increasing or decreasing relationship between the pairs of numbers shown on a scatter-plot. This is generally chosen to be the Pearson correlation coefficient, an idea originally proposed by Francis Galton but formally published in 1895 by Karl Pearson, one of the founders of modern statistics.* A Pearson correlation runs between −1 and 1, and expresses how close to a straight line the dots or data-points fall. A correlation of 1 occurs if all the points lie on a straight line going upwards, while a correlation of −1 occurs if all the points lie on a straight line going downwards. A correlation near 0 can come from a random scatter of points, or any other pattern in which there is no systematic trend upwards or downwards, some examples of which are shown in Figure 2.6. `([Location 738](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=738))""
- [[Spearman's rank correlation]]
- "Like Pearson correlation coefficient, but using ranks instead of the actual values to dampen outlier impact"
- ""`An alternative measure is called Spearman’s rank correlation after English psychologist Charles Spearman (who developed the idea of an underlying general intelligence), and depends only on the ranks of the data rather than their specific values. This means it can be near 1 or −1 if the points are close to a line that steadily increases or decreases, even if this line is not straight`; ([Location 747](https://readwise.io/to_kindle?action=open&asin=B07N6D73FZ&location=747))""
- # Visualizing Spread
- Normal distribution chart
- ""
-