.. _3-3-measures-of-variability-range-variance-and-SD: .. raw:: html

Measures of Variability - Range, Variance, and Standard Deviation ==================================================================================== When describing a dataset, knowing where the center lies tells only half the story. Two datasets might share the same mean or median but look entirely different when plotted. This is because they differ in how widely the values are dispersed—their **variability** or **spread**. Measures of spread help us quantify this dispersion, providing a more complete picture of our data's characteristics. .. admonition:: Road Map 🧭 :class: important * Calculate and interpret the **range** as a simple spread measure. * Develop the concept of **deviations** from the mean. * Define and compute the **variance** and **standard deviation**. The Need for Measures of Variability --------------------------------------- When using only measures of central tendency (mean, median, or mode), we often lose important information about the data's distribution. For example: * Two countries might have the same mean family income, but one could have both greater wealth and greater poverty than the other. * Two classes might have the same average test score, but one might have consistent performance while the other has extreme high and low scores. * Two manufacturing processes might produce parts with the same average size, but one might have much tighter tolerances than the other. Consider the visualization below, which shows two distributions with identical means but different spreads: .. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter3/different-spreads-same-mean.jpg :alt: Two distributions with same mean but different spreads :align: center :width: 60% Two distributions with the same mean but different spreads To fully characterize these distributions, we need measures that quantify the dispersion of values around the center. Sample Range: The Simplest Measure ------------------------------------- The **sample range** is the most basic measure of spread—simply the difference between the maximum and minimum values in a dataset: .. math:: \text{Range} = \text{Maximum} - \text{Minimum} .. admonition:: Example 💡: Continuing with the pet counts data :class: note We continue to use the **pet counts data from Part 1 of Section 3.2.4**: .. math:: \{4, 8, 7, 9, 4, 3, 5, 1, 4\}. The range is :math:`9 - 1 = 8`. .. code-block:: r # Creating the dataset num_pets <- c(4, 8, 7, 9, 4, 3, 5, 1, 4) range_pets <- max(num_pets) - min(num_pets) range_pets # Returns 8 Limitations of Sample Range ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ While the range is easy to calculate and understand, it has significant limitations: * It depends only on the two most extreme values, ignoring all other observations. * It is highly sensitive to outliers. * Two very different distributions can have identical ranges. To illustrate this limitation, consider three different datasets that all have the same range and central tendencies: .. flat-table:: Data sets with the same range and mean but different distributions :width: 100% :header-rows: 0 :align: center * - **Set 1** - -15 - -1 - -0.5 - 0 - 0.5 - 1 - 15 * - **Set 2** - -15 - -3 - -1 - 0 - 1 - 3 - 15 * - **Set 3** - -15 - -10 - -5 - 0 - 5 - 10 - 15 .. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter3/same-range-different-distributions.jpg :alt: Three datasets with same range but different distributions :align: center :figwidth: 100% All three datasets have a range of 30 (from -15 to 15) and a mean of 0, but their distributions are clearly different. Set 1 has most values concentrated near the center with only a few extreme points, Set 2 is less concentrated, and Set 3 has values more evenly distributed across the range. This example demonstrates why we need measures that consider the dispersion of all values in the dataset, not just the extremes. Sample Variance and Sample Standard Deviation ------------------------------------------------ Deviations from the Mean ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To better measure spread, we need to consider how far each data point lies from a central value, typically the mean. This distance is called a **deviation**. For each observation :math:`x_i, i=1,\cdots,n`, its deviation from the sample mean is: .. math:: d_i = x_i - \bar{x} Deviations in Pet Counts Data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Let's calculate the deviations for our pet counts data set using :math:`\bar{x} = 5`. The values are recorded in the third column of :numref:`sq-dev-table`. .. _sq-dev-table: .. flat-table:: Deviations and Squared Deviations from the Sample Mean :header-rows: 2 :align: center :width: 80% * - Object - Pet Counts Data - Deviation - Squared deviation * - Formula - :math:`x_i` - :math:`x_i - \bar{x}` - :math:`(x_i - \bar{x})^2` * - :rspan:`8` Value - :math:`1` - :math:`1-5=-4` - :math:`(1-5)^2=16` * - :math:`3` - :math:`3-5=-2` - :math:`(3-5)^2=4` * - :math:`4` - :math:`4-5=-1` - :math:`(4-5)^2=1` * - :math:`4` - :math:`4-5=-1` - :math:`(4-5)^2=1` * - :math:`4` - :math:`4-5=-1` - :math:`(4-5)^2=1` * - :math:`5` - :math:`5-5=0` - :math:`(5-5)^2=0` * - :math:`7` - :math:`7-5=2` - :math:`(7-5)^2=4` * - :math:`8` - :math:`8-5=3` - :math:`(8-5)^2=9` * - :math:`9` - :math:`9-5=4` - :math:`(9-5)^2=16` * - **Sum** - :math:`\sum_{i=1}^n x_i = n\bar{x} = 45` - :math:`\sum_{i=1}^n (x_i -\bar{x})=` :math:`\sum_{i=1}^n x_i -n\bar{x} = 0` - :math:`\sum_{i=1}^n (x_i -\bar{x})^2=52` From the final row of :numref:`sq-dev-table`, we note an important property of deviations; **they always sum to zero**. This makes it impossible for their average to serve as meaningful summary for a data set. Instead, we use the squared deviations so that only the magnitudes influence the summary, not their signs. See the right most column of :numref:`sq-dev-table` for the squared deviations of the pet counts data. .. admonition:: If signs are an issue, why not take absolute values? :class: important Indeed, variability metrics which use *absolute deviations* exist. However, those that use **squared deviations** are far more widely adopted because of their powerful theoretical properties. We will explore these properties throughout the semester. Sample Variance, :math:`s^2` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We compute the **sample variance**, denoted by :math:`s^2`, by taking the sum of all squared deviations, then dividing it by :math:`n-1`: .. math:: s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})^2 The sample variance represents the *average* squared deviation from the mean, although we divide by :math:`n-1` rather than :math:`n`. While the full theoretical explanation is beyond the scope of this coures, this adjustment is made to correct for bias in the estimation. .. admonition:: Example 💡: Computing the Sample Variance :class: note Let's calculate the variance for the pet counts example. Most hard work has already been done in :numref:`sq-dev-table`. We take the sum of the final column, then divide by :math:`n-1`. .. math:: s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})^2 = \frac{52}{9-1} = 6.5 Using R, .. code-block:: r num_pets <- c(4, 8, 7, 9, 4, 3, 5, 1, 4) var(num_pets) # Returns 6.5 Sample Standard Deviation, :math:`s` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ While the sample variance is mathematically useful, it has a practical drawback--it's expressed in the squared scale of the original units, making interpretations difficult. We return to the original units by taking the positive square root of the sample variance and call this the **sample standard deviation**: .. math:: s = \sqrt{s^2} = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})^2}. .. admonition:: Example 💡: Computing the sample standard deviation :class: note For the pet counts example, .. math:: s = \sqrt{6.5} \approx 2.55. On average, the number of pets deviates from the mean by about 2.55 pets. .. code-block:: r num_pets <- c(4, 8, 7, 9, 4, 3, 5, 1, 4) sd(num_pets) # Returns 2.55 Properties of Variance and Standard Deviation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #. They are always non-negative. #. They equal zero only when all data values are identical. #. They increase as the spread of the data increases. #. The two measures always increase and decrease together. Revisiting the Three Datasets with a Shared Range --------------------------------------------------- Let's return to our three datasets with identical ranges and means but different distributions: .. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter3/same-range-different-distributions.jpg :alt: Three datasets with same range but different distributions :align: center :figwidth: 100% .. flat-table:: Data sets with the same range and mean but different distributions :width: 100% :header-rows: 1 :align: center * - Data set - :cspan:`6` Data values - Variance * - **1** - -15 - -1 - -0.5 - 0 - 0.5 - 1 - 15 - 75.42 * - **2** - -15 - -3 - -1 - 0 - 1 - 3 - 15 - 78.33 * - **3** - -15 - -10 - -5 - 0 - 5 - 10 - 15 - 116.67 Although all three datasets have the same sample range (30) and sample mean (0), their variances differ substantially. Set 1, with most points concentrated near the center, has the smallest variance. Set 3, with points more spread out, has the largest variance. This illustrates how variance and standard deviation capture differences in distribution that the range misses. Impact of extreme values ------------------------------ Let us compute the sample range, variance, and standard deviation for the *updated* pet counts data from Part 2 of Section 3.2.4. .. flat-table:: :width: 90% :align: center :header-rows: 1 * - :cspan:`8` New pet counts data * - 1 - 3 - 4 - 4 - 4 - 5 - 7 - 8 - 9 → 19 1. **Sample range** .. math:: 19 - 1 = 18. 2. **Sample variance and sample standard deviation** In Part 2 of Section 3.2.4, we computed the new sample mean as :math:`\bar{x}=6.11`. We must re-compute the squared deviations for all data points using this new value: .. flat-table:: Deviations and Squared Deviations from the Sample Mean :header-rows: 2 :align: center :width: 80% * - Object - Updated Pet Counts Data - Updated Squared deviation * - Formula - :math:`x_i` - :math:`(x_i - \bar{x})^2` * - :rspan:`8` Value - :math:`1` - :math:`(1-6.11)^2=26.12` * - :math:`3` - :math:`(3-6.11)^2=9.68` * - :math:`4` - :math:`(4-6.11)^2=4.46` * - :math:`4` - :math:`(4-6.11)^2=4.46` * - :math:`4` - :math:`(4-6.11)^2=4.46` * - :math:`5` - :math:`(5-6.11)^2=1.23` * - :math:`7` - :math:`(7-6.11)^2=0.79` * - :math:`8` - :math:`(8-6.11)^2=3.57` * - :math:`19` - :math:`(9-6.11)^2=166.12` * - **Sum** - :math:`\sum_{i=1}^n x_i = n\bar{x} = 55` - :math:`\sum_{i=1}^n (x_i -\bar{x})^2=220.8889` Then the sample variance is .. math:: s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2 = \frac{220.8889}{9} = 27.6111, and the sample standard deviation is .. math:: s = \sqrt{s^2} = \sqrt{27.6111} = 5.2546. .. flat-table:: :width: 80% :align: center :widths: 2 2 1 2 :header-rows: 2 * - :cspan:`3` How did the measures of spread change? * - Measure - Before update - → - After update * - Sample range - 8 - → - 18 * - Sample variance - 6.5 - → - 27.6111 * - Sample standard deviation - 2.55 - → - 5.2546 We note that all three measures increased in value after an extreme value of 19 was added to the data set. Between the sample range and sample standard deviation—both measured on the same scale as the data—the impact was weaker on the standard deviation. This is because the standard deviation incorporates all data points in its calculation, whereas the sample range depends only on the extremes. The increase in the sample variance is the most dramatic, but this is because it is computed on the squared scale. Bringing It All Together -------------------------- .. admonition:: Key Takeaways 📝 :class: important 1. Central tendency measures alone don't fully describe a dataset; we also need measures of spread. 2. The **range** (max - min) is the simplest spread measure but depends only on the extreme values, which often have the least representative power of the data. 3. **Deviations** from the mean always sum to zero. Therefore, we construct a measure of spread with **squared deviations**. 4. The **sample variance** (:math:`s^2`) is the *average* squared deviation from the mean. 5. The **sample standard deviation** (:math:`s`) is the square root of the sample variance, returning to the original units of measurement. 6. The three measures are sensitive to extreme values. Exercises ~~~~~~~~~~~~~~ #. **Conceptual Understanding**: Two statistics classes have the same mean score of 75 on an exam. * Class A has a standard deviation of 5 points. * Class B has a standard deviation of 15 points. Explain what this tells you about the score distributions in each class. Which class had more consistent performance? #. **Calculating Spread**: For the dataset {15, 18, 22, 24, 30, 34, 35, 35, 38, 42, 48}: a) Calculate the range. b) Calculate the variance and standard deviation. c) Interpret the standard deviation in the context of the data. #. **Comparing Datasets**: Datasets X and Y both have a mean of 50, but X has a range of 20 and Y has a range of 40. a) What can you conclude about their relative spread? b) Is it possible for X to have a larger standard deviation than Y? Explain. #. **Challenge Problem**: Create three different datasets, each with 5 values, that have: a) The same sample mean of 10 but different standard deviations. Describe your strategy. b) The same standard deviation of 2 but different means. Describe your strategy.