.. _3-3-measures-of-variability-range-variance-and-SD:
.. raw:: html
Measures of Variability - Range, Variance, and Standard Deviation
====================================================================================
When describing a dataset, knowing where the center lies tells only half the story.
Two datasets might share the same mean or median but look entirely different when plotted.
This is because they differ in how widely the values are dispersed—their **variability** or
**spread**. Measures of spread help us quantify this dispersion, providing a more complete
picture of our data's characteristics.
.. admonition:: Road Map đź§
:class: important
* Calculate and interpret the **range** as a simple spread measure.
* Develop the concept of **deviations** from the mean.
* Define and compute the **variance** and **standard deviation**.
The Need for Measures of Variability
---------------------------------------
When using only measures of central tendency (mean, median, or mode),
we often lose important information about the data's distribution. For example:
* Two countries might have the same mean family income, but one could have both greater
wealth and greater poverty than the other.
* Two classes might have the same average test score, but one might have consistent
performance while the other has extreme high and low scores.
* Two manufacturing processes might produce parts with the same average size,
but one might have much tighter tolerances than the other.
Consider the visualization below, which shows two distributions with identical means
but different spreads:
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter3/different-spreads-same-mean.jpg
:alt: Two distributions with same mean but different spreads
:align: center
:width: 60%
Two distributions with the same mean but different spreads
To fully characterize these distributions, we need measures that quantify the
dispersion of values around the center.
Sample Range: The Simplest Measure
-------------------------------------
The **sample range** is the most basic measure of spread—simply the difference between
the maximum and minimum values in a dataset:
.. math::
\text{Range} = \text{Maximum} - \text{Minimum}
.. admonition:: Example đź’ˇ: Continuing with the pet counts data
:class: note
We continue to use the **pet counts data from Part 1 of Section 3.2.4**:
.. math:: \{4, 8, 7, 9, 4, 3, 5, 1, 4\}.
The range is :math:`9 - 1 = 8`.
.. code-block:: r
# Creating the dataset
num_pets <- c(4, 8, 7, 9, 4, 3, 5, 1, 4)
range_pets <- max(num_pets) - min(num_pets)
range_pets # Returns 8
Limitations of Sample Range
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
While the range is easy to calculate and understand, it has significant limitations:
* It depends only on the two most extreme values, ignoring all other observations.
* It is highly sensitive to outliers.
* Two very different distributions can have identical ranges.
To illustrate this limitation, consider three different datasets that all have the
same range and central tendencies:
.. flat-table:: Data sets with the same range and mean but different distributions
:width: 100%
:header-rows: 0
:align: center
* - **Set 1**
- -15
- -1
- -0.5
- 0
- 0.5
- 1
- 15
* - **Set 2**
- -15
- -3
- -1
- 0
- 1
- 3
- 15
* - **Set 3**
- -15
- -10
- -5
- 0
- 5
- 10
- 15
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter3/same-range-different-distributions.jpg
:alt: Three datasets with same range but different distributions
:align: center
:figwidth: 100%
All three datasets have a range of 30 (from -15 to 15) and a mean of 0, but their
distributions are clearly different. Set 1 has most values concentrated near the
center with only a few extreme points, Set 2 is less concentrated, and Set 3 has
values more evenly distributed across the range.
This example demonstrates why we need measures that consider the dispersion of all
values in the dataset, not just the extremes.
Sample Variance and Sample Standard Deviation
------------------------------------------------
Deviations from the Mean
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To better measure spread, we need to consider how far each data point lies
from a central value, typically the mean. This distance is called a **deviation**.
For each observation :math:`x_i, i=1,\cdots,n`, its deviation from the sample mean is:
.. math::
d_i = x_i - \bar{x}
Deviations in Pet Counts Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Let's calculate the deviations for our pet counts data set
using :math:`\bar{x} = 5`. The values are recorded in the third column of
:numref:`sq-dev-table`.
.. _sq-dev-table:
.. flat-table:: Deviations and Squared Deviations from the Sample Mean
:header-rows: 2
:align: center
:width: 80%
* - Object
- Pet Counts Data
- Deviation
- Squared deviation
* - Formula
- :math:`x_i`
- :math:`x_i - \bar{x}`
- :math:`(x_i - \bar{x})^2`
* - :rspan:`8` Value
- :math:`1`
- :math:`1-5=-4`
- :math:`(1-5)^2=16`
* - :math:`3`
- :math:`3-5=-2`
- :math:`(3-5)^2=4`
* - :math:`4`
- :math:`4-5=-1`
- :math:`(4-5)^2=1`
* - :math:`4`
- :math:`4-5=-1`
- :math:`(4-5)^2=1`
* - :math:`4`
- :math:`4-5=-1`
- :math:`(4-5)^2=1`
* - :math:`5`
- :math:`5-5=0`
- :math:`(5-5)^2=0`
* - :math:`7`
- :math:`7-5=2`
- :math:`(7-5)^2=4`
* - :math:`8`
- :math:`8-5=3`
- :math:`(8-5)^2=9`
* - :math:`9`
- :math:`9-5=4`
- :math:`(9-5)^2=16`
* - **Sum**
- :math:`\sum_{i=1}^n x_i = n\bar{x} = 45`
- :math:`\sum_{i=1}^n (x_i -\bar{x})=`
:math:`\sum_{i=1}^n x_i -n\bar{x} = 0`
- :math:`\sum_{i=1}^n (x_i -\bar{x})^2=52`
From the final row of :numref:`sq-dev-table`, we note an important property of deviations;
**they always sum to zero**. This makes it impossible for their average to serve as
meaningful summary for a data set.
Instead, we use the squared deviations so that only the magnitudes influence the summary,
not their signs. See the right most column of :numref:`sq-dev-table` for the squared deviations
of the pet counts data.
.. admonition:: If signs are an issue, why not take absolute values?
:class: important
Indeed, variability metrics which use *absolute deviations* exist.
However, those that use **squared deviations** are far more widely
adopted because of their powerful theoretical properties.
We will explore these properties throughout the semester.
Sample Variance, :math:`s^2`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We compute the **sample variance**, denoted by :math:`s^2`,
by taking the sum of all squared deviations, then
dividing it by :math:`n-1`:
.. math::
s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})^2
The sample variance represents the *average* squared deviation from the mean,
although we divide by :math:`n-1` rather than :math:`n`. While the
full theoretical explanation is beyond the scope of this coures,
this adjustment is made to correct for bias in the estimation.
.. admonition:: Example đź’ˇ: Computing the Sample Variance
:class: note
Let's calculate the variance for the pet counts example.
Most hard work has already been done in :numref:`sq-dev-table`. We take the
sum of the final column, then divide by :math:`n-1`.
.. math::
s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})^2 = \frac{52}{9-1} = 6.5
Using R,
.. code-block:: r
num_pets <- c(4, 8, 7, 9, 4, 3, 5, 1, 4)
var(num_pets) # Returns 6.5
Sample Standard Deviation, :math:`s`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
While the sample variance is mathematically useful, it has a practical drawback--it's expressed in
the squared scale of the original units, making interpretations difficult.
We return to the original units by taking the positive square root of the sample variance and
call this the **sample standard deviation**:
.. math::
s = \sqrt{s^2} = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})^2}.
.. admonition:: Example đź’ˇ: Computing the sample standard deviation
:class: note
For the pet counts example,
.. math::
s = \sqrt{6.5} \approx 2.55.
On average, the number of pets deviates from the mean by about 2.55 pets.
.. code-block:: r
num_pets <- c(4, 8, 7, 9, 4, 3, 5, 1, 4)
sd(num_pets) # Returns 2.55
Properties of Variance and Standard Deviation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#. They are always non-negative.
#. They equal zero only when all data values are identical.
#. They increase as the spread of the data increases.
#. The two measures always increase and decrease together.
Revisiting the Three Datasets with a Shared Range
---------------------------------------------------
Let's return to our three datasets with identical ranges and means but different
distributions:
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter3/same-range-different-distributions.jpg
:alt: Three datasets with same range but different distributions
:align: center
:figwidth: 100%
.. flat-table:: Data sets with the same range and mean but different distributions
:width: 100%
:header-rows: 1
:align: center
* - Data set
- :cspan:`6` Data values
- Variance
* - **1**
- -15
- -1
- -0.5
- 0
- 0.5
- 1
- 15
- 75.42
* - **2**
- -15
- -3
- -1
- 0
- 1
- 3
- 15
- 78.33
* - **3**
- -15
- -10
- -5
- 0
- 5
- 10
- 15
- 116.67
Although all three datasets have the same sample range (30) and sample mean (0), their
variances differ substantially. Set 1, with most points concentrated near
the center, has the smallest variance. Set 3, with points more spread out,
has the largest variance. This illustrates how variance and standard deviation
capture differences in distribution that the range misses.
Impact of extreme values
------------------------------
Let us compute the sample range, variance, and standard deviation for the
*updated* pet counts data from Part 2 of Section 3.2.4.
.. flat-table::
:width: 90%
:align: center
:header-rows: 1
* - :cspan:`8` New pet counts data
* - 1
- 3
- 4
- 4
- 4
- 5
- 7
- 8
- 9 → 19
1. **Sample range**
.. math:: 19 - 1 = 18.
2. **Sample variance and sample standard deviation**
In Part 2 of Section 3.2.4, we computed the new sample mean as :math:`\bar{x}=6.11`.
We must re-compute the squared deviations for all data points using this new value:
.. flat-table:: Deviations and Squared Deviations from the Sample Mean
:header-rows: 2
:align: center
:width: 80%
* - Object
- Updated Pet Counts Data
- Updated Squared deviation
* - Formula
- :math:`x_i`
- :math:`(x_i - \bar{x})^2`
* - :rspan:`8` Value
- :math:`1`
- :math:`(1-6.11)^2=26.12`
* - :math:`3`
- :math:`(3-6.11)^2=9.68`
* - :math:`4`
- :math:`(4-6.11)^2=4.46`
* - :math:`4`
- :math:`(4-6.11)^2=4.46`
* - :math:`4`
- :math:`(4-6.11)^2=4.46`
* - :math:`5`
- :math:`(5-6.11)^2=1.23`
* - :math:`7`
- :math:`(7-6.11)^2=0.79`
* - :math:`8`
- :math:`(8-6.11)^2=3.57`
* - :math:`19`
- :math:`(9-6.11)^2=166.12`
* - **Sum**
- :math:`\sum_{i=1}^n x_i = n\bar{x} = 55`
- :math:`\sum_{i=1}^n (x_i -\bar{x})^2=220.8889`
Then the sample variance is
.. math:: s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2 = \frac{220.8889}{9} = 27.6111,
and the sample standard deviation is
.. math:: s = \sqrt{s^2} = \sqrt{27.6111} = 5.2546.
.. flat-table::
:width: 80%
:align: center
:widths: 2 2 1 2
:header-rows: 2
* - :cspan:`3` How did the measures of spread change?
* - Measure
- Before update
- →
- After update
* - Sample range
- 8
- →
- 18
* - Sample variance
- 6.5
- →
- 27.6111
* - Sample standard deviation
- 2.55
- →
- 5.2546
We note that all three measures increased in value after an extreme value of 19 was added to the data set.
Between the sample range and sample standard deviation—both measured on the same scale as the data—the impact
was weaker on the standard deviation. This is because the standard deviation incorporates all data points
in its calculation, whereas the sample range depends only on the extremes. The increase in the sample
variance is the most dramatic, but this is because it is computed on the squared scale.
Bringing It All Together
--------------------------
.. admonition:: Key Takeaways 📝
:class: important
1. Central tendency measures alone don't fully describe a dataset;
we also need measures of spread.
2. The **range** (max - min) is the simplest spread measure but depends
only on the extreme values, which often have the least representative
power of the data.
3. **Deviations** from the mean always sum to zero.
Therefore, we construct a measure of spread with **squared deviations**.
4. The **sample variance** (:math:`s^2`) is the *average* squared deviation from the mean.
5. The **sample standard deviation** (:math:`s`) is the square root of the sample variance,
returning to the original units of measurement.
6. The three measures are sensitive to extreme values.
Exercises
~~~~~~~~~~~~~~
#. **Conceptual Understanding**: Two statistics classes have the same mean score
of 75 on an exam.
* Class A has a standard deviation of 5 points.
* Class B has a standard deviation of 15 points.
Explain what this tells you about the score distributions in each class.
Which class had more consistent performance?
#. **Calculating Spread**: For the dataset {15, 18, 22, 24, 30, 34, 35, 35, 38, 42, 48}:
a) Calculate the range.
b) Calculate the variance and standard deviation.
c) Interpret the standard deviation in the context of the data.
#. **Comparing Datasets**: Datasets X and Y both have a mean of 50, but X has a
range of 20 and Y has a range of 40.
a) What can you conclude about their relative spread?
b) Is it possible for X to have a larger standard deviation than Y? Explain.
#. **Challenge Problem**: Create three different datasets, each with 5 values, that have:
a) The same sample mean of 10 but different standard deviations. Describe your
strategy.
b) The same standard deviation of 2 but different means. Describe your
strategy.