Slides 📊

3.3. Measures of Variability - Range, Variance, and Standard Deviation

When describing a dataset, knowing where the center lies tells only half the story. Two datasets might share the same mean or median but look entirely different when plotted. This may be because they differ in how widely the values are dispersed. Measures of spread help us quantify this dispersion, providing a more complete picture of our data’s characteristics.

Road Map 🧭

Calculate and interpret the range as a simple spread measure.
Develop the concept of deviations from the mean.
Define and compute the variance and standard deviation.

3.3.1. The Need for Measures of Variability

When using only measures of central tendency, we often lose important information about the data’s distribution. For example:

Two countries might have the same mean family income, but one could have both greater wealth and greater poverty than the other.
Two classes might have the same average test score, but one might have consistent performance while the other has extreme high and low scores.
Two manufacturing processes might produce parts with the same average size, but one might have much tighter tolerances than the other.

Consider the visualization below, which shows two distributions with identical means but different spreads:

Two distributions with same mean but different spreads — Fig. 3.3 Two distributions with the same mean but different spreads

To fully characterize these distributions, we need measures that quantify the dispersion of values around the center.

3.3.2. Sample Range: The Simplest Measure

The sample range is the most basic measure of spread—simply the difference between the maximum and minimum values in a dataset:

\[\text{Range} = \text{Maximum} - \text{Minimum}\]

Example 💡: Continuing with the pet counts data

We continue to use the pet counts data from Part 1 of Section 3.2.4:

\[\{4, 8, 7, 9, 4, 3, 5, 1, 4\}.\]

The range is \(9 - 1 = 8\).

# Creating the dataset
num_pets <- c(4, 8, 7, 9, 4, 3, 5, 1, 4)

range_pets <- max(num_pets) - min(num_pets)
range_pets  # Returns 8

Limitations of Sample Range

While the range is easy to calculate and understand, it has significant limitations:

It depends only on the two most extreme values, ignoring all other observations.
It is highly sensitive to outliers.
Two very different distributions can have identical ranges.

To illustrate this limitation, consider three different data sets that all have the same range and central tendencies:

Table 3.1 Data sets with the same range and mean but different distributions
Set 1	-15	-1	-0.5	0.5	1	15
Set 2	-15	-3	-1	1	3	15
Set 3	-15	-10	-5	5	10	15

Three datasets with same range but different distributions

All three datasets have a range of 30 (from -15 to 15) and a mean of 0, but their distributions are clearly different. Set 1 has most values concentrated near the center with only a few extreme points, Set 2 is less concentrated, and Set 3 has values more evenly distributed across the range.

This example demonstrates why we need measures that consider the dispersion of all values in the dataset, not just the extremes.

3.3.3. Sample Variance and Sample Standard Deviation

Deviations from the Mean

To better measure the spread, we need to consider how far each data point lies from a central value, typically the mean. This distance is called a deviation.

For each observation \(x_i\), its deviation from the sample mean is defined as:

\[d_i = x_i - \bar{x}\]

Deviations in Pet Counts Data

Let’s calculate the deviations for our pet counts data set using \(\bar{x} = 5\). The values are recorded in Table 3.2.

Table 3.2 Deviations and Squared Deviations from the Sample Mean
Object	Pet Counts Data	Deviation	Squared deviation
Formula	\(x_i\)	\(x_i - \bar{x}\)	\((x_i - \bar{x})^2\)
Value	\(1\)	\(1-5=-4\)	\((1-5)^2=16\)
	\(3\)	\(3-5=-2\)	\((3-5)^2=4\)
	\(4\)	\(4-5=-1\)	\((4-5)^2=1\)
	\(4\)	\(4-5=-1\)	\((4-5)^2=1\)
	\(4\)	\(4-5=-1\)	\((4-5)^2=1\)
	\(5\)	\(5-5=0\)	\((5-5)^2=0\)
	\(7\)	\(7-5=2\)	\((7-5)^2=4\)
	\(8\)	\(8-5=3\)	\((8-5)^2=9\)
	\(9\)	\(9-5=4\)	\((9-5)^2=16\)
Sum	\(\sum_{i=1}^n x_i = n\bar{x} = 45\)	\(\sum_{i=1}^n (x_i -\bar{x})=\) \(\sum_{i=1}^n x_i -n\bar{x} = 0\)	\(\sum_{i=1}^n (x_i -\bar{x})^2=52\)

From the final row of Table 3.2, we note an important property of deviations; they always sum to zero. This makes it impossible for their average to serve as a meaningful summary for a data set. Instead, we use the squared deviations so that only the magnitudes influence the summary, not their signs. See the right most column of Table 3.2 for the squared deviations of the pet counts data.

If signs are an issue, why not take absolute values?

Indeed, variability metrics which use absolute deviations exist. However, those that use squared deviations are far more widely adopted because of their powerful theoretical properties. We will explore these properties throughout the semester.

Sample Variance, \(s^2\)

We compute the sample variance, denoted by \(s^2\), by taking the sum of all squared deviations, then dividing it by \(n-1\):

\[s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})^2\]

The sample variance represents the average squared deviation from the mean, although we divide by \(n-1\) rather than \(n\). While the full theoretical explanation is not covered yet, this adjustment is made to correct for bias in the estimation.

Example 💡: Computing the Sample Variance

Let’s calculate the variance for the pet counts example. Most hard work has already been done in Table 3.2. We take the sum of the final column, then divide by \(n-1\).

\[s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})^2 = \frac{52}{9-1} = 6.5\]

Using R,

num_pets <- c(4, 8, 7, 9, 4, 3, 5, 1, 4)
var(num_pets)  # Returns 6.5

Sample Standard Deviation, \(s\)

While the sample variance is mathematically useful, it has a practical drawback–it’s expressed in the squared scale of the original unit, making interpretations difficult.

We return to the original unit by taking the positive square root of the sample variance. This value is called the sample standard deviation:

\[s = \sqrt{s^2} = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})^2}.\]

Example 💡: Computing the Sample Standard Deviation

For the pet counts example,

\[s = \sqrt{6.5} \approx 2.55.\]

On average, the pet count deviates from the mean by about 2.55 pets.

num_pets <- c(4, 8, 7, 9, 4, 3, 5, 1, 4)
sd(num_pets)  # Returns 2.55

Properties of Variance and Standard Deviation

They are always non-negative.
They equal zero only when all data values are identical.
They increase as the spread of the data increases.
The two measures always increase and decrease together.

3.3.4. Revisiting the Three Datasets with a Shared Range

Let’s return to our three datasets:

Table 3.3 Data sets with the same range and mean but different distributions
Data set	Data values						Variance
1	-15	-1	-0.5	0.5	1	15	75.42
2	-15	-3	-1	1	3	15	78.33
3	-15	-10	-5	5	10	15	116.67

Although all three datasets have the same sample range (30) and sample mean (0), their variances differ substantially. Set 1, with most points concentrated near the center, has the smallest variance. Set 3, with points more spread out, has the largest variance. This illustrates how variance and standard deviation capture differences in distribution that the range misses.

3.3.5. Impact of extreme values

Let us compute the sample range, variance, and standard deviation for the updated pet counts data from Part 2 of Section 3.2.4.

New pet counts data
1	3	4	4	4	5	7	8	9 → 19

Sample range

\[19 - 1 = 18.\]

Sample variance and sample standard deviation

In Part 2 of Section 3.2.4, we computed the new sample mean as \(\bar{x}=6.11\). We must re-compute the squared deviations for all data points using this new value:

Table 3.4 Deviations and Squared Deviations from the Sample Mean
Object	Updated Pet Counts Data	Updated Squared deviation
Formula	\(x_i\)	\((x_i - \bar{x})^2\)
Value	\(1\)	\((1-6.11)^2=26.12\)
	\(3\)	\((3-6.11)^2=9.68\)
	\(4\)	\((4-6.11)^2=4.46\)
	\(4\)	\((4-6.11)^2=4.46\)
	\(4\)	\((4-6.11)^2=4.46\)
	\(5\)	\((5-6.11)^2=1.23\)
	\(7\)	\((7-6.11)^2=0.79\)
	\(8\)	\((8-6.11)^2=3.57\)
	\(19\)	\((9-6.11)^2=166.12\)
Sum	\(\sum_{i=1}^n x_i = n\bar{x} = 55\)	\(\sum_{i=1}^n (x_i -\bar{x})^2=220.8889\)

Then the sample variance is

\[s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2 = \frac{220.8889}{9} = 27.6111,\]

and the sample standard deviation is

\[s = \sqrt{s^2} = \sqrt{27.6111} = 5.2546.\]

How did the measures of spread change?
Measure	Before update	→	After update
Sample range	8	→	18
Sample variance	6.5	→	27.6111
Sample standard deviation	2.55	→	5.2546

All three measures increased after an extreme value of 19 was added to the data set. Between the sample range and sample standard deviation—both on the data’s original scale—the impact was weaker on the standard deviation. This is because the standard deviation incorporates all data points in its calculation, whereas the sample range depends only on the extremes. The increase in the sample variance is the most dramatic, but this is because it is computed on the squared scale.

3.3.6. Bringing It All Together

Key Takeaways 📝

Central tendency measures alone don’t fully describe a dataset; we also need measures of spread.
The range (max - min) is the simplest spread measure but depends only on the extreme values, which often have the least representative power of the data.
Deviations from the mean always sum to zero. Therefore, we construct a measure of spread with squared deviations.
The sample variance (\(s^2\)) is the average squared deviation from the mean.
The sample standard deviation (\(s\)) is the square root of the sample variance, returning to the original units of measurement.
The three measures are sensitive to extreme values.

Exercises

Conceptual Understanding: Two statistics classes have the same mean score of 75 on an exam.
- Class A has a standard deviation of 5 points.
- Class B has a standard deviation of 15 points.
Explain what this tells you about the score distributions in each class. Which class had more consistent performance?
Calculating Spread: For the dataset {15, 18, 22, 24, 30, 34, 35, 35, 38, 42, 48}:
1. Calculate the range.
2. Calculate the variance and standard deviation.
3. Interpret the standard deviation in the context of the data.
Comparing Datasets: Datasets X and Y both have a mean of 50, but X has a range of 20 and Y has a range of 40.
1. What can you conclude about their relative spread?
2. Is it possible for X to have a larger standard deviation than Y? Explain.
Challenge Problem: Create three different datasets, each with 5 values, that have:
1. The same sample mean of 10 but different standard deviations. Describe your strategy.
2. The same standard deviation of 2 but different means. Describe your strategy.