Slides 📊

3.4. Measures of Variability - Interquartile Range and Five-Number Summary

While the sample variance and standard deviation provide useful measures of spread, they can be heavily influenced by extreme values. In situations where data is skewed or contains unusual observations, we need measures that are less sensitive to individual points. This is where the interquartile range (IQR) and five-number summary come into play—measures that focus only on the middle portion of the data.

Road Map 🧭

Understand quartiles and percentiles as ways to divide ordered data.
Calculate the interquartile range (IQR) as a robust measure of spread.
Identify explicit points using the upper and lower fences.
Create and interpret modified box plots.
Distinguish between explicit points and real outliers.

3.4.1. Preliminaries: Percentiles and Quartiles

Sample Percentiles

For a number \(p\) between 0 and 100, the \(p\)-th sample percentile of a variable is the value such that \(p\%\) of observations are less than or equal to that value. For example:

The 10th percentile is a value such that 10% of all observations are less than or equal to it.
The 99th percentile is a value such that 99% of all observations are less than or equal to it.

Sample Quartiles

Sample quartiles are three special percentiles which divide the data into four equally sized parts, each containing approximately 25% of the data points. They consist of

\(Q_1\) (first quartile): The 25th percentile
\(Q_2\) (second quartile): The 50th percentile. This is also the sample median.
\(Q_3\) (third quartile): The 75th percentile

Distribution divided by quartiles — Fig. 3.4 *A distribution divided into four equal parts by the quartiles*

Calculating Sample Quartiles

Sample quartiles can be found by “computing three different medians.” The specific steps are as follows:

Compute the sample median of the whole data set. This is the second quartile, or \(Q_2\).
The data is now divided into two equally sized subsets, from the minimum to the median, then from the median to the maximum. Include the median in both subsets if it is on a data point. Compute the sample median of the first subset. This is the first quartile, \(Q_1\), for the whole data.
Likewise, compute the median for the second subset. This is the third quartile, \(Q_3\), for the whole data.

Example 💡: Computing the Sample Quartiles

The table below displays the time to promotion, in months, for 19 randomly sampled software engineers at an IT firm. Compute the sample quartiles, both by hand and using R.

7	12	14	14	14	18	21	22	23	24
25	34	34	37	47	49	64	100	150

Calculating the sample quartiles by hand

\(n=19\). The sample median of the data set is the 10th smallest value. \(Q_2 = 24\).
Let us now consider the first half of the data, from 7 to 24. There are 10 (even) data points in the subset. Therefore, we take the average between the 5th and the 6th data points as the subset’s sample median. \(Q_1 = \frac{14+18}{2} = 16\).
Repeating Step 2 for the second subset (24 to 150), we find \(Q_3 = \frac{37+47}{2} = 42\).

Confirm that the four sections created by \(Q_1, Q_2\) and \(Q_3\) are equally sized.

Warning

Quantile definitions: R’s quantile() uses the Hyndman and Fan type 7 method by default, which is not the same as the simple “median of halves” hand calculation shown above. For all homework and assessments, you must use R’s approach with :code:`type = 7` (the default), and your answers should match R’s output.

Calculating the sample quartiles in R

In R, the quantile() function is used to calculate percentiles (quantiles are simply percentiles on a 0-1 scale instead of 0-100.) By default, it returns the five-number summary (minimum, Q₁, median, Q₃, maximum):
# Example dataset
x <- c(7, 12, 14, 14, 14, 18, 21, 22, 23, 24, 25, 34, 34, 37, 47, 49, 64, 100, 150)

# Get quartiles (five-number summary)
quantile(x)

# Output:
#    0%   25%   50%   75%  100%
#   7.0  16.0  24.0  42.0 150.0
To access quartiles, you can use the following syntax:
# Get Q1 (25th percentile)
quantile(x)["25%"]

# Get Q3 (75th percentile)
quantile(x)["75%"]
You can also request specific percentiles other than the default:
# Calculate the 10th, 20th, 30th percentiles
quantile(x, probs = c(0.1, 0.2, 0.3))

Brief discussion: quantile methods in R

R provides nine quantile definitions. They differ only in how they place the percentile between sorted data values.

Two families. Types 1–3 are stepwise choices that pick actual data points, with different tie handling. Types 4–9 use linear interpolation between data points.
Textbook “median of halves.” This is Tukey’s hinges, used by fivenum() and by base and ggplot boxplots. It often matches the hand method you may know, but not always.
R default. quantile(..., type = 7) is the default. It interpolates to give smooth percentiles and is standard in R and S/S-PLUS.
Other common choices. Type 6 is popular in hydrology, type 8 targets median-unbiasedness across distributions, and type 9 targets normal-theory unbiasedness.
Practical impact. Differences can be noticeable for small samples or skewed data. For large n the methods tend to agree.
Course policy. For homework and assessments use quantile(x) and report answers that match R’s default.

3.4.2. The Interquartile Range (IQR)

The interquartile range (IQR) represents the spread of the data using the width of the middle 50%. It is calculated as the difference between the third quartile (\(Q_3\)) and the first quartile (\(Q_1\)):

\[\text{IQR} = Q_3 - Q_1.\]

Example 💡: Calculating the IQR

For the number of months to promotion data, compute the IQR.

From the previous example, \(Q_3 = 42\) and \(Q_1 = 16\). Then

\[\text{IQR} = 42 - 16 = 26.\]

This tells us that the middle 50% of the data spans 26 months, giving us a sense of how spread out the typical cases are.

R provides a built-in function to calculate the IQR.

# Calculate IQR
IQR(x)

# Alternative calculation
q <- quantile(x)
q["75%"] - q["25%"]

3.4.3. Five-Number Summary

The five-number summary provides a concise overview of a dataset’s distribution by reporting the sample quartiles, together with the data’s minimum and maximum. This summary gives a comprehensive picture of the center (\(Q_2\)), spread (\(Q_1\) and \(Q_3\)), and extremes (min and max) of the data.

See the first code block of Example 💡: Computing the Sample Quartiles for an instance of a five-number summary.

3.4.4. Identifying Explicit Points with Fences

One important application of the IQR is to identify potential outliers or explicit points in the data. We use what’s called the IQR rules to establish “fences” beyond which observations are flagged for further inspection.

Inner fences are computed using the 1.5 IQR rule:
- Lower inner fence = \(Q_1 - 1.5 IQR\)
- Upper inner fence = \(Q_3 + 1.5 IQR\)
Outer fences are computed using the 3 IQR rule:
- Lower outer fence = \(Q_1 - 3 IQR\)
- Upper outer fence = \(Q_3 + 3 IQR\)

Points that fall between the inner and outer fences are considered mild explicit points, while those beyond the outer fences are considered extreme explicit points.

Example 💡: Identifying the Explicit Points

For the number of months to promotion data, identify the inner and outer fences, and identify any mild and extreme explicit points.

7	12	14	14	14	18	21	22	23	24
25	34	34	37	47	49	64	100	150

From the previous exmaple, IQR = 26.

Lower inner fence: \(16 - (1.5)(26) = -23\)
Upper inner fence: \(42 + (1.5)(26) = 81\)
Lower outer fence: \(16 - (3)(26) = -62\)
Upper outer fence: \(42 + (3)(26) = 120\)

Since the value 100 exceeds the upper inner fence (81) but falls below the upper outer fence (120), it is classified as a mild explicit point. The value 150 exceeds the upper outer fence (120), making it an extreme explicit point. There are no lower explicit points in the data.

3.4.5. Modified Box Plots

A modified box plot is a visual representation of the five-number summary and the explicit points identified by the 1.5 IQR rule. It consists of

A box that spans from Q₁ to Q₃, representing the middle 50% of the data,
A line inside the box marking the median (Q₂),
Whiskers extending from the box to the most extreme data points that are not classified as explicit points, and
Explicit points as dots beyond the whiskers.

Modified box plot with components labeled — Fig. 3.5 A modified box plot with labeled components

The dot inside the box represents the sample mean of the data. While it is not a formal component of a box plot, we often include it for a more comprehensive view of the data distribution.

Why do we call it modified?

The box plots introduced in this course are a modified version of the basic form, which does not account for explicit points. You are not expected to know or use the basic version. Whenever we refer to a box plot, we always mean the modified version.

Example 💡: Creating a Modified Box Plot

For the number of months to promotion data set, draw a modified box plot both by hand and using R. Then interpret the data’s distribution.

Drawing a modified box plot by hand

Draw a horizontal or vertical axis that covers the full range of the data.
Draw a line thruough Q₁ and Q₃ and draw a box which uses them as its two sides.
Draw a line across the box at the median (Q₂).
Draw whiskers from the box to the most extreme data points that are NOT beyond the inner fences.
Plot individual points for observations beyond the inner fences.

Drawing a modified box plot using R

# Import the graphing package
library(ggplot2)

# Create data vector
promotion <- c(7, 12, 14, 14, 14, 18, 21, 22, 23, 24, 25, 34, 34, 37, 47, 49, 64, 100, 150)

# Change format to data frame
promotion_df <- data.frame(months=promotion)

# Graph
ggplot(pro_df, aes(y="", x=months)) +  # flip x and y to make the plot vertical
   stat_boxplot(geom="errorbar") + #formats the whiskers
   geom_boxplot() +
   ggtitle("Boxplot of Months to Promotion") +
   stat_summary(fun = mean, col = "black", geom = "point", size = 3)

The code above returns Fig. 3.5.

3.4.6. Reading box plots beyond the five number summary

Skewness

While box plots offer a less detailed view of the data distribution than histograms, they are effective for quickly identifying skewness. Let us compare the histograms and the box plots drawn for the same data sets:

Comparison of histograms and box plots — Fig. 3.6 Histograms and box plots of skewed data sets

Recall that each of the four sections defined by the sample quartiles contains approximately the same number of data points. Therefore, if one whisker is much longer than the box or the other whisker, it suggests that the data is more spread out in that section of the distribution.

Limitations

Box plots are efficient for identifying symmetry or skewness, but they are limited in the level of detail they provide on the shape of the distribution. Most importantly, we cannot

determine whether the data is normal (bell-shaped), or
identify the number of modes in the data

through a box plot. See the examples below—each with a distinct distribution, yet producing very similar box plots.

Fig. 3.7 Box plots of different symmetric distributions

To gain a detailed understanding of the shape of the data distribution, we must use more refined graphical tools such as histograms.

Explicit Points vs. Real Outliers

When interpreting a modified box plot, it’s crucial to understand that not all explicit points are real outliers.

Explicit points are observations flagged by statistical criteria like the 1.5 IQR rule. They are points that mathematically deviate from the pattern established by the majority of the data.
Real outliers are explicit points that, upon investigation, truly deviate from the underlying pattern of the data. They may represent errors, anomalies, or genuinely unusual cases.

When a data distribution is strongly skewed, for example, values on the longer tail may be identified as explicit points by the 1.5 IQR rule, although they are simply conforming to the underlying distribution. See Fig. 3.6.

When a data point is flagged as explicit, it should be inspected more carefully to determine if:

it represents an error in measurement, recording, or data entry,
it is an observation from a different population or process,
or very far from the main body of the box plot to be considered part of the same distribution.

In these cases, the explicit points are considered true outliers.

Example💡: Interpreting a box plot

Interpret the box plot of teh number of months to promotion data.

The middle 50% of engineers were promoted between 16 and 42 months after hiring.
The median time to promotion was 24 months.
Two engineers (100 and 150 months) are identified as explicit points.
Since the right half of the data is more spread out than the left half (longer right whisker, greater distance between Q2 and Q3 than Q1 and Q2), the data set is right-skewed.

The two explicit points are very far from the main body of the plot with gaps larger than the IQR itself, so we must investigate the possibility of them being real outliers. They might represent engineers who:

Were hired with specialized skills not requiring management roles.
Chose to remain in technical roles longer.
Faced unusual circumstances affecting their promotion timeline.
Were erroneously included in the dataset.

3.4.7. Bringing It All Together

Key Takeaways 📝

Sample quartiles divide the data into four equal parts.
The interquartile range (IQR) measures the spread of the middle 50% of the data.
The five-number summary provides a robust overview of the data distribution.
Use the 1.5 × IQR rule to identify explicit points that may be potential outliers.
Not all explicit points are real outliers - investigate them thoroughly before drawing conclusions.
Modified box plots visualize the five-number summary and highlight explicit points.

In the next section, we’ll discuss how to choose the most appropriate measures of center and spread for different types of data distributions.

Exercises

Quartile Calculation: For the dataset {5, 7, 9, 12, 15, 17, 19, 22, 25, 27, 30, 33, 38, 42, 51}: a) Calculate the five-number summary. b) Determine the IQR. c) Find the inner and outer fences. d) Identify any explicit points.
Real vs. Explicit: Suppose you are analyzing the daily commute times (in minutes) for employees at a company, and you find that most values are between 15 and 45 minutes, but there are a few values over 90 minutes that are flagged as explicit points by the 1.5 IQR rule.
1. What additional information would you want to gather to determine if these are real outliers?
2. Give two examples of circumstances where these explicit points might NOT be considered real outliers
3. Give two examples of circumstances where these explicit points SHOULD be considered real outliers