3.4. Measures of Variability - Interquartile Range and Five-Number Summary

While the sample variance and standard deviation provide useful measures of spread, they can be heavily influenced by extreme values. In situations where data is skewed or contains unusual observations, we need measures that are less sensitive to individual points. This is where the interquartile range (IQR) and five-number summary come into play—measures that focus only on the middle portion of the data.

Road Map 🧭

  • Understand quartiles and percentiles as ways to divide ordered data.

  • Calculate the interquartile range (IQR) as a robust measure of spread.

  • Identify explicit points using the upper and lower fences.

  • Create and interpret modified box plots.

  • Distinguish between explicit points and real outliers.

3.4.1. Preliminaries: Percentiles and Quartiles

Sample Percentiles

For a number \(p\) between 0 and 100, the \(p\)-th sample percentile of a variable is the value such that \(p\%\) of observations are less than or equal to that value. For example:

  • The 10th percentile is a value such that 10% of all observations are less than or equal to it.

  • The 99th percentile is a value such that 99% of all observations are less than or equal to it.

Sample Quartiles

Sample quartiles are three special percentiles which divide the data into four equally sized parts, each containing approximately 25% of the data points. They consist of

  • \(Q_1\) (first quartile): The 25th percentile

  • \(Q_2\) (second quartile): The 50th percentile. This is also the sample median.

  • \(Q_3\) (third quartile): The 75th percentile

Distribution divided by quartiles

Fig. 3.4 A distribution divided into four equal parts by the quartiles

Calculating Sample Quartiles

Sample quartiles can be found by “computing three different medians.” The specific steps are as follows:

  1. Compute the sample median of the whole data set. This is the second quartile, or \(Q_2\).

  2. The data is now divided into two equally sized subsets, from the minimum to the median, then from the median to the maximum. Include the median in both subsets if it is on a data point. Compute the sample median of the first subset. This is the first quartile, \(Q_1\), for the whole data.

  3. Likewise, compute the median for the second subset. This is the third quartile, \(Q_3\), for the whole data.

Example 💡: Computing the Sample Quartiles

The table below displays the time to promotion, in months, for 19 randomly sampled software engineers at an IT firm. Compute the sample quartiles, both by hand and using R.

7

12

14

14

14

18

21

22

23

24

25

34

34

37

47

49

64

100

150

Calculating the sample quartiles by hand

  1. \(n=19\). The sample median of the data set is the 10th smallest value. \(Q_2 = 24\).

  2. Let us now consider the first half of the data, from 7 to 24. There are 10 (even) data points in the subset. Therefore, we take the average between the 5th and the 6th data points as the subset’s sample median. \(Q_1 = \frac{14+18}{2} = 16\).

  3. Repeating Step 2 for the second subset (24 to 150), we find \(Q_3 = \frac{37+47}{2} = 42\).

Confirm that the four sections created by \(Q_1, Q_2\) and \(Q_3\) are equally sized.

Warning

Quantile definitions: R’s quantile() uses the Hyndman and Fan type 7 method by default, which is not the same as the simple “median of halves” hand calculation shown above. For all homework and assessments, you must use R’s approach with :code:`type = 7` (the default), and your answers should match R’s output.

Calculating the sample quartiles in R

In R, the quantile() function is used to calculate percentiles (quantiles are simply percentiles on a 0-1 scale instead of 0-100.) By default, it returns the five-number summary (minimum, Q₁, median, Q₃, maximum):

# Example dataset
x <- c(7, 12, 14, 14, 14, 18, 21, 22, 23, 24, 25, 34, 34, 37, 47, 49, 64, 100, 150)

# Get quartiles (five-number summary)
quantile(x)

# Output:
#    0%   25%   50%   75%  100%
#   7.0  16.0  24.0  42.0 150.0

To access quartiles, you can use the following syntax:

# Get Q1 (25th percentile)
quantile(x)["25%"]

# Get Q3 (75th percentile)
quantile(x)["75%"]

You can also request specific percentiles other than the default:

# Calculate the 10th, 20th, 30th percentiles
quantile(x, probs = c(0.1, 0.2, 0.3))

Brief discussion: quantile methods in R

R provides nine quantile definitions. They differ only in how they place the percentile between sorted data values.

  • Two families. Types 1–3 are stepwise choices that pick actual data points, with different tie handling. Types 4–9 use linear interpolation between data points.

  • Textbook “median of halves.” This is Tukey’s hinges, used by fivenum() and by base and ggplot boxplots. It often matches the hand method you may know, but not always.

  • R default. quantile(..., type = 7) is the default. It interpolates to give smooth percentiles and is standard in R and S/S-PLUS.

  • Other common choices. Type 6 is popular in hydrology, type 8 targets median-unbiasedness across distributions, and type 9 targets normal-theory unbiasedness.

  • Practical impact. Differences can be noticeable for small samples or skewed data. For large n the methods tend to agree.

  • Course policy. For homework and assessments use quantile(x) and report answers that match R’s default.

3.4.2. The Interquartile Range (IQR)

The interquartile range (IQR) represents the spread of the data using the width of the middle 50%. It is calculated as the difference between the third quartile (\(Q_3\)) and the first quartile (\(Q_1\)):

\[\text{IQR} = Q_3 - Q_1.\]

Example 💡: Calculating the IQR

For the number of months to promotion data, compute the IQR.

From the previous example, \(Q_3 = 42\) and \(Q_1 = 16\). Then

\[\text{IQR} = 42 - 16 = 26.\]

This tells us that the middle 50% of the data spans 26 months, giving us a sense of how spread out the typical cases are.

R provides a built-in function to calculate the IQR.

# Calculate IQR
IQR(x)

# Alternative calculation
q <- quantile(x)
q["75%"] - q["25%"]

3.4.3. Five-Number Summary

The five-number summary provides a concise overview of a dataset’s distribution by reporting the sample quartiles, together with the data’s minimum and maximum. This summary gives a comprehensive picture of the center (\(Q_2\)), spread (\(Q_1\) and \(Q_3\)), and extremes (min and max) of the data.

See the first code block of Example 💡: Computing the Sample Quartiles for an instance of a five-number summary.

3.4.4. Identifying Explicit Points with Fences

One important application of the IQR is to identify potential outliers or explicit points in the data. We use what’s called the IQR rules to establish “fences” beyond which observations are flagged for further inspection.

  • Inner fences are computed using the 1.5 IQR rule:

    • Lower inner fence = \(Q_1 - 1.5 IQR\)

    • Upper inner fence = \(Q_3 + 1.5 IQR\)

  • Outer fences are computed using the 3 IQR rule:

    • Lower outer fence = \(Q_1 - 3 IQR\)

    • Upper outer fence = \(Q_3 + 3 IQR\)

Points that fall between the inner and outer fences are considered mild explicit points, while those beyond the outer fences are considered extreme explicit points.

Example 💡: Identifying the Explicit Points

For the number of months to promotion data, identify the inner and outer fences, and identify any mild and extreme explicit points.

7

12

14

14

14

18

21

22

23

24

25

34

34

37

47

49

64

100

150

From the previous exmaple, IQR = 26.

  • Lower inner fence: \(16 - (1.5)(26) = -23\)

  • Upper inner fence: \(42 + (1.5)(26) = 81\)

  • Lower outer fence: \(16 - (3)(26) = -62\)

  • Upper outer fence: \(42 + (3)(26) = 120\)

Since the value 100 exceeds the upper inner fence (81) but falls below the upper outer fence (120), it is classified as a mild explicit point. The value 150 exceeds the upper outer fence (120), making it an extreme explicit point. There are no lower explicit points in the data.

3.4.5. Modified Box Plots

A modified box plot is a visual representation of the five-number summary and the explicit points identified by the 1.5 IQR rule. It consists of

  1. A box that spans from Q₁ to Q₃, representing the middle 50% of the data,

  2. A line inside the box marking the median (Q₂),

  3. Whiskers extending from the box to the most extreme data points that are not classified as explicit points, and

  4. Explicit points as dots beyond the whiskers.

Modified box plot with components labeled

Fig. 3.5 A modified box plot with labeled components

The dot inside the box represents the sample mean of the data. While it is not a formal component of a box plot, we often include it for a more comprehensive view of the data distribution.

Why do we call it modified?

The box plots introduced in this course are a modified version of the basic form, which does not account for explicit points. You are not expected to know or use the basic version. Whenever we refer to a box plot, we always mean the modified version.

Example 💡: Creating a Modified Box Plot

For the number of months to promotion data set, draw a modified box plot both by hand and using R. Then interpret the data’s distribution.

Drawing a modified box plot by hand

  1. Draw a horizontal or vertical axis that covers the full range of the data.

  2. Draw a line thruough Q₁ and Q₃ and draw a box which uses them as its two sides.

  3. Draw a line across the box at the median (Q₂).

  4. Draw whiskers from the box to the most extreme data points that are NOT beyond the inner fences.

  5. Plot individual points for observations beyond the inner fences.

Drawing a modified box plot using R

# Import the graphing package
library(ggplot2)

# Create data vector
promotion <- c(7, 12, 14, 14, 14, 18, 21, 22, 23, 24, 25, 34, 34, 37, 47, 49, 64, 100, 150)

# Change format to data frame
promotion_df <- data.frame(months=promotion)

# Graph
ggplot(pro_df, aes(y="", x=months)) +  # flip x and y to make the plot vertical
   stat_boxplot(geom="errorbar") + #formats the whiskers
   geom_boxplot() +
   ggtitle("Boxplot of Months to Promotion") +
   stat_summary(fun = mean, col = "black", geom = "point", size = 3)

The code above returns Fig. 3.5.

3.4.6. Reading box plots beyond the five number summary

Skewness

While box plots offer a less detailed view of the data distribution than histograms, they are effective for quickly identifying skewness. Let us compare the histograms and the box plots drawn for the same data sets:

Comparison of histograms and box plots

Fig. 3.6 Histograms and box plots of skewed data sets

Recall that each of the four sections defined by the sample quartiles contains approximately the same number of data points. Therefore, if one whisker is much longer than the box or the other whisker, it suggests that the data is more spread out in that section of the distribution.

Limitations

Box plots are efficient for identifying symmetry or skewness, but they are limited in the level of detail they provide on the shape of the distribution. Most importantly, we cannot

  • determine whether the data is normal (bell-shaped), or

  • identify the number of modes in the data

through a box plot. See the examples below—each with a distinct distribution, yet producing very similar box plots.

Box plots of different symmetric distributions

Fig. 3.7 Box plots of different symmetric distributions

To gain a detailed understanding of the shape of the data distribution, we must use more refined graphical tools such as histograms.

Explicit Points vs. Real Outliers

When interpreting a modified box plot, it’s crucial to understand that not all explicit points are real outliers.

  • Explicit points are observations flagged by statistical criteria like the 1.5 IQR rule. They are points that mathematically deviate from the pattern established by the majority of the data.

  • Real outliers are explicit points that, upon investigation, truly deviate from the underlying pattern of the data. They may represent errors, anomalies, or genuinely unusual cases.

When a data distribution is strongly skewed, for example, values on the longer tail may be identified as explicit points by the 1.5 IQR rule, although they are simply conforming to the underlying distribution. See Fig. 3.6.

When a data point is flagged as explicit, it should be inspected more carefully to determine if:

  1. it represents an error in measurement, recording, or data entry,

  2. it is an observation from a different population or process,

  3. or very far from the main body of the box plot to be considered part of the same distribution.

In these cases, the explicit points are considered true outliers.

Example💡: Interpreting a box plot

Interpret the box plot of teh number of months to promotion data.

Box plot of months to promotion data
  • The middle 50% of engineers were promoted between 16 and 42 months after hiring.

  • The median time to promotion was 24 months.

  • Two engineers (100 and 150 months) are identified as explicit points.

  • Since the right half of the data is more spread out than the left half (longer right whisker, greater distance between Q2 and Q3 than Q1 and Q2), the data set is right-skewed.

The two explicit points are very far from the main body of the plot with gaps larger than the IQR itself, so we must investigate the possibility of them being real outliers. They might represent engineers who:

  • Were hired with specialized skills not requiring management roles.

  • Chose to remain in technical roles longer.

  • Faced unusual circumstances affecting their promotion timeline.

  • Were erroneously included in the dataset.

3.4.7. Bringing It All Together

Key Takeaways 📝

  1. Sample quartiles divide the data into four equal parts.

  2. The interquartile range (IQR) measures the spread of the middle 50% of the data.

  3. The five-number summary provides a robust overview of the data distribution.

  4. Use the 1.5 × IQR rule to identify explicit points that may be potential outliers.

  5. Not all explicit points are real outliers - investigate them thoroughly before drawing conclusions.

  6. Modified box plots visualize the five-number summary and highlight explicit points.

In the next section, we’ll discuss how to choose the most appropriate measures of center and spread for different types of data distributions.

3.4.8. Exercises

These exercises build your skills in calculating and interpreting quartiles, the interquartile range (IQR), five-number summaries, and modified box plots.

Exercise 1: Calculating Quartiles and IQR

A network engineer measures the latency (in milliseconds) for 15 data packets:

\[\{12, 15, 18, 19, 22, 24, 25, 27, 29, 31, 34, 38, 42, 48, 55\}\]
  1. Find the five-number summary (Min, Q₁, Q₂, Q₃, Max).

  2. Calculate the interquartile range (IQR).

  3. Interpret the IQR in context: “The middle 50% of packet latencies span ___ milliseconds.”

  4. Verify your calculations using R.

  5. What percentage of packets had latencies between Q₁ and Q₃?

Solution

Part (a): Five-Number Summary

The data is already sorted with \(n = 15\).

Step 1: Find Q₂ (median)

Since n = 15 is odd, Q₂ is the \(\frac{15+1}{2} = 8\)th value.

\(Q_2 = 27\) ms

Step 2: Find Q₁

Consider the lower half: {12, 15, 18, 19, 22, 24, 25, 27} (8 values, including the median)

Q₁ is the median of this subset: average of 4th and 5th values.

\(Q_1 = \frac{19 + 22}{2} = 20.5\) ms

Step 3: Find Q₃

Consider the upper half: {27, 29, 31, 34, 38, 42, 48, 55} (8 values, including the median)

Q₃ is the median of this subset: average of 4th and 5th values.

\(Q_3 = \frac{34 + 38}{2} = 36\) ms

Five-Number Summary:

  • Minimum: 12 ms

  • Q₁: 20.5 ms

  • Q₂ (Median): 27 ms

  • Q₃: 36 ms

  • Maximum: 55 ms

Part (b): Interquartile Range

\[\text{IQR} = Q_3 - Q_1 = 36 - 20.5 = 15.5 \text{ ms}\]

Part (c): Interpretation

“The middle 50% of packet latencies span 15.5 milliseconds.”

This means the typical variation in latency (excluding the fastest and slowest 25% of packets) is about 15.5 ms.

Part (d): R Verification

latency <- c(12, 15, 18, 19, 22, 24, 25, 27, 29, 31, 34, 38, 42, 48, 55)

# Five-number summary
quantile(latency)
#   0%  25%  50%  75% 100%
# 12.0 20.5 27.0 36.0 55.0

# IQR
IQR(latency)  # Returns 15.5

Part (e): Percentage Between Q₁ and Q₃

By definition, 50% of the data falls between Q₁ and Q₃. This is the interquartile range—the “middle half” of the data.


Exercise 2: Identifying Explicit Points

A quality control engineer measures the tensile strength (in MPa) of 20 steel samples:

\[\{415, 422, 428, 431, 435, 438, 441, 445, 448, 452, 455, 458, 462, 468, 472, 478, 485, 495, 520, 580\}\]
  1. Calculate Q₁, Q₃, and the IQR.

  2. Calculate the inner fences (using the 1.5 × IQR rule).

  3. Calculate the outer fences (using the 3 × IQR rule).

  4. Identify any mild explicit points and extreme explicit points.

  5. If you removed the explicit points and recalculated, would the fences change? Explain why this could create problems.

Solution

Part (a): Quartiles and IQR

Data is sorted with \(n = 20\).

Q₂ (median): Average of 10th and 11th values = \(\frac{452 + 455}{2} = 453.5\) MPa

Q₁: Median of lower half {415, 422, 428, 431, 435, 438, 441, 445, 448, 452}

Average of 5th and 6th values = \(\frac{435 + 438}{2} = 436.5\) MPa

Q₃: Median of upper half {455, 458, 462, 468, 472, 478, 485, 495, 520, 580}

Average of 5th and 6th values = \(\frac{472 + 478}{2} = 475\) MPa

IQR = Q₃ − Q₁ = 475 − 436.5 = 38.5 MPa

Part (b): Inner Fences (1.5 × IQR)

\[\text{Lower inner fence} = Q_1 - 1.5 \times \text{IQR} = 436.5 - 1.5(38.5) = 436.5 - 57.75 = 378.75 \text{ MPa}\]
\[\text{Upper inner fence} = Q_3 + 1.5 \times \text{IQR} = 475 + 1.5(38.5) = 475 + 57.75 = 532.75 \text{ MPa}\]

Part (c): Outer Fences (3 × IQR)

\[\text{Lower outer fence} = Q_1 - 3 \times \text{IQR} = 436.5 - 3(38.5) = 436.5 - 115.5 = 321 \text{ MPa}\]
\[\text{Upper outer fence} = Q_3 + 3 \times \text{IQR} = 475 + 3(38.5) = 475 + 115.5 = 590.5 \text{ MPa}\]

Part (d): Identifying Explicit Points

Check each value against the fences:

  • Mild explicit points: Values between inner and outer fences

    • 580 MPa: Above upper inner fence (532.75) but below upper outer fence (590.5) → Mild explicit point

  • Extreme explicit points: Values beyond outer fences

    • None (580 < 590.5 and all values > 321)

  • 520 MPa is below 532.75, so it is NOT an explicit point

Summary: One mild explicit point (580 MPa), no extreme explicit points.

Part (e): Removing Explicit Points and Recalculating

Yes, the fences would change if we removed 580 and recalculated.

Why this is problematic:

  1. Moving target: After removing 580, the new Q₃ and IQR would decrease, potentially flagging 520 as a new explicit point

  2. Iterative deletion: This could lead to repeatedly removing points, shrinking the dataset artificially

  3. Data manipulation: Systematically removing explicit points without justification biases the analysis

  4. Loss of information: Extreme values may represent real phenomena worth studying

Best practice: Calculate fences once using the original data, then investigate flagged points rather than automatically removing them.


Exercise 3: Interpreting Box Plots

The box plots below compare response times (in seconds) for two different web servers.

Box plots comparing two web servers

Fig. 3.8 Response times for Server A and Server B

  1. For each server, estimate the five-number summary from the box plot.

  2. Which server has a higher median response time?

  3. Which server shows more variability in response times? Justify using the IQR.

  4. Server A shows two explicit points. Are these necessarily errors that should be removed? Explain.

  5. Based on the box plots, which server appears to have a right-skewed distribution? How can you tell?

  6. A manager wants to choose the server with more consistent performance. Which would you recommend and why?

Solution

Part (a): Estimated Five-Number Summaries

Note: Exact values will depend on the figure; these are representative estimates.

Server A:

  • Min (lower whisker): ~0.8 sec

  • Q₁: ~1.2 sec

  • Q₂ (median): ~1.5 sec

  • Q₃: ~2.0 sec

  • Max (excluding explicit points): ~2.8 sec

  • Explicit points: ~3.5 sec and ~4.2 sec

Server B:

  • Min: ~1.0 sec

  • Q₁: ~1.8 sec

  • Q₂ (median): ~2.5 sec

  • Q₃: ~3.2 sec

  • Max: ~4.0 sec

  • No explicit points

Part (b): Higher Median

Server B has a higher median response time (~2.5 sec vs ~1.5 sec).

Part (c): More Variability

Server B shows more variability.

  • Server A IQR ≈ 2.0 − 1.2 = 0.8 sec

  • Server B IQR ≈ 3.2 − 1.8 = 1.4 sec

Server B’s IQR is nearly twice as large, indicating the middle 50% of its response times are more spread out.

Part (d): Are Explicit Points Errors?

Not necessarily. The explicit points on Server A could represent:

  • Legitimate slow responses: Heavy load periods, complex queries, or network congestion

  • Real system behavior: Occasional garbage collection or resource contention

  • Edge cases: Valid but unusual requests

These points should be investigated rather than automatically removed. They may reveal important system behavior under stress.

Part (e): Skewness

Server A appears right-skewed:

  • The right whisker is longer than the left whisker

  • The distance from Q₂ to Q₃ is greater than Q₁ to Q₂

  • There are explicit points on the high end only

Server B appears approximately symmetric or slightly right-skewed:

  • The box and whiskers are more balanced

  • No explicit points flagged

Part (f): Recommendation

Recommend Server A for more consistent performance.

Reasoning:

  • Server A has a smaller IQR (0.8 vs 1.4 sec), meaning more consistent typical performance

  • Server A has a lower median (1.5 vs 2.5 sec), meaning faster typical responses

  • The explicit points on Server A are rare occurrences and can be monitored

  • For most requests, Server A provides faster, more predictable response times

Caveat: If the explicit points on Server A occur frequently or during critical operations, this recommendation might change.


Exercise 4: Creating and Interpreting Box Plots

A pharmaceutical researcher measures the time (in hours) for a drug to take effect in 25 patients:

effect_time <- c(1.2, 1.5, 1.8, 2.0, 2.1, 2.3, 2.4, 2.5, 2.6, 2.7,
                 2.8, 2.9, 3.0, 3.1, 3.2, 3.4, 3.5, 3.7, 3.9, 4.2,
                 4.5, 5.0, 5.8, 8.5, 12.0)
  1. Calculate the five-number summary and IQR.

  2. Determine the inner fences and identify any explicit points.

  3. Write R code to create a modified box plot with whisker caps and a mean point.

  4. Describe the shape of the distribution based on the box plot.

  5. Two patients had effect times of 8.5 and 12.0 hours. The researcher suspects these might be non-responders or patients who didn’t follow dosing instructions. Should these values be excluded from the analysis? Discuss.

Solution

Part (a): Five-Number Summary and IQR

Using R:

effect_time <- c(1.2, 1.5, 1.8, 2.0, 2.1, 2.3, 2.4, 2.5, 2.6, 2.7,
                 2.8, 2.9, 3.0, 3.1, 3.2, 3.4, 3.5, 3.7, 3.9, 4.2,
                 4.5, 5.0, 5.8, 8.5, 12.0)

quantile(effect_time)
#    0%   25%   50%   75%  100%
#  1.20  2.40  3.00  4.05 12.00

IQR(effect_time)  # 1.65

Five-Number Summary:

  • Min: 1.2 hours

  • Q₁: 2.4 hours

  • Q₂: 3.0 hours

  • Q₃: 4.05 hours

  • Max: 12.0 hours

IQR = 4.05 − 2.4 = 1.65 hours

Part (b): Inner Fences and Explicit Points

\[\text{Lower inner fence} = 2.4 - 1.5(1.65) = 2.4 - 2.475 = -0.075 \text{ hours}\]
\[\text{Upper inner fence} = 4.05 + 1.5(1.65) = 4.05 + 2.475 = 6.525 \text{ hours}\]

Explicit points (values above 6.525):

  • 8.5 hours — Mild explicit point

  • 12.0 hours — Check against outer fence: 4.05 + 3(1.65) = 9.0, so 12.0 > 9.0 is an Extreme explicit point

Part (c): R Code for Box Plot

library(ggplot2)

effect_df <- data.frame(time = effect_time)

ggplot(effect_df, aes(x = "", y = time)) +
  stat_boxplot(geom = "errorbar", width = 0.2) +  # Whisker caps
  geom_boxplot(fill = "lightblue", width = 0.4) +
  stat_summary(fun = mean, geom = "point",
               color = "black", size = 3) +  # Mean point
  ggtitle("Drug Effect Time Distribution") +
  ylab("Time to Effect (hours)") +
  xlab("") +
  theme_minimal()

Part (d): Shape of Distribution

The distribution is right-skewed (positively skewed):

  • The upper whisker is much longer than the lower whisker

  • The distance from Q₂ to Q₃ (1.05 hours) is greater than Q₁ to Q₂ (0.6 hours)

  • There are explicit points only on the upper end

  • The mean would be greater than the median due to the high values

This pattern is common for time-to-effect data: most patients respond within a typical range, but some take much longer.

Part (e): Should Explicit Points Be Excluded?

This requires careful consideration:

Arguments for exclusion:

  • If investigation confirms non-compliance (missed doses, wrong timing)

  • If the patients are genuinely non-responders (different population)

  • If there was a data recording error

Arguments against exclusion:

  • These may represent real variation in drug response

  • Some patients naturally metabolize drugs more slowly

  • Excluding them could underestimate variability and mislead clinical expectations

  • Regulatory agencies require reporting all patient outcomes

Recommendation:

  1. Investigate the specific circumstances for these two patients

  2. Report results both ways: full dataset and with exclusions (sensitivity analysis)

  3. Document any exclusions with clear justification

  4. Never exclude solely because values are statistical outliers

If the 8.5 and 12.0 hour patients had legitimate reasons for slow response (e.g., genetic factors, concurrent medications), they should remain in the analysis as they represent real patient outcomes.


Exercise 5: Comparing Box Plots Across Groups

A manufacturing company produces ball bearings at three different facilities. The diameters (in mm) of samples from each facility are summarized below:

Facility

Min

Q₁

Median

Q₃

Max

Explicit Points

A

9.92

9.97

10.00

10.03

10.08

None

B

9.85

9.94

10.01

10.06

10.12

10.25

C

9.88

9.96

9.99

10.04

10.15

None

The target diameter is 10.00 mm.

  1. Calculate the IQR for each facility.

  2. Which facility produces the most consistent bearings? Which produces the least consistent?

  3. Facility B has an explicit point at 10.25 mm. Calculate whether this would be classified as a mild or extreme explicit point.

  4. Which facility is best centered on the target of 10.00 mm?

  5. If you could only choose one facility based on this data, which would you recommend for high-precision applications? Justify your answer.

Solution

Part (a): IQR for Each Facility

  • Facility A: IQR = 10.03 − 9.97 = 0.06 mm

  • Facility B: IQR = 10.06 − 9.94 = 0.12 mm

  • Facility C: IQR = 10.04 − 9.96 = 0.08 mm

Part (b): Consistency Comparison

  • Most consistent: Facility A (smallest IQR = 0.06 mm)

  • Least consistent: Facility B (largest IQR = 0.12 mm)

Facility A’s middle 50% of bearings vary by only 0.06 mm, while Facility B’s vary by twice that amount (0.12 mm).

Part (c): Classification of Facility B’s Explicit Point

For Facility B:

  • Q₁ = 9.94, Q₃ = 10.06, IQR = 0.12

Inner fences:

  • Upper inner fence = 10.06 + 1.5(0.12) = 10.06 + 0.18 = 10.24 mm

Outer fences:

  • Upper outer fence = 10.06 + 3(0.12) = 10.06 + 0.36 = 10.42 mm

Since 10.25 > 10.24 (upper inner fence) but 10.25 < 10.42 (upper outer fence):

10.25 mm is a mild explicit point.

Part (d): Best Centered on Target

Target = 10.00 mm. Compare medians:

  • Facility A: Median = 10.00 mm (exactly on target)

  • Facility B: Median = 10.01 mm (0.01 mm above)

  • Facility C: Median = 9.99 mm (0.01 mm below)

Facility A is best centered, with its median exactly at the target.

Part (e): Recommendation for High-Precision Applications

Recommend Facility A.

Justification:

  1. Smallest IQR (0.06 mm): Most consistent production

  2. Median exactly on target (10.00 mm): Best centered

  3. No explicit points: No anomalous measurements

  4. Tightest range (9.92 to 10.08 = 0.16 mm): Even extreme values are close to target

For high-precision applications, consistency is crucial. Facility A provides both the best accuracy (centered on target) and precision (low variability).

Facility C would be a reasonable second choice (IQR = 0.08, median only 0.01 mm off target), but Facility B should be avoided for precision work due to its high variability and explicit points.


Exercise 6: Explicit Points vs. Real Outliers

A data analyst examines daily sales data (in thousands of dollars) for a retail store over 60 days. The modified box plot shows several explicit points on the high end.

Box plot of daily sales with explicit points

Fig. 3.9 Daily sales showing explicit points at high values

The five-number summary is: Min = $12K, Q₁ = $25K, Median = $32K, Q₃ = $42K, Max = $95K.

Three values are flagged as explicit points: $72K, $85K, and $95K.

  1. Verify that these three values are correctly identified as explicit points using the 1.5 × IQR rule.

  2. The store manager explains that the $72K day was Black Friday, the $85K day was the week before Christmas, and the $95K day was a special promotion event. Are these real outliers or expected behavior? Explain.

  3. If these explicit points are removed, the new median becomes $30K and the new Q₃ becomes $38K. Discuss whether removing them is appropriate.

  4. How would you recommend presenting this sales data in a report? Should you report the summary statistics with or without these high-sales days?

  5. The analyst notices that the distribution appears right-skewed. Is this expected for retail sales data? Why or why not?

Solution

Part (a): Verifying Explicit Points

Given: Q₁ = $25K, Q₃ = $42K

IQR = 42 − 25 = $17K

Upper inner fence = Q₃ + 1.5 × IQR = 42 + 1.5(17) = 42 + 25.5 = $67.5K

Upper outer fence = Q₃ + 3 × IQR = 42 + 3(17) = 42 + 51 = $93K

Classification:

  • $72K: Above $67.5K (inner fence), below $93K (outer fence) → Mild explicit point

  • $85K: Above $67.5K, below $93K → Mild explicit point

  • $95K: Above $93K (outer fence) → Extreme explicit point

All three are correctly identified.

Part (b): Real Outliers or Expected Behavior?

These are NOT real outliers—they are expected behavior.

  • Black Friday ($72K): Known annual shopping event with predictably high sales

  • Week before Christmas ($85K): Peak retail season, expected to be high

  • Special promotion ($95K): Planned event with anticipated sales surge

These represent legitimate business events, not errors or anomalies. They are:

  • Predictable and recurring (holidays occur every year)

  • Explainable by known factors

  • Part of the natural retail sales cycle

  • Important for business planning and revenue projections

The 1.5 × IQR rule flagged them because they deviate from “typical” days, but “typical” isn’t the full picture for seasonal businesses.

Part (c): Should They Be Removed?

No, removing them would be inappropriate.

Reasons:

  1. They represent real revenue: The store actually earned this money

  2. They affect annual totals: Removing them understates true business performance

  3. They’re predictable: Not random errors but expected events

  4. Misleading summaries: Reporting median of $30K instead of $32K misrepresents typical performance by ignoring important sales days

  5. Business planning: These days may contribute 15-20% of annual revenue—critical for staffing, inventory, and cash flow planning

Part (d): Reporting Recommendations

Report both perspectives:

  1. Overall summary (including high-sales days): - Median: $32K - IQR: $17K - Note: “Distribution includes seasonal peaks”

  2. “Typical day” summary (excluding known events): - Median: $30K - IQR: lower value - Note: “Represents non-promotional days”

  3. Separate the special events: - “Black Friday: $72K” - “Holiday week: $85K” - “Promotions average: $95K”

  4. Visualize appropriately: - Show box plot with explicit points labeled by event type - Consider separate box plots for “regular days” vs “special events”

This approach provides a complete picture without hiding important revenue sources.

Part (e): Is Right-Skewness Expected?

Yes, right-skewness is expected for retail sales data.

Reasons:

  1. Natural floor: Sales can’t go below $0, but there’s no upper limit

  2. Typical clustering: Most days have “normal” sales in a moderate range

  3. Occasional spikes: Holidays, promotions, and special events create a long right tail

  4. Economic pattern: Consumer spending follows this pattern across most retail contexts

This is why retail businesses often report median sales rather than mean—the mean would be inflated by high-sales days and may not represent a “typical” day well.


3.4.9. Additional Practice Problems

True/False Questions (1 point each)

  1. The interquartile range (IQR) measures the spread of the entire dataset.

    Ⓣ or Ⓕ

  2. Q₂ is another name for the sample median.

    Ⓣ or Ⓕ

  3. All explicit points identified by the 1.5 × IQR rule are errors that should be removed.

    Ⓣ or Ⓕ

  4. In a modified box plot, the whiskers always extend to the minimum and maximum values.

    Ⓣ or Ⓕ

  5. If a distribution is right-skewed, the right whisker of its box plot will typically be longer than the left whisker.

    Ⓣ or Ⓕ

  6. The five-number summary includes the mean.

    Ⓣ or Ⓕ

Multiple Choice Questions (2 points each)

  1. For a dataset with Q₁ = 20 and Q₃ = 50, what is the upper inner fence?

    Ⓐ 65

    Ⓑ 80

    Ⓒ 95

    Ⓓ 140

  2. A value falls above the upper inner fence but below the upper outer fence. This value is classified as:

    Ⓐ A normal observation

    Ⓑ A mild explicit point

    Ⓒ An extreme explicit point

    Ⓓ An error that must be removed

  3. Two box plots have the same median, but Box Plot A has a much larger IQR than Box Plot B. Which statement is correct?

    Ⓐ Dataset A has a larger range

    Ⓑ Dataset A has more variability in the middle 50% of observations

    Ⓒ Dataset A has more explicit points

    Ⓓ Dataset A has a larger mean

  4. Which of the following CANNOT be determined from a box plot alone?

    Ⓐ Whether the distribution is skewed

    Ⓑ The approximate IQR

    Ⓒ Whether explicit points exist

    Ⓓ Whether the distribution is bimodal

Answers to Practice Problems

True/False Answers:

  1. False — The IQR measures the spread of the middle 50% of the data, not the entire dataset.

  2. True — Q₂ (the second quartile) is the 50th percentile, which is the definition of the median.

  3. False — Explicit points should be investigated, not automatically removed. They may be legitimate unusual values, not errors.

  4. False — In a modified box plot, whiskers extend to the most extreme values that are NOT explicit points. Explicit points are shown as individual dots beyond the whiskers.

  5. True — In a right-skewed distribution, values extend further to the right, making the right whisker longer. There may also be explicit points on the right side.

  6. False — The five-number summary includes: Minimum, Q₁, Median (Q₂), Q₃, and Maximum. The mean is NOT part of the five-number summary (though it’s often added to box plots as a reference point).

Multiple Choice Answers:

  1. — IQR = Q₃ − Q₁ = 50 − 20 = 30. Upper inner fence = Q₃ + 1.5 × IQR = 50 + 1.5(30) = 50 + 45 = 95.

  2. — Values between the inner and outer fences are classified as mild explicit points. Values beyond the outer fences are extreme explicit points.

  3. — The IQR specifically measures the spread of the middle 50% of observations. A larger IQR means more variability in the middle 50%. The other options cannot be determined from IQR alone.

  4. — Box plots show quartiles and explicit points but cannot reveal modality (whether data is unimodal, bimodal, etc.). The same box plot could represent very different distributional shapes. A histogram is needed to identify modes.