Slides 📊

3.5. Choosing the Right Measure & Comparing Measures Across Data Sets

Now that we’ve explored different measures of central tendency and spread, we need to determine which measures are most appropriate for different types of data. The effectiveness of each measure depends on the distribution shape and the presence of outliers.

Road Map 🧭

Understand what it means for a measure to be resistant.
Examine how skewness affects different measures of center and spread.
Learn which measures to choose when the data is skewed or outliers are present.
Understand why non-resistant measures are often favored over the resistant ones.
Explore standardization for comparing observations across different datasets.
See how side-by-side boxplots can compare distributions across groups.

3.5.1. Resistant and Non-Resistant Measures

One of the most important considerations when choosing summary statistics is whether they are resistant to extreme values. This is because under their presence, certain measures lose their representative power for the data set.

A resistant measure is one that is not strongly affected by extreme values.
A non-resistant measure is a measure that is significantly influenced by them.

Skewed distributions provide an excellent way to understand the concept of resistance. Let’s examine what happens to our measures of center and spread when data is skewed.

Effect of Skewness on Numerical Summary Measures

Comparison of negatively and positively skewed distributions — Fig. 3.8 Skewed distributions

Observation 1: Sample mean is pulled towards the tail

In Fig. 3.8, notice how the sample means (solid dots) are pulled in the direction of the tail, while the medians remain firmly in the middle of the ordered data.

Observation 2: Many points in the tail are marked as explicit points

When data is skewed, box plots often flag many points as “explicit” in the direction of the tail. However, these flagged points don’t necessarily represent true outliers. They may simply be part of the natural tail behavior of the distribution.

Recall that real outliers typically show a clear gap from the rest of the data. These points should be investigated more thoroughly to determine if they are real outliers, but it is evident that many explicit points on Fig. 3.8 are not.

3.5.2. Choosing the Right Pair of Measures

Sample Mean vs. Sample Median

As we can see from the skewed distributions in Fig. 3.8,

The sample mean (\(\bar{x}\)) is non-resistant. This is because it gives equal weight to each observation, including the extreme values.
The sample median (\(\tilde{x}\)) is resistant because it depends only on the order (ranks) of most of the data, not the exact values.

Sample Variance (Standard Deviation) vs. IQR

Similarly, the measures of spread also differ in their resistance to outliers.

The sample variance (\(s^2\)) and sample standard deviation (\(s\)) are non-resistant because
- their computation depends on the sample mean, which is already non-resistant.
- their computation involves squaring the distances between data points and the sample mean, which amplifies the effect of extreme values.
The interquartile range (IQR) is resistant because it excludes the extreme values from computation by only considering the middle 50% of the data.

Why Use Non-Resistant Measures At All?

\(\bar{x}\) and \(s\) (or \(s^2\)) are still favored over \(\tilde{x}\) and IQR when skewness and outliers are not a concern, because they carry important theoretical properties related to normality. These properties will form the foundation of many inference methods we develope later in the semester.

Summary

Property of data distribution	Resistance required?	Measure of center	Measure of variability
Skewed or has outliers	Yes	Sample median	IQR
Reasonably symmetric and has no outliers	No	Sample mean	Sample variance (sd)

3.5.3. Comparing Measures Across Data Sets

A. Side-by-Side Box Plots for Group Comparisons

When comparing a quantitative variable across categories, side-by-side box plots provide an excellent visualization tool.

Side-by-side box plots of heights by voice type — Fig. 3.9 Heights of singers in the New York Choral Society (1979) by voice type

This example shows the heights of singers in the New York Choral Society, categorized by voice type (soprano, alto, tenor, and bass). Some key observations are:

Singers with lower voice ranges tend to be taller.
There is a significant overlap in heights between tenors and basses, and between sopranos and altos.
Each voice type shows a different distribution shape of heights. For example, the distribution is negatively skewed for the soprano singers, while it is fairly symmetric for the altos except for a single outlier.

B. Standardization: Comparing Apples to Oranges

Often, we need to compare observations from different variables that use different scales or units. Standardization allows us to convert values to a common scale, making direct comparisons possible.

Suppose we have a data set \(x_1, x_2, \cdots, x_n\), with a sample mean \(\bar{x}\) and sample standard deviation \(s\). Then for each \(x_i\), its standardized value \(z_i\) is calculated as:

\[z_i = \frac{x_i - \bar{x}}{s}.\]

The standardized values tell us how many standard deviations the data points are above or below the mean:

Positive \(z_i\) indicates that \(x_i\) is above the sample mean.
Negative \(z_i\) indicates that \(x_i\) is below the sample mean.
\(z_i=0\) indicates that \(x_i\) is equal to the sample mean.
\(z_i = -2.25\) means the observation \(x_i\) is 2.25 standard deviations below the sample mean.

Properties and Uses of Standardized Data

After standardization, the new data set \(z_1, z_2, \cdots, z_n\) always has a sample mean of 0 and a sample standard deviation of 1 (this can be verified through a few algebraic steps). In other words, standardization adjusts any data set so that each value is measured relative to a common center and scale. This leads to several important properties:

Since they are unitless (the original units cancel out in the calculation), they provide an immediate sense of distance. For example, a data point with \(z=-3\) is most likely very far from the central mass of the data.
They allow for direct comparison of observations from different distributions.

3.5.4. Bringing It All Together

Key Takeaways 📝

Resistant measures (median, IQR) are not strongly affected by outliers and skewness and should be used when data is skewed or contains outliers.
Non-resistant measures (mean, standard deviation) are influenced by extreme values but have better statistical properties for symmetric distributions without outliers.
The sample mean is always pulled in the direction of the tail in skewed distributions, while the sample median remains representative of the center.
Side-by-side box plots allow us to compare distributions of a quantitative variable across categories of a categorical variable.
Standardized observations enable comparisons across different distributions.

Exercises

Resistant vs. Non-Resistant Measures: For each of the following datasets, state whether you would use the mean and standard deviation or the median and IQR as summary measures, and explain why:
1. Annual salaries of employees at a small company with one extremely high-paid CEO
2. Heights of randomly selected adult males from a normally distributed population
3. Times to completion of a task with a few extremely slow performances
Standardization: A student took tests in both math and English. She scored 85 on the math test (mean = 75, sd = 5) and 88 on the English test (mean = 80, sd = 10).
1. Calculate the z-score for each test result.
2. On which test did she perform better relative to her peers?
3. Explain why we can’t compare the raw scores directly.
Skewed Distributions: Using the dataset {3, 4, 5, 5, 6, 7, 8, 15, 22}:
1. Create a dot plot or histogram and describe the shape of the distribution.
2. Calculate the mean, median, standard deviation, and IQR.
3. Explain how the relationship between the mean and median confirms your assessment of the distribution shape.
4. Which measures would you recommend for summarizing this dataset?
Side-by-Side Box Plot Interpretation: A researcher collects data on the recovery time (in days) for patients using three different treatments:
- Treatment A: {5, 7, 8, 9, 10, 12, 14}
- Treatment B: {3, 4, 5, 6, 7, 8, 20}
- Treatment C: {8, 9, 10, 11, 12, 13, 14}
1. Create side-by-side box plots for the three treatments
2. Which treatment appears to have the shortest typical recovery time?
3. Which treatment has the most consistent results?
4. Are there any potential outliers? If so, how might they affect your assessment?