3.5. Choosing the Right Measure & Comparing Measures Across Data Sets

Now that we’ve explored different measures of central tendency and spread, we need to determine which measures are most appropriate for different types of data. The effectiveness of each measure depends on the distribution shape and the presence of outliers.

Road Map 🧭

  • Understand what it means for a measure to be resistant.

  • Examine how skewness affects different measures of center and spread.

  • Learn which measures to choose when dealing with outliers.

  • Explore standardization for comparing observations across different datasets.

  • See how side-by-side boxplots can compare distributions across groups.

3.5.1. Resistant and Non-Resistant Measures

One of the most important considerations when choosing summary statistics is whether they are resistant to outliers and extreme values. This is because under the presence of extreme values, certain measures lose their representative power for the data set.

  • A resistant measure is one that isn’t strongly affected by extreme values in the dataset,

  • A non-resistant measure can be heavily influenced by them.

Skewed distributions provide an excellent way to understand the concept of resistance. Let’s examine what happens to our measures of center and spread when data is skewed.

Comparison of negatively and positively skewed distributions

Fig. 3.8 Comparison of negatively and positively skewed distributions with box plots and measures of center

In the figure above, we can see:

  1. Negatively Skewed Distribution (left): The data is bunched toward the higher end with a tail extending to the left.

  2. Positively Skewed Distribution (right): The data is bunched toward the lower end with a tail extending to the right.

Effect of skewness on numerical summary measures

1. Sample mean is pulled towards the tail

Notice how the mean (solid dot) is pulled in the direction of the tail, while the median remains firmly in the middle of the ordered data.

2. Many points in the tail are marked as explicit points

When data is skewed, box plots often flag many points as “explicit” in the direction of the tail.

However, these flagged points don’t necessarily represent true outliers. They may simply be part of the natural tail behavior of the distribution.

Recall that real outliers typically show a clear gap between them and the rest of the data. These points should be investigated more thoroughly to determine if they are real outliers, but it is evident that many explicit points on Fig. 3.8 are not.

3.5.2. Choosing the Right Pair of measures

Sample Mean vs. Sample Median

As we can see from the skewed distributions,

  • The sample mean (x̄) is non-resistant because it’s calculated using all data values and gives equal weight to each observation, including outliers. As a result, it is pulled towards values which are far from the main trend of the data.

  • The sample median (x̃) is resistant because it depends only on the order (ranks) of most of the data, not the exact values.

Sample Variance (Standard Deviation) vs. IQR

Similarly, our measures of spread also differ in their resistance to outliers:

  • The sample variance (s²) and standard deviation (s) are non-resistant because: - They depend on the distance from each point to the sample mean, which is already non-resistant. - They square these distances, giving even more weight to extreme values.

  • The interquartile range (IQR) is resistant because: - It only considers the middle 50% of the data - It’s not affected by extreme values in either tail

Summary

Property of data distribution

Resistance required?

Measure of center

Measure of variability

Skewed or has outliers

Yes

Sample median

IQR

Reasonably symmetric and has no outliers

No

Sample mean

Sample variance (sd)

3.5.3. Comparing Measures Among Data Sets

Side-by-Side Box Plots for Group Comparisons

When comparing a quantitative variable across categories, side-by-side box plots provide an excellent visualization tool.

Side-by-side box plots of heights by voice type

Fig. 3.9 Heights of singers in the New York Choral Society (1979) by voice type

This example shows the heights of singers in the New York Choral Society, categorized by voice type (soprano, alto, tenor, and bass). The box plots allow us to compare the distributions and see patterns:

Example : Plotting and interpret a side-by-side box plot

Interpretation

  • Tenors and basses (typically male) tend to be taller than sopranos and altos (typically female)

  • There is some overlap in heights between altos and tenors

  • Each voice type shows a different distribution of heights

This code produces a plot similar to the one shown in the figure above. Let’s break down what each part of the code does:

  1. We first organize our data in a data frame with two columns: Voice (the categorical variable) and Height (the quantitative variable)

  2. We create a custom function convert_in_to_ft to display heights in a more readable format

  3. We use ggplot2 to create the visualization: - stat_boxplot(geom = “errorbar”) adds the whisker ends - geom_boxplot(fill = “purple”) creates the box plots with purple fill - stat_summary() adds dots representing the means - Various theming options control the appearance of text and labels

Standardization: Comparing Apples to Oranges with Z-Scores

Often, we need to compare observations from different variables that use different scales or units. Standardization allows us to convert values to a common scale, making direct comparisons possible.

The z-score (or standardized value) for an observation is calculated as:

\[z_i = \frac{x_i - \bar{x}}{s}\]

Where: * x_i is the original observation * x̄ is the sample mean * s is the sample standard deviation

Z-score formula and explanation

Fig. 3.10 Z-score formula and properties

Z-scores tell us how many standard deviations an observation is from the mean:

  • Positive z-scores indicate values above the mean

  • Negative z-scores indicate values below the mean

  • A z-score of 0 indicates a value equal to the mean

For example, z = -2.25 means the observation is 2.25 standard deviations below the mean.

Properties and Uses of Z-Scores

Z-scores have several important properties:

  1. They express relative standing in terms of standard deviations from the mean

  2. They allow for direct comparison of observations from different distributions

  3. They are unitless measures (the original units cancel out in the calculation)

  4. The sum of all z-scores in a dataset is always 0

Z-scores are particularly useful when:

  • Comparing performances across different measures (e.g., comparing a student’s standing in both math and reading)

  • Identifying unusual observations (values with z-scores above 3 or below -3 are often considered unusual)

  • Creating standardized scales for assessment or evaluation

Computing Z-Scores in R

Calculating z-scores in R is straightforward. You can use the built-in scale() function or calculate them manually:

Listing 3.1 Computing z-scores in R
# Sample data
x <- c(4, 8, 7, 9, 4, 3, 5, 1, 4)

# Method 1: Using the scale() function
z_scores <- scale(x)

# Method 2: Manual calculation
z_scores_manual <- (x - mean(x)) / sd(x)

# Print the results
print(z_scores)
print(z_scores_manual)

# Verify that the sum of z-scores is (approximately) zero
sum(z_scores)

# Example of comparing values from different distributions
# Test scores from two different exams
math_scores <- c(78, 85, 92, 65, 70, 88, 75)
english_scores <- c(82, 90, 88, 72, 68, 85, 80)

# Calculate z-scores for both distributions
math_z <- scale(math_scores)
english_z <- scale(english_scores)

# Create a data frame to compare a student's performance on both tests
# For example, comparing the performance of student #2
comparison <- data.frame(
  Subject = c("Math", "English"),
  Raw_Score = c(math_scores[2], english_scores[2]),
  Z_Score = c(math_z[2], english_z[2])
)

print(comparison)

This code demonstrates how to calculate z-scores and how they can be used to compare performance across different distributions. The output shows that even though the raw scores might differ, the z-scores provide a standardized measure that allows for direct comparison.

3.5.4. Bringing It All Together

Key Takeaways 📝

  1. Resistant measures (median, IQR) are not strongly affected by outliers and skewness and should be used when data is skewed or contains outliers.

  2. Non-resistant measures (mean, standard deviation) are influenced by extreme values but have better statistical properties for symmetric distributions without outliers.

  3. The mean is always pulled in the direction of the tail in skewed distributions, while the median remains representative of the center.

  4. Side-by-side box plots allow us to compare distributions of a quantitative variable across categories of a categorical variable.

  5. Z-scores standardize observations by expressing them in terms of standard deviations from the mean, enabling comparisons across different distributions.

  6. When analyzing data, always consider the shape of the distribution and the presence of outliers when choosing summary measures.

Looking Ahead: From Description to Inference

The numerical summary measures we’ve explored in this chapter are fundamental tools for descriptive statistics—they help us characterize and understand our sample data. However, in future chapters, these measures will also serve as the foundation for statistical inference.

In particular, the concepts of resistant and non-resistant measures connect to important properties of statistical estimators:

  • Unbiasedness: Does the estimator target the correct population parameter on average?

  • Consistency: Does the estimator converge to the true parameter as sample size increases?

  • Efficiency: How much variability does the estimator have?

  • Robustness: How well does the estimator perform when assumptions are violated?

The sample mean is an unbiased estimator of the population mean, but it lacks robustness to outliers. The sample median, while not always the most efficient estimator, provides robustness when dealing with skewed distributions or data with outliers.

As we transition from descriptive statistics to probability in the next chapter, and then to statistical inference, keep these properties in mind. The choices we make about which measures to use will impact the validity and reliability of our statistical conclusions.

In the next chapter, we’ll begin exploring probability concepts that provide the theoretical foundation for statistical inference.

Exercises

  1. Resistant vs. Non-Resistant Measures: For each of the following datasets, state whether you would use the mean and standard deviation or the median and IQR as summary measures, and explain why: a) Annual salaries of employees at a small company with one extremely high-paid CEO b) Heights of randomly selected adult males from a normally distributed population c) Times to completion of a task with a few extremely slow performances

  2. Interpreting Z-Scores: A student took tests in both math and English. She scored 85 on the math test (mean = 75, sd = 5) and 88 on the English test (mean = 80, sd = 10). a) Calculate the z-score for each test result b) On which test did she perform better relative to her peers? c) Explain why we can’t compare the raw scores directly

  3. Skewed Distributions: Using the dataset {3, 4, 5, 5, 6, 7, 8, 15, 22}: a) Create a dot plot or histogram and describe the shape of the distribution b) Calculate the mean, median, standard deviation, and IQR c) Explain how the relationship between the mean and median confirms your assessment of the distribution shape d) Which measures would you recommend for summarizing this dataset?

  4. Side-by-Side Box Plot Interpretation: A researcher collects data on the recovery time (in days) for patients using three different treatments: * Treatment A: {5, 7, 8, 9, 10, 12, 14} * Treatment B: {3, 4, 5, 6, 7, 8, 20} * Treatment C: {8, 9, 10, 11, 12, 13, 14}

    1. Create side-by-side box plots for the three treatments

    2. Which treatment appears to have the shortest typical recovery time?

    3. Which treatment has the most consistent results?

    4. Are there any potential outliers? If so, how might they affect your assessment?

  5. Z-Score Application: The heights of adult female volleyball players have a mean of 180 cm with a standard deviation of 5 cm. The heights of adult female gymnasts have a mean of 155 cm with a standard deviation of 6 cm. a) If a woman is 172 cm tall, calculate her z-score relative to both the volleyball player and gymnast populations b) Relative to their respective sports, would this woman be considered tall or short? c) Explain how z-scores help us make this comparison despite the different means and standard deviations

3.5.5. Appendix: Code for Creating a Side-by-Side Box Plot

Let’s first walk through how to create side-by-side box plots in R using the New York Choral Society height data. First, we’ll need to structure our data properly. Here’s the dataset:

Listing 3.2 Singer heights data (in inches)
# Singer heights data (in inches)
singer_data <- data.frame(
Voice = c(rep("Soprano", 36), rep("Alto", 35), rep("Tenor", 21), rep("Bass", 39)),
Height = c(
   # Soprano heights
   64, 62, 66, 65, 60, 61, 65, 66, 65, 63, 67, 65, 62, 65, 68, 65, 63, 65, 62, 65, 66, 62, 65, 63, 65, 66, 65, 62, 65, 66, 65, 61, 65, 66, 65, 62,
   # Alto heights
   65, 62, 68, 67, 67, 63, 67, 66, 63, 72, 62, 61, 66, 64, 60, 61, 66, 66, 66, 62, 70, 65, 64, 63, 65, 69, 61, 66, 65, 61, 63, 64, 67, 66, 68,
   # Tenor heights
   69, 72, 71, 66, 76, 74, 71, 66, 68, 67, 70, 65, 72, 70, 68, 73, 66, 68, 67, 64, 0,
   # Bass heights
   72, 70, 72, 69, 73, 71, 72, 68, 68, 71, 66, 68, 71, 73, 73, 70, 68, 70, 75, 68, 71, 70, 74, 70, 75, 75, 69, 72, 71, 70, 71, 68, 70, 75, 72, 66, 72, 70, 69
)
)

# Remove any zero values (missing data)
singer_data <- subset(singer_data, Height > 0)

Now we can create the box plots using ggplot2:


caption:

Creating side-by-side box plots with ggplot2

library(ggplot2)

# Create a function to convert heights in inches to feet and inches format for labels convert_in_to_ft <- function(values) {

feet <- floor(values / 12) inches <- floor(values %% 12) value <- sprintf(“%sft %sin”, feet, inches) return(value)

}

# Create the side-by-side box plots ggplot(singer_data, aes(x = Voice, y = Height)) +

stat_boxplot(geom = “errorbar”) + # Add whisker ends geom_boxplot(fill = “purple”) + # Create box plots with purple fill theme(text = element_text(size = 24, face = “bold”),

axis.text = element_text(size = 20, face = “bold”)) +

stat_summary(fun = mean, col = “black”, geom = “point”, size = 3) + # Add mean points scale_y_continuous(label = convert_in_to_ft, # Custom y-axis labels

breaks = c(60, 65, 70, 75)) +

xlab(“”) + # No x-axis label needed ylab(“Height”) # y-axis label