Slides 📊

2.4. Exploring Quantitative Distributions: Modality, Skewness & Outliers

Histograms reveal patterns that help us interpret the overall shape of a data distribution. In this lesson, we will develop a systematic approach to reading histograms by focusing on three key characteristics: modality, skewness, and outliers.

Road Map 🧭

Identify patterns in histograms by examining their shape, center, and spread.
Distinguish between unimodal, bimodal, and multimodal distributions.
Recognize symmetric versus skewed distributions.
Learn to spot potential outliers that may require further investigation.

2.4.1. Looking for Patterns in Histograms

Histogram of the furnace dataset — Fig. 2.10 Histogram of the furnace data set (see Section 2.3)

Once we generate a histogram, we begin our data-reading process by noting the basic properties such as the center and spread of the data.

Center: Where is the bulk of the data located? Is there one central location or multiple?
Spread: How widely dispersed are the values around these central locations?

According to Fig. 2.10, the center of the furnace data set is located around 10, and it spans a range between 0 and 20, approximately.

After covering the basics, we move on to classify the shape of the data distribution based on a few key characteristics:

Modality
Skewness
Presence of any potential outliers

Let us discuss each of these charachteristics in detail.

2.4.2. Modality: Counting the Peaks

Modality tells us whether our data is concentrated around a single location or multiple locations.

A mode is a local peak in a histogram or density curve. Counting them helps us understand whether the data represents one homogeneous population or multiple distinct subgroups.

Unimodal distributions have a single peak, suggesting that the data is coming from a single population centered around a unique location. Most textbook examples show unimodal distributions in the classic “bell curve” shape just like Fig. 2.10, but a unimodal distribution can take many forms—it only needs to have one primary peak.

Bimodal distributions show two peaks in their histograms. The existence of two centers might indicate that the data was sampled from two different populations with distinct characteristics.

Multimodal distributions have two or more peaks in their histograms, potentially indicating that the data was collected from several different populations.

Caution 🛑

Sometimes, apparent multimodality in a histogram can be an artifact of using too many bins rather than a genuine feature of the data. If the bin count is too large, you’ll tend to see several modes because individual data points are highlighted rather than the overall summary. This is why choosing an appropriate bin size is crucial when creating histograms.

Example 💡: Old Faithful Eruption Data

Let’s examine the eruption data of the Old Faithful geyser in Yellowstone National Park. This built-in R dataset contains measurements on both the waiting time between eruptions and the duration of eruptions themselves. Let us plot the histograms of both variables.

library(ggplot2)
data(faithful) # Load Old Faithful dataset
nbins <- round(sqrt(nrow(faithful))) + 2

# Plot eruption times
ggplot(faithful, aes(eruptions)) +
geom_histogram(bins = nbins, fill = "dodgerblue", colour = "black", linewidth = 1) +
geom_density(colour = "red", linewidth = 1.2) +
labs(title = "Old Faithful eruption lengths – bimodal distribution",
      x = "Eruption time (minutes)", y = "Frequency")

# Plot waiting times using your own code

Histogram of length of eruption of the Old Faithful geyser — Fig. 2.11 Length of eruption

The histogram of lengths of eruption shows two distinct peaks—one around 2 minutes and another around 4.5 minutes. This indicates that eruptions tend to fall into two categories: shorter eruptions and longer eruptions, with relatively few eruptions of intermediate length.

Histogram of waiting duration of the Old Faithful geyser — Fig. 2.12 Waiting duration

The histogram of waiting times also reveals two modes—shorter waiting periods and longer ones, with a dip in the middle region around 60 minutes.

The bimodality in both variables reflects the physical processes within the geyser. Short eruptions tend to be followed by shorter waiting times, while long eruptions precede longer waiting intervals. The red density curves capture this bimodal pattern beautifully. In contrast, the blue normal curves clearly fail to capture the two-peaked nature of the data, showing how misleading it would be to treat this data as a simple unimodal distribution.

This example illustrates an important principle: when you encounter multimodality, it often signals the presence of distinct subgroups that should be analyzed separately rather than averaged together.

2.4.3. Symmetry and Skewness

We call a data symmetric if its distribution is reasonably balanced around a central value. When the data is not symmetric, we call it skewed. A data can be right (positively) skewed or left (negatively) skewed depending on the direction of the longer tail:

If the tail stretches toward higher values, the distribution is right (or positively) skewed.
If the tail stretches toward lower values, the distribution is left (or negatively) skewed.

Symmetric, left skewed, and right skewed Histograms — Fig. 2.13 Histograms with different degrees and directions of skewness

2.4.4. Identifying Potential Outliers

When examining a histogram, we also look for potential outliers—observations that deviate significantly from the overall pattern of the data. When encountered, they should not immediately be discarded but investigated thoroughly. This is important because:

They may represent errors in data collection or entry.
They could be valid but unusual observations that provide valuable insights.
They can significantly influence statistical measures like the mean.
They may indicate violations of assumptions in statistical inference procedures.

When R creates a histogram, it automatically determines an appropriate range for the axes to ensure all data points are included in the plot. If there are values far from the majority of the data, the plotting range will be extended to include them, creating visible gaps in the histogram where no observations fall.

Fig. 2.14 Histogram with potential outliers marked in red circles

2.4.5. Bringing It All Together

Key Takeaways 📝

Histograms reveal the shape, center, spread, and potential outliers in a quantitative data.
Potential outliers require careful investigation—they may represent errors or important observations.
Understanding the shape of a data distribution helps us choose appropriate statistical tools and avoid misleading conclusions.