2.4. Exploring Quantitative Distributions: Modality, Shape & Outliers
Seeing the Story in the Shape
After creating a histogram, we can use it to understand the properties of our quantitative variables more deeply. Histograms reveal patterns that help us interpret the general shape of our data distribution. Is the data concentrated around one central location or multiple locations? Does it spread out symmetrically, or is it pulled in one direction? Are there any unusual values that deviate from the overall pattern? In this lesson, we’ll develop your eye for identifying modality, skewness, and outliers - three key characteristics that help diagnose data quality and guide your choice of statistical models.
Road Map 🧭
Identify overall patterns in histograms, examining shape, center, and spread
Distinguish between unimodal, bimodal, and multimodal distributions
Recognize symmetric versus skewed distributions
Learn to spot potential outliers that may require further investigation
2.4.1. Looking for Patterns in Histograms
When examining a histogram, we’re looking for an overall pattern that summarizes our data visually. This pattern helps us understand the general behavior of our measurements, while any deviations from that pattern may represent important anomalies worth investigating further.
The key characteristics we’re looking for include:
Shape: Is the distribution symmetrical around a central location, or is it pulled in one direction?
Center: Where is the bulk of the data located? Is there one central location or multiple?
Spread: How widely dispersed are the values around these central locations?
Deviations: Are there any observations that stand out from the overall pattern?
By systematically examining these aspects of our histogram, we can gain valuable insights about our data before proceeding to more formal statistical analysis.
2.4.2. Modality: Counting the Peaks
One of the first characteristics we want to identify in our data is its modality. Modality tells us about the central tendencies of our data - specifically, whether our data is concentrated around a single location or multiple locations.
A mode is a local peak in a histogram or density curve. The number of modes helps us understand whether our data might represent one homogeneous population or multiple distinct subgroups.
Unimodal distributions have a single peak, suggesting that our data is coming from a single population centered around a unique location. Most textbook examples show unimodal distributions, often in the classic “bell curve” shape, but a unimodal distribution can take many forms - it simply needs to have one primary peak.
Bimodal distributions have two distinct peaks, suggesting that there might be another variable besides our quantitative variable that could partition the data into two separate groups. These groups might represent sampling from two different populations that could be characterized by distinct categories.
Multimodal distributions have multiple peaks beyond two, potentially indicating that we’re sampling from several different populations. For a quantitative variable with a multimodal distribution, there may be a categorical variable that divides the population into several different groups.
A word of caution: sometimes apparent multimodality in a histogram can be an artifact of using too fine a bin size rather than a genuine feature of the data. If you increase the bin size too much (creating too many bins), you’ll tend to see several modes simply because you’re focusing on individual data points rather than summarizing the information effectively. This is why choosing an appropriate bin size is crucial when creating histograms.
Example 💡 – Old Faithful Eruption Data
Let’s examine real data from the Old Faithful geyser in Yellowstone National Park.

This built-in R dataset contains measurements on both the waiting time between eruptions and the duration of eruptions themselves.
When we plot histograms of these variables, we observe striking bimodal patterns in both:
library(ggplot2)
data(faithful) # Load Old Faithful dataset
nbins <- round(sqrt(nrow(faithful))) + 2
# Plot eruption times
ggplot(faithful, aes(eruptions)) +
geom_histogram(bins = nbins, fill = "dodgerblue", colour = "black", linewidth = 1) +
geom_density(colour = "red", linewidth = 1.2) +
labs(title = "Old Faithful eruption lengths – bimodal distribution",
x = "Eruption time (minutes)", y = "Frequency")
# Plot waiting times (code would be similar)
The eruption time histogram shows two distinct peaks - one around 2 minutes and another around 4.5 minutes. This indicates that eruptions tend to fall into two categories: shorter eruptions and longer eruptions, with relatively few eruptions of intermediate length.
Similarly, the waiting time histogram reveals two modes - shorter waiting periods and longer ones, with a dip in the middle region around 60 minutes. This pattern tells us something about the frequency of eruptions: if you’ve waited about an hour without seeing an eruption, you’ll likely need to wait another 30-60 minutes.
The bimodality in both variables isn’t just a statistical curiosity—it reflects physical processes within the geyser. Short eruptions tend to be followed by shorter waiting times, while long eruptions precede longer waiting intervals.
When we overlay the red density curve, it captures this bimodal pattern beautifully. In contrast, the blue curve (representing a normal distribution with the same mean and standard deviation) clearly fails to capture the two-peaked nature of the data, showing how misleading it would be to treat this data as a simple unimodal distribution.
This example illustrates an important principle: when you encounter multimodality, it often signals the presence of distinct subgroups that should be analyzed separately rather than averaged together.
2.4.3. Skewness: Symmetry and Direction
Another important property of distributions is their symmetry or skewness. A symmetric distribution has roughly equal shapes on both sides when divided at its center. However, many real-world distributions are pulled or “skewed” in one direction or another.
When examining a histogram, we need to determine whether the data is:
Symmetric: Evenly balanced around a central value, with similar tails on both sides
Right-skewed (or positively skewed): The majority of data is concentrated on the left side, with the tail stretching to the right
Left-skewed (or negatively skewed): The majority of data is concentrated on the right side, with the tail stretching to the left
The direction of the tail determines the direction of skewness. If the tail stretches toward higher values (to the right), the distribution is right-skewed or positively skewed. If the tail stretches toward lower values (to the left), the distribution is left-skewed or negatively skewed.
Identifying skewness is important because it affects:
Which measures of central tendency are most appropriate (mean vs. median)
Which statistical procedures we can use in later analysis
Whether we might need to transform our data for certain statistical tests
We’ll prefer symmetric data for many statistical inference procedures we’ll study later in the course, particularly when working with the normal distribution. When data is skewed, transformations like logarithms can sometimes help make the distribution more symmetric.
Example 💡 – Black Cherry Trees Data

Let’s examine the built-in trees dataset in R, which contains measurements on 31 felled black cherry trees, including their height (in feet), diameter (in inches), and volume (in cubic feet).
When we create histograms for these variables, we can practice identifying their skewness:
library(ggplot2)
data(trees)
# Height histogram
ggplot(trees, aes(Height)) +
geom_histogram(bins = 8, fill = "lightblue", colour = "black") +
labs(title = "Black Cherry tree heights",
x = "Height (feet)", y = "Frequency")
# Similar code for Girth (diameter) and Volume histograms
Looking at the height histogram, the distribution appears roughly symmetric, though it might be pulled slightly to the left (negative skew). Most trees are concentrated in the middle height range, with a fairly balanced spread in both directions.
The diameter (girth) histogram shows a positive skew, with most trees having smaller diameters and fewer trees with larger diameters. The distribution is noticeably pulled to the right.
The volume histogram also displays a positive skew. Although there appears to be a jump in one of the bins, this is likely due to random sampling variation rather than representing a true bimodal pattern, especially considering the small sample size of only 31 trees.
With such a small sample size, it’s important to be cautious about drawing definitive conclusions about the underlying population distributions. If we were to observe more trees, we would likely confirm that both diameter and volume follow positively skewed distributions, while height is more symmetric or slightly negatively skewed.
2.4.4. Identifying Potential Outliers
When examining a histogram, we also want to look for outliers - data points that deviate significantly from the overall pattern of the data. Histograms can help us identify these potential outliers visually.
When R creates a histogram, it automatically determines an appropriate range for the axes to ensure all data points are included in the plot. If there are values far from the majority of the data, the plotting range will be extended to include them, creating visible gaps in the histogram where no observations fall.
Outliers are data points that lie far from the general pattern of the distribution. They are important to investigate because:
They may represent errors in data collection or entry
They could be valid but unusual observations that provide valuable insights
They can significantly influence statistical measures like the mean
They may indicate violations of assumptions in statistical inference procedures
When we spot potential outliers, we should not immediately discard them. Instead, we need to investigate why they differ from the rest of the data. They might represent important characteristics of our population or they might be erroneous entries that need correction.
Example 💡 – Outliers in Histograms

When examining a histogram, gaps in the distribution can signal the presence of potential outliers. In this example histogram, we can see that most of the data is concentrated in the central region, but the plotting range extends much further to include a few distant points.
Notice how there are bins far from the main cluster of data, with gaps (empty bins) in between. These isolated bins represent observations that are separated from the main body of data. These points should be flagged for further investigation.
If we were to zoom in on just the central region of the data, we would get a clearer picture of the main distribution’s shape, but we would miss these important outlying observations. This is why it’s crucial to examine the full range of our data before focusing on the central tendency.
When addressing potential outliers, we should:
Verify that they are not data entry errors or measurement mistakes
Consider whether they represent genuine observations of interest
Assess their impact on our statistical summaries and analysis
Make informed decisions about how to handle them based on our investigation
In subsequent chapters, we’ll learn more formal methods for identifying outliers, but visual inspection through histograms provides an excellent first step in recognizing unusual observations.
2.4.5. Using R and ggplot2 for Histograms
Throughout this chapter, we’ve been using R to create our histograms. Specifically, we’re using a package called ggplot2, which provides a powerful and flexible system for creating statistical graphics.
The ggplot2 package is based on the concept of a “grammar of graphics” - treating graphical components as building blocks that can be combined to create complex visualizations. This approach might feel different if you’re used to other plotting systems, but it becomes intuitive with practice.
The basic structure of a ggplot2 histogram involves:
Loading the ggplot2 library: library(ggplot2)
Creating a plotting frame with data: ggplot(data, aes(x = variable))
Adding a histogram layer: + geom_histogram()
Customizing with additional components: + labs(), + theme(), etc.
Components are connected with the + symbol, building the plot piece by piece:
library(ggplot2)
data(faithful) # Load built-in dataset
ggplot(faithful, aes(x = eruptions)) +
geom_histogram(bins = 10, fill = "skyblue", color = "black") +
labs(title = "Old Faithful Eruption Times",
x = "Eruption duration (minutes)",
y = "Frequency")
This code creates a histogram of eruption durations from the Old Faithful dataset, with 10 bins, sky blue fill, black outlines, and appropriate labels.
We can enhance our histograms with density curves using geom_density():
ggplot(faithful, aes(eruptions)) +
geom_histogram(aes(y = after_stat(density)), bins = 10,
fill = "skyblue", color = "black") +
geom_density(color = "red", linewidth = 1) +
labs(title = "Old Faithful Eruption Times with Density Curve",
x = "Eruption duration (minutes)",
y = "Density")
This approach to creating graphics may take some getting used to, but it provides a consistent and powerful framework for data visualization that we’ll continue to use throughout the course.
2.4.6. Bringing It All Together
When examining a histogram, we should systematically assess these key characteristics:
- Overall pattern:
What is the general shape of the distribution?
Is there one central location (unimodal), two central locations (bimodal), or several (multimodal)?
- Symmetry:
Is the distribution symmetric around a central value?
Is it skewed to the right (positive skew) or left (negative skew)?
- Potential anomalies:
Are there any observations that fall far from the main pattern?
Do these potential outliers need further investigation?
These shape assessments aren’t just academic exercises. They provide crucial guidance for selecting appropriate statistical measures and can reveal important insights about our data. Additionally, when we begin statistical inference later in the course, understanding the shape of our distributions will help us determine whether our assumptions are valid for the statistical procedures we wish to apply.
Key Takeaways 📝
Histograms reveal the shape, center, spread, and potential outliers in our quantitative data.
Multiple peaks (bimodal or multimodal patterns) suggest distinct subgroups that may require separate analysis.
The symmetry or skewness of a distribution affects which statistical methods are most appropriate.
Potential outliers require careful investigation - they may represent errors or important observations.
Understanding distribution shapes helps us choose appropriate statistical tools and avoid misleading conclusions.
Exercises
Shape Classification. For each of the following datasets available in R, create a histogram and classify the distribution as unimodal, bimodal, or multimodal, and as symmetric, right-skewed, or left-skewed: rivers (lengths of major rivers), faithful$waiting (waiting times between eruptions), and cars$dist (stopping distances).
Old Faithful Analysis. Using the faithful dataset, create histograms for both eruption duration and waiting time. Describe the modality and skewness of each. How might these patterns relate to the physical behavior of the geyser?
Outlier Identification. Create a histogram of the trees$Volume data. Are there any potential outliers? How can you tell visually from the histogram? What steps would you take to investigate these observations further?
Distribution Comparison. Generate data from three different distributions: normal (rnorm(100, 50, 10)), uniform (runif(100, 0, 100)), and exponential (rexp(100, 0.05)). Create histograms for each and compare their shapes, noting differences in modality, symmetry, and tail behavior.
Real Data Exploration. Find a dataset of interest to you (either from R’s built-in datasets or an external source) that contains at least one quantitative variable. Create a well-formatted histogram using ggplot2, and write a complete description of the distribution’s shape, including modality, symmetry/skewness, and any potential outliers. What might these characteristics tell you about the underlying process that generated the data?