Slides 📊

2.3. Tools for Numerical (Quantitative) Data

Numerical data offers richer visualization possibilities than categorical data because it contains information about the distances between values. We are now interested in the overall shape of the distribution, where the numbers are clustered, and how far they spread. A histogram answers all three at a glance.

Road Map 🧭

Visualize numerical variables with histograms.
Understand the impact of choosing different numbers of classes/bins for a histogram.
Recognize the difference between a bar graph and a histogram.

2.3.1. Building Your First Histogram

A histogram first divides the number line spanning the range of a variable into adjacent intervals of equal width. Over each interval, a bar is drawn whose height equals the count (or relative frequency) of the datapoints belonging to the interval. Each interval is called a bin, and it is up to the user to select how many bins are used in a histogram.

Let us begin our exploration of histograms with a simple example based on a built-in data set in R, InsectSprays. First import the data set by running:

library(ggplot2)
data(InsectSprays)
View(InsectSprays)

The code will open a separate window of the complete table. The first few rows are:

(index)	count	spray
1	10	A
2	7	A
\(\vdots\)	\(\vdots\)	\(\vdots\)

This data set reports the insect counts in agricultural experimental units treated with six different insecticides. There are 72 rows in the data set.

Use ggplot2 to print a historgram:

# rule of thumb: max(round(sqrt(n)) + 2,5) bins
n_obs  <- nrow(InsectSprays) #72
n_bins <- max(round(sqrt(n_obs)) + 2,5)

ggplot(InsectSprays, aes(x = count)) +
  geom_histogram(bins = n_bins, colour = "black", fill = "skyblue", linewidth = 1.2) +
  labs(title = "Distribution of insect counts (Beall, 1942)",
       x = "Number of insects", y = "Frequency")

A histogram of InsectSprays dataset — Fig. 2.6 Histogram of InsectSprays dataset

We make a number of observations:

Each bar represents the number of observations falling within a range of insect counts.
This histogram uses 10 bins. This means that the histogram divided the data’s range into 10 intervals of equal length.
The histogram does a good job of describing the overall distribution of the data, while not being overly detailed.

2.3.2. Determining the Number of Bins

How Does the Bin Count Change a Histogram?

Although we are free to choose any number of bins for a histogram, this choice significantly affects the quality of data representation. To illustrate this impact, we plot the same data set four times using different bin counts. The data used in this example can be downloaded here: furnace.csv (if a new browser tab opens instead, press Ctrl/Cmd+S to save the file).

library(ggplot2)
furnace <- read.csv("furnace.txt")   # replace "furnace.txt" with your file location

for (bins in c(6, 10, 15, 30)) {
  p <- ggplot(furnace, aes(Consumption)) +
    geom_histogram(bins = bins, colour = "black", fill = "darkgreen") +
    labs(title = paste(bins, "bins"), x = "BTU", y = "Frequency")
  print(p)
}

Furnace BTU histograms: 6, 10, 15 and 30 bins — Fig. 2.7 Furnace BTU histograms with different numbers of bins

We observe a clear trend in Fig. 2.7:

6 bins: Oversimplifies the data, hiding important features
10 bins: Balances detail and clarity, revealing the general, slightly right-skewed shape
15 bins: Shows more granular structure and begins to display some potentially random fluctuation
30 bins: Too detailed, resulting in a jagged appearance dominated by sampling variability

In this case, ten to fifteen bins reveal the overall trend without drowning the eye in high-frequency jitter.

The Rule of Thumb

Bin count is a Goldilocks choice: if fewer than necessary, the histogram hides detail, and if too many, it is difficult to observe trends due to noise. In the meantime, the is no single correct choice. We usually test a few candicate values in a reasonable range, then make a final choice based on how the resulting histograms look.

The rule of thumb for a starting point (not the final correct answer) is as follows:

Find how many rows there are in the data set. Denote the row count with \(n\).
Compute

\[b = \operatorname{max}(\operatorname{round}(\sqrt{n})+ 2,5).\]

Here, \(\operatorname{round}(\cdot)\) denotes rounding to the nearest integer.
If \(b > 30\), start your test with a value between \(20\) and \(30\). That is, a number of bins over the \(20-30\) range is often too high, even if the data set is fairly large.
If \(b \leq 30\), start your tests at \(b\).
Find the “best-looking” histogram among your candidates. You may use the furnace example (Fig. 2.7) as a guidelline.

Example 💡: Do the Previous Examples Follow the Rule of Thumb?

The InsectSprays data set has \(n=72\) rows. By the rule of thumb, \(b=\operatorname{max}(\operatorname{round}(\sqrt{72})+2,5) = 10\), which is exactly how many bins were used in Fig. 2.6.

There are \(n=90\) observations in the furnace dataset (Fig. 2.7). This gives \(b=\operatorname{max}(\operatorname{round}(\sqrt{90}) + 2,5) = 11\). This result agrees with our previous conclusion that using 10 or 15 bins best represent the data.

2.3.3. Enhancing Histograms – Density & Normal Overlay

Normality

We will learn in later chapters that normality is a desirable characteristic in a data set which allows us to use a range of useful theoretical tools. Normality is usually characterized by a bell-shape in the data distribution: unimodal, symmetric, and with tails that taper off at an appropriate rate.

Assessing Normality

Whenever we graph a histogram, it is one of our interests to read how normal the data is. For this purpose, we use two additional assisting tools.

A smooth curve outlining the data’s shape, also called the kernel density (red)
A true normal curve that shares the same center and width as the data (blue)

library(ggplot2)
furnace <- read.csv("furnace.txt")   # replace with your path
xbar <- mean(furnace$Consumption)
s    <- sd(furnace$Consumption)
bins <- round(sqrt(nrow(furnace))) + 2

ggplot(furnace, aes(Consumption)) +
  geom_histogram(aes(y = after_stat(density)), bins = bins,
                fill = "lightblue", colour = "black") +
  geom_density(colour = "red",  linewidth = 1.2) +
  stat_function(fun = dnorm, args = list(mean = xbar, sd = s),
                colour = "blue", linewidth = 1.2) +
  labs(title = "Residential furnace energy consumption",
       y = "Density", x = "BTU")

Histogram with kernel density (red) and normal curve (blue) — Fig. 2.8 Histogram with a kernel density (red) and a normal curve (blue)

Based on how close the two curves are, we assess whether the data is sufficiently normal or deviates from it. Although they are not part of a histogram by definition, we will always use them in this course, whenever a histogram is drawn.

2.3.4. Bar Graph or Histogram?

For discrete numerical data with few unique values, a bar graph may not present any major visual difference from a histogram. In some cases, the data can even be treated as an ordinal categorical data and displayed with a bar graph. Fig. 2.9 is one such example:

Bar graph of number of children per family — Fig. 2.9 Numerical variable with a small set of possible values and its bar graph

In general, however, bar graphs and histograms differ clearly in their usage and appearance.

Comparison of Bar Graphs and Histograms
Feature	Bar graphs	Histograms
Variable type	Categorical variables; numerical variables with few possible values are sometimes converted to a categorical variable	Numerical variables, especially with many different possible values
Marks on \(x\)-axis	All points contributing to a bar has the same value. The center of the bar is marked with this value.	All points contributing to a bar belongs to the same interval but may have different values. Either, The two ends of each bar is marked, or The bars are positioned along a countinuous number line.
Gaps between bars	There are no in-between values among categories. There may be gaps between the bars to reflect this.	The intervals are always adjacent. Gaps indicate absence of data points in the corresponding interval(s).

2.3.5. Bringing It All Together

Key Takeaways 📝

Histograms turn numbers into a shape.
Use \(\operatorname{max}(\operatorname{round}(\sqrt{n})+2)\) bins as a starting point, then adjust by eye.
Overlay a normal curve and a smooth trend line to easily assess deviation from normality.