Slides 📊
2.3. Tools for Numerical (Quantitative) Data
Numerical data offers richer visualization possibilities than categorical data because it contains information about the distances between values. We are now interested in the overall shape of the distribution, where the numbers are clustered, and how far they spread. A histogram answers all three at a glance.
Road Map 🧭
Visualize numerical variables with histograms.
Understand the impact of choosing different numbers of classes/bins for a histogram.
Recognize the difference between a bar graph and a histogram.
2.3.1. Building Your First Histogram
A histogram first divides the number line spanning the range of a variable into adjacent intervals of equal width. Over each interval, a bar is drawn whose height equals the count (or relative frequency) of the datapoints belonging to the interval. Each interval is called a bin, and it is up to the user to select how many bins are used in a histogram.
Let us begin our exploration of histograms with a simple example based on
a built-in data set in R, InsectSprays
. First import the data set by running:
library(ggplot2)
data(InsectSprays)
View(InsectSprays)
The code will open a separate window of the complete table. The first few rows are:
(index) |
count |
spray |
---|---|---|
1 |
10 |
A |
2 |
7 |
A |
\(\vdots\) |
\(\vdots\) |
\(\vdots\) |
This data set reports the insect counts in agricultural experimental units treated with six different insecticides. There are 72 rows in the data set.
Use ggplot2
to print a historgram:
# rule of thumb: round(sqrt(n)) + 2 bins
n_obs <- nrow(InsectSprays) #72
n_bins <- round(sqrt(n_obs)) + 2
ggplot(InsectSprays, aes(x = count)) +
geom_histogram(bins = n_bins, colour = "black", fill = "skyblue", linewidth = 1.2) +
labs(title = "Distribution of insect counts (Beall, 1942)",
x = "Number of insects", y = "Frequency")

Fig. 2.6 Histogram of InsectSprays dataset
We make a number of observations:
Each bar represents the number of observations falling within a range of insect counts.
This histogram uses 10 bins. This means that the histogram divided the data’s range into 10 intervals of equal length.
The histogram does a good job of describing the overall distribution of the data, while not being overly detailed.
2.3.2. Determining the Number of Bins
How Does the Bin Count Change a Histogram?
Although we are free to choose any number of bins for a histogram,
this choice significantly affects the quality of data representation.
To illustrate this impact, we plot the same data set
four times using different bin counts. The data used in this example can be
downloaded here: furnace.csv
(if a new
browser tab opens instead, press Ctrl/Cmd+S to save the file).
library(ggplot2)
furnace <- read.csv("furnace.txt") # replace "furnace.txt" with your file location
for (bins in c(6, 10, 15, 30)) {
p <- ggplot(furnace, aes(Consumption)) +
geom_histogram(bins = bins, colour = "black", fill = "darkgreen") +
labs(title = paste(bins, "bins"), x = "BTU", y = "Frequency")
print(p)
}

Fig. 2.7 Furnace BTU histograms with different numbers of bins
We observe a clear trend in Fig. 2.7:
6 bins: Oversimplifies the data, hiding important features
10 bins: Balances detail and clarity, revealing the general, slightly right-skewed shape
15 bins: Shows more granular structure and begins to display some potentially random fluctuation
30 bins: Too detailed, resulting in a jagged appearance dominated by sampling variability
In this case, ten to fifteen bins reveal the overall trend without drowning the eye in high-frequency jitter.
The Rule of Thumb
Bin count is a Goldilocks choice: if fewer than necessary, the histogram hides detail, and if too many, it is difficult to observe trends due to noise. In the meantime, the is no single correct choice. We usually test a few candicate values in a reasonable range, then make a final choice based on how the resulting histograms look.
The rule of thumb for a starting point (not the final correct answer) is as follows:
Find how many rows there are in the data set. Denote the row count with \(n\).
Compute
\[b = \bigl\lceil \sqrt{n} \bigr\rceil + 2.\]Here, \(\lceil\cdot\rceil\) denotes the ceiling function, which outputs the smallest integer greater than the input (for example, \(\lceil 3.56\rceil = 4\)).
If \(b > 30\), start your test with a value between \(20\) and \(30\). That is, a number of bins over the \(20-30\) range is usually too high, even if the data set is very large.
If \(b \leq 30\), start your tests at \(b\).
Find the “best-looking” histogram among your candidates. You may use the furnace example (Fig. 2.7) as a guidelline.
Example 💡: Do the Previous Examples Follow the Rule of Thumb?
The InsectSprays data set has \(n=72\) rows. By the rule of thumb, \(b=\lceil \sqrt{72} \rceil + 2 = 10\), which is exactly how many bins were used in Fig. 2.6.
There are \(n=90\) observations in the furnace dataset (Fig. 2.7). This gives \(b=\lceil \sqrt{90} \rceil + 2 = 11\). This result agrees with our previous conclusion that using 10 or 15 bins best represent the data.
2.3.3. Enhancing Histograms – Density & Normal Overlay
Normality
We will learn in later chapters that normality is a desirable characteristic in a data set which allows us to use a range of useful theoretical tools. Normality is usually characterized by a bell-shape in the data distribution: unimodal, symmetric, and with tails that taper off at an appropriate rate.
Assessing Normality
Whenever we graph a histogram, it is one of our interests to read how normal the data is. For this purpose, we use two additional assisting tools.
A smooth curve outlining the data’s shape, also called the kernel density (red)
A true normal curve that shares the same center and width as the data (blue)
library(ggplot2)
furnace <- read.csv("furnace.txt") # replace with your path
xbar <- mean(furnace$Consumption)
s <- sd(furnace$Consumption)
bins <- round(sqrt(nrow(furnace))) + 2
ggplot(furnace, aes(Consumption)) +
geom_histogram(aes(y = after_stat(density)), bins = bins,
fill = "lightblue", colour = "black") +
geom_density(colour = "red", linewidth = 1.2) +
stat_function(fun = dnorm, args = list(mean = xbar, sd = s),
colour = "blue", linewidth = 1.2) +
labs(title = "Residential furnace energy consumption",
y = "Density", x = "BTU")

Fig. 2.8 Histogram with a kernel density (red) and a normal curve (blue)
Based on how close the two curves are, we assess whether the data is sufficiently normal or deviates from it. Although they are not part of a histogram by definition, we will always use them in this course, whenever a histogram is drawn.
2.3.4. Bar Graph or Histogram?
For discrete numerical data with few unique values, a bar graph may not present any major visual difference from a histogram. In some cases, the data can even be treated as an ordinal categorical data and displayed with a bar graph. Fig. 2.9 is one such example:

Fig. 2.9 Numerical variable with a small set of possible values and its bar graph
In general, however, bar graphs and histograms differ clearly in their usage and appearance.
Comparison of Bar Graphs and Histograms |
||
---|---|---|
Feature |
Bar graphs |
Histograms |
Variable type |
Categorical variables; numerical variables with few possible values are sometimes converted to a categorical variable |
Numerical variables, especially with many different possible values |
Marks on \(x\)-axis |
All points contributing to a bar has the same value. The center of the bar is marked with this value. |
All points contributing to a bar belongs to the ame interval but may have different values. Either,
|
Gaps between bars |
There are no in-between values among categories. There may be gaps between the bars to reflect this. |
The intervals are always adjacent. Gaps indicate absence of data points in the corresponding interval(s). |
2.3.5. Bringing It All Together
Key Takeaways 📝
Histograms turn numbers into a shape.
Use \(\lceil \sqrt{n} \rceil + 2\) bins as a starting point, then adjust by eye.
Overlay a normal curve and a smooth trend line to easily assess deviation from normality.