Slides 📊

2.2. Tools for Categorical (Qualitative) Data

Numbers alone cannot convey how popular a department is, how balanced a survey sample appears, or whether two demographic variables interact. In this lesson, you will learn two plots— pie charts and bar graphs—that turn categorical tallies into instant insights.

Road Map 🧭

Build frequency tables for categorical variables using three different metrics: frequency, relative frequency, and percentage.
Visualize a frequency table with pie charts and bar graphs. Learn when one is preferred over the other.

2.2.1. The distribution of a categorical variable

The first stage of understanding the distribution of a categorical variable is to construct a table listing every category together with its count. We will use the famous 1973 UC Berkeley Graduate Admissions data set for illustration. This data set is available by default on RStudio. Run:

# Load required packages

# If not installed already, install the package first by running
# install.packages("(package_name)")
# e.g. if ggplot2 is not installed, run
# install.packages("ggplot2")

library(ggplot2)
library(dplyr)
library(scales)

# Load built‑in data
data("UCBAdmissions")
df <- as.data.frame(UCBAdmissions)  %>% arrange(Dept, desc(Gender))

View(df)

Important Note

Each code block must be run AFTER any previously presented code blocks in the same section. If you want to copy and paste the whole code as a single chunk, go to the appendix at the bottom of the page.

The first rows will look like following:

Table 2.1 1973 UC Berkeley Graduate Admissions Data
Row	Admit	Gender	Dept	Freq
1	Admitted	Male	A	512
2	Rejected	Male	A	313
3	Admitted	Female	A	89
3	Rejected	Female	A	19
5	Admitted	Male	B	353
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)

The combination of the first three columns shows the distinct categories that an observation can belong to. In this dataset, there are

two admission statuses (“Admitted” and “Rejected”),

two genders (“Male” and “Female”), and

six departments (“A” through “F”).

Therefore, the dataset has a total of \(2 * 2 * 6 = 24\) different categories. These categories are also called classes or labels.

The frequencies in the right most column of Table 2.1 show the counts for each category. However, frequencies alone make it difficult to assess the relative size. For example:

Does the class of [“Admitted”, “Male”, “Dept A”] take up a large proportion out of the entire data set?

Is the class of [“Rejected”, “Female”, “Dept A”] one of the smallest?

For an objective picture, we must take the total count into consideration. We use two new metrics for this purpose:

Relative frequency (proportion): The fraction of the count out of the total, computed by

\[relative\text{ }frequency = \dfrac{frequency}{total\text{ }count}\]

Percentage: The relative frequency multiplied by 100%, computed by

\[percentage = \dfrac{frequency}{total\text{ }count} \cdot 100\%\]

Using the new metrics, let us create a full frequency table.

#Create the column of relative frequency
df$Rel_Freq <- df$Freq / sum(df$Freq)
#Create the column of percentage
df$Perc <- df$Rel_Freq * 100
View(df)

Now we see an extended table:

Table 2.2 Extended 1973 UC Berkeley Admissions Data
Row	Admit	Gender	Dept	Freq	Rel_Freq	Perc
1	Admitted	Male	A	512	0.113	11.3
2	Rejected	Male	A	313	0.069	6.9
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)

It is also possible to create extended frequency tables for various combinations of the three individual categorical variables. Let’s try creating a table displaying the counts of admitted students, categorized by departments.

# Take the subset of the data which only involves "Admitted" category.
admitted <- df[df$Admit == "Admitted", ]

# Frequency table of admitted student by department
df_by_dept <- admitted %>% group_by(Dept) %>% summarise(Freq=sum(Freq))
df_by_dept$Rel_Freq <- df_by_dept$Freq / sum(df_by_dept$Freq)
df_by_dept$Perc <- df_by_dept$Rel_Freq * 100

Admitted Applicants by Department
Department Label	Frequency	Relative Frequency	Percentage
A	601	0.342	34.2
B	307	0.211	21.1
C	322	0.183	18.3
D	269	0.153	15.3
E	147	0.0838	8.38
F	46	0.0262	2.62

Note that relative frequencies always fall between 0 and 1 and sum to 1. Likewise, the percentages always range from 0 to 100 and sum to 100. They provide a standardized representation of the counts and allow comparisons between different variables that share the same list of categories, even if their totals differ.

2.2.2. Pie charts

A pie chart represents a categorical variable as a sliced circle, where the slices are sized proportionally to the counts, relative frequencies, or percentages. Note that the outcome will be visually identical regardless of the chosen metric.

Pie charts are best when you need to emphasize that the categories make up a complete whole, and if your main goal is to compare the relative sizes of the labels within a single dataset.

Let us draw the pie charts of the admission status variable, for each gender. We begin by creating the corresponding extended frequency tables:

# Only the code for the female case is shown for conciseness. Try creating
# the code for the other case using this as a template.

# Take the subset of the data which only involves the "Female" category.
female <- df[df$Gender == "Female", ]

# Frequency table of admitted student by department
admit_female <- female %>% group_by(Admit) %>% summarise(Freq=sum(Freq))
admit_female$Rel_Freq <- admit_female$Freq / sum(admit_female$Freq)
admit_female$Perc <- admit_female$Rel_Freq * 100

Admission for Female Applicants
Label	Frequency	Relative Frequency	Percentage
Admitted	557	0.304	30.4
Rejected	1278	0.696	69.6

Admission for Male Applicants
Label	Frequency	Relative Frequency	Percentage
Admitted	1198	0.445	44.5
Rejected	1493	0.555	55.5

Using the tables above, create pie charts:

#Pie chart for female applicants
ggplot(admit_female, aes(x = "", y = Freq, fill = Admit)) +
  geom_bar(stat = "identity", width = 1, colour = "black", size = 1.25) +
  coord_polar(theta = "y", start = 0) +
  geom_text(aes(label = percent(Rel_Freq)),
            position = position_stack(vjust = 0.5), size=5)+
  theme_void()+
  ggtitle("Acceptance Rate of Female Applicants")

UCB admission by gender — Fig. 2.3 1973 UC Berkeley graduate admissions, by gender

Pie charts are effective for an intuitive presentation of the variable composition, especially when there are only few categories or when the imbalance among the proportions needs to be emphasized.

The pie charts in Fig. 2.3 display the distributions of graduate admissions for female and male applicants at UC Berkeley in 1973. They seem to suggest that there was a significant difference in the likelihood of acceptance between genders. We now proceed to the next section to explore this from another perspective.

2.2.3. Bar graphs

A bar graph draws one bar per category with the height proportional to its frequency. Bars may represent counts, relative counts, or percentages.

Bar graphs offer several advantages over pie charts:

Pie charts lose their simplicity when there are more than a few categories. In contrast, bar graphs handle many categories more effectively.
They allow exact comparisons of relative sizes, especially when frequencies are similar.
When observations can belong to multiple categories, it is incorrect to suggest that the frequencies form a whole - since their total may exceed 100%. In such cases, bar graphs are more appropriate, as they do not imply that the parts sum to a whole.

To demonstrate the strength of bar graphs in handling many categories, let us plot Table 2.1, which contains 24 different categories.

df$Dept_Gender <- paste(df$Dept, df$Gender, sep="-")

ggplot(df, aes(x = Dept_Gender, y = Freq, fill = Admit)) +
geom_bar(stat = "identity", position="dodge", width=0.7) +
theme_minimal() +
labs(title = "Dodged bar graph of frequencies",
     x = "Department-Gender",
     y = "Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))

Bar graph of frequencies, UC Berkeley graduate admissions data — Fig. 2.4 Bar graph of frequencies, UC Berkeley data set

Unlike our first impression through the pie charts (Fig. 2.3), we begin to suspect that the acceptance rates are comparable between the two genders within a department.

To dig deeper into our suspicion, let us draw another bar graph, where each bar has a height corresponding to the relative frequency of an admission result within a single department, for a single gender. In addition, we will stack the bars so that the composition of “Accepted” vs “Rejected” is emphasized within each Department-Gender category.

Dept_Gender_total <- as.vector((df %>% group_by(Dept_Gender) %>% summarise(Sum=sum(Freq)))$Sum)
df$Dept_Gender_total <- rep(Dept_Gender_total, each=2)
df$Dept_Gender_Rel_Freq <- df$Freq/df$Dept_Gender_total


ggplot(df, aes(x = Dept_Gender, y = Dept_Gender_Rel_Freq, fill = Admit)) +
  geom_bar(stat = "identity", position="stack", width=0.5) +
  theme_minimal() +
  labs(title = "Stacked bar graph of relative frequencies by Dept and Gender",
      x = "Department-Gender",
      y = "Relative Frequency") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))

Fig. 2.5 Bar graph of relative frequencies of “Accepted” vs “Rejected” by Dept-Gender, UC Berkeley data set

Our suspicion is comfirmed. Indeed, the two genders have comparable acceptance rates within departments. In four of the six departments, the rate is actually higher for female students!

We covered two key techniques in drawing a bar graph through the UC Berkeley example.

Dodging bars side‑by‑side lets us compare groups across categories (Fig. 2.4).
Stacking bars emphasizes composition within each category (Fig. 2.5).

Remark - What’s behind the contradiction?

The pie charts and the bar graphs appear to convey conflicting messages, even though they are based on the same data set. This discrepancy arises because certain departments had a disproportionately large number of applicants—most of whom were male.

This highlights the importance of examining a data set carefully at multiple levels of categorization before drawing conclusions. In fact, this situation illustrates a well-known and frequently occurring statistical phenomenon called Simpson’s Paradox.

Feel free to explore this fascinating topic further on your own!

2.2.4. Pie Chart or Bar Graph?

Choosing between pie charts and bar graphs depends on your data and the story you want to tell:

Bar graphs handle many categories comfortably; a pie chart with more than five slices becomes hard to read.
Exact comparisons across categories are easier in a bar graph because the common baseline (zero) guides the eye.
Bar graphs work well when observations can belong to multiple categories.
If your takeaway is “X accounts for one‑third of the total,” a pie slice delivers that message immediately.
Pie charts emphasize the part-to-whole relationship and are ideal when your data represents 100% of something.

2.2.5. Bringing It All Together

Key Takeaways 📝

The distribution of a categorical variable is first organized into a table of categories (also called labels or classes) with their counts, proportions, or percentages.
Pie charts emphasize part of whole; bar graphs emphasize category comparisons.
Choose dodged or stacked bar graphs based on the message you want to convey. Dodged bar graphs allow precise comparison of heights; stacked bar graphs focuses on showing the composition within a category.
Examine categorical data from multiple perspectives to avoid misleading interpretations.

Appendix: All Code in One Stack

# Load required packages
# If not installed already, install the package first by running
# install.packages("(package_name)")
# e.g. if ggplot2 is not installed, run
# install.packages("ggplot2")

library(ggplot2)
library(tidyverse)
library(scales)

# Load built‑in data
data("UCBAdmissions")
df <- as.data.frame(UCBAdmissions) %>% arrange(Dept, desc(Gender))

#View(df)

###########
#Create the column of relative frequency
df$Rel_Freq <- df$Freq / sum(df$Freq)
#Create the column of percentage
df$Perc <- df$Rel_Freq * 100
View(df)

###########
# Take the subset of the data which only involves "Admitted" category.
admitted <- df[df$Admit == "Admitted", ]

# Frequency table of admitted student by department
df_by_dept <- admitted %>% group_by(Dept) %>% summarise(Freq=sum(Freq))
df_by_dept$Rel_Freq <- df_by_dept$Freq / sum(df_by_dept$Freq)
df_by_dept$Perc <- df_by_dept$Rel_Freq * 100

###########
# Only the code for the female case is shown for conciseness. Try creating
# the code for the other case using this as a template.

# Take the subset of the data which only involves "Female" category.
female <- df[df$Gender == "Female", ]

# Frequency table of admitted student by department
admit_female <- female %>% group_by(Admit) %>% summarise(Freq=sum(Freq))
admit_female$Rel_Freq <- admit_female$Freq / sum(admit_female$Freq)
admit_female$Perc <- admit_female$Rel_Freq * 100

###########
#Pie chart for female applicants
ggplot(admit_female, aes(x = "", y = Freq, fill = Admit)) +
  geom_bar(stat = "identity", width = 1, colour = "black", size = 1.25) +
  coord_polar(theta = "y", start = 0) +
  geom_text(aes(label = percent(Rel_Freq)),
            position = position_stack(vjust = 0.5), size=5)+
  theme_void()+
  ggtitle("Acceptance Rate of Female Applicants")

############
# Bar graph of frequencies
df$Dept_Gender <- paste(df$Dept, df$Gender, sep="-")

ggplot(df, aes(x = Dept_Gender, y = Freq, fill = Admit)) +
geom_bar(stat = "identity", position="dodge", width=0.7) +
theme_minimal() +
labs(title = "Dodged bar graph of frequencies",
     x = "Department-Gender",
     y = "Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))

#################
# Stacked bar graph of relative frequencies
Dept_Gender_total <- as.vector((df %>% group_by(Dept_Gender) %>% summarise(Sum=sum(Freq)))$Sum)
df$Dept_Gender_total <- rep(Dept_Gender_total, each=2)
df$Dept_Gender_Rel_Freq <- df$Freq/df$Dept_Gender_total


ggplot(df, aes(x = Dept_Gender, y = Dept_Gender_Rel_Freq, fill = Admit)) +
  geom_bar(stat = "identity", position="stack", width=0.5) +
  theme_minimal() +
  labs(title = "Stacked bar graph of relative frequencies by Dept and Gender",
      x = "Department-Gender",
      y = "Relative Frequency") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))