2.2. Tools for Categorical (Qualitative) Data
Numbers alone cannot convey how popular a department is, how balanced a survey sample appears, or whether two demographic variables interact. A simple count table answers “how many,” but only a picture answers “so what?” In this lesson, you will learn two plots- pie charts and bar graphs—that turn categorical tallies into instant insights. We will also practice building the underlying frequency tables so that your R code is always one line away from a clear graph.
Road Map 🧭
Build frequency tables for categorical variables using three different metrics: frequency, relative frequency, and percentage.
Visualize a frequency table with pie charts and bar graphs. Learn when one is preferred over the other.
2.2.1. The distribution of a categorical variable
The first stage of understanding the distribution of a categorical variable is to construct a table listing every category together with its count. We will use the famous 1973 UC Berkeley Graduate Admissions data set for illustration. This data set is available by default on RStudio. Run:
# Load required packages
# If not installed already, install the package first by running
# install.packages("(package_name)")
# e.g. if ggplot2 is not installed, run
# install.packages("ggplot2")
library(ggplot2)
library(dplyr)
library(scales)
# Load built‑in data
data("UCBAdmissions")
df <- as.data.frame(UCBAdmissions) %>% arrange(Dept, desc(Gender))
View(df)
Important Note
Each code block must be run AFTER any previously presented code blocks in the same section. If you want to copy and paste the whole code as a single chunk, go to the appendix at the bottom of the page.
The first rows will look like following:
Admit |
Gender |
Dept |
Freq |
|
---|---|---|---|---|
1 |
Admitted |
Male |
A |
512 |
2 |
Rejected |
Male |
A |
313 |
3 |
Admitted |
Female |
A |
89 |
3 |
Rejected |
Female |
A |
19 |
5 |
Admitted |
Male |
B |
353 |
\(\vdots\) |
\(\vdots\) |
\(\vdots\) |
\(\vdots\) |
\(\vdots\) |
The combination of the first three columns shows the distinct categories that an observation belongs to. In this dataset, there are
two admission statuses (“Admitted” and “Rejected”),
two genders (“Male” and “Female”), and
six departments (“A” through “F”).
Therefore, the dataset has a total of \(2 * 2 * 6 = 24\) different categories. These categories are also called classes or labels.
The frequencies in the right most column of Table 2.1 show the counts for each category. However, frequencies alone make it difficult to assess the relative size. For example:
Does the class of [“Admitted”, “Male”, “Dept A”] take up a large proportion of the entire set of observations?
Is the class of [“Rejected”, “Female”, “Dept A”] one of the smallest?
For an objective picture, we must take the total counts into consideration. We use two new metrics for this purpose:
Relative frequency (proportion): The fraction of the count out of the total, computed by
\(relative\text{ }frequency = \dfrac{frequency}{total\text{ }count}\)
Percentage: The relative frequency multiplied by 100%, computed by
\(percentage = \dfrac{frequency}{total\text{ }count} * 100\%\)
Using the new metrics, let us create a full frequency table.
#Create the column of relative frequency
df$Rel_Freq <- df$Freq / sum(df$Freq)
#Create the column of percentage
df$Perc <- df$Rel_Freq * 100
View(df)
Now we see an extended table:
Admit |
Gender |
Dept |
Freq |
Rel_Freq |
Perc |
|
---|---|---|---|---|---|---|
1 |
Admitted |
Male |
A |
512 |
0.113 |
11.3 |
2 |
Rejected |
Male |
A |
313 |
0.069 |
6.9 |
\(\vdots\) |
\(\vdots\) |
\(\vdots\) |
\(\vdots\) |
\(\vdots\) |
\(\vdots\) |
\(\vdots\) |
It is also possible to create extended frequency tables for various combinations of the three individual categorical variables. Let’s try creating a table displaying the counts of admitted students categorized by departments.
# Take the subset of the data which only involves "Admitted" category. admitted <- df[df$Admit == "Admitted", ] # Frequency table of admitted student by department df_by_dept <- admitted %>% group_by(Dept) %>% summarise(Freq=sum(Freq)) df_by_dept$Rel_Freq <- df_by_dept$Freq / sum(df_by_dept$Freq) df_by_dept$Perc <- df_by_dept$Rel_Freq * 100
Admitted Applicants by Department |
|||
---|---|---|---|
Department Label |
Frequency |
Relative Frequency |
Percentage |
A |
601 |
0.342 |
34.2 |
B |
307 |
0.211 |
21.1 |
C |
322 |
0.183 |
18.3 |
D |
269 |
0.153 |
15.3 |
E |
147 |
0.0838 |
8.38 |
F |
46 |
0.0262 |
2.62 |
Note that relative frequencies always fall between 0 and 1 and sum to 1. Likewise, the percentages always range from 0 to 100 and sum to 100. They provide a standardized representation of the counts and allow comparisons between different variables that share the same list of categories, even if their totals differ.
2.2.2. Pie charts
A pie chart represents a categorical variable as a sliced circle, where the slices are sized proportionally to the counts, relative frequencies or percentages. Note that the outcome will be identical regardless of the chosen metric.
Pie charts are best when you need to emphasize that the categories make up a complete whole, and if your main goal is to compare the relative sizes of the labels within a single dataset.
Let us draw the pie charts of the admission status variable, for each gender. We begin by creating the corresponding extended frequency tables:
# Only the code for the female case is shown for conciseness. Try creating
# the code for the other case using this as a template.
# Take the subset of the data which only involves the "Female" category.
female <- df[df$Gender == "Female", ]
# Frequency table of admitted student by department
admit_female <- female %>% group_by(Admit) %>% summarise(Freq=sum(Freq))
admit_female$Rel_Freq <- admit_female$Freq / sum(admit_female$Freq)
admit_female$Perc <- admit_female$Rel_Freq * 100
Admission for Female Applicants |
|||
---|---|---|---|
Label |
Frequency |
Relative Frequency |
Percentage |
Admitted |
557 |
0.304 |
30.4 |
Rejected |
1278 |
0.696 |
69.6 |
Admission for Male Applicants |
|||
---|---|---|---|
Label |
Frequency |
Relative Frequency |
Percentage |
Admitted |
1198 |
0.445 |
44.5 |
Rejected |
1493 |
0.555 |
55.5 |
Using the tables above, create pie charts:
#Pie chart for female applicants
ggplot(admit_female, aes(x = "", y = Freq, fill = Admit)) +
geom_bar(stat = "identity", width = 1, colour = "black", size = 1.25) +
coord_polar(theta = "y", start = 0) +
geom_text(aes(label = percent(Rel_Freq)),
position = position_stack(vjust = 0.5), size=5)+
theme_void()+
ggtitle("Acceptance Rate of Female Applicants")

Fig. 2.2 1973 UC Berkeley graduate admissions, by gender
Pie charts are effective for an intuitive presentation of the variable composition, especially when there are only a few categories or when the imbalance among the proportions needs to be emphasized.
The pie charts in Fig. 2.2 display the distributions of graduate admissions for female and male applicants at UC Berkeley in 1973. They seem to suggest that there was a significant difference in the likelihood of acceptance between genders. We now proceed to the next section to explore this from another perspective.
2.2.3. Bar graphs
A bar graph draws one bar per category with the height proportional to its frequency. Bars may represent counts, relative counts, or percentages.
Bar graphs offer several advantages over pie charts:
Pie charts lose their simplicity when there are more than a few categories. In contrast, bar graphs handle many categories more effectively.
They allow exact comparisons of relative sizes, especially when frequencies are of similar sizes.
When observations can belong to multiple categories, it is incorrect to suggest that the frequencies form a whole - since their total may exceed 100%. In such cases, bar graphs are more appropriate, as they do not imply that the parts sum to a whole.
To demonstrate the strength of bar graphs in handling many categories, let us plot Table 2.1, which contains 24 different categories.
df$Dept_Gender <- paste(df$Dept, df$Gender, sep="-")
ggplot(df, aes(x = Dept_Gender, y = Freq, fill = Admit)) +
geom_bar(stat = "identity", position="dodge", width=0.7) +
theme_minimal() +
labs(title = "Dodged bar graph of frequencies",
x = "Department-Gender",
y = "Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))

Fig. 2.3 Bar graph of frequencies, UC Berkeley data set
Unlike our first impression through the pie charts (Fig. 2.2), we begin to suspect that the acceptance rates are comparable between the two genders within a department.
To dig deeper into our suspicion, let us draw another bar graph, where each bar has a height corresponding to the relative frequency of admission results within a single department, for a single gender. In addition, we will stack the bars so that the composition of “Accepted” vs “Rejected” is emphasized within each Department-Gender category.
Dept_Gender_total <- as.vector((df %>% group_by(Dept_Gender) %>% summarise(Sum=sum(Freq)))$Sum)
df$Dept_Gender_total <- rep(Dept_Gender_total, each=2)
df$Dept_Gender_Rel_Freq <- df$Freq/df$Dept_Gender_total
ggplot(df, aes(x = Dept_Gender, y = Dept_Gender_Rel_Freq, fill = Admit)) +
geom_bar(stat = "identity", position="stack", width=0.5) +
theme_minimal() +
labs(title = "Stacked bar graph of relative frequencies by Dept and Gender",
x = "Department-Gender",
y = "Relative Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))

Fig. 2.4 Bar graph of relative frequencies of “Accepted” vs “Rejected” by Dept-Gender, UC Berkeley data set
Our suspicion is comfirmed. Indeed, the two genders have comparable acceptance rates within departments. In four of the six departments, the rate is actually higher for female students!
We covered two key techniques in drawing a bar graph through the UC Berkeley example.
Dodging bars side‑by‑side lets us compare groups across categories.
Stacking bars emphasizes composition within each category.
Remark - What’s behind the contradiction?
The pie charts and bar graphs we generated appear to convey conflicting messages, even though they are based on the same data set. This discrepancy arises because certain departments had a disproportionately large number of applicants—most of whom were male.
This highlights the importance of examining a data set carefully at multiple levels of categorization before drawing conclusions. In fact, this situation illustrates a well-known and frequently occurring statistical phenomenon called Simpson’s Paradox.
Feel free to explore this fascinating topic further on your own!
2.2.4. Pie Chart or Bar Graph?
Choosing between pie charts and bar graphs depends on your data and the story you want to tell:
Bar graphs handle many categories comfortably; a pie chart with more than five slices becomes hard to read.
Exact comparisons across categories are easier in a bar graph because the common baseline (zero) guides the eye.
If your takeaway is “X accounts for one‑third of the total,” a pie slice delivers that message immediately.
Bar graphs work well when observations can belong to multiple categories.
Pie charts emphasize the part-to-whole relationship and are ideal when your data represents 100% of something.
2.2.5. Bringing It All Together
Key Takeaways 📝
The distribution of a categorical variable is first organized into a table of categories (also called labels or classes) with their counts, proportions, or percentages.
Pie charts emphasize part of whole; bar graphs emphasize category comparisons.
Choose dodged or stacked bar graphs based on the message you want to convey. Dodged bar graphs allow precise comparison of heights; stacked bar graphs focuses on showing the composition within a category.
Examine categorical data from multiple perspectives to avoid misleading interpretations.
Appendix: All Code in One Stack
# Load required packages
# If not installed already, install the package first by running
# install.packages("(package_name)")
# e.g. if ggplot2 is not installed, run
# install.packages("ggplot2")
library(ggplot2)
library(tidyverse)
library(scales)
# Load built‑in data
data("UCBAdmissions")
df <- as.data.frame(UCBAdmissions) %>% arrange(Dept, desc(Gender))
#View(df)
###########
#Create the column of relative frequency
df$Rel_Freq <- df$Freq / sum(df$Freq)
#Create the column of percentage
df$Perc <- df$Rel_Freq * 100
View(df)
###########
# Take the subset of the data which only involves "Admitted" category.
admitted <- df[df$Admit == "Admitted", ]
# Frequency table of admitted student by department
df_by_dept <- admitted %>% group_by(Dept) %>% summarise(Freq=sum(Freq))
df_by_dept$Rel_Freq <- df_by_dept$Freq / sum(df_by_dept$Freq)
df_by_dept$Perc <- df_by_dept$Rel_Freq * 100
###########
# Only the code for the female case is shown for conciseness. Try creating
# the code for the other case using this as a template.
# Take the subset of the data which only involves "Female" category.
female <- df[df$Gender == "Female", ]
# Frequency table of admitted student by department
admit_female <- female %>% group_by(Admit) %>% summarise(Freq=sum(Freq))
admit_female$Rel_Freq <- admit_female$Freq / sum(admit_female$Freq)
admit_female$Perc <- admit_female$Rel_Freq * 100
###########
#Pie chart for female applicants
ggplot(admit_female, aes(x = "", y = Freq, fill = Admit)) +
geom_bar(stat = "identity", width = 1, colour = "black", size = 1.25) +
coord_polar(theta = "y", start = 0) +
geom_text(aes(label = percent(Rel_Freq)),
position = position_stack(vjust = 0.5), size=5)+
theme_void()+
ggtitle("Acceptance Rate of Female Applicants")
############
# Bar graph of frequencies
df$Dept_Gender <- paste(df$Dept, df$Gender, sep="-")
ggplot(df, aes(x = Dept_Gender, y = Freq, fill = Admit)) +
geom_bar(stat = "identity", position="dodge", width=0.7) +
theme_minimal() +
labs(title = "Dodged bar graph of frequencies",
x = "Department-Gender",
y = "Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))
#################
# Stacked bar graph of relative frequencies
Dept_Gender_total <- as.vector((df %>% group_by(Dept_Gender) %>% summarise(Sum=sum(Freq)))$Sum)
df$Dept_Gender_total <- rep(Dept_Gender_total, each=2)
df$Dept_Gender_Rel_Freq <- df$Freq/df$Dept_Gender_total
ggplot(df, aes(x = Dept_Gender, y = Dept_Gender_Rel_Freq, fill = Admit)) +
geom_bar(stat = "identity", position="stack", width=0.5) +
theme_minimal() +
labs(title = "Stacked bar graph of relative frequencies by Dept and Gender",
x = "Department-Gender",
y = "Relative Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))