.. _2-2-tools-for-categorical-qualitative-data:
.. raw:: html
Tools for Categorical (Qualitative) Data
==================================================================
Numbers alone cannot convey how *popular* a department is, how *balanced* a survey sample appears,
or whether two demographic variables *interact*. A simple *count table* answers "*how many*,"
but only a picture answers "*so what*?" In this lesson, you will learn two plots-
**pie charts** and **bar graphs**—that turn categorical tallies into instant insights. We will
also practice building the underlying frequency tables so that your R code is always one line
away from a clear graph.
.. admonition:: Road Map đź§
:class: tip
* Build **frequency tables** for categorical variables using three different metrics: **frequency,
relative frequency**, and **percentage**.
* Visualize a frequency table with **pie charts** and **bar graphs**. Learn when
one is preferred over the other.
The distribution of a categorical variable
---------------------------------------------
The first stage of understanding the **distribution** of a categorical variable is
to construct a table listing every category together with its count. We will use the
famous 1973 UC Berkeley Graduate Admissions data set for illustration. This data set is available
by default on RStudio. Run:
.. _import-data:
.. code-block:: r
# Load required packages
# If not installed already, install the package first by running
# install.packages("(package_name)")
# e.g. if ggplot2 is not installed, run
# install.packages("ggplot2")
library(ggplot2)
library(dplyr)
library(scales)
# Load built‑in data
data("UCBAdmissions")
df <- as.data.frame(UCBAdmissions) %>% arrange(Dept, desc(Gender))
View(df)
.. admonition:: Important Note
:class: error
Each code block must be run AFTER any previously presented code blocks in
the same section. If you want to copy and paste the whole code as a single chunk,
go to the appendix at the bottom of the page.
The first rows will look like following:
.. _UCB-default-table:
.. flat-table:: 1973 UC Berkeley Graduate Admissions Data
:header-rows: 1
:widths: 1 1 1 1 1
:width: 50%
:align: center
* -
- **Admit**
- **Gender**
- **Dept**
- **Freq**
* - 1
- Admitted
- Male
- A
- 512
* - 2
- Rejected
- Male
- A
- 313
* - 3
- Admitted
- Female
- A
- 89
* - 3
- Rejected
- Female
- A
- 19
* - 5
- Admitted
- Male
- B
- 353
* - :math:`\vdots`
- :math:`\vdots`
- :math:`\vdots`
- :math:`\vdots`
- :math:`\vdots`
The combination of the first three columns shows the distinct categories that an observation
belongs to. In this dataset, there are
- two admission statuses ("Admitted" and "Rejected"),
- two genders ("Male" and "Female"), and
- six departments ("A" through "F").
Therefore, the dataset has a total of :math:`2 * 2 * 6 = 24` different categories.
These categories are also called **classes** or **labels**.
The frequencies in the right most column of :numref:`UCB-default-table` show the
counts for each category. However, frequencies alone make it difficult to assess
the relative size. For example:
- Does the class of ["Admitted", "Male", "Dept A"] take up a large proportion
of the entire set of observations?
- Is the class of ["Rejected", "Female", "Dept A"]
one of the smallest?
For an objective picture, we must take the total counts into
consideration. We use two new metrics for this purpose:
* **Relative frequency (proportion)**: The fraction of the count out of the total, computed by
:math:`relative\text{ }frequency = \dfrac{frequency}{total\text{ }count}`
* **Percentage**: The relative frequency multiplied by 100%, computed by
:math:`percentage = \dfrac{frequency}{total\text{ }count} * 100\%`
Using the new metrics, let us create a full frequency table.
.. code-block:: r
#Create the column of relative frequency
df$Rel_Freq <- df$Freq / sum(df$Freq)
#Create the column of percentage
df$Perc <- df$Rel_Freq * 100
View(df)
Now we see an extended table:
.. flat-table:: Extended 1973 UC Berkeley Admissions Data
:header-rows: 2
:widths: 1 1 1 1 1 1 1
:width: 70%
:align: center
* -
- **Admit**
- **Gender**
- **Dept**
- **Freq**
- **Rel_Freq**
- **Perc**
* - 1
- Admitted
- Male
- A
- 512
- 0.113
- 11.3
* - 2
- Rejected
- Male
- A
- 313
- 0.069
- 6.9
* - :math:`\vdots`
- :math:`\vdots`
- :math:`\vdots`
- :math:`\vdots`
- :math:`\vdots`
- :math:`\vdots`
- :math:`\vdots`
It is also possible to create extended frequency tables for various
combinations of the three individual categorical variables.
Let's try creating a table displaying the counts of admitted students
categorized by departments.
.. code-block:: r
# Take the subset of the data which only involves "Admitted" category.
admitted <- df[df$Admit == "Admitted", ]
# Frequency table of admitted student by department
df_by_dept <- admitted %>% group_by(Dept) %>% summarise(Freq=sum(Freq))
df_by_dept$Rel_Freq <- df_by_dept$Freq / sum(df_by_dept$Freq)
df_by_dept$Perc <- df_by_dept$Rel_Freq * 100
.. flat-table::
:header-rows: 2
:widths: 1 1 1 1
:width: 80%
:align: center
* - :cspan:`3` Admitted Applicants by Department
* - Department Label
- Frequency
- Relative Frequency
- Percentage
* - **A**
- 601
- 0.342
- 34.2
* - **B**
- 307
- 0.211
- 21.1
* - **C**
- 322
- 0.183
- 18.3
* - **D**
- 269
- 0.153
- 15.3
* - **E**
- 147
- 0.0838
- 8.38
* - **F**
- 46
- 0.0262
- 2.62
Note that relative frequencies always **fall between 0 and 1** and **sum to 1**. Likewise,
the percentages always range **from 0 to 100** and **sum to 100**. They provide a standardized
representation of the counts and allow comparisons between different variables that
share the same list of categories, even if their totals differ.
Pie charts
-----------------------------------
A **pie chart** represents a categorical variable as a sliced circle,
where the slices are sized proportionally to the counts, relative frequencies
or percentages. Note that the outcome will be identical regardless of the
chosen metric.
Pie charts are best when you need to emphasize that the categories make up a
complete whole, and if your main goal is to compare the relative sizes of
the labels within a single dataset.
Let us draw the pie charts of the admission status variable, for each gender.
We begin by creating the corresponding extended frequency tables:
.. code-block:: r
# Only the code for the female case is shown for conciseness. Try creating
# the code for the other case using this as a template.
# Take the subset of the data which only involves the "Female" category.
female <- df[df$Gender == "Female", ]
# Frequency table of admitted student by department
admit_female <- female %>% group_by(Admit) %>% summarise(Freq=sum(Freq))
admit_female$Rel_Freq <- admit_female$Freq / sum(admit_female$Freq)
admit_female$Perc <- admit_female$Rel_Freq * 100
.. flat-table::
:header-rows: 2
:widths: 1 1 1 1
:width: 80%
:align: center
* - :cspan:`3` Admission for Female Applicants
* - Label
- Frequency
- Relative Frequency
- Percentage
* - **Admitted**
- 557
- 0.304
- 30.4
* - **Rejected**
- 1278
- 0.696
- 69.6
.. flat-table::
:header-rows: 2
:widths: 1 1 1 1
:width: 80%
:align: center
* - :cspan:`3` Admission for Male Applicants
* - Label
- Frequency
- Relative Frequency
- Percentage
* - **Admitted**
- 1198
- 0.445
- 44.5
* - **Rejected**
- 1493
- 0.555
- 55.5
Using the tables above, create pie charts:
.. code-block:: r
#Pie chart for female applicants
ggplot(admit_female, aes(x = "", y = Freq, fill = Admit)) +
geom_bar(stat = "identity", width = 1, colour = "black", size = 1.25) +
coord_polar(theta = "y", start = 0) +
geom_text(aes(label = percent(Rel_Freq)),
position = position_stack(vjust = 0.5), size=5)+
theme_void()+
ggtitle("Acceptance Rate of Female Applicants")
.. _pie-male-female:
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter2/UCB-acceptance-by-gender.png
:align: center
:figwidth: 90%
:alt: UCB admission by gender
1973 UC Berkeley graduate admissions, by gender
Pie charts are effective for an intuitive presentation of
the variable composition, especially when **there are only a few categories**
or when the **imbalance among the proportions** needs to be emphasized.
The pie charts in :numref:`pie-male-female` display the distributions
of graduate admissions for female and male applicants at UC Berkeley
in 1973. They *seem* to suggest that there was a significant difference
in the likelihood of acceptance between genders.
We now proceed to the next section to explore this from another perspective.
Bar graphs
--------------------------
A **bar graph** draws one bar per category with the height proportional to its
frequency. Bars may represent counts, relative counts, or percentages.
Bar graphs offer several advantages over pie charts:
* Pie charts lose their simplicity when there are more than a few categories.
In contrast, bar graphs handle many categories more effectively.
* They allow **exact comparisons** of relative sizes, especially when
frequencies are of similar sizes.
* When observations can belong to multiple categories,
it is incorrect to suggest that the frequencies form a whole -
since their total may exceed 100%. In such cases, bar graphs are more appropriate,
as they do not imply that the parts sum to a whole.
To demonstrate the strength of bar graphs in handling many categories, let us plot
:numref:`UCB-default-table`, which contains 24 different categories.
.. code-block:: r
df$Dept_Gender <- paste(df$Dept, df$Gender, sep="-")
ggplot(df, aes(x = Dept_Gender, y = Freq, fill = Admit)) +
geom_bar(stat = "identity", position="dodge", width=0.7) +
theme_minimal() +
labs(title = "Dodged bar graph of frequencies",
x = "Department-Gender",
y = "Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter2/UCB-Freq.png
:figwidth: 90%
:align: center
:alt: Bar graph of frequencies, UC Berkeley data set
Bar graph of frequencies, UC Berkeley data set
Unlike our first impression through the pie charts (:numref:`pie-male-female`),
we begin to suspect that the acceptance rates are comparable
between the two genders within a department.
To dig deeper into our suspicion, let us draw another bar graph, where
each bar has a height corresponding to the relative frequency of admission results
within a single department, for a single gender. In addition, we will *stack* the
bars so that the composition of "Accepted" vs "Rejected" is emphasized within
each Department-Gender category.
.. code-block:: r
Dept_Gender_total <- as.vector((df %>% group_by(Dept_Gender) %>% summarise(Sum=sum(Freq)))$Sum)
df$Dept_Gender_total <- rep(Dept_Gender_total, each=2)
df$Dept_Gender_Rel_Freq <- df$Freq/df$Dept_Gender_total
ggplot(df, aes(x = Dept_Gender, y = Dept_Gender_Rel_Freq, fill = Admit)) +
geom_bar(stat = "identity", position="stack", width=0.5) +
theme_minimal() +
labs(title = "Stacked bar graph of relative frequencies by Dept and Gender",
x = "Department-Gender",
y = "Relative Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter2/UCB-Rel-Freq-Dept-Gender.png
:figwidth: 90%
:align: center
:alt:
Bar graph of relative frequencies of "Accepted" vs "Rejected" by Dept-Gender,
UC Berkeley data set
Our suspicion is comfirmed. Indeed, the two genders have comparable acceptance rates
within departments. In four of the six departments, the rate is actually higher
for female students!
We covered two key techniques in drawing a bar graph through the UC Berkeley example.
- **Dodging** bars side‑by‑side lets us *compare* groups across categories.
- **Stacking** bars emphasizes *composition* within each category.
.. admonition:: Remark - What's behind the contradiction?
:class: important
The pie charts and bar graphs we generated appear to convey **conflicting messages**,
even though they are based on the same data set. This discrepancy arises because
certain departments had a disproportionately large number of applicants—most of
whom were male.
This highlights the importance of **examining a data set carefully at multiple
levels of categorization** before drawing conclusions. In fact, this situation
illustrates a well-known and frequently occurring statistical phenomenon called
**Simpson’s Paradox**.
Feel free to explore this fascinating topic further on your own!
Pie Chart or Bar Graph?
--------------------------------------------------
Choosing between pie charts and bar graphs depends on your data and the story you want to tell:
* Bar graphs handle many categories comfortably; a pie chart with more than five slices becomes hard to read.
* Exact comparisons across categories are easier in a bar graph because the common baseline (zero) guides the eye.
* If your takeaway is "X accounts for one‑third of the total," a pie slice delivers that message immediately.
* Bar graphs work well when observations can belong to multiple categories.
* Pie charts emphasize the part-to-whole relationship and are ideal when your data represents 100% of something.
Bringing It All Together
--------------------------
.. admonition:: Key Takeaways 📝
:class: important
#. The distribution of a categorical variable is first organized into a table of
**categories** (also called **labels** or **classes**) with their counts,
proportions, or percentages.
#. Pie charts emphasize **part of whole**; bar graphs emphasize **category comparisons**.
#. Choose **dodged** or **stacked** bar graphs based on the message you want to convey.
Dodged bar graphs allow precise comparison of heights; stacked bar graphs
focuses on showing the composition within a category.
#. Examine categorical data from multiple perspectives to avoid misleading interpretations.
Appendix: All Code in One Stack
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: r
# Load required packages
# If not installed already, install the package first by running
# install.packages("(package_name)")
# e.g. if ggplot2 is not installed, run
# install.packages("ggplot2")
library(ggplot2)
library(tidyverse)
library(scales)
# Load built‑in data
data("UCBAdmissions")
df <- as.data.frame(UCBAdmissions) %>% arrange(Dept, desc(Gender))
#View(df)
###########
#Create the column of relative frequency
df$Rel_Freq <- df$Freq / sum(df$Freq)
#Create the column of percentage
df$Perc <- df$Rel_Freq * 100
View(df)
###########
# Take the subset of the data which only involves "Admitted" category.
admitted <- df[df$Admit == "Admitted", ]
# Frequency table of admitted student by department
df_by_dept <- admitted %>% group_by(Dept) %>% summarise(Freq=sum(Freq))
df_by_dept$Rel_Freq <- df_by_dept$Freq / sum(df_by_dept$Freq)
df_by_dept$Perc <- df_by_dept$Rel_Freq * 100
###########
# Only the code for the female case is shown for conciseness. Try creating
# the code for the other case using this as a template.
# Take the subset of the data which only involves "Female" category.
female <- df[df$Gender == "Female", ]
# Frequency table of admitted student by department
admit_female <- female %>% group_by(Admit) %>% summarise(Freq=sum(Freq))
admit_female$Rel_Freq <- admit_female$Freq / sum(admit_female$Freq)
admit_female$Perc <- admit_female$Rel_Freq * 100
###########
#Pie chart for female applicants
ggplot(admit_female, aes(x = "", y = Freq, fill = Admit)) +
geom_bar(stat = "identity", width = 1, colour = "black", size = 1.25) +
coord_polar(theta = "y", start = 0) +
geom_text(aes(label = percent(Rel_Freq)),
position = position_stack(vjust = 0.5), size=5)+
theme_void()+
ggtitle("Acceptance Rate of Female Applicants")
############
# Bar graph of frequencies
df$Dept_Gender <- paste(df$Dept, df$Gender, sep="-")
ggplot(df, aes(x = Dept_Gender, y = Freq, fill = Admit)) +
geom_bar(stat = "identity", position="dodge", width=0.7) +
theme_minimal() +
labs(title = "Dodged bar graph of frequencies",
x = "Department-Gender",
y = "Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))
#################
# Stacked bar graph of relative frequencies
Dept_Gender_total <- as.vector((df %>% group_by(Dept_Gender) %>% summarise(Sum=sum(Freq)))$Sum)
df$Dept_Gender_total <- rep(Dept_Gender_total, each=2)
df$Dept_Gender_Rel_Freq <- df$Freq/df$Dept_Gender_total
ggplot(df, aes(x = Dept_Gender, y = Dept_Gender_Rel_Freq, fill = Admit)) +
geom_bar(stat = "identity", position="stack", width=0.5) +
theme_minimal() +
labs(title = "Stacked bar graph of relative frequencies by Dept and Gender",
x = "Department-Gender",
y = "Relative Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))