.. _2-2-tools-for-categorical-qualitative-data: .. raw:: html
Tools for Categorical (Qualitative) Data ================================================================== Numbers alone cannot convey how *popular* a department is, how *balanced* a survey sample appears, or whether two demographic variables *interact*. A simple *count table* answers "*how many*," but only a picture answers "*so what*?" In this lesson, you will learn two plots- **pie charts** and **bar graphs**—that turn categorical tallies into instant insights. We will also practice building the underlying frequency tables so that your R code is always one line away from a clear graph. .. admonition:: Road Map 🧭 :class: tip * Build **frequency tables** for categorical variables using three different metrics: **frequency, relative frequency**, and **percentage**. * Visualize a frequency table with **pie charts** and **bar graphs**. Learn when one is preferred over the other. The distribution of a categorical variable --------------------------------------------- The first stage of understanding the **distribution** of a categorical variable is to construct a table listing every category together with its count. We will use the famous 1973 UC Berkeley Graduate Admissions data set for illustration. This data set is available by default on RStudio. Run: .. _import-data: .. code-block:: r # Load required packages # If not installed already, install the package first by running # install.packages("(package_name)") # e.g. if ggplot2 is not installed, run # install.packages("ggplot2") library(ggplot2) library(dplyr) library(scales) # Load built‑in data data("UCBAdmissions") df <- as.data.frame(UCBAdmissions) %>% arrange(Dept, desc(Gender)) View(df) .. admonition:: Important Note :class: error Each code block must be run AFTER any previously presented code blocks in the same section. If you want to copy and paste the whole code as a single chunk, go to the appendix at the bottom of the page. The first rows will look like following: .. _UCB-default-table: .. flat-table:: 1973 UC Berkeley Graduate Admissions Data :header-rows: 1 :widths: 1 1 1 1 1 :width: 50% :align: center * - - **Admit** - **Gender** - **Dept** - **Freq** * - 1 - Admitted - Male - A - 512 * - 2 - Rejected - Male - A - 313 * - 3 - Admitted - Female - A - 89 * - 3 - Rejected - Female - A - 19 * - 5 - Admitted - Male - B - 353 * - :math:`\vdots` - :math:`\vdots` - :math:`\vdots` - :math:`\vdots` - :math:`\vdots` The combination of the first three columns shows the distinct categories that an observation belongs to. In this dataset, there are - two admission statuses ("Admitted" and "Rejected"), - two genders ("Male" and "Female"), and - six departments ("A" through "F"). Therefore, the dataset has a total of :math:`2 * 2 * 6 = 24` different categories. These categories are also called **classes** or **labels**. The frequencies in the right most column of :numref:`UCB-default-table` show the counts for each category. However, frequencies alone make it difficult to assess the relative size. For example: - Does the class of ["Admitted", "Male", "Dept A"] take up a large proportion of the entire set of observations? - Is the class of ["Rejected", "Female", "Dept A"] one of the smallest? For an objective picture, we must take the total counts into consideration. We use two new metrics for this purpose: * **Relative frequency (proportion)**: The fraction of the count out of the total, computed by :math:`relative\text{ }frequency = \dfrac{frequency}{total\text{ }count}` * **Percentage**: The relative frequency multiplied by 100%, computed by :math:`percentage = \dfrac{frequency}{total\text{ }count} * 100\%` Using the new metrics, let us create a full frequency table. .. code-block:: r #Create the column of relative frequency df$Rel_Freq <- df$Freq / sum(df$Freq) #Create the column of percentage df$Perc <- df$Rel_Freq * 100 View(df) Now we see an extended table: .. flat-table:: Extended 1973 UC Berkeley Admissions Data :header-rows: 2 :widths: 1 1 1 1 1 1 1 :width: 70% :align: center * - - **Admit** - **Gender** - **Dept** - **Freq** - **Rel_Freq** - **Perc** * - 1 - Admitted - Male - A - 512 - 0.113 - 11.3 * - 2 - Rejected - Male - A - 313 - 0.069 - 6.9 * - :math:`\vdots` - :math:`\vdots` - :math:`\vdots` - :math:`\vdots` - :math:`\vdots` - :math:`\vdots` - :math:`\vdots` It is also possible to create extended frequency tables for various combinations of the three individual categorical variables. Let's try creating a table displaying the counts of admitted students categorized by departments. .. code-block:: r # Take the subset of the data which only involves "Admitted" category. admitted <- df[df$Admit == "Admitted", ] # Frequency table of admitted student by department df_by_dept <- admitted %>% group_by(Dept) %>% summarise(Freq=sum(Freq)) df_by_dept$Rel_Freq <- df_by_dept$Freq / sum(df_by_dept$Freq) df_by_dept$Perc <- df_by_dept$Rel_Freq * 100 .. flat-table:: :header-rows: 2 :widths: 1 1 1 1 :width: 80% :align: center * - :cspan:`3` Admitted Applicants by Department * - Department Label - Frequency - Relative Frequency - Percentage * - **A** - 601 - 0.342 - 34.2 * - **B** - 307 - 0.211 - 21.1 * - **C** - 322 - 0.183 - 18.3 * - **D** - 269 - 0.153 - 15.3 * - **E** - 147 - 0.0838 - 8.38 * - **F** - 46 - 0.0262 - 2.62 Note that relative frequencies always **fall between 0 and 1** and **sum to 1**. Likewise, the percentages always range **from 0 to 100** and **sum to 100**. They provide a standardized representation of the counts and allow comparisons between different variables that share the same list of categories, even if their totals differ. Pie charts ----------------------------------- A **pie chart** represents a categorical variable as a sliced circle, where the slices are sized proportionally to the counts, relative frequencies or percentages. Note that the outcome will be identical regardless of the chosen metric. Pie charts are best when you need to emphasize that the categories make up a complete whole, and if your main goal is to compare the relative sizes of the labels within a single dataset. Let us draw the pie charts of the admission status variable, for each gender. We begin by creating the corresponding extended frequency tables: .. code-block:: r # Only the code for the female case is shown for conciseness. Try creating # the code for the other case using this as a template. # Take the subset of the data which only involves the "Female" category. female <- df[df$Gender == "Female", ] # Frequency table of admitted student by department admit_female <- female %>% group_by(Admit) %>% summarise(Freq=sum(Freq)) admit_female$Rel_Freq <- admit_female$Freq / sum(admit_female$Freq) admit_female$Perc <- admit_female$Rel_Freq * 100 .. flat-table:: :header-rows: 2 :widths: 1 1 1 1 :width: 80% :align: center * - :cspan:`3` Admission for Female Applicants * - Label - Frequency - Relative Frequency - Percentage * - **Admitted** - 557 - 0.304 - 30.4 * - **Rejected** - 1278 - 0.696 - 69.6 .. flat-table:: :header-rows: 2 :widths: 1 1 1 1 :width: 80% :align: center * - :cspan:`3` Admission for Male Applicants * - Label - Frequency - Relative Frequency - Percentage * - **Admitted** - 1198 - 0.445 - 44.5 * - **Rejected** - 1493 - 0.555 - 55.5 Using the tables above, create pie charts: .. code-block:: r #Pie chart for female applicants ggplot(admit_female, aes(x = "", y = Freq, fill = Admit)) + geom_bar(stat = "identity", width = 1, colour = "black", size = 1.25) + coord_polar(theta = "y", start = 0) + geom_text(aes(label = percent(Rel_Freq)), position = position_stack(vjust = 0.5), size=5)+ theme_void()+ ggtitle("Acceptance Rate of Female Applicants") .. _pie-male-female: .. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter2/UCB-acceptance-by-gender.png :align: center :figwidth: 90% :alt: UCB admission by gender 1973 UC Berkeley graduate admissions, by gender Pie charts are effective for an intuitive presentation of the variable composition, especially when **there are only a few categories** or when the **imbalance among the proportions** needs to be emphasized. The pie charts in :numref:`pie-male-female` display the distributions of graduate admissions for female and male applicants at UC Berkeley in 1973. They *seem* to suggest that there was a significant difference in the likelihood of acceptance between genders. We now proceed to the next section to explore this from another perspective. Bar graphs -------------------------- A **bar graph** draws one bar per category with the height proportional to its frequency. Bars may represent counts, relative counts, or percentages. Bar graphs offer several advantages over pie charts: * Pie charts lose their simplicity when there are more than a few categories. In contrast, bar graphs handle many categories more effectively. * They allow **exact comparisons** of relative sizes, especially when frequencies are of similar sizes. * When observations can belong to multiple categories, it is incorrect to suggest that the frequencies form a whole - since their total may exceed 100%. In such cases, bar graphs are more appropriate, as they do not imply that the parts sum to a whole. To demonstrate the strength of bar graphs in handling many categories, let us plot :numref:`UCB-default-table`, which contains 24 different categories. .. code-block:: r df$Dept_Gender <- paste(df$Dept, df$Gender, sep="-") ggplot(df, aes(x = Dept_Gender, y = Freq, fill = Admit)) + geom_bar(stat = "identity", position="dodge", width=0.7) + theme_minimal() + labs(title = "Dodged bar graph of frequencies", x = "Department-Gender", y = "Frequency") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10)) .. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter2/UCB-Freq.png :figwidth: 90% :align: center :alt: Bar graph of frequencies, UC Berkeley data set Bar graph of frequencies, UC Berkeley data set Unlike our first impression through the pie charts (:numref:`pie-male-female`), we begin to suspect that the acceptance rates are comparable between the two genders within a department. To dig deeper into our suspicion, let us draw another bar graph, where each bar has a height corresponding to the relative frequency of admission results within a single department, for a single gender. In addition, we will *stack* the bars so that the composition of "Accepted" vs "Rejected" is emphasized within each Department-Gender category. .. code-block:: r Dept_Gender_total <- as.vector((df %>% group_by(Dept_Gender) %>% summarise(Sum=sum(Freq)))$Sum) df$Dept_Gender_total <- rep(Dept_Gender_total, each=2) df$Dept_Gender_Rel_Freq <- df$Freq/df$Dept_Gender_total ggplot(df, aes(x = Dept_Gender, y = Dept_Gender_Rel_Freq, fill = Admit)) + geom_bar(stat = "identity", position="stack", width=0.5) + theme_minimal() + labs(title = "Stacked bar graph of relative frequencies by Dept and Gender", x = "Department-Gender", y = "Relative Frequency") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10)) .. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter2/UCB-Rel-Freq-Dept-Gender.png :figwidth: 90% :align: center :alt: Bar graph of relative frequencies of "Accepted" vs "Rejected" by Dept-Gender, UC Berkeley data set Our suspicion is comfirmed. Indeed, the two genders have comparable acceptance rates within departments. In four of the six departments, the rate is actually higher for female students! We covered two key techniques in drawing a bar graph through the UC Berkeley example. - **Dodging** bars side‑by‑side lets us *compare* groups across categories. - **Stacking** bars emphasizes *composition* within each category. .. admonition:: Remark - What's behind the contradiction? :class: important The pie charts and bar graphs we generated appear to convey **conflicting messages**, even though they are based on the same data set. This discrepancy arises because certain departments had a disproportionately large number of applicants—most of whom were male. This highlights the importance of **examining a data set carefully at multiple levels of categorization** before drawing conclusions. In fact, this situation illustrates a well-known and frequently occurring statistical phenomenon called **Simpson’s Paradox**. Feel free to explore this fascinating topic further on your own! Pie Chart or Bar Graph? -------------------------------------------------- Choosing between pie charts and bar graphs depends on your data and the story you want to tell: * Bar graphs handle many categories comfortably; a pie chart with more than five slices becomes hard to read. * Exact comparisons across categories are easier in a bar graph because the common baseline (zero) guides the eye. * If your takeaway is "X accounts for one‑third of the total," a pie slice delivers that message immediately. * Bar graphs work well when observations can belong to multiple categories. * Pie charts emphasize the part-to-whole relationship and are ideal when your data represents 100% of something. Bringing It All Together -------------------------- .. admonition:: Key Takeaways 📝 :class: important #. The distribution of a categorical variable is first organized into a table of **categories** (also called **labels** or **classes**) with their counts, proportions, or percentages. #. Pie charts emphasize **part of whole**; bar graphs emphasize **category comparisons**. #. Choose **dodged** or **stacked** bar graphs based on the message you want to convey. Dodged bar graphs allow precise comparison of heights; stacked bar graphs focuses on showing the composition within a category. #. Examine categorical data from multiple perspectives to avoid misleading interpretations. Appendix: All Code in One Stack ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: r # Load required packages # If not installed already, install the package first by running # install.packages("(package_name)") # e.g. if ggplot2 is not installed, run # install.packages("ggplot2") library(ggplot2) library(tidyverse) library(scales) # Load built‑in data data("UCBAdmissions") df <- as.data.frame(UCBAdmissions) %>% arrange(Dept, desc(Gender)) #View(df) ########### #Create the column of relative frequency df$Rel_Freq <- df$Freq / sum(df$Freq) #Create the column of percentage df$Perc <- df$Rel_Freq * 100 View(df) ########### # Take the subset of the data which only involves "Admitted" category. admitted <- df[df$Admit == "Admitted", ] # Frequency table of admitted student by department df_by_dept <- admitted %>% group_by(Dept) %>% summarise(Freq=sum(Freq)) df_by_dept$Rel_Freq <- df_by_dept$Freq / sum(df_by_dept$Freq) df_by_dept$Perc <- df_by_dept$Rel_Freq * 100 ########### # Only the code for the female case is shown for conciseness. Try creating # the code for the other case using this as a template. # Take the subset of the data which only involves "Female" category. female <- df[df$Gender == "Female", ] # Frequency table of admitted student by department admit_female <- female %>% group_by(Admit) %>% summarise(Freq=sum(Freq)) admit_female$Rel_Freq <- admit_female$Freq / sum(admit_female$Freq) admit_female$Perc <- admit_female$Rel_Freq * 100 ########### #Pie chart for female applicants ggplot(admit_female, aes(x = "", y = Freq, fill = Admit)) + geom_bar(stat = "identity", width = 1, colour = "black", size = 1.25) + coord_polar(theta = "y", start = 0) + geom_text(aes(label = percent(Rel_Freq)), position = position_stack(vjust = 0.5), size=5)+ theme_void()+ ggtitle("Acceptance Rate of Female Applicants") ############ # Bar graph of frequencies df$Dept_Gender <- paste(df$Dept, df$Gender, sep="-") ggplot(df, aes(x = Dept_Gender, y = Freq, fill = Admit)) + geom_bar(stat = "identity", position="dodge", width=0.7) + theme_minimal() + labs(title = "Dodged bar graph of frequencies", x = "Department-Gender", y = "Frequency") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10)) ################# # Stacked bar graph of relative frequencies Dept_Gender_total <- as.vector((df %>% group_by(Dept_Gender) %>% summarise(Sum=sum(Freq)))$Sum) df$Dept_Gender_total <- rep(Dept_Gender_total, each=2) df$Dept_Gender_Rel_Freq <- df$Freq/df$Dept_Gender_total ggplot(df, aes(x = Dept_Gender, y = Dept_Gender_Rel_Freq, fill = Admit)) + geom_bar(stat = "identity", position="stack", width=0.5) + theme_minimal() + labs(title = "Stacked bar graph of relative frequencies by Dept and Gender", x = "Department-Gender", y = "Relative Frequency") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))