.. _2-1-understanding-the-structure-of-data-set: .. raw:: html

Data Set Structure and Variable Types =========================================================================== Imagine being handed a spreadsheet containing thousands of numbers—sales figures, temperature readings, or survey responses. Without organization and visualization, these numbers remain just that: a collection of digits offering little immediate insight. A raw data file is essentially a *laundry list* of values. This chapter introduces you to the basic vocabulary of structured data sets and demonstrates how tables, pie charts, bar graphs, and histograms transform data values into intuitve patterns. We begin by establishing a common language for organizing and describing a data set. .. admonition:: Road Map 🧭 :class: important * Define **case** and **variable**, and understand how they are organized in a rectangular data set. * Define **categorical (qualitative)** and **numerical (quantitative)** variables and their further divisions. * Understand that each variable type requires a **different set of summarizing tools** to reflect its unique structure. Understanding the Structure of a Data Set --------------------------------------------- .. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter2/cases-variables.png :alt: cases and variables in a rectangular spreadsheet :align: center :figwidth: 60% Cases and variables in a rectangular spreadsheet Before delving into visualization techniques, we need to understand how data is organized. A **data set** is typically arranged as a rectangular array where - *rows* represent **cases** or **observations** (individual entities such as people, cities, or products), and - *columns* represent **variables** (specific attributes of each case). When examining any data set, always begin by asking three fundamental questions: * **Who?** - What cases does the data describe? How many cases are there? * **What?** - Which variables are being measured, with what units, and on what scale? How many variables are there? * **Why?** - What question motivated the data, and are the variables appropriate for answering that question? These questions may seem simple, but they provide essential context for any statistical analysis. Without understanding who or what is being measured, we risk misinterpreting results or drawing inappropriate conclusions. Variable Classification ------------------------------------------------------- Variables come in different types, each with its own properties and appropriate summarizing methods. The flowchart below illustrates the primary classification system: .. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter2/variable-types-flow-chart.png :alt: Flow-chart of variable types :figwidth: 60% :align: center Categorical (qualitative) Variables ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A categorical variable describes an attribute which can be classified into distinct categories. Typicallly, these categories are distinguished by names or labels but cannot be measured on a numerical scale. It can be further divided into: * **Nominal variables**: categories with no inherent order (e.g., fruit types, car brands, hair colors) * **Ordinal variables**: categories with a meaningful order but uneven or unmeasurable spacing between values (e.g., education levels, survey responses on a scale from "strongly disagree" to "strongly agree") Numerical (quantitative) variables ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A numerical variable record measurements or counts of an attribute which can be expressed on a numerical scale. It can be subdivided into: * **Discrete variables**: Values that can be counted as separate, distinct points (typically whole numbers) with no possible values between consecutive units (e.g., number of books on a shelf, number of students in a classroom) * **Continuous variables**: Values that can take any value within a range, limited only by measurement precision (e.g., height, weight, temperature, time) .. admonition:: A Tip 🔎: When confused bewteen discrete and continuous...🤔 :class: error Take any single value from the numerical variable. Then ask, **Can I clearly identify the "next" (or "previous") point?** - If yes, then **discrete** - If no, then **continuous** For example: - If there are three students in a classroom, what is the "next" value? Without question, it's four. → discrete - If the current temperature is 81 degrees Fahrenheit, what is the "next" value it can be? 82? 81.1? 81.00001? Not clear. → continuous Quantitavie variables can also be divided into interval and ratio scales: * **Interval scales** have equal distances between values, but no true zero point (e.g., Celsius temperature—0°C doesn't mean "no temperature") * **Ratio scales** have equal distances between values and a meaningful zero that represents the absence of the quantity (e.g., height, weight, time) .. Summarizing Categorical Variables ------------------------------------ How do we make sense of categorical data? The first step is usually to create a distribution showing categories with their **count**, **relative frequency**, or **percentage**. Two standard visual representations are particularly effective for categorical data: pie charts and bar graphs. Let's examine real examples of each visualization type: Pie Chart Example – UC Berkeley Graduate Applications ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter2/ucb-applications-pie.png :alt: Pie chart of Berkeley graduate applications, 1973 :align: center This pie chart displays the distribution of graduate applications across different departments at UC Berkeley in 1973. Each slice represents a department, with the size proportional to the percentage of total applications. The largest segments immediately draw attention to the most popular departments. The R code used to create this visualization is included below for those interested in the technical implementation: .. code-block:: r #----------------------------------------------- # PIE CHART – Applications by department #----------------------------------------------- library(ggplot2) adjust_angle <- function(percentage, angle){ ifelse(percentage > 10, 0, ifelse(angle < -45, angle + 90, angle)) } data("UCBAdmissions") df <- as.data.frame(UCBAdmissions) # --- All applications applications_summary <- aggregate(Freq ~ Dept, data = df, sum) applications_total <- sum(applications_summary$Freq) applications_summary$Percentage <- (applications_summary$Freq / applications_total) * 100 applications_summary$angle <- cumsum(c(0, head(applications_summary$Freq, -1))) + applications_summary$Freq/2 applications_summary$percentage_angle <- applications_summary$angle / sum(applications_summary$Freq) * 360 applications_summary$text_angle <- adjust_angle(applications_summary$Percentage, -90 + applications_summary$percentage_angle) ggplot(applications_summary, aes(x = "", y = Freq, fill = Dept)) + geom_bar(stat = "identity", width = 1, colour = "black", size = 1.25) + coord_polar(theta = "y", start = 0) + labs(title = "Applications by Department at UC Berkeley in 1973", fill = "Department") + theme(legend.position = "right", plot.title = element_text(size = 20, face = "bold")) + geom_text(aes(label = sprintf("%.1f%%", Percentage), angle = text_angle), position = position_stack(vjust = 0.5), fontface = "bold", size = 6, colour = "white") Bar Graph Example – Student Club Simulation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter2/student-clubs-bar.png :alt: Bar graph of simulated club membership counts :align: center This bar graph shows a simulation of 450 students deciding whether to join each of 10 different clubs. The height of each bar represents the membership count for each club. Unlike pie charts, bar graphs excel at comparing exact values across categories and work well even with many categories. The code below demonstrates how this simulation was conducted and visualized: .. code-block:: r # Simulate 450 students deciding independently whether to join each of 10 clubs library(ggplot2) set.seed(42) num_people <- 450 num_clubs <- 10 prob_join <- 0.10 people_clubs <- matrix(rbinom(num_people * num_clubs, size = 1, prob = prob_join), nrow = num_people, ncol = num_clubs) club_names <- c("Eco Warriors", "Tech Innovators", "Art Enthusiasts", "Sports League", "Debate Society", "Music Band", "Theatre Group", "Literature Circle", "Chess Club", "Science Forum") df_people_clubs <- as.data.frame(people_clubs) names(df_people_clubs) <- club_names club_totals <- colSums(df_people_clubs) total_memberships <- sum(club_totals) club_percentages <- (club_totals / total_memberships) * 100 df_plot <- data.frame(Club = names(club_percentages), Percentage = club_percentages, Members = club_totals) ggplot(df_plot, aes(x = Club, y = Members, fill = Club)) + geom_bar(stat = "identity") + theme_minimal() + labs(title = "Membership in Student Clubs", x = "Student Club", y = "Number of Members") + theme(axis.text.x = element_text(angle = 45, hjust = 1, face = "bold", size = 14)) Summarizing Quantitative Variables ------------------------------------- While pie charts and bar graphs work well for categorical data, they aren't suitable for quantitative variables where we need to understand the distribution across a numeric range. For quantitative data, the **histogram** serves as our primary visualization tool. A histogram divides the number line into equal-width *bins* and creates a bar for each bin, with the height proportional to the count (or density) of observations falling within that range. This simple yet powerful visualization reveals patterns like central tendency, spread, modality, skewness, and potential outliers at a glance. Choosing the number of bins involves both art and science. A useful rule of thumb is :math:`\lceil \sqrt{n} + 2 \rceil`, where *n* is the number of observations. However, you should always let the visual clarity of the resulting picture be your ultimate guide. Too few bins might obscure important patterns, while too many can create visual noise from random variation. Discrete Example – Agricultural Insect Counts ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter2/insect-count-hist.png :alt: Histogram of insect counts from Beall (1942) :align: center This histogram displays the distribution of insect counts from an agricultural study. Since the data is discrete (we can only count whole insects), the histogram shows clear steps. The shape reveals important information about typical insect density and the variation among observations. .. code-block:: r library(ggplot2) data(InsectSprays) n <- nrow(InsectSprays) nbins <- round(max(sqrt(n) + 1, 5)) ggplot(InsectSprays, aes(x = count)) + geom_histogram(bins = nbins, fill = "blue", colour = "black", linewidth = 2) + labs(title = "Distribution of Insect Counts", x = "Insect Count", y = "Frequency") Continuous Example – Residential Furnace Energy Use ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter2/furnace-density.png :alt: Histogram and density curve for BTU consumption :align: center This example shows energy consumption (in BTUs) for residential furnaces. Since energy usage is a continuous variable, we can overlay a smooth density curve to better visualize the underlying distribution. The example also includes a normal distribution curve (in blue) for comparison, highlighting any deviations from normality in the actual data. .. code-block:: r library(ggplot2) furnace <- read.csv("furnace.txt") # replace with local path xbar <- mean(furnace$Consumption) s <- sd(furnace$Consumption) n_bins <- max(round(sqrt(nrow(furnace)) + 2), 5) ggplot(furnace, aes(Consumption)) + geom_histogram(aes(y = after_stat(density)), bins = n_bins, fill = "purple", colour = "black", size = 2) + geom_density(colour = "red", linewidth = 2) + stat_function(fun = dnorm, args = list(mean = xbar, sd = s), colour = "blue", linewidth = 2) + xlab("BTU") + ylab("Density") + xlim(0, 20) Describing Shape, Center, Spread & Outliers --------------------------------------------- When examining any histogram, train yourself to observe and comment on four key characteristics: * **Shape** – Is the distribution unimodal (one peak), bimodal (two peaks), or multimodal (multiple peaks)? Is it symmetric around a central value, or skewed to the left or right? * **Center** – Where is the bulk of the data located? This gives us an intuitive sense of typical values before formally calculating means or medians. * **Spread** – How wide is the distribution? Are values tightly clustered or widely dispersed? We'll later formalize this with range, interquartile range (IQR), and standard deviation. * **Outliers** – Are there any observations that fall notably outside the main pattern? These may represent errors, special cases, or important anomalies worth investigating. These four characteristics provide a complete qualitative description of any distribution and should become second nature when you look at histograms or other distribution visualizations. .. admonition:: Example 💡: Variable Type Depends on Both Nature and Context When classifying a variable, its definition given by the data collector is just as important as its naturally occurring properities. Take the **final exam grades** of an imaginary course, MATH 1234, for example. - Researcher 1 records the data as percentage scores after a curve has been applied (Variable 1). - Researcher 2 records the data as belonging to one of the intervals 0%-60%, 60%-70%, 70%-80%, then 80%-100% (Variable 2). Although both variables come from the same source, they are now of distinct types: - **Variable 1** is a **continuous numerical variable** on an **interval scale**. A grade of 0% has lost its meaning, and a student with 80% did not perform twice as better as one with 40%, due to the curve. - **Variable 2** is an **ordinal categorical variable**. Although a natural order exists among the intervals, no meaningful arithmetic oprations are possible between them, indicating that they are **not** numerical values. In general cases where no experimental details are given, you may focus on the variable's natural properties. However, when additional context is available, it must be factored into how you classify it. Bringing It All Together -------------------------- In subsequent sections, we'll discuss the appropriate visualizing tools for each type of variables. .. admonition:: Key Takeaways 📝 :class: important 1. Always identify the *who*, *what* and *why* before graphing to provide essential context. 2. Be able to classify a variable by considering both its natural properties and its specific usage in an experiment. Exercises ~~~~~~~~~~~~~ 1. For each of the following variables state whether it is categorical nominal, categorical ordinal, discrete numerical or continuous numerical. For numerical variables, also determine whether they are on a ratio scale or an interval scale: - Blood type - SAT score - Number of pets owned - Daily rainfall