2.1. Data Set Structure and Variable Types

Imagine being handed a spreadsheet containing thousands of numbers—sales figures, temperature readings, or survey responses. Without organization and visualization, these numbers remain just that: a collection of digits offering little immediate insight.

A raw data file is essentially a laundry list of values. This chapter introduces you to the basic vocabulary of structured data sets and demonstrates how tables, pie charts, bar graphs, and histograms transform data values into intuitve patterns. We begin by establishing a common language for organizing and describing a data set.

Road Map 🧭

  • Define case and variable, and understand how they are organized in a rectangular data set.

  • Define categorical (qualitative) and numerical (quantitative) variables and their further divisions.

  • Understand that each variable type requires a different set of summarizing tools to reflect its unique structure.

2.1.1. Understanding the Structure of a Data Set

cases and variables in a rectangular spreadsheet

Fig. 2.1 Cases and variables in a rectangular spreadsheet

Before delving into visualization techniques, we need to understand how data is organized. A data set is typically arranged as a rectangular array where

  • rows represent cases or observations (individual entities such as people, cities, or products), and

  • columns represent variables (specific attributes of each case).

When examining any data set, always begin by asking three fundamental questions:

  • Who? - What cases does the data describe? How many cases are there?

  • What? - Which variables are being measured, with what units, and on what scale? How many variables are there?

  • Why? - What question motivated the data, and are the variables appropriate for answering that question?

These questions may seem simple, but they provide essential context for any statistical analysis. Without understanding who or what is being measured, we risk misinterpreting results or drawing inappropriate conclusions.

2.1.2. Variable Classification

Variables come in different types, each with its own properties and appropriate summarizing methods. The flowchart below illustrates the primary classification system:

Flow-chart of variable types

Categorical (qualitative) Variables

A categorical variable describes an attribute which can be classified into distinct categories. Typicallly, these categories are distinguished by names or labels but cannot be measured on a numerical scale. It can be further divided into:

  • Nominal variables: categories with no inherent order (e.g., fruit types, car brands, hair colors)

  • Ordinal variables: categories with a meaningful order but uneven or unmeasurable spacing between values (e.g., education levels, survey responses on a scale from “strongly disagree” to “strongly agree”)

Numerical (quantitative) variables

A numerical variable record measurements or counts of an attribute which can be expressed on a numerical scale. It can be subdivided into:

  • Discrete variables: Values that can be counted as separate, distinct points (typically whole numbers) with no possible values between consecutive units (e.g., number of books on a shelf, number of students in a classroom)

  • Continuous variables: Values that can take any value within a range, limited only by measurement precision (e.g., height, weight, temperature, time)

A Tip 🔎: When confused bewteen discrete and continuous…🤔

Take any single value from the numerical variable. Then ask, Can I clearly identify the “next” (or “previous”) point?
  • If yes, then discrete

  • If no, then continuous

For example:
  • If there are three students in a classroom, what is the “next” value? Without question, it’s four. → discrete

  • If the current temperature is 81 degrees Fahrenheit, what is the “next” value it can be? 82? 81.1? 81.00001? Not clear. → continuous

Quantitavie variables can also be divided into interval and ratio scales:

  • Interval scales have equal distances between values, but no true zero point (e.g., Celsius temperature—0°C doesn’t mean “no temperature”)

  • Ratio scales have equal distances between values and a meaningful zero that represents the absence of the quantity (e.g., height, weight, time)

Example 💡: Variable Type Depends on Both Nature and Context

When classifying a variable, its definition given by the data collector is just as important as its naturally occurring properities. Take the final exam grades of an imaginary course, MATH 1234, for example.

  • Researcher 1 records the data as percentage scores after a curve has been applied (Variable 1).

  • Researcher 2 records the data as belonging to one of the intervals 0%-60%, 60%-70%, 70%-80%, then 80%-100% (Variable 2).

Although both variables come from the same source, they are now of distinct types:

  • Variable 1 is a continuous numerical variable on an interval scale. A grade of 0% has lost its meaning, and a student with 80% did not perform twice as better as one with 40%, due to the curve.

  • Variable 2 is an ordinal categorical variable. Although a natural order exists among the intervals, no meaningful arithmetic oprations are possible between them, indicating that they are not numerical values.

In general cases where no experimental details are given, you may focus on the variable’s natural properties. However, when additional context is available, it must be factored into how you classify it.

2.1.3. Bringing It All Together

In subsequent sections, we’ll discuss the appropriate visualizing tools for each type of variables.

Key Takeaways 📝

  1. Always identify the who, what and why before graphing to provide essential context.

  2. Be able to classify a variable by considering both its natural properties and its specific usage in an experiment.

Exercises

  1. For each of the following variables state whether it is categorical nominal, categorical ordinal, discrete numerical or continuous numerical. For numerical variables, also determine whether they are on a ratio scale or an interval scale:

    • Blood type

    • SAT score

    • Number of pets owned

    • Daily rainfall