2.1. Data Set Structure and Variable Types

Imagine being handed a spreadsheet containing thousands of numbers—sales figures, temperature readings, or survey responses. Without organization and visualization, these numbers remain just that: a collection of digits offering little immediate insight.

A raw data file is essentially a laundry list of values. This chapter introduces you to the basic vocabulary of structured data sets and demonstrates how tables, pie charts, bar graphs, and histograms transform data values into intuitve patterns. We begin by establishing a common language for organizing and describing a data set.

Road Map 🧭

  • Define case and variable, and understand how they are organized in a rectangular data set.

  • Define categorical (qualitative) and numerical (quantitative) variables and learn their further divisions.

  • Understand that each variable type requires a different set of summarizing tools to reflect its unique structure.

2.1.1. Understanding the Structure of a Data Set

cases and variables in a rectangular spreadsheet

Fig. 2.1 Cases and variables in a rectangular spreadsheet

Before delving into visualization techniques, we need to understand how data is organized. A data set is typically arranged as a rectangular array where

  • rows represent cases or observations (individual entities such as people, cities, or products), and

  • columns represent variables (specific attributes of each case).

When examining any data set, always begin by asking the three fundamental questions:

  • Who? - What cases does the data describe? How many cases are there?

  • What? - Which variables are being measured, with what units, and on what scale? How many variables are there?

  • Why? - What question motivated the data, and are the variables appropriate for answering that question?

These questions may seem simple, but they provide essential context for any statistical analysis. Without understanding their answers, we risk misinterpreting results or drawing inappropriate conclusions.

2.1.2. Variable Classification

Variables come in different types, each with its own properties and appropriate summarizing methods. Fig. 2.2 illustrates the primary classification system:

Flow-chart of variable types

Fig. 2.2 Variable types flow chart

Categorical (Qualitative) Variables

A categorical variable describes an attribute which can be classified into distinct categories. Typically, these categories are distinguished by names or labels and cannot be measured on a numerical scale. It can be further divided into:

  • Nominal variables: categories with no inherent order (e.g., fruit types, car brands, hair colors)

  • Ordinal variables: categories with a meaningful order but uneven or unmeasurable spacing between values (e.g., education levels, survey responses on a scale from “strongly disagree” to “strongly agree”)

Numerical (Quantitative) Variables

A numerical variable records measurements or counts that can be expressed on a numerical scale. It can be subdivided into:

  • Discrete variables: variables whose possible outcomes are separate, distinct points with no possible values between consecutive units (e.g., number of books on a shelf, number of students in a classroom)

  • Continuous variables: variables that can take any value within a range, limited only by measurement precision (e.g., height, weight, temperature, time)

A Tip 🔎: When confused bewteen discrete and continuous…🤔

Take any single value from the numerical variable. Then ask, can I clearly identify the “next” (or “previous”) point?

  • If yes, then discrete

  • If no, then continuous

For example:

  • If there are three students in a classroom now, what is the “next” possible value for the student count? Without question, it’s four. → discrete

  • If the current temperature is 81 degrees Fahrenheit, what is the “next” value it can be? 82? 81.1? 81.00001? Not clear. → continuous

A quantitavie variable can also be classifed as measured on an interval scale or a ratio scale:

Comparison of quantitative variables on interval vs ratio scale

Interval scale

Ratio scale

True zero

Has no true zero point. Its “zero” value does not imply the absence of the quanity being measured

Has meaningful zero that represents the absence of the quantity

Ratios

Ratios of values do not have a meaning.

Ratios of values make sense.

Examples

  • Celsius temperature: 0 degrees Celcius does not imply any absence.

  • SAT score: If two students received 800 and 1600 on SAT, it does not mean that the second student did twice as well.

  • height, weight, time, etc.

  • If two animal species have average heights of 1 foot and 2 feet, respectively, then it means that the second species is twice as tall as the first, on average.

Example 💡: Variable Type Depends on Both Nature and Context

When classifying a variable, its definition given by the data collector is just as important as its naturally occurring properties. Take the final exam grades of an imaginary course, MATH 1234, for example.

  • Researcher 1 records the data as percentage scores after a curve has been applied (Variable 1).

  • Researcher 2 records the data as belonging to one of the intervals 0%-60%, 60%-70%, 70%-80%, then 80%-100% (Variable 2).

Although both variables come from the same source, they are now of distinct types:

  • Variable 1 is a continuous numerical variable on an interval scale. A grade of 0% has lost its meaning due to the curve. Also, a student with 80% did not perform twice as better as one with 40%.

  • Variable 2 is an ordinal categorical variable. Although a natural order exists among the intervals, no meaningful arithmetic oprations are possible between them, indicating that they are not numerical values.

In general cases where no experimental details are given, you may focus on the variable’s natural properties. However, when additional context is available, it must be factored into how you classify it.

2.1.3. Bringing It All Together

Key Takeaways 📝

  1. Always identify the who, what and why in a dataset.

  2. Be able to classify a variable by considering both its natural properties and its specific usage in an experiment.

Exercises

  1. For each of the following variables state whether it is categorical nominal, categorical ordinal, discrete numerical or continuous numerical. For numerical variables, also determine whether they are on a ratio scale or an interval scale:

    • Blood type

    • SAT score

    • Number of pets owned

    • Daily rainfall