Slides 📊

3.1. Introduction to Numerical Summaries: Notation and Terminology

Numerical summaries provide a concise way to convey a data set’s key characteristics. While visual summaries give a holistic view, numerical summaries offer precise, compact descriptions that are easy to compare, analyze, and report. We begin this chapter by introducing the language needed to discuss these numerical measures in depth.

Road Map 🧭

Establish the notation we’ll use throughout the course.
Distinguish between population parameters and sample statistics.

3.1.1. Notation: The Language of Statistics

Before diving into summary measures, we need a consistent language to describe our data. Statisticians use specific notation to represent variables, observations, and statistical measures.

Variables and Observations

We denote variables with lowercase letters. The most common choices are \(x, y,\) and \(z\), but we can use other letters if we have more than three variables or if others are more appropriate for the context.

Sample Size

The lowercase letter \(n\) indicates the number of observations in a sample.

Individual Data Points

We represent an observation from a variable with the variable name subscripted by an index. For example, if variable \(x\) has \(n\) observations, their values can be denoted with

\[x_1,\; x_2,\; x_3,\;\ldots,\; x_n\]

The subscripts prevent confusion when individual observations are discussed. The order of data collection doesn’t usually matter in assigning subcripts; what’s important is that each observation is referenced with a unique index in a consistent manner.

When we refer to a single arbitrary observation, we often use the letter \(i\) as the index. For example, to express the sum of all data points, we write \(\sum_{i=1}^n x_i\).

Example 1 💡: Putting the notations to use

Suppose a data set of weight measurements is collected, in pounds, from randomly selected patients in Hospital A. The dataset is:

\[\{301, 202, 101, 125, 131\}.\]

Use appropriate notation to describe the data set.

First, let \(w\) denote the variable of weight measurements. Since the data set contains five observations, \(n = 5\). Also,
- \(w_1 = 301\)
- \(w_2 = 202\)
- \(w_3 = 101\)
- \(w_4 = 125\)
- \(w_5 = 131\)
Write the formula of the arithmetic average (the sum of all values, divided by the number of values) in concise notation.

The formula can be written as

\[\frac{1}{n}\sum_{i=1}^n w_i\]

Multiple Groups

Sometimes we measure the same variable across different groups. In these cases, we use double subscript notation:

\[\bigl\{\,y_{11}, y_{12}, \ldots, y_{1n_1}\bigr\},\; \bigl\{\,y_{21}, y_{22}, \ldots, y_{2n_2}\bigr\},\; \ldots,\; \bigl\{\,y_{I1}, y_{I2}, \ldots, y_{In_I}\bigr\}\]

Here,

The first subscript (\(1, 2, \cdots, I\)) indicates the group.
The second subscript indicates the observation within a group.
Each group can have a different sample size, so we differentiate group sample sizes with subscripts: \(n_1, n_2, \cdots, n_I\).

Example 2 💡: Continue Practicing Notation

Suppose Hospital B and Hospital C also collected weight data from their randomly selected patients.

Hospital B: \(\{132, 215, 140, 149, 270, 192, 105\}\)
Hospital C: \(\{166, 128, 199\}\)

Use appropriate notation to represent the combined variable of weight measurements from Hospitals A, B, and C.

We still use \(w\) to denote the weight variable.
- Use group index of 1 for Hospital A. Then \(n_1 = 5\), and
  
  \[w_{11}=301, w_{12}=202, \cdots, w_{15}=131.\]
- Use group index 2 for Hospital B. \(n_2 = 7\), and
  
  \[w_{21}=132, w_{22}=215, \cdots, w_{27}=105.\]
- Use group index 3 for Hospital C. \(n_3 = 3\), and
  
  \[w_{31}=166, w_{32}=128, w_{33}=199.\]

We won’t use the double-subscript notation often in the early chapters, but it will become important when we study multi-sample methods in Chapters 11 and 12.

3.1.2. Parameter vs Statistic

Recall from Section 1.2:

A population is the complete set of individuals we are interested in studying.
A sample is a subset of the population that we observe and measure.

These definitions lead to another key distinction that will remain central throughout the course: population paramter vs sample statistics.

A sample statistic, or simply a statistic, is any quantity we compute from observed data to describe the sample.
Often, a statistic is also used as an estimate for a “true” quantity that charaterizes the population. This population-level quantity is called a parameter.

To clearly indicate which of the two we are referring to, we use different families of symbols. We write a population parameter with a Greek letter and a sample statistic with a Latin letter. See the table below for some important examples:

Population Parameter			Sample Statistic
Name	Notation	How to read	Name	Notation	How to read
Population Mean	\(\mu\)	“mu”	Sample Mean	\(\bar{x}\)	“x bar”
Population Median	\(\tilde{\mu}\)	“mu tilde”	Sample Median	\(\tilde{x}\)	“x tilde”
Population Variance	\(\sigma^{2}\)	“sigma squared”	Sample Variance	\(s^{2}\)	“s squared”
Population Standard Deviation	\(\sigma\)	“sigma”	Sample Standard Deviation	\(s\)	“s”

Avoid the common mistake ‼️

Since most metrics have both a population and a sample version, always be sure to specify which one you mean by using the full expression.

❌	✔
The mean is larger than the median.	The sample mean is larger than the sample median.
The variance is unknown, so we must use data to estimate it.	The population variance is unknown, so we must use data to estimate it.

3.1.3. Bringing It All Together

Key Takeaways 📝

Consistent notation is essential for clear communication in statistics.
Subscripts are used to identify specific observations by their indices.
Usually, Greek letters denote population parameters; Latin letters denote sample statistics.