3.1. Introduction to Numerical Summaries: Notation and Terminology

In addition to our visual tools for understanding data, we can condense information into single numerical values that represent key characteristics of our dataset. While histograms and other plots provide a holistic view, numerical summaries offer precise, compact descriptions that can be easily compared, analyzed, and reported.

Road Map 🧭

  • Establish the notation we’ll use throughout the course.

  • Distinguish between population parameters and sample statistics.

3.1.1. Notation: The Language of Statistics

Before diving into summary measures, we need a consistent language to describe our data. Statisticians use specific notation to represent variables, observations, and statistical measures.

Variables and Observations

We denote variables with lowercase letters. The most common choices are \(x, y,\) and \(z\), but we might use other letters if we have more than three variables or if others are more appropriate for the context.

Sample Size

The lowercase letter \(n\) indicates how many observations are in our sample.

Individual Data Points

We represent individual observations for a variable with the variable name and a numerical subscript. If a variable \(x\) has \(n\) observations, their values can be denoted with

\[x_1,\; x_2,\; x_3,\;\ldots,\; x_n\]

The subscript indicates which observation we’re referring to. The order in which the data is collected doesn’t usually matter; what’s important is that we reference each observation with a consistent index.

When we refer to an arbitrary observation, we often use the letter \(i\) as the index. For example, to express the sum of all data points, we write \(\sum_{i=1}^n x_i\).

Example 1 💡: Putting the notations to use

Suppose a data set of weight measurements is collected, in pounds, from randomly selected patients in Hospital A. The dataset is:

\[\{301, 202, 101, 125, 131\}.\]
  1. Use appropriate notation to describe the data set.

    First, let \(w\) denote the variable of weight measurements. Since the data set contains five observations, \(n = 5\). Also,

    • \(w_1 = 301\)

    • \(w_2 = 202\)

    • \(w_3 = 101\)

    • \(w_4 = 125\)

    • \(w_5 = 131\)

  2. Write the formula of the arithmetic average (the sum of all values, divided by the number of values) in concise notation.

    The formula can be written as

    \[\frac{1}{n}\sum_{i=1}^n w_i\]

Multiple Groups

Sometimes we measure the same variable across different groups. In these cases, we use double subscript notation:

\[\bigl\{\,y_{11}, y_{12}, \ldots, y_{1n_1}\bigr\},\; \bigl\{\,y_{21}, y_{22}, \ldots, y_{2n_2}\bigr\},\; \ldots,\; \bigl\{\,y_{I1}, y_{I2}, \ldots, y_{In_I}\bigr\}\]

Here:

  • The first subscript (\(1, 2, \cdots, I\)) indicates the group.

  • The second subscript indicates the observation within that group.

  • Each group can have a different sample size, so we differentiate group sample sizes with subscripts: \(n_1, n_2, \cdots, n_I\).

Example 2 💡: Continue practicing notation

(Continued from Example 1)

Suppose Hospital B and Hospital C also collected weight data from their randomly selected patients.

  • Hospital B: {132, 215, 140, 149, 270, 192, 105}

  • Hospital C: {166, 128, 199}

  1. Use appropriate notation to represent the combined variable of weight measurements from Hospitals A, B, and C.

    We still use \(w\) to denote the weight variable.

    • Use group index of 1 for Hospital A. Then \(n_1 = 5\), and

      \[w_{11}=301, w_{12}=202, \cdots, w_{15}=131.\]
    • Use group index 2 for Hospital B. \(n_2 = 7\), and

      \[w_{21}=132, w_{22}=215, \cdots, w_{27}=105.\]
    • Use group index 3 for Hospital C. \(n_3 = 3\), and

      \[w_{31}=166, w_{32}=128, w_{33}=199.\]

We won’t use the double-subscript notation often in the early chapters, but it will become important when we study multi-sample methods in Chapters 11 and 12.

3.1.2. Parameter vs Statistic

Recall from Section 1.2:

  • A population is the complete set of individuals we’re interested in studying.

  • A sample is the subset of the population that we observe and measure.

These definitions lead to another key distinction that will remain central throughout the course: population paramter vs sample statistics.

  • A sample statistic, or simply a statistic, is any quantity we compute from observed data to describe the sample.

  • Often, a statistic is also used as an estimate for a corresponding “true” quantity that charaterizes the population. This population-level quantity is called the parameter.

To clearly indicate which of the two we are referring to, we use different families of symbols. We write a population parameter with a Greek letter and a sample statistic with a Latin letter.

Population Parameter

Sample Statistic

Name

Notation

How to read

Name

Notation

How to read

Population Mean

\(\mu\)

“mu”

Sample Mean

\(\bar{x}\)

“x bar”

Population Median

\(\tilde{\mu}\)

“mu tilde”

Sample Median

\(\tilde{x}\)

“x tilde”

Population Variance

\(\sigma^{2}\)

“sigma squared”

Sample Variance

\(s^{2}\)

“s squared”

Population Standard Deviation

\(\sigma\)

“sigma”

Sample Standard Deviation

\(s\)

“s”

Avoid the common mistake ‼️

Since most metrics have both a population and a sample version, always be sure to specify which one you mean by using the full expression.

The mean is larger than the median.

The sample mean is larger than the sample median.

The variance is unknown, so we must use data to estimate it.

The population variance is unknown, so we must use data to estimate it.

3.1.3. Bringing It All Together

These foundational concepts will serve as building blocks for the rest of Chapter 3, where we’ll dive deeper into specific summary measures and learn how to calculate, interpret, and apply them to real data.

Key Takeaways 📝

  1. Consistent notation is essential for clear communication in statistics.

  2. Subscripts are used to identify specific observations by their indices.

  3. Usually, Greek letters denote population parameters; Latin letters denote sample statistics.