3.1. Introduction to Numerical Summaries: Notation and Terminology

Numerical summaries provide a concise way to convey a data set’s key characteristics. While visual summaries give a holistic view, numerical summaries offer precise, compact descriptions that are easy to compare, analyze, and report. We begin this chapter by introducing the language needed to discuss these numerical measures in depth.

Road Map 🧭

  • Establish the notation we’ll use throughout the course.

  • Distinguish between population parameters and sample statistics.

3.1.1. Notation: The Language of Statistics

Before diving into summary measures, we need a consistent language to describe our data. Statisticians use specific notation to represent variables, observations, and statistical measures.

Variables and Observations

We denote variables with lowercase letters. The most common choices are \(x, y,\) and \(z\), but we can use other letters if we have more than three variables or if others are more appropriate for the context.

Sample Size

The lowercase letter \(n\) indicates the number of observations in a sample.

Individual Data Points

We represent an observation from a variable with the variable name subscripted by an index. For example, if variable \(x\) has \(n\) observations, their values can be denoted with

\[x_1,\; x_2,\; x_3,\;\ldots,\; x_n\]

The subscripts prevent confusion when individual observations are discussed. The order of data collection doesn’t usually matter in assigning subcripts; what’s important is that each observation is referenced with a unique index in a consistent manner.

When we refer to a single arbitrary observation, we often use the letter \(i\) as the index. For example, to express the sum of all data points, we write \(\sum_{i=1}^n x_i\).

Example 1 💡: Putting the notations to use

Suppose a data set of weight measurements is collected, in pounds, from randomly selected patients in Hospital A. The dataset is:

\[\{301, 202, 101, 125, 131\}.\]
  1. Use appropriate notation to describe the data set.

    First, let \(w\) denote the variable of weight measurements. Since the data set contains five observations, \(n = 5\). Also,

    • \(w_1 = 301\)

    • \(w_2 = 202\)

    • \(w_3 = 101\)

    • \(w_4 = 125\)

    • \(w_5 = 131\)

  2. Write the formula of the arithmetic average (the sum of all values, divided by the number of values) in concise notation.

    The formula can be written as

    \[\frac{1}{n}\sum_{i=1}^n w_i\]

Multiple Groups

Sometimes we measure the same variable across different groups. In these cases, we use double subscript notation:

\[\bigl\{\,y_{11}, y_{12}, \ldots, y_{1n_1}\bigr\},\; \bigl\{\,y_{21}, y_{22}, \ldots, y_{2n_2}\bigr\},\; \ldots,\; \bigl\{\,y_{I1}, y_{I2}, \ldots, y_{In_I}\bigr\}\]

Here,

  • The first subscript (\(1, 2, \cdots, I\)) indicates the group.

  • The second subscript indicates the observation within a group.

  • Each group can have a different sample size, so we differentiate group sample sizes with subscripts: \(n_1, n_2, \cdots, n_I\).

Example 2 💡: Continue Practicing Notation

Suppose Hospital B and Hospital C also collected weight data from their randomly selected patients.

  • Hospital B: \(\{132, 215, 140, 149, 270, 192, 105\}\)

  • Hospital C: \(\{166, 128, 199\}\)

  1. Use appropriate notation to represent the combined variable of weight measurements from Hospitals A, B, and C.

    We still use \(w\) to denote the weight variable.

    • Use group index of 1 for Hospital A. Then \(n_1 = 5\), and

      \[w_{11}=301, w_{12}=202, \cdots, w_{15}=131.\]
    • Use group index 2 for Hospital B. \(n_2 = 7\), and

      \[w_{21}=132, w_{22}=215, \cdots, w_{27}=105.\]
    • Use group index 3 for Hospital C. \(n_3 = 3\), and

      \[w_{31}=166, w_{32}=128, w_{33}=199.\]

We won’t use the double-subscript notation often in the early chapters, but it will become important when we study multi-sample methods in Chapters 11 and 12.

3.1.2. Parameter vs Statistic

Recall from Section 1.2:

  • A population is the complete set of individuals we are interested in studying.

  • A sample is a subset of the population that we observe and measure.

These definitions lead to another key distinction that will remain central throughout the course: population paramter vs sample statistics.

  • A sample statistic, or simply a statistic, is any quantity we compute from observed data to describe the sample.

  • Often, a statistic is also used as an estimate for a “true” quantity that charaterizes the population. This population-level quantity is called a parameter.

To clearly indicate which of the two we are referring to, we use different families of symbols. We write a population parameter with a Greek letter and a sample statistic with a Latin letter. See the table below for some important examples:

Population Parameter

Sample Statistic

Name

Notation

How to read

Name

Notation

How to read

Population Mean

\(\mu\)

“mu”

Sample Mean

\(\bar{x}\)

“x bar”

Population Median

\(\tilde{\mu}\)

“mu tilde”

Sample Median

\(\tilde{x}\)

“x tilde”

Population Variance

\(\sigma^{2}\)

“sigma squared”

Sample Variance

\(s^{2}\)

“s squared”

Population Standard Deviation

\(\sigma\)

“sigma”

Sample Standard Deviation

\(s\)

“s”

Avoid the common mistake ‼️

Since most metrics have both a population and a sample version, always be sure to specify which one you mean by using the full expression.

The mean is larger than the median.

The sample mean is larger than the sample median.

The variance is unknown, so we must use data to estimate it.

The population variance is unknown, so we must use data to estimate it.

3.1.3. Bringing It All Together

Key Takeaways 📝

  1. Consistent notation is essential for clear communication in statistics.

  2. Subscripts are used to identify specific observations by their indices.

  3. Usually, Greek letters denote population parameters; Latin letters denote sample statistics.

3.1.4. Exercises

These exercises build your fluency with statistical notation and the critical distinction between population parameters and sample statistics.

Exercise 1: Basic Notation Practice

A quality control engineer measures the resistance (in ohms) of 6 randomly selected resistors from a production batch. The measurements are:

\[\{47.2, 46.8, 47.5, 47.1, 46.9, 47.3\}\]
  1. What is the sample size? Use appropriate notation.

  2. Let \(r\) denote the resistance variable. Write out the notation for each individual observation (\(r_1, r_2, \ldots\)).

  3. Write the formula for the sum of all resistance measurements using summation notation.

  4. Write the formula for the sample mean resistance using summation notation.

  5. If the engineer wanted to denote the true average resistance of all resistors in the production batch, what symbol should be used? Why?

Solution

Part (a): Sample Size

\(n = 6\)

There are 6 observations in the sample.

Part (b): Individual Observations

  • \(r_1 = 47.2\)

  • \(r_2 = 46.8\)

  • \(r_3 = 47.5\)

  • \(r_4 = 47.1\)

  • \(r_5 = 46.9\)

  • \(r_6 = 47.3\)

Part (c): Sum Using Summation Notation

\[\sum_{i=1}^{n} r_i = \sum_{i=1}^{6} r_i = r_1 + r_2 + r_3 + r_4 + r_5 + r_6\]

Part (d): Sample Mean Formula

\[\bar{r} = \frac{1}{n}\sum_{i=1}^{n} r_i = \frac{1}{6}\sum_{i=1}^{6} r_i\]

Part (e): True Average Resistance

The symbol \(\mu\) (or \(\mu_r\) to specify the variable) should be used.

Reason: The “true average resistance of all resistors in the production batch” refers to the population mean—a population parameter. Population parameters are denoted with Greek letters, while sample statistics use Latin letters. Since we’re referring to the entire population (all resistors in the batch), not just our sample of 6, we use the Greek letter mu (\(\mu\)).


Exercise 2: Interpreting Summation Expressions

Given a dataset with \(n = 4\) observations: \(x_1 = 10, x_2 = 15, x_3 = 12, x_4 = 8\).

Evaluate each of the following expressions:

  1. \(\displaystyle\sum_{i=1}^{4} x_i\)

  2. \(\displaystyle\sum_{i=1}^{4} x_i^2\)

  3. \(\displaystyle\left(\sum_{i=1}^{4} x_i\right)^2\)

  4. \(\displaystyle\sum_{i=1}^{4} (x_i - 11)\)

  5. Are your answers to parts (b) and (c) the same? Explain why this distinction matters.

Solution

Part (a): \(\sum_{i=1}^{4} x_i\)

\[\sum_{i=1}^{4} x_i = x_1 + x_2 + x_3 + x_4 = 10 + 15 + 12 + 8 = \mathbf{45}\]

Part (b): \(\sum_{i=1}^{4} x_i^2\)

This means: square each value first, then sum.

\[\sum_{i=1}^{4} x_i^2 = x_1^2 + x_2^2 + x_3^2 + x_4^2 = 10^2 + 15^2 + 12^2 + 8^2 = 100 + 225 + 144 + 64 = \mathbf{533}\]

Part (c): \(\left(\sum_{i=1}^{4} x_i\right)^2\)

This means: sum all values first, then square the result.

\[\left(\sum_{i=1}^{4} x_i\right)^2 = (10 + 15 + 12 + 8)^2 = 45^2 = \mathbf{2025}\]

Part (d): \(\sum_{i=1}^{4} (x_i - 11)\)

\[\sum_{i=1}^{4} (x_i - 11) = (10-11) + (15-11) + (12-11) + (8-11) = -1 + 4 + 1 + (-3) = \mathbf{1}\]

Part (e): Why (b) ≠ (c)

No, the answers are different: 533 ≠ 2025.

  • \(\sum x_i^2\) means “sum of squares” — square each value, then add

  • \((\sum x_i)^2\) means “square of the sum” — add all values, then square the total

This distinction is critical in statistics because variance calculations involve both expressions:

\[s^2 = \frac{1}{n-1}\left[\sum_{i=1}^{n} x_i^2 - \frac{1}{n}\left(\sum_{i=1}^{n} x_i\right)^2\right]\]

Confusing these terms leads to incorrect calculations of variance and standard deviation.


Exercise 3: Parameter vs. Statistic Identification

For each description below, determine whether the quantity described is a population parameter or a sample statistic. Then provide the appropriate symbol.

  1. The average GPA of all 45,000 students currently enrolled at Purdue University.

  2. The average GPA calculated from a random sample of 200 Purdue students.

  3. The standard deviation of tensile strength for 50 steel specimens tested in a lab.

  4. The true proportion of all manufactured chips that are defective.

  5. The median response time calculated from 1,000 recorded API calls to a server.

  6. The variance of fuel efficiency across all vehicles of a particular model ever produced.

Solution

Part (a): Average GPA of ALL 45,000 Students

Population parameter: \(\mu\)

This refers to the entire population of Purdue students. The population mean is denoted with the Greek letter mu.

Part (b): Average GPA from Sample of 200 Students

Sample statistic: \(\bar{x}\)

This is calculated from a subset (sample) of the population. The sample mean is denoted with “x-bar.”

Part (c): Standard Deviation of 50 Steel Specimens

Sample statistic: \(s\)

Unless these 50 specimens represent ALL steel specimens of interest (unlikely), this is a sample. The sample standard deviation uses the Latin letter s.

Part (d): True Proportion of ALL Defective Chips

Population parameter: \(p\) (or \(\pi\) in some textbooks)

The word “true” and “all” indicate this is the population proportion—the actual defect rate across all manufactured chips, not an estimate from a sample.

Part (e): Median Response Time from 1,000 API Calls

Sample statistic: \(\tilde{x}\)

The 1,000 recorded calls are a sample of all possible API calls. The sample median is denoted with “x-tilde.”

Part (f): Variance of Fuel Efficiency for ALL Vehicles of a Model

Population parameter: \(\sigma^2\)

“All vehicles of a particular model ever produced” defines the entire population. The population variance uses sigma squared.


Exercise 4: Correcting Notation Mistakes

A student writes the following statements in their statistics report. Identify and correct any notation or terminology errors.

  1. “The mean of our sample was μ = 75.3.”

  2. “We calculated s² = 12.4 for the population variance.”

  3. “The standard deviation is larger than expected.”

  4. “Using summation notation: \(\sum_{i=1}^{10} = 250\)

  5. “The variance was σ² = 16, calculated from our 30 data points.”

Solution

Part (a): Error in Mean Notation

Error: Using μ (Greek letter) for a sample quantity.

Correction: “The mean of our sample was \(\bar{x} = 75.3\).”

Since this is calculated from a sample, we must use the Latin letter notation (x-bar), not the Greek letter μ which is reserved for the population mean.

Part (b): Error in Variance Notation

Error: Using s² (Latin letter) for a population quantity.

Correction: “We calculated \(\sigma^2 = 12.4\) for the population variance.”

Alternatively, if the student actually calculated from sample data: “We calculated \(s^2 = 12.4\) for the sample variance.”

The notation and the description must match.

Part (c): Ambiguous Terminology

Error: Not specifying whether this is the population or sample standard deviation.

Correction: “The sample standard deviation is larger than expected.” OR “The population standard deviation is larger than expected.”

Always specify which version you mean to avoid ambiguity.

Part (d): Incomplete Summation Notation

Error: The summation is missing the variable being summed.

Correction: “\(\sum_{i=1}^{10} x_i = 250\)

The notation must specify what is being summed (in this case, \(x_i\)).

Part (e): Mismatched Notation and Data Source

Error: Using σ² (population notation) when the value was calculated from sample data.

Correction: “The variance was \(s^2 = 16\), calculated from our 30 data points.”

When we calculate from observed data (a sample), we must use sample notation. We can only know σ² if we have information about the entire population, which is rare.


Exercise 5: Multiple Groups Notation

A pharmaceutical company tests a new drug at three different dosage levels. The improvement scores (on a 0-100 scale) for patients in each group are:

  • Low dose (Group 1): 12, 15, 18, 14

  • Medium dose (Group 2): 25, 28, 22, 30, 27

  • High dose (Group 3): 35, 42, 38, 40, 45, 41

  1. Write the sample sizes for each group using appropriate notation.

  2. Using double-subscript notation, write the notation for the third observation in Group 2.

  3. Write the formula for the mean improvement score in Group 1 using summation notation.

  4. Write an expression for the total number of patients across all three groups.

  5. Write the formula for the overall mean improvement score across all patients (hint: you’ll need to account for different group sizes).

Solution

Part (a): Sample Sizes

  • \(n_1 = 4\) (Low dose)

  • \(n_2 = 5\) (Medium dose)

  • \(n_3 = 6\) (High dose)

Part (b): Third Observation in Group 2

\(y_{23} = 22\)

The first subscript (2) indicates Group 2 (Medium dose), and the second subscript (3) indicates the third observation within that group.

Part (c): Mean for Group 1

\[\bar{y}_1 = \frac{1}{n_1}\sum_{j=1}^{n_1} y_{1j} = \frac{1}{4}\sum_{j=1}^{4} y_{1j} = \frac{y_{11} + y_{12} + y_{13} + y_{14}}{4}\]

Calculating: \(\bar{y}_1 = \frac{12 + 15 + 18 + 14}{4} = \frac{59}{4} = 14.75\)

Part (d): Total Number of Patients

\[N = n_1 + n_2 + n_3 = \sum_{i=1}^{3} n_i = 4 + 5 + 6 = 15\]

(We often use capital \(N\) for the total sample size when combining groups.)

Part (e): Overall Mean

The overall mean must weight each group by its size:

\[\bar{y} = \frac{1}{N}\sum_{i=1}^{3}\sum_{j=1}^{n_i} y_{ij} = \frac{\sum_{i=1}^{3} n_i \bar{y}_i}{N}\]

Or equivalently:

\[\bar{y} = \frac{n_1\bar{y}_1 + n_2\bar{y}_2 + n_3\bar{y}_3}{n_1 + n_2 + n_3}\]

Calculating the group means:

  • \(\bar{y}_1 = 59/4 = 14.75\)

  • \(\bar{y}_2 = 132/5 = 26.4\)

  • \(\bar{y}_3 = 241/6 \approx 40.1667\)

Overall mean:

\[\bar{y} = \frac{4(14.75) + 5(26.4) + 6(40.166667)}{15} = \frac{59 + 132 + 241}{15} = \frac{432}{15} = 28.8\]

Note

Precision and rounding: To avoid rounding drift, keep full precision in intermediate calculations. Final numeric answers should be reported to at least 4 decimal places (unless the question states otherwise). Do not round early (for example, rounding group means, standard errors, or test statistics and then reusing those rounded values). Instead, carry unrounded values through the computation (ideally keeping 6 or more decimals internally) and round only at the final step.


Exercise 6: Conceptual Understanding

Answer the following conceptual questions about parameters and statistics.

  1. Explain why we typically cannot know the exact value of a population parameter.

  2. If we collect a sample and calculate \(\bar{x} = 50\), can we conclude that \(\mu = 50\)? Why or why not?

  3. A researcher claims: “I surveyed every single employee in my company (all 150 of them) and found the average satisfaction score to be 7.2.” Should this be reported as \(\bar{x} = 7.2\) or \(\mu = 7.2\)? Explain.

  4. Why do statisticians use different symbols (Greek vs. Latin letters) to distinguish parameters from statistics?

  5. In the context of statistical inference, what is the relationship between sample statistics and population parameters?

Solution

Part (a): Why Parameters Are Typically Unknown

We typically cannot know the exact value of a population parameter because:

  1. Populations are often very large or infinite — measuring every individual is impractical or impossible

  2. Cost and time constraints — complete enumeration is usually too expensive

  3. Destructive testing — some measurements destroy the item (e.g., testing battery life until failure)

  4. Dynamic populations — populations may change while we’re measuring them

Instead, we use sample statistics to estimate population parameters.

Part (b): Can We Conclude μ = 50?

No, we cannot conclude that \(\mu = 50\) just because \(\bar{x} = 50\).

Reasons:

  • The sample mean \(\bar{x}\) is an estimate of \(\mu\), not the exact value

  • Different samples would likely produce different values of \(\bar{x}\)

  • Sampling variability means our estimate has uncertainty

  • We can only say that \(\bar{x} = 50\) is our best point estimate of \(\mu\)

Later chapters will introduce confidence intervals and hypothesis tests to quantify this uncertainty.

Part (c): Census vs. Sample

This should be reported as \(\mu = 7.2\) (population parameter).

Reason: The researcher surveyed every single employee in the company. This is a census, not a sample. When you measure the entire population of interest, the calculated value IS the population parameter, not an estimate of it.

Key consideration: The answer depends on how you define the population. If the population is “all current employees,” then 7.2 is \(\mu\). If the population is “all employees past, present, and future,” then it would be \(\bar{x}\).

Part (d): Why Different Symbols?

Statisticians use different symbols to:

  1. Prevent confusion — clearly distinguish between known (calculated) and unknown (true) quantities

  2. Communicate precision — readers immediately know whether a value is exact or estimated

  3. Maintain mathematical rigor — formulas involving parameters vs. statistics often differ (e.g., dividing by \(n\) vs. \(n-1\))

  4. Guide interpretation — helps researchers avoid overstating conclusions

Part (e): Relationship in Statistical Inference

In statistical inference:

  • Sample statistics are used to estimate population parameters

  • Statistics are observable and calculable from data

  • Parameters are typically unknown and fixed (though unobserved)

  • The goal of inference is to draw conclusions about parameters based on statistics

  • Methods like confidence intervals quantify how close we believe our statistic is to the parameter

  • Hypothesis tests assess whether data are consistent with hypothesized parameter values

The entire framework of statistical inference rests on this relationship: using what we can measure (statistics) to learn about what we want to know (parameters).


3.1.5. Additional Practice Problems

True/False Questions (1 point each)

  1. The symbol \(\bar{x}\) represents the population mean.

    Ⓣ or Ⓕ

  2. If \(n = 5\) and the observations are \(x_1, x_2, x_3, x_4, x_5\), then \(\sum_{i=1}^{5} x_i\) represents the sum of all observations.

    Ⓣ or Ⓕ

  3. Greek letters are used to denote sample statistics.

    Ⓣ or Ⓕ

  4. The expression \(\sum_{i=1}^{n} x_i^2\) is equivalent to \(\left(\sum_{i=1}^{n} x_i\right)^2\).

    Ⓣ or Ⓕ

  5. A sample statistic is used to estimate a population parameter.

    Ⓣ or Ⓕ

  6. The notation \(s^2\) represents the population variance.

    Ⓣ or Ⓕ

Multiple Choice Questions (2 points each)

  1. A researcher calculates the average height of 100 randomly selected adults to be 68.5 inches. Which notation correctly represents this value?

    \(\mu = 68.5\)

    \(\bar{x} = 68.5\)

    \(\sigma = 68.5\)

    \(s = 68.5\)

  2. In double-subscript notation, \(y_{34}\) represents:

    Ⓐ The product of the 3rd and 4th observations

    Ⓑ The 34th observation in the dataset

    Ⓒ The 4th observation in the 3rd group

    Ⓓ The 3rd observation in the 4th group

  3. Which of the following correctly matches the symbol with its description?

    \(\sigma\) — sample standard deviation

    \(\tilde{x}\) — population median

    \(\mu\) — population mean

    \(s^2\) — population variance

  4. Given \(x_1 = 2, x_2 = 4, x_3 = 6\), what is \(\sum_{i=1}^{3}(x_i - 4)^2\)?

    Ⓐ 0

    Ⓑ 8

    Ⓒ 12

    Ⓓ 144

Answers to Practice Problems

True/False Answers:

  1. False\(\bar{x}\) represents the sample mean. The population mean is denoted \(\mu\).

  2. True — This is the correct interpretation of summation notation: sum all values from \(i = 1\) to \(i = 5\).

  3. False — Greek letters denote population parameters. Latin letters denote sample statistics.

  4. False — These are different operations. \(\sum x_i^2\) is the “sum of squares” (square each, then add). \((\sum x_i)^2\) is the “square of the sum” (add all, then square).

  5. True — This is a fundamental concept in statistical inference: statistics estimate parameters.

  6. False\(s^2\) represents the sample variance. The population variance is \(\sigma^2\).

Multiple Choice Answers:

  1. — Since this is calculated from a sample of 100 adults, it’s a sample mean, denoted \(\bar{x}\).

  2. — In double-subscript notation, the first subscript indicates the group and the second indicates the observation within that group. So \(y_{34}\) is the 4th observation in the 3rd group.

  3. \(\mu\) correctly represents the population mean. The other options are incorrect: \(\sigma\) is population (not sample) standard deviation; \(\tilde{x}\) is sample (not population) median; \(s^2\) is sample (not population) variance.

  4. — Calculate step by step:

    • \((x_1 - 4)^2 = (2-4)^2 = (-2)^2 = 4\)

    • \((x_2 - 4)^2 = (4-4)^2 = 0^2 = 0\)

    • \((x_3 - 4)^2 = (6-4)^2 = 2^2 = 4\)

    • Sum: \(4 + 0 + 4 = 8\)