.. _3-1-intro-numerical-summaries: .. raw:: html
Introduction to Numerical Summaries: Notation and Terminology =============================================================== In addition to our visual tools for understanding data, we can condense information into single numerical values that represent key characteristics of our dataset. While histograms and other plots provide a holistic view, numerical summaries offer precise, compact descriptions that can be easily compared, analyzed, and reported. .. admonition:: Road Map 🧭 :class: important * Establish the **notation** we'll use throughout the course. * Distinguish between **population parameters** and **sample statistics**. Notation: The Language of Statistics -------------------------------------- Before diving into summary measures, we need a consistent **language** to describe our data. Statisticians use specific notation to represent variables, observations, and statistical measures. Variables and Observations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We denote **variables with lowercase letters**. The most common choices are :math:`x, y,` and :math:`z`, but we might use other letters if we have more than three variables or if others are more appropriate for the context. Sample Size ~~~~~~~~~~~~~~~ The lowercase letter :math:`n` indicates how many observations are in our sample. Individual Data Points ~~~~~~~~~~~~~~~~~~~~~~~~~~ We represent individual observations for a variable with the variable name and a numerical subscript. If a variable :math:`x` has :math:`n` observations, their values can be denoted with .. math:: x_1,\; x_2,\; x_3,\;\ldots,\; x_n The subscript indicates which observation we're referring to. The order in which the data is collected doesn't usually matter; what's important is that we reference each observation with a consistent index. When we refer to an arbitrary observation, we often use the letter :math:`i` as the index. For example, to express the sum of all data points, we write :math:`\sum_{i=1}^n x_i`. .. admonition:: Example 1 💡: Putting the notations to use :class: note Suppose a data set of weight measurements is collected, in pounds, from randomly selected patients in Hospital A. The dataset is: .. math:: \{301, 202, 101, 125, 131\}. #. Use appropriate notation to describe the data set. First, let :math:`w` denote the variable of weight measurements. Since the data set contains five observations, :math:`n = 5`. Also, * :math:`w_1 = 301` * :math:`w_2 = 202` * :math:`w_3 = 101` * :math:`w_4 = 125` * :math:`w_5 = 131` #. Write the **formula** of the arithmetic average (the sum of all values, divided by the number of values) in concise notation. The formula can be written as .. math:: \frac{1}{n}\sum_{i=1}^n w_i Multiple Groups ~~~~~~~~~~~~~~~~~~~ Sometimes we measure the same variable across different groups. In these cases, we use double subscript notation: .. math:: \bigl\{\,y_{11}, y_{12}, \ldots, y_{1n_1}\bigr\},\; \bigl\{\,y_{21}, y_{22}, \ldots, y_{2n_2}\bigr\},\; \ldots,\; \bigl\{\,y_{I1}, y_{I2}, \ldots, y_{In_I}\bigr\} Here: * The first subscript (:math:`1, 2, \cdots, I`) indicates the group. * The second subscript indicates the observation within that group. * Each group can have a different sample size, so we differentiate group sample sizes with subscripts: :math:`n_1, n_2, \cdots, n_I`. .. admonition:: Example 2 💡: Continue practicing notation :class: note (Continued from Example 1) Suppose **Hospital B** and **Hospital C** also collected weight data from their randomly selected patients. * Hospital B: {132, 215, 140, 149, 270, 192, 105} * Hospital C: {166, 128, 199} #. Use appropriate notation to represent the **combined variable** of weight measurements from Hospitals A, B, and C. We still use :math:`w` to denote the weight variable. * Use group index of 1 for Hospital A. Then :math:`n_1 = 5`, and .. math:: w_{11}=301, w_{12}=202, \cdots, w_{15}=131. * Use group index 2 for Hospital B. :math:`n_2 = 7`, and .. math:: w_{21}=132, w_{22}=215, \cdots, w_{27}=105. * Use group index 3 for Hospital C. :math:`n_3 = 3`, and .. math:: w_{31}=166, w_{32}=128, w_{33}=199. We won't use the double-subscript notation often in the early chapters, but it will become important when we study multi-sample methods in Chapters 11 and 12. Parameter vs Statistic --------------------------------------------------- Recall from **Section 1.2**: * A **population** is the complete set of individuals we're interested in studying. * A **sample** is the subset of the population that we observe and measure. These definitions lead to another key distinction that will remain central throughout the course: **population paramter** vs **sample statistics**. * A **sample statistic**, or simply a **statistic**, is any quantity we compute from observed data to **describe the sample**. * Often, a statistic is also used as an estimate for a corresponding "true" quantity that charaterizes the population. This population-level quantity is called the **parameter**. To clearly indicate which of the two we are referring to, we use different families of symbols. We write a population parameter with a *Greek letter* and a sample statistic with a *Latin letter*. .. flat-table:: :widths: 1 1 1 1 1 1 :width: 90% :align: center :header-rows: 2 * - :cspan:`2` Population Parameter - :cspan:`2` Sample Statistic * - Name - Notation - How to read - Name - Notation - How to read * - Population Mean - :math:`\mu` - "mu" - Sample Mean - :math:`\bar{x}` - "x bar" * - Population Median - :math:`\tilde{\mu}` - "mu tilde" - Sample Median - :math:`\tilde{x}` - "x tilde" * - Population Variance - :math:`\sigma^{2}` - "sigma squared" - Sample Variance - :math:`s^{2}` - "s squared" * - Population Standard Deviation - :math:`\sigma` - "sigma" - Sample Standard Deviation - :math:`s` - "s" .. admonition:: Avoid the common mistake ‼️ :class: danger Since most metrics have both a population and a sample version, always be sure to specify which one you mean by using the full expression. .. flat-table:: :width: 100% * - ❌ - ✔ * - The **mean** is larger than the **median**. - The **sample mean** is larger than the **sample median**. * - The **variance** is unknown, so we must use data to estimate it. - The **population variance** is unknown, so we must use data to estimate it. Bringing It All Together --------------------------- These foundational concepts will serve as building blocks for the rest of Chapter 3, where we'll dive deeper into specific summary measures and learn how to calculate, interpret, and apply them to real data. .. admonition:: Key Takeaways 📝 :class: important #. **Consistent notation is essential** for clear communication in statistics. #. Subscripts are used to identify specific observations by their indices. #. Usually, Greek letters denote population parameters; Latin letters denote sample statistics.