.. _3-4-measures-of-variability-IQR-and-5-number-summary: .. raw:: html

Measures of Variability - Interquartile Range and Five-Number Summary ========================================================================== While the sample variance and standard deviation provide useful measures of spread, they can be heavily influenced by extreme values. In situations where data is skewed or contains unusual observations, we need measures that are less sensitive to individual points. This is where the interquartile range (IQR) and five-number summary come into play—measures that focus on the middle portion of the data. .. admonition:: Road Map 🧭 :class: important • Understand quartiles and percentiles as ways to divide ordered data. • Calculate the **interquartile range (IQR)** as a robust measure of spread. • Identify **explicit points** using **the upper and lower fences**. • Create and interpret **modified box plots**. • Distinguish between **explicit points** and **real outliers**. Preliminaries: Percentiles and Quartiles ---------------------------------------------- Sample Percentiles ~~~~~~~~~~~~~~~~~~~~~~~ For a number :math:`p` between 0 and 100, the :math:`p` th **sample percentile** of a variable is the value such that :math:`p\%` of observations are less than or equal to that value. For example: * The 10th percentile is a value such that 10% of observations are less than or equal to it. * The 99h percentile is a value such that 99% of observations are less than or equal to it. Sample Quartiles ~~~~~~~~~~~~~~~~~~~~~ **Sample quartiles** are the three special percentiles which divide the data into four equally sized parts, each containing approximately 25% of the data points. They consist of * :math:`Q_1` **(first quartile)**: The 25th percentile * :math:`Q_2` **(second quartile)**: The 50th percentile. This is also the sample median. * :math:`Q_3` **(third quartile)**: The 75th percentile .. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter3/quartiles.png :alt: Distribution divided by quartiles :align: center :figwidth: 80% *A distribution divided into four equal parts by the quartiles* Calculating Sample Quartiles ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sample quartiles can be found by "computing three different medians." The specific steps are as follows: 1. Compute the sample median of the whole data set. This is the second quartile, or :math:`Q_2`. 2. The data is now divided into two equally sized subsets, from the minimum to the median, then from the median to the maximum. Include the median in both subsets if it is **on** a data point. Compute the sample median of the first subset. This is the first quartile, :math:`Q_1`, for the whole data. 3. Likewise, compute the median for the second subset. This is the third quartile, :math:`Q_3`, for the whole data. .. admonition:: Example 💡: Computing the sample quartiles :class: note The table below displays the time to promotion, in months, for 19 randomly sampled software engineers at an IT firm. Compute the sample quartiles, both by hand and using R. .. flat-table:: :align: center :width: 100% * - 7 - 12 - 14 - 14 - 14 - 18 - 21 - 22 - 23 - 24 * - 25 - 34 - 34 - 37 - 47 - 49 - 64 - 100 - 150 - **Calculating the sample quartiles by hand** 1. :math:`n=19`. The sample median of the data set is the 10th smallest value. :math:`Q_2 = 24`. 2. Let us now consider the first half of the data, from 7 to 24. There are 10 (even) data points in the subset. Therefore, we take the average between the 5th and the 6th data points as the subset's sample median. :math:`Q_1 = \frac{14+18}{2} = 16`. 3. Repeating Step 2 for the second subset (24 to 150), we find :math:`Q_3 = \frac{37+47}{2} = 42`. Confirm that the four sections created by :math:`Q_1, Q_2` and :math:`Q_3` are equally sized. **Calculating the sample quartiles in R** In R, the `quantile()` function is used to calculate percentiles (quantiles are simply percentiles on a 0-1 scale instead of 0-100.) By default, it returns the **five-number summary** (minimum, Q₁, median, Q₃, maximum): .. code-block:: r # Example dataset x <- c(7, 12, 14, 14, 14, 18, 21, 22, 23, 24, 25, 34, 34, 37, 47, 49, 64, 100, 150) # Get quartiles (five-number summary) quantile(x) # Output: # 0% 25% 50% 75% 100% # 7.0 16.0 24.0 42.0 150.0 To access quartiles, you can use the following syntax: .. code-block:: r # Get Q1 (25th percentile) quantile(x)["25%"] # Get Q3 (75th percentile) quantile(x)["75%"] You can also request specific percentiles other than the default: .. code-block:: r # Calculate the 10th, 20th, 30th percentiles quantile(x, probs = c(0.1, 0.2, 0.3)) The Interquartile Range (IQR) -------------------------------- The **interquartile range (IQR)** represents the spread of the data using **the width of the middle 50%**. It is calculated as the difference between the third quartile (:math:`Q_3`) and the first quartile (:math:`Q_1`): .. math:: \text{IQR} = Q_3 - Q_1. .. Properties of IQR ~~~~~~~~~~~~~~~~~~~~~~ It is resistant to values at the extreme ends since it focuses only on the middle 50% of the data. .. admonition:: Example 💡: Calculating the IQR :class: note For the *number of months to promotion* data, compute the IQR. From the previous example, :math:`Q_3 = 42` and :math:`Q_1 = 16`. Then .. math:: \text{IQR} = 42 - 16 = 26. This tells us that the middle 50% of the data spans 26 months, giving us a sense of the typical spread for time to promotion. R provides a built-in function to calculate the IQR. .. code-block:: r # Calculate IQR IQR(x) # Alternative calculation q <- quantile(x) q["75%"] - q["25%"] Five-Number Summary --------------------- The **five-number summary** provides a concise overview of a dataset's distribution by reporting the sample quartiles, together with the data's minimum and maximum. This summary gives a **comprehensive picture of the center** (:math:`Q_2`), **spread** (:math:`Q_1` and :math:`Q_3`), **and extremes** (min and max) of the data. See the first code block of the previous example for an instance of a five-number summary. Identifying Explicit Points with Fences ----------------------------------------- One important application of the IQR is to identify potential outliers or "explicit points" in the data. We use what's called the **IQR fules** to establish "fences" beyond which observations are flagged for further inspection. * **Inner fences** are computed using the **1.5 × IQR rule**: * Lower inner fence = Q₁ - 1.5 × IQR * Upper inner fence = Q₃ + 1.5 × IQR * **Outer fences** are computed using the **3 × IQR rule**: * Lower outer fence = Q₁ - 3 × IQR * Upper outer fence = Q₃ + 3 × IQR Points that fall between the inner and outer fences are considered **mild explicit points**, while those beyond the outer fences are **extreme explicit points**. .. admonition:: Example 💡: Identifying the explicit points :class: note For the *number of months to promotion* data, identify the inner and outer fences, and identify any mild and extreme explicit points. .. flat-table:: :align: center :width: 100% * - 7 - 12 - 14 - 14 - 14 - 18 - 21 - 22 - 23 - 24 * - 25 - 34 - 34 - 37 - 47 - 49 - 64 - 100 - 150 - From the previous exmaple, IQR = 26. * Lower inner fence: 16 - 1.5 × 26 = -23 * Upper inner fence: 42 + 1.5 × 26 = 81 * Lower outer fence: 16 - 3 × 26 = -62 * Upper outer fence: 42 + 3 × 26 = 120 Since the value 100 exceeds the upper inner fence (81) but falls below the upper outer fence (120), it is classified as a mild explicit point. The value 150 exceeds the upper outer fence (120), making it an extreme explicit point. There are no lower explicit points in the data. Modified Box Plots -------------------- A **modified box plot** is a visual representation of **the five-number summary** and the explicit points identified by the **1.5 x IQR** rule. It consists of 1. A **box** that spans from Q₁ to Q₃, representing the middle 50% of the data, 2. A **line** inside the box marking the median (Q₂), 3. **Whiskers** extending from the box to the most extreme data points that are not explicit points, and 4. **Explicit points** as dots beyond the whiskers. .. _box-plot: .. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter3/labeled_boxplot.png :alt: Modified box plot with components labeled :figwidth: 80% A modified box plot with labeled components The dot inside the box represents the sample mean of the data. While it is not a formal component of a boxplot, we often include it for a more comprehensive view of the data distribution. .. admonition:: Why do we call it *modified*? :class: important The box plots introduced in this course are a modified version of the basic form, which does not account for explicit points. You are not expected to know or use the basic version. Whenever we refer to a box plot, we always mean the modified version. .. admonition:: Example 💡: Creating a modified box plot :class: note For the *number of months to promotion* data set, draw a modified box plot both by hand and using R. Then interpret the data's distribution. **Drawing a modified box plot by hand** #. Draw a horizontal or vertical axis that covers the full range of the data. #. Draw a line thruough Q₁ and Q₃ and draw a box which uses them as its two sides. #. Draw a line across the box at the median (Q₂). #. Draw whiskers from the box to the most extreme data points that are NOT beyond the inner fences. #. Plot individual points for observations beyond the inner fences. **Drawing a modified box plot using R** .. code-block:: r # Import the graphing package library(ggplot2) # Create data vector promotion <- c(7, 12, 14, 14, 14, 18, 21, 22, 23, 24, 25, 34, 34, 37, 47, 49, 64, 100, 150) # Change format to data frame promotion_df <- data.frame(months=promotion) # Graph ggplot(pro_df, aes(y="", x=months)) + # flip x and y to make the plot vertical stat_boxplot(geom="errorbar") + #formats the whiskers geom_boxplot() + ggtitle("Boxplot of Months to Promotion") + stat_summary(fun = mean, col = "black", geom = "point", size = 3) The code above returns :numref:`box-plot`. Reading box plots beyond the five number summary ---------------------------------------------------- Skewness ~~~~~~~~~~~~~~~~ While box plots offer a less detailed view of the data distribution than histograms, they are effective for quickly identifying skewness. Let us compare the histograms and the box plots drawn for the same data sets: .. _skewed-box-plots: .. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter3/box_plot_vs_histogram.png :alt: Comparison of histograms and box plots :align: center :figwidth: 100% Histograms and box plots of skewed data sets Recall that each of the four sections defined by the sample quartiles contains approximately the same number of data points. Therefore, if one whisker is much longer than the box or the other whisker, it suggests that the data is more spread out in that section of the distribution. Limitations ~~~~~~~~~~~~~~~~~~~~~~ Box plots are efficient for identifying symmetry or asymmetry, but they are limited in the level of detail they provide on the shape of the distribution. Most importantly, we **cannot** * determine whether the data is normal (bell-shaped), or * identify the number of modes in the data through a box plot. See the examples below—each with a distinct distribution, yet producing very similar box plots. .. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter3/symmetric-box-plots.png :alt: Box plots of different symmetric distributions :align: center :figwidth: 100% Box plots of different symmetric distributions To gain a detailed understanding of the **shape** of the data distribution, we must use more refined graphical tools such as histograms. Explicit Points vs. Real Outliers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When interpreting a modified box plot, it's crucial to understand that **not all explicit points are real outliers**. * **Explicit points** are observations flagged by statistical criteria like the 1.5 × IQR rule. They are points that mathematically deviate from the pattern established by the majority of the data. * **Real outliers** are explicit points that, upon investigation, truly deviate from the underlying pattern of the data. They may represent errors, anomalies, or genuinely unusual cases. When a data distribution is strongly skewed, for example, values on the longer tail may be identified as explicit points by the 1.5 × IQR rule, although they are simply conforming to the underlying distribution. See :numref:`skewed-box-plots`. When a data point is flagged as explicit, it should be inspected more carefully to determine if: #. it represents an error in measurement, recording, or data entry, #. it is an observation from a different population or process, #. or **very far**from the main body of the box plot to be considered part of the same distribution. In these cases, the explicit points are true outliers because they deviate from the data's overall trend. .. admonition:: Example💡: Interpreting a box plot Interpret the box plot of teh *number of months to promotion* data. .. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter3/box-plot-unlabeled.png :alt: Box plot of months to promotion data :align: center :figwidth: 60% * The middle 50% of engineers were promoted between 16 and 42 months after hiring. * The median time to promotion was 24 months. * Two engineers (100 and 150 months) are identified as explicit points. * Since the right half of the data is more spread out than the left half (longer right whisker, greater distance between Q2 and Q3 than Q1 and Q2), the data set is right-skewed. The two explicit points are very far from the main body of the plot with gaps larger than IQR, so we must investigate the possibility of them being real outliers. They might represent engineers who: * Were hired with specialized skills not requiring management roles. * Chose to remain in technical roles longer. * Faced unusual circumstances affecting their promotion timeline. * Were erroneously included in the dataset. .. Choosing Between Spread Measures --------------------------------- When deciding whether to use standard deviation or IQR as your primary measure of spread, consider: * Use **standard deviation** when: * Data is approximately normally distributed * There are no significant outliers * You need to perform calculations that specifically require standard deviation * Use **interquartile range** when: * Data is skewed * There are potential outliers * You want a robust measure that isn't heavily influenced by extreme values In practice, it's often valuable to report both measures, as they provide complementary information about the spread of your data. Bringing It All Together -------------------------- .. admonition:: Key Takeaways 📝 :class: important #. **Sample quartiles** divide the data into four equal parts. #. The **interquartile range (IQR)** measures the spread of the middle 50% of the data. #. The **five-number summary** provides a robust overview of the data distribution. #. Use the 1.5 × IQR rule to identify **explicit points** that may be potential outliers. #. Not all **explicit points** are **real outliers** - investigate them thoroughly before drawing conclusions. #. **Modified box plots** visualize the five-number summary and highlight explicit points. In the next section, we'll discuss how to choose the most appropriate measures of center and spread for different types of data distributions. Exercises ~~~~~~~~~~~~~~ 1. **Quartile Calculation**: For the dataset {5, 7, 9, 12, 15, 17, 19, 22, 25, 27, 30, 33, 38, 42, 51}: a) Calculate the five-number summary. b) Determine the IQR. c) Find the inner and outer fences. d) Identify any explicit points. 2. **Box Plot Interpretation**: Consider the following box plot of annual salaries (in thousands of dollars) for three departments in a company: .. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter3/department-boxplots.png :alt: Box plots of salaries by department :align: center :width: 80% a) Which department has the highest median salary? b) Which department has the greatest spread of salaries? c) Are there any explicit points? If so, in which department(s)? d) What other observations can you make about the salary distributions? #. **Real vs. Explicit**: Suppose you are analyzing the daily commute times (in minutes) for employees at a company, and you find that most values are between 15 and 45 minutes, but there are a few values over 90 minutes that are flagged as explicit points by the 1.5 × IQR rule. a) What additional information would you want to gather to determine if these are real outliers? b) Give two examples of circumstances where these explicit points might NOT be considered real outliers c) Give two examples of circumstances where these explicit points SHOULD be considered real outliers