.. _worksheet10: Worksheet 10: Checking Normality and Introduction to Sampling Distributions =========================================================================== .. admonition:: Learning Objectives 🎯 :class: info • Assess normality using visual methods (histograms, QQ-plots) • Apply numerical methods (Empirical Rule, IQR-to-SD ratio) • Manually construct and interpret QQ-plots • Understand the concept of sampling distributions • Explore how sample statistics behave across repeated samples • Use simulation to investigate sampling distributions empirically Part 1: Checking Normality -------------------------- In many statistical inference methods (e.g., confidence intervals, t-tests, ANOVA, regression), we assume data comes from a population that is approximately Normally distributed or the estimators that we use are approximately normally distributed. But how do we check this assumption in practice? From the Chapter 6 slides, we learned several ways to assess Normality: 1. **Visual Inspection:** • Histograms with a Normal curve overlay and Kernel Density curve • Normal Probability Plots (also called QQ-plots) to assess how closely data follow a theoretical Normal distribution 2. **Numerical Checks** • Backward Empirical Rule: Evaluating the proportion of data within 1, 2, and 3 standard deviations of the mean • IQR-to-Standard-Deviation Ratio: Checking whether the ratio is approximately 1.34 for Normal data 3. **(Outside Scope of Class)** Formal Goodness of Fit Tests (e.g., Shapiro–Wilk test and Kolmogorov-Smirnov Test) 4. **Interpretation** • Compare these checks to the expected properties of truly Normal data • Make a judgment call about whether the Normal assumption is tenable In this worksheet, you will use both visual and numerical methods to assess Normality using R's built-in mtcars dataset, specifically focusing on the miles per gallon (mpg) variable. By the end of this worksheet, you should be able to: 1. Understand how to visually assess Normality using histograms and QQ-plots. 2. Manually construct a QQ-plot and compare it to the one generated by R's stat_qq(). 3. Apply numerical methods, such as the Empirical Rule and IQR-to-Standard Deviation ratio, to further evaluate Normality. **Question 1:** We will begin by computing key summary statistics for the mpg variable in the mtcars dataset. Load the mtcars dataset in R. a) For the mpg variable compute and record the following statistics: • Sample size (n): • Sample mean (x̄): • Sample median (x̃): • Sample Standard Deviation (s): • Inter Quartile Range (IQR): b) Create a histogram of the mpg variable with the Kernel Density curve and Normal curve overlayed on top of the graph. Answer the following: • Does the histogram appear approximately bell-shaped? • Are there noticeable skewness or outliers? • Would you say this data is approximately Normal purely based on this graph? c) To better understand QQ-plots, you will construct one manually step by step. • Order the data from smallest to largest and store it as ``sorted.mpg`` in the mtcars dataset. .. math:: x_{(1)} \leq x_{(2)} \leq \cdots \leq x_{(n)} • For each :math:`i = 1, \ldots, n` compute the theoretical quantiles of the Standard Normal Distribution and save them into the data frame as a variable named ``theoretical.quantiles``. .. math:: z_i = \Phi^{-1}\left(\frac{i - 0.5}{n}\right) where :math:`\Phi^{-1}` denotes the inverse cumulative distribution function of the Standard Normal Distribution which corresponds to finding the quantiles using the ``qnorm`` function in R. You are essentially solving for :math:`z_i` such that :math:`P(Z \leq z_i) = \frac{i - 0.5}{n}` for :math:`i = 1, 2, \ldots, n`. • The empirical CDF places each data point at a cumulative probability of :math:`i/n`, mapping the :math:`i^{\text{th}}` value to the corresponding quantile of a Normal distribution. • However, this discrete ranking system artificially pushes points too far into their respective quantiles. Adjusting by :math:`i - 0.5` centers each data point in its expected quantile interval, making the comparison to the theoretical quantiles more accurate. • Compute the reference line and generate points along the line using your ``theoretical.quantiles``. The slope of the line is the sample standard deviation, and the intercept is the sample mean of the mpg variable. • Construct your manually created QQ-plot using the code below. The ``geom_point()`` and ``geom_line()``, layers from ggplot allow us to plot ordered pairs of points and fitted function respectively and will be useful later when we get to the topic of linear regression .. code-block:: r ggplot(mtcars, aes(x = theoretical.quantiles, y = sorted.mpg)) + geom_point(color = "black") + # Manual QQ-points geom_line(aes(y = best.fit), color = "black", linetype = "solid", linewidth = 1.5) + labs(title = "Manually Constructed QQ-Plot", x = "Theoretical Normal Quantiles", y = "Sample MPG Data") + theme_minimal() • Generate a qqplot automatically from the mpg variable using ``stat_qq()`` and ``geom_abline()`` layers using the code below and answer the following questions. .. code-block:: r ggplot(mtcars, aes(sample=mpg)) + stat_qq() + geom_abline(slope = s, intercept = xbar) + labs(title = "Automatically Constructed QQ-Plot", x = "Theoretical Normal Quantiles", y = "Sample MPG Data") + theme_minimal() • Compare your manual QQ-plot to the one generated using ``stat_qq()``. Do both plots look similar? • Does the data follow the reference line closely? • Are there any clear deviations, such as curved patterns or outliers? • Based on the QQ-plots, would you conclude that mpg follows a Normal distribution? d) Beyond visualization, we use numerical checks to confirm Normality. i. For a Normally distributed dataset: • 68% of values should be within 1 standard deviation of the mean. • 95% should be within 2 standard deviations. • 99.7% should be within 3 standard deviations. Compute the proportions of mpg values that fall within these ranges. How close are your proportions to 0.68, 0.95, and 0.997? Are there significant deviations? What might explain them? ii. For a Normal distribution, the ratio of IQR to standard deviation should be approximately 1.34. Compute the IQR-to-SD ratio. What does this value suggest about the Normality of the data? e) Based on your entire analysis of the data, draw a conclusion about the distribution of the data—does it adhere strongly to a Normal distribution? **R Code for Normality Assessment:** .. code-block:: r # Load necessary libraries library(ggplot2) # Part a: Summary statistics n <- length(mtcars$mpg) xbar <- mean(mtcars$mpg) xmed <- median(mtcars$mpg) s <- sd(mtcars$mpg) iqr <- IQR(mtcars$mpg) cat("Sample size n =", n, "\n") cat("Sample mean =", xbar, "\n") cat("Sample median =", xmed, "\n") cat("Sample SD =", s, "\n") cat("IQR =", iqr, "\n") # Part b: Histogram with overlays ggplot(mtcars, aes(x = mpg)) + geom_histogram(aes(y = ..density..), bins = 10, fill = "lightblue", color = "black", alpha = 0.7) + geom_density(aes(y = ..density..), color = "red", size = 1.2) + stat_function(fun = dnorm, args = list(mean = xbar, sd = s), color = "blue", size = 1.2) + labs(title = "MPG Distribution with Normal Overlay", x = "Miles per Gallon", y = "Density") + theme_minimal() # Part c: Manual QQ-plot construction mtcars$sorted.mpg <- sort(mtcars$mpg) i <- 1:n mtcars$theoretical.quantiles <- qnorm((i - 0.5) / n) # Best fit line: y = xbar + s * z mtcars$best.fit <- xbar + s * mtcars$theoretical.quantiles # Part d: Numerical checks # Empirical Rule within_1sd <- sum(abs(mtcars$mpg - xbar) <= s) / n within_2sd <- sum(abs(mtcars$mpg - xbar) <= 2*s) / n within_3sd <- sum(abs(mtcars$mpg - xbar) <= 3*s) / n cat("\nEmpirical Rule Check:\n") cat("Within 1 SD:", round(within_1sd, 3), "(expect 0.68)\n") cat("Within 2 SD:", round(within_2sd, 3), "(expect 0.95)\n") cat("Within 3 SD:", round(within_3sd, 3), "(expect 0.997)\n") # IQR to SD ratio iqr_sd_ratio <- iqr / s cat("\nIQR-to-SD ratio:", round(iqr_sd_ratio, 3), "(expect 1.34)\n") Part 2: Introduction to Sampling Distributions ---------------------------------------------- In many statistical procedures, the underlying data do not need to follow a Normal distribution. Instead, what matters is whether the estimators used in statistical inference have distributions that are well understood (and many cases Normal). In this section, you will explore the concept of sampling distributions by investigating how different estimators behave across repeated samples from different populations. Rather than relying on theoretical results, you will use simulation to observe these distributions empirically. **What is a Sampling Distribution?** - A single sample of size :math:`n` provides one estimate of a population parameter, such as the sample mean :math:`\bar{x}` as an estimator of the population mean :math:`\mu` or sample standard deviation as an estimate of the population standard deviation :math:`\sigma`. - If we repeatedly take samples of the same size :math:`n` from the same population and compute the statistic each time, the values will vary. This variation is known as **sampling variability**. - The **sampling distribution** of an estimator is the probability distribution of that statistic across all possible samples of the same size. **Why is this Important?** - Understanding how an estimator behaves across repeated samples allows us to quantify uncertainty and make statistical inferences about the population. - Even if the underlying data do not follow a Normal distribution, the distribution of an estimator may still be well behaved, making statistical inference reliable. - The properties of different estimators, such as bias and variability, can be directly observed through simulation and will be discussed in more detail when we begin the chapters on statistical inference. **What Influences the Sampling Distribution?** The shape and variability of a sampling distribution depend on several factors. Sample size affects variability, with larger samples producing more stable estimates. The population distribution influences the shape of the sampling distribution, especially for small samples, though some estimators stabilize under certain conditions. The choice of estimator matters, as different statistics (e.g., sample mean vs. sample median) have distinct properties. Independence vs. dependence also impacts behavior, as many statistical results assume independent observations. Understanding these factors helps explain why different estimators behave the way they do and allow us to reliably choose the appropriate inference procedure. **Question 2:** In this exercise, you will investigate how the sample mean and sum behave when drawing repeated samples from a Normally distributed population. This will help you understand how their sampling distributions depend on sample size and population parameters. Consider a population where individual observations follow a Normal distribution with mean :math:`\mu = 10` and population standard deviation :math:`\sigma = 5`. a) Sampling and Computing Statistics: • Draw many simple random samples (SRS = 1500) of size :math:`n = 2` from the population. • For each of the 1500 samples of size :math:`n = 2`, compute both: - sample mean: :math:`\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i` - sample sum: :math:`S_n = \sum_{i=1}^n x_i` In other words obtain a vector of sample means and sums: .. math:: \begin{pmatrix} \bar{x}_1 \\ \vdots \\ \bar{x}_{1499} \\ \bar{x}_{1500} \end{pmatrix} \quad \begin{pmatrix} S_n^1 \\ \vdots \\ S_n^{1499} \\ S_n^{1500} \end{pmatrix} • Plot a histogram for the 1500 sample means and overlay the smooth kernel density and a Normal distribution centered at the sample mean of the 1500 sample means and sample standard deviation of all 1500 sample means. See the tutorial for Computer Assignment #4 for more of an explanation. b) Repeat a) for :math:`n = 10` and :math:`n = 100`. Answer the following questions: i. What do you notice about the spread of these distributions as :math:`n` increases? ii. Is the sampling distribution for each of the estimators Normal for each sample size :math:`n`? iii. How do the means and standard deviations of these sampling distributions relate to the original population parameters? **R Code for Sampling Distribution Simulation:** .. code-block:: r # Set parameters mu <- 10 sigma <- 5 n_sims <- 1500 # Function to simulate sampling distributions simulate_sampling_dist <- function(n, mu, sigma, n_sims) { # Generate samples sample_means <- numeric(n_sims) sample_sums <- numeric(n_sims) for (i in 1:n_sims) { sample <- rnorm(n, mean = mu, sd = sigma) sample_means[i] <- mean(sample) sample_sums[i] <- sum(sample) } # Create data frame for plotting data.frame( sample_means = sample_means, sample_sums = sample_sums, n = n ) } # Simulate for different sample sizes results_n2 <- simulate_sampling_dist(2, mu, sigma, n_sims) results_n10 <- simulate_sampling_dist(10, mu, sigma, n_sims) results_n100 <- simulate_sampling_dist(100, mu, sigma, n_sims) # Plot sampling distribution of means for n=2 mean_of_means <- mean(results_n2$sample_means) sd_of_means <- sd(results_n2$sample_means) ggplot(results_n2, aes(x = sample_means)) + geom_histogram(aes(y = ..density..), bins = 30, fill = "lightblue", color = "black", alpha = 0.7) + geom_density(color = "red", size = 1.2) + stat_function(fun = dnorm, args = list(mean = mean_of_means, sd = sd_of_means), color = "blue", size = 1.2) + labs(title = paste("Sampling Distribution of Mean (n =", 2, ")"), x = "Sample Mean", y = "Density") + theme_minimal() **Question 3:** A research team is developing an AI inference system deployed on a cloud-based platform. When a request is sent to the AI model, it processes the input and returns a prediction. Due to various factors such as hardware fluctuations and workload balancing, the response time for each request varies. However, past system logs indicate that the inference time per request follows a Normal distribution with a mean of 250 milliseconds and a standard deviation of 40 milliseconds. To ensure the system meets real-time processing requirements, the team analyzes the sampling distribution of the average response time when processing multiple requests in batches. The system processes requests in batches of 25, meaning that every recorded batch has 25 independent inference times. .. math:: X_1, X_2, \ldots, X_{25} \overset{\text{i.i.d.}}{\sim} N(\mu = 250, \sigma = 40) a) Determine the distribution of the sample mean :math:`\bar{X}` (the average inference time for a batch of 25 requests). b) Compute the probability that a randomly selected batch of 25 requests has an average inference time exceeding 260 milliseconds. Write out the full probability statement and show the steps for calculating this probability but use R to obtain the numerical answer. To improve the analysis, the team aggregates data across four cloud regions, each running separate inference pipelines. They track the sum of the total inference times from these four regions, where each region processes a batch of 25 requests independently. Let :math:`Y_i = X_1^i + X_2^i + \cdots + X_{25}^i` denote the total processing time for cloud region :math:`i = 1, 2, 3, 4`, then :math:`S_4 = Y_1 + Y_2 + Y_3 + Y_4`. c) Determine the sampling distribution for :math:`S_4`, the total inference time for the four regions. d) Compute the probability that the total inference time across the four regions exceeds 25,600 milliseconds. The cloud-based AI service processes 100 requests across four regions, with the total inference time :math:`S_4`. The service earns $0.05 per request and incurs a computational cost of $0.0001 per millisecond. The profit random variable :math:`V` defined as: .. math:: V = 5 - 0.0001 \cdot S_4 e) Determine the distribution for the profit :math:`V`, including its mean and standard deviation. f) Compute the probability that the AI service operates at a loss :math:`\{V < 0\}`. g) Determine the bottom 10th percentile for the profit :math:`(P(V \leq v_{0.1}) = 0.1)`. **R Code for Application Problem:** .. code-block:: r # Parameters mu_request <- 250 sigma_request <- 40 n_batch <- 25 n_regions <- 4 # Part a: Distribution of sample mean mu_xbar <- mu_request sigma_xbar <- sigma_request / sqrt(n_batch) cat("Distribution of X̄ ~ N(", mu_xbar, ",", sigma_xbar, ")\n") # Part b: P(X̄ > 260) prob_exceed_260 <- pnorm(260, mean = mu_xbar, sd = sigma_xbar, lower.tail = FALSE) cat("P(X̄ > 260) =", prob_exceed_260, "\n") # Part c: Distribution of S4 # Each Yi ~ N(25*250, sqrt(25)*40) mu_Yi <- n_batch * mu_request sigma_Yi <- sqrt(n_batch) * sigma_request # S4 = Y1 + Y2 + Y3 + Y4 mu_S4 <- n_regions * mu_Yi sigma_S4 <- sqrt(n_regions) * sigma_Yi cat("\nDistribution of S4 ~ N(", mu_S4, ",", sigma_S4, ")\n") # Part d: P(S4 > 25600) prob_S4_exceed <- pnorm(25600, mean = mu_S4, sd = sigma_S4, lower.tail = FALSE) cat("P(S4 > 25600) =", prob_S4_exceed, "\n") # Part e: Distribution of profit V # V = 5 - 0.0001 * S4 mu_V <- 5 - 0.0001 * mu_S4 sigma_V <- 0.0001 * sigma_S4 cat("\nDistribution of V ~ N(", mu_V, ",", sigma_V, ")\n") # Part f: P(V < 0) prob_loss <- pnorm(0, mean = mu_V, sd = sigma_V) cat("P(V < 0) =", prob_loss, "\n") # Part g: 10th percentile v_10 <- qnorm(0.1, mean = mu_V, sd = sigma_V) cat("10th percentile of profit =", v_10, "\n") Key Takeaways ------------- .. admonition:: Summary 📝 :class: important **Checking Normality:** • Use **visual methods**: histograms with normal overlay, QQ-plots • Apply **numerical checks**: Empirical Rule (68-95-99.7), IQR/SD ≈ 1.34 • QQ-plots compare ordered data to theoretical normal quantiles • Points should follow diagonal line if data is normal **Sampling Distributions:** • **Sampling distribution**: distribution of a statistic across all possible samples • Even if data isn't normal, sample means may be approximately normal • **Larger samples** → less variability in sampling distribution • For normal populations: :math:`\bar{X} \sim N(\mu, \sigma/\sqrt{n})` • **Sums** of normal variables are also normal with adjusted parameters