Final Exam — Fall 2025: Worked Solutions

Exam Information

Course: STAT 350 — Introduction to Statistics
Semester: Fall 2025
Total Points: 150 + 15 (Extra Credit) = 165
Time Allowed: 60 minutes
Coverage: Cumulative (Chapters 1–13); primary emphasis on Chapters 12–13, with Chapters 1–7 weighted more heavily than Chapters 8–11 among the earlier material

Problem	Total Possible	Topic
Problem 1 (True/False, 2 pts each)	20	Poisson Independence, Conditional Probability, Card Sampling, E[X·Y] for Dependent RVs, Variance of Linear Combinations, ANOVA Pairs, Tukey HSD Timing, Normal Test Statistic, Residual Patterns, R² Interpretation
Problem 2 (Multiple Choice, 3 pts each)	18	Conditional Probability from Bar Charts, PMF Normalization, CI Duality, Regression Diagnostics, F-statistic Range, ANOVA F Interpretation
Problem 3	18	Venn Diagrams, Intersection and Union of Events
Problem 4	31	Uniform Distribution, Custom PDF (Haoyu), Expected Value, Counterclockwise Symmetry
Problem 5	40	One-Way ANOVA, Pooled SD, Combined SD, Tukey HSD
Problem 6	38	Simple Linear Regression
Total	150 (+ 15 Extra Credit)

—

Problem 1 — True/False (20 points, 2 pts each)

Question 1.1 (2 pts)

Let $X$ be a Poisson random variable with $E[X] = \mu$ (average per-hour rate). Suppose we define two Poisson random variables, $Y$ and $Z$, defined on two three-hour periods, sharing the rate of $X$. Then $Y$ and $Z$ must be independent and identically distributed, $\text{Poisson}(3\mu)$.

Question 1.2 (2 pts)

Let $A_1, A_2, \ldots, A_n$ and $B$ be events from a sample space $\Omega$ where $A_1, A_2, \ldots, A_n$ form a partition of $\Omega$ and $P(B) > 0$. Then it must follow that $\sum_{i=1}^{n} P(A_i \mid B) = 1$.

Question 1.3 (2 pts)

A special deck of cards contains eight cards: for each number 1, 2, 3, and 4 there is exactly one red card and one black card (so the cards are 1R, 1B, 2R, 2B, 3R, 3B, 4R, 4B). Two cards are drawn at random without replacement, in order. Let $C$ denote the event that the first card drawn is red, and let $D$ denote the event that the second card drawn is either a 1 or a 2. Events $C$ and $D$ are independent.

Question 1.4 (2 pts)

On each day, a factory may be in a high-stress state ($Y = 1$, probability $1/4$) or normal operating mode ($Y = 0$, probability $3/4$). Let $X$ be the number of machine breakdowns, where $X \mid Y = 1 \sim \text{Poisson}(10)$ and $X \mid Y = 0 \sim \text{Poisson}(6)$. Define $V = X \cdot (1 - Y)$. Since $V = X \cdot (1 - Y)$, it follows that $E[V] = E[X] \cdot (1 - E[Y]) = 21/4$.

Question 1.5 (2 pts)

$X \sim N(\mu_X = 120, \sigma_X = 15)$ and $Y \sim N(\mu_Y = 160, \sigma_Y = 20)$ are independent. The planning index is $I = 0.3X + 0.7Y + 50$. The variance of $I$ satisfies $\text{Var}(I) = 0.3 \cdot \text{Var}(X) + 0.7 \cdot \text{Var}(Y) = 347.5$.

Question 1.6 (2 pts)

In a one-way ANOVA, a Tukey HSD output reports results for 105 unique pairwise comparisons among the group means. The single factor variable in the ANOVA model must have exactly 15 levels.

Question 1.7 (2 pts)

A one-way ANOVA is conducted, all assumptions are found reasonable, the F-test statistic is approximately 1 with a large p-value, and the null hypothesis of equal treatment means is not rejected. In this situation, the appropriate next step is to carry out a Tukey multiple comparison procedure to determine which specific treatment means differ.

Question 1.8 (2 pts)

In simple linear regression $y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$ where the $\varepsilon_i$ are i.i.d. normal with variance $\sigma^2$. Under $H_0\colon \beta_1 = 0$, if $\sigma^2$ is known, the test statistic $\hat{\beta}_1 \div \sqrt{\sigma^2 / S_{XX}}$ follows a standard normal distribution.

Question 1.9 (2 pts)

In a simple linear regression setting, a residual plot shows points randomly scattered around the horizontal line at 0 with roughly the same vertical spread for all values of $x$, and no clear curvature or funnel shape. This residual pattern provides reasonable support for both the linearity assumption and the constant variance (homoscedasticity) assumption.

Question 1.10 (2 pts)

In a simple linear regression analysis, the sample coefficient of determination $R^2$ is found to be very close to 1. From this, we can conclude that a large proportion of the variability in the response variable is explained by its linear relationship with the explanatory variable, and that the relationship between them is in fact linear.

—

Problem 2 — Multiple Choice (18 points, 3 pts each)

Question 2.1 (3 pts)

Three elevators in the Math building malfunction during the day (8 am–7 pm) but tend to work normally at night (7 pm–8 am). Let $X \in \{\text{day}, \text{night}\}$ and $Y$ = number of working elevators. The conditional distributions of $Y$ given $X$ are displayed in the bar graphs below.

Two side-by-side bar graphs. Day graph: P(Y=0)=0.01, P(Y=1)=0.05, P(Y=2)=0.34, P(Y=3)=0.60. Night graph: P(Y=0)=0.01, P(Y=1)=0.05, P(Y=2)=0.14, P(Y=3)=0.80.

Which of the following probabilities is computed correctly?

1. $P(Y \leq 1) = 0.12$
1. $P(\{X = \text{day}\} \cap \{Y = 3\}) = 0.6$
1. $P(X = \text{day}) = P(X = \text{night}) = 1$
1. $P(Y > 1 \mid X = \text{night}) = 0.94$
1. All of the above

Question 2.2 (3 pts)

Let $X$ be a discrete random variable with support $\{0, 1, 2, 3\}$. Find the constant $k$ that makes the following a valid pmf:

\[\begin{split}f_X(x) = \begin{cases} 0.45, & x = 0 \\ k/x, & x = 1, 2, 3 \\ 0, & \text{otherwise} \end{cases}\end{split}\]

1. $k = 0.45$
1. $k = 0.30$
1. $k = 0.5455$
1. $k = 0.9124$
1. $k = 0.5006$
1. $k = \infty$

Question 2.3 (3 pts)

A simple linear regression is fit to relate product price ($x$) to weekly sales ($Y$). All standard assumptions are satisfied. A 95% confidence interval for $\beta_1$ is $(-12.4,\,-4.1)$. At $\alpha = 0.05$, what is the correct conclusion about the population linear association?

1. There is no evidence of a linear association because 0 is not in the interval.
1. There is evidence of a linear association because 0 is not in the interval.
1. There is evidence that changing price will cause weekly sales to decrease, because 0 is not in the interval.
1. We can only conclude that there is a linear association in this particular sample; a CI cannot draw conclusions about the population.
1. We cannot draw any conclusion about linear association without also knowing the p-value.

Question 2.4 (3 pts)

A researcher fits a simple linear regression model. Which plot is most appropriate for checking the normality assumption of the errors?

1. Scatterplot of $y$ versus $x$.
1. Plot of residuals versus $x$ values.
1. Normal probability plot (QQ-plot) of the residuals.
1. Normal probability plot (QQ-plot) of the response.
1. Histogram of the response.
1. Histogram of the explanatory variable.

Question 2.5 (3 pts)

The ANOVA F-statistic always falls within which range?

1. The positive real numbers $(0, \infty)$.
1. The real numbers $(-\infty, \infty)$.
1. The negative real numbers $(-\infty, 0)$.
1. The real numbers between 0 and 1.
1. None of the above.

Question 2.6 (3 pts)

A researcher compares four insomnia drugs using one-way ANOVA by randomly assigning seniors to one of the four treatments. Which statements correctly describe the relationship between between-drug variation, within-drug variation, and the F-statistic?

1. The ANOVA F-statistic is the ratio of between-group variation to within-group variation.
1. A large F occurs when differences among the four drug sample means are large relative to the typical person-to-person variability within each drug group.
1. The within-drug variation sets the noise level for judging whether observed differences among drug sample means look unusually large.
1. An F-statistic near 1 indicates that between-drug variation is about the same size as within-drug variation, so the drug means do not stand out beyond background variability.
1. All of the above.
1. None of the above.

—

Problem 3 Setup

Events $E$, $F$, and $G$ belong to the same sample space $S$, each with a non-zero probability.

The Venn diagram below shows the probability of each region:

Three-circle Venn diagram with regions labeled: outside all circles = 0.18, E only = 0.15, E intersect F (not G) = 0.20, E intersect F intersect G = 0.32, F intersect G (not E) = 0.05, G only = 0.10. The F-only and E-intersect-G-only regions have probability 0.

Part (a) uses four Venn diagrams labeled A–D, shown on the exam. The shaded regions correspond to the probability expressions in the matching table.

Problem 3 — Venn Diagrams and Set Theory (18 points)

Question 3a (8 pts)

Match each Venn diagram (A, B, C, or D) with the probability statement that correctly represents its colored region.

Four Venn diagrams labeled A, B, C, D, each showing three overlapping circles E, F, G. Diagram A shades the region where E and G overlap. Diagram B shades the region of F only (the part of F outside both E and G). Diagram C shades the part of F and G that falls outside E. Diagram D shades the overlapping regions of F with E and with G.

Table 20 Venn Diagram Matching
Row	Notation	Venn Diagram Letter
i	$P(E' \cap F \cap G')$	B
ii	$P(E \cap G)$	A
iii	$P\!\bigl(F \cap \bigl((E \cap F) \cup G\bigr)\bigr)$	D
iv	$P\!\bigl(E' \cap (F \cup G)\bigr)$	C

Question 3b (10 pts)

Using the Venn diagram probabilities, define events $A$, $B$, $C$, and $D$ as the colored regions from parts i–iv respectively.

(4 pts) Compute $P(A \cap B \cap C \cap D)$.
(6 pts) Compute $P(A \cup B \cup C \cup D)$.

—

Problem 4 Setup

Dr. Reese asked two TAs, Zhenghao and Haoyu, to walk clockwise around a rectangular path (length 30 m, width 20 m, total perimeter 100 m), each starting and ending at Dr. Reese’s position. An observation time is chosen uniformly at random from each TA’s lap.

$Z$ = distance Zhenghao has traveled from Dr. Reese at the observation time.
$H$ = distance Haoyu has traveled from Dr. Reese at the observation time.

Rectangular grid map of an exam room. The rectangle is 30 m wide and 20 m tall. Dr. Reese's position (a smiley face) is at the bottom-right corner. The left side is labeled Walk up and the right side is labeled Walk down. The clockwise path goes along the bottom (horizontal, 30 m), up the left side (20 m), across the top (30 m), and down the right side (20 m). Student seats are labeled with grid coordinates throughout the room.

Speed profile for Haoyu (going clockwise from Dr. Reese):

Table 21 Path Segments
Segment	Distance (m)	Direction	Relative speed
Bottom (horizontal)	[0, 30]	Flat	Baseline
Left side	[30, 50]	Walk up	Half speed (twice as slow)
Top (horizontal)	[50, 80]	Flat	Baseline
Right side	[80, 100]	Walk down	Double speed (twice as fast)

Problem 4 — Uniform and Custom Distributions (31 points)

Question 4a (4 pts)

Determine the distribution of $Z$ and its parameter(s).

Question 4b (4 pts)

What is the probability that Zhenghao is walking up the path at the observation time?

Question 4c (10 pts)

Based on Haoyu’s speed, the pdf of $H$ is:

\[\begin{split}f_H(x) = \begin{cases} k, & 0 < x \leq 30 \\ 2k, & 30 < x \leq 50 \\ k, & 50 < x \leq 80 \\ k/2, & 80 < x \leq 100 \\ 0, & \text{otherwise} \end{cases}\end{split}\]

Determine the value of $k$ that makes $f_H(x)$ a valid pdf.

Question 4d (6 pts)

What is the probability that Haoyu has traveled between 40 m and 70 m from Dr. Reese?

Question 4e (4 pts)

Compare the average travel distance of Zhenghao and Haoyu. Who travels farther on average?

Question 4f (3 pts)

Suppose Zhenghao and Haoyu now walk counterclockwise instead of clockwise. Which of the following quantities will change?

1. The expected value of $Z$
1. The expected value of $H$
1. The variance of $Z$
1. The variance of $H$
1. All of the above

—

Problem 5 Setup

A data analyst compares four statistical techniques — Regression, ANOVA, Taguchi methods, and Structural Equation Modeling (SEM) — in terms of their mean effectiveness at capturing hidden correlations between depression indices and human health indicators. Each technique is applied with $m$ replications, giving total sample size $n = 4m$. The ANOVA is conducted at $\alpha = 0.05$.

Table 22 ANOVA Table
Source	df	SS	MS	F	$\Pr(>F)$
Treatment	3	931.46	310.4867	12.5411	7.465e-07
Error	84	2079.63	24.7575
Total	87	3011.09

Table 23 Summary Statistics by Technique
	Regression	ANOVA	Taguchi	SEM
$\bar{x}_i$	54.4	45.1	83.4	74.3
$s_i$	5.3	6.3	3.8	4.1

Problem 5 — One-Way ANOVA (40 points)

Question 5a (6 pts)

Determine the value of $m$ (replications per technique).

Question 5b (2 pts)

Check whether the constant variance assumption is satisfied using the summary statistics.

Question 5c (5 pts)

Using the ANOVA output and assuming all four techniques share the same error variance, compute the pooled estimate of the error standard deviation.

Question 5d (5 pts)

Now ignore technique labels and treat all $n = 88$ measurements as one combined sample. Compute the standard deviation of this combined sample.

Question 5e (3 pts)

Which statement best describes the difference between the standard deviations in parts (c) and (d)?

1. Both measure exactly the same quantity, just computed differently.
1. Part (c) measures variability between technique means; part (d) measures variability within each technique.
1. Part (c) measures how much all observations vary around a single overall mean; part (d) measures how much observations vary around their own technique’s mean.
1. Part (c) measures how much observations vary around their own technique’s mean (assuming common error variance); part (d) measures how much all observations vary around a single overall mean when technique labels are ignored.

Question 5f (2 pts)

Provide the first two steps of the four-step hypothesis testing procedure.

Question 5g (5 pts)

Based on the ANOVA results, state your formal decision and write a conclusion in context.

Question 5h (6 pts)

The Tukey HSD results at family-wise error rate 5% are shown below. Draw the graphical underline display and identify which technique has the highest mean effectiveness.

Table 24 Tukey HSD Results
Comparison	Significant?
Regression − ANOVA	No
Regression − SEM	No
Regression − Taguchi	Yes
ANOVA − SEM	Yes
ANOVA − Taguchi	Yes
SEM − Taguchi	Yes

Solution

Underline display (groups connected by the same underline are not significantly different at the 5% family-wise level):

Tukey HSD underline display. Four group means are shown in ascending order: 45.1 ANOVA, 54.4 Regression, 74.3 SEM, 83.4 Taguchi. Two overlapping underlines are drawn: one connecting ANOVA and Regression, and a second connecting Regression and SEM. Taguchi has no underline connecting it to any other group.

Two overlapping underlines are needed: one connecting ANOVA–Regression, and a second connecting Regression–SEM. ANOVA and SEM are significantly different from each other and cannot share an underline.

More precisely, with sample means 45.1 (ANOVA), 54.4 (Regression), 74.3 (SEM), 83.4 (Taguchi):

Table 25 Tukey HSD Groupings
Technique	$\bar{x}_i$	Significantly different from
ANOVA	45.1	SEM, Taguchi; not from Regression
Regression	54.4	Taguchi; not from ANOVA or SEM
SEM	74.3	ANOVA, Taguchi; not from Regression
Taguchi	83.4	All others (ANOVA, Regression, SEM)

Conclusion: The Taguchi method has the highest mean effectiveness at the population level. It has the largest sample mean ($\bar{x}_{\text{Taguchi}} = 83.4$) and is statistically significantly different from all other techniques at the 5% family-wise error level.

—

Problem 6 Setup

A marketing analyst studies the relationship between weekly online advertising spending ($x$, in hundreds of dollars) and weekly sales ($Y$, in thousands of dollars) over 8 weeks.

Table 26 Advertising Spending and Sales Data
Variable	1	2	3	4	5	6	7	8
Ad spending ($x$, hundreds $)	2	4	6	8	10	12	14	16
Sales ($Y$, thousands $)	6	7	9	9	12	9	13	12

Summary statistics: $\sum x_i = 72$, $\sum x_i^2 = 816$, $\sum y_i = 77$, $\sum y_i^2 = 785$, $\sum x_i y_i = 768$.

The scatter plot and regression diagnostics below were produced from the data.

Scatterplot of Sales (thousands of dollars, y-axis) versus Ad Spending (hundreds of dollars, x-axis). The eight data points show a generally increasing trend from lower-left to upper-right, consistent with a positive linear association.

Normal QQ-plot of the regression residuals. Sample quantiles are on the y-axis and theoretical normal quantiles on the x-axis. The points follow the red reference line closely, with minor deviations, suggesting the normality assumption is roughly satisfied.

Residual plot showing residuals (y-axis) versus Ad Spending (x-axis). The residuals are scattered randomly around the horizontal reference line at zero with no clear fan shape or curvature, supporting the linearity and constant variance assumptions.

Problem 6 — Simple Linear Regression (38 points)

Question 6a (10 pts)

Compute the least-squares regression line for predicting weekly sales from weekly advertising spending.

Question 6b (7 pts)

Complete all missing entries in the ANOVA table below.

Table 27 Regression ANOVA Table
Source	df	SS	MS	F	$\Pr(>F)$
Model	1	33.48	33.48	19.3336	0.0046
Error	6	10.39	1.7317
Total	7	43.87

Question 6c (4 pts)

Compute $R^2$ and interpret it in context.

Question 6d (4 pts)

Compute the Pearson correlation coefficient $r$ and interpret it in context.

Question 6e (4 pts)

Compute $s$, the estimate of $\sigma$ (the standard deviation of the error terms).

Question 6f (4 pts)

To test $H_0\colon \beta_1 = 0$ versus $H_a\colon \beta_1 \neq 0$, what is the value of the $t$-test statistic?

Question 6g (5 pts)

Is there a significant linear association between advertising spending and sales at $\alpha = 0.01$? State the hypotheses and provide a formal conclusion.