Final Exam — Fall 2024: Worked Solutions

Exam Information

Course: STAT 350 — Introduction to Statistics
Semester: Fall 2024
Total Points: 150 + 15 (Extra Credit) = 165
Time Allowed: 120 minutes
Coverage: Cumulative (Chapters 1–13); primary emphasis on Chapters 12–13, with Chapters 1–7 weighted more heavily than Chapters 8–11 among the earlier material
Exam PDF
Solution Key PDF

Problem	Total Possible	Topic
Problem 1 (True/False, 2 pts each)	18	IQR, CLT/Normal Approximation, Exponential, Uniform, Paired Design, ANOVA Comparisons, Confidence Bounds, Influential Points, Prediction Intervals
Problem 2 (Multiple Choice, 3 pts each)	15	Venn Diagrams/Probability, Skewed Distributions, Expected Value, ANOVA–t Relationship, SLR Diagnostics
Problem 3	16	Power Analysis (z-test, Paired Design)
Problem 4	30	Exponential Distribution, Binomial Distribution
Problem 5	42	One-Way ANOVA, Tukey HSD
Problem 6	44	Simple Linear Regression
Total	150 (+ 15 Extra Credit)

—

Problem 1 — True/False (18 points, 2 pts each)

Question 1.1 (2 pts)

Given a dataset that contains multiple real outliers, the best measures of spread for this data would be the interquartile range (IQR).

Question 1.2 (2 pts)

Let $X$ follow a binomial distribution with fixed $p$ and sufficiently large number of trials $n$. The estimator $\hat{p} = X/n$, representing the sample proportion of successes, is derived from the random variable $X$, which is the sum of $n$ independent and identically distributed Bernoulli trials. The sampling distribution of $\hat{p} = X/n$ is approximately normal.

Question 1.3 (2 pts)

The time it takes for a customer to complete a transaction at a store follows an exponential distribution with a rate of $\lambda = 1.4$ transactions per minute. The probability that a transaction lasts less than 1 minute is greater than the probability that a transaction lasts between 1 and 3 minutes.

Question 1.4 (2 pts)

The time it takes for a customer to complete a transaction at a store is uniformly distributed between 2 and 10 minutes. The probability that a transaction lasts between 1 and 6 minutes is greater than the probability that a transaction lasts between 6 and 10 minutes.

Question 1.5 (2 pts)

A research team investigates whether consuming a spoonful of apple cider vinegar before meals prevents blood sugar spikes. They selected 30 pairs of identical twins, randomly assigning each twin in a pair to one of two groups. In this scenario, a two-sample independent procedure is appropriate to compare the groups.

Question 1.6 (2 pts)

In a one-way ANOVA analysis for a factor with nine levels, the F-test resulted in rejection of the null hypothesis. If all possible pairs of levels are to be compared, the Multiple Comparisons step would involve 72 paired comparisons.

Question 1.7 (2 pts)

The 98% lower confidence bound on the true slope of a simple linear regression line, $\beta_1$, gives the value of −0.43. Then there is 0.98 probability that $\beta_1$ is greater than −0.43.

Question 1.8 (2 pts)

In simple linear regression, all influential points must be outliers.

Question 1.9 (2 pts)

In simple linear regression, prediction intervals are wider than confidence intervals for the mean response at the same value of the predictor variable.

—

Problem 2 — Multiple Choice (15 points, 3 pts each)

Question 2.1 (3 pts)

Select the expression that does NOT correctly represent the probability of the colored area in the Venn diagram shown below.

Venn diagram with three sets A, B, and C. Sets A and B overlap; C is disjoint. The shaded region is the part of A that does not overlap B, plus all of C.

1. $P(A \cup B) - P(B) + P(C)$
1. $P(A) - P(A \cap B) + P(C)$
1. $P(A) - P(B) + P(C)$
1. $P(A \cup C) - P(A \cap B)$
1. $P(A \cup B \cup C) - P(B)$

Question 2.2 (3 pts)

Fréchet distribution is a heavily skewed, right-tailed continuous distribution that is used for modeling extreme events such as earthquake magnitudes, daily rainfall totals, and large insurance claims. Which of the following statements is TRUE about Fréchet distribution?

1. The mean is the largest among the measures of central tendency, followed by the mode and the median.
1. A small sample size is adequate to apply the central limit theorem to the distribution of the sample mean.
1. For samples from this distribution, the median and variance are recommended measures of central tendency and spread, respectively.
1. The interquartile range (IQR) is preferred for describing the spread of a population with a Fréchet distribution because it is less sensitive to outliers.
1. None of the above statements are TRUE for the Fréchet distribution.

Question 2.3 (3 pts)

Suppose $X$ is a random variable with $E[2^X] = 16$, $\text{Var}(X) = 32$, and $E[3X + 2] = 8$. Let a new random variable $Y$ be defined as $Y = 2^X - \frac{1}{4}X^2$. What is $E[Y]$?

1. 0
1. 7
1. 8
1. 36
1. None of the above

Question 2.4 (3 pts)

An ANOVA F-test was performed on a dataset with two treatment levels ($k = 2$), resulting in a test statistic of $F_{\text{TS}} = 3.28$. If the same dataset is used for a hypothesis test on the difference of means of the two levels, then $t_{\text{TS}}^2 = 3.28$ only if:

1. The null value, $\Delta_0$, is 0.
1. The hypothesis test is a two-tailed $t$-test.
1. The observations within and across the two levels are assumed to be independent.
1. The population variances are assumed to be equal, and the pooled variance estimate is used to construct the test statistic.
1. All of the above must hold simultaneously.
1. We cannot be certain because the $F_{\text{TS}}$ and $T_{\text{TS}}$ are test statistics for two different hypothesis tests, each with distinct assumptions and interpretations.

Question 2.5 (3 pts)

Which of the following statements is true for simple linear regression?

1. All diagnostics plots rely on the residuals/errors.
1. The assumption that the response is a simple random sample (SRS) for each fixed value of the explanatory variable is easy to verify.
1. A scatter plot can be used to assess both linearity and normality.
1. A scatter plot can be used to assess both homogeneity of variance and linearity.
1. All of the above are true for simple linear regression.

—

Problem 3 Setup

A dietitian is studying the effectiveness of a new dietary supplement on weight loss. To evaluate its impact, the dietitian measures the weight of 44 individuals before and after a 4-week regimen with the supplement.

The difference in weight ($d$ = weight after − weight before) is calculated for each participant. The dietitian wants to establish whether the supplement results in a decrease in weight. The dietitian would like the test to have at least 90% power to detect an average weight decrease of 6 lbs, which is deemed an important reduction.

The test is conducted at a significance level of $\alpha = 0.05$, with hypotheses:

\[H_0\colon \mu_d \geq 0 \qquad H_a\colon \mu_d < 0\]

For this paired test, the standard deviation of the differences, $\sigma_d$, is assumed to be known with a value of 12 lbs.

Problem 3 — Power Analysis (16 points)

Question 3a (3 pts)

Select the correct option for what the purple shaded region in the power graph actually represents.

1. The intern is correct; it is the power of the test in detecting an alternative $\mu_{d_a} = -6$.
1. The intern is wrong; it is in fact the probability of Type I error.
1. The intern is wrong; it is in fact the probability of Type II error.
1. The intern is wrong, and it is none of the above options.

Power analysis graph showing two overlapping bell curves. The left curve is centered at -6 (the alternative distribution) and the right curve is centered at 0 (the null distribution). A vertical dashed line marks d_cutoff. The purple shaded region is the area under the null (right) curve to the right of d_cutoff.

Question 3b (13 pts)

Using the R output below, calculate the power of the test to detect an average weight decrease of 6 lbs. Determine whether the sample size is sufficient to meet the dietitian’s requirement for at least 90% power. Provide a detailed explanation of your calculations, including all steps and reasoning, to receive full credit. This includes computing $\bar{d}_{\text{cutoff}}$ and writing out full probability statements.

> qnorm(p = 0.05, lower.tail = FALSE)
[1] 1.644854
> qt(p = 0.05, df = 43, lower.tail = FALSE)
[1] 1.681071
> qnorm(p = 0.1, lower.tail = FALSE)
[1] 1.281552
> qt(p = 0.1, df = 43, lower.tail = FALSE)
[1] 1.301552
> pnorm(1.671771, lower.tail = TRUE)
[1] 0.9527153
> pnorm(1.671771, lower.tail = FALSE)
[1] 0.04728474
> pnorm(-2.975653, lower.tail = TRUE)
[1] 0.001461827
> pnorm(-2.975653, lower.tail = FALSE)
[1] 0.9985382
> pt(1.671771, df = 43, lower.tail = TRUE)
[1] 0.9490839
> pt(1.671771, df = 43, lower.tail = FALSE)
[1] 0.05091613
> pt(-2.975653, df = 43, lower.tail = TRUE)
[1] 0.002391402
> pt(-2.975653, df = 43, lower.tail = FALSE)
[1] 0.9976086

—

Problem 4 Setup

Halin is a new student in STAT 350 and has never coded in R before. To improve her skills, she spends at most 40 minutes each weekday doing R self-study. Her daily workflow is as follows:

The time it takes until she runs into an error, denoted by $T$, follows an Exponential distribution with an average time of 25 minutes.
If $T > 40$, she does not run into an error during her study session that day.
If $T \leq 40$, she encounters an error and attempts to debug it:
- Debugging succeeds with probability 0.7, after which she feels happy and ends her study session.
- Debugging fails with probability 0.3, and she immediately stops and goes to office hours for help.
Each day’s workflow is independent of other days.

Problem 4 — Exponential Distribution and Binomial (30 points)

Question 4a (8 pts)

What is the probability that Halin will complete her study session on a given weekday without encountering an error?

Question 4b (12 pts)

On a given weekday, what is the probability that Halin does not need to go to office hours?

Question 4c (10 pts)

Suppose Halin has continued her independent study for 20 weekdays. Use the information from the previous question to determine the probability that Halin will visit office hours exactly 5 times over the 20 weekdays.

—

Problem 5 Setup

A clothing retail company aims to boost profit during the upcoming holiday season, and its marketing team has decided to use four advertising strategies: an Email Ad Campaign, a Direct Mail Ad Campaign, a Social Media Ad Campaign, and an AI-Powered Ad Campaign. To test the effectiveness of these strategies, 180 loyal customers were randomly divided into four groups of 45 customers each. Each group was exposed to one advertising strategy, and their purchase amounts for the year were recorded.

Table 11 Summary Statistics by Ad Campaign
	Email	Direct Mail	Social Media	AI-Powered
$n_i$	45	45	45	45
$\bar{x}_i$	449.45	450.39	453.42	455.72
$s_i^2$	68.16	104.6542	112.3643	147.7942

Problem 5 — One-Way ANOVA (42 points)

Question 5a (3 pts)

Which of the following assumptions is NOT required to perform one-way ANOVA? Assume a factor has $k$ levels.

1. Population variances are equal across the $k$ groups.
1. An independent sample is randomly drawn from each of the $k$ groups.
1. Observations within each group are independent of observations in other groups.
1. The sample sizes from each of the $k$ groups are the same.
1. The sample means are normally distributed for each of the $k$ groups.
1. All of the above assumptions are required.

Question 5b (2 pts)

Determine whether the homogeneity of variance assumption is valid or invalid. Mathematically support your answer.

Question 5c (4 pts)

Clearly identify the factor of interest, specify how many levels this factor has, and describe what the quantitative response variable measures. Using this information, define the parameters of interest and state the null hypothesis and alternative hypothesis.

Question 5d (12 pts)

Complete the ANOVA table. Clearly show your work.

Table 12 One-Way ANOVA Table
Source	df	SS	MS	F
Factor (Ad Campaign)	3	1111.9185	370.6395	3.4241
Error	176	19050.80	108.2432
Total	179	20162.72

Question 5e (7 pts)

The $p$-value was found to be 0.0185. Test your hypotheses at a significance level of $\alpha = 0.05$. Provide the formal decision and interpret the conclusion in the context of the problem. You may assume all assumptions are valid.

Question 5f (3 pts)

Based on your conclusion, determine whether you should proceed to conduct a Tukey HSD test.

1. Conduct the Tukey HSD test because it can identify specific pairs of means that are significantly different when the ANOVA results show a significant difference.
1. Do not conduct the Tukey HSD test because the ANOVA results indicate that the population means are not significantly different.
1. There is insufficient information provided to decide whether a Tukey HSD test should be conducted.

Question 5g (3 pts)

Regardless of your conclusion for part (f), the researchers decided to conduct a Tukey HSD with an overall significance level of 5%. Let $\text{df}_E$ denote the correct degrees of freedom for error. Choose the correct Tukey parameter.

1. qtukey(0.95, nmeans = 3, df = dfe, lower.tail=TRUE) = 3.342793
1. qtukey(0.95/2, nmeans = 4, df = dfe, lower.tail=TRUE) = 1.925889
1. qtukey(0.95/2, nmeans = 3, df = dfe, lower.tail=TRUE) = 1.533567
1. qtukey(0.95, nmeans = 4, df = dfe, lower.tail=TRUE) = 3.66811

Question 5h (8 pts)

Using the summary information and the Tukey parameter above, construct a 95% confidence interval for the true difference in the average yearly amount spent by customers exposed to the AI-powered Ad Campaign and those exposed to the Direct Mail Ad Campaign. Based on the confidence interval, determine whether there is statistically significant evidence of a difference between these two groups.

—

Problem 6 Setup

A STAT 350 student sought to explore the relationship between the cost of a meal for a single diner ($x$) and the tip amount offered ($Y$). The student selected nine specific meal costs and, for each cost, recorded a randomly selected tip amount from diners at a restaurant in Greater Lafayette.

Table 13 Meal Cost and Tip Data
Variable	1	2	3	4	5	6	7	8	9
Meal cost ($x$, $)	33.85	31.24	26.82	38.54	33.97	36.44	30.13	29.65	32.76
Tip ($Y$, $)	5.35	6.48	7.52	3.63	4.75	4.23	5.87	6.55	4.86

Problem 6 — Simple Linear Regression (44 points)

Question 6a (5 pts)

Describe the relationship between meal cost and the tip amount based on the scatter plot below.

Scatter plot of tip amount (y-axis, ranging from about 3.5 to 7.5 dollars) versus meal cost (x-axis, ranging from about 27 to 39 dollars). The nine data points show a downward-sloping pattern, indicating that larger meal costs are associated with smaller tip amounts.

Question 6b (15 pts)

The following summary statistics were realized from the data above:

\[n = 9, \quad \bar{x} = 32.600, \quad \bar{y} = 5.4711\]

\[\sum_{i=1}^{9} x_i y_i = 1570.9020, \quad \sum_{i=1}^{9} x_i^2 = 9668.3960, \quad \sum_{i=1}^{9} y_i^2 = 281.7746\]

Determine the slope ($b_1$) of the least-squares regression line.
Determine the intercept ($b_0$) of the least-squares regression line.
Write out the equation of the regression line.

Question 6c (6 pts)

List the assumptions of simple linear regression that can be evaluated using diagnostic plots. For each assumption, specify all the diagnostic plots that can be used to assess it. To receive full credit, you must include all relevant plots for each assumption.

Question 6d (18 pts)

The following output was obtained using RStudio for the tip–meal cost data. You may assume that all assumptions have been met for simple linear regression.

Residual standard error: 0.3783 on 7 degrees of freedom
Multiple R-squared:  0.9191,    Adjusted R-squared:  0.9075
F-statistic: 79.49 on 1 and 7 DF,  p-value: 4.534e-05

Interpret the coefficient of determination, $R^2$, given in the output above.
Use the output above and your results in (b) to compute the Pearson correlation coefficient ($r$).
For the test on the slope $\beta_1$: use the output above to calculate the $t$-test statistic, and specify the associated degrees of freedom.
Perform a four-step hypothesis test at $\alpha = 0.01$ using the F-test procedure to determine whether there is a significant linear association between meal cost and tip offered.