13.1. Introduction to Linear Regression

So far, we have studied inference methods that describe a single population or the relationship between a quantitative variable and a categorical variable. Regression analysis, in contrast, examines relationships between two quantitative variables. While many core statistical ideas parallel the methods from previous chapters, one major shift occurs—we can now describe the relationship using a functional form, represented by a trendline.

After a brief introduction to general regression analysis, we narrow our focus to variables that exhibit a linear associations.

Road Map

  • Express the core ideas of previous inference methods in model form.

  • Build the general regression model for two quantitative variables. Understand its components and see how it extends the modeling ideas from earlier methods.

  • Graphically assess the association of two quantitative variables using scatter plots.

13.1.1. The Evolution of Our Statistical Journey

Before diving into linear regression, let us reflect on our journey through statistical inference. Each major phase has built systematically toward the culminating topic of linear regression.

Model for Single Population Inference

We began with the fundamental problem of inferring an unknown population mean \(\mu\) from sample data. The corresponding model can be written as:

\[X_i = \mu + \varepsilon_i,\]

where \(\varepsilon_i\) represent iid errors with \(E(\varepsilon_i)=0\) and \(\text{Var}(\varepsilon_i) = \sigma^2\) for \(i = 1, 2, \ldots, n\). This model captures the essential idea that each observation consists of an underlying mean plus random variation around that mean.

Two-Population Models

A. Independent Two-Sample Inference

We then extended our methods to comparison of two population means, handling both independent and dependent sampling scenarios.

For independent two-sample inference, the assumptions can be expressed using the model:

\[\begin{split}&X_{Ai} = \mu_A + \varepsilon_{Ai} \\ &X_{Bi} = \mu_B + \varepsilon_{Bi},\end{split}\]

where

  • \(\mu_A\) and \(\mu_B\) are the unknown population means,

  • \(\varepsilon_{Ai}\) are iid with \(E(\varepsilon_{Ai})=0\) and \(\text{Var}(\varepsilon_{Ai}) = \sigma^2_A\) for all \(i=1,\cdots,n_A\),

  • \(\varepsilon_{Bi}\) are iid with \(E(\varepsilon_{Bi})=0\) and \(\text{Var}(\varepsilon_{Bi}) = \sigma^2_B\) for all \(i=1,\cdots,n_B\), and

  • error terms of Population A are independent from error terms of Population B.

B. Paired Two-Sample Inference

For paired samples, the difference is modeled directly:

\[D_i = X_{Ai} - X_{Bi} = (\mu_A - \mu_B) + \varepsilon_i\]

where \(\varepsilon_i\) are iid with \(E(\varepsilon_i)=0\) and \(\text{Var}(\varepsilon_i)=\sigma^2_D\).

The ANOVA Model

ANOVA extended the modeling ideas for the independent two-sample analysis to \(k\)-samples. Each observation \(X_{ij}\) is assumed to satisfy:

\[X_{ij} = \mu_i + \varepsilon_{ij},\]

where

  • \(\mu_i\) is the true mean for group \(i\), and

  • \(\varepsilon_{ij}\) are iid errors with \(E(\varepsilon_{ij})=0\) and \(\text{Var}(\varepsilon_{ij}) = \sigma^2\) for all possible pairs \((i,j)\).


13.1.2. The Regression Framework

Throughout our progression, we consistently worked with a single quantitative variable—either on its own or in connection with a categorical factor variable that divides data into groups.

In regression analyses, we study the relationship of two quantitative variables. Because both variables carry numerical order and magnitude, our interest expands: we now examine not only whether an association exists but also the functional form that characterizes how the two variables relate.

The General Regression Model

Our new modeling framework can be expressed as:

\[Y = g(X) + \varepsilon.\]

This simple equation contains profound ideas:

  • The response variable \(Y\) (also called the dependent variable) represents the outcome to be understood and predicted.

  • The explanatory variable \(X\) (also called the independent variable) represents the variable that may explain, influence, or predict changes in the response variable.

  • The regression function \(g(X)\) defines the systematic relationship between the explanatory and response variables. This function captures the average behavior of how \(Y\) changes with \(X\).

  • The error rerm \(\varepsilon\) represents unexplained variation—everything about \(Y\) that cannot be explained by the functional relationship with \(X\).

Functional Association Does Not Guarantee Causality

Two variables are said to be associated if changes in one variable are accompanied by systematic changes in the other variable.

Causation makes a stronger claim that one variable brings about changes in the other. Establishing causation requires careful experimental design and advanced analytical techniques that allow the causal argument to be statistically rigorous.

‼️ Regression analyses covered in this course can establish association, but not causation.


13.1.3. Preliminary Assessment of Linear Relationship Through Scatter Plots

Before mathematically constructing regression analysis, let us examine the association between two quantitative variables graphically. Scatter plots are the primary tool for this stage.

The Anatomy of a Scatter Plot

Scatter plot

Fig. 13.1 An example of a scatter plot

A scatter plot consists of:

  • Horizontal X-axis whose range includes all \(x_i\) values in the data set

  • Vertical Y-axis whose range includes all observed \(y_i\) values

  • Points at coordinates \((x_i, y_i)\), each representing an observed pair

The assessment of a scatter plot consists of three main steps:

  • Step 1: Check whether there is any relationship between the two variables, and if yes, identify the form of the relationship (linear, curved, etc.).

Once the relationship is confirmed to be linear, proceed to:

  • Step 2: Assess the direction and strength of the linear relationship.

  • Step 3: Check if any horizontal or vertical ouliers exist.

Step 1: The Functional Form

During this stage, we visualize a curve which best summarizes the trend created by the data points. Depending on its functional form, we classify the association as linear, exponential, polynonomial, clustered, etc. Fig. 13.1 shows a scatter plot with a linear trend.

See Fig. 13.2 for scatter plots exhibiting trends of different functional forms.

Scatter plots showing different functional forms

Fig. 13.2 Scatter plots showing trends with exponential, polynomial, and sinusoidal forms

Other possible forms are:

  • Threshold or breakpoint patterns: The relationship changes character at certain values, requiring different functional forms in different regions.

  • Clustered form: Points group into distinct clusters rather than following a smooth pattern. This suggests the presence of subgroups or categories within the data.

  • No pattern: Points appear randomly scattered with no discernible relationship. This suggests that the explanatory variable provides little or no information about the response variable.

Step 2: Direction and Strength of a Linear Relationship

Once a linear form is identified, we characterize the association further with its direction and strength.

Scatter plots with varying directions of linear association

Fig. 13.3 Scatter plots with different directions of linear association

Positive linear association is indicated by an upward trend in the scatter plot. As the explanatory variable \(X\) increases, the response variable \(Y\) tends to increase as well.

Negative linear association is indicated by a downward trend, with the variables moving in “opposite” directions.

Scatter plots with varying strengths of linear association

Fig. 13.4 Strength of linear association increases from left to right

Strength of a linear relationship is indicated on a scatter plot by how closely the points gather around the best-fit line. We say that \(X\) and \(Y\) have a deterministic (perfect) linear association when the data points lie on a straight line (first on the right of Fig. 13.4).

Exception: A Perfect Horizontal Line

When the summary line is horizontal, we consider the two random variables to be unassociated, even if the dots draw a perfect line. This may seem contradictory to our prior discussion at first, but recall that two variables are associated if information of one variable gives us extra information about the other. In an association described by a flat line, the knowledge of an \(X\) value gives no additional information on the potential location of \(Y\).

Step 3: Outliers and Influential Points

There are two types of outliers in regression analysis:

  • \(X\)-outliers deviate horizontally from other \(X\) values (Fig. 13.6).

  • \(Y\)-outliers show a greater vertical distance from the trendline than other data points (Fig. 13.5).

Note the key distinction: \(Y\)-outliers are determined by their distance from the associational trend with \(X\), not from other \(Y\) values.

Y-outlier

Fig. 13.5 Y-outlier is circled in red

We further define an influential point as an observation that has a large impact on the fitted regression line. Removing this point would substantially change the slope, the intercept, or both. We also say that such points have high leverage.

X-outliers

Fig. 13.6 X-outliers

In Fig. 13.5 and Fig. 13.6, the red trend lines summarize all data points, while the blue trend lines summarize the data with outliers removed. These graphs provide important takeaways:

  • It is problematic if few outliers have high leverage, as they distort the general trend.

  • Between the two types, \(X\)-outliers are generally more influential than \(Y\)-outliers, often “pulling” the best fit line toward them.

  • Not all outliers are influential, and not all influential points are outliers.

Example💡: Car Engine Performance 🚘

Automotive engineers collected data on eight four-cylinder vehicles that are considered to be among the most fuel-efficient in 2006. For each vehicle, they measured:

  • The total displacement of the engine, in cylinder volume (liters)

  • The power output of the engine, in orsepower (hp)

See the complete dataset below:

2006 Fuel-efficient vehicle data

Obs #

Vehicle

Cylinder Volume (L)

Horsepower (hp)

1

Honda Civic

1.8

100

2

Toyota Prius

1.5

96

3

VW Golf

2.0

115

4

VW Beetle

2.4

150

5

Toyota Corolla

1.8

126

6

VW Jetta

2.5

150

7

Mini Cooper

1.6

118

8

Toyota Yaris

1.5

106

Q1: Which variable should be explanatory and which should be response?

From an engineering perspective, the physical size of the engine largely determines its potential power output. Larger engines generally have the capacity to produce more power, though other factors like engine design and tuning also matter. Therefore, we use cylindar volume as the explanatory variable \(X\) and the power output as the respones \(Y\).

Q2: Create a Scatter Plot

  1. Save the data set in the data.frame format:

car_efficiency <- data.frame(
hp = c(100, 96, 115, 150, 126, 150, 118, 106),
cylinder_volume = c(1.8, 1.5, 2.0, 2.4, 1.8, 2.5, 1.6, 1.5)
)
  1. Use ggplot to make a scatter plot with a fitted line:

ggplot(car_efficiency, aes(x = cylinder_volume, y = hp)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "black", size = 1) +
labs(
   title = "2006 Fuel Efficiency",
   x = "Cylinder Volume (L)",
   y = "Horsepower (hp)"
) +
theme_minimal()

The resulting plot is:

Scatter plot of car engine performance data set

Fig. 13.7 Scatter plot of car engine performance data set

Q3: Identify the form of the relationship between the total displacement and the power output.

The points roughly follow a linear pattern. We don’t see curvature, clustering, or other non-linear patterns.

Q4: If the form is linear, state the direction and strength of the linear relationship. Are there any outliers? If there are, are the outliers influential?

  • The linear association is positive—as the cylinder volume increases, the power output tends to increase.

  • The strength is moderate—most points cluster reasonably close to the apparent trend line, though there is some scatter.

  • There are no obvious outliers or influential points. All data points fall within reasonable ranges in both directions and follow the general pattern.

13.1.4. Bringing It All Together

Key Takeaways 📝

  1. The regression model \(Y = g(X) + \varepsilon\) decomposes observations into systematic relationships plus unexplained variation.

  2. Most regression analyses can only establish association; causation requires well-designed experiments and advanced analysis methods.

  3. Scatter plots are the primary tool for assessing form, direction, and strength of bivariate relationships.

  4. There are two types of outliers in regression analysis. \(X\)-outliers lie far from most \(X\) values. \(Y\)-outliers lie far from the trend line of their association with \(X\).

  5. Outliers and influential points require special attention because they can dramatically affect fitted models and conclusions. \(X\)-outliers are more prone to being influential than \(Y\)-outliers.