13.1. Introduction to Linear Regression: Correlation and Scatter Plots
We’ve reached the final major topic of STAT 350: linear regression. Throughout the semester, we’ve studied relationships where one variable was quantitative and the other was categorical—comparing means across different groups. Now we’re moving to examine relationships between two quantitative variables, opening up an entirely new world of statistical modeling and inference.
Road Map
Problem we will solve – How to study and model relationships between two quantitative variables, distinguishing between association and causation while using scatter plots as our primary exploratory tool
Tools we’ll learn – The regression modeling framework \(Y = g(X) + \varepsilon\), scatter plot construction and interpretation, identification of form, direction, and strength in relationships
How it fits – This extends our statistical modeling from categorical explanatory variables (ANOVA) to quantitative explanatory variables, providing the foundation for prediction, correlation analysis, and understanding functional relationships
13.1.1. The Evolution of Our Statistical Journey
Before diving into linear regression, let’s reflect on the remarkable journey we’ve taken through statistical inference. Each major phase has built systematically toward this culminating topic.
From Single Populations to Multiple Relationships

Fig. 13.1 Review of single population inference methods showing the progression from basic hypothesis testing to confidence intervals
Single Population Inference: We began with the fundamental question of making inferences about an unknown population mean \(\mu\) based on sample data. Our model was elegantly simple:
where \(\varepsilon_i \sim N(0, \sigma^2)\) and \(i = 1, 2, \ldots, n\). This model captured the essential idea that each observation consists of an underlying mean plus random variation around that mean.
We developed the complete statistical inference framework:
Point estimation: \(\bar{X}\) as our best guess for \(\mu\)
Confidence intervals: \(\bar{x} \pm t_{\alpha/2, n-1} \frac{s}{\sqrt{n}}\)
Hypothesis testing: Test statistics comparing sample evidence to null hypotheses
Two Population Comparisons: We then extended our methods to compare two population means, handling both independent and dependent sampling scenarios.

Fig. 13.2 Review of two population inference methods including independent and paired procedures
For independent samples, our models became:
For paired samples, we modeled the differences directly:
where \(\varepsilon_i \sim N(0, \sigma_D^2)\).
Multiple Population Analysis: ANOVA extended these ideas to simultaneously compare \(k\) population means using the framework:

Fig. 13.3 Review of one-way ANOVA showing the decomposition of variance and F-test structure
where \(\varepsilon_{ij} \sim N(0, \sigma^2)\) and the F-test compares between-group to within-group variability.
Throughout this progression, we consistently worked with one quantitative variable and one categorical factor variable (with 1, 2, or \(k\) levels respectively).
The New Frontier: Two Quantitative Variables
Linear regression represents a fundamental paradigm shift. Instead of comparing groups defined by categorical variables, we’re now studying functional relationships between two quantitative variables. This opens up possibilities for:
Prediction: Using values of one variable to predict values of another
Understanding relationships: Quantifying how changes in one variable relate to changes in another
Model building: Creating mathematical descriptions of real-world phenomena
13.1.2. The Regression Framework: From Groups to Functions

Fig. 13.4 The general regression model showing how Y relates to X through a function plus error
The transition from group comparisons to regression involves a conceptual leap in how we think about our models. Instead of discrete group means, we now have a continuous functional relationship.
The General Regression Model
Our new modeling framework can be expressed as:
This deceptively simple equation contains profound ideas:
The Response Variable \(Y\) (also called the dependent variable) represents the outcome we’re trying to understand, predict, or explain. This is what we measure and observe varying across our study.
The Explanatory Variable \(X\) (also called the independent variable) represents the variable we believe may explain, influence, or predict changes in the response variable. We choose which variable plays this role based on our research questions and theoretical understanding.
The Regression Function \(g(X)\) defines the systematic relationship between the explanatory and response variables. This function captures the average tendency—how \(Y\) tends to change as \(X\) changes.
The Error Term \(\varepsilon\) represents unexplained variation—everything about \(Y\) that cannot be explained by the functional relationship with \(X\). This includes measurement error, other unmeasured variables, and inherent randomness in the phenomena we’re studying.
Understanding Association vs. Causation

Fig. 13.5 The crucial distinction between association and causation in regression analysis
This distinction is fundamental to proper interpretation of regression results.
Association means that two quantitative variables are related—changes in one variable are accompanied by systematic changes in the other variable. We can observe association through our data and statistical analysis.
Causation goes much further, claiming that one variable actually brings about changes in the other variable. Establishing causation requires much more than statistical analysis—it requires well-designed experiments with proper controls.
Why This Matters: Most regression analyses can only establish association, not causation. Just because we model \(Y\) as a function of \(X\) doesn’t mean \(X\) causes \(Y\). The modeling choice often reflects:
Practical considerations (which variable is easier to measure or control)
Temporal ordering (which variable comes first in time)
Research questions (which variable we want to predict)
Theoretical frameworks from the subject area
The Gold Standard: Establishing causation requires well-designed controlled experiments where:
The explanatory variable is deliberately manipulated by the researcher
Other potential influencing factors are controlled or randomized
Subjects are randomly assigned to different treatment levels
Temporal ordering clearly establishes cause before effect
Since most observational studies cannot meet these requirements, we focus on identifying and quantifying associations while being careful not to claim causation.
13.1.3. Scatter Plots: Visualizing Bivariate Relationships

Fig. 13.6 Definition and basic structure of scatter plots for two quantitative variables
Just as we used histograms and boxplots to explore single variables, scatter plots serve as our primary tool for exploring relationships between two quantitative variables.
The Anatomy of a Scatter Plot
A scatter plot displays the relationship between two quantitative variables by plotting data points in a two-dimensional coordinate system:
Horizontal axis (X-axis): Explanatory variable (independent variable)
Vertical axis (Y-axis): Response variable (dependent variable)
Each point: Represents one observation with coordinates \((x_i, y_i)\)
Critical First Step: Before creating any scatter plot, we must decide which variable should be explanatory and which should be response. This decision should be based on:
Research questions (what are we trying to predict or explain?)
Theoretical understanding (what might influence what?)
Practical considerations (what can be controlled or measured first?)
Temporal ordering (what comes first in time?)
13.1.4. A Detailed Example: Car Engine Performance

Fig. 13.7 Dataset of eight fuel-efficient four-cylinder vehicles from 2006 with cylinder volume and horsepower measurements
Let’s work through a comprehensive example using real data from a 2006 study of fuel-efficient vehicles.
Research Context: Automotive engineers collected data on eight four-cylinder vehicles judged to be among the most fuel-efficient in 2006. For each vehicle, they measured: - Cylinder volume (in liters): Total displacement of the engine - Horsepower (hp): Power output of the engine
Variable Selection Decision: Which variable should be explanatory and which should be response?
From an engineering perspective, the physical size of the engine (cylinder volume) largely determines its potential power output (horsepower). Larger engines generally have the capacity to produce more power, though other factors like engine design and tuning also matter. Therefore:
Explanatory variable (X): Cylinder volume (liters)
Response variable (Y): Horsepower (hp)
The Data:
Obs # |
Vehicle |
Cylinder Volume (L) |
Horsepower (hp) |
---|---|---|---|
1 |
Honda Civic |
1.8 |
100 |
2 |
Toyota Prius |
1.5 |
96 |
3 |
VW Golf |
2.0 |
115 |
4 |
VW Beetle |
2.4 |
150 |
5 |
Toyota Corolla |
1.8 |
126 |
6 |
VW Jetta |
2.5 |
150 |
7 |
Mini Cooper |
1.6 |
118 |
8 |
Toyota Yaris |
1.5 |
106 |
Constructing a Scatter Plot Manually
Step 1: Determine axis ranges - Cylinder volume ranges from 1.5L to 2.5L, so we’ll use 1.25L to 2.75L for good margins - Horsepower ranges from 96 hp to 150 hp, so we’ll use 90 hp to 160 hp
Step 2: Plot each observation Starting with the Honda Civic at coordinates (1.8, 100): - Find 1.8 on the horizontal axis - Move vertically to 100 on the vertical axis - Place a point at the intersection
Repeat this process for all eight vehicles to create the complete scatter plot.

Fig. 13.8 Manual construction of the scatter plot showing each vehicle plotted as a point
Creating Scatter Plots in R
For practical data analysis, we use software to create professional scatter plots:
# Create the dataset
car_efficiency <- data.frame(
hp = c(100, 96, 115, 150, 126, 150, 118, 106),
cylinder_volume = c(1.8, 1.5, 2.0, 2.4, 1.8, 2.5, 1.6, 1.5)
)
# Basic scatter plot
ggplot(car_efficiency, aes(x = cylinder_volume, y = hp)) +
geom_point() +
labs(
title = "2006 Fuel Efficiency Study",
x = "Explanatory Variable (X = Cylinder Volume in Liters)",
y = "Response Variable (Y = Horsepower)"
)
Adding a fitted line:
# Scatter plot with regression line
ggplot(car_efficiency, aes(x = cylinder_volume, y = hp)) +
geom_point(size = 3, color = "blue") +
geom_smooth(method = "lm", se = FALSE, color = "black", size = 1) +
labs(
title = "2006 Fuel Efficiency Study",
x = "Cylinder Volume (L)",
y = "Horse Power (hp)"
) +
theme_minimal()

Fig. 13.9 Professional scatter plot created in R with a fitted regression line
13.1.5. Information Conveyed by Scatter Plots

Fig. 13.10 Overview of the key information that scatter plots reveal about bivariate relationships
Scatter plots are remarkably rich sources of information about the relationship between two quantitative variables. They help us assess multiple aspects simultaneously:
Primary Functions: 1. Pattern identification: Is there a relationship between the variables? 2. Form assessment: What type of relationship exists (linear, curved, etc.)? 3. Direction evaluation: How do the variables relate (positive, negative, no association)? 4. Strength measurement: How closely do the points follow the pattern? 5. Outlier detection: Are there unusual observations that don’t fit the pattern?
Form: The Shape of the Relationship

Fig. 13.11 Categories of relationship forms that can be identified in scatter plots
Form refers to the general shape or pattern of the association between the two variables. This fundamental assessment determines what types of analyses are appropriate.
Linear Form: Points roughly follow a straight line pattern. This is the ideal scenario for simple linear regression, which we’ll focus on in this course.
Non-linear (Curved) Form: Points follow some type of curved pattern. Examples include:

Fig. 13.12 Examples of different non-linear relationships including exponential, polynomial, logarithmic, and sinusoidal patterns
Exponential relationships: Growth starts slowly then accelerates rapidly
Polynomial relationships: Parabolic or higher-order curves with multiple direction changes
Logarithmic relationships: Rapid initial growth that levels off
Sinusoidal relationships: Cyclical or wave-like patterns
Clustered Form: Points group into distinct clusters rather than following a smooth pattern. This suggests the presence of subgroups or categories within the data.
No Pattern: Points appear randomly scattered with no discernible relationship. This suggests the explanatory variable provides little or no information about the response variable.
Threshold or Breakpoint Patterns: The relationship changes character at certain values, requiring different functional forms in different regions.
For This Course: We’ll focus primarily on linear relationships. When we encounter non-linear patterns, we’ll discuss potential transformations that might linearize the relationship, but advanced non-linear modeling is beyond our scope.
Direction: The Nature of Linear Relationships

Fig. 13.13 Explanation of positive and negative associations in linear relationships
When we’ve identified a linear form, we can characterize the direction of the relationship:
Positive Association (Direct Relationship):
Indicated by an upward trend in the scatter plot
As the explanatory variable \(X\) increases, the response variable \(Y\) tends to increase on average
The variables move “together” in the same direction
Example: As cylinder volume increases, horsepower tends to increase
Negative Association (Inverse Relationship):
Indicated by a downward trend in the scatter plot
As the explanatory variable \(X\) increases, the response variable \(Y\) tends to decrease on average
The variables move in “opposite” directions
Example: As car weight increases, fuel efficiency tends to decrease

Fig. 13.14 Multiple examples showing the spectrum from strong positive to strong negative linear associations
Important Note: Direction only applies when we have an identifiable linear pattern. For non-linear relationships, the direction might change across different regions of the relationship.
Strength: How Closely Points Follow the Pattern

Fig. 13.15 Explanation of strength as a measure of how tightly points cluster around the best-fitting line
Strength measures how closely the observed points cluster around the underlying pattern. For linear relationships, this means how tightly the points cluster around the best-fitting straight line.
Strong Association:
Points cluster very closely around the line
Little scatter or deviation from the linear pattern
The linear relationship explains most of the variation in the response variable
High predictive value
Moderate Association:
Points show clear linear trend but with noticeable scatter
The relationship is evident but not overwhelmingly strong
Moderate predictive value
Weak Association:
Points show a slight linear trend but with substantial scatter
The relationship is barely detectable
Low predictive value
May not be practically useful for prediction
No Association:
Points appear randomly scattered
No discernible pattern or trend
The explanatory variable provides no useful information about the response variable
Perfect Association (rarely seen in real data):
All points lie exactly on a straight line
No random variation around the pattern
Deterministic relationship
Visual Assessment: We can estimate strength by imagining how close the points come to lying on a straight line. The closer they cluster, the stronger the association.
13.1.6. Outliers and Influential Points in Bivariate Data

Fig. 13.16 Definition of outliers and influential points in the context of two quantitative variables
When we move from single variables to relationships between variables, the concepts of outliers and unusual observations become more complex and more important.
Understanding Outliers in Two Dimensions
Outlier: An observation that lies outside the overall pattern of the relationship. Unlike single-variable outliers, bivariate outliers must be assessed relative to the relationship pattern, not just the individual variable distributions.
An observation can be an outlier in several ways:
X-direction outlier: Unusual value of the explanatory variable
Y-direction outlier: Unusual value of the response variable
Pattern outlier: Falls far from the trend line even if both X and Y values are reasonable individually
Combined outlier: Unusual in multiple dimensions simultaneously
Influential Points: Impact on the Fitted Line
Influential Point: An observation that has a large impact on the fitted regression line. Removing this point would substantially change the slope, intercept, or both.
Key Insight: Not all outliers are influential, and not all influential points are outliers.
High Leverage: Points with extreme X-values have the potential to be influential because they can “pull” the regression line toward them. However, they’re only actually influential if they don’t follow the general pattern established by the other points.
Examples of Different Scenarios:

Fig. 13.17 Examples showing non-influential outliers vs highly influential points using the car efficiency data
Scenario 1: Non-Influential Outlier
Imagine adding a go-kart with 0.15L cylinder volume and 34 hp to our car dataset:
This point is an outlier (very different from other vehicles)
It’s not influential because it follows the same general linear trend
The fitted line barely changes when this point is included
Scenario 2: Highly Influential Point
Imagine adding a “Mario-Go-Kart with star power” with 0.15L cylinder volume and 100 hp:
This point dramatically violates the established pattern
Despite the small engine, it has extremely high power
Including this point substantially changes the fitted line’s slope and intercept
This single point has disproportionate influence on our conclusions
Practical Implications:
Always examine scatter plots for potentially influential points
Investigate unusual observations—are they data entry errors, measurement mistakes, or genuinely different phenomena?
Consider analyzing data both with and without influential points to assess sensitivity
Document decisions about handling unusual observations
13.1.7. Interpreting Our Car Efficiency Example
Let’s apply our scatter plot analysis framework to the car efficiency data:
Form Assessment: The points roughly follow a straight line pattern, suggesting a linear relationship is appropriate. We don’t see curvature, clustering, or other non-linear patterns.
Direction Assessment: Clear positive association—as cylinder volume increases, horsepower tends to increase. This matches our engineering intuition that larger engines generally produce more power.
Strength Assessment: Moderate to strong linear association. Most points cluster reasonably close to the apparent trend line, though there’s some scatter. The relationship appears useful for prediction but not perfect.
Outlier Assessment: No obvious outliers in this dataset. All vehicles fall within reasonable ranges for both variables and follow the general pattern.
Functional Form Conclusion: A linear regression model appears appropriate for this relationship. We could reasonably model: Horsepower = \(\beta_0 + \beta_1 \times\) Cylinder Volume + Error
13.1.8. When Linear Models Are Not Appropriate
It’s crucial to recognize when our linear regression tools won’t work well:
Strong Non-Linear Patterns: If the scatter plot shows clear curvature, exponential growth, or other non-linear patterns, linear regression will provide poor fits and misleading conclusions.
Multiple Clusters: If points group into distinct clusters, we might need separate models for each cluster or classification methods rather than regression.
No Relationship: If points appear randomly scattered, there’s no relationship to model.
Extreme Outliers or Influential Points: These can dominate the analysis and lead to misleading conclusions.
Alternative Approaches:
Transformations: Sometimes transforming variables (log, square root, etc.) can linearize relationships
Non-linear regression: More advanced techniques for curved relationships
Classification methods: For clustered data
Multiple regression: When we need multiple explanatory variables
13.1.9. Bringing It All Together
This visual analysis of scatter plots provides the foundation for everything we’ll do in linear regression. Just as we learned to interpret histograms before developing confidence intervals for means, we need to master scatter plot interpretation before developing regression inference procedures.
What’s Coming Next:
Simple linear regression model: Formalizing \(Y = \beta_0 + \beta_1 X + \varepsilon\)
Least squares estimation: Finding the “best” line through the data
Correlation coefficients: Quantifying the strength of linear relationships
Regression inference: Hypothesis tests and confidence intervals for regression parameters
Prediction and prediction intervals: Using our models to make predictions with appropriate uncertainty quantification
The Bigger Picture: Linear regression extends all our familiar statistical concepts:
Point estimation: Estimating slope and intercept parameters
Confidence intervals: For individual parameters and predictions
Hypothesis testing: Testing whether relationships exist
Model assumptions: Extending normality and independence concepts
Diagnostic methods: Checking whether our models are appropriate
Just as ANOVA extended two-sample t-tests to multiple groups, linear regression extends our modeling framework from categorical to quantitative explanatory variables, opening up vast new possibilities for understanding and predicting relationships in the world around us.
Key Takeaways 📝
Linear regression represents a paradigm shift from comparing groups defined by categorical variables to modeling functional relationships between quantitative variables.
The regression framework \(Y = g(X) + \varepsilon\) decomposes observations into systematic relationships plus unexplained variation, extending our familiar modeling approach.
Association and causation are fundamentally different: Most regression analyses can only establish association; causation requires well-designed controlled experiments.
Variable role assignment matters: We must thoughtfully decide which variable serves as explanatory and which as response based on research questions and theoretical understanding.
Scatter plots are our primary exploratory tool for assessing form, direction, and strength of bivariate relationships before formal modeling.
Form assessment determines model appropriateness: Linear, non-linear, clustered, or no pattern each require different analytical approaches.
Direction and strength characterize linear relationships: Positive/negative direction and strong/moderate/weak strength guide interpretation and prediction utility.
Outliers and influential points require special attention in bivariate data because they can dramatically affect fitted models and conclusions.
Visual analysis guides formal analysis: Just as histograms informed our understanding of single variables, scatter plots inform our approach to relationship modeling.
Linear regression extends familiar concepts: Point estimation, confidence intervals, hypothesis testing, and model assumptions all have regression analogs we’ll develop.
Exercises
Variable Role Assignment: For each research scenario, identify which variable should be explanatory (X) and which should be response (Y), and justify your choice:
A study of house prices and square footage of living space
Research on student GPA and hours spent studying per week
Investigation of plant height and amount of fertilizer applied
Analysis of blood pressure and age in adults
Study of company stock prices and quarterly earnings
Scatter Plot Interpretation: Examine each described scatter plot and assess the form, direction, and strength:
Points cluster tightly around a line that slopes upward from left to right
Points show a clear upward trend but with considerable scatter around the apparent line
Points form a cloud with no apparent pattern or trend
Points follow a curved pattern that rises steeply at first then levels off
Points show a clear downward trend with minimal scatter
Association vs. Causation: For each statement, determine whether it describes association, causation, or neither, and explain your reasoning:
“Students who study more hours per week tend to have higher GPAs”
“Increasing the temperature causes water to evaporate faster”
“Ice cream sales and swimming pool accidents both increase in summer”
“Taking aspirin reduces the risk of heart attack in controlled clinical trials”
“Taller people tend to weigh more than shorter people”
Outlier and Influence Analysis: Consider a dataset relating years of education to annual income. Describe how each of these observations might affect the analysis:
A person with 12 years of education earning $200,000 per year
A person with 20 years of education earning $40,000 per year
A person with 8 years of education earning $30,000 per year
A person with 16 years of education earning $60,000 per year
Data Collection and Design: A researcher wants to study the relationship between daily exercise time and weight loss over a 6-month period.
How should the researcher assign variable roles?
What type of study design would be needed to establish causation?
What confounding variables might affect this relationship?
How might the researcher control for these confounding factors?
R Implementation: Using the car efficiency dataset provided in the chapter:
Create a basic scatter plot with appropriate labels
Add a fitted line to visualize the trend
Identify which vehicle appears most unusual relative to the overall pattern
Predict what would happen to the fitted line if the Toyota Yaris were removed
Form Recognition: For each described relationship, suggest an appropriate functional form:
Bacteria population growth in ideal conditions over time
Learning curve showing skill improvement with practice (rapid initial gains that slow down)
Relationship between car speed and fuel efficiency (optimal speed in the middle range)
Depreciation of car value over time
Relationship between hours of sleep and test performance (too little or too much both harmful)
Real-World Applications: Choose a topic from your field of study or personal interest and:
Identify two quantitative variables that might be related
Hypothesize the form, direction, and strength of their relationship
Discuss whether you expect the relationship to be causal or merely associational
Identify potential confounding variables that might affect the relationship
Suggest how you might collect data to study this relationship