Linear Regression Analysis serves as a fundamental statistical technique for understanding the relationship between a dependent variable and one or more independent variables. This method is extensively used for prediction, trend analysis, and exploring how variables interconnect. It helps in determining the extent to which independent variables are associated with a dependent variable, making it a key tool in both research and practical applications.

The base R package includes functions such as lm() for conducting linear regression analysis and summary() for obtaining detailed reports on the model. These reports include regression coefficients, the R-squared value, and statistical testing of model parameters. These functions are integral for assessng the model’s fit to the data, hypothesis testing regarding the relationships among variables, and making informed predictions. lm() is R’s primary function for fitting linear models. It’s designed to analyze the relationship between variables by estimating regression coefficients that outline the best-fitting line or hyperplane in the case of multiple regression, based on the least squares method. While we will use lm() for simple linear regression, lm() also handles multiple regression, facilitating the examination of linear relationships among several explanatory variables and the response.

ggplot will also be used for creating diagnostic plots, analyzing residuals, and identifying influential data points, which are essential for verifying model assumptions and overall reliability. For specific insights into the confidence of model estimates, the confint() function is key. It provides confidence intervals for model parameters, offering a range within which the true parameter values are expected to lie. For added analysis, the cor() function is useful for assesing the correlation between two quantitative variables,the predict() function generates predictions and confidence intervals for new data points, and anova() can be used to obtain the Analysis of Variance table. These features enrich the analysis process, offering a comprehensive suite of tools for linear regression.

Introduction to Simple Linear Regression

The lm() function in R is a useful tool used for fitting linear models.

lm(formula, data, subset, weights, na.action,
   method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
   singular.ok = TRUE, contrasts = NULL, offset, ...)
  • formula: An object of class “formula”: a symbolic description of the model to be fitted. The typical form is response ~ predictors, detailing how the response variable depends on one or more predictors.
  • data: Optionally, a data frame, list, or environment containing the variables in the model. If not found in data, the variables are taken from the environment.
  • subset: Optionally, a vector specifying a subset of observations to be used in the fitting process.
  • weights: Optionally, a vector of weights for weighted least squares fitting. If non-NULL, it enables minimizing the sum of weighted squared residuals.
  • na.action: Specifies what should happen when the data contain NAs. Options include na.omit and na.exclude, among others.

In practice, you’ll often use lm() with just the first two arguments, formula and data, especially for straightforward simple linear regression analyses:

fit <- lm(response ~ predictor1 + predictor2, data = my_data)
summary(fit)

This formula indicates that the response variable is modeled as a function of predictor1 and predictor2. The summary() function then provides a detailed report of the model.

What Does summary() Do?

The summary() function, when utilized on a linear model object created by lm(), yields a comprehensive overview of the regression analysis. This summary encapsulates several key components crucial for interpreting the model’s effectiveness and the relationships between variables:

  • Coefficients: A table listing the regression coefficients for each predictor, including the intercept. This section also provides the standard error for each coefficient, the t-value, and the p-value, helping to assess the statistical significance of each predictor.
  • Residuals: Statistics of the residuals, such as the minimum, maximum, median, and quartiles, offering insights into the distribution and potential skewness of the model errors.
  • R-squared (\(R^2\)): A statistic that measures the proportion of variability in the dependent variable that can be explained by the model. It ranges from 0 to 1, with higher values indicating a better fit of the model to the data.
  • Adjusted R-squared: Adjusted for the number of predictors in the model, providing a more accurate measure when comparing models with different numbers of predictors.
  • F-statistic: A measure to assess the overall significance of the model. It compares the model with no predictors (intercept only) to the specified model, testing the null hypothesis that all regression coefficients are equal to zero.
  • \(p\)-value (Pr(>F)): The p-value associated with the F statistic. It checks if the explained variance in the dependent variable by the model is significantly greater than the unexplained variance, under the null hypothesis that all regression coefficients (except the intercept) are equal to zero.

Scatter plot

A scatter plot is a fundamental tool in statistical analysis for visualizing the relationship between two continuous variables. It displays values for typically two variables for a set of data, with one variable on each axis, allowing for a quick look at a possible linear or non-linear relationship between them. In R, the ggplot2 package offers versatile capabilities for creating scatter plots, enhancing them with regression lines, smooth lines (to capture non-linear trends), and grouping by categories if needed. Here’s a simple guide on creating a scatter plot using ggplot2.

Assuming you have a dataset data with two continuous variables varX (independent variable) and varY (dependent variable), you can create a scatter plot to explore their relationship.

ggplot(data, aes(x = varX, y = varY)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  ggtitle("Scatter Plot of varY vs. varX") +
  xlab("varX") +
  ylab("varY")
  • geom_point(): This function adds points to the plot, with the position of each point determined by the values of varX and varY for each observation in the dataset.
  • geom_smooth(): Adds a smoothed line to the scatter plot to help visualize the trend between varX and varY. By specifying the method (e.g., method = lm for linear regression), it fits a model to the data and plots the resulting line/curve. The se parameter controls whether a confidence interval around the smooth is shown (se = FALSE omits the confidence interval).

Sample Correlation

The cor() function in R calculates the correlation coefficient between two variables, quantifying the degree to which they are related. The correlation coefficient is a numerical measure that can range from -1 to 1, where values closer to -1 or 1 indicate a stronger negative or positive linear relationship, respectively, and values near 0 suggest little to no linear relationship.

cor(x, y = NULL, use = "everything",
    method = c("pearson", "kendall", "spearman"))
  • x, y: Vectors, matrices, or data frames of variables to compute the correlation. If y is NULL, x must be a matrix or data frame where the correlation will be calculated for all pairs of columns.
  • use: Determines how missing values (NAs) are handled. Options include “everything” (no NA removal), “all.obs” (only cases with no missing values), “complete.obs” (all complete cases), and “na.or.complete” (depending on the pairwise completeness of the observations).
  • method: Specifies the method of correlation:
    • pearson: Measures the linear relationship between two continuous variables. It assumes the data are linearly related. The Pearson correlation is the one we use in this course.
    • kendall: Uses Kendall’s tau coefficient, a non-parametric measure. It assesses how well the relationship between two variables can be described using a monotonic function.
    • spearman: Similar to Kendall’s tau, Spearman’s rho is designed to measure the strength of a monotonic relationship between two variables.

Residual plots

Residual plots are an essential diagnostic tool in regression analysis, designed to evaluate the fit of a regression model by plotting the residuals (the differences between observed and predicted values) against the explanatory variable(s). These plots are useful in detecting issues like non-linearity, heteroscedasticity (non-constant variance of residuals), and outliers. The ggplot2 package in R can be utilized to produce residual plots.

Assuming your dataset data includes the explanatory variable varX and the dependent variable varY, and you have fitted a linear model fit, you can generate a residual plot in the following manner:

fit <- lm(varY ~ varX, data = data)

data$resids <- fit$residuals

# Generate the residual plot
ggplot(data, aes(x = varX, y = resids)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "solid", color = "blue") +
  ggtitle("Residual Plot vs. Explanatory Variable") +
  xlab("Explanatory Variable (varX)") +
  ylab("Residuals")
  • geom_point(): This function plots the residuals for each value of the explanatory variable, enabling the visualization of any patterns in the residual distribution.
  • geom_hline(yintercept = 0): Adds a dashed horizontal line at zero, aiding in the assessment of whether residuals are evenly dispersed around the zero line—this serves as a visual gauge for model fit.

Explanatory Variable vs. Residuals: Plotting residuals against the explanatory variable is a common method for checking the model’s assumptions. This includes verifying the linearity of the relationship between the explanatory and dependent variables and ensuring that the variance of the residuals is constant (homoscedasticity). Residual plots against the explanatory variable(s) are a straightforward yet powerful method for visually inspecting the assumptions of linear regression models. By analyzing the patterns of the residuals, researchers can determine the adequacy of the model fit and identify whether any modifications are necessary to enhance model accuracy.

Histograms and QQ plots (quantile-quantile plots) also play an important role in the analysis of residuals. They are both used to examine the distribution of residuals and to check for deviations from normality, which is a key assumption of linear regression models. As you are familar with Histograms and QQ plots we will not discuss them here.

Confidence Intervals for Model Parameters

The confint() function in R is primarily used to calculate confidence intervals for the parameters of a fitted model. This is particularly useful in linear regression models, where it generates intervals for the regression coefficients, including both the slope(s) and intercept.

How It Works Confidence intervals provide a range of values for each parameter. These ranges are constructed to contain the true parameter value with a certain confidence level, often 95%. The calculation of these intervals is based on the estimate of the parameter, its standard error, and the critical value from the t-distribution corresponding to the desired level of confidence.

confint(object, parm, level = 0.95)
  • object: This is the fitted model object from which confidence intervals for parameters are to be computed. The object could be the result of a wide range of model fitting functions, reflecting the generic nature of confint().
  • parm: Specifies which model parameters to compute confidence intervals for. It can be a vector of parameter names or indices. If omitted, intervals for all parameters in the model are calculated.
  • level: Defines the confidence level for the interval calculation. Defaults to \(0.95\), for example, indicates \(95\%\) confidence intervals, standard in many statistical analyses.

For linear model (lm) objects, confint() constructs confidence intervals based on t-critical values, reflecting the distribution of coefficients under the assumption of normality or t-distribution for smaller samples.

Assuming your dataset data includes the explanatory variable varX and the dependent variable varY, and you have fitted a linear model fit, you can construct confidence intervals for the model parameters in the following manner:

fit <- lm(varY ~ varX, data = data)
confint(fit, level = 0.95)

Confidience Intervals for the Mean Response at a Point and Prediction Intervals

The predict() function is designed to make predictions based on a fitted model. In the context of linear regression, it can generate predicted values of the dependent variable for new observations of the independent variables. It can also provide confidence intervals or prediction intervals for these predicted values.

predict(object, newdata, se.fit = FALSE, scale = NULL, df = Inf,
        interval = "none", level = 0.95, type = "response",
        terms = NULL, na.action = na.pass,
        pred.var = res.var/weights, weights = 1)
  • object: A linear model object created using lm(). This object contains the coefficients and structure of the model derived from previously analyzed data.
  • newdata: Optionally, a data frame of new observations. This data frame must have columns that match the explanatory variables used in the model. If newdata is not provided, the function will return fitted values for the original data used to fit the model.
  • se.fit: Indicates whether standard errors of the predictions should be returned. This is useful for constructing confidence intervals around the predicted values.
  • interval: Specifies the type of interval to compute for the predictions. Options include “none”, “confidence” for confidence intervals around the mean prediction, and “prediction” for intervals that cover future observations.
    • Confidence Intervals: For each predicted value, a confidence interval indicates the range within which the true mean response is expected to fall, given the predictor values.
    • Prediction Intervals: Offers a range within which future observations are likely to fall, accounting for the error in estimating the mean response and the natural variability of the response variable.
  • level: Sets the confidence level for the confidence or prediction intervals.
  • pred.var: Allows for customization of the variance used in prediction intervals, particularly relevant when predicting future observations.

Assuming your dataset data includes the explanatory variable varX and the dependent variable varY, you have fitted a linear model fit, and C is the confidence coefficient, you can construct confidence intervals for the model parameters in the following manner:

fit <- lm(varY ~ varX, data = data)
new_data <- data.frame(varX = c(1, 2, 3, 4))
# For mean response
predict(fit, newdata = new_data, interval = "confidence", level = C)
# For individual predictions
predict(fit, newdata = new_data, interval = "prediction", level = C)

Example: Job Stress and Locus of Control

Job Stress and Locus of Control. Many factors, such as the type of job, education level, and job experience, can affect the stress felt by workers on the job. Locus of control (LOC) is a term in psychology that describes the extent to which a person believes he or she is in control of the events that influence his or her life. Is feeling “more in control” associated with less job stress? A recent study examined the relationship between LOC and several work-related behavioral measures among certified public accountants in Taiwan. LOC was assessed using a questionnaire that asked respondents to select one of two options for each of 23 items. Scores ranged from \(0\) to \(23\). Individuals with low LOC believe that their own behavior and attributes determine their rewards in life. Those with high LOC believe that these rewards are beyond their control. We will consider a random sample of 100 accountants.

CA 9a: Exploring and Modeling the Association between LOC and Stress

loc.df <- read.csv("Data/loc.csv")

Scatterplot of LOC versus STRESS

Visualize the relationship between LOC and job stress using a scatter plot. This step is crucial for getting a preliminary sense of the nature of the relationship between the variables.

ggplot(loc.df, aes(x=LOC, y=STRESS))+ 
  geom_point(size = 3) + 
  geom_smooth(method = lm, se = FALSE, lwd = 2) +
  ggtitle("Relationship between Stress and LOC") + 
  xlab("Locus") + 
  ylab("Stress")
## `geom_smooth()` using formula = 'y ~ x'

After recognizing that there appears to be some linear association we can measure the correlation.

Sample Pearson Correlation

cor(loc.df$LOC, loc.df$STRESS)
## [1] 0.3122765

The correlation coefficient between Stress and LOC is \(0.3122765\). This looks like there is a weak but non-negligible association betweenStress and LOC.

Fit Least-Squares Regression Line

fit <- lm(STRESS ~ LOC, data = loc.df) 
summary(fit)
## 
## Call:
## lm(formula = STRESS ~ LOC, data = loc.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.04704 -0.33806  0.02169  0.30798  1.06715 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.25550    0.14691  15.353  < 2e-16 ***
## LOC          0.03991    0.01226   3.254  0.00156 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4513 on 98 degrees of freedom
## Multiple R-squared:  0.09752,    Adjusted R-squared:  0.08831 
## F-statistic: 10.59 on 1 and 98 DF,  p-value: 0.001562

The equation of the least-squares regression line is found using the coefficients from the Estimate column of the output above. The intercept \(b_0=2.25550\) and the slope \(b_1=0.03991\), resulting in the least-squares regression line: \[\hat{y}=2.25550+0.03991x_{\text{LOC}}\].

The mean squared error MSE is the square of the residual standard error MSE \(=0.4513^2=0.2036717\).

The coefficient of determination \(R^2\) is indicated by the Multiple R-squared value \(R^2=0.09752\).

ANOVA Table

The ANOVA table can provide more details regarding the sums of squares, degrees of freedom, mean squares, F-test statistic, and \(p\)-value. We can also confirm the MSE and \(R^2\) values from the output.

anova(fit)
## Analysis of Variance Table
## 
## Response: STRESS
##           Df  Sum Sq Mean Sq F value   Pr(>F)   
## LOC        1  2.1565 2.15651  10.589 0.001562 **
## Residuals 98 19.9578 0.20365                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The mean squared error MSE is directly given in the table from row RESIDUALS and column Mean Sq as MSE\(=0.20365\).

The coefficient of determination \(R^2\) can be calculated from the sum of squares \(R^2=\frac{\text{SS_R}}{\text{SS_T}}=\frac{\text{SS_R}}{\text{SS_R}+\text{SS_E}}=\frac{2.1565}{2.1565+19.9578}=0.09752\).

CA 9b: Diagnostics: Checking Model Assumptions and Model Tests

Checking Linearity and Constant Variance Assumptions: Scatterplot & Residual Plot

Visualize the relationship between LOC and job stress using a scatter plot.

ggplot(loc.df, aes(x=LOC, y=STRESS))+ 
  geom_point(size = 3) + 
  geom_smooth(method = lm, se = FALSE, lwd = 2) +
  ggtitle("Relationship between Stress and LOC") + 
  xlab("Locus") + 
  ylab("Stress")
## `geom_smooth()` using formula = 'y ~ x'

The scatter plot indicates a linear association as the points increase in a positive linear trend. There might be minor violations of constant variance as the variability looks smaller at the ends than at the center. However, this may be just due to less observations at the far ends. No outliers are present in the scatter plot.

fit <- lm(STRESS ~ LOC, data = loc.df) 
loc.df$resids <- fit$residuals

ggplot(loc.df, aes(x=LOC, y=resids))+ 
  geom_point(size = 3) + 
  geom_hline(yintercept = 0, linetype = "solid", color = "blue", lwd = 2) +
  ggtitle("Residual Plot") + 
  xlab("Locus") + 
  ylab("Residuals")

The residual plot has no patterns and the points occur randomly above and below the \(x\)-axis indicating a linear association. There might be minor violations of constant variance as the variability looks smaller at the ends than at the center. However, this may be just due to less observations at the far ends. There is a possibility that the standard deviation is not constant, but that could be due to the fact that there are only a few points at higher and lower ranges, making it harder to assess the true variability. No outliers are present in the residual plot.

Checking Normality Assumption: Histogram and QQ-plot

n_bins <- round(max(sqrt(nrow(loc.df))+2, 5))
xbar.resids <- mean(loc.df$resids) 
s.resids <- sd(loc.df$resids)

ggplot(loc.df, aes(x = resids)) +
  geom_histogram(aes(y = after_stat(density)), bins = n_bins, fill = "grey", col = "black") + 
  geom_density(col = "red", lwd = 1) + 
  stat_function(fun = dnorm, args = list(mean = xbar.resids, sd = s.resids), col="blue", lwd = 1) + 
  ggtitle("Histogram of Residuals")

ggplot(loc.df, aes(sample=resids)) + 
  stat_qq() + 
  geom_abline(slope = s.resids, intercept = xbar.resids) + 
  ggtitle("QQ Plot of Residuals")

It looks like the residuals are normal because on the normal probability plot the points are close to the line without systematic deviation. The histogram reveals a symmetric, unimodal pattern, and the red curve (estimated kernel density) and the blue curve (estimated normal density) on the histogram seem to be close.

Assuming that we have an SRS, two of the assumptions are met; linear relationship (scatterplot and residual plot), and normality of the residuals (histogram and normal probability plot of the residuals). Even though the constant standard deviation of the residuals (scatterplot and residual plot) seems to be not constant, we will continue the analysis even though the assumptions for linear regression analysis are not met.

Confidince Interval for Parameters

Since, the assumptions are reasonable and it appears there is a linear association we will proceed to provide confidence regions for the parameters of the model at a \(95\%\) level.

confint(fit, level = 0.95)
##                  2.5 %     97.5 %
## (Intercept) 1.96395317 2.54704023
## LOC         0.01557099 0.06424615

slope \(95\%\) CI: \((0.01557099, 0.06424615)\)

We are \(95\%\) confident that the population slope of Stress versus LOC is covered by the interval from \(0.01557099\) to \(0.06424615\). Note that the slope is indicated by the name of the x variable in the output which is LOC in this example.

y-intercept \(95\%\) CI: \((1.96395317, 2.54704023)\)

We are \(95\%\) confident that the population y-intercept of Stress versus LOC is covered by the interval from \(1.96395317\) to \(2.54704023\).

Model Hypothesis Test F-test

Since, it appears there is a linear association between Stress and LOC we will perform a formal test of significance. We can use an overall model F-test or a test for the slope being \(0\).

## 
## Call:
## lm(formula = STRESS ~ LOC, data = loc.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.04704 -0.33806  0.02169  0.30798  1.06715 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.25550    0.14691  15.353  < 2e-16 ***
## LOC          0.03991    0.01226   3.254  0.00156 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4513 on 98 degrees of freedom
## Multiple R-squared:  0.09752,    Adjusted R-squared:  0.08831 
## F-statistic: 10.59 on 1 and 98 DF,  p-value: 0.001562

Step 1: Definition of the terms: not required

Step 2: State the hypotheses:

  • \(H_0\): There is no association between Stress and LOC.
  • \(H_a\): There is an association between Stress and LOC.

Step 3: Find the Test Statistic, p-value, report df

  • \(F_{\text{ts}} = 10.59\)
  • \(\text{df}_1 = 1\)
  • \(\text{df}_2 = 98\)
  • \(p\text{-value} = 0.001562\)

Step 4: Conclusion:

Since \(p\text{-value}=0.001562 ≤ 0.05=\alpha\), we should reject \(H_0\)

The data provide evidence (\(p\text{-value}=0.001562\)) to the claim that there is an association between Stress and LOC.

Model Hypothesis Test of Slope

Step 1: Definition of the terms

  • \(\beta_1\) is the population slope of STRESS versus LOC.

Step 2: State the hypotheses

  • \(H_0: \beta_1=0\)
  • \(H_a: \beta_1\neq 0\)

Step 3: Find the Test Statistic, p-value, report DF

  • \(t_{\text{ts}} = 3.254\)
  • \(\text{df} = 98\)
  • \(p\)-value \(= 0.00156\)

Step 4: Conclusion:

Since \(p\text{-value}=0.00156 ≤ 0.05=\alpha\), we should reject H0.

The data provide evidence (\(p\text{-value}=0.00156\)) to the claim that there is an association between STRESS and LOC.

OR

The data provide evidence (\(p\text{-value}=0.00156\)) to the claim that the slope of STRESS versus LOC is not zero.

It is important to note that both tests provide the same result.

CA 9c: Utilizing the Model for New Values

A psychologist decides to use the model to predict the stress level of a client who works as a certified public accountant. After the client completed the series of questions for measuring LOC it was found that the score was \(12\). Obtain a \(95\%\) confidence interval for the mean response at this point LOC\(=12\) and a \(95\%\) prediction interval around the prediction at this new value of LOC.

Confidince Interval for the Mean Response at a Point

newdata <- data.frame(LOC = 12) 
predict(fit, newdata, interval = "confidence", level = 0.95)
##      fit      lwr      upr
## 1 2.7344 2.643662 2.825137

The confidence interval for the mean response is: \((2.643662, 2.825127)\)

We are \(95\%\) confident that the population mean response of Stress when LOC is \(12\) is covered by the interval \((2.643662, 2.825127)\).

Prediction interval for a new Point

newdata <- data.frame(LOC = 12) 
predict(fit, newdata, interval = "prediction", level = 0.95)
##      fit      lwr      upr
## 1 2.7344 1.834271 3.634528

The prediction interval is: \((1.834271, 3.634528)\)

We are \(95\%\) confident that the new measurement of Stress when LOC is \(12\) will be covered by the interval \((1.834271, 3.634528)\).

It is important to observe that the prediction interval is larger than the confidence interval of the mean at the same point. If you were to look at average stress levels at LOC of \(12\), you would be looking at the confidence interval of the mean at the point. However, if you wanted to predict the stress level of the next accountant with a locus of control of \(12\), you would need to look at the prediction interval. Since I am not familiar with the range of these values, I cannot judge whether this range is large or not.