Linear Regression Analysis serves as a fundamental statistical technique for understanding the relationship between a dependent variable and one or more independent variables. This method is extensively used for prediction, trend analysis, and exploring how variables interconnect. It helps in determining the extent to which independent variables are associated with a dependent variable, making it a key tool in both research and practical applications.
The base R package includes functions such as lm() for conducting linear regression analysis and summary() for obtaining detailed reports on the model. These reports include regression coefficients, the R-squared value, and statistical testing of model parameters. These functions are integral for assessng the model’s fit to the data, hypothesis testing regarding the relationships among variables, and making informed predictions. lm() is R’s primary function for fitting linear models. It’s designed to analyze the relationship between variables by estimating regression coefficients that outline the best-fitting line or hyperplane in the case of multiple regression, based on the least squares method. While we will use lm() for simple linear regression, lm() also handles multiple regression, facilitating the examination of linear relationships among several explanatory variables and the response.
ggplot will also be used for creating diagnostic plots, analyzing residuals, and identifying influential data points, which are essential for verifying model assumptions and overall reliability. For specific insights into the confidence of model estimates, the confint() function is key. It provides confidence intervals for model parameters, offering a range within which the true parameter values are expected to lie. For added analysis, the cor() function is useful for assesing the correlation between two quantitative variables,the predict() function generates predictions and confidence intervals for new data points, and anova() can be used to obtain the Analysis of Variance table. These features enrich the analysis process, offering a comprehensive suite of tools for linear regression.
The lm() function in R is a useful tool used for fitting linear models.
lm(formula, data, subset, weights, na.action,
method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
singular.ok = TRUE, contrasts = NULL, offset, ...)
In practice, you’ll often use lm() with just the first two arguments, formula and data, especially for straightforward simple linear regression analyses:
This formula indicates that the response variable is modeled as a function of predictor1 and predictor2. The summary() function then provides a detailed report of the model.
What Does summary() Do?
The summary() function, when utilized on a linear model object created by lm(), yields a comprehensive overview of the regression analysis. This summary encapsulates several key components crucial for interpreting the model’s effectiveness and the relationships between variables:
A scatter plot is a fundamental tool in statistical analysis for visualizing the relationship between two continuous variables. It displays values for typically two variables for a set of data, with one variable on each axis, allowing for a quick look at a possible linear or non-linear relationship between them. In R, the ggplot2 package offers versatile capabilities for creating scatter plots, enhancing them with regression lines, smooth lines (to capture non-linear trends), and grouping by categories if needed. Here’s a simple guide on creating a scatter plot using ggplot2.
Assuming you have a dataset data with two continuous variables varX (independent variable) and varY (dependent variable), you can create a scatter plot to explore their relationship.
ggplot(data, aes(x = varX, y = varY)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
ggtitle("Scatter Plot of varY vs. varX") +
xlab("varX") +
ylab("varY")
The cor() function in R calculates the correlation coefficient between two variables, quantifying the degree to which they are related. The correlation coefficient is a numerical measure that can range from -1 to 1, where values closer to -1 or 1 indicate a stronger negative or positive linear relationship, respectively, and values near 0 suggest little to no linear relationship.
Residual plots are an essential diagnostic tool in regression analysis, designed to evaluate the fit of a regression model by plotting the residuals (the differences between observed and predicted values) against the explanatory variable(s). These plots are useful in detecting issues like non-linearity, heteroscedasticity (non-constant variance of residuals), and outliers. The ggplot2 package in R can be utilized to produce residual plots.
Assuming your dataset data includes the explanatory variable varX and the dependent variable varY, and you have fitted a linear model fit, you can generate a residual plot in the following manner:
fit <- lm(varY ~ varX, data = data)
data$resids <- fit$residuals
# Generate the residual plot
ggplot(data, aes(x = varX, y = resids)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "solid", color = "blue") +
ggtitle("Residual Plot vs. Explanatory Variable") +
xlab("Explanatory Variable (varX)") +
ylab("Residuals")
Explanatory Variable vs. Residuals: Plotting residuals against the explanatory variable is a common method for checking the model’s assumptions. This includes verifying the linearity of the relationship between the explanatory and dependent variables and ensuring that the variance of the residuals is constant (homoscedasticity). Residual plots against the explanatory variable(s) are a straightforward yet powerful method for visually inspecting the assumptions of linear regression models. By analyzing the patterns of the residuals, researchers can determine the adequacy of the model fit and identify whether any modifications are necessary to enhance model accuracy.
Histograms and QQ plots (quantile-quantile plots) also play an important role in the analysis of residuals. They are both used to examine the distribution of residuals and to check for deviations from normality, which is a key assumption of linear regression models. As you are familar with Histograms and QQ plots we will not discuss them here.
The confint() function in R is primarily used to calculate confidence intervals for the parameters of a fitted model. This is particularly useful in linear regression models, where it generates intervals for the regression coefficients, including both the slope(s) and intercept.
How It Works Confidence intervals provide a range of values for each parameter. These ranges are constructed to contain the true parameter value with a certain confidence level, often 95%. The calculation of these intervals is based on the estimate of the parameter, its standard error, and the critical value from the t-distribution corresponding to the desired level of confidence.
For linear model (lm) objects, confint() constructs confidence intervals based on t-critical values, reflecting the distribution of coefficients under the assumption of normality or t-distribution for smaller samples.
Assuming your dataset data includes the explanatory variable varX and the dependent variable varY, and you have fitted a linear model fit, you can construct confidence intervals for the model parameters in the following manner:
The predict() function is designed to make predictions based on a fitted model. In the context of linear regression, it can generate predicted values of the dependent variable for new observations of the independent variables. It can also provide confidence intervals or prediction intervals for these predicted values.
predict(object, newdata, se.fit = FALSE, scale = NULL, df = Inf,
interval = "none", level = 0.95, type = "response",
terms = NULL, na.action = na.pass,
pred.var = res.var/weights, weights = 1)
Assuming your dataset data includes the explanatory variable varX and the dependent variable varY, you have fitted a linear model fit, and C is the confidence coefficient, you can construct confidence intervals for the model parameters in the following manner:
Job Stress and Locus of Control. Many factors, such as the type of job, education level, and job experience, can affect the stress felt by workers on the job. Locus of control (LOC) is a term in psychology that describes the extent to which a person believes he or she is in control of the events that influence his or her life. Is feeling “more in control” associated with less job stress? A recent study examined the relationship between LOC and several work-related behavioral measures among certified public accountants in Taiwan. LOC was assessed using a questionnaire that asked respondents to select one of two options for each of 23 items. Scores ranged from \(0\) to \(23\). Individuals with low LOC believe that their own behavior and attributes determine their rewards in life. Those with high LOC believe that these rewards are beyond their control. We will consider a random sample of 100 accountants.
Visualize the relationship between LOC and job stress using a scatter plot. This step is crucial for getting a preliminary sense of the nature of the relationship between the variables.
ggplot(loc.df, aes(x=LOC, y=STRESS))+
geom_point(size = 3) +
geom_smooth(method = lm, se = FALSE, lwd = 2) +
ggtitle("Relationship between Stress and LOC") +
xlab("Locus") +
ylab("Stress")
## `geom_smooth()` using formula = 'y ~ x'
After recognizing that there appears to be some linear association we can measure the correlation.
## [1] 0.3122765
The correlation coefficient between Stress and LOC is \(0.3122765\). This looks like there is a weak but non-negligible association betweenStress and LOC.
##
## Call:
## lm(formula = STRESS ~ LOC, data = loc.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.04704 -0.33806 0.02169 0.30798 1.06715
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.25550 0.14691 15.353 < 2e-16 ***
## LOC 0.03991 0.01226 3.254 0.00156 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4513 on 98 degrees of freedom
## Multiple R-squared: 0.09752, Adjusted R-squared: 0.08831
## F-statistic: 10.59 on 1 and 98 DF, p-value: 0.001562
The equation of the least-squares regression line is found using the coefficients from the Estimate column of the output above. The intercept \(b_0=2.25550\) and the slope \(b_1=0.03991\), resulting in the least-squares regression line: \[\hat{y}=2.25550+0.03991x_{\text{LOC}}\].
The mean squared error MSE is the square of the residual standard error MSE \(=0.4513^2=0.2036717\).
The coefficient of determination \(R^2\) is indicated by the Multiple R-squared value \(R^2=0.09752\).
The ANOVA table can provide more details regarding the sums of squares, degrees of freedom, mean squares, F-test statistic, and \(p\)-value. We can also confirm the MSE and \(R^2\) values from the output.
## Analysis of Variance Table
##
## Response: STRESS
## Df Sum Sq Mean Sq F value Pr(>F)
## LOC 1 2.1565 2.15651 10.589 0.001562 **
## Residuals 98 19.9578 0.20365
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The mean squared error MSE is directly given in the table from row RESIDUALS and column Mean Sq as MSE\(=0.20365\).
The coefficient of determination \(R^2\) can be calculated from the sum of squares \(R^2=\frac{\text{SS_R}}{\text{SS_T}}=\frac{\text{SS_R}}{\text{SS_R}+\text{SS_E}}=\frac{2.1565}{2.1565+19.9578}=0.09752\).
Visualize the relationship between LOC and job stress using a scatter plot.
ggplot(loc.df, aes(x=LOC, y=STRESS))+
geom_point(size = 3) +
geom_smooth(method = lm, se = FALSE, lwd = 2) +
ggtitle("Relationship between Stress and LOC") +
xlab("Locus") +
ylab("Stress")
## `geom_smooth()` using formula = 'y ~ x'
The scatter plot indicates a linear association as the points increase in a positive linear trend. There might be minor violations of constant variance as the variability looks smaller at the ends than at the center. However, this may be just due to less observations at the far ends. No outliers are present in the scatter plot.
fit <- lm(STRESS ~ LOC, data = loc.df)
loc.df$resids <- fit$residuals
ggplot(loc.df, aes(x=LOC, y=resids))+
geom_point(size = 3) +
geom_hline(yintercept = 0, linetype = "solid", color = "blue", lwd = 2) +
ggtitle("Residual Plot") +
xlab("Locus") +
ylab("Residuals")
The residual plot has no patterns and the points occur randomly above and below the \(x\)-axis indicating a linear association. There might be minor violations of constant variance as the variability looks smaller at the ends than at the center. However, this may be just due to less observations at the far ends. There is a possibility that the standard deviation is not constant, but that could be due to the fact that there are only a few points at higher and lower ranges, making it harder to assess the true variability. No outliers are present in the residual plot.
n_bins <- round(max(sqrt(nrow(loc.df))+2, 5))
xbar.resids <- mean(loc.df$resids)
s.resids <- sd(loc.df$resids)
ggplot(loc.df, aes(x = resids)) +
geom_histogram(aes(y = after_stat(density)), bins = n_bins, fill = "grey", col = "black") +
geom_density(col = "red", lwd = 1) +
stat_function(fun = dnorm, args = list(mean = xbar.resids, sd = s.resids), col="blue", lwd = 1) +
ggtitle("Histogram of Residuals")
ggplot(loc.df, aes(sample=resids)) +
stat_qq() +
geom_abline(slope = s.resids, intercept = xbar.resids) +
ggtitle("QQ Plot of Residuals")
It looks like the residuals are normal because on the normal probability plot the points are close to the line without systematic deviation. The histogram reveals a symmetric, unimodal pattern, and the red curve (estimated kernel density) and the blue curve (estimated normal density) on the histogram seem to be close.
Assuming that we have an SRS, two of the assumptions are met; linear relationship (scatterplot and residual plot), and normality of the residuals (histogram and normal probability plot of the residuals). Even though the constant standard deviation of the residuals (scatterplot and residual plot) seems to be not constant, we will continue the analysis even though the assumptions for linear regression analysis are not met.
Since, the assumptions are reasonable and it appears there is a linear association we will proceed to provide confidence regions for the parameters of the model at a \(95\%\) level.
## 2.5 % 97.5 %
## (Intercept) 1.96395317 2.54704023
## LOC 0.01557099 0.06424615
slope \(95\%\) CI: \((0.01557099, 0.06424615)\)
We are \(95\%\) confident that the population slope of Stress versus LOC is covered by the interval from \(0.01557099\) to \(0.06424615\). Note that the slope is indicated by the name of the x variable in the output which is LOC in this example.
y-intercept \(95\%\) CI: \((1.96395317, 2.54704023)\)
We are \(95\%\) confident that the population y-intercept of Stress versus LOC is covered by the interval from \(1.96395317\) to \(2.54704023\).
Since, it appears there is a linear association between Stress and LOC we will perform a formal test of significance. We can use an overall model F-test or a test for the slope being \(0\).
##
## Call:
## lm(formula = STRESS ~ LOC, data = loc.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.04704 -0.33806 0.02169 0.30798 1.06715
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.25550 0.14691 15.353 < 2e-16 ***
## LOC 0.03991 0.01226 3.254 0.00156 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4513 on 98 degrees of freedom
## Multiple R-squared: 0.09752, Adjusted R-squared: 0.08831
## F-statistic: 10.59 on 1 and 98 DF, p-value: 0.001562
Step 1: Definition of the terms: not required
Step 2: State the hypotheses:
Step 3: Find the Test Statistic, p-value, report df
Step 4: Conclusion:
Since \(p\text{-value}=0.001562 ≤ 0.05=\alpha\), we should reject \(H_0\)
The data provide evidence (\(p\text{-value}=0.001562\)) to the claim that there is an association between Stress and LOC.
Step 1: Definition of the terms
Step 2: State the hypotheses
Step 3: Find the Test Statistic, p-value, report DF
Step 4: Conclusion:
Since \(p\text{-value}=0.00156 ≤ 0.05=\alpha\), we should reject H0.
The data provide evidence (\(p\text{-value}=0.00156\)) to the claim that there is an association between STRESS and LOC.
OR
The data provide evidence (\(p\text{-value}=0.00156\)) to the claim that the slope of STRESS versus LOC is not zero.
It is important to note that both tests provide the same result.
A psychologist decides to use the model to predict the stress level of a client who works as a certified public accountant. After the client completed the series of questions for measuring LOC it was found that the score was \(12\). Obtain a \(95\%\) confidence interval for the mean response at this point LOC\(=12\) and a \(95\%\) prediction interval around the prediction at this new value of LOC.
## fit lwr upr
## 1 2.7344 2.643662 2.825137
The confidence interval for the mean response is: \((2.643662, 2.825127)\)
We are \(95\%\) confident that the population mean response of Stress when LOC is \(12\) is covered by the interval \((2.643662, 2.825127)\).
## fit lwr upr
## 1 2.7344 1.834271 3.634528
The prediction interval is: \((1.834271, 3.634528)\)
We are \(95\%\) confident that the new measurement of Stress when LOC is \(12\) will be covered by the interval \((1.834271, 3.634528)\).
It is important to observe that the prediction interval is larger than the confidence interval of the mean at the same point. If you were to look at average stress levels at LOC of \(12\), you would be looking at the confidence interval of the mean at the point. However, if you wanted to predict the stress level of the next accountant with a locus of control of \(12\), you would need to look at the prediction interval. Since I am not familiar with the range of these values, I cannot judge whether this range is large or not.