In this R demo session, we’ll explore the concept of simple linear regression. Simple linear regression is a statistical technique that allows us to model and investigate the relationship between two quantitative variables.
For this demo, we are using an old dataset called
readwrite2.csv
. The purpose of the analysis is to predict
writing scores from students’ reading scores.
###Import Data and Build the First Regression Model
Let’s first import the data and then proceed with building our simple linear regression model.
# Import the data
data_readwrite <- read.csv("readwrite2.csv")
# Preview the data
head(data_readwrite)
## id reading writing level
## 1 3070 18 21 3
## 2 1306 20 15 3
## 3 83 14 14 2
## 4 2486 16 20 4
## 5 1938 14 15 2
## 6 397 20 18 4
# Let's build our first linear regression model
unstandarized_model <- lm(writing ~ reading, data = data_readwrite)
# Summary of the regression model
summary(unstandarized_model)
##
## Call:
## lm(formula = writing ~ reading, data = data_readwrite)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.2832 -2.3738 0.1721 2.6262 11.6273
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.37157 0.40338 13.32 <2e-16 ***
## reading 0.63646 0.02156 29.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.561 on 1995 degrees of freedom
## Multiple R-squared: 0.304, Adjusted R-squared: 0.3036
## F-statistic: 871.2 on 1 and 1995 DF, p-value: < 2.2e-16
Intercept
The estimated intercept is 5.37157. In the context of this research, this means that when the reading score is zero, the predicted writing score would be approximately 5.37. Like the example you provided, the null hypothesis for the intercept is that it is zero. With a t-value of 13.32 and a p-value of < 2e-16, the intercept is significantly different from zero. However, an intercept with a reading score of zero may not hold much practical meaning in the context of our study, as it’s unlikely for a student to have a reading score of zero.
Regression Coefficient
The estimated regression coefficient for the reading variable is 0.63646, with a standard error (SE) of 0.02156. The SE can be thought of as a measure of the variability in the estimate for the regression coefficient, reflecting how it might vary across different samples from the population.
The t-value for the regression coefficient is 29.52, and the p-value is < 2e-16. This p-value is much smaller than the commonly used alpha level of 0.05, indicating that the reading score is a significant predictor of the writing score.
Logic of Null Hypothesis Testing for Regression Coefficient
Step 1. Identify test statistic: For the regression coefficient, the test statistic is the t-statistic, which in our case is 29.52.
Step 2. Assume the null hypothesis is true:
Step 3. Identify sampling distribution of test statistic: As per
statistical theory, the sampling distribution is a t-distribution with
degrees of freedom df = N−p−1
, where N is the sample size
and p is the number of predictors. Here, df=1995.
Step 4. Compare observed test statistic to the sampling distribution: The observed t-statistic is 29.52. The probability of obtaining such an extreme t-statistic, under the null hypothesis, is extremely low (< 2e-16).
Step 5. Compare to cutoff: The p-value is much less than a = 0.05 α=0.05, implying that we reject the null hypothesis. The regression coefficient is statistically significant.
R2 and F-Statistic
The R2 value is 0.304, which means that approximately 30.4% of the variance in writing scores can be explained by reading scores. This is a reasonable amount of explained variance for a single predictor.
The F-statistic is 871.2 with a p-value of < 2.2e-16. The null hypothesis for the F-statistic is that reading scores do not predict writing scores. Given the extremely low p-value, we reject this null hypothesis. This reinforces the finding from the t-test on the regression coefficient, indicating that reading scores do predict writing scores significantly.
Continuing from the discussion of the original regression model, let’s now transition to examining the standardized regression model. Standardizing the variables before conducting a regression analysis can offer unique insights, particularly in terms of effect size and direct comparisons among different predictors.
# Standardize the variables and fit the standardized linear regression model
standardized_model <- lm(scale(writing) ~ scale(reading), data = data_readwrite)
summary(standardized_model)
##
## Call:
## lm(formula = scale(writing) ~ scale(reading), data = data_readwrite)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.87855 -0.55630 0.04032 0.61545 2.72484
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.728e-16 1.867e-02 0.00 1
## scale(reading) 5.513e-01 1.868e-02 29.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8345 on 1995 degrees of freedom
## Multiple R-squared: 0.304, Adjusted R-squared: 0.3036
## F-statistic: 871.2 on 1 and 1995 DF, p-value: < 2.2e-16
Intercept
In the standardized model, the intercept is expected to be zero, or very close to zero. In your output, it is 3.728e-16, which is virtually zero. This is expected because all variables have been centered around their mean.
Regression Coefficient
The standardized regression coefficient for reading is approximately 0.5513. This means that for every one standard deviation increase in reading scores, we can expect an approximately 0.5513 standard deviation increase in writing scores. This coefficient reflects the strength of the relationship in standard deviation units, making it easier to compare with other predictors or across different datasets.
Intercept
Unstandardized: In the unstandardized model, the intercept had a substantive interpretation, representing the predicted writing score when the reading score was zero. Its p-value and t-value indicated statistical significance.
Standardized: In the standardized model, the intercept is virtually zero and loses this substantive interpretation. The p-value is 1, and the t-value is 0, confirming that the intercept is not statistically significant in the standardized model.
Regression Coefficient
Unstandardized: The coefficient was in the original units of the variable, indicating the change in writing scores for a one-point increase in reading scores. Its t-value was 29.52, and the p-value was < 2e-16, indicating statistical significance. The SE was 0.02156.
Standardized: In the standardized model, the coefficient is in terms of standard deviations. It also has a t-value of 29.52 and a p-value of < 2e-16, again confirming statistical significance. The SE in the standardized model is 0.01868, which, while in different units, still gives an idea of the precision of the estimate.
R2 and F-Statistic
Both models have the same R2 value of 0.304 and F-statistic of 871.2, indicating that reading is a significant predictor of writing. The p-value associated with the F-statistic is also the same in both models, < 2.2e-16, further confirming the significance of the model.
Description: The observations should be independent from one another for valid results.
Check: The independence of observations is often an aspect of study design rather than something that can be statistically tested. Make sure to carefully consider how the data was collected and if the design ensures independence.
Description: The dependent variable should be numeric, at least interval-scaled if not ratio-scaled.
Check: This is generally a feature of the dataset and should be verified before the analysis.
Description: The relationship between the dependent and independent variables should be linear.
Check: Plot the data and fit a linear model to visually inspect linearity.
library(ggplot2)
# Scatterplot
scatter <- ggplot(data_readwrite, aes(x=reading, y=writing)) +
geom_point() +
geom_smooth(method="lm", color="blue", se=FALSE) +
geom_smooth(method="loess", color="red", se=FALSE) +
labs(title = "Scatterplot with Linear and Loess Lines",
x = "Reading Scores",
y = "Writing Scores")
scatter
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Given that the linear line (represented in blue) and the LOESS line (represented in red) closely align with each other, we can conclude that the relationship between reading and writing scores is linear.
Description: The errors should vary constantly across levels of the independent variable.
Check: Use ncvTest from the car package to test this, and/or plot residuals.
The ncvTest()
function performs the Breusch-Pagan test
for non-constant variance (homoscedasticity) in the residuals. A
significant p-value (usually < 0.05) suggests that the assumption of
homoscedasticity is violated, meaning the error variance changes across
levels of the independent variable.
# install if necessary
# install.packages("car")
library(car)
## Loading required package: carData
# Non-constant Error Variance Test
ncvTest(unstandarized_model)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 12.28884, Df = 1, p = 0.00045567
The Non-constant Variance Score Test (Breusch-Pagan test) yielded a p-value of 0.0005, indicating a violation of the homoscedasticity assumption. This suggests that the error variance is not constant across levels of the reading scores.
Note: The Breusch-Pagan test can be sensitive to large sample sizes. Although our p-value indicates non-constant variance, it’s important to also examine residual plots to evaluate the practical significance of this finding.
par(mfrow=c(2,2))
plot(unstandarized_model)
The first plot depicts residuals versus fitted values. Residuals are measured as follows:
residual = (observed y) – (model-predicted y)
To evaluate the assumption of homoscedasticity, we examine the residual plot to ensure that the residuals are randomly and evenly dispersed around the y = 0 line. Based on the visual inspection of the plot, there is no discernible pattern, suggesting that the assumption of homoscedasticity is not violated.
Description: The residuals should be normally distributed and centered around zero.
Check: Use Shapiro-Wilk normality test and plot histograms and Q-Q plots.
shapiro.test(unstandarized_model$residuals)
##
## Shapiro-Wilk normality test
##
## data: unstandarized_model$residuals
## W = 0.99556, p-value = 1.209e-05
# Create a histogram of residuals
ggplot(data.frame(residuals = unstandarized_model$residuals), aes(x = residuals)) +
geom_histogram(aes(y = after_stat(density)), binwidth = 1, fill = "white", color = "black", alpha = 0.7) +
geom_density(alpha = 0.2, fill = "#FF6666") +
ggtitle("Histogram of Residuals") +
theme_minimal()
Despite the significant result from the Shapiro-Wilk test, it’s important to note that this test becomes highly sensitive with large sample sizes, such as ours (N = 1997). Therefore, it may detect even trivial departures from normality. Based on the Q-Q plot—where the data points closely align with the reference line—and the histogram, which exhibits a bell-shaped curve, we can conclude that the residuals of the regression model are approximately normally distributed.
Outliers and influential observations can have a substantial impact on the results and interpretation of a regression analysis. Therefore, it’s crucial to identify and evaluate their influence on your model.
Cook’s Distance Cook’s distance is a measure that helps to identify influential data points in a regression analysis. The most common cutoff value for considering a point to be influential is 4/(n−p−1), where n is the sample size and p is the number of predictors.
# Calculate Cook's distance
cooks_dist <- cooks.distance(unstandarized_model)
# Plot
plot(cooks_dist, pch="*", cex=1.2, main="Cook's Distance")
abline(h = 4/(length(cooks_dist)-1-1), col="red")
Dfbeta
Dfbetas measure the individual impact of a specific observation on the estimated regression coefficients. High Dfbeta values indicate that the point is influential.
#install.packages("olsrr")
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_plot_cooksd_chart(unstandarized_model)
Cook’s d is a way of looking at outliers or influential observations within regression. The red line is the most common cutoff. Here, Cook’s d suggests a lot of potential outliers.
Another good diagnostic in olsrr is dfbeta, which looks at the influence of individual observations on the estimated values for the parameters in the model. In other words, it quantifies how much the estimated parameter would change if the observation were deleted.
ols_plot_dfbetas(unstandarized_model)
There are two primary parameters in the model, the intercept and regression coefficient for reading scores. There are several observations that are highly influential, and those overlap substantially with the outliers identified with Cook’s d.
Here is another similar plot, which quantifies how much the predicted value for a single observation would change if that observation were not used in parameter estimation.
ols_plot_dffits(unstandarized_model)
This graph shows how big each individual’s residual is, standardized and compared to a typical cutoff. Similar to Cook’s d.
ols_plot_resid_stand(unstandarized_model)
You can plot some of these simultaneously. This is a pretty good one- it looks at leverage (influential observations) and standardized residuals (outliers) together. Observations can be outliers, influential, both, or neither.
ols_plot_resid_lev(unstandarized_model)
Here there are a bunch of potential outliers, a bunch of influential observations, and some are both.
Which diagnostics should you use? No rule. Pick some you like. They say somewhat different things in somewhat different ways.
My favorite? I think dfbetas is the most intuitive way of looking at influential observations, and I find a scatter plot to be the most straightforward first step for identifying potential outliers. This is not saying you should make the some choices.
Overall, the diagnostic tests indicate some issues with the data. To address this, we will consider removing all observations with standardized residuals greater than 2. As a precaution, we’ll create a new dataset to work with, ensuring the integrity of the original data remains intact. This is a recommended best practice.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::recode() masks car::recode()
## ✖ purrr::some() masks car::some()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data_new <- data_readwrite %>%
mutate(std_residuals = rstandard(unstandarized_model)) %>%
mutate(outlier = case_when(abs(std_residuals) > 2 ~ TRUE,
TRUE ~ FALSE)) %>%
filter(!outlier)
Now rerun the regression using the new data.
standarized_model_new <- lm(scale(writing) ~ scale(reading), data = data_new)
summary(standarized_model_new)
##
## Call:
## lm(formula = scale(writing) ~ scale(reading), data = data_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.80092 -0.56064 0.02125 0.58771 1.78234
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.984e-16 1.788e-02 0.0 1
## scale(reading) 6.224e-01 1.789e-02 34.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7829 on 1915 degrees of freedom
## Multiple R-squared: 0.3873, Adjusted R-squared: 0.387
## F-statistic: 1211 on 1 and 1915 DF, p-value: < 2.2e-16
Coefficients (standardized value: beta)
Residual Standard Error
R2 Values
F-Statistic and p-value
Residuals
Overall, removing the outliers has led to a model that appears to be a better fit for the data, explaining more variance and producing more reliable estimates.