Introduction

In this R demo session, we’ll explore the concept of simple linear regression. Simple linear regression is a statistical technique that allows us to model and investigate the relationship between two quantitative variables.

For this demo, we are using an old dataset called readwrite2.csv. The purpose of the analysis is to predict writing scores from students’ reading scores.

Unstandardized Linear Regression

###Import Data and Build the First Regression Model

Let’s first import the data and then proceed with building our simple linear regression model.

# Import the data
data_readwrite <- read.csv("readwrite2.csv")

# Preview the data
head(data_readwrite)
##     id reading writing level
## 1 3070      18      21     3
## 2 1306      20      15     3
## 3   83      14      14     2
## 4 2486      16      20     4
## 5 1938      14      15     2
## 6  397      20      18     4
# Let's build our first linear regression model
unstandarized_model <- lm(writing ~ reading, data = data_readwrite)

# Summary of the regression model
summary(unstandarized_model)
## 
## Call:
## lm(formula = writing ~ reading, data = data_readwrite)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.2832  -2.3738   0.1721   2.6262  11.6273 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.37157    0.40338   13.32   <2e-16 ***
## reading      0.63646    0.02156   29.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.561 on 1995 degrees of freedom
## Multiple R-squared:  0.304,  Adjusted R-squared:  0.3036 
## F-statistic: 871.2 on 1 and 1995 DF,  p-value: < 2.2e-16

Understanding the Regression Model Results

Intercept

The estimated intercept is 5.37157. In the context of this research, this means that when the reading score is zero, the predicted writing score would be approximately 5.37. Like the example you provided, the null hypothesis for the intercept is that it is zero. With a t-value of 13.32 and a p-value of < 2e-16, the intercept is significantly different from zero. However, an intercept with a reading score of zero may not hold much practical meaning in the context of our study, as it’s unlikely for a student to have a reading score of zero.

Regression Coefficient

The estimated regression coefficient for the reading variable is 0.63646, with a standard error (SE) of 0.02156. The SE can be thought of as a measure of the variability in the estimate for the regression coefficient, reflecting how it might vary across different samples from the population.

The t-value for the regression coefficient is 29.52, and the p-value is < 2e-16. This p-value is much smaller than the commonly used alpha level of 0.05, indicating that the reading score is a significant predictor of the writing score.

Logic of Null Hypothesis Testing for Regression Coefficient

Step 1. Identify test statistic: For the regression coefficient, the test statistic is the t-statistic, which in our case is 29.52.

Step 2. Assume the null hypothesis is true:

  • H0: b(reading) = 0 (Reading does not predict Writing)
  • H1: b(reading) != 0 (Reading predicts Writing)

Step 3. Identify sampling distribution of test statistic: As per statistical theory, the sampling distribution is a t-distribution with degrees of freedom df = N−p−1, where N is the sample size and p is the number of predictors. Here, df=1995.

Step 4. Compare observed test statistic to the sampling distribution: The observed t-statistic is 29.52. The probability of obtaining such an extreme t-statistic, under the null hypothesis, is extremely low (< 2e-16).

Step 5. Compare to cutoff: The p-value is much less than a = 0.05 α=0.05, implying that we reject the null hypothesis. The regression coefficient is statistically significant.

R2 and F-Statistic

The R2 value is 0.304, which means that approximately 30.4% of the variance in writing scores can be explained by reading scores. This is a reasonable amount of explained variance for a single predictor.

The F-statistic is 871.2 with a p-value of < 2.2e-16. The null hypothesis for the F-statistic is that reading scores do not predict writing scores. Given the extremely low p-value, we reject this null hypothesis. This reinforces the finding from the t-test on the regression coefficient, indicating that reading scores do predict writing scores significantly.

Standardized Regression Model

Continuing from the discussion of the original regression model, let’s now transition to examining the standardized regression model. Standardizing the variables before conducting a regression analysis can offer unique insights, particularly in terms of effect size and direct comparisons among different predictors.

# Standardize the variables and fit the standardized linear regression model
standardized_model <- lm(scale(writing) ~ scale(reading), data = data_readwrite)

summary(standardized_model)
## 
## Call:
## lm(formula = scale(writing) ~ scale(reading), data = data_readwrite)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.87855 -0.55630  0.04032  0.61545  2.72484 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.728e-16  1.867e-02    0.00        1    
## scale(reading) 5.513e-01  1.868e-02   29.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8345 on 1995 degrees of freedom
## Multiple R-squared:  0.304,  Adjusted R-squared:  0.3036 
## F-statistic: 871.2 on 1 and 1995 DF,  p-value: < 2.2e-16

Interpretation of the Standardized Model

Intercept

In the standardized model, the intercept is expected to be zero, or very close to zero. In your output, it is 3.728e-16, which is virtually zero. This is expected because all variables have been centered around their mean.

Regression Coefficient

The standardized regression coefficient for reading is approximately 0.5513. This means that for every one standard deviation increase in reading scores, we can expect an approximately 0.5513 standard deviation increase in writing scores. This coefficient reflects the strength of the relationship in standard deviation units, making it easier to compare with other predictors or across different datasets.

Comparisons with the Unstandardized Model

Intercept

  • Unstandardized: In the unstandardized model, the intercept had a substantive interpretation, representing the predicted writing score when the reading score was zero. Its p-value and t-value indicated statistical significance.

  • Standardized: In the standardized model, the intercept is virtually zero and loses this substantive interpretation. The p-value is 1, and the t-value is 0, confirming that the intercept is not statistically significant in the standardized model.

Regression Coefficient

  • Unstandardized: The coefficient was in the original units of the variable, indicating the change in writing scores for a one-point increase in reading scores. Its t-value was 29.52, and the p-value was < 2e-16, indicating statistical significance. The SE was 0.02156.

  • Standardized: In the standardized model, the coefficient is in terms of standard deviations. It also has a t-value of 29.52 and a p-value of < 2e-16, again confirming statistical significance. The SE in the standardized model is 0.01868, which, while in different units, still gives an idea of the precision of the estimate.

R2 and F-Statistic

Both models have the same R2 value of 0.304 and F-statistic of 871.2, indicating that reading is a significant predictor of writing. The p-value associated with the F-statistic is also the same in both models, < 2.2e-16, further confirming the significance of the model.

Assumptions

1. Observations are Independent from One Another

Description: The observations should be independent from one another for valid results.

Check: The independence of observations is often an aspect of study design rather than something that can be statistically tested. Make sure to carefully consider how the data was collected and if the design ensures independence.

2. The Dependent Variable is Numeric

Description: The dependent variable should be numeric, at least interval-scaled if not ratio-scaled.

Check: This is generally a feature of the dataset and should be verified before the analysis.

3. The Relationship is Linear

Description: The relationship between the dependent and independent variables should be linear.

Check: Plot the data and fit a linear model to visually inspect linearity.

library(ggplot2)

# Scatterplot
scatter <- ggplot(data_readwrite, aes(x=reading, y=writing)) +
  geom_point() +
  geom_smooth(method="lm", color="blue", se=FALSE) +
  geom_smooth(method="loess", color="red", se=FALSE) +
  labs(title = "Scatterplot with Linear and Loess Lines",
       x = "Reading Scores", 
       y = "Writing Scores")
scatter
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Given that the linear line (represented in blue) and the LOESS line (represented in red) closely align with each other, we can conclude that the relationship between reading and writing scores is linear.

4. Homoscedasticity of Errors

Description: The errors should vary constantly across levels of the independent variable.

Check: Use ncvTest from the car package to test this, and/or plot residuals.

The ncvTest() function performs the Breusch-Pagan test for non-constant variance (homoscedasticity) in the residuals. A significant p-value (usually < 0.05) suggests that the assumption of homoscedasticity is violated, meaning the error variance changes across levels of the independent variable.

# install if necessary
# install.packages("car")
library(car)
## Loading required package: carData
# Non-constant Error Variance Test
ncvTest(unstandarized_model)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 12.28884, Df = 1, p = 0.00045567

The Non-constant Variance Score Test (Breusch-Pagan test) yielded a p-value of 0.0005, indicating a violation of the homoscedasticity assumption. This suggests that the error variance is not constant across levels of the reading scores.

Note: The Breusch-Pagan test can be sensitive to large sample sizes. Although our p-value indicates non-constant variance, it’s important to also examine residual plots to evaluate the practical significance of this finding.

par(mfrow=c(2,2))
plot(unstandarized_model)

The first plot depicts residuals versus fitted values. Residuals are measured as follows:

residual = (observed y) – (model-predicted y)

To evaluate the assumption of homoscedasticity, we examine the residual plot to ensure that the residuals are randomly and evenly dispersed around the y = 0 line. Based on the visual inspection of the plot, there is no discernible pattern, suggesting that the assumption of homoscedasticity is not violated.

5. Residuals are Normally Distributed

Description: The residuals should be normally distributed and centered around zero.

Check: Use Shapiro-Wilk normality test and plot histograms and Q-Q plots.

shapiro.test(unstandarized_model$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  unstandarized_model$residuals
## W = 0.99556, p-value = 1.209e-05
# Create a histogram of residuals
ggplot(data.frame(residuals = unstandarized_model$residuals), aes(x = residuals)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 1, fill = "white", color = "black", alpha = 0.7) +
  geom_density(alpha = 0.2, fill = "#FF6666") +
  ggtitle("Histogram of Residuals") +
  theme_minimal()

Despite the significant result from the Shapiro-Wilk test, it’s important to note that this test becomes highly sensitive with large sample sizes, such as ours (N = 1997). Therefore, it may detect even trivial departures from normality. Based on the Q-Q plot—where the data points closely align with the reference line—and the histogram, which exhibits a bell-shaped curve, we can conclude that the residuals of the regression model are approximately normally distributed.

6. Outliers and Influential Observations

Outliers and influential observations can have a substantial impact on the results and interpretation of a regression analysis. Therefore, it’s crucial to identify and evaluate their influence on your model.

Cook’s Distance Cook’s distance is a measure that helps to identify influential data points in a regression analysis. The most common cutoff value for considering a point to be influential is 4/(n−p−1), where n is the sample size and p is the number of predictors.

# Calculate Cook's distance
cooks_dist <- cooks.distance(unstandarized_model)

# Plot
plot(cooks_dist, pch="*", cex=1.2, main="Cook's Distance")
abline(h = 4/(length(cooks_dist)-1-1), col="red")

Dfbeta

Dfbetas measure the individual impact of a specific observation on the estimated regression coefficients. High Dfbeta values indicate that the point is influential.

#install.packages("olsrr")
library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_plot_cooksd_chart(unstandarized_model)

Cook’s d is a way of looking at outliers or influential observations within regression. The red line is the most common cutoff. Here, Cook’s d suggests a lot of potential outliers.

Another good diagnostic in olsrr is dfbeta, which looks at the influence of individual observations on the estimated values for the parameters in the model. In other words, it quantifies how much the estimated parameter would change if the observation were deleted.

ols_plot_dfbetas(unstandarized_model)

There are two primary parameters in the model, the intercept and regression coefficient for reading scores. There are several observations that are highly influential, and those overlap substantially with the outliers identified with Cook’s d.

Here is another similar plot, which quantifies how much the predicted value for a single observation would change if that observation were not used in parameter estimation.

ols_plot_dffits(unstandarized_model)

This graph shows how big each individual’s residual is, standardized and compared to a typical cutoff. Similar to Cook’s d.

ols_plot_resid_stand(unstandarized_model)

You can plot some of these simultaneously. This is a pretty good one- it looks at leverage (influential observations) and standardized residuals (outliers) together. Observations can be outliers, influential, both, or neither.

ols_plot_resid_lev(unstandarized_model)

Here there are a bunch of potential outliers, a bunch of influential observations, and some are both.

Which diagnostics should you use? No rule. Pick some you like. They say somewhat different things in somewhat different ways.

My favorite? I think dfbetas is the most intuitive way of looking at influential observations, and I find a scatter plot to be the most straightforward first step for identifying potential outliers. This is not saying you should make the some choices.

Overall, the diagnostic tests indicate some issues with the data. To address this, we will consider removing all observations with standardized residuals greater than 2. As a precaution, we’ll create a new dataset to work with, ensuring the integrity of the original data remains intact. This is a recommended best practice.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::recode() masks car::recode()
## ✖ purrr::some()   masks car::some()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data_new <- data_readwrite %>%
  mutate(std_residuals = rstandard(unstandarized_model)) %>%
  mutate(outlier = case_when(abs(std_residuals) > 2 ~ TRUE,
                             TRUE ~ FALSE)) %>%
  filter(!outlier)

Now rerun the regression using the new data.

standarized_model_new <- lm(scale(writing) ~ scale(reading), data = data_new)
summary(standarized_model_new)
## 
## Call:
## lm(formula = scale(writing) ~ scale(reading), data = data_new)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.80092 -0.56064  0.02125  0.58771  1.78234 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -5.984e-16  1.788e-02     0.0        1    
## scale(reading)  6.224e-01  1.789e-02    34.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7829 on 1915 degrees of freedom
## Multiple R-squared:  0.3873, Adjusted R-squared:  0.387 
## F-statistic:  1211 on 1 and 1915 DF,  p-value: < 2.2e-16

Coefficients (standardized value: beta)

  • Original Model: The beta value for reading was 0.5513.
  • New Model: The beta value for reading is now 0.6224.
  • Interpretation: The increase in the beta value suggests that the relationship between reading and writing has strengthened after the outliers are removed.

Residual Standard Error

  • Original Model: 0.8345
  • New Model: 0.7829
  • Interpretation: The decrease in residual standard error suggests that the new model is a better fit for the data.

R2 Values

  • Original Model: R2 = 0.304
  • New Model: R2 = 0.3873
  • Interpretation: The R2 value has increased, indicating that the new model explains more of the variance in the dependent variable.

F-Statistic and p-value

  • Original Model: F-statistic = 871.2, p-value < 2.2e-16
  • New Model: F-statistic = 1211, p-value < 2.2e-16
  • Interpretation: The F-statistic has increased, reinforcing the idea that the model is a better fit. The p-value remains extremely low, reaffirming the statistical significance of the relationship between reading and writing scores.

Residuals

  • Original Model: The residuals ranged from about -2.88 to 2.72.
  • New Model: The residuals range from about -1.80 to 1.78.
  • Interpretation: The narrower range of residuals in the new model suggests that the fit has improved.

Overall, removing the outliers has led to a model that appears to be a better fit for the data, explaining more variance and producing more reliable estimates.