Regression analysis: is a statistical tool used to
explain the relationship between a response (dependent, outcome)
variable and one or more predictor (independent) variables.
Examples of Relationships between independent variables and dependent
variables:
Linear Regression Model: In a linear regression
model, it is assumed that the relationship between variables can be
explained with a linear function.
Simple Regression Model: A simple regression model involves one response variable and a single predictor variable.
Data in a simple regression model are observed in pairs, with
Each index represents one case or unit of observation in the
data.
When plotting these pairs against the \(X\) and \(Y\) axes, the goal is to fit the best line to the data.
Linear Function and Parameters:
Example Linear Function:
The provided figure shows a plot of a linear function: \(y=f(x)=1+0.5 × x\).
- The intercept is equal to 1.
- The slope of the line is equal to 0.5.
Interpretation of Intercept and Slope:
intercept <- 1
slope <- 0.5
x <- seq(0, 10, length.out = 100)
y <- intercept + slope * x
plot(x, y, type = "l", col = "blue", lwd = 2, xlab = "X", ylab = "Y", main = "Linear Regression Example")
points(c(1, 2, 3, 4, 5), c(1.5, 2, 2.5, 3, 3.5), col = "red", pch = 16)
text(2, 5, paste("Intercept =", intercept), pos = 3, col = "black")
text(2, 4.5, paste("Slope =", slope), pos = 3, col = "black")
legend("topright", legend = c("Linear Function", "Observed Data"), col = c("blue", "red"), lty = 1, pch = c(NA, 16))
Representation of Simple Linear Model:
\(R^2\) is the proportion of the total variation that can be explained by the model.
\(R^2\) is between 0 and 1, with higher values indicating a better fit \(R^2\) is calculated as
\(R^2 = \frac{SSreg}{TSS} = 1 - \frac{RRS}{TSS}\)
read.csv()
function or any other relevant function.class
. To get the names of variables in a
data.frame
we use names
and to get number of
rows and columns of data.frame we use dim
.str()
function
and to get summary statistics we use summary()
.plot()
function to visualize the relationship between the predictor and
response variables.lm()
function. The lm()
function takes the form
lm(response ~ predictor, data = dataset)
, where response is
the name of the response
variable, predictor is the name of
the predictor
variable, and dataset
is the
name of the data frame containing the data.coef()
function.predict()
function.plot()
function.# Load the data into RStudio
data <- read.csv("E:/JORIELYN/REGRESSION ANA/elemapi2v2.csv")
#returns structure of variables in the data frame
str(data)
## 'data.frame': 400 obs. of 24 variables:
## $ snum : int 906 889 887 876 888 4284 4271 2910 2899 2887 ...
## $ dnum : int 41 41 41 41 41 98 98 108 108 108 ...
## $ api00 : int 693 570 546 571 478 858 918 831 860 737 ...
## $ api99 : int 600 501 472 487 425 844 864 791 838 703 ...
## $ growth : int 93 69 74 84 53 14 54 40 22 34 ...
## $ meals : int 67 92 97 90 89 10 5 2 5 29 ...
## $ ell : int 9 21 29 27 30 3 2 3 6 15 ...
## $ yr_rnd : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mobility: int 11 33 36 27 44 10 16 44 10 17 ...
## $ acs_k3 : int 16 15 17 20 18 20 19 20 20 21 ...
## $ acs_46 : int 22 32 25 30 31 33 28 31 30 29 ...
## $ not_hsg : int 0 0 0 36 50 1 1 0 2 8 ...
## $ hsg : int 0 0 0 45 50 8 4 4 9 25 ...
## $ some_col: int 0 0 0 9 0 24 18 16 15 34 ...
## $ col_grad: int 0 0 0 9 0 36 34 50 42 27 ...
## $ grad_sch: int 0 0 0 0 0 31 43 30 33 7 ...
## $ avg_ed : num NA NA NA 1.91 1.5 ...
## $ full : int 76 79 68 87 87 100 100 96 100 96 ...
## $ emer : int 24 19 29 11 13 0 0 2 0 7 ...
## $ enroll : int 247 463 395 418 520 343 303 1513 660 362 ...
## $ mealcat : int 2 3 3 3 3 1 1 1 1 1 ...
## $ collcat : int 1 1 1 1 1 2 2 2 2 3 ...
## $ abv_hsg : int 100 100 100 64 50 99 99 100 98 92 ...
## $ lgenroll: num 2.39 2.67 2.6 2.62 2.72 ...
#Summary statistics of variables in data
summary(data)
## snum dnum api00 api99
## Min. : 58 Min. : 41.0 Min. :369.0 Min. :333.0
## 1st Qu.:1720 1st Qu.:395.0 1st Qu.:523.8 1st Qu.:484.8
## Median :3008 Median :401.0 Median :643.0 Median :602.0
## Mean :2867 Mean :457.7 Mean :647.6 Mean :610.2
## 3rd Qu.:4198 3rd Qu.:630.0 3rd Qu.:762.2 3rd Qu.:731.2
## Max. :6072 Max. :796.0 Max. :940.0 Max. :917.0
##
## growth meals ell yr_rnd
## Min. :-69.00 Min. : 0.00 Min. : 0.00 Min. :0.00
## 1st Qu.: 19.00 1st Qu.: 31.00 1st Qu.: 9.75 1st Qu.:0.00
## Median : 36.00 Median : 67.50 Median :25.00 Median :0.00
## Mean : 37.41 Mean : 60.31 Mean :31.45 Mean :0.23
## 3rd Qu.: 53.25 3rd Qu.: 90.00 3rd Qu.:50.25 3rd Qu.:0.00
## Max. :134.00 Max. :100.00 Max. :91.00 Max. :1.00
##
## mobility acs_k3 acs_46 not_hsg
## Min. : 2.00 Min. :14.00 Min. :20.00 Min. : 0.00
## 1st Qu.:13.00 1st Qu.:18.00 1st Qu.:27.00 1st Qu.: 4.00
## Median :17.00 Median :19.00 Median :29.00 Median : 14.00
## Mean :18.25 Mean :19.16 Mean :29.69 Mean : 21.25
## 3rd Qu.:22.00 3rd Qu.:20.00 3rd Qu.:31.00 3rd Qu.: 34.00
## Max. :47.00 Max. :25.00 Max. :50.00 Max. :100.00
## NA's :1 NA's :2 NA's :3
## hsg some_col col_grad grad_sch
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.000
## 1st Qu.: 17.00 1st Qu.:12.00 1st Qu.: 7.0 1st Qu.: 1.000
## Median : 26.00 Median :19.00 Median : 16.0 Median : 4.000
## Mean : 26.02 Mean :19.71 Mean : 19.7 Mean : 8.637
## 3rd Qu.: 34.00 3rd Qu.:28.00 3rd Qu.: 30.0 3rd Qu.:10.000
## Max. :100.00 Max. :67.00 Max. :100.0 Max. :67.000
##
## avg_ed full emer enroll
## Min. :1.000 Min. : 37.00 Min. : 0.00 Min. : 130.0
## 1st Qu.:2.070 1st Qu.: 76.00 1st Qu.: 3.00 1st Qu.: 320.0
## Median :2.600 Median : 88.00 Median :10.00 Median : 435.0
## Mean :2.668 Mean : 84.55 Mean :12.66 Mean : 483.5
## 3rd Qu.:3.220 3rd Qu.: 97.00 3rd Qu.:19.00 3rd Qu.: 608.0
## Max. :4.620 Max. :100.00 Max. :59.00 Max. :1570.0
## NA's :19
## mealcat collcat abv_hsg lgenroll
## Min. :1.000 Min. :1.00 Min. : 0.00 Min. :2.114
## 1st Qu.:1.000 1st Qu.:1.00 1st Qu.: 66.00 1st Qu.:2.505
## Median :2.000 Median :2.00 Median : 86.00 Median :2.638
## Mean :2.015 Mean :2.02 Mean : 78.75 Mean :2.640
## 3rd Qu.:3.000 3rd Qu.:3.00 3rd Qu.: 96.00 3rd Qu.:2.784
## Max. :3.000 Max. :3.00 Max. :100.00 Max. :3.196
##
#Simple regression model of DV api00 and IV enroll
m1 <- lm(api00 ~ enroll, data = data)
class(m1)
## [1] "lm"
print(m1) #Prints coefficients of the model
##
## Call:
## lm(formula = api00 ~ enroll, data = data)
##
## Coefficients:
## (Intercept) enroll
## 744.2514 -0.1999
#Prints summary of the model
summary(m1)
##
## Call:
## lm(formula = api00 ~ enroll, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -285.50 -112.55 -6.70 95.06 389.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 744.25141 15.93308 46.711 < 2e-16 ***
## enroll -0.19987 0.02985 -6.695 7.34e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 135 on 398 degrees of freedom
## Multiple R-squared: 0.1012, Adjusted R-squared: 0.09898
## F-statistic: 44.83 on 1 and 398 DF, p-value: 7.339e-11
#r coeff function
coefficients(m1)
## (Intercept) enroll
## 744.2514142 -0.1998674
anova(m1) #anova table
## Analysis of Variance Table
##
## Response: api00
## Df Sum Sq Mean Sq F value Pr(>F)
## enroll 1 817326 817326 44.829 7.339e-11 ***
## Residuals 398 7256346 18232
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#first we need to create new data.frame including all predictors of the model
new.data <- data.frame(enroll = c(500, 600, 700))
predict(m1, newdata = new.data)
## 1 2 3
## 644.3177 624.3309 604.3442
plot(api00 ~ enroll, data = data) #scatter plot of api00 vs. enroll
abline(m1, col = "blue") #add regression line to the scatter plot
R uses object class factor to store and manipulate categorical
variables. The lm
function automatically treat variable
type character as factor
, but it is always safer to change
the variable’s class to factor before the analysis.
The function factor(
) is used to encode a variable
(a vector) as a factor.
#we transform variable meancat to a factor variable
data$mealcat_F <- factor(data$mealcat)
#now the class of mealcat_F is factor
str(data$mealcat_F)
## Factor w/ 3 levels "1","2","3": 2 3 3 3 3 1 1 1 1 1 ...
#we transform variable yr_rnd to a factor variable
data$yr_rnd_F <- factor(data$yr_rnd)
#levels of factor variable yr_rnd_F
levels(data$yr_rnd_F)
## [1] "0" "1"
lm
function by default generates dummy variables for each
category of the factor. A dummy variable is a variable that takes values
0 and 1. If we are in that category the dummy variable is 1 and if we
are not in that category the dummy variable is 0.#regression of api00 with yr_rnd_F
m2 <- lm(api00 ~ yr_rnd_F, data = data)
#summary of the model
summary(m2)
##
## Call:
## lm(formula = api00 ~ yr_rnd_F, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -273.539 -95.662 0.967 103.341 297.967
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 684.54 7.14 95.88 <2e-16 ***
## yr_rnd_F1 -160.51 14.89 -10.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 125.3 on 398 degrees of freedom
## Multiple R-squared: 0.226, Adjusted R-squared: 0.2241
## F-statistic: 116.2 on 1 and 398 DF, p-value: < 2.2e-16
#scatter plot api00 against yr_rnd
plot(api00 ~ yr_rnd, data = data)
abline(m2) # add regression line to the plot
#regression model of api00 against categorical variable mealcat_F with 3 levels
m3 <- lm(api00 ~ mealcat_F, data = data)
summary(m3)
##
## Call:
## lm(formula = api00 ~ mealcat_F, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -253.394 -47.883 0.282 52.282 185.620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 805.718 6.169 130.60 <2e-16 ***
## mealcat_F2 -166.324 8.708 -19.10 <2e-16 ***
## mealcat_F3 -301.338 8.629 -34.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 70.61 on 397 degrees of freedom
## Multiple R-squared: 0.7548, Adjusted R-squared: 0.7536
## F-statistic: 611.1 on 2 and 397 DF, p-value: < 2.2e-16
#aggregate the mean of api00 for each category in mealcat_F
aggregate(api00 ~ mealcat_F, FUN=mean, data = data)
## mealcat_F api00
## 1 1 805.7176
## 2 2 639.3939
## 3 3 504.3796
#relevel factor mealcat_F and make group "3" as the reference group
data$mealcat_F <- relevel(data$mealcat_F, ref = "3")
m4 <- lm(api00 ~ mealcat_F, data = data)
summary(m4)
##
## Call:
## lm(formula = api00 ~ mealcat_F, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -253.394 -47.883 0.282 52.282 185.620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 504.380 6.033 83.61 <2e-16 ***
## mealcat_F1 301.338 8.629 34.92 <2e-16 ***
## mealcat_F2 135.014 8.612 15.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 70.61 on 397 degrees of freedom
## Multiple R-squared: 0.7548, Adjusted R-squared: 0.7536
## F-statistic: 611.1 on 2 and 397 DF, p-value: < 2.2e-16
#scatter plot api00 against yr_rnd_F
plot(api00 ~ yr_rnd_F, data = data)
#using aggregate to find the mean for each group of year school round
aggregate(api00 ~ yr_rnd_F, FUN=mean, data = data)
## yr_rnd_F api00
## 1 0 684.5390
## 2 1 524.0326
#t test for equality of mean of api00 for two group of year round and not year round
#with equal variance assumption
t.test(api00 ~ yr_rnd_F, data = data, var.equal = TRUE)
##
## Two Sample t-test
##
## data: api00 by yr_rnd_F
## t = 10.782, df = 398, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 131.2390 189.7737
## sample estimates:
## mean in group 0 mean in group 1
## 684.5390 524.0326
#anova table
anova(m2)
## Analysis of Variance Table
##
## Response: api00
## Df Sum Sq Mean Sq F value Pr(>F)
## yr_rnd_F 1 1825001 1825001 116.24 < 2.2e-16 ***
## Residuals 398 6248671 15700
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#multiple regression model of DV api00 and IVs enroll, meals, and full
m5 <- lm(api00 ~ enroll + meals + full, data = data)
summary(m5) #summary of model m5
##
## Call:
## lm(formula = api00 ~ enroll + meals + full, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -181.721 -40.802 1.129 39.983 158.774
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 801.82983 26.42660 30.342 < 2e-16 ***
## enroll -0.05146 0.01384 -3.719 0.000229 ***
## meals -3.65973 0.10880 -33.639 < 2e-16 ***
## full 1.08109 0.23945 4.515 8.37e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 58.73 on 396 degrees of freedom
## Multiple R-squared: 0.8308, Adjusted R-squared: 0.8295
## F-statistic: 648.2 on 3 and 396 DF, p-value: < 2.2e-16
anova(m5) #anova table of model m5
## Analysis of Variance Table
##
## Response: api00
## Df Sum Sq Mean Sq F value Pr(>F)
## enroll 1 817326 817326 236.947 < 2.2e-16 ***
## meals 1 5820066 5820066 1687.263 < 2.2e-16 ***
## full 1 70313 70313 20.384 8.369e-06 ***
## Residuals 396 1365967 3449
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In R we use function scale()
to do this for each
variable.
#Standardized regression model
m5.sd <- lm(scale(api00) ~ scale(enroll) + scale(meals) + scale(full), data = data)
summary(m5.sd) #coefficients are standardized
##
## Call:
## lm(formula = scale(api00) ~ scale(enroll) + scale(meals) + scale(full),
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.27749 -0.28683 0.00793 0.28108 1.11617
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.454e-16 2.064e-02 0.000 1.000000
## scale(enroll) -8.191e-02 2.203e-02 -3.719 0.000229 ***
## scale(meals) -8.210e-01 2.441e-02 -33.639 < 2e-16 ***
## scale(full) 1.136e-01 2.517e-02 4.515 8.37e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4129 on 396 degrees of freedom
## Multiple R-squared: 0.8308, Adjusted R-squared: 0.8295
## F-statistic: 648.2 on 3 and 396 DF, p-value: < 2.2e-16
#multiple regression model of DV api00 and IVs enroll, meals, and enroll:meals
m6 <- lm(api00 ~ enroll + meals + enroll:meals , data = data)
summary(m6) #summary of model m6
##
## Call:
## lm(formula = api00 ~ enroll + meals + enroll:meals, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -211.186 -38.834 -1.137 38.997 163.713
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.833e+02 1.665e+01 53.034 <2e-16 ***
## enroll 8.835e-04 3.362e-02 0.026 0.9790
## meals -3.425e+00 2.344e-01 -14.614 <2e-16 ***
## enroll:meals -9.537e-04 4.292e-04 -2.222 0.0269 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 59.85 on 396 degrees of freedom
## Multiple R-squared: 0.8243, Adjusted R-squared: 0.823
## F-statistic: 619.3 on 3 and 396 DF, p-value: < 2.2e-16
Regression model with interaction between two categorical predictors
#multiple regression model of DV api00 and IVs enroll, meals, and enroll:meals
m7 <- lm(api00 ~ yr_rnd_F * mealcat_F , data = data)
summary(m7) #summary of model m7
##
## Call:
## lm(formula = api00 ~ yr_rnd_F * mealcat_F, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -207.533 -50.764 -1.843 48.874 179.000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 521.493 8.414 61.978 < 2e-16 ***
## yr_rnd_F1 -33.493 11.771 -2.845 0.00467 **
## mealcat_F1 288.193 10.443 27.597 < 2e-16 ***
## mealcat_F2 123.781 10.552 11.731 < 2e-16 ***
## yr_rnd_F1:mealcat_F1 -40.764 29.231 -1.395 0.16394
## yr_rnd_F1:mealcat_F2 -18.248 22.256 -0.820 0.41278
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 68.87 on 394 degrees of freedom
## Multiple R-squared: 0.7685, Adjusted R-squared: 0.7656
## F-statistic: 261.6 on 5 and 394 DF, p-value: < 2.2e-16
#running the model without interaction terms
m0 <- lm(api00 ~ yr_rnd_F + mealcat_F , data = data)
#run F-test using anova function
anova(m0, m7)
## Analysis of Variance Table
##
## Model 1: api00 ~ yr_rnd_F + mealcat_F
## Model 2: api00 ~ yr_rnd_F * mealcat_F
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 396 1879528
## 2 394 1868944 2 10584 1.1156 0.3288
#multiple regression model of DV api00 and IVs enroll, meals, and enroll:meals
m8 <- lm(api00 ~ yr_rnd_F * enroll , data = data)
summary(m8) #summary of model m8
##
## Call:
## lm(formula = api00 ~ yr_rnd_F * enroll, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -274.043 -94.781 0.417 97.666 309.573
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 682.068556 18.797436 36.285 <2e-16 ***
## yr_rnd_F1 -74.858236 48.224281 -1.552 0.1214
## enroll 0.006021 0.042396 0.142 0.8871
## yr_rnd_F1:enroll -0.120218 0.072075 -1.668 0.0961 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 125 on 396 degrees of freedom
## Multiple R-squared: 0.2335, Adjusted R-squared: 0.2277
## F-statistic: 40.21 on 3 and 396 DF, p-value: < 2.2e-16
#First get the range of continuous predictor
range(data$enroll)
## [1] 130 1570
#make a sequence of number of enroll from range of enroll 130-1570 increment by 5
new.enroll <- seq(130, 1570, 5)
#Because we have two levels of yr_rnd_F we repeat enroll two times.
new.yr_rnd <- rep(levels(data$yr_rnd_F), each = length(new.enroll))
#create new data
new.data.plot <- data.frame(enroll = rep(new.enroll, times = 2), yr_rnd_F = new.yr_rnd)
#predict api00
new.data.plot$predicted_api00 <- predict(m8, newdata = new.data.plot)
#I use ggplot to plot
library(ggplot2)
ggplot(new.data.plot, aes(x = enroll, y = predicted_api00, colour = yr_rnd_F)) +
geom_line(lwd=1.2)
Assumptions of OLS regression:
Homogeneity of variance (homoscedasticity): constant error variance
Linearity: linear relationships between predictors and outcome
Independence: no correlation between errors
Normality: errors are normally distributed
Model specification: correct inclusions and exclusions of variables
Issues of concern:
Influence: individual observations impacting coefficients heavily
Collinearity: highly correlated predictors causing estimation problems
Tools and methods:
Graphical methods and numerical tests for diagnostics
R’s built-in stats package and additional packages like “car”, “alr4”, and “faraway”
Instructions for installing and loading packages in RStudio:
# install packages for Regression Diagnostics
#install.packages("car")
#install.packages("alr4")
#install.packages("faraway")
#loading packages into working environment
library(car)
library(alr4)
library(faraway)
Three types of unusual data:
Outliers: Data points with large residuals, potentially indicating sample peculiarities or data errors.
Leverage: Data points with extreme values on a predictor variable, influencing the estimate of regression coefficients.
Influence: Data points that, when removed, significantly change the estimates of coefficients, a combination of leverage and outlierness.
Identifying unusual data:
#multiple regression model of DV api00 and DVs enroll, meals, and full
m2 <- lm(api00 ~ enroll + meals + full, data = data)
scatterplotMatrix(~ api00 + enroll + meals + full, data = data)
res.std <- rstandard(m2) #studentized residuals stored in vector res.std
#plot Standardized residual in y axis. X axis will be the index or row names
plot(res.std, ylab="Standardized Residual", ylim=c(-3.5,3.5))
#add horizontal lines 3 and -3 to identify extreme values
abline(h =c(-3,0,3), lty = 2)
#find out which data point is outside of 3 standard deviation cut-off
#index is row numbers of those point
index <- which(res.std > 3 | res.std < -3)
#add School name next to points that have extreme value
text(index-20, res.std[index] , labels = data$snum[index])
We can run a test statistic to test if we have outlier using function
outlierTest(m2)
in the package “car”.
#Bonferroni p-values for testing outliner
outlierTest(m2)
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferroni p
## 226 -3.151186 0.00175 0.70001
#a vector containing the diagonal of the 'hat' matrix
h <- influence(m2)$hat
#half normal plot of leverage from package faraway
halfnorm(influence(m2)$hat, ylab = "leverage")
Cook’s distance is a measure for influence points. A point with high level of cook’s distance is considers as a point with high influence point.
#the cut of value for cook's distance
cutoff <- 4/((nrow(data)-length(m2$coefficients)-2))
plot(m2, which = 4, cook.levels = cutoff)
We can use influencePlot()
function in package “car” to
identify influence point.
#cook's distance, studentized residuals, and leverage in the same plot
influencePlot(m2, main="Influence Plot",
sub="Circle size is proportional to Cook's Distance" )
## StudRes Hat CookD
## 8 0.18718812 0.08016299 7.652779e-04
## 93 2.76307269 0.02940688 5.687488e-02
## 210 0.03127861 0.06083329 1.588292e-05
## 226 -3.15118603 0.01417076 3.489753e-02
## 346 -2.83932062 0.00412967 8.211170e-03
Also, infIndexPlot()
function will gives use a series of
plots that we need for to investigate influence points.
#4 diagnostic plots to identify influential points
infIndexPlot(m2)
One of the main assumptions for the ordinary least squares regression is the homogeneity of variance of the residuals. Heteroscedasticity, a violation of one of the key assumptions of ordinary least squares (OLS) regression.
There are graphical and non-graphical methods for detecting heteroscedasticity. A commonly used graphical method is to plot the residuals versus fitted (predicted) values.
#residual vs. fitted value plot for Homoscedasticity
plot(m2$resid ~ m2$fitted.values)
#add horizental line from 0
abline(h = 0, lty = 2)
To check linearity residuals should be plotted against the fit as
well as other predictors. If any of these plots show systematic shapes,
then the linear model is not appropriate and some nonlinear terms may
need to be added. In package car, function residualPlots()
produces those plots.
#residual vs. fitted value and all predictors plus test for curvature
residualPlots(m2)
## Test stat Pr(>|Test stat|)
## enroll 0.0111 0.9911
## meals -0.6238 0.5331
## full 1.1565 0.2482
## Tukey test -0.8411 0.4003
A simple visual check would be to plot the residuals versus the time variable. Here the data is not as time series therefore we use school number as an order variable for the data.
#plotting the trend of residuals and school number
plot(m2$resid ~ data$snum) #residual plot vs. school id
Normality Misconception:
OLS regression only requires:
Furthermore, there is no assumption or requirement that the predictor variables be normally distributed. If this were the case than we would not be able to use dummy coded variables in our models.
In addition, because of large sample theory if we have large enough sample size we do not even need the residuals be normally distributed. However for small sample sizes the normality assumption is required.
To test normality we use qq-normal plot of residuals.
qqnorm(m2$resid) #Normal Quantile to Quantile plot
qqline(m2$resid)
Multicollinearity occurs when two or more predictor variables are highly correlated (near perfect linear relationships).
When multicollinearity exists, the estimates (regression coefficients) become unstable and their standard errors (uncertainty) get inflated. This makes it difficult to interpret the results and draw reliable conclusions.
VIF (Variance Inflation Factor): A numerical measure used to quantify the degree of multicollinearity. As a rule of thumb, a VIF value greater than 10 indicates potential problems.
Tolerance: Calculated as 1/VIF, a value below 0.1 suggests a problematic level of collinearity.
In R we can use function vif
from package car or
faraway. Since there are two packages in our environment that include
this function we use ‘car::vif()’ to tell R that we want to run this
function from package car;
car::vif(m2) #variance inflation factor
## enroll meals full
## 1.135733 1.394279 1.482305
Model specification error: These occur when relevant variables are omitted or irrelevant ones are included in the model.
Consequences:
Omitting relevant variables: Their influence gets wrongly attributed to other variables, inflating the error term and distorting coefficients.
Including irrelevant variables: They compete for explanation, leading to inaccurate coefficient estimates.
Identifying relevant variables: Various methods and statistical tests exist, including:
Added variable plot: Visualizes the impact of excluding a variable by plotting residuals of models with and without it.
Slope interpretation: If the slope between residuals is not close to zero, the excluded variable likely contributes to the model.
Added variable plots can also help identify influential data points.
avPlots(m2) #added variable plots from package "car"