Simple Regression

Introduction

Regression analysis: is a statistical tool used to explain the relationship between a response (dependent, outcome) variable and one or more predictor (independent) variables.

Examples of Relationships between independent variables and dependent variables:

  • Employee efficiency can be related to years of training, educational background, and age.
  • The time until a headache clears with a pain killer may be related to dosage, age, and gender.
  • The number of votes for a presidential candidate may be related to gender, city average income, and state.
  • Student academic performance in schools may be related to factors like poverty level, number of students, teacher credentials, and class size.

Linear Regression Model: In a linear regression model, it is assumed that the relationship between variables can be explained with a linear function.

Simple Regression Model: A simple regression model involves one response variable and a single predictor variable.

Linear function

Data in a simple regression model are observed in pairs, with

  • \(X:\) representing the predictor variable \((x_1, x_2, \dots, x_n)\); and
  • \(Y:\) representing the response variable \((y_1, y_2,\dots, y_n)\).

Each index represents one case or unit of observation in the data.

When plotting these pairs against the \(X\) and \(Y\) axes, the goal is to fit the best line to the data.

Linear Function and Parameters:

  • A linear function is determined by the intercept and the slope of the line.

Example Linear Function:

The provided figure shows a plot of a linear function: \(y=f(x)=1+0.5 × x\).

- The intercept is equal to 1.
- The slope of the line is equal to 0.5.

Interpretation of Intercept and Slope:

  • The intercept represents the value of \(y\) when \(x = 0\).
  • The slope represents the increase in \(y\) for a one-unit increase in \(x\).
intercept <- 1
slope <- 0.5

x <- seq(0, 10, length.out = 100)
y <- intercept + slope * x

plot(x, y, type = "l", col = "blue", lwd = 2, xlab = "X", ylab = "Y", main = "Linear Regression Example")

points(c(1, 2, 3, 4, 5), c(1.5, 2, 2.5, 3, 3.5), col = "red", pch = 16)

text(2, 5, paste("Intercept =", intercept), pos = 3, col = "black")
text(2, 4.5, paste("Slope =", slope), pos = 3, col = "black")

legend("topright", legend = c("Linear Function", "Observed Data"), col = c("blue", "red"), lty = 1, pch = c(NA, 16))

Linear regression as a statistical model

Representation of Simple Linear Model:

  • The simple linear model is represented as
    \(Y=\beta_0 + \beta_1 X + \epsilon\)
    .
    • \(\beta_0\) is the intercept
    • \(\beta_1\) is the slope
    • \(\epsilon\) is the error term.
  • The regression model has two parts: a linear function and an error term.
  • \(\beta_0\) and \(\beta_1\) are model parameters, estimated from the data.
  • The linear function predicts the mean value of the outcome \((E(Y))\) for a given predictor variable.
  • The mean of \(Y\) conditional on \(X=x\) is written as \(E(Y|X=x)= \beta_0 + \beta_1 x\) in a simple linear regression.
  • The error term \((\epsilon)\) represents unexplained uncertainty, is a random variable with mean zero, and has an unknown variance \((\sigma^2)\).
  • The error term is often assumed to have a normal distribution, though it’s not mandatory.
  • In a simple regression model, individual units \((i=1,2,\dots,n)\) have pairs of observations \((x_i,y_i)\).

Decomposing the sum of squares and ANOVA

  • \(R^2\) is the proportion of the total variation that can be explained by the model.

  • \(R^2\) is between 0 and 1, with higher values indicating a better fit \(R^2\) is calculated as

    \(R^2 = \frac{SSreg}{TSS} = 1 - \frac{RRS}{TSS}\)

    • SSreg is the sum of squares of the regression
    • TSS is the total sum of squares
    • RSS is the residual sum of squares 1.

Running the regression model in RStudio

To run a regression model in RStudio, we can follow these steps:

  • Load the data into RStudio using the read.csv() function or any other relevant function.
  • An object’s type can be determined by the function class. To get the names of variables in a data.frame we use names and to get number of rows and columns of data.frame we use dim.
  • To look at structure of the data we use str() function and to get summary statistics we use summary().
  • Create a scatter plot of the data using the plot() function to visualize the relationship between the predictor and response variables.
  • Fit a regression model to the data using the lm() function. The lm() function takes the form lm(response ~ predictor, data = dataset), where response is the name of the response variable, predictor is the name of the predictor variable, and dataset is the name of the data frame containing the data.
  • Extract the coefficients of the model using the coef() function.
  • We can use function anova() to extract the sums of squares of the model.
  • Calculate the predicted values of the response variable using the predict() function.
  • Plot the predicted values against the actual values using the plot() function.

Example codes and their coreesponding outputs:

# Load the data into RStudio
data <- read.csv("E:/JORIELYN/REGRESSION ANA/elemapi2v2.csv")

#returns structure of variables in the data frame 
str(data)      
## 'data.frame':    400 obs. of  24 variables:
##  $ snum    : int  906 889 887 876 888 4284 4271 2910 2899 2887 ...
##  $ dnum    : int  41 41 41 41 41 98 98 108 108 108 ...
##  $ api00   : int  693 570 546 571 478 858 918 831 860 737 ...
##  $ api99   : int  600 501 472 487 425 844 864 791 838 703 ...
##  $ growth  : int  93 69 74 84 53 14 54 40 22 34 ...
##  $ meals   : int  67 92 97 90 89 10 5 2 5 29 ...
##  $ ell     : int  9 21 29 27 30 3 2 3 6 15 ...
##  $ yr_rnd  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mobility: int  11 33 36 27 44 10 16 44 10 17 ...
##  $ acs_k3  : int  16 15 17 20 18 20 19 20 20 21 ...
##  $ acs_46  : int  22 32 25 30 31 33 28 31 30 29 ...
##  $ not_hsg : int  0 0 0 36 50 1 1 0 2 8 ...
##  $ hsg     : int  0 0 0 45 50 8 4 4 9 25 ...
##  $ some_col: int  0 0 0 9 0 24 18 16 15 34 ...
##  $ col_grad: int  0 0 0 9 0 36 34 50 42 27 ...
##  $ grad_sch: int  0 0 0 0 0 31 43 30 33 7 ...
##  $ avg_ed  : num  NA NA NA 1.91 1.5 ...
##  $ full    : int  76 79 68 87 87 100 100 96 100 96 ...
##  $ emer    : int  24 19 29 11 13 0 0 2 0 7 ...
##  $ enroll  : int  247 463 395 418 520 343 303 1513 660 362 ...
##  $ mealcat : int  2 3 3 3 3 1 1 1 1 1 ...
##  $ collcat : int  1 1 1 1 1 2 2 2 2 3 ...
##  $ abv_hsg : int  100 100 100 64 50 99 99 100 98 92 ...
##  $ lgenroll: num  2.39 2.67 2.6 2.62 2.72 ...
#Summary statistics of variables in data
summary(data)  
##       snum           dnum           api00           api99      
##  Min.   :  58   Min.   : 41.0   Min.   :369.0   Min.   :333.0  
##  1st Qu.:1720   1st Qu.:395.0   1st Qu.:523.8   1st Qu.:484.8  
##  Median :3008   Median :401.0   Median :643.0   Median :602.0  
##  Mean   :2867   Mean   :457.7   Mean   :647.6   Mean   :610.2  
##  3rd Qu.:4198   3rd Qu.:630.0   3rd Qu.:762.2   3rd Qu.:731.2  
##  Max.   :6072   Max.   :796.0   Max.   :940.0   Max.   :917.0  
##                                                                
##      growth           meals             ell            yr_rnd    
##  Min.   :-69.00   Min.   :  0.00   Min.   : 0.00   Min.   :0.00  
##  1st Qu.: 19.00   1st Qu.: 31.00   1st Qu.: 9.75   1st Qu.:0.00  
##  Median : 36.00   Median : 67.50   Median :25.00   Median :0.00  
##  Mean   : 37.41   Mean   : 60.31   Mean   :31.45   Mean   :0.23  
##  3rd Qu.: 53.25   3rd Qu.: 90.00   3rd Qu.:50.25   3rd Qu.:0.00  
##  Max.   :134.00   Max.   :100.00   Max.   :91.00   Max.   :1.00  
##                                                                  
##     mobility         acs_k3          acs_46         not_hsg      
##  Min.   : 2.00   Min.   :14.00   Min.   :20.00   Min.   :  0.00  
##  1st Qu.:13.00   1st Qu.:18.00   1st Qu.:27.00   1st Qu.:  4.00  
##  Median :17.00   Median :19.00   Median :29.00   Median : 14.00  
##  Mean   :18.25   Mean   :19.16   Mean   :29.69   Mean   : 21.25  
##  3rd Qu.:22.00   3rd Qu.:20.00   3rd Qu.:31.00   3rd Qu.: 34.00  
##  Max.   :47.00   Max.   :25.00   Max.   :50.00   Max.   :100.00  
##  NA's   :1       NA's   :2       NA's   :3                       
##       hsg            some_col        col_grad        grad_sch     
##  Min.   :  0.00   Min.   : 0.00   Min.   :  0.0   Min.   : 0.000  
##  1st Qu.: 17.00   1st Qu.:12.00   1st Qu.:  7.0   1st Qu.: 1.000  
##  Median : 26.00   Median :19.00   Median : 16.0   Median : 4.000  
##  Mean   : 26.02   Mean   :19.71   Mean   : 19.7   Mean   : 8.637  
##  3rd Qu.: 34.00   3rd Qu.:28.00   3rd Qu.: 30.0   3rd Qu.:10.000  
##  Max.   :100.00   Max.   :67.00   Max.   :100.0   Max.   :67.000  
##                                                                   
##      avg_ed           full             emer           enroll      
##  Min.   :1.000   Min.   : 37.00   Min.   : 0.00   Min.   : 130.0  
##  1st Qu.:2.070   1st Qu.: 76.00   1st Qu.: 3.00   1st Qu.: 320.0  
##  Median :2.600   Median : 88.00   Median :10.00   Median : 435.0  
##  Mean   :2.668   Mean   : 84.55   Mean   :12.66   Mean   : 483.5  
##  3rd Qu.:3.220   3rd Qu.: 97.00   3rd Qu.:19.00   3rd Qu.: 608.0  
##  Max.   :4.620   Max.   :100.00   Max.   :59.00   Max.   :1570.0  
##  NA's   :19                                                       
##     mealcat         collcat        abv_hsg          lgenroll    
##  Min.   :1.000   Min.   :1.00   Min.   :  0.00   Min.   :2.114  
##  1st Qu.:1.000   1st Qu.:1.00   1st Qu.: 66.00   1st Qu.:2.505  
##  Median :2.000   Median :2.00   Median : 86.00   Median :2.638  
##  Mean   :2.015   Mean   :2.02   Mean   : 78.75   Mean   :2.640  
##  3rd Qu.:3.000   3rd Qu.:3.00   3rd Qu.: 96.00   3rd Qu.:2.784  
##  Max.   :3.000   Max.   :3.00   Max.   :100.00   Max.   :3.196  
## 
#Simple regression model of DV api00 and IV enroll
m1 <- lm(api00 ~ enroll, data = data)
class(m1)
## [1] "lm"
print(m1)    #Prints coefficients of the model
## 
## Call:
## lm(formula = api00 ~ enroll, data = data)
## 
## Coefficients:
## (Intercept)       enroll  
##    744.2514      -0.1999
#Prints summary of the model
summary(m1)
## 
## Call:
## lm(formula = api00 ~ enroll, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -285.50 -112.55   -6.70   95.06  389.15 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 744.25141   15.93308  46.711  < 2e-16 ***
## enroll       -0.19987    0.02985  -6.695 7.34e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 135 on 398 degrees of freedom
## Multiple R-squared:  0.1012, Adjusted R-squared:  0.09898 
## F-statistic: 44.83 on 1 and 398 DF,  p-value: 7.339e-11
#r coeff function
coefficients(m1)
## (Intercept)      enroll 
## 744.2514142  -0.1998674
anova(m1)  #anova table
## Analysis of Variance Table
## 
## Response: api00
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## enroll      1  817326  817326  44.829 7.339e-11 ***
## Residuals 398 7256346   18232                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#first we need to create new data.frame including all predictors of the model
new.data <- data.frame(enroll = c(500, 600, 700))
predict(m1, newdata = new.data)
##        1        2        3 
## 644.3177 624.3309 604.3442
plot(api00 ~ enroll, data = data) #scatter plot of api00 vs. enroll
abline(m1, col = "blue")       #add regression line to the scatter plot

Factor and dummy variable

#we transform variable meancat to a factor variable 
data$mealcat_F <- factor(data$mealcat)
#now the class of mealcat_F is factor
str(data$mealcat_F)
##  Factor w/ 3 levels "1","2","3": 2 3 3 3 3 1 1 1 1 1 ...
#we transform variable yr_rnd to a factor variable 
data$yr_rnd_F <- factor(data$yr_rnd)
#levels of factor variable yr_rnd_F
levels(data$yr_rnd_F)
## [1] "0" "1"

Regression of api00 on yr_rnd a two level categorical variable

#regression of api00 with yr_rnd_F
m2 <- lm(api00 ~ yr_rnd_F, data = data)
#summary of the model
summary(m2)
## 
## Call:
## lm(formula = api00 ~ yr_rnd_F, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -273.539  -95.662    0.967  103.341  297.967 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   684.54       7.14   95.88   <2e-16 ***
## yr_rnd_F1    -160.51      14.89  -10.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 125.3 on 398 degrees of freedom
## Multiple R-squared:  0.226,  Adjusted R-squared:  0.2241 
## F-statistic: 116.2 on 1 and 398 DF,  p-value: < 2.2e-16
#scatter plot api00 against yr_rnd
plot(api00 ~ yr_rnd, data = data)
abline(m2)  # add regression line to the plot

Regression of api00 on mealcat, a three level categorical variable

#regression model of api00 against categorical variable mealcat_F with 3 levels
m3 <- lm(api00 ~ mealcat_F, data = data)
summary(m3)
## 
## Call:
## lm(formula = api00 ~ mealcat_F, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -253.394  -47.883    0.282   52.282  185.620 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  805.718      6.169  130.60   <2e-16 ***
## mealcat_F2  -166.324      8.708  -19.10   <2e-16 ***
## mealcat_F3  -301.338      8.629  -34.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 70.61 on 397 degrees of freedom
## Multiple R-squared:  0.7548, Adjusted R-squared:  0.7536 
## F-statistic: 611.1 on 2 and 397 DF,  p-value: < 2.2e-16
#aggregate the mean of api00 for each category in mealcat_F
aggregate(api00 ~ mealcat_F, FUN=mean, data = data)
##   mealcat_F    api00
## 1         1 805.7176
## 2         2 639.3939
## 3         3 504.3796
#relevel factor mealcat_F and make group "3" as the reference group
data$mealcat_F <- relevel(data$mealcat_F, ref = "3")
m4 <- lm(api00 ~ mealcat_F, data = data)
summary(m4)
## 
## Call:
## lm(formula = api00 ~ mealcat_F, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -253.394  -47.883    0.282   52.282  185.620 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  504.380      6.033   83.61   <2e-16 ***
## mealcat_F1   301.338      8.629   34.92   <2e-16 ***
## mealcat_F2   135.014      8.612   15.68   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 70.61 on 397 degrees of freedom
## Multiple R-squared:  0.7548, Adjusted R-squared:  0.7536 
## F-statistic: 611.1 on 2 and 397 DF,  p-value: < 2.2e-16

Regression analysis with a single dummy variable and t-test (Optional)

#scatter plot api00 against yr_rnd_F
plot(api00 ~ yr_rnd_F, data = data)

#using aggregate to find the mean for each group of year school round
aggregate(api00 ~ yr_rnd_F, FUN=mean, data = data)
##   yr_rnd_F    api00
## 1        0 684.5390
## 2        1 524.0326
#t test for equality of mean of api00 for two group of year round and not year round 
#with equal variance assumption
t.test(api00 ~ yr_rnd_F, data = data, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  api00 by yr_rnd_F
## t = 10.782, df = 398, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  131.2390 189.7737
## sample estimates:
## mean in group 0 mean in group 1 
##        684.5390        524.0326
#anova table
anova(m2)
## Analysis of Variance Table
## 
## Response: api00
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## yr_rnd_F    1 1825001 1825001  116.24 < 2.2e-16 ***
## Residuals 398 6248671   15700                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multiple Regression

Adding more predictors to a simple regression model

#multiple regression model of DV api00 and IVs enroll, meals, and full
m5 <- lm(api00 ~  enroll + meals + full, data = data)
summary(m5)   #summary of model m5
## 
## Call:
## lm(formula = api00 ~ enroll + meals + full, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -181.721  -40.802    1.129   39.983  158.774 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 801.82983   26.42660  30.342  < 2e-16 ***
## enroll       -0.05146    0.01384  -3.719 0.000229 ***
## meals        -3.65973    0.10880 -33.639  < 2e-16 ***
## full          1.08109    0.23945   4.515 8.37e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 58.73 on 396 degrees of freedom
## Multiple R-squared:  0.8308, Adjusted R-squared:  0.8295 
## F-statistic: 648.2 on 3 and 396 DF,  p-value: < 2.2e-16
anova(m5)     #anova table of model m5 
## Analysis of Variance Table
## 
## Response: api00
##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
## enroll      1  817326  817326  236.947 < 2.2e-16 ***
## meals       1 5820066 5820066 1687.263 < 2.2e-16 ***
## full        1   70313   70313   20.384 8.369e-06 ***
## Residuals 396 1365967    3449                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Standardized regression coefficients

In R we use function scale() to do this for each variable.

#Standardized regression model
m5.sd <- lm(scale(api00) ~  scale(enroll) + scale(meals) + scale(full), data = data)
summary(m5.sd) #coefficients are standardized
## 
## Call:
## lm(formula = scale(api00) ~ scale(enroll) + scale(meals) + scale(full), 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27749 -0.28683  0.00793  0.28108  1.11617 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.454e-16  2.064e-02   0.000 1.000000    
## scale(enroll) -8.191e-02  2.203e-02  -3.719 0.000229 ***
## scale(meals)  -8.210e-01  2.441e-02 -33.639  < 2e-16 ***
## scale(full)    1.136e-01  2.517e-02   4.515 8.37e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4129 on 396 degrees of freedom
## Multiple R-squared:  0.8308, Adjusted R-squared:  0.8295 
## F-statistic: 648.2 on 3 and 396 DF,  p-value: < 2.2e-16

Regression model with two way interaction

Regression model with interaction between two continuous predictors

#multiple regression model of DV api00 and IVs enroll, meals, and enroll:meals
m6 <- lm(api00 ~  enroll + meals + enroll:meals , data = data)
summary(m6)   #summary of model m6
## 
## Call:
## lm(formula = api00 ~ enroll + meals + enroll:meals, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -211.186  -38.834   -1.137   38.997  163.713 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.833e+02  1.665e+01  53.034   <2e-16 ***
## enroll        8.835e-04  3.362e-02   0.026   0.9790    
## meals        -3.425e+00  2.344e-01 -14.614   <2e-16 ***
## enroll:meals -9.537e-04  4.292e-04  -2.222   0.0269 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 59.85 on 396 degrees of freedom
## Multiple R-squared:  0.8243, Adjusted R-squared:  0.823 
## F-statistic: 619.3 on 3 and 396 DF,  p-value: < 2.2e-16

Regression model with interaction between two categorical predictors

#multiple regression model of DV api00 and IVs enroll, meals, and enroll:meals
m7 <- lm(api00 ~  yr_rnd_F * mealcat_F , data = data)
summary(m7)   #summary of model m7
## 
## Call:
## lm(formula = api00 ~ yr_rnd_F * mealcat_F, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -207.533  -50.764   -1.843   48.874  179.000 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           521.493      8.414  61.978  < 2e-16 ***
## yr_rnd_F1             -33.493     11.771  -2.845  0.00467 ** 
## mealcat_F1            288.193     10.443  27.597  < 2e-16 ***
## mealcat_F2            123.781     10.552  11.731  < 2e-16 ***
## yr_rnd_F1:mealcat_F1  -40.764     29.231  -1.395  0.16394    
## yr_rnd_F1:mealcat_F2  -18.248     22.256  -0.820  0.41278    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68.87 on 394 degrees of freedom
## Multiple R-squared:  0.7685, Adjusted R-squared:  0.7656 
## F-statistic: 261.6 on 5 and 394 DF,  p-value: < 2.2e-16
#running the model without interaction terms
m0 <- lm(api00 ~  yr_rnd_F + mealcat_F , data = data)
#run F-test using anova function
anova(m0, m7)
## Analysis of Variance Table
## 
## Model 1: api00 ~ yr_rnd_F + mealcat_F
## Model 2: api00 ~ yr_rnd_F * mealcat_F
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1    396 1879528                           
## 2    394 1868944  2     10584 1.1156 0.3288

Regression model with interaction between a continuous predictor and a categorical predictor

#multiple regression model of DV api00 and IVs enroll, meals, and enroll:meals
m8 <- lm(api00 ~  yr_rnd_F * enroll , data = data)
summary(m8)   #summary of model m8
## 
## Call:
## lm(formula = api00 ~ yr_rnd_F * enroll, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -274.043  -94.781    0.417   97.666  309.573 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      682.068556  18.797436  36.285   <2e-16 ***
## yr_rnd_F1        -74.858236  48.224281  -1.552   0.1214    
## enroll             0.006021   0.042396   0.142   0.8871    
## yr_rnd_F1:enroll  -0.120218   0.072075  -1.668   0.0961 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 125 on 396 degrees of freedom
## Multiple R-squared:  0.2335, Adjusted R-squared:  0.2277 
## F-statistic: 40.21 on 3 and 396 DF,  p-value: < 2.2e-16

Ploting the interaction between a continuous predictor and a categorical predictor

#First get the range of continuous predictor
range(data$enroll)
## [1]  130 1570
#make a sequence of number of enroll from range of enroll 130-1570 increment by 5
new.enroll <- seq(130, 1570, 5)
#Because we have two levels of yr_rnd_F we repeat enroll two times.
new.yr_rnd <- rep(levels(data$yr_rnd_F), each = length(new.enroll))
#create new data
new.data.plot <- data.frame(enroll = rep(new.enroll, times = 2),  yr_rnd_F = new.yr_rnd)
#predict api00
new.data.plot$predicted_api00 <- predict(m8, newdata = new.data.plot)  
#I use ggplot to plot
library(ggplot2)
ggplot(new.data.plot, aes(x = enroll, y = predicted_api00, colour = yr_rnd_F)) +
  geom_line(lwd=1.2) 

Summary

  • In this part we have discussed the basics of how to perform simple and multiple regressions in R, the basics of interpreting output, as well as some related commands.

Regression Diagnostics

Introduction

Assumptions of OLS regression:

  • Homogeneity of variance (homoscedasticity): constant error variance

  • Linearity: linear relationships between predictors and outcome

  • Independence: no correlation between errors

  • Normality: errors are normally distributed

  • Model specification: correct inclusions and exclusions of variables

Issues of concern:

  • Influence: individual observations impacting coefficients heavily

  • Collinearity: highly correlated predictors causing estimation problems

Tools and methods:

  • Graphical methods and numerical tests for diagnostics

  • R’s built-in stats package and additional packages like “car”, “alr4”, and “faraway”

Instructions for installing and loading packages in RStudio:

# install packages for Regression Diagnostics

#install.packages("car") 
#install.packages("alr4")
#install.packages("faraway")

#loading packages into working environment
library(car)
library(alr4)
library(faraway)

Unusual and Influential data

Three types of unusual data:

  • Outliers: Data points with large residuals, potentially indicating sample peculiarities or data errors.

  • Leverage: Data points with extreme values on a predictor variable, influencing the estimate of regression coefficients.

  • Influence: Data points that, when removed, significantly change the estimates of coefficients, a combination of leverage and outlierness.

Identifying unusual data:

  • Scatterplot matrix: Visualize relationships between variables to identify potential outliers.
#multiple regression model of DV api00 and DVs enroll, meals, and full
m2 <- lm(api00 ~  enroll + meals + full, data = data)
scatterplotMatrix(~ api00 + enroll + meals + full, data = data)

  • Studentized residuals:I dentify outliers based on their distance from the mean in standard deviation units. In R we use rstandard() function to compute Studentized residuals.
res.std <- rstandard(m2) #studentized residuals stored in vector res.std 
#plot Standardized residual in y axis. X axis will be the index or row names
plot(res.std, ylab="Standardized Residual", ylim=c(-3.5,3.5))
#add horizontal lines 3 and -3 to identify extreme values
abline(h =c(-3,0,3), lty = 2)
#find out which data point is outside of 3 standard deviation cut-off
#index is row numbers of those point
index <- which(res.std > 3 | res.std < -3)
#add School name next to points that have extreme value
text(index-20, res.std[index] , labels = data$snum[index])

We can run a test statistic to test if we have outlier using function outlierTest(m2) in the package “car”.

#Bonferroni p-values for testing outliner
outlierTest(m2)
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##      rstudent unadjusted p-value Bonferroni p
## 226 -3.151186            0.00175      0.70001
#a vector containing the diagonal of the 'hat' matrix
h <- influence(m2)$hat
#half normal plot of leverage from package faraway
halfnorm(influence(m2)$hat, ylab = "leverage")

Cook’s distance is a measure for influence points. A point with high level of cook’s distance is considers as a point with high influence point.

#the cut of value for cook's distance
cutoff <- 4/((nrow(data)-length(m2$coefficients)-2))
plot(m2, which = 4, cook.levels = cutoff)

We can use influencePlot() function in package “car” to identify influence point.

#cook's distance, studentized residuals, and leverage in the same plot
influencePlot(m2, main="Influence Plot", 
              sub="Circle size is proportional to Cook's Distance" )

##         StudRes        Hat        CookD
## 8    0.18718812 0.08016299 7.652779e-04
## 93   2.76307269 0.02940688 5.687488e-02
## 210  0.03127861 0.06083329 1.588292e-05
## 226 -3.15118603 0.01417076 3.489753e-02
## 346 -2.83932062 0.00412967 8.211170e-03

Also, infIndexPlot() function will gives use a series of plots that we need for to investigate influence points.

#4 diagnostic plots to identify influential points
infIndexPlot(m2)

Checking Homoscedasticity

One of the main assumptions for the ordinary least squares regression is the homogeneity of variance of the residuals. Heteroscedasticity, a violation of one of the key assumptions of ordinary least squares (OLS) regression.

There are graphical and non-graphical methods for detecting heteroscedasticity. A commonly used graphical method is to plot the residuals versus fitted (predicted) values.

#residual vs. fitted value plot for Homoscedasticity
plot(m2$resid ~ m2$fitted.values)
#add horizental line from 0
abline(h = 0, lty = 2)

Checking Linearity

To check linearity residuals should be plotted against the fit as well as other predictors. If any of these plots show systematic shapes, then the linear model is not appropriate and some nonlinear terms may need to be added. In package car, function residualPlots() produces those plots.

#residual vs. fitted value and all predictors plus test for curvature
residualPlots(m2)

##            Test stat Pr(>|Test stat|)
## enroll        0.0111           0.9911
## meals        -0.6238           0.5331
## full          1.1565           0.2482
## Tukey test   -0.8411           0.4003

Issues of Independence

A simple visual check would be to plot the residuals versus the time variable. Here the data is not as time series therefore we use school number as an order variable for the data.

#plotting the trend of residuals and school number
plot(m2$resid ~ data$snum)   #residual plot vs. school id 

Checking Normality of Residuals

Normality Misconception:

  • Many researchers mistakenly believe: Normality of residuals is necessary for valid regression analysis.
  • Clarification: Normality is only necessary for valid hypothesis testing, specifically for p-values associated with t-tests and F-tests.
  • Impact: Violation of normality primarily affects the accuracy of p-values, not the estimates of coefficients themselves.

OLS regression only requires:

  • Identically distributed residuals: Errors should have the same distribution across all observations.
  • Independently distributed residuals: Errors should not be correlated with each other.

Furthermore, there is no assumption or requirement that the predictor variables be normally distributed. If this were the case than we would not be able to use dummy coded variables in our models.

In addition, because of large sample theory if we have large enough sample size we do not even need the residuals be normally distributed. However for small sample sizes the normality assumption is required.

To test normality we use qq-normal plot of residuals.

qqnorm(m2$resid)  #Normal Quantile to Quantile plot
qqline(m2$resid) 

Checking for Multicollinearity

  • Multicollinearity occurs when two or more predictor variables are highly correlated (near perfect linear relationships).

  • When multicollinearity exists, the estimates (regression coefficients) become unstable and their standard errors (uncertainty) get inflated. This makes it difficult to interpret the results and draw reliable conclusions.

  • VIF (Variance Inflation Factor): A numerical measure used to quantify the degree of multicollinearity. As a rule of thumb, a VIF value greater than 10 indicates potential problems.

  • Tolerance: Calculated as 1/VIF, a value below 0.1 suggests a problematic level of collinearity.

In R we can use function vif from package car or faraway. Since there are two packages in our environment that include this function we use ‘car::vif()’ to tell R that we want to run this function from package car;

car::vif(m2)  #variance inflation factor
##   enroll    meals     full 
## 1.135733 1.394279 1.482305

Model Specification

  • Model specification error: These occur when relevant variables are omitted or irrelevant ones are included in the model.

  • Consequences:

    • Omitting relevant variables: Their influence gets wrongly attributed to other variables, inflating the error term and distorting coefficients.

    • Including irrelevant variables: They compete for explanation, leading to inaccurate coefficient estimates.

  • Identifying relevant variables: Various methods and statistical tests exist, including:

    • Added variable plot: Visualizes the impact of excluding a variable by plotting residuals of models with and without it.

    • Slope interpretation: If the slope between residuals is not close to zero, the excluded variable likely contributes to the model.

  • Added variable plots can also help identify influential data points.

avPlots(m2) #added variable plots from package "car"