Problem Set 4

Introduction to Regression in R

Simple Regression

[Regression analysis is a statistical tool used to explain the relationship between a response (dependent, outcome) variable as a function of one or more predictor (independent) variables.]

[In a linear regression model we assume this relationship can be explained with a linear function.]
[a simple regression model has one response and a single predictor.]
[The data in a simple regression model are observed in pairs. For example:] \(X:\) The predictor. \(x_1,x_2,…,x_n\) are \(n\) observations from \(X.\) \(Y:\) The response. \(y_1,y_2,…,y_n\) are \(n\) observations from \(Y.\) each index represent one case or unit of observation in the data.

Linear regression as a statistical model

In real life situations, we usually cannot explain Y as an exact function of X. There is always some error or noise in the dependent variable that cannot be explained perfectly by the relationship between the independent and dependent variables. We call this the error term in the model. We represent the formula for a simple linear model as: \(Y=β_0+β_1X+ϵ\) Where \(β_0\) is the intercept, \(β_1\) is the slope of the model, and \(ϵ\) is the error term.

[The regression model has two parts: a linear function and an error term.]
[\(β_0\) and \(β_1\) are model parameters, estimated from the data.]
[The linear function predicts the mean value of the outcome \((E(Y))\) for a given predictor variable.]
[The mean of \(Y\) conditional on \(X=x\) is written as \(E(Y|X=x)=β_0+β{_1}x\) in a simple linear regression.]
[The error term \((ϵ)\) represents unexplained uncertainty, is a random variable with mean zero, and has an unknown variance \((σ^2)\)]
[The error term is often assumed to have a normal distribution, though it’s not mandatory.]
[In a simple regression model, individual units \((i=1,2,…,n)\) have pairs of observations \((x_i,y_i)\).

Decomposing the sum of squares and ANOVA

The ratio of SSreg to SST is called \(R^2\). \(R^2\) is the proportion of the total variation that can be explained by the model.

\(R^2=\)\(\frac{SSreg}{TSS}\)= 1- ()

In some texts \(RSS\) is also noted as \(SSE\) or sum of squares of errors.

Running the regression model in RStudio

Example:

[1] "data.frame"

 [1] "snum"     "dnum"     "api00"    "api99"    "growth"   "meals"   
 [7] "ell"      "yr_rnd"   "mobility" "acs_k3"   "acs_46"   "not_hsg" 
[13] "hsg"      "some_col" "col_grad" "grad_sch" "avg_ed"   "full"    
[19] "emer"     "enroll"   "mealcat"  "collcat"  "abv_hsg"  "lgenroll"

[1] 400  24

'data.frame':   400 obs. of  24 variables:
 $ snum    : int  906 889 887 876 888 4284 4271 2910 2899 2887 ...
 $ dnum    : int  41 41 41 41 41 98 98 108 108 108 ...
 $ api00   : int  693 570 546 571 478 858 918 831 860 737 ...
 $ api99   : int  600 501 472 487 425 844 864 791 838 703 ...
 $ growth  : int  93 69 74 84 53 14 54 40 22 34 ...
 $ meals   : int  67 92 97 90 89 10 5 2 5 29 ...
 $ ell     : int  9 21 29 27 30 3 2 3 6 15 ...
 $ yr_rnd  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ mobility: int  11 33 36 27 44 10 16 44 10 17 ...
 $ acs_k3  : int  16 15 17 20 18 20 19 20 20 21 ...
 $ acs_46  : int  22 32 25 30 31 33 28 31 30 29 ...
 $ not_hsg : int  0 0 0 36 50 1 1 0 2 8 ...
 $ hsg     : int  0 0 0 45 50 8 4 4 9 25 ...
 $ some_col: int  0 0 0 9 0 24 18 16 15 34 ...
 $ col_grad: int  0 0 0 9 0 36 34 50 42 27 ...
 $ grad_sch: int  0 0 0 0 0 31 43 30 33 7 ...
 $ avg_ed  : num  NA NA NA 1.91 1.5 ...
 $ full    : int  76 79 68 87 87 100 100 96 100 96 ...
 $ emer    : int  24 19 29 11 13 0 0 2 0 7 ...
 $ enroll  : int  247 463 395 418 520 343 303 1513 660 362 ...
 $ mealcat : int  2 3 3 3 3 1 1 1 1 1 ...
 $ collcat : int  1 1 1 1 1 2 2 2 2 3 ...
 $ abv_hsg : int  100 100 100 64 50 99 99 100 98 92 ...
 $ lgenroll: num  2.39 2.67 2.6 2.62 2.72 ...

      snum           dnum           api00           api99      
 Min.   :  58   Min.   : 41.0   Min.   :369.0   Min.   :333.0  
 1st Qu.:1720   1st Qu.:395.0   1st Qu.:523.8   1st Qu.:484.8  
 Median :3008   Median :401.0   Median :643.0   Median :602.0  
 Mean   :2867   Mean   :457.7   Mean   :647.6   Mean   :610.2  
 3rd Qu.:4198   3rd Qu.:630.0   3rd Qu.:762.2   3rd Qu.:731.2  
 Max.   :6072   Max.   :796.0   Max.   :940.0   Max.   :917.0  
                                                               
     growth           meals             ell            yr_rnd    
 Min.   :-69.00   Min.   :  0.00   Min.   : 0.00   Min.   :0.00  
 1st Qu.: 19.00   1st Qu.: 31.00   1st Qu.: 9.75   1st Qu.:0.00  
 Median : 36.00   Median : 67.50   Median :25.00   Median :0.00  
 Mean   : 37.41   Mean   : 60.31   Mean   :31.45   Mean   :0.23  
 3rd Qu.: 53.25   3rd Qu.: 90.00   3rd Qu.:50.25   3rd Qu.:0.00  
 Max.   :134.00   Max.   :100.00   Max.   :91.00   Max.   :1.00  
                                                                 
    mobility         acs_k3          acs_46         not_hsg      
 Min.   : 2.00   Min.   :14.00   Min.   :20.00   Min.   :  0.00  
 1st Qu.:13.00   1st Qu.:18.00   1st Qu.:27.00   1st Qu.:  4.00  
 Median :17.00   Median :19.00   Median :29.00   Median : 14.00  
 Mean   :18.25   Mean   :19.16   Mean   :29.69   Mean   : 21.25  
 3rd Qu.:22.00   3rd Qu.:20.00   3rd Qu.:31.00   3rd Qu.: 34.00  
 Max.   :47.00   Max.   :25.00   Max.   :50.00   Max.   :100.00  
 NA's   :1       NA's   :2       NA's   :3                       
      hsg            some_col        col_grad        grad_sch     
 Min.   :  0.00   Min.   : 0.00   Min.   :  0.0   Min.   : 0.000  
 1st Qu.: 17.00   1st Qu.:12.00   1st Qu.:  7.0   1st Qu.: 1.000  
 Median : 26.00   Median :19.00   Median : 16.0   Median : 4.000  
 Mean   : 26.02   Mean   :19.71   Mean   : 19.7   Mean   : 8.637  
 3rd Qu.: 34.00   3rd Qu.:28.00   3rd Qu.: 30.0   3rd Qu.:10.000  
 Max.   :100.00   Max.   :67.00   Max.   :100.0   Max.   :67.000  
                                                                  
     avg_ed           full             emer           enroll      
 Min.   :1.000   Min.   : 37.00   Min.   : 0.00   Min.   : 130.0  
 1st Qu.:2.070   1st Qu.: 76.00   1st Qu.: 3.00   1st Qu.: 320.0  
 Median :2.600   Median : 88.00   Median :10.00   Median : 435.0  
 Mean   :2.668   Mean   : 84.55   Mean   :12.66   Mean   : 483.5  
 3rd Qu.:3.220   3rd Qu.: 97.00   3rd Qu.:19.00   3rd Qu.: 608.0  
 Max.   :4.620   Max.   :100.00   Max.   :59.00   Max.   :1570.0  
 NA's   :19                                                       
    mealcat         collcat        abv_hsg          lgenroll    
 Min.   :1.000   Min.   :1.00   Min.   :  0.00   Min.   :2.114  
 1st Qu.:1.000   1st Qu.:1.00   1st Qu.: 66.00   1st Qu.:2.505  
 Median :2.000   Median :2.00   Median : 86.00   Median :2.638  
 Mean   :2.015   Mean   :2.02   Mean   : 78.75   Mean   :2.640  
 3rd Qu.:3.000   3rd Qu.:3.00   3rd Qu.: 96.00   3rd Qu.:2.784  
 Max.   :3.000   Max.   :3.00   Max.   :100.00   Max.   :3.196

[1] "lm"


Call:
lm(formula = api00 ~ enroll, data = data)

Coefficients:
(Intercept)       enroll  
   744.2514      -0.1999


Call:
lm(formula = api00 ~ enroll, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-285.50 -112.55   -6.70   95.06  389.15 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 744.25141   15.93308  46.711  < 2e-16 ***
enroll       -0.19987    0.02985  -6.695 7.34e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 135 on 398 degrees of freedom
Multiple R-squared:  0.1012,    Adjusted R-squared:  0.09898 
F-statistic: 44.83 on 1 and 398 DF,  p-value: 7.339e-11

(Intercept)      enroll 
744.2514142  -0.1998674

Analysis of Variance Table

Response: api00
           Df  Sum Sq Mean Sq F value    Pr(>F)    
enroll      1  817326  817326  44.829 7.339e-11 ***
Residuals 398 7256346   18232                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

       1        2        3 
644.3177 624.3309 604.3442

Regression with categorical variable as predictor

[R uses object class factor to store and manipulate categorical variables. The lm function automatically treat variable type character as factor, but it is always safer to change the variable’s class to factor before the analysis.]
[The function factor() is used to encode a variable (a vector) as a factor.]

 Factor w/ 3 levels "1","2","3": 2 3 3 3 3 1 1 1 1 1 ...

[1] "0" "1"

[In R when we include a factor as a predictor to the model, the lm function by default generates dummy variables for each category of the factor. A dummy variable is a variable that takes values 0 and 1. If we are in that category the dummy variable is 1 and if we are not in that category the dummy variable is 0.]


Call:
lm(formula = api00 ~ yr_rnd_F, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-273.539  -95.662    0.967  103.341  297.967 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   684.54       7.14   95.88   <2e-16 ***
yr_rnd_F1    -160.51      14.89  -10.78   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 125.3 on 398 degrees of freedom
Multiple R-squared:  0.226, Adjusted R-squared:  0.2241 
F-statistic: 116.2 on 1 and 398 DF,  p-value: < 2.2e-16


Call:
lm(formula = api00 ~ mealcat_F, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-253.394  -47.883    0.282   52.282  185.620 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  805.718      6.169  130.60   <2e-16 ***
mealcat_F2  -166.324      8.708  -19.10   <2e-16 ***
mealcat_F3  -301.338      8.629  -34.92   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 70.61 on 397 degrees of freedom
Multiple R-squared:  0.7548,    Adjusted R-squared:  0.7536 
F-statistic: 611.1 on 2 and 397 DF,  p-value: < 2.2e-16

  mealcat_F    api00
1         1 805.7176
2         2 639.3939
3         3 504.3796


Call:
lm(formula = api00 ~ mealcat_F, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-253.394  -47.883    0.282   52.282  185.620 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  504.380      6.033   83.61   <2e-16 ***
mealcat_F1   301.338      8.629   34.92   <2e-16 ***
mealcat_F2   135.014      8.612   15.68   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 70.61 on 397 degrees of freedom
Multiple R-squared:  0.7548,    Adjusted R-squared:  0.7536 
F-statistic: 611.1 on 2 and 397 DF,  p-value: < 2.2e-16

  yr_rnd_F    api00
1        0 684.5390
2        1 524.0326


    Two Sample t-test

data:  api00 by yr_rnd_F
t = 10.782, df = 398, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 131.2390 189.7737
sample estimates:
mean in group 0 mean in group 1 
       684.5390        524.0326

Analysis of Variance Table

Response: api00
           Df  Sum Sq Mean Sq F value    Pr(>F)    
yr_rnd_F    1 1825001 1825001  116.24 < 2.2e-16 ***
Residuals 398 6248671   15700                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multiple Regression


Call:
lm(formula = api00 ~ enroll + meals + full, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-181.721  -40.802    1.129   39.983  158.774 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 801.82983   26.42660  30.342  < 2e-16 ***
enroll       -0.05146    0.01384  -3.719 0.000229 ***
meals        -3.65973    0.10880 -33.639  < 2e-16 ***
full          1.08109    0.23945   4.515 8.37e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 58.73 on 396 degrees of freedom
Multiple R-squared:  0.8308,    Adjusted R-squared:  0.8295 
F-statistic: 648.2 on 3 and 396 DF,  p-value: < 2.2e-16

Analysis of Variance Table

Response: api00
           Df  Sum Sq Mean Sq  F value    Pr(>F)    
enroll      1  817326  817326  236.947 < 2.2e-16 ***
meals       1 5820066 5820066 1687.263 < 2.2e-16 ***
full        1   70313   70313   20.384 8.369e-06 ***
Residuals 396 1365967    3449                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Call:
lm(formula = scale(api00) ~ scale(enroll) + scale(meals) + scale(full), 
    data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.27749 -0.28683  0.00793  0.28108  1.11617 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    4.454e-16  2.064e-02   0.000 1.000000    
scale(enroll) -8.191e-02  2.203e-02  -3.719 0.000229 ***
scale(meals)  -8.210e-01  2.441e-02 -33.639  < 2e-16 ***
scale(full)    1.136e-01  2.517e-02   4.515 8.37e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4129 on 396 degrees of freedom
Multiple R-squared:  0.8308,    Adjusted R-squared:  0.8295 
F-statistic: 648.2 on 3 and 396 DF,  p-value: < 2.2e-16

Regression model with two way interaction

Regression model with interaction between two continuous predictors


Call:
lm(formula = api00 ~ enroll + meals + enroll:meals, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-211.186  -38.834   -1.137   38.997  163.713 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   8.833e+02  1.665e+01  53.034   <2e-16 ***
enroll        8.835e-04  3.362e-02   0.026   0.9790    
meals        -3.425e+00  2.344e-01 -14.614   <2e-16 ***
enroll:meals -9.537e-04  4.292e-04  -2.222   0.0269 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 59.85 on 396 degrees of freedom
Multiple R-squared:  0.8243,    Adjusted R-squared:  0.823 
F-statistic: 619.3 on 3 and 396 DF,  p-value: < 2.2e-16

Regression model with interaction between two categorical predictors


Call:
lm(formula = api00 ~ yr_rnd_F * mealcat_F, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-207.533  -50.764   -1.843   48.874  179.000 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)           521.493      8.414  61.978  < 2e-16 ***
yr_rnd_F1             -33.493     11.771  -2.845  0.00467 ** 
mealcat_F1            288.193     10.443  27.597  < 2e-16 ***
mealcat_F2            123.781     10.552  11.731  < 2e-16 ***
yr_rnd_F1:mealcat_F1  -40.764     29.231  -1.395  0.16394    
yr_rnd_F1:mealcat_F2  -18.248     22.256  -0.820  0.41278    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 68.87 on 394 degrees of freedom
Multiple R-squared:  0.7685,    Adjusted R-squared:  0.7656 
F-statistic: 261.6 on 5 and 394 DF,  p-value: < 2.2e-16

Analysis of Variance Table

Model 1: api00 ~ yr_rnd_F + mealcat_F
Model 2: api00 ~ yr_rnd_F * mealcat_F
  Res.Df     RSS Df Sum of Sq      F Pr(>F)
1    396 1879528                           
2    394 1868944  2     10584 1.1156 0.3288

Regression model with interaction between a continuous predictor and a categorical predictor


Call:
lm(formula = api00 ~ yr_rnd_F * enroll, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-274.043  -94.781    0.417   97.666  309.573 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      682.068556  18.797436  36.285   <2e-16 ***
yr_rnd_F1        -74.858236  48.224281  -1.552   0.1214    
enroll             0.006021   0.042396   0.142   0.8871    
yr_rnd_F1:enroll  -0.120218   0.072075  -1.668   0.0961 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 125 on 396 degrees of freedom
Multiple R-squared:  0.2335,    Adjusted R-squared:  0.2277 
F-statistic: 40.21 on 3 and 396 DF,  p-value: < 2.2e-16

Ploting the interaction between a continuous predictor and a categorical predictor

[1]  130 1570

Regression Diagnostics

Regression diagnostics checks on how well the data meet the assumptions of OLS regression such as: * Homogeneity of variance (homoscedasticity): The error variance should be constant * Linearity: the relationships between the predictors and the outcome variable should be linear * Independence: The errors associated with one observation are not correlated with the errors of any other observation * Normality: the errors should be normally distributed. Technically normality is necessary only for hypothesis tests to be valid. * Model specification: The model should be properly specified (including all relevant variables, and excluding irrelevant variables).

[Influence which is individual observations that exert undue influence on the coefficients and collinearity where predictors that are highly collinear, i.e., linearly related, can cause problems in estimating the regression coefficients are the issues that can arise during the analysis.]
[ Three ways that an observation can be unusual. -Outliers: In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent-variable value is unusual given its values on the predictor variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem.]
- [Leverage: An observation with an extreme value on a predictor variable is called a point with high leverage. Leverage is a measure of how far an observation deviates from the mean of that variable. These leverage points can have an effect on the estimate of regression coefficients.]
- [Influence: An observation is said to be influential if removing the observation substantially changes the estimate of coefficients. Influence can be thought of as the product of leverage and outlierness.]

[Studentized residuals:I dentify outliers based on their distance from the mean in standard deviation units. In R we use rstandard() function to compute Studentized residuals.]

No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
     rstudent unadjusted p-value Bonferroni p
226 -3.151186            0.00175      0.70001

Cook’s distance is a measure for influence points. A point with high level of cook’s distance is considers as a point with high influence point.

        StudRes        Hat        CookD
8    0.18718812 0.08016299 7.652779e-04
93   2.76307269 0.02940688 5.687488e-02
210  0.03127861 0.06083329 1.588292e-05
226 -3.15118603 0.01417076 3.489753e-02
346 -2.83932062 0.00412967 8.211170e-03

Checking Homoscedasticity

Checking Linearity

           Test stat Pr(>|Test stat|)
enroll        0.0111           0.9911
meals        -0.6238           0.5331
full          1.1565           0.2482
Tukey test   -0.8411           0.4003

Issues of Independence

Checking Normality of Residuals

Many researchers believe that multiple regression requires normality. This is not the case. Normality of residuals is only required for valid hypothesis testing, that is, the normality assumption assures that the p-values for the t-tests and F-test will be valid. Normality is not required in order to obtain unbiased estimates of the regression coefficients. OLS regression merely requires that the residuals (errors) be identically and independently distributed. Furthermore, there is no assumption or requirement that the predictor variables be normally distributed. If this were the case than we would not be able to use dummy coded variables in our models.

Furthermore, because of large sample theory if we have large enough sample size we do not even need the residuals be normally distributed.

However for small sample sizes the normality assumption is required.

To test normality we use qq-normal plot of residuals.

Checking for Multicollinearity

In R we can use function vif from package car or faraway. Since there are two packages in our environment that include this function we use ‘car::vif()’ to tell R that we want to run this function from package car

  enroll    meals     full 
1.135733 1.394279 1.482305

Model Specification

Model specification error: These occur when relevant variables are omitted or irrelevant ones are included in the model.

Added variable plot is also useful for detecting influential points.