[Regression analysis is a statistical tool used to explain the relationship between a response (dependent, outcome) variable as a function of one or more predictor (independent) variables.]
[In a linear regression model we assume this relationship can be explained with a linear function.]
[a simple regression model has one response and a single predictor.]
[The data in a simple regression model are observed in pairs. For example:] \(X:\) The predictor. \(x_1,x_2,…,x_n\) are \(n\) observations from \(X.\) \(Y:\) The response. \(y_1,y_2,…,y_n\) are \(n\) observations from \(Y.\) each index represent one case or unit of observation in the data.
In real life situations, we usually cannot explain Y as an exact function of X. There is always some error or noise in the dependent variable that cannot be explained perfectly by the relationship between the independent and dependent variables. We call this the error term in the model. We represent the formula for a simple linear model as: \(Y=β_0+β_1X+ϵ\) Where \(β_0\) is the intercept, \(β_1\) is the slope of the model, and \(ϵ\) is the error term.
The ratio of SSreg to SST is called \(R^2\). \(R^2\) is the proportion of the total variation that can be explained by the model.
\(R^2=\)\(\frac{SSreg}{TSS}\)= 1- ()
In some texts \(RSS\) is also noted as \(SSE\) or sum of squares of errors.
Example:
[1] "data.frame"
[1] "snum" "dnum" "api00" "api99" "growth" "meals"
[7] "ell" "yr_rnd" "mobility" "acs_k3" "acs_46" "not_hsg"
[13] "hsg" "some_col" "col_grad" "grad_sch" "avg_ed" "full"
[19] "emer" "enroll" "mealcat" "collcat" "abv_hsg" "lgenroll"
[1] 400 24
'data.frame': 400 obs. of 24 variables:
$ snum : int 906 889 887 876 888 4284 4271 2910 2899 2887 ...
$ dnum : int 41 41 41 41 41 98 98 108 108 108 ...
$ api00 : int 693 570 546 571 478 858 918 831 860 737 ...
$ api99 : int 600 501 472 487 425 844 864 791 838 703 ...
$ growth : int 93 69 74 84 53 14 54 40 22 34 ...
$ meals : int 67 92 97 90 89 10 5 2 5 29 ...
$ ell : int 9 21 29 27 30 3 2 3 6 15 ...
$ yr_rnd : int 0 0 0 0 0 0 0 0 0 0 ...
$ mobility: int 11 33 36 27 44 10 16 44 10 17 ...
$ acs_k3 : int 16 15 17 20 18 20 19 20 20 21 ...
$ acs_46 : int 22 32 25 30 31 33 28 31 30 29 ...
$ not_hsg : int 0 0 0 36 50 1 1 0 2 8 ...
$ hsg : int 0 0 0 45 50 8 4 4 9 25 ...
$ some_col: int 0 0 0 9 0 24 18 16 15 34 ...
$ col_grad: int 0 0 0 9 0 36 34 50 42 27 ...
$ grad_sch: int 0 0 0 0 0 31 43 30 33 7 ...
$ avg_ed : num NA NA NA 1.91 1.5 ...
$ full : int 76 79 68 87 87 100 100 96 100 96 ...
$ emer : int 24 19 29 11 13 0 0 2 0 7 ...
$ enroll : int 247 463 395 418 520 343 303 1513 660 362 ...
$ mealcat : int 2 3 3 3 3 1 1 1 1 1 ...
$ collcat : int 1 1 1 1 1 2 2 2 2 3 ...
$ abv_hsg : int 100 100 100 64 50 99 99 100 98 92 ...
$ lgenroll: num 2.39 2.67 2.6 2.62 2.72 ...
snum dnum api00 api99
Min. : 58 Min. : 41.0 Min. :369.0 Min. :333.0
1st Qu.:1720 1st Qu.:395.0 1st Qu.:523.8 1st Qu.:484.8
Median :3008 Median :401.0 Median :643.0 Median :602.0
Mean :2867 Mean :457.7 Mean :647.6 Mean :610.2
3rd Qu.:4198 3rd Qu.:630.0 3rd Qu.:762.2 3rd Qu.:731.2
Max. :6072 Max. :796.0 Max. :940.0 Max. :917.0
growth meals ell yr_rnd
Min. :-69.00 Min. : 0.00 Min. : 0.00 Min. :0.00
1st Qu.: 19.00 1st Qu.: 31.00 1st Qu.: 9.75 1st Qu.:0.00
Median : 36.00 Median : 67.50 Median :25.00 Median :0.00
Mean : 37.41 Mean : 60.31 Mean :31.45 Mean :0.23
3rd Qu.: 53.25 3rd Qu.: 90.00 3rd Qu.:50.25 3rd Qu.:0.00
Max. :134.00 Max. :100.00 Max. :91.00 Max. :1.00
mobility acs_k3 acs_46 not_hsg
Min. : 2.00 Min. :14.00 Min. :20.00 Min. : 0.00
1st Qu.:13.00 1st Qu.:18.00 1st Qu.:27.00 1st Qu.: 4.00
Median :17.00 Median :19.00 Median :29.00 Median : 14.00
Mean :18.25 Mean :19.16 Mean :29.69 Mean : 21.25
3rd Qu.:22.00 3rd Qu.:20.00 3rd Qu.:31.00 3rd Qu.: 34.00
Max. :47.00 Max. :25.00 Max. :50.00 Max. :100.00
NA's :1 NA's :2 NA's :3
hsg some_col col_grad grad_sch
Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.000
1st Qu.: 17.00 1st Qu.:12.00 1st Qu.: 7.0 1st Qu.: 1.000
Median : 26.00 Median :19.00 Median : 16.0 Median : 4.000
Mean : 26.02 Mean :19.71 Mean : 19.7 Mean : 8.637
3rd Qu.: 34.00 3rd Qu.:28.00 3rd Qu.: 30.0 3rd Qu.:10.000
Max. :100.00 Max. :67.00 Max. :100.0 Max. :67.000
avg_ed full emer enroll
Min. :1.000 Min. : 37.00 Min. : 0.00 Min. : 130.0
1st Qu.:2.070 1st Qu.: 76.00 1st Qu.: 3.00 1st Qu.: 320.0
Median :2.600 Median : 88.00 Median :10.00 Median : 435.0
Mean :2.668 Mean : 84.55 Mean :12.66 Mean : 483.5
3rd Qu.:3.220 3rd Qu.: 97.00 3rd Qu.:19.00 3rd Qu.: 608.0
Max. :4.620 Max. :100.00 Max. :59.00 Max. :1570.0
NA's :19
mealcat collcat abv_hsg lgenroll
Min. :1.000 Min. :1.00 Min. : 0.00 Min. :2.114
1st Qu.:1.000 1st Qu.:1.00 1st Qu.: 66.00 1st Qu.:2.505
Median :2.000 Median :2.00 Median : 86.00 Median :2.638
Mean :2.015 Mean :2.02 Mean : 78.75 Mean :2.640
3rd Qu.:3.000 3rd Qu.:3.00 3rd Qu.: 96.00 3rd Qu.:2.784
Max. :3.000 Max. :3.00 Max. :100.00 Max. :3.196
[1] "lm"
Call:
lm(formula = api00 ~ enroll, data = data)
Coefficients:
(Intercept) enroll
744.2514 -0.1999
Call:
lm(formula = api00 ~ enroll, data = data)
Residuals:
Min 1Q Median 3Q Max
-285.50 -112.55 -6.70 95.06 389.15
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 744.25141 15.93308 46.711 < 2e-16 ***
enroll -0.19987 0.02985 -6.695 7.34e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 135 on 398 degrees of freedom
Multiple R-squared: 0.1012, Adjusted R-squared: 0.09898
F-statistic: 44.83 on 1 and 398 DF, p-value: 7.339e-11
(Intercept) enroll
744.2514142 -0.1998674
Analysis of Variance Table
Response: api00
Df Sum Sq Mean Sq F value Pr(>F)
enroll 1 817326 817326 44.829 7.339e-11 ***
Residuals 398 7256346 18232
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1 2 3
644.3177 624.3309 604.3442
[R uses object class factor to store and manipulate categorical variables. The lm function automatically treat variable type character as factor, but it is always safer to change the variable’s class to factor before the analysis.]
[The function factor() is used to encode a variable (a vector) as a factor.]
Factor w/ 3 levels "1","2","3": 2 3 3 3 3 1 1 1 1 1 ...
[1] "0" "1"
Call:
lm(formula = api00 ~ yr_rnd_F, data = data)
Residuals:
Min 1Q Median 3Q Max
-273.539 -95.662 0.967 103.341 297.967
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 684.54 7.14 95.88 <2e-16 ***
yr_rnd_F1 -160.51 14.89 -10.78 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 125.3 on 398 degrees of freedom
Multiple R-squared: 0.226, Adjusted R-squared: 0.2241
F-statistic: 116.2 on 1 and 398 DF, p-value: < 2.2e-16
Call:
lm(formula = api00 ~ mealcat_F, data = data)
Residuals:
Min 1Q Median 3Q Max
-253.394 -47.883 0.282 52.282 185.620
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 805.718 6.169 130.60 <2e-16 ***
mealcat_F2 -166.324 8.708 -19.10 <2e-16 ***
mealcat_F3 -301.338 8.629 -34.92 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 70.61 on 397 degrees of freedom
Multiple R-squared: 0.7548, Adjusted R-squared: 0.7536
F-statistic: 611.1 on 2 and 397 DF, p-value: < 2.2e-16
mealcat_F api00
1 1 805.7176
2 2 639.3939
3 3 504.3796
Call:
lm(formula = api00 ~ mealcat_F, data = data)
Residuals:
Min 1Q Median 3Q Max
-253.394 -47.883 0.282 52.282 185.620
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 504.380 6.033 83.61 <2e-16 ***
mealcat_F1 301.338 8.629 34.92 <2e-16 ***
mealcat_F2 135.014 8.612 15.68 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 70.61 on 397 degrees of freedom
Multiple R-squared: 0.7548, Adjusted R-squared: 0.7536
F-statistic: 611.1 on 2 and 397 DF, p-value: < 2.2e-16
yr_rnd_F api00
1 0 684.5390
2 1 524.0326
Two Sample t-test
data: api00 by yr_rnd_F
t = 10.782, df = 398, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
131.2390 189.7737
sample estimates:
mean in group 0 mean in group 1
684.5390 524.0326
Analysis of Variance Table
Response: api00
Df Sum Sq Mean Sq F value Pr(>F)
yr_rnd_F 1 1825001 1825001 116.24 < 2.2e-16 ***
Residuals 398 6248671 15700
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
lm(formula = api00 ~ enroll + meals + full, data = data)
Residuals:
Min 1Q Median 3Q Max
-181.721 -40.802 1.129 39.983 158.774
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 801.82983 26.42660 30.342 < 2e-16 ***
enroll -0.05146 0.01384 -3.719 0.000229 ***
meals -3.65973 0.10880 -33.639 < 2e-16 ***
full 1.08109 0.23945 4.515 8.37e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 58.73 on 396 degrees of freedom
Multiple R-squared: 0.8308, Adjusted R-squared: 0.8295
F-statistic: 648.2 on 3 and 396 DF, p-value: < 2.2e-16
Analysis of Variance Table
Response: api00
Df Sum Sq Mean Sq F value Pr(>F)
enroll 1 817326 817326 236.947 < 2.2e-16 ***
meals 1 5820066 5820066 1687.263 < 2.2e-16 ***
full 1 70313 70313 20.384 8.369e-06 ***
Residuals 396 1365967 3449
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
lm(formula = scale(api00) ~ scale(enroll) + scale(meals) + scale(full),
data = data)
Residuals:
Min 1Q Median 3Q Max
-1.27749 -0.28683 0.00793 0.28108 1.11617
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.454e-16 2.064e-02 0.000 1.000000
scale(enroll) -8.191e-02 2.203e-02 -3.719 0.000229 ***
scale(meals) -8.210e-01 2.441e-02 -33.639 < 2e-16 ***
scale(full) 1.136e-01 2.517e-02 4.515 8.37e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4129 on 396 degrees of freedom
Multiple R-squared: 0.8308, Adjusted R-squared: 0.8295
F-statistic: 648.2 on 3 and 396 DF, p-value: < 2.2e-16
Regression model with interaction between two continuous predictors
Call:
lm(formula = api00 ~ enroll + meals + enroll:meals, data = data)
Residuals:
Min 1Q Median 3Q Max
-211.186 -38.834 -1.137 38.997 163.713
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.833e+02 1.665e+01 53.034 <2e-16 ***
enroll 8.835e-04 3.362e-02 0.026 0.9790
meals -3.425e+00 2.344e-01 -14.614 <2e-16 ***
enroll:meals -9.537e-04 4.292e-04 -2.222 0.0269 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 59.85 on 396 degrees of freedom
Multiple R-squared: 0.8243, Adjusted R-squared: 0.823
F-statistic: 619.3 on 3 and 396 DF, p-value: < 2.2e-16
Regression model with interaction between two categorical predictors
Call:
lm(formula = api00 ~ yr_rnd_F * mealcat_F, data = data)
Residuals:
Min 1Q Median 3Q Max
-207.533 -50.764 -1.843 48.874 179.000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 521.493 8.414 61.978 < 2e-16 ***
yr_rnd_F1 -33.493 11.771 -2.845 0.00467 **
mealcat_F1 288.193 10.443 27.597 < 2e-16 ***
mealcat_F2 123.781 10.552 11.731 < 2e-16 ***
yr_rnd_F1:mealcat_F1 -40.764 29.231 -1.395 0.16394
yr_rnd_F1:mealcat_F2 -18.248 22.256 -0.820 0.41278
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 68.87 on 394 degrees of freedom
Multiple R-squared: 0.7685, Adjusted R-squared: 0.7656
F-statistic: 261.6 on 5 and 394 DF, p-value: < 2.2e-16
Analysis of Variance Table
Model 1: api00 ~ yr_rnd_F + mealcat_F
Model 2: api00 ~ yr_rnd_F * mealcat_F
Res.Df RSS Df Sum of Sq F Pr(>F)
1 396 1879528
2 394 1868944 2 10584 1.1156 0.3288
Regression model with interaction between a continuous predictor and a categorical predictor
Call:
lm(formula = api00 ~ yr_rnd_F * enroll, data = data)
Residuals:
Min 1Q Median 3Q Max
-274.043 -94.781 0.417 97.666 309.573
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 682.068556 18.797436 36.285 <2e-16 ***
yr_rnd_F1 -74.858236 48.224281 -1.552 0.1214
enroll 0.006021 0.042396 0.142 0.8871
yr_rnd_F1:enroll -0.120218 0.072075 -1.668 0.0961 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 125 on 396 degrees of freedom
Multiple R-squared: 0.2335, Adjusted R-squared: 0.2277
F-statistic: 40.21 on 3 and 396 DF, p-value: < 2.2e-16
Ploting the interaction between a continuous predictor and a categorical predictor
[1] 130 1570
Regression diagnostics checks on how well the data meet the assumptions of OLS regression such as: * Homogeneity of variance (homoscedasticity): The error variance should be constant * Linearity: the relationships between the predictors and the outcome variable should be linear * Independence: The errors associated with one observation are not correlated with the errors of any other observation * Normality: the errors should be normally distributed. Technically normality is necessary only for hypothesis tests to be valid. * Model specification: The model should be properly specified (including all relevant variables, and excluding irrelevant variables).
[Influence which is individual observations that exert undue influence on the coefficients and collinearity where predictors that are highly collinear, i.e., linearly related, can cause problems in estimating the regression coefficients are the issues that can arise during the analysis.]
[ Three ways that an observation can be unusual. -Outliers: In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent-variable value is unusual given its values on the predictor variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem.]
[Leverage: An observation with an extreme value on a predictor variable is called a point with high leverage. Leverage is a measure of how far an observation deviates from the mean of that variable. These leverage points can have an effect on the estimate of regression coefficients.]
[Influence: An observation is said to be influential if removing the observation substantially changes the estimate of coefficients. Influence can be thought of as the product of leverage and outlierness.]
No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
rstudent unadjusted p-value Bonferroni p
226 -3.151186 0.00175 0.70001
Cook’s distance is a measure for influence points. A point with high level of cook’s distance is considers as a point with high influence point.
StudRes Hat CookD
8 0.18718812 0.08016299 7.652779e-04
93 2.76307269 0.02940688 5.687488e-02
210 0.03127861 0.06083329 1.588292e-05
226 -3.15118603 0.01417076 3.489753e-02
346 -2.83932062 0.00412967 8.211170e-03
Test stat Pr(>|Test stat|)
enroll 0.0111 0.9911
meals -0.6238 0.5331
full 1.1565 0.2482
Tukey test -0.8411 0.4003
Many researchers believe that multiple regression requires normality. This is not the case. Normality of residuals is only required for valid hypothesis testing, that is, the normality assumption assures that the p-values for the t-tests and F-test will be valid. Normality is not required in order to obtain unbiased estimates of the regression coefficients. OLS regression merely requires that the residuals (errors) be identically and independently distributed. Furthermore, there is no assumption or requirement that the predictor variables be normally distributed. If this were the case than we would not be able to use dummy coded variables in our models.
Furthermore, because of large sample theory if we have large enough sample size we do not even need the residuals be normally distributed.
However for small sample sizes the normality assumption is required.
To test normality we use qq-normal plot of residuals.
In R we can use function vif from package car or faraway. Since there are two packages in our environment that include this function we use ‘car::vif()’ to tell R that we want to run this function from package car
enroll meals full
1.135733 1.394279 1.482305
Model specification error: These occur when relevant variables are omitted or irrelevant ones are included in the model.
Added variable plot is also useful for detecting influential points.