Remember that your data files and script of the R Markdown should be in the same working directory. Even if you change the working directory in a code chunk of R Markdown, the working directory will be reset to the original working directory as the code chunk gets executed. By defalut the working directory of a R Markdown script is the directory where you have just saved the script.
setwd("~/Linear Models")
getwd()
## [1] "C:/Users/Ankit/Documents/Linear Models"
COL <- read.csv2("./COL.csv")
p<-4
n<-dim(COL)[1]
library(car)
## Warning: package 'car' was built under R version 3.4.2
Suppose, one wants to perform a linear regression to model the cholesterol level as a function of weight (P) ,height (H) and age (E). So, We will create here the linear model where we are trying to predict the ‘Colestrol level (C)’ based on the variables Weight(P), Age (E) and Height(H) using the data set COL. The code written below accomplishes this task.
mod<-lm(C~P+E+H,COL)
We can now see the summary of the model. The summary can be interpreted as follows-
summary(mod)
##
## Call:
## lm(formula = C ~ P + E + H, data = COL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -74.608 -22.137 1.888 21.156 65.410
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 490.9978 35.0517 14.008 < 2e-16 ***
## P 10.3773 0.7365 14.090 < 2e-16 ***
## E -13.0195 3.8530 -3.379 0.00105 **
## H -5.0989 0.7227 -7.055 2.68e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.11 on 96 degrees of freedom
## Multiple R-squared: 0.8101, Adjusted R-squared: 0.8041
## F-statistic: 136.5 on 3 and 96 DF, p-value: < 2.2e-16
confint(mod,level=0.99)
## 0.5 % 99.5 %
## (Intercept) 398.881272 583.114304
## P 8.441792 12.312821
## E -23.145228 -2.893732
## H -6.998311 -3.199551
anova(mod)
## Analysis of Variance Table
##
## Response: C
## Df Sum Sq Mean Sq F value Pr(>F)
## P 1 62396 62396 68.826 6.686e-13 ***
## E 1 263670 263670 290.841 < 2.2e-16 ***
## H 1 45123 45123 49.773 2.676e-10 ***
## Residuals 96 87031 907
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(C~H+P+E,COL))
## Analysis of Variance Table
##
## Response: C
## Df Sum Sq Mean Sq F value Pr(>F)
## H 1 171564 171564 189.244 < 2.2e-16 ***
## P 1 189273 189273 208.778 < 2.2e-16 ***
## E 1 10351 10351 11.418 0.001052 **
## Residuals 96 87031 907
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(mod,ty=3)
## Anova Table (Type III tests)
##
## Response: C
## Sum Sq Df F value Pr(>F)
## (Intercept) 177887 1 196.219 < 2.2e-16 ***
## P 179985 1 198.533 < 2.2e-16 ***
## E 10351 1 11.418 0.001052 **
## H 45123 1 49.773 2.676e-10 ***
## Residuals 87031 96
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(predict(mod),resid(mod))
abline(h=0,lty=2)
# Diagnostico: OUTLIERS (rstudent)
plot(rstandard(mod))
abline(h=c(-2,0,2),lty=2)
plot(rstudent(mod),main="rstudent")
abline(h=c(-2,0,2),lty=2)
# Diagnostico: LEVERAGE
plot(hatvalues(mod))
abline(h=c(0,2*mean(hatvalues(mod))),lty=2)
# Diagnosis: INFLUENCE (dffits)
plot(cooks.distance(mod))
abline(h=c(0,4/n),lty=2)
plot(dffits(mod),main="dffits")
abline(h=c(-2*sqrt(p/n),0,2*sqrt(p/n)),lty=2)
oldpar <- par( mfrow=c(2,2))
plot(mod,ask=F)
par(oldpar)
We can check for the condition of Multicollinearity using the VIF function which corresponds to the Variance Inflation factor
vif(mod)
## P E H
## 9.489406 20.904776 31.695499
summary(mod)
##
## Call:
## lm(formula = C ~ P + E + H, data = COL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -74.608 -22.137 1.888 21.156 65.410
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 490.9978 35.0517 14.008 < 2e-16 ***
## P 10.3773 0.7365 14.090 < 2e-16 ***
## E -13.0195 3.8530 -3.379 0.00105 **
## H -5.0989 0.7227 -7.055 2.68e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.11 on 96 degrees of freedom
## Multiple R-squared: 0.8101, Adjusted R-squared: 0.8041
## F-statistic: 136.5 on 3 and 96 DF, p-value: < 2.2e-16
confint(mod,level=0.99)
## 0.5 % 99.5 %
## (Intercept) 398.881272 583.114304
## P 8.441792 12.312821
## E -23.145228 -2.893732
## H -6.998311 -3.199551
(C0<-data.frame(cbind(P=c(65,75,65),E=c(15,15,12),H=c(150,150,150)), row.names=1:3))
## P E H
## 1 65 15 150
## 2 75 15 150
## 3 65 12 150
predict(mod, C0, interval="confidence", level=.95, se.fit=T)
## $fit
## fit lwr upr
## 1 205.3908 199.1668 211.6148
## 2 309.1639 294.6188 323.7089
## 3 244.4492 219.8210 269.0774
##
## $se.fit
## 1 2 3
## 3.135539 7.327533 12.407261
##
## $df
## [1] 96
##
## $residual.scale
## [1] 30.1094
predict(mod, C0, interval="prediction", level=.95, se.fit=F)
## fit lwr upr
## 1 205.3908 145.3009 265.4807
## 2 309.1639 247.6528 370.6749
## 3 244.4492 179.8071 309.0914
anova(mod)
## Analysis of Variance Table
##
## Response: C
## Df Sum Sq Mean Sq F value Pr(>F)
## P 1 62396 62396 68.826 6.686e-13 ***
## E 1 263670 263670 290.841 < 2.2e-16 ***
## H 1 45123 45123 49.773 2.676e-10 ***
## Residuals 96 87031 907
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(mod,ty=3)
## Anova Table (Type III tests)
##
## Response: C
## Sum Sq Df F value Pr(>F)
## (Intercept) 177887 1 196.219 < 2.2e-16 ***
## P 179985 1 198.533 < 2.2e-16 ***
## E 10351 1 11.418 0.001052 **
## H 45123 1 49.773 2.676e-10 ***
## Residuals 87031 96
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm(C~I(P-mean(P))+I(E-mean(E))+I(H-mean(H)),COL))
##
## Call:
## lm(formula = C ~ I(P - mean(P)) + I(E - mean(E)) + I(H - mean(H)),
## data = COL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -74.608 -22.137 1.888 21.156 65.410
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 218.1550 3.0109 72.454 < 2e-16 ***
## I(P - mean(P)) 10.3773 0.7365 14.090 < 2e-16 ***
## I(E - mean(E)) -13.0195 3.8530 -3.379 0.00105 **
## I(H - mean(H)) -5.0989 0.7227 -7.055 2.68e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.11 on 96 degrees of freedom
## Multiple R-squared: 0.8101, Adjusted R-squared: 0.8041
## F-statistic: 136.5 on 3 and 96 DF, p-value: < 2.2e-16
summary(lm(C~I(P-65)+I(E-15)+I(H-150),COL))
##
## Call:
## lm(formula = C ~ I(P - 65) + I(E - 15) + I(H - 150), data = COL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -74.608 -22.137 1.888 21.156 65.410
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 205.3908 3.1355 65.504 < 2e-16 ***
## I(P - 65) 10.3773 0.7365 14.090 < 2e-16 ***
## I(E - 15) -13.0195 3.8530 -3.379 0.00105 **
## I(H - 150) -5.0989 0.7227 -7.055 2.68e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.11 on 96 degrees of freedom
## Multiple R-squared: 0.8101, Adjusted R-squared: 0.8041
## F-statistic: 136.5 on 3 and 96 DF, p-value: < 2.2e-16
summary(mod2<-lm(C~I(P-0.5*H+10)+E+H,COL))
##
## Call:
## lm(formula = C ~ I(P - 0.5 * H + 10) + E + H, data = COL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -74.608 -22.137 1.888 21.156 65.410
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 387.22473 33.69605 11.492 < 2e-16 ***
## I(P - 0.5 * H + 10) 10.37731 0.73649 14.090 < 2e-16 ***
## E -13.01948 3.85300 -3.379 0.00105 **
## H 0.08972 0.58736 0.153 0.87891
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.11 on 96 degrees of freedom
## Multiple R-squared: 0.8101, Adjusted R-squared: 0.8041
## F-statistic: 136.5 on 3 and 96 DF, p-value: < 2.2e-16
vif(mod2)
## I(P - 0.5 * H + 10) E H
## 1.009937 20.904776 20.933520
summary(mod3<-lm(C~I(P-0.5*H+10)+E,COL))
##
## Call:
## lm(formula = C ~ I(P - 0.5 * H + 10) + E, data = COL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -74.286 -22.638 1.755 20.935 66.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 391.9885 12.6975 30.87 <2e-16 ***
## I(P - 0.5 * H + 10) 10.3882 0.7294 14.24 <2e-16 ***
## E -12.4452 0.8387 -14.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.96 on 97 degrees of freedom
## Multiple R-squared: 0.81, Adjusted R-squared: 0.8061
## F-statistic: 206.8 on 2 and 97 DF, p-value: < 2.2e-16
vif(mod3)
## I(P - 0.5 * H + 10) E
## 1.000527 1.000527