Before you proceed:

Remember that your data files and script of the R Markdown should be in the same working directory. Even if you change the working directory in a code chunk of R Markdown, the working directory will be reset to the original working directory as the code chunk gets executed. By defalut the working directory of a R Markdown script is the directory where you have just saved the script.

setwd("~/Linear Models")
getwd()
## [1] "C:/Users/Ankit/Documents/Linear Models"
COL <- read.csv2("./COL.csv")
p<-4
n<-dim(COL)[1]
library(car)
## Warning: package 'car' was built under R version 3.4.2

Suppose, one wants to perform a linear regression to model the cholesterol level as a function of weight (P) ,height (H) and age (E). So, We will create here the linear model where we are trying to predict the ‘Colestrol level (C)’ based on the variables Weight(P), Age (E) and Height(H) using the data set COL. The code written below accomplishes this task.

mod<-lm(C~P+E+H,COL)

We can now see the summary of the model. The summary can be interpreted as follows-

summary(mod)
## 
## Call:
## lm(formula = C ~ P + E + H, data = COL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -74.608 -22.137   1.888  21.156  65.410 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 490.9978    35.0517  14.008  < 2e-16 ***
## P            10.3773     0.7365  14.090  < 2e-16 ***
## E           -13.0195     3.8530  -3.379  0.00105 ** 
## H            -5.0989     0.7227  -7.055 2.68e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.11 on 96 degrees of freedom
## Multiple R-squared:  0.8101, Adjusted R-squared:  0.8041 
## F-statistic: 136.5 on 3 and 96 DF,  p-value: < 2.2e-16

Optional calculations: Confidence intervals of the parameters

confint(mod,level=0.99)
##                  0.5 %     99.5 %
## (Intercept) 398.881272 583.114304
## P             8.441792  12.312821
## E           -23.145228  -2.893732
## H            -6.998311  -3.199551

Optional calculations: SS1 Test of the parameters with default orderings

anova(mod)
## Analysis of Variance Table
## 
## Response: C
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## P          1  62396   62396  68.826 6.686e-13 ***
## E          1 263670  263670 290.841 < 2.2e-16 ***
## H          1  45123   45123  49.773 2.676e-10 ***
## Residuals 96  87031     907                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(C~H+P+E,COL))
## Analysis of Variance Table
## 
## Response: C
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## H          1 171564  171564 189.244 < 2.2e-16 ***
## P          1 189273  189273 208.778 < 2.2e-16 ***
## E          1  10351   10351  11.418  0.001052 ** 
## Residuals 96  87031     907                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note: SS3, the tests (F) always coincide with those of the summary (t), F = t ^ 2

Anova(mod,ty=3)
## Anova Table (Type III tests)
## 
## Response: C
##             Sum Sq Df F value    Pr(>F)    
## (Intercept) 177887  1 196.219 < 2.2e-16 ***
## P           179985  1 198.533 < 2.2e-16 ***
## E            10351  1  11.418  0.001052 ** 
## H            45123  1  49.773 2.676e-10 ***
## Residuals    87031 96                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Diagnosticos de R

oldpar <- par( mfrow=c(2,2))
plot(mod,ask=F)

par(oldpar)

Diagnostics: Collinearity

We can check for the condition of Multicollinearity using the VIF function which corresponds to the Variance Inflation factor

vif(mod)
##         P         E         H 
##  9.489406 20.904776 31.695499

We remember MODEL SUMMARY

summary(mod)
## 
## Call:
## lm(formula = C ~ P + E + H, data = COL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -74.608 -22.137   1.888  21.156  65.410 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 490.9978    35.0517  14.008  < 2e-16 ***
## P            10.3773     0.7365  14.090  < 2e-16 ***
## E           -13.0195     3.8530  -3.379  0.00105 ** 
## H            -5.0989     0.7227  -7.055 2.68e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.11 on 96 degrees of freedom
## Multiple R-squared:  0.8101, Adjusted R-squared:  0.8041 
## F-statistic: 136.5 on 3 and 96 DF,  p-value: < 2.2e-16

Optional calculations: Confidence intervals of the parameters

confint(mod,level=0.99)
##                  0.5 %     99.5 %
## (Intercept) 398.881272 583.114304
## P             8.441792  12.312821
## E           -23.145228  -2.893732
## H            -6.998311  -3.199551

Optional calculations: For some predetermined cases IC of E (Y)

(C0<-data.frame(cbind(P=c(65,75,65),E=c(15,15,12),H=c(150,150,150)), row.names=1:3))
##    P  E   H
## 1 65 15 150
## 2 75 15 150
## 3 65 12 150
predict(mod, C0, interval="confidence", level=.95, se.fit=T)
## $fit
##        fit      lwr      upr
## 1 205.3908 199.1668 211.6148
## 2 309.1639 294.6188 323.7089
## 3 244.4492 219.8210 269.0774
## 
## $se.fit
##         1         2         3 
##  3.135539  7.327533 12.407261 
## 
## $df
## [1] 96
## 
## $residual.scale
## [1] 30.1094

Optional calculations: For some predetermined cases I: Y Prison

predict(mod, C0, interval="prediction", level=.95, se.fit=F)
##        fit      lwr      upr
## 1 205.3908 145.3009 265.4807
## 2 309.1639 247.6528 370.6749
## 3 244.4492 179.8071 309.0914

Optional calculations: SS1 Test of the parameters with default orderings

anova(mod)
## Analysis of Variance Table
## 
## Response: C
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## P          1  62396   62396  68.826 6.686e-13 ***
## E          1 263670  263670 290.841 < 2.2e-16 ***
## H          1  45123   45123  49.773 2.676e-10 ***
## Residuals 96  87031     907                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note: SS3, the tests (F) coincide with those of the summary (t), F = t ^ 2

Anova(mod,ty=3)
## Anova Table (Type III tests)
## 
## Response: C
##             Sum Sq Df F value    Pr(>F)    
## (Intercept) 177887  1 196.219 < 2.2e-16 ***
## P           179985  1 198.533 < 2.2e-16 ***
## E            10351  1  11.418  0.001052 ** 
## H            45123  1  49.773 2.676e-10 ***
## Residuals    87031 96                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Linear paths in the independent variables:

center the data, only canvia the independent term

summary(lm(C~I(P-mean(P))+I(E-mean(E))+I(H-mean(H)),COL))
## 
## Call:
## lm(formula = C ~ I(P - mean(P)) + I(E - mean(E)) + I(H - mean(H)), 
##     data = COL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -74.608 -22.137   1.888  21.156  65.410 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    218.1550     3.0109  72.454  < 2e-16 ***
## I(P - mean(P))  10.3773     0.7365  14.090  < 2e-16 ***
## I(E - mean(E)) -13.0195     3.8530  -3.379  0.00105 ** 
## I(H - mean(H))  -5.0989     0.7227  -7.055 2.68e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.11 on 96 degrees of freedom
## Multiple R-squared:  0.8101, Adjusted R-squared:  0.8041 
## F-statistic: 136.5 on 3 and 96 DF,  p-value: < 2.2e-16

or, if P = 65, E = 15 and H = 150 are close to the means:

summary(lm(C~I(P-65)+I(E-15)+I(H-150),COL))
## 
## Call:
## lm(formula = C ~ I(P - 65) + I(E - 15) + I(H - 150), data = COL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -74.608 -22.137   1.888  21.156  65.410 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 205.3908     3.1355  65.504  < 2e-16 ***
## I(P - 65)    10.3773     0.7365  14.090  < 2e-16 ***
## I(E - 15)   -13.0195     3.8530  -3.379  0.00105 ** 
## I(H - 150)   -5.0989     0.7227  -7.055 2.68e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.11 on 96 degrees of freedom
## Multiple R-squared:  0.8101, Adjusted R-squared:  0.8041 
## F-statistic: 136.5 on 3 and 96 DF,  p-value: < 2.2e-16

Linear paths in the independent variables:

canvas in some variable, for example, excess weight,

weight pattern 0.5 * H-10, EP = P- (0.5 * H-10)

summary(mod2<-lm(C~I(P-0.5*H+10)+E+H,COL))
## 
## Call:
## lm(formula = C ~ I(P - 0.5 * H + 10) + E + H, data = COL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -74.608 -22.137   1.888  21.156  65.410 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         387.22473   33.69605  11.492  < 2e-16 ***
## I(P - 0.5 * H + 10)  10.37731    0.73649  14.090  < 2e-16 ***
## E                   -13.01948    3.85300  -3.379  0.00105 ** 
## H                     0.08972    0.58736   0.153  0.87891    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.11 on 96 degrees of freedom
## Multiple R-squared:  0.8101, Adjusted R-squared:  0.8041 
## F-statistic: 136.5 on 3 and 96 DF,  p-value: < 2.2e-16
vif(mod2)
## I(P - 0.5 * H + 10)                   E                   H 
##            1.009937           20.904776           20.933520

Note: Only canvia some parameters and collinearity

Linear paths in the independent variables:

remove some independent variable not significant and / or with a lot of colinealidat

for example H, if excess weight is already used

summary(mod3<-lm(C~I(P-0.5*H+10)+E,COL))
## 
## Call:
## lm(formula = C ~ I(P - 0.5 * H + 10) + E, data = COL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -74.286 -22.638   1.755  20.935  66.244 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         391.9885    12.6975   30.87   <2e-16 ***
## I(P - 0.5 * H + 10)  10.3882     0.7294   14.24   <2e-16 ***
## E                   -12.4452     0.8387  -14.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.96 on 97 degrees of freedom
## Multiple R-squared:   0.81,  Adjusted R-squared:  0.8061 
## F-statistic: 206.8 on 2 and 97 DF,  p-value: < 2.2e-16
vif(mod3)
## I(P - 0.5 * H + 10)                   E 
##            1.000527            1.000527