Regression Models in R

Kayvan Nejabati Zenouz

Dec 05, 2018


1 Intended Learning Outcomes

1.1 This documents helps you get a quick glance at …

  • Advanced regression models (using R very very quickly)

    • Linear models lmO
    • Non-parametric loessO and smooth.splineO
    • Generalised linear models glmO
    • Generalised Addetive models gamO
    • Generalised linear models for location, scale, and shape gamlssO
  • Various ways of checking your models

  • Model selection

  • Neural networks vs linear models

2 Introduction

  • We start working with STATLAB.csv
  • We shall use a subset
Data<-data.frame(famib=Main$famib, famit=Main$famit, sex=as.factor(Main$sex))

3 Basic Linear Models

(linmod<-lm(famit~famib, data=Data))

Call:
lm(formula = famit ~ famib, data = Data)

Coefficients:
(Intercept)        famib  
    100.031        0.657  

3.1 Summary

summary(linmod)

Call:
lm(formula = famit ~ famib, data = Data)

Residuals:
    Min      1Q  Median      3Q     Max 
-91.311 -36.062  -8.762  32.448 169.409 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 100.0306    11.8130   8.468 3.31e-14 ***
famib         0.6570     0.1587   4.140 6.01e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 51.58 on 138 degrees of freedom
Multiple R-squared:  0.1105,    Adjusted R-squared:  0.104 
F-statistic: 17.14 on 1 and 138 DF,  p-value: 0.00006013

3.2 ANOVA

anova(linmod)
Analysis of Variance Table

Response: famit
           Df Sum Sq Mean Sq F value     Pr(>F)    
famib       1  45600   45600   17.14 0.00006013 ***
Residuals 138 367137    2660                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.3 Plots

  • Plot data with the regression line
plot(famit ~ famib, data = Data, col = "red", pch = 19)
abline(linmod, col = "blue", lwd = 2)

  • Model checking plots
par(mfrow = c(2, 2))
plot(linmod, col = "blue", lwd = 2, pch = 19)

  • Residuals apparently OK, but the Q-Q plot shows a bit of a divergence near the tails

  • Preform shapiro-wilk test

shapiro.test(linmod$residuals)

    Shapiro-Wilk normality test

data:  linmod$residuals
W = 0.97538, p-value = 0.01241
  • The residuals fails to pass the normality test
p<- ggplot(data=Data, aes(x=famit)) + 
  geom_histogram(aes(y =..density..), position="identity",
                 breaks=seq(min(Data$famit)-1, max(Data$famit)+1, by = 20), 
                 alpha=.5, color="light blue", fill="red") + 
               geom_density(alpha=.3, fill="brown", col="black") + 
  labs(title="Histogram of famit")
ggplotly(p)
  • The response variable does not have a normal shape histogram

  • We shall look at non-parametric models shortly

3.4 Model Plots with ggplot

p<- ggplot(data = Data, aes(x = famib, y = famit)) + 
     geom_point(color='red') +
     geom_smooth(method = "lm", se = TRUE)
ggplotly(p)
  • Model checking with autoplot()
library(ggfortify)
autoplot(linmod, colour="red", CI=TRUE, pval=TRUE, plotTable=TRUE, title="Model checking plots")

3.5 Models with More than one Explanatory Variables

p<- ggplot(data = Data, aes(x = famib, y = famit, color=sex)) + 
     geom_point() +
     geom_smooth(method = "lm", se = TRUE)
ggplotly(p)
(linmod1<-lm(famit~famib+sex, data=Data))

Call:
lm(formula = famit ~ famib + sex, data = Data)

Coefficients:
(Intercept)        famib         sex2  
    94.8289       0.6711       8.7064  
summary(linmod1)

Call:
lm(formula = famit ~ famib + sex, data = Data)

Residuals:
    Min      1Q  Median      3Q     Max 
-95.378 -32.596  -7.268  31.908 164.780 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  94.8289    12.9202   7.340 1.72e-11 ***
famib         0.6711     0.1593   4.212 4.56e-05 ***
sex2          8.7064     8.7570   0.994    0.322    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 51.58 on 137 degrees of freedom
Multiple R-squared:  0.1169,    Adjusted R-squared:  0.104 
F-statistic: 9.064 on 2 and 137 DF,  p-value: 0.000201
anova(linmod1)
Analysis of Variance Table

Response: famit
           Df Sum Sq Mean Sq F value     Pr(>F)    
famib       1  45600   45600 17.1388 0.00006039 ***
sex         1   2630    2630  0.9885     0.3219    
Residuals 137 364507    2661                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.6 Comparing Models

anova(linmod,linmod1)
Analysis of Variance Table

Model 1: famit ~ famib
Model 2: famit ~ famib + sex
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    138 367137                           
2    137 364507  1      2630 0.9885 0.3219

3.7 Modelling with Polynomials

(linmod1<-lm(famit~famib+I(famib^2)+I(famib^3), data=Data))

Call:
lm(formula = famit ~ famib + I(famib^2) + I(famib^3), data = Data)

Coefficients:
(Intercept)        famib   I(famib^2)   I(famib^3)  
43.29121857   2.97336055  -0.02761250   0.00009696  
summary(linmod1)

Call:
lm(formula = famit ~ famib + I(famib^2) + I(famib^3), data = Data)

Residuals:
    Min      1Q  Median      3Q     Max 
-90.974 -36.443  -7.469  31.488 167.917 

Coefficients:
               Estimate  Std. Error t value Pr(>|t|)
(Intercept) 43.29121857 54.06485200   0.801    0.425
famib        2.97336055  2.08264280   1.428    0.156
I(famib^2)  -0.02761250  0.02434115  -1.134    0.259
I(famib^3)   0.00009696  0.00008606   1.127    0.262

Residual standard error: 51.71 on 136 degrees of freedom
Multiple R-squared:  0.1188,    Adjusted R-squared:  0.09939 
F-statistic: 6.113 on 3 and 136 DF,  p-value: 0.0006228

3.8 Modelling with Interaction

(linmod1<-lm(famit~famib+sex+famib*sex, data=Data))

Call:
lm(formula = famit ~ famib + sex + famib * sex, data = Data)

Coefficients:
(Intercept)        famib         sex2   famib:sex2  
    88.2815       0.7626      24.9666      -0.2372  
summary(linmod1)

Call:
lm(formula = famit ~ famib + sex + famib * sex, data = Data)

Residuals:
    Min      1Q  Median      3Q     Max 
-99.264 -32.680  -5.625  32.370 166.721 

Coefficients:
            Estimate Std. Error t value   Pr(>|t|)    
(Intercept)  88.2815    15.7938   5.590 0.00000012 ***
famib         0.7626     0.2037   3.744   0.000266 ***
sex2         24.9666    24.1301   1.035   0.302661    
famib:sex2   -0.2372     0.3279  -0.723   0.470707    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 51.67 on 136 degrees of freedom
Multiple R-squared:  0.1202,    Adjusted R-squared:  0.1008 
F-statistic: 6.196 on 3 and 136 DF,  p-value: 0.0005615

4 Ggplot with Loess

  • Locally Estimated Scatterplot Smoothing is a non-parametric regression method
p<- ggplot(data = Data, aes(x = famib, y = famit)) + 
     geom_point(color='red') +
     geom_smooth(method = "loess", se = TRUE)
ggplotly(p)

5 Function loess()

loess.model <- loess(famit ~ famib, data = Data, span=0.5, degree=1) # 50% smoothing span and degree 1 polynomial
summary(loess.model)
Call:
loess(formula = famit ~ famib, data = Data, span = 0.5, degree = 1)

Number of Observations: 140 
Equivalent Number of Parameters: 4.73 
Residual Standard Error: 51.05 
Trace of smoother matrix: 5.56  (exact)

Control settings:
  span     :  0.5 
  degree   :  1 
  family   :  gaussian
  surface  :  interpolate     cell = 0.2
  normalize:  TRUE
 parametric:  FALSE
drop.square:  FALSE 
loess.model2 <- loess(famit ~ famib, data = Data, span=0.9, degree=2)
summary(loess.model2)
Call:
loess(formula = famit ~ famib, data = Data, span = 0.9, degree = 2)

Number of Observations: 140 
Equivalent Number of Parameters: 4.36 
Residual Standard Error: 51.03 
Trace of smoother matrix: 4.75  (exact)

Control settings:
  span     :  0.9 
  degree   :  2 
  family   :  gaussian
  surface  :  interpolate     cell = 0.2
  normalize:  TRUE
 parametric:  FALSE
drop.square:  FALSE 

5.1 Create plots

plot(famit ~ famib, data = Data, col="pink", pch=19)
hat1 <- predict(loess.model)
lines(Data$famib[order(Data$famib)], hat1[order(Data$famib)], col="red", lwd=2)
hat2 <- predict(loess.model2)
lines(Data$famib[order(Data$famib)], hat2[order(Data$famib)], col="blue", lwd=2)
legend(20, 350, legend=c("span=0.5, degree=1", "span=0.9, degree=2"),col=c("red", "blue"),lty=1, cex=0.8,
       title="Line types", text.font=4, bg='lightblue')

6 Function smooth.spline()

  • Fits a cubic smoothing spline to the data
spline.model <- smooth.spline(Data$famit ~ Data$famib, df=7)
spline.model
Call:
smooth.spline(x = Data$famit ~ Data$famib, df = 7)

Smoothing Parameter  spar= 0.7922674  lambda= 0.001018668 (12 iterations)
Equivalent Degrees of Freedom (Df): 6.999176
Penalized Criterion (RSS): 147331.1
GCV: 2689.888

6.1 Compare fitting

plot(famit ~ famib, data = Data, col="pink", pch=19)
abline(linmod, col="blue", lwd=2)
lines(Data$famib[order(Data$famib)], hat1[order(Data$famib)], col="red", lwd=2)
lines(spline.model, col="green", lwd=2)
legend(20, 350, legend=c("linear model", "loess span=0.5, degree=1", "smooth spline, df=7"),col=c("blue", "red", "green"),lty=1, cex=0.8,
       title="Line types", text.font=4, bg='lightblue')

7 Function glm() and gam()

  • Generalised linear models allows modelling data coming from a distribution belonging to the exponential family
Gmodel<-glm(famit ~ famib, data=Data, family=Gamma())
summary(Gmodel)

Call:
glm(formula = famit ~ famib, family = Gamma(), data = Data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0736  -0.2830  -0.0493   0.2172   0.8766  

Coefficients:
                Estimate   Std. Error t value  Pr(>|t|)    
(Intercept)  0.008817606  0.000520140  16.952   < 2e-16 ***
famib       -0.000026780  0.000006262  -4.276 0.0000352 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Gamma family taken to be 0.1209083)

    Null deviance: 19.751  on 139  degrees of freedom
Residual deviance: 17.793  on 138  degrees of freedom
AIC: 1492.1

Number of Fisher Scoring iterations: 5
autoplot(Gmodel, colour="red", CI=TRUE, pval=TRUE, plotTable=TRUE, title="Model checking plots")

  • Generalised additive modelling allows for non-parametric modelling library(gam)
Gamodel<-gam(famit ~ s(famib, df = 5), data=Data, family=gaussian())
summary(Gamodel)

Call: gam(formula = famit ~ s(famib, df = 5), family = gaussian(), 
    data = Data)
Deviance Residuals:
   Min     1Q Median     3Q    Max 
-99.05 -37.12  -4.34  31.84 167.00 

(Dispersion Parameter for gaussian family taken to be 2565.676)

    Null Deviance: 412737 on 139 degrees of freedom
Residual Deviance: 343800.7 on 134 degrees of freedom
AIC: 1504.167 

Number of Local Scoring Iterations: 2 

Anova for Parametric Effects
                  Df Sum Sq Mean Sq F value     Pr(>F)    
s(famib, df = 5)   1  45600   45600  17.773 0.00004548 ***
Residuals        134 343801    2566                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova for Nonparametric Effects
                 Npar Df Npar F   Pr(F)  
(Intercept)                              
s(famib, df = 5)       4 2.2739 0.06456 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(Gamodel,se = TRUE, col="red", lwd=2) 

8 Function gamlss()

  • Generalised Additive Models for Location, Scale, and Shape allows modelling for skewed data as well as many more distribution library(gamlss)
Gaml<-gamlss(famit ~ famib, data=Data, family=NO()) # Normal distribution
GAMLSS-RS iteration 1: Global Deviance = 1499.361 
GAMLSS-RS iteration 2: Global Deviance = 1499.361 
summary(Gaml)
******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = famit ~ famib, family = NO(), data = Data) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 100.0306    11.7284   8.529 2.44e-14 ***
famib         0.6570     0.1576   4.170 5.37e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.93592    0.05976   65.86   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
No. of observations in the fit:  140 
Degrees of Freedom for the fit:  3
      Residual Deg. of Freedom:  137 
                      at cycle:  2 
 
Global Deviance:     1499.361 
            AIC:     1505.361 
            SBC:     1514.186 
******************************************************************
plot(Gaml)

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  1.00614e-16 
                       variance   =  1.007194 
               coef. of skewness  =  0.5688844 
               coef. of kurtosis  =  3.224292 
Filliben correlation coefficient  =  0.9880375 
******************************************************************
shapiro.test(Gaml$residuals)

    Shapiro-Wilk normality test

data:  Gaml$residuals
W = 0.97538, p-value = 0.01241
  • Problem with residuals

9 Gamlss with Generalised Guassian Inverse

Gaml<-gamlss(famit ~ famib, data=Data, family=GIG()) # Generalised Guassian Inverse
GAMLSS-RS iteration 1: Global Deviance = 1489.07 
GAMLSS-RS iteration 2: Global Deviance = 1487.675 
GAMLSS-RS iteration 3: Global Deviance = 1487.005 
GAMLSS-RS iteration 4: Global Deviance = 1486.604 
GAMLSS-RS iteration 5: Global Deviance = 1486.324 
GAMLSS-RS iteration 6: Global Deviance = 1486.115 
GAMLSS-RS iteration 7: Global Deviance = 1485.951 
GAMLSS-RS iteration 8: Global Deviance = 1485.822 
GAMLSS-RS iteration 9: Global Deviance = 1485.718 
GAMLSS-RS iteration 10: Global Deviance = 1485.633 
GAMLSS-RS iteration 11: Global Deviance = 1485.578 
GAMLSS-RS iteration 12: Global Deviance = 1485.578 
summary(Gaml)
******************************************************************
Family:  c("GIG", "Generalised Inverse Gaussian") 

Call:  gamlss(formula = famit ~ famib, family = GIG(), data = Data) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  log
Mu Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.672463   0.081517  57.319  < 2e-16 ***
famib       0.004342   0.001095   3.963 0.000119 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    1.917     93.331   0.021    0.984

------------------------------------------------------------------
Nu link function:  identity 
Nu Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   8.0556     0.9437   8.536 2.44e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
No. of observations in the fit:  140 
Degrees of Freedom for the fit:  4
      Residual Deg. of Freedom:  136 
                      at cycle:  12 
 
Global Deviance:     1485.578 
            AIC:     1493.578 
            SBC:     1505.344 
******************************************************************
plot(Gaml)

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  0.0007077885 
                       variance   =  1.006735 
               coef. of skewness  =  -0.1802651 
               coef. of kurtosis  =  2.911699 
Filliben correlation coefficient  =  0.9972805 
******************************************************************
shapiro.test(Gaml$residuals)

    Shapiro-Wilk normality test

data:  Gaml$residuals
W = 0.99436, p-value = 0.8607
  • Small problem with the residuals

10 Gamlss with Generalised Gamma

Gaml<-gamlss(famit ~ famib, data=Data, family=GG()) # Generalised Gamma Dsitribution
GAMLSS-RS iteration 1: Global Deviance = 1485.289 
GAMLSS-RS iteration 2: Global Deviance = 1484.983 
GAMLSS-RS iteration 3: Global Deviance = 1484.853 
GAMLSS-RS iteration 4: Global Deviance = 1484.796 
GAMLSS-RS iteration 5: Global Deviance = 1484.77 
GAMLSS-RS iteration 6: Global Deviance = 1484.759 
GAMLSS-RS iteration 7: Global Deviance = 1484.753 
GAMLSS-RS iteration 8: Global Deviance = 1484.751 
GAMLSS-RS iteration 9: Global Deviance = 1484.75 
GAMLSS-RS iteration 10: Global Deviance = 1484.749 
summary(Gaml)
******************************************************************
Family:  c("GG", "generalised Gamma Lopatatsidis-Green") 

Call:  gamlss(formula = famit ~ famib, family = GG(), data = Data) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  log
Mu Coefficients:
            Estimate Std. Error t value  Pr(>|t|)    
(Intercept) 4.698616   0.082202  57.159   < 2e-16 ***
famib       0.004439   0.001083   4.099 0.0000711 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.07366    0.06868  -15.63   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Nu link function:  identity 
Nu Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   1.5778     0.6844   2.306   0.0227 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
No. of observations in the fit:  140 
Degrees of Freedom for the fit:  4
      Residual Deg. of Freedom:  136 
                      at cycle:  10 
 
Global Deviance:     1484.749 
            AIC:     1492.749 
            SBC:     1504.516 
******************************************************************
plot(Gaml)

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  0.0009268872 
                       variance   =  1.006849 
               coef. of skewness  =  -0.002718162 
               coef. of kurtosis  =  2.819451 
Filliben correlation coefficient  =  0.9983572 
******************************************************************
shapiro.test(Gaml$residuals)

    Shapiro-Wilk normality test

data:  Gaml$residuals
W = 0.99649, p-value = 0.9844
  • Problems solved

11 Binomial Modelling

Data from https://www.kaggle.com/kemical/kickstarter-projects

## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
  • Use library(dplyr)
CountS<-count(Main,state, sort = T)
names(CountS)[2]<-"Count"
p<- ggplot(CountS, aes(x = reorder(state,-Count), y = Count)) + 
     geom_bar(stat = "identity", fill = "red", colour = "light blue", width = 0.7, 
                 alpha=.5) +
     xlab('Category')+ ylab('Number of Cases')+
     theme(axis.text.x=element_text(angle=-45, hjust=0.001)) # "angle" will tilt the labels 
ggplotly(p)
CountCS <- count(Main, main_category, state, sort = F)
names(CountCS)[3] <- "Count"

p <- ggplot(CountCS, aes(x = reorder(main_category, -Count), y = Count, fill = state)) + 
    geom_bar(stat = "identity", width = 0.7) + xlab("Category") + ylab("Number of Cases") + 
    theme(axis.text.x = element_text(angle = -90, hjust = 0.001))  # 'angle' will tilt the labels
ggplotly(p)

11.1 Modelling

Main$Successful<-ifelse(Main$state=="successful"|Main$state=="live",1,0)
TS<-sample(1:nrow(Main), round(nrow(Main)/5))
MainTr<-Main[-TS,]
MainTe<-Main[TS,]
model <- glm(Successful ~main_category ,family=binomial(link='logit'),data=MainTr)
summary(model)

Call:
glm(formula = Successful ~ main_category, family = binomial(link = "logit"), 
    data = MainTr)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4059  -0.9513  -0.7667   1.3237   1.7688  

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)               -0.33760    0.01351  -24.99   <2e-16 ***
main_categoryComics        0.51680    0.02542   20.33   <2e-16 ***
main_categoryCrafts       -0.76518    0.03067  -24.95   <2e-16 ***
main_categoryDance         0.86021    0.04017   21.41   <2e-16 ***
main_categoryDesign       -0.23390    0.01904  -12.28   <2e-16 ***
main_categoryFashion      -0.73639    0.02175  -33.86   <2e-16 ***
main_categoryFilm & Video -0.16564    0.01632  -10.15   <2e-16 ***
main_categoryFood         -0.73314    0.02118  -34.61   <2e-16 ***
main_categoryGames        -0.22054    0.01831  -12.04   <2e-16 ***
main_categoryJournalism   -0.93050    0.04131  -22.53   <2e-16 ***
main_categoryMusic         0.22343    0.01670   13.38   <2e-16 ***
main_categoryPhotography  -0.45066    0.02698  -16.70   <2e-16 ***
main_categoryPublishing   -0.43116    0.01810  -23.82   <2e-16 ***
main_categoryTechnology   -0.99202    0.02034  -48.77   <2e-16 ***
main_categoryTheater       0.76118    0.02579   29.51   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 396332  on 302928  degrees of freedom
Residual deviance: 383740  on 302914  degrees of freedom
AIC: 383770

Number of Fisher Scoring iterations: 4
anova(model, test="Chisq")
Analysis of Deviance Table

Model: binomial, link: logit

Response: Successful

Terms added sequentially (first to last)

              Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                         302928     396332              
main_category 14    12592    302914     383740 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

11.2 Checking the Fit

  • The usual plots may not be useful
autoplot(model, colour = "red", CI = TRUE, pval = TRUE, plotTable = TRUE, title = "Model checking plots")

11.3 Prediction Error

  • For checking the binomial model we may consider other methods
library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
pR2(model)
##              llh          llhNull               G2         McFadden 
## -191869.99454571 -198165.79105116   12591.59301090       0.03177035 
##             r2ML             r2CU 
##       0.04071413       0.05579346
fitted.results <- predict(model, newdata = select(MainTe, main_category), type = "response")
DiffD <- data.frame(Diff = MainTe$Successful - fitted.results)
mean(DiffD$Diff)
## [1] -0.0009951888
p <- ggplot(data = DiffD, aes(x = Diff)) + geom_histogram(aes(y = ..density..), 
    position = "identity", bins = 15, alpha = 0.5, col = "red", fill = "white") + 
    geom_density(alpha = 0.3, fill = "#FF6666") + labs(title = "Histogram of famit")
ggplotly(p)

11.4 Accuracy

fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != MainTe$Successful)
print(paste('Accuracy ',round(1-misClasificError,2),'%', sep = ''))
[1] "Accuracy 0.65%"

11.5 Performance with library(ROCR)

pr <- ROCR::prediction(fitted.results, MainTe$Successful)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf, col = "red", pch = 19, lwd = 2)

12 Stepwise Regression library(olsrr)

12.1 Dataframe

## Graphs

ggPoints(data = Data.fr, mapping = aes(x = Height, y = Weight, colour = Fac), 
    method = "lm", interactive = TRUE)

12.2 Linear Regression

model <- lm(Weight ~ ., data = Data.fr)
summary(model)

Call:
lm(formula = Weight ~ ., data = Data.fr)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.3884 -2.7489 -0.6242  2.3178  8.5864 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -25.47398   36.46371  -0.699   0.4855    
Height        9.32140    3.60675   2.584   0.0103 *  
Pre          -0.12786    0.06159  -2.076   0.0390 *  
FacA.2      -11.36542    0.74355 -15.285   <2e-16 ***
DisD.2        0.71256    0.62631   1.138   0.2564    
DisD.3        0.81611    0.67930   1.201   0.2308    
DisD.4        1.23373    0.82184   1.501   0.1346    
Day           0.08625    0.72441   0.119   0.9053    
Err           1.00248    0.01052  95.272   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.312 on 241 degrees of freedom
Multiple R-squared:  0.9838,    Adjusted R-squared:  0.9833 
F-statistic:  1833 on 8 and 241 DF,  p-value: < 2.2e-16
autoplot(model, colour="red", CI=TRUE, pval=TRUE, plotTable=TRUE, title="Model checking plots")

shapiro.test(model$residuals)

    Shapiro-Wilk normality test

data:  model$residuals
W = 0.92068, p-value = 2.767e-10

12.3 Default Stepwise Regression With olsrr

  • Variables with \(p\) value less than pent will enter into the model

  • Variables with \(p\) more than prem will be removed from the model

ols_step_both_p(model, pent = 0.1, prem = 0.3, details =TRUE)
Stepwise Selection Method   
---------------------------

Candidate Terms: 

1. Height 
2. Pre 
3. Fac 
4. Dis 
5. Day 
6. Err 

We are selecting variables based on p value...


Stepwise Selection: Step 1 

- Err added 

                         Model Summary                          
---------------------------------------------------------------
R                       0.845       RMSE                13.713 
R-Squared               0.715       Coef. Var           37.449 
Adj. R-Squared          0.714       MSE                188.043 
Pred R-Squared          0.710       MAE                 11.325 
---------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 

                                 ANOVA                                   
------------------------------------------------------------------------
                  Sum of                                                
                 Squares         DF    Mean Square       F         Sig. 
------------------------------------------------------------------------
Regression    116856.198          1     116856.198    621.432    0.0000 
Residual       46634.737        248        188.043                      
Total         163490.936        249                                     
------------------------------------------------------------------------

                                  Parameter Estimates                                    
----------------------------------------------------------------------------------------
      model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
----------------------------------------------------------------------------------------
(Intercept)    38.010         0.869                 43.736    0.000    36.298    39.721 
        Err     1.074         0.043        0.845    24.929    0.000     0.990     1.159 
----------------------------------------------------------------------------------------



Stepwise Selection: Step 2 

- Height added 

                        Model Summary                          
--------------------------------------------------------------
R                       0.973       RMSE                5.987 
R-Squared               0.946       Coef. Var          16.349 
Adj. R-Squared          0.945       MSE                35.840 
Pred R-Squared          0.945       MAE                 5.223 
--------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 

                                  ANOVA                                   
-------------------------------------------------------------------------
                  Sum of                                                 
                 Squares         DF    Mean Square       F          Sig. 
-------------------------------------------------------------------------
Regression    154638.542          2      77319.271    2157.367    0.0000 
Residual        8852.394        247         35.840                       
Total         163490.936        249                                      
-------------------------------------------------------------------------

                                    Parameter Estimates                                      
--------------------------------------------------------------------------------------------
      model       Beta    Std. Error    Std. Beta       t        Sig       lower      upper 
--------------------------------------------------------------------------------------------
(Intercept)    -26.039         2.009                 -12.962    0.000    -29.995    -22.082 
        Err      1.013         0.019        0.797     53.594    0.000      0.976      1.051 
     Height      8.529         0.263        0.483     32.469    0.000      8.012      9.046 
--------------------------------------------------------------------------------------------

                        Model Summary                          
--------------------------------------------------------------
R                       0.973       RMSE                5.987 
R-Squared               0.946       Coef. Var          16.349 
Adj. R-Squared          0.945       MSE                35.840 
Pred R-Squared          0.945       MAE                 5.223 
--------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 

                                  ANOVA                                   
-------------------------------------------------------------------------
                  Sum of                                                 
                 Squares         DF    Mean Square       F          Sig. 
-------------------------------------------------------------------------
Regression    154638.542          2      77319.271    2157.367    0.0000 
Residual        8852.394        247         35.840                       
Total         163490.936        249                                      
-------------------------------------------------------------------------

                                    Parameter Estimates                                      
--------------------------------------------------------------------------------------------
      model       Beta    Std. Error    Std. Beta       t        Sig       lower      upper 
--------------------------------------------------------------------------------------------
(Intercept)    -26.039         2.009                 -12.962    0.000    -29.995    -22.082 
        Err      1.013         0.019        0.797     53.594    0.000      0.976      1.051 
     Height      8.529         0.263        0.483     32.469    0.000      8.012      9.046 
--------------------------------------------------------------------------------------------



Stepwise Selection: Step 3 

- Fac added 

                        Model Summary                          
--------------------------------------------------------------
R                       0.992       RMSE                3.309 
R-Squared               0.984       Coef. Var           9.036 
Adj. R-Squared          0.983       MSE                10.948 
Pred R-Squared          0.983       MAE                 2.801 
--------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 

                                  ANOVA                                   
-------------------------------------------------------------------------
                  Sum of                                                 
                 Squares         DF    Mean Square       F          Sig. 
-------------------------------------------------------------------------
Regression    160797.703          3      53599.234    4895.757    0.0000 
Residual        2693.233        246         10.948                       
Total         163490.936        249                                      
-------------------------------------------------------------------------

                                    Parameter Estimates                                      
--------------------------------------------------------------------------------------------
      model       Beta    Std. Error    Std. Beta       t        Sig       lower      upper 
--------------------------------------------------------------------------------------------
(Intercept)    -22.845         1.118                 -20.427    0.000    -25.048    -20.642 
        Err      1.005         0.010        0.790     96.061    0.000      0.984      1.025 
     Height      8.871         0.146        0.502     60.801    0.000      8.583      9.158 
     FacA.2    -10.084         0.425       -0.195    -23.719    0.000    -10.922     -9.247 
--------------------------------------------------------------------------------------------

                        Model Summary                          
--------------------------------------------------------------
R                       0.992       RMSE                3.309 
R-Squared               0.984       Coef. Var           9.036 
Adj. R-Squared          0.983       MSE                10.948 
Pred R-Squared          0.983       MAE                 2.801 
--------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 

                                  ANOVA                                   
-------------------------------------------------------------------------
                  Sum of                                                 
                 Squares         DF    Mean Square       F          Sig. 
-------------------------------------------------------------------------
Regression    160797.703          3      53599.234    4895.757    0.0000 
Residual        2693.233        246         10.948                       
Total         163490.936        249                                      
-------------------------------------------------------------------------

                                    Parameter Estimates                                      
--------------------------------------------------------------------------------------------
      model       Beta    Std. Error    Std. Beta       t        Sig       lower      upper 
--------------------------------------------------------------------------------------------
(Intercept)    -22.845         1.118                 -20.427    0.000    -25.048    -20.642 
        Err      1.005         0.010        0.790     96.061    0.000      0.984      1.025 
     Height      8.871         0.146        0.502     60.801    0.000      8.583      9.158 
     FacA.2    -10.084         0.425       -0.195    -23.719    0.000    -10.922     -9.247 
--------------------------------------------------------------------------------------------



No more variables to be added/removed.


Final Model Output 
------------------

                        Model Summary                          
--------------------------------------------------------------
R                       0.992       RMSE                3.309 
R-Squared               0.984       Coef. Var           9.036 
Adj. R-Squared          0.983       MSE                10.948 
Pred R-Squared          0.983       MAE                 2.801 
--------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 

                                  ANOVA                                   
-------------------------------------------------------------------------
                  Sum of                                                 
                 Squares         DF    Mean Square       F          Sig. 
-------------------------------------------------------------------------
Regression    160797.703          3      53599.234    4895.757    0.0000 
Residual        2693.233        246         10.948                       
Total         163490.936        249                                      
-------------------------------------------------------------------------

                                    Parameter Estimates                                      
--------------------------------------------------------------------------------------------
      model       Beta    Std. Error    Std. Beta       t        Sig       lower      upper 
--------------------------------------------------------------------------------------------
(Intercept)    -22.845         1.118                 -20.427    0.000    -25.048    -20.642 
        Err      1.005         0.010        0.790     96.061    0.000      0.984      1.025 
     Height      8.871         0.146        0.502     60.801    0.000      8.583      9.158 
     FacA.2    -10.084         0.425       -0.195    -23.719    0.000    -10.922     -9.247 
--------------------------------------------------------------------------------------------

                               Stepwise Selection Summary                                 
-----------------------------------------------------------------------------------------
                     Added/                   Adj.                                           
Step    Variable    Removed     R-Square    R-Square      C(p)          AIC        RMSE      
-----------------------------------------------------------------------------------------
   1      Err       addition       0.715       0.714    4005.5300    2022.6293    13.7129    
   2     Height     addition       0.946       0.945     563.0430    1609.2148     5.9866    
   3      Fac       addition       0.984       0.983       3.5330    1313.7284     3.3088    
-----------------------------------------------------------------------------------------

12.4 Wide Model

  • Variables with \(p\) value less than pent=1 will enter into the model

  • Variables with \(p\) more than prem=1 will be removed from the model

ols_step_both_p(model, pent = 1, prem = 1, details = FALSE)
Stepwise Selection Method   
---------------------------

Candidate Terms: 

1. Height 
2. Pre 
3. Fac 
4. Dis 
5. Day 
6. Err 

We are selecting variables based on p value...

Variables Entered/Removed: 

- Err added 
- Height added 
- Fac added 
- Pre added 
- Dis added 
- Day added 


Final Model Output 
------------------

                        Model Summary                          
--------------------------------------------------------------
R                       0.992       RMSE                3.312 
R-Squared               0.984       Coef. Var           9.045 
Adj. R-Squared          0.983       MSE                10.969 
Pred R-Squared          0.983       MAE                 2.760 
--------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 

                                  ANOVA                                   
-------------------------------------------------------------------------
                  Sum of                                                 
                 Squares         DF    Mean Square       F          Sig. 
-------------------------------------------------------------------------
Regression    160847.424          8      20105.928    1832.989    0.0000 
Residual        2643.512        241         10.969                       
Total         163490.936        249                                      
-------------------------------------------------------------------------

                                    Parameter Estimates                                     
-------------------------------------------------------------------------------------------
      model       Beta    Std. Error    Std. Beta       t        Sig       lower     upper 
-------------------------------------------------------------------------------------------
(Intercept)    -25.474        36.464                  -0.699    0.485    -97.302    46.354 
        Err      1.002         0.011        0.789     95.272    0.000      0.982     1.023 
     Height      9.321         3.607        0.528      2.584    0.010      2.217    16.426 
     FacA.2    -11.365         0.744       -0.220    -15.285    0.000    -12.830    -9.901 
        Pre     -0.128         0.062       -0.034     -2.076    0.039     -0.249    -0.007 
     DisD.2      0.713         0.626        0.012      1.138    0.256     -0.521     1.946 
     DisD.3      0.816         0.679        0.013      1.201    0.231     -0.522     2.154 
     DisD.4      1.234         0.822        0.021      1.501    0.135     -0.385     2.853 
        Day      0.086         0.724        0.024      0.119    0.905     -1.341     1.513 
-------------------------------------------------------------------------------------------

                               Stepwise Selection Summary                                 
-----------------------------------------------------------------------------------------
                     Added/                   Adj.                                           
Step    Variable    Removed     R-Square    R-Square      C(p)          AIC        RMSE      
-----------------------------------------------------------------------------------------
   1      Err       addition       0.715       0.714    4005.5300    2022.6293    13.7129    
   2     Height     addition       0.946       0.945     563.0430    1609.2148     5.9866    
   3      Fac       addition       0.984       0.983       3.5330    1313.7284     3.3088    
   4      Pre       addition       0.984       0.983       3.4740    1313.6235     3.3016    
   5      Dis       addition       0.984       0.983       3.0140    1317.0846     3.3052    
   6      Day       addition       0.984       0.983       5.0000    1319.0699     3.3119    
-----------------------------------------------------------------------------------------

12.5 Very Narrow Model

  • Variables with \(p\) value less than pent= 10^{-150} will enter into the model

  • Variables with \(p\) more than prem= 0.5 will be removed from the model

  • Smallest possible number in R is .Machine$double.xmin which is 2.225073910^{-308}

ols_step_both_p(model, pent = 10^(-150), prem = 0.5, details = FALSE)
Stepwise Selection Method   
---------------------------

Candidate Terms: 

1. Height 
2. Pre 
3. Fac 
4. Dis 
5. Day 
6. Err 

We are selecting variables based on p value...

Variables Entered/Removed: 

- Err added 

No more variables to be added/removed.


Final Model Output 
------------------

                         Model Summary                          
---------------------------------------------------------------
R                       0.845       RMSE                13.713 
R-Squared               0.715       Coef. Var           37.449 
Adj. R-Squared          0.714       MSE                188.043 
Pred R-Squared          0.710       MAE                 11.325 
---------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 

                                 ANOVA                                   
------------------------------------------------------------------------
                  Sum of                                                
                 Squares         DF    Mean Square       F         Sig. 
------------------------------------------------------------------------
Regression    116856.198          1     116856.198    621.432    0.0000 
Residual       46634.737        248        188.043                      
Total         163490.936        249                                     
------------------------------------------------------------------------

                                  Parameter Estimates                                    
----------------------------------------------------------------------------------------
      model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
----------------------------------------------------------------------------------------
(Intercept)    38.010         0.869                 43.736    0.000    36.298    39.721 
        Err     1.074         0.043        0.845    24.929    0.000     0.990     1.159 
----------------------------------------------------------------------------------------

                               Stepwise Selection Summary                                 
-----------------------------------------------------------------------------------------
                     Added/                   Adj.                                           
Step    Variable    Removed     R-Square    R-Square      C(p)          AIC        RMSE      
-----------------------------------------------------------------------------------------
   1      Err       addition       0.715       0.714    4005.5300    2022.6293    13.7129    
-----------------------------------------------------------------------------------------

13 Relative Importance library(relaimpo)

# Calculate Relative Importance for Each Predictor

calc.relimp(model,type=c("lmg","last","first"),
   rela=TRUE)
Response variable: Weight 
Total response variance: 656.5901 
Analysis based on 250 observations 

8 Regressors: 
Some regressors combined in groups: 
        Group  Dis : DisD.2 DisD.3 DisD.4 

 Relative importance of 6 (groups of) regressors assessed: 
 Dis Height Pre Fac Day Err 
 
Proportion of variance explained by model: 98.38%
Metrics are normalized to sum to 100% (rela=TRUE). 

Relative importance metrics: 

               lmg           last        first
Dis    0.001595766 0.000263138186 0.0003205256
Height 0.150713910 0.000716370913 0.2289530014
Pre    0.007474442 0.000462176589 0.0053172473
Fac    0.029149724 0.025058770079 0.0203080482
Day    0.149484405 0.000001520218 0.2275498091
Err    0.661581754 0.973498024015 0.5175513684

Average coefficients for different model sizes: 

           1group     2groups     3groups      4groups      5groups
Height  9.9271262  12.4585380  13.8021218  13.63350255  12.03185259
Pre     0.3188422   0.2976070   0.2157637   0.07908466  -0.06262494
Fac    -8.6555234 -10.0418802 -11.1093732 -11.97687392 -12.29966957
DisD.2  0.5128425   0.2309913   0.1408727   0.36166228   0.69564134
DisD.3 -0.6224845  -0.5992575  -0.4245367   0.06597999   0.64790985
DisD.4 -0.8760861  -1.3050856  -1.3468825  -0.56998870   0.61965646
Day    -1.9879969  -1.0098538  -0.2827304   0.12297230   0.22513025
Err     1.0744190   1.0499086   1.0303313   1.01603189   1.00635427
            6groups
Height   9.32140490
Pre     -0.12785520
Fac    -11.36541593
DisD.2   0.71255929
DisD.3   0.81610944
DisD.4   1.23373058
Day      0.08624536
Err      1.00247867
# Bootstrap Measures of Relative Importance (1000 samples) 
boot <- boot.relimp(model, b = 100, type = c("lmg", 
  "last", "first"), rank = TRUE, 
  diff = TRUE, rela = TRUE)
booteval.relimp(boot) # print result
Response variable: Weight 
Total response variance: 656.5901 
Analysis based on 250 observations 

8 Regressors: 
Some regressors combined in groups: 
        Group  Dis : DisD.2 DisD.3 DisD.4 

 Relative importance of 6 (groups of) regressors assessed: 
 Dis Height Pre Fac Day Err 
 
Proportion of variance explained by model: 98.38%
Metrics are normalized to sum to 100% (rela=TRUE). 

Relative importance metrics: 

               lmg           last        first
Dis    0.001595766 0.000263138186 0.0003205256
Height 0.150713910 0.000716370913 0.2289530014
Pre    0.007474442 0.000462176589 0.0053172473
Fac    0.029149724 0.025058770079 0.0203080482
Day    0.149484405 0.000001520218 0.2275498091
Err    0.661581754 0.973498024015 0.5175513684

Average coefficients for different model sizes: 

           1group     2groups     3groups      4groups      5groups
Height  9.9271262  12.4585380  13.8021218  13.63350255  12.03185259
Pre     0.3188422   0.2976070   0.2157637   0.07908466  -0.06262494
Fac    -8.6555234 -10.0418802 -11.1093732 -11.97687392 -12.29966957
DisD.2  0.5128425   0.2309913   0.1408727   0.36166228   0.69564134
DisD.3 -0.6224845  -0.5992575  -0.4245367   0.06597999   0.64790985
DisD.4 -0.8760861  -1.3050856  -1.3468825  -0.56998870   0.61965646
Day    -1.9879969  -1.0098538  -0.2827304   0.12297230   0.22513025
Err     1.0744190   1.0499086   1.0303313   1.01603189   1.00635427
            6groups
Height   9.32140490
Pre     -0.12785520
Fac    -11.36541593
DisD.2   0.71255929
DisD.3   0.81610944
DisD.4   1.23373058
Day      0.08624536
Err      1.00247867

 
 Confidence interval information ( 100 bootstrap replicates, bty= perc ): 
Relative Contributions with confidence intervals: 
 
                               Lower  Upper
             percentage 0.95   0.95   0.95  
Dis.lmg      0.0016     ____EF 0.0023 0.0178
Height.lmg   0.1507     _BC___ 0.1145 0.1870
Pre.lmg      0.0075     ____EF 0.0047 0.0221
Fac.lmg      0.0291     ___D__ 0.0142 0.0540
Day.lmg      0.1495     _BC___ 0.1136 0.1854
Err.lmg      0.6616     A_____ 0.5800 0.7226
                                            
Dis.last     0.0003     __CDEF 0.0000 0.0018
Height.last  0.0007     __CDEF 0.0001 0.0025
Pre.last     0.0005     __CDEF 0.0000 0.0017
Fac.last     0.0251     _B____ 0.0176 0.0384
Day.last     0.0000     ___DEF 0.0000 0.0006
Err.last     0.9735     A_____ 0.9565 0.9811
                                            
Dis.first    0.0003     ___DEF 0.0008 0.0261
Height.first 0.2290     _BC___ 0.1720 0.2681
Pre.first    0.0053     ___DEF 0.0000 0.0380
Fac.first    0.0203     ___DEF 0.0000 0.0628
Day.first    0.2275     _BC___ 0.1701 0.2660
Err.first    0.5176     A_____ 0.4330 0.5965

Letters indicate the ranks covered by bootstrap CIs. 
(Rank bootstrap confidence intervals always obtained by percentile method) 
CAUTION: Bootstrap confidence intervals can be somewhat liberal. 

 
 Differences between Relative Contributions: 
 
                                 Lower   Upper
                 difference 0.95 0.95    0.95   
Dis-Height.lmg   -0.1491     *   -0.1834 -0.1011
Dis-Pre.lmg      -0.0059         -0.0158  0.0080
Dis-Fac.lmg      -0.0276     *   -0.0499 -0.0062
Dis-Day.lmg      -0.1479     *   -0.1812 -0.1001
Dis-Err.lmg      -0.6600     *   -0.7183 -0.5722
Height-Pre.lmg    0.1432     *    0.1004  0.1798
Height-Fac.lmg    0.1216     *    0.0736  0.1681
Height-Day.lmg    0.0012         -0.0012  0.0035
Height-Err.lmg   -0.5109     *   -0.6104 -0.4008
Pre-Fac.lmg      -0.0217     *   -0.0397 -0.0074
Pre-Day.lmg      -0.1420     *   -0.1781 -0.0992
Pre-Err.lmg      -0.6541     *   -0.7133 -0.5663
Fac-Day.lmg      -0.1203     *   -0.1670 -0.0728
Fac-Err.lmg      -0.6324     *   -0.7015 -0.5327
Day-Err.lmg      -0.5121     *   -0.6108 -0.4027
                                                
Dis-Height.last  -0.0005         -0.0017  0.0012
Dis-Pre.last     -0.0002         -0.0011  0.0011
Dis-Fac.last     -0.0248     *   -0.0376 -0.0173
Dis-Day.last      0.0003         -0.0002  0.0017
Dis-Err.last     -0.9732     *   -0.9808 -0.9558
Height-Pre.last   0.0003         -0.0012  0.0015
Height-Fac.last  -0.0243     *   -0.0360 -0.0167
Height-Day.last   0.0007         -0.0002  0.0019
Height-Err.last  -0.9728     *   -0.9804 -0.9543
Pre-Fac.last     -0.0246     *   -0.0367 -0.0174
Pre-Day.last      0.0005         -0.0002  0.0016
Pre-Err.last     -0.9730     *   -0.9809 -0.9548
Fac-Day.last      0.0251     *    0.0175  0.0379
Fac-Err.last     -0.9484     *   -0.9635 -0.9179
Day-Err.last     -0.9735     *   -0.9810 -0.9561
                                                
Dis-Height.first -0.2286     *   -0.2636 -0.1521
Dis-Pre.first    -0.0050         -0.0263  0.0205
Dis-Fac.first    -0.0200         -0.0603  0.0174
Dis-Day.first    -0.2272     *   -0.2619 -0.1510
Dis-Err.first    -0.5172     *   -0.5924 -0.4096
Height-Pre.first  0.2236     *    0.1447  0.2660
Height-Fac.first  0.2086     *    0.1317  0.2615
Height-Day.first  0.0014         -0.0021  0.0045
Height-Err.first -0.2886     *   -0.4263 -0.1757
Pre-Fac.first    -0.0150         -0.0427  0.0079
Pre-Day.first    -0.2222     *   -0.2641 -0.1429
Pre-Err.first    -0.5122     *   -0.5834 -0.4101
Fac-Day.first    -0.2072     *   -0.2603 -0.1307
Fac-Err.first    -0.4972     *   -0.5721 -0.3831
Day-Err.first    -0.2900     *   -0.4271 -0.1779

* indicates that CI for difference does not include 0. 
CAUTION: Bootstrap confidence intervals can be somewhat liberal. 
plot(booteval.relimp(boot,sort=TRUE)) # plot result

14 Artificial Neural Networks in R

  • We shall work with the dataframe
  • Create a training and test
Main<-Data
TS<-sample(1:nrow(Main), round(nrow(Main)/5))
train<-Main[-TS,]
test<-Main[TS,]

14.1 Linear Model on Data

model <- glm(Weight ~., data=train)
summary(model)

Call:
glm(formula = Weight ~ ., data = train)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-11.6230   -3.6645    0.0582    3.3049   13.0276  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.74076   59.04151  -0.063    0.950    
Height       5.69671    5.83728   0.976    0.330    
Pre          0.48497    0.04988   9.723   <2e-16 ***
Day         -0.61044    1.17242  -0.521    0.603    
Err          1.01651    0.01753  57.989   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 23.68618)

    Null deviance: 129393.2  on 199  degrees of freedom
Residual deviance:   4618.8  on 195  degrees of freedom
AIC: 1207.5

Number of Fisher Scoring iterations: 2
pr.model <- predict(model,test)
MSE.lm <- sum((pr.model - test$Weight)^2)/nrow(test)

14.2 Scale Data

maxs <- apply(Main, 2, max) 
mins <- apply(Main, 2, min)

scaled <- as.data.frame(scale(Main, center = mins, scale = maxs - mins))

train_ <- scaled[-TS,]
test_ <- scaled[TS,]

14.3 Neural Network with library(neuralnet)

n <- names(train_)
f <- as.formula(paste("Weight ~", paste(n[!n %in% "Weight"], collapse = " + ")))
nn <- neuralnet(f,data=train_,hidden=c(3,2),linear.output=T)

14.4 Plot the Best Layer

plot(nn, rep="best")

14.5 Test the Network

pr.nn <- compute(nn,test_[,-2])
pr.nn_ <- pr.nn$net.result*(max(Main$Weight)-min(Main$Weight))+min(Main$Weight)
test.r <- (test_$Weight)*(max(Main$Weight)-min(Main$Weight))+min(Main$Weight)
MSE.nn <- sum((test.r - pr.nn_)^2)/nrow(test_)
print(paste(MSE.lm,MSE.nn))
[1] "32.2173825421678 16.2889592065124"
par(mfrow=c(1,2))
plot(test$Weight,pr.nn_,col='red',main='Real vs predicted NN',pch=19,cex=0.7)
abline(0,1,lwd=2)
legend('bottomright',legend='NN',pch=19,col='red', bty='n')
plot(test$Weight,pr.model,col='blue',main='Real vs predicted lm',pch=19, cex=0.7)
abline(0,1,lwd=2)
legend('bottomright',legend='LM',pch=19,col='blue', bty='n', cex=.95)

plot(test$Weight,pr.nn_,col='red',main='Real vs predicted NN',pch=19)
points(test$Weight,pr.model,col='blue',pch=19)
abline(0,1,lwd=2)
legend('bottomright',legend=c('NN','LM'),pch=19,col=c('red','blue'))

14.6 Use package library(NeuralNetTools) to Check for Importance

nn <- neuralnet(f,data=train_,hidden=4,linear.output=T)
garson(nn)

lekprofile(nn)

olden(nn)

plotnet(nn)