Predictive Modeling

Predictive modeling, i.e., regression and its siblings, is the main focus of analytics. It seems that most of us spend much of our lives thinking about what may be next.

The following is the best way of setting up the working directory in RMarkdown and Rnotebook.

knitr::opts_knit$set(root.dir = normalizePath("C:\\Users\\Ken\\Google Drive\\Courses\\M733\\Data Files\\Firestone"))

We’ll use the handy Firestone Canada data for illustration.

##    RID v088   v045 v075 v076 v077 v078 v079 v080 v081 v082 v083 v084 v085
## 1    1    3  First    5    5    5    3    5    5    4    5    5    5    3
## 2    2   NA   <NA>   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## 3    3    3   <NA>    5    5    5    4    5    3    4    5    5    4    5
## 4    4    1  First    3   NA    1   NA   NA    1   NA   NA   NA   NA    5
## 5    5    5  First    5    5    5    3    5    5    5    4    5    5    3
## 6    6    4  First    5   NA   NA   NA    5   NA   NA    5   NA   NA   NA
## 7    7    4  First    5    5   NA    5    5   NA   NA   NA   NA    5   NA
## 8    8    4  First    5    5    5    5    5    3    2    5    5    5    4
## 9    9    3  First    4    4    3    3    4    3    4    4    5    4    4
## 10  10    5 Repeat    4    5    2    5    5    3    4    5    5    3    3
## 11  11    3 Repeat    5    5    3    5    5    2    5    5    2    5   NA
## 12  12    4 Repeat    4    5    4    5    4    3    5    5    5    4    5
## 13  13    5 Repeat    5    3    5    5    5    2    5    3    5    5    5
## 14  14    1 Repeat    1    5    1    1    1    1    1    4    4    4    5
## 15  15    4 Repeat    5    5    4    5    5    4    2    5    5    5   NA
## 16  16    4 Repeat    5   NA   NA    5    5   NA    5    5   NA   NA    5
## 17  17    1 Repeat    2    5    2    3    1    1    4    5    4    4    5
## 18  18    5 Repeat    5    5    5    5    5    5    5    5    5    5    5
## 19  19   NA Repeat    2   NA    3    5    5    3   NA    5    1    3    5
## 20  20    3 Repeat    4    5    2    1    5    5    5    5    1    4   NA
##    v086 v087 Occupation Shifts Education Province Gender Age Marital
## 1     3    5          3      2         4        3   Male   7       1
## 2    NA   NA          1      2         8        2 Female   8       1
## 3     5    5         11     NA         2        2 Female   7       1
## 4    NA   NA          6      1         4        2 Female   7       5
## 5     5    4         12      2         6        2   Male   3       2
## 6    NA    5          2      2         5        2   Male   6       1
## 7     5    5          5      2         3        2 Female   8       1
## 8     5    5         12      2         6        2   Male   4       1
## 9     5    4          7      1         5        2 Female   5       2
## 10    3    5         12      1         6        2   Male   4       1
## 11    4    5          3      2         5        2 Female   2       2
## 12    5    5         11     NA         3        2 Female  10       1
## 13    2    5          2      1         9        2 Female   4       1
## 14    5    1         11      1         6        2 Female  10       1
## 15    5    5          3      2         7        2   Male   4       1
## 16    5    5          5      2         5        2   Male   2       2
## 17    4    3          7      2         6        2 Female   7       1
## 18    5    5         12      2         8        2 Female   8       1
## 19    4    5         12      2         5        2 Female   6       2
## 20    5    2          7      2         5        2 Female   8       1
##    Income
## 1      NA
## 2       8
## 3       4
## 4       4
## 5       5
## 6      NA
## 7      NA
## 8       3
## 9       4
## 10      5
## 11      6
## 12      3
## 13      6
## 14      4
## 15      6
## 16      7
## 17      7
## 18      7
## 19      8
## 20      7
##        RID       v088       v045       v075       v076       v077 
##          0         41         43         61         92        173 
##       v078       v079       v080       v081       v082       v083 
##         67         97        135        108         79        133 
##       v084       v085       v086       v087 Occupation     Shifts 
##        107        291        117         77          8        140 
##  Education   Province     Gender        Age    Marital     Income 
##          7         32         32          7         35         92

Missing Values

And now looking at a summary of the missing values and zeros.

##      variable q_zeros p_zeros q_na  p_na    type unique
## 1         RID       0    0.00    0  0.00 integer   1036
## 2        v088       0    0.00   41  3.96 integer      5
## 3        v045       0    0.00   43  4.15  factor      2
## 4        v075       0    0.00   61  5.89 integer      5
## 5        v076       0    0.00   92  8.88 integer      5
## 6        v077       0    0.00  173 16.70 integer      5
## 7        v078       0    0.00   67  6.47 integer      5
## 8        v079       0    0.00   97  9.36 integer      5
## 9        v080       0    0.00  135 13.03 integer      5
## 10       v081       0    0.00  108 10.42 integer      5
## 11       v082       0    0.00   79  7.63 integer      5
## 12       v083       0    0.00  133 12.84 integer      5
## 13       v084       0    0.00  107 10.33 integer      5
## 14       v085       0    0.00  291 28.09 integer      5
## 15       v086       0    0.00  117 11.29 integer      5
## 16       v087       0    0.00   77  7.43 integer      5
## 17 Occupation       0    0.00    8  0.77 integer     13
## 18     Shifts       0    0.00  140 13.51 integer      2
## 19  Education       3    0.29    7  0.68 integer     15
## 20   Province       0    0.00   32  3.09 integer      5
## 21     Gender       0    0.00   32  3.09  factor      2
## 22        Age       1    0.10    7  0.68 integer     13
## 23    Marital       0    0.00   35  3.38 integer      5
## 24     Income       0    0.00   92  8.88 integer      8
##      variable q_zeros p_zeros q_na  p_na    type unique
## 14       v085       0    0.00  291 28.09 integer      5
## 6        v077       0    0.00  173 16.70 integer      5
## 18     Shifts       0    0.00  140 13.51 integer      2
## 9        v080       0    0.00  135 13.03 integer      5
## 12       v083       0    0.00  133 12.84 integer      5
## 15       v086       0    0.00  117 11.29 integer      5
## 10       v081       0    0.00  108 10.42 integer      5
## 13       v084       0    0.00  107 10.33 integer      5
## 8        v079       0    0.00   97  9.36 integer      5
## 5        v076       0    0.00   92  8.88 integer      5
## 24     Income       0    0.00   92  8.88 integer      8
## 11       v082       0    0.00   79  7.63 integer      5
## 16       v087       0    0.00   77  7.43 integer      5
## 7        v078       0    0.00   67  6.47 integer      5
## 4        v075       0    0.00   61  5.89 integer      5
## 3        v045       0    0.00   43  4.15  factor      2
## 2        v088       0    0.00   41  3.96 integer      5
## 23    Marital       0    0.00   35  3.38 integer      5
## 20   Province       0    0.00   32  3.09 integer      5
## 21     Gender       0    0.00   32  3.09  factor      2
## 17 Occupation       0    0.00    8  0.77 integer     13
## 19  Education       3    0.29    7  0.68 integer     15
## 22        Age       1    0.10    7  0.68 integer     13
## 1         RID       0    0.00    0  0.00 integer   1036

Missing values can be imputed in many ways.

The ‘mice’ package is a thorough and respected process.

This is multiple imputation, so the result is 5 (default) datasets to reflect that we really don’t know what value a person might have provided as an answer and the best we can do is to provide a probabilistic answer.

But remember Dr. David Rubin’s famous quote, “Whatever you do with missing values is wrong, but some things are less wrong than others.”

mice provides for analysis using all 5 datasets using ‘tempData’. Google for “mice” and find out what it really does and how you can use all of its facilities.

Below, we’re accessing mice (it needs to be installed first), then using mice to impute 5 datasets using the pmm methods. After that, we’re doing a regression using all 5 datasets.

require(mice)
## Loading required package: mice
## Loading required package: Rcpp
## mice 2.25 2015-11-09
head(data)
##   RID v088  v045 v075 v076 v077 v078 v079 v080 v081 v082 v083 v084 v085
## 1   1    3 First    5    5    5    3    5    5    4    5    5    5    3
## 2   2   NA  <NA>   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## 3   3    3  <NA>    5    5    5    4    5    3    4    5    5    4    5
## 4   4    1 First    3   NA    1   NA   NA    1   NA   NA   NA   NA    5
## 5   5    5 First    5    5    5    3    5    5    5    4    5    5    3
## 6   6    4 First    5   NA   NA   NA    5   NA   NA    5   NA   NA   NA
##   v086 v087 Occupation Shifts Education Province Gender Age Marital Income
## 1    3    5          3      2         4        3   Male   7       1     NA
## 2   NA   NA          1      2         8        2 Female   8       1      8
## 3    5    5         11     NA         2        2 Female   7       1      4
## 4   NA   NA          6      1         4        2 Female   7       5      4
## 5    5    4         12      2         6        2   Male   3       2      5
## 6   NA    5          2      2         5        2   Male   6       1     NA
data <- data[-c(1,3, 17:24)]
require(psych)
## Loading required package: psych
## 
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
## 
##     describe
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
describe(data)
##      vars   n mean   sd median trimmed  mad min max range  skew kurtosis
## v088    1 995 3.92 1.15      4    4.11 1.48   1   5     4 -1.05     0.35
## v075    2 975 4.17 1.18      5    4.40 0.00   1   5     4 -1.42     0.94
## v076    3 944 4.41 0.89      5    4.58 0.00   1   5     4 -1.61     2.44
## v077    4 863 3.96 1.43      5    4.19 0.00   1   5     4 -1.02    -0.51
## v078    5 969 4.39 1.07      5    4.65 0.00   1   5     4 -1.88     2.67
## v079    6 939 4.25 1.04      5    4.43 0.00   1   5     4 -1.38     1.33
## v080    7 901 3.49 1.27      3    3.58 1.48   1   5     4 -0.31    -0.99
## v081    8 928 3.80 1.19      4    3.95 1.48   1   5     4 -0.75    -0.32
## v082    9 957 4.35 0.90      5    4.51 0.00   1   5     4 -1.54     2.33
## v083   10 903 3.97 1.38      5    4.21 0.00   1   5     4 -1.09    -0.21
## v084   11 929 4.02 1.11      4    4.20 1.48   1   5     4 -1.02     0.35
## v085   12 745 3.66 1.00      3    3.64 0.00   1   5     4  0.09    -0.64
## v086   13 919 4.19 0.97      4    4.33 1.48   1   5     4 -1.17     1.01
## v087   14 959 4.26 1.01      5    4.46 0.00   1   5     4 -1.42     1.41
##        se
## v088 0.04
## v075 0.04
## v076 0.03
## v077 0.05
## v078 0.03
## v079 0.03
## v080 0.04
## v081 0.04
## v082 0.03
## v083 0.05
## v084 0.04
## v085 0.04
## v086 0.03
## v087 0.03
tempData <- mice(data, m=5,maxit=50,meth='pmm',seed=500,print=FALSE)
modelFit <- with(tempData,lm(v088 ~ v075 + v079 + v080 + v083 + v087))
pool(modelFit)
## Call: pool(object = modelFit)
## 
## Pooled coefficients:
##  (Intercept)         v075         v079         v080         v083 
## -0.002033553  0.395689717  0.179898354  0.117699697  0.044043468 
##         v087 
##  0.216320014 
## 
## Fraction of information about the coefficients missing due to nonresponse: 
## (Intercept)        v075        v079        v080        v083        v087 
##  0.22108338  0.09805023  0.29346152  0.36742707  0.17402429  0.20684079
summary(pool(modelFit))
##                      est         se           t        df     Pr(>|t|)
## (Intercept) -0.002033553 0.17099950 -0.01189216  86.55886 9.905390e-01
## v075         0.395689717 0.02853647 13.86610380 312.38182 0.000000e+00
## v079         0.179898354 0.03695089  4.86857997  52.24903 1.079632e-05
## v080         0.117699697 0.02696218  4.36536358  34.50866 1.096515e-04
## v083         0.044043468 0.02132799  2.06505497 130.27929 4.090095e-02
## v087         0.216320014 0.03482794  6.21110598  97.16507 1.304118e-08
##                    lo 95      hi 95 nmis        fmi     lambda
## (Intercept) -0.341938001 0.33787089   NA 0.22108338 0.20329154
## v075         0.339541717 0.45183772   61 0.09805023 0.09229399
## v079         0.105759384 0.25403732   97 0.29346152 0.26692438
## v080         0.062935699 0.17246370  135 0.36742707 0.33179785
## v083         0.001849444 0.08623749  133 0.17402429 0.16144080
## v087         0.147197681 0.28544235   77 0.20684079 0.19068108

Sometimes, you’ll want to use the imputed data outside of mice in packages or applications that don’t provide for the ‘pooling’ of all 5 datasets. Then, you might consider selecting one of those 5 datasets for use.

# if you want to use the data outside of mice, you need to reference 1 of the 5 datasets
completedData <- complete(tempData,1) # using imputed dataset 1
complete1<- as.data.frame(completedData)
is.data.frame(complete1)
## [1] TRUE
fit1<- lm(v088 ~ v075 + v079 + v080 + v083 + v087, data=complete1)
summary(fit1)
## 
## Call:
## lm(formula = v088 ~ v075 + v079 + v080 + v083 + v087, data = complete1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6741 -0.5522  0.2482  0.5263  2.5539 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.02504    0.15386   0.163   0.8707    
## v075         0.40193    0.02733  14.704  < 2e-16 ***
## v079         0.18432    0.03196   5.768 1.06e-08 ***
## v080         0.12262    0.02221   5.520 4.28e-08 ***
## v083         0.03888    0.01959   1.985   0.0474 *  
## v087         0.19760    0.03126   6.322 3.85e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8573 on 1030 degrees of freedom
## Multiple R-squared:  0.4517, Adjusted R-squared:  0.449 
## F-statistic: 169.7 on 5 and 1030 DF,  p-value: < 2.2e-16

Here, we’re using dataset 2.

# if you want to use the data outside of mice, you need to reference 2 of the 5 datasets
completedData <- complete(tempData,2) # using imputed dataset 2
complete2<- as.data.frame(completedData)
is.data.frame(complete2)
## [1] TRUE
fit2<- lm(v088 ~ v075 + v079 + v080 + v083 + v087, data=complete1)
summary(fit2)
## 
## Call:
## lm(formula = v088 ~ v075 + v079 + v080 + v083 + v087, data = complete1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6741 -0.5522  0.2482  0.5263  2.5539 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.02504    0.15386   0.163   0.8707    
## v075         0.40193    0.02733  14.704  < 2e-16 ***
## v079         0.18432    0.03196   5.768 1.06e-08 ***
## v080         0.12262    0.02221   5.520 4.28e-08 ***
## v083         0.03888    0.01959   1.985   0.0474 *  
## v087         0.19760    0.03126   6.322 3.85e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8573 on 1030 degrees of freedom
## Multiple R-squared:  0.4517, Adjusted R-squared:  0.449 
## F-statistic: 169.7 on 5 and 1030 DF,  p-value: < 2.2e-16

‘texreg’ provides the facility to stack the results of 1 or more regressions next to each other for comparison purposes. It’s very handy.

#install.packages("texreg")
require(texreg)
## Loading required package: texreg
## Version:  1.36.18
## Date:     2016-10-22
## Author:   Philip Leifeld (University of Glasgow)
## 
## Please cite the JSS article in your publications -- see citation("texreg").
screenreg(list(  fit1, fit2),custom.model.names=c("Fit1", "Fit2"))
## 
## =====================================
##              Fit1         Fit2       
## -------------------------------------
## (Intercept)     0.03         0.03    
##                (0.15)       (0.15)   
## v075            0.40 ***     0.40 ***
##                (0.03)       (0.03)   
## v079            0.18 ***     0.18 ***
##                (0.03)       (0.03)   
## v080            0.12 ***     0.12 ***
##                (0.02)       (0.02)   
## v083            0.04 *       0.04 *  
##                (0.02)       (0.02)   
## v087            0.20 ***     0.20 ***
##                (0.03)       (0.03)   
## -------------------------------------
## R^2             0.45         0.45    
## Adj. R^2        0.45         0.45    
## Num. obs.    1036         1036       
## RMSE            0.86         0.86    
## =====================================
## *** p < 0.001, ** p < 0.01, * p < 0.05

Below, we’ll predict the values of v088, the criterion variable, based on the regression (fit1) and the imputed dataset 1. We’ll also get the correlation between the predicted values of v088 anbd the actual values of v088.

predicted <- predict(fit1, newdata=complete1)
actual <- complete1$v088 # the actual values of v088 for the testing sample
length(actual)
## [1] 1036
length(predicted)
## [1] 1036
cor(actual,predicted) # correlation coefficient
## [1] 0.6720871
cor(actual,predicted)^2 # coefficient of determination, R2
## [1] 0.4517011
cor.test(actual,predicted) # hypothesis test for correlation
## 
##  Pearson's product-moment correlation
## 
## data:  actual and predicted
## t = 29.186, df = 1034, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6372671 0.7041686
## sample estimates:
##       cor 
## 0.6720871
pool.r.squared(modelFit)
##           est     lo 95     hi 95        fmi
## R^2 0.4552127 0.4088597 0.5000378 0.03682078
pool.r.squared(modelFit,adjusted=TRUE)
##              est    lo 95     hi 95        fmi
## adj R^2 0.452568 0.406136 0.4974979 0.03703019

An here is a very basic scatter plot using ‘plotly’ to produce interactive plots.

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:Hmisc':
## 
##     subplot
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(complete1, x = actual, y = predicted, type="scatter", mode = "markers",
        text = ~paste("Predicted: ", predicted))
# x<- actual
# actual
# y<- predicted
# library('WVPlots') # see https://github.com/WinVector/WVPlots
# WVPlots::ScatterHist(complete1, "x", "y", title="Example Fit" )
# WVPlots::ScatterHist(complete1, "x", "y", smoothmethod="lm", title="Example Linear Fit", annot_size=2)