Predictive modeling, i.e., regression and its siblings, is the main focus of analytics. It seems that most of us spend much of our lives thinking about what may be next.
The following is the best way of setting up the working directory in RMarkdown and Rnotebook.
knitr::opts_knit$set(root.dir = normalizePath("C:\\Users\\Ken\\Google Drive\\Courses\\M733\\Data Files\\Firestone"))
We’ll use the handy Firestone Canada data for illustration.
## RID v088 v045 v075 v076 v077 v078 v079 v080 v081 v082 v083 v084 v085
## 1 1 3 First 5 5 5 3 5 5 4 5 5 5 3
## 2 2 NA <NA> NA NA NA NA NA NA NA NA NA NA NA
## 3 3 3 <NA> 5 5 5 4 5 3 4 5 5 4 5
## 4 4 1 First 3 NA 1 NA NA 1 NA NA NA NA 5
## 5 5 5 First 5 5 5 3 5 5 5 4 5 5 3
## 6 6 4 First 5 NA NA NA 5 NA NA 5 NA NA NA
## 7 7 4 First 5 5 NA 5 5 NA NA NA NA 5 NA
## 8 8 4 First 5 5 5 5 5 3 2 5 5 5 4
## 9 9 3 First 4 4 3 3 4 3 4 4 5 4 4
## 10 10 5 Repeat 4 5 2 5 5 3 4 5 5 3 3
## 11 11 3 Repeat 5 5 3 5 5 2 5 5 2 5 NA
## 12 12 4 Repeat 4 5 4 5 4 3 5 5 5 4 5
## 13 13 5 Repeat 5 3 5 5 5 2 5 3 5 5 5
## 14 14 1 Repeat 1 5 1 1 1 1 1 4 4 4 5
## 15 15 4 Repeat 5 5 4 5 5 4 2 5 5 5 NA
## 16 16 4 Repeat 5 NA NA 5 5 NA 5 5 NA NA 5
## 17 17 1 Repeat 2 5 2 3 1 1 4 5 4 4 5
## 18 18 5 Repeat 5 5 5 5 5 5 5 5 5 5 5
## 19 19 NA Repeat 2 NA 3 5 5 3 NA 5 1 3 5
## 20 20 3 Repeat 4 5 2 1 5 5 5 5 1 4 NA
## v086 v087 Occupation Shifts Education Province Gender Age Marital
## 1 3 5 3 2 4 3 Male 7 1
## 2 NA NA 1 2 8 2 Female 8 1
## 3 5 5 11 NA 2 2 Female 7 1
## 4 NA NA 6 1 4 2 Female 7 5
## 5 5 4 12 2 6 2 Male 3 2
## 6 NA 5 2 2 5 2 Male 6 1
## 7 5 5 5 2 3 2 Female 8 1
## 8 5 5 12 2 6 2 Male 4 1
## 9 5 4 7 1 5 2 Female 5 2
## 10 3 5 12 1 6 2 Male 4 1
## 11 4 5 3 2 5 2 Female 2 2
## 12 5 5 11 NA 3 2 Female 10 1
## 13 2 5 2 1 9 2 Female 4 1
## 14 5 1 11 1 6 2 Female 10 1
## 15 5 5 3 2 7 2 Male 4 1
## 16 5 5 5 2 5 2 Male 2 2
## 17 4 3 7 2 6 2 Female 7 1
## 18 5 5 12 2 8 2 Female 8 1
## 19 4 5 12 2 5 2 Female 6 2
## 20 5 2 7 2 5 2 Female 8 1
## Income
## 1 NA
## 2 8
## 3 4
## 4 4
## 5 5
## 6 NA
## 7 NA
## 8 3
## 9 4
## 10 5
## 11 6
## 12 3
## 13 6
## 14 4
## 15 6
## 16 7
## 17 7
## 18 7
## 19 8
## 20 7
## RID v088 v045 v075 v076 v077
## 0 41 43 61 92 173
## v078 v079 v080 v081 v082 v083
## 67 97 135 108 79 133
## v084 v085 v086 v087 Occupation Shifts
## 107 291 117 77 8 140
## Education Province Gender Age Marital Income
## 7 32 32 7 35 92
And now looking at a summary of the missing values and zeros.
## variable q_zeros p_zeros q_na p_na type unique
## 1 RID 0 0.00 0 0.00 integer 1036
## 2 v088 0 0.00 41 3.96 integer 5
## 3 v045 0 0.00 43 4.15 factor 2
## 4 v075 0 0.00 61 5.89 integer 5
## 5 v076 0 0.00 92 8.88 integer 5
## 6 v077 0 0.00 173 16.70 integer 5
## 7 v078 0 0.00 67 6.47 integer 5
## 8 v079 0 0.00 97 9.36 integer 5
## 9 v080 0 0.00 135 13.03 integer 5
## 10 v081 0 0.00 108 10.42 integer 5
## 11 v082 0 0.00 79 7.63 integer 5
## 12 v083 0 0.00 133 12.84 integer 5
## 13 v084 0 0.00 107 10.33 integer 5
## 14 v085 0 0.00 291 28.09 integer 5
## 15 v086 0 0.00 117 11.29 integer 5
## 16 v087 0 0.00 77 7.43 integer 5
## 17 Occupation 0 0.00 8 0.77 integer 13
## 18 Shifts 0 0.00 140 13.51 integer 2
## 19 Education 3 0.29 7 0.68 integer 15
## 20 Province 0 0.00 32 3.09 integer 5
## 21 Gender 0 0.00 32 3.09 factor 2
## 22 Age 1 0.10 7 0.68 integer 13
## 23 Marital 0 0.00 35 3.38 integer 5
## 24 Income 0 0.00 92 8.88 integer 8
## variable q_zeros p_zeros q_na p_na type unique
## 14 v085 0 0.00 291 28.09 integer 5
## 6 v077 0 0.00 173 16.70 integer 5
## 18 Shifts 0 0.00 140 13.51 integer 2
## 9 v080 0 0.00 135 13.03 integer 5
## 12 v083 0 0.00 133 12.84 integer 5
## 15 v086 0 0.00 117 11.29 integer 5
## 10 v081 0 0.00 108 10.42 integer 5
## 13 v084 0 0.00 107 10.33 integer 5
## 8 v079 0 0.00 97 9.36 integer 5
## 5 v076 0 0.00 92 8.88 integer 5
## 24 Income 0 0.00 92 8.88 integer 8
## 11 v082 0 0.00 79 7.63 integer 5
## 16 v087 0 0.00 77 7.43 integer 5
## 7 v078 0 0.00 67 6.47 integer 5
## 4 v075 0 0.00 61 5.89 integer 5
## 3 v045 0 0.00 43 4.15 factor 2
## 2 v088 0 0.00 41 3.96 integer 5
## 23 Marital 0 0.00 35 3.38 integer 5
## 20 Province 0 0.00 32 3.09 integer 5
## 21 Gender 0 0.00 32 3.09 factor 2
## 17 Occupation 0 0.00 8 0.77 integer 13
## 19 Education 3 0.29 7 0.68 integer 15
## 22 Age 1 0.10 7 0.68 integer 13
## 1 RID 0 0.00 0 0.00 integer 1036
Missing values can be imputed in many ways.
The ‘mice’ package is a thorough and respected process.
This is multiple imputation, so the result is 5 (default) datasets to reflect that we really don’t know what value a person might have provided as an answer and the best we can do is to provide a probabilistic answer.
But remember Dr. David Rubin’s famous quote, “Whatever you do with missing values is wrong, but some things are less wrong than others.”
mice provides for analysis using all 5 datasets using ‘tempData’. Google for “mice” and find out what it really does and how you can use all of its facilities.
Below, we’re accessing mice (it needs to be installed first), then using mice to impute 5 datasets using the pmm methods. After that, we’re doing a regression using all 5 datasets.
require(mice)
## Loading required package: mice
## Loading required package: Rcpp
## mice 2.25 2015-11-09
head(data)
## RID v088 v045 v075 v076 v077 v078 v079 v080 v081 v082 v083 v084 v085
## 1 1 3 First 5 5 5 3 5 5 4 5 5 5 3
## 2 2 NA <NA> NA NA NA NA NA NA NA NA NA NA NA
## 3 3 3 <NA> 5 5 5 4 5 3 4 5 5 4 5
## 4 4 1 First 3 NA 1 NA NA 1 NA NA NA NA 5
## 5 5 5 First 5 5 5 3 5 5 5 4 5 5 3
## 6 6 4 First 5 NA NA NA 5 NA NA 5 NA NA NA
## v086 v087 Occupation Shifts Education Province Gender Age Marital Income
## 1 3 5 3 2 4 3 Male 7 1 NA
## 2 NA NA 1 2 8 2 Female 8 1 8
## 3 5 5 11 NA 2 2 Female 7 1 4
## 4 NA NA 6 1 4 2 Female 7 5 4
## 5 5 4 12 2 6 2 Male 3 2 5
## 6 NA 5 2 2 5 2 Male 6 1 NA
data <- data[-c(1,3, 17:24)]
require(psych)
## Loading required package: psych
##
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
##
## describe
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
describe(data)
## vars n mean sd median trimmed mad min max range skew kurtosis
## v088 1 995 3.92 1.15 4 4.11 1.48 1 5 4 -1.05 0.35
## v075 2 975 4.17 1.18 5 4.40 0.00 1 5 4 -1.42 0.94
## v076 3 944 4.41 0.89 5 4.58 0.00 1 5 4 -1.61 2.44
## v077 4 863 3.96 1.43 5 4.19 0.00 1 5 4 -1.02 -0.51
## v078 5 969 4.39 1.07 5 4.65 0.00 1 5 4 -1.88 2.67
## v079 6 939 4.25 1.04 5 4.43 0.00 1 5 4 -1.38 1.33
## v080 7 901 3.49 1.27 3 3.58 1.48 1 5 4 -0.31 -0.99
## v081 8 928 3.80 1.19 4 3.95 1.48 1 5 4 -0.75 -0.32
## v082 9 957 4.35 0.90 5 4.51 0.00 1 5 4 -1.54 2.33
## v083 10 903 3.97 1.38 5 4.21 0.00 1 5 4 -1.09 -0.21
## v084 11 929 4.02 1.11 4 4.20 1.48 1 5 4 -1.02 0.35
## v085 12 745 3.66 1.00 3 3.64 0.00 1 5 4 0.09 -0.64
## v086 13 919 4.19 0.97 4 4.33 1.48 1 5 4 -1.17 1.01
## v087 14 959 4.26 1.01 5 4.46 0.00 1 5 4 -1.42 1.41
## se
## v088 0.04
## v075 0.04
## v076 0.03
## v077 0.05
## v078 0.03
## v079 0.03
## v080 0.04
## v081 0.04
## v082 0.03
## v083 0.05
## v084 0.04
## v085 0.04
## v086 0.03
## v087 0.03
tempData <- mice(data, m=5,maxit=50,meth='pmm',seed=500,print=FALSE)
modelFit <- with(tempData,lm(v088 ~ v075 + v079 + v080 + v083 + v087))
pool(modelFit)
## Call: pool(object = modelFit)
##
## Pooled coefficients:
## (Intercept) v075 v079 v080 v083
## -0.002033553 0.395689717 0.179898354 0.117699697 0.044043468
## v087
## 0.216320014
##
## Fraction of information about the coefficients missing due to nonresponse:
## (Intercept) v075 v079 v080 v083 v087
## 0.22108338 0.09805023 0.29346152 0.36742707 0.17402429 0.20684079
summary(pool(modelFit))
## est se t df Pr(>|t|)
## (Intercept) -0.002033553 0.17099950 -0.01189216 86.55886 9.905390e-01
## v075 0.395689717 0.02853647 13.86610380 312.38182 0.000000e+00
## v079 0.179898354 0.03695089 4.86857997 52.24903 1.079632e-05
## v080 0.117699697 0.02696218 4.36536358 34.50866 1.096515e-04
## v083 0.044043468 0.02132799 2.06505497 130.27929 4.090095e-02
## v087 0.216320014 0.03482794 6.21110598 97.16507 1.304118e-08
## lo 95 hi 95 nmis fmi lambda
## (Intercept) -0.341938001 0.33787089 NA 0.22108338 0.20329154
## v075 0.339541717 0.45183772 61 0.09805023 0.09229399
## v079 0.105759384 0.25403732 97 0.29346152 0.26692438
## v080 0.062935699 0.17246370 135 0.36742707 0.33179785
## v083 0.001849444 0.08623749 133 0.17402429 0.16144080
## v087 0.147197681 0.28544235 77 0.20684079 0.19068108
Sometimes, you’ll want to use the imputed data outside of mice in packages or applications that don’t provide for the ‘pooling’ of all 5 datasets. Then, you might consider selecting one of those 5 datasets for use.
# if you want to use the data outside of mice, you need to reference 1 of the 5 datasets
completedData <- complete(tempData,1) # using imputed dataset 1
complete1<- as.data.frame(completedData)
is.data.frame(complete1)
## [1] TRUE
fit1<- lm(v088 ~ v075 + v079 + v080 + v083 + v087, data=complete1)
summary(fit1)
##
## Call:
## lm(formula = v088 ~ v075 + v079 + v080 + v083 + v087, data = complete1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6741 -0.5522 0.2482 0.5263 2.5539
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.02504 0.15386 0.163 0.8707
## v075 0.40193 0.02733 14.704 < 2e-16 ***
## v079 0.18432 0.03196 5.768 1.06e-08 ***
## v080 0.12262 0.02221 5.520 4.28e-08 ***
## v083 0.03888 0.01959 1.985 0.0474 *
## v087 0.19760 0.03126 6.322 3.85e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8573 on 1030 degrees of freedom
## Multiple R-squared: 0.4517, Adjusted R-squared: 0.449
## F-statistic: 169.7 on 5 and 1030 DF, p-value: < 2.2e-16
Here, we’re using dataset 2.
# if you want to use the data outside of mice, you need to reference 2 of the 5 datasets
completedData <- complete(tempData,2) # using imputed dataset 2
complete2<- as.data.frame(completedData)
is.data.frame(complete2)
## [1] TRUE
fit2<- lm(v088 ~ v075 + v079 + v080 + v083 + v087, data=complete1)
summary(fit2)
##
## Call:
## lm(formula = v088 ~ v075 + v079 + v080 + v083 + v087, data = complete1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6741 -0.5522 0.2482 0.5263 2.5539
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.02504 0.15386 0.163 0.8707
## v075 0.40193 0.02733 14.704 < 2e-16 ***
## v079 0.18432 0.03196 5.768 1.06e-08 ***
## v080 0.12262 0.02221 5.520 4.28e-08 ***
## v083 0.03888 0.01959 1.985 0.0474 *
## v087 0.19760 0.03126 6.322 3.85e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8573 on 1030 degrees of freedom
## Multiple R-squared: 0.4517, Adjusted R-squared: 0.449
## F-statistic: 169.7 on 5 and 1030 DF, p-value: < 2.2e-16
‘texreg’ provides the facility to stack the results of 1 or more regressions next to each other for comparison purposes. It’s very handy.
#install.packages("texreg")
require(texreg)
## Loading required package: texreg
## Version: 1.36.18
## Date: 2016-10-22
## Author: Philip Leifeld (University of Glasgow)
##
## Please cite the JSS article in your publications -- see citation("texreg").
screenreg(list( fit1, fit2),custom.model.names=c("Fit1", "Fit2"))
##
## =====================================
## Fit1 Fit2
## -------------------------------------
## (Intercept) 0.03 0.03
## (0.15) (0.15)
## v075 0.40 *** 0.40 ***
## (0.03) (0.03)
## v079 0.18 *** 0.18 ***
## (0.03) (0.03)
## v080 0.12 *** 0.12 ***
## (0.02) (0.02)
## v083 0.04 * 0.04 *
## (0.02) (0.02)
## v087 0.20 *** 0.20 ***
## (0.03) (0.03)
## -------------------------------------
## R^2 0.45 0.45
## Adj. R^2 0.45 0.45
## Num. obs. 1036 1036
## RMSE 0.86 0.86
## =====================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
Below, we’ll predict the values of v088, the criterion variable, based on the regression (fit1) and the imputed dataset 1. We’ll also get the correlation between the predicted values of v088 anbd the actual values of v088.
predicted <- predict(fit1, newdata=complete1)
actual <- complete1$v088 # the actual values of v088 for the testing sample
length(actual)
## [1] 1036
length(predicted)
## [1] 1036
cor(actual,predicted) # correlation coefficient
## [1] 0.6720871
cor(actual,predicted)^2 # coefficient of determination, R2
## [1] 0.4517011
cor.test(actual,predicted) # hypothesis test for correlation
##
## Pearson's product-moment correlation
##
## data: actual and predicted
## t = 29.186, df = 1034, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6372671 0.7041686
## sample estimates:
## cor
## 0.6720871
pool.r.squared(modelFit)
## est lo 95 hi 95 fmi
## R^2 0.4552127 0.4088597 0.5000378 0.03682078
pool.r.squared(modelFit,adjusted=TRUE)
## est lo 95 hi 95 fmi
## adj R^2 0.452568 0.406136 0.4974979 0.03703019
An here is a very basic scatter plot using ‘plotly’ to produce interactive plots.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:Hmisc':
##
## subplot
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(complete1, x = actual, y = predicted, type="scatter", mode = "markers",
text = ~paste("Predicted: ", predicted))
# x<- actual
# actual
# y<- predicted
# library('WVPlots') # see https://github.com/WinVector/WVPlots
# WVPlots::ScatterHist(complete1, "x", "y", title="Example Fit" )
# WVPlots::ScatterHist(complete1, "x", "y", smoothmethod="lm", title="Example Linear Fit", annot_size=2)