A charitable organization wishes to develop a machine learning model to improve the cost-effectiveness of their direct marketing campaigns to previous donors. According to their recent mailing records, the typical overall response rate is 10%. Out of those who respond (donate) to the mailing, the average donation is 14.50. Each mailing costs 2.00 to produce and send; the mailing includes a gift of personalized address labels and assortment of cards and envelopes. It is not cost-effective to mail everyone because the expected profit from each mailing is 14.50 x 0.10 – 2 = -$0.55.
We would like to develop a classification model using data from the most recent campaign that can effectively captures likely donors so that the expected net profit is maximized. The response rate in the test sample has the more typical 10% response rate.
We would also like to build a prediction model to predict expected gift amounts from donors – the data for this will consist of the records for donors only.
End Goal: Maximize the profits (expected revenue less the costs) on the marketing campaign
The entire dataset consists of 3984 training observations, 2018 validation observations, and 2007 test observations. Weighted sampling has been used, over-representing the responders so that the training and validation samples have approximately equal numbers of donors and non-donors.
# load the data
# charity <- read.csv(file.choose()) # load the "charity.csv" file
charity <- read.csv("D:/Boston College/MS AE Courses/Spring 2018 - Big Data Econometrics/Projects/Final Group Project/charity.csv")
library(psych)
describe(charity)
## vars n mean sd median trimmed mad min max
## ID 1 8009 4005.00 2312.14 4005.00 4005.00 2968.17 1.00 8009.00
## reg1 2 8009 0.20 0.40 0.00 0.13 0.00 0.00 1.00
## reg2 3 8009 0.32 0.47 0.00 0.27 0.00 0.00 1.00
## reg3 4 8009 0.13 0.34 0.00 0.04 0.00 0.00 1.00
## reg4 5 8009 0.14 0.35 0.00 0.05 0.00 0.00 1.00
## home 6 8009 0.87 0.34 1.00 0.96 0.00 0.00 1.00
## chld 7 8009 1.72 1.40 2.00 1.61 1.48 0.00 5.00
## hinc 8 8009 3.91 1.47 4.00 3.89 1.48 1.00 7.00
## genf 9 8009 0.61 0.49 1.00 0.63 0.00 0.00 1.00
## wrat 10 8009 6.91 2.43 8.00 7.35 1.48 0.00 9.00
## avhv 11 8009 182.65 72.72 169.00 174.55 59.30 48.00 710.00
## incm 12 8009 43.47 24.71 38.00 40.19 19.27 3.00 287.00
## inca 13 8009 56.43 24.82 51.00 53.46 19.27 12.00 305.00
## plow 14 8009 14.23 13.41 10.00 12.24 11.86 0.00 87.00
## npro 15 8009 60.03 30.35 58.00 58.88 34.10 2.00 164.00
## tgif 16 8009 113.07 85.48 89.00 99.77 47.44 23.00 2057.00
## lgif 17 8009 22.94 29.95 16.00 17.43 10.38 3.00 681.00
## rgif 18 8009 15.66 12.43 12.00 13.66 8.90 1.00 173.00
## tdon 19 8009 18.86 5.78 18.00 18.31 4.45 5.00 40.00
## tlag 20 8009 6.36 3.70 5.00 5.70 2.97 1.00 34.00
## agif 21 8009 11.68 6.57 10.23 10.83 5.50 1.29 72.27
## donr 22 6002 0.50 0.50 0.00 0.50 0.00 0.00 1.00
## damt 23 6002 7.21 7.36 0.00 6.85 0.00 0.00 27.00
## part* 24 8009 2.00 0.71 2.00 2.00 1.48 1.00 3.00
## range skew kurtosis se
## ID 8008.00 0.00 -1.20 25.84
## reg1 1.00 1.50 0.24 0.00
## reg2 1.00 0.78 -1.40 0.01
## reg3 1.00 2.15 2.63 0.00
## reg4 1.00 2.08 2.33 0.00
## home 1.00 -2.16 2.64 0.00
## chld 5.00 0.27 -0.80 0.02
## hinc 6.00 0.01 -0.09 0.02
## genf 1.00 -0.43 -1.81 0.01
## wrat 9.00 -1.35 0.79 0.03
## avhv 662.00 1.54 4.49 0.81
## incm 284.00 2.05 8.31 0.28
## inca 293.00 1.94 7.87 0.28
## plow 87.00 1.36 1.89 0.15
## npro 162.00 0.31 -0.62 0.34
## tgif 2034.00 6.55 107.52 0.96
## lgif 678.00 7.81 110.38 0.33
## rgif 172.00 2.63 13.92 0.14
## tdon 35.00 1.10 2.12 0.06
## tlag 33.00 2.42 8.41 0.04
## agif 70.98 1.78 6.02 0.07
## donr 1.00 0.00 -2.00 0.01
## damt 27.00 0.12 -1.83 0.10
## part* 2.00 0.00 -1.01 0.01
# describe(log(charity$tdon))
colSums(sapply(charity, is.na))
## ID reg1 reg2 reg3 reg4 home chld hinc genf wrat avhv incm inca plow npro
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## tgif lgif rgif tdon tlag agif donr damt part
## 0 0 0 0 0 0 2007 2007 0
There are 22 variables and 8,009 observations in the initial dataset with no missing data. The observations reflect prior mailing data including the whether the mailing was successful and a donation was received or not, the amount of the donation along with numerous other aspects that may or may not be helpful in determining who to send future donation requests to in order to maximize the amount of donations. Categorical and numerical variables are seperated below:
Categorical : REG1, REG2, REG3, REG4[4 dummy variables for 5 geographic regions], home[home ownder or not], wrat[wealth ratings 0-9], HINC[household income 1-7], GENF[gender]
Numeric : CHLD, AVHV, INCM, INCA, PLOW, NPRO, TGIF, LGIF, RGIF, TDON, TLAG, AGIF.
#summary(charity[charity$part=="train",])
charity.train <- charity[charity$part=="train",]
cat_Vars<-c("wrat","reg1","reg2","reg3","reg4","chld","home","genf","donr")
fact_Vars<-c("ID","wrat","reg1","reg2","reg3","reg4","chld","home","genf","donr","part")
library(pastecs)
## Warning: package 'pastecs' was built under R version 3.4.3
## Loading required package: boot
##
## Attaching package: 'boot'
## The following object is masked from 'package:psych':
##
## logit
stat.desc(charity.train[,!(colnames(charity) %in% fact_Vars)], basic = FALSE, norm = TRUE)
## hinc avhv incm inca
## median 4.000000e+00 1.710000e+02 3.900000e+01 5.200000e+01
## mean 3.946285e+00 1.851842e+02 4.428840e+01 5.713554e+01
## SE.mean 2.213459e-02 1.183481e+00 3.988590e-01 4.004511e-01
## CI.mean.0.95 4.339619e-02 2.320286e+00 7.819869e-01 7.851082e-01
## var 1.951922e+00 5.580102e+03 6.338086e+02 6.388784e+02
## std.dev 1.397112e+00 7.470008e+01 2.517556e+01 2.527604e+01
## coef.var 3.540322e-01 4.033825e-01 5.684458e-01 4.423874e-01
## skewness -9.722617e-04 1.547448e+00 1.999558e+00 1.866605e+00
## skew.2SE -1.253143e-02 1.994497e+01 2.577219e+01 2.405857e+01
## kurtosis 1.278076e-01 4.368741e+00 8.100369e+00 6.950879e+00
## kurt.2SE 8.238593e-01 2.816129e+01 5.221570e+01 4.480598e+01
## normtest.W 9.131411e-01 9.015639e-01 8.610426e-01 8.762568e-01
## normtest.p 5.278763e-43 5.628907e-45 1.199436e-50 1.060381e-48
## plow npro tgif lgif
## median 1.000000e+01 6.000000e+01 9.100000e+01 1.500000e+01
## mean 1.373268e+01 6.162927e+01 1.167447e+02 2.319302e+01
## SE.mean 2.074478e-01 4.806751e-01 1.353938e+00 4.955428e-01
## CI.mean.0.95 4.067137e-01 9.423923e-01 2.654476e+00 9.715412e-01
## var 1.714497e+02 9.204975e+02 7.303260e+03 9.783215e+02
## std.dev 1.309388e+01 3.033970e+01 8.545911e+01 3.127813e+01
## coef.var 9.534833e-01 4.922937e-01 7.320169e-01 1.348601e+00
## skewness 1.380889e+00 2.794947e-01 5.051670e+00 7.870416e+00
## skew.2SE 1.779820e+01 3.602393e+00 6.511071e+01 1.014414e+02
## kurtosis 1.933290e+00 -6.152998e-01 7.028005e+01 1.053568e+02
## kurt.2SE 1.246216e+01 -3.966277e+00 4.530314e+02 6.791391e+02
## normtest.W 8.686586e-01 9.814236e-01 7.040989e-01 4.608955e-01
## normtest.p 1.071742e-49 1.847981e-22 2.654817e-64 2.469441e-76
## rgif tdon tlag agif
## median 1.200000e+01 1.800000e+01 5.000000e+00 1.022000e+01
## mean 1.554669e+01 1.881476e+01 6.301958e+00 1.165924e+01
## SE.mean 1.913403e-01 8.844124e-02 5.700427e-02 1.015743e-01
## CI.mean.0.95 3.751342e-01 1.733943e-01 1.117603e-01 1.991425e-01
## var 1.458587e+02 3.116226e+01 1.294596e+01 4.110429e+01
## std.dev 1.207720e+01 5.582317e+00 3.598049e+00 6.411263e+00
## coef.var 7.768342e-01 2.966988e-01 5.709415e-01 5.498869e-01
## skewness 2.652387e+00 1.126548e+00 2.414226e+00 1.643531e+00
## skew.2SE 3.418647e+01 1.452002e+01 3.111683e+01 2.118338e+01
## kurtosis 1.569152e+01 2.422921e+00 8.450899e+00 4.949063e+00
## kurt.2SE 1.011489e+02 1.561836e+01 5.447524e+01 3.190210e+01
## normtest.W 7.988907e-01 9.229075e-01 7.721709e-01 8.843321e-01
## normtest.p 4.074392e-57 3.628989e-41 2.170872e-59 1.380742e-47
## damt
## median 1.000000e+01
## mean 7.260040e+00
## SE.mean 1.168954e-01
## CI.mean.0.95 2.291804e-01
## var 5.443952e+01
## std.dev 7.378314e+00
## coef.var 1.016291e+00
## skewness 1.026073e-01
## skew.2SE 1.322500e+00
## kurtosis -1.844542e+00
## kurt.2SE -1.189008e+01
## normtest.W 7.412550e-01
## normtest.p 9.344744e-62
The two parameters skew.2se and kurt.2se results, which are the skew and kurtosis value divided by 2 standard errors. If the absolute value of these parameters are > 1 then they are significant (at p < .05) suggesting strong potential for non-normality.
library(purrr)
## Warning: package 'purrr' was built under R version 3.4.2
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.2
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:pastecs':
##
## extract
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
# DENSITY PLOTS TO CHECK NORMALITY OF NUMERIC VARIABLES
charity.train[,-1] %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
# stat_density() + # Fills in the density graph/area.
facet_wrap(~ key, scales = "free") +
geom_density()
# BAR GRAPHS TO CHECK THE BALANCE OF CATEGORICAL/FACTOR VARIABLES
charity.train[,cat_Vars] %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~key, scales = "free") +
geom_bar()
# ggplot(charity.train, aes(donr)) +
# facet_grid(genf~.)
# geom_bar(stat="identity")
Lets try to answer some questions:
The npro
, tdon
variables are closest to normal. This means most of our data will need to be transformed.
We can see that agif
,avhv
,home
,inca
,incm
, lgif
, plow
,rgif
,tgif
,tlag
and wrat
are highly skewed. Several other variables have moderate skewness.
boxplot(charity.train[-1])
oth_Vars<- c("hinc","tgif","agif","tlag","tdon","damt")
outl<-names(charity.train) %in% c(fact_Vars,oth_Vars)
boxplot(charity.train[!outl])
The variables in this 2nd boxplot may need to be treated for outliers.
Lets pre-processing the data based on their behaviour as learned in exploration section above.
# predictor transformations
# charity.t <- charity
# charity.t$avhv <- log(charity.t$avhv)
# boxplot(charity.t$avhv)
# set up data for analysis
data.train <- charity.train
x.train <- data.train[,2:21]
y.train.donr <- data.train[,22] # donr
y.train.damt <- data.train[y.train.donr==1,23] # damt for observations with donr=1
# y.train.damt <- data.train["donr"==1,23] # damt for observations with donr=1
n.train.donr <- length(y.train.donr) # 3984
n.train.damt <- length(y.train.damt) # 1995
data.valid <- charity[charity$part=="valid",]
x.valid <- data.valid[,2:21]
y.valid.donr <- data.valid[,22] # donr
y.valid.damt <- data.valid[y.valid.donr==1,23] # damt for observations with donr=1
n.valid.donr <- length(y.valid.donr) # 2018
n.valid.damt <- length(y.valid.damt) # 999
data.test <- charity[charity$part=="test",]
n.test <- dim(data.test)[1] # 2007
x.test <- data.test[,2:21]
# standardize to have zero mean and unit sd
x.train.mean <- apply(x.train, 2, mean)
x.train.sd <- apply(x.train, 2, sd)
# performing the standardization at this step
x.train.std <- t((t(x.train)-x.train.mean)/x.train.sd)
apply(x.train.std, 2, mean) # check zero mean
## reg1 reg2 reg3 reg4 home
## 2.151811e-17 -2.526099e-17 3.693258e-17 -6.017778e-17 -9.663428e-18
## chld hinc genf wrat avhv
## -2.051129e-17 -1.463197e-17 4.465563e-17 -1.062688e-16 1.816327e-17
## incm inca plow npro tgif
## -1.335981e-16 8.260581e-17 4.848389e-17 -6.655491e-17 2.532023e-17
## lgif rgif tdon tlag agif
## 2.669731e-17 5.429393e-18 -1.835154e-16 1.010477e-16 -1.258480e-16
apply(x.train.std, 2, sd) # check unit sd
## reg1 reg2 reg3 reg4 home chld hinc genf wrat avhv incm inca plow npro tgif
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## lgif rgif tdon tlag agif
## 1 1 1 1 1
data.train.std.donr <- data.frame(x.train.std, donr=y.train.donr) # to classify donr
data.train.std.damt <- data.frame(x.train.std[y.train.donr==1,], damt=y.train.damt) # to predict damt when donr=1
x.valid.std <- t((t(x.valid)-x.train.mean)/x.train.sd) # standardize using training mean and sd
data.valid.std.donr <- data.frame(x.valid.std, donr=y.valid.donr) # to classify donr
data.valid.std.damt <- data.frame(x.valid.std[y.valid.donr==1,], damt=y.valid.damt) # to predict damt when donr=1
x.test.std <- t((t(x.test)-x.train.mean)/x.train.sd) # standardize using training mean and sd
data.test.std <- data.frame(x.test.std)
We have seperated the train, validation and test datasets. We also standardized all of the column fields to avoid the scaling issues affecting some statistical methods.
Before actually starting to model. A quick look at identifying significant variables here.
cor(data.train.std.donr[, !(colnames(data.train.std.donr) %in% fact_Vars)])
## hinc avhv incm inca plow
## hinc 1.000000000 0.017689974 0.012054151 0.0131884781 -0.006398353
## avhv 0.017689974 1.000000000 0.722519341 0.8408769868 -0.625136632
## incm 0.012054151 0.722519341 1.000000000 0.8729018527 -0.655455152
## inca 0.013188478 0.840876987 0.872901853 1.0000000000 -0.634632891
## plow -0.006398353 -0.625136632 -0.655455152 -0.6346328906 1.000000000
## npro 0.023097994 -0.002658239 0.017073640 0.0182656423 -0.011667686
## tgif 0.011496846 0.018886905 0.041677073 0.0354444121 -0.017615631
## lgif -0.009139114 -0.011299427 0.006731373 -0.0075395084 0.005928332
## rgif -0.002380876 -0.003325955 0.002748777 -0.0053741219 -0.013643922
## tdon 0.014723148 -0.008495970 -0.013859730 -0.0162847340 0.016682061
## tlag -0.030735162 -0.004848666 -0.021305788 -0.0169451119 0.021362168
## agif -0.004664742 -0.002257410 0.010328347 -0.0001531168 -0.013683353
## npro tgif lgif rgif tdon
## hinc 0.0230979942 0.01149685 -0.009139114 -0.002380876 0.0147231478
## avhv -0.0026582392 0.01888690 -0.011299427 -0.003325955 -0.0084959698
## incm 0.0170736403 0.04167707 0.006731373 0.002748777 -0.0138597299
## inca 0.0182656423 0.03544441 -0.007539508 -0.005374122 -0.0162847340
## plow -0.0116676863 -0.01761563 0.005928332 -0.013643922 0.0166820612
## npro 1.0000000000 0.72664798 -0.001328901 -0.012489364 0.0009863842
## tgif 0.7266479833 1.00000000 0.173434853 0.073608064 -0.0113031199
## lgif -0.0013289011 0.17343485 1.000000000 0.696059825 0.0036342686
## rgif -0.0124893642 0.07360806 0.696059825 1.000000000 0.0063324838
## tdon 0.0009863842 -0.01130312 0.003634269 0.006332484 1.0000000000
## tlag 0.0057474624 -0.01172427 0.016816132 0.012158261 -0.0061519123
## agif 0.0022454370 0.05577226 0.609630157 0.705341391 0.0079658768
## tlag agif
## hinc -0.030735162 -0.0046647419
## avhv -0.004848666 -0.0022574096
## incm -0.021305788 0.0103283470
## inca -0.016945112 -0.0001531168
## plow 0.021362168 -0.0136833534
## npro 0.005747462 0.0022454370
## tgif -0.011724267 0.0557722567
## lgif 0.016816132 0.6096301573
## rgif 0.012158261 0.7053413905
## tdon -0.006151912 0.0079658768
## tlag 1.000000000 0.0299336658
## agif 0.029933666 1.0000000000
cor(data.train.std.donr[,c("incm","inca","plow")])
## incm inca plow
## incm 1.0000000 0.8729019 -0.6554552
## inca 0.8729019 1.0000000 -0.6346329
## plow -0.6554552 -0.6346329 1.0000000
cor(data.train.std.donr[,c("npro","tgif")])
## npro tgif
## npro 1.000000 0.726648
## tgif 0.726648 1.000000
cor(data.train.std.donr[,c("lgif","rgif","agif")])
## lgif rgif agif
## lgif 1.0000000 0.6960598 0.6096302
## rgif 0.6960598 1.0000000 0.7053414
## agif 0.6096302 0.7053414 1.0000000
We see that, as expected, there are high correlations in inca
,incm
and plow
; and similarly high correlation between npro
and tgif
; between lgif
, rgif
and agif
.
chk<-lm(damt~.,data=data.train.std.damt)
summary(chk)
##
## Call:
## lm(formula = damt ~ ., data = data.train.std.damt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4624 -0.7966 -0.1533 0.5999 9.1086
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.18934 0.04735 299.660 < 2e-16 ***
## reg1 -0.03923 0.03962 -0.990 0.32217
## reg2 -0.07434 0.04294 -1.731 0.08352 .
## reg3 0.32690 0.04041 8.089 1.04e-15 ***
## reg4 0.63517 0.04158 15.275 < 2e-16 ***
## home 0.23834 0.06073 3.925 8.99e-05 ***
## chld -0.60477 0.03794 -15.939 < 2e-16 ***
## hinc 0.50143 0.03984 12.587 < 2e-16 ***
## genf -0.06318 0.02850 -2.217 0.02675 *
## wrat -0.00109 0.04150 -0.026 0.97905
## avhv -0.04815 0.05136 -0.937 0.34864
## incm 0.29408 0.05845 5.031 5.32e-07 ***
## inca 0.04726 0.07161 0.660 0.50936
## plow 0.24829 0.04341 5.719 1.23e-08 ***
## npro 0.13613 0.04442 3.065 0.00221 **
## tgif 0.05965 0.04603 1.296 0.19517
## lgif -0.05501 0.03843 -1.431 0.15251
## rgif 0.51685 0.04387 11.783 < 2e-16 ***
## tdon 0.07254 0.03493 2.077 0.03796 *
## tlag 0.02206 0.03366 0.655 0.51229
## agif 0.67139 0.04048 16.585 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.273 on 1974 degrees of freedom
## Multiple R-squared: 0.5722, Adjusted R-squared: 0.5679
## F-statistic: 132 on 20 and 1974 DF, p-value: < 2.2e-16
So, the significant variables worth keeping in the model would be: reg3, reg4, home, chld, hinc, genf, incm, plow, npro, rgif, tdon, agif.
chk2<-lm(damt ~ reg3+reg4+home+chld+hinc+genf+incm+plow+npro+rgif+tdon+agif, data=data.train.std.damt )
summary(chk2)
##
## Call:
## lm(formula = damt ~ reg3 + reg4 + home + chld + hinc + genf +
## incm + plow + npro + rgif + tdon + agif, data = data.train.std.damt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3693 -0.8003 -0.1494 0.6141 9.1133
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.15939 0.04207 336.556 < 2e-16 ***
## reg3 0.36159 0.03334 10.845 < 2e-16 ***
## reg4 0.67092 0.03431 19.556 < 2e-16 ***
## home 0.24795 0.06032 4.110 4.11e-05 ***
## chld -0.62278 0.03612 -17.242 < 2e-16 ***
## hinc 0.49969 0.03969 12.589 < 2e-16 ***
## genf -0.06388 0.02846 -2.244 0.0249 *
## incm 0.31516 0.03459 9.110 < 2e-16 ***
## plow 0.25773 0.04143 6.220 6.03e-10 ***
## npro 0.18506 0.02871 6.445 1.45e-10 ***
## rgif 0.49120 0.03816 12.873 < 2e-16 ***
## tdon 0.07140 0.03490 2.046 0.0409 *
## agif 0.65710 0.03940 16.677 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.273 on 1982 degrees of freedom
## Multiple R-squared: 0.5706, Adjusted R-squared: 0.568
## F-statistic: 219.5 on 12 and 1982 DF, p-value: < 2.2e-16
# R-sq=57.06
chk3<-lm(damt ~ reg3+reg4+home+chld+hinc+incm+plow+npro+rgif+agif, data=data.train.std.damt )
summary(chk3)
##
## Call:
## lm(formula = damt ~ reg3 + reg4 + home + chld + hinc + incm +
## plow + npro + rgif + agif, data = data.train.std.damt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4191 -0.7948 -0.1510 0.6149 9.0839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.14803 0.04182 338.279 < 2e-16 ***
## reg3 0.35818 0.03338 10.731 < 2e-16 ***
## reg4 0.66861 0.03436 19.458 < 2e-16 ***
## home 0.25117 0.06042 4.157 3.36e-05 ***
## chld -0.62972 0.03611 -17.438 < 2e-16 ***
## hinc 0.50125 0.03973 12.616 < 2e-16 ***
## incm 0.31665 0.03465 9.138 < 2e-16 ***
## plow 0.25782 0.04150 6.212 6.37e-10 ***
## npro 0.18564 0.02876 6.455 1.36e-10 ***
## rgif 0.49263 0.03822 12.888 < 2e-16 ***
## agif 0.65524 0.03946 16.605 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.275 on 1984 degrees of freedom
## Multiple R-squared: 0.5685, Adjusted R-squared: 0.5663
## F-statistic: 261.4 on 10 and 1984 DF, p-value: < 2.2e-16
# genf+sqrt(tdon)+
# R-sq=56.85
chk4<-lm(damt ~ reg3+reg4+chld+hinc+log(incm)+log(rgif)+agif, data=data.train.std.damt )
## Warning in log(incm): NaNs produced
## Warning in log(rgif): NaNs produced
summary(chk4)
##
## Call:
## lm(formula = damt ~ reg3 + reg4 + chld + hinc + log(incm) + log(rgif) +
## agif, data = data.train.std.damt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9129 -0.7318 -0.1961 0.5439 7.8487
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.25314 0.11219 135.956 < 2e-16 ***
## reg3 0.47541 0.07625 6.235 1.36e-09 ***
## reg4 0.61516 0.07257 8.476 7.56e-16 ***
## chld -0.66103 0.07568 -8.734 < 2e-16 ***
## hinc 0.37056 0.09670 3.832 0.000152 ***
## log(incm) 0.18383 0.05555 3.309 0.001038 **
## log(rgif) 0.12424 0.05914 2.101 0.036412 *
## agif 0.58948 0.06926 8.511 5.91e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.194 on 334 degrees of freedom
## (1653 observations deleted due to missingness)
## Multiple R-squared: 0.5041, Adjusted R-squared: 0.4937
## F-statistic: 48.5 on 7 and 334 DF, p-value: < 2.2e-16
#R-sq=50.41
We began with a few simple approaches and moved on to using some advanced ones as shown below:
classify_donr<- function(modelName,Vars,k) {
library(caret)
if (is.null(Vars)) {
cl.eval.formula<-eval(parse(text= paste("donr ~.")))
} else {
cl.p.list<-Vars[1]
for (i in 2:length(Vars))
cl.p.list<- paste(cl.p.list,Vars[i],sep = "+")
cl.eval.formula<-eval(parse(text= paste("donr ~",cl.p.list)))
}
if (modelName == "lda") {
library(MASS)
model<-lda(eval(cl.eval.formula),data.train.std.donr)
post.valid <- predict(model, data.valid.std.donr)$posterior[,2] # 2018 posterior probabilities
} else if (modelName == "qda") {
library(MASS)
model<-qda(eval(cl.eval.formula),data.train.std.donr)
post.valid <- predict(model, data.valid.std.donr)$posterior[,2] # 2018 posterior probabilities
} else if (modelName == "logistic") {
model<-glm(eval(cl.eval.formula),data.train.std.donr,family="binomial")
post.valid <- predict(model, data.frame(x.valid.std), type="response") # 2018 probabilities
} else if (modelName == "knn") {
library(class)
post.valid<-knn(data.train.std.donr[,Vars],data.valid.std.donr[,Vars],y.train.donr,k)
} else if (modelName == "decision.tree") {
library(rpart)
model<-rpart(eval(cl.eval.formula),data.train.std.donr,method="class")
post.valid<-predict(model,data.valid.std.donr)[,2]
} else if (modelName == "svm") {
library(e1071)
model<-svm(eval(cl.eval.formula),data.train.std.donr,kernel="polynomial")
post.valid<-predict(model,data.valid.std.donr)
} else if (modelName == "naive.bayes") {
library(e1071)
model<-naiveBayes(eval(cl.eval.formula),data.train.std.donr)
post.valid<-predict(model,data.valid.std.donr)
} else if (modelName == "gam") {
library(gam)
model<-gam(donr~.,data.train.std.donr, family = "gaussian") #
post.valid<-predict(model,data.valid.std.donr)
} else if (modelName == "randomforest") {
library(randomForest)
model<-randomForest(as.factor(donr)~.,data.train.std.donr,mtry=16,ntree=500)
post.valid<-predict(model,data.valid.std.donr)
}
# Profit calculations using average donation = $14.50 and mailing cost = $2
profit <- cumsum(14.5*y.valid.donr[order(post.valid, decreasing=T)]-2)
plot(profit, main=modelName) # see how profits change as more mailings are made
n.mail.valid <- which.max(profit) # number of mailings that maximizes profits
c(n.mail.valid, max(profit)) # report number of mailings and maximum profit
if (!(modelName %in% c("knn","randomforest"))) {
cutoff <- sort(post.valid, decreasing=T)[n.mail.valid+1] # set cutoff based on n.mail.valid
pred.valid.donr <- ifelse(post.valid>cutoff, 1, 0) # mail to everyone above the cutoff
class_table<-table(pred.valid.donr, y.valid.donr) # classification table
} else {
class_table<-table(post.valid, y.valid.donr) # classification table
}
print(class_table)
print(confusionMatrix(class_table,positive="1")$overall["Accuracy"])
return(max(profit))
}
# Function callls
Vars= c("reg4","home" ,"chld" ,"hinc","I(hinc^2)")
Vars2=c("reg3","reg4","home", "chld", "hinc", "genf", "incm", "plow", "npro", "rgif", "tdon", "agif")
Vars3=c("reg3","reg4","home", "chld", "hinc","I(hinc^2)", "genf", "incm", "plow", "npro", "rgif", "tdon", "agif")
paste("Maximum profit earned with LDA model is: ", classify_donr("lda",Vars2))
## Warning: package 'caret' was built under R version 3.4.3
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
##
## melanoma
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
## y.valid.donr
## pred.valid.donr 0 1
## 0 489 31
## 1 530 968
## Accuracy
## 0.722002
## [1] "Maximum profit earned with LDA model is: 11040"
Vars1=c("reg3","reg4","home", "chld", "hinc", "genf", "incm", "plow", "npro", "rgif", "tdon")
paste("Maximum profit earned with QDA model is: ", classify_donr("qda",Vars1))
## y.valid.donr
## pred.valid.donr 0 1
## 0 562 37
## 1 457 962
## Accuracy
## 0.7552032
## [1] "Maximum profit earned with QDA model is: 11111"
Vars2=c("reg3","reg4","home", "chld", "hinc", "genf", "incm", "plow", "npro", "rgif", "tdon", "agif")
paste("Maximum profit earned with logistic model is: ", classify_donr("logistic",Vars2))
## y.valid.donr
## pred.valid.donr 0 1
## 0 389 12
## 1 630 987
## Accuracy
## 0.6818632
## [1] "Maximum profit earned with logistic model is: 11077.5"
Vars<-NULL
paste("Maximum profit earned with logistic model is: ", classify_donr("logistic",Vars))
## y.valid.donr
## pred.valid.donr 0 1
## 0 612 20
## 1 407 979
## Accuracy
## 0.7884044
## [1] "Maximum profit earned with logistic model is: 11423.5"
#Vars2=c("reg3","reg4","home", "chld", "hinc", "genf", "incm", "plow", "npro", "rgif", "tdon", "agif")
#paste("Maximum profit earned with logistic model is: ", classify_donr("logistic",Vars3))
Vars.lo=c("reg1","reg2","home","chld","wrat","incm","plow","npro","tgif","tdon","tlag")
paste("Maximum profit earned with logistic model is: ", classify_donr("logistic",Vars.lo))
## y.valid.donr
## pred.valid.donr 0 1
## 0 610 21
## 1 409 978
## Accuracy
## 0.7869177
## [1] "Maximum profit earned with logistic model is: 11407"
# need to create a new column if transformation is needed.
Vars2=c("reg3","reg4","home", "chld", "hinc", "genf", "incm", "plow", "npro", "rgif", "tdon", "agif")
Vars1=c("reg3","reg4","home", "chld", "hinc", "genf", "incm", "plow", "npro", "rgif", "tdon")
paste("Maximum profit earned with knn model is: ", classify_donr("knn",Vars1,5))
## y.valid.donr
## post.valid 0 1
## 0 728 128
## 1 291 871
## Accuracy
## 0.7923687
## [1] "Maximum profit earned with knn model is: 10538.5"
paste("Maximum profit earned with knn model is: ", classify_donr("knn",Vars2,15))
## y.valid.donr
## post.valid 0 1
## 0 733 115
## 1 286 884
## Accuracy
## 0.8012884
## [1] "Maximum profit earned with knn model is: 10549"
paste("Maximum profit earned with knn model is: ", classify_donr("decision.tree",Vars2))
## y.valid.donr
## pred.valid.donr 0 1
## 0 667 73
## 1 352 926
## Accuracy
## 0.7893954
## [1] "Maximum profit earned with knn model is: 11116"
paste("Maximum profit earned with knn model is: ", classify_donr("svm",Vars2))
## Warning: package 'e1071' was built under R version 3.4.3
## y.valid.donr
## pred.valid.donr 0 1
## 0 555 48
## 1 464 951
## Accuracy
## 0.7462834
## [1] "Maximum profit earned with knn model is: 10959.5"
Vars4=c("reg4","home", "chld", "hinc","I(hinc^2)", "genf", "incm", "plow", "npro", "tdon", "agif","tgif")
# "I(sqrt(tdon))"
paste("Maximum profit earned with knn model is: ", classify_donr("svm",Vars4))
## y.valid.donr
## pred.valid.donr 0 1
## 0 475 23
## 1 544 976
## Accuracy
## 0.7190287
## [1] "Maximum profit earned with knn model is: 11112"
# data.train.std.donr$log.tgif<-log(data.train.std.donr$tgif);
# data.train.std.donr$log.agif<-log(data.train.std.donr$agif)
# data.train.std.donr$log.incm<-log(data.train.std.donr$incm)
# data.train.std.donr$sqrt.tdon<-sqrt(data.train.std.donr$tdon)
paste("Maximum profit earned with GAM model is: ", classify_donr("gam",Vars2))
## Warning: package 'gam' was built under R version 3.4.3
## Loading required package: splines
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 3.4.2
##
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
##
## accumulate, when
## Loaded gam 1.15
## y.valid.donr
## pred.valid.donr 0 1
## 0 534 12
## 1 485 987
## Accuracy
## 0.7537166
## [1] "Maximum profit earned with GAM model is: 11367.5"
Vars2=c("reg3","reg4","home", "chld", "hinc", "genf", "incm", "plow", "npro", "rgif", "tdon", "agif")
# Vars2=c("reg3","reg4","home", "chld", "hinc", "genf", "incm", "plow", "npro", "rgif", "tdon", "agif","log.tgif","log.agif","log.incm","sqrt.tdon")
paste("Maximum profit earned with knn model is: ", classify_donr("randomforest",Vars2))
## Warning: package 'randomForest' was built under R version 3.4.2
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:psych':
##
## outlier
## y.valid.donr
## post.valid 0 1
## 0 888 89
## 1 131 910
## Accuracy
## 0.8909812
## [1] "Maximum profit earned with knn model is: 11113"
#reg best subset
#Helps develop a better sense of variables/predictors. Here best subset is used with MSE and bic.
#OLS model then fit. Explains that the 10 variable solution is best to use within this test.
library(leaps)
## Warning: package 'leaps' was built under R version 3.4.3
#best subset selesction for full 19 variable model
regfit.full=regsubsets(damt ~ reg1 + reg2 + reg3 + reg4 + home + chld + hinc + genf + wrat +
avhv + incm + inca + plow + npro + tgif + lgif + rgif + tdon + tlag + agif,
data = data.train.std.damt, nvmax=19)
reg.summary=summary(regfit.full)
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(damt ~ reg1 + reg2 + reg3 + reg4 + home +
## chld + hinc + genf + wrat + avhv + incm + inca + plow + npro +
## tgif + lgif + rgif + tdon + tlag + agif, data = data.train.std.damt,
## nvmax = 19)
## 20 Variables (and intercept)
## Forced in Forced out
## reg1 FALSE FALSE
## reg2 FALSE FALSE
## reg3 FALSE FALSE
## reg4 FALSE FALSE
## home FALSE FALSE
## chld FALSE FALSE
## hinc FALSE FALSE
## genf FALSE FALSE
## wrat FALSE FALSE
## avhv FALSE FALSE
## incm FALSE FALSE
## inca FALSE FALSE
## plow FALSE FALSE
## npro FALSE FALSE
## tgif FALSE FALSE
## lgif FALSE FALSE
## rgif FALSE FALSE
## tdon FALSE FALSE
## tlag FALSE FALSE
## agif FALSE FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: exhaustive
## reg1 reg2 reg3 reg4 home chld hinc genf wrat avhv incm inca plow
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " "*" " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " "*" " " " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " "*" " " "*" "*" " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " "*" " " "*" "*" " " " " " " " " " " " "
## 6 ( 1 ) " " " " "*" "*" " " "*" "*" " " " " " " " " " " " "
## 7 ( 1 ) " " " " "*" "*" " " "*" "*" " " " " " " "*" " " " "
## 8 ( 1 ) " " " " "*" "*" " " "*" "*" " " " " " " "*" " " "*"
## 9 ( 1 ) " " " " "*" "*" " " "*" "*" " " " " " " "*" " " "*"
## 10 ( 1 ) " " " " "*" "*" "*" "*" "*" " " " " " " "*" " " "*"
## 11 ( 1 ) " " " " "*" "*" "*" "*" "*" "*" " " " " "*" " " "*"
## 12 ( 1 ) " " " " "*" "*" "*" "*" "*" "*" " " " " "*" " " "*"
## 13 ( 1 ) " " "*" "*" "*" "*" "*" "*" "*" " " " " "*" " " "*"
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" " " " " "*" " " "*"
## 15 ( 1 ) " " "*" "*" "*" "*" "*" "*" "*" " " " " "*" " " "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" " " " " "*" " " "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*" "*" "*"
## npro tgif lgif rgif tdon tlag agif
## 1 ( 1 ) " " " " " " " " " " " " "*"
## 2 ( 1 ) " " " " " " " " " " " " "*"
## 3 ( 1 ) " " " " " " " " " " " " "*"
## 4 ( 1 ) " " " " " " " " " " " " "*"
## 5 ( 1 ) " " " " " " "*" " " " " "*"
## 6 ( 1 ) " " " " " " "*" " " " " "*"
## 7 ( 1 ) " " " " " " "*" " " " " "*"
## 8 ( 1 ) " " " " " " "*" " " " " "*"
## 9 ( 1 ) "*" " " " " "*" " " " " "*"
## 10 ( 1 ) "*" " " " " "*" " " " " "*"
## 11 ( 1 ) "*" " " " " "*" " " " " "*"
## 12 ( 1 ) "*" " " " " "*" "*" " " "*"
## 13 ( 1 ) "*" " " " " "*" "*" " " "*"
## 14 ( 1 ) "*" " " " " "*" "*" " " "*"
## 15 ( 1 ) "*" "*" "*" "*" "*" " " "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" " " "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" " " "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" " " "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
names(reg.summary)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
#show r squared
reg.summary$rsq
## [1] 0.2718288 0.3745233 0.4406249 0.4760923 0.5102874 0.5376199 0.5471627
## [8] 0.5562790 0.5647500 0.5685088 0.5696678 0.5705746 0.5710447 0.5713117
## [15] 0.5716925 0.5719149 0.5720135 0.5721088 0.5722019
#plot
par(mfrow=c(2,2))
plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l")
which.min(reg.summary$rss)
## [1] 19
plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l")
#show red dot on model with largest adjusted r^2, and it equal 15
which.max(reg.summary$adjr2) #15
## [1] 16
points(15,reg.summary$adjr2[15], col="red",cex=2,pch=20)
#plot Cp and BIC, and show smallest value of each
plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp",type='l')
which.min(reg.summary$cp) #15
## [1] 13
points(15,reg.summary$cp[15],col="red",cex=2,pch=20)
which.min(reg.summary$bic)
## [1] 10
plot(reg.summary$bic,xlab="Number of Variables",ylab="BIC",type='l')
points(11,reg.summary$bic[11],col="red",cex=2,pch=20)
plot(reg.summary$obj,xlab="Number of Variables",ylab="rsq",type='l')
which.max(reg.summary$rsq)
## [1] 19
par(mfrow=c(1,1))
plot(regfit.full,scale="r2")
plot(regfit.full,scale="adjr2")
plot(regfit.full,scale="Cp")
plot(regfit.full,scale="bic")
#see coefficients for this model
coef(regfit.full,6)
## (Intercept) reg3 reg4 chld hinc rgif
## 14.2931010 0.3736171 0.6889575 -0.5731359 0.5028017 0.4766170
## agif
## 0.6545193
#Choosing Among the Models Using the validation Set Approach
library(leaps)
regfit.best= regsubsets(damt ~ reg1 + reg2 + reg3 + reg4 + home + chld + hinc + genf + wrat +
avhv + incm + inca + plow + npro + tgif + lgif + rgif + tdon + tlag + agif,
data=data.train.std.damt,nvmax=19)
test.mat=model.matrix(damt ~ reg1 + reg2 + reg3 + reg4 + home + chld + hinc + genf + wrat +
avhv + incm + inca + plow + npro + tgif + lgif + rgif + tdon + tlag + agif,
data=data.valid.std.damt)
val.errors=rep(NA,19)
for(i in 1:19){
coefi=coef(regfit.best,id=i)
pred=test.mat[,names(coefi)]%*%coefi
val.errors[i]=mean((data.valid.std.damt$damt-pred)^2)
}
val.errors
## [1] 3.008661 2.499722 2.273558 2.149864 2.055742 1.980266 1.953428
## [8] 1.905913 1.870818 1.857947 1.868510 1.870988 1.859918 1.858703
## [15] 1.862669 1.861431 1.865514 1.867564 1.866597
which.min(val.errors)
## [1] 10
coef(regfit.best,10)
## (Intercept) reg3 reg4 home chld hinc
## 14.1480296 0.3581844 0.6686119 0.2511736 -0.6297229 0.5012535
## incm plow npro rgif agif
## 0.3166469 0.2578157 0.1856372 0.4926266 0.6552395
#best subset explains 10-variable solution is best
lm <- lm(damt ~ reg3 + reg4 + home + chld + hinc + incm + plow + npro + rgif + agif,
data=data.train.std.damt)
pred.valid.val.er <-predict(lm,newdata = data.valid.std.damt)
mean((y.valid.damt - pred.valid.val.er)^2)
## [1] 1.857947
#MSE 1.857947
Lets begin with a few simple approaches to advanced ones:
##### PREDICTION MODELING ######
attach(data.train.std.damt)
plot(tlag, damt)
# Least squares regression
model.ls1 <- lm(damt ~ reg1 + reg2 + reg3 + reg4 + home + chld + hinc + genf + wrat +
avhv + incm + inca + plow + npro + tgif + lgif + rgif + tdon + tlag + agif,
data.train.std.damt)
library(car)
## Warning: package 'car' was built under R version 3.4.3
##
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
##
## some
## The following object is masked from 'package:boot':
##
## logit
## The following object is masked from 'package:psych':
##
## logit
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.4.3
vif(model.ls1)
## reg1 reg2 reg3 reg4 home chld hinc genf
## 2.085632 2.520273 1.509729 1.520960 1.034332 1.190075 1.010603 1.007176
## wrat avhv incm inca plow npro tgif lgif
## 1.023573 3.598009 4.895007 7.473890 1.884816 2.436902 2.567655 2.298598
## rgif tdon tlag agif
## 2.627160 1.009821 1.016425 2.092438
summary(model.ls1)$adj.r.squared
## [1] 0.5678677
pred.valid.ls1 <- predict(model.ls1, newdata = data.valid.std.damt) # validation predictions
mean((y.valid.damt - pred.valid.ls1)^2) # mean prediction error
## [1] 1.866657
# 1.866902
model.ls1 <- lm(damt ~ reg1 + reg2 + reg3 + reg4 + home + chld + hinc + genf +
avhv + incm + inca + plow + npro + tgif + lgif + rgif + tdon + tlag + agif,
data.train.std.damt)
summary(model.ls1)
##
## Call:
## lm(formula = damt ~ reg1 + reg2 + reg3 + reg4 + home + chld +
## hinc + genf + avhv + incm + inca + plow + npro + tgif + lgif +
## rgif + tdon + tlag + agif, data = data.train.std.damt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4627 -0.7969 -0.1528 0.6007 9.1082
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.18897 0.04521 313.837 < 2e-16 ***
## reg1 -0.03920 0.03959 -0.990 0.3222
## reg2 -0.07430 0.04290 -1.732 0.0834 .
## reg3 0.32690 0.04040 8.091 1.02e-15 ***
## reg4 0.63516 0.04157 15.279 < 2e-16 ***
## home 0.23837 0.06071 3.926 8.92e-05 ***
## chld -0.60490 0.03762 -16.080 < 2e-16 ***
## hinc 0.50148 0.03979 12.604 < 2e-16 ***
## genf -0.06317 0.02849 -2.217 0.0267 *
## avhv -0.04810 0.05131 -0.937 0.3487
## incm 0.29411 0.05843 5.034 5.24e-07 ***
## inca 0.04722 0.07158 0.660 0.5095
## plow 0.24829 0.04340 5.721 1.22e-08 ***
## npro 0.13615 0.04440 3.066 0.0022 **
## tgif 0.05965 0.04602 1.296 0.1951
## lgif -0.05502 0.03842 -1.432 0.1523
## rgif 0.51683 0.04385 11.787 < 2e-16 ***
## tdon 0.07255 0.03492 2.077 0.0379 *
## tlag 0.02206 0.03365 0.655 0.5123
## agif 0.67140 0.04047 16.592 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.272 on 1975 degrees of freedom
## Multiple R-squared: 0.5722, Adjusted R-squared: 0.5681
## F-statistic: 139 on 19 and 1975 DF, p-value: < 2.2e-16
pred.valid.ls1 <- predict(model.ls1, newdata = data.valid.std.damt) # validation predictions
mean((y.valid.damt - pred.valid.ls1)^2) # mean prediction error
## [1] 1.866597
# 1.866902
##########################################################
##########################################################
#reg best subset
#Helps develop a better sense of variables/predictors. Here best subset is used with MSE and bic.
#OLS model then fit. Explains that the 10 variable solution is best to use within this test.
library(leaps)
regfit.full=regsubsets(damt ~ reg1 + reg2 + reg3 + reg4 + home + chld + hinc + genf + wrat +
avhv + incm + inca + plow + npro + tgif + lgif + rgif + tdon + tlag + agif,
data = data.train.std.damt)
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(damt ~ reg1 + reg2 + reg3 + reg4 + home +
## chld + hinc + genf + wrat + avhv + incm + inca + plow + npro +
## tgif + lgif + rgif + tdon + tlag + agif, data = data.train.std.damt)
## 20 Variables (and intercept)
## Forced in Forced out
## reg1 FALSE FALSE
## reg2 FALSE FALSE
## reg3 FALSE FALSE
## reg4 FALSE FALSE
## home FALSE FALSE
## chld FALSE FALSE
## hinc FALSE FALSE
## genf FALSE FALSE
## wrat FALSE FALSE
## avhv FALSE FALSE
## incm FALSE FALSE
## inca FALSE FALSE
## plow FALSE FALSE
## npro FALSE FALSE
## tgif FALSE FALSE
## lgif FALSE FALSE
## rgif FALSE FALSE
## tdon FALSE FALSE
## tlag FALSE FALSE
## agif FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## reg1 reg2 reg3 reg4 home chld hinc genf wrat avhv incm inca plow
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " "*" " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " "*" " " " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " "*" " " "*" "*" " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " "*" " " "*" "*" " " " " " " " " " " " "
## 6 ( 1 ) " " " " "*" "*" " " "*" "*" " " " " " " " " " " " "
## 7 ( 1 ) " " " " "*" "*" " " "*" "*" " " " " " " "*" " " " "
## 8 ( 1 ) " " " " "*" "*" " " "*" "*" " " " " " " "*" " " "*"
## npro tgif lgif rgif tdon tlag agif
## 1 ( 1 ) " " " " " " " " " " " " "*"
## 2 ( 1 ) " " " " " " " " " " " " "*"
## 3 ( 1 ) " " " " " " " " " " " " "*"
## 4 ( 1 ) " " " " " " " " " " " " "*"
## 5 ( 1 ) " " " " " " "*" " " " " "*"
## 6 ( 1 ) " " " " " " "*" " " " " "*"
## 7 ( 1 ) " " " " " " "*" " " " " "*"
## 8 ( 1 ) " " " " " " "*" " " " " "*"
#best subset selesction for full 20 model
regfit.full=regsubsets(damt ~ reg1 + reg2 + reg3 + reg4 + home + chld + hinc + genf + wrat +
avhv + incm + inca + plow + npro + tgif + lgif + rgif + tdon + tlag + agif, data = data.train.std.damt, nvmax=20)
reg.summary=summary(regfit.full)
View(data.train.std.damt)
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(damt ~ reg1 + reg2 + reg3 + reg4 + home +
## chld + hinc + genf + wrat + avhv + incm + inca + plow + npro +
## tgif + lgif + rgif + tdon + tlag + agif, data = data.train.std.damt,
## nvmax = 20)
## 20 Variables (and intercept)
## Forced in Forced out
## reg1 FALSE FALSE
## reg2 FALSE FALSE
## reg3 FALSE FALSE
## reg4 FALSE FALSE
## home FALSE FALSE
## chld FALSE FALSE
## hinc FALSE FALSE
## genf FALSE FALSE
## wrat FALSE FALSE
## avhv FALSE FALSE
## incm FALSE FALSE
## inca FALSE FALSE
## plow FALSE FALSE
## npro FALSE FALSE
## tgif FALSE FALSE
## lgif FALSE FALSE
## rgif FALSE FALSE
## tdon FALSE FALSE
## tlag FALSE FALSE
## agif FALSE FALSE
## 1 subsets of each size up to 20
## Selection Algorithm: exhaustive
## reg1 reg2 reg3 reg4 home chld hinc genf wrat avhv incm inca plow
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " "*" " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " "*" " " " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " "*" " " "*" "*" " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " "*" " " "*" "*" " " " " " " " " " " " "
## 6 ( 1 ) " " " " "*" "*" " " "*" "*" " " " " " " " " " " " "
## 7 ( 1 ) " " " " "*" "*" " " "*" "*" " " " " " " "*" " " " "
## 8 ( 1 ) " " " " "*" "*" " " "*" "*" " " " " " " "*" " " "*"
## 9 ( 1 ) " " " " "*" "*" " " "*" "*" " " " " " " "*" " " "*"
## 10 ( 1 ) " " " " "*" "*" "*" "*" "*" " " " " " " "*" " " "*"
## 11 ( 1 ) " " " " "*" "*" "*" "*" "*" "*" " " " " "*" " " "*"
## 12 ( 1 ) " " " " "*" "*" "*" "*" "*" "*" " " " " "*" " " "*"
## 13 ( 1 ) " " "*" "*" "*" "*" "*" "*" "*" " " " " "*" " " "*"
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" " " " " "*" " " "*"
## 15 ( 1 ) " " "*" "*" "*" "*" "*" "*" "*" " " " " "*" " " "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" " " " " "*" " " "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*" "*" "*"
## 20 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## npro tgif lgif rgif tdon tlag agif
## 1 ( 1 ) " " " " " " " " " " " " "*"
## 2 ( 1 ) " " " " " " " " " " " " "*"
## 3 ( 1 ) " " " " " " " " " " " " "*"
## 4 ( 1 ) " " " " " " " " " " " " "*"
## 5 ( 1 ) " " " " " " "*" " " " " "*"
## 6 ( 1 ) " " " " " " "*" " " " " "*"
## 7 ( 1 ) " " " " " " "*" " " " " "*"
## 8 ( 1 ) " " " " " " "*" " " " " "*"
## 9 ( 1 ) "*" " " " " "*" " " " " "*"
## 10 ( 1 ) "*" " " " " "*" " " " " "*"
## 11 ( 1 ) "*" " " " " "*" " " " " "*"
## 12 ( 1 ) "*" " " " " "*" "*" " " "*"
## 13 ( 1 ) "*" " " " " "*" "*" " " "*"
## 14 ( 1 ) "*" " " " " "*" "*" " " "*"
## 15 ( 1 ) "*" "*" "*" "*" "*" " " "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" " " "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" " " "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" " " "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 20 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
names(reg.summary)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
#show r squared
reg.summary$rsq
## [1] 0.2718288 0.3745233 0.4406249 0.4760923 0.5102874 0.5376199 0.5471627
## [8] 0.5562790 0.5647500 0.5685088 0.5696678 0.5705746 0.5710447 0.5713117
## [15] 0.5716925 0.5719149 0.5720135 0.5721088 0.5722019 0.5722020
which.max(reg.summary$rsq)
## [1] 20
#plot
par(mfrow=c(2,2))
plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l")
which.min(reg.summary$rss)
## [1] 20
points(20,reg.summary$rss[19], col="red",cex=2,pch=20)
plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l")
which.max(reg.summary$adjr2) #16
## [1] 16
points(16,reg.summary$adjr2[16], col="red",cex=2,pch=20)
#plot Cp and BIC, and show smallest value of each
plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp",type='l')
which.min(reg.summary$cp) #13
## [1] 13
points(13,reg.summary$cp[13],col="red",cex=2,pch=20)
which.min(reg.summary$bic)
## [1] 10
plot(reg.summary$bic,xlab="Number of Variables",ylab="BIC",type='l')
points(10,reg.summary$bic[10],col="red",cex=2,pch=20)
plot(regfit.full,scale="r2")
plot(regfit.full,scale="adjr2")
plot(regfit.full,scale="Cp")
plot(regfit.full,scale="bic")
#see coefficients for this model
coef(regfit.full,6)
## (Intercept) reg3 reg4 chld hinc rgif
## 14.2931010 0.3736171 0.6889575 -0.5731359 0.5028017 0.4766170
## agif
## 0.6545193
#Choosing Among the Models Using the validation Set Approach
library(leaps)
regfit.best= regsubsets(damt ~ reg1 + reg2 + reg3 + reg4 + home + chld + hinc + genf + wrat +
avhv + incm + inca + plow + npro + tgif + lgif + rgif + tdon + tlag + agif,
data=data.train.std.damt,nvmax=20)
test.mat=model.matrix(damt ~ reg1 + reg2 + reg3 + reg4 + home + chld + hinc + genf + wrat +
avhv + incm + inca + plow + npro + tgif + lgif + rgif + tdon + tlag + agif,
data=data.valid.std.damt)
val.errors=rep(NA,20)
for(i in 1:20){
coefi=coef(regfit.best,id=i)
pred=test.mat[,names(coefi)]%*%coefi
val.errors[i]=mean((data.valid.std.damt$damt-pred)^2)
}
par(mfrow=c(1,1))
plot(val.errors)
which.min(val.errors)
## [1] 10
coef(regfit.best,10)
## (Intercept) reg3 reg4 home chld hinc
## 14.1480296 0.3581844 0.6686119 0.2511736 -0.6297229 0.5012535
## incm plow npro rgif agif
## 0.3166469 0.2578157 0.1856372 0.4926266 0.6552395
#best subset explains 10 variable soltution is best
lm <- lm(damt ~ reg3 + reg4 + home + chld + hinc +
incm + plow + npro + rgif + agif,
data=data.train.std.damt)
summary(lm)
##
## Call:
## lm(formula = damt ~ reg3 + reg4 + home + chld + hinc + incm +
## plow + npro + rgif + agif, data = data.train.std.damt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4191 -0.7948 -0.1510 0.6149 9.0839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.14803 0.04182 338.279 < 2e-16 ***
## reg3 0.35818 0.03338 10.731 < 2e-16 ***
## reg4 0.66861 0.03436 19.458 < 2e-16 ***
## home 0.25117 0.06042 4.157 3.36e-05 ***
## chld -0.62972 0.03611 -17.438 < 2e-16 ***
## hinc 0.50125 0.03973 12.616 < 2e-16 ***
## incm 0.31665 0.03465 9.138 < 2e-16 ***
## plow 0.25782 0.04150 6.212 6.37e-10 ***
## npro 0.18564 0.02876 6.455 1.36e-10 ***
## rgif 0.49263 0.03822 12.888 < 2e-16 ***
## agif 0.65524 0.03946 16.605 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.275 on 1984 degrees of freedom
## Multiple R-squared: 0.5685, Adjusted R-squared: 0.5663
## F-statistic: 261.4 on 10 and 1984 DF, p-value: < 2.2e-16
pred.valid.ls1 <-predict(lm,newdata = data.valid.std.damt)
mean((y.valid.damt - pred.valid.ls1)^2)
## [1] 1.857947
#MSE 1.857947
#for 15 compare model fitnes
coef(regfit.best,15)
## (Intercept) reg2 reg3 reg4 home chld
## 14.17536619 -0.04429253 0.34502115 0.65534029 0.24264951 -0.61172393
## hinc genf incm plow npro tgif
## 0.50238983 -0.06204784 0.30981710 0.25905308 0.13625361 0.06262015
## lgif rgif tdon agif
## -0.05554278 0.51707167 0.07226416 0.67105728
lm2 <- lm(damt ~ reg2 + reg3 + reg4 + home + chld + hinc +
genf + incm + plow + npro + tgif + lgif + rgif + tdon + agif,
data=data.train.std.damt)
summary(lm2)
##
## Call:
## lm(formula = damt ~ reg2 + reg3 + reg4 + home + chld + hinc +
## genf + incm + plow + npro + tgif + lgif + rgif + tdon + agif,
## data = data.train.std.damt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4085 -0.8029 -0.1559 0.6016 9.1253
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.17537 0.04378 323.773 < 2e-16 ***
## reg2 -0.04429 0.03099 -1.429 0.15308
## reg3 0.34502 0.03507 9.837 < 2e-16 ***
## reg4 0.65534 0.03605 18.179 < 2e-16 ***
## home 0.24265 0.06057 4.006 6.40e-05 ***
## chld -0.61172 0.03724 -16.427 < 2e-16 ***
## hinc 0.50239 0.03973 12.646 < 2e-16 ***
## genf -0.06205 0.02846 -2.180 0.02937 *
## incm 0.30982 0.03474 8.918 < 2e-16 ***
## plow 0.25905 0.04144 6.251 4.97e-10 ***
## npro 0.13625 0.04430 3.076 0.00213 **
## tgif 0.06262 0.04589 1.364 0.17258
## lgif -0.05554 0.03838 -1.447 0.14796
## rgif 0.51707 0.04382 11.800 < 2e-16 ***
## tdon 0.07226 0.03490 2.071 0.03851 *
## agif 0.67106 0.04042 16.601 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.272 on 1979 degrees of freedom
## Multiple R-squared: 0.5717, Adjusted R-squared: 0.5684
## F-statistic: 176.1 on 15 and 1979 DF, p-value: < 2.2e-16
pred.valid.ls1 <-predict(lm2,newdata = data.valid.std.damt)
mean((y.valid.damt - pred.valid.ls1)^2)
## [1] 1.862669
#MSE 1.862669
##########################################################
#bagging & rf
#bagging is used to reduce variance as well, but main idea is too create an aggregate fitted value based of large number of bootstrap samples.
#random forest was used to lower variance among our models. averaging over a large amount of trees helps us reduce the variance.
#rf is said to have good predictive accuracy, bagging may have highly correlated predictors
par(mfrow=c(1,1))
library(randomForest)
set.seed(1)
bag.train=randomForest(damt ~ reg1 + reg2 + reg3 + reg4 + home + chld + hinc + genf + wrat +
avhv + incm + inca + plow + npro + tgif + lgif + rgif + tdon + tlag + agif,
data=data.train.std.damt, mtry=20,importance=TRUE)
bag.train
##
## Call:
## randomForest(formula = damt ~ reg1 + reg2 + reg3 + reg4 + home + chld + hinc + genf + wrat + avhv + incm + inca + plow + npro + tgif + lgif + rgif + tdon + tlag + agif, data = data.train.std.damt, mtry = 20, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 20
##
## Mean of squared residuals: 1.49039
## % Var explained: 60.22
yhat.bag = predict(bag.train,newdata=data.valid.std.damt)
mean((yhat.bag-y.valid.damt)^2)
## [1] 1.704569
#MSE is 1.70467
set.seed(1)
bag.train=randomForest(damt ~ reg1 + reg2 + reg3 + reg4 + home + chld + hinc + genf + wrat +
avhv + incm + inca + plow + npro + tgif + lgif + rgif + tdon + tlag + agif,
data=data.train.std.damt,mtry=5,ntree=500)
bag.train
##
## Call:
## randomForest(formula = damt ~ reg1 + reg2 + reg3 + reg4 + home + chld + hinc + genf + wrat + avhv + incm + inca + plow + npro + tgif + lgif + rgif + tdon + tlag + agif, data = data.train.std.damt, mtry = 5, ntree = 500)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 1.46694
## % Var explained: 60.85
yhat.bag = predict(bag.train,newdata=data.valid.std.damt)
mean((yhat.bag-y.valid.damt)^2)
## [1] 1.666061
#MSE is 1.665782
#rf
set.seed(1)
rf.train=randomForest(damt ~ reg1 + reg2 + reg3 + reg4 + home + chld + hinc + genf + wrat +
avhv + incm + inca + plow + npro + tgif + lgif + rgif + tdon + tlag + agif,
data=data.train.std.damt,mtry=20,importance=TRUE)
summary(rf.train)
## Length Class Mode
## call 5 -none- call
## type 1 -none- character
## predicted 1995 -none- numeric
## mse 500 -none- numeric
## rsq 500 -none- numeric
## oob.times 1995 -none- numeric
## importance 40 -none- numeric
## importanceSD 20 -none- numeric
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 1995 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
############################################################################################
yhat.rf = predict(rf.train,newdata=data.valid.std.damt)
mean((yhat.rf-y.valid.damt)^2)
## [1] 1.704569
#MSE is 1.70671
set.seed(1)
rf.train=randomForest(damt ~ reg1 + reg2 + reg3 + reg4 + home + chld + hinc + genf + wrat +
avhv + incm + inca + plow + npro + tgif + lgif + rgif + tdon + tlag + agif,
data=data.train.std.damt,mtry=6,importance=TRUE)
rf.train
##
## Call:
## randomForest(formula = damt ~ reg1 + reg2 + reg3 + reg4 + home + chld + hinc + genf + wrat + avhv + incm + inca + plow + npro + tgif + lgif + rgif + tdon + tlag + agif, data = data.train.std.damt, mtry = 6, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 6
##
## Mean of squared residuals: 1.474894
## % Var explained: 60.64
yhat.rf = predict(rf.train,newdata=data.valid.std.damt)
mean((yhat.rf-y.valid.damt)^2)
## [1] 1.671982
#MSE is 1.672058
importance(rf.train)
## %IncMSE IncNodePurity
## reg1 7.2280784 44.45009
## reg2 16.3909736 132.51364
## reg3 24.0810706 132.44441
## reg4 53.5370068 552.30862
## home -2.1931766 32.05791
## chld 42.0330987 446.91936
## hinc 26.3149084 253.25895
## genf 1.4776464 35.56429
## wrat 15.4837034 236.81797
## avhv 10.3693870 245.01113
## incm 16.5898662 232.84778
## inca 13.3095035 231.96122
## plow 15.4128626 218.55115
## npro 7.5747497 254.77054
## tgif 11.2501446 287.17436
## lgif 32.3203019 1230.03658
## rgif 32.3219934 1196.17365
## tdon 2.3002735 188.94821
## tlag 0.4946951 154.32186
## agif 29.5088895 1091.17330
varImpPlot(rf.train)
##########################################################
#Used for prediction purpose, using weights and learns slowly.
#shrinkage parameter helps allow for more shaped trees on residuals
#resistant to overfitting. but I qiestion if data is to noisy for use.
#each tree is used to improve fundemtal of other trees instead of averaging
#Boosting
library(gbm)
## Warning: package 'gbm' was built under R version 3.4.3
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## The following object is masked from 'package:boot':
##
## aml
## Loading required package: parallel
## Loaded gbm 2.1.3
set.seed(1)
boost.train=gbm(damt ~ reg1 + reg2 + reg3 + reg4 + home + chld + hinc + genf + wrat +
avhv + incm + inca + plow + npro + tgif + lgif + rgif + tdon + tlag + agif,
data=data.train.std.damt,distribution="gaussian",n.trees=5000,interaction.depth=4)
boost.train
## gbm(formula = damt ~ reg1 + reg2 + reg3 + reg4 + home + chld +
## hinc + genf + wrat + avhv + incm + inca + plow + npro + tgif +
## lgif + rgif + tdon + tlag + agif, distribution = "gaussian",
## data = data.train.std.damt, n.trees = 5000, interaction.depth = 4)
## A gradient boosted model with gaussian loss function.
## 5000 iterations were performed.
## There were 20 predictors of which 20 had non-zero influence.
summary(boost.train)
## var rel.inf
## rgif rgif 22.70454717
## lgif lgif 18.40383645
## agif agif 14.98414235
## reg4 reg4 12.35632592
## chld chld 9.36316407
## hinc hinc 5.37490349
## wrat wrat 4.77411339
## reg3 reg3 3.24508847
## tgif tgif 2.20712068
## reg2 reg2 1.57964770
## incm incm 1.55355820
## plow plow 1.52958617
## inca inca 0.50986021
## avhv avhv 0.41301774
## home home 0.35321219
## npro npro 0.24843695
## reg1 reg1 0.20225816
## tdon tdon 0.14077195
## tlag tlag 0.04336967
## genf genf 0.01303904
yhat.boost=predict(boost.train,newdata=data.valid.std.damt,n.trees=5000)
mean((yhat.boost-y.valid.damt)^2)
## [1] 1.539772
#1.539811
##########################################################
#shrinkage parameter
set.seed(1)
boost.train=gbm(damt ~ reg1 + reg2 + reg3 + reg4 + home + chld + hinc + genf + wrat +
avhv + incm + inca + plow + npro + tgif + lgif + rgif + tdon + tlag + agif,
data=data.train.std.damt,distribution="gaussian",n.trees=100,interaction.depth=4,shrinkage=0.1,verbose=F)
boost.train
## gbm(formula = damt ~ reg1 + reg2 + reg3 + reg4 + home + chld +
## hinc + genf + wrat + avhv + incm + inca + plow + npro + tgif +
## lgif + rgif + tdon + tlag + agif, distribution = "gaussian",
## data = data.train.std.damt, n.trees = 100, interaction.depth = 4,
## shrinkage = 0.1, verbose = F)
## A gradient boosted model with gaussian loss function.
## 100 iterations were performed.
## There were 20 predictors of which 20 had non-zero influence.
yhat.boost=predict(boost.train,newdata=data.valid.std.damt,n.trees=100)
mean((yhat.boost-y.valid.damt)^2)
## [1] 1.403496
#λ = 0.1 = 1.403496
set.seed(1)
boost.train=gbm(damt ~ reg1 + reg2 + reg3 + reg4 + home + chld + hinc + genf + wrat +
avhv + incm + inca + plow + npro + tgif + lgif + rgif + tdon + tlag + agif,
data=data.train.std.damt,distribution="gaussian",n.trees=5000,interaction.depth=4,shrinkage=0.01,verbose=F)
yhat.boost=predict(boost.train,newdata=data.valid.std.damt,n.trees=5000)
mean((yhat.boost-y.valid.damt)^2)
## [1] 1.436947
#λ = 0.01 = 1.440911
set.seed(1)
boost.train=gbm(damt ~ reg1 + reg2 + reg3 + reg4 + home + chld + hinc + genf + wrat +
avhv + incm + inca + plow + npro + tgif + lgif + rgif + tdon + tlag + agif,
data=data.train.std.damt,distribution="gaussian",n.trees=5000,interaction.depth=4,shrinkage=0.001,verbose=F)
yhat.boost=predict(boost.train,newdata=data.valid.std.damt,n.trees=5000)
mean((yhat.boost-y.valid.damt)^2)
## [1] 1.539772
#λ = 0.001 = 1.539811
##########################################################
#ridge and lasso used as shrinking (regularization) methods, here variables are shrinked toward zero based on OLS estimates.
#performs variable selection while reducing variance among models. Cross-validation used for best fit model on each.
#ridge
x <- model.matrix(damt ~.-damt,
data=data.train.std.damt)
y<- data.train.std.damt$damt
xv <- model.matrix(damt ~.-damt,
data=data.valid.std.damt)
yv <- data.valid.std.damt$damt
library(glmnet)
## Warning: package 'glmnet' was built under R version 3.4.2
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
##
## expand
## Loaded glmnet 2.0-13
library(Matrix)
library(foreach)
grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x,y,alpha=0,lambda=grid)
dim(coef(ridge.mod))
## [1] 22 100
#use cross validation to choose tunning paramater
set.seed(1)
cv.out=cv.glmnet(x,y,alpha=0)
plot(cv.out)
bestlam=cv.out$lambda.min
bestlam
## [1] 0.1107589
# value of λ with smallest cv error is 0.1107589
#MSE assoiated with this lamnda?
ridge.pred=predict(ridge.mod,s=bestlam,newx=xv)
summary(ridge.pred)
## 1
## Min. :11.59
## 1st Qu.:13.44
## Median :14.24
## Mean :14.48
## 3rd Qu.:15.32
## Max. :22.38
mean((ridge.pred-y.valid.damt)^2)
## [1] 1.871521
#MSE is 1.873351
out=glmnet(x,y,alpha=0)
predict(out,type="coefficients",s=bestlam)[1:22,]
## (Intercept) (Intercept) reg1 reg2 reg3
## 14.220903329 0.000000000 -0.065591832 -0.110979989 0.291323216
## reg4 home chld hinc genf
## 0.584169803 0.218437039 -0.565948718 0.475460019 -0.059148148
## wrat avhv incm inca plow
## -0.006779739 -0.040793388 0.235250507 0.065377878 0.213706721
## npro tgif lgif rgif tdon
## 0.130470955 0.052011346 -0.007159546 0.483580973 0.071663395
## tlag agif
## 0.025396670 0.627843018
##########################################################
#lasso
par(mfrow=c(1,2))
lasso.mod=glmnet(x,y,alpha=1,lambda=grid)
plot(lasso.mod)
set.seed(1)
cv.out=cv.glmnet(x,y,alpha=1)
plot(cv.out)
bestlam=cv.out$lambda.min
bestlam
## [1] 0.00877745
set.seed(1)
lasso.pred=predict(lasso.mod,s=bestlam,newx=xv)
lasso.pred
## 1
## 5 15.36131
## 14 15.41950
## 15 16.01580
## 18 14.34510
## 21 14.80533
## 25 12.87134
## 32 13.79224
## 43 14.59556
## 53 14.61008
## 54 13.47726
## 68 15.20328
## 69 13.97152
## 99 13.02037
## 114 15.52330
## 121 13.49989
## 137 14.15988
## 143 15.80677
## 154 14.51635
## 162 17.14049
## 164 14.76931
## 192 14.26217
## 197 13.09650
## 198 14.50329
## 213 13.79041
## 230 13.02416
## 231 12.54850
## 235 17.22469
## 236 12.92581
## 240 15.55429
## 241 18.33807
## 243 13.29728
## 258 13.70466
## 265 13.60111
## 274 13.37775
## 280 13.38680
## 281 13.80233
## 303 13.66710
## 307 18.64084
## 308 14.08593
## 321 14.46441
## 325 20.05307
## 333 17.02012
## 341 13.95874
## 365 12.53030
## 370 14.38608
## 374 14.92722
## 381 15.59783
## 389 13.55368
## 395 12.91039
## 396 15.87621
## 399 15.52415
## 400 13.75998
## 407 16.43930
## 409 12.52347
## 415 16.31058
## 421 13.79455
## 435 14.48959
## 438 15.20903
## 457 14.18702
## 477 14.12621
## 478 13.38731
## 479 14.15933
## 481 13.62779
## 491 14.02067
## 498 12.60853
## 505 17.80746
## 512 14.06207
## 527 17.06119
## 529 12.43553
## 530 15.15732
## 560 12.86746
## 586 13.79284
## 593 15.30770
## 599 13.61941
## 600 13.88853
## 607 14.10526
## 635 15.92676
## 654 14.02997
## 664 14.96545
## 672 15.53652
## 676 16.25448
## 688 14.09436
## 705 12.87858
## 709 15.28230
## 710 12.97541
## 715 14.00086
## 717 13.92154
## 720 13.79388
## 741 13.09326
## 745 13.25115
## 753 13.73642
## 761 13.09862
## 768 12.52374
## 777 12.64046
## 786 15.96929
## 804 15.15872
## 821 18.22921
## 834 14.35979
## 835 16.18388
## 836 14.04206
## 850 12.14854
## 866 15.70534
## 878 13.48678
## 885 12.38062
## 887 13.59219
## 888 16.14539
## 890 20.43287
## 910 12.16872
## 913 13.40989
## 933 14.40315
## 935 20.61393
## 940 14.32048
## 942 11.97014
## 946 15.18499
## 947 14.42133
## 950 15.10426
## 951 17.03153
## 960 13.98767
## 969 14.14490
## 970 16.80692
## 980 14.16812
## 984 16.01064
## 997 13.74303
## 998 13.75146
## 1032 13.72784
## 1040 15.06226
## 1057 15.02421
## 1071 13.22473
## 1083 15.37758
## 1087 13.35170
## 1090 11.46166
## 1092 13.80283
## 1097 17.17805
## 1109 17.54975
## 1116 15.01370
## 1128 14.44051
## 1130 14.70182
## 1131 13.77079
## 1138 13.16838
## 1153 14.94185
## 1156 16.23561
## 1173 17.46871
## 1179 14.16748
## 1181 14.89132
## 1195 15.30771
## 1230 14.40679
## 1245 13.84383
## 1264 13.68161
## 1265 13.70675
## 1266 14.62792
## 1271 14.49287
## 1289 14.74373
## 1293 13.29559
## 1297 14.85035
## 1307 15.06936
## 1313 12.91249
## 1320 14.52536
## 1324 12.19447
## 1330 15.27026
## 1332 15.69726
## 1344 15.61764
## 1347 14.32567
## 1355 15.13004
## 1356 13.59742
## 1358 13.38237
## 1365 14.23742
## 1372 16.01573
## 1375 13.67611
## 1387 12.84624
## 1401 16.84579
## 1402 16.13691
## 1405 15.35815
## 1413 15.86307
## 1438 18.53164
## 1454 13.47670
## 1466 14.43140
## 1469 12.19993
## 1475 13.30917
## 1486 15.62599
## 1503 12.43888
## 1511 13.92472
## 1513 15.09867
## 1520 14.64199
## 1529 16.61428
## 1542 15.79612
## 1549 16.94394
## 1555 15.08581
## 1570 14.85566
## 1583 19.12316
## 1585 14.13076
## 1595 12.96258
## 1612 12.75448
## 1613 15.10116
## 1615 15.14544
## 1622 12.65949
## 1627 17.42666
## 1630 12.66158
## 1640 12.33672
## 1645 13.99856
## 1653 15.72028
## 1661 15.47090
## 1663 12.25200
## 1667 13.29073
## 1668 13.04328
## 1675 12.62330
## 1677 13.47025
## 1684 14.43497
## 1686 14.45187
## 1695 12.93534
## 1699 13.19111
## 1716 16.45645
## 1719 12.68320
## 1720 13.99325
## 1730 14.40112
## 1763 11.90595
## 1770 12.52462
## 1771 17.46030
## 1774 13.29026
## 1786 16.10127
## 1794 14.12477
## 1807 16.82759
## 1814 13.31157
## 1825 15.52698
## 1834 13.79938
## 1835 17.44545
## 1838 15.21732
## 1843 11.88518
## 1866 16.73733
## 1889 14.42826
## 1898 13.47250
## 1905 15.64379
## 1907 13.88651
## 1910 13.58179
## 1940 12.96534
## 1941 14.91397
## 1943 17.54087
## 1944 16.81977
## 1956 13.02800
## 1982 15.45574
## 1986 15.55098
## 1990 17.13025
## 1991 15.65883
## 1993 13.20421
## 2017 15.15721
## 2022 13.18365
## 2028 12.94586
## 2042 12.46225
## 2044 13.47002
## 2052 16.50541
## 2073 12.89653
## 2075 13.64084
## 2082 12.20218
## 2102 16.10541
## 2104 14.80285
## 2108 15.39515
## 2117 13.75268
## 2119 14.44349
## 2127 12.50420
## 2136 13.42383
## 2140 15.55189
## 2150 16.56407
## 2172 14.17285
## 2177 14.60199
## 2179 16.77780
## 2180 14.08749
## 2182 14.09488
## 2195 13.72988
## 2200 14.51834
## 2201 14.37038
## 2204 13.80083
## 2212 13.80486
## 2214 12.23464
## 2216 16.06414
## 2221 15.72256
## 2225 12.93933
## 2227 15.89672
## 2230 13.44418
## 2233 16.01754
## 2243 15.59950
## 2254 14.79099
## 2262 15.10138
## 2264 12.65421
## 2273 13.40868
## 2290 13.60042
## 2292 13.66671
## 2293 15.11567
## 2296 13.41544
## 2301 14.54357
## 2303 12.94522
## 2310 13.63426
## 2321 15.94800
## 2333 14.51140
## 2341 14.11045
## 2364 13.88130
## 2370 12.82445
## 2377 14.41085
## 2392 15.67231
## 2402 15.81825
## 2404 14.94680
## 2415 15.02147
## 2420 17.40267
## 2422 16.37848
## 2449 15.24640
## 2453 15.87349
## 2466 18.19772
## 2475 13.77929
## 2481 12.09399
## 2483 13.06806
## 2527 17.45366
## 2544 14.06141
## 2560 11.69869
## 2564 13.15385
## 2579 14.68253
## 2582 12.64993
## 2584 14.28375
## 2590 13.69059
## 2597 14.70270
## 2602 15.31800
## 2617 14.14833
## 2627 14.04417
## 2629 14.39913
## 2630 12.53216
## 2656 14.08094
## 2662 14.62026
## 2670 17.32618
## 2673 14.35565
## 2677 13.31821
## 2679 15.43419
## 2716 13.21356
## 2725 14.28047
## 2735 16.21339
## 2777 15.33556
## 2781 12.78100
## 2793 13.39589
## 2796 15.64121
## 2797 15.22485
## 2811 14.19982
## 2815 15.12137
## 2822 15.41653
## 2824 17.37366
## 2831 13.88074
## 2833 12.96334
## 2835 12.29921
## 2841 16.18306
## 2863 14.06378
## 2873 13.56697
## 2883 15.03853
## 2888 14.06552
## 2891 12.52173
## 2894 16.17682
## 2896 15.62558
## 2909 13.80139
## 2912 14.29166
## 2917 12.78065
## 2929 15.43336
## 2932 14.66952
## 2933 14.15758
## 2940 14.38172
## 2942 15.70635
## 2955 13.12996
## 2956 12.69639
## 2961 14.60226
## 2965 14.14959
## 2982 14.74449
## 2991 12.90385
## 3030 13.53831
## 3040 13.54050
## 3052 17.04857
## 3062 17.31917
## 3065 12.84087
## 3074 13.92390
## 3087 13.86661
## 3090 14.82318
## 3099 14.54529
## 3104 12.56136
## 3107 15.90911
## 3125 14.93885
## 3140 12.74912
## 3158 16.51827
## 3163 15.27202
## 3165 13.55037
## 3168 15.80740
## 3180 12.20449
## 3184 12.83372
## 3197 13.44997
## 3212 14.02086
## 3217 12.80392
## 3236 14.37058
## 3243 13.51223
## 3248 15.44839
## 3249 14.96781
## 3250 17.06688
## 3261 14.58911
## 3266 14.60445
## 3268 13.74546
## 3288 15.57230
## 3307 14.55759
## 3324 13.49065
## 3347 15.97211
## 3355 15.86964
## 3368 13.50326
## 3372 14.95123
## 3379 13.87896
## 3381 14.34647
## 3392 11.80750
## 3394 15.36332
## 3401 14.59937
## 3405 13.42647
## 3407 13.85295
## 3409 13.68597
## 3422 14.86987
## 3436 12.81745
## 3453 13.75955
## 3460 13.81371
## 3478 13.94363
## 3479 13.02912
## 3483 15.93243
## 3496 16.34586
## 3498 13.34095
## 3505 14.45591
## 3524 12.50355
## 3549 17.90899
## 3555 14.73751
## 3566 14.80590
## 3570 13.61631
## 3572 14.67956
## 3573 14.40456
## 3588 13.54514
## 3607 17.85542
## 3617 13.53979
## 3622 15.39646
## 3629 14.91632
## 3633 12.82513
## 3639 12.20443
## 3655 14.71311
## 3663 14.45020
## 3666 13.25901
## 3668 13.19783
## 3671 14.60910
## 3672 13.65390
## 3679 13.79644
## 3703 13.41676
## 3708 15.96601
## 3713 15.00496
## 3714 15.24270
## 3727 13.13143
## 3738 12.49642
## 3751 15.49196
## 3756 14.65241
## 3760 12.86481
## 3762 12.87875
## 3775 13.50863
## 3784 14.63627
## 3793 13.80411
## 3801 14.23046
## 3803 12.52901
## 3809 13.48649
## 3821 12.62171
## 3824 13.03193
## 3825 13.64075
## 3841 13.79579
## 3844 13.26343
## 3853 16.93955
## 3863 14.83491
## 3867 19.88384
## 3870 13.62087
## 3877 13.56045
## 3878 12.42942
## 3879 15.68151
## 3881 12.97460
## 3887 13.46475
## 3895 12.76904
## 3908 13.37649
## 3913 18.04918
## 3928 16.03459
## 3930 15.87184
## 3933 12.72491
## 3939 12.90772
## 3941 17.08641
## 3950 13.48352
## 3960 12.18548
## 3961 17.82863
## 3975 15.14865
## 3985 14.06199
## 3986 13.20858
## 3988 14.67905
## 4011 17.02725
## 4020 14.09364
## 4024 14.27619
## 4033 13.85019
## 4045 14.69081
## 4063 15.09654
## 4066 13.72830
## 4071 15.50058
## 4080 17.62507
## 4086 12.56017
## 4092 12.49873
## 4093 14.25469
## 4094 12.73104
## 4108 13.48654
## 4110 13.73829
## 4111 13.61921
## 4112 14.56364
## 4136 12.31507
## 4139 12.79225
## 4146 13.89724
## 4152 13.80735
## 4161 14.03277
## 4170 12.87869
## 4174 13.04807
## 4191 13.04839
## 4194 17.38978
## 4200 14.59900
## 4220 14.27117
## 4221 14.48931
## 4228 13.05996
## 4253 17.45094
## 4254 13.75318
## 4255 12.81139
## 4259 13.93914
## 4272 17.15019
## 4276 16.96362
## 4279 14.66166
## 4290 13.38020
## 4301 13.85882
## 4305 16.00914
## 4318 15.78893
## 4320 17.80462
## 4352 14.20133
## 4353 16.81581
## 4369 15.11968
## 4370 15.63886
## 4376 13.34167
## 4386 12.61395
## 4389 14.48896
## 4412 16.25203
## 4435 12.69613
## 4447 19.28591
## 4453 15.67006
## 4456 15.51982
## 4457 12.92974
## 4459 13.45760
## 4466 12.74216
## 4480 13.57572
## 4488 14.43462
## 4494 13.76518
## 4495 13.13443
## 4501 14.05574
## 4505 13.35478
## 4508 13.86629
## 4517 13.42576
## 4518 15.24421
## 4538 16.35572
## 4544 13.06873
## 4547 14.79154
## 4558 13.66515
## 4562 14.39689
## 4573 16.84135
## 4577 16.37089
## 4584 15.58898
## 4587 14.56496
## 4606 12.98613
## 4607 16.54073
## 4613 13.10307
## 4618 14.06604
## 4621 15.55020
## 4629 14.68813
## 4641 12.67100
## 4661 13.94061
## 4665 14.30517
## 4666 12.21399
## 4668 15.26288
## 4676 15.50654
## 4681 13.54233
## 4685 15.84566
## 4698 12.41562
## 4700 13.54480
## 4701 15.93203
## 4706 16.39336
## 4707 13.84371
## 4710 14.25820
## 4713 13.05723
## 4714 12.46893
## 4735 12.36999
## 4737 15.45724
## 4738 13.65816
## 4748 14.21292
## 4752 17.08572
## 4753 12.74128
## 4760 13.36557
## 4762 12.82803
## 4763 13.44216
## 4772 13.07187
## 4774 14.24111
## 4782 13.57745
## 4786 13.15276
## 4790 15.69418
## 4795 15.34040
## 4803 14.55486
## 4804 12.72404
## 4809 13.18444
## 4822 15.21507
## 4827 16.08970
## 4829 13.31028
## 4835 18.39480
## 4853 14.53895
## 4854 12.36952
## 4862 13.66356
## 4872 15.42158
## 4878 17.58045
## 4896 14.41739
## 4906 12.59572
## 4943 12.76018
## 4949 16.56293
## 4953 14.79010
## 4962 14.85053
## 4965 14.19317
## 4977 15.21469
## 4991 13.01247
## 5026 15.97599
## 5041 13.69770
## 5042 16.19648
## 5046 13.03084
## 5048 12.09273
## 5050 14.23015
## 5051 15.10581
## 5053 14.30884
## 5055 15.40737
## 5060 14.64865
## 5063 13.66804
## 5070 12.90398
## 5085 14.32638
## 5086 14.20534
## 5095 15.29255
## 5112 13.43572
## 5120 12.89550
## 5123 14.72639
## 5127 16.48838
## 5142 16.26746
## 5170 15.50094
## 5176 12.89730
## 5197 22.54195
## 5199 12.76897
## 5203 15.07359
## 5213 12.88952
## 5217 13.65747
## 5218 16.11396
## 5219 14.89332
## 5222 13.67188
## 5232 12.50308
## 5238 13.46039
## 5241 13.03474
## 5259 13.96351
## 5270 14.97519
## 5273 12.62391
## 5281 15.91130
## 5286 13.41080
## 5287 13.78846
## 5288 15.78808
## 5291 14.21963
## 5294 13.82105
## 5303 14.45383
## 5305 16.41076
## 5309 14.95900
## 5311 15.83618
## 5316 18.50238
## 5321 14.68023
## 5344 14.13437
## 5351 15.61581
## 5357 16.14097
## 5375 14.52594
## 5377 15.94479
## 5383 14.06021
## 5384 15.37548
## 5388 13.08957
## 5400 13.75135
## 5405 13.15495
## 5408 18.35065
## 5411 16.02275
## 5414 14.57326
## 5422 13.06540
## 5423 14.33023
## 5425 13.26258
## 5428 12.84990
## 5453 15.92219
## 5459 13.59294
## 5470 17.06066
## 5476 16.22861
## 5479 15.32433
## 5484 15.14020
## 5498 15.02560
## 5507 15.12971
## 5508 15.30653
## 5509 15.59243
## 5521 15.10837
## 5522 15.70857
## 5525 12.89045
## 5528 12.78614
## 5533 14.96196
## 5537 14.14058
## 5542 14.72895
## 5553 15.36905
## 5564 14.35780
## 5565 14.29716
## 5578 14.10286
## 5580 16.05156
## 5592 13.03106
## 5608 14.48042
## 5612 14.40106
## 5614 19.38465
## 5637 14.85749
## 5648 16.98750
## 5652 12.21822
## 5653 15.92183
## 5654 13.16074
## 5659 16.18979
## 5664 13.29658
## 5671 13.45394
## 5682 13.13511
## 5687 13.38564
## 5704 14.58223
## 5711 13.90682
## 5729 15.05252
## 5731 16.87787
## 5736 12.45808
## 5742 13.92743
## 5763 13.93091
## 5775 13.20816
## 5782 12.63795
## 5785 15.26613
## 5788 17.03800
## 5793 13.93069
## 5805 14.34947
## 5815 14.30243
## 5817 14.29541
## 5823 16.24283
## 5832 14.63281
## 5834 12.77985
## 5867 16.82681
## 5881 13.85223
## 5888 15.65727
## 5892 15.38724
## 5905 12.84747
## 5911 13.73332
## 5917 15.31192
## 5918 14.24554
## 5958 16.35235
## 5963 14.51052
## 5970 15.74890
## 5977 14.61610
## 5983 15.57563
## 5986 13.42935
## 5991 13.52131
## 5992 13.97361
## 5993 16.06671
## 5994 13.09623
## 5999 13.98057
## 6001 13.73610
## 6012 17.36590
## 6014 12.55742
## 6030 13.11826
## 6037 15.71028
## 6051 13.99601
## 6059 14.73611
## 6073 11.98008
## 6077 12.58220
## 6088 14.73312
## 6093 13.06520
## 6100 13.86817
## 6105 14.39939
## 6108 12.90904
## 6124 13.15606
## 6130 13.57123
## 6134 17.09540
## 6143 13.82784
## 6149 15.24889
## 6160 12.15463
## 6162 14.39811
## 6168 12.27118
## 6175 14.04732
## 6176 17.90080
## 6180 14.81831
## 6182 15.52423
## 6185 13.98733
## 6190 13.97528
## 6200 14.18468
## 6207 13.22397
## 6208 12.50812
## 6212 12.41275
## 6213 14.77325
## 6223 13.87000
## 6257 13.65743
## 6272 13.20848
## 6275 13.85829
## 6281 14.19710
## 6291 14.12474
## 6297 13.83131
## 6301 14.48143
## 6308 14.09369
## 6312 15.37933
## 6320 15.03730
## 6322 13.20405
## 6324 12.95618
## 6327 20.67890
## 6328 15.99078
## 6332 14.92204
## 6335 13.07713
## 6336 14.87494
## 6345 13.74545
## 6347 18.28561
## 6350 13.38832
## 6363 14.52449
## 6371 12.67325
## 6388 14.46039
## 6393 14.64542
## 6414 14.21595
## 6416 12.86167
## 6421 17.77786
## 6426 15.32417
## 6428 13.81679
## 6437 15.82456
## 6461 14.40117
## 6464 12.86352
## 6465 13.06164
## 6468 13.92561
## 6480 17.52183
## 6486 12.81752
## 6504 15.93278
## 6505 15.12434
## 6506 18.35412
## 6507 16.98960
## 6563 14.71143
## 6583 13.93175
## 6588 13.29877
## 6594 12.44081
## 6614 13.74723
## 6615 12.85429
## 6622 14.09295
## 6633 12.55335
## 6637 15.89979
## 6638 13.99377
## 6654 15.16403
## 6670 14.74340
## 6688 14.80942
## 6691 13.39883
## 6727 16.22374
## 6735 14.01794
## 6738 15.27911
## 6748 15.49839
## 6749 16.85231
## 6750 13.11173
## 6753 13.63453
## 6769 16.75787
## 6771 16.12011
## 6786 16.81355
## 6809 12.91245
## 6814 12.57492
## 6820 14.19156
## 6831 18.69741
## 6835 14.39881
## 6838 13.48021
## 6842 15.72633
## 6854 12.84391
## 6872 14.01520
## 6877 18.33812
## 6879 15.28394
## 6916 13.76978
## 6927 13.70808
## 6932 14.98797
## 6939 13.30801
## 6940 12.97587
## 6955 14.24568
## 6956 12.72777
## 6972 16.45277
## 6976 13.99979
## 6990 16.30713
## 6995 17.00815
## 7002 14.52710
## 7013 13.47716
## 7018 16.66416
## 7081 12.60783
## 7095 18.18889
## 7107 16.88650
## 7118 15.33657
## 7120 16.99347
## 7129 16.19196
## 7133 15.25876
## 7144 14.76455
## 7150 13.25792
## 7156 13.70922
## 7165 15.40663
## 7172 14.79931
## 7174 14.39096
## 7185 17.20652
## 7192 17.86513
## 7197 14.57985
## 7217 12.61594
## 7222 14.77065
## 7262 14.07714
## 7266 14.15189
## 7273 13.85268
## 7274 13.07289
## 7284 15.05202
## 7288 13.53497
## 7300 14.08423
## 7311 14.09580
## 7316 15.38903
## 7317 13.49311
## 7321 13.43083
## 7324 14.34160
## 7332 15.80138
## 7342 14.30466
## 7353 13.61042
## 7356 12.71596
## 7360 14.39510
## 7368 14.19342
## 7382 12.89060
## 7397 12.47895
## 7403 14.58461
## 7408 14.74408
## 7411 13.85374
## 7421 12.25079
## 7440 14.70164
## 7446 15.64649
## 7456 15.09014
## 7457 14.31539
## 7459 15.02666
## 7463 12.78166
## 7476 14.82649
## 7480 12.96180
## 7490 14.03247
## 7498 13.55420
## 7503 13.31274
## 7509 16.28962
## 7520 13.70026
## 7523 13.59670
## 7524 14.67045
## 7529 14.73137
## 7530 13.41215
## 7538 16.66514
## 7551 12.24882
## 7574 14.68669
## 7575 12.27721
## 7585 14.70408
## 7594 14.57764
## 7607 17.48487
## 7609 13.84445
## 7617 15.74221
## 7619 12.90057
## 7626 13.18479
## 7629 14.70434
## 7640 14.16296
## 7642 14.53947
## 7650 13.80766
## 7653 13.42362
## 7654 14.47905
## 7659 15.35855
## 7662 13.44564
## 7687 15.12487
## 7698 15.29733
## 7700 15.58121
## 7717 13.45941
## 7728 13.54290
## 7731 16.28806
## 7755 13.70008
## 7779 14.89163
## 7784 15.90940
## 7785 15.39872
## 7788 14.38679
## 7792 13.97591
## 7798 13.22554
## 7800 13.09590
## 7803 14.48293
## 7813 12.86516
## 7815 14.12555
## 7822 14.03012
## 7838 12.52088
## 7839 14.06157
## 7844 14.93251
## 7851 14.55114
## 7852 15.75150
## 7854 13.94789
## 7856 13.12195
## 7857 14.29587
## 7877 12.29317
## 7880 12.37033
## 7892 15.07928
## 7895 13.77723
## 7902 11.92722
## 7918 13.64261
## 7921 14.07547
## 7932 14.35027
## 7935 16.16020
## 7942 16.10785
## 7957 15.74303
## 7960 13.81226
## 7992 16.31827
## 7997 14.46817
## 8005 14.05166
mean((lasso.pred-y.valid.damt)^2)
## [1] 1.85981
#MSE is 1.86134
#only 2nd intercept and wrat are 0 ****check this not consistant
set.seed(1)
out=glmnet(x,y,alpha=1,lambda=grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:22,]
lasso.coef
## (Intercept) (Intercept) reg1 reg2 reg3
## 14.194530356 0.000000000 -0.027292544 -0.065641610 0.319922718
## reg4 home chld hinc genf
## 0.630135530 0.217520541 -0.595569736 0.488308561 -0.052798906
## wrat avhv incm inca plow
## 0.000000000 -0.001084543 0.282511967 0.000000000 0.226566028
## npro tgif lgif rgif tdon
## 0.138616732 0.043587827 -0.017263891 0.490316143 0.061421249
## tlag agif
## 0.011348016 0.655682444
lasso.coef[lasso.coef!=0]
## (Intercept) reg1 reg2 reg3 reg4
## 14.194530356 -0.027292544 -0.065641610 0.319922718 0.630135530
## home chld hinc genf avhv
## 0.217520541 -0.595569736 0.488308561 -0.052798906 -0.001084543
## incm plow npro tgif lgif
## 0.282511967 0.226566028 0.138616732 0.043587827 -0.017263891
## rgif tdon tlag agif
## 0.490316143 0.061421249 0.011348016 0.655682444
##########################################################
##########################################################
#reg3, reg4, home, chld, hinc, genf, incm, plow, npro, rgif, tdon, agif variable selection models
#boost
library(gbm)
set.seed(1)
boost.train=gbm(damt ~ reg3 + reg4 + home + chld + hinc + genf +
incm + plow + npro + rgif + tdon + agif,
data=data.train.std.damt,distribution="gaussian",n.trees=5000,interaction.depth=4)
summary(boost.train)
## var rel.inf
## rgif rgif 33.74003372
## agif agif 24.52498244
## reg4 reg4 13.66595186
## chld chld 10.70024403
## hinc hinc 5.70935433
## reg3 reg3 4.30371245
## incm incm 2.23250999
## plow plow 2.05007691
## npro npro 1.98411189
## home home 0.58169060
## tdon tdon 0.45219066
## genf genf 0.05514113
boost.train
## gbm(formula = damt ~ reg3 + reg4 + home + chld + hinc + genf +
## incm + plow + npro + rgif + tdon + agif, distribution = "gaussian",
## data = data.train.std.damt, n.trees = 5000, interaction.depth = 4)
## A gradient boosted model with gaussian loss function.
## 5000 iterations were performed.
## There were 12 predictors of which 12 had non-zero influence.
yhat.boost=predict(boost.train,newdata=data.valid.std.damt,n.trees=5000)
mean((yhat.boost-y.valid.damt)^2)
## [1] 1.727708
#1.727708
#shrinkage parameter
set.seed(1)
boost.train=gbm(damt ~ reg3 + reg4 + home + chld + hinc + genf +
incm + plow + npro + rgif + tdon + agif,
data=data.train.std.damt,distribution="gaussian",n.trees=100,interaction.depth=4,shrinkage=0.1,verbose=F)
yhat.boost=predict(boost.train,newdata=data.valid.std.damt,n.trees=100)
mean((yhat.boost-y.valid.damt)^2)
## [1] 1.661383
#λ = 0.1 = 1.661383
set.seed(1)
boost.train=gbm(damt ~ reg3 + reg4 + home + chld + hinc + genf +
incm + plow + npro + rgif + tdon + agif,data=data.train.std.damt,distribution="gaussian",n.trees=5000,interaction.depth=4,shrinkage=0.01,verbose=F)
yhat.boost=predict(boost.train,newdata=data.valid.std.damt,n.trees=5000)
mean((yhat.boost-y.valid.damt)^2)
## [1] 1.678381
#λ = 0.01 = 1.678381
#random forest
set.seed(1)
rf.train=randomForest(damt ~ reg3 + reg4 + home + chld + hinc + genf +
incm + plow + npro + rgif + tdon + agif,
data=data.train.std.damt,mtry=12,importance=TRUE)
rf.train
##
## Call:
## randomForest(formula = damt ~ reg3 + reg4 + home + chld + hinc + genf + incm + plow + npro + rgif + tdon + agif, data = data.train.std.damt, mtry = 12, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 12
##
## Mean of squared residuals: 1.648163
## % Var explained: 56.01
yhat.rf = predict(rf.train,newdata=data.valid.std.damt)
mean((yhat.rf-y.valid.damt)^2)
## [1] 1.809058
#MSE is 1.809058
#random forest
set.seed(1)
rf.train=randomForest(damt ~ reg3 + reg4 + home + chld + hinc + genf +
incm + plow + npro + rgif + tdon + agif,
data=data.train.std.damt,mtry=10,importance=TRUE)
yhat.rf = predict(rf.train,newdata=data.valid.std.damt)
mean((yhat.rf-y.valid.damt)^2)
## [1] 1.808452
#MSE is 1.808452
# Results
##### CLASSIFICATION
# Random forest (mtry=5) has minimum mean prediction error in the validation sample
# final.cl.model<-randomForest(as.factor(donr)~.,data.train.std.donr,mtry=5,ntree=500)
# chat.test<-predict(final.cl.model,data.test.std)
# However, full logistic regression resulted in the max profit so thats used for final classification
# select model.log1 since it has maximum profit in the validation sample
final.cl.model<-glm(donr~.,data.train.std.donr,family="binomial")
post.valid<- predict(final.cl.model, data.valid.std.donr, type="response") # 2007 probabilities
post.test <- predict(final.cl.model, data.test.std, type="response") # 2007 probabilities
# Weighted sampling
# Validation accuracy with the selected model is 79.7% ~= 80%
# Using Optimal Validation mailing rate (corresponding to the maximum profit) is 0.8
# With SVM - we expect the maximum profit of 11468 (given in the classification model table in the report)
# a. Adjust this mailing rate using 0.8/(0.5/0.1) = 0.16.
# b. Adjust the “non-mailing rate” using (1 – 0.8)/((1 – 0.5)/(1 – 0.1)) = 0.44.
# c. Scale the mailing rate so that it is a proportion: 0.16/(0.16 + 0.44) = 0.267.
# d. The optima test mailing rate is thus 0.267.
print("Response rate for profit calculation (based on weighted sampling) for the test data is 0.267 or 26.7%.")
## [1] "Response rate for profit calculation (based on weighted sampling) for the test data is 0.267 or 26.7%."
profit <- cumsum(14.5*y.valid.donr[order(post.valid, decreasing=T)]-2)
plot(profit) # see how profits change as more mailings are made
n.mail.valid <- which.max(profit) # number of mailings that maximizes profits
c(n.mail.valid, max(profit)) # report number of mailings and maximum profit
## [1] 1386.0 11423.5
cutoff <- sort(post.valid, decreasing=T)[n.mail.valid+1] # set cutoff based on n.mail.valid
pred.valid.donr <- ifelse(post.valid>cutoff, 1, 0) # mail to everyone above the cutoff
class_table<-table(pred.valid.donr, y.valid.donr) # classification table
# Oversampling adjustment for calculating number of mailings for test set
n.mail.valid <- which.max(profit)
tr.rate <- .1 # typical response rate is .1
vr.rate <- .5 # whereas validation response rate is .5
adj.test.1 <- (n.mail.valid/n.valid.donr)/(vr.rate/tr.rate) # adjustment for mail yes
adj.test.0 <- ((n.valid.donr-n.mail.valid)/n.valid.donr)/((1-vr.rate)/(1-tr.rate)) # adjustment for mail no
adj.test <- adj.test.1/(adj.test.1+adj.test.0) # scale into a proportion
n.mail.test <- round(n.test*adj.test, 0) # calculate number of mailings for test set
cutoff.test <- sort(post.test, decreasing=T)[n.mail.test+1] # set cutoff based on n.mail.test
chat.test <- ifelse(post.test>cutoff.test, 1, 0) # mail to everyone above the cutoff
table(chat.test)
## chat.test
## 0 1
## 1614 393
# QC
length(chat.test) # check length = 2007
## [1] 2007
chat.test[1:10] # check this consists of 0s and 1s
## 3 4 9 16 20 22 23 27 28 29
## 0 1 0 0 0 1 0 0 0 0
# Filtering the final data for prediction
data.donr.test<- data.test.std[chat.test==1, ]
##### PREDICTION
# select Boosting/shrinkage, trees=100,interaction.depth=4,shrinkage=0.1 model since it has least MPE in the validation sample
library(gbm)
set.seed(1)
final.pr.model=gbm(damt ~ reg1 + reg2 + reg3 + reg4 + home + chld + hinc + genf + wrat +
avhv + incm + inca + plow + npro + tgif + lgif + rgif + tdon + tlag + agif,
data=data.train.std.damt,distribution="gaussian",n.trees=100,interaction.depth=4,
shrinkage=0.1,verbose=F)
# yhat.test <- predict(final.pr.model, newdata = data.donr.test, n.trees = 100) # test predictions
yhat.test <- predict(final.pr.model, newdata = data.test.std, n.trees = 100) # test predictions
length(yhat.test) # check length = 2007
## [1] 2007
yhat.test[1:10] # check this consists of plausible predictions of damt
## [1] 16.37610 14.33904 15.90360 10.58305 14.93737 16.13470 16.19798
## [8] 14.58182 11.90483 16.44422
# FINAL RESULTS
# Save final results for both classification and regression
ip <- data.frame(chat=chat.test, yhat=yhat.test) # data frame with two variables: chat and yhat
write.csv(ip, file="ABC.csv", row.names=FALSE) # use your initials for the file name
# submit the csv file in Canvas for evaluation based on actual test donr and damt values