library('tidyverse')
library('caret')
library('modelr')
set.seed(303)
The table below displays catalog-spending data for the first few of 200 randomly selected individuals from a very large (over 20,000 households) data base.1 The variable of particular interest is catalog spending as measured by the Spending Ratio (SpendRat). All of the catalog variables are represented by indicator variables; either the consumer bought and the variable is coded as 1 or the consumer didn’t buy and the variable is coded as 0. The other variables can be viewed as indexes for measuring assets, liquidity, and spending.
catalog<-read_csv('C:/Users/email/Downloads/catalog.csv')
catalog$CollGifts<-as_factor(catalog$CollGifts)
catalog$BricMortar<-as_factor(catalog$BricMortar)
catalog$MarthaHome<-as_factor(catalog$MarthaHome)
catalog$SunAds<-as_factor(catalog$SunAds)
catalog$ThemeColl<-as_factor(catalog$ThemeColl)
catalog$CustDec<-as_factor(catalog$CustDec)
catalog$RetailKids<-as_factor(catalog$RetailKids)
catalog$TeenWr<-as_factor(catalog$TeenWr)
catalog$Carlovers<-as_factor(catalog$Carlovers)
catalog$CountryColl<-as_factor(catalog$CountryColl)
str(catalog)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 200 obs. of 21 variables:
$ SpendRat : num 11.8 16.8 11.4 31.3 1.9 ...
$ Age : num 0 35 46 41 46 46 46 56 48 54 ...
$ LenRes : num 2 3 9 2 7 15 16 31 8 8 ...
$ Income : num 3 5 5 2 9 5 4 6 5 5 ...
$ TotAsset : num 122 195 123 117 493 138 162 117 119 50 ...
$ SecAssets : num 27 36 24 25 105 27 25 27 23 10 ...
$ ShortLiq : num 225 220 200 222 310 340 230 300 250 200 ...
$ LongLiq : num 422 420 420 419 500 450 430 440 430 420 ...
$ WlthIdx : num 286 430 290 279 520 440 360 400 360 230 ...
$ SpendVol : num 503 690 600 543 680 440 690 500 610 660 ...
$ SpenVel : num 285 570 280 308 100 50 180 10 0 0 ...
$ CollGifts : Factor w/ 2 levels "0","1": 2 1 2 2 1 1 2 2 2 1 ...
$ BricMortar : Factor w/ 2 levels "0","1": 1 2 1 1 2 2 1 2 1 2 ...
$ MarthaHome : Factor w/ 2 levels "0","1": 1 2 1 1 2 2 1 2 2 1 ...
$ SunAds : Factor w/ 2 levels "0","1": 2 1 2 2 1 1 2 1 1 1 ...
$ ThemeColl : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 2 2 1 ...
$ CustDec : Factor w/ 2 levels "0","1": 2 2 2 1 2 2 1 2 2 1 ...
$ RetailKids : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 2 1 1 ...
$ TeenWr : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 2 1 2 ...
$ Carlovers : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 2 1 1 ...
$ CountryColl: Factor w/ 2 levels "0","1": 2 1 2 2 1 1 2 1 2 1 ...
- attr(*, "spec")=
.. cols(
.. SpendRat = [32mcol_double()[39m,
.. Age = [32mcol_double()[39m,
.. LenRes = [32mcol_double()[39m,
.. Income = [32mcol_double()[39m,
.. TotAsset = [32mcol_double()[39m,
.. SecAssets = [32mcol_double()[39m,
.. ShortLiq = [32mcol_double()[39m,
.. LongLiq = [32mcol_double()[39m,
.. WlthIdx = [32mcol_double()[39m,
.. SpendVol = [32mcol_double()[39m,
.. SpenVel = [32mcol_double()[39m,
.. CollGifts = [32mcol_double()[39m,
.. BricMortar = [32mcol_double()[39m,
.. MarthaHome = [32mcol_double()[39m,
.. SunAds = [32mcol_double()[39m,
.. ThemeColl = [32mcol_double()[39m,
.. CustDec = [32mcol_double()[39m,
.. RetailKids = [32mcol_double()[39m,
.. TeenWr = [32mcol_double()[39m,
.. Carlovers = [32mcol_double()[39m,
.. CountryColl = [32mcol_double()[39m
.. )
The goal of this section is to explore the data set and get it ready for analysis. There are no missing values in the data set, but there are some incorrect entries that must be identified and removed before completing the analysis. Income is coded as an ordinal value, ranging from 1 to 12. Age can be regarded as quantitative, and any value less than 18 is invalid. Length of residence (LenRes) is a value ranging from zero to someone’s age. LenRes should not be higher than Age. You should create a simple 1-2 paragraph summary of this section. Be sure to fully explain the reasoning behind transforming any columns and removing any rows. Campbell told me to is not sufficient. Justify why it makes sense not to include any rows whose age is less than 18 or why we shouldn’t use rows in which length of residence is larger than age.
cat.clean<-filter(catalog,Age>=18,LenRes<=Age)
Provide a basic summary of the cleaned data set. Include a table of univariate statistics to summarize each variable. Choose meaningful summary statistics for each type of variable. You should also include a basic summary of the catalog spending (SpendRat) including an appropriate graphical display.
dim(cat.clean)
[1] 184 21
summary(cat.clean)
SpendRat Age LenRes Income TotAsset
Min. : 0.080 Min. :21.00 Min. : 0.00 Min. :1.000 Min. : 5.00
1st Qu.: 6.077 1st Qu.:44.75 1st Qu.: 8.00 1st Qu.:4.000 1st Qu.: 94.75
Median : 18.805 Median :53.00 Median :11.00 Median :5.000 Median :150.00
Mean : 43.792 Mean :54.71 Mean :14.58 Mean :4.473 Mean :184.67
3rd Qu.: 50.273 3rd Qu.:63.00 3rd Qu.:19.00 3rd Qu.:5.000 3rd Qu.:222.50
Max. :401.420 Max. :89.00 Max. :46.00 Max. :9.000 Max. :999.00
SecAssets ShortLiq LongLiq WlthIdx SpendVol SpenVel
Min. : 0.0 Min. :160.0 Min. :400.0 Min. : 90.0 Min. : 0.0 Min. : 0.0
1st Qu.: 19.0 1st Qu.:210.0 1st Qu.:420.0 1st Qu.:300.0 1st Qu.:532.0 1st Qu.: 40.0
Median : 28.0 Median :230.0 Median :430.0 Median :360.0 Median :610.0 Median :160.0
Mean : 40.9 Mean :240.6 Mean :439.5 Mean :367.1 Mean :568.4 Mean :219.5
3rd Qu.: 42.0 3rd Qu.:260.0 3rd Qu.:440.0 3rd Qu.:430.0 3rd Qu.:670.0 3rd Qu.:310.0
Max. :999.0 Max. :999.0 Max. :999.0 Max. :880.0 Max. :780.0 Max. :999.0
CollGifts BricMortar MarthaHome SunAds ThemeColl CustDec RetailKids TeenWr Carlovers
0:94 0:131 0:117 0:105 0:111 0:120 0:119 0:89 0:133
1:90 1: 53 1: 67 1: 79 1: 73 1: 64 1: 65 1:95 1: 51
CountryColl
0:107
1: 77
cat.clean %>%
ggplot(aes(x=log(SpendRat))) + geom_histogram()
We are interested in developing a model to predict spending ratio. Find a multiple regression model for predicting the amount of money that consumers will spend on catalog shopping, as measured by spending ratio. Your goal is to identify the best model you can. In your write-up be sure to justify your choice of model, discuss any transformation you make to the variables, discuss your model fit, and discuss the effect of the significant predictors using both hypothesis tests and confidence intervals. Remember to check the conditions for inference as you evaluate your models. The data set is much too small to split into training and test data sets, so use cross validation in all your models
## Set up Repeated k-fold Cross Validation
train_control <- trainControl(method="repeatedcv", number=10, repeats=3)
a. Fit a linear model using least squares on the training set, and report the CV error obtained.
lm.fit<-train(log(SpendRat)~., data=cat.clean, trControl=train_control,method='lm')
print(lm.fit)
Linear Regression
184 samples
20 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 167, 167, 165, 165, 165, 166, ...
Resampling results:
RMSE Rsquared MAE
1.394231 0.3002504 1.067814
Tuning parameter 'intercept' was held constant at a value of TRUE
b. Fit a ridge regression model on the training set, with \(\lambda\) chosen by cross-validation. Report the CV error obtained.
y_train=log(cat.clean$SpendRat)
X_train=model_matrix(cat.clean,log(SpendRat)~Age+LenRes+Income+TotAsset+SecAssets+ShortLiq+LongLiq+WlthIdx+SpendVol+SpenVel+CollGifts+BricMortar+MarthaHome+SunAds+ThemeColl+CustDec+RetailKids+TeenWr+Carlovers+CountryColl)
parameters <- c(seq(0.1, 2, by =0.1) , seq(2, 5, 0.5) , seq(5, 25, 1))
ridge.fit<-train(y=y_train,x=X_train,method='glmnet',trControl=train_control,tuneGrid=expand.grid(alpha=0,lambda = parameters))
print(ridge.fit)
glmnet
184 samples
21 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 165, 167, 168, 164, 165, 165, ...
Resampling results across tuning parameters:
lambda RMSE Rsquared MAE
0.1 1.282431 0.3389732 1.0136866
0.2 1.262920 0.3477084 1.0012533
0.3 1.255321 0.3519455 0.9979231
0.4 1.252062 0.3543630 0.9972646
0.5 1.251011 0.3558583 0.9976626
0.6 1.251259 0.3568258 0.9990369
0.7 1.252338 0.3574662 1.0010517
0.8 1.253995 0.3578612 1.0033494
0.9 1.256056 0.3580738 1.0058887
1.0 1.258394 0.3581592 1.0086132
1.1 1.260931 0.3581387 1.0113090
1.2 1.263603 0.3580336 1.0139815
1.3 1.266368 0.3578600 1.0166161
1.4 1.269187 0.3576379 1.0192355
1.5 1.272041 0.3573677 1.0217359
1.6 1.274899 0.3570578 1.0241457
1.7 1.277760 0.3567171 1.0265044
1.8 1.280606 0.3563590 1.0287912
1.9 1.283413 0.3559797 1.0309755
2.0 1.286203 0.3555893 1.0330928
2.5 1.299502 0.3534787 1.0432612
3.0 1.311641 0.3513105 1.0526532
3.5 1.322620 0.3492042 1.0618671
4.0 1.332510 0.3471970 1.0703347
4.5 1.341475 0.3453146 1.0779618
5.0 1.349588 0.3435733 1.0848968
6.0 1.363703 0.3404632 1.0974258
7.0 1.375527 0.3378082 1.1077753
8.0 1.385618 0.3355069 1.1165298
9.0 1.394274 0.3335277 1.1238685
10.0 1.401810 0.3317875 1.1302016
11.0 1.408425 0.3302622 1.1356920
12.0 1.414280 0.3289142 1.1405733
13.0 1.419500 0.3277156 1.1448879
14.0 1.424186 0.3266288 1.1487019
15.0 1.428399 0.3256585 1.1521298
16.0 1.432213 0.3247865 1.1551995
17.0 1.435707 0.3239915 1.1579793
18.0 1.438887 0.3232622 1.1604801
19.0 1.441809 0.3226010 1.1627590
20.0 1.444514 0.3219793 1.1648501
21.0 1.446994 0.3214221 1.1667773
22.0 1.449315 0.3208923 1.1685728
23.0 1.451452 0.3204155 1.1702319
24.0 1.453462 0.3199579 1.1717784
25.0 1.455326 0.3195462 1.1732128
Tuning parameter 'alpha' was held constant at a value of 0
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 0 and lambda = 0.5.
c. Fit a lasso model on the training set, with \(\lambda\) chosen by cross-validation. Report the CV error obtained, along with the number of non-zero coefficient estimates.
y_train=log(cat.clean$SpendRat)
X_train=model_matrix(cat.clean,log(SpendRat)~Age+LenRes+Income+TotAsset+SecAssets+ShortLiq+LongLiq+WlthIdx+SpendVol+SpenVel+CollGifts+BricMortar+MarthaHome+SunAds+ThemeColl+CustDec+RetailKids+TeenWr+Carlovers+CountryColl)
parameters <- c(seq(0.1, 2, by =0.1) , seq(2, 5, 0.5) , seq(5, 25, 1))
lasso.fit<-train(y=y_train,x=X_train,method='glmnet',trControl=train_control,tuneGrid=expand.grid(alpha=1,lambda = parameters))
print(lasso.fit)
glmnet
184 samples
21 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 166, 167, 165, 165, 167, 168, ...
Resampling results across tuning parameters:
lambda RMSE Rsquared MAE
0.1 1.250853 0.35615328 0.9923317
0.2 1.293874 0.33931292 1.0341048
0.3 1.341793 0.33336816 1.0739817
0.4 1.407003 0.29713600 1.1281301
0.5 1.467919 0.23196363 1.1778427
0.6 1.508460 0.08560497 1.2102028
0.7 1.510381 NaN 1.2115618
0.8 1.510381 NaN 1.2115618
0.9 1.510381 NaN 1.2115618
1.0 1.510381 NaN 1.2115618
1.1 1.510381 NaN 1.2115618
1.2 1.510381 NaN 1.2115618
1.3 1.510381 NaN 1.2115618
1.4 1.510381 NaN 1.2115618
1.5 1.510381 NaN 1.2115618
1.6 1.510381 NaN 1.2115618
1.7 1.510381 NaN 1.2115618
1.8 1.510381 NaN 1.2115618
1.9 1.510381 NaN 1.2115618
2.0 1.510381 NaN 1.2115618
2.5 1.510381 NaN 1.2115618
3.0 1.510381 NaN 1.2115618
3.5 1.510381 NaN 1.2115618
4.0 1.510381 NaN 1.2115618
4.5 1.510381 NaN 1.2115618
5.0 1.510381 NaN 1.2115618
6.0 1.510381 NaN 1.2115618
7.0 1.510381 NaN 1.2115618
8.0 1.510381 NaN 1.2115618
9.0 1.510381 NaN 1.2115618
10.0 1.510381 NaN 1.2115618
11.0 1.510381 NaN 1.2115618
12.0 1.510381 NaN 1.2115618
13.0 1.510381 NaN 1.2115618
14.0 1.510381 NaN 1.2115618
15.0 1.510381 NaN 1.2115618
16.0 1.510381 NaN 1.2115618
17.0 1.510381 NaN 1.2115618
18.0 1.510381 NaN 1.2115618
19.0 1.510381 NaN 1.2115618
20.0 1.510381 NaN 1.2115618
21.0 1.510381 NaN 1.2115618
22.0 1.510381 NaN 1.2115618
23.0 1.510381 NaN 1.2115618
24.0 1.510381 NaN 1.2115618
25.0 1.510381 NaN 1.2115618
Tuning parameter 'alpha' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 0.1.
d. Fit a PCR model on the training set, with M chosen by cross-validation. Report the CV error obtained, along with the value of M selected by cross-validation.
pcr.fit<-train(log(SpendRat)~., data=cat.clean, trControl=train_control,tuneLength=ncol(cat.clean),method='pcr')
plot(pcr.fit)
pcr.fit$bestTune
print(pcr.fit)
Principal Component Analysis
184 samples
20 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 167, 165, 166, 166, 165, 165, ...
Resampling results across tuning parameters:
ncomp RMSE Rsquared MAE
1 1.504416 0.05292512 1.226453
2 1.502397 0.05696327 1.227758
3 1.506584 0.05669179 1.238072
4 1.504600 0.05817332 1.228613
5 1.504614 0.05487248 1.228104
6 1.568391 0.04708296 1.268189
7 1.530471 0.03903891 1.246003
8 1.563494 0.04184430 1.267878
9 1.566718 0.03774084 1.267141
10 1.590034 0.04801677 1.269187
11 1.493764 0.13038231 1.208531
12 1.418096 0.23581979 1.105241
13 1.394111 0.27989803 1.053194
14 1.394441 0.27807673 1.051699
15 1.307876 0.31287039 1.020025
16 1.325327 0.30587889 1.028624
17 1.337142 0.29880732 1.035104
18 1.348773 0.30239710 1.044208
19 1.367447 0.29728069 1.052854
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 15.
e. Fit a PLS model on the training set, with M chosen by cross-validation. Report the CV error obtained, along with the value of M selected by cross-validation.
pls.fit<-train(log(SpendRat)~., data=cat.clean, trControl=train_control,tuneLength=ncol(cat.clean),method='pls')
plot(pls.fit)
pls.fit$bestTune
print(pls.fit)
Partial Least Squares
184 samples
20 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 166, 167, 165, 166, 165, 164, ...
Resampling results across tuning parameters:
ncomp RMSE Rsquared MAE
1 1.515817 0.07985721 1.234183
2 1.521152 0.07681501 1.235722
3 1.524430 0.07891818 1.236456
4 1.525921 0.06506635 1.227040
5 1.601973 0.06390192 1.276570
6 1.672190 0.06527717 1.308169
7 1.621498 0.08136191 1.286456
8 1.642158 0.09102616 1.282712
9 1.730764 0.09214255 1.291040
10 1.480394 0.22943277 1.135545
11 1.353831 0.31676170 1.046596
12 1.357684 0.33486266 1.031558
13 1.361018 0.32809382 1.046569
14 1.383036 0.32544019 1.055099
15 1.385985 0.32711307 1.052993
16 1.382261 0.32780757 1.050593
17 1.381936 0.32786463 1.050362
18 1.381863 0.32790378 1.050287
19 1.381829 0.32791716 1.050270
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 11.
f. Comment on the results obtained. How accurately can we predict the Spending Ratio? Is there much difference among the CV errors resulting from these five approaches?
Model | RMSE |
---|---|
Lasso Regression | 1.2508531 |
Ridge Regression | 1.2510108 |
Principle Components Regression | 1.3078756 |
Partial Least Squares Regression | 1.3538306 |
Linear Regression | 1.3942314 |
Best model was Lasso with \(\lambda\) = 0.1 with a RMSE of 1.2508531. The worst performing model was the linear regression model with a RMSE of 1.3942314. There’s not a substantial amount of difference of CV RMSE across the five models.
In this problem, you will develop a model to predict whether income exceeds $50K/yr based on census data.
a. Use the code in Blackboard to create the adult data set.
adult<-read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", col_names=FALSE, na='?')
names(adult)<-c("age","workclass","fnlwgt","education","education_num","marital_status","occupation","relationship","race","sex","capital_gain","capital_loss","hours_per_week","native_country","income")
adult$fnlwgt<-NULL #remove unnecessary column
#coerce character variables to factors
adult$workclass<-as_factor(adult$workclass)
adult$education<-as_factor(adult$education)
adult$marital_status<-as_factor(adult$marital_status)
adult$occupation<-as_factor(adult$occupation)
adult$relationship<-as_factor(adult$relationship)
adult$race<-as_factor(adult$race)
adult$sex<-as_factor(adult$sex)
adult$native_country<-as_factor(adult$native_country)
adult$income<-as_factor(adult$income)
adult<-na.omit(adult) #remove missing observations
y=adult$income
X=model_matrix(adult,income~.) #create dummy variables
X$`(Intercept)`<-NULL #remove unnecessary column
adult2<-as_tibble(cbind(adult$income,X),.name_repair = "unique") #create new data set
names(adult2)[1]<-"income" #fix the name of the outcome variable
nsv<-nearZeroVar(adult2,saveMetrics = FALSE) #identify near zero variance predictors
adult3<-adult2[,-nsv] #remove them
adult3
b. Explore the data graphically in order to investigate the association between income and the other features. Which of the other features seem most likely to be useful in predicting income? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
Using random forest method for feature selection.
##Generate a small sample of the data set to investigate which variables are closely associated with income
k=sample(nrow(adult3),nrow(adult2)*0.08)
##Use random forest model to calculate variable importance
rf=train(income~.,data=adult3[k,])
rfImp<-varImp(rf)
plot(rfImp,top=10)
c. Split the data into an 80% training set and a 20% test set. Set the seed at 1303.
##Create training indicator vector
inTrain <- createDataPartition(adult3$income, p=0.8, list=FALSE)
##Tabulate training and test data sets
train=adult3[inTrain,]
test=adult3[-inTrain,]
dim(adult2)
[1] 30162 97
dim(train)
[1] 24131 24
dim(test)
[1] 6031 24
d. Perform LDA on the training data in order to predict income using the variables that seemed most associated with income in (b). What is the test error of the model obtained?
##Train Model
lda.fit=train(income~`marital_statusMarried-civ-spouse` + education_num + relationshipHusband + age + hours_per_week + `occupationExec-managerial` + `occupationProf-specialty` + sexFemale + `relationshipOwn-child`,data=train,method='lda',trControl = trainControl(method = "cv"))
lda.fit
Linear Discriminant Analysis
24131 samples
9 predictor
2 classes: '<=50K', '>50K'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 21719, 21718, 21717, 21718, 21718, 21719, ...
Resampling results:
Accuracy Kappa
0.8226754 0.4894095
##Calculate Predictions
pred.lda<-predict(lda.fit,test)
##Estimate Accuracy
confusionMatrix(pred.lda,test$income)
Confusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 4139 704
>50K 391 797
Accuracy : 0.8184
95% CI : (0.8085, 0.8281)
No Information Rate : 0.7511
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.478
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9137
Specificity : 0.5310
Pos Pred Value : 0.8546
Neg Pred Value : 0.6709
Prevalence : 0.7511
Detection Rate : 0.6863
Detection Prevalence : 0.8030
Balanced Accuracy : 0.7223
'Positive' Class : <=50K
e. Perform QDA on the training data in order to predict income using the variables that seemed most associated with income in (b). What is the test error of the model obtained?
##Train Model
qda.fit=train(income~`marital_statusMarried-civ-spouse` + education_num + relationshipHusband + age + hours_per_week + `occupationExec-managerial` + `occupationProf-specialty` + sexFemale + `relationshipOwn-child`,data=train,method='qda',trControl = trainControl(method = "cv"))
qda.fit
Quadratic Discriminant Analysis
24131 samples
9 predictor
2 classes: '<=50K', '>50K'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 21719, 21718, 21718, 21719, 21718, 21718, ...
Resampling results:
Accuracy Kappa
0.7522264 0.4660741
##Calculate Predictions
pred.qda<-predict(qda.fit,test)
##Estimate Accuracy
confusionMatrix(pred.qda,test$income)
Confusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 3271 213
>50K 1259 1288
Accuracy : 0.7559
95% CI : (0.7449, 0.7667)
No Information Rate : 0.7511
P-Value [Acc > NIR] : 0.1982
Kappa : 0.4705
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.7221
Specificity : 0.8581
Pos Pred Value : 0.9389
Neg Pred Value : 0.5057
Prevalence : 0.7511
Detection Rate : 0.5424
Detection Prevalence : 0.5777
Balanced Accuracy : 0.7901
'Positive' Class : <=50K
f. Perform logistic regression on the training data in order to predict income using the variables that seemed most associated with income in (b). What is the test error of the model obtained?
##Train Model
glm.fit=train(income~`marital_statusMarried-civ-spouse` + education_num + relationshipHusband + age + hours_per_week + `occupationExec-managerial` + `occupationProf-specialty` + sexFemale + `relationshipOwn-child`,data=train,method='glm',trControl = trainControl(method = "cv"))
glm.fit
Generalized Linear Model
24131 samples
9 predictor
2 classes: '<=50K', '>50K'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 21719, 21718, 21717, 21719, 21717, 21717, ...
Resampling results:
Accuracy Kappa
0.8238368 0.4869807
##Calculate Predictions
pred.glm<-predict(glm.fit,test)
##Estimate Accuracy
confusionMatrix(pred.glm,test$income)
Confusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 4154 728
>50K 376 773
Accuracy : 0.8169
95% CI : (0.807, 0.8266)
No Information Rate : 0.7511
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.4687
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9170
Specificity : 0.5150
Pos Pred Value : 0.8509
Neg Pred Value : 0.6728
Prevalence : 0.7511
Detection Rate : 0.6888
Detection Prevalence : 0.8095
Balanced Accuracy : 0.7160
'Positive' Class : <=50K
g. Perform KNN on the training data, with several values of K, in order to predict income. Use only the variables that seemed most associated with income in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?
##Train Model, Let CV choose value for K
knn.fit<-train(income~`marital_statusMarried-civ-spouse` + education_num + relationshipHusband + age + hours_per_week + `occupationExec-managerial` + `occupationProf-specialty` + sexFemale + `relationshipOwn-child`,data=train,method='knn',trControl = trainControl(method = "cv"), tuneLength=20)
knn.fit
k-Nearest Neighbors
24131 samples
9 predictor
2 classes: '<=50K', '>50K'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 21718, 21717, 21718, 21718, 21718, 21718, ...
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.8047330 0.4363006
7 0.8067635 0.4383265
9 0.8078823 0.4407662
11 0.8092911 0.4431910
13 0.8097055 0.4439116
15 0.8100371 0.4443474
17 0.8093331 0.4421319
19 0.8086285 0.4380045
21 0.8098304 0.4399584
23 0.8096647 0.4388197
25 0.8100791 0.4385366
27 0.8106590 0.4386679
29 0.8095400 0.4356190
31 0.8099544 0.4361528
33 0.8094984 0.4343682
35 0.8097058 0.4343902
37 0.8108248 0.4370759
39 0.8097059 0.4336888
41 0.8098714 0.4334224
43 0.8094986 0.4328404
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 37.
##Calculate Predictions
pred.knn<-predict(knn.fit,test)
##Estimate Accuracy
confusionMatrix(pred.knn,test$income)
Confusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 4184 794
>50K 346 707
Accuracy : 0.811
95% CI : (0.8009, 0.8208)
No Information Rate : 0.7511
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.4384
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9236
Specificity : 0.4710
Pos Pred Value : 0.8405
Neg Pred Value : 0.6714
Prevalence : 0.7511
Detection Rate : 0.6937
Detection Prevalence : 0.8254
Balanced Accuracy : 0.6973
'Positive' Class : <=50K
h. Choose which model predicts income the best and justify your choice.
Model | Accuracy |
---|---|
Quadratic Discriminant | 0.7522264 |
K Nearest Neighbors (K=37) | 0.8108248 |
Linear Discriminant | 0.8226754 |
Logistic Regression | 0.8238368 |