The Trade-Off Between Prediction Accuracy and Model Interpretability Whenever building a model, generally more complexity allows for a more accurate and comprehensive analysis whether it is classification or regression. However, this typically means the model is generally less interpretable. Typically, we want a model that is more accurate than interpretable, but it is important to make sure you can explain your model, especially in the instance of medical works
Supervised Versus Unsupervised Learning
Supervised learning is when we have a clear input and output with our variables. The model learns off of previous instances of data, and this method is better for prediction rather than classification. Learning takes place on the computer, and is typically faster and less complex. On the other hand, Unsupervised learning means we only have a clear input, and as such, the model learns off of natural patterns exhibited in the data. This can help find patterns in the data, and as such, works really well for classification.
The Bias-Variance Trade-Off Bias and variance have an inverse correlation, meaning as one increases the other increases. This is important to consider in your models, as more flexible models result in less variance but more bias, and less flexible models result in less bias but more variance. This is why we assess our models by RMSE, as we can generally make good judgement on which model may be more proper for our situation off of it.
Linear Regression versus K-Nearest Neighbors Linear regression is a form of supervised learning. Linear regression works very well with linear data, and is very easy to implement in comparison to other methods. Linear regression however, is typically not very good with non-linear data, and thus is not completely applicable to every situation. KNN is also supervised, but offers an extreme amount of flexibility. The model can be computationally expensive if k is high enough, and is generally prone to overfitting.
Logistic Regression versus LDA versus QDA Logistic regression typically works really for binary classification i.e spam/not spam and has a linear boundary, as does LDA, but they differ in the way that the assumptions for LDA are more strict, and thus, typically result in higher accuracy for LDA. QDA has a quadratic boundary. Because its decision boudnary is not linear, it is typically more flexible than either LDA or LR. It is essentially a compromise between LDA and KNN
library('tidyverse')
library('caret')
library('modelr')
library('readr')
set.seed(10003)
The table below displays catalog-spending data for the first few of 200 randomly selected individuals from a very large (over 20,000 households) data base.1 The variable of particular interest is catalog spending as measured by the Spending Ratio (SpendRat). All of the catalog variables are represented by indicator variables; either the consumer bought and the variable is coded as 1 or the consumer didn’t buy and the variable is coded as 0. The other variables can be viewed as indexes for measuring assets, liquidity, and spending.
We recode all of our binary variables into factors so that we may use it as ordinal data.
catalog<-read_csv('C:\\Users\\diego\\Desktop\\Data Mining\\assignments\\midterm\\catalog.csv')
Parsed with column specification:
cols(
.default = col_double()
)
See spec(...) for full column specifications.
catalog$CollGifts<-as_factor(catalog$CollGifts)
catalog$BricMortar<-as_factor(catalog$BricMortar)
catalog$MarthaHome<-as_factor(catalog$MarthaHome)
catalog$SunAds<-as_factor(catalog$SunAds)
catalog$ThemeColl<-as_factor(catalog$ThemeColl)
catalog$CustDec<-as_factor(catalog$CustDec)
catalog$RetailKids<-as_factor(catalog$RetailKids)
catalog$TeenWr<-as_factor(catalog$TeenWr)
catalog$Carlovers<-as_factor(catalog$Carlovers)
catalog$CountryColl<-as_factor(catalog$CountryColl)
str(catalog)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 200 obs. of 21 variables:
$ SpendRat : num 11.8 16.8 11.4 31.3 1.9 ...
$ Age : num 0 35 46 41 46 46 46 56 48 54 ...
$ LenRes : num 2 3 9 2 7 15 16 31 8 8 ...
$ Income : num 3 5 5 2 9 5 4 6 5 5 ...
$ TotAsset : num 122 195 123 117 493 138 162 117 119 50 ...
$ SecAssets : num 27 36 24 25 105 27 25 27 23 10 ...
$ ShortLiq : num 225 220 200 222 310 340 230 300 250 200 ...
$ LongLiq : num 422 420 420 419 500 450 430 440 430 420 ...
$ WlthIdx : num 286 430 290 279 520 440 360 400 360 230 ...
$ SpendVol : num 503 690 600 543 680 440 690 500 610 660 ...
$ SpenVel : num 285 570 280 308 100 50 180 10 0 0 ...
$ CollGifts : Factor w/ 2 levels "0","1": 2 1 2 2 1 1 2 2 2 1 ...
$ BricMortar : Factor w/ 2 levels "0","1": 1 2 1 1 2 2 1 2 1 2 ...
$ MarthaHome : Factor w/ 2 levels "0","1": 1 2 1 1 2 2 1 2 2 1 ...
$ SunAds : Factor w/ 2 levels "0","1": 2 1 2 2 1 1 2 1 1 1 ...
$ ThemeColl : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 2 2 1 ...
$ CustDec : Factor w/ 2 levels "0","1": 2 2 2 1 2 2 1 2 2 1 ...
$ RetailKids : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 2 1 1 ...
$ TeenWr : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 2 1 2 ...
$ Carlovers : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 2 1 1 ...
$ CountryColl: Factor w/ 2 levels "0","1": 2 1 2 2 1 1 2 1 2 1 ...
- attr(*, "spec")=
.. cols(
.. SpendRat = [32mcol_double()[39m,
.. Age = [32mcol_double()[39m,
.. LenRes = [32mcol_double()[39m,
.. Income = [32mcol_double()[39m,
.. TotAsset = [32mcol_double()[39m,
.. SecAssets = [32mcol_double()[39m,
.. ShortLiq = [32mcol_double()[39m,
.. LongLiq = [32mcol_double()[39m,
.. WlthIdx = [32mcol_double()[39m,
.. SpendVol = [32mcol_double()[39m,
.. SpenVel = [32mcol_double()[39m,
.. CollGifts = [32mcol_double()[39m,
.. BricMortar = [32mcol_double()[39m,
.. MarthaHome = [32mcol_double()[39m,
.. SunAds = [32mcol_double()[39m,
.. ThemeColl = [32mcol_double()[39m,
.. CustDec = [32mcol_double()[39m,
.. RetailKids = [32mcol_double()[39m,
.. TeenWr = [32mcol_double()[39m,
.. Carlovers = [32mcol_double()[39m,
.. CountryColl = [32mcol_double()[39m
.. )
The goal of this section is to explore the data set and get it ready for analysis. There are no missing values in the data set, but there are some incorrect entries that must be identified and removed before completing the analysis. Income is coded as an ordinal value, ranging from 1 to 12. Age can be regarded as quantitative, and any value less than 18 is invalid. Length of residence (LenRes) is a value ranging from zero to someone’s age. LenRes should not be higher than Age. You should create a simple 1-2 paragraph summary of this section. Be sure to fully explain the reasoning behind transforming any columns and removing any rows. Campbell told me to is not sufficient. Justify why it makes sense not to include any rows whose age is less than 18 or why we shouldn’t use rows in which length of residence is larger than age.
Here, we filter our dataset to have rows in which data follow the following specifications: ages are greater than 18, and LenRes is less than age. age has to be greater than 18 as it is pretty impossible to own a house under 18 under normal circumstances, and LenRes has the specification because you can’t own a house longer than you’ve existed for unless you’re a trust fund baby. This could be accounted for in the dataset but it has been decided it’s not pertinent.
cat.clean<-filter(catalog,Age>=18,LenRes<=Age)
Provide a basic summary of the cleaned data set. Include a table of univariate statistics to summarize each variable. Choose meaningful summary statistics for each type of variable. You should also include a basic summary of the catalog spending (SpendRat) including an appropriate graphical display.
dim(cat.clean)
[1] 184 21
summary(cat.clean)
SpendRat Age LenRes Income TotAsset
Min. : 0.080 Min. :21.00 Min. : 0.00 Min. :1.000 Min. : 5.00
1st Qu.: 6.077 1st Qu.:44.75 1st Qu.: 8.00 1st Qu.:4.000 1st Qu.: 94.75
Median : 18.805 Median :53.00 Median :11.00 Median :5.000 Median :150.00
Mean : 43.792 Mean :54.71 Mean :14.58 Mean :4.473 Mean :184.67
3rd Qu.: 50.273 3rd Qu.:63.00 3rd Qu.:19.00 3rd Qu.:5.000 3rd Qu.:222.50
Max. :401.420 Max. :89.00 Max. :46.00 Max. :9.000 Max. :999.00
SecAssets ShortLiq LongLiq WlthIdx SpendVol
Min. : 0.0 Min. :160.0 Min. :400.0 Min. : 90.0 Min. : 0.0
1st Qu.: 19.0 1st Qu.:210.0 1st Qu.:420.0 1st Qu.:300.0 1st Qu.:532.0
Median : 28.0 Median :230.0 Median :430.0 Median :360.0 Median :610.0
Mean : 40.9 Mean :240.6 Mean :439.5 Mean :367.1 Mean :568.4
3rd Qu.: 42.0 3rd Qu.:260.0 3rd Qu.:440.0 3rd Qu.:430.0 3rd Qu.:670.0
Max. :999.0 Max. :999.0 Max. :999.0 Max. :880.0 Max. :780.0
SpenVel CollGifts BricMortar MarthaHome SunAds ThemeColl CustDec RetailKids
Min. : 0.0 0:94 0:131 0:117 0:105 0:111 0:120 0:119
1st Qu.: 40.0 1:90 1: 53 1: 67 1: 79 1: 73 1: 64 1: 65
Median :160.0
Mean :219.5
3rd Qu.:310.0
Max. :999.0
TeenWr Carlovers CountryColl
0:89 0:133 0:107
1:95 1: 51 1: 77
cat.clean %>%
ggplot(aes(x=log(SpendRat))) + geom_histogram()
We are interested in developing a model to predict spending ratio. Find a multiple regression model for predicting the amount of money that consumers will spend on catalog shopping, as measured by spending ratio. Your goal is to identify the best model you can. In your write-up be sure to justify your choice of model, discuss any transformation you make to the variables, discuss your model fit, and discuss the effect of the significant predictors using both hypothesis tests and confidence intervals. Remember to check the conditions for inference as you evaluate your models. The data set is much too small to split into training and test data sets, so use cross validation in all your models
## Set up Repeated k-fold Cross Validation
train_control <- trainControl(method="repeatedcv", number=10, repeats=3)
a. Fit a linear model using least squares on the training set, and report the CV error obtained.
lm.fit<-train(log(SpendRat)~., data=cat.clean, trControl=train_control,method='lm')
print(lm.fit)
Linear Regression
184 samples
20 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 167, 167, 165, 165, 165, 166, ...
Resampling results:
RMSE Rsquared MAE
1.353788 0.3089562 1.047582
Tuning parameter 'intercept' was held constant at a value of TRUE
b. Fit a ridge regression model on the training set, with \(\lambda\) chosen by cross-validation. Report the CV error obtained.
y_train=log(cat.clean$SpendRat)
X_train=model_matrix(cat.clean,log(SpendRat)~Age+LenRes+Income+TotAsset+SecAssets+ShortLiq+LongLiq+WlthIdx+SpendVol+SpenVel+CollGifts+BricMortar+MarthaHome+SunAds+ThemeColl+CustDec+RetailKids+TeenWr+Carlovers+CountryColl)
parameters <- c(seq(0.1, 2, by =0.1) , seq(2, 5, 0.5) , seq(5, 25, 1))
ridge.fit<-train(y=y_train,x=X_train,method='glmnet',trControl=train_control,tuneGrid=expand.grid(alpha=0,lambda = parameters))
Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.
print(ridge.fit)
glmnet
184 samples
21 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 166, 166, 165, 165, 167, 165, ...
Resampling results across tuning parameters:
lambda RMSE Rsquared MAE
0.1 1.267051 0.3366916 1.0119377
0.2 1.252265 0.3452095 0.9998015
0.3 1.246772 0.3490908 0.9954463
0.4 1.244588 0.3512467 0.9943238
0.5 1.244173 0.3525806 0.9944682
0.6 1.244855 0.3534525 0.9953885
0.7 1.246272 0.3540359 0.9971209
0.8 1.248201 0.3544200 0.9994983
0.9 1.250498 0.3546429 1.0020119
1.0 1.253046 0.3547497 1.0045020
1.1 1.255770 0.3547620 1.0070238
1.2 1.258615 0.3546985 1.0096574
1.3 1.261542 0.3545660 1.0123627
1.4 1.264510 0.3543813 1.0150804
1.5 1.267494 0.3541502 1.0177404
1.6 1.270489 0.3538833 1.0203233
1.7 1.273477 0.3535828 1.0227834
1.8 1.276424 0.3532558 1.0251233
1.9 1.279358 0.3529077 1.0273822
2.0 1.282242 0.3525407 1.0295746
2.5 1.296000 0.3505258 1.0399743
3.0 1.308477 0.3484035 1.0494215
3.5 1.319693 0.3463194 1.0577313
4.0 1.329801 0.3443176 1.0654698
4.5 1.338892 0.3424487 1.0731297
5.0 1.347120 0.3407101 1.0800273
6.0 1.361405 0.3376191 1.0924368
7.0 1.373374 0.3349786 1.1028297
8.0 1.383498 0.3327413 1.1115573
9.0 1.392224 0.3307994 1.1190655
10.0 1.399794 0.3291209 1.1254942
11.0 1.406425 0.3276583 1.1311264
12.0 1.412285 0.3263728 1.1360674
13.0 1.417495 0.3252379 1.1403968
14.0 1.422162 0.3242282 1.1442105
15.0 1.426381 0.3233129 1.1476060
16.0 1.430212 0.3224853 1.1506694
17.0 1.433677 0.3217527 1.1534134
18.0 1.436858 0.3210735 1.1559043
19.0 1.439782 0.3204541 1.1582330
20.0 1.442458 0.3198946 1.1603602
21.0 1.444955 0.3193679 1.1623622
22.0 1.447243 0.3188948 1.1641816
23.0 1.449396 0.3184437 1.1658883
24.0 1.451380 0.3180365 1.1674491
25.0 1.453249 0.3176497 1.1689215
Tuning parameter 'alpha' was held constant at a value of 0
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 0 and lambda = 0.5.
c. Fit a lasso model on the training set, with \(\lambda\) chosen by cross-validation. Report the CV error obtained, along with the number of non-zero coefficient estimates.
y_train=log(cat.clean$SpendRat)
X_train=model_matrix(cat.clean,log(SpendRat)~Age+LenRes+Income+TotAsset+SecAssets+ShortLiq+LongLiq+WlthIdx+SpendVol+SpenVel+CollGifts+BricMortar+MarthaHome+SunAds+ThemeColl+CustDec+RetailKids+TeenWr+Carlovers+CountryColl)
parameters <- c(seq(0.1, 2, by =0.1) , seq(2, 5, 0.5) , seq(5, 25, 1))
lasso.fit<-train(y=y_train,x=X_train,method='glmnet',trControl=train_control,tuneGrid=expand.grid(alpha=1,lambda = parameters))
Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.Setting row names on a tibble is deprecated.There were missing values in resampled performance measures.Setting row names on a tibble is deprecated.
print(lasso.fit)
glmnet
184 samples
21 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 166, 166, 167, 166, 167, 165, ...
Resampling results across tuning parameters:
lambda RMSE Rsquared MAE
0.1 1.252745 0.34116209 0.9993008
0.2 1.301890 0.32135334 1.0412411
0.3 1.351439 0.31181029 1.0803331
0.4 1.415902 0.27410827 1.1336850
0.5 1.474425 0.22194382 1.1817590
0.6 1.511403 0.04943936 1.2115716
0.7 1.513084 NaN 1.2133441
0.8 1.513084 NaN 1.2133441
0.9 1.513084 NaN 1.2133441
1.0 1.513084 NaN 1.2133441
1.1 1.513084 NaN 1.2133441
1.2 1.513084 NaN 1.2133441
1.3 1.513084 NaN 1.2133441
1.4 1.513084 NaN 1.2133441
1.5 1.513084 NaN 1.2133441
1.6 1.513084 NaN 1.2133441
1.7 1.513084 NaN 1.2133441
1.8 1.513084 NaN 1.2133441
1.9 1.513084 NaN 1.2133441
2.0 1.513084 NaN 1.2133441
2.5 1.513084 NaN 1.2133441
3.0 1.513084 NaN 1.2133441
3.5 1.513084 NaN 1.2133441
4.0 1.513084 NaN 1.2133441
4.5 1.513084 NaN 1.2133441
5.0 1.513084 NaN 1.2133441
6.0 1.513084 NaN 1.2133441
7.0 1.513084 NaN 1.2133441
8.0 1.513084 NaN 1.2133441
9.0 1.513084 NaN 1.2133441
10.0 1.513084 NaN 1.2133441
11.0 1.513084 NaN 1.2133441
12.0 1.513084 NaN 1.2133441
13.0 1.513084 NaN 1.2133441
14.0 1.513084 NaN 1.2133441
15.0 1.513084 NaN 1.2133441
16.0 1.513084 NaN 1.2133441
17.0 1.513084 NaN 1.2133441
18.0 1.513084 NaN 1.2133441
19.0 1.513084 NaN 1.2133441
20.0 1.513084 NaN 1.2133441
21.0 1.513084 NaN 1.2133441
22.0 1.513084 NaN 1.2133441
23.0 1.513084 NaN 1.2133441
24.0 1.513084 NaN 1.2133441
25.0 1.513084 NaN 1.2133441
Tuning parameter 'alpha' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 0.1.
d. Fit a PCR model on the training set, with M chosen by cross-validation. Report the CV error obtained, along with the value of M selected by cross-validation.
pcr.fit<-train(log(SpendRat)~., data=cat.clean, trControl=train_control,tuneLength=ncol(cat.clean),method='pcr')
1 package is needed for this model and is not installed. (pls). Would you like to try to install it now?
1: yes
2: no
1
Installing package into 㤼㸱C:/Users/diego/Documents/R/win-library/3.6㤼㸲
(as 㤼㸱lib㤼㸲 is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.6/pls_2.7-2.zip'
Content type 'application/zip' length 1230527 bytes (1.2 MB)
downloaded 1.2 MB
package ‘pls’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\diego\AppData\Local\Temp\Rtmp2NTT8O\downloaded_packages
plot(pcr.fit)
pcr.fit$bestTune
print(pcr.fit)
Principal Component Analysis
184 samples
20 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 166, 166, 166, 165, 166, 165, ...
Resampling results across tuning parameters:
ncomp RMSE Rsquared MAE
1 1.503001 0.05124398 1.223822
2 1.500916 0.05030193 1.225134
3 1.510551 0.06080604 1.240493
4 1.507454 0.04496214 1.229418
5 1.508424 0.04632181 1.229763
6 1.547384 0.04228428 1.255670
7 1.553237 0.03493156 1.261259
8 1.602087 0.03715311 1.285698
9 1.679544 0.04308677 1.309618
10 1.713405 0.03979775 1.315119
11 1.607684 0.10911150 1.246436
12 1.489018 0.23842172 1.137157
13 1.435726 0.30231337 1.066561
14 1.436901 0.30390130 1.065393
15 1.361983 0.32455326 1.035026
16 1.360091 0.32414467 1.038810
17 1.370272 0.32137060 1.047106
18 1.394330 0.32325615 1.056278
19 1.398918 0.31781916 1.059788
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 16.
e. Fit a PLS model on the training set, with M chosen by cross-validation. Report the CV error obtained, along with the value of M selected by cross-validation.
pls.fit<-train(log(SpendRat)~., data=cat.clean, trControl=train_control,tuneLength=ncol(cat.clean),method='pls')
plot(pls.fit)
pls.fit$bestTune
print(pls.fit)
Partial Least Squares
184 samples
20 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 166, 165, 164, 166, 165, 166, ...
Resampling results across tuning parameters:
ncomp RMSE Rsquared MAE
1 1.514124 0.05180940 1.234679
2 1.519536 0.04003535 1.236803
3 1.521734 0.03980321 1.238881
4 1.517557 0.03787143 1.227074
5 1.589556 0.02953898 1.263265
6 1.615449 0.03619017 1.282169
7 1.596234 0.04111200 1.278422
8 1.777183 0.04538481 1.323395
9 2.020306 0.06903793 1.377884
10 1.485052 0.21801695 1.129988
11 1.359647 0.29279979 1.039547
12 1.366636 0.30431906 1.036349
13 1.370391 0.29791756 1.047712
14 1.396472 0.29764361 1.055757
15 1.394893 0.29916053 1.053718
16 1.391851 0.29975506 1.052253
17 1.391739 0.29968962 1.052281
18 1.391706 0.29971672 1.052274
19 1.391703 0.29971739 1.052272
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 11.
f. Comment on the results obtained. How accurately can we predict the Spending Ratio? Is there much difference among the CV errors resulting from these five approaches?
| Model | RMSE |
|---|---|
| Lasso Regression | 1.2527448 |
| Ridge Regression | 1.2441734 |
| Principle Components Regression | 1.3600913 |
| Partial Least Squares Regression | 1.3596472 |
| Linear Regression | 1.3537877 |
Best model was Ridge Regression with \(\lambda\) = 0.1 with a RMSE of 1.2441734. The worst performing model was the Principal components model with a RMSE of 1.5030006, 1.5009159, 1.5105507, 1.507454, 1.5084236, 1.5473836, 1.5532372, 1.6020867, 1.6795437, 1.7134054, 1.6076836, 1.4890181, 1.4357262, 1.4369009, 1.3619829, 1.3600913, 1.3702716, 1.3943295, 1.3989184. There’s not a substantial amount of difference of CV RMSE across the five models.
In this problem, you will develop a model to predict whether income exceeds $50K/yr based on census data.
a. Use the code in Blackboard to create the adult data set.
adult<-read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", col_names=FALSE, na='?')
Parsed with column specification:
cols(
X1 = [32mcol_double()[39m,
X2 = [31mcol_character()[39m,
X3 = [32mcol_double()[39m,
X4 = [31mcol_character()[39m,
X5 = [32mcol_double()[39m,
X6 = [31mcol_character()[39m,
X7 = [31mcol_character()[39m,
X8 = [31mcol_character()[39m,
X9 = [31mcol_character()[39m,
X10 = [31mcol_character()[39m,
X11 = [32mcol_double()[39m,
X12 = [32mcol_double()[39m,
X13 = [32mcol_double()[39m,
X14 = [31mcol_character()[39m,
X15 = [31mcol_character()[39m
)
names(adult)<-c("age","workclass","fnlwgt","education","education_num","marital_status","occupation","relationship","race","sex","capital_gain","capital_loss","hours_per_week","native_country","income")
set.seed(1303)
adult$fnlwgt<-NULL
adult$workclass<-as_factor(adult$workclass)
adult$education<-as_factor(adult$education)
adult$marital_status<-as_factor(adult$marital_status)
adult$occupation<-as_factor(adult$occupation)
adult$relationship<-as_factor(adult$relationship)
adult$race<-as_factor(adult$race)
adult$sex<-as_factor(adult$sex)
adult$native_country<-as_factor(adult$native_country)
adult$income<-as_factor(adult$income)
adult<-na.omit(adult)
y=adult$income
X=model_matrix(adult,income~.)
X$`(Intercept)`<-NULL
adult2<-as_tibble(cbind(adult$income,X),.name_repair = "unique")
names(adult2)[1]<-"income"
attach(adult2)
The following objects are masked from adult2 (pos = 4):
age, capital_gain, capital_loss, education_num, education10th,
education11th, education12th, education1st-4th, education5th-6th,
education7th-8th, education9th, educationAssoc-acdm, educationAssoc-voc,
educationDoctorate, educationHS-grad, educationMasters, educationPreschool,
educationProf-school, educationSome-college, hours_per_week, income,
marital_statusDivorced, marital_statusMarried-AF-spouse,
marital_statusMarried-civ-spouse, marital_statusMarried-spouse-absent,
marital_statusSeparated, marital_statusWidowed, native_countryCambodia,
native_countryCanada, native_countryChina, native_countryColumbia,
native_countryCuba, native_countryDominican-Republic, native_countryEcuador,
native_countryEl-Salvador, native_countryEngland, native_countryFrance,
native_countryGermany, native_countryGreece, native_countryGuatemala,
native_countryHaiti, native_countryHoland-Netherlands,
native_countryHonduras, native_countryHong, native_countryHungary,
native_countryIndia, native_countryIran, native_countryIreland,
native_countryItaly, native_countryJamaica, native_countryJapan,
native_countryLaos, native_countryMexico, native_countryNicaragua,
native_countryOutlying-US(Guam-USVI-etc), native_countryPeru,
native_countryPhilippines, native_countryPoland, native_countryPortugal,
native_countryPuerto-Rico, native_countryScotland, native_countrySouth,
native_countryTaiwan, native_countryThailand, native_countryTrinadad&Tobago,
native_countryVietnam, native_countryYugoslavia, occupationArmed-Forces,
occupationCraft-repair, occupationExec-managerial,
occupationFarming-fishing, occupationHandlers-cleaners,
occupationMachine-op-inspct, occupationOther-service,
occupationPriv-house-serv, occupationProf-specialty,
occupationProtective-serv, occupationSales, occupationTech-support,
occupationTransport-moving, raceAmer-Indian-Eskimo, raceAsian-Pac-Islander,
raceBlack, raceOther, relationshipHusband, relationshipOther-relative,
relationshipOwn-child, relationshipUnmarried, relationshipWife, sexFemale,
workclassFederal-gov, workclassLocal-gov, workclassNever-worked,
workclassPrivate, workclassSelf-emp-inc, workclassSelf-emp-not-inc,
workclassWithout-pay
summary(adult2)
income age workclassSelf-emp-not-inc workclassPrivate
<=50K:22654 Min. :17.00 Min. :0.00000 Min. :0.0000
>50K : 7508 1st Qu.:28.00 1st Qu.:0.00000 1st Qu.:0.0000
Median :37.00 Median :0.00000 Median :1.0000
Mean :38.44 Mean :0.08285 Mean :0.7389
3rd Qu.:47.00 3rd Qu.:0.00000 3rd Qu.:1.0000
Max. :90.00 Max. :1.00000 Max. :1.0000
workclassFederal-gov workclassLocal-gov workclassSelf-emp-inc workclassWithout-pay
Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.0000000
1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000000
Median :0.00000 Median :0.00000 Median :0.00000 Median :0.0000000
Mean :0.03126 Mean :0.06853 Mean :0.03561 Mean :0.0004642
3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000000
Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.0000000
workclassNever-worked educationHS-grad education11th educationMasters
Min. :0 Min. :0.0000 Min. :0.00000 Min. :0.00000
1st Qu.:0 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
Median :0 Median :0.0000 Median :0.00000 Median :0.00000
Mean :0 Mean :0.3262 Mean :0.03475 Mean :0.05394
3rd Qu.:0 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
Max. :0 Max. :1.0000 Max. :1.00000 Max. :1.00000
education9th educationSome-college educationAssoc-acdm educationAssoc-voc
Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.00000
1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
Median :0.00000 Median :0.0000 Median :0.00000 Median :0.00000
Mean :0.01509 Mean :0.2214 Mean :0.03342 Mean :0.04333
3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.00000
education7th-8th educationDoctorate educationProf-school education5th-6th
Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.000000
1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000000
Median :0.00000 Median :0.00000 Median :0.00000 Median :0.000000
Mean :0.01847 Mean :0.01243 Mean :0.01797 Mean :0.009548
3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000000
Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.000000
education10th education1st-4th educationPreschool education12th education_num
Min. :0.00000 Min. :0.000000 Min. :0.000000 Min. :0.0000 Min. : 1.00
1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.: 9.00
Median :0.00000 Median :0.000000 Median :0.000000 Median :0.0000 Median :10.00
Mean :0.02719 Mean :0.005006 Mean :0.001492 Mean :0.0125 Mean :10.12
3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:13.00
Max. :1.00000 Max. :1.000000 Max. :1.000000 Max. :1.0000 Max. :16.00
marital_statusMarried-civ-spouse marital_statusDivorced
Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.0000
Mean :0.4663 Mean :0.1397
3rd Qu.:1.0000 3rd Qu.:0.0000
Max. :1.0000 Max. :1.0000
marital_statusMarried-spouse-absent marital_statusSeparated
Min. :0.00000 Min. :0.00000
1st Qu.:0.00000 1st Qu.:0.00000
Median :0.00000 Median :0.00000
Mean :0.01227 Mean :0.03113
3rd Qu.:0.00000 3rd Qu.:0.00000
Max. :1.00000 Max. :1.00000
marital_statusMarried-AF-spouse marital_statusWidowed occupationExec-managerial
Min. :0.0000000 Min. :0.00000 Min. :0.0000
1st Qu.:0.0000000 1st Qu.:0.00000 1st Qu.:0.0000
Median :0.0000000 Median :0.00000 Median :0.0000
Mean :0.0006962 Mean :0.02742 Mean :0.1324
3rd Qu.:0.0000000 3rd Qu.:0.00000 3rd Qu.:0.0000
Max. :1.0000000 Max. :1.00000 Max. :1.0000
occupationHandlers-cleaners occupationProf-specialty occupationOther-service
Min. :0.00000 Min. :0.0000 Min. :0.0000
1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.00000 Median :0.0000 Median :0.0000
Mean :0.04476 Mean :0.1339 Mean :0.1065
3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.0000
Max. :1.00000 Max. :1.0000 Max. :1.0000
occupationSales occupationCraft-repair occupationTransport-moving
Min. :0.0000 Min. :0.0000 Min. :0.00000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
Median :0.0000 Median :0.0000 Median :0.00000
Mean :0.1188 Mean :0.1336 Mean :0.05212
3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
Max. :1.0000 Max. :1.0000 Max. :1.00000
occupationFarming-fishing occupationMachine-op-inspct occupationTech-support
Min. :0.00000 Min. :0.00000 Min. :0.00000
1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
Median :0.00000 Median :0.00000 Median :0.00000
Mean :0.03279 Mean :0.06518 Mean :0.03024
3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
Max. :1.00000 Max. :1.00000 Max. :1.00000
occupationProtective-serv occupationArmed-Forces occupationPriv-house-serv
Min. :0.00000 Min. :0.0000000 Min. :0.000000
1st Qu.:0.00000 1st Qu.:0.0000000 1st Qu.:0.000000
Median :0.00000 Median :0.0000000 Median :0.000000
Mean :0.02135 Mean :0.0002984 Mean :0.004741
3rd Qu.:0.00000 3rd Qu.:0.0000000 3rd Qu.:0.000000
Max. :1.00000 Max. :1.0000000 Max. :1.000000
relationshipHusband relationshipWife relationshipOwn-child relationshipUnmarried
Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.00000 Median :0.0000 Median :0.0000
Mean :0.4132 Mean :0.04661 Mean :0.1481 Mean :0.1065
3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.0000
Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000
relationshipOther-relative raceBlack raceAsian-Pac-Islander
Min. :0.00000 Min. :0.0000 Min. :0.00000
1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
Median :0.00000 Median :0.0000 Median :0.00000
Mean :0.02947 Mean :0.0934 Mean :0.02967
3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000
Max. :1.00000 Max. :1.0000 Max. :1.00000
raceAmer-Indian-Eskimo raceOther sexFemale capital_gain
Min. :0.000000 Min. :0.000000 Min. :0.0000 Min. : 0
1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.: 0
Median :0.000000 Median :0.000000 Median :0.0000 Median : 0
Mean :0.009482 Mean :0.007659 Mean :0.3243 Mean : 1092
3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:1.0000 3rd Qu.: 0
Max. :1.000000 Max. :1.000000 Max. :1.0000 Max. :99999
capital_loss hours_per_week native_countryCuba native_countryJamaica
Min. : 0.00 Min. : 1.00 Min. :0.00000 Min. :0.000000
1st Qu.: 0.00 1st Qu.:40.00 1st Qu.:0.00000 1st Qu.:0.000000
Median : 0.00 Median :40.00 Median :0.00000 Median :0.000000
Mean : 88.37 Mean :40.93 Mean :0.00305 Mean :0.002652
3rd Qu.: 0.00 3rd Qu.:45.00 3rd Qu.:0.00000 3rd Qu.:0.000000
Max. :4356.00 Max. :99.00 Max. :1.00000 Max. :1.000000
native_countryIndia native_countryMexico native_countrySouth native_countryPuerto-Rico
Min. :0.000000 Min. :0.00000 Min. :0.000000 Min. :0.000000
1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.000000
Median :0.000000 Median :0.00000 Median :0.000000 Median :0.000000
Mean :0.003315 Mean :0.02022 Mean :0.002354 Mean :0.003614
3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.000000
Max. :1.000000 Max. :1.00000 Max. :1.000000 Max. :1.000000
native_countryHonduras native_countryEngland native_countryCanada native_countryGermany
Min. :0.0000000 Min. :0.000000 Min. :0.000000 Min. :0.000000
1st Qu.:0.0000000 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.000000
Median :0.0000000 Median :0.000000 Median :0.000000 Median :0.000000
Mean :0.0003979 Mean :0.002851 Mean :0.003547 Mean :0.004244
3rd Qu.:0.0000000 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.000000
Max. :1.0000000 Max. :1.000000 Max. :1.000000 Max. :1.000000
native_countryIran native_countryPhilippines native_countryItaly native_countryPoland
Min. :0.000000 Min. :0.000000 Min. :0.000000 Min. :0.000000
1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.000000
Median :0.000000 Median :0.000000 Median :0.000000 Median :0.000000
Mean :0.001393 Mean :0.006233 Mean :0.002254 Mean :0.001857
3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.000000
Max. :1.000000 Max. :1.000000 Max. :1.000000 Max. :1.000000
native_countryColumbia native_countryCambodia native_countryThailand
Min. :0.000000 Min. :0.0000000 Min. :0.0000000
1st Qu.:0.000000 1st Qu.:0.0000000 1st Qu.:0.0000000
Median :0.000000 Median :0.0000000 Median :0.0000000
Mean :0.001857 Mean :0.0005968 Mean :0.0005636
3rd Qu.:0.000000 3rd Qu.:0.0000000 3rd Qu.:0.0000000
Max. :1.000000 Max. :1.0000000 Max. :1.0000000
native_countryEcuador native_countryLaos native_countryTaiwan native_countryHaiti
Min. :0.0000000 Min. :0.0000000 Min. :0.000000 Min. :0.000000
1st Qu.:0.0000000 1st Qu.:0.0000000 1st Qu.:0.000000 1st Qu.:0.000000
Median :0.0000000 Median :0.0000000 Median :0.000000 Median :0.000000
Mean :0.0008952 Mean :0.0005636 Mean :0.001393 Mean :0.001393
3rd Qu.:0.0000000 3rd Qu.:0.0000000 3rd Qu.:0.000000 3rd Qu.:0.000000
Max. :1.0000000 Max. :1.0000000 Max. :1.000000 Max. :1.000000
native_countryPortugal native_countryDominican-Republic native_countryEl-Salvador
Min. :0.000000 Min. :0.000000 Min. :0.000000
1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.000000
Median :0.000000 Median :0.000000 Median :0.000000
Mean :0.001127 Mean :0.002221 Mean :0.003315
3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.000000
Max. :1.000000 Max. :1.000000 Max. :1.000000
native_countryFrance native_countryGuatemala native_countryChina native_countryJapan
Min. :0.0000000 Min. :0.000000 Min. :0.000000 Min. :0.000000
1st Qu.:0.0000000 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.000000
Median :0.0000000 Median :0.000000 Median :0.000000 Median :0.000000
Mean :0.0008952 Mean :0.002089 Mean :0.002254 Mean :0.001956
3rd Qu.:0.0000000 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.000000
Max. :1.0000000 Max. :1.000000 Max. :1.000000 Max. :1.000000
native_countryYugoslavia native_countryPeru native_countryOutlying-US(Guam-USVI-etc)
Min. :0.0000000 Min. :0.0000000 Min. :0.0000000
1st Qu.:0.0000000 1st Qu.:0.0000000 1st Qu.:0.0000000
Median :0.0000000 Median :0.0000000 Median :0.0000000
Mean :0.0005305 Mean :0.0009946 Mean :0.0004642
3rd Qu.:0.0000000 3rd Qu.:0.0000000 3rd Qu.:0.0000000
Max. :1.0000000 Max. :1.0000000 Max. :1.0000000
native_countryScotland native_countryTrinadad&Tobago native_countryGreece
Min. :0.0000000 Min. :0.0000000 Min. :0.0000000
1st Qu.:0.0000000 1st Qu.:0.0000000 1st Qu.:0.0000000
Median :0.0000000 Median :0.0000000 Median :0.0000000
Mean :0.0003647 Mean :0.0005968 Mean :0.0009615
3rd Qu.:0.0000000 3rd Qu.:0.0000000 3rd Qu.:0.0000000
Max. :1.0000000 Max. :1.0000000 Max. :1.0000000
native_countryNicaragua native_countryVietnam native_countryHong native_countryIreland
Min. :0.000000 Min. :0.000000 Min. :0.0000000 Min. :0.0000000
1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.0000000 1st Qu.:0.0000000
Median :0.000000 Median :0.000000 Median :0.0000000 Median :0.0000000
Mean :0.001094 Mean :0.002122 Mean :0.0006299 Mean :0.0007957
3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.0000000 3rd Qu.:0.0000000
Max. :1.000000 Max. :1.000000 Max. :1.0000000 Max. :1.0000000
native_countryHungary native_countryHoland-Netherlands
Min. :0.000000 Min. :0.00e+00
1st Qu.:0.000000 1st Qu.:0.00e+00
Median :0.000000 Median :0.00e+00
Mean :0.000431 Mean :3.32e-05
3rd Qu.:0.000000 3rd Qu.:0.00e+00
Max. :1.000000 Max. :1.00e+00
b. Explore the data graphically in order to investigate the association between income and the other features. Which of the other features seem most likely to be useful in predicting income? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
Using random forest method for feature selection.
##Generate a small sample of the data set to investigate which variables are closely associated with income
k=sample(nrow(adult2),nrow(adult2)*0.08)
library('e1071')
##Use random forest model to calculate variable importance
rf=train(income~.,data=adult2[k,])
It appears my computer is not powerful enough to run this. c. Split the data into an 80% training set and a 20% test set. Set the seed at 1303.
##Create training indicator vector
set.seed(1303)
inTrain <- createDataPartition(adult2$income, p=0.8, list=FALSE)
##Tabulate training and test data sets
train=adult2[inTrain,]
test=adult2[-inTrain,]
dim(adult2)
[1] 30162 97
dim(train)
[1] 24131 97
dim(test)
[1] 6031 97
d. Perform LDA on the training data in order to predict income using the variables that seemed most associated with income in (b). What is the test error of the model obtained?
##Train Model
lda.fit=train(income~age+`marital_statusMarried-civ-spouse`+capital_gain+education_num+hours_per_week+relationshipHusband+capital_loss+`occupationExec-managerial`+workclassPrivate,data=train,method='lda',trControl = trainControl(method = "cv"))
lda.fit
Linear Discriminant Analysis
24131 samples
9 predictor
2 classes: '<=50K', '>50K'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 21719, 21719, 21718, 21718, 21718, 21717, ...
Resampling results:
Accuracy Kappa
0.8317514 0.5150774
##Calculate Predictions
pred.lda<-predict(lda.fit,test)
##Estimate Accuracy
confusionMatrix(pred.lda,test$income)
Confusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 4181 686
>50K 349 815
Accuracy : 0.8284
95% CI : (0.8186, 0.8378)
No Information Rate : 0.7511
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5037
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9230
Specificity : 0.5430
Pos Pred Value : 0.8591
Neg Pred Value : 0.7002
Prevalence : 0.7511
Detection Rate : 0.6933
Detection Prevalence : 0.8070
Balanced Accuracy : 0.7330
'Positive' Class : <=50K
e. Perform QDA on the training data in order to predict income using the variables that seemed most associated with income in (b). What is the test error of the model obtained?
##Train Model
qda.fit=train(income~age+`marital_statusMarried-civ-spouse`+capital_gain+education_num+hours_per_week+relationshipHusband+capital_loss+`occupationExec-managerial`+workclassPrivate,data=train,method='qda',trControl = trainControl(method = "cv"))
qda.fit
Quadratic Discriminant Analysis
24131 samples
9 predictor
2 classes: '<=50K', '>50K'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 21717, 21718, 21719, 21718, 21717, 21718, ...
Resampling results:
Accuracy Kappa
0.7985565 0.3967202
##Calculate Predictions
pred.qda<-predict(qda.fit,test)
##Estimate Accuracy
confusionMatrix(pred.qda,test$income)
Confusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 4142 858
>50K 388 643
Accuracy : 0.7934
95% CI : (0.783, 0.8036)
No Information Rate : 0.7511
P-Value [Acc > NIR] : 5.204e-15
Kappa : 0.3828
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9143
Specificity : 0.4284
Pos Pred Value : 0.8284
Neg Pred Value : 0.6237
Prevalence : 0.7511
Detection Rate : 0.6868
Detection Prevalence : 0.8290
Balanced Accuracy : 0.6714
'Positive' Class : <=50K
f. Perform logistic regression on the training data in order to predict income using the variables that seemed most associated with income in (b). What is the test error of the model obtained?
##Train Model
glm.fit=train(income~age+`marital_statusMarried-civ-spouse`+capital_gain+education_num+hours_per_week+relationshipHusband+capital_loss+`occupationExec-managerial`+workclassPrivate,data=train,method='glm',trControl = trainControl(method = "cv"))
glm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurredglm.fit: fitted probabilities numerically 0 or 1 occurred
glm.fit
Generalized Linear Model
24131 samples
9 predictor
2 classes: '<=50K', '>50K'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 21719, 21719, 21718, 21718, 21717, 21717, ...
Resampling results:
Accuracy Kappa
0.8431492 0.5491027
##Calculate Predictions
pred.glm<-predict(glm.fit,test)
##Estimate Accuracy
confusionMatrix(pred.glm,test$income)
Confusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 4201 667
>50K 329 834
Accuracy : 0.8349
95% CI : (0.8252, 0.8441)
No Information Rate : 0.7511
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5223
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9274
Specificity : 0.5556
Pos Pred Value : 0.8630
Neg Pred Value : 0.7171
Prevalence : 0.7511
Detection Rate : 0.6966
Detection Prevalence : 0.8072
Balanced Accuracy : 0.7415
'Positive' Class : <=50K
g. Perform KNN on the training data, with several values of K, in order to predict income. Use only the variables that seemed most associated with income in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?
##Train Model, Let CV choose value for K
knn.fit<-train(income~age+`marital_statusMarried-civ-spouse`+capital_gain+education_num+hours_per_week+relationshipHusband+capital_loss+`occupationExec-managerial`+workclassPrivate,data=train,method='knn',trControl = trainControl(method = "cv"), tuneLength=20)
knn.fit
k-Nearest Neighbors
24131 samples
9 predictor
2 classes: '<=50K', '>50K'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 21717, 21718, 21718, 21718, 21717, 21718, ...
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.8449293 0.5537778
7 0.8478300 0.5586043
9 0.8456334 0.5502511
11 0.8462139 0.5497724
13 0.8451778 0.5459386
15 0.8450536 0.5436919
17 0.8448462 0.5428098
19 0.8438933 0.5382269
21 0.8437273 0.5380229
23 0.8446805 0.5398063
25 0.8431473 0.5349175
27 0.8428985 0.5338392
29 0.8431886 0.5335989
31 0.8421525 0.5296729
33 0.8404119 0.5233287
35 0.8403705 0.5220293
37 0.8390445 0.5177549
39 0.8392101 0.5175455
41 0.8387545 0.5163369
43 0.8385058 0.5154016
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 7.
##Calculate Predictions
pred.knn<-predict(knn.fit,test)
##Estimate Accuracy
confusionMatrix(pred.knn,test$income)
Confusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 4270 644
>50K 260 857
Accuracy : 0.8501
95% CI : (0.8408, 0.859)
No Information Rate : 0.7511
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5616
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9426
Specificity : 0.5710
Pos Pred Value : 0.8689
Neg Pred Value : 0.7672
Prevalence : 0.7511
Detection Rate : 0.7080
Detection Prevalence : 0.8148
Balanced Accuracy : 0.7568
'Positive' Class : <=50K
h. Choose which model predicts income the best and justify your choice.
| Model | Accuracy |
|---|---|
| Linear Discriminant | 0.7985565 |
| Quadratic Discriminant | 0.8317514 |
| K Nearest Neighbors (K=7) | 0.8385058 |
| Logistic Regression | 0.8431492 |
It appears KNN is the optimal way to run this classification problem, as it is the most accurate method given the results, with QDA giving the lowest amount of accuracy. While this is sensitive to other data, so are the other models at a similar percentage, making KNN the best choice.