library('tidyverse')
library('caret')
library('modelr')
set.seed(303)

Part 2: Regression (40 Points)

The table below displays catalog-spending data for the first few of 200 randomly selected individuals from a very large (over 20,000 households) data base.1 The variable of particular interest is catalog spending as measured by the Spending Ratio (SpendRat). All of the catalog variables are represented by indicator variables; either the consumer bought and the variable is coded as 1 or the consumer didn’t buy and the variable is coded as 0. The other variables can be viewed as indexes for measuring assets, liquidity, and spending.

catalog<-read_csv('C:/Users/email/Downloads/catalog.csv')
catalog$CollGifts<-as_factor(catalog$CollGifts)
catalog$BricMortar<-as_factor(catalog$BricMortar)
catalog$MarthaHome<-as_factor(catalog$MarthaHome)
catalog$SunAds<-as_factor(catalog$SunAds)
catalog$ThemeColl<-as_factor(catalog$ThemeColl)
catalog$CustDec<-as_factor(catalog$CustDec)
catalog$RetailKids<-as_factor(catalog$RetailKids)
catalog$TeenWr<-as_factor(catalog$TeenWr)
catalog$Carlovers<-as_factor(catalog$Carlovers)
catalog$CountryColl<-as_factor(catalog$CountryColl)
str(catalog)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':    200 obs. of  21 variables:
 $ SpendRat   : num  11.8 16.8 11.4 31.3 1.9 ...
 $ Age        : num  0 35 46 41 46 46 46 56 48 54 ...
 $ LenRes     : num  2 3 9 2 7 15 16 31 8 8 ...
 $ Income     : num  3 5 5 2 9 5 4 6 5 5 ...
 $ TotAsset   : num  122 195 123 117 493 138 162 117 119 50 ...
 $ SecAssets  : num  27 36 24 25 105 27 25 27 23 10 ...
 $ ShortLiq   : num  225 220 200 222 310 340 230 300 250 200 ...
 $ LongLiq    : num  422 420 420 419 500 450 430 440 430 420 ...
 $ WlthIdx    : num  286 430 290 279 520 440 360 400 360 230 ...
 $ SpendVol   : num  503 690 600 543 680 440 690 500 610 660 ...
 $ SpenVel    : num  285 570 280 308 100 50 180 10 0 0 ...
 $ CollGifts  : Factor w/ 2 levels "0","1": 2 1 2 2 1 1 2 2 2 1 ...
 $ BricMortar : Factor w/ 2 levels "0","1": 1 2 1 1 2 2 1 2 1 2 ...
 $ MarthaHome : Factor w/ 2 levels "0","1": 1 2 1 1 2 2 1 2 2 1 ...
 $ SunAds     : Factor w/ 2 levels "0","1": 2 1 2 2 1 1 2 1 1 1 ...
 $ ThemeColl  : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 2 2 1 ...
 $ CustDec    : Factor w/ 2 levels "0","1": 2 2 2 1 2 2 1 2 2 1 ...
 $ RetailKids : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 2 1 1 ...
 $ TeenWr     : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 2 1 2 ...
 $ Carlovers  : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 2 1 1 ...
 $ CountryColl: Factor w/ 2 levels "0","1": 2 1 2 2 1 1 2 1 2 1 ...
 - attr(*, "spec")=
  .. cols(
  ..   SpendRat = col_double(),
  ..   Age = col_double(),
  ..   LenRes = col_double(),
  ..   Income = col_double(),
  ..   TotAsset = col_double(),
  ..   SecAssets = col_double(),
  ..   ShortLiq = col_double(),
  ..   LongLiq = col_double(),
  ..   WlthIdx = col_double(),
  ..   SpendVol = col_double(),
  ..   SpenVel = col_double(),
  ..   CollGifts = col_double(),
  ..   BricMortar = col_double(),
  ..   MarthaHome = col_double(),
  ..   SunAds = col_double(),
  ..   ThemeColl = col_double(),
  ..   CustDec = col_double(),
  ..   RetailKids = col_double(),
  ..   TeenWr = col_double(),
  ..   Carlovers = col_double(),
  ..   CountryColl = col_double()
  .. )

Data Cleaning

The goal of this section is to explore the data set and get it ready for analysis. There are no missing values in the data set, but there are some incorrect entries that must be identified and removed before completing the analysis. Income is coded as an ordinal value, ranging from 1 to 12. Age can be regarded as quantitative, and any value less than 18 is invalid. Length of residence (LenRes) is a value ranging from zero to someone’s age. LenRes should not be higher than Age. You should create a simple 1-2 paragraph summary of this section. Be sure to fully explain the reasoning behind transforming any columns and removing any rows. Campbell told me to is not sufficient. Justify why it makes sense not to include any rows whose age is less than 18 or why we shouldn’t use rows in which length of residence is larger than age.

cat.clean<-filter(catalog,Age>=18,LenRes<=Age)

Basic Summary

Provide a basic summary of the cleaned data set. Include a table of univariate statistics to summarize each variable. Choose meaningful summary statistics for each type of variable. You should also include a basic summary of the catalog spending (SpendRat) including an appropriate graphical display.

dim(cat.clean)
[1] 184  21
summary(cat.clean)
    SpendRat            Age            LenRes          Income         TotAsset     
 Min.   :  0.080   Min.   :21.00   Min.   : 0.00   Min.   :1.000   Min.   :  5.00  
 1st Qu.:  6.077   1st Qu.:44.75   1st Qu.: 8.00   1st Qu.:4.000   1st Qu.: 94.75  
 Median : 18.805   Median :53.00   Median :11.00   Median :5.000   Median :150.00  
 Mean   : 43.792   Mean   :54.71   Mean   :14.58   Mean   :4.473   Mean   :184.67  
 3rd Qu.: 50.273   3rd Qu.:63.00   3rd Qu.:19.00   3rd Qu.:5.000   3rd Qu.:222.50  
 Max.   :401.420   Max.   :89.00   Max.   :46.00   Max.   :9.000   Max.   :999.00  
   SecAssets        ShortLiq        LongLiq         WlthIdx         SpendVol        SpenVel     
 Min.   :  0.0   Min.   :160.0   Min.   :400.0   Min.   : 90.0   Min.   :  0.0   Min.   :  0.0  
 1st Qu.: 19.0   1st Qu.:210.0   1st Qu.:420.0   1st Qu.:300.0   1st Qu.:532.0   1st Qu.: 40.0  
 Median : 28.0   Median :230.0   Median :430.0   Median :360.0   Median :610.0   Median :160.0  
 Mean   : 40.9   Mean   :240.6   Mean   :439.5   Mean   :367.1   Mean   :568.4   Mean   :219.5  
 3rd Qu.: 42.0   3rd Qu.:260.0   3rd Qu.:440.0   3rd Qu.:430.0   3rd Qu.:670.0   3rd Qu.:310.0  
 Max.   :999.0   Max.   :999.0   Max.   :999.0   Max.   :880.0   Max.   :780.0   Max.   :999.0  
 CollGifts BricMortar MarthaHome SunAds  ThemeColl CustDec RetailKids TeenWr Carlovers
 0:94      0:131      0:117      0:105   0:111     0:120   0:119      0:89   0:133    
 1:90      1: 53      1: 67      1: 79   1: 73     1: 64   1: 65      1:95   1: 51    
                                                                                      
                                                                                      
                                                                                      
                                                                                      
 CountryColl
 0:107      
 1: 77      
            
            
            
            
cat.clean %>%
  ggplot(aes(x=log(SpendRat))) + geom_histogram()

Modeling

We are interested in developing a model to predict spending ratio. Find a multiple regression model for predicting the amount of money that consumers will spend on catalog shopping, as measured by spending ratio. Your goal is to identify the best model you can. In your write-up be sure to justify your choice of model, discuss any transformation you make to the variables, discuss your model fit, and discuss the effect of the significant predictors using both hypothesis tests and confidence intervals. Remember to check the conditions for inference as you evaluate your models. The data set is much too small to split into training and test data sets, so use cross validation in all your models

## Set up Repeated k-fold Cross Validation
train_control <- trainControl(method="repeatedcv", number=10, repeats=3)

a. Fit a linear model using least squares on the training set, and report the CV error obtained.

lm.fit<-train(log(SpendRat)~., data=cat.clean, trControl=train_control,method='lm')
print(lm.fit)
Linear Regression 

184 samples
 20 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 167, 167, 165, 165, 165, 166, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  1.394231  0.3002504  1.067814

Tuning parameter 'intercept' was held constant at a value of TRUE

b. Fit a ridge regression model on the training set, with \(\lambda\) chosen by cross-validation. Report the CV error obtained.

y_train=log(cat.clean$SpendRat)
X_train=model_matrix(cat.clean,log(SpendRat)~Age+LenRes+Income+TotAsset+SecAssets+ShortLiq+LongLiq+WlthIdx+SpendVol+SpenVel+CollGifts+BricMortar+MarthaHome+SunAds+ThemeColl+CustDec+RetailKids+TeenWr+Carlovers+CountryColl)
parameters <- c(seq(0.1, 2, by =0.1) ,  seq(2, 5, 0.5) , seq(5, 25, 1))
ridge.fit<-train(y=y_train,x=X_train,method='glmnet',trControl=train_control,tuneGrid=expand.grid(alpha=0,lambda = parameters))
print(ridge.fit)
glmnet 

184 samples
 21 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 165, 167, 168, 164, 165, 165, ... 
Resampling results across tuning parameters:

  lambda  RMSE      Rsquared   MAE      
   0.1    1.282431  0.3389732  1.0136866
   0.2    1.262920  0.3477084  1.0012533
   0.3    1.255321  0.3519455  0.9979231
   0.4    1.252062  0.3543630  0.9972646
   0.5    1.251011  0.3558583  0.9976626
   0.6    1.251259  0.3568258  0.9990369
   0.7    1.252338  0.3574662  1.0010517
   0.8    1.253995  0.3578612  1.0033494
   0.9    1.256056  0.3580738  1.0058887
   1.0    1.258394  0.3581592  1.0086132
   1.1    1.260931  0.3581387  1.0113090
   1.2    1.263603  0.3580336  1.0139815
   1.3    1.266368  0.3578600  1.0166161
   1.4    1.269187  0.3576379  1.0192355
   1.5    1.272041  0.3573677  1.0217359
   1.6    1.274899  0.3570578  1.0241457
   1.7    1.277760  0.3567171  1.0265044
   1.8    1.280606  0.3563590  1.0287912
   1.9    1.283413  0.3559797  1.0309755
   2.0    1.286203  0.3555893  1.0330928
   2.5    1.299502  0.3534787  1.0432612
   3.0    1.311641  0.3513105  1.0526532
   3.5    1.322620  0.3492042  1.0618671
   4.0    1.332510  0.3471970  1.0703347
   4.5    1.341475  0.3453146  1.0779618
   5.0    1.349588  0.3435733  1.0848968
   6.0    1.363703  0.3404632  1.0974258
   7.0    1.375527  0.3378082  1.1077753
   8.0    1.385618  0.3355069  1.1165298
   9.0    1.394274  0.3335277  1.1238685
  10.0    1.401810  0.3317875  1.1302016
  11.0    1.408425  0.3302622  1.1356920
  12.0    1.414280  0.3289142  1.1405733
  13.0    1.419500  0.3277156  1.1448879
  14.0    1.424186  0.3266288  1.1487019
  15.0    1.428399  0.3256585  1.1521298
  16.0    1.432213  0.3247865  1.1551995
  17.0    1.435707  0.3239915  1.1579793
  18.0    1.438887  0.3232622  1.1604801
  19.0    1.441809  0.3226010  1.1627590
  20.0    1.444514  0.3219793  1.1648501
  21.0    1.446994  0.3214221  1.1667773
  22.0    1.449315  0.3208923  1.1685728
  23.0    1.451452  0.3204155  1.1702319
  24.0    1.453462  0.3199579  1.1717784
  25.0    1.455326  0.3195462  1.1732128

Tuning parameter 'alpha' was held constant at a value of 0
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 0 and lambda = 0.5.

c. Fit a lasso model on the training set, with \(\lambda\) chosen by cross-validation. Report the CV error obtained, along with the number of non-zero coefficient estimates.

y_train=log(cat.clean$SpendRat)
X_train=model_matrix(cat.clean,log(SpendRat)~Age+LenRes+Income+TotAsset+SecAssets+ShortLiq+LongLiq+WlthIdx+SpendVol+SpenVel+CollGifts+BricMortar+MarthaHome+SunAds+ThemeColl+CustDec+RetailKids+TeenWr+Carlovers+CountryColl)
parameters <- c(seq(0.1, 2, by =0.1) ,  seq(2, 5, 0.5) , seq(5, 25, 1))
lasso.fit<-train(y=y_train,x=X_train,method='glmnet',trControl=train_control,tuneGrid=expand.grid(alpha=1,lambda = parameters))
print(lasso.fit)
glmnet 

184 samples
 21 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 166, 167, 165, 165, 167, 168, ... 
Resampling results across tuning parameters:

  lambda  RMSE      Rsquared    MAE      
   0.1    1.250853  0.35615328  0.9923317
   0.2    1.293874  0.33931292  1.0341048
   0.3    1.341793  0.33336816  1.0739817
   0.4    1.407003  0.29713600  1.1281301
   0.5    1.467919  0.23196363  1.1778427
   0.6    1.508460  0.08560497  1.2102028
   0.7    1.510381         NaN  1.2115618
   0.8    1.510381         NaN  1.2115618
   0.9    1.510381         NaN  1.2115618
   1.0    1.510381         NaN  1.2115618
   1.1    1.510381         NaN  1.2115618
   1.2    1.510381         NaN  1.2115618
   1.3    1.510381         NaN  1.2115618
   1.4    1.510381         NaN  1.2115618
   1.5    1.510381         NaN  1.2115618
   1.6    1.510381         NaN  1.2115618
   1.7    1.510381         NaN  1.2115618
   1.8    1.510381         NaN  1.2115618
   1.9    1.510381         NaN  1.2115618
   2.0    1.510381         NaN  1.2115618
   2.5    1.510381         NaN  1.2115618
   3.0    1.510381         NaN  1.2115618
   3.5    1.510381         NaN  1.2115618
   4.0    1.510381         NaN  1.2115618
   4.5    1.510381         NaN  1.2115618
   5.0    1.510381         NaN  1.2115618
   6.0    1.510381         NaN  1.2115618
   7.0    1.510381         NaN  1.2115618
   8.0    1.510381         NaN  1.2115618
   9.0    1.510381         NaN  1.2115618
  10.0    1.510381         NaN  1.2115618
  11.0    1.510381         NaN  1.2115618
  12.0    1.510381         NaN  1.2115618
  13.0    1.510381         NaN  1.2115618
  14.0    1.510381         NaN  1.2115618
  15.0    1.510381         NaN  1.2115618
  16.0    1.510381         NaN  1.2115618
  17.0    1.510381         NaN  1.2115618
  18.0    1.510381         NaN  1.2115618
  19.0    1.510381         NaN  1.2115618
  20.0    1.510381         NaN  1.2115618
  21.0    1.510381         NaN  1.2115618
  22.0    1.510381         NaN  1.2115618
  23.0    1.510381         NaN  1.2115618
  24.0    1.510381         NaN  1.2115618
  25.0    1.510381         NaN  1.2115618

Tuning parameter 'alpha' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 0.1.

d. Fit a PCR model on the training set, with M chosen by cross-validation. Report the CV error obtained, along with the value of M selected by cross-validation.

pcr.fit<-train(log(SpendRat)~., data=cat.clean, trControl=train_control,tuneLength=ncol(cat.clean),method='pcr')
plot(pcr.fit)

pcr.fit$bestTune
print(pcr.fit)
Principal Component Analysis 

184 samples
 20 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 167, 165, 166, 166, 165, 165, ... 
Resampling results across tuning parameters:

  ncomp  RMSE      Rsquared    MAE     
   1     1.504416  0.05292512  1.226453
   2     1.502397  0.05696327  1.227758
   3     1.506584  0.05669179  1.238072
   4     1.504600  0.05817332  1.228613
   5     1.504614  0.05487248  1.228104
   6     1.568391  0.04708296  1.268189
   7     1.530471  0.03903891  1.246003
   8     1.563494  0.04184430  1.267878
   9     1.566718  0.03774084  1.267141
  10     1.590034  0.04801677  1.269187
  11     1.493764  0.13038231  1.208531
  12     1.418096  0.23581979  1.105241
  13     1.394111  0.27989803  1.053194
  14     1.394441  0.27807673  1.051699
  15     1.307876  0.31287039  1.020025
  16     1.325327  0.30587889  1.028624
  17     1.337142  0.29880732  1.035104
  18     1.348773  0.30239710  1.044208
  19     1.367447  0.29728069  1.052854

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 15.

e. Fit a PLS model on the training set, with M chosen by cross-validation. Report the CV error obtained, along with the value of M selected by cross-validation.

pls.fit<-train(log(SpendRat)~., data=cat.clean, trControl=train_control,tuneLength=ncol(cat.clean),method='pls')
plot(pls.fit)

pls.fit$bestTune
print(pls.fit)
Partial Least Squares 

184 samples
 20 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 166, 167, 165, 166, 165, 164, ... 
Resampling results across tuning parameters:

  ncomp  RMSE      Rsquared    MAE     
   1     1.515817  0.07985721  1.234183
   2     1.521152  0.07681501  1.235722
   3     1.524430  0.07891818  1.236456
   4     1.525921  0.06506635  1.227040
   5     1.601973  0.06390192  1.276570
   6     1.672190  0.06527717  1.308169
   7     1.621498  0.08136191  1.286456
   8     1.642158  0.09102616  1.282712
   9     1.730764  0.09214255  1.291040
  10     1.480394  0.22943277  1.135545
  11     1.353831  0.31676170  1.046596
  12     1.357684  0.33486266  1.031558
  13     1.361018  0.32809382  1.046569
  14     1.383036  0.32544019  1.055099
  15     1.385985  0.32711307  1.052993
  16     1.382261  0.32780757  1.050593
  17     1.381936  0.32786463  1.050362
  18     1.381863  0.32790378  1.050287
  19     1.381829  0.32791716  1.050270

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 11.

f. Comment on the results obtained. How accurately can we predict the Spending Ratio? Is there much difference among the CV errors resulting from these five approaches?

Model RMSE
Lasso Regression 1.2508531
Ridge Regression 1.2510108
Principle Components Regression 1.3078756
Partial Least Squares Regression 1.3538306
Linear Regression 1.3942314

Best model was Lasso with \(\lambda\) = 0.1 with a RMSE of 1.2508531. The worst performing model was the linear regression model with a RMSE of 1.3942314. There’s not a substantial amount of difference of CV RMSE across the five models.

Part 3: Classification (40 Points)

In this problem, you will develop a model to predict whether income exceeds $50K/yr based on census data.

a. Use the code in Blackboard to create the adult data set.

adult<-read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", col_names=FALSE, na='?')
names(adult)<-c("age","workclass","fnlwgt","education","education_num","marital_status","occupation","relationship","race","sex","capital_gain","capital_loss","hours_per_week","native_country","income")
adult$fnlwgt<-NULL #remove unnecessary column
#coerce character variables to factors
adult$workclass<-as_factor(adult$workclass)
adult$education<-as_factor(adult$education)
adult$marital_status<-as_factor(adult$marital_status)
adult$occupation<-as_factor(adult$occupation)
adult$relationship<-as_factor(adult$relationship)
adult$race<-as_factor(adult$race)
adult$sex<-as_factor(adult$sex)
adult$native_country<-as_factor(adult$native_country)
adult$income<-as_factor(adult$income)
adult<-na.omit(adult) #remove missing observations
y=adult$income
X=model_matrix(adult,income~.) #create dummy variables
X$`(Intercept)`<-NULL #remove unnecessary column
adult2<-as_tibble(cbind(adult$income,X),.name_repair = "unique") #create new data set
names(adult2)[1]<-"income" #fix the name of the outcome variable
nsv<-nearZeroVar(adult2,saveMetrics = FALSE) #identify near zero variance predictors
adult3<-adult2[,-nsv] #remove them
adult3

b. Explore the data graphically in order to investigate the association between income and the other features. Which of the other features seem most likely to be useful in predicting income? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

Using random forest method for feature selection.

##Generate a small sample of the data set to investigate which variables are closely associated with income
k=sample(nrow(adult3),nrow(adult2)*0.08)

##Use random forest model to calculate variable importance
rf=train(income~.,data=adult3[k,])
rfImp<-varImp(rf)
plot(rfImp,top=10)

c. Split the data into an 80% training set and a 20% test set. Set the seed at 1303.

##Create training indicator vector
inTrain <- createDataPartition(adult3$income, p=0.8, list=FALSE)
##Tabulate training and test data sets
train=adult3[inTrain,]
test=adult3[-inTrain,]
dim(adult2)
[1] 30162    97
dim(train)
[1] 24131    24
dim(test)
[1] 6031   24

d. Perform LDA on the training data in order to predict income using the variables that seemed most associated with income in (b). What is the test error of the model obtained?

##Train Model
lda.fit=train(income~`marital_statusMarried-civ-spouse` + education_num + relationshipHusband  + age + hours_per_week  + `occupationExec-managerial`  + `occupationProf-specialty` + sexFemale + `relationshipOwn-child`,data=train,method='lda',trControl = trainControl(method = "cv"))
lda.fit
Linear Discriminant Analysis 

24131 samples
    9 predictor
    2 classes: '<=50K', '>50K' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 21719, 21718, 21717, 21718, 21718, 21719, ... 
Resampling results:

  Accuracy   Kappa    
  0.8226754  0.4894095
##Calculate Predictions
pred.lda<-predict(lda.fit,test)
##Estimate Accuracy
confusionMatrix(pred.lda,test$income)
Confusion Matrix and Statistics

          Reference
Prediction <=50K >50K
     <=50K  4139  704
     >50K    391  797
                                          
               Accuracy : 0.8184          
                 95% CI : (0.8085, 0.8281)
    No Information Rate : 0.7511          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.478           
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9137          
            Specificity : 0.5310          
         Pos Pred Value : 0.8546          
         Neg Pred Value : 0.6709          
             Prevalence : 0.7511          
         Detection Rate : 0.6863          
   Detection Prevalence : 0.8030          
      Balanced Accuracy : 0.7223          
                                          
       'Positive' Class : <=50K           
                                          

e. Perform QDA on the training data in order to predict income using the variables that seemed most associated with income in (b). What is the test error of the model obtained?

##Train Model
qda.fit=train(income~`marital_statusMarried-civ-spouse` + education_num + relationshipHusband  + age + hours_per_week  + `occupationExec-managerial`  + `occupationProf-specialty` + sexFemale + `relationshipOwn-child`,data=train,method='qda',trControl = trainControl(method = "cv"))
qda.fit
Quadratic Discriminant Analysis 

24131 samples
    9 predictor
    2 classes: '<=50K', '>50K' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 21719, 21718, 21718, 21719, 21718, 21718, ... 
Resampling results:

  Accuracy   Kappa    
  0.7522264  0.4660741
##Calculate Predictions
pred.qda<-predict(qda.fit,test)
##Estimate Accuracy
confusionMatrix(pred.qda,test$income)
Confusion Matrix and Statistics

          Reference
Prediction <=50K >50K
     <=50K  3271  213
     >50K   1259 1288
                                          
               Accuracy : 0.7559          
                 95% CI : (0.7449, 0.7667)
    No Information Rate : 0.7511          
    P-Value [Acc > NIR] : 0.1982          
                                          
                  Kappa : 0.4705          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.7221          
            Specificity : 0.8581          
         Pos Pred Value : 0.9389          
         Neg Pred Value : 0.5057          
             Prevalence : 0.7511          
         Detection Rate : 0.5424          
   Detection Prevalence : 0.5777          
      Balanced Accuracy : 0.7901          
                                          
       'Positive' Class : <=50K           
                                          

f. Perform logistic regression on the training data in order to predict income using the variables that seemed most associated with income in (b). What is the test error of the model obtained?

##Train Model
glm.fit=train(income~`marital_statusMarried-civ-spouse` + education_num + relationshipHusband  + age + hours_per_week  + `occupationExec-managerial`  + `occupationProf-specialty` + sexFemale + `relationshipOwn-child`,data=train,method='glm',trControl = trainControl(method = "cv"))
glm.fit
Generalized Linear Model 

24131 samples
    9 predictor
    2 classes: '<=50K', '>50K' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 21719, 21718, 21717, 21719, 21717, 21717, ... 
Resampling results:

  Accuracy   Kappa    
  0.8238368  0.4869807
##Calculate Predictions
pred.glm<-predict(glm.fit,test)
##Estimate Accuracy
confusionMatrix(pred.glm,test$income)
Confusion Matrix and Statistics

          Reference
Prediction <=50K >50K
     <=50K  4154  728
     >50K    376  773
                                         
               Accuracy : 0.8169         
                 95% CI : (0.807, 0.8266)
    No Information Rate : 0.7511         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.4687         
                                         
 Mcnemar's Test P-Value : < 2.2e-16      
                                         
            Sensitivity : 0.9170         
            Specificity : 0.5150         
         Pos Pred Value : 0.8509         
         Neg Pred Value : 0.6728         
             Prevalence : 0.7511         
         Detection Rate : 0.6888         
   Detection Prevalence : 0.8095         
      Balanced Accuracy : 0.7160         
                                         
       'Positive' Class : <=50K          
                                         

g. Perform KNN on the training data, with several values of K, in order to predict income. Use only the variables that seemed most associated with income in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

##Train Model, Let CV choose value for K
knn.fit<-train(income~`marital_statusMarried-civ-spouse` + education_num + relationshipHusband  + age + hours_per_week  + `occupationExec-managerial`  + `occupationProf-specialty` + sexFemale + `relationshipOwn-child`,data=train,method='knn',trControl = trainControl(method = "cv"), tuneLength=20)
knn.fit
k-Nearest Neighbors 

24131 samples
    9 predictor
    2 classes: '<=50K', '>50K' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 21718, 21717, 21718, 21718, 21718, 21718, ... 
Resampling results across tuning parameters:

  k   Accuracy   Kappa    
   5  0.8047330  0.4363006
   7  0.8067635  0.4383265
   9  0.8078823  0.4407662
  11  0.8092911  0.4431910
  13  0.8097055  0.4439116
  15  0.8100371  0.4443474
  17  0.8093331  0.4421319
  19  0.8086285  0.4380045
  21  0.8098304  0.4399584
  23  0.8096647  0.4388197
  25  0.8100791  0.4385366
  27  0.8106590  0.4386679
  29  0.8095400  0.4356190
  31  0.8099544  0.4361528
  33  0.8094984  0.4343682
  35  0.8097058  0.4343902
  37  0.8108248  0.4370759
  39  0.8097059  0.4336888
  41  0.8098714  0.4334224
  43  0.8094986  0.4328404

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 37.
##Calculate Predictions
pred.knn<-predict(knn.fit,test)
##Estimate Accuracy
confusionMatrix(pred.knn,test$income)
Confusion Matrix and Statistics

          Reference
Prediction <=50K >50K
     <=50K  4184  794
     >50K    346  707
                                          
               Accuracy : 0.811           
                 95% CI : (0.8009, 0.8208)
    No Information Rate : 0.7511          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.4384          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9236          
            Specificity : 0.4710          
         Pos Pred Value : 0.8405          
         Neg Pred Value : 0.6714          
             Prevalence : 0.7511          
         Detection Rate : 0.6937          
   Detection Prevalence : 0.8254          
      Balanced Accuracy : 0.6973          
                                          
       'Positive' Class : <=50K           
                                          

h. Choose which model predicts income the best and justify your choice.

Model Accuracy
Quadratic Discriminant 0.7522264
K Nearest Neighbors (K=37) 0.8108248
Linear Discriminant 0.8226754
Logistic Regression 0.8238368
---
title: "STA 4143 Data Mining Midterm in CARET"
output: 
  html_notebook:
    toc: true
    toc_float: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
```

```{r}
library('tidyverse')
library('caret')
library('modelr')
set.seed(303)
```
### Part 2: Regression (40 Points)
The table below displays catalog-spending data for the first few of 200 randomly selected individuals from a very large (over 20,000 households) data base.1 The variable of particular interest is catalog spending as measured by the Spending Ratio (SpendRat). All of the catalog variables are represented by indicator variables; either the consumer bought and the variable is coded as 1 or the consumer didn’t buy and the variable is coded as 0. The other variables can be viewed as indexes for measuring assets, liquidity, and spending.

```{r}
catalog<-read_csv('C:/Users/email/Downloads/catalog.csv')
catalog$CollGifts<-as_factor(catalog$CollGifts)
catalog$BricMortar<-as_factor(catalog$BricMortar)
catalog$MarthaHome<-as_factor(catalog$MarthaHome)
catalog$SunAds<-as_factor(catalog$SunAds)
catalog$ThemeColl<-as_factor(catalog$ThemeColl)
catalog$CustDec<-as_factor(catalog$CustDec)
catalog$RetailKids<-as_factor(catalog$RetailKids)
catalog$TeenWr<-as_factor(catalog$TeenWr)
catalog$Carlovers<-as_factor(catalog$Carlovers)
catalog$CountryColl<-as_factor(catalog$CountryColl)
str(catalog)
```

#### Data Cleaning
The goal of this section is to explore the data set and get it ready for analysis. There are no missing values in the data set, but there are some incorrect entries that must be identified and removed before completing the analysis. Income is coded as an ordinal value, ranging from 1 to 12. Age can be regarded as quantitative, and any value less than 18 is invalid. Length of residence (LenRes) is a value ranging from zero to someone’s age. LenRes should not be higher than Age. You should create a simple 1-2 paragraph summary of this section. Be sure to fully explain the reasoning behind transforming any columns and removing any rows. Campbell told me to is not sufficient. Justify why it makes sense not to include any rows whose age is less than 18 or why we shouldn’t use rows in which length of residence is larger than age.

```{r}
cat.clean<-filter(catalog,Age>=18,LenRes<=Age)
```

#### Basic Summary
Provide a basic summary of the cleaned data set. Include a table of univariate statistics to summarize each variable. Choose meaningful summary statistics for each type of variable. You should also include a basic summary of the catalog spending (SpendRat) including an appropriate graphical display.

```{r}
dim(cat.clean)
summary(cat.clean)
cat.clean %>%
  ggplot(aes(x=log(SpendRat))) + geom_histogram()
```

#### Modeling
We are interested in developing a model to predict spending ratio. Find a multiple regression model for predicting the amount of money that consumers will spend on catalog shopping, as measured by spending ratio. Your goal is to identify the best model you can. In your write-up be sure to justify your choice of model, discuss any transformation you make to the variables, discuss your model fit, and discuss the effect of the significant predictors using both hypothesis tests and confidence intervals. Remember to check the conditions for inference as you evaluate your models. The data set is much too small to split into training and test data sets, so use cross validation in all your models

```{r}
## Set up Repeated k-fold Cross Validation
train_control <- trainControl(method="repeatedcv", number=10, repeats=3)
```

**a. Fit a linear model using least squares on the training set, and report the CV error obtained.**

```{r}
lm.fit<-train(log(SpendRat)~., data=cat.clean, trControl=train_control,method='lm')
print(lm.fit)
```

**b. Fit a ridge regression model on the training set, with $\lambda$ chosen by cross-validation. Report the CV error obtained.**
```{r}
y_train=log(cat.clean$SpendRat)
X_train=model_matrix(cat.clean,log(SpendRat)~Age+LenRes+Income+TotAsset+SecAssets+ShortLiq+LongLiq+WlthIdx+SpendVol+SpenVel+CollGifts+BricMortar+MarthaHome+SunAds+ThemeColl+CustDec+RetailKids+TeenWr+Carlovers+CountryColl)
parameters <- c(seq(0.1, 2, by =0.1) ,  seq(2, 5, 0.5) , seq(5, 25, 1))
ridge.fit<-train(y=y_train,x=X_train,method='glmnet',trControl=train_control,tuneGrid=expand.grid(alpha=0,lambda = parameters))
print(ridge.fit)
```
**c. Fit a lasso model on the training set, with $\lambda$ chosen by cross-validation. Report the CV error obtained, along with the number of non-zero coefficient estimates.**

```{r}
y_train=log(cat.clean$SpendRat)
X_train=model_matrix(cat.clean,log(SpendRat)~Age+LenRes+Income+TotAsset+SecAssets+ShortLiq+LongLiq+WlthIdx+SpendVol+SpenVel+CollGifts+BricMortar+MarthaHome+SunAds+ThemeColl+CustDec+RetailKids+TeenWr+Carlovers+CountryColl)
parameters <- c(seq(0.1, 2, by =0.1) ,  seq(2, 5, 0.5) , seq(5, 25, 1))
lasso.fit<-train(y=y_train,x=X_train,method='glmnet',trControl=train_control,tuneGrid=expand.grid(alpha=1,lambda = parameters))
print(lasso.fit)
```

**d. Fit a PCR model on the training set, with M chosen by cross-validation. Report the CV error obtained, along with the value of M selected by cross-validation.**

```{r}
pcr.fit<-train(log(SpendRat)~., data=cat.clean, trControl=train_control,tuneLength=ncol(cat.clean),method='pcr')
plot(pcr.fit)
pcr.fit$bestTune
print(pcr.fit)
```

**e. Fit a PLS model on the training set, with M chosen by cross-validation. Report the CV error obtained, along with the value of M selected by cross-validation.**

```{r}
pls.fit<-train(log(SpendRat)~., data=cat.clean, trControl=train_control,tuneLength=ncol(cat.clean),method='pls')
plot(pls.fit)
pls.fit$bestTune
print(pls.fit)
```

**f. Comment on the results obtained. How accurately can we predict the Spending Ratio? Is there much difference among the CV errors resulting from these five approaches?**

| Model                            | RMSE     |
|----------------------------------|----------|
| Lasso Regression                 | `r min(lasso.fit$results$RMSE)` |
| Ridge Regression                 | `r min(ridge.fit$results$RMSE)` |
| Principle Components Regression  | `r min(pcr.fit$results$RMSE)` |
| Partial Least Squares Regression | `r min(pls.fit$results$RMSE)` |
| Linear Regression                | `r lm.fit$results$RMSE` |


Best model was Lasso with $\lambda$ = 0.1 with a RMSE of `r min(lasso.fit$results$RMSE)`. The worst performing model was the linear regression model with a RMSE of `r lm.fit$results$RMSE`. There's not a substantial amount of difference of CV RMSE across the five models. 

### Part 3: Classification (40 Points)
In this problem, you will develop a model to predict whether income exceeds $50K/yr based on census data.

**a. Use the code in Blackboard to create the adult data set.**

```{r}
adult<-read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", col_names=FALSE, na='?')
names(adult)<-c("age","workclass","fnlwgt","education","education_num","marital_status","occupation","relationship","race","sex","capital_gain","capital_loss","hours_per_week","native_country","income")
adult$fnlwgt<-NULL #remove unnecessary column
#coerce character variables to factors
adult$workclass<-as_factor(adult$workclass)
adult$education<-as_factor(adult$education)
adult$marital_status<-as_factor(adult$marital_status)
adult$occupation<-as_factor(adult$occupation)
adult$relationship<-as_factor(adult$relationship)
adult$race<-as_factor(adult$race)
adult$sex<-as_factor(adult$sex)
adult$native_country<-as_factor(adult$native_country)
adult$income<-as_factor(adult$income)
adult<-na.omit(adult) #remove missing observations
y=adult$income
X=model_matrix(adult,income~.) #create dummy variables
X$`(Intercept)`<-NULL #remove unnecessary column
adult2<-as_tibble(cbind(adult$income,X),.name_repair = "unique") #create new data set
names(adult2)[1]<-"income" #fix the name of the outcome variable
nsv<-nearZeroVar(adult2,saveMetrics = FALSE) #identify near zero variance predictors
adult3<-adult2[,-nsv] #remove them
adult3
```

**b. Explore the data graphically in order to investigate the association between income and the other features. Which of the other features seem most likely to be useful in predicting income? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.**

Using [random forest method](http://r-statistics.co/Variable-Selection-and-Importance-With-R.html) for feature selection.

```{r, cache=TRUE}
##Generate a small sample of the data set to investigate which variables are closely associated with income
k=sample(nrow(adult3),nrow(adult2)*0.08)

##Use random forest model to calculate variable importance
rf=train(income~.,data=adult3[k,])
rfImp<-varImp(rf)
plot(rfImp,top=10)
```

**c. Split the data into an 80% training set and a 20% test set. Set the seed at 1303.**

```{r}
##Create training indicator vector
inTrain <- createDataPartition(adult3$income, p=0.8, list=FALSE)
##Tabulate training and test data sets
train=adult3[inTrain,]
test=adult3[-inTrain,]
dim(adult2)
dim(train)
dim(test)
```

**d. Perform LDA on the training data in order to predict income using the variables that seemed most associated with income in (b). What is the test error of the model obtained?**

```{r}
##Train Model
lda.fit=train(income~`marital_statusMarried-civ-spouse` + education_num + relationshipHusband  + age + hours_per_week  + `occupationExec-managerial`  + `occupationProf-specialty` + sexFemale + `relationshipOwn-child`,data=train,method='lda',trControl = trainControl(method = "cv"))
lda.fit
##Calculate Predictions
pred.lda<-predict(lda.fit,test)
##Estimate Accuracy
confusionMatrix(pred.lda,test$income)
```

**e. Perform QDA on the training data in order to predict income using the variables that seemed most associated with income in (b). What is the test error of the model obtained?**
```{r}
##Train Model
qda.fit=train(income~`marital_statusMarried-civ-spouse` + education_num + relationshipHusband  + age + hours_per_week  + `occupationExec-managerial`  + `occupationProf-specialty` + sexFemale + `relationshipOwn-child`,data=train,method='qda',trControl = trainControl(method = "cv"))
qda.fit
##Calculate Predictions
pred.qda<-predict(qda.fit,test)
##Estimate Accuracy
confusionMatrix(pred.qda,test$income)
```
**f. Perform logistic regression on the training data in order to predict income using the variables that seemed most associated with income in (b). What is the test error of the model obtained?**

```{r}
##Train Model
glm.fit=train(income~`marital_statusMarried-civ-spouse` + education_num + relationshipHusband  + age + hours_per_week  + `occupationExec-managerial`  + `occupationProf-specialty` + sexFemale + `relationshipOwn-child`,data=train,method='glm',trControl = trainControl(method = "cv"))
glm.fit
##Calculate Predictions
pred.glm<-predict(glm.fit,test)
##Estimate Accuracy
confusionMatrix(pred.glm,test$income)
```

**g. Perform KNN on the training data, with several values of K, in order to predict income. Use only the variables that seemed most associated with income in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?**

```{r,cache=TRUE}
##Train Model, Let CV choose value for K
knn.fit<-train(income~`marital_statusMarried-civ-spouse` + education_num + relationshipHusband  + age + hours_per_week  + `occupationExec-managerial`  + `occupationProf-specialty` + sexFemale + `relationshipOwn-child`,data=train,method='knn',trControl = trainControl(method = "cv"), tuneLength=20)
knn.fit
##Calculate Predictions
pred.knn<-predict(knn.fit,test)
##Estimate Accuracy
confusionMatrix(pred.knn,test$income)
```
**h. Choose which model predicts income the best and justify your choice.**

| Model                            | Accuracy |
|----------------------------------|----------|
| Quadratic Discriminant                 | `r qda.fit$results[2][[1]]` |
| K Nearest Neighbors (K=37)  | `r max(knn.fit$results[2][[1]])`|
| Linear Discriminant                 | `r lda.fit$results[2][[1]]` |
| _**Logistic Regression**_ | _**`r glm.fit$results[2][[1]]`**_ |

