Instructions

Part I: U.S. Salary (100 points)

library(tidyverse)
library(cluster)

Introduction to the Data Set

The data set used in this part is extracted from the 1994 Census database.

Features (predictors):
  • age: continuous numeric
  • workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
  • fnlwgt: continuous numeric
  • education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool
  • education-num: continuous numeric
  • marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse
  • occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces
  • relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
  • race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
  • sex: Female, Male
  • capital-gain: continuous numeric
  • capital-loss: continuous numeric
  • hours-per-week: continuous numeric
  • native-country
Target (response):
  • salary: <=50K or >50K
Prediction task: Determine whether a person makes over 50K a year.

Data Wrangling

First, let us load the data set and fileter U.S. observations.

salary_US0 = read_csv("salary.csv", show_col_types = FALSE)
salary_US1 <- salary_US0 %>% filter(`native-country`=="United-States")
glimpse(salary_US1)
## Rows: 29,170
## Columns: 15
## $ age              <dbl> 39, 50, 38, 53, 37, 52, 31, 42, 37, 23, 32, 25, 32, 3…
## $ workclass        <chr> "State-gov", "Self-emp-not-inc", "Private", "Private"…
## $ fnlwgt           <dbl> 77516, 83311, 215646, 234721, 284582, 209642, 45781, …
## $ education        <chr> "Bachelors", "Bachelors", "HS-grad", "11th", "Masters…
## $ `education-num`  <dbl> 13, 13, 9, 7, 14, 9, 14, 13, 10, 13, 12, 9, 9, 7, 14,…
## $ `marital-status` <chr> "Never-married", "Married-civ-spouse", "Divorced", "M…
## $ occupation       <chr> "Adm-clerical", "Exec-managerial", "Handlers-cleaners…
## $ relationship     <chr> "Not-in-family", "Husband", "Not-in-family", "Husband…
## $ race             <chr> "White", "White", "White", "Black", "White", "White",…
## $ sex              <chr> "Male", "Male", "Male", "Male", "Female", "Male", "Fe…
## $ `capital-gain`   <dbl> 2174, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0, 0, 0, 0, 0…
## $ `capital-loss`   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `hours-per-week` <dbl> 40, 13, 40, 40, 40, 45, 50, 40, 80, 30, 50, 35, 40, 5…
## $ `native-country` <chr> "United-States", "United-States", "United-States", "U…
## $ salary           <chr> "<=50K", "<=50K", "<=50K", "<=50K", "<=50K", ">50K", …

Secondly, let us check if there is any missing values.

sum(is.na(salary_US1))
## [1] 0

No missing values! Notice that all of the categorical variables have type “character”, which is not ideal for machine learning modeling. Let us convert them into factors.

salary_US2 <- as.data.frame(unclass(salary_US1), stringsAsFactors=TRUE)

Use the summary function to have an overview of the variables:

summary(salary_US2)
##       age                   workclass         fnlwgt               education   
##  Min.   :17.00   Private         :20135   Min.   :  12285   HS-grad     :9702  
##  1st Qu.:28.00   Self-emp-not-inc: 2313   1st Qu.: 115895   Some-college:6740  
##  Median :37.00   Local-gov       : 1956   Median : 176730   Bachelors   :4766  
##  Mean   :38.66   ?               : 1659   Mean   : 187069   Masters     :1527  
##  3rd Qu.:48.00   State-gov       : 1210   3rd Qu.: 234138   Assoc-voc   :1289  
##  Max.   :90.00   Self-emp-inc    :  991   Max.   :1484705   11th        :1067  
##                  (Other)         :  906                     (Other)     :4079  
##  education.num                 marital.status            occupation  
##  Min.   : 1.00   Divorced             : 4162   Exec-managerial:3735  
##  1st Qu.: 9.00   Married-AF-spouse    :   23   Prof-specialty :3693  
##  Median :10.00   Married-civ-spouse   :13368   Craft-repair   :3685  
##  Mean   :10.17   Married-spouse-absent:  253   Adm-clerical   :3449  
##  3rd Qu.:12.00   Never-married        : 9579   Sales          :3364  
##  Max.   :16.00   Separated            :  883   Other-service  :2777  
##                  Widowed              :  902   (Other)        :8467  
##          relationship                   race           sex       
##  Husband       :11861   Amer-Indian-Eskimo:  296   Female: 9682  
##  Not-in-family : 7528   Asian-Pac-Islander:  292   Male  :19488  
##  Other-relative:  696   Black             : 2832                 
##  Own-child     : 4691   Other             :  129                 
##  Unmarried     : 3033   White             :25621                 
##  Wife          : 1361                                            
##                                                                  
##   capital.gain    capital.loss     hours.per.week        native.country 
##  Min.   :    0   Min.   :   0.00   Min.   : 1.00   United-States:29170  
##  1st Qu.:    0   1st Qu.:   0.00   1st Qu.:40.00                        
##  Median :    0   Median :   0.00   Median :40.00                        
##  Mean   : 1089   Mean   :  88.51   Mean   :40.45                        
##  3rd Qu.:    0   3rd Qu.:   0.00   3rd Qu.:45.00                        
##  Max.   :99999   Max.   :4356.00   Max.   :99.00                        
##                                                                         
##    salary     
##  <=50K:21999  
##  >50K : 7171  
##               
##               
##               
##               
## 

Notice that since we only kept U.S. observations, native.country only has one distinct value, United-States, now. It is a constant, so we can remove it. In addition, note that there are two variables about education. The variable education is categorical which has a lot of levels. It will consume many degrees of freedom. Thus, let us get rid of education and only keep education-num.

The variable fnlwgt represents final weight, which is assigned by the U.S. census bureau to each row. For simplicity of this analysis, this weighting factor is discarded.

Notice that relationship also has many levels, but the role in a family can be assessed from gender and marital status. Hence, relationship is sort of redundant and can be removed.

salary_US3 <- salary_US2 %>% select(-c(native.country,fnlwgt,education,relationship))

Use salary_US3 to solve Problems 1 to 6.

Exploratory Data Analysis (15 points)

Problem 1. Boxplot (5 points)

Create a side-by-side boxplot to show the relationship between salary and age. Hint: Similar to Exercise 6 in HW 10. Don’t use facet_wrap or facet_grid.

boxplot(age~salary,data=salary_US3, horizontal=T)

Problem 2. Frequency Polygon (5 points)

Create a density plot to show the relationship between salary and capital.gain. (Hint: Read Page 20 in lecture slides ggplot2_3). ***TYPO: frequency polygon

ggplot(data = salary_US3, mapping = aes(x = capital.gain, y = ..density..)) + 
  geom_freqpoly(mapping = aes(colour = salary), binwidth = 5000)

Problem 3. 2-dimensional Bin-plot(5 points)

Create a two-dimensional bin-plot to show the relationship between sex and salary. Hint: Use geom_bin2d. Read Page 12 in lecture slides ggplot2_2.

ggplot(salary_US3, aes(sex, salary)) + 
  geom_bin2d()

Partition Data

Use the following code to separate salary_US3 to a training set and a test set. Make sure that you use the same seed.

set.seed(99)
train_US <- sample_frac(salary_US3, 0.8, fac="salary") # Stratifies sampling
test_US <- setdiff(salary_US3, train_US)

Data preparation is done for now.

Supervised Learning Modeling (85 points)

Problem 4. Decision Tree (25 points)

Load the packages:

library(rpart)
library(rpart.plot)

Step 1. (5 points) Fit a decision tree with train_US using all of the feature variables. Set the seed as follows:

set.seed(99)

full_form <- as.formula(salary~.)

mod_tree <- rpart(full_form, data = train_US)
mod_tree
## n= 23336 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 23336 5741 <=50K (0.75398526 0.24601474)  
##    2) marital.status=Divorced,Married-spouse-absent,Never-married,Separated,Widowed 12585  826 <=50K (0.93436631 0.06563369)  
##      4) capital.gain< 7139.5 12363  609 <=50K (0.95074011 0.04925989) *
##      5) capital.gain>=7139.5 222    5 >50K (0.02252252 0.97747748) *
##    3) marital.status=Married-AF-spouse,Married-civ-spouse 10751 4915 <=50K (0.54283322 0.45716678)  
##      6) education.num< 12.5 7610 2608 <=50K (0.65729304 0.34270696)  
##       12) capital.gain< 5095.5 7227 2234 <=50K (0.69088142 0.30911858) *
##       13) capital.gain>=5095.5 383    9 >50K (0.02349869 0.97650131) *
##      7) education.num>=12.5 3141  834 >50K (0.26552053 0.73447947) *

Step 2. (5 points) Plot the decision tree.

rpart.plot(mod_tree)

Step 3. (5 points) Do prediction for the test data using the trained decision. Calculate the confusion matrix.

confusion_matrix <- function(data,y,mod){
  confusion_matrix <- data %>% 
  mutate(pred = predict(mod, newdata = data, type = "class"),
         y=y) %>%
  select(y,pred) %>% table()
}
misclass <- function(confusion){
  misclass <- 1- sum(diag(confusion))/sum(confusion)
  return(misclass)
}

confusion_test <- confusion_matrix(test_US, test_US$salary, mod_tree)
confusion_test
##        pred
## y       <=50K >50K
##   <=50K  3142  199
##   >50K    599  639

Step 4. (5 points) Calculate the misclassification rate.

misclass(confusion_test)
## [1] 0.1742739

Step 5. (5 points) Calculate the true positive rate and the true negative rate for the prediction results on the test set, where the two rates are defined as follows:

  • True Positive Rate: P(predicted salary<=50K | true salary <=50K)
  • True Negative Rate: P(predicted salary>50K | true salary >50K)
tpr <- confusion_test[1,1]/sum(confusion_test[,1]); tpr
## [1] 0.8398824
tnr <- confusion_test[2,2]/sum(confusion_test[,2]); tnr
## [1] 0.7625298

Problem 5. Random Forest (30 points)

Load the package:

library(randomForest)
## Warning: package 'randomForest' was built under R version 4.1.2

Step 1. (5 points) Fit a random forest model with train_US using all of the feature variables. Use default values for all tuning parameters. Set the seed as follows:

set.seed(99)

mod_rforest<- randomForest(full_form,train_US,ntree=1000, mtry = 5)
mod_rforest
## 
## Call:
##  randomForest(formula = full_form, data = train_US, ntree = 1000,      mtry = 5) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 14.65%
## Confusion matrix:
##       <=50K >50K class.error
## <=50K 16247 1348  0.07661267
## >50K   2071 3670  0.36073855

Step 2. (5 points) Do prediction for the test data using the trained decision. Calculate the confusion matrix.

confusion_testrf <- confusion_matrix(test_US, test_US$salary, mod_rforest)
confusion_testrf
##        pred
## y       <=50K >50K
##   <=50K  3039  302
##   >50K    472  766

Step 3. (5 points) Calculate the misclassification rate.

misclass(confusion_testrf)
## [1] 0.1690325

Step 4. (5 points) Calculate the true positive rate and the true negative rate for the prediction results on the test set.

tpr_rf <- confusion_testrf[1,1]/sum(confusion_testrf[,1]); tpr_rf
## [1] 0.8655654
tnr_rf <- confusion_testrf[2,2]/sum(confusion_testrf[,2]); tnr_rf
## [1] 0.7172285

Step 5. (5 points) Set mtry=10 and ntree=100 and do Steps 1 to 4 again. Which model is better? Set the seed as follows:

The first model is better for the random forest because the ntree is larger and averaging more trees will yield a stronger result and reduce overfitting. Also, mtry is also smaller in the first model which also reduces overfitting as well.

set.seed(99)

mod_rforest1<- randomForest(full_form,train_US,ntree=100, mtry = 10)
mod_rforest1
## 
## Call:
##  randomForest(formula = full_form, data = train_US, ntree = 100,      mtry = 10) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 10
## 
##         OOB estimate of  error rate: 15.86%
## Confusion matrix:
##       <=50K >50K class.error
## <=50K 16040 1555  0.08837738
## >50K   2145 3596  0.37362829
confusion_testrf1 <- confusion_matrix(test_US, test_US$salary, mod_rforest1)
confusion_testrf1
##        pred
## y       <=50K >50K
##   <=50K  3002  339
##   >50K    477  761
misclass(confusion_testrf1)
## [1] 0.1782048
tpr_rf1 <- confusion_testrf1[1,1]/sum(confusion_testrf1[,1]); tpr_rf1
## [1] 0.8628916
tnr_rf1 <- confusion_testrf1[2,2]/sum(confusion_testrf1[,2]); tnr_rf1
## [1] 0.6918182

Step 6. (5 points) Which variable is the most important? Rank the variables according to their importance. Use the first model.

  1. Marital.status (first split in the decision tree - will reduce gini or entropy the most)
  2. capital.gain
  3. education.num

Problem 6. Regularized Logistic Regression (30 points)

names(salary_US3)
##  [1] "age"            "workclass"      "education.num"  "marital.status"
##  [5] "occupation"     "race"           "sex"            "capital.gain"  
##  [9] "capital.loss"   "hours.per.week" "salary"

Load the package and define the full formula for the logistic regression model as follows:

library(glmnet)
form_full <- as.formula("salary~age+workclass+education.num+
                      marital.status+occupation+race+sex+capital.gain+
                      capital.loss+hours.per.week")

Step 1. (10 points) Use 5-fold cross-validation to find the “optimal” lambda. Print and plot the results. Hint: use the cv.glmnet function in the glmnet package. Set the seed as follows:

set.seed(99)

predictors <- model.matrix(form_full, data = train_US) 
cv.fit <- cv.glmnet(predictors, train_US$salary, nfolds=5, family = "binomial", type = "class")
cv.fit$lambda.1se
## [1] 0.005176728
plot(cv.fit)

Step 2. (5 points) Use glmnet to fit a regularized logistic regression with lambda equal to lambda.1se found in Step 1.

lambda1 = cv.fit$lambda.1se
mod_lr <- glmnet(predictors, train_US$salary, family = "binomial", lambda = lambda1)
mod_lr$beta
## 39 x 1 sparse Matrix of class "dgCMatrix"
##                                                s0
## (Intercept)                          .           
## age                                  0.0203045912
## workclassFederal-gov                 0.4206971823
## workclassLocal-gov                   .           
## workclassNever-worked                .           
## workclassPrivate                     0.0908181818
## workclassSelf-emp-inc                0.2854952585
## workclassSelf-emp-not-inc           -0.1927690345
## workclassState-gov                   .           
## workclassWithout-pay                 .           
## education.num                        0.2867569303
## marital.statusMarried-AF-spouse      1.4640604991
## marital.statusMarried-civ-spouse     2.0914564325
## marital.statusMarried-spouse-absent  .           
## marital.statusNever-married         -0.3379584076
## marital.statusSeparated              .           
## marital.statusWidowed                .           
## occupationAdm-clerical               .           
## occupationArmed-Forces               .           
## occupationCraft-repair               .           
## occupationExec-managerial            0.7070894622
## occupationFarming-fishing           -0.6435047227
## occupationHandlers-cleaners         -0.3511981997
## occupationMachine-op-inspct         -0.0738227580
## occupationOther-service             -0.5963914056
## occupationPriv-house-serv            .           
## occupationProf-specialty             0.3728751137
## occupationProtective-serv            0.2742614606
## occupationSales                      0.1137338501
## occupationTech-support               0.3820048593
## occupationTransport-moving           .           
## raceAsian-Pac-Islander               .           
## raceBlack                            .           
## raceOther                            .           
## raceWhite                            .           
## sexMale                              0.0838127096
## capital.gain                         0.0002123040
## capital.loss                         0.0005405183
## hours.per.week                       0.0251743954
form2 <- as.formula(salary~age+workclass+education.num+marital.status+occupation+sex+capital.gain+capital.loss+hours.per.week)
mod_lr2 <- glm(form2, data=train_US,family=binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(mod_lr2)
## 
## Call:
## glm(formula = form2, family = binomial, data = train_US)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2288  -0.5078  -0.2048  -0.0381   3.2546  
## 
## Coefficients: (1 not defined because of singularities)
##                                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         -9.039e+00  2.254e-01 -40.106  < 2e-16 ***
## age                                  2.611e-02  1.865e-03  14.002  < 2e-16 ***
## workclassFederal-gov                 1.004e+00  1.763e-01   5.694 1.24e-08 ***
## workclassLocal-gov                   2.914e-01  1.615e-01   1.805 0.071079 .  
## workclassNever-worked               -7.968e+00  2.579e+02  -0.031 0.975351    
## workclassPrivate                     5.473e-01  1.442e-01   3.795 0.000147 ***
## workclassSelf-emp-inc                7.834e-01  1.752e-01   4.471 7.78e-06 ***
## workclassSelf-emp-not-inc            2.408e-02  1.587e-01   0.152 0.879412    
## workclassState-gov                   2.281e-01  1.744e-01   1.308 0.190919    
## workclassWithout-pay                -1.096e+01  1.276e+02  -0.086 0.931553    
## education.num                        3.012e-01  1.106e-02  27.220  < 2e-16 ***
## marital.statusMarried-AF-spouse      2.650e+00  5.407e-01   4.901 9.56e-07 ***
## marital.statusMarried-civ-spouse     2.179e+00  7.693e-02  28.320  < 2e-16 ***
## marital.statusMarried-spouse-absent -1.569e-01  3.042e-01  -0.516 0.605999    
## marital.statusNever-married         -5.284e-01  9.509e-02  -5.557 2.74e-08 ***
## marital.statusSeparated             -1.717e-01  1.921e-01  -0.894 0.371373    
## marital.statusWidowed               -2.005e-01  1.856e-01  -1.080 0.280116    
## occupationAdm-clerical               1.429e-01  1.133e-01   1.262 0.207083    
## occupationArmed-Forces              -8.284e-01  1.672e+00  -0.495 0.620266    
## occupationCraft-repair               1.412e-01  9.792e-02   1.442 0.149390    
## occupationExec-managerial            9.180e-01  1.002e-01   9.164  < 2e-16 ***
## occupationFarming-fishing           -8.381e-01  1.627e-01  -5.153 2.57e-07 ***
## occupationHandlers-cleaners         -6.827e-01  1.730e-01  -3.947 7.92e-05 ***
## occupationMachine-op-inspct         -1.922e-01  1.245e-01  -1.544 0.122566    
## occupationOther-service             -9.007e-01  1.567e-01  -5.749 8.97e-09 ***
## occupationPriv-house-serv           -3.460e+00  2.116e+00  -1.635 0.102016    
## occupationProf-specialty             6.214e-01  1.076e-01   5.778 7.58e-09 ***
## occupationProtective-serv            7.197e-01  1.504e-01   4.785 1.71e-06 ***
## occupationSales                      3.308e-01  1.037e-01   3.191 0.001418 ** 
## occupationTech-support               7.599e-01  1.370e-01   5.547 2.90e-08 ***
## occupationTransport-moving                  NA         NA      NA       NA    
## sexMale                              1.557e-01  6.163e-02   2.526 0.011525 *  
## capital.gain                         3.167e-04  1.199e-05  26.417  < 2e-16 ***
## capital.loss                         6.593e-04  4.299e-05  15.336  < 2e-16 ***
## hours.per.week                       2.875e-02  1.870e-03  15.370  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26039  on 23335  degrees of freedom
## Residual deviance: 15158  on 23302  degrees of freedom
## AIC: 15226
## 
## Number of Fisher Scoring iterations: 12

Step 3. (5 points) Do prediction for the test data using the trained model and calculate the misclassification rate.

logistic.misclassrate <- function(dataset, y, fit, form){
  misclass_lr <- dataset %>% 
  mutate(pred.logistic = predict(fit, newx = model.matrix(form, data = dataset), 
         type = "class")) %>% 
  mutate(misclassify = ifelse(y != pred.logistic, 1,0)) %>%
  summarize(misclass.rate = mean(misclassify))
  return(misclass_lr$misclass.rate)
}

predictors1 <- model.matrix(form2, data = train_US) 
fit_opt <- glmnet(predictors1, train_US$salary, 
                  family = "binomial", lambda = lambda1)
misclassrate_lr<-logistic.misclassrate(test_US,test_US$salary,fit_opt, form2)
misclassrate_lr
## [1] 0.1690325

Step 4. (5 points) Use glmnet to re-fit a regularized logistic regression with lambda equal to lambda.min found in Step 1 and calculate the misclassification rate for this one. Set the seed as follows:

set.seed(99)

lambda2 = cv.fit$lambda.min
mod_lr <- glmnet(predictors, train_US$salary, family = "binomial", lambda = lambda2)
mod_lr$beta
## 39 x 1 sparse Matrix of class "dgCMatrix"
##                                                s0
## (Intercept)                          .           
## age                                  0.0252944913
## workclassFederal-gov                 0.8876991158
## workclassLocal-gov                   0.1744466947
## workclassNever-worked                .           
## workclassPrivate                     0.4269646215
## workclassSelf-emp-inc                0.6541269431
## workclassSelf-emp-not-inc           -0.0768595914
## workclassState-gov                   0.1013949020
## workclassWithout-pay                -1.9569187418
## education.num                        0.2982259546
## marital.statusMarried-AF-spouse      2.5621236423
## marital.statusMarried-civ-spouse     2.1787884608
## marital.statusMarried-spouse-absent -0.0734169553
## marital.statusNever-married         -0.5023258283
## marital.statusSeparated             -0.0913308941
## marital.statusWidowed               -0.1457490611
## occupationAdm-clerical               0.1282631917
## occupationArmed-Forces              -0.3806008374
## occupationCraft-repair               0.1298749718
## occupationExec-managerial            0.9068445894
## occupationFarming-fishing           -0.8190704566
## occupationHandlers-cleaners         -0.6419479989
## occupationMachine-op-inspct         -0.1698045356
## occupationOther-service             -0.8499321584
## occupationPriv-house-serv           -1.9463227958
## occupationProf-specialty             0.6122354485
## occupationProtective-serv            0.7103943292
## occupationSales                      0.3165231451
## occupationTech-support               0.7323469613
## occupationTransport-moving           .           
## raceAsian-Pac-Islander               0.3790006886
## raceBlack                            0.0417806673
## raceOther                           -0.2788294267
## raceWhite                            0.1805908310
## sexMale                              0.1434482041
## capital.gain                         0.0003064485
## capital.loss                         0.0006480464
## hours.per.week                       0.0284857908
mod_lr2 <- glm(form_full, data=train_US,family=binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(mod_lr2)
## 
## Call:
## glm(formula = form_full, family = binomial, data = train_US)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2064  -0.5082  -0.2042  -0.0386   3.2440  
## 
## Coefficients: (1 not defined because of singularities)
##                                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         -9.500e+00  3.269e-01 -29.065  < 2e-16 ***
## age                                  2.591e-02  1.867e-03  13.879  < 2e-16 ***
## workclassFederal-gov                 1.021e+00  1.770e-01   5.769 7.97e-09 ***
## workclassLocal-gov                   3.016e-01  1.616e-01   1.866 0.062043 .  
## workclassNever-worked               -7.936e+00  2.591e+02  -0.031 0.975564    
## workclassPrivate                     5.496e-01  1.443e-01   3.810 0.000139 ***
## workclassSelf-emp-inc                7.813e-01  1.752e-01   4.460 8.21e-06 ***
## workclassSelf-emp-not-inc            2.817e-02  1.587e-01   0.178 0.859093    
## workclassState-gov                   2.322e-01  1.745e-01   1.331 0.183249    
## workclassWithout-pay                -1.096e+01  1.275e+02  -0.086 0.931500    
## education.num                        2.996e-01  1.108e-02  27.052  < 2e-16 ***
## marital.statusMarried-AF-spouse      2.636e+00  5.408e-01   4.874 1.10e-06 ***
## marital.statusMarried-civ-spouse     2.175e+00  7.694e-02  28.271  < 2e-16 ***
## marital.statusMarried-spouse-absent -1.567e-01  3.047e-01  -0.514 0.607009    
## marital.statusNever-married         -5.321e-01  9.516e-02  -5.591 2.25e-08 ***
## marital.statusSeparated             -1.555e-01  1.927e-01  -0.807 0.419617    
## marital.statusWidowed               -1.989e-01  1.857e-01  -1.072 0.283928    
## occupationAdm-clerical               1.365e-01  1.134e-01   1.204 0.228608    
## occupationArmed-Forces              -6.771e-01  1.758e+00  -0.385 0.700149    
## occupationCraft-repair               1.301e-01  9.805e-02   1.326 0.184677    
## occupationExec-managerial            9.073e-01  1.003e-01   9.046  < 2e-16 ***
## occupationFarming-fishing           -8.497e-01  1.628e-01  -5.218 1.80e-07 ***
## occupationHandlers-cleaners         -6.881e-01  1.731e-01  -3.974 7.06e-05 ***
## occupationMachine-op-inspct         -1.943e-01  1.246e-01  -1.560 0.118753    
## occupationOther-service             -8.966e-01  1.568e-01  -5.718 1.08e-08 ***
## occupationPriv-house-serv           -3.482e+00  2.131e+00  -1.634 0.102314    
## occupationProf-specialty             6.152e-01  1.077e-01   5.710 1.13e-08 ***
## occupationProtective-serv            7.163e-01  1.505e-01   4.758 1.95e-06 ***
## occupationSales                      3.206e-01  1.038e-01   3.088 0.002014 ** 
## occupationTech-support               7.459e-01  1.372e-01   5.437 5.43e-08 ***
## occupationTransport-moving                  NA         NA      NA       NA    
## raceAsian-Pac-Islander               7.485e-01  3.180e-01   2.354 0.018577 *  
## raceBlack                            3.946e-01  2.584e-01   1.527 0.126745    
## raceOther                           -4.830e-02  4.946e-01  -0.098 0.922219    
## raceWhite                            5.118e-01  2.460e-01   2.081 0.037445 *  
## sexMale                              1.498e-01  6.175e-02   2.426 0.015247 *  
## capital.gain                         3.174e-04  1.200e-05  26.445  < 2e-16 ***
## capital.loss                         6.581e-04  4.300e-05  15.306  < 2e-16 ***
## hours.per.week                       2.867e-02  1.870e-03  15.336  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26039  on 23335  degrees of freedom
## Residual deviance: 15149  on 23298  degrees of freedom
## AIC: 15225
## 
## Number of Fisher Scoring iterations: 12
predictors2 <- model.matrix(form_full, data = train_US) 
fit_opt <- glmnet(predictors2, train_US$salary, 
                  family = "binomial", lambda = lambda2)
misclassrate_lr<-logistic.misclassrate(test_US,test_US$salary,fit_opt, form_full)
misclassrate_lr
## [1] 0.1631361

Step 5. (5 points) Compare the results of the two models. Which one is better?

The re-fit regularized logistic regression with lambda equal to lambda.min is better because it gives a smaller misclassification rate.

Part II: World Happiness Report (100 points)

Introduction to the Data Set

The World Happiness Report is a landmark survey of the state of global happiness. The first report was published in 2012. We will explore the 2018 data.

  • Overall rank: Happiness rank; numeric. Range: 1 to 156
  • Country or region: Character. 156 values
  • Score: or subjective well-being; numeric. Range: 0 to 10
  • GDP per capita
  • Social support: the national average of the binary responses (either 0 or 1) to the GWP question “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?”
  • Healthy life expectancy
  • Freedom to make life choices: the national average of responses to the GWP question “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?”
  • Generosity: the residual of regressing national average of response to the GWP question “Have you donated money to a charity in the past month?” on GDP per capita.
  • Perceptions of corruption: the national average of the survey responses to two questions in the GWP: “Is corruption widespread throughout the government or not” and “Is corruption widespread within businesses or not?”
happiness_0 = read_csv("happiness_2018.csv", show_col_types = FALSE)
glimpse(happiness_0)
## Rows: 156
## Columns: 9
## $ `Overall rank`                 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, …
## $ `Country or region`            <chr> "Finland", "Norway", "Denmark", "Icelan…
## $ Score                          <dbl> 7.632, 7.594, 7.555, 7.495, 7.487, 7.44…
## $ `GDP per capita`               <dbl> 1.305, 1.456, 1.351, 1.343, 1.420, 1.36…
## $ `Social support`               <dbl> 1.592, 1.582, 1.590, 1.644, 1.549, 1.48…
## $ `Healthy life expectancy`      <dbl> 0.874, 0.861, 0.868, 0.914, 0.927, 0.87…
## $ `Freedom to make life choices` <dbl> 0.681, 0.686, 0.683, 0.677, 0.660, 0.63…
## $ Generosity                     <dbl> 0.202, 0.286, 0.284, 0.353, 0.256, 0.33…
## $ `Perceptions of corruption`    <chr> "0.393", "0.340", "0.408", "0.138", "0.…

Any missing values?

sum(is.na(happiness_0))
## [1] 0

No missing values!

Data Wrangling

Notice that Perception of corruption was coded as a character variable, which is not intuitive. Let us convert it to numeric.

happiness_1 = happiness_0 %>% 
  mutate(Corruption = as.numeric(`Perceptions of corruption`)) 
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

The warning message tells us missing values were produced in the conversion process.

sum(is.na(happiness_1$Corruption))
## [1] 1
happiness_1[is.na(happiness_1$Corruption),] %>% select(c(`Country or region`,Corruption))
## # A tibble: 1 × 2
##   `Country or region`  Corruption
##   <chr>                     <dbl>
## 1 United Arab Emirates         NA

Fortunately, only one country has missing value in Corruption. Remove this observation.

happiness_2 = happiness_1 %>% drop_na()
dim(happiness_2)
## [1] 155  10

Since we are going to do clustering and PCA, we need to drop the “target” variable Overall rank. The label variable Country or region has all unique values, so it is not useful in modeling. We can drop it too. The original Perceptions of corruption can also be removed because we already have a numeric version of it.

happiness_3 = happiness_2 %>% 
  select(-c(`Overall rank`, `Country or region`,`Perceptions of corruption`)) 
head(happiness_3,3)
## # A tibble: 3 × 7
##   Score `GDP per capita` `Social support` `Healthy life exp… `Freedom to make l…
##   <dbl>            <dbl>            <dbl>              <dbl>               <dbl>
## 1  7.63             1.30             1.59              0.874               0.681
## 2  7.59             1.46             1.58              0.861               0.686
## 3  7.56             1.35             1.59              0.868               0.683
## # … with 2 more variables: Generosity <dbl>, Corruption <dbl>
names(happiness_3)
## [1] "Score"                        "GDP per capita"              
## [3] "Social support"               "Healthy life expectancy"     
## [5] "Freedom to make life choices" "Generosity"                  
## [7] "Corruption"

Notice that some of the column names are very long with embedded spaces. Let us rename them.

names(happiness_3) <- c("Score", "GDPP", "SocialSupport","LifeExp", "Freedom", "Generosity","Corruption")
head(happiness_3,3)
## # A tibble: 3 × 7
##   Score  GDPP SocialSupport LifeExp Freedom Generosity Corruption
##   <dbl> <dbl>         <dbl>   <dbl>   <dbl>      <dbl>      <dbl>
## 1  7.63  1.30          1.59   0.874   0.681      0.202      0.393
## 2  7.59  1.46          1.58   0.861   0.686      0.286      0.34 
## 3  7.56  1.35          1.59   0.868   0.683      0.284      0.408

Use happiness_3 to solve Problems 7 to 9.

Exploratory Data Analysis (20 points)

Problem 7: Histogram (15 points)

Create histograms for GDPP, Freedom, and Corruption, respectively.

ggplot(data = happiness_3, aes(x = GDPP)) +
    geom_histogram(color = 'white', binwidth = .08, fill='steelblue') +
    labs(x = 'GDPP') 

ggplot(data = happiness_3, aes(x = Freedom)) +
    geom_histogram(color = 'white', binwidth = .05, fill='steelblue') +
    labs(x = 'Freedom')

ggplot(data = happiness_3, aes(x = Corruption)) +
    geom_histogram(color = 'white', binwidth = .03, fill='steelblue') +
    labs(x = 'Corruption')

Problem 8: Scatterplot Matrix (5 points)

Create a scatterplot matrix for all variables in happiness_3. Hint: You may just use the plot() function.

plot(happiness_3 , pch=20 , cex=.3, col = "sky blue")

If you have extra time, you may install the GGally package and use the ggpairs() function to create the scatterplot matrix.

Unsupervised Learning (80 points)

Problem 9: Principal Component Analysis (40 points)

Step 1: Calculate principal components (5 points)

Calculate the principal components. Save the results and call it pc.happiness.

pc.happiness <- happiness_3 %>%  prcomp(scale=TRUE)
pc.happiness
## Standard deviations (1, .., p=7):
## [1] 1.9531575 1.1879200 0.7725034 0.7530372 0.5655682 0.4039974 0.3565308
## 
## Rotation (n x k) = (7 x 7):
##                       PC1         PC2         PC3         PC4         PC5
## Score         -0.47669832  0.05368983 -0.04625171  0.08805459 -0.03678967
## GDPP          -0.45549646  0.23911795 -0.08808347 -0.20884619  0.24072812
## SocialSupport -0.42296782  0.21394135 -0.18595613  0.23425680 -0.75567883
## LifeExp       -0.44557603  0.21482773 -0.12628090 -0.25101520  0.46869391
## Freedom       -0.32582611 -0.37152071  0.40342711  0.69482741  0.26999464
## Generosity    -0.09638624 -0.67906233 -0.72360393 -0.01921444  0.04552942
## Corruption    -0.26905194 -0.49886358  0.50320915 -0.58956802 -0.27386491
##                       PC6         PC7
## Score          0.85690768 -0.15600969
## GDPP          -0.09840230  0.78507141
## SocialSupport -0.32958424 -0.07870914
## LifeExp       -0.32963782 -0.58993260
## Freedom       -0.18728161  0.04795450
## Generosity    -0.03784297  0.04590404
## Corruption    -0.04725308 -0.02648475
Step 2: Calculate proportion of variance explained (PVE) (10 points)
happiness.var <- pc.happiness$sdev^2
pve <- happiness.var[1:7]/sum(happiness.var)
pve
## [1] 0.54497488 0.20159341 0.08525163 0.08100929 0.04569534 0.02331627 0.01815917
Step 3: Create the scree plot and identify the “elbow” (5 points)
PC=1:7
data=data.frame(PC, pve)

ggplot(data=data, aes(x=PC, y=pve))+
  geom_line(color="navy")+
  geom_point(aes(x=3,y=0.087),cex=5,color="orange",alpha=0.3)+
  geom_point(color="red",cex=2)+
  labs(title="Proportion of Variance Explained", x="Principal Component",y="pve")+
  scale_x_continuous(breaks = 1:7)

Step 4: Create a plot for cumulative PVE (5 points)

Draw a horizontal reference line at y=0.9.

ggplot(data=data, aes(x=PC, y=cumsum(pve)))+
  geom_hline(aes(yintercept=0.9),lty=2,color="purple",linewidth=1, alpha=0.5)+
  geom_line(color="navy")+
  geom_point(color="red",cex=2)+
  labs(title="Cumulative Proportion of Variance Explained", 
       x="Principal Component",
       y="cumulative pve")+
  scale_x_continuous(breaks = 1:9)
## Warning: Ignoring unknown parameters: linewidth

Step 5: Visualize data with the first two PCs (5 points)

Extract the first two PCs in a data frame and name it PC12. Create a scatterplot using the first two principal components.

pc.happiness$x[,1:2]
##                 PC1          PC2
##   [1,] -3.751467923 -1.233032707
##   [2,] -3.822651639 -1.473258656
##   [3,] -3.885892205 -1.860252903
##   [4,] -3.311512606 -0.854680071
##   [5,] -3.772020113 -1.289512069
##   [6,] -3.366388512 -1.572872605
##   [7,] -3.382425152 -1.482099682
##   [8,] -3.715858663 -2.335237739
##   [9,] -3.626773779 -1.716295889
##  [10,] -3.510831927 -1.755383219
##  [11,] -2.047783967  0.298619222
##  [12,] -2.930530055 -0.536684311
##  [13,] -1.900748983  0.438453789
##  [14,] -3.359394423 -1.298061678
##  [15,] -2.883341317 -1.024607393
##  [16,] -2.719363439 -0.203591768
##  [17,] -3.400953210 -0.604105595
##  [18,] -2.368827397 -0.304962970
##  [19,] -2.881465188 -1.371102409
##  [20,] -1.679402945  1.707471956
##  [21,] -2.602420670 -1.158221320
##  [22,] -2.104012354  0.849628039
##  [23,] -0.898044003  1.122305933
##  [24,] -1.133130112  0.682622055
##  [25,] -1.609762052  1.202828768
##  [26,] -1.421520995  0.806817580
##  [27,] -1.012350593  0.891915339
##  [28,] -1.231426552  1.330008842
##  [29,] -0.584687903 -0.083178164
##  [30,] -1.745654451  0.243655850
##  [31,] -2.373803939 -0.553439867
##  [32,] -1.314812628  0.842661198
##  [33,] -3.647606366 -1.760584737
##  [34,] -0.706543265 -0.079708572
##  [35,] -1.796833371  1.302454761
##  [36,] -0.721414575  1.050489261
##  [37,] -1.056657131  0.696319452
##  [38,] -0.967459440  1.798460974
##  [39,] -0.159027217  0.976687773
##  [40,] -0.616158177 -0.367723686
##  [41,] -1.304454563  1.086755989
##  [42,] -1.618977612 -0.285291172
##  [43,] -1.737179254 -2.184790183
##  [44,] -1.437862817  0.434794989
##  [45,] -1.249870971 -0.889143146
##  [46,] -1.123821275  1.909609055
##  [47,] -0.832957616  0.423030067
##  [48,]  0.008836708 -0.417228418
##  [49,] -0.615836287  2.403580859
##  [50,] -1.686293600  0.722110119
##  [51,] -0.505295909  1.334721270
##  [52,] -0.623691803  1.487519733
##  [53,] -1.973464523  1.078562967
##  [54,] -1.013843548 -0.061898754
##  [55,] -0.637925100  0.994032567
##  [56,] -0.699044643  1.406291313
##  [57,] -1.359813879  0.211983943
##  [58,] -0.412282552  1.754145613
##  [59,] -0.962581849  0.609420614
##  [60,] -0.846738033  0.816154308
##  [61,]  0.043022780  0.093772215
##  [62,] -1.507607529  0.671458563
##  [63,] -0.584709016  0.348956223
##  [64,] -0.216112941  0.980562947
##  [65,]  0.095362955 -0.172165027
##  [66,]  0.764538045  1.110406893
##  [67,] -0.305621963  0.604122897
##  [68,] -0.191088384  2.180596109
##  [69,] -0.433438743  0.298394369
##  [70,] -0.218291434 -0.038574744
##  [71,]  0.310748041 -0.066504447
##  [72,] -0.556975919  1.065123505
##  [73,] -0.410046459  1.197022140
##  [74,]  1.319521141 -0.545957525
##  [75,] -2.231084882 -0.771416160
##  [76,] -1.022132920  1.706344576
##  [77,]  0.079263307  1.306795673
##  [78,]  0.211633899  2.741655235
##  [79,] -0.573237808 -0.272480316
##  [80,]  0.072873289  1.332862012
##  [81,] -0.016397634  1.194666194
##  [82,] -0.605785561  0.403235801
##  [83,]  0.660636067  1.703613259
##  [84,]  0.880558474  0.945375087
##  [85,] -0.443450818  0.950231093
##  [86,] -0.011447196  0.804924017
##  [87,]  1.169666532  0.436400720
##  [88,]  0.085138336  0.642241293
##  [89,] -0.029956077  0.236070243
##  [90,]  1.560640922 -0.377212003
##  [91,]  0.195753285 -0.576428806
##  [92,]  0.599243819  0.789685956
##  [93,] -0.041877629  0.259363080
##  [94,] -0.313449380 -0.095502144
##  [95,]  0.035477148 -1.854581627
##  [96,] -0.329664304 -1.803238284
##  [97,]  1.825232419 -3.118614199
##  [98,]  1.978549659 -0.480967931
##  [99,] -0.049620332  1.955566996
## [100,]  0.678708012 -1.170106126
## [101,]  0.578608237  2.182614517
## [102,]  1.009343281  1.424667456
## [103,]  1.118150087  1.096232680
## [104,]  0.524566272  0.598606624
## [105,]  0.375008332 -0.946352370
## [106,]  2.153478931 -0.679159158
## [107,]  1.681337963 -0.552747960
## [108,]  1.515853142  0.031127383
## [109,]  0.485710534 -1.450044089
## [110,]  1.223866144  1.478185782
## [111,]  0.835891016  0.583629000
## [112,]  3.037707159 -1.091588338
## [113,]  1.678209707 -0.006875453
## [114,]  1.024746130 -0.795584668
## [115,] -0.140890528 -0.728777542
## [116,]  1.074622456  0.470293735
## [117,]  2.156060937  0.041299842
## [118,]  0.864914950  0.791475624
## [119,]  0.804801585 -1.289682817
## [120,]  2.115213941 -0.497135256
## [121,]  1.138680039  0.747697255
## [122,]  2.122636664 -1.648127539
## [123,]  1.262621790 -1.525803489
## [124,]  1.535447841 -0.859172463
## [125,]  2.082529653  0.671610596
## [126,]  1.739453115 -1.213041260
## [127,]  1.282146790  0.162881535
## [128,]  1.448621035  1.403109292
## [129,]  0.212623815 -3.845008655
## [130,]  3.173070602 -0.192479059
## [131,]  2.709943387 -0.433390781
## [132,]  1.450221504 -0.619426413
## [133,]  2.750536515 -0.844818639
## [134,]  2.086109895 -1.071208274
## [135,]  3.127284593 -0.892517649
## [136,]  2.346271098  1.020498454
## [137,]  1.212293556  1.182970105
## [138,]  3.085743768 -1.018042309
## [139,]  2.701416625 -0.834773715
## [140,]  2.216875833 -0.260873292
## [141,]  2.726489142  1.486813854
## [142,]  2.876844756  0.186958747
## [143,]  2.325232091 -0.282216146
## [144,]  3.912379680 -0.035607255
## [145,]  0.968322698  0.639510870
## [146,]  2.972793394 -1.461052784
## [147,]  3.379225605 -1.489497481
## [148,]  3.136595785 -0.800891565
## [149,]  2.778357132 -1.520938540
## [150,]  0.991623771 -3.098002886
## [151,]  2.696862962  0.710675464
## [152,]  2.000370006 -1.307331902
## [153,]  3.826383061 -0.718214604
## [154,]  5.234976947 -1.547929789
## [155,]  4.551301739 -0.125406939
PC12 <- pc.happiness$x %>% as_tibble() %>% select(1:2)


pc.happiness$x %>% as_tibble() %>% select(PC1, PC2) %>%
  bind_cols(happiness_3) %>% 
  ggplot(aes(x = `PC1`, y = `PC2`)) + geom_point()

Note that you don’t need to do clustering yet. Can you see any pattern?

No pattern because no clustering has been done yet.

Problem 10: Clustering (40 points)

Step 1: Scale Data (5 points)

Scale all the variables in happiness_3 and save the new dataset as happiness_4.

happiness_4 = scale(happiness_3)
Step 2: Optimal Number of Clusters (5 points)

Since we cannot figure out how many clusters are needed to group the happiness data, we need to use the fviz_nbclust() function to find the optimal number of clusters.

library(factoextra)
fviz_nbclust(happiness_4, kmeans, method = "gap_stat")

Step 3: K-Means (10 points)

Use K-Means method to cluster the data set happiness_4. Then, use mutate to add the clustering results to the data set PC12, which was created in Step 5 of Problem 9. Call the new data set PC12_cluster. Set the seed as follows:

set.seed(99)
km_mod = kmeans(happiness_4, centers=3)

PC12_cluster <- PC12 %>% mutate(cluster = factor(km_mod$cluster))
PC12_cluster
## # A tibble: 155 × 3
##      PC1    PC2 cluster
##    <dbl>  <dbl> <fct>  
##  1 -3.75 -1.23  1      
##  2 -3.82 -1.47  1      
##  3 -3.89 -1.86  1      
##  4 -3.31 -0.855 1      
##  5 -3.77 -1.29  1      
##  6 -3.37 -1.57  1      
##  7 -3.38 -1.48  1      
##  8 -3.72 -2.34  1      
##  9 -3.63 -1.72  1      
## 10 -3.51 -1.76  1      
## # … with 145 more rows
Step 4: Create a visualization using PC12_cluster (5 points)

Create a scatterplot for PC1 and PC2. Make sure to map color to cluster.

PC12_cluster %>% ggplot(aes(y = `PC1`, x = `PC2`)) + geom_point(aes(color = cluster))+xlab("PC1")+ylab("PC2")

Step 5: Create a visualization using fviz_cluster (5 points)

Create a plot for the clustering results similar to Step 4 but using the fviz_cluster() function in thefactoextra package.

fviz_cluster(km_mod, data = happiness_4)

Step 6: Adding labels back (5 points)

Use mutate to add Country or region to PC12_cluster. Make sure to use the Country or region in happiness_2 where the observation with missing value has been removed.

PC12_cluster <- PC12 %>% mutate(cluster = factor(km_mod$cluster), countryOrRegion = factor(happiness_2$`Country or region`))
PC12_cluster
## # A tibble: 155 × 4
##      PC1    PC2 cluster countryOrRegion
##    <dbl>  <dbl> <fct>   <fct>          
##  1 -3.75 -1.23  1       Finland        
##  2 -3.82 -1.47  1       Norway         
##  3 -3.89 -1.86  1       Denmark        
##  4 -3.31 -0.855 1       Iceland        
##  5 -3.77 -1.29  1       Switzerland    
##  6 -3.37 -1.57  1       Netherlands    
##  7 -3.38 -1.48  1       Canada         
##  8 -3.72 -2.34  1       New Zealand    
##  9 -3.63 -1.72  1       Sweden         
## 10 -3.51 -1.76  1       Australia      
## # … with 145 more rows
Step 7: Identify those countries or regions in Cluster 1 (5 points)

Print out those countries or regions in Cluster #1.

PC12_cluster %>% filter(cluster == 1) %>% select(countryOrRegion) 
## # A tibble: 22 × 1
##    countryOrRegion
##    <fct>          
##  1 Finland        
##  2 Norway         
##  3 Denmark        
##  4 Iceland        
##  5 Switzerland    
##  6 Netherlands    
##  7 Canada         
##  8 New Zealand    
##  9 Sweden         
## 10 Australia      
## # … with 12 more rows