library(tidyverse)
library(cluster)
The data set used in this part is extracted from the 1994 Census database.
age: continuous numericworkclass: Private, Self-emp-not-inc, Self-emp-inc,
Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.fnlwgt: continuous numericeducation: Bachelors, Some-college, 11th, HS-grad,
Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters,
1st-4th, 10th, Doctorate, 5th-6th, Preschooleducation-num: continuous numericmarital-status: Married-civ-spouse, Divorced,
Never-married, Separated, Widowed, Married-spouse-absent,
Married-AF-spouseoccupation: Tech-support, Craft-repair, Other-service,
Sales, Exec-managerial, Prof-specialty, Handlers-cleaners,
Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving,
Priv-house-serv, Protective-serv, Armed-Forcesrelationship: Wife, Own-child, Husband, Not-in-family,
Other-relative, Unmarriedrace: White, Asian-Pac-Islander, Amer-Indian-Eskimo,
Other, Blacksex: Female, Malecapital-gain: continuous numericcapital-loss: continuous numerichours-per-week: continuous numericnative-countrysalary: <=50K or >50KFirst, let us load the data set and fileter U.S. observations.
salary_US0 = read_csv("salary.csv", show_col_types = FALSE)
salary_US1 <- salary_US0 %>% filter(`native-country`=="United-States")
glimpse(salary_US1)
## Rows: 29,170
## Columns: 15
## $ age <dbl> 39, 50, 38, 53, 37, 52, 31, 42, 37, 23, 32, 25, 32, 3…
## $ workclass <chr> "State-gov", "Self-emp-not-inc", "Private", "Private"…
## $ fnlwgt <dbl> 77516, 83311, 215646, 234721, 284582, 209642, 45781, …
## $ education <chr> "Bachelors", "Bachelors", "HS-grad", "11th", "Masters…
## $ `education-num` <dbl> 13, 13, 9, 7, 14, 9, 14, 13, 10, 13, 12, 9, 9, 7, 14,…
## $ `marital-status` <chr> "Never-married", "Married-civ-spouse", "Divorced", "M…
## $ occupation <chr> "Adm-clerical", "Exec-managerial", "Handlers-cleaners…
## $ relationship <chr> "Not-in-family", "Husband", "Not-in-family", "Husband…
## $ race <chr> "White", "White", "White", "Black", "White", "White",…
## $ sex <chr> "Male", "Male", "Male", "Male", "Female", "Male", "Fe…
## $ `capital-gain` <dbl> 2174, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0, 0, 0, 0, 0…
## $ `capital-loss` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `hours-per-week` <dbl> 40, 13, 40, 40, 40, 45, 50, 40, 80, 30, 50, 35, 40, 5…
## $ `native-country` <chr> "United-States", "United-States", "United-States", "U…
## $ salary <chr> "<=50K", "<=50K", "<=50K", "<=50K", "<=50K", ">50K", …
Secondly, let us check if there is any missing values.
sum(is.na(salary_US1))
## [1] 0
No missing values! Notice that all of the categorical variables have type “character”, which is not ideal for machine learning modeling. Let us convert them into factors.
salary_US2 <- as.data.frame(unclass(salary_US1), stringsAsFactors=TRUE)
Use the summary function to have an overview of the variables:
summary(salary_US2)
## age workclass fnlwgt education
## Min. :17.00 Private :20135 Min. : 12285 HS-grad :9702
## 1st Qu.:28.00 Self-emp-not-inc: 2313 1st Qu.: 115895 Some-college:6740
## Median :37.00 Local-gov : 1956 Median : 176730 Bachelors :4766
## Mean :38.66 ? : 1659 Mean : 187069 Masters :1527
## 3rd Qu.:48.00 State-gov : 1210 3rd Qu.: 234138 Assoc-voc :1289
## Max. :90.00 Self-emp-inc : 991 Max. :1484705 11th :1067
## (Other) : 906 (Other) :4079
## education.num marital.status occupation
## Min. : 1.00 Divorced : 4162 Exec-managerial:3735
## 1st Qu.: 9.00 Married-AF-spouse : 23 Prof-specialty :3693
## Median :10.00 Married-civ-spouse :13368 Craft-repair :3685
## Mean :10.17 Married-spouse-absent: 253 Adm-clerical :3449
## 3rd Qu.:12.00 Never-married : 9579 Sales :3364
## Max. :16.00 Separated : 883 Other-service :2777
## Widowed : 902 (Other) :8467
## relationship race sex
## Husband :11861 Amer-Indian-Eskimo: 296 Female: 9682
## Not-in-family : 7528 Asian-Pac-Islander: 292 Male :19488
## Other-relative: 696 Black : 2832
## Own-child : 4691 Other : 129
## Unmarried : 3033 White :25621
## Wife : 1361
##
## capital.gain capital.loss hours.per.week native.country
## Min. : 0 Min. : 0.00 Min. : 1.00 United-States:29170
## 1st Qu.: 0 1st Qu.: 0.00 1st Qu.:40.00
## Median : 0 Median : 0.00 Median :40.00
## Mean : 1089 Mean : 88.51 Mean :40.45
## 3rd Qu.: 0 3rd Qu.: 0.00 3rd Qu.:45.00
## Max. :99999 Max. :4356.00 Max. :99.00
##
## salary
## <=50K:21999
## >50K : 7171
##
##
##
##
##
Notice that since we only kept U.S. observations,
native.country only has one distinct value,
United-States, now. It is a constant, so we can remove it.
In addition, note that there are two variables about education. The
variable education is categorical which has a lot of
levels. It will consume many degrees of freedom. Thus, let us get rid of
education and only keep education-num.
The variable fnlwgt represents final weight, which is
assigned by the U.S. census bureau to each row. For simplicity of this
analysis, this weighting factor is discarded.
Notice that relationship also has many levels, but the
role in a family can be assessed from gender and marital status. Hence,
relationship is sort of redundant and can be removed.
salary_US3 <- salary_US2 %>% select(-c(native.country,fnlwgt,education,relationship))
Use salary_US3 to solve Problems 1 to
6.
Create a side-by-side boxplot to show the relationship between
salary and age. Hint: Similar to Exercise 6 in
HW 10. Don’t use facet_wrap or facet_grid.
boxplot(age~salary,data=salary_US3, horizontal=T)
Create a density plot to show the relationship between
salary and capital.gain. (Hint: Read Page 20
in lecture slides ggplot2_3). ***TYPO: frequency
polygon
ggplot(data = salary_US3, mapping = aes(x = capital.gain, y = ..density..)) +
geom_freqpoly(mapping = aes(colour = salary), binwidth = 5000)
Create a two-dimensional bin-plot to show the relationship between
sex and salary. Hint: Use
geom_bin2d. Read Page 12 in lecture slides
ggplot2_2.
ggplot(salary_US3, aes(sex, salary)) +
geom_bin2d()
Use the following code to separate salary_US3 to a
training set and a test set. Make sure that you use the same seed.
set.seed(99)
train_US <- sample_frac(salary_US3, 0.8, fac="salary") # Stratifies sampling
test_US <- setdiff(salary_US3, train_US)
Data preparation is done for now.
Load the packages:
library(rpart)
library(rpart.plot)
Step 1. (5 points) Fit a decision tree with train_US
using all of the feature variables. Set the seed as follows:
set.seed(99)
full_form <- as.formula(salary~.)
mod_tree <- rpart(full_form, data = train_US)
mod_tree
## n= 23336
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 23336 5741 <=50K (0.75398526 0.24601474)
## 2) marital.status=Divorced,Married-spouse-absent,Never-married,Separated,Widowed 12585 826 <=50K (0.93436631 0.06563369)
## 4) capital.gain< 7139.5 12363 609 <=50K (0.95074011 0.04925989) *
## 5) capital.gain>=7139.5 222 5 >50K (0.02252252 0.97747748) *
## 3) marital.status=Married-AF-spouse,Married-civ-spouse 10751 4915 <=50K (0.54283322 0.45716678)
## 6) education.num< 12.5 7610 2608 <=50K (0.65729304 0.34270696)
## 12) capital.gain< 5095.5 7227 2234 <=50K (0.69088142 0.30911858) *
## 13) capital.gain>=5095.5 383 9 >50K (0.02349869 0.97650131) *
## 7) education.num>=12.5 3141 834 >50K (0.26552053 0.73447947) *
Step 2. (5 points) Plot the decision tree.
rpart.plot(mod_tree)
Step 3. (5 points) Do prediction for the test data using the trained decision. Calculate the confusion matrix.
confusion_matrix <- function(data,y,mod){
confusion_matrix <- data %>%
mutate(pred = predict(mod, newdata = data, type = "class"),
y=y) %>%
select(y,pred) %>% table()
}
misclass <- function(confusion){
misclass <- 1- sum(diag(confusion))/sum(confusion)
return(misclass)
}
confusion_test <- confusion_matrix(test_US, test_US$salary, mod_tree)
confusion_test
## pred
## y <=50K >50K
## <=50K 3142 199
## >50K 599 639
Step 4. (5 points) Calculate the misclassification rate.
misclass(confusion_test)
## [1] 0.1742739
Step 5. (5 points) Calculate the true positive rate and the true negative rate for the prediction results on the test set, where the two rates are defined as follows:
tpr <- confusion_test[1,1]/sum(confusion_test[,1]); tpr
## [1] 0.8398824
tnr <- confusion_test[2,2]/sum(confusion_test[,2]); tnr
## [1] 0.7625298
Load the package:
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.1.2
Step 1. (5 points) Fit a random forest model with
train_US using all of the feature variables. Use default
values for all tuning parameters. Set the seed as follows:
set.seed(99)
mod_rforest<- randomForest(full_form,train_US,ntree=1000, mtry = 5)
mod_rforest
##
## Call:
## randomForest(formula = full_form, data = train_US, ntree = 1000, mtry = 5)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 14.65%
## Confusion matrix:
## <=50K >50K class.error
## <=50K 16247 1348 0.07661267
## >50K 2071 3670 0.36073855
Step 2. (5 points) Do prediction for the test data using the trained decision. Calculate the confusion matrix.
confusion_testrf <- confusion_matrix(test_US, test_US$salary, mod_rforest)
confusion_testrf
## pred
## y <=50K >50K
## <=50K 3039 302
## >50K 472 766
Step 3. (5 points) Calculate the misclassification rate.
misclass(confusion_testrf)
## [1] 0.1690325
Step 4. (5 points) Calculate the true positive rate and the true negative rate for the prediction results on the test set.
tpr_rf <- confusion_testrf[1,1]/sum(confusion_testrf[,1]); tpr_rf
## [1] 0.8655654
tnr_rf <- confusion_testrf[2,2]/sum(confusion_testrf[,2]); tnr_rf
## [1] 0.7172285
Step 5. (5 points) Set mtry=10 and
ntree=100 and do Steps 1 to 4 again. Which model is better?
Set the seed as follows:
The first model is better for the random forest because the ntree is larger and averaging more trees will yield a stronger result and reduce overfitting. Also, mtry is also smaller in the first model which also reduces overfitting as well.
set.seed(99)
mod_rforest1<- randomForest(full_form,train_US,ntree=100, mtry = 10)
mod_rforest1
##
## Call:
## randomForest(formula = full_form, data = train_US, ntree = 100, mtry = 10)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 10
##
## OOB estimate of error rate: 15.86%
## Confusion matrix:
## <=50K >50K class.error
## <=50K 16040 1555 0.08837738
## >50K 2145 3596 0.37362829
confusion_testrf1 <- confusion_matrix(test_US, test_US$salary, mod_rforest1)
confusion_testrf1
## pred
## y <=50K >50K
## <=50K 3002 339
## >50K 477 761
misclass(confusion_testrf1)
## [1] 0.1782048
tpr_rf1 <- confusion_testrf1[1,1]/sum(confusion_testrf1[,1]); tpr_rf1
## [1] 0.8628916
tnr_rf1 <- confusion_testrf1[2,2]/sum(confusion_testrf1[,2]); tnr_rf1
## [1] 0.6918182
Step 6. (5 points) Which variable is the most important? Rank the variables according to their importance. Use the first model.
names(salary_US3)
## [1] "age" "workclass" "education.num" "marital.status"
## [5] "occupation" "race" "sex" "capital.gain"
## [9] "capital.loss" "hours.per.week" "salary"
Load the package and define the full formula for the logistic regression model as follows:
library(glmnet)
form_full <- as.formula("salary~age+workclass+education.num+
marital.status+occupation+race+sex+capital.gain+
capital.loss+hours.per.week")
Step 1. (10 points) Use 5-fold cross-validation to find the “optimal”
lambda. Print and plot the results. Hint: use the cv.glmnet
function in the glmnet package. Set the seed as
follows:
set.seed(99)
predictors <- model.matrix(form_full, data = train_US)
cv.fit <- cv.glmnet(predictors, train_US$salary, nfolds=5, family = "binomial", type = "class")
cv.fit$lambda.1se
## [1] 0.005176728
plot(cv.fit)
Step 2. (5 points) Use glmnet to fit a regularized
logistic regression with lambda equal to lambda.1se found
in Step 1.
lambda1 = cv.fit$lambda.1se
mod_lr <- glmnet(predictors, train_US$salary, family = "binomial", lambda = lambda1)
mod_lr$beta
## 39 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) .
## age 0.0203045912
## workclassFederal-gov 0.4206971823
## workclassLocal-gov .
## workclassNever-worked .
## workclassPrivate 0.0908181818
## workclassSelf-emp-inc 0.2854952585
## workclassSelf-emp-not-inc -0.1927690345
## workclassState-gov .
## workclassWithout-pay .
## education.num 0.2867569303
## marital.statusMarried-AF-spouse 1.4640604991
## marital.statusMarried-civ-spouse 2.0914564325
## marital.statusMarried-spouse-absent .
## marital.statusNever-married -0.3379584076
## marital.statusSeparated .
## marital.statusWidowed .
## occupationAdm-clerical .
## occupationArmed-Forces .
## occupationCraft-repair .
## occupationExec-managerial 0.7070894622
## occupationFarming-fishing -0.6435047227
## occupationHandlers-cleaners -0.3511981997
## occupationMachine-op-inspct -0.0738227580
## occupationOther-service -0.5963914056
## occupationPriv-house-serv .
## occupationProf-specialty 0.3728751137
## occupationProtective-serv 0.2742614606
## occupationSales 0.1137338501
## occupationTech-support 0.3820048593
## occupationTransport-moving .
## raceAsian-Pac-Islander .
## raceBlack .
## raceOther .
## raceWhite .
## sexMale 0.0838127096
## capital.gain 0.0002123040
## capital.loss 0.0005405183
## hours.per.week 0.0251743954
form2 <- as.formula(salary~age+workclass+education.num+marital.status+occupation+sex+capital.gain+capital.loss+hours.per.week)
mod_lr2 <- glm(form2, data=train_US,family=binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(mod_lr2)
##
## Call:
## glm(formula = form2, family = binomial, data = train_US)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2288 -0.5078 -0.2048 -0.0381 3.2546
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.039e+00 2.254e-01 -40.106 < 2e-16 ***
## age 2.611e-02 1.865e-03 14.002 < 2e-16 ***
## workclassFederal-gov 1.004e+00 1.763e-01 5.694 1.24e-08 ***
## workclassLocal-gov 2.914e-01 1.615e-01 1.805 0.071079 .
## workclassNever-worked -7.968e+00 2.579e+02 -0.031 0.975351
## workclassPrivate 5.473e-01 1.442e-01 3.795 0.000147 ***
## workclassSelf-emp-inc 7.834e-01 1.752e-01 4.471 7.78e-06 ***
## workclassSelf-emp-not-inc 2.408e-02 1.587e-01 0.152 0.879412
## workclassState-gov 2.281e-01 1.744e-01 1.308 0.190919
## workclassWithout-pay -1.096e+01 1.276e+02 -0.086 0.931553
## education.num 3.012e-01 1.106e-02 27.220 < 2e-16 ***
## marital.statusMarried-AF-spouse 2.650e+00 5.407e-01 4.901 9.56e-07 ***
## marital.statusMarried-civ-spouse 2.179e+00 7.693e-02 28.320 < 2e-16 ***
## marital.statusMarried-spouse-absent -1.569e-01 3.042e-01 -0.516 0.605999
## marital.statusNever-married -5.284e-01 9.509e-02 -5.557 2.74e-08 ***
## marital.statusSeparated -1.717e-01 1.921e-01 -0.894 0.371373
## marital.statusWidowed -2.005e-01 1.856e-01 -1.080 0.280116
## occupationAdm-clerical 1.429e-01 1.133e-01 1.262 0.207083
## occupationArmed-Forces -8.284e-01 1.672e+00 -0.495 0.620266
## occupationCraft-repair 1.412e-01 9.792e-02 1.442 0.149390
## occupationExec-managerial 9.180e-01 1.002e-01 9.164 < 2e-16 ***
## occupationFarming-fishing -8.381e-01 1.627e-01 -5.153 2.57e-07 ***
## occupationHandlers-cleaners -6.827e-01 1.730e-01 -3.947 7.92e-05 ***
## occupationMachine-op-inspct -1.922e-01 1.245e-01 -1.544 0.122566
## occupationOther-service -9.007e-01 1.567e-01 -5.749 8.97e-09 ***
## occupationPriv-house-serv -3.460e+00 2.116e+00 -1.635 0.102016
## occupationProf-specialty 6.214e-01 1.076e-01 5.778 7.58e-09 ***
## occupationProtective-serv 7.197e-01 1.504e-01 4.785 1.71e-06 ***
## occupationSales 3.308e-01 1.037e-01 3.191 0.001418 **
## occupationTech-support 7.599e-01 1.370e-01 5.547 2.90e-08 ***
## occupationTransport-moving NA NA NA NA
## sexMale 1.557e-01 6.163e-02 2.526 0.011525 *
## capital.gain 3.167e-04 1.199e-05 26.417 < 2e-16 ***
## capital.loss 6.593e-04 4.299e-05 15.336 < 2e-16 ***
## hours.per.week 2.875e-02 1.870e-03 15.370 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 26039 on 23335 degrees of freedom
## Residual deviance: 15158 on 23302 degrees of freedom
## AIC: 15226
##
## Number of Fisher Scoring iterations: 12
Step 3. (5 points) Do prediction for the test data using the trained model and calculate the misclassification rate.
logistic.misclassrate <- function(dataset, y, fit, form){
misclass_lr <- dataset %>%
mutate(pred.logistic = predict(fit, newx = model.matrix(form, data = dataset),
type = "class")) %>%
mutate(misclassify = ifelse(y != pred.logistic, 1,0)) %>%
summarize(misclass.rate = mean(misclassify))
return(misclass_lr$misclass.rate)
}
predictors1 <- model.matrix(form2, data = train_US)
fit_opt <- glmnet(predictors1, train_US$salary,
family = "binomial", lambda = lambda1)
misclassrate_lr<-logistic.misclassrate(test_US,test_US$salary,fit_opt, form2)
misclassrate_lr
## [1] 0.1690325
Step 4. (5 points) Use glmnet to re-fit a regularized
logistic regression with lambda equal to lambda.min found
in Step 1 and calculate the misclassification rate for this one. Set the
seed as follows:
set.seed(99)
lambda2 = cv.fit$lambda.min
mod_lr <- glmnet(predictors, train_US$salary, family = "binomial", lambda = lambda2)
mod_lr$beta
## 39 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) .
## age 0.0252944913
## workclassFederal-gov 0.8876991158
## workclassLocal-gov 0.1744466947
## workclassNever-worked .
## workclassPrivate 0.4269646215
## workclassSelf-emp-inc 0.6541269431
## workclassSelf-emp-not-inc -0.0768595914
## workclassState-gov 0.1013949020
## workclassWithout-pay -1.9569187418
## education.num 0.2982259546
## marital.statusMarried-AF-spouse 2.5621236423
## marital.statusMarried-civ-spouse 2.1787884608
## marital.statusMarried-spouse-absent -0.0734169553
## marital.statusNever-married -0.5023258283
## marital.statusSeparated -0.0913308941
## marital.statusWidowed -0.1457490611
## occupationAdm-clerical 0.1282631917
## occupationArmed-Forces -0.3806008374
## occupationCraft-repair 0.1298749718
## occupationExec-managerial 0.9068445894
## occupationFarming-fishing -0.8190704566
## occupationHandlers-cleaners -0.6419479989
## occupationMachine-op-inspct -0.1698045356
## occupationOther-service -0.8499321584
## occupationPriv-house-serv -1.9463227958
## occupationProf-specialty 0.6122354485
## occupationProtective-serv 0.7103943292
## occupationSales 0.3165231451
## occupationTech-support 0.7323469613
## occupationTransport-moving .
## raceAsian-Pac-Islander 0.3790006886
## raceBlack 0.0417806673
## raceOther -0.2788294267
## raceWhite 0.1805908310
## sexMale 0.1434482041
## capital.gain 0.0003064485
## capital.loss 0.0006480464
## hours.per.week 0.0284857908
mod_lr2 <- glm(form_full, data=train_US,family=binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(mod_lr2)
##
## Call:
## glm(formula = form_full, family = binomial, data = train_US)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2064 -0.5082 -0.2042 -0.0386 3.2440
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.500e+00 3.269e-01 -29.065 < 2e-16 ***
## age 2.591e-02 1.867e-03 13.879 < 2e-16 ***
## workclassFederal-gov 1.021e+00 1.770e-01 5.769 7.97e-09 ***
## workclassLocal-gov 3.016e-01 1.616e-01 1.866 0.062043 .
## workclassNever-worked -7.936e+00 2.591e+02 -0.031 0.975564
## workclassPrivate 5.496e-01 1.443e-01 3.810 0.000139 ***
## workclassSelf-emp-inc 7.813e-01 1.752e-01 4.460 8.21e-06 ***
## workclassSelf-emp-not-inc 2.817e-02 1.587e-01 0.178 0.859093
## workclassState-gov 2.322e-01 1.745e-01 1.331 0.183249
## workclassWithout-pay -1.096e+01 1.275e+02 -0.086 0.931500
## education.num 2.996e-01 1.108e-02 27.052 < 2e-16 ***
## marital.statusMarried-AF-spouse 2.636e+00 5.408e-01 4.874 1.10e-06 ***
## marital.statusMarried-civ-spouse 2.175e+00 7.694e-02 28.271 < 2e-16 ***
## marital.statusMarried-spouse-absent -1.567e-01 3.047e-01 -0.514 0.607009
## marital.statusNever-married -5.321e-01 9.516e-02 -5.591 2.25e-08 ***
## marital.statusSeparated -1.555e-01 1.927e-01 -0.807 0.419617
## marital.statusWidowed -1.989e-01 1.857e-01 -1.072 0.283928
## occupationAdm-clerical 1.365e-01 1.134e-01 1.204 0.228608
## occupationArmed-Forces -6.771e-01 1.758e+00 -0.385 0.700149
## occupationCraft-repair 1.301e-01 9.805e-02 1.326 0.184677
## occupationExec-managerial 9.073e-01 1.003e-01 9.046 < 2e-16 ***
## occupationFarming-fishing -8.497e-01 1.628e-01 -5.218 1.80e-07 ***
## occupationHandlers-cleaners -6.881e-01 1.731e-01 -3.974 7.06e-05 ***
## occupationMachine-op-inspct -1.943e-01 1.246e-01 -1.560 0.118753
## occupationOther-service -8.966e-01 1.568e-01 -5.718 1.08e-08 ***
## occupationPriv-house-serv -3.482e+00 2.131e+00 -1.634 0.102314
## occupationProf-specialty 6.152e-01 1.077e-01 5.710 1.13e-08 ***
## occupationProtective-serv 7.163e-01 1.505e-01 4.758 1.95e-06 ***
## occupationSales 3.206e-01 1.038e-01 3.088 0.002014 **
## occupationTech-support 7.459e-01 1.372e-01 5.437 5.43e-08 ***
## occupationTransport-moving NA NA NA NA
## raceAsian-Pac-Islander 7.485e-01 3.180e-01 2.354 0.018577 *
## raceBlack 3.946e-01 2.584e-01 1.527 0.126745
## raceOther -4.830e-02 4.946e-01 -0.098 0.922219
## raceWhite 5.118e-01 2.460e-01 2.081 0.037445 *
## sexMale 1.498e-01 6.175e-02 2.426 0.015247 *
## capital.gain 3.174e-04 1.200e-05 26.445 < 2e-16 ***
## capital.loss 6.581e-04 4.300e-05 15.306 < 2e-16 ***
## hours.per.week 2.867e-02 1.870e-03 15.336 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 26039 on 23335 degrees of freedom
## Residual deviance: 15149 on 23298 degrees of freedom
## AIC: 15225
##
## Number of Fisher Scoring iterations: 12
predictors2 <- model.matrix(form_full, data = train_US)
fit_opt <- glmnet(predictors2, train_US$salary,
family = "binomial", lambda = lambda2)
misclassrate_lr<-logistic.misclassrate(test_US,test_US$salary,fit_opt, form_full)
misclassrate_lr
## [1] 0.1631361
Step 5. (5 points) Compare the results of the two models. Which one is better?
The re-fit regularized logistic regression with lambda equal to
lambda.min is better because it gives a smaller
misclassification rate.
The World Happiness Report is a landmark survey of the state of global happiness. The first report was published in 2012. We will explore the 2018 data.
Overall rank: Happiness rank; numeric. Range: 1 to
156Country or region: Character. 156 valuesScore: or subjective well-being; numeric. Range: 0 to
10GDP per capitaSocial support: the national average of the binary
responses (either 0 or 1) to the GWP question “If you were in trouble,
do you have relatives or friends you can count on to help you whenever
you need them, or not?”Healthy life expectancyFreedom to make life choices: the national average of
responses to the GWP question “Are you satisfied or dissatisfied with
your freedom to choose what you do with your life?”Generosity: the residual of regressing national average
of response to the GWP question “Have you donated money to a charity in
the past month?” on GDP per capita.Perceptions of corruption: the national average of the
survey responses to two questions in the GWP: “Is corruption widespread
throughout the government or not” and “Is corruption widespread within
businesses or not?”happiness_0 = read_csv("happiness_2018.csv", show_col_types = FALSE)
glimpse(happiness_0)
## Rows: 156
## Columns: 9
## $ `Overall rank` <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, …
## $ `Country or region` <chr> "Finland", "Norway", "Denmark", "Icelan…
## $ Score <dbl> 7.632, 7.594, 7.555, 7.495, 7.487, 7.44…
## $ `GDP per capita` <dbl> 1.305, 1.456, 1.351, 1.343, 1.420, 1.36…
## $ `Social support` <dbl> 1.592, 1.582, 1.590, 1.644, 1.549, 1.48…
## $ `Healthy life expectancy` <dbl> 0.874, 0.861, 0.868, 0.914, 0.927, 0.87…
## $ `Freedom to make life choices` <dbl> 0.681, 0.686, 0.683, 0.677, 0.660, 0.63…
## $ Generosity <dbl> 0.202, 0.286, 0.284, 0.353, 0.256, 0.33…
## $ `Perceptions of corruption` <chr> "0.393", "0.340", "0.408", "0.138", "0.…
Any missing values?
sum(is.na(happiness_0))
## [1] 0
No missing values!
Notice that Perception of corruption was coded as a
character variable, which is not intuitive. Let us convert it to
numeric.
happiness_1 = happiness_0 %>%
mutate(Corruption = as.numeric(`Perceptions of corruption`))
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
The warning message tells us missing values were produced in the conversion process.
sum(is.na(happiness_1$Corruption))
## [1] 1
happiness_1[is.na(happiness_1$Corruption),] %>% select(c(`Country or region`,Corruption))
## # A tibble: 1 × 2
## `Country or region` Corruption
## <chr> <dbl>
## 1 United Arab Emirates NA
Fortunately, only one country has missing value in
Corruption. Remove this observation.
happiness_2 = happiness_1 %>% drop_na()
dim(happiness_2)
## [1] 155 10
Since we are going to do clustering and PCA, we need to drop the
“target” variable Overall rank. The label variable
Country or region has all unique values, so it is not
useful in modeling. We can drop it too. The original
Perceptions of corruption can also be removed because we
already have a numeric version of it.
happiness_3 = happiness_2 %>%
select(-c(`Overall rank`, `Country or region`,`Perceptions of corruption`))
head(happiness_3,3)
## # A tibble: 3 × 7
## Score `GDP per capita` `Social support` `Healthy life exp… `Freedom to make l…
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.63 1.30 1.59 0.874 0.681
## 2 7.59 1.46 1.58 0.861 0.686
## 3 7.56 1.35 1.59 0.868 0.683
## # … with 2 more variables: Generosity <dbl>, Corruption <dbl>
names(happiness_3)
## [1] "Score" "GDP per capita"
## [3] "Social support" "Healthy life expectancy"
## [5] "Freedom to make life choices" "Generosity"
## [7] "Corruption"
Notice that some of the column names are very long with embedded spaces. Let us rename them.
names(happiness_3) <- c("Score", "GDPP", "SocialSupport","LifeExp", "Freedom", "Generosity","Corruption")
head(happiness_3,3)
## # A tibble: 3 × 7
## Score GDPP SocialSupport LifeExp Freedom Generosity Corruption
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7.63 1.30 1.59 0.874 0.681 0.202 0.393
## 2 7.59 1.46 1.58 0.861 0.686 0.286 0.34
## 3 7.56 1.35 1.59 0.868 0.683 0.284 0.408
Use happiness_3 to solve Problems 7 to
9.
Create histograms for GDPP, Freedom, and
Corruption, respectively.
ggplot(data = happiness_3, aes(x = GDPP)) +
geom_histogram(color = 'white', binwidth = .08, fill='steelblue') +
labs(x = 'GDPP')
ggplot(data = happiness_3, aes(x = Freedom)) +
geom_histogram(color = 'white', binwidth = .05, fill='steelblue') +
labs(x = 'Freedom')
ggplot(data = happiness_3, aes(x = Corruption)) +
geom_histogram(color = 'white', binwidth = .03, fill='steelblue') +
labs(x = 'Corruption')
Create a scatterplot matrix for all variables in
happiness_3. Hint: You may just use the plot()
function.
plot(happiness_3 , pch=20 , cex=.3, col = "sky blue")
If you have extra time, you may install the GGally
package and use the ggpairs() function to create the
scatterplot matrix.
Calculate the principal components. Save the results and call it
pc.happiness.
pc.happiness <- happiness_3 %>% prcomp(scale=TRUE)
pc.happiness
## Standard deviations (1, .., p=7):
## [1] 1.9531575 1.1879200 0.7725034 0.7530372 0.5655682 0.4039974 0.3565308
##
## Rotation (n x k) = (7 x 7):
## PC1 PC2 PC3 PC4 PC5
## Score -0.47669832 0.05368983 -0.04625171 0.08805459 -0.03678967
## GDPP -0.45549646 0.23911795 -0.08808347 -0.20884619 0.24072812
## SocialSupport -0.42296782 0.21394135 -0.18595613 0.23425680 -0.75567883
## LifeExp -0.44557603 0.21482773 -0.12628090 -0.25101520 0.46869391
## Freedom -0.32582611 -0.37152071 0.40342711 0.69482741 0.26999464
## Generosity -0.09638624 -0.67906233 -0.72360393 -0.01921444 0.04552942
## Corruption -0.26905194 -0.49886358 0.50320915 -0.58956802 -0.27386491
## PC6 PC7
## Score 0.85690768 -0.15600969
## GDPP -0.09840230 0.78507141
## SocialSupport -0.32958424 -0.07870914
## LifeExp -0.32963782 -0.58993260
## Freedom -0.18728161 0.04795450
## Generosity -0.03784297 0.04590404
## Corruption -0.04725308 -0.02648475
happiness.var <- pc.happiness$sdev^2
pve <- happiness.var[1:7]/sum(happiness.var)
pve
## [1] 0.54497488 0.20159341 0.08525163 0.08100929 0.04569534 0.02331627 0.01815917
PC=1:7
data=data.frame(PC, pve)
ggplot(data=data, aes(x=PC, y=pve))+
geom_line(color="navy")+
geom_point(aes(x=3,y=0.087),cex=5,color="orange",alpha=0.3)+
geom_point(color="red",cex=2)+
labs(title="Proportion of Variance Explained", x="Principal Component",y="pve")+
scale_x_continuous(breaks = 1:7)
Draw a horizontal reference line at y=0.9.
ggplot(data=data, aes(x=PC, y=cumsum(pve)))+
geom_hline(aes(yintercept=0.9),lty=2,color="purple",linewidth=1, alpha=0.5)+
geom_line(color="navy")+
geom_point(color="red",cex=2)+
labs(title="Cumulative Proportion of Variance Explained",
x="Principal Component",
y="cumulative pve")+
scale_x_continuous(breaks = 1:9)
## Warning: Ignoring unknown parameters: linewidth
Extract the first two PCs in a data frame and name it
PC12. Create a scatterplot using the first two principal
components.
pc.happiness$x[,1:2]
## PC1 PC2
## [1,] -3.751467923 -1.233032707
## [2,] -3.822651639 -1.473258656
## [3,] -3.885892205 -1.860252903
## [4,] -3.311512606 -0.854680071
## [5,] -3.772020113 -1.289512069
## [6,] -3.366388512 -1.572872605
## [7,] -3.382425152 -1.482099682
## [8,] -3.715858663 -2.335237739
## [9,] -3.626773779 -1.716295889
## [10,] -3.510831927 -1.755383219
## [11,] -2.047783967 0.298619222
## [12,] -2.930530055 -0.536684311
## [13,] -1.900748983 0.438453789
## [14,] -3.359394423 -1.298061678
## [15,] -2.883341317 -1.024607393
## [16,] -2.719363439 -0.203591768
## [17,] -3.400953210 -0.604105595
## [18,] -2.368827397 -0.304962970
## [19,] -2.881465188 -1.371102409
## [20,] -1.679402945 1.707471956
## [21,] -2.602420670 -1.158221320
## [22,] -2.104012354 0.849628039
## [23,] -0.898044003 1.122305933
## [24,] -1.133130112 0.682622055
## [25,] -1.609762052 1.202828768
## [26,] -1.421520995 0.806817580
## [27,] -1.012350593 0.891915339
## [28,] -1.231426552 1.330008842
## [29,] -0.584687903 -0.083178164
## [30,] -1.745654451 0.243655850
## [31,] -2.373803939 -0.553439867
## [32,] -1.314812628 0.842661198
## [33,] -3.647606366 -1.760584737
## [34,] -0.706543265 -0.079708572
## [35,] -1.796833371 1.302454761
## [36,] -0.721414575 1.050489261
## [37,] -1.056657131 0.696319452
## [38,] -0.967459440 1.798460974
## [39,] -0.159027217 0.976687773
## [40,] -0.616158177 -0.367723686
## [41,] -1.304454563 1.086755989
## [42,] -1.618977612 -0.285291172
## [43,] -1.737179254 -2.184790183
## [44,] -1.437862817 0.434794989
## [45,] -1.249870971 -0.889143146
## [46,] -1.123821275 1.909609055
## [47,] -0.832957616 0.423030067
## [48,] 0.008836708 -0.417228418
## [49,] -0.615836287 2.403580859
## [50,] -1.686293600 0.722110119
## [51,] -0.505295909 1.334721270
## [52,] -0.623691803 1.487519733
## [53,] -1.973464523 1.078562967
## [54,] -1.013843548 -0.061898754
## [55,] -0.637925100 0.994032567
## [56,] -0.699044643 1.406291313
## [57,] -1.359813879 0.211983943
## [58,] -0.412282552 1.754145613
## [59,] -0.962581849 0.609420614
## [60,] -0.846738033 0.816154308
## [61,] 0.043022780 0.093772215
## [62,] -1.507607529 0.671458563
## [63,] -0.584709016 0.348956223
## [64,] -0.216112941 0.980562947
## [65,] 0.095362955 -0.172165027
## [66,] 0.764538045 1.110406893
## [67,] -0.305621963 0.604122897
## [68,] -0.191088384 2.180596109
## [69,] -0.433438743 0.298394369
## [70,] -0.218291434 -0.038574744
## [71,] 0.310748041 -0.066504447
## [72,] -0.556975919 1.065123505
## [73,] -0.410046459 1.197022140
## [74,] 1.319521141 -0.545957525
## [75,] -2.231084882 -0.771416160
## [76,] -1.022132920 1.706344576
## [77,] 0.079263307 1.306795673
## [78,] 0.211633899 2.741655235
## [79,] -0.573237808 -0.272480316
## [80,] 0.072873289 1.332862012
## [81,] -0.016397634 1.194666194
## [82,] -0.605785561 0.403235801
## [83,] 0.660636067 1.703613259
## [84,] 0.880558474 0.945375087
## [85,] -0.443450818 0.950231093
## [86,] -0.011447196 0.804924017
## [87,] 1.169666532 0.436400720
## [88,] 0.085138336 0.642241293
## [89,] -0.029956077 0.236070243
## [90,] 1.560640922 -0.377212003
## [91,] 0.195753285 -0.576428806
## [92,] 0.599243819 0.789685956
## [93,] -0.041877629 0.259363080
## [94,] -0.313449380 -0.095502144
## [95,] 0.035477148 -1.854581627
## [96,] -0.329664304 -1.803238284
## [97,] 1.825232419 -3.118614199
## [98,] 1.978549659 -0.480967931
## [99,] -0.049620332 1.955566996
## [100,] 0.678708012 -1.170106126
## [101,] 0.578608237 2.182614517
## [102,] 1.009343281 1.424667456
## [103,] 1.118150087 1.096232680
## [104,] 0.524566272 0.598606624
## [105,] 0.375008332 -0.946352370
## [106,] 2.153478931 -0.679159158
## [107,] 1.681337963 -0.552747960
## [108,] 1.515853142 0.031127383
## [109,] 0.485710534 -1.450044089
## [110,] 1.223866144 1.478185782
## [111,] 0.835891016 0.583629000
## [112,] 3.037707159 -1.091588338
## [113,] 1.678209707 -0.006875453
## [114,] 1.024746130 -0.795584668
## [115,] -0.140890528 -0.728777542
## [116,] 1.074622456 0.470293735
## [117,] 2.156060937 0.041299842
## [118,] 0.864914950 0.791475624
## [119,] 0.804801585 -1.289682817
## [120,] 2.115213941 -0.497135256
## [121,] 1.138680039 0.747697255
## [122,] 2.122636664 -1.648127539
## [123,] 1.262621790 -1.525803489
## [124,] 1.535447841 -0.859172463
## [125,] 2.082529653 0.671610596
## [126,] 1.739453115 -1.213041260
## [127,] 1.282146790 0.162881535
## [128,] 1.448621035 1.403109292
## [129,] 0.212623815 -3.845008655
## [130,] 3.173070602 -0.192479059
## [131,] 2.709943387 -0.433390781
## [132,] 1.450221504 -0.619426413
## [133,] 2.750536515 -0.844818639
## [134,] 2.086109895 -1.071208274
## [135,] 3.127284593 -0.892517649
## [136,] 2.346271098 1.020498454
## [137,] 1.212293556 1.182970105
## [138,] 3.085743768 -1.018042309
## [139,] 2.701416625 -0.834773715
## [140,] 2.216875833 -0.260873292
## [141,] 2.726489142 1.486813854
## [142,] 2.876844756 0.186958747
## [143,] 2.325232091 -0.282216146
## [144,] 3.912379680 -0.035607255
## [145,] 0.968322698 0.639510870
## [146,] 2.972793394 -1.461052784
## [147,] 3.379225605 -1.489497481
## [148,] 3.136595785 -0.800891565
## [149,] 2.778357132 -1.520938540
## [150,] 0.991623771 -3.098002886
## [151,] 2.696862962 0.710675464
## [152,] 2.000370006 -1.307331902
## [153,] 3.826383061 -0.718214604
## [154,] 5.234976947 -1.547929789
## [155,] 4.551301739 -0.125406939
PC12 <- pc.happiness$x %>% as_tibble() %>% select(1:2)
pc.happiness$x %>% as_tibble() %>% select(PC1, PC2) %>%
bind_cols(happiness_3) %>%
ggplot(aes(x = `PC1`, y = `PC2`)) + geom_point()
Note that you don’t need to do clustering yet. Can you see any pattern?
No pattern because no clustering has been done yet.
Scale all the variables in happiness_3 and save the new
dataset as happiness_4.
happiness_4 = scale(happiness_3)
Since we cannot figure out how many clusters are needed to group the
happiness data, we need to use the fviz_nbclust() function
to find the optimal number of clusters.
library(factoextra)
fviz_nbclust(happiness_4, kmeans, method = "gap_stat")
Use K-Means method to cluster the data set happiness_4.
Then, use mutate to add the clustering results to the data
set PC12, which was created in Step 5 of Problem 9. Call
the new data set PC12_cluster. Set the seed as follows:
set.seed(99)
km_mod = kmeans(happiness_4, centers=3)
PC12_cluster <- PC12 %>% mutate(cluster = factor(km_mod$cluster))
PC12_cluster
## # A tibble: 155 × 3
## PC1 PC2 cluster
## <dbl> <dbl> <fct>
## 1 -3.75 -1.23 1
## 2 -3.82 -1.47 1
## 3 -3.89 -1.86 1
## 4 -3.31 -0.855 1
## 5 -3.77 -1.29 1
## 6 -3.37 -1.57 1
## 7 -3.38 -1.48 1
## 8 -3.72 -2.34 1
## 9 -3.63 -1.72 1
## 10 -3.51 -1.76 1
## # … with 145 more rows
PC12_cluster (5
points)Create a scatterplot for PC1 and PC2. Make
sure to map color to cluster.
PC12_cluster %>% ggplot(aes(y = `PC1`, x = `PC2`)) + geom_point(aes(color = cluster))+xlab("PC1")+ylab("PC2")
fviz_cluster (5
points)Create a plot for the clustering results similar to Step 4 but using
the fviz_cluster() function in thefactoextra
package.
fviz_cluster(km_mod, data = happiness_4)
Use mutate to add Country or region to
PC12_cluster. Make sure to use the
Country or region in happiness_2 where the
observation with missing value has been removed.
PC12_cluster <- PC12 %>% mutate(cluster = factor(km_mod$cluster), countryOrRegion = factor(happiness_2$`Country or region`))
PC12_cluster
## # A tibble: 155 × 4
## PC1 PC2 cluster countryOrRegion
## <dbl> <dbl> <fct> <fct>
## 1 -3.75 -1.23 1 Finland
## 2 -3.82 -1.47 1 Norway
## 3 -3.89 -1.86 1 Denmark
## 4 -3.31 -0.855 1 Iceland
## 5 -3.77 -1.29 1 Switzerland
## 6 -3.37 -1.57 1 Netherlands
## 7 -3.38 -1.48 1 Canada
## 8 -3.72 -2.34 1 New Zealand
## 9 -3.63 -1.72 1 Sweden
## 10 -3.51 -1.76 1 Australia
## # … with 145 more rows
Print out those countries or regions in Cluster #1.
PC12_cluster %>% filter(cluster == 1) %>% select(countryOrRegion)
## # A tibble: 22 × 1
## countryOrRegion
## <fct>
## 1 Finland
## 2 Norway
## 3 Denmark
## 4 Iceland
## 5 Switzerland
## 6 Netherlands
## 7 Canada
## 8 New Zealand
## 9 Sweden
## 10 Australia
## # … with 12 more rows