The dataset (1) is derived from the 1994 Census and includes information on the occupations and income of individuals in the United States. The following features are included in the dataset:
age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
income: <=50k, >50k
A subset of the data will be used for illustrative purposes in conjunction with regression techniques to create predictive models for both categorical and continuous response variables. The intention of this document is to compare techniques including regularized regression algorithms and support vector machines to linear and logistic regression in order to recommend methods for further use and analysis of the data.
Multiple linear regression is a statistical method that uses several explanatory variables to predict the outcome of a continous response variable, and is an extension of ordinary least squares regression. It fits a linear relationship between the explanatory variables and the response. The assumptions for multiple regression include the following: the response variable is normally distributed and has a constant variance, the explanatory variables are nonrandom, the explanatory variables are noncorrelated, the explanatory variables have a linear relationship with the response, and the data is randomly collected and independent.
Logistic regression is a statistical method used for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a binary response variable, and is therefore appropriate for situations where linear regression cannot be used due to a dichotomous categorical response variable.The assuptions for a logistic regression include the following: the dependent variable should be binary, the independent variables should not be correlated, the log odds (the logit of the probability) and the independent variables should have a linear relationship, the sample size has to be sufficiently large, observations must be independent. Unlike multiple linear analysis, due to a categorical response, logistic regression does not require a normally distributed response variable with a constant variance.
Regularized regression techniques are an expansion on more traditional regression methods by introducing penalty terms to control the complexity of the model and account for issues with overfitting. They can be effective for situations with many predictors or when multicollinearity is present between features.
We will explore three of the most common forms of regularized regression for linear and logistic regression, including Ridge Regression, LASSO, and Elastic Net Regression techniques. Each introduces penalties to the regression model in a slightly different way and may be more or less advantageous in certain modeling situations.
SVM is another technique that can be used in regression and classification contexts, and is a distribution-free method. In SVM, classes are separated in a feature space and SVM draws a hyperplane, or decision boundary, to maximize the margin between parallel hyperplanes while minimizing misclassification errors. SVM can handle both linear and nonlinear class boundaries, using the kernel trick to handle non-linear classification. The most common kernel transformations are radial kernel and polynomial kernel. We will explore a linear, radial, and polynomial methods to create SVM classification models.
All of the linear models will be assessed through their root mean square error, or RMSE. The classification models created will be assessed through ROC curves and their corresponding AUC values.
The RMSE is the standard deviation of the residuals, and measures the average difference between a model’s predicted values and the actual values. We will use the RMSE on the same set of training and test data to assess the predictive potential of the linear regression models.
ROC Curves are a graphical technique used to measure the performance of a binary classification model by plotting the true positive rate against the false positive rate at various classification thresholds. The AUC, or area under the curve, is a number that quantifies the performance of the ROC in a single value by approximating the area under the constructed curve between 0 and 1. The closer the AUC is to 1, the better the performance of the model.
Using a small subset of the data, the intention of this analysis is to take an initial look at various techniques for linear and logistic regression to inform future analysis of the full dataset by assessing their predictive performance against each other. The following questions will be addressed through linear and logistic regression:
To begin, we take a subset of the original data and examine its summary and the number of missing values.
## # A tibble: 3 × 2
## native_country n
## <fct> <int>
## 1 Not US 146
## 2 United States 1450
## 3 <NA> 32
## age workclass fnlwgt education
## Min. :17.00 Private :1145 Min. : 19752 HS-grad :538
## 1st Qu.:28.00 Self-emp-not-inc: 123 1st Qu.:113621 Some-college:375
## Median :37.00 Local-gov : 96 Median :179723 Bachelors :276
## Mean :38.67 State-gov : 63 Mean :190137 Masters : 76
## 3rd Qu.:47.00 Self-emp-inc : 57 3rd Qu.:239087 Assoc-voc : 69
## Max. :90.00 (Other) : 51 Max. :972354 11th : 50
## NA's : 93 (Other) :244
## education_num marital_status occupation
## Min. : 1.00 Divorced :223 Prof-specialty :206
## 1st Qu.: 9.00 Married-AF-spouse : 2 Exec-managerial:205
## Median :10.00 Married-civ-spouse :764 Craft-repair :201
## Mean :10.03 Married-spouse-absent: 16 Sales :201
## 3rd Qu.:12.00 Never-married :521 Adm-clerical :183
## Max. :16.00 Separated : 51 (Other) :539
## Widowed : 51 NA's : 93
## relationship race sex
## Husband :664 Amer-Indian-Eskimo: 18 Female: 553
## Not-in-family :434 Asian-Pac-Islander: 64 Male :1075
## Other-relative: 42 Black : 153
## Own-child :245 Other : 13
## Unmarried :156 White :1380
## Wife : 87
##
## capital_gain capital_loss hours_per_week native_country
## Min. : 0.0 Min. : 0.00 Min. : 2.00 Not US : 146
## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.:40.00 United States:1450
## Median : 0.0 Median : 0.00 Median :40.00 NA's : 32
## Mean : 959.2 Mean : 92.87 Mean :40.42
## 3rd Qu.: 0.0 3rd Qu.: 0.00 3rd Qu.:45.00
## Max. :99999.0 Max. :2559.00 Max. :99.00
##
## income
## <=50K:1258
## >50K : 370
##
##
##
##
##
## age workclass fnlwgt education education_num
## 0 93 0 0 0
## marital_status occupation relationship race sex
## 0 93 0 0 0
## capital_gain capital_loss hours_per_week native_country income
## 0 0 0 32 0
We rebinned native_country due to the number of sparse categories. There are enough missing values to suggest the use of imputation, so we proceed with multiple imputation using the R mice library.
## age workclass fnlwgt education education_num
## "" "polyreg" "" "" ""
## marital_status occupation relationship race sex
## "" "polyreg" "" "" ""
## capital_gain capital_loss hours_per_week native_country income
## "" "" "" "logreg" ""
## age workclass fnlwgt education
## Min. :17.00 Federal-gov : 55 Min. : 19752 HS-grad :538
## 1st Qu.:28.00 Local-gov : 104 1st Qu.:113621 Some-college:375
## Median :37.00 Private :1208 Median :179723 Bachelors :276
## Mean :38.67 Self-emp-inc : 57 Mean :190137 Masters : 76
## 3rd Qu.:47.00 Self-emp-not-inc: 132 3rd Qu.:239087 Assoc-voc : 69
## Max. :90.00 State-gov : 71 Max. :972354 11th : 50
## Without-pay : 1 (Other) :244
## education_num marital_status occupation
## Min. : 1.00 Divorced :223 Prof-specialty :218
## 1st Qu.: 9.00 Married-AF-spouse : 2 Sales :214
## Median :10.00 Married-civ-spouse :764 Craft-repair :212
## Mean :10.03 Married-spouse-absent: 16 Exec-managerial:212
## 3rd Qu.:12.00 Never-married :521 Adm-clerical :193
## Max. :16.00 Separated : 51 Other-service :176
## Widowed : 51 (Other) :403
## relationship race sex
## Husband :664 Amer-Indian-Eskimo: 18 Female: 553
## Not-in-family :434 Asian-Pac-Islander: 64 Male :1075
## Other-relative: 42 Black : 153
## Own-child :245 Other : 13
## Unmarried :156 White :1380
## Wife : 87
##
## capital_gain capital_loss hours_per_week native_country
## Min. : 0.0 Min. : 0.00 Min. : 2.00 Not US : 155
## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.:40.00 United States:1473
## Median : 0.0 Median : 0.00 Median :40.00
## Mean : 959.2 Mean : 92.87 Mean :40.42
## 3rd Qu.: 0.0 3rd Qu.: 0.00 3rd Qu.:45.00
## Max. :99999.0 Max. :2559.00 Max. :99.00
##
## income
## <=50K:1258
## >50K : 370
##
##
##
##
##
## age workclass fnlwgt education education_num
## 0 0 0 0 0
## marital_status occupation relationship race sex
## 0 0 0 0 0
## capital_gain capital_loss hours_per_week native_country income
## 0 0 0 0 0
With the missing values imputed, we can continue by looking at the distribution of the continuous features.
We note severe issues with nonnormality, especially for capital_gain and capital_loss. Some skew is also observed in fnlwgt and age, but these could possibly corrected through Box-cox transformations.
After doing Box-cox transformations, we note that there appear to be adequate changes for age and fnlwgt, but not for capital_gain or capital_loss. Therefore, we will have to handle them through other means, such as feature rebinning.
The median of both capital_loss and capital_gain is zero according to the earlier summary, meaning over 50% of observations are equal to zero. From the histograms, it looks like the vast majority are likely equal to zero. Therefore, to rebin these variables, we will create binary categorical variables of whether or not their values are equal to zero.
## Mode FALSE TRUE
## logical 104 1524
## Mode FALSE TRUE
## logical 81 1547
After rebinning, we can look at the correlations of the continuous variables.
There do not apear to be high correlations between the variables just looking at the pairwise correlation plots or particularly obvious patterns.
We will examine the relationship between the binary response variable of income with continuous predictors using stacked boxplots.
We see a fair number of outliers across all plots. Some patterns, however, do seem quite apparent; age looks significantly higher for those with incomes over 50k, as does hours per week and the highest level of education (represented through a numeric scale).
We can compare the distributions of the categorical variables through mosaic plots.
Once again, we see differences in distributions for income across the categorical variables. Many more invidiuals with incomes over 50k seem to be married with civilian spouses. In the family relationship/role in family mosaic plot, we see that many more husbands make over 50k a year compared to the proportion of husbands for individuals with incomes less than or equal to 50k. The distribution of income across sexes also shows a visible difference between males and females. As one might expect, the proportion of executive/managerial positions and specialty professions is much higher for individuals with incomes above 50k when compared to the distribution in those with incomes not above 50k.
We also do see evidence of sparse categories, so we will print the counts across categories for each of the categorical features.
education | n |
---|---|
10th | 41 |
11th | 50 |
12th | 26 |
1st-4th | 7 |
5th-6th | 19 |
7th-8th | 40 |
9th | 22 |
Assoc-acdm | 41 |
Assoc-voc | 69 |
Bachelors | 276 |
Doctorate | 16 |
HS-grad | 538 |
Masters | 76 |
Preschool | 3 |
Prof-school | 29 |
Some-college | 375 |
marital_status | n |
---|---|
Divorced | 223 |
Married-AF-spouse | 2 |
Married-civ-spouse | 764 |
Married-spouse-absent | 16 |
Never-married | 521 |
Separated | 51 |
Widowed | 51 |
workclass | n |
---|---|
Federal-gov | 55 |
Local-gov | 104 |
Private | 1208 |
Self-emp-inc | 57 |
Self-emp-not-inc | 132 |
State-gov | 71 |
Without-pay | 1 |
occupation | n |
---|---|
Adm-clerical | 193 |
Craft-repair | 212 |
Exec-managerial | 212 |
Farming-fishing | 53 |
Handlers-cleaners | 71 |
Machine-op-inspct | 111 |
Other-service | 176 |
Priv-house-serv | 6 |
Prof-specialty | 218 |
Protective-serv | 31 |
Sales | 214 |
Tech-support | 41 |
Transport-moving | 90 |
relationship | n |
---|---|
Husband | 664 |
Not-in-family | 434 |
Other-relative | 42 |
Own-child | 245 |
Unmarried | 156 |
Wife | 87 |
we can combine married to armed forces with married to civilian to create the category “Married”. It may be preferable to use education_num instead of education for modeling purposes to better capture the meaning in the data. There don’t seem to be particularly intuitive ways of recategorizing the remaining categories.
marital_status | n |
---|---|
Divorced | 223 |
Married | 782 |
Never-Married | 520 |
Separated | 51 |
Widowed | 51 |
Proceeding onto regression methods, we will create models based on the same test and training sets (a randomly sampled 80/20 partition of the data) to predict for hours per week worked using various regression methods.
We will begin by constructing a linear regression model, as well as use a stepwise feature selection algorithm to reduce it.
The parameter estimates of the full linear regression model are presented below.
kable(summary(full_model)$coef, caption ="Full Main Effects Linear Model Parameter Estimates")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 48.2706487 | 6.2982952 | 7.6640816 | 0.0000000 |
tage | -1.8300735 | 0.5904604 | -3.0994011 | 0.0019820 |
workclass Local-gov | -0.0005067 | 2.1079810 | -0.0002404 | 0.9998082 |
workclass Private | 0.3958597 | 1.7494639 | 0.2262749 | 0.8210243 |
workclass Self-emp-inc | 4.8415708 | 2.3612246 | 2.0504491 | 0.0405267 |
workclass Self-emp-not-inc | -0.2752901 | 2.0369005 | -0.1351514 | 0.8925137 |
workclass State-gov | -2.5459915 | 2.2665091 | -1.1233096 | 0.2615195 |
tfnlwgt | -0.0065940 | 0.0044796 | -1.4720090 | 0.1412676 |
education_num | 0.0919761 | 0.1491651 | 0.6166062 | 0.5376056 |
marital_statusMarried | -2.1367528 | 2.5684694 | -0.8319168 | 0.4056132 |
marital_statusNever-Married | -3.4865084 | 1.1087650 | -3.1444971 | 0.0017025 |
marital_statusSeparated | -2.7910089 | 1.8403186 | -1.5165900 | 0.1296205 |
marital_statusWidowed | -10.0478542 | 1.9542879 | -5.1414401 | 0.0000003 |
occupation Craft-repair | 2.9638188 | 1.3287327 | 2.2305606 | 0.0258853 |
occupation Exec-managerial | 2.8429637 | 1.2844201 | 2.2134220 | 0.0270467 |
occupation Farming-fishing | 8.1284041 | 2.0280101 | 4.0080689 | 0.0000648 |
occupation Handlers-cleaners | 1.2945211 | 1.8286022 | 0.7079293 | 0.4791198 |
occupation Machine-op-inspct | 3.0853869 | 1.5294450 | 2.0173245 | 0.0438729 |
occupation Other-service | -2.6823728 | 1.3024929 | -2.0594145 | 0.0396592 |
occupation Priv-house-serv | 10.3382552 | 4.5729514 | 2.2607403 | 0.0239450 |
occupation Prof-specialty | 3.8207154 | 1.3280320 | 2.8769755 | 0.0040826 |
occupation Protective-serv | 4.9957483 | 2.4134299 | 2.0699786 | 0.0386572 |
occupation Sales | 2.1768548 | 1.2611750 | 1.7260530 | 0.0845824 |
occupation Tech-support | 1.7416705 | 2.0931857 | 0.8320669 | 0.4055285 |
occupation Transport-moving | 5.1363311 | 1.6753461 | 3.0658328 | 0.0022167 |
relationship Not-in-family | 1.1117187 | 2.6120389 | 0.4256134 | 0.6704621 |
relationship Other-relative | -3.6199992 | 2.7680689 | -1.3077706 | 0.1911891 |
relationship Own-child | -5.1005627 | 2.6713308 | -1.9093714 | 0.0564406 |
relationship Unmarried | 1.4167454 | 2.7394938 | 0.5171559 | 0.6051379 |
relationship Wife | -2.7965732 | 1.6063803 | -1.7409160 | 0.0819418 |
race Asian-Pac-Islander | 3.6970936 | 3.5350674 | 1.0458340 | 0.2958378 |
race Black | 3.6092154 | 3.1844363 | 1.1333922 | 0.2572647 |
race Other | 2.4853272 | 4.5976685 | 0.5405625 | 0.5889045 |
race White | 3.2749591 | 3.0502827 | 1.0736575 | 0.2831813 |
sex Male | 3.1944402 | 0.9242637 | 3.4562000 | 0.0005660 |
gain_zeroTRUE | 0.3239521 | 1.2777899 | 0.2535253 | 0.7999035 |
loss_zeroTRUE | -2.7403716 | 1.4128371 | -1.9396231 | 0.0526479 |
native_countryUnited States | -0.8770166 | 1.2315473 | -0.7121258 | 0.4765184 |
income >50K | 3.6759027 | 0.8894436 | 4.1328113 | 0.0000382 |
The parameter estimates of the stepwise selected linear regression model are presented below.
kable(summary(step_model)$coef, caption ="Stepwise Main Effects Linear Model Parameter Estimates")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 50.0562533 | 4.6548545 | 10.7535592 | 0.0000000 |
tage | -1.8354913 | 0.5855283 | -3.1347608 | 0.0017593 |
workclass Local-gov | -0.1677065 | 2.0967077 | -0.0799856 | 0.9362613 |
workclass Private | 0.2540200 | 1.7418193 | 0.1458360 | 0.8840740 |
workclass Self-emp-inc | 4.8849392 | 2.3537212 | 2.0754111 | 0.0381492 |
workclass Self-emp-not-inc | -0.3568191 | 2.0279190 | -0.1759533 | 0.8603586 |
workclass State-gov | -2.4902844 | 2.2595666 | -1.1021071 | 0.2706238 |
marital_statusMarried | -1.7793437 | 2.5389270 | -0.7008251 | 0.4835405 |
marital_statusNever-Married | -3.5163943 | 1.1054304 | -3.1810184 | 0.0015031 |
marital_statusSeparated | -2.9893517 | 1.8187185 | -1.6436582 | 0.1004941 |
marital_statusWidowed | -10.0469607 | 1.9445842 | -5.1666371 | 0.0000003 |
occupation Craft-repair | 2.8572933 | 1.3156271 | 2.1718109 | 0.0300543 |
occupation Exec-managerial | 2.7552530 | 1.2776829 | 2.1564452 | 0.0312360 |
occupation Farming-fishing | 7.9952491 | 2.0002354 | 3.9971541 | 0.0000678 |
occupation Handlers-cleaners | 1.1931112 | 1.8192187 | 0.6558371 | 0.5120476 |
occupation Machine-op-inspct | 2.9306008 | 1.4947883 | 1.9605457 | 0.0501502 |
occupation Other-service | -2.7154918 | 1.2848092 | -2.1135371 | 0.0347493 |
occupation Priv-house-serv | 10.4446298 | 4.5292449 | 2.3060422 | 0.0212686 |
occupation Prof-specialty | 3.9280586 | 1.2831127 | 3.0613513 | 0.0022496 |
occupation Protective-serv | 4.8535566 | 2.4018496 | 2.0207579 | 0.0435142 |
occupation Sales | 2.1531252 | 1.2569852 | 1.7129280 | 0.0869697 |
occupation Tech-support | 1.7202514 | 2.0845942 | 0.8252212 | 0.4094009 |
occupation Transport-moving | 4.8636760 | 1.6521426 | 2.9438596 | 0.0033004 |
relationship Not-in-family | 1.4309004 | 2.5853418 | 0.5534666 | 0.5800413 |
relationship Other-relative | -3.3918760 | 2.7385639 | -1.2385601 | 0.2157371 |
relationship Own-child | -4.7556025 | 2.6471371 | -1.7965078 | 0.0726512 |
relationship Unmarried | 1.7696374 | 2.7083341 | 0.6534044 | 0.5136138 |
relationship Wife | -2.8753827 | 1.5997744 | -1.7973677 | 0.0725146 |
sex Male | 3.1481220 | 0.9151009 | 3.4401913 | 0.0006001 |
loss_zeroTRUE | -2.8523010 | 1.4010071 | -2.0358933 | 0.0419681 |
income >50K | 3.7607541 | 0.8504124 | 4.4222709 | 0.0000106 |
All of the models will be assessed through their RMSE values. To begin, the RMSE of the full and stepwise linear models are as follows.
Full Main Effect Model | 12.36235 |
Stepwise Main Effect Model | 12.36235 |
Moving onto regularized linear regression, we create LASSO, Ridge, and Elastic Net regression models with the same training and test data. For regularized linear regression, it’s important to standardize the data. The following plots are the coefficient path plots and RMSE plots for the LASSO model.
We use cross-validation on the training data to find the best lambda for
the candidate models, then examine the resulting values for the
RMSE.
LASSO.opt | Ridge.opt | Elasticnet.opt |
---|---|---|
13.28 | 13.29 | 13.28 |
We can present the model equations for each of the models as follows.
Lasso Model Equation:
## Model equation: y = 32.4875 + 0.8853*tage + 0*workclass + -0.0017*tfnlwgt + 0.4052*education_num + 0*marital_status + 0*occupation + 0*relationship + 0*race + 0*sex + 0*gain_zero + 0*loss_zero + 0*native_country + 0*income
Ridge Model Equation:
## Model equation: y = 33.6755 + 0.8334*tage + 0*workclass + -0.003*tfnlwgt + 0.3491*education_num + 0*marital_status + 0*occupation + 0*relationship + 0*race + 0*sex + 0*gain_zero + 0*loss_zero + 0*native_country + 0*income
Elastic net Model Equation:
## Model equation: y = 31.5954 + 1.0689*tage + 0*workclass + -0.0036*tfnlwgt + 0.4543*education_num + 0*marital_status + 0*occupation + 0*relationship + 0*race + 0*sex + 0*gain_zero + 0*loss_zero + 0*native_country + 0*income
Moving onto SVM, we create three candidate models, one linear and two non-linear. Using the R library caret, we find the best value of C for each model and the corresponding RMSE.
## # A tibble: 3 × 2
## Model RMSE
## <chr> <dbl>
## 1 SVM Radial 11.3
## 2 SVM Linear w/ choice of cost 11.4
## 3 SVM Poly 11.7
Finally, we can compile all of the RMSE values of all the linear candidate regression models below, presenting them in order so that the best-performing models with the lowest RMSE values are at the top of the table.
models | RMSEvectors |
---|---|
SVM Radial | 11.28096 |
SVM Linear | 11.39403 |
SVM Poly | 11.71654 |
Full Linear Model | 12.36235 |
Stepwise Linear Model | 12.36235 |
Elastic Model | 13.27753 |
LASSO Model | 13.28365 |
Ridge Model | 13.28971 |
Similarly to the above process, we will now do the same thing with classification models, using the same test set to predict for whether or not an individual’s income is over 50k a year.
Once again, we will begin with standard regression, this time creating a full and stepwise logistic regression model with income as our binary response.
kable(summary(full_model)$coef, caption ="Full Main Effects Logistic Model Parameter Estimates")
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | 7.0565690 | 2.3188054 | 3.0431916 | 0.0023408 |
tage | -0.7335423 | 0.1862409 | -3.9386744 | 0.0000819 |
workclass Local-gov | 0.6054860 | 0.5584139 | 1.0842960 | 0.2782336 |
workclass Private | 0.4674377 | 0.4642454 | 1.0068762 | 0.3139943 |
workclass Self-emp-inc | 0.1593855 | 0.6081620 | 0.2620775 | 0.7932617 |
workclass Self-emp-not-inc | 0.6071620 | 0.5410758 | 1.1221385 | 0.2618036 |
workclass State-gov | 1.1847186 | 0.6765367 | 1.7511521 | 0.0799197 |
tfnlwgt | -0.0005336 | 0.0013430 | -0.3972796 | 0.6911613 |
education_num | -0.3218152 | 0.0473529 | -6.7960986 | 0.0000000 |
marital_statusMarried | 0.3015322 | 1.2030593 | 0.2506379 | 0.8020941 |
marital_statusNever-Married | 0.7677871 | 0.4406564 | 1.7423716 | 0.0814434 |
marital_statusSeparated | 0.2420295 | 0.7753984 | 0.3121357 | 0.7549374 |
marital_statusWidowed | -0.1833534 | 0.7040866 | -0.2604131 | 0.7945451 |
occupation Craft-repair | 0.3930907 | 0.4124476 | 0.9530682 | 0.3405555 |
occupation Exec-managerial | -0.3576187 | 0.3892420 | -0.9187567 | 0.3582228 |
occupation Farming-fishing | 0.9434672 | 0.6742601 | 1.3992631 | 0.1617341 |
occupation Handlers-cleaners | 2.1228946 | 0.8642171 | 2.4564367 | 0.0140323 |
occupation Machine-op-inspct | 0.6689145 | 0.5141826 | 1.3009280 | 0.1932831 |
occupation Other-service | 1.9143891 | 0.6945667 | 2.7562350 | 0.0058471 |
occupation Priv-house-serv | 14.8698157 | 1452.0027676 | 0.0102409 | 0.9918291 |
occupation Prof-specialty | -0.2049285 | 0.4069848 | -0.5035286 | 0.6145927 |
occupation Protective-serv | -0.5121722 | 0.6320365 | -0.8103522 | 0.4177378 |
occupation Sales | 0.0258114 | 0.4040798 | 0.0638769 | 0.9490682 |
occupation Tech-support | -0.3867937 | 0.5918623 | -0.6535197 | 0.5134213 |
occupation Transport-moving | 1.0930875 | 0.5288984 | 2.0667251 | 0.0387601 |
relationship Not-in-family | 1.5938273 | 1.2005385 | 1.3275937 | 0.1843124 |
relationship Other-relative | 15.5011076 | 636.7743424 | 0.0243432 | 0.9805789 |
relationship Own-child | 1.9599241 | 1.2142501 | 1.6141025 | 0.1065052 |
relationship Unmarried | 1.9041730 | 1.2750295 | 1.4934345 | 0.1353235 |
relationship Wife | -1.5257810 | 0.5078245 | -3.0045440 | 0.0026598 |
race Asian-Pac-Islander | -0.6158089 | 1.3796275 | -0.4463588 | 0.6553381 |
race Black | -0.2740138 | 1.3262309 | -0.2066109 | 0.8363137 |
race Other | -0.9842000 | 1.5588828 | -0.6313496 | 0.5278120 |
race White | -0.4884678 | 1.2770134 | -0.3825079 | 0.7020847 |
sex Male | -1.2400701 | 0.4060342 | -3.0541027 | 0.0022573 |
gain_zeroTRUE | 1.7186232 | 0.3344039 | 5.1393637 | 0.0000003 |
loss_zeroTRUE | 1.2573197 | 0.3689780 | 3.4075733 | 0.0006554 |
native_countryUnited States | 0.2752245 | 0.4346773 | 0.6331698 | 0.5266228 |
hours_per_week | -0.0368134 | 0.0083058 | -4.4322541 | 0.0000093 |
kable(summary(step_model)$coef, caption ="Full Main Effects Stepwise Model Parameter Estimates")
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | 7.5589756 | 1.3841241 | 5.4611981 | 0.0000000 |
tage | -0.7846377 | 0.1760327 | -4.4573407 | 0.0000083 |
education_num | -0.3108584 | 0.0468060 | -6.6414149 | 0.0000000 |
occupation Craft-repair | 0.4172012 | 0.3889686 | 1.0725831 | 0.2834582 |
occupation Exec-managerial | -0.2633769 | 0.3668995 | -0.7178450 | 0.4728529 |
occupation Farming-fishing | 1.1289316 | 0.6367945 | 1.7728350 | 0.0762561 |
occupation Handlers-cleaners | 2.0991186 | 0.8589539 | 2.4438083 | 0.0145331 |
occupation Machine-op-inspct | 0.6927517 | 0.4969245 | 1.3940784 | 0.1632939 |
occupation Other-service | 1.9057293 | 0.6882745 | 2.7688508 | 0.0056254 |
occupation Priv-house-serv | 14.9326797 | 1405.9918785 | 0.0106207 | 0.9915260 |
occupation Prof-specialty | -0.1264728 | 0.3857527 | -0.3278598 | 0.7430177 |
occupation Protective-serv | -0.4186899 | 0.5878844 | -0.7121977 | 0.4763424 |
occupation Sales | 0.0081423 | 0.3755298 | 0.0216821 | 0.9827016 |
occupation Tech-support | -0.3692031 | 0.5850072 | -0.6311087 | 0.5279694 |
occupation Transport-moving | 1.1846821 | 0.5082099 | 2.3310881 | 0.0197487 |
relationship Not-in-family | 1.7251353 | 0.2841262 | 6.0717231 | 0.0000000 |
relationship Other-relative | 15.6584069 | 637.8008879 | 0.0245506 | 0.9804134 |
relationship Own-child | 2.2439725 | 0.5790038 | 3.8755748 | 0.0001064 |
relationship Unmarried | 1.7331027 | 0.5111762 | 3.3904210 | 0.0006979 |
relationship Wife | -1.4474025 | 0.4977983 | -2.9076081 | 0.0036420 |
sex Male | -1.1599890 | 0.3892845 | -2.9797976 | 0.0028844 |
gain_zeroTRUE | 1.7916801 | 0.3316006 | 5.4031261 | 0.0000001 |
loss_zeroTRUE | 1.2351147 | 0.3613874 | 3.4177029 | 0.0006315 |
hours_per_week | -0.0394216 | 0.0081420 | -4.8417690 | 0.0000013 |
The ROC curves and AUC values are presented as follows:
Model | AUC |
---|---|
Full Main Effects Logistic Model | 0.8837286 |
Stepwise Reduced Main Effects Logistic Model | 0.8795015 |
Again, we will use the same three regularized regression techniques. This time, we will calculate the optimal cut-off probabilities.
The coefficients for each of the regularized classification models are shown.
lasso | ridge | elasticnet | |
---|---|---|---|
(Intercept) | -7.938 | -6.344 | -7.612 |
tage | 0.6055 | 0.548 | 0.6177 |
workclass Local-gov | 0 | -0.2113 | 0 |
workclass Private | 0 | -0.1988 | 0 |
workclass Self-emp-inc | 0.2844 | 0.201 | 0.298 |
workclass Self-emp-not-inc | -0.03721 | -0.2917 | -0.09122 |
workclass State-gov | -0.3318 | -0.6627 | -0.4542 |
tfnlwgt | 0 | 0.0003855 | 3.892e-05 |
education_num | 0.3054 | 0.2509 | 0.296 |
marital_statusMarried | 1.383 | 0.8022 | 1.153 |
marital_statusNever-Married | -0.5042 | -0.6123 | -0.6086 |
marital_statusSeparated | 0 | -0.2036 | 0 |
marital_statusWidowed | 0 | 0.008398 | 0 |
occupation Craft-repair | -0.121 | -0.1538 | -0.195 |
occupation Exec-managerial | 0.3833 | 0.5531 | 0.4133 |
occupation Farming-fishing | -0.5353 | -0.603 | -0.6322 |
occupation Handlers-cleaners | -1.252 | -1.253 | -1.426 |
occupation Machine-op-inspct | -0.3131 | -0.4489 | -0.4264 |
occupation Other-service | -1.221 | -1.058 | -1.316 |
occupation Priv-house-serv | 0 | -1.24 | -0.5635 |
occupation Prof-specialty | 0.1991 | 0.47 | 0.2532 |
occupation Protective-serv | 0.3783 | 0.5986 | 0.4252 |
occupation Sales | 0.03504 | 0.1836 | 0.05208 |
occupation Tech-support | 0.2785 | 0.4952 | 0.3657 |
occupation Transport-moving | -0.7521 | -0.7522 | -0.8502 |
relationship Not-in-family | 0 | -0.4991 | -0.1428 |
relationship Other-relative | -0.7993 | -1.369 | -1.25 |
relationship Own-child | -0.1399 | -0.688 | -0.4314 |
relationship Unmarried | -0.05846 | -0.6981 | -0.3488 |
relationship Wife | 1.208 | 1.042 | 1.271 |
race Asian-Pac-Islander | 0 | 0.2191 | 0.05758 |
race Black | -0.02451 | -0.1028 | -0.08712 |
race Other | 0.2375 | 0.5087 | 0.332 |
race White | 0 | 0.0983 | 0 |
sex Male | 0.9808 | 0.8203 | 1.031 |
gain_zeroTRUE | -1.617 | -1.507 | -1.627 |
loss_zeroTRUE | -1.03 | -1.048 | -1.084 |
native_countryUnited States | -0.04374 | -0.1408 | -0.1115 |
hours_per_week | 0.03093 | 0.02929 | 0.03172 |
Optimal cut-off probability determination:
Finally, we will get the ROC and AUC values for each of the regularized regression candidate models.
lasso | ridge | elastic | |
---|---|---|---|
Sensitivity | 0.8893 | 0.8854 | 0.8893 |
Specificity | 0.6944 | 0.7083 | 0.6806 |
Pos Pred Value | 0.9109 | 0.9143 | 0.9073 |
Neg Pred Value | 0.641 | 0.6375 | 0.6364 |
Precision | 0.9109 | 0.9143 | 0.9073 |
Recall | 0.8893 | 0.8854 | 0.8893 |
F1 | 0.9 | 0.8996 | 0.8982 |
Prevalence | 0.7785 | 0.7785 | 0.7785 |
Detection Rate | 0.6923 | 0.6892 | 0.6923 |
Detection Prevalence | 0.76 | 0.7538 | 0.7631 |
Balanced Accuracy | 0.7919 | 0.7969 | 0.7849 |
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
Once again, we will use support vector machines, this time for the sake of classification. I found that this process took much less time for classification models compared to the regression model process.
## # A tibble: 3 × 2
## Model Accuracy
## <chr> <dbl>
## 1 SVM Poly 0.827
## 2 SVM Linear w/ choice of cost 0.842
## 3 SVM Radial 0.843
## Confusion Matrix and Statistics
##
## true
## pred Over50k UnderorEqual50k
## Over50k 35 18
## UnderorEqual50k 37 235
##
## Accuracy : 0.8308
## 95% CI : (0.7855, 0.8699)
## No Information Rate : 0.7785
## P-Value [Acc > NIR] : 0.01195
##
## Kappa : 0.4582
##
## Mcnemar's Test P-Value : 0.01522
##
## Sensitivity : 0.4861
## Specificity : 0.9289
## Pos Pred Value : 0.6604
## Neg Pred Value : 0.8640
## Prevalence : 0.2215
## Detection Rate : 0.1077
## Detection Prevalence : 0.1631
## Balanced Accuracy : 0.7075
##
## 'Positive' Class : Over50k
##
## Setting levels: control = Over50k, case = UnderorEqual50k
## Setting direction: controls > cases
## Confusion Matrix and Statistics
##
## true
## pred Over50k UnderorEqual50k
## Over50k 36 19
## UnderorEqual50k 36 234
##
## Accuracy : 0.8308
## 95% CI : (0.7855, 0.8699)
## No Information Rate : 0.7785
## P-Value [Acc > NIR] : 0.01195
##
## Kappa : 0.4641
##
## Mcnemar's Test P-Value : 0.03097
##
## Sensitivity : 0.5000
## Specificity : 0.9249
## Pos Pred Value : 0.6545
## Neg Pred Value : 0.8667
## Prevalence : 0.2215
## Detection Rate : 0.1108
## Detection Prevalence : 0.1692
## Balanced Accuracy : 0.7125
##
## 'Positive' Class : Over50k
##
## Setting levels: control = Over50k, case = UnderorEqual50k
## Setting direction: controls > cases
## Confusion Matrix and Statistics
##
## true
## pred Over50k UnderorEqual50k
## Over50k 36 18
## UnderorEqual50k 36 235
##
## Accuracy : 0.8338
## 95% CI : (0.7888, 0.8726)
## No Information Rate : 0.7785
## P-Value [Acc > NIR] : 0.008171
##
## Kappa : 0.471
##
## Mcnemar's Test P-Value : 0.020700
##
## Sensitivity : 0.5000
## Specificity : 0.9289
## Pos Pred Value : 0.6667
## Neg Pred Value : 0.8672
## Prevalence : 0.2215
## Detection Rate : 0.1108
## Detection Prevalence : 0.1662
## Balanced Accuracy : 0.7144
##
## 'Positive' Class : Over50k
##
## Setting levels: control = Over50k, case = UnderorEqual50k
## Setting direction: controls > cases
Model | AUC |
---|---|
SVM Linear w/ choice of cost | 0.8849363 |
SVM Radial | 0.8861989 |
SVM Poly | 0.8849363 |
Finally, to compare the model performance, we compile the AUC values of all the models together, listed in descending order (so the best performing models are listed first).
models | AUCvectors |
---|---|
SVM Radial | 0.8861989 |
Ridge Model | 0.8857049 |
SVM Linear | 0.8849363 |
SVM Poly | 0.8849363 |
Elastic Model | 0.8839482 |
Full Logit Model | 0.8837286 |
LASSO Model | 0.8823562 |
Stepwise Logit Model | 0.8795015 |
Overall, the models performed quite similarly. All of the RMSE values were around 11-12, and all of the AUC values were around 0.88 to 0.89. An RMSE of 11-12 indicates an estimate that is off by about 11-12. Considering that the range of the values of hours per week was only from 2-99, the error across the models does feel a little too high to make particularly meaningful predictions of this variable. On the other hand, the classification models performed quite well in the prediction of whether or not one’s income was over 50k annually or not based on the predictive features.
In both regression and classification models, SVM performed very well. In particular, the SVM Radial non-linear technique performed the best in both regression and classification analysis. However, SVM is very computationally intensive and took a very long time to run. When expanding this analysis to the full dataset, SVM may be favored for predictive potential if one has the time and technology; however, I found the time for the linear regression models in particular far too costly to be worth the difference.
For linear regression analysis, the second best performing general method outside of SVM was just traditional linear regression, with the full linear model and stepwise reduced linear model performing almost identically in terms of the RMSE. As such, for simplicity, I would likely recommend it for analysis of a larger dataset. SVM is also effective, but very computationally intensive. All of the regularized regression models, on the other hand, performed similarly and not particularly well in comparison.
For the classification models, on the other hand, regularized regression methods performed as well or better than traditional logistic regression and ran about equally as fast. The SVM classification models were significantly faster to run than the linear regression models and as such may be worth the time to run; however, the difference between its performance with regularized regression models such as the Ridge model was small, while the Ridge model took much less time to run. Therefore, it might be preferable still to do regularized regression methods instead for the shorter runtime with similar results.
The most obvious limitation of this analysis is that only a very small subset of the original dataset was used. I found that several algorithms were extremely time-consuming with the full dataset, most significantly being the process of imputing missing values and both SVM regression methods, though SVM linear regression was particularly time consuming of the two.
The dataset involved a lot of categorical data with many categories, which may have affected the performance of these algorithms. There were also some sparse categories that were hard to effectively recategorize.
The models also didn’t take into account interactions between predictors.