Introduction

The dataset (1) is derived from the 1994 Census and includes information on the occupations and income of individuals in the United States. The following features are included in the dataset:

age: continuous.

workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. fnlwgt: continuous.

education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

education-num: continuous.

marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

sex: Female, Male.

capital-gain: continuous.

capital-loss: continuous.

hours-per-week: continuous.

native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

income: <=50k, >50k

A subset of the data will be used for illustrative purposes in conjunction with regression techniques to create predictive models for both categorical and continuous response variables. The intention of this document is to compare techniques including regularized regression algorithms and support vector machines to linear and logistic regression in order to recommend methods for further use and analysis of the data.

Methodology

Regression and Classification Methods

Multiple linear regression is a statistical method that uses several explanatory variables to predict the outcome of a continous response variable, and is an extension of ordinary least squares regression. It fits a linear relationship between the explanatory variables and the response. The assumptions for multiple regression include the following: the response variable is normally distributed and has a constant variance, the explanatory variables are nonrandom, the explanatory variables are noncorrelated, the explanatory variables have a linear relationship with the response, and the data is randomly collected and independent.

Logistic regression is a statistical method used for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a binary response variable, and is therefore appropriate for situations where linear regression cannot be used due to a dichotomous categorical response variable.The assuptions for a logistic regression include the following: the dependent variable should be binary, the independent variables should not be correlated, the log odds (the logit of the probability) and the independent variables should have a linear relationship, the sample size has to be sufficiently large, observations must be independent. Unlike multiple linear analysis, due to a categorical response, logistic regression does not require a normally distributed response variable with a constant variance.

Regularized regression techniques are an expansion on more traditional regression methods by introducing penalty terms to control the complexity of the model and account for issues with overfitting. They can be effective for situations with many predictors or when multicollinearity is present between features.

We will explore three of the most common forms of regularized regression for linear and logistic regression, including Ridge Regression, LASSO, and Elastic Net Regression techniques. Each introduces penalties to the regression model in a slightly different way and may be more or less advantageous in certain modeling situations.

SVM is another technique that can be used in regression and classification contexts, and is a distribution-free method. In SVM, classes are separated in a feature space and SVM draws a hyperplane, or decision boundary, to maximize the margin between parallel hyperplanes while minimizing misclassification errors. SVM can handle both linear and nonlinear class boundaries, using the kernel trick to handle non-linear classification. The most common kernel transformations are radial kernel and polynomial kernel. We will explore a linear, radial, and polynomial methods to create SVM classification models.

Model Performance Measures

All of the linear models will be assessed through their root mean square error, or RMSE. The classification models created will be assessed through ROC curves and their corresponding AUC values.

The RMSE is the standard deviation of the residuals, and measures the average difference between a model’s predicted values and the actual values. We will use the RMSE on the same set of training and test data to assess the predictive potential of the linear regression models.

ROC Curves are a graphical technique used to measure the performance of a binary classification model by plotting the true positive rate against the false positive rate at various classification thresholds. The AUC, or area under the curve, is a number that quantifies the performance of the ROC in a single value by approximating the area under the constructed curve between 0 and 1. The closer the AUC is to 1, the better the performance of the model.

Research Questions

Using a small subset of the data, the intention of this analysis is to take an initial look at various techniques for linear and logistic regression to inform future analysis of the full dataset by assessing their predictive performance against each other. The following questions will be addressed through linear and logistic regression:

  1. What linear regression models most successfully predict for the hours per week worked for individuals of the dataset?
  2. What classification models most successfully predict for whether or not an individual’s income is above 50k annually?

EDA

To begin, we take a subset of the original data and examine its summary and the number of missing values.

## # A tibble: 3 × 2
##   native_country     n
##   <fct>          <int>
## 1 Not US           146
## 2 United States   1450
## 3 <NA>              32
##       age                    workclass        fnlwgt               education  
##  Min.   :17.00    Private         :1145   Min.   : 19752    HS-grad     :538  
##  1st Qu.:28.00    Self-emp-not-inc: 123   1st Qu.:113621    Some-college:375  
##  Median :37.00    Local-gov       :  96   Median :179723    Bachelors   :276  
##  Mean   :38.67    State-gov       :  63   Mean   :190137    Masters     : 76  
##  3rd Qu.:47.00    Self-emp-inc    :  57   3rd Qu.:239087    Assoc-voc   : 69  
##  Max.   :90.00   (Other)          :  51   Max.   :972354    11th        : 50  
##                  NA's             :  93                    (Other)      :244  
##  education_num                  marital_status            occupation 
##  Min.   : 1.00    Divorced             :223     Prof-specialty :206  
##  1st Qu.: 9.00    Married-AF-spouse    :  2     Exec-managerial:205  
##  Median :10.00    Married-civ-spouse   :764     Craft-repair   :201  
##  Mean   :10.03    Married-spouse-absent: 16     Sales          :201  
##  3rd Qu.:12.00    Never-married        :521     Adm-clerical   :183  
##  Max.   :16.00    Separated            : 51    (Other)         :539  
##                   Widowed              : 51    NA's            : 93  
##           relationship                  race           sex      
##   Husband       :664    Amer-Indian-Eskimo:  18    Female: 553  
##   Not-in-family :434    Asian-Pac-Islander:  64    Male  :1075  
##   Other-relative: 42    Black             : 153                 
##   Own-child     :245    Other             :  13                 
##   Unmarried     :156    White             :1380                 
##   Wife          : 87                                            
##                                                                 
##   capital_gain      capital_loss     hours_per_week        native_country
##  Min.   :    0.0   Min.   :   0.00   Min.   : 2.00   Not US       : 146  
##  1st Qu.:    0.0   1st Qu.:   0.00   1st Qu.:40.00   United States:1450  
##  Median :    0.0   Median :   0.00   Median :40.00   NA's         :  32  
##  Mean   :  959.2   Mean   :  92.87   Mean   :40.42                       
##  3rd Qu.:    0.0   3rd Qu.:   0.00   3rd Qu.:45.00                       
##  Max.   :99999.0   Max.   :2559.00   Max.   :99.00                       
##                                                                          
##     income    
##   <=50K:1258  
##   >50K : 370  
##               
##               
##               
##               
## 
##            age      workclass         fnlwgt      education  education_num 
##              0             93              0              0              0 
## marital_status     occupation   relationship           race            sex 
##              0             93              0              0              0 
##   capital_gain   capital_loss hours_per_week native_country         income 
##              0              0              0             32              0

We rebinned native_country due to the number of sparse categories. There are enough missing values to suggest the use of imputation, so we proceed with multiple imputation using the R mice library.

##            age      workclass         fnlwgt      education  education_num 
##             ""      "polyreg"             ""             ""             "" 
## marital_status     occupation   relationship           race            sex 
##             ""      "polyreg"             ""             ""             "" 
##   capital_gain   capital_loss hours_per_week native_country         income 
##             ""             ""             ""       "logreg"             ""
##       age                    workclass        fnlwgt               education  
##  Min.   :17.00    Federal-gov     :  55   Min.   : 19752    HS-grad     :538  
##  1st Qu.:28.00    Local-gov       : 104   1st Qu.:113621    Some-college:375  
##  Median :37.00    Private         :1208   Median :179723    Bachelors   :276  
##  Mean   :38.67    Self-emp-inc    :  57   Mean   :190137    Masters     : 76  
##  3rd Qu.:47.00    Self-emp-not-inc: 132   3rd Qu.:239087    Assoc-voc   : 69  
##  Max.   :90.00    State-gov       :  71   Max.   :972354    11th        : 50  
##                   Without-pay     :   1                    (Other)      :244  
##  education_num                  marital_status            occupation 
##  Min.   : 1.00    Divorced             :223     Prof-specialty :218  
##  1st Qu.: 9.00    Married-AF-spouse    :  2     Sales          :214  
##  Median :10.00    Married-civ-spouse   :764     Craft-repair   :212  
##  Mean   :10.03    Married-spouse-absent: 16     Exec-managerial:212  
##  3rd Qu.:12.00    Never-married        :521     Adm-clerical   :193  
##  Max.   :16.00    Separated            : 51     Other-service  :176  
##                   Widowed              : 51    (Other)         :403  
##           relationship                  race           sex      
##   Husband       :664    Amer-Indian-Eskimo:  18    Female: 553  
##   Not-in-family :434    Asian-Pac-Islander:  64    Male  :1075  
##   Other-relative: 42    Black             : 153                 
##   Own-child     :245    Other             :  13                 
##   Unmarried     :156    White             :1380                 
##   Wife          : 87                                            
##                                                                 
##   capital_gain      capital_loss     hours_per_week        native_country
##  Min.   :    0.0   Min.   :   0.00   Min.   : 2.00   Not US       : 155  
##  1st Qu.:    0.0   1st Qu.:   0.00   1st Qu.:40.00   United States:1473  
##  Median :    0.0   Median :   0.00   Median :40.00                       
##  Mean   :  959.2   Mean   :  92.87   Mean   :40.42                       
##  3rd Qu.:    0.0   3rd Qu.:   0.00   3rd Qu.:45.00                       
##  Max.   :99999.0   Max.   :2559.00   Max.   :99.00                       
##                                                                          
##     income    
##   <=50K:1258  
##   >50K : 370  
##               
##               
##               
##               
## 
##            age      workclass         fnlwgt      education  education_num 
##              0              0              0              0              0 
## marital_status     occupation   relationship           race            sex 
##              0              0              0              0              0 
##   capital_gain   capital_loss hours_per_week native_country         income 
##              0              0              0              0              0

With the missing values imputed, we can continue by looking at the distribution of the continuous features.

We note severe issues with nonnormality, especially for capital_gain and capital_loss. Some skew is also observed in fnlwgt and age, but these could possibly corrected through Box-cox transformations.

After doing Box-cox transformations, we note that there appear to be adequate changes for age and fnlwgt, but not for capital_gain or capital_loss. Therefore, we will have to handle them through other means, such as feature rebinning.

The median of both capital_loss and capital_gain is zero according to the earlier summary, meaning over 50% of observations are equal to zero. From the histograms, it looks like the vast majority are likely equal to zero. Therefore, to rebin these variables, we will create binary categorical variables of whether or not their values are equal to zero.

##    Mode   FALSE    TRUE 
## logical     104    1524
##    Mode   FALSE    TRUE 
## logical      81    1547

After rebinning, we can look at the correlations of the continuous variables.

There do not apear to be high correlations between the variables just looking at the pairwise correlation plots or particularly obvious patterns.

We will examine the relationship between the binary response variable of income with continuous predictors using stacked boxplots.

We see a fair number of outliers across all plots. Some patterns, however, do seem quite apparent; age looks significantly higher for those with incomes over 50k, as does hours per week and the highest level of education (represented through a numeric scale).

We can compare the distributions of the categorical variables through mosaic plots.

Once again, we see differences in distributions for income across the categorical variables. Many more invidiuals with incomes over 50k seem to be married with civilian spouses. In the family relationship/role in family mosaic plot, we see that many more husbands make over 50k a year compared to the proportion of husbands for individuals with incomes less than or equal to 50k. The distribution of income across sexes also shows a visible difference between males and females. As one might expect, the proportion of executive/managerial positions and specialty professions is much higher for individuals with incomes above 50k when compared to the distribution in those with incomes not above 50k.

We also do see evidence of sparse categories, so we will print the counts across categories for each of the categorical features.

Education Counts
education n
10th 41
11th 50
12th 26
1st-4th 7
5th-6th 19
7th-8th 40
9th 22
Assoc-acdm 41
Assoc-voc 69
Bachelors 276
Doctorate 16
HS-grad 538
Masters 76
Preschool 3
Prof-school 29
Some-college 375
Marital Status Counts
marital_status n
Divorced 223
Married-AF-spouse 2
Married-civ-spouse 764
Married-spouse-absent 16
Never-married 521
Separated 51
Widowed 51
Work Class Counts
workclass n
Federal-gov 55
Local-gov 104
Private 1208
Self-emp-inc 57
Self-emp-not-inc 132
State-gov 71
Without-pay 1
Occupation Counts
occupation n
Adm-clerical 193
Craft-repair 212
Exec-managerial 212
Farming-fishing 53
Handlers-cleaners 71
Machine-op-inspct 111
Other-service 176
Priv-house-serv 6
Prof-specialty 218
Protective-serv 31
Sales 214
Tech-support 41
Transport-moving 90
Relationship/Role in Family Counts
relationship n
Husband 664
Not-in-family 434
Other-relative 42
Own-child 245
Unmarried 156
Wife 87

we can combine married to armed forces with married to civilian to create the category “Married”. It may be preferable to use education_num instead of education for modeling purposes to better capture the meaning in the data. There don’t seem to be particularly intuitive ways of recategorizing the remaining categories.

Marital Status Counts
marital_status n
Divorced 223
Married 782
Never-Married 520
Separated 51
Widowed 51

Linear Regression Modeling

Proceeding onto regression methods, we will create models based on the same test and training sets (a randomly sampled 80/20 partition of the data) to predict for hours per week worked using various regression methods.

Linear Regression

We will begin by constructing a linear regression model, as well as use a stepwise feature selection algorithm to reduce it.

The parameter estimates of the full linear regression model are presented below.

kable(summary(full_model)$coef, caption ="Full Main Effects Linear Model Parameter Estimates")
Full Main Effects Linear Model Parameter Estimates
Estimate Std. Error t value Pr(>|t|)
(Intercept) 48.2706487 6.2982952 7.6640816 0.0000000
tage -1.8300735 0.5904604 -3.0994011 0.0019820
workclass Local-gov -0.0005067 2.1079810 -0.0002404 0.9998082
workclass Private 0.3958597 1.7494639 0.2262749 0.8210243
workclass Self-emp-inc 4.8415708 2.3612246 2.0504491 0.0405267
workclass Self-emp-not-inc -0.2752901 2.0369005 -0.1351514 0.8925137
workclass State-gov -2.5459915 2.2665091 -1.1233096 0.2615195
tfnlwgt -0.0065940 0.0044796 -1.4720090 0.1412676
education_num 0.0919761 0.1491651 0.6166062 0.5376056
marital_statusMarried -2.1367528 2.5684694 -0.8319168 0.4056132
marital_statusNever-Married -3.4865084 1.1087650 -3.1444971 0.0017025
marital_statusSeparated -2.7910089 1.8403186 -1.5165900 0.1296205
marital_statusWidowed -10.0478542 1.9542879 -5.1414401 0.0000003
occupation Craft-repair 2.9638188 1.3287327 2.2305606 0.0258853
occupation Exec-managerial 2.8429637 1.2844201 2.2134220 0.0270467
occupation Farming-fishing 8.1284041 2.0280101 4.0080689 0.0000648
occupation Handlers-cleaners 1.2945211 1.8286022 0.7079293 0.4791198
occupation Machine-op-inspct 3.0853869 1.5294450 2.0173245 0.0438729
occupation Other-service -2.6823728 1.3024929 -2.0594145 0.0396592
occupation Priv-house-serv 10.3382552 4.5729514 2.2607403 0.0239450
occupation Prof-specialty 3.8207154 1.3280320 2.8769755 0.0040826
occupation Protective-serv 4.9957483 2.4134299 2.0699786 0.0386572
occupation Sales 2.1768548 1.2611750 1.7260530 0.0845824
occupation Tech-support 1.7416705 2.0931857 0.8320669 0.4055285
occupation Transport-moving 5.1363311 1.6753461 3.0658328 0.0022167
relationship Not-in-family 1.1117187 2.6120389 0.4256134 0.6704621
relationship Other-relative -3.6199992 2.7680689 -1.3077706 0.1911891
relationship Own-child -5.1005627 2.6713308 -1.9093714 0.0564406
relationship Unmarried 1.4167454 2.7394938 0.5171559 0.6051379
relationship Wife -2.7965732 1.6063803 -1.7409160 0.0819418
race Asian-Pac-Islander 3.6970936 3.5350674 1.0458340 0.2958378
race Black 3.6092154 3.1844363 1.1333922 0.2572647
race Other 2.4853272 4.5976685 0.5405625 0.5889045
race White 3.2749591 3.0502827 1.0736575 0.2831813
sex Male 3.1944402 0.9242637 3.4562000 0.0005660
gain_zeroTRUE 0.3239521 1.2777899 0.2535253 0.7999035
loss_zeroTRUE -2.7403716 1.4128371 -1.9396231 0.0526479
native_countryUnited States -0.8770166 1.2315473 -0.7121258 0.4765184
income >50K 3.6759027 0.8894436 4.1328113 0.0000382

The parameter estimates of the stepwise selected linear regression model are presented below.

kable(summary(step_model)$coef, caption ="Stepwise Main Effects Linear Model Parameter Estimates")
Stepwise Main Effects Linear Model Parameter Estimates
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.0562533 4.6548545 10.7535592 0.0000000
tage -1.8354913 0.5855283 -3.1347608 0.0017593
workclass Local-gov -0.1677065 2.0967077 -0.0799856 0.9362613
workclass Private 0.2540200 1.7418193 0.1458360 0.8840740
workclass Self-emp-inc 4.8849392 2.3537212 2.0754111 0.0381492
workclass Self-emp-not-inc -0.3568191 2.0279190 -0.1759533 0.8603586
workclass State-gov -2.4902844 2.2595666 -1.1021071 0.2706238
marital_statusMarried -1.7793437 2.5389270 -0.7008251 0.4835405
marital_statusNever-Married -3.5163943 1.1054304 -3.1810184 0.0015031
marital_statusSeparated -2.9893517 1.8187185 -1.6436582 0.1004941
marital_statusWidowed -10.0469607 1.9445842 -5.1666371 0.0000003
occupation Craft-repair 2.8572933 1.3156271 2.1718109 0.0300543
occupation Exec-managerial 2.7552530 1.2776829 2.1564452 0.0312360
occupation Farming-fishing 7.9952491 2.0002354 3.9971541 0.0000678
occupation Handlers-cleaners 1.1931112 1.8192187 0.6558371 0.5120476
occupation Machine-op-inspct 2.9306008 1.4947883 1.9605457 0.0501502
occupation Other-service -2.7154918 1.2848092 -2.1135371 0.0347493
occupation Priv-house-serv 10.4446298 4.5292449 2.3060422 0.0212686
occupation Prof-specialty 3.9280586 1.2831127 3.0613513 0.0022496
occupation Protective-serv 4.8535566 2.4018496 2.0207579 0.0435142
occupation Sales 2.1531252 1.2569852 1.7129280 0.0869697
occupation Tech-support 1.7202514 2.0845942 0.8252212 0.4094009
occupation Transport-moving 4.8636760 1.6521426 2.9438596 0.0033004
relationship Not-in-family 1.4309004 2.5853418 0.5534666 0.5800413
relationship Other-relative -3.3918760 2.7385639 -1.2385601 0.2157371
relationship Own-child -4.7556025 2.6471371 -1.7965078 0.0726512
relationship Unmarried 1.7696374 2.7083341 0.6534044 0.5136138
relationship Wife -2.8753827 1.5997744 -1.7973677 0.0725146
sex Male 3.1481220 0.9151009 3.4401913 0.0006001
loss_zeroTRUE -2.8523010 1.4010071 -2.0358933 0.0419681
income >50K 3.7607541 0.8504124 4.4222709 0.0000106

All of the models will be assessed through their RMSE values. To begin, the RMSE of the full and stepwise linear models are as follows.

Linear Regression RMSE for Hours per Week candidate models
Full Main Effect Model 12.36235
Stepwise Main Effect Model 12.36235

Regularized Regression Techniques

Moving onto regularized linear regression, we create LASSO, Ridge, and Elastic Net regression models with the same training and test data. For regularized linear regression, it’s important to standardize the data. The following plots are the coefficient path plots and RMSE plots for the LASSO model.

We use cross-validation on the training data to find the best lambda for the candidate models, then examine the resulting values for the RMSE.

LASSO.opt Ridge.opt Elasticnet.opt
13.28 13.29 13.28

We can present the model equations for each of the models as follows.

Lasso Model Equation:

## Model equation: y = 32.4875 + 0.8853*tage + 0*workclass + -0.0017*tfnlwgt + 0.4052*education_num + 0*marital_status + 0*occupation + 0*relationship + 0*race + 0*sex + 0*gain_zero + 0*loss_zero + 0*native_country + 0*income

Ridge Model Equation:

## Model equation: y = 33.6755 + 0.8334*tage + 0*workclass + -0.003*tfnlwgt + 0.3491*education_num + 0*marital_status + 0*occupation + 0*relationship + 0*race + 0*sex + 0*gain_zero + 0*loss_zero + 0*native_country + 0*income

Elastic net Model Equation:

## Model equation: y = 31.5954 + 1.0689*tage + 0*workclass + -0.0036*tfnlwgt + 0.4543*education_num + 0*marital_status + 0*occupation + 0*relationship + 0*race + 0*sex + 0*gain_zero + 0*loss_zero + 0*native_country + 0*income

SVM Linear Regression

Moving onto SVM, we create three candidate models, one linear and two non-linear. Using the R library caret, we find the best value of C for each model and the corresponding RMSE.

## # A tibble: 3 × 2
##   Model                         RMSE
##   <chr>                        <dbl>
## 1 SVM Radial                    11.3
## 2 SVM Linear w/ choice of cost  11.4
## 3 SVM Poly                      11.7

All RMSE Values

Finally, we can compile all of the RMSE values of all the linear candidate regression models below, presenting them in order so that the best-performing models with the lowest RMSE values are at the top of the table.

RMSE Values of Candidate Linear Regression Models
models RMSEvectors
SVM Radial 11.28096
SVM Linear 11.39403
SVM Poly 11.71654
Full Linear Model 12.36235
Stepwise Linear Model 12.36235
Elastic Model 13.27753
LASSO Model 13.28365
Ridge Model 13.28971

Logistic Regression

Similarly to the above process, we will now do the same thing with classification models, using the same test set to predict for whether or not an individual’s income is over 50k a year.

Multiple Logistic Regression

Once again, we will begin with standard regression, this time creating a full and stepwise logistic regression model with income as our binary response.

kable(summary(full_model)$coef, caption ="Full Main Effects Logistic Model Parameter Estimates")
Full Main Effects Logistic Model Parameter Estimates
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.0565690 2.3188054 3.0431916 0.0023408
tage -0.7335423 0.1862409 -3.9386744 0.0000819
workclass Local-gov 0.6054860 0.5584139 1.0842960 0.2782336
workclass Private 0.4674377 0.4642454 1.0068762 0.3139943
workclass Self-emp-inc 0.1593855 0.6081620 0.2620775 0.7932617
workclass Self-emp-not-inc 0.6071620 0.5410758 1.1221385 0.2618036
workclass State-gov 1.1847186 0.6765367 1.7511521 0.0799197
tfnlwgt -0.0005336 0.0013430 -0.3972796 0.6911613
education_num -0.3218152 0.0473529 -6.7960986 0.0000000
marital_statusMarried 0.3015322 1.2030593 0.2506379 0.8020941
marital_statusNever-Married 0.7677871 0.4406564 1.7423716 0.0814434
marital_statusSeparated 0.2420295 0.7753984 0.3121357 0.7549374
marital_statusWidowed -0.1833534 0.7040866 -0.2604131 0.7945451
occupation Craft-repair 0.3930907 0.4124476 0.9530682 0.3405555
occupation Exec-managerial -0.3576187 0.3892420 -0.9187567 0.3582228
occupation Farming-fishing 0.9434672 0.6742601 1.3992631 0.1617341
occupation Handlers-cleaners 2.1228946 0.8642171 2.4564367 0.0140323
occupation Machine-op-inspct 0.6689145 0.5141826 1.3009280 0.1932831
occupation Other-service 1.9143891 0.6945667 2.7562350 0.0058471
occupation Priv-house-serv 14.8698157 1452.0027676 0.0102409 0.9918291
occupation Prof-specialty -0.2049285 0.4069848 -0.5035286 0.6145927
occupation Protective-serv -0.5121722 0.6320365 -0.8103522 0.4177378
occupation Sales 0.0258114 0.4040798 0.0638769 0.9490682
occupation Tech-support -0.3867937 0.5918623 -0.6535197 0.5134213
occupation Transport-moving 1.0930875 0.5288984 2.0667251 0.0387601
relationship Not-in-family 1.5938273 1.2005385 1.3275937 0.1843124
relationship Other-relative 15.5011076 636.7743424 0.0243432 0.9805789
relationship Own-child 1.9599241 1.2142501 1.6141025 0.1065052
relationship Unmarried 1.9041730 1.2750295 1.4934345 0.1353235
relationship Wife -1.5257810 0.5078245 -3.0045440 0.0026598
race Asian-Pac-Islander -0.6158089 1.3796275 -0.4463588 0.6553381
race Black -0.2740138 1.3262309 -0.2066109 0.8363137
race Other -0.9842000 1.5588828 -0.6313496 0.5278120
race White -0.4884678 1.2770134 -0.3825079 0.7020847
sex Male -1.2400701 0.4060342 -3.0541027 0.0022573
gain_zeroTRUE 1.7186232 0.3344039 5.1393637 0.0000003
loss_zeroTRUE 1.2573197 0.3689780 3.4075733 0.0006554
native_countryUnited States 0.2752245 0.4346773 0.6331698 0.5266228
hours_per_week -0.0368134 0.0083058 -4.4322541 0.0000093
kable(summary(step_model)$coef, caption ="Full Main Effects Stepwise Model Parameter Estimates")
Full Main Effects Stepwise Model Parameter Estimates
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.5589756 1.3841241 5.4611981 0.0000000
tage -0.7846377 0.1760327 -4.4573407 0.0000083
education_num -0.3108584 0.0468060 -6.6414149 0.0000000
occupation Craft-repair 0.4172012 0.3889686 1.0725831 0.2834582
occupation Exec-managerial -0.2633769 0.3668995 -0.7178450 0.4728529
occupation Farming-fishing 1.1289316 0.6367945 1.7728350 0.0762561
occupation Handlers-cleaners 2.0991186 0.8589539 2.4438083 0.0145331
occupation Machine-op-inspct 0.6927517 0.4969245 1.3940784 0.1632939
occupation Other-service 1.9057293 0.6882745 2.7688508 0.0056254
occupation Priv-house-serv 14.9326797 1405.9918785 0.0106207 0.9915260
occupation Prof-specialty -0.1264728 0.3857527 -0.3278598 0.7430177
occupation Protective-serv -0.4186899 0.5878844 -0.7121977 0.4763424
occupation Sales 0.0081423 0.3755298 0.0216821 0.9827016
occupation Tech-support -0.3692031 0.5850072 -0.6311087 0.5279694
occupation Transport-moving 1.1846821 0.5082099 2.3310881 0.0197487
relationship Not-in-family 1.7251353 0.2841262 6.0717231 0.0000000
relationship Other-relative 15.6584069 637.8008879 0.0245506 0.9804134
relationship Own-child 2.2439725 0.5790038 3.8755748 0.0001064
relationship Unmarried 1.7331027 0.5111762 3.3904210 0.0006979
relationship Wife -1.4474025 0.4977983 -2.9076081 0.0036420
sex Male -1.1599890 0.3892845 -2.9797976 0.0028844
gain_zeroTRUE 1.7916801 0.3316006 5.4031261 0.0000001
loss_zeroTRUE 1.2351147 0.3613874 3.4177029 0.0006315
hours_per_week -0.0394216 0.0081420 -4.8417690 0.0000013

The ROC curves and AUC values are presented as follows:

Model AUC
Full Main Effects Logistic Model 0.8837286
Stepwise Reduced Main Effects Logistic Model 0.8795015

Regularized Regression

Again, we will use the same three regularized regression techniques. This time, we will calculate the optimal cut-off probabilities.

The coefficients for each of the regularized classification models are shown.

  lasso ridge elasticnet
(Intercept) -7.938 -6.344 -7.612
tage 0.6055 0.548 0.6177
workclass Local-gov 0 -0.2113 0
workclass Private 0 -0.1988 0
workclass Self-emp-inc 0.2844 0.201 0.298
workclass Self-emp-not-inc -0.03721 -0.2917 -0.09122
workclass State-gov -0.3318 -0.6627 -0.4542
tfnlwgt 0 0.0003855 3.892e-05
education_num 0.3054 0.2509 0.296
marital_statusMarried 1.383 0.8022 1.153
marital_statusNever-Married -0.5042 -0.6123 -0.6086
marital_statusSeparated 0 -0.2036 0
marital_statusWidowed 0 0.008398 0
occupation Craft-repair -0.121 -0.1538 -0.195
occupation Exec-managerial 0.3833 0.5531 0.4133
occupation Farming-fishing -0.5353 -0.603 -0.6322
occupation Handlers-cleaners -1.252 -1.253 -1.426
occupation Machine-op-inspct -0.3131 -0.4489 -0.4264
occupation Other-service -1.221 -1.058 -1.316
occupation Priv-house-serv 0 -1.24 -0.5635
occupation Prof-specialty 0.1991 0.47 0.2532
occupation Protective-serv 0.3783 0.5986 0.4252
occupation Sales 0.03504 0.1836 0.05208
occupation Tech-support 0.2785 0.4952 0.3657
occupation Transport-moving -0.7521 -0.7522 -0.8502
relationship Not-in-family 0 -0.4991 -0.1428
relationship Other-relative -0.7993 -1.369 -1.25
relationship Own-child -0.1399 -0.688 -0.4314
relationship Unmarried -0.05846 -0.6981 -0.3488
relationship Wife 1.208 1.042 1.271
race Asian-Pac-Islander 0 0.2191 0.05758
race Black -0.02451 -0.1028 -0.08712
race Other 0.2375 0.5087 0.332
race White 0 0.0983 0
sex Male 0.9808 0.8203 1.031
gain_zeroTRUE -1.617 -1.507 -1.627
loss_zeroTRUE -1.03 -1.048 -1.084
native_countryUnited States -0.04374 -0.1408 -0.1115
hours_per_week 0.03093 0.02929 0.03172

Optimal cut-off probability determination:

Finally, we will get the ROC and AUC values for each of the regularized regression candidate models.

  lasso ridge elastic
Sensitivity 0.8893 0.8854 0.8893
Specificity 0.6944 0.7083 0.6806
Pos Pred Value 0.9109 0.9143 0.9073
Neg Pred Value 0.641 0.6375 0.6364
Precision 0.9109 0.9143 0.9073
Recall 0.8893 0.8854 0.8893
F1 0.9 0.8996 0.8982
Prevalence 0.7785 0.7785 0.7785
Detection Rate 0.6923 0.6892 0.6923
Detection Prevalence 0.76 0.7538 0.7631
Balanced Accuracy 0.7919 0.7969 0.7849
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

SVM

Once again, we will use support vector machines, this time for the sake of classification. I found that this process took much less time for classification models compared to the regression model process.

## # A tibble: 3 × 2
##   Model                        Accuracy
##   <chr>                           <dbl>
## 1 SVM Poly                        0.827
## 2 SVM Linear w/ choice of cost    0.842
## 3 SVM Radial                      0.843
## Confusion Matrix and Statistics
## 
##                  true
## pred              Over50k UnderorEqual50k
##   Over50k              35              18
##   UnderorEqual50k      37             235
##                                           
##                Accuracy : 0.8308          
##                  95% CI : (0.7855, 0.8699)
##     No Information Rate : 0.7785          
##     P-Value [Acc > NIR] : 0.01195         
##                                           
##                   Kappa : 0.4582          
##                                           
##  Mcnemar's Test P-Value : 0.01522         
##                                           
##             Sensitivity : 0.4861          
##             Specificity : 0.9289          
##          Pos Pred Value : 0.6604          
##          Neg Pred Value : 0.8640          
##              Prevalence : 0.2215          
##          Detection Rate : 0.1077          
##    Detection Prevalence : 0.1631          
##       Balanced Accuracy : 0.7075          
##                                           
##        'Positive' Class : Over50k         
## 
## Setting levels: control = Over50k, case = UnderorEqual50k
## Setting direction: controls > cases

## Confusion Matrix and Statistics
## 
##                  true
## pred              Over50k UnderorEqual50k
##   Over50k              36              19
##   UnderorEqual50k      36             234
##                                           
##                Accuracy : 0.8308          
##                  95% CI : (0.7855, 0.8699)
##     No Information Rate : 0.7785          
##     P-Value [Acc > NIR] : 0.01195         
##                                           
##                   Kappa : 0.4641          
##                                           
##  Mcnemar's Test P-Value : 0.03097         
##                                           
##             Sensitivity : 0.5000          
##             Specificity : 0.9249          
##          Pos Pred Value : 0.6545          
##          Neg Pred Value : 0.8667          
##              Prevalence : 0.2215          
##          Detection Rate : 0.1108          
##    Detection Prevalence : 0.1692          
##       Balanced Accuracy : 0.7125          
##                                           
##        'Positive' Class : Over50k         
## 
## Setting levels: control = Over50k, case = UnderorEqual50k
## Setting direction: controls > cases

## Confusion Matrix and Statistics
## 
##                  true
## pred              Over50k UnderorEqual50k
##   Over50k              36              18
##   UnderorEqual50k      36             235
##                                           
##                Accuracy : 0.8338          
##                  95% CI : (0.7888, 0.8726)
##     No Information Rate : 0.7785          
##     P-Value [Acc > NIR] : 0.008171        
##                                           
##                   Kappa : 0.471           
##                                           
##  Mcnemar's Test P-Value : 0.020700        
##                                           
##             Sensitivity : 0.5000          
##             Specificity : 0.9289          
##          Pos Pred Value : 0.6667          
##          Neg Pred Value : 0.8672          
##              Prevalence : 0.2215          
##          Detection Rate : 0.1108          
##    Detection Prevalence : 0.1662          
##       Balanced Accuracy : 0.7144          
##                                           
##        'Positive' Class : Over50k         
## 
## Setting levels: control = Over50k, case = UnderorEqual50k
## Setting direction: controls > cases

Model AUC
SVM Linear w/ choice of cost 0.8849363
SVM Radial 0.8861989
SVM Poly 0.8849363

All AUC Values

Finally, to compare the model performance, we compile the AUC values of all the models together, listed in descending order (so the best performing models are listed first).

AUC Values of Candidate Classification Models (Descending)
models AUCvectors
SVM Radial 0.8861989
Ridge Model 0.8857049
SVM Linear 0.8849363
SVM Poly 0.8849363
Elastic Model 0.8839482
Full Logit Model 0.8837286
LASSO Model 0.8823562
Stepwise Logit Model 0.8795015

Discussion

Results and Recommendations

Overall, the models performed quite similarly. All of the RMSE values were around 11-12, and all of the AUC values were around 0.88 to 0.89. An RMSE of 11-12 indicates an estimate that is off by about 11-12. Considering that the range of the values of hours per week was only from 2-99, the error across the models does feel a little too high to make particularly meaningful predictions of this variable. On the other hand, the classification models performed quite well in the prediction of whether or not one’s income was over 50k annually or not based on the predictive features.

In both regression and classification models, SVM performed very well. In particular, the SVM Radial non-linear technique performed the best in both regression and classification analysis. However, SVM is very computationally intensive and took a very long time to run. When expanding this analysis to the full dataset, SVM may be favored for predictive potential if one has the time and technology; however, I found the time for the linear regression models in particular far too costly to be worth the difference.

For linear regression analysis, the second best performing general method outside of SVM was just traditional linear regression, with the full linear model and stepwise reduced linear model performing almost identically in terms of the RMSE. As such, for simplicity, I would likely recommend it for analysis of a larger dataset. SVM is also effective, but very computationally intensive. All of the regularized regression models, on the other hand, performed similarly and not particularly well in comparison.

For the classification models, on the other hand, regularized regression methods performed as well or better than traditional logistic regression and ran about equally as fast. The SVM classification models were significantly faster to run than the linear regression models and as such may be worth the time to run; however, the difference between its performance with regularized regression models such as the Ridge model was small, while the Ridge model took much less time to run. Therefore, it might be preferable still to do regularized regression methods instead for the shorter runtime with similar results.

Limitations

The most obvious limitation of this analysis is that only a very small subset of the original dataset was used. I found that several algorithms were extremely time-consuming with the full dataset, most significantly being the process of imputing missing values and both SVM regression methods, though SVM linear regression was particularly time consuming of the two.

The dataset involved a lot of categorical data with many categories, which may have affected the performance of these algorithms. There were also some sparse categories that were hard to effectively recategorize.

The models also didn’t take into account interactions between predictors.

References

  1. https://archive.ics.uci.edu/dataset/2/adult