1 Introduction

This is the third part of the Census Data project. In this project we analyze a U.S. census data taken from the UCI (University of California at Irvine) Machine Learning Repository. The project is divided into four parts. Our final goal is to build a model, which can predict whether the income of a random adult American citizen is less or greater than 50000$ a year based on given features, such as age, education, occupation, gender, race, etc. In the first part of the project – Cleaning and Preprocessing the Data, we introduced the U.S. census dataset and we cleaned the data and transformed it into a form suitable for further analysis. In the second part of the project – Exploratory Data Analysis, we performed detailed exploratory data analysis and investigated the relationship between the independent variables and the dependent variable “income”. This part of the project is devoted to building predictive models. We use logistic regression, random forest, support vector machines and neural networks. The last part of the project – Theoretical Background, gives theoretical details about the logistic regression and methods for assessing thr goodness of fit of the predictive models.

Data analysis techniques used in this part of the project:

Investigation of the predictor variables for collinearity using:
- the R function “findLinearCombos()” from the package “caret” to identify linearly dependent column vectors in the design matrix
- Goodman and Kruskal’s tau measure
- the Generalized Variance Inflation Factor (GVIF)
Building four predictive models using the following classification algorithms:
- Logistic regression
- Random forest
- Support vector machines
- Neural networks
Cross-validation and grid search for finding the optimal model parameters (cost parameter and kernel width) for the SVM model
Cross-validation and grid search for finding the optimal model parameters (number of hidden neurons, learning rate, weight decay parameter) for the NN model
Finding the optimal number of trees based on error plots of the RF model
Implementation of two techniques for balancing the imbalanced dataset in order to achieve higher sensitivity:
- Over-sampling
- Synthetic Minority Over-sampling Technique (SMOTE)
Assessing the goodness of fit of the logistic regression model using:
- Likelihood ratio test (LRT)
- The Hosmer-Lemeshow test
Assessing the significance of the explanatory variables
Stepwise selection of covariates for the logistic regression model using:
- the R function “drop1()”
- the R function “step()”
- both function implement backward selection of predictors based on LRT
Analysis of the prediction accuracy, sensitivity, specificity and computational time of the fitted models (both on the training and on the test datasets)

We start by loading the necessary packages:

library(ggplot2)
library(scales)
library(plyr)
library(vcd)
library(ggthemes)
library(caret)
library(GoodmanKruskal)
library(ResourceSelection)
library(randomForest)
library(e1071)
library(nnet)
library(DMwR)

2 Reading the Preprocessed Datasets

In the first part of the project – Cleaning and Preprocessing the Data we cleaned and preprocessed the census data and we wrote the final data frames into the comma-separated-value files “adult_df.csv” and “test_df.csv”. Hence, we read the train and the test datasets into the “db.adult” and “db.test” data frames, respectively:

setwd("D:/Data_Science_Projects/Census_DataSet")

db.adult <- read.csv("adult_df.csv")

db.test <- read.csv("test_df.csv")

3 Logistic Regression

We are interested in predicting the values of the variable “income”. Therefore “income” is our response variable, or dependent variable. It assumes only two values - less than 50K a year and more than 50K a year. The problem we consider is a classification one. Let $Y_i$ be the random variable “income of the i-th subject”. Let also $Y_i=1$ if income>50K and $Y_i=0$ if income<=50K. Then, $Y_i$ ($i=1,2,\dots, n$, where $n=30162$ is the number of observations in the data frame) follows the binomial distribution. More precisely, $Y_i$ follows the Bernoulli distribution, which is a special case of the binomial distribution when there is only 1 trial. Since the response variable is binary, we start with a logistic regression model, which is a type of a generalized linear model. For more information on logistic regression – go to the “2 Logistic regression” section of the Theoretical Background part of the project.

3.1 Fitting a logistic regression model with the function glm

The R function that we use to build the logistic regression model is “glm” from the standard R installation (the “stats” package). As the R documentation states, the function fits generalized linear models. Since we want to fit a logistic regression model, in “glm” we set “family” to “binomial” and “link” (link function) to “logit”. We start with a set of explanatory variables (see below) which we will call “full”. The set consists of all default variables, except “fnlwgt”, “hours_per_week”, “native_country”, “capital_gain” and “capital_loss”. In our analysis we do not take into account the variable “fnlwgt”. We also substiute the variable “native_country” with “native_region”, and “capital_gain” and “capital_loss” are replaced by “cap_gain” and “cap_loss”, respectively.

names(db.adult)

 [1] "age"            "workclass"      "fnlwgt"         "education"     
 [5] "education_num"  "marital_status" "occupation"     "relationship"  
 [9] "race"           "sex"            "capital_gain"   "capital_loss"  
[13] "hours_per_week" "native_country" "income"         "hours_w"       
[17] "native_region"  "cap_gain"       "cap_loss"

covariates <- paste("age", "workclass", "education", "education_num",
                    "marital_status", "occupation", "relationship",
                    "race", "sex", "native_region", "hours_w",
                    "cap_gain", "cap_loss", sep = "+")

form <- as.formula(paste("income ~", covariates))

start_time <- proc.time()
glm.model <- glm(formula = form,
                 data = db.adult, 
                 family = binomial(link = "logit"),
                 x = TRUE)
# The option "x=TRUE" returns the design matrix
time.logistic <- proc.time() - start_time
time.logistic

   user  system elapsed 
   4.49    0.15    4.67

After we fitted the model, we need to determine:

the significance of the obtained estimates $\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2,\dots,\hat{\beta}_k$ of the model parameters $\beta_0,\beta_1,\dots,\beta_k$ (for more information see section “2.4 Maximum likelihood estimators” of the last part of the project – Theoretical Background )
the goodness of fit of the logistic model

3.1.1 Collinearity of the predictor variables

We investigate if there are collinear predictor variables. We begin with a summary of the fitted logistic model:

summary(glm.model)$coefficients[, 1:2]

                                          Estimate   Std. Error
(Intercept)                            0.902289478 6.769089e-01
age                                    0.026438419 1.703372e-03
workclass Local-gov                   -0.639700747 1.122277e-01
workclass Private                     -0.452837780 9.309886e-02
workclass Self-emp-inc                -0.235136463 1.233519e-01
workclass Self-emp-not-inc            -0.872465363 1.099112e-01
workclass State-gov                   -0.753259323 1.250307e-01
workclass Without-pay                -13.180257804 1.993433e+02
education 11th                         0.103269080 2.137006e-01
education 12th                         0.444262173 2.736754e-01
education 1st-4th                     -0.459467478 4.798216e-01
education 5th-6th                     -0.419230130 3.521554e-01
education 7th-8th                     -0.504352776 2.420578e-01
education 9th                         -0.278287394 2.691488e-01
education Assoc-acdm                   1.286519936 1.802206e-01
education Assoc-voc                    1.262132954 1.731718e-01
education Bachelors                    1.902761826 1.612147e-01
education Doctorate                    2.967218072 2.231634e-01
education HS-grad                      0.764449675 1.567956e-01
education Masters                      2.260928202 1.721559e-01
education Preschool                  -12.504022385 9.784995e+01
education Prof-school                  2.900937791 2.067525e-01
education Some-college                 1.124568866 1.590551e-01
marital_status Married-AF-spouse       2.906368163 5.787119e-01
marital_status Married-civ-spouse      2.164167535 2.735700e-01
marital_status Married-spouse-absent   0.009494187 2.365025e-01
marital_status Never-married          -0.442439757 8.766115e-02
marital_status Separated              -0.073529884 1.640838e-01
marital_status Widowed                 0.213420666 1.567942e-01
occupation Armed-Forces               -1.381287295 1.593288e+00
occupation Craft-repair                0.037115382 8.042603e-02
occupation Exec-managerial             0.769219849 7.746537e-02
occupation Farming-fishing            -0.867170091 1.380995e-01
occupation Handlers-cleaners          -0.717075233 1.442101e-01
occupation Machine-op-inspct          -0.314569283 1.027973e-01
occupation Other-service              -0.790992657 1.181567e-01
occupation Priv-house-serv            -3.194481795 1.316745e+00
occupation Prof-specialty              0.502766131 8.200143e-02
occupation Protective-serv             0.593387739 1.255901e-01
occupation Sales                       0.258121406 8.301447e-02
occupation Tech-support                0.659672887 1.114897e-01
occupation Transport-moving           -0.070477875 9.947379e-02
relationship Not-in-family             0.582873644 2.704928e-01
relationship Other-relative           -0.283364787 2.452199e-01
relationship Own-child                -0.615249793 2.697838e-01
relationship Unmarried                 0.438948372 2.860101e-01
relationship Wife                      1.380705410 1.050376e-01
race Asian-Pac-Islander                0.716848339 2.749136e-01
race Black                             0.530568451 2.408688e-01
race Other                             0.186069663 3.703154e-01
race White                             0.631748991 2.303928e-01
sex Male                               0.839522945 7.951133e-02
native_region Central-Asia            -0.044740001 2.883293e-01
native_region East-Asia                0.073407108 2.615591e-01
native_region Europe-East              0.369271777 3.352623e-01
native_region Europe-West              0.574603030 1.932003e-01
native_region Outlying-US              0.320962994 2.251356e-01
native_region South-America           -0.981654260 4.692376e-01
native_region United-States            0.420610201 1.342609e-01
hours_w between_45_and_60              0.438435381 4.369345e-02
hours_w between_60_and_80              0.410845207 9.821402e-02
hours_w less_than_40                  -0.804580132 6.216572e-02
hours_w more_than_80                   0.273986107 1.936161e-01
cap_gain Low                          -6.656754399 5.092054e-01
cap_gain Medium                       -4.779587229 5.138468e-01
cap_loss Low                          -0.789675036 1.489671e-01
cap_loss Medium                        0.808078772 1.791307e-01

From the output of “summary(glm.model)” we notice the following warning message: “Coefficients: (1 not defined because of singularities)”. Also, in the coefficients table the coefficient for the variable “education_num” is NA, i.e. with a missing value. As we explain in detail in the section “2.5 Iterative methods for solving systems of non-linear equations and the issue of collinearity” of the Theoretical Background part of the project, this is an indication that the covariate “education_num” is collinear with some other predictors. Therefore, we have to exclude it from the list of predictor variables and fit the model again. We will use the R function “findLinearCombos” from the package “caret” to test whether the covariate “education_num” is collinear with some of the other predictors. As the R documentation states, the function determines linear combinations in a numeric matrix. Furthermore, it returns a list that enumerates these dependencies and a vector of column positions that can be removed to eliminate the linear dependencies:

findLinearCombos(glm.model$x)

$linearCombos
$linearCombos[[1]]
 [1] 24  1  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23


$remove
[1] 24

findLinearCombos(glm.model$x)$remove

[1] 24

As we can see from the code output above, R found linear dependencies between the covariates and recommends to remove column 24 from the design matrix. Below we identify which predictor corresponds to column 24:

colnames(glm.model$x)[findLinearCombos(glm.model$x)$remove]

[1] "education_num"

Indeed, it turns out that the problematic predictor is “education_num”. As we can see from the data frame “db.adult”, the variable “education_num” provides redundant information, equivalent to that contained in “education”. From the content of “education_num” and “education” we can also see that the two variables are linearly dependent, i.e. collinear. Below we list the unique combinations of values for the variables “education” and “education_num”. We do this with the help of the R function “unique” from the “base” package. As the documentation reads, for a data frame the function “unique” returns a “data frame with the same columns but with duplicate rows removed (and with row names from the first occurrences of the unique rows)”. The variable “education” has a total of 16 factor levels and we see that to each level of “education” corresponds a number from 1 to 16 from the variable “education_num”:

unique.combinations <- unique(db.adult[ ,c("education", "education_num")])

unique.combinations[order(unique.combinations$education_num), ]

        education education_num
209     Preschool             1
387       1st-4th             2
53        5th-6th             3
15        7th-8th             4
7             9th             5
205          10th             6
4            11th             7
386          12th             8
3         HS-grad             9
11   Some-college            10
46      Assoc-voc            11
14     Assoc-acdm            12
1       Bachelors            13
6         Masters            14
49    Prof-school            15
20      Doctorate            16

Therefore we remove the covariate “education_num” and fit a new model - “glm.model.wld” (“wld” stands for “without linear dependencies”), thus resolving the problem with linearly dependent predictors:

covariates_new <- paste("age", "workclass", "education","marital_status",
                        "occupation", "relationship", "race", "sex",
                        "native_region", "hours_w", "cap_gain",
                        "cap_loss", sep = "+")

form_new <- as.formula(paste("income ~", covariates_new))

start_time <- proc.time()
glm.model.wld <- glm(formula = form_new,
                     data = db.adult, 
                     family = binomial(link = "logit"),
                     x = TRUE,
                     y = TRUE)
time.logistic <- proc.time() - start_time
time.logistic

   user  system elapsed 
   2.65    0.16    2.83

As we can see from the test below, there are no linear dependencies between the covariates anymore:

findLinearCombos(glm.model.wld$x)

$linearCombos
list()

$remove
NULL

3.1.2 Other diagnostics for detecting collinearity

1. Goodman and Kruskal’s tau measure

Another way to see if there exist any correlations between the covariates, is to calculate the Goodman and Kruskal’s tau measure for all pairs of covariates. The Goodman and Kruskal’s tau measures the strength of association between categorical variables. It also works for discrete numerical variables by treating each variable value as a separate level of a factor variable. The standard association measure for numerical variables is the Pearson correlation coefficient, but it is not applicable for categorical variables.

In our case, the Goodman and Kruskal’s tau measure will give us the strength of the association between predictors two by two. If there are 3 or more mutually dependent predictors, this coefficient will not help us identify them. In order to calculate the Goodman and Kruskal’s tau for all pairs of predictor variables, we use the function “GKtauDataframe” from the R package “GoodmanKruskal”. The tau measure ranges from 0 to 1. Values closer to zero indicate weak association, whereas values closer to 1 indicate strong association. The tau measure is not symmetric. This means that, if A and B are two categorical variables, then \[ \tau(A,B)\neq \tau(B,A). \] The most popular measures for categorical variables are the chi-square and Cramer’s V, which can be calculated with the “assocstats” function in the “vcd” package. However, the “assocstats” function is not adapted to data frames. On the other hand, the above mentioned function “GKtauDataframe” is designed specifically for data frames and it makes the calculation of Goodman and Kruskal’s tau for all pairs of predictors very easy and straightforward. Furthermore, the function “GKtauDataframe” returns an object of class “GKtauMatrix” and the “plot” method can be applied for this object class, visualizing in a convenient way the Goodman and Kruskal’s tau for all pairs of predictors. As we can see below, the plot is in the form of a matrix. The numbers on the diagonal are equal to the number of levels for each categorical variable, while the off-diagonal elements give the Goodman-Kruskal tau values. Each tau measure is represeneted graphically by an ellipse, which is a circle for $\tau=0$ and degenerates into a straight line for $\tau=1$.

GKmatrix <- GKtauDataframe(db.adult[, c("age", "workclass", 
                                        "education", "education_num",
                                        "marital_status", 
                                        "occupation", "relationship",
                                        "race", "sex", "hours_w",
                                        "native_region", "cap_gain",
                                        "cap_loss")])

plot(GKmatrix)

From the resulting plot above we see that “education” is perfectly predictable from “education_num” and vice versa. This confirms our conclusion that the two variables are collinear. All other $\tau$ values are close to zero, except for $\tau$(relationship, marital_status), $\tau$(relationship, sex), and $\tau$(marital_status, relationship). These numbers make sense, because “relationship” can be predicted by “marital_status” and the other way around. The only association which is not intuitive is that between “relationship” and “sex”. The tau value of $0.42$ suggests that being a female or male can determine the type of relationship that an individial is in. Looking below at the percentage of women and men belonging to each category of the factor variable “relationship”, we see that 36% of women are “Not-in-family” compared to 20% of men, and 25% of women are “Unmarried” in contrast to only 4% of men.

tab <- xtabs(~ sex + relationship, data = db.adult)

ptab <- prop.table(tab, 1)

print(format(round(ptab, 2), scientific = FALSE))

         relationship
sex        Husband  Not-in-family  Other-relative  Own-child  Unmarried
   Female "0.00"   "0.36"         "0.04"          "0.20"     "0.25"    
   Male   "0.61"   "0.20"         "0.02"          "0.12"     "0.04"    
         relationship
sex        Wife 
   Female "0.14"
   Male   "0.00"

Now we compute the Cramer’s V value for the above pairs of variables with the “assocstats” function from the “vcd” package. The Cramer’s V is symmetric, i.e. V(relationship, marital_status)=V(marital_status, relationship) and hence, we can compute only one of these values. The values we obtain for Cramer’s V indicate strength of association which is similar to the one predicted by the Goodman and Kruskal’s tau:

assocstats(tab)$cramer

[1] 0.6502624

tab1 <- xtabs(~ marital_status + relationship, data = db.adult)

assocstats(tab1)$cramer

[1] 0.4871943

2. Generalized Variance Inflation Factor

The Variance Inflation Factor (VIF) measures how much the variance of an estimated regression coefficient is inflated due to multicollinearity in the model. The basic idea is to regress each independent variable on all other independent variables. A VIF is calculated for each of the explanatory variables using the R-squared value (coefficient of determination) of the regression of that variable against all other explanatory variables: \[ \text{VIF}_i=\dfrac{1}{1 - R_i^2} \] From the above formula, we can see that variance inflation factors take values from 1 upwards. A VIF of 1.7 means that the variance of an estimated regression coefficient is 70% bigger than what one would expect if there was no correlation with other predictors in the model. A commonly adopted rule of thumb is that a VIF greater than or equal to 5 indicates high correlation between the predictor variables. Other VIF values considered critical are 2.5 and 10. In the literature there is a debate on the exact treshold value for the VIF. In general, the more the VIF increases, the less reliable the constructed regression model is.

However, the standard VIF cannot be used for predictors with more than one degree of freedom such as categorical variables with more than two levels. Fox and Monette (“Generalized Collinearity Diagnostics”, Journal of the American Statistical Association, Volume 87, 1992) generalized the notion of variance inflation to explanatory variables requiring more than 1 degree of freedom and proposed the generalized variance inflation factor (GVIF). The GVIF is obtained by the VIF corrected to the number of degrees of freedom (df) of the predictor variable: \[ \text{GVIF} = \text{VIF}^{1/(2*df)} \] VIF equals GVIF for independent variables with 1 df. The GVIF is invariant with respect to the coding of the categorical variables in the model. This means that it does not depend on the choice of baseline category or coding scheme for the dummy regressors. For the GVIF the rule of thumb is to square the corrected VIF value, i.e. $\text{GVIF}=\text{VIF}^{1/(2*df)}$, and apply the threshold for the VIF, which is 5.

To calculate the GVIFs for our model, we use the “vif” function in the “car” package. Because there are several packages with different implementations of the function “vif”, we have to specify explicitly that we want to call the “vif” function from the “car” package:

car::vif(glm.model.wld)

                     GVIF Df GVIF^(1/(2*Df))
age              1.232896  1        1.110359
workclass        1.628834  6        1.041493
education        2.011061 15        1.023562
marital_status  51.847256  6        1.389613
occupation       2.891764 13        1.041686
relationship   115.512999  5        1.607915
race             2.297860  4        1.109597
sex              2.837879  1        1.684601
native_region    2.393816  7        1.064334
hours_w          1.262756  4        1.029591
cap_gain         1.054674  2        1.013397
cap_loss         1.016078  2        1.003995

We get an error message when attempting to calculate the GVIFs for the “glm.model”, because there are aliased coefficients in the model. This means that there are linearly dependent predictors (perfect collinearity). Since the collinearity is resolved in the second fitted model “glm.model.wld”, the GVIFs are calculated and displayed. As we can see from the output above, all $\text{GVIF}^{(1/(2*Df))}$ are less than the treshold value of $\sqrt{5}=2.23601$. Therefore, we can conclude that there are no indications of collinearity among the explanatory variables in the model “glm.model.wld”.

3.2 Assessing the goodness of fit of the model

3.2.1 Likelihood ratio test

Although the deviance is not useful as a measure of the model goodness of fit in our case because we have ungrouped data (for details see the section “3.2 Limitations of the deviance as a test statistic for ungrouped data” of the Theoretical Background part of the project), we briefly demonstrate how it can be obtained in R. Below we use the “summary()” function to display the deviance of the intercept-only model and the deviance of the model “glm.model.wld”:

summary(glm.model.wld)$null.deviance

[1] 33850.71

summary(glm.model.wld)$deviance

[1] 19642.97

In the output above the null deviance shows how well the response is predicted by a model with nothing but an intercept (grand mean) compared to the saturated model. The deviance (residual deviance) shows how well the observed response is predicted by a model with a given set of predictors (in our case this is the model “glm.model.wld”) compared to the saturated model.

As we discuss in the “3.3 Likelihood ratio test statistic” section of the Theoretical Background part of the project, the likelihood ratio test is valid also in the case of sparse data. As such we apply it to compare the goodness of fit of the intercept only model and the fitted model with explanatory variables - “glm.model.wld”. We test the null hypothesis that the null model fits well the observed data and explains about the same amount of variation in the response variable as the model “glm.model.wld” at the $0.05$ significance level:

k <- length(glm.model.wld$coefficients)
D_M <- glm.model.wld$deviance
D_0 <- glm.model.wld$null.deviance

1 - pchisq(q = D_0 - D_M, df = k - 1)

[1] 0

The p-value is approximately $0$ and, hence, smaller than $0.05$. This means that we reject the null hypothesis that there is no significant difference between the grand mean model (intercept-only model) and the model “glm.model.wld” (model with predictors). This means that at the 5% significance level the null model does not fit the observed data better than the multivariate model “glm.model.wld”.

3.2.2 The Hosmer-Lemeshow test

The Hosmer-Lemeshow test is a goodness of fit test for logistic regression models with ungrouped (individual) binary data. Just as in our problem, grouping the data according to covariate patterns is not always easy or even possible. The idea of the Hosmer-Lemeshow test is to divide the data into subgroups based on the predicted probabilities $\hat{\pi}_i$ instead on the values of the explanatory variables. The test consists of the following steps:

First we sort the data set of size $n$ (we have $n$ observations) in descending order according to the fitted probabilities $\hat{\pi}_i$
Then we divide the data into $g$ equal-sized groups: the first group contains the first $\dfrac{n}{g}$ observations with the highest estimated probabilities; the second group contains the $\dfrac{n}{g}$ observations having the next highest estimated probabilities; etc.
Finally, we construct a Pearson-like test statistic for the observed and expected cell frequencies from a $2\times g$ table

For more details – go to the “3.5 The Hosmer-Lemeshow test” section of the Theoretical Background part of the project.

In order to assess the goodness of fit of the model with the Hosmer-Lemeshow test we will use the R function “hoslem.test()” from the package “ResourceSelection”. We have to pass to the function the vector of observed responses in a binary format, the vector with expected values (predicted probabilities) and the number of groups $g$. Therefore we first extract the fitted probabilities, which can be done either with the function “predict()” or with “glm.model.wld$fitted.values”:

head(glm.model.wld$fitted.values)

         1          2          3          4          5          6 
0.08856988 0.45729601 0.03017249 0.09494667 0.55583302 0.83322025

predicted.probs <- predict(glm.model.wld, type = "response")

# predicted.probs <- glm.model.wld$fitted.values

head(predicted.probs) # returns probabilities

         1          2          3          4          5          6 
0.08856988 0.45729601 0.03017249 0.09494667 0.55583302 0.83322025

Next, we need to transform the vector of predicted probabilities $\hat{\boldsymbol{\pi}}=\left( \hat{\pi}_1, \hat{\pi}_2, \dots, \hat{\pi}_n \right)$ ($n=30162$) into a binary vector of predicted responses (predicted income). We denote this binary vector with $\hat{\mathbf{y}}=\left( \hat{y}_1, \hat{y}_2, \dots, \hat{y}_n \right)$. We construct the vector $\hat{\mathbf{y}}$ in the following way: \[ \hat{y}_i = 1, \quad \text{if} \quad \hat{\pi}_i>0.5, \] and \[ \hat{y}_i = 0, \quad \text{if} \quad \hat{\pi}_i \leq 0.5, \] where a value of 1 is equivalent to an yearly income of more than 50K, and a value of 0 means an income of less than 50K.

We take the vector of observed responses (which is a factor variable with two levels - " >50K" and " <=50K“) and create a binary vector:

observed.values <- ifelse(db.adult$income  == " >50K", 1, 0)

Next we generate the vector of predicted probabilities – “predicted.probs”, and then we use it to create the binary vector “predicted.response”:

predicted.probs <- predict(glm.model.wld, type = "response")

predicted.response <- ifelse(predicted.probs > 0.5, 1, 0)


head(predicted.response, 20)

 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
 0  0  0  0  1  1  0  0  1  1  1  0  0  0  0  0  0  0  0  1

head(observed.values, 20)

 [1] 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 1 1

First we test the accuracy of the fitted logistic model on the training dataset, i.e. we calculate what is the percentage of correctly predicted response values:

mean(observed.values == predicted.response)

[1] 0.8489159

As we can see from the result above, there is a 84.89% match between observed and predicted values of the dependent variable. Of course, in order to test the prediction accuracy of the fitted model, we need to test it on a new test data set. We will do this later on in the next sections of the project.

Finally, we proceed with the Hosmer-Lemeshow test. We run the test with different number of groups. We take $g=10, 20, 50, 100, 200, 300$ and $400$. As we can see from the results below, for small number of groups we obtain very small p-values, which means a poor fit of the model, whereas for bigger values of $g$ we obtain significant p-values indicating a good fit of the built model. As we already mentioned, the Hosmer-Lemeshow test has some serious drawbacks, such as the demonstrated dependence of the results on the choice of groups. Therefore we have to interpret with caution the outcome of the test. Our judgement of the adequacy of the model should be based above all on the final goals we want to achieve, such as whether we want the constructed model to match the observed data as good as possible or to predict new observations with high accuracy. We will see further that despite the lack of fit according to the Hosmer-Lemeshow test (in the case of relatively small number of groups), the fitted model “glm.model.wld” predicts new obervations with high accuracy.

hoslem.test(observed.values, predicted.response, g=10)


    Hosmer and Lemeshow goodness of fit (GOF) test

data:  observed.values, predicted.response
X-squared = 401.75, df = 8, p-value < 2.2e-16

hoslem.test(observed.values, predicted.response, g=20)


    Hosmer and Lemeshow goodness of fit (GOF) test

data:  observed.values, predicted.response
X-squared = 401.75, df = 18, p-value < 2.2e-16

hoslem.test(observed.values, predicted.response, g=50)


    Hosmer and Lemeshow goodness of fit (GOF) test

data:  observed.values, predicted.response
X-squared = 401.75, df = 48, p-value < 2.2e-16

hoslem.test(observed.values, predicted.response, g=100)


    Hosmer and Lemeshow goodness of fit (GOF) test

data:  observed.values, predicted.response
X-squared = 401.75, df = 98, p-value < 2.2e-16

hoslem.test(observed.values, predicted.response, g=200)


    Hosmer and Lemeshow goodness of fit (GOF) test

data:  observed.values, predicted.response
X-squared = 401.75, df = 198, p-value = 5.551e-16

hoslem.test(observed.values, predicted.response, g=300)


    Hosmer and Lemeshow goodness of fit (GOF) test

data:  observed.values, predicted.response
X-squared = 401.75, df = 298, p-value = 5.575e-05

hoslem.test(observed.values, predicted.response, g=400)


    Hosmer and Lemeshow goodness of fit (GOF) test

data:  observed.values, predicted.response
X-squared = 401.75, df = 398, p-value = 0.438

Here it is important to note that what goodness-of-fit statistics are testing is not how well the fitted model predicts the dependent variable, but whether there exists a more complicated model which fits the observed data even better. This more complex model can be obtained by adding non-linearities, interaction terms, or by changing the link function. However, a basic rule in practice is to keep the model as simple as possible provided that it serves good enough our purposes. In that sense we have to weigh the advantages and disadvantages of fitting more complicated models and decide what is best in our case. We have to investigate whether the more complicated model fits significantly better the observed data. If there is not a considerable improvement in the fit or prediction accuracy of the model, a more intricate model does not justify the additional computational complexity.

Determining what kind of non-linearities to include in the model is not always a trivial task, especially for data which do not have intuitive interpretation, such as radio signals or gene expression data sets, etc. In these cases neural networks are very useful, because their algorithm naturally identifies nonlinear relationships. Further we will demonstrate the application of neural networks to our problem.

3.3 Significance of the explanatory variables in the model

After fitting the model, we can assess the significance of the estimated model coefficients and hence the significance of the predictors in the fitted model.

3.3.1 Overall significance of the categorical covariates in the model

In order to test the significance of each covariate in the model we will perform likelihood ratio tests with the R function “anova()”. When we run “anova(glm.model.wld, test=”LRT“)”, the function sequentially compares nested models with increasing complexity against the full model by adding one predictor at a time. That is, at each step of the process the algorithm compares a model consisting of one, two, and so on covariates against the full model “glm.model.wld”. The comparisons are done with the help of a likelihood ratio test as we discuss in the “3.3 Likelihood ratio test statistic” section of the Theoretical Background part of the project. The p-values of the tests are calculated using the chi-squared distribution:

anova(glm.model.wld, test = "LRT")

Analysis of Deviance Table

Model: binomial, link: logit

Response: income

Terms added sequentially (first to last)

               Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                           30161      33851              
age             1   1738.5     30160      32112 < 2.2e-16 ***
workclass       6    426.1     30154      31686 < 2.2e-16 ***
education      15   3570.2     30139      28116 < 2.2e-16 ***
marital_status  6   5091.4     30133      23024 < 2.2e-16 ***
occupation     13    765.5     30120      22259 < 2.2e-16 ***
relationship    5    199.5     30115      22059 < 2.2e-16 ***
race            4     21.3     30111      22038 0.0002802 ***
sex             1    165.7     30110      21872 < 2.2e-16 ***
native_region   7     39.5     30103      21833 1.571e-06 ***
hours_w         4    418.7     30099      21414 < 2.2e-16 ***
cap_gain        2   1473.1     30097      19941 < 2.2e-16 ***
cap_loss        2    298.0     30095      19643 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we can see from the results above, all explanatory variables are significant and according to the likelihood ratio tests we should keep all of the considered predictors in the model.

3.3.2 Significance of the estimated model parameters

Now we will test the significance of each level of the categorical predictors in the fitted model.

We have a total of 67 model parameters:

length(glm.model.wld$coefficients)

[1] 67

That is, we have 12 predictors, most of which are categorical, and hence, dummy variables are created to account for the factor levels of each categorical covariate. Therefore the number of model parameters is greater than the number of predictors. For each categorical (factor) variabe with $l$ levels, $l-1$ dummy variables are created and one level is chosen as the so-called “base” or “reference” level. By default, R orders the levels of the categorical variables alphabetically and takes the first factor level as a reference level. The choice of base level does not affect the estimates of the model parameters. In the “2.2 Interpretation of the coefficients” section of the Theoretical Background part of the project we discuss in detail how categorical variables are accounted for in the logistic model and how the model coefficients are interpreted.

summary(glm.model.wld)


Call:
glm(formula = form_new, family = binomial(link = "logit"), data = db.adult, 
    x = TRUE, y = TRUE)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.6865  -0.5220  -0.1929  -0.0004   3.7400  

Coefficients:
                                       Estimate Std. Error z value
(Intercept)                            0.902289   0.676909   1.333
age                                    0.026438   0.001703  15.521
workclass Local-gov                   -0.639701   0.112228  -5.700
workclass Private                     -0.452838   0.093099  -4.864
workclass Self-emp-inc                -0.235136   0.123352  -1.906
workclass Self-emp-not-inc            -0.872465   0.109911  -7.938
workclass State-gov                   -0.753259   0.125031  -6.025
workclass Without-pay                -13.180258 199.343318  -0.066
education 11th                         0.103269   0.213701   0.483
education 12th                         0.444262   0.273675   1.623
education 1st-4th                     -0.459467   0.479822  -0.958
education 5th-6th                     -0.419230   0.352155  -1.190
education 7th-8th                     -0.504353   0.242058  -2.084
education 9th                         -0.278287   0.269149  -1.034
education Assoc-acdm                   1.286520   0.180221   7.139
education Assoc-voc                    1.262133   0.173172   7.288
education Bachelors                    1.902762   0.161215  11.803
education Doctorate                    2.967218   0.223163  13.296
education HS-grad                      0.764450   0.156796   4.875
education Masters                      2.260928   0.172156  13.133
education Preschool                  -12.504022  97.849946  -0.128
education Prof-school                  2.900938   0.206752  14.031
education Some-college                 1.124569   0.159055   7.070
marital_status Married-AF-spouse       2.906368   0.578712   5.022
marital_status Married-civ-spouse      2.164168   0.273570   7.911
marital_status Married-spouse-absent   0.009494   0.236502   0.040
marital_status Never-married          -0.442440   0.087661  -5.047
marital_status Separated              -0.073530   0.164084  -0.448
marital_status Widowed                 0.213421   0.156794   1.361
occupation Armed-Forces               -1.381287   1.593288  -0.867
occupation Craft-repair                0.037115   0.080426   0.461
occupation Exec-managerial             0.769220   0.077465   9.930
occupation Farming-fishing            -0.867170   0.138099  -6.279
occupation Handlers-cleaners          -0.717075   0.144210  -4.972
occupation Machine-op-inspct          -0.314569   0.102797  -3.060
occupation Other-service              -0.790993   0.118157  -6.694
occupation Priv-house-serv            -3.194482   1.316745  -2.426
occupation Prof-specialty              0.502766   0.082001   6.131
occupation Protective-serv             0.593388   0.125590   4.725
occupation Sales                       0.258121   0.083014   3.109
occupation Tech-support                0.659673   0.111490   5.917
occupation Transport-moving           -0.070478   0.099474  -0.709
relationship Not-in-family             0.582874   0.270493   2.155
relationship Other-relative           -0.283365   0.245220  -1.156
relationship Own-child                -0.615250   0.269784  -2.281
relationship Unmarried                 0.438948   0.286010   1.535
relationship Wife                      1.380705   0.105038  13.145
race Asian-Pac-Islander                0.716848   0.274914   2.608
race Black                             0.530568   0.240869   2.203
race Other                             0.186070   0.370315   0.502
race White                             0.631749   0.230393   2.742
sex Male                               0.839523   0.079511  10.559
native_region Central-Asia            -0.044740   0.288329  -0.155
native_region East-Asia                0.073407   0.261559   0.281
native_region Europe-East              0.369272   0.335262   1.101
native_region Europe-West              0.574603   0.193200   2.974
native_region Outlying-US              0.320963   0.225136   1.426
native_region South-America           -0.981654   0.469238  -2.092
native_region United-States            0.420610   0.134261   3.133
hours_w between_45_and_60              0.438435   0.043693  10.034
hours_w between_60_and_80              0.410845   0.098214   4.183
hours_w less_than_40                  -0.804580   0.062166 -12.943
hours_w more_than_80                   0.273986   0.193616   1.415
cap_gain Low                          -6.656754   0.509205 -13.073
cap_gain Medium                       -4.779587   0.513847  -9.302
cap_loss Low                          -0.789675   0.148967  -5.301
cap_loss Medium                        0.808079   0.179131   4.511
                                     Pr(>|z|)    
(Intercept)                           0.18255    
age                                   < 2e-16 ***
workclass Local-gov                  1.20e-08 ***
workclass Private                    1.15e-06 ***
workclass Self-emp-inc                0.05662 .  
workclass Self-emp-not-inc           2.06e-15 ***
workclass State-gov                  1.70e-09 ***
workclass Without-pay                 0.94728    
education 11th                        0.62892    
education 12th                        0.10452    
education 1st-4th                     0.33827    
education 5th-6th                     0.23386    
education 7th-8th                     0.03720 *  
education 9th                         0.30116    
education Assoc-acdm                 9.43e-13 ***
education Assoc-voc                  3.14e-13 ***
education Bachelors                   < 2e-16 ***
education Doctorate                   < 2e-16 ***
education HS-grad                    1.09e-06 ***
education Masters                     < 2e-16 ***
education Preschool                   0.89832    
education Prof-school                 < 2e-16 ***
education Some-college               1.55e-12 ***
marital_status Married-AF-spouse     5.11e-07 ***
marital_status Married-civ-spouse    2.56e-15 ***
marital_status Married-spouse-absent  0.96798    
marital_status Never-married         4.48e-07 ***
marital_status Separated              0.65406    
marital_status Widowed                0.17347    
occupation Armed-Forces               0.38597    
occupation Craft-repair               0.64445    
occupation Exec-managerial            < 2e-16 ***
occupation Farming-fishing           3.40e-10 ***
occupation Handlers-cleaners         6.61e-07 ***
occupation Machine-op-inspct          0.00221 ** 
occupation Other-service             2.17e-11 ***
occupation Priv-house-serv            0.01526 *  
occupation Prof-specialty            8.72e-10 ***
occupation Protective-serv           2.30e-06 ***
occupation Sales                      0.00187 ** 
occupation Tech-support              3.28e-09 ***
occupation Transport-moving           0.47863    
relationship Not-in-family            0.03117 *  
relationship Other-relative           0.24786    
relationship Own-child                0.02258 *  
relationship Unmarried                0.12485    
relationship Wife                     < 2e-16 ***
race Asian-Pac-Islander               0.00912 ** 
race Black                            0.02761 *  
race Other                            0.61534    
race White                            0.00611 ** 
sex Male                              < 2e-16 ***
native_region Central-Asia            0.87669    
native_region East-Asia               0.77898    
native_region Europe-East             0.27070    
native_region Europe-West             0.00294 ** 
native_region Outlying-US             0.15397    
native_region South-America           0.03644 *  
native_region United-States           0.00173 ** 
hours_w between_45_and_60             < 2e-16 ***
hours_w between_60_and_80            2.87e-05 ***
hours_w less_than_40                  < 2e-16 ***
hours_w more_than_80                  0.15704    
cap_gain Low                          < 2e-16 ***
cap_gain Medium                       < 2e-16 ***
cap_loss Low                         1.15e-07 ***
cap_loss Medium                      6.45e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 33851  on 30161  degrees of freedom
Residual deviance: 19643  on 30095  degrees of freedom
AIC: 19777

Number of Fisher Scoring iterations: 13

From the summary above we can see which are the significant covariates and levels of categorical covariates for the log odds model based on the corresponding p-values. These p-values are obtained using the Wald test statistic (for details go to the section “4 Significance of the estimated model parameters” of the Theoretical Background part of the project) to test the following hypotheses for the model coefficients:

$H_0$: $\hat{\beta}_j=0$

vs.

$H_1$: $\hat{\beta}_j\neq 0$

for all $j=0,1,2,\dots, k$.

In other words, the Wald test is used to evaluate the statistical significance of each coefficient in the fitted logistic model, i.e. the test checks the hypothesis that each individual coefficient is zero. If the coefficient of a category is not statistically significant, this does not imply that the whole categorical predictor is unimportant and should be removed from the model. The overall effect of the factor variable is tested by performing a likelihood ratio test as we showed earlier.

In the regression output above, the reported coefficients for each category of a factor variable measure the differences from the base level. As an illustrative example, let us consider the predictor “workclass”, which has $7$ levels:

levels(db.adult$workclass)

[1] " Federal-gov"      " Local-gov"        " Private"         
[4] " Self-emp-inc"     " Self-emp-not-inc" " State-gov"       
[7] " Without-pay"

As we can see from the output of “summary”, “workclass Local-gov” differs from the base level - “workclass Federal-gov” by $-0.74$, and that difference is significant at the $5$% level. This tells us that \[ \small{ \textbf{odds ratio}=} \] \[ \small{ =\dfrac{\textrm{odds of income being more than 50K given that "workclass"="workclass Local-gov"}}{\textrm{odds of income being more than 50K given that "workclass"="workclass Federal-gov"}}= } \] \[ \small{ =\exp(-0.74)=0.48 } \] or, equivalently, \[ \small{ \dfrac{\textrm{odds of income being more than 50K given that "workclass"="workclass Federal-gov"}}{\textrm{odds of income being more than 50K given that "workclass"="workclass Local-gov"}}=} \] \[ \small{ =\dfrac{1}{\exp(-0.74)}=2.08,} \] which means that it is two times more likely for a person to have an income of more than 50K a year if they belong to “workclass Federal-gov”(if they work for the federal government) compared to “workclass Local-gov” (if they work in local government), if we keep all other variables constant.

The category “workclass Private” differs from the reference level by $-0.56$ ($\exp(-0.56)=0.57$) and the difference is also significant at the $5$% level. In terms of odds ratios, this is equivalent to \[ \small{ \dfrac{\textrm{odds of income being more than 50K given that "workclass"="workclass Federal-gov"}}{\textrm{odds of income being more than 50K given that "workclass"="workclass Private"}}=} \] \[ \small{ =\dfrac{1}{\exp(-0.56)}=1.75, } \] which means that people who work in the federal government are $1.75$ times more likely to earn more than 50K a year compared to people who are employed in the private sector.

Similar conclusions can be made for the levels “workclass Self-emp-inc”, “workclass Self-emp-not-inc” and “workclass State-gov” which differ from the base level by $-0.37$, $-1.07$ and $-0.87$, respectively. From these results, we can conclude that:

if a subject is employed in the federal government, the odds of earning more than 50K a year are $1.43$ times greater than if a person is self-employed (controlling for all other independent variables);
people who work in the federal government are $2.9$ times more likely to earn more than 50K a year compared to people who are self-employed without income, which makes sense;
the odds of earning more than 50K a year are $2.4$ times greater for people who work in the federal government than for people with “workclass State-gov”.

The last category is “workclass Without-pay” which differs from the base level by $-13.58$: \[ \small{ =\dfrac{\textrm{odds of income being more than 50K given that "workclass"="workclass Without-pay"}}{\textrm{odds of income being more than 50K given that "workclass"="workclass Federal-gov"}}=} \] \[ \small{ =\exp(-13.58)=1.27\times 10^{-6} = 0.00000127,} \] or, equivalently \[ \small{ \dfrac{\textrm{odds of income being more than 50K given that "workclass"="workclass Federal-gov"}}{\textrm{odds of income being more than 50K given that "workclass"="workclass Without-pay"}}=} \] \[ \small{ =\dfrac{1}{\exp(-13.58)}=787.4.} \] The latter means that it is much less likely for a person to earn more than 50K a year if they are without pay than if they are employed in the federal government, which makes perfect sense. However, we will see that this result is due mainly to the fact that there are very little people without pay who participated in the survey.

We can make equivalent interpretations for all predictors in the model, but we will not go into details and discuss the impact of all categorical variables. We will just make some more comments on the more interesting results. As an example, we will consider thoroughly the independent variable “education”. Regarding education, we notice that it is $18$ times more likely for an individual to have an income of more than 50K a year, if they have a doctorate degree compared to having only a 10-th grade diploma. Below we list the levels of education:

levels(db.adult$education)

 [1] " 10th"         " 11th"         " 12th"         " 1st-4th"     
 [5] " 5th-6th"      " 7th-8th"      " 9th"          " Assoc-acdm"  
 [9] " Assoc-voc"    " Bachelors"    " Doctorate"    " HS-grad"     
[13] " Masters"      " Preschool"    " Prof-school"  " Some-college"

summary(db.adult$education)

         10th          11th          12th       1st-4th       5th-6th 
          820          1048           377           151           288 
      7th-8th           9th    Assoc-acdm     Assoc-voc     Bachelors 
          557           455          1008          1307          5044 
    Doctorate       HS-grad       Masters     Preschool   Prof-school 
          375          9840          1627            45           542 
 Some-college 
         6678

Also, it is much more likely to earn more than 50K if one has a Bachelor or Master degree ($6.5$ times and $9.3$ times more likely, respectively) relative to the baseline - 10-th grade diploma. The same can be said for “education Prof-school”, “education Assoc-acdm” and “education Assoc-voc” - the odds of having an income of more than 50K are $17$ times, $3.5$ times and again $3.5$ times greater, respectively, compared to the reference category. Furthermore, people with college degree are $3$ times more likely to earn more than 50K compared to people with 10-th degree education. On the other hand, if an individual has a “1st-4th”, “5th-6th”, “7th-8th”, or “9th” degree education, their odds of being paid more than 50K a year are $1.75$, $1.5$, $1.64$ and $1.3$ times lower, respectively, than if they had “10th” degree education. The result for “education Preschool” is very extreme - the odds of earning more than 50K a year are $1/2.5\times 10^{-6}=400000$ times lower relative to the base level. This number makes sense but is also due to the fact that there are very few people in this category - only $45$ people out of $30162$ in the whole sample. This can be seen also from the insignificant p-value for “education Preschool”, which indicates that the covariate (dummy variable in this case) is not significant to the model at the $5$% level. From the p-values we also notice that “education 1st-4th”, “education 5th-6th”, “education 9th”, “education 11th” and “education 12th” are not significant at the $5$% level.

Below we show the $95$% confidence intervals for all estimated model parameters, as well as the number of people belonging to each category of “workclass”:

summary(db.adult$workclass)

      Federal-gov         Local-gov           Private      Self-emp-inc 
              943              2067             22286              1074 
 Self-emp-not-inc         State-gov       Without-pay 
             2499              1279                14

confint.default(glm.model.wld)

                                             2.5 %        97.5 %
(Intercept)                            -0.42442766   2.229006620
age                                     0.02309987   0.029776966
workclass Local-gov                    -0.85966300  -0.419738493
workclass Private                      -0.63530818  -0.270367376
workclass Self-emp-inc                 -0.47690169   0.006628768
workclass Self-emp-not-inc             -1.08788730  -0.657043423
workclass State-gov                    -0.99831503  -0.508203616
workclass Without-pay                -403.88598205 377.525466439
education 11th                         -0.31557639   0.522114548
education 12th                         -0.09213183   0.980656177
education 1st-4th                      -1.39990049   0.480965538
education 5th-6th                      -1.10944208   0.270981821
education 7th-8th                      -0.97877726  -0.029928297
education 9th                          -0.80580927   0.249234482
education Assoc-acdm                    0.93329414   1.639745735
education Assoc-voc                     0.92272247   1.601543443
education Bachelors                     1.58678677   2.218736884
education Doctorate                     2.52982589   3.404610254
education HS-grad                       0.45713599   1.071763354
education Masters                       1.92350880   2.598347602
education Preschool                  -204.28639279 179.278348024
education Prof-school                   2.49571043   3.306165155
education Some-college                  0.81282661   1.436311123
marital_status Married-AF-spouse        1.77211372   4.040622605
marital_status Married-civ-spouse       1.62798024   2.700354828
marital_status Married-spouse-absent   -0.45404214   0.473030517
marital_status Never-married           -0.61425245  -0.270627067
marital_status Separated               -0.39512831   0.248068539
marital_status Widowed                 -0.09389025   0.520731587
occupation Armed-Forces                -4.50407419   1.741499603
occupation Craft-repair                -0.12051675   0.194747513
occupation Exec-managerial              0.61739052   0.921049182
occupation Farming-fishing             -1.13784005  -0.596500127
occupation Handlers-cleaners           -0.99972183  -0.434428630
occupation Machine-op-inspct           -0.51604829  -0.113090274
occupation Other-service               -1.02257547  -0.559409841
occupation Priv-house-serv             -5.77525481  -0.613708781
occupation Prof-specialty               0.34204628   0.663485977
occupation Protective-serv              0.34723561   0.839539872
occupation Sales                        0.09541604   0.420826770
occupation Tech-support                 0.44115707   0.878188702
occupation Transport-moving            -0.26544291   0.124487162
relationship Not-in-family              0.05271747   1.113029820
relationship Other-relative            -0.76398695   0.197257381
relationship Own-child                 -1.14401636  -0.086483230
relationship Unmarried                 -0.12162117   0.999517916
relationship Wife                       1.17483555   1.586575271
race Asian-Pac-Islander                 0.17802763   1.255669042
race Black                              0.05847431   1.002662589
race Other                             -0.53973527   0.911874593
race White                              0.18018746   1.083310519
sex Male                                0.68368361   0.995362284
native_region Central-Asia             -0.60985508   0.520375081
native_region East-Asia                -0.43923923   0.586053447
native_region Europe-East              -0.28783017   1.026373727
native_region Europe-West               0.19593733   0.953268729
native_region Outlying-US              -0.12029459   0.762220580
native_region South-America            -1.90134306  -0.061965462
native_region United-States             0.15746373   0.683756668
hours_w between_45_and_60               0.35279780   0.524072965
hours_w between_60_and_80               0.21834927   0.603341146
hours_w less_than_40                   -0.92642270  -0.682737562
hours_w more_than_80                   -0.10549456   0.653466778
cap_gain Low                           -7.65477856  -5.658730237
cap_gain Medium                        -5.78670849  -3.772465970
cap_loss Low                           -1.08164521  -0.497704860
cap_loss Medium                         0.45698900   1.159168541

The $95$% confidence interval for the odds ratio comparing “workclass Without-pay” versus “workclass Federal-gov” ranges from $\exp(-401.8)\rightarrow -\infty$ to $\exp(374.7)\rightarrow \infty$. This anomaly is due to the fact that there are a very small number of people - only $14$, who belong to the category “workclass Without-pay”. Thus, this association should be interpreted with a lot of caution. The same can be said for the category (dummy variable) “education Preschool” to which belong only $45$ people from the study:

summary(db.adult$education)

         10th          11th          12th       1st-4th       5th-6th 
          820          1048           377           151           288 
      7th-8th           9th    Assoc-acdm     Assoc-voc     Bachelors 
          557           455          1008          1307          5044 
    Doctorate       HS-grad       Masters     Preschool   Prof-school 
          375          9840          1627            45           542 
 Some-college 
         6678

3.4 Performance of the fitted model

3.4.1 Performance on the training data

First we calculate the percentage of correctly guessed response variables using the training dataset “db.adult”, i.e. the same dataset that we used to fit the logistic regression model “glm.model.wld”. Given the vector of predicted probabilities $\hat{\boldsymbol{\pi}}$, we calculate a character vector $\hat{\mathbf{y}}=\left( \hat{y}_1, \hat{y}_2, \dots, \hat{y}_n \right)$ such that \[ \hat{y}_i = " >50K", \quad \text{if} \quad \hat{\pi}_i>0.5, \] and \[ \hat{y}_i = " <=50K", \quad \text{if} \quad \hat{\pi}_i \leq 0.5. \]

Since R codes the factor variables as numbers, the binary response variable is also being coded. Therefore when estimating a logistic regression model we need to know how the binary response variable is being modeled. By default R orders the factor levels alphabetically and the response level modeled in the logistic regression is the highest level. In our case the highest level in the “income” variable is " >50K“:

attributes(db.adult$income)

$levels
[1] " <=50K" " >50K" 

$class
[1] "factor"

Hence, the response level modeled is " >50K“, i.e. when fitting the logistic regression model, the probability of the income being greater than 50K is calculated. We denote the vector with the predicted income values as”predicted.income.train“:

predicted.income.train <- ifelse(predicted.probs > 0.5, " >50K", " <=50K")

As we discussed this earlier, there is an 84.89% match between observed and predicted values of “income”:

mean(predicted.income.train == db.adult$income)

[1] 0.8489159

Below we show the confusion matrix:

stat.log.train <- confusionMatrix(data = predicted.income.train, 
                                  reference = db.adult$income,
                                  positive = levels(db.adult$income)[2])

stat.log.train

Confusion Matrix and Statistics

          Reference
Prediction  <=50K  >50K
     <=50K  21075  2978
     >50K    1579  4530
                                          
               Accuracy : 0.8489          
                 95% CI : (0.8448, 0.8529)
    No Information Rate : 0.7511          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5691          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.6034          
            Specificity : 0.9303          
         Pos Pred Value : 0.7415          
         Neg Pred Value : 0.8762          
             Prevalence : 0.2489          
         Detection Rate : 0.1502          
   Detection Prevalence : 0.2025          
      Balanced Accuracy : 0.7668          
                                          
       'Positive' Class :  >50K

The sensitivity is the proportion of income values equal to " >50K" that are correctly identified as such, and the specificity is the proportion of income values equal to " <=50K" that are correctly identified as such. We see that the sensitivity is 60.34% and the specificity is 93.03%.

3.4.2 Out of sample error

Next we will test how well the fitted model predicts new observations. For this purpose we will use the provided test dataset and, hence, the corresponding test data frame that we created - “db.test”.

predicted.income.test <- predict(glm.model.wld, 
                                 newdata = db.test, 
                                 type = "response") 

predicted.income.test <- ifelse(predicted.income.test > 0.5, " >50K", " <=50K")

Below we show the respective confusion matrix:

stat.log.test <- confusionMatrix(data = predicted.income.test, 
                                 reference = db.test$income,
                                 positive = levels(db.test$income)[2])

stat.log.test

Confusion Matrix and Statistics

          Reference
Prediction  <=50K  >50K
     <=50K  10560  1496
     >50K     800  2204
                                          
               Accuracy : 0.8475          
                 95% CI : (0.8417, 0.8532)
    No Information Rate : 0.7543          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5608          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.5957          
            Specificity : 0.9296          
         Pos Pred Value : 0.7337          
         Neg Pred Value : 0.8759          
             Prevalence : 0.2457          
         Detection Rate : 0.1463          
   Detection Prevalence : 0.1995          
      Balanced Accuracy : 0.7626          
                                          
       'Positive' Class :  >50K

As is evident by the results above, the model predicts correctly 84.75% of the values of the dependent variable. We consider this a good predictive rate. On the test dataset the sensitivity is 59.57% and the specificity is 92.96%.

3.5 Stepwise selection of covariates with likelihood ratio tests

Here we consider two R functions for stepwise selection of covariates. The functions are based on the predictors’ significance according to likelihood ratio tests. The first function is “drop1()”. It drops one predictor at a time from the full model “glm.model.wld” and then compares the restricted and full model in terms of goodness of fit with the help of the likelihood ratio test statistic.

drop1(glm.model.wld,
      trace = TRUE,
      test = "LRT")

Single term deletions

Model:
income ~ age + workclass + education + marital_status + occupation + 
    relationship + race + sex + native_region + hours_w + cap_gain + 
    cap_loss
               Df Deviance   AIC     LRT  Pr(>Chi)    
<none>               19643 19777                      
age             1    19886 20018  243.22 < 2.2e-16 ***
workclass       6    19746 19868  102.65 < 2.2e-16 ***
education      15    20637 20741  993.74 < 2.2e-16 ***
marital_status  6    19751 19873  108.25 < 2.2e-16 ***
occupation     13    20180 20288  537.42 < 2.2e-16 ***
relationship    5    19921 20045  277.75 < 2.2e-16 ***
race            4    19656 19782   12.69 0.0129070 *  
sex             1    19759 19891  115.54 < 2.2e-16 ***
native_region   7    19670 19790   27.40 0.0002826 ***
hours_w         4    20012 20138  369.34 < 2.2e-16 ***
cap_gain        2    21195 21325 1552.24 < 2.2e-16 ***
cap_loss        2    19941 20071  298.01 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we see from the output above, the p-values indicate that all predictors in the model are significant and should be retained.

The second function which we consider is the “step()” function for backward selection of predictors based on LRT:

step.lrt <- step(glm.model.wld, direction="backward", test="LRT")

Start:  AIC=19776.97
income ~ age + workclass + education + marital_status + occupation + 
    relationship + race + sex + native_region + hours_w + cap_gain + 
    cap_loss

                 Df Deviance   AIC     LRT  Pr(>Chi)    
<none>                 19643 19777                      
- race            4    19656 19782   12.69 0.0129070 *  
- native_region   7    19670 19790   27.40 0.0002826 ***
- workclass       6    19746 19868  102.65 < 2.2e-16 ***
- marital_status  6    19751 19873  108.25 < 2.2e-16 ***
- sex             1    19759 19891  115.54 < 2.2e-16 ***
- age             1    19886 20018  243.22 < 2.2e-16 ***
- relationship    5    19921 20045  277.75 < 2.2e-16 ***
- cap_loss        2    19941 20071  298.01 < 2.2e-16 ***
- hours_w         4    20012 20138  369.34 < 2.2e-16 ***
- occupation     13    20180 20288  537.42 < 2.2e-16 ***
- education      15    20637 20741  993.74 < 2.2e-16 ***
- cap_gain        2    21195 21325 1552.24 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The algorithm starts with the full models - “glm.model.wld” and consequtively eliminates explanatory variables comparing the LRT between the current model and the full model. From the output above we see that according to the p-values all predictor variables should be kept in the model.

3.6 Manual selection of covariates

We calculate the variance of the predictors with the function “nearZeroVar” from the “caret” package. We want to see if there are any variables with near zero variance:

nearZeroVar(db.adult, saveMetrics = TRUE)

                freqRatio percentUnique zeroVar   nzv
age              1.001175    0.23871096   FALSE FALSE
workclass        8.917967    0.02320801   FALSE FALSE
fnlwgt           1.083333   67.18055832   FALSE FALSE
education        1.473495    0.05304688   FALSE FALSE
education_num    1.473495    0.05304688   FALSE FALSE
marital_status   1.446124    0.02320801   FALSE FALSE
occupation       1.001985    0.04641602   FALSE FALSE
relationship     1.613125    0.01989258   FALSE FALSE
race             9.205893    0.01657715   FALSE FALSE
sex              2.083419    0.00663086   FALSE FALSE
capital_gain    81.970326    0.39122074   FALSE  TRUE
capital_loss   148.118557    0.29838870   FALSE  TRUE
hours_per_week   5.243194    0.31165042   FALSE FALSE
native_country  45.088525    0.13593263   FALSE  TRUE
income           3.017315    0.00663086   FALSE FALSE
hours_w          2.473339    0.01657715   FALSE FALSE
native_region   22.433931    0.02652344   FALSE  TRUE
cap_gain        22.424603    0.00994629   FALSE  TRUE
cap_loss        37.261538    0.00994629   FALSE  TRUE

Indeed, there are predictors with near zero variance. These are the variables “capital_gain”, “capital_loss”, “native_country”, “native_region”, “cap_gain” and “cap_loss”. In general, covariates with near zero variance are not informative regarding the response variable, but this is not always the case. Therefore we have to be careful with throwing away data.

Since in our prediction models we do not include the variables “capital_gain”, “capital_loss” and “native_country”, we are not interested in them. Regarding the remaining three features – “native_region”, “cap_gain” and “cap_loss”, we will fit logistic regression models excluding each of these variables one at a time, and then we will compare their prediction accuracy to that of the full model – “glm.model.wld”. We start with fitting a model without the variable “native_region”:

covariates <- paste("age", "workclass", "education",
                    "marital_status", "occupation", "relationship",
                    "race", "sex", "hours_w",
                    "cap_gain", "cap_loss", sep = "+")

form <- as.formula(paste("income ~", covariates))

start_time <- proc.time()
Mod <- glm(formula = form,
           data = db.adult, 
           family = binomial(link = "logit"),
           x = TRUE)
time.wout.region <- proc.time() - start_time
time.wout.region

   user  system elapsed 
   2.44    0.08    2.51

Below we show the model’s accuracy, sensitivity and specificity on the test data, i.e. on new observations:

predicted.test <- predict(Mod, 
                          newdata = db.test, 
                          type = "response") 

predicted.test <- ifelse(predicted.test > 0.5, " >50K", " <=50K")

mean(predicted.test == db.test$income)

[1] 0.8474104

confMat.with.cap <- confusionMatrix(data = predicted.test, 
                                    reference = db.test$income,
                                    positive = levels(db.test$income)[2])

confMat.with.cap

Confusion Matrix and Statistics

          Reference
Prediction  <=50K  >50K
     <=50K  10554  1492
     >50K     806  2208
                                          
               Accuracy : 0.8474          
                 95% CI : (0.8416, 0.8531)
    No Information Rate : 0.7543          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5609          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.5968          
            Specificity : 0.9290          
         Pos Pred Value : 0.7326          
         Neg Pred Value : 0.8761          
             Prevalence : 0.2457          
         Detection Rate : 0.1466          
   Detection Prevalence : 0.2001          
      Balanced Accuracy : 0.7629          
                                          
       'Positive' Class :  >50K

Fitting the model without “native_region” results in approximately the same accuracy, sensitivity and specificity on the test data. This means that we can drop this variable and thus reduce the computational complexity. The computational time needed to fit the model without “native_region” is 2.44s and is 1.1 times less than that needed to fit the full model (which was 2.65s).

Next we remove the variables “cap_gain” and “cap_loss” but we keep “native_region”:

covariates <- paste("age", "workclass", "education",
                    "marital_status", "occupation", "relationship",
                    "race", "sex", "hours_w", "native_region", 
                    sep = "+")

form <- as.formula(paste("income ~", covariates))

Mod <- glm(formula = form,
           data = db.adult, 
           family = binomial(link = "logit"),
           x = TRUE)

We check the model’s accuracy, sensitivity and specificity on the test dataset:

predicted.test <- predict(Mod, 
                          newdata = db.test, 
                          type = "response") 

predicted.test <- ifelse(predicted.test > 0.5, " >50K", " <=50K")

mean(predicted.test == db.test$income)

[1] 0.8316069

confMat.without.cap <- confusionMatrix(data = predicted.test, 
                                       reference = db.test$income,
                                       positive = levels(db.test$income)[2])

confMat.without.cap

Confusion Matrix and Statistics

          Reference
Prediction  <=50K  >50K
     <=50K  10430  1606
     >50K     930  2094
                                          
               Accuracy : 0.8316          
                 95% CI : (0.8255, 0.8376)
    No Information Rate : 0.7543          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5159          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.5659          
            Specificity : 0.9181          
         Pos Pred Value : 0.6925          
         Neg Pred Value : 0.8666          
             Prevalence : 0.2457          
         Detection Rate : 0.1390          
   Detection Prevalence : 0.2008          
      Balanced Accuracy : 0.7420          
                                          
       'Positive' Class :  >50K

As we can see from the results above, the prediction accuracy on the test dataset decreased from 84.75% for the full model to 83.16% for the model without “cap_gain” and “cap_loss”. The sensitivity decreases a little bit – from 59.68% to 56.59%, but the specificity declines more significantly – from 92.9% to 91.81%, which is not desirabe since this is the percentage of correctly predicted outcomes with income greater than 50K.

Finally we drop all three variables with near zero variance and we fit a logistic model:

covariates <- paste("age", "workclass", "education",
                    "marital_status", "occupation", "relationship",
                    "race", "sex", "hours_w", 
                    sep = "+")

form <- as.formula(paste("income ~", covariates))

Mod <- glm(formula = form,
           data = db.adult, 
           family = binomial(link = "logit"),
           x = TRUE)

The accuracy, sensitivity and specificity of the model on the test dataset are given below:

predicted.test <- predict(Mod, 
                          newdata = db.test, 
                          type = "response") 

predicted.test <- ifelse(predicted.test > 0.5, " >50K", " <=50K")


mean(predicted.test == db.test$income)

[1] 0.8312749

confMat.without.three <- confusionMatrix(data = predicted.test, 
                                         reference = db.test$income,
                                         positive = levels(db.test$income)[2])

confMat.without.three

Confusion Matrix and Statistics

          Reference
Prediction  <=50K  >50K
     <=50K  10428  1609
     >50K     932  2091
                                          
               Accuracy : 0.8313          
                 95% CI : (0.8252, 0.8372)
    No Information Rate : 0.7543          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5149          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.5651          
            Specificity : 0.9180          
         Pos Pred Value : 0.6917          
         Neg Pred Value : 0.8663          
             Prevalence : 0.2457          
         Detection Rate : 0.1388          
   Detection Prevalence : 0.2007          
      Balanced Accuracy : 0.7415          
                                          
       'Positive' Class :  >50K

We observe that the accuracy drops from 84.75% for the full model to 83.13% for the reduced model. The sensitivity decreases a little bit – from 59.57% to 56.51%. The specificity declines more significantly – from 92.96% to 91.8%, which is not a desired effect. Therefore, we will keep the variables “cap_gain” and “cap_loss” in the model.

Although the automated stepwise selection of variables pointed out that we should keep all initial variables in the model, our further analysis showed that we can drop “native_region”.

4 Over-sampling

We modify the census dataset in order to have more balanced data. As we saw from the exploratory data analaysis in the second part of the project, 75.1% of people earn less than 50K a year and only 24.9% – earn more than 50K a year. Although this statistic is not surprising, it makes the data imbalanced and leads to low sensitivity of the fitted classification model. We recall that as the “positive” class we consider people with income greater than 50K. Thus, the sensitivity is the proportion of observations with income values equal to " >50K" that are correctly identified as such, and the specificity is the proportion of records with income values equal to " <=50K" that are correctly identified as such. Hence, the sensitivity of the logistic regression model is only – 59.57% , whereas the specificity is 92.96%. We level the data by adding copies of observations from the under-represented class. This technique is called over-sampling, or sampling with replacement and its goal is to build better classification models with higher sensitivity.

First we create a temporary dataset called “db.adult.rich” which contains all observations with income greater than 50K a year:

db.adult.rich <- subset(db.adult, db.adult$income == " >50K")

rownames(db.adult.rich) <- 1:nrow(db.adult.rich)

Then we use the function “createDataPartition()” from the “caret” package to randomly sample 80% of the these observations and we create the dataset “db.adult.oversampled” which consists of the original observations from the “db.adult” set and the over-sampled instances:

IndRich <- createDataPartition(y = db.adult.rich$income,
                               times = 1,
                               p = 0.8,
                               list = FALSE)


db.adult.oversampled <- rbind(db.adult, db.adult.rich[IndRich, ])

rownames(db.adult.oversampled) <- 1:nrow(db.adult.oversampled)

After we are finished with the temporary dataset “db.adult.rich”, we remove it from the working environment:

rm(db.adult.rich)

5 Synthetic Minority Over-sampling Technique

The most popular algorithm for generating synthetic samples is the SMOTE algorithm, or the Synthetic Minority Over-sampling Technique. More information can be found in the original paper “SMOTE: Synthetic Minority Over-sampling Technique” by N.V. Chawla, K. W. Bowyer, L.O. Hall and W.P. Kegelmeyer from 2002. We use this technique to create a more balanced training dataset. In short, the idea of the method is to combine under-sampling of the majority class with a special form of over-sampling of the minority class. In the SMOTE algorithm the over-sampling of the minority class is carried out by creating new “synthetic” examples, rather than by over-sampling with replacement from the existing examples. The synthetic observations are generated with the help of the k nearest neighbors (KNN) algorithm. The majority class is under-sampled by randomly removing observations until the minority class becomes some pre-specified percentage of the majority class.

Below we create a “SMOTEd” training dataset called “db.adult.smote”:

db.adult.smote <- SMOTE(form = form, 
                        data = db.adult,
                        perc.over = 100,
                        perc.under = 250, 
                        k = 5)

rownames(db.adult.smote) <- 1:nrow(db.adult.smote)

where we recall that we use the following formula:

covariates <- paste("age", "workclass", "education","marital_status",
                    "occupation", "relationship", "race", "sex",
                    "native_region", "hours_w", "cap_gain",
                    "cap_loss", sep = "+")

form <- as.formula(paste("income ~", covariates))

We show the distribution of income for the SMOTE and for the original dataset:

summary(db.adult.smote$income)

 <=50K   >50K 
 18770  15016

summary(db.adult$income)

 <=50K   >50K 
 22654   7508

We see that 3884 observations of the majority class from the original unbalanced set were (randomly) removed and 7508 new observations of the minority class were generated. Below we show a bar plot of the distribution of income in the new balanced dataset:

ggplot(data = db.adult.smote, 
       mapping = aes(x = db.adult.smote$income, 
                     fill = db.adult.smote$income)) + 
  geom_bar(mapping = aes(y = (..count..)/sum(..count..))) +
  geom_text(mapping = aes(label = scales::percent((..count..)/sum(..count..)),
                      y = (..count..)/sum(..count..) ), 
            stat = "count",
            vjust = -.1) +
  labs(x = "Income", 
       y = "",
       fill = "Income") +
  scale_y_continuous(labels = percent)

From the graph we see that 55.6% of all people earn less than 50K a year and 44.4% – earn more than 50K a year.

6 Logistic regression on the balanced training sets

6.1 Over-sampled training set

After we enriched the “db.adult” dataset with observations over-sampled from the under-represented class of people who earn more than 50K a year, we fit a new logistic regression model called “glm.model.over”:

covariates <- paste("age", "workclass", "education","marital_status",
                    "occupation", "relationship", "race", "sex",
                    "native_region", "hours_w", "cap_gain",
                     "cap_loss", sep = "+")


form <- as.formula(paste("income ~", covariates))

start_time <- proc.time()
glm.model.over <- glm(formula = form,
                      data = db.adult.oversampled, 
                      family = binomial(link = "logit"),
                      x = TRUE,
                      y = TRUE)
time.logistic.over <- proc.time() - start_time
time.logistic.over

   user  system elapsed 
   3.36    0.13    3.48

6.1.1 Accuracy on the training dataset

First we test the performance of the model on the training set – namely, the “db.adult.oversampled” dataset:

predicted.train <- predict(glm.model.over, 
                           newdata = db.adult.oversampled, 
                           type = "response") 

predicted.train <- ifelse(predicted.train > 0.5, " >50K", " <=50K")

Below we show the confusion matrix together with the accuracy, sensitivity and specificity:

stat.Logi.over <- confusionMatrix(data = predicted.train, 
                                  reference = db.adult.oversampled$income,
                                  positive = levels(db.test$income)[2])

stat.Logi.over

Confusion Matrix and Statistics

          Reference
Prediction  <=50K  >50K
     <=50K  19728  3524
     >50K    2926  9991
                                          
               Accuracy : 0.8217          
                 95% CI : (0.8177, 0.8256)
    No Information Rate : 0.6263          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6156          
 Mcnemar's Test P-Value : 1.057e-13       
                                          
            Sensitivity : 0.7393          
            Specificity : 0.8708          
         Pos Pred Value : 0.7735          
         Neg Pred Value : 0.8484          
             Prevalence : 0.3737          
         Detection Rate : 0.2762          
   Detection Prevalence : 0.3571          
      Balanced Accuracy : 0.8050          
                                          
       'Positive' Class :  >50K

We observe that the overall accuracy drops a little from 84.89% (for the model trained on the original unbalanced dataset) to 82.17%, whereas the sensitivity significantly increases from 60.34% to 73.93%. The specificity decreases with 5.95% from 93.03% to 87.08%.

6.1.2 Accuracy on the test dataset

Our next step is to test the accuracy of the fitted model on the test dataset:

predicted.test <- predict(glm.model.over, 
                          newdata = db.test, 
                          type = "response") 

predicted.test <- ifelse(predicted.test > 0.5, " >50K", " <=50K")

Below we give the confusion matrix together with additional statistics:

stat.Logi.over <- confusionMatrix(data = predicted.test, 
                                  reference = db.test$income,
                                  positive = levels(db.test$income)[2])

stat.Logi.over

Confusion Matrix and Statistics

          Reference
Prediction  <=50K  >50K
     <=50K   9869   989
     >50K    1491  2711
                                          
               Accuracy : 0.8353          
                 95% CI : (0.8293, 0.8412)
    No Information Rate : 0.7543          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5751          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.7327          
            Specificity : 0.8688          
         Pos Pred Value : 0.6452          
         Neg Pred Value : 0.9089          
             Prevalence : 0.2457          
         Detection Rate : 0.1800          
   Detection Prevalence : 0.2790          
      Balanced Accuracy : 0.8007          
                                          
       'Positive' Class :  >50K

We see that the accuracy on the test set drops down a little bit from 84.75% for the model built on the original unbalanced train set to 83.53% for the model built on the over-sampled train dataset. On the other hand, the sensitivity significantly increases from 59.57% to 73.27% and the specificity decreases from 92.96% to 86.88%.

6.2 SMOTE training set

Next we train a model on the balanced SMOTE training dataset:

covariates <- paste("age", "workclass", "education","marital_status",
                    "occupation", "relationship", "race", "sex",
                    "native_region", "hours_w", "cap_gain",
                     "cap_loss", sep = "+")


form <- as.formula(paste("income ~", covariates))


start_time <- proc.time()
glm.model.smote <- glm(formula = form,
                       data = db.adult.smote, 
                       family = binomial(link = "logit"),
                       x = TRUE,
                       y = TRUE)
time.logistic.smote <- proc.time() - start_time
time.logistic.smote

   user  system elapsed 
   3.17    0.13    3.29

6.2.1 Accuracy on the training dataset

First we test the performance of the model on the training set – namely, the “db.adult.oversampled” dataset:

predicted.train <- predict(glm.model.smote, 
                           newdata = db.adult.smote, 
                           type = "response") 

predicted.train <- ifelse(predicted.train > 0.5, " >50K", " <=50K")

Below we show the confusion matrix together with the accuracy, sensitivity and specificity:

stat.Logi.smote <- confusionMatrix(data = predicted.train, 
                                   reference = db.adult.smote$income,
                                   positive = levels(db.test$income)[2])

stat.Logi.smote

Confusion Matrix and Statistics

          Reference
Prediction  <=50K  >50K
     <=50K  16221  3142
     >50K    2549 11874
                                          
               Accuracy : 0.8316          
                 95% CI : (0.8275, 0.8355)
    No Information Rate : 0.5556          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6576          
 Mcnemar's Test P-Value : 4.247e-15       
                                          
            Sensitivity : 0.7908          
            Specificity : 0.8642          
         Pos Pred Value : 0.8233          
         Neg Pred Value : 0.8377          
             Prevalence : 0.4444          
         Detection Rate : 0.3514          
   Detection Prevalence : 0.4269          
      Balanced Accuracy : 0.8275          
                                          
       'Positive' Class :  >50K

We observe that the overall accuracy drops a little from 84.89% (for the original unbalanced train dataset) to 83.16% (for the balanced SMOTE dataset), whereas the sensitivity significantly increases from 60.34% to 79.08%. The specificity decreases with 6% from 93.03% to 86.42%.

6.2.2 Accuracy on the test dataset

Our next step is to test the accuracy of the fitted model on the test dataset:

predicted.test <- predict(glm.model.smote, 
                          newdata = db.test, 
                          type = "response") 

predicted.test <- ifelse(predicted.test > 0.5, " >50K", " <=50K")

Below we give the confusion matrix together with additional statistics:

stat.Logi.smote <- confusionMatrix(data = predicted.test, 
                                   reference = db.test$income,
                                   positive = levels(db.test$income)[2])

stat.Logi.smote

Confusion Matrix and Statistics

          Reference
Prediction  <=50K  >50K
     <=50K   9768  1043
     >50K    1592  2657
                                          
               Accuracy : 0.825           
                 95% CI : (0.8189, 0.8311)
    No Information Rate : 0.7543          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5504          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.7181          
            Specificity : 0.8599          
         Pos Pred Value : 0.6253          
         Neg Pred Value : 0.9035          
             Prevalence : 0.2457          
         Detection Rate : 0.1764          
   Detection Prevalence : 0.2821          
      Balanced Accuracy : 0.7890          
                                          
       'Positive' Class :  >50K

As we can see the overall accuracy drops a little from 84.75% (for the original unbalanced train dataset) to 82.5% (for the balanced SMOTE dataset), whereas the sensitivity significantly increases from 59.57% to 71.81%. The specificity decreases with 6% from 92.96% to 85.99%.

7 Random Forest

Here we fit a classification model with the random forest algorithm with the “randomForest” function from the “randomForest” package. We use the default parameters, like the number of variables for each split (which in this case is $\sqrt{12}\approx 3$) in the decision trees and the number of trees, which is 500. We train two models – one on the over-sampled train dataset and one on the SMOTE dataset.

First, we will briefly explain how the random forest algorithm works. The random forest is an “ensemble” method. The main idea of ensemble methods is that a group of “weak learners” comes together and forms a “strong learner”. In the case of the random forest, each decision tree (a type of classification machine learning algorithm) is a “weak learner” and the group of trees (i.e. the so-called forest) forms the “strong learner”. Let us say that we have $T$ number of trees in the forest. Then, for each tree $N$ cases are sampled at random with replacement to create a subset of the train dataset. This subset usually is about 66% of the total set. At each node, of each tree in the forest, $m$ predictor variables are selected at random from all the predictor variables. In R the default value for $m$ is square root of the total number of predictors in the model. The predictor variable that provides the best split, according to some objective function, is then used to do a binary split on that node. At the next node of each tree, another $m$ variables are chosen at random from all predictor variables and so on, until a leaf node is reached. The final classification is done after a majority vote based on the decisions (classification) made by each tree. Since each decision tree samples at random about 66% of all observations in the train set, the remaining 34% are used to test the accuracy of the predictions made by the corresponding tree. Thus, the random forest simultaneously trains a model and tests its prediction accuracy. R reports the so-called “out-of-bag” (OOB) error for each decision tree, as well as an averaged overall OOB error. This is the prediction error on the remaining 34% of the training examples.

7.1 Random Forest on the over-sampled train set

Below we fit the random forest model on the over-sampled training dataset:

set.seed(1234)

covariates <- paste("age", "workclass", "education", 
                    "marital_status", "occupation", "relationship",
                    "race", "sex", "native_region", "hours_w",
                    "cap_gain", "cap_loss", sep = "+")

form <- as.formula(paste("income ~", covariates))

start_time <- proc.time()
ModRF <- randomForest(formula = form,
                      data = db.adult.oversampled)
time.rf.over <- proc.time() - start_time
time.rf.over

   user  system elapsed 
  21.55    1.00   23.21

We see that the overall “out-of-bag” error, or OOB, is 14.13%:

ModRF


Call:
 randomForest(formula = form, data = db.adult.oversampled) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 14.13%
Confusion matrix:
        <=50K  >50K class.error
 <=50K  19577  3077   0.1358259
 >50K    2032 11483   0.1503515

Below we test the accuracy on the training and test datasets and we see that it is 90.87% and 83.45%, respectively. The “out of sample” error is 16.51% and is in agreement with the OOB error:

mean(predict(ModRF, newdata = db.adult.oversampled) == db.adult.oversampled$income)

[1] 0.909204

mean(predict(ModRF, newdata = db.test) == db.test$income)

[1] 0.8346614

Next we display an error plot of the random forest model:

plot(ModRF)

print(ModRF)


Call:
 randomForest(formula = form, data = db.adult.oversampled) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 14.13%
Confusion matrix:
        <=50K  >50K class.error
 <=50K  19577  3077   0.1358259
 >50K    2032 11483   0.1503515

The black solid line is the overall OOB error (i.e. averaged for all trees), or the accuracy, the red dashed line is the prediction error for the first category (since we have a classification problem) of the variable “income” – " <=50K“, and the green dotted line represents the prediction error for the second category of”income" – " >50K“. This means that the red dashed line corresponds approximately to 1-specificity, and the green dotted line represents 1-sensitivity. Indeed, if we display the confusion matrix and the performance statistics on the train set, we can see the sensitivity and specificity:

confusionMatrix(data = predict(ModRF, newdata = db.adult.oversampled),
                reference = db.adult.oversampled$income,
                positive = levels(db.test$income)[2])

Confusion Matrix and Statistics

          Reference
Prediction  <=50K  >50K
     <=50K  20647  1276
     >50K    2007 12239
                                          
               Accuracy : 0.9092          
                 95% CI : (0.9062, 0.9122)
    No Information Rate : 0.6263          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.8082          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9056          
            Specificity : 0.9114          
         Pos Pred Value : 0.8591          
         Neg Pred Value : 0.9418          
             Prevalence : 0.3737          
         Detection Rate : 0.3384          
   Detection Prevalence : 0.3939          
      Balanced Accuracy : 0.9085          
                                          
       'Positive' Class :  >50K

From the error plot, we see that the errors do not decrease after the number of trees reaches a certain threshold, which is 100 trees. Hence, we try to build a RF model with 200 trees to check if we can achieve the same accuracy with a computationally less expensive model:

set.seed(1234)

start_time <- proc.time()
ModRF.small <- randomForest(formula = form, 
                            data = db.adult.oversampled, 
                            ntree = 200)
time.rf.small.over <- proc.time() - start_time
time.rf.small.over

   user  system elapsed 
   8.15    0.33    8.50

On our machine the computational time went from about 21.55s to 8.15s, which is 2.6 times faster.

print(ModRF.small)


Call:
 randomForest(formula = form, data = db.adult.oversampled, ntree = 200) 
               Type of random forest: classification
                     Number of trees: 200
No. of variables tried at each split: 3

        OOB estimate of  error rate: 14.28%
Confusion matrix:
        <=50K  >50K class.error
 <=50K  19611  3043   0.1343251
 >50K    2123 11392   0.1570847

mean(predict(ModRF.small, newdata = db.adult.oversampled) == db.adult.oversampled$income)

[1] 0.9069369

mean(predict(ModRF.small, newdata = db.test) == db.test$income)

[1] 0.8335325

From the code above, we see that the OOB error increased insignificantly from 14.13% to 14.28% and the accuracy is approximately the same – 90.88% on the training data, and 83.36% on the test dataset.

Next we show the confusion matrix on the test set:

stat.rf.over <- confusionMatrix(data = predict(ModRF.small, 
                                               newdata = db.test),
                                reference = db.test$income,
                                positive = levels(db.test$income)[2])

stat.rf.over

Confusion Matrix and Statistics

          Reference
Prediction  <=50K  >50K
     <=50K   9818   964
     >50K    1542  2736
                                          
               Accuracy : 0.8336          
                 95% CI : (0.8276, 0.8395)
    No Information Rate : 0.7543          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5735          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.7395          
            Specificity : 0.8643          
         Pos Pred Value : 0.6396          
         Neg Pred Value : 0.9106          
             Prevalence : 0.2457          
         Detection Rate : 0.1817          
   Detection Prevalence : 0.2841          
      Balanced Accuracy : 0.8019          
                                          
       'Positive' Class :  >50K

The sensitivity is 73.95% and the specificity is 86.43%.

7.2 Random Forest on the SMOTE train set

Below we build a random forest model on the SMOTE training set. We use the optimal number of trees which we determined to be 200 for the model trained on the over-sampled train dataset:

set.seed(1234)

covariates <- paste("age", "workclass", "education", 
                    "marital_status", "occupation", "relationship",
                    "race", "sex", "native_region", "hours_w",
                    "cap_gain", "cap_loss", sep = "+")

form <- as.formula(paste("income ~", covariates))

start_time <- proc.time()
ModRF <- randomForest(formula = form,
                      data = db.adult.smote,
                      ntree = 200)
time.rf.smote <- proc.time() - start_time
time.rf.smote

   user  system elapsed 
   7.78    0.33    8.19

First we take a look at the performance of the model on the train set:

confusionMatrix(data = predict(ModRF, newdata = db.adult.smote),
                               reference = db.adult.smote$income,
                               positive = levels(db.test$income)[2])

Confusion Matrix and Statistics

          Reference
Prediction  <=50K  >50K
     <=50K  17837  1288
     >50K     933 13728
                                          
               Accuracy : 0.9343          
                 95% CI : (0.9316, 0.9369)
    No Information Rate : 0.5556          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.8666          
 Mcnemar's Test P-Value : 5.844e-14       
                                          
            Sensitivity : 0.9142          
            Specificity : 0.9503          
         Pos Pred Value : 0.9364          
         Neg Pred Value : 0.9327          
             Prevalence : 0.4444          
         Detection Rate : 0.4063          
   Detection Prevalence : 0.4339          
      Balanced Accuracy : 0.9323          
                                          
       'Positive' Class :  >50K

And we show the confusion matrix, as well as the accuracy, sensitivity and specificity on the test set:

stat.rf.smote <- confusionMatrix(data = predict(ModRF, 
                                                newdata = db.test),
                                 reference = db.test$income,
                                 positive = levels(db.test$income)[2])

stat.rf.smote

Confusion Matrix and Statistics

          Reference
Prediction  <=50K  >50K
     <=50K   9729   976
     >50K    1631  2724
                                          
               Accuracy : 0.8269          
                 95% CI : (0.8208, 0.8329)
    No Information Rate : 0.7543          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5593          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.7362          
            Specificity : 0.8564          
         Pos Pred Value : 0.6255          
         Neg Pred Value : 0.9088          
             Prevalence : 0.2457          
         Detection Rate : 0.1809          
   Detection Prevalence : 0.2892          
      Balanced Accuracy : 0.7963          
                                          
       'Positive' Class :  >50K

We observe that the accuracy and sensitivity for the SMOTE model are slightly lower than those for the over-sampled model.

8 Support Vector Machines

We fit a model with the Support Vector Machines (SVM) algorithm, which is implemented in the R package “e1071”. This is a popular method that usually has reasonable default performance and has parameters that allow for further tuning if necessary. It also allows for fitting both linearly and non-linearly separable data using linear or non-linear kernels, respectively. By default, the “svm” function in the “e1071” package scales the data. We train two types of models – one on the over-sampled train set and one on the SMOTE train set. We also use cross-validation and grid search to find the optimal model hyperparameters.

8.1 Linear kernel

8.1.1 Cross-validation for selecting the optimal model parameters

Our goal was to perform cross-validation with a grid search in order to determine the optimal model hyperparameters. In the case of linear kernel, the hyperparameter, whose value we wanted to tune, was the so-called cost parameter. We wanted to use k-fold cross-validation with k=5. However, since the cross-validation algorithm required a lot of computational time, we decided to try another approach for selecting the best set of model parameters. Namely, we used a fixed split of the “db.adult.oversampled” dataset into a training and a validation set. But, even with a fixed split, depending on which values of the parameters we use for the grid search, the computational time may still be very long. For some values of the parameters we observe that the optimization algorithm cannot converge. We followed also the recommendations from the paper “A Practical Guide to Support Vector Classification” by Hsu, Chang, and Lin (the authors of the LIBSVM library) to use exponentially growing sequences of C (and $\gamma$ in the case of a radial kernel), such as $C=2^{-5}, 2^{-4}, 2^{-3}, \dots, 1, 2^1, 2^2, 2^3, \dots$. We found out that the algorithm has difficulties converging for values of C greater than $2^6$. We experimented also with $C=2^{-4}$ and we obtained performance results (accuracy, sensitivity and specificity) almost identical to those we obtained for $C=1, 2^2, 2^3, 2^4, 2^5$ and $C=2^6$. Hence, after testing different parameters, we discovered that there is no big deviation in accuracy metrics and the default R parameter (cost=1) gives good performance results and is the best one in terms of runtime and convergence. We build predictive models with both the over-sampled and the SMOTE training datasets. However, the models trained on the two training datasets have approximately the same performance regarding accuracy, sensitivity and specificity.

First we show simulations with the over-sampled training set. We tried to run the following code but we aborted it after running for 1 hour and 30 minutes. We used the “train()” function from the “caret” package. We constructed a grid of only 3 values for the cost paramater C and we wanted to use 5-fold cross-validation to choose the optimal value for C:

covariates <- paste("age", "workclass", "education", 
                    "marital_status", "occupation", "relationship",
                    "race", "sex", "native_region", "hours_w",
                    "cap_gain", "cap_loss", sep = "+")

form <- as.formula(paste("income ~", covariates))

my.grid <- expand.grid(C = c(0.1, 0.5, 1)) 

set.seed(45112)

start_time <- proc.time()
Model.TunedSVM <- train(form = form,
                        data = db.adult.oversampled,
                        method = "svmLinear",
                        cross = 5,
                        tuneGrid = my.grid,
                        trControl = trainControl(method = "cv"),
                        trace = FALSE)
svm.tuned.time <- proc.time() - start_time
svm.tuned.time

Model.TunedSVM

Another way to do cross-validation is to use the “tune()” function from the R package “e1071”, as shown below:

covariates <- paste("age", "workclass", "education", 
                    "marital_status", "occupation", "relationship",
                    "race", "sex", "native_region", "hours_w",
                    "cap_gain", "cap_loss", sep = "+")

form <- as.formula(paste("income ~", covariates))

start_time <- proc.time()
Model.TunedSVM <- tune(method = svm, 
                       kernel = "linear",
                       train.x = form, 
                       data = db.adult.oversampled,
                       ranges = list(cost = 2^seq(4,10,2)), 
                       tunecontrol = tune.control(sampling = "cross",
                                                  cross = 5))
svm.tuned.time <- proc.time() - start_time
svm.tuned.time

However, we do not run the above code due to the extensive computational time.

We also tried to run the following code with fixed split of the training set and with the suggested exponentially increasing values of the cost parameter, but the computations took a very long time and also the algorithm was having difficulties to converge:

set.seed(45112)

start_time <- proc.time()
Model.TunedSVM1 <- tune(method = svm, 
                        kernel = "linear",
                        train.x = form, 
                        data = db.adult.oversampled,
                        ranges = list(cost = 2^seq(4,8,2)), 
                        tunecontrol = tune.control(sampling = "fix"))
svm.tuned.time <- proc.time() - start_time
svm.tuned.time

Finally, we gave up the 5-fold cross-validation as well as the fixed split of the training set and we settled for a more simplistic approach. Namely, we built a model on the whole over-sampled training set taking C to be equal to each of the following values: $C=2^{-4}, 1, 2^4, 2^5, 2^6$, and then we calculated the performance metrics of each of the models on the test dataset. Below we show the code that we used to generate the above mentioned models:

set.seed(45112)

C <- 2^(-4) # 1, 2^4, 2^5, 2^6

start_time <- proc.time()
ModSVM.lin <- svm(form, 
                  data = db.adult.oversampled,
                  cost = C,
                  kernel = "linear")
time.svm.lin <- proc.time() - start_time
time.svm.lin


stat.accuracy <- confusionMatrix(data = predict(ModSVM.lin, 
                                                newdata = db.test),
                                 reference = db.test$income,
                                 positive = levels(db.test$income)[2])

stat.accuracy

Next we show a table with the performance metrics of the SVM models trained on the over-sampled dataset “db.adult.oversampled” for different values of the cost parmeter $C$:

Performance on the test dataset of the SVM models trained on the over-sampled dataset
	Accuracy	Sensitivity	Specificity	Runtime
C=2^(-4)	0.8316	0.7286	0.8651	130s
C=1	0.8315	0.7365	0.8625	159s
C=2^4	0.8311	0.7351	0.8624	560s
C=2^5	0.8311	0.7351	0.8624	980s
C=2^6	0.8321	0.7357	0.8635	1669s

We see that there is no significant deviation in the accuracy, sensitivity and specificity for the different values of the cost parameter, which controls for overfitting. We choose the model with $C=1$, which is also the model with the highest value of the sensitivity. This model also has a reasonable computational time.

Since we chose the model with $C=1$, for completeness below we run the svm algorithm with linear kernel and cost set to 1 (which is also the dafault value for the “svm()” function) in order to build the said model:

features <- paste("age", "workclass", "education", 
                    "marital_status", "occupation", "relationship",
                    "race", "sex", "native_region", "hours_w",
                    "cap_gain", "cap_loss", sep = "+")

form <- as.formula(paste("income ~", features))

set.seed(45112)

start_time <- proc.time()
ModSVM.lin <- svm(form, 
                  data = db.adult.oversampled, 
                  kernel = "linear",
                  cost = 1)
time.svm.lin.over <- proc.time() - start_time
time.svm.lin.over

   user  system elapsed 
 149.95    0.04  150.28

The computational time is approximately 150.28s.

Finally we train an svm model with the chosen model parameters ($C=1$) on the SMOTE training set:

# SMOTE model:

set.seed(45112)

start_time <- proc.time()
ModSVM.lin.smote <- svm(form, 
                        data = db.adult.smote, 
                        kernel = "linear",
                        cost = 1)
time.svm.lin.smote <- proc.time() - start_time
time.svm.lin.smote

   user  system elapsed 
 146.32    0.15  147.54

8.1.2 Accuracy on the training set

The vector “predSVM.lin” contains the predictions on the over-sampled training dataset:

predSVM.lin <- predict(ModSVM.lin, newdata = db.adult.oversampled)

and the vector “predSVM.lin.smote” contains the predictions on the SMOTE training set:

# SMOTE model:

predSVM.lin.smote <- predict(ModSVM.lin.smote, newdata = db.adult.smote)

Below we show the confusion matrix and some other statistics for both the over-sampled and the SMOTE models on the respective training sets:

# Over-sampled:

stat.svmLin.over <- confusionMatrix(data = predSVM.lin, 
                                    reference = db.adult.oversampled$income,
                                    positive = levels(db.test$income)[2])


stat.svmLin.over

Confusion Matrix and Statistics

          Reference
Prediction  <=50K  >50K
     <=50K  19594  3435
     >50K    3060 10080
                                          
               Accuracy : 0.8204          
                 95% CI : (0.8164, 0.8244)
    No Information Rate : 0.6263          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6142          
 Mcnemar's Test P-Value : 3.473e-06       
                                          
            Sensitivity : 0.7458          
            Specificity : 0.8649          
         Pos Pred Value : 0.7671          
         Neg Pred Value : 0.8508          
             Prevalence : 0.3737          
         Detection Rate : 0.2787          
   Detection Prevalence : 0.3633          
      Balanced Accuracy : 0.8054          
                                          
       'Positive' Class :  >50K

# SMOTE:

stat.svmLin.smote <- confusionMatrix(data = predSVM.lin.smote, 
                                     reference = db.adult.smote$income,
                                     positive = levels(db.test$income)[2])

stat.svmLin.smote

Confusion Matrix and Statistics

          Reference
Prediction  <=50K  >50K
     <=50K  16225  3117
     >50K    2545 11899
                                          
               Accuracy : 0.8324          
                 95% CI : (0.8284, 0.8364)
    No Information Rate : 0.5556          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6593          
 Mcnemar's Test P-Value : 3.238e-14       
                                          
            Sensitivity : 0.7924          
            Specificity : 0.8644          
         Pos Pred Value : 0.8238          
         Neg Pred Value : 0.8388          
             Prevalence : 0.4444          
         Detection Rate : 0.3522          
   Detection Prevalence : 0.4275          
      Balanced Accuracy : 0.8284          
                                          
       'Positive' Class :  >50K

We observe that the accuracy of the SMOTE model is higher than that of the over-sampled model with about 1.2%. Also the sensitivity of the SMOTE model is 79.24% and is much higher than that of the over-sampled model, which is 74.58%.

8.1.3 Accuracy on the test set

The vector “predSVM.lin” contains the predictions (of the model built on the over-sampled train set) on the test data set:

predSVM.lin <- predict(ModSVM.lin, newdata = db.test)

Contrary to our expectations, the accuracy on the validation set is 83.22% is higher than that on the training set which was 82.04%.

mean(predSVM.lin == db.test$income)

[1] 0.8322045

We also display the respective confusion matrix:

stat.svmLin.over <- confusionMatrix(data = predSVM.lin, 
                                    reference = db.test$income,
                                    positive = levels(db.test$income)[2])

stat.svmLin.over

Confusion Matrix and Statistics

          Reference
Prediction  <=50K  >50K
     <=50K   9810   977
     >50K    1550  2723
                                          
               Accuracy : 0.8322          
                 95% CI : (0.8261, 0.8381)
    No Information Rate : 0.7543          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5698          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.7359          
            Specificity : 0.8636          
         Pos Pred Value : 0.6373          
         Neg Pred Value : 0.9094          
             Prevalence : 0.2457          
         Detection Rate : 0.1808          
   Detection Prevalence : 0.2837          
      Balanced Accuracy : 0.7998          
                                          
       'Positive' Class :  >50K

The sensitivity is 73.59% and the specificity is 86.36%. Next we show the performance metrics of the SMOTE model:

# SMOTE model:

predSVM.lin.smote <- predict(ModSVM.lin.smote, 
                                  newdata = db.test)

stat.svmLin.smote <- confusionMatrix(data = predSVM.lin.smote, 
                                     reference = db.test$income,
                                     positive = levels(db.test$income)[2])

stat.svmLin.smote

Confusion Matrix and Statistics

          Reference
Prediction  <=50K  >50K
     <=50K   9792  1060
     >50K    1568  2640
                                          
               Accuracy : 0.8255          
                 95% CI : (0.8193, 0.8315)
    No Information Rate : 0.7543          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.55            
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.7135          
            Specificity : 0.8620          
         Pos Pred Value : 0.6274          
         Neg Pred Value : 0.9023          
             Prevalence : 0.2457          
         Detection Rate : 0.1753          
   Detection Prevalence : 0.2794          
      Balanced Accuracy : 0.7877          
                                          
       'Positive' Class :  >50K

In comparison, the sensitivity and the specificity of the SMOTE model are 71.35% and 86.2%, respectively and both are lower than those f the over-sampled model. The accuracy is 82.55% and is also lower than that of the over-sampled model. Thus, we conclude that the performance of the SVM model trained on the dataset obtained by our simple over-sampling methodology is better than that of the model trained on the balanced set generated by the far more sophisticated SMOTE algorithm.

8.1.4 Summary

Based on our analysis, we found out that the optimal cost parameter is equal to 1. We also show a comparison between the performance for $C=1$ of the model trained on the over-sampled dataset and the corresponding model trained on the SMOTE dataset:

Performance on the test dataset of the SVM models trained on the over-sampled and on the SMOTE datasets
	Accuracy	Sensitivity	Specificity	Runtime
Over-sampled	0.8322045	0.7359459	0.8635563	150s
SMOTE	0.8254980	0.7135135	0.8619718	148s

We note that the accuracy and the sensitivity are higher for the over-sampled model. The specificity is approximately the same for both models and the computational time for the over-sampled model takes 3s more than that needed for the SMOTE model.

8.2 Radial kernel

We also fit an SVM model with the default radial kernel, which is used for data that is not linearly separable. First we build a model on the over-sampled model:

set.seed(123)

start_time <- proc.time()
ModSVM <- svm(form, 
              data = db.adult.oversampled)
time.svm.rad.over <- proc.time() - start_time
time.svm.rad.over

   user  system elapsed 
 263.53    0.11  264.88

We use the default R parameters, which are $C=1$ and $\gamma=\dfrac{1}{ncol(data)}$.

In this case the computational time is approximately 264.88s.

We could also try cross-validation to find the optimal values of the model parameters $C$ (cost) and $\gamma$ (kernel width):

set.seed(45112)

start_time <- proc.time()
Model.TunedSVM <- tune(method = svm, 
                       train.x = form, 
                       data = db.adult.oversampled,
                       ranges = list(gamma = 2^seq(-9,-3,2),
                                     cost = 2^seq(4,10,2)), 
                       tunecontrol = tune.control(sampling = "fix"))
svm.tuned.time <- proc.time() - start_time
svm.tuned.time

However, due to the very big computational times, we do not run the code shown above and hence, we do not perform cross-validation.

We also fit a model on the SMOTE training set:

# SMOTE model:

set.seed(123)

start_time <- proc.time()
ModSVM.smote <- svm(form, 
                    data = db.adult.smote)
time.svm.rad.smote <- proc.time() - start_time
time.svm.rad.smote

   user  system elapsed 
 211.52    0.09  212.20

8.2.1 Accuracy on the training set

The vector “predSVM” contains the predictions on the training data set:

predSVM <- predict(ModSVM, newdata = db.adult.oversampled)

The confusion matrix is given below:

stat.svmRad.over <- confusionMatrix(data = predSVM, 
                                    reference = db.adult.oversampled$income,
                                    positive = levels(db.adult$income)[2])

stat.svmRad.over

Confusion Matrix and Statistics

          Reference
Prediction  <=50K  >50K
     <=50K  19735  3475
     >50K    2919 10040
                                          
               Accuracy : 0.8232          
                 95% CI : (0.8192, 0.8271)
    No Information Rate : 0.6263          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.6192          
 Mcnemar's Test P-Value : 3.9e-12         
                                          
            Sensitivity : 0.7429          
            Specificity : 0.8711          
         Pos Pred Value : 0.7748          
         Neg Pred Value : 0.8503          
             Prevalence : 0.3737          
         Detection Rate : 0.2776          
   Detection Prevalence : 0.3583          
      Balanced Accuracy : 0.8070          
                                          
       'Positive' Class :  >50K

We observe that the “in sample” accuracy is a little better compared to the linear kernel – 82.32%, but the computational time increased from 150.28 to 264.88, i.e. 1.8 times more. The sensitivity is 74.29% and the specificity is 87.11%. Both the sensitivity and the specificity are a little higher than those for the linear kernel.

We also calculate the performance metrics for the SMOTE model:

# SMOTE:

predSVM.smote <- predict(ModSVM.smote, newdata = db.adult.smote)

stat.svmRad.smote <- confusionMatrix(data = predSVM.smote, 
                                     reference = db.adult.smote$income,
                                     positive = levels(db.adult$income)[2])

stat.svmRad.smote

Confusion Matrix and Statistics

          Reference
Prediction  <=50K  >50K
     <=50K  16502  2905
     >50K    2268 12111
                                         
               Accuracy : 0.8469         
                 95% CI : (0.843, 0.8507)
    No Information Rate : 0.5556         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.6886         
 Mcnemar's Test P-Value : < 2.2e-16      
                                         
            Sensitivity : 0.8065         
            Specificity : 0.8792         
         Pos Pred Value : 0.8423         
         Neg Pred Value : 0.8503         
             Prevalence : 0.4444         
         Detection Rate : 0.3585         
   Detection Prevalence : 0.4256         
      Balanced Accuracy : 0.8429         
                                         
       'Positive' Class :  >50K

We observe that the accuracy, the sensitivity and the specificity are all higher than that for the over-sampled model.

8.2.2 Accuracy on the test set

Finally we check the model accuracy on the test data. The confusion matrix, as well as the accuracy, sensitivity and specificity are shown below:

stat.svmRad.over <- confusionMatrix(data = predict(ModSVM, 
                                                   newdata = db.test), 
                                    reference = db.test$income,
                                    positive = levels(db.test$income)[2])

stat.svmRad.over

Confusion Matrix and Statistics

          Reference
Prediction  <=50K  >50K
     <=50K   9875   985
     >50K    1485  2715
                                        
               Accuracy : 0.836         
                 95% CI : (0.83, 0.8419)
    No Information Rate : 0.7543        
    P-Value [Acc > NIR] : < 2.2e-16     
                                        
                  Kappa : 0.5768        
 Mcnemar's Test P-Value : < 2.2e-16     
                                        
            Sensitivity : 0.7338        
            Specificity : 0.8693        
         Pos Pred Value : 0.6464        
         Neg Pred Value : 0.9093        
             Prevalence : 0.2457        
         Detection Rate : 0.1803        
   Detection Prevalence : 0.2789        
      Balanced Accuracy : 0.8015        
                                        
       'Positive' Class :  >50K

The prediction accuracy on the test dataset is 83.6% and is slightly better than that for the linear kernel, which was 83.22%

The sensitivity is 73.38% and is approximately equal to that for the linear kernel – 73.59%, whereas the specificity is 86.93% and is a little higher compared to 86.36% in the linear kernel case. Since we are interested in the percentage of correctly predicted outcomes with income>50K, we want a higher sensitivity. In that sense the radial kernel model gives a little bit better results.

We also give the performance metrics of the SMOTE model on the test set:

# SMOTE:

stat.svmRad.smote <- confusionMatrix(data = predict(ModSVM.smote, newdata = db.test), 
                                     reference = db.test$income,
                                     positive = levels(db.test$income)[2])

stat.svmRad.smote

Confusion Matrix and Statistics

          Reference
Prediction  <=50K  >50K
     <=50K   9920  1081
     >50K    1440  2619
                                          
               Accuracy : 0.8326          
                 95% CI : (0.8265, 0.8385)
    No Information Rate : 0.7543          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5627          
 Mcnemar's Test P-Value : 1.003e-12       
                                          
            Sensitivity : 0.7078          
            Specificity : 0.8732          
         Pos Pred Value : 0.6452          
         Neg Pred Value : 0.9017          
             Prevalence : 0.2457          
         Detection Rate : 0.1739          
   Detection Prevalence : 0.2695          
      Balanced Accuracy : 0.7905          
                                          
       'Positive' Class :  >50K

The accuracy of the SMOTE model is approximately the same as that of the over-sampled model, whereas the sensitivity is lower.

9 Neural Networks

We use the “nnet” package which implements a single-hidden-layer neural network. We use cross-validation to determine the optimal number of units in the hidden layer and the optimal decay parameter. The “nnet” package does not allow for tuning of the learning rate hyperparameter.

9.1 Building a predictive model on the over-sampled train set

As an initial attempt in the cross-validation, we choose to test a grid with 5 and 10 hidden neurons and decay equal to 0.001 and 0.1. We perform 10-fold cross validation:

covariates <- paste("age", "workclass", "education",
                    "marital_status", "occupation", "relationship",
                    "race", "sex", "native_region", "hours_w",
                    "cap_gain", "cap_loss", sep = "+")


form <- as.formula(paste("income ~", covariates))

my.grid <- expand.grid(size = c(5, 10), decay = c(0.001, 0.1))

nnet.cv <- train(form = form,
                 data = db.adult.oversampled,
                 method = "nnet",
                 metric = "Accuracy",
                 tuneGrid = my.grid,
                 trControl = trainControl(method = "cv", number = 10),
                 trace = FALSE)

nnet.cv

The output of the “train()” function is shown below:

Resampling results across tuning parameters
size	decay	Accuracy	Kappa
5	0.001	0.7829162	0.5315457
5	0.100	0.8248827	0.6264014
10	0.001	0.8236105	0.6235264
10	0.100	0.8272878	0.6312372

[1] "Accuracy was used to select the optimal model using  the largest value. The final values used for the model were size = 10 and decay = 0.1."

We see that the best model corresponds to “size=10” and “decay=0.1”. We also observe that the value of 0.1 for the decay hyperparameter gives better accuracy for both 5 and 10 hidden neurons. We also note that the accuracy is getting slightly higher as the number of hidden neurons increases. Therefore, we decide to perform cross-validation again with more neurons and bigger values of the decay parameter:

my.grid1 <- expand.grid(size = c(8,10, 12), decay = c(0.1, 1, 2))

trCtrl <- trainControl(method = "cv", number = 5)

nnet.cv1 <- train(form = form,
                  data = db.adult.oversampled,
                  method = "nnet",
                  metric = "Accuracy",
                  tuneGrid = my.grid1,
                  trControl = trCtrl,
                  trace = FALSE)

nnet.cv1

The respective output is given in the table below:

Resampling results across tuning parameters
size	decay	Accuracy	Kappa
8	0.1	0.8275364	0.6318514
8	1.0	0.7883050	0.5071174
8	2.0	0.8275364	0.6311347
10	0.1	0.8250758	0.6249785
10	1.0	0.8282552	0.6324913
10	2.0	0.8272599	0.6300738
12	0.1	0.8263200	0.6291617
12	1.0	0.8272599	0.6313593
12	2.0	0.8290569	0.6341225

[1] "Accuracy was used to select the optimal model using the largest value. The final values used for the model were size = 12 and decay = 2."

We see that the best accuracy is achieved for 12 neurons and decay=2. Therefore, we choose these values as the optimal ones and we fit the respective neural net model, where we increase the number of maximum allowed iterations to 500:

covariates <- paste("age", "workclass", "education",
                    "marital_status", "occupation", "relationship",
                    "race", "sex", "native_region", "hours_w",
                    "cap_gain", "cap_loss", sep = "+")


form <- as.formula(paste("income ~", covariates))

start_time <- proc.time()
neural.fit <- nnet(formula = form,
                   data = db.adult.oversampled,
                   size = 12,
                   decay = 2,
                   maxit = 500,
                   trace = FALSE)
time.nnet.over <- proc.time() - start_time
time.nnet.over

   user  system elapsed 
 162.33    0.07  164.28

The prediction accuracy on the test data is given below:

predicted.test <- predict(neural.fit, 
                          newdata = db.test,
                          type = "class")

stat.nnet.over <- confusionMatrix(data = predicted.test, 
                                  reference = db.test$income,
                                  positive = levels(db.test$income)[2])

stat.nnet.over

Confusion Matrix and Statistics

          Reference
Prediction  <=50K  >50K
     <=50K   9793   880
     >50K    1567  2820
                                          
               Accuracy : 0.8375          
                 95% CI : (0.8315, 0.8434)
    No Information Rate : 0.7543          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5874          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.7622          
            Specificity : 0.8621          
         Pos Pred Value : 0.6428          
         Neg Pred Value : 0.9175          
             Prevalence : 0.2457          
         Detection Rate : 0.1873          
   Detection Prevalence : 0.2913          
      Balanced Accuracy : 0.8121          
                                          
       'Positive' Class :  >50K

We see that the accuracy is 83.75%, the senistivity is 76.22% and the specificity is 86.21%.

9.2 Building a predictive model on the SMOTE train set

We fit a neural net model on the SMOTE train dataset. We use the optimal values of the hyperparameters which we determined in the case of the model trained on the over-sampled set:

# SMOTE model:

start_time <- proc.time()
neural.fit.smote <- nnet(formula = form,
                         data = db.adult.smote,
                         size = 12,
                         decay = 2,
                         maxit = 500,
                         trace = FALSE)
time.nnet.smote <- proc.time() - start_time
time.nnet.smote

   user  system elapsed 
 158.64    0.04  160.11

The performance of the model is shown below:

predicted.test.smote <- predict(neural.fit.smote, 
                                newdata = db.test,
                                type = "class")




stat.nnet.smote <- confusionMatrix(data = predicted.test.smote, 
                                   reference = db.test$income,
                                   positive = levels(db.test$income)[2])

stat.nnet.smote

Confusion Matrix and Statistics

          Reference
Prediction  <=50K  >50K
     <=50K   9972  1077
     >50K    1388  2623
                                          
               Accuracy : 0.8363          
                 95% CI : (0.8303, 0.8422)
    No Information Rate : 0.7543          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5706          
 Mcnemar's Test P-Value : 4.269e-10       
                                          
            Sensitivity : 0.7089          
            Specificity : 0.8778          
         Pos Pred Value : 0.6540          
         Neg Pred Value : 0.9025          
             Prevalence : 0.2457          
         Detection Rate : 0.1742          
   Detection Prevalence : 0.2663          
      Balanced Accuracy : 0.7934          
                                          
       'Positive' Class :  >50K

10 Summary

We built several classification models to predict whether a random U.S. citizen earns more than 50K a year. We used logistic regression, random forest, support vector machines and neural networks. In the table below we summarize the accuracy, sensitivity, specificity and computational time on the test dataset for all fitted models. We show results for models trained on the over-sampled and on the SMOTE datasets.

The model with the highest accuracy and sensitivity is the neural networks model built on the over-sampled train set – 83.75% and 76.22%, respectively.

The models with the next best accuracy and sensitivity are the SVM model with radial kernel trained on the over-sampled dataset and the logistic regression model on the over-sampled set. Their accuracies are 83.6% and 83.53%, respectively. On the other hand, the SVM model with the radial kernel has the biggest runtime – 264.88s. In contrast, the logistic regression model has the best runtime – only 3.48s for the over-sampled set and 3.29s for the SMOTE dataset.

In terms of computational time, the best model is the logistic regression one trained on the SMOTE dataset. The random forest model also has a good prediction accuracy and sensitivity. The runtime for the over-sampled random forest model is 8.5s and is very close to the best one – 3.48s for the logistic regression.

We also observe that the SMOTE algorithm does not lead to better accuracy and sensitivity compared to the much simpler over-sampling technique that we used.

In conclusion, after accounting for all performance indicators, the logistic regression model and the neural networks model appear to be the most suitable choices for the considered classification problem.

	Accuracy	Sensitivity	Specificity	Runtime
Logistic Regression over-sampled	0.8353254	0.7327027	0.8687500	3.48
Logistic Regression SMOTE	0.8250332	0.7181081	0.8598592	3.29
Random Forest over-sampled	0.8335989	0.7394595	0.8642606	8.50
Random Forest SMOTE	0.8268924	0.7362162	0.8564261	8.19
SVM linear kernel over-sampled	0.8322045	0.7359459	0.8635563	150.28
SVM linear kernel SMOTE	0.8254980	0.7135135	0.8619718	147.54
SVM radial kernel over-sampled	0.8359894	0.7337838	0.8692782	264.88
SVM radial kernel SMOTE	0.8326029	0.7078378	0.8732394	212.20
Neural Networks over-sampled	0.8375166	0.7621622	0.8620599	164.28
Neural Networks SMOTE	0.8363214	0.7089189	0.8778169	160.11

11 References

An Introduction to Categorical Data Analysis, Second Edition, Alan Agresti, 2007
An Introduction to Generalized Linear Models, Annette J. Dobson, Adrian G. Barnett, Third Edition, 2008
Applied Logistic Regression, D. Hosmer, S. Lemeshow, R. Sturdivant, Third Edition, 2013
Categorical Data Analysis, Alan Agresti, Third Edition, 2013
Generalized Linear Models, P. McCullagh, J.A. Nelder, Second Edition, 1989
“SMOTE: Synthetic Minority Over-sampling Technique”, N. V. Chawla, K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer (2002), Journal of Artificial Intelligence Research, Volume 16, pages 321-357

Census Data Project – Part 3: Predictive Analysis

Vasilena Taralova