This is the third part of the Census Data project. In this project we analyze a U.S. census data taken from the UCI (University of California at Irvine) Machine Learning Repository. The project is divided into four parts. Our final goal is to build a model, which can predict whether the income of a random adult American citizen is less or greater than 50000$ a year based on given features, such as age, education, occupation, gender, race, etc. In the first part of the project – Cleaning and Preprocessing the Data, we introduced the U.S. census dataset and we cleaned the data and transformed it into a form suitable for further analysis. In the second part of the project – Exploratory Data Analysis, we performed detailed exploratory data analysis and investigated the relationship between the independent variables and the dependent variable “income”. This part of the project is devoted to building predictive models. We use logistic regression, random forest, support vector machines and neural networks. The last part of the project – Theoretical Background, gives theoretical details about the logistic regression and methods for assessing thr goodness of fit of the predictive models.
Data analysis techniques used in this part of the project:
Investigation of the predictor variables for collinearity using:
the R function “findLinearCombos()” from the package “caret” to identify linearly dependent column vectors in the design matrix
Goodman and Kruskal’s tau measure
the Generalized Variance Inflation Factor (GVIF)
Building four predictive models using the following classification algorithms:
Logistic regression
Random forest
Support vector machines
Neural networks
Cross-validation and grid search for finding the optimal model parameters (cost parameter and kernel width) for the SVM model
Cross-validation and grid search for finding the optimal model parameters (number of hidden neurons, learning rate, weight decay parameter) for the NN model
Finding the optimal number of trees based on error plots of the RF model
Implementation of two techniques for balancing the imbalanced dataset in order to achieve higher sensitivity:
Over-sampling
Synthetic Minority Over-sampling Technique (SMOTE)
Assessing the goodness of fit of the logistic regression model using:
Assessing the significance of the explanatory variables
Stepwise selection of covariates for the logistic regression model using:
the R function “drop1()”
the R function “step()”
both function implement backward selection of predictors based on LRT
Analysis of the prediction accuracy, sensitivity, specificity and computational time of the fitted models (both on the training and on the test datasets)
We start by loading the necessary packages:
library(ggplot2)
library(scales)
library(plyr)
library(vcd)
library(ggthemes)
library(caret)
library(GoodmanKruskal)
library(ResourceSelection)
library(randomForest)
library(e1071)
library(nnet)
library(DMwR)In the first part of the project – Cleaning and Preprocessing the Data we cleaned and preprocessed the census data and we wrote the final data frames into the comma-separated-value files “adult_df.csv” and “test_df.csv”. Hence, we read the train and the test datasets into the “db.adult” and “db.test” data frames, respectively:
setwd("D:/Data_Science_Projects/Census_DataSet")
db.adult <- read.csv("adult_df.csv")
db.test <- read.csv("test_df.csv")We are interested in predicting the values of the variable “income”. Therefore “income” is our response variable, or dependent variable. It assumes only two values - less than 50K a year and more than 50K a year. The problem we consider is a classification one. Let \(Y_i\) be the random variable “income of the i-th subject”. Let also \(Y_i=1\) if income>50K and \(Y_i=0\) if income<=50K. Then, \(Y_i\) (\(i=1,2,\dots, n\), where \(n=30162\) is the number of observations in the data frame) follows the binomial distribution. More precisely, \(Y_i\) follows the Bernoulli distribution, which is a special case of the binomial distribution when there is only 1 trial. Since the response variable is binary, we start with a logistic regression model, which is a type of a generalized linear model. For more information on logistic regression – go to the “2 Logistic regression” section of the Theoretical Background part of the project.
The R function that we use to build the logistic regression model is “glm” from the standard R installation (the “stats” package). As the R documentation states, the function fits generalized linear models. Since we want to fit a logistic regression model, in “glm” we set “family” to “binomial” and “link” (link function) to “logit”. We start with a set of explanatory variables (see below) which we will call “full”. The set consists of all default variables, except “fnlwgt”, “hours_per_week”, “native_country”, “capital_gain” and “capital_loss”. In our analysis we do not take into account the variable “fnlwgt”. We also substiute the variable “native_country” with “native_region”, and “capital_gain” and “capital_loss” are replaced by “cap_gain” and “cap_loss”, respectively.
names(db.adult) [1] "age" "workclass" "fnlwgt" "education"
[5] "education_num" "marital_status" "occupation" "relationship"
[9] "race" "sex" "capital_gain" "capital_loss"
[13] "hours_per_week" "native_country" "income" "hours_w"
[17] "native_region" "cap_gain" "cap_loss"
covariates <- paste("age", "workclass", "education", "education_num",
"marital_status", "occupation", "relationship",
"race", "sex", "native_region", "hours_w",
"cap_gain", "cap_loss", sep = "+")
form <- as.formula(paste("income ~", covariates))
start_time <- proc.time()
glm.model <- glm(formula = form,
data = db.adult,
family = binomial(link = "logit"),
x = TRUE)
# The option "x=TRUE" returns the design matrix
time.logistic <- proc.time() - start_time
time.logistic user system elapsed
4.49 0.15 4.67
After we fitted the model, we need to determine:
We investigate if there are collinear predictor variables. We begin with a summary of the fitted logistic model:
summary(glm.model)$coefficients[, 1:2] Estimate Std. Error
(Intercept) 0.902289478 6.769089e-01
age 0.026438419 1.703372e-03
workclass Local-gov -0.639700747 1.122277e-01
workclass Private -0.452837780 9.309886e-02
workclass Self-emp-inc -0.235136463 1.233519e-01
workclass Self-emp-not-inc -0.872465363 1.099112e-01
workclass State-gov -0.753259323 1.250307e-01
workclass Without-pay -13.180257804 1.993433e+02
education 11th 0.103269080 2.137006e-01
education 12th 0.444262173 2.736754e-01
education 1st-4th -0.459467478 4.798216e-01
education 5th-6th -0.419230130 3.521554e-01
education 7th-8th -0.504352776 2.420578e-01
education 9th -0.278287394 2.691488e-01
education Assoc-acdm 1.286519936 1.802206e-01
education Assoc-voc 1.262132954 1.731718e-01
education Bachelors 1.902761826 1.612147e-01
education Doctorate 2.967218072 2.231634e-01
education HS-grad 0.764449675 1.567956e-01
education Masters 2.260928202 1.721559e-01
education Preschool -12.504022385 9.784995e+01
education Prof-school 2.900937791 2.067525e-01
education Some-college 1.124568866 1.590551e-01
marital_status Married-AF-spouse 2.906368163 5.787119e-01
marital_status Married-civ-spouse 2.164167535 2.735700e-01
marital_status Married-spouse-absent 0.009494187 2.365025e-01
marital_status Never-married -0.442439757 8.766115e-02
marital_status Separated -0.073529884 1.640838e-01
marital_status Widowed 0.213420666 1.567942e-01
occupation Armed-Forces -1.381287295 1.593288e+00
occupation Craft-repair 0.037115382 8.042603e-02
occupation Exec-managerial 0.769219849 7.746537e-02
occupation Farming-fishing -0.867170091 1.380995e-01
occupation Handlers-cleaners -0.717075233 1.442101e-01
occupation Machine-op-inspct -0.314569283 1.027973e-01
occupation Other-service -0.790992657 1.181567e-01
occupation Priv-house-serv -3.194481795 1.316745e+00
occupation Prof-specialty 0.502766131 8.200143e-02
occupation Protective-serv 0.593387739 1.255901e-01
occupation Sales 0.258121406 8.301447e-02
occupation Tech-support 0.659672887 1.114897e-01
occupation Transport-moving -0.070477875 9.947379e-02
relationship Not-in-family 0.582873644 2.704928e-01
relationship Other-relative -0.283364787 2.452199e-01
relationship Own-child -0.615249793 2.697838e-01
relationship Unmarried 0.438948372 2.860101e-01
relationship Wife 1.380705410 1.050376e-01
race Asian-Pac-Islander 0.716848339 2.749136e-01
race Black 0.530568451 2.408688e-01
race Other 0.186069663 3.703154e-01
race White 0.631748991 2.303928e-01
sex Male 0.839522945 7.951133e-02
native_region Central-Asia -0.044740001 2.883293e-01
native_region East-Asia 0.073407108 2.615591e-01
native_region Europe-East 0.369271777 3.352623e-01
native_region Europe-West 0.574603030 1.932003e-01
native_region Outlying-US 0.320962994 2.251356e-01
native_region South-America -0.981654260 4.692376e-01
native_region United-States 0.420610201 1.342609e-01
hours_w between_45_and_60 0.438435381 4.369345e-02
hours_w between_60_and_80 0.410845207 9.821402e-02
hours_w less_than_40 -0.804580132 6.216572e-02
hours_w more_than_80 0.273986107 1.936161e-01
cap_gain Low -6.656754399 5.092054e-01
cap_gain Medium -4.779587229 5.138468e-01
cap_loss Low -0.789675036 1.489671e-01
cap_loss Medium 0.808078772 1.791307e-01
From the output of “summary(glm.model)” we notice the following warning message: “Coefficients: (1 not defined because of singularities)”. Also, in the coefficients table the coefficient for the variable “education_num” is NA, i.e. with a missing value. As we explain in detail in the section “2.5 Iterative methods for solving systems of non-linear equations and the issue of collinearity” of the Theoretical Background part of the project, this is an indication that the covariate “education_num” is collinear with some other predictors. Therefore, we have to exclude it from the list of predictor variables and fit the model again. We will use the R function “findLinearCombos” from the package “caret” to test whether the covariate “education_num” is collinear with some of the other predictors. As the R documentation states, the function determines linear combinations in a numeric matrix. Furthermore, it returns a list that enumerates these dependencies and a vector of column positions that can be removed to eliminate the linear dependencies:
findLinearCombos(glm.model$x)$linearCombos
$linearCombos[[1]]
[1] 24 1 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
$remove
[1] 24
findLinearCombos(glm.model$x)$remove[1] 24
As we can see from the code output above, R found linear dependencies between the covariates and recommends to remove column 24 from the design matrix. Below we identify which predictor corresponds to column 24:
colnames(glm.model$x)[findLinearCombos(glm.model$x)$remove][1] "education_num"
Indeed, it turns out that the problematic predictor is “education_num”. As we can see from the data frame “db.adult”, the variable “education_num” provides redundant information, equivalent to that contained in “education”. From the content of “education_num” and “education” we can also see that the two variables are linearly dependent, i.e. collinear. Below we list the unique combinations of values for the variables “education” and “education_num”. We do this with the help of the R function “unique” from the “base” package. As the documentation reads, for a data frame the function “unique” returns a “data frame with the same columns but with duplicate rows removed (and with row names from the first occurrences of the unique rows)”. The variable “education” has a total of 16 factor levels and we see that to each level of “education” corresponds a number from 1 to 16 from the variable “education_num”:
unique.combinations <- unique(db.adult[ ,c("education", "education_num")])
unique.combinations[order(unique.combinations$education_num), ] education education_num
209 Preschool 1
387 1st-4th 2
53 5th-6th 3
15 7th-8th 4
7 9th 5
205 10th 6
4 11th 7
386 12th 8
3 HS-grad 9
11 Some-college 10
46 Assoc-voc 11
14 Assoc-acdm 12
1 Bachelors 13
6 Masters 14
49 Prof-school 15
20 Doctorate 16
Therefore we remove the covariate “education_num” and fit a new model - “glm.model.wld” (“wld” stands for “without linear dependencies”), thus resolving the problem with linearly dependent predictors:
covariates_new <- paste("age", "workclass", "education","marital_status",
"occupation", "relationship", "race", "sex",
"native_region", "hours_w", "cap_gain",
"cap_loss", sep = "+")
form_new <- as.formula(paste("income ~", covariates_new))
start_time <- proc.time()
glm.model.wld <- glm(formula = form_new,
data = db.adult,
family = binomial(link = "logit"),
x = TRUE,
y = TRUE)
time.logistic <- proc.time() - start_time
time.logistic user system elapsed
2.65 0.16 2.83
As we can see from the test below, there are no linear dependencies between the covariates anymore:
findLinearCombos(glm.model.wld$x)$linearCombos
list()
$remove
NULL
1. Goodman and Kruskal’s tau measure
Another way to see if there exist any correlations between the covariates, is to calculate the Goodman and Kruskal’s tau measure for all pairs of covariates. The Goodman and Kruskal’s tau measures the strength of association between categorical variables. It also works for discrete numerical variables by treating each variable value as a separate level of a factor variable. The standard association measure for numerical variables is the Pearson correlation coefficient, but it is not applicable for categorical variables.
In our case, the Goodman and Kruskal’s tau measure will give us the strength of the association between predictors two by two. If there are 3 or more mutually dependent predictors, this coefficient will not help us identify them. In order to calculate the Goodman and Kruskal’s tau for all pairs of predictor variables, we use the function “GKtauDataframe” from the R package “GoodmanKruskal”. The tau measure ranges from 0 to 1. Values closer to zero indicate weak association, whereas values closer to 1 indicate strong association. The tau measure is not symmetric. This means that, if A and B are two categorical variables, then \[ \tau(A,B)\neq \tau(B,A). \] The most popular measures for categorical variables are the chi-square and Cramer’s V, which can be calculated with the “assocstats” function in the “vcd” package. However, the “assocstats” function is not adapted to data frames. On the other hand, the above mentioned function “GKtauDataframe” is designed specifically for data frames and it makes the calculation of Goodman and Kruskal’s tau for all pairs of predictors very easy and straightforward. Furthermore, the function “GKtauDataframe” returns an object of class “GKtauMatrix” and the “plot” method can be applied for this object class, visualizing in a convenient way the Goodman and Kruskal’s tau for all pairs of predictors. As we can see below, the plot is in the form of a matrix. The numbers on the diagonal are equal to the number of levels for each categorical variable, while the off-diagonal elements give the Goodman-Kruskal tau values. Each tau measure is represeneted graphically by an ellipse, which is a circle for \(\tau=0\) and degenerates into a straight line for \(\tau=1\).
GKmatrix <- GKtauDataframe(db.adult[, c("age", "workclass",
"education", "education_num",
"marital_status",
"occupation", "relationship",
"race", "sex", "hours_w",
"native_region", "cap_gain",
"cap_loss")])
plot(GKmatrix)From the resulting plot above we see that “education” is perfectly predictable from “education_num” and vice versa. This confirms our conclusion that the two variables are collinear. All other \(\tau\) values are close to zero, except for \(\tau\)(relationship, marital_status), \(\tau\)(relationship, sex), and \(\tau\)(marital_status, relationship). These numbers make sense, because “relationship” can be predicted by “marital_status” and the other way around. The only association which is not intuitive is that between “relationship” and “sex”. The tau value of \(0.42\) suggests that being a female or male can determine the type of relationship that an individial is in. Looking below at the percentage of women and men belonging to each category of the factor variable “relationship”, we see that 36% of women are “Not-in-family” compared to 20% of men, and 25% of women are “Unmarried” in contrast to only 4% of men.
tab <- xtabs(~ sex + relationship, data = db.adult)
ptab <- prop.table(tab, 1)
print(format(round(ptab, 2), scientific = FALSE)) relationship
sex Husband Not-in-family Other-relative Own-child Unmarried
Female "0.00" "0.36" "0.04" "0.20" "0.25"
Male "0.61" "0.20" "0.02" "0.12" "0.04"
relationship
sex Wife
Female "0.14"
Male "0.00"
Now we compute the Cramer’s V value for the above pairs of variables with the “assocstats” function from the “vcd” package. The Cramer’s V is symmetric, i.e. V(relationship, marital_status)=V(marital_status, relationship) and hence, we can compute only one of these values. The values we obtain for Cramer’s V indicate strength of association which is similar to the one predicted by the Goodman and Kruskal’s tau:
assocstats(tab)$cramer[1] 0.6502624
tab1 <- xtabs(~ marital_status + relationship, data = db.adult)
assocstats(tab1)$cramer[1] 0.4871943
2. Generalized Variance Inflation Factor
The Variance Inflation Factor (VIF) measures how much the variance of an estimated regression coefficient is inflated due to multicollinearity in the model. The basic idea is to regress each independent variable on all other independent variables. A VIF is calculated for each of the explanatory variables using the R-squared value (coefficient of determination) of the regression of that variable against all other explanatory variables: \[ \text{VIF}_i=\dfrac{1}{1 - R_i^2} \] From the above formula, we can see that variance inflation factors take values from 1 upwards. A VIF of 1.7 means that the variance of an estimated regression coefficient is 70% bigger than what one would expect if there was no correlation with other predictors in the model. A commonly adopted rule of thumb is that a VIF greater than or equal to 5 indicates high correlation between the predictor variables. Other VIF values considered critical are 2.5 and 10. In the literature there is a debate on the exact treshold value for the VIF. In general, the more the VIF increases, the less reliable the constructed regression model is.
However, the standard VIF cannot be used for predictors with more than one degree of freedom such as categorical variables with more than two levels. Fox and Monette (“Generalized Collinearity Diagnostics”, Journal of the American Statistical Association, Volume 87, 1992) generalized the notion of variance inflation to explanatory variables requiring more than 1 degree of freedom and proposed the generalized variance inflation factor (GVIF). The GVIF is obtained by the VIF corrected to the number of degrees of freedom (df) of the predictor variable: \[ \text{GVIF} = \text{VIF}^{1/(2*df)} \] VIF equals GVIF for independent variables with 1 df. The GVIF is invariant with respect to the coding of the categorical variables in the model. This means that it does not depend on the choice of baseline category or coding scheme for the dummy regressors. For the GVIF the rule of thumb is to square the corrected VIF value, i.e. \(\text{GVIF}=\text{VIF}^{1/(2*df)}\), and apply the threshold for the VIF, which is 5.
To calculate the GVIFs for our model, we use the “vif” function in the “car” package. Because there are several packages with different implementations of the function “vif”, we have to specify explicitly that we want to call the “vif” function from the “car” package:
car::vif(glm.model.wld) GVIF Df GVIF^(1/(2*Df))
age 1.232896 1 1.110359
workclass 1.628834 6 1.041493
education 2.011061 15 1.023562
marital_status 51.847256 6 1.389613
occupation 2.891764 13 1.041686
relationship 115.512999 5 1.607915
race 2.297860 4 1.109597
sex 2.837879 1 1.684601
native_region 2.393816 7 1.064334
hours_w 1.262756 4 1.029591
cap_gain 1.054674 2 1.013397
cap_loss 1.016078 2 1.003995
We get an error message when attempting to calculate the GVIFs for the “glm.model”, because there are aliased coefficients in the model. This means that there are linearly dependent predictors (perfect collinearity). Since the collinearity is resolved in the second fitted model “glm.model.wld”, the GVIFs are calculated and displayed. As we can see from the output above, all \(\text{GVIF}^{(1/(2*Df))}\) are less than the treshold value of \(\sqrt{5}=2.23601\). Therefore, we can conclude that there are no indications of collinearity among the explanatory variables in the model “glm.model.wld”.
Although the deviance is not useful as a measure of the model goodness of fit in our case because we have ungrouped data (for details see the section “3.2 Limitations of the deviance as a test statistic for ungrouped data” of the Theoretical Background part of the project), we briefly demonstrate how it can be obtained in R. Below we use the “summary()” function to display the deviance of the intercept-only model and the deviance of the model “glm.model.wld”:
summary(glm.model.wld)$null.deviance[1] 33850.71
summary(glm.model.wld)$deviance[1] 19642.97
In the output above the null deviance shows how well the response is predicted by a model with nothing but an intercept (grand mean) compared to the saturated model. The deviance (residual deviance) shows how well the observed response is predicted by a model with a given set of predictors (in our case this is the model “glm.model.wld”) compared to the saturated model.
As we discuss in the “3.3 Likelihood ratio test statistic” section of the Theoretical Background part of the project, the likelihood ratio test is valid also in the case of sparse data. As such we apply it to compare the goodness of fit of the intercept only model and the fitted model with explanatory variables - “glm.model.wld”. We test the null hypothesis that the null model fits well the observed data and explains about the same amount of variation in the response variable as the model “glm.model.wld” at the \(0.05\) significance level:
k <- length(glm.model.wld$coefficients)
D_M <- glm.model.wld$deviance
D_0 <- glm.model.wld$null.deviance
1 - pchisq(q = D_0 - D_M, df = k - 1)[1] 0
The p-value is approximately \(0\) and, hence, smaller than \(0.05\). This means that we reject the null hypothesis that there is no significant difference between the grand mean model (intercept-only model) and the model “glm.model.wld” (model with predictors). This means that at the 5% significance level the null model does not fit the observed data better than the multivariate model “glm.model.wld”.
The Hosmer-Lemeshow test is a goodness of fit test for logistic regression models with ungrouped (individual) binary data. Just as in our problem, grouping the data according to covariate patterns is not always easy or even possible. The idea of the Hosmer-Lemeshow test is to divide the data into subgroups based on the predicted probabilities \(\hat{\pi}_i\) instead on the values of the explanatory variables. The test consists of the following steps:
First we sort the data set of size \(n\) (we have \(n\) observations) in descending order according to the fitted probabilities \(\hat{\pi}_i\)
Then we divide the data into \(g\) equal-sized groups: the first group contains the first \(\dfrac{n}{g}\) observations with the highest estimated probabilities; the second group contains the \(\dfrac{n}{g}\) observations having the next highest estimated probabilities; etc.
Finally, we construct a Pearson-like test statistic for the observed and expected cell frequencies from a \(2\times g\) table
For more details – go to the “3.5 The Hosmer-Lemeshow test” section of the Theoretical Background part of the project.
In order to assess the goodness of fit of the model with the Hosmer-Lemeshow test we will use the R function “hoslem.test()” from the package “ResourceSelection”. We have to pass to the function the vector of observed responses in a binary format, the vector with expected values (predicted probabilities) and the number of groups \(g\). Therefore we first extract the fitted probabilities, which can be done either with the function “predict()” or with “glm.model.wld$fitted.values”:
head(glm.model.wld$fitted.values) 1 2 3 4 5 6
0.08856988 0.45729601 0.03017249 0.09494667 0.55583302 0.83322025
predicted.probs <- predict(glm.model.wld, type = "response")
# predicted.probs <- glm.model.wld$fitted.values
head(predicted.probs) # returns probabilities 1 2 3 4 5 6
0.08856988 0.45729601 0.03017249 0.09494667 0.55583302 0.83322025
Next, we need to transform the vector of predicted probabilities \(\hat{\boldsymbol{\pi}}=\left( \hat{\pi}_1, \hat{\pi}_2, \dots, \hat{\pi}_n \right)\) (\(n=30162\)) into a binary vector of predicted responses (predicted income). We denote this binary vector with \(\hat{\mathbf{y}}=\left( \hat{y}_1, \hat{y}_2, \dots, \hat{y}_n \right)\). We construct the vector \(\hat{\mathbf{y}}\) in the following way: \[ \hat{y}_i = 1, \quad \text{if} \quad \hat{\pi}_i>0.5, \] and \[ \hat{y}_i = 0, \quad \text{if} \quad \hat{\pi}_i \leq 0.5, \] where a value of 1 is equivalent to an yearly income of more than 50K, and a value of 0 means an income of less than 50K.
We take the vector of observed responses (which is a factor variable with two levels - " >50K" and " <=50K“) and create a binary vector:
observed.values <- ifelse(db.adult$income == " >50K", 1, 0)Next we generate the vector of predicted probabilities – “predicted.probs”, and then we use it to create the binary vector “predicted.response”:
predicted.probs <- predict(glm.model.wld, type = "response")
predicted.response <- ifelse(predicted.probs > 0.5, 1, 0)
head(predicted.response, 20) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 1
head(observed.values, 20) [1] 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 1 1
First we test the accuracy of the fitted logistic model on the training dataset, i.e. we calculate what is the percentage of correctly predicted response values:
mean(observed.values == predicted.response)[1] 0.8489159
As we can see from the result above, there is a 84.89% match between observed and predicted values of the dependent variable. Of course, in order to test the prediction accuracy of the fitted model, we need to test it on a new test data set. We will do this later on in the next sections of the project.
Finally, we proceed with the Hosmer-Lemeshow test. We run the test with different number of groups. We take \(g=10, 20, 50, 100, 200, 300\) and \(400\). As we can see from the results below, for small number of groups we obtain very small p-values, which means a poor fit of the model, whereas for bigger values of \(g\) we obtain significant p-values indicating a good fit of the built model. As we already mentioned, the Hosmer-Lemeshow test has some serious drawbacks, such as the demonstrated dependence of the results on the choice of groups. Therefore we have to interpret with caution the outcome of the test. Our judgement of the adequacy of the model should be based above all on the final goals we want to achieve, such as whether we want the constructed model to match the observed data as good as possible or to predict new observations with high accuracy. We will see further that despite the lack of fit according to the Hosmer-Lemeshow test (in the case of relatively small number of groups), the fitted model “glm.model.wld” predicts new obervations with high accuracy.
hoslem.test(observed.values, predicted.response, g=10)
Hosmer and Lemeshow goodness of fit (GOF) test
data: observed.values, predicted.response
X-squared = 401.75, df = 8, p-value < 2.2e-16
hoslem.test(observed.values, predicted.response, g=20)
Hosmer and Lemeshow goodness of fit (GOF) test
data: observed.values, predicted.response
X-squared = 401.75, df = 18, p-value < 2.2e-16
hoslem.test(observed.values, predicted.response, g=50)
Hosmer and Lemeshow goodness of fit (GOF) test
data: observed.values, predicted.response
X-squared = 401.75, df = 48, p-value < 2.2e-16
hoslem.test(observed.values, predicted.response, g=100)
Hosmer and Lemeshow goodness of fit (GOF) test
data: observed.values, predicted.response
X-squared = 401.75, df = 98, p-value < 2.2e-16
hoslem.test(observed.values, predicted.response, g=200)
Hosmer and Lemeshow goodness of fit (GOF) test
data: observed.values, predicted.response
X-squared = 401.75, df = 198, p-value = 5.551e-16
hoslem.test(observed.values, predicted.response, g=300)
Hosmer and Lemeshow goodness of fit (GOF) test
data: observed.values, predicted.response
X-squared = 401.75, df = 298, p-value = 5.575e-05
hoslem.test(observed.values, predicted.response, g=400)
Hosmer and Lemeshow goodness of fit (GOF) test
data: observed.values, predicted.response
X-squared = 401.75, df = 398, p-value = 0.438
Here it is important to note that what goodness-of-fit statistics are testing is not how well the fitted model predicts the dependent variable, but whether there exists a more complicated model which fits the observed data even better. This more complex model can be obtained by adding non-linearities, interaction terms, or by changing the link function. However, a basic rule in practice is to keep the model as simple as possible provided that it serves good enough our purposes. In that sense we have to weigh the advantages and disadvantages of fitting more complicated models and decide what is best in our case. We have to investigate whether the more complicated model fits significantly better the observed data. If there is not a considerable improvement in the fit or prediction accuracy of the model, a more intricate model does not justify the additional computational complexity.
Determining what kind of non-linearities to include in the model is not always a trivial task, especially for data which do not have intuitive interpretation, such as radio signals or gene expression data sets, etc. In these cases neural networks are very useful, because their algorithm naturally identifies nonlinear relationships. Further we will demonstrate the application of neural networks to our problem.
After fitting the model, we can assess the significance of the estimated model coefficients and hence the significance of the predictors in the fitted model.
In order to test the significance of each covariate in the model we will perform likelihood ratio tests with the R function “anova()”. When we run “anova(glm.model.wld, test=”LRT“)”, the function sequentially compares nested models with increasing complexity against the full model by adding one predictor at a time. That is, at each step of the process the algorithm compares a model consisting of one, two, and so on covariates against the full model “glm.model.wld”. The comparisons are done with the help of a likelihood ratio test as we discuss in the “3.3 Likelihood ratio test statistic” section of the Theoretical Background part of the project. The p-values of the tests are calculated using the chi-squared distribution:
anova(glm.model.wld, test = "LRT")Analysis of Deviance Table
Model: binomial, link: logit
Response: income
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 30161 33851
age 1 1738.5 30160 32112 < 2.2e-16 ***
workclass 6 426.1 30154 31686 < 2.2e-16 ***
education 15 3570.2 30139 28116 < 2.2e-16 ***
marital_status 6 5091.4 30133 23024 < 2.2e-16 ***
occupation 13 765.5 30120 22259 < 2.2e-16 ***
relationship 5 199.5 30115 22059 < 2.2e-16 ***
race 4 21.3 30111 22038 0.0002802 ***
sex 1 165.7 30110 21872 < 2.2e-16 ***
native_region 7 39.5 30103 21833 1.571e-06 ***
hours_w 4 418.7 30099 21414 < 2.2e-16 ***
cap_gain 2 1473.1 30097 19941 < 2.2e-16 ***
cap_loss 2 298.0 30095 19643 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we can see from the results above, all explanatory variables are significant and according to the likelihood ratio tests we should keep all of the considered predictors in the model.
Now we will test the significance of each level of the categorical predictors in the fitted model.
We have a total of 67 model parameters:
length(glm.model.wld$coefficients)[1] 67
That is, we have 12 predictors, most of which are categorical, and hence, dummy variables are created to account for the factor levels of each categorical covariate. Therefore the number of model parameters is greater than the number of predictors. For each categorical (factor) variabe with \(l\) levels, \(l-1\) dummy variables are created and one level is chosen as the so-called “base” or “reference” level. By default, R orders the levels of the categorical variables alphabetically and takes the first factor level as a reference level. The choice of base level does not affect the estimates of the model parameters. In the “2.2 Interpretation of the coefficients” section of the Theoretical Background part of the project we discuss in detail how categorical variables are accounted for in the logistic model and how the model coefficients are interpreted.
summary(glm.model.wld)
Call:
glm(formula = form_new, family = binomial(link = "logit"), data = db.adult,
x = TRUE, y = TRUE)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.6865 -0.5220 -0.1929 -0.0004 3.7400
Coefficients:
Estimate Std. Error z value
(Intercept) 0.902289 0.676909 1.333
age 0.026438 0.001703 15.521
workclass Local-gov -0.639701 0.112228 -5.700
workclass Private -0.452838 0.093099 -4.864
workclass Self-emp-inc -0.235136 0.123352 -1.906
workclass Self-emp-not-inc -0.872465 0.109911 -7.938
workclass State-gov -0.753259 0.125031 -6.025
workclass Without-pay -13.180258 199.343318 -0.066
education 11th 0.103269 0.213701 0.483
education 12th 0.444262 0.273675 1.623
education 1st-4th -0.459467 0.479822 -0.958
education 5th-6th -0.419230 0.352155 -1.190
education 7th-8th -0.504353 0.242058 -2.084
education 9th -0.278287 0.269149 -1.034
education Assoc-acdm 1.286520 0.180221 7.139
education Assoc-voc 1.262133 0.173172 7.288
education Bachelors 1.902762 0.161215 11.803
education Doctorate 2.967218 0.223163 13.296
education HS-grad 0.764450 0.156796 4.875
education Masters 2.260928 0.172156 13.133
education Preschool -12.504022 97.849946 -0.128
education Prof-school 2.900938 0.206752 14.031
education Some-college 1.124569 0.159055 7.070
marital_status Married-AF-spouse 2.906368 0.578712 5.022
marital_status Married-civ-spouse 2.164168 0.273570 7.911
marital_status Married-spouse-absent 0.009494 0.236502 0.040
marital_status Never-married -0.442440 0.087661 -5.047
marital_status Separated -0.073530 0.164084 -0.448
marital_status Widowed 0.213421 0.156794 1.361
occupation Armed-Forces -1.381287 1.593288 -0.867
occupation Craft-repair 0.037115 0.080426 0.461
occupation Exec-managerial 0.769220 0.077465 9.930
occupation Farming-fishing -0.867170 0.138099 -6.279
occupation Handlers-cleaners -0.717075 0.144210 -4.972
occupation Machine-op-inspct -0.314569 0.102797 -3.060
occupation Other-service -0.790993 0.118157 -6.694
occupation Priv-house-serv -3.194482 1.316745 -2.426
occupation Prof-specialty 0.502766 0.082001 6.131
occupation Protective-serv 0.593388 0.125590 4.725
occupation Sales 0.258121 0.083014 3.109
occupation Tech-support 0.659673 0.111490 5.917
occupation Transport-moving -0.070478 0.099474 -0.709
relationship Not-in-family 0.582874 0.270493 2.155
relationship Other-relative -0.283365 0.245220 -1.156
relationship Own-child -0.615250 0.269784 -2.281
relationship Unmarried 0.438948 0.286010 1.535
relationship Wife 1.380705 0.105038 13.145
race Asian-Pac-Islander 0.716848 0.274914 2.608
race Black 0.530568 0.240869 2.203
race Other 0.186070 0.370315 0.502
race White 0.631749 0.230393 2.742
sex Male 0.839523 0.079511 10.559
native_region Central-Asia -0.044740 0.288329 -0.155
native_region East-Asia 0.073407 0.261559 0.281
native_region Europe-East 0.369272 0.335262 1.101
native_region Europe-West 0.574603 0.193200 2.974
native_region Outlying-US 0.320963 0.225136 1.426
native_region South-America -0.981654 0.469238 -2.092
native_region United-States 0.420610 0.134261 3.133
hours_w between_45_and_60 0.438435 0.043693 10.034
hours_w between_60_and_80 0.410845 0.098214 4.183
hours_w less_than_40 -0.804580 0.062166 -12.943
hours_w more_than_80 0.273986 0.193616 1.415
cap_gain Low -6.656754 0.509205 -13.073
cap_gain Medium -4.779587 0.513847 -9.302
cap_loss Low -0.789675 0.148967 -5.301
cap_loss Medium 0.808079 0.179131 4.511
Pr(>|z|)
(Intercept) 0.18255
age < 2e-16 ***
workclass Local-gov 1.20e-08 ***
workclass Private 1.15e-06 ***
workclass Self-emp-inc 0.05662 .
workclass Self-emp-not-inc 2.06e-15 ***
workclass State-gov 1.70e-09 ***
workclass Without-pay 0.94728
education 11th 0.62892
education 12th 0.10452
education 1st-4th 0.33827
education 5th-6th 0.23386
education 7th-8th 0.03720 *
education 9th 0.30116
education Assoc-acdm 9.43e-13 ***
education Assoc-voc 3.14e-13 ***
education Bachelors < 2e-16 ***
education Doctorate < 2e-16 ***
education HS-grad 1.09e-06 ***
education Masters < 2e-16 ***
education Preschool 0.89832
education Prof-school < 2e-16 ***
education Some-college 1.55e-12 ***
marital_status Married-AF-spouse 5.11e-07 ***
marital_status Married-civ-spouse 2.56e-15 ***
marital_status Married-spouse-absent 0.96798
marital_status Never-married 4.48e-07 ***
marital_status Separated 0.65406
marital_status Widowed 0.17347
occupation Armed-Forces 0.38597
occupation Craft-repair 0.64445
occupation Exec-managerial < 2e-16 ***
occupation Farming-fishing 3.40e-10 ***
occupation Handlers-cleaners 6.61e-07 ***
occupation Machine-op-inspct 0.00221 **
occupation Other-service 2.17e-11 ***
occupation Priv-house-serv 0.01526 *
occupation Prof-specialty 8.72e-10 ***
occupation Protective-serv 2.30e-06 ***
occupation Sales 0.00187 **
occupation Tech-support 3.28e-09 ***
occupation Transport-moving 0.47863
relationship Not-in-family 0.03117 *
relationship Other-relative 0.24786
relationship Own-child 0.02258 *
relationship Unmarried 0.12485
relationship Wife < 2e-16 ***
race Asian-Pac-Islander 0.00912 **
race Black 0.02761 *
race Other 0.61534
race White 0.00611 **
sex Male < 2e-16 ***
native_region Central-Asia 0.87669
native_region East-Asia 0.77898
native_region Europe-East 0.27070
native_region Europe-West 0.00294 **
native_region Outlying-US 0.15397
native_region South-America 0.03644 *
native_region United-States 0.00173 **
hours_w between_45_and_60 < 2e-16 ***
hours_w between_60_and_80 2.87e-05 ***
hours_w less_than_40 < 2e-16 ***
hours_w more_than_80 0.15704
cap_gain Low < 2e-16 ***
cap_gain Medium < 2e-16 ***
cap_loss Low 1.15e-07 ***
cap_loss Medium 6.45e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 33851 on 30161 degrees of freedom
Residual deviance: 19643 on 30095 degrees of freedom
AIC: 19777
Number of Fisher Scoring iterations: 13
From the summary above we can see which are the significant covariates and levels of categorical covariates for the log odds model based on the corresponding p-values. These p-values are obtained using the Wald test statistic (for details go to the section “4 Significance of the estimated model parameters” of the Theoretical Background part of the project) to test the following hypotheses for the model coefficients:
\(H_0\): \(\hat{\beta}_j=0\)
vs.
\(H_1\): \(\hat{\beta}_j\neq 0\)
for all \(j=0,1,2,\dots, k\).
In other words, the Wald test is used to evaluate the statistical significance of each coefficient in the fitted logistic model, i.e. the test checks the hypothesis that each individual coefficient is zero. If the coefficient of a category is not statistically significant, this does not imply that the whole categorical predictor is unimportant and should be removed from the model. The overall effect of the factor variable is tested by performing a likelihood ratio test as we showed earlier.
In the regression output above, the reported coefficients for each category of a factor variable measure the differences from the base level. As an illustrative example, let us consider the predictor “workclass”, which has \(7\) levels:
levels(db.adult$workclass)[1] " Federal-gov" " Local-gov" " Private"
[4] " Self-emp-inc" " Self-emp-not-inc" " State-gov"
[7] " Without-pay"
As we can see from the output of “summary”, “workclass Local-gov” differs from the base level - “workclass Federal-gov” by \(-0.74\), and that difference is significant at the \(5\)% level. This tells us that \[ \small{ \textbf{odds ratio}=} \] \[ \small{ =\dfrac{\textrm{odds of income being more than 50K given that "workclass"="workclass Local-gov"}}{\textrm{odds of income being more than 50K given that "workclass"="workclass Federal-gov"}}= } \] \[ \small{ =\exp(-0.74)=0.48 } \] or, equivalently, \[ \small{ \dfrac{\textrm{odds of income being more than 50K given that "workclass"="workclass Federal-gov"}}{\textrm{odds of income being more than 50K given that "workclass"="workclass Local-gov"}}=} \] \[ \small{ =\dfrac{1}{\exp(-0.74)}=2.08,} \] which means that it is two times more likely for a person to have an income of more than 50K a year if they belong to “workclass Federal-gov”(if they work for the federal government) compared to “workclass Local-gov” (if they work in local government), if we keep all other variables constant.
The category “workclass Private” differs from the reference level by \(-0.56\) (\(\exp(-0.56)=0.57\)) and the difference is also significant at the \(5\)% level. In terms of odds ratios, this is equivalent to \[ \small{ \dfrac{\textrm{odds of income being more than 50K given that "workclass"="workclass Federal-gov"}}{\textrm{odds of income being more than 50K given that "workclass"="workclass Private"}}=} \] \[ \small{ =\dfrac{1}{\exp(-0.56)}=1.75, } \] which means that people who work in the federal government are \(1.75\) times more likely to earn more than 50K a year compared to people who are employed in the private sector.
Similar conclusions can be made for the levels “workclass Self-emp-inc”, “workclass Self-emp-not-inc” and “workclass State-gov” which differ from the base level by \(-0.37\), \(-1.07\) and \(-0.87\), respectively. From these results, we can conclude that:
if a subject is employed in the federal government, the odds of earning more than 50K a year are \(1.43\) times greater than if a person is self-employed (controlling for all other independent variables);
people who work in the federal government are \(2.9\) times more likely to earn more than 50K a year compared to people who are self-employed without income, which makes sense;
the odds of earning more than 50K a year are \(2.4\) times greater for people who work in the federal government than for people with “workclass State-gov”.
The last category is “workclass Without-pay” which differs from the base level by \(-13.58\): \[ \small{ =\dfrac{\textrm{odds of income being more than 50K given that "workclass"="workclass Without-pay"}}{\textrm{odds of income being more than 50K given that "workclass"="workclass Federal-gov"}}=} \] \[ \small{ =\exp(-13.58)=1.27\times 10^{-6} = 0.00000127,} \] or, equivalently \[ \small{ \dfrac{\textrm{odds of income being more than 50K given that "workclass"="workclass Federal-gov"}}{\textrm{odds of income being more than 50K given that "workclass"="workclass Without-pay"}}=} \] \[ \small{ =\dfrac{1}{\exp(-13.58)}=787.4.} \] The latter means that it is much less likely for a person to earn more than 50K a year if they are without pay than if they are employed in the federal government, which makes perfect sense. However, we will see that this result is due mainly to the fact that there are very little people without pay who participated in the survey.
We can make equivalent interpretations for all predictors in the model, but we will not go into details and discuss the impact of all categorical variables. We will just make some more comments on the more interesting results. As an example, we will consider thoroughly the independent variable “education”. Regarding education, we notice that it is \(18\) times more likely for an individual to have an income of more than 50K a year, if they have a doctorate degree compared to having only a 10-th grade diploma. Below we list the levels of education:
levels(db.adult$education) [1] " 10th" " 11th" " 12th" " 1st-4th"
[5] " 5th-6th" " 7th-8th" " 9th" " Assoc-acdm"
[9] " Assoc-voc" " Bachelors" " Doctorate" " HS-grad"
[13] " Masters" " Preschool" " Prof-school" " Some-college"
summary(db.adult$education) 10th 11th 12th 1st-4th 5th-6th
820 1048 377 151 288
7th-8th 9th Assoc-acdm Assoc-voc Bachelors
557 455 1008 1307 5044
Doctorate HS-grad Masters Preschool Prof-school
375 9840 1627 45 542
Some-college
6678
Also, it is much more likely to earn more than 50K if one has a Bachelor or Master degree (\(6.5\) times and \(9.3\) times more likely, respectively) relative to the baseline - 10-th grade diploma. The same can be said for “education Prof-school”, “education Assoc-acdm” and “education Assoc-voc” - the odds of having an income of more than 50K are \(17\) times, \(3.5\) times and again \(3.5\) times greater, respectively, compared to the reference category. Furthermore, people with college degree are \(3\) times more likely to earn more than 50K compared to people with 10-th degree education. On the other hand, if an individual has a “1st-4th”, “5th-6th”, “7th-8th”, or “9th” degree education, their odds of being paid more than 50K a year are \(1.75\), \(1.5\), \(1.64\) and \(1.3\) times lower, respectively, than if they had “10th” degree education. The result for “education Preschool” is very extreme - the odds of earning more than 50K a year are \(1/2.5\times 10^{-6}=400000\) times lower relative to the base level. This number makes sense but is also due to the fact that there are very few people in this category - only \(45\) people out of \(30162\) in the whole sample. This can be seen also from the insignificant p-value for “education Preschool”, which indicates that the covariate (dummy variable in this case) is not significant to the model at the \(5\)% level. From the p-values we also notice that “education 1st-4th”, “education 5th-6th”, “education 9th”, “education 11th” and “education 12th” are not significant at the \(5\)% level.
Below we show the \(95\)% confidence intervals for all estimated model parameters, as well as the number of people belonging to each category of “workclass”:
summary(db.adult$workclass) Federal-gov Local-gov Private Self-emp-inc
943 2067 22286 1074
Self-emp-not-inc State-gov Without-pay
2499 1279 14
confint.default(glm.model.wld) 2.5 % 97.5 %
(Intercept) -0.42442766 2.229006620
age 0.02309987 0.029776966
workclass Local-gov -0.85966300 -0.419738493
workclass Private -0.63530818 -0.270367376
workclass Self-emp-inc -0.47690169 0.006628768
workclass Self-emp-not-inc -1.08788730 -0.657043423
workclass State-gov -0.99831503 -0.508203616
workclass Without-pay -403.88598205 377.525466439
education 11th -0.31557639 0.522114548
education 12th -0.09213183 0.980656177
education 1st-4th -1.39990049 0.480965538
education 5th-6th -1.10944208 0.270981821
education 7th-8th -0.97877726 -0.029928297
education 9th -0.80580927 0.249234482
education Assoc-acdm 0.93329414 1.639745735
education Assoc-voc 0.92272247 1.601543443
education Bachelors 1.58678677 2.218736884
education Doctorate 2.52982589 3.404610254
education HS-grad 0.45713599 1.071763354
education Masters 1.92350880 2.598347602
education Preschool -204.28639279 179.278348024
education Prof-school 2.49571043 3.306165155
education Some-college 0.81282661 1.436311123
marital_status Married-AF-spouse 1.77211372 4.040622605
marital_status Married-civ-spouse 1.62798024 2.700354828
marital_status Married-spouse-absent -0.45404214 0.473030517
marital_status Never-married -0.61425245 -0.270627067
marital_status Separated -0.39512831 0.248068539
marital_status Widowed -0.09389025 0.520731587
occupation Armed-Forces -4.50407419 1.741499603
occupation Craft-repair -0.12051675 0.194747513
occupation Exec-managerial 0.61739052 0.921049182
occupation Farming-fishing -1.13784005 -0.596500127
occupation Handlers-cleaners -0.99972183 -0.434428630
occupation Machine-op-inspct -0.51604829 -0.113090274
occupation Other-service -1.02257547 -0.559409841
occupation Priv-house-serv -5.77525481 -0.613708781
occupation Prof-specialty 0.34204628 0.663485977
occupation Protective-serv 0.34723561 0.839539872
occupation Sales 0.09541604 0.420826770
occupation Tech-support 0.44115707 0.878188702
occupation Transport-moving -0.26544291 0.124487162
relationship Not-in-family 0.05271747 1.113029820
relationship Other-relative -0.76398695 0.197257381
relationship Own-child -1.14401636 -0.086483230
relationship Unmarried -0.12162117 0.999517916
relationship Wife 1.17483555 1.586575271
race Asian-Pac-Islander 0.17802763 1.255669042
race Black 0.05847431 1.002662589
race Other -0.53973527 0.911874593
race White 0.18018746 1.083310519
sex Male 0.68368361 0.995362284
native_region Central-Asia -0.60985508 0.520375081
native_region East-Asia -0.43923923 0.586053447
native_region Europe-East -0.28783017 1.026373727
native_region Europe-West 0.19593733 0.953268729
native_region Outlying-US -0.12029459 0.762220580
native_region South-America -1.90134306 -0.061965462
native_region United-States 0.15746373 0.683756668
hours_w between_45_and_60 0.35279780 0.524072965
hours_w between_60_and_80 0.21834927 0.603341146
hours_w less_than_40 -0.92642270 -0.682737562
hours_w more_than_80 -0.10549456 0.653466778
cap_gain Low -7.65477856 -5.658730237
cap_gain Medium -5.78670849 -3.772465970
cap_loss Low -1.08164521 -0.497704860
cap_loss Medium 0.45698900 1.159168541
The \(95\)% confidence interval for the odds ratio comparing “workclass Without-pay” versus “workclass Federal-gov” ranges from \(\exp(-401.8)\rightarrow -\infty\) to \(\exp(374.7)\rightarrow \infty\). This anomaly is due to the fact that there are a very small number of people - only \(14\), who belong to the category “workclass Without-pay”. Thus, this association should be interpreted with a lot of caution. The same can be said for the category (dummy variable) “education Preschool” to which belong only \(45\) people from the study:
summary(db.adult$education) 10th 11th 12th 1st-4th 5th-6th
820 1048 377 151 288
7th-8th 9th Assoc-acdm Assoc-voc Bachelors
557 455 1008 1307 5044
Doctorate HS-grad Masters Preschool Prof-school
375 9840 1627 45 542
Some-college
6678
First we calculate the percentage of correctly guessed response variables using the training dataset “db.adult”, i.e. the same dataset that we used to fit the logistic regression model “glm.model.wld”. Given the vector of predicted probabilities \(\hat{\boldsymbol{\pi}}\), we calculate a character vector \(\hat{\mathbf{y}}=\left( \hat{y}_1, \hat{y}_2, \dots, \hat{y}_n \right)\) such that \[ \hat{y}_i = " >50K", \quad \text{if} \quad \hat{\pi}_i>0.5, \] and \[ \hat{y}_i = " <=50K", \quad \text{if} \quad \hat{\pi}_i \leq 0.5. \]
Since R codes the factor variables as numbers, the binary response variable is also being coded. Therefore when estimating a logistic regression model we need to know how the binary response variable is being modeled. By default R orders the factor levels alphabetically and the response level modeled in the logistic regression is the highest level. In our case the highest level in the “income” variable is " >50K“:
attributes(db.adult$income)$levels
[1] " <=50K" " >50K"
$class
[1] "factor"
Hence, the response level modeled is " >50K“, i.e. when fitting the logistic regression model, the probability of the income being greater than 50K is calculated. We denote the vector with the predicted income values as”predicted.income.train“:
predicted.income.train <- ifelse(predicted.probs > 0.5, " >50K", " <=50K")As we discussed this earlier, there is an 84.89% match between observed and predicted values of “income”:
mean(predicted.income.train == db.adult$income)[1] 0.8489159
Below we show the confusion matrix:
stat.log.train <- confusionMatrix(data = predicted.income.train,
reference = db.adult$income,
positive = levels(db.adult$income)[2])
stat.log.trainConfusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 21075 2978
>50K 1579 4530
Accuracy : 0.8489
95% CI : (0.8448, 0.8529)
No Information Rate : 0.7511
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5691
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.6034
Specificity : 0.9303
Pos Pred Value : 0.7415
Neg Pred Value : 0.8762
Prevalence : 0.2489
Detection Rate : 0.1502
Detection Prevalence : 0.2025
Balanced Accuracy : 0.7668
'Positive' Class : >50K
The sensitivity is the proportion of income values equal to " >50K" that are correctly identified as such, and the specificity is the proportion of income values equal to " <=50K" that are correctly identified as such. We see that the sensitivity is 60.34% and the specificity is 93.03%.
Next we will test how well the fitted model predicts new observations. For this purpose we will use the provided test dataset and, hence, the corresponding test data frame that we created - “db.test”.
predicted.income.test <- predict(glm.model.wld,
newdata = db.test,
type = "response")
predicted.income.test <- ifelse(predicted.income.test > 0.5, " >50K", " <=50K")Below we show the respective confusion matrix:
stat.log.test <- confusionMatrix(data = predicted.income.test,
reference = db.test$income,
positive = levels(db.test$income)[2])
stat.log.testConfusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 10560 1496
>50K 800 2204
Accuracy : 0.8475
95% CI : (0.8417, 0.8532)
No Information Rate : 0.7543
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5608
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.5957
Specificity : 0.9296
Pos Pred Value : 0.7337
Neg Pred Value : 0.8759
Prevalence : 0.2457
Detection Rate : 0.1463
Detection Prevalence : 0.1995
Balanced Accuracy : 0.7626
'Positive' Class : >50K
As is evident by the results above, the model predicts correctly 84.75% of the values of the dependent variable. We consider this a good predictive rate. On the test dataset the sensitivity is 59.57% and the specificity is 92.96%.
Here we consider two R functions for stepwise selection of covariates. The functions are based on the predictors’ significance according to likelihood ratio tests. The first function is “drop1()”. It drops one predictor at a time from the full model “glm.model.wld” and then compares the restricted and full model in terms of goodness of fit with the help of the likelihood ratio test statistic.
drop1(glm.model.wld,
trace = TRUE,
test = "LRT")Single term deletions
Model:
income ~ age + workclass + education + marital_status + occupation +
relationship + race + sex + native_region + hours_w + cap_gain +
cap_loss
Df Deviance AIC LRT Pr(>Chi)
<none> 19643 19777
age 1 19886 20018 243.22 < 2.2e-16 ***
workclass 6 19746 19868 102.65 < 2.2e-16 ***
education 15 20637 20741 993.74 < 2.2e-16 ***
marital_status 6 19751 19873 108.25 < 2.2e-16 ***
occupation 13 20180 20288 537.42 < 2.2e-16 ***
relationship 5 19921 20045 277.75 < 2.2e-16 ***
race 4 19656 19782 12.69 0.0129070 *
sex 1 19759 19891 115.54 < 2.2e-16 ***
native_region 7 19670 19790 27.40 0.0002826 ***
hours_w 4 20012 20138 369.34 < 2.2e-16 ***
cap_gain 2 21195 21325 1552.24 < 2.2e-16 ***
cap_loss 2 19941 20071 298.01 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we see from the output above, the p-values indicate that all predictors in the model are significant and should be retained.
The second function which we consider is the “step()” function for backward selection of predictors based on LRT:
step.lrt <- step(glm.model.wld, direction="backward", test="LRT") Start: AIC=19776.97
income ~ age + workclass + education + marital_status + occupation +
relationship + race + sex + native_region + hours_w + cap_gain +
cap_loss
Df Deviance AIC LRT Pr(>Chi)
<none> 19643 19777
- race 4 19656 19782 12.69 0.0129070 *
- native_region 7 19670 19790 27.40 0.0002826 ***
- workclass 6 19746 19868 102.65 < 2.2e-16 ***
- marital_status 6 19751 19873 108.25 < 2.2e-16 ***
- sex 1 19759 19891 115.54 < 2.2e-16 ***
- age 1 19886 20018 243.22 < 2.2e-16 ***
- relationship 5 19921 20045 277.75 < 2.2e-16 ***
- cap_loss 2 19941 20071 298.01 < 2.2e-16 ***
- hours_w 4 20012 20138 369.34 < 2.2e-16 ***
- occupation 13 20180 20288 537.42 < 2.2e-16 ***
- education 15 20637 20741 993.74 < 2.2e-16 ***
- cap_gain 2 21195 21325 1552.24 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The algorithm starts with the full models - “glm.model.wld” and consequtively eliminates explanatory variables comparing the LRT between the current model and the full model. From the output above we see that according to the p-values all predictor variables should be kept in the model.
We calculate the variance of the predictors with the function “nearZeroVar” from the “caret” package. We want to see if there are any variables with near zero variance:
nearZeroVar(db.adult, saveMetrics = TRUE) freqRatio percentUnique zeroVar nzv
age 1.001175 0.23871096 FALSE FALSE
workclass 8.917967 0.02320801 FALSE FALSE
fnlwgt 1.083333 67.18055832 FALSE FALSE
education 1.473495 0.05304688 FALSE FALSE
education_num 1.473495 0.05304688 FALSE FALSE
marital_status 1.446124 0.02320801 FALSE FALSE
occupation 1.001985 0.04641602 FALSE FALSE
relationship 1.613125 0.01989258 FALSE FALSE
race 9.205893 0.01657715 FALSE FALSE
sex 2.083419 0.00663086 FALSE FALSE
capital_gain 81.970326 0.39122074 FALSE TRUE
capital_loss 148.118557 0.29838870 FALSE TRUE
hours_per_week 5.243194 0.31165042 FALSE FALSE
native_country 45.088525 0.13593263 FALSE TRUE
income 3.017315 0.00663086 FALSE FALSE
hours_w 2.473339 0.01657715 FALSE FALSE
native_region 22.433931 0.02652344 FALSE TRUE
cap_gain 22.424603 0.00994629 FALSE TRUE
cap_loss 37.261538 0.00994629 FALSE TRUE
Indeed, there are predictors with near zero variance. These are the variables “capital_gain”, “capital_loss”, “native_country”, “native_region”, “cap_gain” and “cap_loss”. In general, covariates with near zero variance are not informative regarding the response variable, but this is not always the case. Therefore we have to be careful with throwing away data.
Since in our prediction models we do not include the variables “capital_gain”, “capital_loss” and “native_country”, we are not interested in them. Regarding the remaining three features – “native_region”, “cap_gain” and “cap_loss”, we will fit logistic regression models excluding each of these variables one at a time, and then we will compare their prediction accuracy to that of the full model – “glm.model.wld”. We start with fitting a model without the variable “native_region”:
covariates <- paste("age", "workclass", "education",
"marital_status", "occupation", "relationship",
"race", "sex", "hours_w",
"cap_gain", "cap_loss", sep = "+")
form <- as.formula(paste("income ~", covariates))
start_time <- proc.time()
Mod <- glm(formula = form,
data = db.adult,
family = binomial(link = "logit"),
x = TRUE)
time.wout.region <- proc.time() - start_time
time.wout.region user system elapsed
2.44 0.08 2.51
Below we show the model’s accuracy, sensitivity and specificity on the test data, i.e. on new observations:
predicted.test <- predict(Mod,
newdata = db.test,
type = "response")
predicted.test <- ifelse(predicted.test > 0.5, " >50K", " <=50K")
mean(predicted.test == db.test$income)[1] 0.8474104
confMat.with.cap <- confusionMatrix(data = predicted.test,
reference = db.test$income,
positive = levels(db.test$income)[2])
confMat.with.capConfusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 10554 1492
>50K 806 2208
Accuracy : 0.8474
95% CI : (0.8416, 0.8531)
No Information Rate : 0.7543
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5609
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.5968
Specificity : 0.9290
Pos Pred Value : 0.7326
Neg Pred Value : 0.8761
Prevalence : 0.2457
Detection Rate : 0.1466
Detection Prevalence : 0.2001
Balanced Accuracy : 0.7629
'Positive' Class : >50K
Fitting the model without “native_region” results in approximately the same accuracy, sensitivity and specificity on the test data. This means that we can drop this variable and thus reduce the computational complexity. The computational time needed to fit the model without “native_region” is 2.44s and is 1.1 times less than that needed to fit the full model (which was 2.65s).
Next we remove the variables “cap_gain” and “cap_loss” but we keep “native_region”:
covariates <- paste("age", "workclass", "education",
"marital_status", "occupation", "relationship",
"race", "sex", "hours_w", "native_region",
sep = "+")
form <- as.formula(paste("income ~", covariates))
Mod <- glm(formula = form,
data = db.adult,
family = binomial(link = "logit"),
x = TRUE)We check the model’s accuracy, sensitivity and specificity on the test dataset:
predicted.test <- predict(Mod,
newdata = db.test,
type = "response")
predicted.test <- ifelse(predicted.test > 0.5, " >50K", " <=50K")
mean(predicted.test == db.test$income)[1] 0.8316069
confMat.without.cap <- confusionMatrix(data = predicted.test,
reference = db.test$income,
positive = levels(db.test$income)[2])
confMat.without.capConfusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 10430 1606
>50K 930 2094
Accuracy : 0.8316
95% CI : (0.8255, 0.8376)
No Information Rate : 0.7543
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5159
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.5659
Specificity : 0.9181
Pos Pred Value : 0.6925
Neg Pred Value : 0.8666
Prevalence : 0.2457
Detection Rate : 0.1390
Detection Prevalence : 0.2008
Balanced Accuracy : 0.7420
'Positive' Class : >50K
As we can see from the results above, the prediction accuracy on the test dataset decreased from 84.75% for the full model to 83.16% for the model without “cap_gain” and “cap_loss”. The sensitivity decreases a little bit – from 59.68% to 56.59%, but the specificity declines more significantly – from 92.9% to 91.81%, which is not desirabe since this is the percentage of correctly predicted outcomes with income greater than 50K.
Finally we drop all three variables with near zero variance and we fit a logistic model:
covariates <- paste("age", "workclass", "education",
"marital_status", "occupation", "relationship",
"race", "sex", "hours_w",
sep = "+")
form <- as.formula(paste("income ~", covariates))
Mod <- glm(formula = form,
data = db.adult,
family = binomial(link = "logit"),
x = TRUE)The accuracy, sensitivity and specificity of the model on the test dataset are given below:
predicted.test <- predict(Mod,
newdata = db.test,
type = "response")
predicted.test <- ifelse(predicted.test > 0.5, " >50K", " <=50K")
mean(predicted.test == db.test$income)[1] 0.8312749
confMat.without.three <- confusionMatrix(data = predicted.test,
reference = db.test$income,
positive = levels(db.test$income)[2])
confMat.without.threeConfusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 10428 1609
>50K 932 2091
Accuracy : 0.8313
95% CI : (0.8252, 0.8372)
No Information Rate : 0.7543
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5149
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.5651
Specificity : 0.9180
Pos Pred Value : 0.6917
Neg Pred Value : 0.8663
Prevalence : 0.2457
Detection Rate : 0.1388
Detection Prevalence : 0.2007
Balanced Accuracy : 0.7415
'Positive' Class : >50K
We observe that the accuracy drops from 84.75% for the full model to 83.13% for the reduced model. The sensitivity decreases a little bit – from 59.57% to 56.51%. The specificity declines more significantly – from 92.96% to 91.8%, which is not a desired effect. Therefore, we will keep the variables “cap_gain” and “cap_loss” in the model.
Although the automated stepwise selection of variables pointed out that we should keep all initial variables in the model, our further analysis showed that we can drop “native_region”.
We modify the census dataset in order to have more balanced data. As we saw from the exploratory data analaysis in the second part of the project, 75.1% of people earn less than 50K a year and only 24.9% – earn more than 50K a year. Although this statistic is not surprising, it makes the data imbalanced and leads to low sensitivity of the fitted classification model. We recall that as the “positive” class we consider people with income greater than 50K. Thus, the sensitivity is the proportion of observations with income values equal to " >50K" that are correctly identified as such, and the specificity is the proportion of records with income values equal to " <=50K" that are correctly identified as such. Hence, the sensitivity of the logistic regression model is only – 59.57% , whereas the specificity is 92.96%. We level the data by adding copies of observations from the under-represented class. This technique is called over-sampling, or sampling with replacement and its goal is to build better classification models with higher sensitivity.
First we create a temporary dataset called “db.adult.rich” which contains all observations with income greater than 50K a year:
db.adult.rich <- subset(db.adult, db.adult$income == " >50K")
rownames(db.adult.rich) <- 1:nrow(db.adult.rich)Then we use the function “createDataPartition()” from the “caret” package to randomly sample 80% of the these observations and we create the dataset “db.adult.oversampled” which consists of the original observations from the “db.adult” set and the over-sampled instances:
IndRich <- createDataPartition(y = db.adult.rich$income,
times = 1,
p = 0.8,
list = FALSE)
db.adult.oversampled <- rbind(db.adult, db.adult.rich[IndRich, ])
rownames(db.adult.oversampled) <- 1:nrow(db.adult.oversampled)After we are finished with the temporary dataset “db.adult.rich”, we remove it from the working environment:
rm(db.adult.rich)The most popular algorithm for generating synthetic samples is the SMOTE algorithm, or the Synthetic Minority Over-sampling Technique. More information can be found in the original paper “SMOTE: Synthetic Minority Over-sampling Technique” by N.V. Chawla, K. W. Bowyer, L.O. Hall and W.P. Kegelmeyer from 2002. We use this technique to create a more balanced training dataset. In short, the idea of the method is to combine under-sampling of the majority class with a special form of over-sampling of the minority class. In the SMOTE algorithm the over-sampling of the minority class is carried out by creating new “synthetic” examples, rather than by over-sampling with replacement from the existing examples. The synthetic observations are generated with the help of the k nearest neighbors (KNN) algorithm. The majority class is under-sampled by randomly removing observations until the minority class becomes some pre-specified percentage of the majority class.
Below we create a “SMOTEd” training dataset called “db.adult.smote”:
db.adult.smote <- SMOTE(form = form,
data = db.adult,
perc.over = 100,
perc.under = 250,
k = 5)
rownames(db.adult.smote) <- 1:nrow(db.adult.smote)where we recall that we use the following formula:
covariates <- paste("age", "workclass", "education","marital_status",
"occupation", "relationship", "race", "sex",
"native_region", "hours_w", "cap_gain",
"cap_loss", sep = "+")
form <- as.formula(paste("income ~", covariates))We show the distribution of income for the SMOTE and for the original dataset:
summary(db.adult.smote$income) <=50K >50K
18770 15016
summary(db.adult$income) <=50K >50K
22654 7508
We see that 3884 observations of the majority class from the original unbalanced set were (randomly) removed and 7508 new observations of the minority class were generated. Below we show a bar plot of the distribution of income in the new balanced dataset:
ggplot(data = db.adult.smote,
mapping = aes(x = db.adult.smote$income,
fill = db.adult.smote$income)) +
geom_bar(mapping = aes(y = (..count..)/sum(..count..))) +
geom_text(mapping = aes(label = scales::percent((..count..)/sum(..count..)),
y = (..count..)/sum(..count..) ),
stat = "count",
vjust = -.1) +
labs(x = "Income",
y = "",
fill = "Income") +
scale_y_continuous(labels = percent)From the graph we see that 55.6% of all people earn less than 50K a year and 44.4% – earn more than 50K a year.
After we enriched the “db.adult” dataset with observations over-sampled from the under-represented class of people who earn more than 50K a year, we fit a new logistic regression model called “glm.model.over”:
covariates <- paste("age", "workclass", "education","marital_status",
"occupation", "relationship", "race", "sex",
"native_region", "hours_w", "cap_gain",
"cap_loss", sep = "+")
form <- as.formula(paste("income ~", covariates))
start_time <- proc.time()
glm.model.over <- glm(formula = form,
data = db.adult.oversampled,
family = binomial(link = "logit"),
x = TRUE,
y = TRUE)
time.logistic.over <- proc.time() - start_time
time.logistic.over user system elapsed
3.36 0.13 3.48
First we test the performance of the model on the training set – namely, the “db.adult.oversampled” dataset:
predicted.train <- predict(glm.model.over,
newdata = db.adult.oversampled,
type = "response")
predicted.train <- ifelse(predicted.train > 0.5, " >50K", " <=50K")Below we show the confusion matrix together with the accuracy, sensitivity and specificity:
stat.Logi.over <- confusionMatrix(data = predicted.train,
reference = db.adult.oversampled$income,
positive = levels(db.test$income)[2])
stat.Logi.overConfusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 19728 3524
>50K 2926 9991
Accuracy : 0.8217
95% CI : (0.8177, 0.8256)
No Information Rate : 0.6263
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6156
Mcnemar's Test P-Value : 1.057e-13
Sensitivity : 0.7393
Specificity : 0.8708
Pos Pred Value : 0.7735
Neg Pred Value : 0.8484
Prevalence : 0.3737
Detection Rate : 0.2762
Detection Prevalence : 0.3571
Balanced Accuracy : 0.8050
'Positive' Class : >50K
We observe that the overall accuracy drops a little from 84.89% (for the model trained on the original unbalanced dataset) to 82.17%, whereas the sensitivity significantly increases from 60.34% to 73.93%. The specificity decreases with 5.95% from 93.03% to 87.08%.
Our next step is to test the accuracy of the fitted model on the test dataset:
predicted.test <- predict(glm.model.over,
newdata = db.test,
type = "response")
predicted.test <- ifelse(predicted.test > 0.5, " >50K", " <=50K")Below we give the confusion matrix together with additional statistics:
stat.Logi.over <- confusionMatrix(data = predicted.test,
reference = db.test$income,
positive = levels(db.test$income)[2])
stat.Logi.overConfusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 9869 989
>50K 1491 2711
Accuracy : 0.8353
95% CI : (0.8293, 0.8412)
No Information Rate : 0.7543
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5751
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.7327
Specificity : 0.8688
Pos Pred Value : 0.6452
Neg Pred Value : 0.9089
Prevalence : 0.2457
Detection Rate : 0.1800
Detection Prevalence : 0.2790
Balanced Accuracy : 0.8007
'Positive' Class : >50K
We see that the accuracy on the test set drops down a little bit from 84.75% for the model built on the original unbalanced train set to 83.53% for the model built on the over-sampled train dataset. On the other hand, the sensitivity significantly increases from 59.57% to 73.27% and the specificity decreases from 92.96% to 86.88%.
Next we train a model on the balanced SMOTE training dataset:
covariates <- paste("age", "workclass", "education","marital_status",
"occupation", "relationship", "race", "sex",
"native_region", "hours_w", "cap_gain",
"cap_loss", sep = "+")
form <- as.formula(paste("income ~", covariates))
start_time <- proc.time()
glm.model.smote <- glm(formula = form,
data = db.adult.smote,
family = binomial(link = "logit"),
x = TRUE,
y = TRUE)
time.logistic.smote <- proc.time() - start_time
time.logistic.smote user system elapsed
3.17 0.13 3.29
First we test the performance of the model on the training set – namely, the “db.adult.oversampled” dataset:
predicted.train <- predict(glm.model.smote,
newdata = db.adult.smote,
type = "response")
predicted.train <- ifelse(predicted.train > 0.5, " >50K", " <=50K")Below we show the confusion matrix together with the accuracy, sensitivity and specificity:
stat.Logi.smote <- confusionMatrix(data = predicted.train,
reference = db.adult.smote$income,
positive = levels(db.test$income)[2])
stat.Logi.smoteConfusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 16221 3142
>50K 2549 11874
Accuracy : 0.8316
95% CI : (0.8275, 0.8355)
No Information Rate : 0.5556
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6576
Mcnemar's Test P-Value : 4.247e-15
Sensitivity : 0.7908
Specificity : 0.8642
Pos Pred Value : 0.8233
Neg Pred Value : 0.8377
Prevalence : 0.4444
Detection Rate : 0.3514
Detection Prevalence : 0.4269
Balanced Accuracy : 0.8275
'Positive' Class : >50K
We observe that the overall accuracy drops a little from 84.89% (for the original unbalanced train dataset) to 83.16% (for the balanced SMOTE dataset), whereas the sensitivity significantly increases from 60.34% to 79.08%. The specificity decreases with 6% from 93.03% to 86.42%.
Our next step is to test the accuracy of the fitted model on the test dataset:
predicted.test <- predict(glm.model.smote,
newdata = db.test,
type = "response")
predicted.test <- ifelse(predicted.test > 0.5, " >50K", " <=50K")Below we give the confusion matrix together with additional statistics:
stat.Logi.smote <- confusionMatrix(data = predicted.test,
reference = db.test$income,
positive = levels(db.test$income)[2])
stat.Logi.smoteConfusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 9768 1043
>50K 1592 2657
Accuracy : 0.825
95% CI : (0.8189, 0.8311)
No Information Rate : 0.7543
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5504
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.7181
Specificity : 0.8599
Pos Pred Value : 0.6253
Neg Pred Value : 0.9035
Prevalence : 0.2457
Detection Rate : 0.1764
Detection Prevalence : 0.2821
Balanced Accuracy : 0.7890
'Positive' Class : >50K
As we can see the overall accuracy drops a little from 84.75% (for the original unbalanced train dataset) to 82.5% (for the balanced SMOTE dataset), whereas the sensitivity significantly increases from 59.57% to 71.81%. The specificity decreases with 6% from 92.96% to 85.99%.
Here we fit a classification model with the random forest algorithm with the “randomForest” function from the “randomForest” package. We use the default parameters, like the number of variables for each split (which in this case is \(\sqrt{12}\approx 3\)) in the decision trees and the number of trees, which is 500. We train two models – one on the over-sampled train dataset and one on the SMOTE dataset.
First, we will briefly explain how the random forest algorithm works. The random forest is an “ensemble” method. The main idea of ensemble methods is that a group of “weak learners” comes together and forms a “strong learner”. In the case of the random forest, each decision tree (a type of classification machine learning algorithm) is a “weak learner” and the group of trees (i.e. the so-called forest) forms the “strong learner”. Let us say that we have \(T\) number of trees in the forest. Then, for each tree \(N\) cases are sampled at random with replacement to create a subset of the train dataset. This subset usually is about 66% of the total set. At each node, of each tree in the forest, \(m\) predictor variables are selected at random from all the predictor variables. In R the default value for \(m\) is square root of the total number of predictors in the model. The predictor variable that provides the best split, according to some objective function, is then used to do a binary split on that node. At the next node of each tree, another \(m\) variables are chosen at random from all predictor variables and so on, until a leaf node is reached. The final classification is done after a majority vote based on the decisions (classification) made by each tree. Since each decision tree samples at random about 66% of all observations in the train set, the remaining 34% are used to test the accuracy of the predictions made by the corresponding tree. Thus, the random forest simultaneously trains a model and tests its prediction accuracy. R reports the so-called “out-of-bag” (OOB) error for each decision tree, as well as an averaged overall OOB error. This is the prediction error on the remaining 34% of the training examples.
Below we fit the random forest model on the over-sampled training dataset:
set.seed(1234)
covariates <- paste("age", "workclass", "education",
"marital_status", "occupation", "relationship",
"race", "sex", "native_region", "hours_w",
"cap_gain", "cap_loss", sep = "+")
form <- as.formula(paste("income ~", covariates))
start_time <- proc.time()
ModRF <- randomForest(formula = form,
data = db.adult.oversampled)
time.rf.over <- proc.time() - start_time
time.rf.over user system elapsed
21.55 1.00 23.21
We see that the overall “out-of-bag” error, or OOB, is 14.13%:
ModRF
Call:
randomForest(formula = form, data = db.adult.oversampled)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 3
OOB estimate of error rate: 14.13%
Confusion matrix:
<=50K >50K class.error
<=50K 19577 3077 0.1358259
>50K 2032 11483 0.1503515
Below we test the accuracy on the training and test datasets and we see that it is 90.87% and 83.45%, respectively. The “out of sample” error is 16.51% and is in agreement with the OOB error:
mean(predict(ModRF, newdata = db.adult.oversampled) == db.adult.oversampled$income)[1] 0.909204
mean(predict(ModRF, newdata = db.test) == db.test$income)[1] 0.8346614
Next we display an error plot of the random forest model:
plot(ModRF)print(ModRF)
Call:
randomForest(formula = form, data = db.adult.oversampled)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 3
OOB estimate of error rate: 14.13%
Confusion matrix:
<=50K >50K class.error
<=50K 19577 3077 0.1358259
>50K 2032 11483 0.1503515
The black solid line is the overall OOB error (i.e. averaged for all trees), or the accuracy, the red dashed line is the prediction error for the first category (since we have a classification problem) of the variable “income” – " <=50K“, and the green dotted line represents the prediction error for the second category of”income" – " >50K“. This means that the red dashed line corresponds approximately to 1-specificity, and the green dotted line represents 1-sensitivity. Indeed, if we display the confusion matrix and the performance statistics on the train set, we can see the sensitivity and specificity:
confusionMatrix(data = predict(ModRF, newdata = db.adult.oversampled),
reference = db.adult.oversampled$income,
positive = levels(db.test$income)[2])Confusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 20647 1276
>50K 2007 12239
Accuracy : 0.9092
95% CI : (0.9062, 0.9122)
No Information Rate : 0.6263
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8082
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9056
Specificity : 0.9114
Pos Pred Value : 0.8591
Neg Pred Value : 0.9418
Prevalence : 0.3737
Detection Rate : 0.3384
Detection Prevalence : 0.3939
Balanced Accuracy : 0.9085
'Positive' Class : >50K
From the error plot, we see that the errors do not decrease after the number of trees reaches a certain threshold, which is 100 trees. Hence, we try to build a RF model with 200 trees to check if we can achieve the same accuracy with a computationally less expensive model:
set.seed(1234)
start_time <- proc.time()
ModRF.small <- randomForest(formula = form,
data = db.adult.oversampled,
ntree = 200)
time.rf.small.over <- proc.time() - start_time
time.rf.small.over user system elapsed
8.15 0.33 8.50
On our machine the computational time went from about 21.55s to 8.15s, which is 2.6 times faster.
print(ModRF.small)
Call:
randomForest(formula = form, data = db.adult.oversampled, ntree = 200)
Type of random forest: classification
Number of trees: 200
No. of variables tried at each split: 3
OOB estimate of error rate: 14.28%
Confusion matrix:
<=50K >50K class.error
<=50K 19611 3043 0.1343251
>50K 2123 11392 0.1570847
mean(predict(ModRF.small, newdata = db.adult.oversampled) == db.adult.oversampled$income)[1] 0.9069369
mean(predict(ModRF.small, newdata = db.test) == db.test$income)[1] 0.8335325
From the code above, we see that the OOB error increased insignificantly from 14.13% to 14.28% and the accuracy is approximately the same – 90.88% on the training data, and 83.36% on the test dataset.
Next we show the confusion matrix on the test set:
stat.rf.over <- confusionMatrix(data = predict(ModRF.small,
newdata = db.test),
reference = db.test$income,
positive = levels(db.test$income)[2])
stat.rf.overConfusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 9818 964
>50K 1542 2736
Accuracy : 0.8336
95% CI : (0.8276, 0.8395)
No Information Rate : 0.7543
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5735
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.7395
Specificity : 0.8643
Pos Pred Value : 0.6396
Neg Pred Value : 0.9106
Prevalence : 0.2457
Detection Rate : 0.1817
Detection Prevalence : 0.2841
Balanced Accuracy : 0.8019
'Positive' Class : >50K
The sensitivity is 73.95% and the specificity is 86.43%.
Below we build a random forest model on the SMOTE training set. We use the optimal number of trees which we determined to be 200 for the model trained on the over-sampled train dataset:
set.seed(1234)
covariates <- paste("age", "workclass", "education",
"marital_status", "occupation", "relationship",
"race", "sex", "native_region", "hours_w",
"cap_gain", "cap_loss", sep = "+")
form <- as.formula(paste("income ~", covariates))
start_time <- proc.time()
ModRF <- randomForest(formula = form,
data = db.adult.smote,
ntree = 200)
time.rf.smote <- proc.time() - start_time
time.rf.smote user system elapsed
7.78 0.33 8.19
First we take a look at the performance of the model on the train set:
confusionMatrix(data = predict(ModRF, newdata = db.adult.smote),
reference = db.adult.smote$income,
positive = levels(db.test$income)[2])Confusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 17837 1288
>50K 933 13728
Accuracy : 0.9343
95% CI : (0.9316, 0.9369)
No Information Rate : 0.5556
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8666
Mcnemar's Test P-Value : 5.844e-14
Sensitivity : 0.9142
Specificity : 0.9503
Pos Pred Value : 0.9364
Neg Pred Value : 0.9327
Prevalence : 0.4444
Detection Rate : 0.4063
Detection Prevalence : 0.4339
Balanced Accuracy : 0.9323
'Positive' Class : >50K
And we show the confusion matrix, as well as the accuracy, sensitivity and specificity on the test set:
stat.rf.smote <- confusionMatrix(data = predict(ModRF,
newdata = db.test),
reference = db.test$income,
positive = levels(db.test$income)[2])
stat.rf.smoteConfusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 9729 976
>50K 1631 2724
Accuracy : 0.8269
95% CI : (0.8208, 0.8329)
No Information Rate : 0.7543
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5593
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.7362
Specificity : 0.8564
Pos Pred Value : 0.6255
Neg Pred Value : 0.9088
Prevalence : 0.2457
Detection Rate : 0.1809
Detection Prevalence : 0.2892
Balanced Accuracy : 0.7963
'Positive' Class : >50K
We observe that the accuracy and sensitivity for the SMOTE model are slightly lower than those for the over-sampled model.
We fit a model with the Support Vector Machines (SVM) algorithm, which is implemented in the R package “e1071”. This is a popular method that usually has reasonable default performance and has parameters that allow for further tuning if necessary. It also allows for fitting both linearly and non-linearly separable data using linear or non-linear kernels, respectively. By default, the “svm” function in the “e1071” package scales the data. We train two types of models – one on the over-sampled train set and one on the SMOTE train set. We also use cross-validation and grid search to find the optimal model hyperparameters.
Our goal was to perform cross-validation with a grid search in order to determine the optimal model hyperparameters. In the case of linear kernel, the hyperparameter, whose value we wanted to tune, was the so-called cost parameter. We wanted to use k-fold cross-validation with k=5. However, since the cross-validation algorithm required a lot of computational time, we decided to try another approach for selecting the best set of model parameters. Namely, we used a fixed split of the “db.adult.oversampled” dataset into a training and a validation set. But, even with a fixed split, depending on which values of the parameters we use for the grid search, the computational time may still be very long. For some values of the parameters we observe that the optimization algorithm cannot converge. We followed also the recommendations from the paper “A Practical Guide to Support Vector Classification” by Hsu, Chang, and Lin (the authors of the LIBSVM library) to use exponentially growing sequences of C (and \(\gamma\) in the case of a radial kernel), such as \(C=2^{-5}, 2^{-4}, 2^{-3}, \dots, 1, 2^1, 2^2, 2^3, \dots\). We found out that the algorithm has difficulties converging for values of C greater than \(2^6\). We experimented also with \(C=2^{-4}\) and we obtained performance results (accuracy, sensitivity and specificity) almost identical to those we obtained for \(C=1, 2^2, 2^3, 2^4, 2^5\) and \(C=2^6\). Hence, after testing different parameters, we discovered that there is no big deviation in accuracy metrics and the default R parameter (cost=1) gives good performance results and is the best one in terms of runtime and convergence. We build predictive models with both the over-sampled and the SMOTE training datasets. However, the models trained on the two training datasets have approximately the same performance regarding accuracy, sensitivity and specificity.
First we show simulations with the over-sampled training set. We tried to run the following code but we aborted it after running for 1 hour and 30 minutes. We used the “train()” function from the “caret” package. We constructed a grid of only 3 values for the cost paramater C and we wanted to use 5-fold cross-validation to choose the optimal value for C:
covariates <- paste("age", "workclass", "education",
"marital_status", "occupation", "relationship",
"race", "sex", "native_region", "hours_w",
"cap_gain", "cap_loss", sep = "+")
form <- as.formula(paste("income ~", covariates))
my.grid <- expand.grid(C = c(0.1, 0.5, 1))
set.seed(45112)
start_time <- proc.time()
Model.TunedSVM <- train(form = form,
data = db.adult.oversampled,
method = "svmLinear",
cross = 5,
tuneGrid = my.grid,
trControl = trainControl(method = "cv"),
trace = FALSE)
svm.tuned.time <- proc.time() - start_time
svm.tuned.time
Model.TunedSVMAnother way to do cross-validation is to use the “tune()” function from the R package “e1071”, as shown below:
covariates <- paste("age", "workclass", "education",
"marital_status", "occupation", "relationship",
"race", "sex", "native_region", "hours_w",
"cap_gain", "cap_loss", sep = "+")
form <- as.formula(paste("income ~", covariates))
start_time <- proc.time()
Model.TunedSVM <- tune(method = svm,
kernel = "linear",
train.x = form,
data = db.adult.oversampled,
ranges = list(cost = 2^seq(4,10,2)),
tunecontrol = tune.control(sampling = "cross",
cross = 5))
svm.tuned.time <- proc.time() - start_time
svm.tuned.timeHowever, we do not run the above code due to the extensive computational time.
We also tried to run the following code with fixed split of the training set and with the suggested exponentially increasing values of the cost parameter, but the computations took a very long time and also the algorithm was having difficulties to converge:
set.seed(45112)
start_time <- proc.time()
Model.TunedSVM1 <- tune(method = svm,
kernel = "linear",
train.x = form,
data = db.adult.oversampled,
ranges = list(cost = 2^seq(4,8,2)),
tunecontrol = tune.control(sampling = "fix"))
svm.tuned.time <- proc.time() - start_time
svm.tuned.timeFinally, we gave up the 5-fold cross-validation as well as the fixed split of the training set and we settled for a more simplistic approach. Namely, we built a model on the whole over-sampled training set taking C to be equal to each of the following values: \(C=2^{-4}, 1, 2^4, 2^5, 2^6\), and then we calculated the performance metrics of each of the models on the test dataset. Below we show the code that we used to generate the above mentioned models:
set.seed(45112)
C <- 2^(-4) # 1, 2^4, 2^5, 2^6
start_time <- proc.time()
ModSVM.lin <- svm(form,
data = db.adult.oversampled,
cost = C,
kernel = "linear")
time.svm.lin <- proc.time() - start_time
time.svm.lin
stat.accuracy <- confusionMatrix(data = predict(ModSVM.lin,
newdata = db.test),
reference = db.test$income,
positive = levels(db.test$income)[2])
stat.accuracyNext we show a table with the performance metrics of the SVM models trained on the over-sampled dataset “db.adult.oversampled” for different values of the cost parmeter \(C\):
| Accuracy | Sensitivity | Specificity | Runtime | |
|---|---|---|---|---|
| C=2^(-4) | 0.8316 | 0.7286 | 0.8651 | 130s |
| C=1 | 0.8315 | 0.7365 | 0.8625 | 159s |
| C=2^4 | 0.8311 | 0.7351 | 0.8624 | 560s |
| C=2^5 | 0.8311 | 0.7351 | 0.8624 | 980s |
| C=2^6 | 0.8321 | 0.7357 | 0.8635 | 1669s |
We see that there is no significant deviation in the accuracy, sensitivity and specificity for the different values of the cost parameter, which controls for overfitting. We choose the model with \(C=1\), which is also the model with the highest value of the sensitivity. This model also has a reasonable computational time.
Since we chose the model with \(C=1\), for completeness below we run the svm algorithm with linear kernel and cost set to 1 (which is also the dafault value for the “svm()” function) in order to build the said model:
features <- paste("age", "workclass", "education",
"marital_status", "occupation", "relationship",
"race", "sex", "native_region", "hours_w",
"cap_gain", "cap_loss", sep = "+")
form <- as.formula(paste("income ~", features))
set.seed(45112)
start_time <- proc.time()
ModSVM.lin <- svm(form,
data = db.adult.oversampled,
kernel = "linear",
cost = 1)
time.svm.lin.over <- proc.time() - start_time
time.svm.lin.over user system elapsed
149.95 0.04 150.28
The computational time is approximately 150.28s.
Finally we train an svm model with the chosen model parameters (\(C=1\)) on the SMOTE training set:
# SMOTE model:
set.seed(45112)
start_time <- proc.time()
ModSVM.lin.smote <- svm(form,
data = db.adult.smote,
kernel = "linear",
cost = 1)
time.svm.lin.smote <- proc.time() - start_time
time.svm.lin.smote user system elapsed
146.32 0.15 147.54
The vector “predSVM.lin” contains the predictions on the over-sampled training dataset:
predSVM.lin <- predict(ModSVM.lin, newdata = db.adult.oversampled)and the vector “predSVM.lin.smote” contains the predictions on the SMOTE training set:
# SMOTE model:
predSVM.lin.smote <- predict(ModSVM.lin.smote, newdata = db.adult.smote)Below we show the confusion matrix and some other statistics for both the over-sampled and the SMOTE models on the respective training sets:
# Over-sampled:
stat.svmLin.over <- confusionMatrix(data = predSVM.lin,
reference = db.adult.oversampled$income,
positive = levels(db.test$income)[2])
stat.svmLin.overConfusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 19594 3435
>50K 3060 10080
Accuracy : 0.8204
95% CI : (0.8164, 0.8244)
No Information Rate : 0.6263
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6142
Mcnemar's Test P-Value : 3.473e-06
Sensitivity : 0.7458
Specificity : 0.8649
Pos Pred Value : 0.7671
Neg Pred Value : 0.8508
Prevalence : 0.3737
Detection Rate : 0.2787
Detection Prevalence : 0.3633
Balanced Accuracy : 0.8054
'Positive' Class : >50K
# SMOTE:
stat.svmLin.smote <- confusionMatrix(data = predSVM.lin.smote,
reference = db.adult.smote$income,
positive = levels(db.test$income)[2])
stat.svmLin.smoteConfusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 16225 3117
>50K 2545 11899
Accuracy : 0.8324
95% CI : (0.8284, 0.8364)
No Information Rate : 0.5556
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6593
Mcnemar's Test P-Value : 3.238e-14
Sensitivity : 0.7924
Specificity : 0.8644
Pos Pred Value : 0.8238
Neg Pred Value : 0.8388
Prevalence : 0.4444
Detection Rate : 0.3522
Detection Prevalence : 0.4275
Balanced Accuracy : 0.8284
'Positive' Class : >50K
We observe that the accuracy of the SMOTE model is higher than that of the over-sampled model with about 1.2%. Also the sensitivity of the SMOTE model is 79.24% and is much higher than that of the over-sampled model, which is 74.58%.
The vector “predSVM.lin” contains the predictions (of the model built on the over-sampled train set) on the test data set:
predSVM.lin <- predict(ModSVM.lin, newdata = db.test)Contrary to our expectations, the accuracy on the validation set is 83.22% is higher than that on the training set which was 82.04%.
mean(predSVM.lin == db.test$income)[1] 0.8322045
We also display the respective confusion matrix:
stat.svmLin.over <- confusionMatrix(data = predSVM.lin,
reference = db.test$income,
positive = levels(db.test$income)[2])
stat.svmLin.overConfusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 9810 977
>50K 1550 2723
Accuracy : 0.8322
95% CI : (0.8261, 0.8381)
No Information Rate : 0.7543
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5698
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.7359
Specificity : 0.8636
Pos Pred Value : 0.6373
Neg Pred Value : 0.9094
Prevalence : 0.2457
Detection Rate : 0.1808
Detection Prevalence : 0.2837
Balanced Accuracy : 0.7998
'Positive' Class : >50K
The sensitivity is 73.59% and the specificity is 86.36%. Next we show the performance metrics of the SMOTE model:
# SMOTE model:
predSVM.lin.smote <- predict(ModSVM.lin.smote,
newdata = db.test)
stat.svmLin.smote <- confusionMatrix(data = predSVM.lin.smote,
reference = db.test$income,
positive = levels(db.test$income)[2])
stat.svmLin.smoteConfusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 9792 1060
>50K 1568 2640
Accuracy : 0.8255
95% CI : (0.8193, 0.8315)
No Information Rate : 0.7543
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.55
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.7135
Specificity : 0.8620
Pos Pred Value : 0.6274
Neg Pred Value : 0.9023
Prevalence : 0.2457
Detection Rate : 0.1753
Detection Prevalence : 0.2794
Balanced Accuracy : 0.7877
'Positive' Class : >50K
In comparison, the sensitivity and the specificity of the SMOTE model are 71.35% and 86.2%, respectively and both are lower than those f the over-sampled model. The accuracy is 82.55% and is also lower than that of the over-sampled model. Thus, we conclude that the performance of the SVM model trained on the dataset obtained by our simple over-sampling methodology is better than that of the model trained on the balanced set generated by the far more sophisticated SMOTE algorithm.
Based on our analysis, we found out that the optimal cost parameter is equal to 1. We also show a comparison between the performance for \(C=1\) of the model trained on the over-sampled dataset and the corresponding model trained on the SMOTE dataset:
| Accuracy | Sensitivity | Specificity | Runtime | |
|---|---|---|---|---|
| Over-sampled | 0.8322045 | 0.7359459 | 0.8635563 | 150s |
| SMOTE | 0.8254980 | 0.7135135 | 0.8619718 | 148s |
We note that the accuracy and the sensitivity are higher for the over-sampled model. The specificity is approximately the same for both models and the computational time for the over-sampled model takes 3s more than that needed for the SMOTE model.
We also fit an SVM model with the default radial kernel, which is used for data that is not linearly separable. First we build a model on the over-sampled model:
set.seed(123)
start_time <- proc.time()
ModSVM <- svm(form,
data = db.adult.oversampled)
time.svm.rad.over <- proc.time() - start_time
time.svm.rad.over user system elapsed
263.53 0.11 264.88
We use the default R parameters, which are \(C=1\) and \(\gamma=\dfrac{1}{ncol(data)}\).
In this case the computational time is approximately 264.88s.
We could also try cross-validation to find the optimal values of the model parameters \(C\) (cost) and \(\gamma\) (kernel width):
set.seed(45112)
start_time <- proc.time()
Model.TunedSVM <- tune(method = svm,
train.x = form,
data = db.adult.oversampled,
ranges = list(gamma = 2^seq(-9,-3,2),
cost = 2^seq(4,10,2)),
tunecontrol = tune.control(sampling = "fix"))
svm.tuned.time <- proc.time() - start_time
svm.tuned.timeHowever, due to the very big computational times, we do not run the code shown above and hence, we do not perform cross-validation.
We also fit a model on the SMOTE training set:
# SMOTE model:
set.seed(123)
start_time <- proc.time()
ModSVM.smote <- svm(form,
data = db.adult.smote)
time.svm.rad.smote <- proc.time() - start_time
time.svm.rad.smote user system elapsed
211.52 0.09 212.20
The vector “predSVM” contains the predictions on the training data set:
predSVM <- predict(ModSVM, newdata = db.adult.oversampled)The confusion matrix is given below:
stat.svmRad.over <- confusionMatrix(data = predSVM,
reference = db.adult.oversampled$income,
positive = levels(db.adult$income)[2])
stat.svmRad.overConfusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 19735 3475
>50K 2919 10040
Accuracy : 0.8232
95% CI : (0.8192, 0.8271)
No Information Rate : 0.6263
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.6192
Mcnemar's Test P-Value : 3.9e-12
Sensitivity : 0.7429
Specificity : 0.8711
Pos Pred Value : 0.7748
Neg Pred Value : 0.8503
Prevalence : 0.3737
Detection Rate : 0.2776
Detection Prevalence : 0.3583
Balanced Accuracy : 0.8070
'Positive' Class : >50K
We observe that the “in sample” accuracy is a little better compared to the linear kernel – 82.32%, but the computational time increased from 150.28 to 264.88, i.e. 1.8 times more. The sensitivity is 74.29% and the specificity is 87.11%. Both the sensitivity and the specificity are a little higher than those for the linear kernel.
We also calculate the performance metrics for the SMOTE model:
# SMOTE:
predSVM.smote <- predict(ModSVM.smote, newdata = db.adult.smote)
stat.svmRad.smote <- confusionMatrix(data = predSVM.smote,
reference = db.adult.smote$income,
positive = levels(db.adult$income)[2])
stat.svmRad.smoteConfusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 16502 2905
>50K 2268 12111
Accuracy : 0.8469
95% CI : (0.843, 0.8507)
No Information Rate : 0.5556
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6886
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.8065
Specificity : 0.8792
Pos Pred Value : 0.8423
Neg Pred Value : 0.8503
Prevalence : 0.4444
Detection Rate : 0.3585
Detection Prevalence : 0.4256
Balanced Accuracy : 0.8429
'Positive' Class : >50K
We observe that the accuracy, the sensitivity and the specificity are all higher than that for the over-sampled model.
Finally we check the model accuracy on the test data. The confusion matrix, as well as the accuracy, sensitivity and specificity are shown below:
stat.svmRad.over <- confusionMatrix(data = predict(ModSVM,
newdata = db.test),
reference = db.test$income,
positive = levels(db.test$income)[2])
stat.svmRad.overConfusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 9875 985
>50K 1485 2715
Accuracy : 0.836
95% CI : (0.83, 0.8419)
No Information Rate : 0.7543
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5768
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.7338
Specificity : 0.8693
Pos Pred Value : 0.6464
Neg Pred Value : 0.9093
Prevalence : 0.2457
Detection Rate : 0.1803
Detection Prevalence : 0.2789
Balanced Accuracy : 0.8015
'Positive' Class : >50K
The prediction accuracy on the test dataset is 83.6% and is slightly better than that for the linear kernel, which was 83.22%
The sensitivity is 73.38% and is approximately equal to that for the linear kernel – 73.59%, whereas the specificity is 86.93% and is a little higher compared to 86.36% in the linear kernel case. Since we are interested in the percentage of correctly predicted outcomes with income>50K, we want a higher sensitivity. In that sense the radial kernel model gives a little bit better results.
We also give the performance metrics of the SMOTE model on the test set:
# SMOTE:
stat.svmRad.smote <- confusionMatrix(data = predict(ModSVM.smote, newdata = db.test),
reference = db.test$income,
positive = levels(db.test$income)[2])
stat.svmRad.smoteConfusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 9920 1081
>50K 1440 2619
Accuracy : 0.8326
95% CI : (0.8265, 0.8385)
No Information Rate : 0.7543
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5627
Mcnemar's Test P-Value : 1.003e-12
Sensitivity : 0.7078
Specificity : 0.8732
Pos Pred Value : 0.6452
Neg Pred Value : 0.9017
Prevalence : 0.2457
Detection Rate : 0.1739
Detection Prevalence : 0.2695
Balanced Accuracy : 0.7905
'Positive' Class : >50K
The accuracy of the SMOTE model is approximately the same as that of the over-sampled model, whereas the sensitivity is lower.
We use the “nnet” package which implements a single-hidden-layer neural network. We use cross-validation to determine the optimal number of units in the hidden layer and the optimal decay parameter. The “nnet” package does not allow for tuning of the learning rate hyperparameter.
As an initial attempt in the cross-validation, we choose to test a grid with 5 and 10 hidden neurons and decay equal to 0.001 and 0.1. We perform 10-fold cross validation:
covariates <- paste("age", "workclass", "education",
"marital_status", "occupation", "relationship",
"race", "sex", "native_region", "hours_w",
"cap_gain", "cap_loss", sep = "+")
form <- as.formula(paste("income ~", covariates))
my.grid <- expand.grid(size = c(5, 10), decay = c(0.001, 0.1))
nnet.cv <- train(form = form,
data = db.adult.oversampled,
method = "nnet",
metric = "Accuracy",
tuneGrid = my.grid,
trControl = trainControl(method = "cv", number = 10),
trace = FALSE)
nnet.cvThe output of the “train()” function is shown below:
| size | decay | Accuracy | Kappa |
|---|---|---|---|
| 5 | 0.001 | 0.7829162 | 0.5315457 |
| 5 | 0.100 | 0.8248827 | 0.6264014 |
| 10 | 0.001 | 0.8236105 | 0.6235264 |
| 10 | 0.100 | 0.8272878 | 0.6312372 |
[1] "Accuracy was used to select the optimal model using the largest value. The final values used for the model were size = 10 and decay = 0.1."
We see that the best model corresponds to “size=10” and “decay=0.1”. We also observe that the value of 0.1 for the decay hyperparameter gives better accuracy for both 5 and 10 hidden neurons. We also note that the accuracy is getting slightly higher as the number of hidden neurons increases. Therefore, we decide to perform cross-validation again with more neurons and bigger values of the decay parameter:
my.grid1 <- expand.grid(size = c(8,10, 12), decay = c(0.1, 1, 2))
trCtrl <- trainControl(method = "cv", number = 5)
nnet.cv1 <- train(form = form,
data = db.adult.oversampled,
method = "nnet",
metric = "Accuracy",
tuneGrid = my.grid1,
trControl = trCtrl,
trace = FALSE)
nnet.cv1The respective output is given in the table below:
| size | decay | Accuracy | Kappa |
|---|---|---|---|
| 8 | 0.1 | 0.8275364 | 0.6318514 |
| 8 | 1.0 | 0.7883050 | 0.5071174 |
| 8 | 2.0 | 0.8275364 | 0.6311347 |
| 10 | 0.1 | 0.8250758 | 0.6249785 |
| 10 | 1.0 | 0.8282552 | 0.6324913 |
| 10 | 2.0 | 0.8272599 | 0.6300738 |
| 12 | 0.1 | 0.8263200 | 0.6291617 |
| 12 | 1.0 | 0.8272599 | 0.6313593 |
| 12 | 2.0 | 0.8290569 | 0.6341225 |
[1] "Accuracy was used to select the optimal model using the largest value. The final values used for the model were size = 12 and decay = 2."
We see that the best accuracy is achieved for 12 neurons and decay=2. Therefore, we choose these values as the optimal ones and we fit the respective neural net model, where we increase the number of maximum allowed iterations to 500:
covariates <- paste("age", "workclass", "education",
"marital_status", "occupation", "relationship",
"race", "sex", "native_region", "hours_w",
"cap_gain", "cap_loss", sep = "+")
form <- as.formula(paste("income ~", covariates))
start_time <- proc.time()
neural.fit <- nnet(formula = form,
data = db.adult.oversampled,
size = 12,
decay = 2,
maxit = 500,
trace = FALSE)
time.nnet.over <- proc.time() - start_time
time.nnet.over user system elapsed
162.33 0.07 164.28
The prediction accuracy on the test data is given below:
predicted.test <- predict(neural.fit,
newdata = db.test,
type = "class")
stat.nnet.over <- confusionMatrix(data = predicted.test,
reference = db.test$income,
positive = levels(db.test$income)[2])
stat.nnet.overConfusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 9793 880
>50K 1567 2820
Accuracy : 0.8375
95% CI : (0.8315, 0.8434)
No Information Rate : 0.7543
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5874
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.7622
Specificity : 0.8621
Pos Pred Value : 0.6428
Neg Pred Value : 0.9175
Prevalence : 0.2457
Detection Rate : 0.1873
Detection Prevalence : 0.2913
Balanced Accuracy : 0.8121
'Positive' Class : >50K
We see that the accuracy is 83.75%, the senistivity is 76.22% and the specificity is 86.21%.
We fit a neural net model on the SMOTE train dataset. We use the optimal values of the hyperparameters which we determined in the case of the model trained on the over-sampled set:
# SMOTE model:
start_time <- proc.time()
neural.fit.smote <- nnet(formula = form,
data = db.adult.smote,
size = 12,
decay = 2,
maxit = 500,
trace = FALSE)
time.nnet.smote <- proc.time() - start_time
time.nnet.smote user system elapsed
158.64 0.04 160.11
The performance of the model is shown below:
predicted.test.smote <- predict(neural.fit.smote,
newdata = db.test,
type = "class")
stat.nnet.smote <- confusionMatrix(data = predicted.test.smote,
reference = db.test$income,
positive = levels(db.test$income)[2])
stat.nnet.smoteConfusion Matrix and Statistics
Reference
Prediction <=50K >50K
<=50K 9972 1077
>50K 1388 2623
Accuracy : 0.8363
95% CI : (0.8303, 0.8422)
No Information Rate : 0.7543
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.5706
Mcnemar's Test P-Value : 4.269e-10
Sensitivity : 0.7089
Specificity : 0.8778
Pos Pred Value : 0.6540
Neg Pred Value : 0.9025
Prevalence : 0.2457
Detection Rate : 0.1742
Detection Prevalence : 0.2663
Balanced Accuracy : 0.7934
'Positive' Class : >50K
We built several classification models to predict whether a random U.S. citizen earns more than 50K a year. We used logistic regression, random forest, support vector machines and neural networks. In the table below we summarize the accuracy, sensitivity, specificity and computational time on the test dataset for all fitted models. We show results for models trained on the over-sampled and on the SMOTE datasets.
The model with the highest accuracy and sensitivity is the neural networks model built on the over-sampled train set – 83.75% and 76.22%, respectively.
The models with the next best accuracy and sensitivity are the SVM model with radial kernel trained on the over-sampled dataset and the logistic regression model on the over-sampled set. Their accuracies are 83.6% and 83.53%, respectively. On the other hand, the SVM model with the radial kernel has the biggest runtime – 264.88s. In contrast, the logistic regression model has the best runtime – only 3.48s for the over-sampled set and 3.29s for the SMOTE dataset.
In terms of computational time, the best model is the logistic regression one trained on the SMOTE dataset. The random forest model also has a good prediction accuracy and sensitivity. The runtime for the over-sampled random forest model is 8.5s and is very close to the best one – 3.48s for the logistic regression.
We also observe that the SMOTE algorithm does not lead to better accuracy and sensitivity compared to the much simpler over-sampling technique that we used.
In conclusion, after accounting for all performance indicators, the logistic regression model and the neural networks model appear to be the most suitable choices for the considered classification problem.
| Accuracy | Sensitivity | Specificity | Runtime | |
|---|---|---|---|---|
| Logistic Regression over-sampled | 0.8353254 | 0.7327027 | 0.8687500 | 3.48 |
| Logistic Regression SMOTE | 0.8250332 | 0.7181081 | 0.8598592 | 3.29 |
| Random Forest over-sampled | 0.8335989 | 0.7394595 | 0.8642606 | 8.50 |
| Random Forest SMOTE | 0.8268924 | 0.7362162 | 0.8564261 | 8.19 |
| SVM linear kernel over-sampled | 0.8322045 | 0.7359459 | 0.8635563 | 150.28 |
| SVM linear kernel SMOTE | 0.8254980 | 0.7135135 | 0.8619718 | 147.54 |
| SVM radial kernel over-sampled | 0.8359894 | 0.7337838 | 0.8692782 | 264.88 |
| SVM radial kernel SMOTE | 0.8326029 | 0.7078378 | 0.8732394 | 212.20 |
| Neural Networks over-sampled | 0.8375166 | 0.7621622 | 0.8620599 | 164.28 |
| Neural Networks SMOTE | 0.8363214 | 0.7089189 | 0.8778169 | 160.11 |
An Introduction to Categorical Data Analysis, Second Edition, Alan Agresti, 2007
An Introduction to Generalized Linear Models, Annette J. Dobson, Adrian G. Barnett, Third Edition, 2008
Applied Logistic Regression, D. Hosmer, S. Lemeshow, R. Sturdivant, Third Edition, 2013
Categorical Data Analysis, Alan Agresti, Third Edition, 2013
Generalized Linear Models, P. McCullagh, J.A. Nelder, Second Edition, 1989
“SMOTE: Synthetic Minority Over-sampling Technique”, N. V. Chawla, K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer (2002), Journal of Artificial Intelligence Research, Volume 16, pages 321-357