Our main goal in this project will be to predict wether or not a person makes
over $50,000 a year. Since we are dealing with a classification problem with
binary outcome(a person can make either over 50,000 a year or less than 50,000 a year),
we will utilize logistic regression which is the machine learning algorithm of
choice for these types of problems.
In this step, we will perform some data preprocessing in order to identify
significant variables that will be useful for modeling purposes. We will be using histograms to quickly visualize the distribution of different variables
#Univariate Analysis
r1 <- ggplot(dat.train,aes(age)) + geom_histogram(binwidth = 0.7) #The distribution seems to be left skewed
r2 <- ggplot(dat.train,aes(log(fnlwgt))) + geom_histogram(binwidth = 0.5) #Use log transformation to normalize data
r3 <- ggplot(dat.train, aes(log(capital.gain + 1))) + geom_histogram(binwidth = 0.5) #MOstly 0's not significant
r4 <- ggplot(dat.train, aes(log(capital.loss + 1))) + geom_histogram(binwidth = 0.5) #Mostly 0's not significant feature
r5 <- ggplot(dat.train, aes(hours.per.week)) + geom_histogram(binwidth = 0.5)
r6 <- ggplot(dat.train, aes(education.num)) + geom_histogram(binwidth = 0.5) #Assume it corresponds to number of years in school
grid.arrange(r1, r2, r3, r4, r5, r6, ncol=3)
We notice that Capital gain and capital loss have very narrow
distributions(heavily skewed) and over 90% of observations are 0’s therefore they will be excluded from the analysis.
Let’s take a closer look at these variables using bar charts which are visualization tools used by statisticians to analyze the distribution of categorical variables and see if we should exclude any variable to improve our final model.
h1 <- ggplot(dat.train, aes(x=workclass)) + ggtitle("Work Class") + xlab("Work Class") +
geom_bar(aes(y = 100*(..count..)/sum(..count..))) + ylab("Percentage") + coord_flip() +
scale_x_discrete(limits = rev(levels(dat.train$workclass)))
h2 <- ggplot(dat.train, aes(x=education)) + ggtitle("Education") + xlab("Education") +
geom_bar(aes(y = 100*(..count..)/sum(..count..))) + ylab("Percentage") + coord_flip() +
scale_x_discrete(limits = rev(levels(dat.train$education)))
h3 <- ggplot(dat.train, aes(x=occupation)) + ggtitle("Occupation") + xlab("Occupation") +
geom_bar(aes(y = 100*(..count..)/sum(..count..))) + ylab("Percentage") + coord_flip() +
scale_x_discrete(limits = rev(levels(dat.train$occupation)))
#h4 <- ggplot(dat.train, aes(x=native.country)) + ggtitle("Native Country") + xlab("Native Country") +
#geom_bar(aes(y = 100*(..count..)/sum(..count..))) + ylab("Percentage") + coord_flip() +
#scale_x_discrete(limits = rev(levels(dat.train$native.country)))
grid.arrange(h1, h2, h3, ncol=2)
We take a more detailed look at the variable native country.
h5 <- ggplot(dat.train, aes(x=native.country)) + ggtitle("Native Country") + xlab("Native Country") +
geom_bar(aes(y = 100*(..count..)/sum(..count..))) + ylab("Percentage") + coord_flip() +
scale_x_discrete(limits = rev(levels(dat.train$native.country)))
h5
We immediately notice that distribution of the variable native country is very narrow.
Most observations are from the United States therefore the variable will be
excluded from our analysis.
#We excluded all three variables from the analysis
dat.train$capital.gain <- NULL
dat.train$capital.loss <- NULL
dat.train$native.country <- NULL
We will analyze other categorical variables to see if they should be included
in our analysis. This time, we will be using pie charts to further analyze
different distributions.
p1 <- ggplot(dat.train, aes(x=factor(1), fill=marital.status)) +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 1) + coord_polar(theta="y") +
theme(axis.text.y = element_blank(), axis.ticks.y = element_blank(), legend.title=element_blank()) +
xlab("") + ylab("") + ggtitle("Marital Status")
p2 <- ggplot(dat.train, aes(x=factor(1), fill=relationship)) +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 1) + coord_polar(theta="y") +
theme(axis.text.y = element_blank(), axis.ticks.y = element_blank(), legend.title=element_blank()) +
xlab("") + ylab("") + ggtitle("Relationship")
p3 <- ggplot(dat.train, aes(x=factor(1), fill=race)) +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 1) + coord_polar(theta="y") +
theme(axis.text.y = element_blank(), axis.ticks.y = element_blank(), legend.title=element_blank()) +
xlab("") + ylab("") + ggtitle("Race")
p4 <- ggplot(dat.train, aes(x=factor(1), fill=sex)) +
geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 1) + coord_polar(theta="y") +
theme(axis.text.y = element_blank(), axis.ticks.y = element_blank(), legend.title=element_blank()) +
xlab("") + ylab("") + ggtitle("Sex")
grid.arrange(p1, p2, p3, p4, ncol=2)
We notice that distributions of these categorical variables don’t display any
extreme skewness therefore will be conserved for our regression analysis.
It’s always a good practice to identify highly correlated variables if any that
could potentially negatively affect our final model. Let’s see if there are any
correlated numerical variables in our dataset.
#Correlation Analysis Between Numerical Variables:
numeric.var <- sapply(dat.train, is.numeric) # Find numerical variables
corr.matrix <- cor(dat.train[,numeric.var]) # Calculate the correlation matrix
corrplot(corr.matrix, main="\n Correlation Plot for Numerical Variables") ## Correlation plot
par(mar=c(5.1,4.1,4.1,2.1)) # Restore plot margins
There seems to be no visible correlation between numerical variables in the dataset.
Let’s perform the same task for categorical variables. We will be using boxplots for this specific task.
par(mfrow=c(2,2)) #Arrange plots in a grid
boxplot(dat.train$age~dat.train$class, main="Age vs. Income Class",
xlab="Income Class", ylab="Age")
boxplot(dat.train$education.num~dat.train$class, main="Years of Education vs. Income Class",
xlab="Income Class", ylab="Years of Education")
boxplot(dat.train$fnlwgt~dat.train$class,log = "y", main="log(Weight) vs. Income Class",
xlab="Income Class", ylab="log(Weight)")
boxplot(dat.train$hours.per.week~dat.train$class, main="Hours per Week vs. Income Class",
xlab="Income Class", ylab="Hours per Week")
par(mfrow=c(1,1))
We notice that income class and weight have nearly identical distribution.
par(mfrow=c(3,2))
par(las=1) # horizontal axis labels
plot(table(dat.train$class, dat.train$workclass), main="Work Class vs. Income Class", cex=1.5)
plot(table(dat.train$class, dat.train$marital.status), main="Marital Status vs. Income Class", cex=1.5)
plot(table(dat.train$class, dat.train$occupation), main="Occupation vs. Income Class", cex=1.5)
plot(table(dat.train$class, dat.train$relationship), main="Relationship vs. Income Class", cex=1.5)
plot(table(dat.train$class, dat.train$race), main="Race vs. Income Class", cex=1.5)
plot(table(dat.train$class, dat.train$sex), main="Sex vs. Income Class", cex=1.5)
par(las=0) # parallel axis labels
There is no significant correlation between these categorical variables
therefore, we will consider all these variables.
We will use logistic regression as planned to train our model on the training set and use it on unseen data (the test set.)
Logistic regression is modeling the probability that an individual makes more than $50,000 annually. In other words, a response closer to 1 indicates higher chance of making over $50,000, while a response closer to 0 indicates a higher chance of making less than 50,000. Therefore, a threshold of 0.5 is used to determine whether an individual is predicted to make more than $50,000 annually or not. A confusion matrix and a ROC curve are presented to evaluate how well the model predicts income.
The metrics used will be AUROC, accuracy and a confusion matrix.
set.seed(1234) #Used to ensure the reproducibility of results
fit1 <- glm(class ~ ., data = dat.train, family = binomial('logit'), na.action = na.omit) #Logistic Regression Model
#Confusion Matrix:
prob <- predict(fit1, dat.test, type = 'response')
pred <- rep('<=50K', length(prob))
pred[prob>=.5] <- '>50K'
tb <- table(pred, dat.test$class)
tb #Displays confusion Matrix
##
## pred <=50K. >50K.
## <=50K 11465 1701
## >50K 970 2145
accuracy <- sum(diag(tb)) / sum(tb) #Compute the model's accuracy
accuracy # Display Accuracy
## [1] 0.8359437
The model produced an accuracy of about 83.5% Logistic Regression Inference: The model gives higher accuracy on unseen data when it has all the predictors included. The model’s accuracy decreases when some of the predictors are removed.
The ROC curve is a plot of sensitivity (the ability of the model to predict an
event correctly) versus 1-specificity for the possible cut-off classification
probability values.
The ROC curve is more informative than the classification table (Confusion
Matrix) since it summarizes the predictive power for all possible probability
values.
pr <- prediction(prob, dat.test$class)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
The AUC is basically just tells us how frequently a random draw from our
predicted response probabilities on your 1-labeled data (> 50k) will be greater
than a random draw from your predicted response probabilities on your 0-labeled
data (<= 50k).
#AUC
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc #Display AUC
## [1] 0.8854446
The area under the curve is about 0.88 which signals a very strong predictive
ability of our model.
We have realized that logistic regression has the best auc and therefore is the most appropriate algorithm for this problem.
A good way to improve our model’s accuracy would be to optimize parameters contained in different models.
Further feature engineering could also be explored in order to improve the predictive power of our models.
While there are others very powerful and sophisticated machine learning algorithms out there,
The very traditional logistic regression model seems to be the most efficient
and very easy to explain to people without a technical background.