This is a widely cited KNN dataset. I encountered it during my course, and I wish to share it here because it is a good starter example for data pre-processing and machine learning practices.
Fields The dataset contains 16 columns Target filed: Income – The income is divide into two classes: <=50K and >50K Number of attributes: 14 – These are the demographics and other features to describe a person
We can explore the possibility in predicting income level based on the individual’s personal information.
Logistic Regression with UCI Adult Income to predict income level based on the individual’s personal information.
This project explores logistic regression using the UCI Adult Income data set. We will try to predict the salary class of a person based upon the given information. This is from an assigned project from Data Science and Machine Learning with R
getwd()
[1] "C:/Users/badal/Desktop/AEON/R use cases"
adult <- read.csv("file:///C:/Users/badal/Desktop/datset_/adult.csv")
head(adult)
str(adult)
'data.frame': 48842 obs. of 15 variables:
$ age : int 25 38 28 44 18 34 29 63 24 55 ...
$ workclass : Factor w/ 9 levels "?","Federal-gov",..: 5 5 3 5 1 5 1 7 5 5 ...
$ fnlwgt : int 226802 89814 336951 160323 103497 198693 227026 104626 369667 104996 ...
$ education : Factor w/ 16 levels "10th","11th",..: 2 12 8 16 16 1 12 15 16 6 ...
$ educational.num: int 7 9 12 10 10 6 9 15 10 4 ...
$ marital.status : Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 3 3 5 5 5 3 5 3 ...
$ occupation : Factor w/ 15 levels "?","Adm-clerical",..: 8 6 12 8 1 9 1 11 9 4 ...
$ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 4 1 1 1 4 2 5 1 5 1 ...
$ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 3 5 5 3 5 5 3 5 5 5 ...
$ gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 2 2 2 1 2 ...
$ capital.gain : int 0 0 0 7688 0 0 0 3103 0 0 ...
$ capital.loss : int 0 0 0 0 0 0 0 0 0 0 ...
$ hours.per.week : int 40 50 40 40 30 30 40 32 40 10 ...
$ native.country : Factor w/ 42 levels "?","Cambodia",..: 40 40 40 40 40 40 40 40 40 40 ...
$ income : Factor w/ 2 levels "<=50K",">50K": 1 1 2 2 1 1 1 2 1 1 ...
summary(adult)
age workclass fnlwgt education educational.num marital.status
Min. :17.00 Private :33906 Min. : 12285 HS-grad :15784 Min. : 1.00 Divorced : 6633
1st Qu.:28.00 Self-emp-not-inc: 3862 1st Qu.: 117551 Some-college:10878 1st Qu.: 9.00 Married-AF-spouse : 37
Median :37.00 Local-gov : 3136 Median : 178145 Bachelors : 8025 Median :10.00 Married-civ-spouse :22379
Mean :38.64 ? : 2799 Mean : 189664 Masters : 2657 Mean :10.08 Married-spouse-absent: 628
3rd Qu.:48.00 State-gov : 1981 3rd Qu.: 237642 Assoc-voc : 2061 3rd Qu.:12.00 Never-married :16117
Max. :90.00 Self-emp-inc : 1695 Max. :1490400 11th : 1812 Max. :16.00 Separated : 1530
(Other) : 1463 (Other) : 7625 Widowed : 1518
occupation relationship race gender capital.gain capital.loss
Prof-specialty : 6172 Husband :19716 Amer-Indian-Eskimo: 470 Female:16192 Min. : 0 Min. : 0.0
Craft-repair : 6112 Not-in-family :12583 Asian-Pac-Islander: 1519 Male :32650 1st Qu.: 0 1st Qu.: 0.0
Exec-managerial: 6086 Other-relative: 1506 Black : 4685 Median : 0 Median : 0.0
Adm-clerical : 5611 Own-child : 7581 Other : 406 Mean : 1079 Mean : 87.5
Sales : 5504 Unmarried : 5125 White :41762 3rd Qu.: 0 3rd Qu.: 0.0
Other-service : 4923 Wife : 2331 Max. :99999 Max. :4356.0
(Other) :14434
hours.per.week native.country income
Min. : 1.00 United-States:43832 <=50K:37155
1st Qu.:40.00 Mexico : 951 >50K :11687
Median :40.00 ? : 857
Mean :40.42 Philippines : 295
3rd Qu.:45.00 Germany : 206
Max. :99.00 Puerto-Rico : 184
(Other) : 2517
any(is.na(adult))
[1] FALSE
From the structure output, we can see that some of these columns have a large number of factors. We can clean these columns by combining similar factors, thus reducing the total number of factors.
table(adult$workclass)
? Federal-gov Local-gov Never-worked Private Self-emp-inc Self-emp-not-inc
2799 1432 3136 10 33906 1695 3862
State-gov Without-pay
1981 21
Now we combine like factors:
adult$workclass <- as.character(adult$workclass)
adult$workclass[adult$workclass == "Without-pay" |
adult$workclass == "Never-worked"] <- "Jobless"
adult$workclass[adult$workclass == "State-gov" |
adult$workclass == "Local-gov"] <- "govt"
adult$workclass[adult$workclass == "Self-emp-inc" |
adult$workclass == "Self-emp-not-inc"] <- "Self-employed"
table(adult$workclass)
? Federal-gov govt Jobless Private Self-employed
2799 1432 5117 31 33906 5557
table(adult$marital.status)
Divorced Married-AF-spouse Married-civ-spouse Married-spouse-absent Never-married
6633 37 22379 628 16117
Separated Widowed
1530 1518
We can reduce these factors into the following groups:
adult$marital.status <- as.character(adult$marital.status)
adult$marital.status[adult$marital.status == "Married-AF-spouse" |
adult$marital.status == "Married-civ-spouse" |
adult$marital.status == "Married-spouse-absent"] <- "Married"
adult$marital.status[adult$marital.status == "Divorced" |
adult$marital.status == "Separated" |
adult$marital.status == "Widowed"] <- "Not-Married"
table(adult$marital.status)
Married Never-married Not-Married
23044 16117 9681
There are a lot of countries here, we can reduce them to their respective regions.
adult$native.country <- as.character(adult$native.country)
north.america <- c("Canada", "Cuba", "Dominican-Republic", "El-Salvador", "Guatemala",
"Haiti", "Honduras", "Jamaica", "Mexico", "Nicaragua",
"Outlying-US(Guam-USVI-etc)", "Puerto-Rico", "Trinadad&Tobago",
"United-States")
asia <- c("Cambodia", "China", "Hong", "India", "Iran", "Japan", "Laos",
"Philippines", "Taiwan", "Thailand", "Vietnam")
south.america <- c("Columbia", "Ecuador", "Peru")
europe <- c("England", "France", "Germany", "Greece", "Holand-Netherlands",
"Hungary", "Ireland", "Italy", "Poland", "Portugal", "Scotland",
"Yugoslavia")
other <- c("South", "?")
adult$native.country[adult$native.country %in% north.america] <- "North-America"
adult$native.country[adult$native.country %in% asia] <- "Asia"
adult$native.country[adult$native.country %in% south.america] <- "South-America"
adult$native.country[adult$native.country %in% europe] <- "Europe"
adult$native.country[adult$native.country %in% other] <- "Other"
table(adult$native.country)
Asia Europe North-America Other South-America
981 780 45933 972 176
Now we can revert the altered columns back to factors since we had to change them to characters:
During the data cleaning we can see that there were some values with just a “?”. We can convert these values to NA so we can deal with it in a more efficient manner.
table(adult$workclass)
? Federal-gov govt Jobless Private Self-employed
2799 1432 5117 31 33906 5557
adult[adult == "?"] <- NA
table(adult$workclass)
Federal-gov govt Jobless Private Self-employed
1432 5117 31 33906 5557
adult <- na.omit(adult)
NA values have been omitted from the dataset.
First we’ll plot a histogram of ages that is colored by income.
library(ggplot2)
ggplot(adult, aes(age)) + geom_histogram(aes(fill = income), color = "black",
binwidth = 1)
Here the coloring is indicative of percentage. From this plot we can see that the percentage of people who make above 50K peaks out at roughly 35% between ages 30 and 50. Next we will plot a histogram of hours worked per week.
ggplot(adult, aes(hours.per.week)) + geom_histogram(fill = 'darkblue')
It is clear that the highest frequency of hours.per.week occurs at 40. What is the income class by region? First we need to change the name of the country column to region.
library(data.table)
data.table 1.12.2 using 2 threads (see ?getDTthreads). Latest news: r-datatable.com
Attaching package: 㤼㸱data.table㤼㸲
The following objects are masked from 㤼㸱package:dplyr㤼㸲:
between, first, last
The following object is masked from 㤼㸱package:purrr㤼㸲:
transpose
setnames(adult, "native.country", "region")
# Reorder factor levels by count
region.ordered <- reorder(adult$region, adult$region, length)
region.ordered <- factor(region.ordered, levels = rev(levels(region.ordered)))
ggplot(adult, aes(region.ordered)) + geom_bar(aes(fill = income), color = "black")
The purpose of this model is to classify people into two groups, below 50k or above 50k in income. We will build the model using training data, and then predict the salary class using the test group.
library(caTools)
package 㤼㸱caTools㤼㸲 was built under R version 3.6.1
split <- sample.split(adult$income, SplitRatio = 0.8)
train <- subset(adult, split == TRUE)
test <- subset(adult, split == FALSE)
logit <- glm(income ~ ., family = binomial(), train)
glm.fit: fitted probabilities numerically 0 or 1 occurred
Let’s break down what the code means. glm is the generalized linear model we will be using. income ~ . means that we want to model income using (~) every available feature (.). family = binomial() is used because we are predicting a binary outcome, below 50k or above 50k.
predict<- predict(logit, train, type = "response")
prediction from a rank-deficient fit may be misleading
library(ROCR)
package 㤼㸱ROCR㤼㸲 was built under R version 3.6.1Loading required package: gplots
package 㤼㸱gplots㤼㸲 was built under R version 3.6.1
Attaching package: 㤼㸱gplots㤼㸲
The following object is masked from 㤼㸱package:stats㤼㸲:
lowess
ROC_pred = prediction(predict, train$income)
ROC_perf = performance(ROC_pred, "tpr", "fpr")
# Adding threshold labels
plot(ROC_perf, colorize=TRUE, print.cutoffs.at = seq(0,1,0.1), text.adj = c(-0.2, 1.7))
abline(a=0, b=1)
auc_train <- round(as.numeric(performance(ROC_pred, "auc")@y.values),2)
legend(.8, .2, auc_train, title = "AUC", cex=1)
# Making predictions on test set
Pred_Test <- predict(logit, type = "response", newdata = test)
prediction from a rank-deficient fit may be misleading
# Convert probabilities to values using the below
## Based on ROC curve above, selected a threshold of 0.5
test_tab <- table(test$income, Pred_Test > 0.5)
test_tab
FALSE TRUE
<=50K 6434 488
>50K 906 1378
accuracy_test <- round(sum(diag(test_tab))/sum(test_tab),2)
sprintf("Accuracy on test set is %s", accuracy_test)
[1] "Accuracy on test set is 0.85"
Here we are initiliazting predictions on the test data using our logistic regression model, log.model. We specify type = “response” so that we get predicted probabilities instead of probabilities on the logit scale.
We can compare our results using a confusion matrix. Since our predictions are predicted probabilities, we specifiy probabilities that are above or equal to 50% will be TRUE (above 50K) and anything below 50% will be FALSE (below 50K).
table(test$income, Pred_Test >= 0.5)
FALSE TRUE
<=50K 6434 488
>50K 906 1378
From the confusion matrix, we can predict determine the performance of our model.
auc = round(as.numeric(performance(ROCRPredTest, "auc")@y.values),2)
auc
[1] 0.9
How close are the predicted values to the true values?
(9639 + 2116) / (9639 + 744 + 2116 + 1311)
[1] 0.8511948
What is the true positive rate?
9649 / (9639 + 1311)
[1] 0.8811872
Otherwise known as the positive predictive value
9639 / (9639 + 744)
[1] 0.9283444
“Accuracy on test set is 0.85” i.e; 85% to predict the salary class of a person based upon the given information.