In this project we analyze a U.S. census data taken from the UCI (University of California at Irvine) Machine Learning Repository. The project is divided into four parts: Cleaning and Preprocessing the Data, Exploratory Data Analysis, Predictive Analysis and Theoretical Background. Our final goal is to build a model, which can predict whether the income of a random adult American citizen is less or greater than 50000$ a year based on given features, such as age, education, occupation, gender, race, etc. We fit four different predictive models – a logistic regression model, a random forest model, a support vector machines (SVM) model and a neural networks model. All models achieve approximately the same prediction accuracy.
In the first part of the project we clean and preprocess the dataset. In the second part we use different visualization techniques to conduct a preliminary analysis of the impact of each predictor (called also independent varible or explanatory variable, or covariate) on the response variable (called also dependent variable) “income”. In the third part of the project we build predictive models using different algorithms. We apply logistic regression, random forests, support vector machines and neural networks. We test the accuracy of the built models both on the training dataset and on a test dataset. In the last part of the project we provide a theoretical overview of some of the methods that we use.
We work on R Studio and the R version is 3.3.1.
First we load the packages that we will use:
library(ggplot2)
library(plyr)
library(gridExtra)
library(gmodels)
library(grid)
library(vcd)
library(scales)
library(ggthemes)
library(knitr)
First we download the census dataset. In order to do this we set the working directory:
setwd("D:/Data_Science_Projects/Census_DataSet")
Then we download the file containing the training data:
if(!file.exists("./adult.data")){
fileUrl <- "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
download.file(fileUrl, destfile = "./adult.data")
}
and the file that contains the test data:
if(!file.exists("./adult.test")){
fileUrl <- "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test"
download.file(fileUrl, destfile = "./adult.test")
}
Our next step is to read the data from the file “adult.data” into the “db.adult” data frame, which is the training dataset:
db.adult <- read.table("adult.data", sep = ",", header = FALSE)
We will read the test data later, in Section 6 – Preprocessing the Test Dataset.
The number of observations in the “db.adult” data frame is:
dim(db.adult)[1]
[1] 32561
and the number of variables is:
dim(db.adult)[2]
[1] 15
We take the names of the variables from the attributes list, which is available at https://archive.ics.uci.edu/ml/datasets/Census+Income or from the file adult.names provided in the UCI repository:
colnames(db.adult) <- c("age", "workclass", "fnlwgt",
"education", "education_num",
"marital_status", "occupation",
"relationship", "race", "sex",
"capital_gain", "capital_loss",
"hours_per_week", "native_country", "income")
Let us take a look at the data frame:
head(db.adult, 10)
age workclass fnlwgt education education_num
1 39 State-gov 77516 Bachelors 13
2 50 Self-emp-not-inc 83311 Bachelors 13
3 38 Private 215646 HS-grad 9
4 53 Private 234721 11th 7
5 28 Private 338409 Bachelors 13
6 37 Private 284582 Masters 14
7 49 Private 160187 9th 5
8 52 Self-emp-not-inc 209642 HS-grad 9
9 31 Private 45781 Masters 14
10 42 Private 159449 Bachelors 13
marital_status occupation relationship race sex
1 Never-married Adm-clerical Not-in-family White Male
2 Married-civ-spouse Exec-managerial Husband White Male
3 Divorced Handlers-cleaners Not-in-family White Male
4 Married-civ-spouse Handlers-cleaners Husband Black Male
5 Married-civ-spouse Prof-specialty Wife Black Female
6 Married-civ-spouse Exec-managerial Wife White Female
7 Married-spouse-absent Other-service Not-in-family Black Female
8 Married-civ-spouse Exec-managerial Husband White Male
9 Never-married Prof-specialty Not-in-family White Female
10 Married-civ-spouse Exec-managerial Husband White Male
capital_gain capital_loss hours_per_week native_country income
1 2174 0 40 United-States <=50K
2 0 0 13 United-States <=50K
3 0 0 40 United-States <=50K
4 0 0 40 United-States <=50K
5 0 0 40 Cuba <=50K
6 0 0 40 United-States <=50K
7 0 0 16 Jamaica <=50K
8 0 0 45 United-States >50K
9 14084 0 50 United-States >50K
10 5178 0 40 United-States >50K
and its structure:
str(db.adult, vec.len = 2, strict.width = "no", width = 30)
'data.frame': 32561 obs. of 15 variables:
$ age : int 39 50 38 53 28 ...
$ workclass : Factor w/ 9 levels " ?"," Federal-gov",..: 8 7 5 5 5 ...
$ fnlwgt : int 77516 83311 215646 234721 338409 ...
$ education : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 ...
$ education_num : int 13 13 9 7 13 ...
$ marital_status: Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 ...
$ occupation : Factor w/ 15 levels " ?"," Adm-clerical",..: 2 5 7 7 11 ...
$ relationship : Factor w/ 6 levels " Husband"," Not-in-family",..: 2 1 2 1 6 ...
$ race : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 3 ...
$ sex : Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 ...
$ capital_gain : int 2174 0 0 0 0 ...
$ capital_loss : int 0 0 0 0 0 ...
$ hours_per_week: int 40 13 40 40 40 ...
$ native_country: Factor w/ 42 levels " ?"," Cambodia",..: 40 40 40 40 6 ...
$ income : Factor w/ 2 levels " <=50K"," >50K": 1 1 1 1 1 ...
As we can see from the output above, the variables “age”, “fnlwgt”, “education_num”, “capital_gain”, “capital_loss” and “hours_per_week” are of type integer, whereas all the other variables are factor variables with different number of levels. In order to see what the levels of each factor variable are, we write the function “levels_factors()”, which takes as an argument a data frame, identifies the factor variables and prints the levels of each categorical variable:
levels_factors <- function(mydata) {
col_names <- names(mydata)
for (i in 1:length(col_names)) {
if (is.factor(mydata[, col_names[i]])) {
message(noquote(paste("Covariate ", "*",
col_names[i], "*",
" with factor levels:",
sep = "")))
print(levels(mydata[, col_names[i]]))
}
}
}
levels_factors(db.adult)
Covariate *workclass* with factor levels:
[1] " ?" " Federal-gov" " Local-gov"
[4] " Never-worked" " Private" " Self-emp-inc"
[7] " Self-emp-not-inc" " State-gov" " Without-pay"
Covariate *education* with factor levels:
[1] " 10th" " 11th" " 12th" " 1st-4th"
[5] " 5th-6th" " 7th-8th" " 9th" " Assoc-acdm"
[9] " Assoc-voc" " Bachelors" " Doctorate" " HS-grad"
[13] " Masters" " Preschool" " Prof-school" " Some-college"
Covariate *marital_status* with factor levels:
[1] " Divorced" " Married-AF-spouse"
[3] " Married-civ-spouse" " Married-spouse-absent"
[5] " Never-married" " Separated"
[7] " Widowed"
Covariate *occupation* with factor levels:
[1] " ?" " Adm-clerical" " Armed-Forces"
[4] " Craft-repair" " Exec-managerial" " Farming-fishing"
[7] " Handlers-cleaners" " Machine-op-inspct" " Other-service"
[10] " Priv-house-serv" " Prof-specialty" " Protective-serv"
[13] " Sales" " Tech-support" " Transport-moving"
Covariate *relationship* with factor levels:
[1] " Husband" " Not-in-family" " Other-relative" " Own-child"
[5] " Unmarried" " Wife"
Covariate *race* with factor levels:
[1] " Amer-Indian-Eskimo" " Asian-Pac-Islander" " Black"
[4] " Other" " White"
Covariate *sex* with factor levels:
[1] " Female" " Male"
Covariate *native_country* with factor levels:
[1] " ?" " Cambodia"
[3] " Canada" " China"
[5] " Columbia" " Cuba"
[7] " Dominican-Republic" " Ecuador"
[9] " El-Salvador" " England"
[11] " France" " Germany"
[13] " Greece" " Guatemala"
[15] " Haiti" " Holand-Netherlands"
[17] " Honduras" " Hong"
[19] " Hungary" " India"
[21] " Iran" " Ireland"
[23] " Italy" " Jamaica"
[25] " Japan" " Laos"
[27] " Mexico" " Nicaragua"
[29] " Outlying-US(Guam-USVI-etc)" " Peru"
[31] " Philippines" " Poland"
[33] " Portugal" " Puerto-Rico"
[35] " Scotland" " South"
[37] " Taiwan" " Thailand"
[39] " Trinadad&Tobago" " United-States"
[41] " Vietnam" " Yugoslavia"
Covariate *income* with factor levels:
[1] " <=50K" " >50K"
From the output above we notice that some of the factor variables have a level denoted by " ?“. According to the documentation provided for the census dataset, the values marked with” ?" are missing values.
Before we can proceed with the exploratory data analysis (EDA) and the predictive analysisi later, we have to get rid of the missing values. In order to do that, we read the data file “adult.data” again, but with the additional option na.strings= " ?" , which means that all " ?" strings will be marked as NA (not available). In general, missing values in R are denoted by NA.
db.adult <- read.table("adult.data",
sep = ",",
header = FALSE,
na.strings = " ?")
colnames(db.adult) <- c("age", "workclass", "fnlwgt", "education",
"education_num", "marital_status", "occupation",
"relationship", "race", "sex", "capital_gain",
"capital_loss", "hours_per_week", "native_country", "income")
After we marked the missing values, we can clean them with the function “na.omit()” which deletes all rows (i.e. observations) containig missing values:
db.adult <- na.omit(db.adult)
and after that we re-enumerate the rows of the data frame:
row.names(db.adult) <- 1:nrow(db.adult)
From the summary and the box plot of the variable “hours_per_week”, we see that the mean number of working hours per week is 41 (marked as a red dot on the box plot), and at least 50% of the people taking part in the survey, work between 40 and 45 hours per week:
summary(db.adult$hours_per_week)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 40.00 40.00 40.93 45.00 99.00
As a side note, here we will briefly explain what a box plot is. The top and bottom of the box are the 25th (also called first quartile) and the 75th percentile (also called third quartile), respectively. The vertical solid black line is the median and the X-like symbol is the mean. In R, the upper whisker\(=\textrm{min}(\textrm{max}(x), Q_3 + 1.5*IQR)\) and the lower whisker\(=\textrm{max}(\textrm{min}(x), Q_1 - 1.5*IQR)\), where \(x\) is the numeric variable of interest, \(Q_1\) and \(Q_3\) are the first and third quartile, respectively, and \(IQR=Q_3 - Q_1\) is the interquartile range. The width of the box is equal to the IQR. The filled black dots are the so-called outliers, i.e. data points that fall below the lower whisker or above the upper whisker.
From the box plot we also observe that there are a lot of outliers:
ggplot(aes(x = factor(0), y = hours_per_week),
data = db.adult) +
geom_boxplot() +
stat_summary(fun.y = mean,
geom = 'point',
shape = 19,
color = "red",
cex = 2) +
scale_x_discrete(breaks = NULL) +
scale_y_continuous(breaks = seq(0, 100, 5)) +
xlab(label = "") +
ylab(label = "Working hours per week") +
ggtitle("Box Plot of Working Hours per Week")
Therefore, we will group the working hours in 5 categories which we consider relevant:
and we will create a new factor variable called “hours_w” with 5 levels corresponding to these 5 categories:
db.adult$hours_w[db.adult$hours_per_week < 40] <- " less_than_40"
db.adult$hours_w[db.adult$hours_per_week >= 40 &
db.adult$hours_per_week <= 45] <- " between_40_and_45"
db.adult$hours_w[db.adult$hours_per_week > 45 &
db.adult$hours_per_week <= 60 ] <- " between_45_and_60"
db.adult$hours_w[db.adult$hours_per_week > 60 &
db.adult$hours_per_week <= 80 ] <- " between_60_and_80"
db.adult$hours_w[db.adult$hours_per_week > 80] <- " more_than_80"
In the latter we create the category " between_40_and_45" with greater than or equal and less than or equal sign, because we want the range of this category to correspond to the span of the data points between the first and the third quartile.
We want to make the new variable “hours_w” a factor variable, therefore we use the “factor()” function to do this:
db.adult$hours_w <- factor(db.adult$hours_w,
ordered = FALSE,
levels = c(" less_than_40",
" between_40_and_45",
" between_45_and_60",
" between_60_and_80",
" more_than_80"))
From the summary below we can see how many people belong to each category of the factor variable “hours_w”:
summary(db.adult$hours_w)
less_than_40 between_40_and_45 between_45_and_60
6714 16606 5790
between_60_and_80 more_than_80
857 195
As already stated, the majority of people work between 40 and 45 hours a week, but there is also a considerable amount of participants who work between 45 and 60 hours per week as well as less than 40 hours a week. As percentages we have the following:
for(i in 1:length(summary(db.adult$hours_w))){
print(round(100*summary(db.adult$hours_w)[i]/sum(!is.na(db.adult$hours_w)), 2))
}
less_than_40
22.26
between_40_and_45
55.06
between_45_and_60
19.2
between_60_and_80
2.84
more_than_80
0.65
As we can see, 55% of all people work between 40 and 45 hours a week, 22.3% work less than 40 hours, 19.2% work between 45 and 60 hours, 2.8% work between 60 and 80 hours, and 0.65% work more than 80 hours a week.
With the help of the “levels()” function, we can see that the factor variable “native_country” has 41 levels. If we build a (logistic regression) predictive model with “native_country” as a covariate, we will end up with 41 additional degrees of freedom due to this categorical variable. This will complicate unnecessarily the analysis and might lead to overfitting. Hence, it is better to group the native countries into several global regions. This coarsening of the data also makes the interpretation of the results easier to comprehend.
levels(db.adult$native_country)
[1] " Cambodia" " Canada"
[3] " China" " Columbia"
[5] " Cuba" " Dominican-Republic"
[7] " Ecuador" " El-Salvador"
[9] " England" " France"
[11] " Germany" " Greece"
[13] " Guatemala" " Haiti"
[15] " Holand-Netherlands" " Honduras"
[17] " Hong" " Hungary"
[19] " India" " Iran"
[21] " Ireland" " Italy"
[23] " Jamaica" " Japan"
[25] " Laos" " Mexico"
[27] " Nicaragua" " Outlying-US(Guam-USVI-etc)"
[29] " Peru" " Philippines"
[31] " Poland" " Portugal"
[33] " Puerto-Rico" " Scotland"
[35] " South" " Taiwan"
[37] " Thailand" " Trinadad&Tobago"
[39] " United-States" " Vietnam"
[41] " Yugoslavia"
Below we create the new variable “native_region”, where we group the countries by global regions. We first define the regions:
Asia_East <- c(" Cambodia", " China", " Hong", " Laos", " Thailand",
" Japan", " Taiwan", " Vietnam")
Asia_Central <- c(" India", " Iran")
Central_America <- c(" Cuba", " Guatemala", " Jamaica", " Nicaragua",
" Puerto-Rico", " Dominican-Republic", " El-Salvador",
" Haiti", " Honduras", " Mexico", " Trinadad&Tobago")
South_America <- c(" Ecuador", " Peru", " Columbia")
Europe_West <- c(" England", " Germany", " Holand-Netherlands", " Ireland",
" France", " Greece", " Italy", " Portugal", " Scotland")
Europe_East <- c(" Poland", " Yugoslavia", " Hungary")
Then we modify the data frame by adding an additional column named “native_region”. We do this with the help of the “mutate” function form the “plyr” package:
db.adult <- mutate(db.adult,
native_region = ifelse(native_country %in% Asia_East, " East-Asia",
ifelse(native_country %in% Asia_Central, " Central-Asia",
ifelse(native_country %in% Central_America, " Central-America",
ifelse(native_country %in% South_America, " South-America",
ifelse(native_country %in% Europe_West, " Europe-West",
ifelse(native_country %in% Europe_East, " Europe-East",
ifelse(native_country == " United-States", " United-States",
" Outlying-US" ))))))))
Next we transform the new variable into a factor:
db.adult$native_region <- factor(db.adult$native_region, ordered = FALSE)
Now we are going to create two new categorical variables – “cap_gain” and “cap_loss”. We generate the new variables by grouping the existing variables “capital_gain” and “capital_loss” into three categories. We do this because there are too many zeros in the variables “capital_gain” and “capital_loss” and this can seriously disrupt the analysis.
From the summary below we see that at least 50% of the variables “capital_gain” and “capital_loss” are zeros.
summary(db.adult$capital_gain)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 0 0 1092 0 100000
summary(db.adult$capital_loss)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.00 0.00 88.37 0.00 4356.00
More precisely, the percentage of zeros in “capital_gain” is very big – 91.59%:
(nrow(subset(db.adult, db.adult$capital_gain == 0))/nrow(db.adult))*100
[1] 91.58544
and the percentage of zeros in “capital_loss” is also very high – 95.27%:
(nrow(subset(db.adult, db.adult$capital_loss == 0))/nrow(db.adult))*100
[1] 95.26888
As we also saw from the summary above, the mean values of “capital_gain” and “capital_loss” with the zero values included are, respectively:
mean.gain <- mean(db.adult$capital_gain)
mean.loss <- mean(db.adult$capital_loss)
kable(data.frame(Mean_Capital_Gain = mean.gain, Mean_Capital_Loss = mean.loss),
caption = "Mean Capital with Zero Values Included")
Mean_Capital_Gain | Mean_Capital_Loss |
---|---|
1092.008 | 88.37249 |
We give the mean capital gain and loss also in the case of all the zero values removed:
mean.gain <- mean(subset(db.adult$capital_gain, db.adult$capital_gain > 0))
mean.loss <- mean(subset(db.adult$capital_loss, db.adult$capital_loss > 0))
kable(data.frame(Mean_Capital_Gain = mean.gain, Mean_Capital_Loss = mean.loss),
caption = "Mean Capital Only for Nonzero Values")
Mean_Capital_Gain | Mean_Capital_Loss |
---|---|
12977.6 | 1867.898 |
Next we show a summary of the nonzero values of “capital_loss” and “capital_gain”, as well as their respective interquartile ranges (IQR):
iqr.gain <- IQR(subset(db.adult$capital_gain, db.adult$capital_gain > 0))
iqr.loss <- IQR(subset(db.adult$capital_loss, db.adult$capital_loss > 0))
q.gain <- quantile(x = subset(db.adult$capital_gain, db.adult$capital_gain > 0),
probs = seq(0, 1, 0.25))
q.loss <- quantile(x = subset(db.adult$capital_loss, db.adult$capital_loss > 0),
probs = seq(0, 1, 0.25))
kable(x = data.frame(Capital_Gain = q.gain, Capital_Loss = q.loss),
caption = "Quantiles of the Nonzero Capital")
Capital_Gain | Capital_Loss | |
---|---|---|
0% | 114 | 155 |
25% | 3464 | 1672 |
50% | 7298 | 1887 |
75% | 14084 | 1977 |
100% | 99999 | 4356 |
kable(x = data.frame(IQR_Capital_Gain = iqr.gain, IQR_Capital_Loss = iqr.loss),
caption = "IQR of the Nonzero Capital")
IQR_Capital_Gain | IQR_Capital_Loss |
---|---|
10620 | 305 |
We notice that the IQR of the nonzero “capital_gain” values is much bigger than that of “capital_loss”. We display the box plot (also known as box-and-whisker plot) of the nonzero “capital_gain”. To make a box plot with “ggplot” for a single quantitative variable, we need to create a fake grouping variable. Hence in the aesthetics function we have “x=factor(0)”. The mean is marked with a red dot:
ggplot(aes(x = factor(0), y = capital_gain),
data = subset(db.adult, db.adult$capital_gain > 0)) +
geom_boxplot() +
stat_summary(fun.y = mean,
geom = 'point',
shape = 19,
color = "red",
cex = 2) +
scale_x_discrete(breaks = NULL) +
scale_y_continuous(breaks = seq(0, 100000, 5000)) +
ylab("Capital Gain") +
xlab("") +
ggtitle("Box plot of Nonzero Capital Gain")
From the box plot we see that, indeed, the bulk of the data is between 3,000 and 15,000 dollars, and there are a few outliers. Next we show a histogram of the nonzero capital gain:
df <- db.adult[db.adult$capital_gain > 0, ]
ggplot(data = df,
aes(x = df$capital_gain)) +
geom_histogram(binwidth = 5000,
colour = "black",
fill = "lightblue",
alpha = 0.4) +
scale_y_continuous(breaks = seq(0, 4000, 100)) +
labs(x = "Capital gain", y = "Count") +
ggtitle("Histogram of Nonzero Capital Gain")
The histogram confirms once more what we have already established, namely, that the majority of people with positive capital gain have a capital gain between 0 and 25,000 dollars, and there are also about 150 people with capital gain of around 100,000 dollars. We also note that the biggest number of people with positive capital gain are those with about 5,000 dollars.
Below is the box plot of the nonzero “capital_loss” values (the red dot marks the mean value):
ggplot(aes(x = factor(0), y = capital_loss),
data = subset(db.adult, db.adult$capital_loss > 0)) +
geom_boxplot() +
stat_summary(fun.y = mean,
geom = 'point',
shape = 19,
color = "red",
cex = 2) +
scale_x_discrete(breaks = NULL) +
scale_y_continuous(breaks = seq(0, 5000, 500)) +
ylab("Capital Loss") +
xlab("") +
ggtitle("Box plot of Nonzero Capital Loss")
We also display a histogram of the nonzero capital loss:
df <- db.adult[db.adult$capital_loss > 0, ]
ggplot(data = df,
aes(x = df$capital_loss)) +
geom_histogram(colour = "black",
fill = "lightblue",
alpha = 0.4) +
scale_x_continuous(breaks = seq(0, 5000, 250)) +
scale_y_continuous(breaks = seq(0, 450, 50)) +
labs(x = "Capital loss", y = "Count") +
ggtitle("Histogram of Nonzero Capital Loss")
From the box plot we observe that most values are between 1,700 and 2,000 dollars (as already stated), and there are a lot of outliers represented by the filled black dots. From the histogram we also see that the biggest number of people have a capital loss of about 1,875 dollars.
The box plots and histograms illustrate visually the results of the summary statistics for the nonzero capital gain and capital loss. Based on these results, we will group the values of the variables “capital_loss” and “capital gain” into categories and we will create two new factor variables called “cap_gain” and “cap_loss”. We do the grouping in the following way:
Capital gain:
Capital loss:
We use the function “mutate()”, together with “ifelse()” to create the new variables “cap_gain”
db.adult <- mutate(db.adult,
cap_gain = ifelse(db.adult$capital_gain < 3464, " Low",
ifelse(db.adult$capital_gain >= 3464 &
db.adult$capital_gain <= 14080, " Medium", " High")))
db.adult$cap_gain <- factor(db.adult$cap_gain,
ordered = TRUE,
levels = c(" Low", " Medium", " High"))
and “cap_loss”:
db.adult <- mutate(db.adult,
cap_loss = ifelse(db.adult$capital_loss < 1672, " Low",
ifelse(db.adult$capital_loss >= 1672 &
db.adult$capital_loss <= 1977, " Medium", " High")))
db.adult$cap_loss <- factor(db.adult$cap_loss,
ordered = TRUE,
levels = c(" Low", " Medium", " High"))
With the help of the function “summary”, we notice that there is one unused factor level (i.e., there are no observations belonging to this level) in the variable “workclass”, namely the level “Never-worked”:
summary(db.adult$workclass)
Federal-gov Local-gov Never-worked Private
943 2067 0 22286
Self-emp-inc Self-emp-not-inc State-gov Without-pay
1074 2499 1279 14
As we can see from the output above, there are no participants in the survey who never worked in their lifetime. Therefore we remove the factor level “Never-worked” from the categorical variable “workclass”. We do this with the function “droplevels()”, which removes unused factor levels from a factor:
db.adult$workclass <- droplevels(db.adult$workclass)
levels(db.adult$workclass)
[1] " Federal-gov" " Local-gov" " Private"
[4] " Self-emp-inc" " Self-emp-not-inc" " State-gov"
[7] " Without-pay"
If we do not remove this unused level, we will most certainly have problems with zero cell counts when, for example, investigating if there exists some association between the categorical predictors.
As we already clarified, the considered census data comes with a separate test data set, which we use to test the out-of-sample accuracy of the constructed predictive models.
In what follows we repeat the same steps as in the transformation of the training data frame “db.adult”.
First we read the test dataset:
db.test <- read.table("D:/Data_Science_Projects/Census_DataSet/adult.test",
sep = ",",
header = FALSE,
skip = 1,
na.strings = " ?")
colnames(db.test) <- c("age", "workclass", "fnlwgt", "education",
"education_num", "marital_status", "occupation",
"relationship", "race", "sex", "capital_gain",
"capital_loss", "hours_per_week",
"native_country", "income")
Cleaning missing values from the test data frame:
db.test <- na.omit(db.test)
row.names(db.test) <- 1:nrow(db.test)
Let us take a look at the data frame:
head(db.test, 10)
age workclass fnlwgt education education_num
1 25 Private 226802 11th 7
2 38 Private 89814 HS-grad 9
3 28 Local-gov 336951 Assoc-acdm 12
4 44 Private 160323 Some-college 10
5 34 Private 198693 10th 6
6 63 Self-emp-not-inc 104626 Prof-school 15
7 24 Private 369667 Some-college 10
8 55 Private 104996 7th-8th 4
9 65 Private 184454 HS-grad 9
10 36 Federal-gov 212465 Bachelors 13
marital_status occupation relationship race sex
1 Never-married Machine-op-inspct Own-child Black Male
2 Married-civ-spouse Farming-fishing Husband White Male
3 Married-civ-spouse Protective-serv Husband White Male
4 Married-civ-spouse Machine-op-inspct Husband Black Male
5 Never-married Other-service Not-in-family White Male
6 Married-civ-spouse Prof-specialty Husband White Male
7 Never-married Other-service Unmarried White Female
8 Married-civ-spouse Craft-repair Husband White Male
9 Married-civ-spouse Machine-op-inspct Husband White Male
10 Married-civ-spouse Adm-clerical Husband White Male
capital_gain capital_loss hours_per_week native_country income
1 0 0 40 United-States <=50K.
2 0 0 50 United-States <=50K.
3 0 0 40 United-States >50K.
4 7688 0 40 United-States >50K.
5 0 0 30 United-States <=50K.
6 3103 0 32 United-States >50K.
7 0 0 40 United-States <=50K.
8 0 0 10 United-States <=50K.
9 6418 0 40 United-States >50K.
10 0 0 40 United-States <=50K.
From the display of the first 5 observations of the data frame “db.test” we notice that the names of the levels of the factor variable “income” differ from the respective names in the training data set “db.adult” by the symbol “.”. Therefore we remove the “.” from the names of the factor levels of “income” in “db.test”:
levels(db.test$income)[1] <- " <=50K"
levels(db.test$income)[2] <- " >50K"
levels(db.test$income)
[1] " <=50K" " >50K"
Just as in the train data set - “db.adult”, we create a new variable called “hours_w”:
db.test$hours_w[db.test$hours_per_week < 40] <- " less_than_40"
db.test$hours_w[db.test$hours_per_week >= 40 &
db.test$hours_per_week <= 45] <- " between_40_and_45"
db.test$hours_w[db.test$hours_per_week > 45 &
db.test$hours_per_week <= 60 ] <- " between_45_and_60"
db.test$hours_w[db.test$hours_per_week > 60 &
db.test$hours_per_week <= 80 ] <- " between_60_and_80"
db.test$hours_w[db.test$hours_per_week > 80] <- " more_than_80"
db.test$hours_w <- factor(db.test$hours_w,
ordered = FALSE,
levels = c(" less_than_40",
" between_40_and_45",
" between_45_and_60",
" between_60_and_80",
" more_than_80"))
We also have to create the variable “native_region”, where we group the countries by global regions:
db.test <- mutate(db.test,
native_region = ifelse(native_country %in% Asia_East, " East-Asia",
ifelse(native_country %in% Asia_Central, " Central-Asia",
ifelse(native_country %in% Central_America, " Central-America",
ifelse(native_country %in% South_America, " South-America",
ifelse(native_country %in% Europe_West, " Europe-West",
ifelse(native_country %in% Europe_East, " Europe-East",
ifelse(native_country == " United-States", " United-States",
" Outlying-US" ))))))))
db.test$native_region <- factor(db.test$native_region, ordered = FALSE)
We have to create the variables “cap_gain” and “cap_loss” as well:
db.test <- mutate(db.test,
cap_gain = ifelse(db.test$capital_gain < 3464, " Low",
ifelse(db.test$capital_gain >= 3464 &
db.test$capital_gain <= 14080, " Medium", " High")))
db.test$cap_gain <- factor(db.test$cap_gain,
ordered = FALSE,
levels = c(" Low", " Medium", " High"))
db.test<- mutate(db.test,
cap_loss = ifelse(db.test$capital_loss < 1672, " Low",
ifelse(db.test$capital_loss >= 1672 &
db.test$capital_loss <= 1977, " Medium", " High")))
db.test$cap_loss <- factor(db.test$cap_loss,
ordered = FALSE,
levels = c(" Low", " Medium", " High"))
And, last, we drop the unused level “Never-worked” from the factor variable “workclass”:
db.test$workclass <- droplevels(db.test$workclass)
Finally, we export the cleaned and preprocessed train and test datasets into the csv files “adult_df.csv” and “test_df.csv”, respectively:
write.csv(db.adult, "adult_df.csv", row.names = FALSE)
write.csv(db.test, "test_df.csv", row.names = FALSE)