1 Introduction

In this project we analyze a U.S. census data taken from the UCI (University of California at Irvine) Machine Learning Repository. The project is divided into four parts: Cleaning and Preprocessing the Data, Exploratory Data Analysis, Predictive Analysis and Theoretical Background. Our final goal is to build a model, which can predict whether the income of a random adult American citizen is less or greater than 50000$ a year based on given features, such as age, education, occupation, gender, race, etc. We fit four different predictive models – a logistic regression model, a random forest model, a support vector machines (SVM) model and a neural networks model. All models achieve approximately the same prediction accuracy.

In the first part of the project we clean and preprocess the dataset. In the second part we use different visualization techniques to conduct a preliminary analysis of the impact of each predictor (called also independent varible or explanatory variable, or covariate) on the response variable (called also dependent variable) “income”. In the third part of the project we build predictive models using different algorithms. We apply logistic regression, random forests, support vector machines and neural networks. We test the accuracy of the built models both on the training dataset and on a test dataset. In the last part of the project we provide a theoretical overview of some of the methods that we use.

We work on R Studio and the R version is 3.3.1.

First we load the packages that we will use:

library(ggplot2)
library(plyr)
library(gridExtra)
library(gmodels)
library(grid)
library(vcd)
library(scales)
library(ggthemes)
library(knitr)

2 Downloading and Reading the Data

First we download the census dataset. In order to do this we set the working directory:

setwd("D:/Data_Science_Projects/Census_DataSet")

Then we download the file containing the training data:

if(!file.exists("./adult.data")){
  
    fileUrl <- "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"  
    
    download.file(fileUrl, destfile = "./adult.data")
    
}

and the file that contains the test data:

if(!file.exists("./adult.test")){
  
    fileUrl <- "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test"  
    
    download.file(fileUrl, destfile = "./adult.test")
    
}

Our next step is to read the data from the file “adult.data” into the “db.adult” data frame, which is the training dataset:

db.adult <- read.table("adult.data", sep = ",", header = FALSE)

We will read the test data later, in Section 6 – Preprocessing the Test Dataset.

3 Overview of the Data

The number of observations in the “db.adult” data frame is:

dim(db.adult)[1]

[1] 32561

and the number of variables is:

dim(db.adult)[2]

[1] 15

We take the names of the variables from the attributes list, which is available at https://archive.ics.uci.edu/ml/datasets/Census+Income or from the file adult.names provided in the UCI repository:

colnames(db.adult) <- c("age", "workclass", "fnlwgt", 
                        "education", "education_num", 
                        "marital_status", "occupation",
                        "relationship", "race", "sex", 
                        "capital_gain", "capital_loss", 
                        "hours_per_week", "native_country", "income")

Let us take a look at the data frame:

head(db.adult, 10)

   age         workclass fnlwgt  education education_num
1   39         State-gov  77516  Bachelors            13
2   50  Self-emp-not-inc  83311  Bachelors            13
3   38           Private 215646    HS-grad             9
4   53           Private 234721       11th             7
5   28           Private 338409  Bachelors            13
6   37           Private 284582    Masters            14
7   49           Private 160187        9th             5
8   52  Self-emp-not-inc 209642    HS-grad             9
9   31           Private  45781    Masters            14
10  42           Private 159449  Bachelors            13
           marital_status         occupation   relationship   race     sex
1           Never-married       Adm-clerical  Not-in-family  White    Male
2      Married-civ-spouse    Exec-managerial        Husband  White    Male
3                Divorced  Handlers-cleaners  Not-in-family  White    Male
4      Married-civ-spouse  Handlers-cleaners        Husband  Black    Male
5      Married-civ-spouse     Prof-specialty           Wife  Black  Female
6      Married-civ-spouse    Exec-managerial           Wife  White  Female
7   Married-spouse-absent      Other-service  Not-in-family  Black  Female
8      Married-civ-spouse    Exec-managerial        Husband  White    Male
9           Never-married     Prof-specialty  Not-in-family  White  Female
10     Married-civ-spouse    Exec-managerial        Husband  White    Male
   capital_gain capital_loss hours_per_week native_country income
1          2174            0             40  United-States  <=50K
2             0            0             13  United-States  <=50K
3             0            0             40  United-States  <=50K
4             0            0             40  United-States  <=50K
5             0            0             40           Cuba  <=50K
6             0            0             40  United-States  <=50K
7             0            0             16        Jamaica  <=50K
8             0            0             45  United-States   >50K
9         14084            0             50  United-States   >50K
10         5178            0             40  United-States   >50K

and its structure:

str(db.adult, vec.len = 2, strict.width = "no", width = 30)

'data.frame':   32561 obs. of  15 variables:
 $ age           : int  39 50 38 53 28 ...
 $ workclass     : Factor w/ 9 levels " ?"," Federal-gov",..: 8 7 5 5 5 ...
 $ fnlwgt        : int  77516 83311 215646 234721 338409 ...
 $ education     : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 ...
 $ education_num : int  13 13 9 7 13 ...
 $ marital_status: Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 ...
 $ occupation    : Factor w/ 15 levels " ?"," Adm-clerical",..: 2 5 7 7 11 ...
 $ relationship  : Factor w/ 6 levels " Husband"," Not-in-family",..: 2 1 2 1 6 ...
 $ race          : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 3 ...
 $ sex           : Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 ...
 $ capital_gain  : int  2174 0 0 0 0 ...
 $ capital_loss  : int  0 0 0 0 0 ...
 $ hours_per_week: int  40 13 40 40 40 ...
 $ native_country: Factor w/ 42 levels " ?"," Cambodia",..: 40 40 40 40 6 ...
 $ income        : Factor w/ 2 levels " <=50K"," >50K": 1 1 1 1 1 ...

As we can see from the output above, the variables “age”, “fnlwgt”, “education_num”, “capital_gain”, “capital_loss” and “hours_per_week” are of type integer, whereas all the other variables are factor variables with different number of levels. In order to see what the levels of each factor variable are, we write the function “levels_factors()”, which takes as an argument a data frame, identifies the factor variables and prints the levels of each categorical variable:

levels_factors <- function(mydata) {
    col_names <- names(mydata)
    for (i in 1:length(col_names)) {
        if (is.factor(mydata[, col_names[i]])) {
            message(noquote(paste("Covariate ", "*", 
                                  col_names[i], "*", 
                                  " with factor levels:", 
                                  sep = "")))
            print(levels(mydata[, col_names[i]]))
        }
    }
}

levels_factors(db.adult)

Covariate *workclass* with factor levels:

[1] " ?"                " Federal-gov"      " Local-gov"       
[4] " Never-worked"     " Private"          " Self-emp-inc"    
[7] " Self-emp-not-inc" " State-gov"        " Without-pay"

Covariate *education* with factor levels:

 [1] " 10th"         " 11th"         " 12th"         " 1st-4th"     
 [5] " 5th-6th"      " 7th-8th"      " 9th"          " Assoc-acdm"  
 [9] " Assoc-voc"    " Bachelors"    " Doctorate"    " HS-grad"     
[13] " Masters"      " Preschool"    " Prof-school"  " Some-college"

Covariate *marital_status* with factor levels:

[1] " Divorced"              " Married-AF-spouse"    
[3] " Married-civ-spouse"    " Married-spouse-absent"
[5] " Never-married"         " Separated"            
[7] " Widowed"

Covariate *occupation* with factor levels:

 [1] " ?"                 " Adm-clerical"      " Armed-Forces"     
 [4] " Craft-repair"      " Exec-managerial"   " Farming-fishing"  
 [7] " Handlers-cleaners" " Machine-op-inspct" " Other-service"    
[10] " Priv-house-serv"   " Prof-specialty"    " Protective-serv"  
[13] " Sales"             " Tech-support"      " Transport-moving"

Covariate *relationship* with factor levels:

[1] " Husband"        " Not-in-family"  " Other-relative" " Own-child"     
[5] " Unmarried"      " Wife"

Covariate *race* with factor levels:

[1] " Amer-Indian-Eskimo" " Asian-Pac-Islander" " Black"             
[4] " Other"              " White"

Covariate *sex* with factor levels:

[1] " Female" " Male"

Covariate *native_country* with factor levels:

 [1] " ?"                          " Cambodia"                  
 [3] " Canada"                     " China"                     
 [5] " Columbia"                   " Cuba"                      
 [7] " Dominican-Republic"         " Ecuador"                   
 [9] " El-Salvador"                " England"                   
[11] " France"                     " Germany"                   
[13] " Greece"                     " Guatemala"                 
[15] " Haiti"                      " Holand-Netherlands"        
[17] " Honduras"                   " Hong"                      
[19] " Hungary"                    " India"                     
[21] " Iran"                       " Ireland"                   
[23] " Italy"                      " Jamaica"                   
[25] " Japan"                      " Laos"                      
[27] " Mexico"                     " Nicaragua"                 
[29] " Outlying-US(Guam-USVI-etc)" " Peru"                      
[31] " Philippines"                " Poland"                    
[33] " Portugal"                   " Puerto-Rico"               
[35] " Scotland"                   " South"                     
[37] " Taiwan"                     " Thailand"                  
[39] " Trinadad&Tobago"            " United-States"             
[41] " Vietnam"                    " Yugoslavia"

Covariate *income* with factor levels:

[1] " <=50K" " >50K"

From the output above we notice that some of the factor variables have a level denoted by " ?“. According to the documentation provided for the census dataset, the values marked with” ?" are missing values.

4 Cleaning Missing Values

Before we can proceed with the exploratory data analysis (EDA) and the predictive analysisi later, we have to get rid of the missing values. In order to do that, we read the data file “adult.data” again, but with the additional option na.strings= " ?" , which means that all " ?" strings will be marked as NA (not available). In general, missing values in R are denoted by NA.

db.adult <- read.table("adult.data",
                       sep = ",", 
                       header = FALSE, 
                       na.strings = " ?")


colnames(db.adult) <- c("age", "workclass", "fnlwgt", "education", 
                        "education_num", "marital_status", "occupation",
                        "relationship", "race", "sex", "capital_gain", 
                        "capital_loss", "hours_per_week", "native_country", "income")

After we marked the missing values, we can clean them with the function “na.omit()” which deletes all rows (i.e. observations) containig missing values:

db.adult <- na.omit(db.adult)

and after that we re-enumerate the rows of the data frame:

row.names(db.adult) <- 1:nrow(db.adult)

5 Transformations on the Data

5.1 The variable “hours_per_week”

From the summary and the box plot of the variable “hours_per_week”, we see that the mean number of working hours per week is 41 (marked as a red dot on the box plot), and at least 50% of the people taking part in the survey, work between 40 and 45 hours per week:

summary(db.adult$hours_per_week)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   40.00   40.00   40.93   45.00   99.00

As a side note, here we will briefly explain what a box plot is. The top and bottom of the box are the 25th (also called first quartile) and the 75th percentile (also called third quartile), respectively. The vertical solid black line is the median and the X-like symbol is the mean. In R, the upper whisker$=\textrm{min}(\textrm{max}(x), Q_3 + 1.5*IQR)$ and the lower whisker$=\textrm{max}(\textrm{min}(x), Q_1 - 1.5*IQR)$, where $x$ is the numeric variable of interest, $Q_1$ and $Q_3$ are the first and third quartile, respectively, and $IQR=Q_3 - Q_1$ is the interquartile range. The width of the box is equal to the IQR. The filled black dots are the so-called outliers, i.e. data points that fall below the lower whisker or above the upper whisker.

From the box plot we also observe that there are a lot of outliers:

ggplot(aes(x = factor(0), y = hours_per_week),
       data = db.adult) + 
  geom_boxplot() +
  stat_summary(fun.y = mean, 
               geom = 'point', 
               shape = 19,
               color = "red",
               cex = 2) +
  scale_x_discrete(breaks = NULL) +
  scale_y_continuous(breaks = seq(0, 100, 5)) + 
  xlab(label = "") +
  ylab(label = "Working hours per week") +
  ggtitle("Box Plot of Working Hours per Week")

Therefore, we will group the working hours in 5 categories which we consider relevant:

less than 40 hours per week
between 40 and 45 hours per week
between 45 and 60 hours per week
between 60 and 80 hours per week, and
more than 80 hours per week,

and we will create a new factor variable called “hours_w” with 5 levels corresponding to these 5 categories:

db.adult$hours_w[db.adult$hours_per_week < 40] <- " less_than_40"
db.adult$hours_w[db.adult$hours_per_week >= 40 & 
                 db.adult$hours_per_week <= 45] <- " between_40_and_45"
db.adult$hours_w[db.adult$hours_per_week > 45 &
                 db.adult$hours_per_week <= 60  ] <- " between_45_and_60"
db.adult$hours_w[db.adult$hours_per_week > 60 &
                 db.adult$hours_per_week <= 80  ] <- " between_60_and_80"
db.adult$hours_w[db.adult$hours_per_week > 80] <- " more_than_80"

In the latter we create the category " between_40_and_45" with greater than or equal and less than or equal sign, because we want the range of this category to correspond to the span of the data points between the first and the third quartile.

We want to make the new variable “hours_w” a factor variable, therefore we use the “factor()” function to do this:

db.adult$hours_w <- factor(db.adult$hours_w,
                           ordered = FALSE,
                           levels = c(" less_than_40", 
                                      " between_40_and_45", 
                                      " between_45_and_60",
                                      " between_60_and_80",
                                      " more_than_80"))

From the summary below we can see how many people belong to each category of the factor variable “hours_w”:

summary(db.adult$hours_w)

      less_than_40  between_40_and_45  between_45_and_60 
              6714              16606               5790 
 between_60_and_80       more_than_80 
               857                195

As already stated, the majority of people work between 40 and 45 hours a week, but there is also a considerable amount of participants who work between 45 and 60 hours per week as well as less than 40 hours a week. As percentages we have the following:

for(i in 1:length(summary(db.adult$hours_w))){
    
   print(round(100*summary(db.adult$hours_w)[i]/sum(!is.na(db.adult$hours_w)), 2)) 
}

 less_than_40 
        22.26 
 between_40_and_45 
             55.06 
 between_45_and_60 
              19.2 
 between_60_and_80 
              2.84 
 more_than_80 
         0.65

As we can see, 55% of all people work between 40 and 45 hours a week, 22.3% work less than 40 hours, 19.2% work between 45 and 60 hours, 2.8% work between 60 and 80 hours, and 0.65% work more than 80 hours a week.

5.2 The variable “native_country”

With the help of the “levels()” function, we can see that the factor variable “native_country” has 41 levels. If we build a (logistic regression) predictive model with “native_country” as a covariate, we will end up with 41 additional degrees of freedom due to this categorical variable. This will complicate unnecessarily the analysis and might lead to overfitting. Hence, it is better to group the native countries into several global regions. This coarsening of the data also makes the interpretation of the results easier to comprehend.

levels(db.adult$native_country)

 [1] " Cambodia"                   " Canada"                    
 [3] " China"                      " Columbia"                  
 [5] " Cuba"                       " Dominican-Republic"        
 [7] " Ecuador"                    " El-Salvador"               
 [9] " England"                    " France"                    
[11] " Germany"                    " Greece"                    
[13] " Guatemala"                  " Haiti"                     
[15] " Holand-Netherlands"         " Honduras"                  
[17] " Hong"                       " Hungary"                   
[19] " India"                      " Iran"                      
[21] " Ireland"                    " Italy"                     
[23] " Jamaica"                    " Japan"                     
[25] " Laos"                       " Mexico"                    
[27] " Nicaragua"                  " Outlying-US(Guam-USVI-etc)"
[29] " Peru"                       " Philippines"               
[31] " Poland"                     " Portugal"                  
[33] " Puerto-Rico"                " Scotland"                  
[35] " South"                      " Taiwan"                    
[37] " Thailand"                   " Trinadad&Tobago"           
[39] " United-States"              " Vietnam"                   
[41] " Yugoslavia"

Below we create the new variable “native_region”, where we group the countries by global regions. We first define the regions:

Asia_East <- c(" Cambodia", " China", " Hong", " Laos", " Thailand",
               " Japan", " Taiwan", " Vietnam")

Asia_Central <- c(" India", " Iran")

Central_America <- c(" Cuba", " Guatemala", " Jamaica", " Nicaragua", 
                     " Puerto-Rico",  " Dominican-Republic", " El-Salvador", 
                     " Haiti", " Honduras", " Mexico", " Trinadad&Tobago")

South_America <- c(" Ecuador", " Peru", " Columbia")


Europe_West <- c(" England", " Germany", " Holand-Netherlands", " Ireland", 
                 " France", " Greece", " Italy", " Portugal", " Scotland")

Europe_East <- c(" Poland", " Yugoslavia", " Hungary")

Then we modify the data frame by adding an additional column named “native_region”. We do this with the help of the “mutate” function form the “plyr” package:

db.adult <- mutate(db.adult, 
       native_region = ifelse(native_country %in% Asia_East, " East-Asia",
                ifelse(native_country %in% Asia_Central, " Central-Asia",
                ifelse(native_country %in% Central_America, " Central-America",
                ifelse(native_country %in% South_America, " South-America",
                ifelse(native_country %in% Europe_West, " Europe-West",
                ifelse(native_country %in% Europe_East, " Europe-East",
                ifelse(native_country == " United-States", " United-States", 
                       " Outlying-US" ))))))))

Next we transform the new variable into a factor:

db.adult$native_region <- factor(db.adult$native_region, ordered = FALSE)

5.3 The variables “capital_gain” and “capital_loss”

Now we are going to create two new categorical variables – “cap_gain” and “cap_loss”. We generate the new variables by grouping the existing variables “capital_gain” and “capital_loss” into three categories. We do this because there are too many zeros in the variables “capital_gain” and “capital_loss” and this can seriously disrupt the analysis.

From the summary below we see that at least 50% of the variables “capital_gain” and “capital_loss” are zeros.

summary(db.adult$capital_gain)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       0       0    1092       0  100000

summary(db.adult$capital_loss)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.00    0.00   88.37    0.00 4356.00

More precisely, the percentage of zeros in “capital_gain” is very big – 91.59%:

(nrow(subset(db.adult, db.adult$capital_gain == 0))/nrow(db.adult))*100

[1] 91.58544

and the percentage of zeros in “capital_loss” is also very high – 95.27%:

(nrow(subset(db.adult, db.adult$capital_loss == 0))/nrow(db.adult))*100

[1] 95.26888

As we also saw from the summary above, the mean values of “capital_gain” and “capital_loss” with the zero values included are, respectively:

mean.gain <- mean(db.adult$capital_gain)

mean.loss <- mean(db.adult$capital_loss)

kable(data.frame(Mean_Capital_Gain = mean.gain, Mean_Capital_Loss = mean.loss),
      caption = "Mean Capital with Zero Values Included")

Mean Capital with Zero Values Included
Mean_Capital_Gain	Mean_Capital_Loss
1092.008	88.37249

We give the mean capital gain and loss also in the case of all the zero values removed:

mean.gain <- mean(subset(db.adult$capital_gain, db.adult$capital_gain > 0))

mean.loss <- mean(subset(db.adult$capital_loss, db.adult$capital_loss > 0))

kable(data.frame(Mean_Capital_Gain = mean.gain, Mean_Capital_Loss = mean.loss),
      caption = "Mean Capital Only for Nonzero Values")

Mean Capital Only for Nonzero Values
Mean_Capital_Gain	Mean_Capital_Loss
12977.6	1867.898

Next we show a summary of the nonzero values of “capital_loss” and “capital_gain”, as well as their respective interquartile ranges (IQR):

iqr.gain <- IQR(subset(db.adult$capital_gain, db.adult$capital_gain > 0))
iqr.loss <- IQR(subset(db.adult$capital_loss, db.adult$capital_loss > 0))



q.gain <- quantile(x = subset(db.adult$capital_gain, db.adult$capital_gain > 0), 
                   probs = seq(0, 1, 0.25))

q.loss <- quantile(x = subset(db.adult$capital_loss, db.adult$capital_loss > 0),
                   probs = seq(0, 1, 0.25))


kable(x = data.frame(Capital_Gain = q.gain, Capital_Loss = q.loss),
      caption = "Quantiles of the Nonzero Capital")

Quantiles of the Nonzero Capital
	Capital_Gain	Capital_Loss
0%	114	155
25%	3464	1672
50%	7298	1887
75%	14084	1977
100%	99999	4356

kable(x = data.frame(IQR_Capital_Gain = iqr.gain, IQR_Capital_Loss = iqr.loss),
      caption = "IQR of the Nonzero Capital")

IQR of the Nonzero Capital
IQR_Capital_Gain	IQR_Capital_Loss
10620	305

We notice that the IQR of the nonzero “capital_gain” values is much bigger than that of “capital_loss”. We display the box plot (also known as box-and-whisker plot) of the nonzero “capital_gain”. To make a box plot with “ggplot” for a single quantitative variable, we need to create a fake grouping variable. Hence in the aesthetics function we have “x=factor(0)”. The mean is marked with a red dot:

ggplot(aes(x = factor(0), y = capital_gain),
        data = subset(db.adult, db.adult$capital_gain > 0)) + 
    geom_boxplot() +
    stat_summary(fun.y = mean, 
                 geom = 'point', 
                 shape = 19,
                 color = "red",
                 cex = 2) +
    scale_x_discrete(breaks = NULL) +
    scale_y_continuous(breaks = seq(0, 100000, 5000)) +
    ylab("Capital Gain") +
    xlab("") +  
    ggtitle("Box plot of Nonzero Capital Gain")

From the box plot we see that, indeed, the bulk of the data is between 3,000 and 15,000 dollars, and there are a few outliers. Next we show a histogram of the nonzero capital gain:

df <- db.adult[db.adult$capital_gain > 0, ]

ggplot(data = df, 
       aes(x = df$capital_gain)) +
  geom_histogram(binwidth = 5000,
                 colour = "black",
                 fill = "lightblue",
                 alpha = 0.4) +
  scale_y_continuous(breaks = seq(0, 4000, 100)) +
  labs(x = "Capital gain", y = "Count") +
  ggtitle("Histogram of Nonzero Capital Gain")

The histogram confirms once more what we have already established, namely, that the majority of people with positive capital gain have a capital gain between 0 and 25,000 dollars, and there are also about 150 people with capital gain of around 100,000 dollars. We also note that the biggest number of people with positive capital gain are those with about 5,000 dollars.

Below is the box plot of the nonzero “capital_loss” values (the red dot marks the mean value):

ggplot(aes(x = factor(0), y = capital_loss),
       data = subset(db.adult, db.adult$capital_loss > 0)) + 
  geom_boxplot() +
  stat_summary(fun.y = mean, 
               geom = 'point', 
               shape = 19,
               color = "red",
               cex = 2) +
  scale_x_discrete(breaks = NULL) +
  scale_y_continuous(breaks = seq(0, 5000, 500)) +
  ylab("Capital Loss") +
  xlab("") +  
  ggtitle("Box plot of Nonzero Capital Loss")

We also display a histogram of the nonzero capital loss:

df <- db.adult[db.adult$capital_loss > 0, ]

ggplot(data = df, 
       aes(x = df$capital_loss)) +
  geom_histogram(colour = "black",
                 fill = "lightblue",
                 alpha = 0.4) +
  scale_x_continuous(breaks = seq(0, 5000, 250)) +
  scale_y_continuous(breaks = seq(0, 450, 50)) +
  labs(x = "Capital loss", y = "Count") +
  ggtitle("Histogram of Nonzero Capital Loss")

From the box plot we observe that most values are between 1,700 and 2,000 dollars (as already stated), and there are a lot of outliers represented by the filled black dots. From the histogram we also see that the biggest number of people have a capital loss of about 1,875 dollars.

The box plots and histograms illustrate visually the results of the summary statistics for the nonzero capital gain and capital loss. Based on these results, we will group the values of the variables “capital_loss” and “capital gain” into categories and we will create two new factor variables called “cap_gain” and “cap_loss”. We do the grouping in the following way:

Capital gain:
- We mark all values of “capital_gain” which are less than the first quartile of the nonzero capital gain (which is equal to 3464) as “Low”; all values that are between the first and third quartile (between 3464 and 14080) - as “Medium”; and all values greater than or equal to the third quartile are marked as “High”.
Capital loss:
- We mark all values of “capital_loss” which are less than the first quartile of the nonzero capital gain (which is equal to 1672) as “Low”; all values that are between the first and third quartile (between 1672 and 1977) - as “Medium”; and all values greater than or equal to the third quartile are marked as “High”.

We use the function “mutate()”, together with “ifelse()” to create the new variables “cap_gain”

db.adult <- mutate(db.adult, 
            cap_gain = ifelse(db.adult$capital_gain < 3464, " Low",
                       ifelse(db.adult$capital_gain >= 3464 & 
                              db.adult$capital_gain <= 14080, " Medium", " High")))


db.adult$cap_gain <- factor(db.adult$cap_gain,
                            ordered = TRUE,
                            levels = c(" Low", " Medium", " High"))

and “cap_loss”:

db.adult <- mutate(db.adult, 
            cap_loss = ifelse(db.adult$capital_loss < 1672, " Low",
                       ifelse(db.adult$capital_loss >= 1672 & 
                              db.adult$capital_loss <= 1977, " Medium", " High")))


db.adult$cap_loss <- factor(db.adult$cap_loss,
                            ordered = TRUE,
                            levels = c(" Low", " Medium", " High"))

5.4 The Variable “workclass”

With the help of the function “summary”, we notice that there is one unused factor level (i.e., there are no observations belonging to this level) in the variable “workclass”, namely the level “Never-worked”:

summary(db.adult$workclass)

      Federal-gov         Local-gov      Never-worked           Private 
              943              2067                 0             22286 
     Self-emp-inc  Self-emp-not-inc         State-gov       Without-pay 
             1074              2499              1279                14

As we can see from the output above, there are no participants in the survey who never worked in their lifetime. Therefore we remove the factor level “Never-worked” from the categorical variable “workclass”. We do this with the function “droplevels()”, which removes unused factor levels from a factor:

db.adult$workclass <- droplevels(db.adult$workclass)

levels(db.adult$workclass)

[1] " Federal-gov"      " Local-gov"        " Private"         
[4] " Self-emp-inc"     " Self-emp-not-inc" " State-gov"       
[7] " Without-pay"

If we do not remove this unused level, we will most certainly have problems with zero cell counts when, for example, investigating if there exists some association between the categorical predictors.

6 Preprocessing the Test Dataset

As we already clarified, the considered census data comes with a separate test data set, which we use to test the out-of-sample accuracy of the constructed predictive models.

In what follows we repeat the same steps as in the transformation of the training data frame “db.adult”.

First we read the test dataset:

db.test <- read.table("D:/Data_Science_Projects/Census_DataSet/adult.test",
                      sep = ",", 
                      header = FALSE, 
                      skip = 1, 
                      na.strings = " ?")

colnames(db.test) <- c("age", "workclass", "fnlwgt", "education",
                       "education_num", "marital_status", "occupation",
                       "relationship", "race", "sex", "capital_gain",
                       "capital_loss", "hours_per_week",
                       "native_country", "income")

Cleaning missing values from the test data frame:

db.test <- na.omit(db.test)

row.names(db.test) <- 1:nrow(db.test)

Let us take a look at the data frame:

head(db.test, 10)

   age         workclass fnlwgt     education education_num
1   25           Private 226802          11th             7
2   38           Private  89814       HS-grad             9
3   28         Local-gov 336951    Assoc-acdm            12
4   44           Private 160323  Some-college            10
5   34           Private 198693          10th             6
6   63  Self-emp-not-inc 104626   Prof-school            15
7   24           Private 369667  Some-college            10
8   55           Private 104996       7th-8th             4
9   65           Private 184454       HS-grad             9
10  36       Federal-gov 212465     Bachelors            13
        marital_status         occupation   relationship   race     sex
1        Never-married  Machine-op-inspct      Own-child  Black    Male
2   Married-civ-spouse    Farming-fishing        Husband  White    Male
3   Married-civ-spouse    Protective-serv        Husband  White    Male
4   Married-civ-spouse  Machine-op-inspct        Husband  Black    Male
5        Never-married      Other-service  Not-in-family  White    Male
6   Married-civ-spouse     Prof-specialty        Husband  White    Male
7        Never-married      Other-service      Unmarried  White  Female
8   Married-civ-spouse       Craft-repair        Husband  White    Male
9   Married-civ-spouse  Machine-op-inspct        Husband  White    Male
10  Married-civ-spouse       Adm-clerical        Husband  White    Male
   capital_gain capital_loss hours_per_week native_country  income
1             0            0             40  United-States  <=50K.
2             0            0             50  United-States  <=50K.
3             0            0             40  United-States   >50K.
4          7688            0             40  United-States   >50K.
5             0            0             30  United-States  <=50K.
6          3103            0             32  United-States   >50K.
7             0            0             40  United-States  <=50K.
8             0            0             10  United-States  <=50K.
9          6418            0             40  United-States   >50K.
10            0            0             40  United-States  <=50K.

From the display of the first 5 observations of the data frame “db.test” we notice that the names of the levels of the factor variable “income” differ from the respective names in the training data set “db.adult” by the symbol “.”. Therefore we remove the “.” from the names of the factor levels of “income” in “db.test”:

levels(db.test$income)[1] <- " <=50K"
levels(db.test$income)[2] <- " >50K"

levels(db.test$income)

[1] " <=50K" " >50K"

Just as in the train data set - “db.adult”, we create a new variable called “hours_w”:

db.test$hours_w[db.test$hours_per_week < 40] <- " less_than_40"
db.test$hours_w[db.test$hours_per_week >= 40 & 
                db.test$hours_per_week <= 45] <- " between_40_and_45"
db.test$hours_w[db.test$hours_per_week > 45 &
                db.test$hours_per_week <= 60  ] <- " between_45_and_60"
db.test$hours_w[db.test$hours_per_week > 60 &
                db.test$hours_per_week <= 80  ] <- " between_60_and_80"
db.test$hours_w[db.test$hours_per_week > 80] <- " more_than_80"



db.test$hours_w <- factor(db.test$hours_w,
                          ordered = FALSE,
                          levels = c(" less_than_40", 
                                     " between_40_and_45", 
                                     " between_45_and_60",
                                     " between_60_and_80",
                                     " more_than_80"))

We also have to create the variable “native_region”, where we group the countries by global regions:

db.test <- mutate(db.test, 
       native_region = ifelse(native_country %in% Asia_East, " East-Asia",
                ifelse(native_country %in% Asia_Central, " Central-Asia",
                ifelse(native_country %in% Central_America, " Central-America",
                ifelse(native_country %in% South_America, " South-America",
                ifelse(native_country %in% Europe_West, " Europe-West",
                ifelse(native_country %in% Europe_East, " Europe-East",
                ifelse(native_country == " United-States", " United-States", 
                       " Outlying-US" ))))))))


db.test$native_region <- factor(db.test$native_region, ordered = FALSE)

We have to create the variables “cap_gain” and “cap_loss” as well:

db.test <- mutate(db.test, 
            cap_gain = ifelse(db.test$capital_gain < 3464, " Low",
                       ifelse(db.test$capital_gain >= 3464 & 
                              db.test$capital_gain <= 14080, " Medium", " High")))

db.test$cap_gain <- factor(db.test$cap_gain,
                            ordered = FALSE,
                            levels = c(" Low", " Medium", " High"))

db.test<- mutate(db.test, 
            cap_loss = ifelse(db.test$capital_loss < 1672, " Low",
                       ifelse(db.test$capital_loss >= 1672 & 
                              db.test$capital_loss <= 1977, " Medium", " High")))


db.test$cap_loss <- factor(db.test$cap_loss,
                            ordered = FALSE,
                            levels = c(" Low", " Medium", " High"))

And, last, we drop the unused level “Never-worked” from the factor variable “workclass”:

db.test$workclass <- droplevels(db.test$workclass)

7 Exporting the Transformed Datasets

Finally, we export the cleaned and preprocessed train and test datasets into the csv files “adult_df.csv” and “test_df.csv”, respectively:

write.csv(db.adult, "adult_df.csv", row.names = FALSE)

write.csv(db.test, "test_df.csv", row.names = FALSE)

Census Data Project – Part 1: Cleaning and Preprocessing the Data

Vasilena Taralova