Problem Statement: To build a model that will predict if the income of any individual in the US is greater than or less than USD 50,000 based on the data available about that individual.

Data Set Description: This Census Income dataset was collected by Barry Becker in 1994 and given to the public site http://archive.ics.uci.edu/ml/datasets/Census+Income. This data set will help you understand how the income of a person varies depending on various factors such as the education background, occupation, marital status, geography, age, number of working hours/week, etc.

Here’s a list of the independent or predictor variables used to predict whether an individual earns more than USD 50,000 or not:

Age Work-class Final-weight Education Education-num (Number of years of education) Marital-status Occupation Relationship Race Sex Capital-gain Capital-loss Hours-per-week Native-country The dependent variable is the “income-level” that represents the level of income. This is a categorical variable and thus it can only take two values:

1.<=50k

2.>=50k

Now that we’ve defined our objective and collected the data, it is time to start with the analysis.

Step 1: Import the data

Lucky for us, we found a data set online, so all we have to do is import the data set into our R environment, like so:

#Downloading train and test data
trainFile = "adult.data"; testFile = "adult.test"
 
if (!file.exists (trainFile))
download.file (url = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
destfile = trainFile)
 
if (!file.exists (testFile))
download.file (url = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
destfile = testFile)

In the above code snippet, we’ve downloaded both, the training data set and the testing data set.

If you take a look at the training data, you’ll notice that the predictor variables are not labelled. Therefore, in the below code snippet, I’ve assigned variable names to each predictor variable and to make the data more readable, I’ve got rid of unnecessary white spaces.

#Assigning column names
colNames = c ("age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel")
 
#Reading training data
training = read.table (trainFile, header = FALSE, sep = ",",
strip.white = TRUE, col.names = colNames,
na.strings = "?", stringsAsFactors = TRUE)

Now in order to study the structure of our data set, we call the str() method. This gives us a descriptive summary of all the predictor variables present in the data set:

#Display structures of the training dataset
str(training)
## 'data.frame':    32561 obs. of  15 variables:
##  $ age          : int  39 50 38 53 28 37 49 52 31 42 ...
##  $ workclass    : Factor w/ 8 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
##  $ fnlwgt       : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
##  $ education    : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
##  $ educationnum : int  13 13 9 7 13 14 5 9 14 13 ...
##  $ maritalstatus: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
##  $ occupation   : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
##  $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
##  $ race         : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
##  $ sex          : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
##  $ capitalgain  : int  2174 0 0 0 0 0 0 0 14084 5178 ...
##  $ capitalloss  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hoursperweek : int  40 13 40 40 40 40 16 45 50 40 ...
##  $ nativecountry: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
##  $ incomelevel  : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...

So, after importing and transforming the data into a readable format, we’ll move to the next crucial step in Data Processing, which is Data Cleaning.

Step 2: Data Cleaning

The data cleaning stage is considered to be one of the most time-consuming tasks in Data Science. This stage includes removing NA values, getting rid of redundant variables and any inconsistencies in the data.

We’ll begin the data cleaning by checking if our data observations have any missing values:

table (complete.cases (training))
## 
## FALSE  TRUE 
##  2399 30162

The above code snippet indicates that 2399 sample cases have NA values. In order to fix this, let’s look at the summary of all our variables and analyze which variables have the greatest number of null values. The reason why we must get rid of NA values is that they lead to wrongful predictions and hence decrease the accuracy of our model. Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.

summary  (training [!complete.cases(training),])
##       age                   workclass        fnlwgt              education  
##  Min.   :17.00   Private         : 410   Min.   : 12285   HS-grad     :661  
##  1st Qu.:22.00   Self-emp-inc    :  42   1st Qu.:121804   Some-college:613  
##  Median :36.00   Self-emp-not-inc:  42   Median :177906   Bachelors   :311  
##  Mean   :40.39   Local-gov       :  26   Mean   :189584   11th        :127  
##  3rd Qu.:58.00   State-gov       :  19   3rd Qu.:232668   10th        :113  
##  Max.   :90.00   (Other)         :  24   Max.   :981628   Masters     : 96  
##                  NA's            :1836                    (Other)     :478  
##   educationnum                 maritalstatus           occupation  
##  Min.   : 1.00   Divorced             :229   Prof-specialty : 102  
##  1st Qu.: 9.00   Married-AF-spouse    :  2   Other-service  :  83  
##  Median :10.00   Married-civ-spouse   :911   Exec-managerial:  74  
##  Mean   : 9.57   Married-spouse-absent: 48   Craft-repair   :  69  
##  3rd Qu.:11.00   Never-married        :957   Sales          :  66  
##  Max.   :16.00   Separated            : 86   (Other)        : 162  
##                  Widowed              :166   NA's           :1843  
##          relationship                 race          sex        capitalgain     
##  Husband       :730   Amer-Indian-Eskimo:  25   Female: 989   Min.   :    0.0  
##  Not-in-family :579   Asian-Pac-Islander: 144   Male  :1410   1st Qu.:    0.0  
##  Other-relative: 92   Black             : 307                 Median :    0.0  
##  Own-child     :602   Other             :  40                 Mean   :  897.1  
##  Unmarried     :234   White             :1883                 3rd Qu.:    0.0  
##  Wife          :162                                           Max.   :99999.0  
##                                                                                
##   capitalloss       hoursperweek         nativecountry  incomelevel 
##  Min.   :   0.00   Min.   : 1.00   United-States:1666   <=50K:2066  
##  1st Qu.:   0.00   1st Qu.:25.00   Mexico       :  33   >50K : 333  
##  Median :   0.00   Median :40.00   Canada       :  14               
##  Mean   :  73.87   Mean   :34.23   Philippines  :  10               
##  3rd Qu.:   0.00   3rd Qu.:40.00   Germany      :   9               
##  Max.   :4356.00   Max.   :99.00   (Other)      :  84               
##                                    NA's         : 583

From the above summary, it is observed that three variables have a good amount of NA values:

Workclass – 1836 Occupation – 1843 Nativecountry – 583 These three variables must be cleaned since they are significant variables for predicting the income level of an individual.

#Removing NAs
TrainSet = training [!is.na (training$workclass) & !is.na (training$occupation), ]
TrainSet = TrainSet [!is.na (TrainSet$nativecountry), ]

Once we’ve gotten rid of the NA values, our next step is to get rid of any unnecessary variable that isn’t essential for predicting our outcome. It is important to get rid of such variables because they only increase the complexity of the model without improving its efficiency.

One such variable is the ‘fnlwgt’ variable, which denotes the population totals derived from CPS by calculating “weighted tallies” of any particular socio-economic characteristics of the population.

This variable is removed from our data set since it does not help to predict our resultant variable:

#Removing unnecessary variables
 
TrainSet$fnlwgt = NULL

Step 3: Data Exploration

Data Exploration involves analyzing each feature variable to check if the variables are significant for building the model.

Exploring the age variable

#Data Exploration
#Exploring the age variable
 
summary (TrainSet$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   28.00   37.00   38.44   47.00   90.00
#Boxplot for age variable
boxplot (age ~ incomelevel, data = TrainSet,
main = "Income levels based on the Age of an individual",
xlab = "Income Level", ylab = "Age", col = "turquoise")

library(ggplot2)
#Histogram for age variable
incomeBelow50K = (TrainSet$incomelevel == "<=50K")
xlimit = c (min (TrainSet$age), max (TrainSet$age))
ylimit = c (0, 1600)
 
hist1 = qplot (age, data = TrainSet[incomeBelow50K,], margins = TRUE,
binwidth = 2, xlim = xlimit, ylim = ylimit, colour = incomelevel)
 
hist2 = qplot (age, data = TrainSet[!incomeBelow50K,], margins = TRUE,
binwidth = 2, xlim = xlimit, ylim = ylimit, colour = incomelevel)

#load gbm library for using grid.arrange function. 
library(gbm)
## Loaded gbm 2.1.5
grid.arrange (hist1, hist2, nrow = 2)
## Warning: Removed 1 rows containing missing values (geom_bar).

## Warning: Removed 1 rows containing missing values (geom_bar).

The above illustrations show that the age variable is varying with the level of income and hence it is a strong predictor variable.

Exploring the ‘educationnum’ variable

This variable denotes the number of years of education of an individual. Let’s see how the ‘educationnum’ variable varies with respect to the income levels:

summary (TrainSet$educationnum)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    9.00   10.00   10.12   13.00   16.00
#Boxplot for education-num variable
boxplot (educationnum ~ incomelevel, data = TrainSet,
main = "Years of Education distribution for different income levels",
xlab = "Income Levels", ylab = "Years of Education", col = "green")

The above illustration depicts that the ‘educationnum’ variable varies for income levels <=50k and >50k, thus proving that it is a significant variable for predicting the outcome.

Exploring capital-gain and capital-loss variable

After studying the summary of the capital-gain and capital-loss variable for each income level, it is clear that their means vary significantly, thus indicating that they are suitable variables for predicting the income level of an individual.

summary (TrainSet[ TrainSet$incomelevel == "<=50K", 
                       c("capitalgain", "capitalloss")])
##   capitalgain       capitalloss     
##  Min.   :    0.0   Min.   :   0.00  
##  1st Qu.:    0.0   1st Qu.:   0.00  
##  Median :    0.0   Median :   0.00  
##  Mean   :  148.9   Mean   :  53.45  
##  3rd Qu.:    0.0   3rd Qu.:   0.00  
##  Max.   :41310.0   Max.   :4356.00

Exploring hours/week variable

Similarly, the ‘hoursperweek’ variable is evaluated to check if it is a significant predictor variable.

#Evaluate hours/week variable
summary (TrainSet$hoursperweek)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   40.00   40.00   40.93   45.00   99.00
boxplot (hoursperweek ~ incomelevel, data = TrainSet,
main = "Hours Per Week distribution for different income levels",
xlab = "Income Levels", ylab = "Hours Per Week", col = "salmon")

The boxplot shows a clear variation for different income levels which makes it an important variable for predicting the outcome.

Similarly, we’ll be evaluating categorical variables as well. In the below section I’ve created qplots for each variable and after evaluating the plots, it is clear that these variables are essential for predicting the income level of an individual.

Exploring work-class variable

#Evaluating work-class variable
qplot (incomelevel, data = TrainSet, fill = workclass) + facet_grid (. ~ workclass)

#Evaluating occupation variable
qplot (incomelevel, data = TrainSet, fill = occupation) + facet_grid (. ~ occupation)

#Evaluating marital-status variable
qplot (incomelevel, data = TrainSet, fill = maritalstatus) + facet_grid (. ~ maritalstatus)

#Evaluating relationship variable
qplot (incomelevel, data = TrainSet, fill = relationship) + facet_grid (. ~ relationship)

All these graphs show that these set of predictor variables are significant for building our predictive model.

Step 4: Building A Model

So, after evaluating all our predictor variables, it is finally time to perform Predictive analytics. In this stage, we’ll build a predictive model that will predict whether an individual earns above USD 50,000 or not based on the predictor variables that we evaluated in the previous section.

To build this model I’ve made use of the boosting algorithm since we have to classify an individual into either of the two classes, i.e:

  1. Income level <= USD 50,000

  2. Income level > USD 50,000

#Building the model
set.seed (32323)
#All of the variables are used to build a logistic regression model to predict the variable over50k.
censusglm<-glm(incomelevel ~ ., data=TrainSet, family=binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#Building the model
set.seed (1)
 
library(caret)
## Loading required package: lattice
library(gbm)
trCtrl = trainControl(method = "cv", number = 10)
#The logistic regression is used to generate predictions for incomelevel 
boostFit = train (incomelevel ~ age + workclass + education + educationnum +
maritalstatus + occupation + relationship +
race + capitalgain + capitalloss + hoursperweek +
nativecountry, trControl = trCtrl,
method = "glm", data = TrainSet)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Step 5: Checking the accuracy of the model

To evaluate the accuracy of the model, we’re going to use a confusion matrix:

#Checking the accuracy of the model
 
confusionMatrix (TrainSet$incomelevel, predict (boostFit, TrainSet))
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction <=50K  >50K
##      <=50K 21046  1608
##      >50K   2949  4559
##                                           
##                Accuracy : 0.8489          
##                  95% CI : (0.8448, 0.8529)
##     No Information Rate : 0.7955          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5703          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8771          
##             Specificity : 0.7393          
##          Pos Pred Value : 0.9290          
##          Neg Pred Value : 0.6072          
##              Prevalence : 0.7955          
##          Detection Rate : 0.6978          
##    Detection Prevalence : 0.7511          
##       Balanced Accuracy : 0.8082          
##                                           
##        'Positive' Class : <=50K           
## 

The output shows that our model calculates the income level of an individual with an accuracy of approximately 84%, which is a good number.

So far, we used the training data set to build the model, now its time to validate the model by using the testing data set.

Step 5: Load and evaluate the test data set

Just like how we cleaned our training data set, our testing data must also be prepared in such a way that it does not have any null values or unnecessary predictor variables, only then can we use the test data to validate our model.

Start by loading the testing data set:

#Load the testing data set
testing = read.table (testFile, header = FALSE, sep = ",",
strip.white = TRUE, col.names = colNames,
na.strings = "?", fill = TRUE, stringsAsFactors = TRUE)

Next, we’re studying the structure of our data set.

#Display structure of the data
str (testing)
## 'data.frame':    16282 obs. of  15 variables:
##  $ age          : Factor w/ 74 levels "|1x3 Cross validator",..: 1 10 23 13 29 3 19 14 48 9 ...
##  $ workclass    : Factor w/ 9 levels "","Federal-gov",..: 1 5 5 3 5 NA 5 NA 7 5 ...
##  $ fnlwgt       : int  NA 226802 89814 336951 160323 103497 198693 227026 104626 369667 ...
##  $ education    : Factor w/ 17 levels "","10th","11th",..: 1 3 13 9 17 17 2 13 16 17 ...
##  $ educationnum : int  NA 7 9 12 10 10 6 9 15 10 ...
##  $ maritalstatus: Factor w/ 8 levels "","Divorced",..: 1 6 4 4 4 6 6 6 4 6 ...
##  $ occupation   : Factor w/ 15 levels "","Adm-clerical",..: 1 8 6 12 8 NA 9 NA 11 9 ...
##  $ relationship : Factor w/ 7 levels "","Husband","Not-in-family",..: 1 5 2 2 2 5 3 6 2 6 ...
##  $ race         : Factor w/ 6 levels "","Amer-Indian-Eskimo",..: 1 4 6 6 4 6 6 4 6 6 ...
##  $ sex          : Factor w/ 3 levels "","Female","Male": 1 3 3 3 3 2 3 3 3 2 ...
##  $ capitalgain  : int  NA 0 0 0 7688 0 0 0 3103 0 ...
##  $ capitalloss  : int  NA 0 0 0 0 0 0 0 0 0 ...
##  $ hoursperweek : int  NA 40 50 40 40 30 30 40 32 40 ...
##  $ nativecountry: Factor w/ 41 levels "","Cambodia",..: 1 39 39 39 39 39 39 39 39 39 ...
##  $ incomelevel  : Factor w/ 3 levels "","<=50K.",">50K.": 1 2 2 3 3 2 2 2 3 2 ...
testing = testing[-1,]
testing$age = as.integer(testing$age)

In the below code snippet we’re looking for complete observations that do not have any null data or missing data.

table (complete.cases (testing))
## 
## FALSE  TRUE 
##  1221 15060
summary  (testing [!complete.cases(testing),])
##       age                   workclass       fnlwgt               education  
##  Min.   : 2.00   Private         :189   Min.   :  13862   Some-college:366  
##  1st Qu.: 7.00   Self-emp-not-inc: 24   1st Qu.: 116834   HS-grad     :340  
##  Median :19.00   State-gov       : 16   Median : 174274   Bachelors   :144  
##  Mean   :23.75   Local-gov       : 10   Mean   : 187207   11th        : 66  
##  3rd Qu.:40.00   Federal-gov     :  9   3rd Qu.: 234791   10th        : 53  
##  Max.   :74.00   (Other)         : 10   Max.   :1024535   Masters     : 47  
##                  NA's            :963                     (Other)     :205  
##   educationnum                  maritalstatus           occupation 
##  Min.   : 1.000   Never-married        :562   Prof-specialty : 62  
##  1st Qu.: 9.000   Married-civ-spouse   :413   Other-service  : 32  
##  Median :10.000   Divorced             :107   Sales          : 30  
##  Mean   : 9.581   Widowed              : 75   Exec-managerial: 28  
##  3rd Qu.:10.000   Separated            : 33   Craft-repair   : 23  
##  Max.   :16.000   Married-spouse-absent: 28   (Other)        : 80  
##                   (Other)              :  3   NA's           :966  
##          relationship                 race         sex       capitalgain     
##                :  0                     :  0         :  0   Min.   :    0.0  
##  Husband       :320   Amer-Indian-Eskimo: 10   Female:508   1st Qu.:    0.0  
##  Not-in-family :302   Asian-Pac-Islander: 72   Male  :713   Median :    0.0  
##  Other-relative: 65   Black             :150                Mean   :  608.3  
##  Own-child     :353   Other             : 13                3rd Qu.:    0.0  
##  Unmarried     :103   White             :976                Max.   :99999.0  
##  Wife          : 78                                                          
##   capitalloss       hoursperweek         nativecountry incomelevel  
##  Min.   :   0.00   Min.   : 1.00   United-States:874         :   0  
##  1st Qu.:   0.00   1st Qu.:20.00   Mexico       : 15   <=50K.:1075  
##  Median :   0.00   Median :40.00   Canada       :  5   >50K. : 146  
##  Mean   :  73.81   Mean   :33.49   South        :  5                
##  3rd Qu.:   0.00   3rd Qu.:40.00   England      :  4                
##  Max.   :2603.00   Max.   :99.00   (Other)      : 44                
##                                    NA's         :274

From the summary it is clear that we have many NA values in the ‘workclass’, ‘occupation’ and ‘nativecountry’ variables, so let’s get rid of these variable

#Removing NAs
TestSet = testing [!is.na (testing$workclass) & !is.na (testing$occupation), ]
TestSet = TestSet [!is.na (TestSet$nativecountry), ]
 
#Removing unnecessary variables
TestSet$fnlwgt = NULL

#Changing the datatype of age variable to int
TestSet$age = as.integer(TestSet$age)

Step 6: Validate the model

The test data set is applied to the predictive model to validate the efficiency of the model. The following code snippet shows how this is done:

#Testing model
TestSet$predicted = predict (boostFit, TestSet)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
table(TestSet$incomelevel, TestSet$predicted)
##         
##          <=50K  >50K
##              0     0
##   <=50K. 10825   535
##   >50K.   1834  1866
actuals_preds <- data.frame(cbind(actuals=TestSet$incomelevel, predicted=TestSet$predicted)) # make actuals_predicteds dataframe.
correlation_accuracy <- cor(actuals_preds)
head(actuals_preds)

The table is used to compare the predicted values to the actual income levels of an individual. This model can further be improved by introducing some variations in the model or by using an alternate algorithm.

So, we just executed an entire Data Science Project from scratch.