Problem Statement: To build a model that will predict if the income of any individual in the US is greater than or less than USD 50,000 based on the data available about that individual.
Data Set Description: This Census Income dataset was collected by Barry Becker in 1994 and given to the public site http://archive.ics.uci.edu/ml/datasets/Census+Income. This data set will help you understand how the income of a person varies depending on various factors such as the education background, occupation, marital status, geography, age, number of working hours/week, etc.
Here’s a list of the independent or predictor variables used to predict whether an individual earns more than USD 50,000 or not:
Age Work-class Final-weight Education Education-num (Number of years of education) Marital-status Occupation Relationship Race Sex Capital-gain Capital-loss Hours-per-week Native-country The dependent variable is the “income-level” that represents the level of income. This is a categorical variable and thus it can only take two values:
1.<=50k
2.>=50k
Now that we’ve defined our objective and collected the data, it is time to start with the analysis.
Step 1: Import the data
Lucky for us, we found a data set online, so all we have to do is import the data set into our R environment, like so:
#Downloading train and test data
trainFile = "adult.data"; testFile = "adult.test"
if (!file.exists (trainFile))
download.file (url = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
destfile = trainFile)
if (!file.exists (testFile))
download.file (url = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
destfile = testFile)
In the above code snippet, we’ve downloaded both, the training data set and the testing data set.
If you take a look at the training data, you’ll notice that the predictor variables are not labelled. Therefore, in the below code snippet, I’ve assigned variable names to each predictor variable and to make the data more readable, I’ve got rid of unnecessary white spaces.
#Assigning column names
colNames = c ("age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel")
#Reading training data
training = read.table (trainFile, header = FALSE, sep = ",",
strip.white = TRUE, col.names = colNames,
na.strings = "?", stringsAsFactors = TRUE)
Now in order to study the structure of our data set, we call the str() method. This gives us a descriptive summary of all the predictor variables present in the data set:
#Display structures of the training dataset
str(training)
## 'data.frame': 32561 obs. of 15 variables:
## $ age : int 39 50 38 53 28 37 49 52 31 42 ...
## $ workclass : Factor w/ 8 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
## $ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
## $ education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
## $ educationnum : int 13 13 9 7 13 14 5 9 14 13 ...
## $ maritalstatus: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
## $ occupation : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
## $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
## $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
## $ capitalgain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
## $ capitalloss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hoursperweek : int 40 13 40 40 40 40 16 45 50 40 ...
## $ nativecountry: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
## $ incomelevel : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
So, after importing and transforming the data into a readable format, we’ll move to the next crucial step in Data Processing, which is Data Cleaning.
Step 2: Data Cleaning
The data cleaning stage is considered to be one of the most time-consuming tasks in Data Science. This stage includes removing NA values, getting rid of redundant variables and any inconsistencies in the data.
We’ll begin the data cleaning by checking if our data observations have any missing values:
table (complete.cases (training))
##
## FALSE TRUE
## 2399 30162
The above code snippet indicates that 2399 sample cases have NA values. In order to fix this, let’s look at the summary of all our variables and analyze which variables have the greatest number of null values. The reason why we must get rid of NA values is that they lead to wrongful predictions and hence decrease the accuracy of our model. Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.
summary (training [!complete.cases(training),])
## age workclass fnlwgt education
## Min. :17.00 Private : 410 Min. : 12285 HS-grad :661
## 1st Qu.:22.00 Self-emp-inc : 42 1st Qu.:121804 Some-college:613
## Median :36.00 Self-emp-not-inc: 42 Median :177906 Bachelors :311
## Mean :40.39 Local-gov : 26 Mean :189584 11th :127
## 3rd Qu.:58.00 State-gov : 19 3rd Qu.:232668 10th :113
## Max. :90.00 (Other) : 24 Max. :981628 Masters : 96
## NA's :1836 (Other) :478
## educationnum maritalstatus occupation
## Min. : 1.00 Divorced :229 Prof-specialty : 102
## 1st Qu.: 9.00 Married-AF-spouse : 2 Other-service : 83
## Median :10.00 Married-civ-spouse :911 Exec-managerial: 74
## Mean : 9.57 Married-spouse-absent: 48 Craft-repair : 69
## 3rd Qu.:11.00 Never-married :957 Sales : 66
## Max. :16.00 Separated : 86 (Other) : 162
## Widowed :166 NA's :1843
## relationship race sex capitalgain
## Husband :730 Amer-Indian-Eskimo: 25 Female: 989 Min. : 0.0
## Not-in-family :579 Asian-Pac-Islander: 144 Male :1410 1st Qu.: 0.0
## Other-relative: 92 Black : 307 Median : 0.0
## Own-child :602 Other : 40 Mean : 897.1
## Unmarried :234 White :1883 3rd Qu.: 0.0
## Wife :162 Max. :99999.0
##
## capitalloss hoursperweek nativecountry incomelevel
## Min. : 0.00 Min. : 1.00 United-States:1666 <=50K:2066
## 1st Qu.: 0.00 1st Qu.:25.00 Mexico : 33 >50K : 333
## Median : 0.00 Median :40.00 Canada : 14
## Mean : 73.87 Mean :34.23 Philippines : 10
## 3rd Qu.: 0.00 3rd Qu.:40.00 Germany : 9
## Max. :4356.00 Max. :99.00 (Other) : 84
## NA's : 583
From the above summary, it is observed that three variables have a good amount of NA values:
Workclass – 1836 Occupation – 1843 Nativecountry – 583 These three variables must be cleaned since they are significant variables for predicting the income level of an individual.
#Removing NAs
TrainSet = training [!is.na (training$workclass) & !is.na (training$occupation), ]
TrainSet = TrainSet [!is.na (TrainSet$nativecountry), ]
Once we’ve gotten rid of the NA values, our next step is to get rid of any unnecessary variable that isn’t essential for predicting our outcome. It is important to get rid of such variables because they only increase the complexity of the model without improving its efficiency.
One such variable is the ‘fnlwgt’ variable, which denotes the population totals derived from CPS by calculating “weighted tallies” of any particular socio-economic characteristics of the population.
This variable is removed from our data set since it does not help to predict our resultant variable:
#Removing unnecessary variables
TrainSet$fnlwgt = NULL
Step 3: Data Exploration
Data Exploration involves analyzing each feature variable to check if the variables are significant for building the model.
Exploring the age variable
#Data Exploration
#Exploring the age variable
summary (TrainSet$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.00 28.00 37.00 38.44 47.00 90.00
#Boxplot for age variable
boxplot (age ~ incomelevel, data = TrainSet,
main = "Income levels based on the Age of an individual",
xlab = "Income Level", ylab = "Age", col = "turquoise")
library(ggplot2)
#Histogram for age variable
incomeBelow50K = (TrainSet$incomelevel == "<=50K")
xlimit = c (min (TrainSet$age), max (TrainSet$age))
ylimit = c (0, 1600)
hist1 = qplot (age, data = TrainSet[incomeBelow50K,], margins = TRUE,
binwidth = 2, xlim = xlimit, ylim = ylimit, colour = incomelevel)
hist2 = qplot (age, data = TrainSet[!incomeBelow50K,], margins = TRUE,
binwidth = 2, xlim = xlimit, ylim = ylimit, colour = incomelevel)
#load gbm library for using grid.arrange function.
library(gbm)
## Loaded gbm 2.1.5
grid.arrange (hist1, hist2, nrow = 2)
## Warning: Removed 1 rows containing missing values (geom_bar).
## Warning: Removed 1 rows containing missing values (geom_bar).
The above illustrations show that the age variable is varying with the level of income and hence it is a strong predictor variable.
Exploring the ‘educationnum’ variable
This variable denotes the number of years of education of an individual. Let’s see how the ‘educationnum’ variable varies with respect to the income levels:
summary (TrainSet$educationnum)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 9.00 10.00 10.12 13.00 16.00
#Boxplot for education-num variable
boxplot (educationnum ~ incomelevel, data = TrainSet,
main = "Years of Education distribution for different income levels",
xlab = "Income Levels", ylab = "Years of Education", col = "green")
The above illustration depicts that the ‘educationnum’ variable varies for income levels <=50k and >50k, thus proving that it is a significant variable for predicting the outcome.
Exploring capital-gain and capital-loss variable
After studying the summary of the capital-gain and capital-loss variable for each income level, it is clear that their means vary significantly, thus indicating that they are suitable variables for predicting the income level of an individual.
summary (TrainSet[ TrainSet$incomelevel == "<=50K",
c("capitalgain", "capitalloss")])
## capitalgain capitalloss
## Min. : 0.0 Min. : 0.00
## 1st Qu.: 0.0 1st Qu.: 0.00
## Median : 0.0 Median : 0.00
## Mean : 148.9 Mean : 53.45
## 3rd Qu.: 0.0 3rd Qu.: 0.00
## Max. :41310.0 Max. :4356.00
Exploring hours/week variable
Similarly, the ‘hoursperweek’ variable is evaluated to check if it is a significant predictor variable.
#Evaluate hours/week variable
summary (TrainSet$hoursperweek)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 40.00 40.00 40.93 45.00 99.00
boxplot (hoursperweek ~ incomelevel, data = TrainSet,
main = "Hours Per Week distribution for different income levels",
xlab = "Income Levels", ylab = "Hours Per Week", col = "salmon")
The boxplot shows a clear variation for different income levels which makes it an important variable for predicting the outcome.
Similarly, we’ll be evaluating categorical variables as well. In the below section I’ve created qplots for each variable and after evaluating the plots, it is clear that these variables are essential for predicting the income level of an individual.
Exploring work-class variable
#Evaluating work-class variable
qplot (incomelevel, data = TrainSet, fill = workclass) + facet_grid (. ~ workclass)
#Evaluating occupation variable
qplot (incomelevel, data = TrainSet, fill = occupation) + facet_grid (. ~ occupation)
#Evaluating marital-status variable
qplot (incomelevel, data = TrainSet, fill = maritalstatus) + facet_grid (. ~ maritalstatus)
#Evaluating relationship variable
qplot (incomelevel, data = TrainSet, fill = relationship) + facet_grid (. ~ relationship)
All these graphs show that these set of predictor variables are significant for building our predictive model.
Step 4: Building A Model
So, after evaluating all our predictor variables, it is finally time to perform Predictive analytics. In this stage, we’ll build a predictive model that will predict whether an individual earns above USD 50,000 or not based on the predictor variables that we evaluated in the previous section.
To build this model I’ve made use of the boosting algorithm since we have to classify an individual into either of the two classes, i.e:
Income level <= USD 50,000
Income level > USD 50,000
#Building the model
set.seed (32323)
#All of the variables are used to build a logistic regression model to predict the variable over50k.
censusglm<-glm(incomelevel ~ ., data=TrainSet, family=binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#Building the model
set.seed (1)
library(caret)
## Loading required package: lattice
library(gbm)
trCtrl = trainControl(method = "cv", number = 10)
#The logistic regression is used to generate predictions for incomelevel
boostFit = train (incomelevel ~ age + workclass + education + educationnum +
maritalstatus + occupation + relationship +
race + capitalgain + capitalloss + hoursperweek +
nativecountry, trControl = trCtrl,
method = "glm", data = TrainSet)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Step 5: Checking the accuracy of the model
To evaluate the accuracy of the model, we’re going to use a confusion matrix:
#Checking the accuracy of the model
confusionMatrix (TrainSet$incomelevel, predict (boostFit, TrainSet))
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## Confusion Matrix and Statistics
##
## Reference
## Prediction <=50K >50K
## <=50K 21046 1608
## >50K 2949 4559
##
## Accuracy : 0.8489
## 95% CI : (0.8448, 0.8529)
## No Information Rate : 0.7955
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5703
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8771
## Specificity : 0.7393
## Pos Pred Value : 0.9290
## Neg Pred Value : 0.6072
## Prevalence : 0.7955
## Detection Rate : 0.6978
## Detection Prevalence : 0.7511
## Balanced Accuracy : 0.8082
##
## 'Positive' Class : <=50K
##
The output shows that our model calculates the income level of an individual with an accuracy of approximately 84%, which is a good number.
So far, we used the training data set to build the model, now its time to validate the model by using the testing data set.
Step 5: Load and evaluate the test data set
Just like how we cleaned our training data set, our testing data must also be prepared in such a way that it does not have any null values or unnecessary predictor variables, only then can we use the test data to validate our model.
Start by loading the testing data set:
#Load the testing data set
testing = read.table (testFile, header = FALSE, sep = ",",
strip.white = TRUE, col.names = colNames,
na.strings = "?", fill = TRUE, stringsAsFactors = TRUE)
Next, we’re studying the structure of our data set.
#Display structure of the data
str (testing)
## 'data.frame': 16282 obs. of 15 variables:
## $ age : Factor w/ 74 levels "|1x3 Cross validator",..: 1 10 23 13 29 3 19 14 48 9 ...
## $ workclass : Factor w/ 9 levels "","Federal-gov",..: 1 5 5 3 5 NA 5 NA 7 5 ...
## $ fnlwgt : int NA 226802 89814 336951 160323 103497 198693 227026 104626 369667 ...
## $ education : Factor w/ 17 levels "","10th","11th",..: 1 3 13 9 17 17 2 13 16 17 ...
## $ educationnum : int NA 7 9 12 10 10 6 9 15 10 ...
## $ maritalstatus: Factor w/ 8 levels "","Divorced",..: 1 6 4 4 4 6 6 6 4 6 ...
## $ occupation : Factor w/ 15 levels "","Adm-clerical",..: 1 8 6 12 8 NA 9 NA 11 9 ...
## $ relationship : Factor w/ 7 levels "","Husband","Not-in-family",..: 1 5 2 2 2 5 3 6 2 6 ...
## $ race : Factor w/ 6 levels "","Amer-Indian-Eskimo",..: 1 4 6 6 4 6 6 4 6 6 ...
## $ sex : Factor w/ 3 levels "","Female","Male": 1 3 3 3 3 2 3 3 3 2 ...
## $ capitalgain : int NA 0 0 0 7688 0 0 0 3103 0 ...
## $ capitalloss : int NA 0 0 0 0 0 0 0 0 0 ...
## $ hoursperweek : int NA 40 50 40 40 30 30 40 32 40 ...
## $ nativecountry: Factor w/ 41 levels "","Cambodia",..: 1 39 39 39 39 39 39 39 39 39 ...
## $ incomelevel : Factor w/ 3 levels "","<=50K.",">50K.": 1 2 2 3 3 2 2 2 3 2 ...
testing = testing[-1,]
testing$age = as.integer(testing$age)
In the below code snippet we’re looking for complete observations that do not have any null data or missing data.
table (complete.cases (testing))
##
## FALSE TRUE
## 1221 15060
summary (testing [!complete.cases(testing),])
## age workclass fnlwgt education
## Min. : 2.00 Private :189 Min. : 13862 Some-college:366
## 1st Qu.: 7.00 Self-emp-not-inc: 24 1st Qu.: 116834 HS-grad :340
## Median :19.00 State-gov : 16 Median : 174274 Bachelors :144
## Mean :23.75 Local-gov : 10 Mean : 187207 11th : 66
## 3rd Qu.:40.00 Federal-gov : 9 3rd Qu.: 234791 10th : 53
## Max. :74.00 (Other) : 10 Max. :1024535 Masters : 47
## NA's :963 (Other) :205
## educationnum maritalstatus occupation
## Min. : 1.000 Never-married :562 Prof-specialty : 62
## 1st Qu.: 9.000 Married-civ-spouse :413 Other-service : 32
## Median :10.000 Divorced :107 Sales : 30
## Mean : 9.581 Widowed : 75 Exec-managerial: 28
## 3rd Qu.:10.000 Separated : 33 Craft-repair : 23
## Max. :16.000 Married-spouse-absent: 28 (Other) : 80
## (Other) : 3 NA's :966
## relationship race sex capitalgain
## : 0 : 0 : 0 Min. : 0.0
## Husband :320 Amer-Indian-Eskimo: 10 Female:508 1st Qu.: 0.0
## Not-in-family :302 Asian-Pac-Islander: 72 Male :713 Median : 0.0
## Other-relative: 65 Black :150 Mean : 608.3
## Own-child :353 Other : 13 3rd Qu.: 0.0
## Unmarried :103 White :976 Max. :99999.0
## Wife : 78
## capitalloss hoursperweek nativecountry incomelevel
## Min. : 0.00 Min. : 1.00 United-States:874 : 0
## 1st Qu.: 0.00 1st Qu.:20.00 Mexico : 15 <=50K.:1075
## Median : 0.00 Median :40.00 Canada : 5 >50K. : 146
## Mean : 73.81 Mean :33.49 South : 5
## 3rd Qu.: 0.00 3rd Qu.:40.00 England : 4
## Max. :2603.00 Max. :99.00 (Other) : 44
## NA's :274
From the summary it is clear that we have many NA values in the ‘workclass’, ‘occupation’ and ‘nativecountry’ variables, so let’s get rid of these variable
#Removing NAs
TestSet = testing [!is.na (testing$workclass) & !is.na (testing$occupation), ]
TestSet = TestSet [!is.na (TestSet$nativecountry), ]
#Removing unnecessary variables
TestSet$fnlwgt = NULL
#Changing the datatype of age variable to int
TestSet$age = as.integer(TestSet$age)
Step 6: Validate the model
The test data set is applied to the predictive model to validate the efficiency of the model. The following code snippet shows how this is done:
#Testing model
TestSet$predicted = predict (boostFit, TestSet)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
table(TestSet$incomelevel, TestSet$predicted)
##
## <=50K >50K
## 0 0
## <=50K. 10825 535
## >50K. 1834 1866
actuals_preds <- data.frame(cbind(actuals=TestSet$incomelevel, predicted=TestSet$predicted)) # make actuals_predicteds dataframe.
correlation_accuracy <- cor(actuals_preds)
head(actuals_preds)
The table is used to compare the predicted values to the actual income levels of an individual. This model can further be improved by introducing some variations in the model or by using an alternate algorithm.
So, we just executed an entire Data Science Project from scratch.