Census Income Data Exploration

library(rpart)
library(rpart.plot)
library(ggplot2)
library(plyr)
library(reshape2)
library(gmodels)


censusdata <- read.csv("/Users/anushiarora/Desktop/Study Material/Semester 4/Business Analytics/BusinessAnalytics-master/Week-03/adult.data.csv")
dim(censusdata)
## [1] 32561    15
names(censusdata)
##  [1] "age"              "workclass"        "fnlwgt"          
##  [4] "education"        "education.number" "marital.status"  
##  [7] "occupation"       "relationship"     "race"            
## [10] "sex"              "capital.gain"     "capital.loss"    
## [13] "hours.per.week"   "native.country"   "salary"
str(censusdata)
## 'data.frame':    32561 obs. of  15 variables:
##  $ age             : int  39 50 38 53 28 37 49 52 31 42 ...
##  $ workclass       : Factor w/ 9 levels " ?"," Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
##  $ fnlwgt          : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
##  $ education       : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 13 7 12 13 10 ...
##  $ education.number: int  13 13 9 7 13 14 5 9 14 13 ...
##  $ marital.status  : Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
##  $ occupation      : Factor w/ 15 levels " ?"," Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
##  $ relationship    : Factor w/ 6 levels " Husband"," Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
##  $ race            : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
##  $ sex             : Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 1 1 2 1 2 ...
##  $ capital.gain    : int  2174 0 0 0 0 0 0 0 14084 5178 ...
##  $ capital.loss    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hours.per.week  : int  40 13 40 40 40 40 16 45 50 40 ...
##  $ native.country  : Factor w/ 42 levels " ?"," Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
##  $ salary          : Factor w/ 2 levels " <=50K"," >50K": 1 1 1 1 1 1 1 2 2 2 ...
table(censusdata$salary)
## 
##  <=50K   >50K 
##  24720   7841
summary(censusdata)
##       age                    workclass         fnlwgt       
##  Min.   :17.00    Private         :22696   Min.   :  12285  
##  1st Qu.:28.00    Self-emp-not-inc: 2541   1st Qu.: 117827  
##  Median :37.00    Local-gov       : 2093   Median : 178356  
##  Mean   :38.58    ?               : 1836   Mean   : 189778  
##  3rd Qu.:48.00    State-gov       : 1298   3rd Qu.: 237051  
##  Max.   :90.00    Self-emp-inc    : 1116   Max.   :1484705  
##                  (Other)          :  981                    
##          education     education.number                marital.status 
##   HS-grad     :10501   Min.   : 1.00     Divorced             : 4443  
##   Some-college: 7291   1st Qu.: 9.00     Married-AF-spouse    :   23  
##   Bachelors   : 5355   Median :10.00     Married-civ-spouse   :14976  
##   Masters     : 1723   Mean   :10.08     Married-spouse-absent:  418  
##   Assoc-voc   : 1382   3rd Qu.:12.00     Never-married        :10683  
##   11th        : 1175   Max.   :16.00     Separated            : 1025  
##  (Other)      : 5134                     Widowed              :  993  
##             occupation            relationship  
##   Prof-specialty :4140    Husband       :13193  
##   Craft-repair   :4099    Not-in-family : 8305  
##   Exec-managerial:4066    Other-relative:  981  
##   Adm-clerical   :3770    Own-child     : 5068  
##   Sales          :3650    Unmarried     : 3446  
##   Other-service  :3295    Wife          : 1568  
##  (Other)         :9541                          
##                   race            sex         capital.gain  
##   Amer-Indian-Eskimo:  311    Female:10771   Min.   :    0  
##   Asian-Pac-Islander: 1039    Male  :21790   1st Qu.:    0  
##   Black             : 3124                   Median :    0  
##   Other             :  271                   Mean   : 1078  
##   White             :27816                   3rd Qu.:    0  
##                                              Max.   :99999  
##                                                             
##   capital.loss    hours.per.week         native.country     salary     
##  Min.   :   0.0   Min.   : 1.00    United-States:29170    <=50K:24720  
##  1st Qu.:   0.0   1st Qu.:40.00    Mexico       :  643    >50K : 7841  
##  Median :   0.0   Median :40.00    ?            :  583                 
##  Mean   :  87.3   Mean   :40.44    Philippines  :  198                 
##  3rd Qu.:   0.0   3rd Qu.:45.00    Germany      :  137                 
##  Max.   :4356.0   Max.   :99.00    Canada       :  121                 
##                                   (Other)       : 1709

This is the dataset, which contains the information about the census income. The dataset that is available under the salary column is as follows: 1. For Salary <= 50K, we have 24720 results. 2. For Salary >50K, we have 7841 results.

This is the summary statistics of the dataset, which gives the minimum, maximum, 1st quartile, 3rd quartile, mean and median of age, fnlwgt, education number, capital loss and capital gain. It also gives count for sex, occupation, race, native country etc.

CLASSIFICATION TREE

You can also embed plots, for example:

## Call:
## rpart(formula = salary ~ age + workclass + fnlwgt + education + 
##     education.number + marital.status + occupation + relationship + 
##     race + sex + capital.gain + capital.loss + hours.per.week + 
##     native.country, data = censusdata, method = "class")
##   n= 32561 
## 
##           CP nsplit rel error    xerror        xstd
## 1 0.12638694      0 1.0000000 1.0000000 0.009839876
## 2 0.06402245      2 0.7472261 0.7472261 0.008840225
## 3 0.03749522      3 0.6832037 0.6832037 0.008532119
## 4 0.01000000      4 0.6457085 0.6457085 0.008339388
## 
## Variable importance
##     relationship   marital.status     capital.gain        education 
##               24               24               10                9 
## education.number              sex       occupation              age 
##                9                7                7                5 
##   hours.per.week 
##                3 
## 
## Node number 1: 32561 observations,    complexity param=0.1263869
##   predicted class= <=50K  expected loss=0.2408096  P(node) =1
##     class counts: 24720  7841
##    probabilities: 0.759 0.241 
##   left son=2 (17800 obs) right son=3 (14761 obs)
##   Primary splits:
##       relationship     splits as  RLLLLR, improve=2394.796, (0 missing)
##       marital.status   splits as  LRRLLLL, improve=2360.673, (0 missing)
##       capital.gain     < 5119   to the left,  improve=1658.928, (0 missing)
##       education        splits as  LLLLLLLLLRRLRLRL, improve=1274.368, (0 missing)
##       education.number < 12.5   to the left,  improve=1274.368, (0 missing)
##   Surrogate splits:
##       marital.status splits as  LRRLLLL, agree=0.993, adj=0.984, (0 split)
##       sex            splits as  LR, agree=0.688, adj=0.311, (0 split)
##       age            < 33.5   to the left,  agree=0.650, adj=0.229, (0 split)
##       occupation     splits as  LLLRRRLLLLRRLLR, agree=0.622, adj=0.167, (0 split)
##       hours.per.week < 43.5   to the left,  agree=0.605, adj=0.130, (0 split)
## 
## Node number 2: 17800 observations,    complexity param=0.03749522
##   predicted class= <=50K  expected loss=0.06617978  P(node) =0.5466663
##     class counts: 16622  1178
##    probabilities: 0.934 0.066 
##   left son=4 (17482 obs) right son=5 (318 obs)
##   Primary splits:
##       capital.gain     < 7073.5 to the left,  improve=519.9766, (0 missing)
##       education        splits as  LLLLLLLLLRRLRLRL, improve=147.2153, (0 missing)
##       education.number < 12.5   to the left,  improve=147.2153, (0 missing)
##       occupation       splits as  LLLLRLLLLLRRLLL, improve=121.3603, (0 missing)
##       hours.per.week   < 43.5   to the left,  improve=112.2149, (0 missing)
## 
## Node number 3: 14761 observations,    complexity param=0.1263869
##   predicted class= <=50K  expected loss=0.4513922  P(node) =0.4533337
##     class counts:  8098  6663
##    probabilities: 0.549 0.451 
##   left son=6 (10329 obs) right son=7 (4432 obs)
##   Primary splits:
##       education        splits as  LLLLLLLLLRRLRLRL, improve=938.6245, (0 missing)
##       education.number < 12.5   to the left,  improve=938.6245, (0 missing)
##       occupation       splits as  LLRLRLLLLLRRRRL, improve=893.7492, (0 missing)
##       capital.gain     < 5095.5 to the left,  improve=755.5021, (0 missing)
##       capital.loss     < 1782.5 to the left,  improve=258.7303, (0 missing)
##   Surrogate splits:
##       education.number < 12.5   to the left,  agree=1.000, adj=1.000, (0 split)
##       occupation       splits as  LLLLRLLLLLRLLLL, agree=0.790, adj=0.301, (0 split)
##       capital.gain     < 7493   to the left,  agree=0.716, adj=0.056, (0 split)
##       native.country   splits as  LLLRLLLLLRRLLLL-LLLRRLLLRLLLLLRLLLLRRLLLLL, agree=0.707, adj=0.025, (0 split)
##       capital.loss     < 1894.5 to the left,  agree=0.705, adj=0.019, (0 split)
## 
## Node number 4: 17482 observations
##   predicted class= <=50K  expected loss=0.04987988  P(node) =0.5369
##     class counts: 16610   872
##    probabilities: 0.950 0.050 
## 
## Node number 5: 318 observations
##   predicted class= >50K   expected loss=0.03773585  P(node) =0.009766285
##     class counts:    12   306
##    probabilities: 0.038 0.962 
## 
## Node number 6: 10329 observations,    complexity param=0.06402245
##   predicted class= <=50K  expected loss=0.3345919  P(node) =0.31722
##     class counts:  6873  3456
##    probabilities: 0.665 0.335 
##   left son=12 (9807 obs) right son=13 (522 obs)
##   Primary splits:
##       capital.gain     < 5095.5 to the left,  improve=459.2245, (0 missing)
##       occupation       splits as  LRLLRLLLLLRRRRL, improve=245.6701, (0 missing)
##       education        splits as  LLLLLLLRR--R-L-R, improve=174.7964, (0 missing)
##       education.number < 8.5    to the left,  improve=174.7964, (0 missing)
##       age              < 35.5   to the left,  improve=125.2055, (0 missing)
## 
## Node number 7: 4432 observations
##   predicted class= >50K   expected loss=0.2763989  P(node) =0.1361138
##     class counts:  1225  3207
##    probabilities: 0.276 0.724 
## 
## Node number 12: 9807 observations
##   predicted class= <=50K  expected loss=0.3001937  P(node) =0.3011885
##     class counts:  6863  2944
##    probabilities: 0.700 0.300 
## 
## Node number 13: 522 observations
##   predicted class= >50K   expected loss=0.01915709  P(node) =0.01603145
##     class counts:    10   512
##    probabilities: 0.019 0.981
## 
## Classification tree:
## rpart(formula = salary ~ age + workclass + fnlwgt + education + 
##     education.number + marital.status + occupation + relationship + 
##     race + sex + capital.gain + capital.loss + hours.per.week + 
##     native.country, data = censusdata, method = "class")
## 
## Variables actually used in tree construction:
## [1] capital.gain education    relationship
## 
## Root node error: 7841/32561 = 0.24081
## 
## n= 32561 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.126387      0   1.00000 1.00000 0.0098399
## 2 0.064022      2   0.74723 0.74723 0.0088402
## 3 0.037495      3   0.68320 0.68320 0.0085321
## 4 0.010000      4   0.64571 0.64571 0.0083394

##                
##                 Pred: <=50K Pred: >50K
##   Actual: <=50K       23473       1247
##   Actual: >50K         3816       4025

There are 13 nodes in the classification tree. The root nodes among all the variables corresponds to three variables, which are: capital.gain, education.number and relationship.

According to the decision tree, the criteria that distinguish the people having salary less than or equal to 50K and the one having more than 50K are relationship, capital gain and education number. 1. For capital gain < 7074; 54% have salary less than or equal to 50K and 1% have salary greater than 50K. 2. For people having education =10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, HS-grad, Preschool, Some-college; 14% have salary greater than 50K.

Distribution and Correlation

## [1] 0.06875571

Correlation between age and hours per week worked is very small.Thus, this criterion can be used for not allocating the hours to work as per the age of the person.

Linear Regression

LRModel <- glm(CI_data$salary~CI_data$sex+CI_data$age+CI_data$hours.per.week+CI_data$occupation+CI_data$marital.status, data = CI_data)
summary(LRModel)
## 
## Call:
## glm(formula = CI_data$salary ~ CI_data$sex + CI_data$age + CI_data$hours.per.week + 
##     CI_data$occupation + CI_data$marital.status, data = CI_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8507  -0.2852  -0.1457   0.1434   1.1173  
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.6341998  0.0146952   43.16   <2e-16 ***
## CI_data$sex             0.1322856  0.0048484   27.28   <2e-16 ***
## CI_data$age             0.0057837  0.0001679   34.44   <2e-16 ***
## CI_data$hours.per.week  0.0055472  0.0001863   29.77   <2e-16 ***
## CI_data$occupation      0.0054397  0.0005241   10.38   <2e-16 ***
## CI_data$marital.status -0.0284765  0.0015467  -18.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1580833)
## 
##     Null deviance: 5952.8  on 32560  degrees of freedom
## Residual deviance: 5146.4  on 32555  degrees of freedom
## AIC: 32349
## 
## Number of Fisher Scoring iterations: 2

The values obtained from logistic regression for sex, age, hours per week, occupation and marital status are very small. Also, from the logistic regression, we can conclude that sex is one of the major criteria for calculating salary. Marketing team can use this information in providing an estimate to the male and female employers about their salary.