library(rpart)
library(rpart.plot)
library(ggplot2)
library(plyr)
library(reshape2)
library(gmodels)
censusdata <- read.csv("/Users/anushiarora/Desktop/Study Material/Semester 4/Business Analytics/BusinessAnalytics-master/Week-03/adult.data.csv")
dim(censusdata)
## [1] 32561 15
names(censusdata)
## [1] "age" "workclass" "fnlwgt"
## [4] "education" "education.number" "marital.status"
## [7] "occupation" "relationship" "race"
## [10] "sex" "capital.gain" "capital.loss"
## [13] "hours.per.week" "native.country" "salary"
str(censusdata)
## 'data.frame': 32561 obs. of 15 variables:
## $ age : int 39 50 38 53 28 37 49 52 31 42 ...
## $ workclass : Factor w/ 9 levels " ?"," Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
## $ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
## $ education : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 13 7 12 13 10 ...
## $ education.number: int 13 13 9 7 13 14 5 9 14 13 ...
## $ marital.status : Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
## $ occupation : Factor w/ 15 levels " ?"," Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
## $ relationship : Factor w/ 6 levels " Husband"," Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
## $ race : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
## $ sex : Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 1 1 2 1 2 ...
## $ capital.gain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
## $ capital.loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hours.per.week : int 40 13 40 40 40 40 16 45 50 40 ...
## $ native.country : Factor w/ 42 levels " ?"," Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
## $ salary : Factor w/ 2 levels " <=50K"," >50K": 1 1 1 1 1 1 1 2 2 2 ...
table(censusdata$salary)
##
## <=50K >50K
## 24720 7841
summary(censusdata)
## age workclass fnlwgt
## Min. :17.00 Private :22696 Min. : 12285
## 1st Qu.:28.00 Self-emp-not-inc: 2541 1st Qu.: 117827
## Median :37.00 Local-gov : 2093 Median : 178356
## Mean :38.58 ? : 1836 Mean : 189778
## 3rd Qu.:48.00 State-gov : 1298 3rd Qu.: 237051
## Max. :90.00 Self-emp-inc : 1116 Max. :1484705
## (Other) : 981
## education education.number marital.status
## HS-grad :10501 Min. : 1.00 Divorced : 4443
## Some-college: 7291 1st Qu.: 9.00 Married-AF-spouse : 23
## Bachelors : 5355 Median :10.00 Married-civ-spouse :14976
## Masters : 1723 Mean :10.08 Married-spouse-absent: 418
## Assoc-voc : 1382 3rd Qu.:12.00 Never-married :10683
## 11th : 1175 Max. :16.00 Separated : 1025
## (Other) : 5134 Widowed : 993
## occupation relationship
## Prof-specialty :4140 Husband :13193
## Craft-repair :4099 Not-in-family : 8305
## Exec-managerial:4066 Other-relative: 981
## Adm-clerical :3770 Own-child : 5068
## Sales :3650 Unmarried : 3446
## Other-service :3295 Wife : 1568
## (Other) :9541
## race sex capital.gain
## Amer-Indian-Eskimo: 311 Female:10771 Min. : 0
## Asian-Pac-Islander: 1039 Male :21790 1st Qu.: 0
## Black : 3124 Median : 0
## Other : 271 Mean : 1078
## White :27816 3rd Qu.: 0
## Max. :99999
##
## capital.loss hours.per.week native.country salary
## Min. : 0.0 Min. : 1.00 United-States:29170 <=50K:24720
## 1st Qu.: 0.0 1st Qu.:40.00 Mexico : 643 >50K : 7841
## Median : 0.0 Median :40.00 ? : 583
## Mean : 87.3 Mean :40.44 Philippines : 198
## 3rd Qu.: 0.0 3rd Qu.:45.00 Germany : 137
## Max. :4356.0 Max. :99.00 Canada : 121
## (Other) : 1709
This is the dataset, which contains the information about the census income. The dataset that is available under the salary column is as follows: 1. For Salary <= 50K, we have 24720 results. 2. For Salary >50K, we have 7841 results.
This is the summary statistics of the dataset, which gives the minimum, maximum, 1st quartile, 3rd quartile, mean and median of age, fnlwgt, education number, capital loss and capital gain. It also gives count for sex, occupation, race, native country etc.
You can also embed plots, for example:
## Call:
## rpart(formula = salary ~ age + workclass + fnlwgt + education +
## education.number + marital.status + occupation + relationship +
## race + sex + capital.gain + capital.loss + hours.per.week +
## native.country, data = censusdata, method = "class")
## n= 32561
##
## CP nsplit rel error xerror xstd
## 1 0.12638694 0 1.0000000 1.0000000 0.009839876
## 2 0.06402245 2 0.7472261 0.7472261 0.008840225
## 3 0.03749522 3 0.6832037 0.6832037 0.008532119
## 4 0.01000000 4 0.6457085 0.6457085 0.008339388
##
## Variable importance
## relationship marital.status capital.gain education
## 24 24 10 9
## education.number sex occupation age
## 9 7 7 5
## hours.per.week
## 3
##
## Node number 1: 32561 observations, complexity param=0.1263869
## predicted class= <=50K expected loss=0.2408096 P(node) =1
## class counts: 24720 7841
## probabilities: 0.759 0.241
## left son=2 (17800 obs) right son=3 (14761 obs)
## Primary splits:
## relationship splits as RLLLLR, improve=2394.796, (0 missing)
## marital.status splits as LRRLLLL, improve=2360.673, (0 missing)
## capital.gain < 5119 to the left, improve=1658.928, (0 missing)
## education splits as LLLLLLLLLRRLRLRL, improve=1274.368, (0 missing)
## education.number < 12.5 to the left, improve=1274.368, (0 missing)
## Surrogate splits:
## marital.status splits as LRRLLLL, agree=0.993, adj=0.984, (0 split)
## sex splits as LR, agree=0.688, adj=0.311, (0 split)
## age < 33.5 to the left, agree=0.650, adj=0.229, (0 split)
## occupation splits as LLLRRRLLLLRRLLR, agree=0.622, adj=0.167, (0 split)
## hours.per.week < 43.5 to the left, agree=0.605, adj=0.130, (0 split)
##
## Node number 2: 17800 observations, complexity param=0.03749522
## predicted class= <=50K expected loss=0.06617978 P(node) =0.5466663
## class counts: 16622 1178
## probabilities: 0.934 0.066
## left son=4 (17482 obs) right son=5 (318 obs)
## Primary splits:
## capital.gain < 7073.5 to the left, improve=519.9766, (0 missing)
## education splits as LLLLLLLLLRRLRLRL, improve=147.2153, (0 missing)
## education.number < 12.5 to the left, improve=147.2153, (0 missing)
## occupation splits as LLLLRLLLLLRRLLL, improve=121.3603, (0 missing)
## hours.per.week < 43.5 to the left, improve=112.2149, (0 missing)
##
## Node number 3: 14761 observations, complexity param=0.1263869
## predicted class= <=50K expected loss=0.4513922 P(node) =0.4533337
## class counts: 8098 6663
## probabilities: 0.549 0.451
## left son=6 (10329 obs) right son=7 (4432 obs)
## Primary splits:
## education splits as LLLLLLLLLRRLRLRL, improve=938.6245, (0 missing)
## education.number < 12.5 to the left, improve=938.6245, (0 missing)
## occupation splits as LLRLRLLLLLRRRRL, improve=893.7492, (0 missing)
## capital.gain < 5095.5 to the left, improve=755.5021, (0 missing)
## capital.loss < 1782.5 to the left, improve=258.7303, (0 missing)
## Surrogate splits:
## education.number < 12.5 to the left, agree=1.000, adj=1.000, (0 split)
## occupation splits as LLLLRLLLLLRLLLL, agree=0.790, adj=0.301, (0 split)
## capital.gain < 7493 to the left, agree=0.716, adj=0.056, (0 split)
## native.country splits as LLLRLLLLLRRLLLL-LLLRRLLLRLLLLLRLLLLRRLLLLL, agree=0.707, adj=0.025, (0 split)
## capital.loss < 1894.5 to the left, agree=0.705, adj=0.019, (0 split)
##
## Node number 4: 17482 observations
## predicted class= <=50K expected loss=0.04987988 P(node) =0.5369
## class counts: 16610 872
## probabilities: 0.950 0.050
##
## Node number 5: 318 observations
## predicted class= >50K expected loss=0.03773585 P(node) =0.009766285
## class counts: 12 306
## probabilities: 0.038 0.962
##
## Node number 6: 10329 observations, complexity param=0.06402245
## predicted class= <=50K expected loss=0.3345919 P(node) =0.31722
## class counts: 6873 3456
## probabilities: 0.665 0.335
## left son=12 (9807 obs) right son=13 (522 obs)
## Primary splits:
## capital.gain < 5095.5 to the left, improve=459.2245, (0 missing)
## occupation splits as LRLLRLLLLLRRRRL, improve=245.6701, (0 missing)
## education splits as LLLLLLLRR--R-L-R, improve=174.7964, (0 missing)
## education.number < 8.5 to the left, improve=174.7964, (0 missing)
## age < 35.5 to the left, improve=125.2055, (0 missing)
##
## Node number 7: 4432 observations
## predicted class= >50K expected loss=0.2763989 P(node) =0.1361138
## class counts: 1225 3207
## probabilities: 0.276 0.724
##
## Node number 12: 9807 observations
## predicted class= <=50K expected loss=0.3001937 P(node) =0.3011885
## class counts: 6863 2944
## probabilities: 0.700 0.300
##
## Node number 13: 522 observations
## predicted class= >50K expected loss=0.01915709 P(node) =0.01603145
## class counts: 10 512
## probabilities: 0.019 0.981
##
## Classification tree:
## rpart(formula = salary ~ age + workclass + fnlwgt + education +
## education.number + marital.status + occupation + relationship +
## race + sex + capital.gain + capital.loss + hours.per.week +
## native.country, data = censusdata, method = "class")
##
## Variables actually used in tree construction:
## [1] capital.gain education relationship
##
## Root node error: 7841/32561 = 0.24081
##
## n= 32561
##
## CP nsplit rel error xerror xstd
## 1 0.126387 0 1.00000 1.00000 0.0098399
## 2 0.064022 2 0.74723 0.74723 0.0088402
## 3 0.037495 3 0.68320 0.68320 0.0085321
## 4 0.010000 4 0.64571 0.64571 0.0083394
##
## Pred: <=50K Pred: >50K
## Actual: <=50K 23473 1247
## Actual: >50K 3816 4025
There are 13 nodes in the classification tree. The root nodes among all the variables corresponds to three variables, which are: capital.gain, education.number and relationship.
According to the decision tree, the criteria that distinguish the people having salary less than or equal to 50K and the one having more than 50K are relationship, capital gain and education number. 1. For capital gain < 7074; 54% have salary less than or equal to 50K and 1% have salary greater than 50K. 2. For people having education =10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, HS-grad, Preschool, Some-college; 14% have salary greater than 50K.
## [1] 0.06875571
Correlation between age and hours per week worked is very small.Thus, this criterion can be used for not allocating the hours to work as per the age of the person.
LRModel <- glm(CI_data$salary~CI_data$sex+CI_data$age+CI_data$hours.per.week+CI_data$occupation+CI_data$marital.status, data = CI_data)
summary(LRModel)
##
## Call:
## glm(formula = CI_data$salary ~ CI_data$sex + CI_data$age + CI_data$hours.per.week +
## CI_data$occupation + CI_data$marital.status, data = CI_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8507 -0.2852 -0.1457 0.1434 1.1173
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6341998 0.0146952 43.16 <2e-16 ***
## CI_data$sex 0.1322856 0.0048484 27.28 <2e-16 ***
## CI_data$age 0.0057837 0.0001679 34.44 <2e-16 ***
## CI_data$hours.per.week 0.0055472 0.0001863 29.77 <2e-16 ***
## CI_data$occupation 0.0054397 0.0005241 10.38 <2e-16 ***
## CI_data$marital.status -0.0284765 0.0015467 -18.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1580833)
##
## Null deviance: 5952.8 on 32560 degrees of freedom
## Residual deviance: 5146.4 on 32555 degrees of freedom
## AIC: 32349
##
## Number of Fisher Scoring iterations: 2
The values obtained from logistic regression for sex, age, hours per week, occupation and marital status are very small. Also, from the logistic regression, we can conclude that sex is one of the major criteria for calculating salary. Marketing team can use this information in providing an estimate to the male and female employers about their salary.