Predicting if income exceeds $50,000 per year based on 1994 US Census Data with Simple Classification Techniques
Data Set Information:
Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)) download the dataset from: “https://www.kaggle.com/uciml/incomecensus-census-income”
*** Response variable A. income: Categorical variable that contains yearly income of the respondent (“<=$50K” or “>50K”).
*** Independent variables B. age: Numerical variable that contains the age of the respondent.
C. workclass : Categorical variable that contains the type of employer of the respondent
D. fnlwgt: Numerical variable that contains the number of respondents that each row of the data set represents.
E. education: Categorical variable that represents the level of education of the respondent (Doctorate, Prof-school, Masters, Bachelors, Assoc-acdm, Assoc-voc, Some-college, HS-grad, 12th, 11th, 10th, 9th, 7th-8th, 5th-6th, 1st-4th, Preschool)
F. education.num: Numerical variable that represents the education variable.
G. marital.status: The marital status of the respondent
H. occupation: Categorical variable that represents the type of employment of the respondent (?, Adm-clerical, Armed-Forces, Craft-repair, Exec-managerial, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Priv-house-serv, Prof-specialty, Protective-serv, Sales, Tech-support, Transport-moving).
I. relationship: Categorical variable that represents the position in the family of the respondent (Husband, Not-in-family, Other-relative, Own-child, Unmarried, Wife).
J. race: Categorical variable that represents the race of the respondent (Amer-Indian-Eskimo, Asian-Pac-Islander, Black,Other, White).
K. sex: Categorical variable that represent the sex of the respondent (Female, Male).
L. capital.gain: Numerical variable that represents the income gained by the respondent from sources other than salary/wages.
M. capital.loss: Numerical variable that represents the income lost by the respondent from sources other than salary/wages.
N. hours.per.week: Numerical variable that represents the hours worked per week by the respondent.
O. native.country: Categorical variable that represents the native country of the respondent.
We will read in our data using the read_csv() function, from the tidyverse package readr, instead of read.csv().
library(readr)
library(tidyverse)
## -- Attaching packages ----------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.0 v dplyr 0.8.4
## v tibble 2.1.3 v stringr 1.4.0
## v tidyr 1.0.2 v forcats 0.4.0
## v purrr 0.3.3
## Warning: package 'ggplot2' was built under R version 3.6.3
## -- Conflicts -------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(dplyr)
library(DataExplorer)
library(pander)
incomecensus <- read_csv("adultincome.csv")
## Parsed with column specification:
## cols(
## age = col_double(),
## workclass = col_character(),
## fnlwgt = col_double(),
## education = col_character(),
## education.num = col_double(),
## marital.status = col_character(),
## occupation = col_character(),
## relationship = col_character(),
## race = col_character(),
## sex = col_character(),
## capital.gain = col_double(),
## capital.loss = col_double(),
## hours.per.week = col_double(),
## native.country = col_character(),
## income = col_character()
## )
First look at the data
str(incomecensus)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 32561 obs. of 15 variables:
## $ age : num 90 82 66 54 41 34 38 74 68 41 ...
## $ workclass : chr "?" "Private" "?" "Private" ...
## $ fnlwgt : num 77053 132870 186061 140359 264663 ...
## $ education : chr "HS-grad" "HS-grad" "Some-college" "7th-8th" ...
## $ education.num : num 9 9 10 4 10 9 6 16 9 10 ...
## $ marital.status: chr "Widowed" "Widowed" "Widowed" "Divorced" ...
## $ occupation : chr "?" "Exec-managerial" "?" "Machine-op-inspct" ...
## $ relationship : chr "Not-in-family" "Not-in-family" "Unmarried" "Unmarried" ...
## $ race : chr "White" "White" "Black" "White" ...
## $ sex : chr "Female" "Female" "Female" "Female" ...
## $ capital.gain : num 0 0 0 0 0 0 0 0 0 0 ...
## $ capital.loss : num 4356 4356 4356 3900 3900 ...
## $ hours.per.week: num 40 18 40 40 40 45 40 20 40 60 ...
## $ native.country: chr "United-States" "United-States" "United-States" "United-States" ...
## $ income : chr "<=50K" "<=50K" "<=50K" "<=50K" ...
## - attr(*, "spec")=
## .. cols(
## .. age = col_double(),
## .. workclass = col_character(),
## .. fnlwgt = col_double(),
## .. education = col_character(),
## .. education.num = col_double(),
## .. marital.status = col_character(),
## .. occupation = col_character(),
## .. relationship = col_character(),
## .. race = col_character(),
## .. sex = col_character(),
## .. capital.gain = col_double(),
## .. capital.loss = col_double(),
## .. hours.per.week = col_double(),
## .. native.country = col_character(),
## .. income = col_character()
## .. )
incomecensus <- incomecensus %>%
select(-c(fnlwgt))
head(incomecensus)
## # A tibble: 6 x 14
## age workclass education education.num marital.status occupation relationship
## <dbl> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 90 ? HS-grad 9 Widowed ? Not-in-fami~
## 2 82 Private HS-grad 9 Widowed Exec-mana~ Not-in-fami~
## 3 66 ? Some-col~ 10 Widowed ? Unmarried
## 4 54 Private 7th-8th 4 Divorced Machine-o~ Unmarried
## 5 41 Private Some-col~ 10 Separated Prof-spec~ Own-child
## 6 34 Private HS-grad 9 Divorced Other-ser~ Unmarried
## # ... with 7 more variables: race <chr>, sex <chr>, capital.gain <dbl>,
## # capital.loss <dbl>, hours.per.week <dbl>, native.country <chr>,
## # income <chr>
Dealing with rows containing “?”
We can see that there are some rows with “?”.
For example
incomecensus %>%
group_by(workclass) %>%
summarise(count = n())
## # A tibble: 9 x 2
## workclass count
## <chr> <int>
## 1 ? 1836
## 2 Federal-gov 960
## 3 Local-gov 2093
## 4 Never-worked 7
## 5 Private 22696
## 6 Self-emp-inc 1116
## 7 Self-emp-not-inc 2541
## 8 State-gov 1298
## 9 Without-pay 14
We will check if there are any other columns containings “?”
First, check for the records with any “?”
missing_count <- purrr::map_df(incomecensus, ~ stringr::str_detect(., pattern = "\\?")) %>%
rowSums() %>%
tbl_df() %>%
filter(value > 0) %>%
summarize(missing_count = n())
missing_count
## # A tibble: 1 x 1
## missing_count
## <int>
## 1 2399
In total, there are 2399 rows containing “?”
count.NA.per.col <- plyr::ldply(incomecensus, function(c) sum(c == "?"))
count.NA.per.col %>%
pander()
| .id | V1 |
|---|---|
| age | 0 |
| workclass | 1836 |
| education | 0 |
| education.num | 0 |
| marital.status | 0 |
| occupation | 1843 |
| relationship | 0 |
| race | 0 |
| sex | 0 |
| capital.gain | 0 |
| capital.loss | 0 |
| hours.per.week | 0 |
| native.country | 583 |
| income | 0 |
There are 3 columns that contain “?” as NA: workclass, occupation, and native.country
We will remove all of these records.
incomecensus <- incomecensus %>%
filter(!workclass=="?", !occupation=="?", !native.country=="?")
The resulting data contains 30162 rows.
Assigning correct R data types to each column
Converting character to factor.m
incomecensus <- incomecensus %>%
mutate_if(is.character,as.factor)
str(incomecensus)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 30162 obs. of 14 variables:
## $ age : num 82 54 41 34 38 74 68 45 38 52 ...
## $ workclass : Factor w/ 7 levels "Federal-gov",..: 3 3 3 3 3 6 1 3 5 3 ...
## $ education : Factor w/ 16 levels "10th","11th",..: 12 6 16 12 1 11 12 11 15 10 ...
## $ education.num : num 9 4 10 9 6 16 9 16 15 13 ...
## $ marital.status: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 7 1 6 1 6 5 1 1 5 7 ...
## $ occupation : Factor w/ 14 levels "Adm-clerical",..: 4 7 10 8 1 10 10 10 10 8 ...
## $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 5 4 5 5 3 2 5 2 2 ...
## $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 5 5 5 5 3 5 5 ...
## $ sex : Factor w/ 2 levels "Female","Male": 1 1 1 1 2 1 1 1 2 1 ...
## $ capital.gain : num 0 0 0 0 0 0 0 0 0 0 ...
## $ capital.loss : num 4356 3900 3900 3770 3770 ...
## $ hours.per.week: num 18 40 40 45 40 20 40 35 45 20 ...
## $ native.country: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 39 39 39 39 39 39 ...
## $ income : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 2 1 2 2 2 ...
3.1. Missing Values
plot_missing(incomecensus)
We will remove the fnlwgt column
3.2. Checking duplicate records
# to check for duplicate records
incomecensus %>%
summarize(record_count = n(),
distinct_records = n_distinct(.))
## # A tibble: 1 x 2
## record_count distinct_records
## <int> <int>
## 1 30162 26904
The function distinct() [dplyr package] can be used to keep only unique/distinct rows from a data frame.
If there are duplicate rows, only the first row is preserved. It’s an efficient version of the R base function unique().
incomecensus <- incomecensus %>%
distinct()
The resulting data frame contain 26904 records.
3.3. Checking the consistence of the data
incomecensus %>%
ggplot(aes(age, education)) +
geom_point(alpha = 0.3, col = "#00AFBB") +
geom_point(aes(col = (age<= 20 & education == 'Masters')),
alpha = 1,
size = 3) +
scale_colour_manual(values = setNames(c('blue','grey'),
c(T, F))) +
theme(legend.position="bottom") +
scale_x_continuous(breaks = seq(0,100,10))+
xlab("Age") +
ylab("Education Level") +
ggtitle("Age vs Education Level")
There are a few data points (in blue) are outlier datapoints.
It is highly unlikely that the people with age of 20 years and less can complete Masters degree. In reality, it might be possible.
We can take these data points out.
incomecensus <- incomecensus %>%
filter(!(age <= 20 & education == 'Masters'))
B. sex and relationship status
table(incomecensus$sex, incomecensus$relationship)
##
## Husband Not-in-family Other-relative Own-child Unmarried Wife
## Female 1 3262 384 1576 2354 1365
## Male 10808 3853 489 2077 732 1
There is one record where Husband is female
There is also one record where wife is female.
In 1994, this situation was not possible in the US
Where are these records ?
incomecensus %>%
filter((sex == 'Male' & relationship == 'Wife') | (sex == 'Female' & relationship == 'Husband'))
## # A tibble: 2 x 14
## age workclass education education.num marital.status occupation relationship
## <dbl> <fct> <fct> <dbl> <fct> <fct> <fct>
## 1 29 Private Bachelors 13 Married-civ-s~ Exec-mana~ Wife
## 2 34 Private HS-grad 9 Married-civ-s~ Sales Husband
## # ... with 7 more variables: race <fct>, sex <fct>, capital.gain <dbl>,
## # capital.loss <dbl>, hours.per.week <dbl>, native.country <fct>,
## # income <fct>
These points can be removed using the code below.
incomecensus <- incomecensus %>%
filter(!((sex == 'Male' & relationship == 'Wife') | (sex == 'Female' & relationship == 'Husband')))
C. Relationship and marital status
Marital Status: The marital status classification identifies five major categories: never married, married, widowed, and divorced.
table(incomecensus$relationship, incomecensus$marital.status)
##
## Divorced Married-AF-spouse Married-civ-spouse
## Husband 0 9 10799
## Not-in-family 2153 0 14
## Other-relative 102 1 118
## Own-child 305 1 83
## Unmarried 1449 0 0
## Wife 0 10 1355
##
## Married-spouse-absent Never-married Separated Widowed
## Husband 0 0 0 0
## Not-in-family 181 3959 382 426
## Other-relative 26 533 53 40
## Own-child 43 3120 89 12
## Unmarried 120 774 404 339
## Wife 0 0 0 0
The group “other married, spouse absent” includes married people living apart because either the husband or wife was employed and living at a considerable distance from home, was serving away from home in the Armed Forces, had moved to another area, or had a different place of residence for any other reason except separation as defined above. (https://www.unmarried.org/government-terminology/)
incomecensus %>%
filter(marital.status == 'Married-spouse-absent' & relationship == 'Unmarried') %>%
summarise(count = n())
## # A tibble: 1 x 1
## count
## <int>
## 1 120
There are 120 rows where people are ‘Married-spouse-absent’ and ‘Unmarried’ at the same time.
Thers records will be removed.
incomecensus <- incomecensus %>%
filter(!(marital.status == 'Married-spouse-absent' & relationship == 'Unmarried'))
D. age and weekly working hour
incomecensus %>%
ggplot(aes(age, hours.per.week)) +
geom_point(size = 2, shape = 23, color = "blue") +
theme(legend.position="bottom") +
scale_x_continuous(breaks = seq(0,100,10))+
scale_y_continuous(breaks = seq(0,100,10))+
xlab("Age") +
ylab("Weekly working hours") +
ggtitle("Age vs Weekly working hours ")
Some people aged older than 70 and work more than 80 hours a week.
Even for people with 90 years of age, there are records of 100 hours per week.
To lighlight these data point
incomecensus %>%
ggplot(aes(age, hours.per.week)) +
geom_point(alpha = 0.3, col = "#00AFBB") +
geom_point(aes(col = (age >= 70 & hours.per.week >= 40)),
alpha = 1,
size = 3) +
scale_colour_manual(values = setNames(c('red','grey'),
c(T, F))) +
theme(legend.position="bottom") +
scale_x_continuous(breaks = seq(0,100,10))+
scale_y_continuous(breaks = seq(0,100,10))+
xlab("Age") +
ylab("Weekly working hours") +
ggtitle("Age vs Weekly working hours ")
incomecensus %>%
filter(age >= 70 & hours.per.week > 50) %>%
summarise(counts =n())
## # A tibble: 1 x 1
## counts
## <int>
## 1 25
Data points with age more than and equal to 70 years and working hours greater than 50 hours can be removed.
incomecensus <- incomecensus %>%
filter(!(age >= 70 & hours.per.week > 50))
E. capital loss and capital gain
There are many points with 99999 as the Capital which seems suspicious and hence, should be eliminated from the analysis.
incomecensus %>%
group_by(capital.gain) %>%
summarise(counts =n()) %>%
arrange(desc(capital.gain)) %>%
head(10)
## # A tibble: 10 x 2
## capital.gain counts
## <dbl> <int>
## 1 99999 147
## 2 41310 2
## 3 34095 3
## 4 27828 32
## 5 25236 10
## 6 25124 1
## 7 22040 1
## 8 20051 31
## 9 18481 2
## 10 15831 6
We can remove all the records where the capital gain is equal to 99999.
incomecensus <- incomecensus %>%
filter(!(capital.gain == 99999))
incomecensus %>%
group_by(capital.loss) %>%
summarise(counts =n()) %>%
arrange(desc(capital.loss)) %>%
head(10)
## # A tibble: 10 x 2
## capital.loss counts
## <dbl> <int>
## 1 4356 1
## 2 3900 2
## 3 3770 2
## 4 3683 2
## 5 3004 1
## 6 2824 8
## 7 2754 2
## 8 2603 4
## 9 2559 12
## 10 2547 4
F. Work class and occupation
table(incomecensus$occupation, incomecensus$workclass)
##
## Federal-gov Local-gov Private Self-emp-inc Self-emp-not-inc
## Adm-clerical 308 269 2360 28 46
## Armed-Forces 9 0 0 0 0
## Craft-repair 62 138 2377 96 485
## Exec-managerial 175 209 2296 360 369
## Farming-fishing 8 27 427 51 415
## Handlers-cleaners 21 44 1056 2 15
## Machine-op-inspct 14 11 1571 10 35
## Other-service 33 187 2349 27 171
## Priv-house-serv 0 0 137 0 0
## Prof-specialty 164 664 2001 138 346
## Protective-serv 27 291 183 5 6
## Sales 14 7 2508 266 362
## Tech-support 66 38 667 3 26
## Transport-moving 24 112 1093 26 118
##
## State-gov Without-pay
## Adm-clerical 245 3
## Armed-Forces 0 0
## Craft-repair 52 1
## Exec-managerial 181 0
## Farming-fishing 15 5
## Handlers-cleaners 9 1
## Machine-op-inspct 13 1
## Other-service 119 0
## Priv-house-serv 0 0
## Prof-specialty 392 0
## Protective-serv 112 0
## Sales 10 0
## Tech-support 55 0
## Transport-moving 40 1
We might remove the two levels: without-pay
We also might regroup these levels into 3 categoris : public, private, self-employeed
A. income
Categorical variable that contains yearly income of the respondent (“<=$50K” or “>50K”).
incomecensus %>%
group_by(income) %>%
summarise(count = n())
## # A tibble: 2 x 2
## income count
## <fct> <int>
## 1 <=50K 19888
## 2 >50K 6720
income_prop <- incomecensus %>%
group_by(income) %>%
summarise(count = n()) %>%
ungroup()%>%
arrange(desc(income)) %>%
mutate(percentage = round(count/sum(count),4)*100,
lab.pos = cumsum(percentage)-0.5*percentage)
ggplot(data = income_prop,
aes(x = "",
y = percentage,
fill = income))+
geom_bar(stat = "identity")+
coord_polar("y") +
geom_text(aes(y = lab.pos,
label = paste(percentage,"%", sep = "")), col = "blue", size = 5) +
scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
theme_void() +
theme(legend.title = element_text(color = "black", size = 14),
legend.text = element_text(color = "black", size = 14))
The summary of the data shows that 75 % of the observations have an income less than or equal to 50k dollars.
Specifically, 22654 persons have an income <=50k dollars, while 7508 people earn more than 50k.
B. Age
incomecensus %>% ggplot(aes(age)) +
geom_histogram(fill= "lightblue",
color = 'blue',
binwidth = 5) +
labs(title= "Age Distribution") +
theme(plot.title = element_text(hjust = 0.5))
summary(incomecensus$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.00 28.00 38.00 38.97 48.00 90.00
incomecensus %>% ggplot(aes(age)) +
geom_histogram(aes(fill=income),
color = 'grey',
binwidth = 1) +
scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
labs(title= "Age Distribution for Income")+
theme(plot.title = element_text(hjust = 0.5))
incomecensus %>%
ggplot(aes(age,
fill= income)) +
geom_density(alpha= 0.7, color = 'blue') +
scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
labs(x = "Age", y = "Density", title = "Density graph of age distribution")
Older people tends to earn more.
On the other hand, majority of the people with an age around 25 years earns less than 50k per annum.
C. workclass
library(ggthemes)
library(hrbrthemes)
## Warning: package 'hrbrthemes' was built under R version 3.6.3
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
## Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
## if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
incomecensus %>%
group_by(workclass) %>%
summarise(counts = n()) %>%
mutate(Percentage = round(counts*100/sum(counts),2)) %>%
arrange(desc(counts)) %>%
ggplot(aes(x= reorder(workclass, -counts),
y = counts,
fill = Percentage)) +
geom_bar(stat = "identity",
width = 0.6) +
scale_fill_gradient(low="skyblue1", high="royalblue4")+
geom_text(aes(label = paste0(round(counts,1), "\n",Percentage,"%")),
vjust = -0.1,
color = "darkblue",
size = 4) +
scale_y_continuous(limits = c(0,25000)) +
theme_minimal() +
labs(x = "Work class",
y = "Frequency",
caption = "Income census US 1994")
Almost three-quarters of sample work in the private sector.
In this data, there are dictinction between Local-gov, State-gov, and Federal-gov.
We might group these three category if there is no difference in income level between them.
library(ggpubr)
## Loading required package: magrittr
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
p1 <- incomecensus %>%
group_by(workclass) %>%
summarise(counts = n()) %>%
mutate(Percentage = round(counts*100/sum(counts),2)) %>%
arrange(desc(counts)) %>%
ggplot(aes(x= reorder(workclass, counts),
y = counts)) +
geom_bar(stat = "identity",
width = 0.6,
fill = "steelblue") +
geom_text(aes(label = paste0(round(counts,1),"\n",Percentage,"%")),
vjust = 0.5,
hjust = -0.5,
color = "darkblue",
size = 4) +
scale_y_continuous(limits = c(0,20000)) +
theme_minimal() +
labs(x = "Work class",y = "Frequency") +
coord_flip()
p2 <- incomecensus %>%
group_by(workclass, income) %>%
summarize(n = n()) %>%
mutate(pct = n*100/sum(n)) %>%
ggplot(aes(x = reorder(workclass, n),
y = pct/100,
fill = income)) +
geom_bar(stat = "identity", width = 0.6) +
scale_x_discrete(name = "") +
scale_y_continuous(name= "Percentage",
labels = percent) +
scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
geom_text(aes(label = paste0(round(pct,0),"%")),
position = position_stack(vjust = 0.5),
size = 4,
color = "black") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.y=element_blank()) +
coord_flip()
ggarrange(p1, p2, nrow = 1)
income_gov <- incomecensus %>%
filter(workclass %in% c("Local-gov", "State-gov", "Federal-gov")) %>%
group_by(workclass, income) %>%
summarise(count = n()) %>%
mutate(pct = count/sum(count)) %>%
arrange(desc(income), pct)
income_gov <- income_gov %>% transmute(income, percent = count*100/sum(count))
income_gov
## # A tibble: 6 x 3
## # Groups: workclass [3]
## workclass income percent
## <fct> <fct> <dbl>
## 1 State-gov >50K 27.0
## 2 Local-gov >50K 29.4
## 3 Federal-gov >50K 38.4
## 4 Federal-gov <=50K 61.6
## 5 Local-gov <=50K 70.6
## 6 State-gov <=50K 73.0
incomecensus %>%
filter(workclass %in% c("Local-gov", "State-gov", "Federal-gov")) %>%
group_by(workclass, income) %>%
summarize(n = n()) %>%
mutate(pct = n*100/sum(n)) %>%
ggplot(aes(x = reorder(workclass, n),
y = pct,
fill = income)) +
geom_bar(stat = "identity", width = 0.6) +
scale_x_discrete(name = "") +
scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
geom_text(aes(label = paste0(round(pct,0),"%")),
position = position_stack(vjust = 0.5),
size = 4,
color = "black") +
theme(axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank())
There is a slight difference between Local-gov and Stave-gov.
Working for the federal government can have a greater chance of getting an income higher than 50K.
incomecensus %>%
filter(workclass %in% c("Self-emp-not-inc", "Self-emp-inc", "Without-pay")) %>%
group_by(workclass, income) %>%
summarize(n = n()) %>%
mutate(pct = n*100/sum(n)) %>%
ggplot(aes(x = reorder(workclass, n),
y = pct,
fill = income)) +
geom_bar(stat = "identity", width = 0.6) +
scale_x_discrete(name = "") +
scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
geom_text(aes(label = paste0(round(pct,0),"%")),
position = position_stack(vjust = 0.5),
size = 4,
color = "black") +
theme(axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank())
education
incomecensus %>%
group_by(education) %>%
summarise(counts = n()) %>%
arrange(desc(counts)) %>%
pander()
| education | counts |
|---|---|
| HS-grad | 8201 |
| Some-college | 5853 |
| Bachelors | 4447 |
| Masters | 1534 |
| Assoc-voc | 1248 |
| Assoc-acdm | 990 |
| 11th | 937 |
| 10th | 763 |
| 7th-8th | 534 |
| Prof-school | 486 |
| 9th | 445 |
| 12th | 356 |
| Doctorate | 349 |
| 5th-6th | 275 |
| 1st-4th | 146 |
| Preschool | 44 |
incomecensus %>%
group_by(education.num) %>%
summarise(counts = n()) %>%
arrange(desc(counts)) %>%
pander()
| education.num | counts |
|---|---|
| 9 | 8201 |
| 10 | 5853 |
| 13 | 4447 |
| 14 | 1534 |
| 11 | 1248 |
| 12 | 990 |
| 7 | 937 |
| 6 | 763 |
| 4 | 534 |
| 15 | 486 |
| 5 | 445 |
| 8 | 356 |
| 16 | 349 |
| 3 | 275 |
| 2 | 146 |
| 1 | 44 |
We can see that these two columns are identical.
The column education.num is simply converted from the column education.
If we keep the column education which is factor type, we might regroup some categories in this column.
If we keep the column education.num which is numerical type,, we can keep the column as it is (numeric).
eudcation_df <- incomecensus %>%
group_by(education.num) %>%
summarise(counts = n()) %>%
mutate(Percentage = round(counts*100/sum(counts),2)) %>%
arrange(education.num)
eudcation_df %>%
pander()
| education.num | counts | Percentage |
|---|---|---|
| 1 | 44 | 0.17 |
| 2 | 146 | 0.55 |
| 3 | 275 | 1.03 |
| 4 | 534 | 2.01 |
| 5 | 445 | 1.67 |
| 6 | 763 | 2.87 |
| 7 | 937 | 3.52 |
| 8 | 356 | 1.34 |
| 9 | 8201 | 30.82 |
| 10 | 5853 | 22 |
| 11 | 1248 | 4.69 |
| 12 | 990 | 3.72 |
| 13 | 4447 | 16.71 |
| 14 | 1534 | 5.77 |
| 15 | 486 | 1.83 |
| 16 | 349 | 1.31 |
ggplot(data = eudcation_df ,
aes(x= education.num,
y = counts,
fill = Percentage)) +
geom_bar(stat = "identity",
width = 0.6) +
scale_x_continuous(breaks = c(0:16)) +
scale_fill_gradient(low="skyblue1", high="royalblue4")+
geom_text(aes(label = paste0(round(counts,1), "\n","(",Percentage,"%)")),
vjust = -0.1,
color = "darkblue",
size = 4) +
scale_y_continuous(limits = c(0,12000)) +
theme_minimal() +
labs(x = "Years of education",
y = "Frequency",
caption = "Income census US 1994") +
ggtitle("Distribution of years of education") +
theme(plot.title = element_text(hjust = 0.5),
axis.title = element_text(size = 12),
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
legend.position = "right",
legend.title = element_text(size = 12, face = "bold"),
legend.text = element_text(colour="black", size = 12))
About 50% of respondents have 9-10 years of education.
About 17% of people in the sample hold bachelor degree (13 years of education).
education.pct <- incomecensus %>%
group_by(education.num, income) %>%
summarize(count = n()) %>%
mutate(pct = count/sum(count)) %>%
arrange(desc(income), pct)
ggplot(education.pct,
aes(education.num, pct, fill = income)) +
geom_bar(stat="identity", position = "fill") +
geom_hline(yintercept = 0.2489, col = "blue") +
scale_x_continuous(breaks = c(0:16)) +
scale_y_continuous(labels=scales::percent) +
scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
ggtitle("Income by years of education") +
xlab("Education (years)") +
ylab("Percentage") +
theme(plot.title = element_text(hjust = 0.5),
axis.title = element_text(size = 12),
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
legend.position = "right",
legend.title = element_text(size = 12, face = "bold"),
legend.text = element_text(colour="black", size = 12))
Higher years of education led to a higher chance of having >50K.
marital status
incomecensus %>%
group_by(marital.status) %>%
summarise(counts = n()) %>%
arrange(desc(counts))
## # A tibble: 7 x 2
## marital.status counts
## <fct> <int>
## 1 Married-civ-spouse 12236
## 2 Never-married 8369
## 3 Divorced 3996
## 4 Separated 926
## 5 Widowed 811
## 6 Married-spouse-absent 249
## 7 Married-AF-spouse 21
we can put all the married people in the same group
This column have 7 labels, three of them are:
Married-AF-spouse: Married armed forces spouse
Married-civ-spouse: Married civilian spouse
Married-spouse-absent
These levels can be grouped into the group “married”.
Replace “Married-AF-spouse”, “Married-civ-spouse”, and “Married-spouse-absent” by “Married”
pat = c("Married-AF-spouse|Married-civ-spouse|Married-spouse-absent")
incomecensus <- incomecensus %>%
mutate(marital_status = stringr::str_replace_all(marital.status, pat, "Married"))
incomecensus <- incomecensus %>%
select(-marital.status)
marital_df <- incomecensus %>%
group_by(marital_status) %>%
summarise(counts = n()) %>%
mutate(Percentage = round(counts*100/sum(counts),1)) %>%
arrange(desc(counts))
marital_df %>%
pander()
| marital_status | counts | Percentage |
|---|---|---|
| Married | 12506 | 47 |
| Never-married | 8369 | 31.5 |
| Divorced | 3996 | 15 |
| Separated | 926 | 3.5 |
| Widowed | 811 | 3 |
The proportion of married people that have an income higher than 50K is highest.
p3 <- incomecensus %>%
group_by(marital_status) %>%
summarise(counts = n()) %>%
mutate(Percentage = round(counts*100/sum(counts),2)) %>%
arrange(desc(counts)) %>%
ggplot(aes(x= reorder(marital_status, counts),
y = counts)) +
geom_bar(stat = "identity",
width = 0.6,
fill = "steelblue") +
geom_text(aes(label = paste0(round(counts,1),"\n","(",Percentage,"%)")),
vjust = 0.4,
hjust = -0.5,
color = "darkblue",
size = 4) +
scale_y_continuous(limits = c(0,20000)) +
theme_minimal() +
labs(x = "Marital status",y = "Frequency") +
coord_flip()
p4 <- incomecensus %>%
group_by(marital_status, income) %>%
summarize(n = n()) %>%
mutate(pct = n*100/sum(n)) %>%
ggplot(aes(x = reorder(marital_status, n),
y = pct/100,
fill = income)) +
geom_bar(stat = "identity", width = 0.6) +
scale_x_discrete(name = "") +
scale_y_continuous(name = "Percentage",
labels = scales::percent) +
scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
geom_text(aes(label = paste0(round(pct,0),"%")),
position = position_stack(vjust = 0.5),
size = 4,
color = "black") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.y=element_blank()) +
coord_flip()
ggarrange(p3, p4, nrow = 1)
occupation
Category_df <- incomecensus %>%
group_by(occupation) %>%
summarise(counts = n()) %>%
mutate(Percentage = round(counts*100/sum(counts),2)) %>%
arrange(counts)
ggplot(data = Category_df,
aes(x= reorder(occupation, -counts),
y = counts,
fill = Percentage)) +
geom_bar(stat = "identity",
width = 0.7) +
geom_text(aes(label = paste0(counts,", ", Percentage,"%")),
vjust = 0.2,
hjust = -0.1,
color = "darkblue",
size = 4) +
scale_y_continuous(limits = c(0,5000)) +
scale_fill_distiller(palette = "Spectral") +
theme_minimal() +
labs(x = "Category",
y = "Frequency",
caption = "Income census US 1994") +
coord_flip()
occupation.pct <- incomecensus %>%
group_by(occupation, income) %>%
summarize(count = n()) %>%
mutate(pct = count/sum(count)) %>%
arrange(desc(income), pct)
occupation.pct$occupation <- factor(occupation.pct$occupation,
levels = occupation.pct$occupation[1:(nrow(occupation.pct)/2)])
ggplot(occupation.pct, aes(reorder(occupation,-pct), pct, fill = income)) +
geom_bar(stat="identity", position = "fill") +
geom_hline(yintercept = 0.2489, col = "blue") +
ggtitle("Income by occupation") +
xlab("Occupation") +
ylab("Percentage") +
scale_y_continuous(labels=scales::percent) +
scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
theme(plot.title = element_text(hjust = 0.5),
axis.title = element_text(size = 12),
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
legend.position = "bottom",
legend.title = element_text(size = 12, face = "bold"),
legend.text = element_text(colour="black", size = 12)) +
coord_flip()
About 25% of the sample has an income of more than 50K.
There are five categories of jobs that have a higher percentage: Protective service, tech-support, Sales, Exec-managerial, and Prof-speciality.
Question: Are these people work in the private sector?
incomecensus %>%
filter(occupation %in% c("Protective-serv", "Tech-support", "Sales", "Exec-managerial", "Prof-specialty")) %>%
group_by(workclass) %>%
summarise(freq = n()) %>%
mutate(percentage = round(freq*100/sum(freq),1)) %>%
arrange(desc(percentage)) %>%
pander()
| workclass | freq | percentage |
|---|---|---|
| Private | 7655 | 64.1 |
| Local-gov | 1209 | 10.1 |
| Self-emp-not-inc | 1109 | 9.3 |
| Self-emp-inc | 772 | 6.5 |
| State-gov | 750 | 6.3 |
| Federal-gov | 446 | 3.7 |
About two-third work in the private sector.
Governmental job accounts for 18.81 %.
Blue_Collar (Craft-repair,Farming-fishing,Handlers-cleaners,Machine-op-inspct,Transport-moving)
White_Collar (Adm-clerical, Sales, Tech-support,Protective-serv),
Exec_mgr_prof (Exec-managerial,Prof-specialty),
and Service_other (Armed-Forces, Other-service, Priv-house-serv)
occupation_distr <- incomecensus %>%
group_by(occupation) %>%
summarise(counts = n()) %>%
mutate(Percentage = round(counts*100/sum(counts),2)) %>%
arrange(desc(counts)) %>%
ggplot(aes(x= reorder(occupation, counts),
y = counts)) +
geom_bar(stat = "identity",
width = 0.6,
fill = "steelblue") +
geom_text(aes(label = paste0(round(counts,1),"\n","(",Percentage,"%)")),
vjust = 0.4, hjust = -0.5,color = "darkblue", size = 4) +
scale_y_continuous(limits = c(0,20000)) +
theme_minimal() +
labs(x = "Grouped Occupation", y = "Frequency") +
coord_flip()
occupation_pct <- incomecensus %>%
group_by(occupation, income) %>%
summarize(n = n()) %>%
mutate(pct = n*100/sum(n)) %>%
ggplot(aes(x = reorder(occupation, n),
y = pct/100,
fill = income)) +
geom_bar(stat = "identity", width = 0.6) +
scale_x_discrete(name = "") +
scale_y_continuous(name = "Percentage",
labels = scales::percent) +
scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
geom_text(aes(label = paste0(round(pct,0),"%")),
position = position_stack(vjust = 0.5),
size = 4,
color = "black") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.y=element_blank()) +
coord_flip()
ggarrange(occupation_distr, occupation_pct, nrow = 1)
incomecensus %>%
ggplot(aes(income, color= income, fill= income)) +
geom_bar( alpha = 0.8, width = 0.8) +
facet_grid(~ relationship) +
labs(x = "Incomes", y = "Count", title = "Incomes by relationship")
This column contains categories that might be overlap with other feature.
For example, unmarried people in the relationship attribute is the same as the unmarried level in the marital column.
relationship_distr <- incomecensus %>%
group_by(relationship) %>%
summarise(counts = n()) %>%
mutate(Percentage = round(counts*100/sum(counts),2)) %>%
arrange(desc(counts)) %>%
ggplot(aes(x= reorder(relationship, counts),
y = counts)) +
geom_bar(stat = "identity",
width = 0.6,
fill = "steelblue") +
geom_text(aes(label = paste0(round(counts,1),"\n","(",Percentage,"%)")),
vjust = 0.4, hjust = -0.5,color = "darkblue", size = 4) +
scale_y_continuous(limits = c(0,20000)) +
theme_minimal() +
labs(x = "Relationship", y = "Frequency") +
coord_flip()
relationship_pct <- incomecensus %>%
group_by(relationship, income) %>%
summarize(n = n()) %>%
mutate(pct = n*100/sum(n)) %>%
ggplot(aes(x = reorder(relationship, n),
y = pct/100,
fill = income)) +
geom_bar(stat = "identity", width = 0.6) +
scale_x_discrete(name = "") +
scale_y_continuous(name = "Percentage",
labels = scales::percent) +
scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
geom_text(aes(label = paste0(round(pct,0),"%")),
position = position_stack(vjust = 0.5),
size = 4,
color = "black") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.y=element_blank()) +
coord_flip()
ggarrange(relationship_distr, relationship_pct, nrow = 1)
This variable may be redundant as it collapse somehow with the feature marital_status.
race
incomecensus %>%
ggplot(aes(race, fill= income)) +
geom_bar(position = "fill") +
labs(x = "Race",
y = "Proportion",
title = "Incomes by race")+
scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
geom_hline(yintercept = 0.2489, col="blue") +
coord_flip()
Black and “Amer-Indio-Eskima” earn less than 50k more frequently than the general population.
However, the percent of whites and “Asian-Pac-Islander” that earn more than 50k is over the general population average.
Sex Here we can see that the vast majority of people having an income greater than 50000 dollars are males.
gender_prop <- incomecensus %>%
group_by(sex) %>%
summarise(count = n()) %>%
ungroup()%>%
arrange(desc(sex)) %>%
mutate(percentage = round(count/sum(count),4)*100,
lab.pos = cumsum(percentage)-0.5*percentage)
gender_distr <- ggplot(data = gender_prop,
aes(x = "",
y = percentage,
fill = sex))+
geom_bar(stat = "identity")+
coord_polar("y") +
geom_text(aes(y = lab.pos,
label = paste(percentage,"%", sep = "")), col = "blue", size = 4) +
scale_fill_manual(values=c("orange", "lightblue"),
name = "Gender") +
theme_void() +
theme(legend.title = element_text(color = "black", size = 12),
legend.text = element_text(color = "black", size = 12))
gender_prop <- incomecensus %>%
group_by(sex, income) %>%
summarize(n = n()) %>%
mutate(pct = n*100/sum(n)) %>%
ggplot(aes(x = reorder(sex, n),
y = pct,
fill = income)) +
geom_bar(stat = "identity", width = 0.6) +
scale_x_discrete(name = "") +
scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
geom_text(aes(label = paste0(round(pct,0),"%")),
position = position_stack(vjust = 0.5),
size = 4,
color = "black") +
theme(axis.text.y = element_blank(),
axis.text.x = element_text(color = "black", size = 12),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
legend.title = element_text(color = "black", size = 12),
legend.text = element_text(color = "black", size = 12))
ggarrange(gender_distr, gender_prop, nrow = 1)
Two-third of respondents are male.
The proportion of women that earn more than 50k is much lower than that of their male counterparts.
hour per week
Distribution of hours per week
incomecensus %>% ggplot(aes(hours.per.week)) +
geom_histogram(fill= "orange",
color = 'blue',
binwidth = 5) +
labs(title= "Age Distribution") +
theme(plot.title = element_text(hjust = 0.5))
incomecensus %>%
group_by(income) %>%
summarise("Mean hours per week" = mean(hours.per.week),
"Standard deviatrion" = sd(hours.per.week))
## # A tibble: 2 x 3
## income `Mean hours per week` `Standard deviatrion`
## <fct> <dbl> <dbl>
## 1 <=50K 39.5 12.3
## 2 >50K 45.8 11.0
Contrusting box-plot
ggplot(data = incomecensus,
aes(income,
hours.per.week,
fill = income))+
scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
geom_boxplot()+
labs(x = "Incomes",
y = "Worked hours per week",
title = "Incomes by working hours")
There are many outliers.
Some people work for 100 hours a week (which is possible)
native.country
Incomes by origin
There are 40 different native countries and some of them have just a few cases, for instance, honduras have just 10 cases.
Therefore, the first attempt is to group the countries by continent.
incomecensus <- incomecensus %>%
mutate(native_continent = case_when(
native.country %in% c("France", "Greece", "Hungary", "Italy", "Portugal",
"Scotland", "England", "Germany", "Holand-Netherlands",
"Ireland", "Poland", "Yugoslavia") ~ "Europe",
native.country %in% c("Columbia", "Dominican-Republic", "El-Salvador", "Haiti",
"Honduras", "Mexico", "Outlying-US(Guam-USVI-etc)",
"Cuba", "Ecuador", "Guatemala", "Jamaica", "Nicaragua",
"Peru", "Puerto-Rico", "Trinadad&Tobago") ~ "Latin America", # "Trinadad&Tobago" and "Outlying-US(Guam-USVI-etc"
native.country %in% c("Iran", "Japan", "Philippines", "Taiwan", "Vietnam", "Cambodia",
"China", "Hong", "India", "Laos", "South", "Thailand") ~ "Asia",
native.country %in% c("United-States","Canada") ~ "USA/Canada"))
incomecensus %>%
group_by(native_continent) %>%
count()
## # A tibble: 4 x 2
## # Groups: native_continent [4]
## native_continent n
## <chr> <int>
## 1 Asia 693
## 2 Europe 490
## 3 Latin America 1314
## 4 USA/Canada 24111
incomecensus %>%
group_by(native_continent, income) %>%
summarize(n = n()) %>%
mutate(pct = n*100/sum(n)) %>%
ggplot(aes(x = reorder(native_continent, n),
y = pct,
fill = income)) +
geom_bar(stat = "identity", width = 0.6) +
scale_x_discrete(name = "") +
scale_fill_manual(values=c("#E3CD81FF", "#B1B3B3FF")) +
geom_text(aes(label = paste0(round(pct,0),"%")),
position = position_stack(vjust = 0.5),
size = 4,
color = "blue") +
theme(axis.text.y = element_blank(),
axis.text.x = element_text(color = "black", size = 12),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
legend.title = element_text(color = "black", size = 12),
legend.text = element_text(color = "black", size = 12))
We will remove some columns
incomecensus <- incomecensus %>%
select(-c("education","capital.gain","capital.loss", "relationship", "native.country"))
The final data
str(incomecensus)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 26608 obs. of 10 variables:
## $ age : num 82 54 41 34 38 74 68 45 38 52 ...
## $ workclass : Factor w/ 7 levels "Federal-gov",..: 3 3 3 3 3 6 1 3 5 3 ...
## $ education.num : num 9 4 10 9 6 16 9 16 15 13 ...
## $ occupation : Factor w/ 4 levels "White_collar",..: 4 3 4 2 1 4 4 4 4 2 ...
## $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 5 5 5 5 3 5 5 ...
## $ sex : Factor w/ 2 levels "Female","Male": 1 1 1 1 2 1 1 1 2 1 ...
## $ hours.per.week : num 18 40 40 45 40 20 40 35 45 20 ...
## $ income : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 2 1 2 2 2 ...
## $ marital_status : chr "Widowed" "Divorced" "Separated" "Divorced" ...
## $ native_continent: chr "USA/Canada" "USA/Canada" "USA/Canada" "USA/Canada" ...
we need to convert two columns marital_status and native_continent in factor
incomecensus$marital_status <- as.factor(incomecensus$marital_status)
incomecensus$native_continent <- as.factor(incomecensus$native_continent)
we will save this data as income.csv
write.csv(incomecensus,file = "income.csv", row.names= FALSE)
MODELLLING
data <- read_csv("income.csv")
## Parsed with column specification:
## cols(
## age = col_double(),
## workclass = col_character(),
## education.num = col_double(),
## occupation = col_character(),
## race = col_character(),
## sex = col_character(),
## hours.per.week = col_double(),
## income = col_character(),
## marital_status = col_character(),
## native_continent = col_character()
## )
data <- data %>%
mutate_if(is.character,as.factor)
str(data)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 26608 obs. of 10 variables:
## $ age : num 82 54 41 34 38 74 68 45 38 52 ...
## $ workclass : Factor w/ 7 levels "Federal-gov",..: 3 3 3 3 3 6 1 3 5 3 ...
## $ education.num : num 9 4 10 9 6 16 9 16 15 13 ...
## $ occupation : Factor w/ 4 levels "Blue_collar",..: 2 1 2 3 4 2 2 2 2 3 ...
## $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 5 5 5 5 3 5 5 ...
## $ sex : Factor w/ 2 levels "Female","Male": 1 1 1 1 2 1 1 1 2 1 ...
## $ hours.per.week : num 18 40 40 45 40 20 40 35 45 20 ...
## $ income : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 2 1 2 2 2 ...
## $ marital_status : Factor w/ 5 levels "Divorced","Married",..: 5 1 4 1 4 3 1 1 3 5 ...
## $ native_continent: Factor w/ 4 levels "Asia","Europe",..: 4 4 4 4 4 4 4 4 4 4 ...
library(caret)
## Warning: package 'caret' was built under R version 3.6.3
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
Trainindex <- createDataPartition(y = data$income , p = .70, list = FALSE)
training <- data[Trainindex ,]
validation <- data[-Trainindex,]
training_new <- training[-8]
validation_new <- validation[-8]
income_training_label <- training$income
income_validation_label <- validation$income
LOGISTIC REGRESSION package caret
set.seed(123)
default_glm_mod <- train(form = income ~ .,
data = training,
method = "glm",
family = "binomial",
tuneLength = 5)
glm_pred <- predict(default_glm_mod, newdata = validation)
confusionMatrix(glm_pred, validation$income)
## Confusion Matrix and Statistics
##
## Reference
## Prediction <=50K >50K
## <=50K 5476 931
## >50K 490 1085
##
## Accuracy : 0.822
## 95% CI : (0.8134, 0.8303)
## No Information Rate : 0.7474
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4917
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9179
## Specificity : 0.5382
## Pos Pred Value : 0.8547
## Neg Pred Value : 0.6889
## Prevalence : 0.7474
## Detection Rate : 0.6860
## Detection Prevalence : 0.8027
## Balanced Accuracy : 0.7280
##
## 'Positive' Class : <=50K
##
II. Decision tree based methods
II.1 rpart
Fully grown trees
library(rpart)
## Warning: package 'rpart' was built under R version 3.6.3
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.6.3
library(caret)
rparttree <- rpart(income ~ .,
data = training,
method = "class")
rparttree
## n= 18626
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 18626 4704 <=50K (0.74744980 0.25255020)
## 2) marital_status=Divorced,Never-married,Separated,Widowed 9806 735 <=50K (0.92504589 0.07495411) *
## 3) marital_status=Married 8820 3969 <=50K (0.55000000 0.45000000)
## 6) education.num< 12.5 6125 2065 <=50K (0.66285714 0.33714286)
## 12) occupation=Blue_collar,Other_service 3473 855 <=50K (0.75381515 0.24618485) *
## 13) occupation=Exec_mgr_prof,White_collar 2652 1210 <=50K (0.54374057 0.45625943)
## 26) age< 32.5 587 153 <=50K (0.73935264 0.26064736) *
## 27) age>=32.5 2065 1008 >50K (0.48813559 0.51186441)
## 54) education.num< 9.5 950 400 <=50K (0.57894737 0.42105263) *
## 55) education.num>=9.5 1115 458 >50K (0.41076233 0.58923767) *
## 7) education.num>=12.5 2695 791 >50K (0.29350649 0.70649351) *
# Plot the trees
rpart.plot(rparttree)
Marital status is the most important feature for this data set.
plotcp(rparttree)
II.1. C5.0 with boosting
library(C50)
## Warning: package 'C50' was built under R version 3.6.3
set.seed(123)
treeC5 <- C5.0(income ~ .,
data = training,
trials = 100)
treeC5_Pred <- predict(treeC5, validation)
confusionMatrix(treeC5_Pred, validation$income)
## Confusion Matrix and Statistics
##
## Reference
## Prediction <=50K >50K
## <=50K 5462 870
## >50K 504 1146
##
## Accuracy : 0.8279
## 95% CI : (0.8194, 0.8361)
## No Information Rate : 0.7474
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5149
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9155
## Specificity : 0.5685
## Pos Pred Value : 0.8626
## Neg Pred Value : 0.6945
## Prevalence : 0.7474
## Detection Rate : 0.6843
## Detection Prevalence : 0.7933
## Balanced Accuracy : 0.7420
##
## 'Positive' Class : <=50K
##
II.2 Randomm forest
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(123)
treeRf <- randomForest(income ~ .,
data = training,
ntree = 500,
mtry = 3,
importance = TRUE)
treeRf_Pred <- predict(treeRf, validation)
confusionMatrix(treeRf_Pred, validation$income)
## Confusion Matrix and Statistics
##
## Reference
## Prediction <=50K >50K
## <=50K 5374 842
## >50K 592 1174
##
## Accuracy : 0.8203
## 95% CI : (0.8117, 0.8287)
## No Information Rate : 0.7474
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5038
##
## Mcnemar's Test P-Value : 4.851e-11
##
## Sensitivity : 0.9008
## Specificity : 0.5823
## Pos Pred Value : 0.8645
## Neg Pred Value : 0.6648
## Prevalence : 0.7474
## Detection Rate : 0.6733
## Detection Prevalence : 0.7788
## Balanced Accuracy : 0.7416
##
## 'Positive' Class : <=50K
##
NAIVE BAYES
library(e1071)
## Warning: package 'e1071' was built under R version 3.6.3
set.seed(123)
NB <- naiveBayes(income ~., data = training)
# that one is faster than the package caret
NB_pred <- predict(NB, validation, type="class")
confusionMatrix(NB_pred,validation$income)
## Confusion Matrix and Statistics
##
## Reference
## Prediction <=50K >50K
## <=50K 5237 735
## >50K 729 1281
##
## Accuracy : 0.8166
## 95% CI : (0.8079, 0.825)
## No Information Rate : 0.7474
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5137
##
## Mcnemar's Test P-Value : 0.896
##
## Sensitivity : 0.8778
## Specificity : 0.6354
## Pos Pred Value : 0.8769
## Neg Pred Value : 0.6373
## Prevalence : 0.7474
## Detection Rate : 0.6561
## Detection Prevalence : 0.7482
## Balanced Accuracy : 0.7566
##
## 'Positive' Class : <=50K
##
KNN
set.seed(3333)
knn_fit <- train(income ~.,
data = training,
method = "knn",
preProcess = c("center", "scale"),tuneLength = 5)
knn_fit
## k-Nearest Neighbors
##
## 18626 samples
## 9 predictor
## 2 classes: '<=50K', '>50K'
##
## Pre-processing: centered (24), scaled (24)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 18626, 18626, 18626, 18626, 18626, 18626, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.7842959 0.4214521
## 7 0.7923503 0.4385040
## 9 0.7976866 0.4504210
## 11 0.8014893 0.4593102
## 13 0.8046032 0.4663794
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 13.
KNN_pred <- predict(knn_fit, validation, type="raw")
confusionMatrix(KNN_pred,validation$income)
## Confusion Matrix and Statistics
##
## Reference
## Prediction <=50K >50K
## <=50K 5323 855
## >50K 643 1161
##
## Accuracy : 0.8123
## 95% CI : (0.8036, 0.8208)
## No Information Rate : 0.7474
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.485
##
## Mcnemar's Test P-Value : 4.991e-08
##
## Sensitivity : 0.8922
## Specificity : 0.5759
## Pos Pred Value : 0.8616
## Neg Pred Value : 0.6436
## Prevalence : 0.7474
## Detection Rate : 0.6669
## Detection Prevalence : 0.7740
## Balanced Accuracy : 0.7341
##
## 'Positive' Class : <=50K
##