FOr this exercise we We want to know how to predict salary using dataset extracted by Barry Becker from the US 1994 Census database. What we will do is to determine whether a person makes over 50K a year using three different supervised learning models namely: - Naive Bayes - Decision Tree - Random Forest Supervised learning is chosen as we want to teach or train the machine using data that is well-labelled. Then the machine will need to test it on a new set of examples (unseen data) so that the algorithm derived from the train data can be used to produce a potentially correct outcome from labelled data.
The dataset itself was downloaded from: https://www.kaggle.com/datasets/ayessa/salary-prediction-classification, which adapated the dataset from: https://archive.ics.uci.edu/ml/datasets/Census+Income.
library(dplyr) # for data wrangling
library(GGally) # to visualize data
library(caret) # to pre-process data
library(e1071) # for making naive bayes model
library(ROCR) # for making ROC curve
library(partykit) # for making decision tree model visualization
library(randomForest) # for making random forest modeldf_salary <- read.csv("data_input/salary.csv", stringsAsFactors = T)
glimpse(df_salary)#> Rows: 32,561
#> Columns: 15
#> $ age <int> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30, 23, 32,~
#> $ workclass <fct> State-gov, Self-emp-not-inc, Private, Private, Pri~
#> $ fnlwgt <int> 77516, 83311, 215646, 234721, 338409, 284582, 160187, 2~
#> $ education <fct> Bachelors, Bachelors, HS-grad, 11th, Bachelors, M~
#> $ education.num <int> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13, 12, 11,~
#> $ marital.status <fct> Never-married, Married-civ-spouse, Divorced, Marrie~
#> $ occupation <fct> Adm-clerical, Exec-managerial, Handlers-cleaners, H~
#> $ relationship <fct> Not-in-family, Husband, Not-in-family, Husband, Wi~
#> $ race <fct> White, White, White, Black, Black, White, Black,~
#> $ sex <fct> Male, Male, Male, Male, Female, Female, Female, ~
#> $ capital.gain <int> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0, 0, 0, ~
#> $ capital.loss <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ hours.per.week <int> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40, 30, 50,~
#> $ native.country <fct> United-States, United-States, United-States, United~
#> $ salary <fct> <=50K, <=50K, <=50K, <=50K, <=50K, <=50K, <=50K,~
summary(df_salary)#> age workclass fnlwgt
#> Min. :17.00 Private :22696 Min. : 12285
#> 1st Qu.:28.00 Self-emp-not-inc: 2541 1st Qu.: 117827
#> Median :37.00 Local-gov : 2093 Median : 178356
#> Mean :38.58 ? : 1836 Mean : 189778
#> 3rd Qu.:48.00 State-gov : 1298 3rd Qu.: 237051
#> Max. :90.00 Self-emp-inc : 1116 Max. :1484705
#> (Other) : 981
#> education education.num marital.status
#> HS-grad :10501 Min. : 1.00 Divorced : 4443
#> Some-college: 7291 1st Qu.: 9.00 Married-AF-spouse : 23
#> Bachelors : 5355 Median :10.00 Married-civ-spouse :14976
#> Masters : 1723 Mean :10.08 Married-spouse-absent: 418
#> Assoc-voc : 1382 3rd Qu.:12.00 Never-married :10683
#> 11th : 1175 Max. :16.00 Separated : 1025
#> (Other) : 5134 Widowed : 993
#> occupation relationship race
#> Prof-specialty :4140 Husband :13193 Amer-Indian-Eskimo: 311
#> Craft-repair :4099 Not-in-family : 8305 Asian-Pac-Islander: 1039
#> Exec-managerial:4066 Other-relative: 981 Black : 3124
#> Adm-clerical :3770 Own-child : 5068 Other : 271
#> Sales :3650 Unmarried : 3446 White :27816
#> Other-service :3295 Wife : 1568
#> (Other) :9541
#> sex capital.gain capital.loss hours.per.week
#> Female:10771 Min. : 0 Min. : 0.0 Min. : 1.00
#> Male :21790 1st Qu.: 0 1st Qu.: 0.0 1st Qu.:40.00
#> Median : 0 Median : 0.0 Median :40.00
#> Mean : 1078 Mean : 87.3 Mean :40.44
#> 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.:45.00
#> Max. :99999 Max. :4356.0 Max. :99.00
#>
#> native.country salary
#> United-States:29170 <=50K:24720
#> Mexico : 643 >50K : 7841
#> ? : 583
#> Philippines : 198
#> Germany : 137
#> Canada : 121
#> (Other) : 1709
From the above, we can conclude that the dataset contains 32,561 with 15 different columns with the following descriptions:
age: continuous. The age of an individual observation.
workclass: A general term to represent the employment status of an individual. Workclass is divided into Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.this is the number of people that census believes the entry represents
education: The highest level of education achieved by an individual. Education is divided into Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous. The highest level of education achieved in numerical form.
marital-status: Marital status of an individual classified into Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. Note that Married.civ.spouse corresponds to a civilian spouse while Married.AF.spouse is a spouse in the Armed Forces.
occupation: the general type of occupation of an individual classified into Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Represents what this individual is relative to others and classified into Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: Descriptions of an individual’s race and classified into White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: the biological sex of the individual, namely Female, Male.
capital-gain: continuous.Capital gains for an individual
capital-loss: continuous. Capital loss for an individual - hours-per-week: continuous. The hours an individual has reported to work per week
native-country: country of origin for an individual, namely United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
salary: our target variable, whether or not an individual makes more than $50,000 annually (<=50K or >50K)
Before we proceed, we need to see whether there is any missing value.
anyNA(df_salary)#> [1] FALSE
colSums(is.na(df_salary))#> age workclass fnlwgt education education.num
#> 0 0 0 0 0
#> marital.status occupation relationship race sex
#> 0 0 0 0 0
#> capital.gain capital.loss hours.per.week native.country salary
#> 0 0 0 0 0
Although it looks as if the data is free of missing values, we noted while reading the dataset that there are rows that were filled with “?”. Hence, we should treat this data as missing values. Let’s reload the data and classify this as NA.
df_salary <- df_salary %>%
mutate(across(c(2,7,14),~ gsub("[[:punct:]]", "", .))) %>%
mutate_at(vars(workclass, education, marital.status, occupation, relationship, race, sex, native.country, salary), as.factor) %>%
mutate_at(vars(age, fnlwgt, education.num, capital.gain, capital.loss, hours.per.week), as.integer)
summary(df_salary)#> age workclass fnlwgt education
#> Min. :17.00 Private :22696 Min. : 12285 HS-grad :10501
#> 1st Qu.:28.00 Selfempnotinc: 2541 1st Qu.: 117827 Some-college: 7291
#> Median :37.00 Localgov : 2093 Median : 178356 Bachelors : 5355
#> Mean :38.58 : 1836 Mean : 189778 Masters : 1723
#> 3rd Qu.:48.00 Stategov : 1298 3rd Qu.: 237051 Assoc-voc : 1382
#> Max. :90.00 Selfempinc : 1116 Max. :1484705 11th : 1175
#> (Other) : 981 (Other) : 5134
#> education.num marital.status occupation
#> Min. : 1.00 Divorced : 4443 Profspecialty :4140
#> 1st Qu.: 9.00 Married-AF-spouse : 23 Craftrepair :4099
#> Median :10.00 Married-civ-spouse :14976 Execmanagerial:4066
#> Mean :10.08 Married-spouse-absent: 418 Admclerical :3770
#> 3rd Qu.:12.00 Never-married :10683 Sales :3650
#> Max. :16.00 Separated : 1025 Otherservice :3295
#> Widowed : 993 (Other) :9541
#> relationship race sex
#> Husband :13193 Amer-Indian-Eskimo: 311 Female:10771
#> Not-in-family : 8305 Asian-Pac-Islander: 1039 Male :21790
#> Other-relative: 981 Black : 3124
#> Own-child : 5068 Other : 271
#> Unmarried : 3446 White :27816
#> Wife : 1568
#>
#> capital.gain capital.loss hours.per.week native.country
#> Min. : 0 Min. : 0.0 Min. : 1.00 UnitedStates:29170
#> 1st Qu.: 0 1st Qu.: 0.0 1st Qu.:40.00 Mexico : 643
#> Median : 0 Median : 0.0 Median :40.00 : 583
#> Mean : 1078 Mean : 87.3 Mean :40.44 Philippines : 198
#> 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.:45.00 Germany : 137
#> Max. :99999 Max. :4356.0 Max. :99.00 Canada : 121
#> (Other) : 1709
#> salary
#> <=50K:24720
#> >50K : 7841
#>
#>
#>
#>
#>
Apparently “?” is now becoming a blank space. We need to convert these blank spaces into NA.
df_salary[df_salary == "" | df_salary == " "] <- NA
summary(df_salary)#> age workclass fnlwgt education
#> Min. :17.00 Private :22696 Min. : 12285 HS-grad :10501
#> 1st Qu.:28.00 Selfempnotinc: 2541 1st Qu.: 117827 Some-college: 7291
#> Median :37.00 Localgov : 2093 Median : 178356 Bachelors : 5355
#> Mean :38.58 Stategov : 1298 Mean : 189778 Masters : 1723
#> 3rd Qu.:48.00 Selfempinc : 1116 3rd Qu.: 237051 Assoc-voc : 1382
#> Max. :90.00 (Other) : 981 Max. :1484705 11th : 1175
#> NA's : 1836 (Other) : 5134
#> education.num marital.status occupation
#> Min. : 1.00 Divorced : 4443 Profspecialty : 4140
#> 1st Qu.: 9.00 Married-AF-spouse : 23 Craftrepair : 4099
#> Median :10.00 Married-civ-spouse :14976 Execmanagerial: 4066
#> Mean :10.08 Married-spouse-absent: 418 Admclerical : 3770
#> 3rd Qu.:12.00 Never-married :10683 Sales : 3650
#> Max. :16.00 Separated : 1025 (Other) :10993
#> Widowed : 993 NA's : 1843
#> relationship race sex
#> Husband :13193 Amer-Indian-Eskimo: 311 Female:10771
#> Not-in-family : 8305 Asian-Pac-Islander: 1039 Male :21790
#> Other-relative: 981 Black : 3124
#> Own-child : 5068 Other : 271
#> Unmarried : 3446 White :27816
#> Wife : 1568
#>
#> capital.gain capital.loss hours.per.week native.country
#> Min. : 0 Min. : 0.0 Min. : 1.00 UnitedStates:29170
#> 1st Qu.: 0 1st Qu.: 0.0 1st Qu.:40.00 Mexico : 643
#> Median : 0 Median : 0.0 Median :40.00 Philippines : 198
#> Mean : 1078 Mean : 87.3 Mean :40.44 Germany : 137
#> 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.:45.00 Canada : 121
#> Max. :99999 Max. :4356.0 Max. :99.00 (Other) : 1709
#> NA's : 583
#> salary
#> <=50K:24720
#> >50K : 7841
#>
#>
#>
#>
#>
We have succeeded in replacing them to NAs. We need to confirm it
using anyNA parameter.
anyNA(df_salary)#> [1] TRUE
colSums(is.na(df_salary))#> age workclass fnlwgt education education.num
#> 0 1836 0 0 0
#> marital.status occupation relationship race sex
#> 0 1843 0 0 0
#> capital.gain capital.loss hours.per.week native.country salary
#> 0 0 0 583 0
Good. let’s drop the NAs.
df_salary_omitna <- na.omit(df_salary)
anyNA(df_salary_omitna)#> [1] FALSE
colSums(is.na(df_salary_omitna))#> age workclass fnlwgt education education.num
#> 0 0 0 0 0
#> marital.status occupation relationship race sex
#> 0 0 0 0 0
#> capital.gain capital.loss hours.per.week native.country salary
#> 0 0 0 0 0
We also want to check which variables that have outliers.
summary(df_salary_omitna)#> age workclass fnlwgt education
#> Min. :17.00 Private :22286 Min. : 13769 HS-grad :9840
#> 1st Qu.:28.00 Selfempnotinc: 2499 1st Qu.: 117627 Some-college:6678
#> Median :37.00 Localgov : 2067 Median : 178425 Bachelors :5044
#> Mean :38.44 Stategov : 1279 Mean : 189794 Masters :1627
#> 3rd Qu.:47.00 Selfempinc : 1074 3rd Qu.: 237629 Assoc-voc :1307
#> Max. :90.00 Federalgov : 943 Max. :1484705 11th :1048
#> (Other) : 14 (Other) :4618
#> education.num marital.status occupation
#> Min. : 1.00 Divorced : 4214 Profspecialty :4038
#> 1st Qu.: 9.00 Married-AF-spouse : 21 Craftrepair :4030
#> Median :10.00 Married-civ-spouse :14065 Execmanagerial:3992
#> Mean :10.12 Married-spouse-absent: 370 Admclerical :3721
#> 3rd Qu.:13.00 Never-married : 9726 Sales :3584
#> Max. :16.00 Separated : 939 Otherservice :3212
#> Widowed : 827 (Other) :7585
#> relationship race sex
#> Husband :12463 Amer-Indian-Eskimo: 286 Female: 9782
#> Not-in-family : 7726 Asian-Pac-Islander: 895 Male :20380
#> Other-relative: 889 Black : 2817
#> Own-child : 4466 Other : 231
#> Unmarried : 3212 White :25933
#> Wife : 1406
#>
#> capital.gain capital.loss hours.per.week native.country
#> Min. : 0 Min. : 0.00 Min. : 1.00 UnitedStates:27504
#> 1st Qu.: 0 1st Qu.: 0.00 1st Qu.:40.00 Mexico : 610
#> Median : 0 Median : 0.00 Median :40.00 Philippines : 188
#> Mean : 1092 Mean : 88.37 Mean :40.93 Germany : 128
#> 3rd Qu.: 0 3rd Qu.: 0.00 3rd Qu.:45.00 PuertoRico : 109
#> Max. :99999 Max. :4356.00 Max. :99.00 Canada : 107
#> (Other) : 1516
#> salary
#> <=50K:22654
#> >50K : 7508
#>
#>
#>
#>
#>
Based on the summary, fnlwgt tends to have many
outliers, while capital.gain and capital.loss
have too many data that have 0 values. To make our lives easier, we will
drop these data from our model to avoid bias.
df_salary_clean <- df_salary_omitna[-c(3,11:12)]
summary(df_salary_clean)#> age workclass education education.num
#> Min. :17.00 Private :22286 HS-grad :9840 Min. : 1.00
#> 1st Qu.:28.00 Selfempnotinc: 2499 Some-college:6678 1st Qu.: 9.00
#> Median :37.00 Localgov : 2067 Bachelors :5044 Median :10.00
#> Mean :38.44 Stategov : 1279 Masters :1627 Mean :10.12
#> 3rd Qu.:47.00 Selfempinc : 1074 Assoc-voc :1307 3rd Qu.:13.00
#> Max. :90.00 Federalgov : 943 11th :1048 Max. :16.00
#> (Other) : 14 (Other) :4618
#> marital.status occupation relationship
#> Divorced : 4214 Profspecialty :4038 Husband :12463
#> Married-AF-spouse : 21 Craftrepair :4030 Not-in-family : 7726
#> Married-civ-spouse :14065 Execmanagerial:3992 Other-relative: 889
#> Married-spouse-absent: 370 Admclerical :3721 Own-child : 4466
#> Never-married : 9726 Sales :3584 Unmarried : 3212
#> Separated : 939 Otherservice :3212 Wife : 1406
#> Widowed : 827 (Other) :7585
#> race sex hours.per.week
#> Amer-Indian-Eskimo: 286 Female: 9782 Min. : 1.00
#> Asian-Pac-Islander: 895 Male :20380 1st Qu.:40.00
#> Black : 2817 Median :40.00
#> Other : 231 Mean :40.93
#> White :25933 3rd Qu.:45.00
#> Max. :99.00
#>
#> native.country salary
#> UnitedStates:27504 <=50K:22654
#> Mexico : 610 >50K : 7508
#> Philippines : 188
#> Germany : 128
#> PuertoRico : 109
#> Canada : 107
#> (Other) : 1516
Now the data looks neater. Let’s proceed to Exploratory Data Analysis or EDA.
We want to do Exploratory Data Analysis (EDA) first to do an initial
investigation of the data, particularly to see if there are any
patterns/correlations that we need to focus on. We will do so using
three graphical tools, ggcorr, boxplot, and
qplot.
Let’s use a heatmap to see the correlation among the predictors.
ggcorr(df_salary_clean, label = T, hjust = 1, layout.exp = 3)The numerical variables seem to have low correlation among each other. This is a good sign particularly since we will be doing a Naive Bayes method which assumes that predictor variables should be independent.
Now, we can deep dive and see the correlation among predictors using boxplot and qplot and gain some insights.
boxplot(age ~ salary, data = df_salary_clean, main = "Age distribution based on salary", xlab = "Salary", ylab = "Age", col = "green")Based on the above boxplot, it somewhat confirms that age correlates with salary with people from age ranging from 30 - 50ish earning more salary with >50K per year. Meanwhile, those below 30 seems to earn less (<=50K). This assumption makes sense since age equates with experience. People of that age range have been working for more than 10+years, making them eligible for higher salary.
boxplot(education.num ~ salary, data = df_salary_clean, main = "Years of Education vs salary", xlab = "Salary", ylab = "Education Years", col = "blue")This confirms that the longer you study, the better the salary. But, what about the educational background?
qplot(salary, data = df_salary_clean, fill = education) + facet_grid (. ~ education)If we look at the qplot, there seems to be a little correlation between education and salary. Our assumption is the higher the education the higher the salary. This is particularly true for masters, doctorate, and prof-school where the people earning >50K outweighs those earning <=50K. However, the number of respondents for these level of education are less than those earning bachelor degree and thus, not really representative for our model.
qplot(salary, data = df_salary_clean, fill = workclass) + facet_grid (. ~ workclass)Based on the above qplot, workclass and salary seem to have no correlation because in each workclass majority earn >=50K per year (with the exception of self-employed). Additionally, the majority of the observations are coming from the private sector. Hence, we should not be including this variable as the results may be biased.
qplot(salary, data = df_salary_clean, fill = sex) + facet_grid (. ~ marital.status)Based on the above qplot, it’s interesting that if you were male married with civilian, you’ll likely to earn higher than if you were a married female with civilian. But, if you were either a female/male who were never married, you’ll tend to earn less. This shows how marital status of a particular sex or gender matters a lot.
boxplot(hours.per.week ~ salary, data = df_salary_clean, main = "Hours per week correlation with salary", xlab = "Salary", ylab = "Working Hour Per Week", col = "orange")This seems to confirm people’s saying that the more hours you work, the more you earn. Based on the above boxplot, people working more than 40 hours per week earn more than those who work less hours than that.
qplot(salary, data = df_salary_clean, fill = sex) + facet_grid (. ~ occupation)Apparently if you’re working in the exec-managerial level and prof-specialty, you have a higher tendency to earn more compared to other types of occupation. Again, sex seems to also matter because majority of the people at these levels are male. It’s also interesting that female dominated certain occupation such as admin-clerical, while male dominated craft-repair occupation.
qplot(salary, data = df_salary_clean, fill = relationship) + facet_grid (. ~ marital.status)Based on the above, if you were a male and married which is defined as “husband”, you will probably earn higher than if you were a female and married or “wife”. The above shows how marital status play a major role in defining your salary. Due to relationship status and marital status overlap with each other, we will use marital status instead as it lumps husband and wife as “married”.
qplot(salary, data = df_salary_clean, fill = race) + facet_grid (. ~ sex)The above qplot also signifies that if you were a white male, you’ll tend to earn higher than the rest of us. White male supremacy, yes? Again, race alone cannot be interpreted as it is dependent on sex. Another could be that the respondents are mainly whites so if we include race in the model, it would be biased.
qplot(salary, data = df_salary_clean, fill = race) + facet_grid (. ~ native.country)Majority of respondents are white Americans which isnt surprising. This shows that native country has no correlation with salary. Even then, whites are majority of the respondents even if the respondents are Mexican. So, we should not be including native country as one of our predictors to avoid bias.
Based on our EDA above, we will only be using the following variables as our predictors:
numerical variables:
agehours.per.weekeducation.numcategorical variables:
sexoccupationmarital.statusSplitting train test with proportion of 80%:20%
RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- sample(nrow(df_salary_clean), nrow(df_salary_clean)*0.8)
df_train <- df_salary_clean[index,]
df_test <- df_salary_clean[-index,]Check target class proportion from the df_train.
prop.table(table(df_train$salary))#>
#> <=50K >50K
#> 0.7524141 0.2475859
The class is a little imbalanced. Let’s do upSampling.
df_train_up <- upSample(x = df_train %>%
select(-salary), y = df_train$salary,
yname= "salary")
prop.table(table(df_train_up$salary))#>
#> <=50K >50K
#> 0.5 0.5
prop.table(table(df_train$salary))#>
#> <=50K >50K
#> 0.7524141 0.2475859
Nice. Our training dataset target class looks balance now.
source: https://www.upgrad.com/blog/naive-bayes-classifier/
What is Naive Bayes Classifier?
It is a classifier method from supervised machine learning that separates data into different classes according to the Bayes’ Theorem with the assumptions that: - all predictors are independent of one another, even if it’s not entirely true, let’s assume that way; - the target is dependent on the predictors variable.
What are the pros and cons of Naive Bayes Classifier?
Pros:
works very fast and therefore can easily predict the class of a test data on a real-time basis, -
can be used to solve multi-class prediction problems,
performs better than other models with less training data if the assumption of independence of all predictors were correctly followed through, and
performs great for categorical input variables compared to numerical variables.
Cons:
prone to “Zero Frequency” or “Skewness Due To Scarcity”
phenomenon, where if the dataset has a categorical variable but was not
present in the training data set, the Naive Bayes model will assign it
to zero probability and therefore not usable for prediction purpose. To
solve this problem, we’ll have to use a smoothing technique called
laplace.
it assumes that predictors are independent of one another, which in reality, most variables are dependent on one another.
To check which model is the best fit for Naive Bayes, whether it’s the model with class imbalance or the model which we have tuned using upSample, we will train two models:
# train
model_naive <- naiveBayes(formula = salary ~ age + hours.per.week + education.num + sex + occupation + marital.status,
data = df_train,
laplace = 1)
model_naive#>
#> Naive Bayes Classifier for Discrete Predictors
#>
#> Call:
#> naiveBayes.default(x = X, y = Y, laplace = laplace)
#>
#> A-priori probabilities:
#> Y
#> <=50K >50K
#> 0.7524141 0.2475859
#>
#> Conditional probabilities:
#> age
#> Y [,1] [,2]
#> <=50K 36.66252 13.50766
#> >50K 43.90626 10.26100
#>
#> hours.per.week
#> Y [,1] [,2]
#> <=50K 39.31611 11.94122
#> >50K 45.65082 10.66788
#>
#> education.num
#> Y [,1] [,2]
#> <=50K 9.633765 2.407389
#> >50K 11.604118 2.366642
#>
#> sex
#> Y Female Male
#> <=50K 0.3846450 0.6153550
#> >50K 0.1480924 0.8519076
#>
#> occupation
#> Y Admclerical ArmedForces Craftrepair
#> <=50K 0.00005503577 0.14232250963 0.00044028619 0.13858007705
#> >50K 0.00016697278 0.06628819502 0.00033394557 0.11888462181
#> occupation
#> Y Execmanagerial Farmingfishing Handlerscleaners Machineopinspct
#> <=50K 0.09108420473 0.03775454045 0.05575123830 0.07743533297
#> >50K 0.25663716814 0.01502755051 0.01118717649 0.03306061112
#> occupation
#> Y Otherservice Privhouseserv Profspecialty Protectiveserv
#> <=50K 0.13439735828 0.00720968630 0.09796367639 0.01970280682
#> >50K 0.01903489731 0.00033394557 0.24211053598 0.02938720988
#> occupation
#> Y Sales Techsupport Transportmoving
#> <=50K 0.11491469455 0.02828838745 0.05410016511
#> >50K 0.12906996160 0.03723493071 0.04124227751
#>
#> marital.status
#> Y Divorced Married-AF-spouse Married-civ-spouse
#> <=50K 0.1649047462 0.0005506002 0.3369122343
#> >50K 0.0590202307 0.0011703728 0.8523658251
#> marital.status
#> Y Married-spouse-absent Never-married Separated Widowed
#> <=50K 0.0150864442 0.4101420548 0.0386521308 0.0337517895
#> >50K 0.0040127069 0.0632001338 0.0088613944 0.0113693362
model_naive_up <- naiveBayes(formula = salary ~ age + hours.per.week + education.num + sex + occupation + marital.status,
data = df_train_up,
laplace = 1)
model_naive_up#>
#> Naive Bayes Classifier for Discrete Predictors
#>
#> Call:
#> naiveBayes.default(x = X, y = Y, laplace = laplace)
#>
#> A-priori probabilities:
#> Y
#> <=50K >50K
#> 0.5 0.5
#>
#> Conditional probabilities:
#> age
#> Y [,1] [,2]
#> <=50K 36.66252 13.50766
#> >50K 43.85789 10.23859
#>
#> hours.per.week
#> Y [,1] [,2]
#> <=50K 39.31611 11.94122
#> >50K 45.60474 10.73596
#>
#> education.num
#> Y [,1] [,2]
#> <=50K 9.633765 2.407389
#> >50K 11.632939 2.361603
#>
#> sex
#> Y Female Male
#> <=50K 0.3846450 0.6153550
#> >50K 0.1510712 0.8489288
#>
#> occupation
#> Y Admclerical ArmedForces Craftrepair
#> <=50K 0.00005503577 0.14232250963 0.00044028619 0.13858007705
#> >50K 0.00005503577 0.06499724821 0.00027517887 0.11997798569
#> occupation
#> Y Execmanagerial Farmingfishing Handlerscleaners Machineopinspct
#> <=50K 0.09108420473 0.03775454045 0.05575123830 0.07743533297
#> >50K 0.25679691800 0.01447440837 0.01062190424 0.03417721519
#> occupation
#> Y Otherservice Privhouseserv Profspecialty Protectiveserv
#> <=50K 0.13439735828 0.00720968630 0.09796367639 0.01970280682
#> >50K 0.01926252064 0.00011007155 0.24589983489 0.02894881673
#> occupation
#> Y Sales Techsupport Transportmoving
#> <=50K 0.11491469455 0.02828838745 0.05410016511
#> >50K 0.12680242157 0.03742432581 0.04017611447
#>
#> marital.status
#> Y Divorced Married-AF-spouse Married-civ-spouse
#> <=50K 0.1649047462 0.0005506002 0.3369122343
#> >50K 0.0613919172 0.0008809602 0.8478141174
#> marital.status
#> Y Married-spouse-absent Never-married Separated Widowed
#> <=50K 0.0150864442 0.4101420548 0.0386521308 0.0337517895
#> >50K 0.0036890210 0.0661821385 0.0084241824 0.0116176633
Let’s predict class from the df_test with the
function of predict() using the two models:
# predict
df_test$pred <- predict(model_naive, newdata = df_test, type = "class")
df_test$pred_up <- predict(model_naive_up, newdata = df_test, type = "class")
head(df_test)Evaluate the models with confusion matrix:
# confusion matrix without upSampling
confusionMatrix(data = df_test$pred,
reference = df_test$salary)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction <=50K >50K
#> <=50K 3920 520
#> >50K 579 1014
#>
#> Accuracy : 0.8178
#> 95% CI : (0.8079, 0.8275)
#> No Information Rate : 0.7457
#> P-Value [Acc > NIR] : < 0.0000000000000002
#>
#> Kappa : 0.5257
#>
#> Mcnemar's Test P-Value : 0.08019
#>
#> Sensitivity : 0.8713
#> Specificity : 0.6610
#> Pos Pred Value : 0.8829
#> Neg Pred Value : 0.6365
#> Prevalence : 0.7457
#> Detection Rate : 0.6498
#> Detection Prevalence : 0.7360
#> Balanced Accuracy : 0.7662
#>
#> 'Positive' Class : <=50K
#>
Without upSampling, we’re getting the following numbers:
NB_withoutup <- cbind.data.frame(
accuracy = 0.8178,
recall = 0.8713,
specificity = 0.6610,
precision = 0.8829)
NB_withoutup# confusion matrix with upSampling
confusionMatrix(data = df_test$pred_up,
reference = df_test$salary)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction <=50K >50K
#> <=50K 3363 232
#> >50K 1136 1302
#>
#> Accuracy : 0.7732
#> 95% CI : (0.7625, 0.7838)
#> No Information Rate : 0.7457
#> P-Value [Acc > NIR] : 0.0000003669
#>
#> Kappa : 0.4993
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.7475
#> Specificity : 0.8488
#> Pos Pred Value : 0.9355
#> Neg Pred Value : 0.5340
#> Prevalence : 0.7457
#> Detection Rate : 0.5574
#> Detection Prevalence : 0.5959
#> Balanced Accuracy : 0.7981
#>
#> 'Positive' Class : <=50K
#>
With upSampling, we’re getting the following numbers:
NB_withup <- cbind.data.frame(
accuracy = 0.7732,
recall = 0.7475,
specificity = 0.8488,
precision = 0.9355)
NB_withupLet’s compare the two models:
rbind("Naive Bayes without tuning" = NB_withoutup, "Naive Bayes with tuning" = NB_withup)In terms of accuracy (how capable our model can predict the Y target) and recall (how capable our model can predict correctly from all the actual data that are actually positive), the Naive Bayes model without upSampling performs better than the model with upSampling.
Meanwhile, the Naive Bayes model with upSampling perform exceptionally well when it comes to specificity (how capable our model can correctly predict the negative class from the actual data that are actually negative) and precision (how capable our model can correctly predict the positive class from the total predicted positive class) compared to the original Naive Bayes model.
Based on the comparison, we will use the Naive Bayes model with upSampling because it performs equally well (>70%) across the board.
Subsequently, we will look into the Receiver Operating Characteristics (ROC) curve and the Area Under the Curve (AUC) to evaluate the chosen model.
The ROC plots two parameters, namely the proportion of true positive rate (TPR) against the proportion of a false positive rate (FPR) at different classification thresholds. A good ROC curve looks like an upside down L which tells you that the closer the bended curve reaches the upper left side of the plot (TPR high, FPR low), the better the model is.
The AUC basically measures the whole area under the ROC curve and tells you how much the selected model is capable of distinguishing between the target classes. A higher AUC score means the better the model distingushes the target classes.
First we need to prepare the prediction result in probability form
from the df_test, and save it as an object with the name of
df_pred_prob:
df_pred_prob <- predict(model_naive_up, newdata = df_test, type = "raw")For the ROC, we want to save it in an object called
df_salary_roc:
df_salary_roc <- data.frame(pred = df_pred_prob[,2], label = df_test$salary)salary_roc <- prediction(predictions = df_salary_roc$pred,
labels = df_salary_roc$label)
# ROC curve
plot(performance(prediction.obj = salary_roc,
measure = "tpr",
x.measure = "fpr"))As expected, the ROC looks like an upside down L. That means our selected model is good.
Let’s find the value of AUC:
salary_auc <- performance(prediction.obj = salary_roc,
measure = "auc")
str(salary_auc)#> Formal class 'performance' [package "ROCR"] with 6 slots
#> ..@ x.name : chr "None"
#> ..@ y.name : chr "Area under the ROC curve"
#> ..@ alpha.name : chr "none"
#> ..@ x.values : list()
#> ..@ y.values :List of 1
#> .. ..$ : num 0.879
#> ..@ alpha.values: list()
salary_auc@y.values#> [[1]]
#> [1] 0.8789886
We can conclude that our miss-classification rate is 12,1%. That means Our model can differentiate between the above average salary (positive class) and the below average salary (negative class) at a classification rate of 87,9%.
source: https://www.upgrad.com/blog/pros-and-cons-of-decision-tree-regression-in-machine-learning/
What is Decision Tree classifier?
Decision Tree is also part of the supervised machine learning which tries to solve the classification problem by transforming the data into a tree representation. It consists of:
Root Node: The first branch of the tree to set the target value, or known as the main predictor.
Interior Node: The next branch(es) which uses other predictors if the root node is not enough in determining the target value.
Terminal/Leaf Node: This is the predicted target class.
How it works is by performing a series of questions on the dataset. By asking these true/false questions, the model is able to narrow down the possible values and make a prediction. The order and content of the question are decided by the model itself.
What are the pros and cons of Decision Tree Classifier?
Pros:
can be used for classification and regression problems
easy to interpret, understand, and visualize
data pre-processing require less effort and no need to normalize/scaling the data
very quick in identifying relationships between variables with the most significant variables
can handle both numerical and categorical variables, and not heavily influenced by outliers
Cons:
prone to overfitting. It continues to develop hypotheses that reduce the training set error but at the cost of increasing the test set error.
Can be solved by pruning
does not perform well with continuous numerical variables
quite instable if there is a small change in the data since it can cause a big difference in the tree structure
in terms of running it in R, it takes longer time to train the model compared to Naive Bayes, particularly if the complexity levels of the dataset are greater.
Since our observations are more than > 20,000, we will set the decision tree with a minimum observation in each node with 4000 observations so that the machine can plot the decision trees more easily. We will also use the df_train from earlier instead of df_train_up as decision tree is not heavily influenced by outliers.
model_ctree <- ctree(formula = salary~ age + hours.per.week + education.num + sex + occupation + marital.status,
data = df_train,
control = ctree_control(mincriterion = 0.07,
minsplit = 4000,
minbucket = 7))
plot(model_ctree, type = "simple")Let’s predict probability from the df_test with the
function of predict() using the decision tree model:
dftest_new <- predict(object = model_ctree,
newdata = df_test,
type = "response")
# confusion matrix data test
confusionMatrix(data = dftest_new,
reference = df_test$salary)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction <=50K >50K
#> <=50K 4275 903
#> >50K 224 631
#>
#> Accuracy : 0.8132
#> 95% CI : (0.8031, 0.823)
#> No Information Rate : 0.7457
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.4233
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.9502
#> Specificity : 0.4113
#> Pos Pred Value : 0.8256
#> Neg Pred Value : 0.7380
#> Prevalence : 0.7457
#> Detection Rate : 0.7086
#> Detection Prevalence : 0.8583
#> Balanced Accuracy : 0.6808
#>
#> 'Positive' Class : <=50K
#>
For our data test, we’re getting the following results:
DT_datatest <- cbind.data.frame(
accuracy = 0.7898,
recall = 0.8092,
specificity = 0.7286,
precision = 0.9040)
DT_datatestdftrain_new <-predict(object = model_ctree,
newdata = df_train_up,
type = "response")
# confusion matrix data train
confusionMatrix(dftrain_new, reference = df_train_up$salary)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction <=50K >50K
#> <=50K 17251 10799
#> >50K 904 7356
#>
#> Accuracy : 0.6777
#> 95% CI : (0.6729, 0.6825)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.3554
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.9502
#> Specificity : 0.4052
#> Pos Pred Value : 0.6150
#> Neg Pred Value : 0.8906
#> Prevalence : 0.5000
#> Detection Rate : 0.4751
#> Detection Prevalence : 0.7725
#> Balanced Accuracy : 0.6777
#>
#> 'Positive' Class : <=50K
#>
Meanwhile, for our data train, we’re getting the following results:
DT_datatrain <- cbind.data.frame(
accuracy = 0.6777,
recall = 0.9502,
specificity = 0.4052,
precision = 0.6150)
DT_datatrainLet’s compare the two results:
rbind("Decision Tree performed in Data Test" = DT_datatest, "Decision Tree performed in Data Train" = DT_datatrain)It seems that we do not experience any overfitting since our test data performs better than our train data. That means our model fits just right when it comes to predicting unseen data.
source: https://www.upgrad.com/blog/random-forest-classifier/
What is Random Forest?
Random Forest is method that consists of multiple decision trees, just as the name suggests, forest = many trees. The reason why it’s called Random is because it uses randomness to improve its accuracy and solve overfitting issues found in decision tree model. The algorithm works by making decision trees based on a random selection of data samples and get predictions from every tree. Subsequently, the model selects the best solution through voting.
What are the pros and cons of Random Forest?
Pros:
Cons:
slower to run compared to other classifiers as it uses multiple decision trees to make predictions, can try using random sampling to solve the problem
cannot be used for real-time predictions - cannot be easily interpreted compared to a decision tree
This algorithm is substantially slower than other classification algorithms because it uses multiple decision trees to make predictions. When a random forest classifier makes a prediction, every tree in the forest has to make a prediction for the same input and vote on the same. This process can be very time-consuming. Because of its slow pace, random forest classifiers can be unsuitable for real-time predictions. The model can be quite challenging to interpret in comparison to a decision tree as you can make a selection by following the tree’s path. However, that’s not possible in a random forest as it has multiple decision trees.
As Random Sampling is not prone to overfitting, we may want to use
the earlier dataset that is df_salary_omitna for our model
without dropping any variables. However due to the large size of the
observations, we may want to conduct random sampling first and reduce it
to 10% of the observations from the original dataset or around 3000.
This is to expedite the random forest processing.
df_salary_sample <- sample_n(df_salary_omitna, 3000)
df_salary_sampleBefore we make cross-validation, we will need to split between the
data train and test of df_salary_sample with a proportion
of 80%:20%.
RNGkind(sample.kind = "Rounding")
set.seed(100)
insample <- sample(nrow(df_salary_sample), nrow(df_salary_sample)*0.8)
RF_train <- df_salary_sample[insample,]
RF_test <- df_salary_sample[-insample,]Now let’s check the proportion of the target class of the
RF_train data.
prop.table(table(RF_train$salary))#>
#> <=50K >50K
#> 0.7491667 0.2508333
Now, let’s begin our model fitting.
set.seed(100)
control <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
model_RF <- train(salary~ ., data = RF_train, method = "rf", trainControl = control)
saveRDS(model_RF, "model_RF.RDS")Now, let’s read our model.
model_RF <- readRDS("model_RF.RDS")
model_RF#> Random Forest
#>
#> 2400 samples
#> 14 predictor
#> 2 classes: ' <=50K', ' >50K'
#>
#> No pre-processing
#> Resampling: Bootstrapped (25 reps)
#> Summary of sample sizes: 2400, 2400, 2400, 2400, 2400, 2400, ...
#> Resampling results across tuning parameters:
#>
#> mtry Accuracy Kappa
#> 2 0.7535454 0.01713643
#> 51 0.8396008 0.55230122
#> 100 0.8321996 0.53734907
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 51.
From the results, we can understand the following:
2400 samples -> the number of rows on our data train used in creating the model
14 predictor -> jthe number of predictor variables in our data train
2 classes -> the number of target class on our data
Summary of sample sizes -> the number of sample size on our data train based on the k-fold cross validation
mtry dan accuracy -> shows the number of mtry used and the number of accuracy based on each mtry.
From the model summary, after doing several trials of
mtry the number of mtry that we can choose is
51, which has the highest accuracy when tested in the test data from the
boostrap sampling.
The boostrap sampling produces unused data when making random forest. These data are called out-of-bag data and are considered as data test by the model. The model will then try to do prediction using those data and calculate the error. The error is called out-of-bag error.
model_RF$finalModel#>
#> Call:
#> randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)), trainControl = ..1)
#> Type of random forest: classification
#> Number of trees: 500
#> No. of variables tried at each split: 51
#>
#> OOB estimate of error rate: 14.96%
#> Confusion matrix:
#> <=50K >50K class.error
#> <=50K 1657 141 0.07842047
#> >50K 218 384 0.36212625
From the above model, our out-of-bag error rate is 15.54%, which means our model’s accuracy is 84.46%. Based on the model, we can check what are the predictors that highly affect salary of a person.
plot(varImp(model_RF))From the plotting above, similar to the decision tree model,
marital.status of Married-civ-spouse is the most important
predictor for determining one’s salary.
Now we can check for the confusionMatrix.
model_RF_test <- predict(model_RF, newdata = RF_test)
confusionMatrix(as.factor(model_RF_test), RF_test$salary)#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction <=50K >50K
#> <=50K 389 67
#> >50K 26 118
#>
#> Accuracy : 0.845
#> 95% CI : (0.8135, 0.873)
#> No Information Rate : 0.6917
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.6128
#>
#> Mcnemar's Test P-Value : 0.00003357
#>
#> Sensitivity : 0.9373
#> Specificity : 0.6378
#> Pos Pred Value : 0.8531
#> Neg Pred Value : 0.8194
#> Prevalence : 0.6917
#> Detection Rate : 0.6483
#> Detection Prevalence : 0.7600
#> Balanced Accuracy : 0.7876
#>
#> 'Positive' Class : <=50K
#>
For our Random Forest model, we’re getting the following results:
RF <- cbind.data.frame(
accuracy = 0.8567,
recall = 0.9376,
specificity = 0.6159,
precision = 0.8789)
RFNote that the above model uses random sampling dataset to expedite the process of running the model. Hence, the accuracy rate should be considered with a grain of salt.
Let’s compare the three models.
rbind("Naive Bayes"= NB_withup, "Decision Tree"= DT_datatest, "Random Forest" = RF)If we look at the three models, the model that gives better precision and are good across the board is Naive-Bayes. However, again this could be because when we did Random Forest, it was only using random sampling dataset which may cause the lower performance. This warrants for another Random Forest test using all observations with better PC than mine.
However, Decision Tree and Random Forest models do agree on one thing, that marital.status, particularly married-civ-spouse is the predictor that has the highest effect on determining a person’s salary based on the US 1994 Census database. This was confirmed by the most recent AEI’s research as reported in: https://www.cbsnews.com/news/married-men-earn-far-more-than-any-other-american-group/. According to the news, the reason behind this is because men who are married tend to work more hours than their bachelor counterparts because of the “marriage effect” where men with higher motivation to work and support a family.
Another report stated that this is because being legally married, particularly in the US, introduces “a whole host of tangible benefits” both at work and through federal tax breaks. Some of these benefits include additional pension contributions and higher premium for the family’s healthcare. Source: https://www.cnbc.com/2020/01/15/additional-benefits-90-percent-of-us-companies-offer-if-youre-married.html