1 Introduction

FOr this exercise we We want to know how to predict salary using dataset extracted by Barry Becker from the US 1994 Census database. What we will do is to determine whether a person makes over 50K a year using three different supervised learning models namely: - Naive Bayes - Decision Tree - Random Forest Supervised learning is chosen as we want to teach or train the machine using data that is well-labelled. Then the machine will need to test it on a new set of examples (unseen data) so that the algorithm derived from the train data can be used to produce a potentially correct outcome from labelled data.

The dataset itself was downloaded from: https://www.kaggle.com/datasets/ayessa/salary-prediction-classification, which adapated the dataset from: https://archive.ics.uci.edu/ml/datasets/Census+Income.

2 Import Libraries

library(dplyr) # for data wrangling
library(GGally) # to visualize data
library(caret) # to pre-process data
library(e1071) # for making naive bayes model 
library(ROCR) # for making ROC curve
library(partykit) # for making decision tree model visualization
library(randomForest) # for making random forest model

3 Understanding the Dataset

df_salary <- read.csv("data_input/salary.csv", stringsAsFactors = T)
glimpse(df_salary)
#> Rows: 32,561
#> Columns: 15
#> $ age            <int> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30, 23, 32,~
#> $ workclass      <fct>  State-gov,  Self-emp-not-inc,  Private,  Private,  Pri~
#> $ fnlwgt         <int> 77516, 83311, 215646, 234721, 338409, 284582, 160187, 2~
#> $ education      <fct>  Bachelors,  Bachelors,  HS-grad,  11th,  Bachelors,  M~
#> $ education.num  <int> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13, 12, 11,~
#> $ marital.status <fct>  Never-married,  Married-civ-spouse,  Divorced,  Marrie~
#> $ occupation     <fct>  Adm-clerical,  Exec-managerial,  Handlers-cleaners,  H~
#> $ relationship   <fct>  Not-in-family,  Husband,  Not-in-family,  Husband,  Wi~
#> $ race           <fct>  White,  White,  White,  Black,  Black,  White,  Black,~
#> $ sex            <fct>  Male,  Male,  Male,  Male,  Female,  Female,  Female, ~
#> $ capital.gain   <int> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0, 0, 0, ~
#> $ capital.loss   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ hours.per.week <int> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40, 30, 50,~
#> $ native.country <fct>  United-States,  United-States,  United-States,  United~
#> $ salary         <fct>  <=50K,  <=50K,  <=50K,  <=50K,  <=50K,  <=50K,  <=50K,~
summary(df_salary)
#>       age                    workclass         fnlwgt       
#>  Min.   :17.00    Private         :22696   Min.   :  12285  
#>  1st Qu.:28.00    Self-emp-not-inc: 2541   1st Qu.: 117827  
#>  Median :37.00    Local-gov       : 2093   Median : 178356  
#>  Mean   :38.58    ?               : 1836   Mean   : 189778  
#>  3rd Qu.:48.00    State-gov       : 1298   3rd Qu.: 237051  
#>  Max.   :90.00    Self-emp-inc    : 1116   Max.   :1484705  
#>                  (Other)          :  981                    
#>          education     education.num                  marital.status 
#>   HS-grad     :10501   Min.   : 1.00    Divorced             : 4443  
#>   Some-college: 7291   1st Qu.: 9.00    Married-AF-spouse    :   23  
#>   Bachelors   : 5355   Median :10.00    Married-civ-spouse   :14976  
#>   Masters     : 1723   Mean   :10.08    Married-spouse-absent:  418  
#>   Assoc-voc   : 1382   3rd Qu.:12.00    Never-married        :10683  
#>   11th        : 1175   Max.   :16.00    Separated            : 1025  
#>  (Other)      : 5134                    Widowed              :  993  
#>             occupation            relationship                    race      
#>   Prof-specialty :4140    Husband       :13193    Amer-Indian-Eskimo:  311  
#>   Craft-repair   :4099    Not-in-family : 8305    Asian-Pac-Islander: 1039  
#>   Exec-managerial:4066    Other-relative:  981    Black             : 3124  
#>   Adm-clerical   :3770    Own-child     : 5068    Other             :  271  
#>   Sales          :3650    Unmarried     : 3446    White             :27816  
#>   Other-service  :3295    Wife          : 1568                              
#>  (Other)         :9541                                                      
#>       sex         capital.gain    capital.loss    hours.per.week 
#>   Female:10771   Min.   :    0   Min.   :   0.0   Min.   : 1.00  
#>   Male  :21790   1st Qu.:    0   1st Qu.:   0.0   1st Qu.:40.00  
#>                  Median :    0   Median :   0.0   Median :40.00  
#>                  Mean   : 1078   Mean   :  87.3   Mean   :40.44  
#>                  3rd Qu.:    0   3rd Qu.:   0.0   3rd Qu.:45.00  
#>                  Max.   :99999   Max.   :4356.0   Max.   :99.00  
#>                                                                  
#>         native.country     salary     
#>   United-States:29170    <=50K:24720  
#>   Mexico       :  643    >50K : 7841  
#>   ?            :  583                 
#>   Philippines  :  198                 
#>   Germany      :  137                 
#>   Canada       :  121                 
#>  (Other)       : 1709

From the above, we can conclude that the dataset contains 32,561 with 15 different columns with the following descriptions:

  • age: continuous. The age of an individual observation.

  • workclass: A general term to represent the employment status of an individual. Workclass is divided into Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.

  • fnlwgt: continuous.this is the number of people that census believes the entry represents

  • education: The highest level of education achieved by an individual. Education is divided into Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

  • education-num: continuous. The highest level of education achieved in numerical form.

  • marital-status: Marital status of an individual classified into Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. Note that Married.civ.spouse corresponds to a civilian spouse while Married.AF.spouse is a spouse in the Armed Forces.

  • occupation: the general type of occupation of an individual classified into Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

  • relationship: Represents what this individual is relative to others and classified into Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

  • race: Descriptions of an individual’s race and classified into White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

  • sex: the biological sex of the individual, namely Female, Male.

  • capital-gain: continuous.Capital gains for an individual

  • capital-loss: continuous. Capital loss for an individual - hours-per-week: continuous. The hours an individual has reported to work per week

  • native-country: country of origin for an individual, namely United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

  • salary: our target variable, whether or not an individual makes more than $50,000 annually (<=50K or >50K)

4 Data Cleansing

Before we proceed, we need to see whether there is any missing value.

anyNA(df_salary)
#> [1] FALSE
colSums(is.na(df_salary))
#>            age      workclass         fnlwgt      education  education.num 
#>              0              0              0              0              0 
#> marital.status     occupation   relationship           race            sex 
#>              0              0              0              0              0 
#>   capital.gain   capital.loss hours.per.week native.country         salary 
#>              0              0              0              0              0

Although it looks as if the data is free of missing values, we noted while reading the dataset that there are rows that were filled with “?”. Hence, we should treat this data as missing values. Let’s reload the data and classify this as NA.

df_salary <- df_salary %>%
  mutate(across(c(2,7,14),~ gsub("[[:punct:]]", "", .))) %>%
  mutate_at(vars(workclass, education, marital.status, occupation, relationship, race, sex, native.country, salary), as.factor) %>%
  mutate_at(vars(age, fnlwgt, education.num, capital.gain, capital.loss, hours.per.week), as.integer)
summary(df_salary)
#>       age                 workclass         fnlwgt                education    
#>  Min.   :17.00    Private      :22696   Min.   :  12285    HS-grad     :10501  
#>  1st Qu.:28.00    Selfempnotinc: 2541   1st Qu.: 117827    Some-college: 7291  
#>  Median :37.00    Localgov     : 2093   Median : 178356    Bachelors   : 5355  
#>  Mean   :38.58                 : 1836   Mean   : 189778    Masters     : 1723  
#>  3rd Qu.:48.00    Stategov     : 1298   3rd Qu.: 237051    Assoc-voc   : 1382  
#>  Max.   :90.00    Selfempinc   : 1116   Max.   :1484705    11th        : 1175  
#>                  (Other)       :  981                     (Other)      : 5134  
#>  education.num                  marital.status            occupation  
#>  Min.   : 1.00    Divorced             : 4443    Profspecialty :4140  
#>  1st Qu.: 9.00    Married-AF-spouse    :   23    Craftrepair   :4099  
#>  Median :10.00    Married-civ-spouse   :14976    Execmanagerial:4066  
#>  Mean   :10.08    Married-spouse-absent:  418    Admclerical   :3770  
#>  3rd Qu.:12.00    Never-married        :10683    Sales         :3650  
#>  Max.   :16.00    Separated            : 1025    Otherservice  :3295  
#>                   Widowed              :  993   (Other)        :9541  
#>           relationship                    race            sex       
#>   Husband       :13193    Amer-Indian-Eskimo:  311    Female:10771  
#>   Not-in-family : 8305    Asian-Pac-Islander: 1039    Male  :21790  
#>   Other-relative:  981    Black             : 3124                  
#>   Own-child     : 5068    Other             :  271                  
#>   Unmarried     : 3446    White             :27816                  
#>   Wife          : 1568                                              
#>                                                                     
#>   capital.gain    capital.loss    hours.per.week        native.country 
#>  Min.   :    0   Min.   :   0.0   Min.   : 1.00    UnitedStates:29170  
#>  1st Qu.:    0   1st Qu.:   0.0   1st Qu.:40.00    Mexico      :  643  
#>  Median :    0   Median :   0.0   Median :40.00                :  583  
#>  Mean   : 1078   Mean   :  87.3   Mean   :40.44    Philippines :  198  
#>  3rd Qu.:    0   3rd Qu.:   0.0   3rd Qu.:45.00    Germany     :  137  
#>  Max.   :99999   Max.   :4356.0   Max.   :99.00    Canada      :  121  
#>                                                   (Other)      : 1709  
#>     salary     
#>   <=50K:24720  
#>   >50K : 7841  
#>                
#>                
#>                
#>                
#> 

Apparently “?” is now becoming a blank space. We need to convert these blank spaces into NA.

df_salary[df_salary == "" | df_salary == " "] <- NA
summary(df_salary)
#>       age                 workclass         fnlwgt                education    
#>  Min.   :17.00    Private      :22696   Min.   :  12285    HS-grad     :10501  
#>  1st Qu.:28.00    Selfempnotinc: 2541   1st Qu.: 117827    Some-college: 7291  
#>  Median :37.00    Localgov     : 2093   Median : 178356    Bachelors   : 5355  
#>  Mean   :38.58    Stategov     : 1298   Mean   : 189778    Masters     : 1723  
#>  3rd Qu.:48.00    Selfempinc   : 1116   3rd Qu.: 237051    Assoc-voc   : 1382  
#>  Max.   :90.00   (Other)       :  981   Max.   :1484705    11th        : 1175  
#>                  NA's          : 1836                     (Other)      : 5134  
#>  education.num                  marital.status            occupation   
#>  Min.   : 1.00    Divorced             : 4443    Profspecialty : 4140  
#>  1st Qu.: 9.00    Married-AF-spouse    :   23    Craftrepair   : 4099  
#>  Median :10.00    Married-civ-spouse   :14976    Execmanagerial: 4066  
#>  Mean   :10.08    Married-spouse-absent:  418    Admclerical   : 3770  
#>  3rd Qu.:12.00    Never-married        :10683    Sales         : 3650  
#>  Max.   :16.00    Separated            : 1025   (Other)        :10993  
#>                   Widowed              :  993   NA's           : 1843  
#>           relationship                    race            sex       
#>   Husband       :13193    Amer-Indian-Eskimo:  311    Female:10771  
#>   Not-in-family : 8305    Asian-Pac-Islander: 1039    Male  :21790  
#>   Other-relative:  981    Black             : 3124                  
#>   Own-child     : 5068    Other             :  271                  
#>   Unmarried     : 3446    White             :27816                  
#>   Wife          : 1568                                              
#>                                                                     
#>   capital.gain    capital.loss    hours.per.week        native.country 
#>  Min.   :    0   Min.   :   0.0   Min.   : 1.00    UnitedStates:29170  
#>  1st Qu.:    0   1st Qu.:   0.0   1st Qu.:40.00    Mexico      :  643  
#>  Median :    0   Median :   0.0   Median :40.00    Philippines :  198  
#>  Mean   : 1078   Mean   :  87.3   Mean   :40.44    Germany     :  137  
#>  3rd Qu.:    0   3rd Qu.:   0.0   3rd Qu.:45.00    Canada      :  121  
#>  Max.   :99999   Max.   :4356.0   Max.   :99.00   (Other)      : 1709  
#>                                                   NA's         :  583  
#>     salary     
#>   <=50K:24720  
#>   >50K : 7841  
#>                
#>                
#>                
#>                
#> 

We have succeeded in replacing them to NAs. We need to confirm it using anyNA parameter.

anyNA(df_salary)
#> [1] TRUE
colSums(is.na(df_salary))
#>            age      workclass         fnlwgt      education  education.num 
#>              0           1836              0              0              0 
#> marital.status     occupation   relationship           race            sex 
#>              0           1843              0              0              0 
#>   capital.gain   capital.loss hours.per.week native.country         salary 
#>              0              0              0            583              0

Good. let’s drop the NAs.

df_salary_omitna <- na.omit(df_salary)
anyNA(df_salary_omitna)
#> [1] FALSE
colSums(is.na(df_salary_omitna))
#>            age      workclass         fnlwgt      education  education.num 
#>              0              0              0              0              0 
#> marital.status     occupation   relationship           race            sex 
#>              0              0              0              0              0 
#>   capital.gain   capital.loss hours.per.week native.country         salary 
#>              0              0              0              0              0

We also want to check which variables that have outliers.

summary(df_salary_omitna)
#>       age                 workclass         fnlwgt                education   
#>  Min.   :17.00    Private      :22286   Min.   :  13769    HS-grad     :9840  
#>  1st Qu.:28.00    Selfempnotinc: 2499   1st Qu.: 117627    Some-college:6678  
#>  Median :37.00    Localgov     : 2067   Median : 178425    Bachelors   :5044  
#>  Mean   :38.44    Stategov     : 1279   Mean   : 189794    Masters     :1627  
#>  3rd Qu.:47.00    Selfempinc   : 1074   3rd Qu.: 237629    Assoc-voc   :1307  
#>  Max.   :90.00    Federalgov   :  943   Max.   :1484705    11th        :1048  
#>                  (Other)       :   14                     (Other)      :4618  
#>  education.num                  marital.status            occupation  
#>  Min.   : 1.00    Divorced             : 4214    Profspecialty :4038  
#>  1st Qu.: 9.00    Married-AF-spouse    :   21    Craftrepair   :4030  
#>  Median :10.00    Married-civ-spouse   :14065    Execmanagerial:3992  
#>  Mean   :10.12    Married-spouse-absent:  370    Admclerical   :3721  
#>  3rd Qu.:13.00    Never-married        : 9726    Sales         :3584  
#>  Max.   :16.00    Separated            :  939    Otherservice  :3212  
#>                   Widowed              :  827   (Other)        :7585  
#>           relationship                    race            sex       
#>   Husband       :12463    Amer-Indian-Eskimo:  286    Female: 9782  
#>   Not-in-family : 7726    Asian-Pac-Islander:  895    Male  :20380  
#>   Other-relative:  889    Black             : 2817                  
#>   Own-child     : 4466    Other             :  231                  
#>   Unmarried     : 3212    White             :25933                  
#>   Wife          : 1406                                              
#>                                                                     
#>   capital.gain    capital.loss     hours.per.week        native.country 
#>  Min.   :    0   Min.   :   0.00   Min.   : 1.00    UnitedStates:27504  
#>  1st Qu.:    0   1st Qu.:   0.00   1st Qu.:40.00    Mexico      :  610  
#>  Median :    0   Median :   0.00   Median :40.00    Philippines :  188  
#>  Mean   : 1092   Mean   :  88.37   Mean   :40.93    Germany     :  128  
#>  3rd Qu.:    0   3rd Qu.:   0.00   3rd Qu.:45.00    PuertoRico  :  109  
#>  Max.   :99999   Max.   :4356.00   Max.   :99.00    Canada      :  107  
#>                                                    (Other)      : 1516  
#>     salary     
#>   <=50K:22654  
#>   >50K : 7508  
#>                
#>                
#>                
#>                
#> 

Based on the summary, fnlwgt tends to have many outliers, while capital.gain and capital.loss have too many data that have 0 values. To make our lives easier, we will drop these data from our model to avoid bias.

df_salary_clean <- df_salary_omitna[-c(3,11:12)]
summary(df_salary_clean)
#>       age                 workclass             education    education.num  
#>  Min.   :17.00    Private      :22286    HS-grad     :9840   Min.   : 1.00  
#>  1st Qu.:28.00    Selfempnotinc: 2499    Some-college:6678   1st Qu.: 9.00  
#>  Median :37.00    Localgov     : 2067    Bachelors   :5044   Median :10.00  
#>  Mean   :38.44    Stategov     : 1279    Masters     :1627   Mean   :10.12  
#>  3rd Qu.:47.00    Selfempinc   : 1074    Assoc-voc   :1307   3rd Qu.:13.00  
#>  Max.   :90.00    Federalgov   :  943    11th        :1048   Max.   :16.00  
#>                  (Other)       :   14   (Other)      :4618                  
#>                 marital.status            occupation            relationship  
#>   Divorced             : 4214    Profspecialty :4038    Husband       :12463  
#>   Married-AF-spouse    :   21    Craftrepair   :4030    Not-in-family : 7726  
#>   Married-civ-spouse   :14065    Execmanagerial:3992    Other-relative:  889  
#>   Married-spouse-absent:  370    Admclerical   :3721    Own-child     : 4466  
#>   Never-married        : 9726    Sales         :3584    Unmarried     : 3212  
#>   Separated            :  939    Otherservice  :3212    Wife          : 1406  
#>   Widowed              :  827   (Other)        :7585                          
#>                   race            sex        hours.per.week 
#>   Amer-Indian-Eskimo:  286    Female: 9782   Min.   : 1.00  
#>   Asian-Pac-Islander:  895    Male  :20380   1st Qu.:40.00  
#>   Black             : 2817                   Median :40.00  
#>   Other             :  231                   Mean   :40.93  
#>   White             :25933                   3rd Qu.:45.00  
#>                                              Max.   :99.00  
#>                                                             
#>        native.country     salary     
#>   UnitedStates:27504    <=50K:22654  
#>   Mexico      :  610    >50K : 7508  
#>   Philippines :  188                 
#>   Germany     :  128                 
#>   PuertoRico  :  109                 
#>   Canada      :  107                 
#>  (Other)      : 1516

Now the data looks neater. Let’s proceed to Exploratory Data Analysis or EDA.

5 Exploratory Data Analysis

We want to do Exploratory Data Analysis (EDA) first to do an initial investigation of the data, particularly to see if there are any patterns/correlations that we need to focus on. We will do so using three graphical tools, ggcorr, boxplot, and qplot.

Let’s use a heatmap to see the correlation among the predictors.

ggcorr(df_salary_clean, label = T, hjust = 1, layout.exp = 3)

The numerical variables seem to have low correlation among each other. This is a good sign particularly since we will be doing a Naive Bayes method which assumes that predictor variables should be independent.

Now, we can deep dive and see the correlation among predictors using boxplot and qplot and gain some insights.

  1. Boxplot for age and salary
boxplot(age ~ salary, data = df_salary_clean, main = "Age distribution based on salary", xlab = "Salary", ylab = "Age", col = "green")

Based on the above boxplot, it somewhat confirms that age correlates with salary with people from age ranging from 30 - 50ish earning more salary with >50K per year. Meanwhile, those below 30 seems to earn less (<=50K). This assumption makes sense since age equates with experience. People of that age range have been working for more than 10+years, making them eligible for higher salary.

  1. Boxplot for education number and salary
boxplot(education.num ~ salary, data = df_salary_clean, main = "Years of Education vs salary", xlab = "Salary", ylab = "Education Years", col = "blue")

This confirms that the longer you study, the better the salary. But, what about the educational background?

  1. Qplot for education and salary
qplot(salary, data = df_salary_clean, fill = education) + facet_grid (. ~ education)

If we look at the qplot, there seems to be a little correlation between education and salary. Our assumption is the higher the education the higher the salary. This is particularly true for masters, doctorate, and prof-school where the people earning >50K outweighs those earning <=50K. However, the number of respondents for these level of education are less than those earning bachelor degree and thus, not really representative for our model.

  1. Qplot for workclass and salary
qplot(salary, data = df_salary_clean, fill = workclass) + facet_grid (. ~ workclass)

Based on the above qplot, workclass and salary seem to have no correlation because in each workclass majority earn >=50K per year (with the exception of self-employed). Additionally, the majority of the observations are coming from the private sector. Hence, we should not be including this variable as the results may be biased.

  1. Qplot for marital status and salary
qplot(salary, data = df_salary_clean, fill = sex) + facet_grid (. ~ marital.status)

Based on the above qplot, it’s interesting that if you were male married with civilian, you’ll likely to earn higher than if you were a married female with civilian. But, if you were either a female/male who were never married, you’ll tend to earn less. This shows how marital status of a particular sex or gender matters a lot.

  1. Boxplot for working hours and salary
boxplot(hours.per.week ~ salary, data = df_salary_clean, main = "Hours per week correlation with salary", xlab = "Salary", ylab = "Working Hour Per Week", col = "orange")

This seems to confirm people’s saying that the more hours you work, the more you earn. Based on the above boxplot, people working more than 40 hours per week earn more than those who work less hours than that.

  1. Qplot for occupation and salary
qplot(salary, data = df_salary_clean, fill = sex) + facet_grid (. ~ occupation)

Apparently if you’re working in the exec-managerial level and prof-specialty, you have a higher tendency to earn more compared to other types of occupation. Again, sex seems to also matter because majority of the people at these levels are male. It’s also interesting that female dominated certain occupation such as admin-clerical, while male dominated craft-repair occupation.

  1. Qplot for sex, relationship status, and salary
qplot(salary, data = df_salary_clean, fill = relationship) + facet_grid (. ~ marital.status)

Based on the above, if you were a male and married which is defined as “husband”, you will probably earn higher than if you were a female and married or “wife”. The above shows how marital status play a major role in defining your salary. Due to relationship status and marital status overlap with each other, we will use marital status instead as it lumps husband and wife as “married”.

  1. Qplot for race, sex and salary
qplot(salary, data = df_salary_clean, fill = race) + facet_grid (. ~ sex)

The above qplot also signifies that if you were a white male, you’ll tend to earn higher than the rest of us. White male supremacy, yes? Again, race alone cannot be interpreted as it is dependent on sex. Another could be that the respondents are mainly whites so if we include race in the model, it would be biased.

  1. Qplot for native country, race, with salary
qplot(salary, data = df_salary_clean, fill = race) + facet_grid (. ~ native.country)

Majority of respondents are white Americans which isnt surprising. This shows that native country has no correlation with salary. Even then, whites are majority of the respondents even if the respondents are Mexican. So, we should not be including native country as one of our predictors to avoid bias.

Based on our EDA above, we will only be using the following variables as our predictors:

  1. numerical variables:

    • age
    • hours.per.week
    • education.num
  2. categorical variables:

    • sex
    • occupation
    • marital.status

6 Cross Validation

Splitting train test with proportion of 80%:20%

RNGkind(sample.kind = "Rounding")
set.seed(100)

index <- sample(nrow(df_salary_clean), nrow(df_salary_clean)*0.8)
df_train <- df_salary_clean[index,]
df_test <- df_salary_clean[-index,]

Check target class proportion from the df_train.

prop.table(table(df_train$salary))
#> 
#>     <=50K      >50K 
#> 0.7524141 0.2475859

The class is a little imbalanced. Let’s do upSampling.

df_train_up <- upSample(x = df_train %>%
                          select(-salary), y = df_train$salary,
                        yname= "salary")

prop.table(table(df_train_up$salary))
#> 
#>  <=50K   >50K 
#>    0.5    0.5
prop.table(table(df_train$salary))
#> 
#>     <=50K      >50K 
#> 0.7524141 0.2475859

Nice. Our training dataset target class looks balance now.

7 Naive Bayes

source: https://www.upgrad.com/blog/naive-bayes-classifier/

What is Naive Bayes Classifier?

It is a classifier method from supervised machine learning that separates data into different classes according to the Bayes’ Theorem with the assumptions that: - all predictors are independent of one another, even if it’s not entirely true, let’s assume that way; - the target is dependent on the predictors variable.

What are the pros and cons of Naive Bayes Classifier?

Pros:

  • works very fast and therefore can easily predict the class of a test data on a real-time basis, -

  • can be used to solve multi-class prediction problems,

  • performs better than other models with less training data if the assumption of independence of all predictors were correctly followed through, and

  • performs great for categorical input variables compared to numerical variables.

Cons:

  • prone to “Zero Frequency” or “Skewness Due To Scarcity” phenomenon, where if the dataset has a categorical variable but was not present in the training data set, the Naive Bayes model will assign it to zero probability and therefore not usable for prediction purpose. To solve this problem, we’ll have to use a smoothing technique called laplace.

  • it assumes that predictors are independent of one another, which in reality, most variables are dependent on one another.

7.1 Model Fitting

To check which model is the best fit for Naive Bayes, whether it’s the model with class imbalance or the model which we have tuned using upSample, we will train two models:

# train
model_naive <- naiveBayes(formula = salary ~ age + hours.per.week + education.num + sex + occupation + marital.status,
                          data = df_train, 
                           laplace = 1)
model_naive
#> 
#> Naive Bayes Classifier for Discrete Predictors
#> 
#> Call:
#> naiveBayes.default(x = X, y = Y, laplace = laplace)
#> 
#> A-priori probabilities:
#> Y
#>     <=50K      >50K 
#> 0.7524141 0.2475859 
#> 
#> Conditional probabilities:
#>         age
#> Y            [,1]     [,2]
#>    <=50K 36.66252 13.50766
#>    >50K  43.90626 10.26100
#> 
#>         hours.per.week
#> Y            [,1]     [,2]
#>    <=50K 39.31611 11.94122
#>    >50K  45.65082 10.66788
#> 
#>         education.num
#> Y             [,1]     [,2]
#>    <=50K  9.633765 2.407389
#>    >50K  11.604118 2.366642
#> 
#>         sex
#> Y           Female      Male
#>    <=50K 0.3846450 0.6153550
#>    >50K  0.1480924 0.8519076
#> 
#>         occupation
#> Y                        Admclerical   ArmedForces   Craftrepair
#>    <=50K 0.00005503577 0.14232250963 0.00044028619 0.13858007705
#>    >50K  0.00016697278 0.06628819502 0.00033394557 0.11888462181
#>         occupation
#> Y         Execmanagerial  Farmingfishing  Handlerscleaners  Machineopinspct
#>    <=50K   0.09108420473   0.03775454045     0.05575123830    0.07743533297
#>    >50K    0.25663716814   0.01502755051     0.01118717649    0.03306061112
#>         occupation
#> Y         Otherservice  Privhouseserv  Profspecialty  Protectiveserv
#>    <=50K 0.13439735828  0.00720968630  0.09796367639   0.01970280682
#>    >50K  0.01903489731  0.00033394557  0.24211053598   0.02938720988
#>         occupation
#> Y                Sales   Techsupport  Transportmoving
#>    <=50K 0.11491469455 0.02828838745    0.05410016511
#>    >50K  0.12906996160 0.03723493071    0.04124227751
#> 
#>         marital.status
#> Y            Divorced  Married-AF-spouse  Married-civ-spouse
#>    <=50K 0.1649047462       0.0005506002        0.3369122343
#>    >50K  0.0590202307       0.0011703728        0.8523658251
#>         marital.status
#> Y         Married-spouse-absent  Never-married    Separated      Widowed
#>    <=50K           0.0150864442   0.4101420548 0.0386521308 0.0337517895
#>    >50K            0.0040127069   0.0632001338 0.0088613944 0.0113693362
model_naive_up <- naiveBayes(formula = salary ~ age + hours.per.week + education.num + sex + occupation + marital.status,
                          data = df_train_up, 
                           laplace = 1)
model_naive_up
#> 
#> Naive Bayes Classifier for Discrete Predictors
#> 
#> Call:
#> naiveBayes.default(x = X, y = Y, laplace = laplace)
#> 
#> A-priori probabilities:
#> Y
#>  <=50K   >50K 
#>    0.5    0.5 
#> 
#> Conditional probabilities:
#>         age
#> Y            [,1]     [,2]
#>    <=50K 36.66252 13.50766
#>    >50K  43.85789 10.23859
#> 
#>         hours.per.week
#> Y            [,1]     [,2]
#>    <=50K 39.31611 11.94122
#>    >50K  45.60474 10.73596
#> 
#>         education.num
#> Y             [,1]     [,2]
#>    <=50K  9.633765 2.407389
#>    >50K  11.632939 2.361603
#> 
#>         sex
#> Y           Female      Male
#>    <=50K 0.3846450 0.6153550
#>    >50K  0.1510712 0.8489288
#> 
#>         occupation
#> Y                        Admclerical   ArmedForces   Craftrepair
#>    <=50K 0.00005503577 0.14232250963 0.00044028619 0.13858007705
#>    >50K  0.00005503577 0.06499724821 0.00027517887 0.11997798569
#>         occupation
#> Y         Execmanagerial  Farmingfishing  Handlerscleaners  Machineopinspct
#>    <=50K   0.09108420473   0.03775454045     0.05575123830    0.07743533297
#>    >50K    0.25679691800   0.01447440837     0.01062190424    0.03417721519
#>         occupation
#> Y         Otherservice  Privhouseserv  Profspecialty  Protectiveserv
#>    <=50K 0.13439735828  0.00720968630  0.09796367639   0.01970280682
#>    >50K  0.01926252064  0.00011007155  0.24589983489   0.02894881673
#>         occupation
#> Y                Sales   Techsupport  Transportmoving
#>    <=50K 0.11491469455 0.02828838745    0.05410016511
#>    >50K  0.12680242157 0.03742432581    0.04017611447
#> 
#>         marital.status
#> Y            Divorced  Married-AF-spouse  Married-civ-spouse
#>    <=50K 0.1649047462       0.0005506002        0.3369122343
#>    >50K  0.0613919172       0.0008809602        0.8478141174
#>         marital.status
#> Y         Married-spouse-absent  Never-married    Separated      Widowed
#>    <=50K           0.0150864442   0.4101420548 0.0386521308 0.0337517895
#>    >50K            0.0036890210   0.0661821385 0.0084241824 0.0116176633

7.2 Model Evaluation

Let’s predict class from the df_test with the function of predict() using the two models:

# predict
df_test$pred <- predict(model_naive, newdata = df_test, type = "class")
df_test$pred_up <- predict(model_naive_up, newdata = df_test, type = "class")
head(df_test)

Evaluate the models with confusion matrix:

# confusion matrix without upSampling

confusionMatrix(data = df_test$pred, 
                reference = df_test$salary)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  <=50K  >50K
#>      <=50K   3920   520
#>      >50K     579  1014
#>                                               
#>                Accuracy : 0.8178              
#>                  95% CI : (0.8079, 0.8275)    
#>     No Information Rate : 0.7457              
#>     P-Value [Acc > NIR] : < 0.0000000000000002
#>                                               
#>                   Kappa : 0.5257              
#>                                               
#>  Mcnemar's Test P-Value : 0.08019             
#>                                               
#>             Sensitivity : 0.8713              
#>             Specificity : 0.6610              
#>          Pos Pred Value : 0.8829              
#>          Neg Pred Value : 0.6365              
#>              Prevalence : 0.7457              
#>          Detection Rate : 0.6498              
#>    Detection Prevalence : 0.7360              
#>       Balanced Accuracy : 0.7662              
#>                                               
#>        'Positive' Class :  <=50K              
#> 

Without upSampling, we’re getting the following numbers:

NB_withoutup <- cbind.data.frame(
  accuracy = 0.8178, 
  recall = 0.8713,
  specificity = 0.6610,
  precision = 0.8829)

NB_withoutup
# confusion matrix with upSampling

confusionMatrix(data = df_test$pred_up, 
                reference = df_test$salary)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  <=50K  >50K
#>      <=50K   3363   232
#>      >50K    1136  1302
#>                                                
#>                Accuracy : 0.7732               
#>                  95% CI : (0.7625, 0.7838)     
#>     No Information Rate : 0.7457               
#>     P-Value [Acc > NIR] : 0.0000003669         
#>                                                
#>                   Kappa : 0.4993               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.7475               
#>             Specificity : 0.8488               
#>          Pos Pred Value : 0.9355               
#>          Neg Pred Value : 0.5340               
#>              Prevalence : 0.7457               
#>          Detection Rate : 0.5574               
#>    Detection Prevalence : 0.5959               
#>       Balanced Accuracy : 0.7981               
#>                                                
#>        'Positive' Class :  <=50K               
#> 

With upSampling, we’re getting the following numbers:

NB_withup <- cbind.data.frame(
  accuracy = 0.7732, 
  recall = 0.7475,
  specificity = 0.8488,
  precision = 0.9355)

NB_withup

Let’s compare the two models:

rbind("Naive Bayes without tuning" = NB_withoutup, "Naive Bayes with tuning" = NB_withup)

In terms of accuracy (how capable our model can predict the Y target) and recall (how capable our model can predict correctly from all the actual data that are actually positive), the Naive Bayes model without upSampling performs better than the model with upSampling.

Meanwhile, the Naive Bayes model with upSampling perform exceptionally well when it comes to specificity (how capable our model can correctly predict the negative class from the actual data that are actually negative) and precision (how capable our model can correctly predict the positive class from the total predicted positive class) compared to the original Naive Bayes model.

Based on the comparison, we will use the Naive Bayes model with upSampling because it performs equally well (>70%) across the board.

7.3 ROC and AUC

Subsequently, we will look into the Receiver Operating Characteristics (ROC) curve and the Area Under the Curve (AUC) to evaluate the chosen model.

The ROC plots two parameters, namely the proportion of true positive rate (TPR) against the proportion of a false positive rate (FPR) at different classification thresholds. A good ROC curve looks like an upside down L which tells you that the closer the bended curve reaches the upper left side of the plot (TPR high, FPR low), the better the model is.

The AUC basically measures the whole area under the ROC curve and tells you how much the selected model is capable of distinguishing between the target classes. A higher AUC score means the better the model distingushes the target classes.

First we need to prepare the prediction result in probability form from the df_test, and save it as an object with the name of df_pred_prob:

df_pred_prob <- predict(model_naive_up, newdata = df_test, type = "raw")

For the ROC, we want to save it in an object called df_salary_roc:

df_salary_roc <- data.frame(pred = df_pred_prob[,2], label = df_test$salary)
salary_roc <- prediction(predictions = df_salary_roc$pred,
                       labels = df_salary_roc$label)

# ROC curve
plot(performance(prediction.obj = salary_roc,
            measure = "tpr",
            x.measure = "fpr"))

As expected, the ROC looks like an upside down L. That means our selected model is good.

Let’s find the value of AUC:

salary_auc <- performance(prediction.obj = salary_roc,
                         measure = "auc")

str(salary_auc)
#> Formal class 'performance' [package "ROCR"] with 6 slots
#>   ..@ x.name      : chr "None"
#>   ..@ y.name      : chr "Area under the ROC curve"
#>   ..@ alpha.name  : chr "none"
#>   ..@ x.values    : list()
#>   ..@ y.values    :List of 1
#>   .. ..$ : num 0.879
#>   ..@ alpha.values: list()
salary_auc@y.values
#> [[1]]
#> [1] 0.8789886

We can conclude that our miss-classification rate is 12,1%. That means Our model can differentiate between the above average salary (positive class) and the below average salary (negative class) at a classification rate of 87,9%.

8 Decision Tree

source: https://www.upgrad.com/blog/pros-and-cons-of-decision-tree-regression-in-machine-learning/

What is Decision Tree classifier?

Decision Tree is also part of the supervised machine learning which tries to solve the classification problem by transforming the data into a tree representation. It consists of:

  • Root Node: The first branch of the tree to set the target value, or known as the main predictor.

  • Interior Node: The next branch(es) which uses other predictors if the root node is not enough in determining the target value.

  • Terminal/Leaf Node: This is the predicted target class.

How it works is by performing a series of questions on the dataset. By asking these true/false questions, the model is able to narrow down the possible values and make a prediction. The order and content of the question are decided by the model itself.

What are the pros and cons of Decision Tree Classifier?

Pros:

  • can be used for classification and regression problems

  • easy to interpret, understand, and visualize

  • data pre-processing require less effort and no need to normalize/scaling the data

  • very quick in identifying relationships between variables with the most significant variables

  • can handle both numerical and categorical variables, and not heavily influenced by outliers

Cons:

  • prone to overfitting. It continues to develop hypotheses that reduce the training set error but at the cost of increasing the test set error.

  • Can be solved by pruning

  • does not perform well with continuous numerical variables

  • quite instable if there is a small change in the data since it can cause a big difference in the tree structure

  • in terms of running it in R, it takes longer time to train the model compared to Naive Bayes, particularly if the complexity levels of the dataset are greater.

8.1 Model Fitting

Since our observations are more than > 20,000, we will set the decision tree with a minimum observation in each node with 4000 observations so that the machine can plot the decision trees more easily. We will also use the df_train from earlier instead of df_train_up as decision tree is not heavily influenced by outliers.

model_ctree <- ctree(formula = salary~ age + hours.per.week + education.num + sex + occupation + marital.status,
                    data = df_train,
                    control = ctree_control(mincriterion = 0.07,
                                            minsplit = 4000,
                                            minbucket = 7))


plot(model_ctree, type = "simple")

8.2 Model Evaluation

Let’s predict probability from the df_test with the function of predict() using the decision tree model:

dftest_new <- predict(object = model_ctree,
                          newdata = df_test,
                          type = "response")

# confusion matrix data test
confusionMatrix(data = dftest_new, 
                reference = df_test$salary)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  <=50K  >50K
#>      <=50K   4275   903
#>      >50K     224   631
#>                                                
#>                Accuracy : 0.8132               
#>                  95% CI : (0.8031, 0.823)      
#>     No Information Rate : 0.7457               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.4233               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.9502               
#>             Specificity : 0.4113               
#>          Pos Pred Value : 0.8256               
#>          Neg Pred Value : 0.7380               
#>              Prevalence : 0.7457               
#>          Detection Rate : 0.7086               
#>    Detection Prevalence : 0.8583               
#>       Balanced Accuracy : 0.6808               
#>                                                
#>        'Positive' Class :  <=50K               
#> 

For our data test, we’re getting the following results:

DT_datatest <- cbind.data.frame(
  accuracy = 0.7898, 
  recall = 0.8092,
  specificity = 0.7286,
  precision = 0.9040)

DT_datatest
dftrain_new <-predict(object = model_ctree,
                          newdata = df_train_up,
                          type = "response")

# confusion matrix data train
confusionMatrix(dftrain_new, reference = df_train_up$salary)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  <=50K  >50K
#>      <=50K  17251 10799
#>      >50K     904  7356
#>                                                
#>                Accuracy : 0.6777               
#>                  95% CI : (0.6729, 0.6825)     
#>     No Information Rate : 0.5                  
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.3554               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.9502               
#>             Specificity : 0.4052               
#>          Pos Pred Value : 0.6150               
#>          Neg Pred Value : 0.8906               
#>              Prevalence : 0.5000               
#>          Detection Rate : 0.4751               
#>    Detection Prevalence : 0.7725               
#>       Balanced Accuracy : 0.6777               
#>                                                
#>        'Positive' Class :  <=50K               
#> 

Meanwhile, for our data train, we’re getting the following results:

DT_datatrain <- cbind.data.frame(
  accuracy = 0.6777, 
  recall = 0.9502,
  specificity = 0.4052,
  precision = 0.6150)

DT_datatrain

Let’s compare the two results:

rbind("Decision Tree performed in Data Test" = DT_datatest, "Decision Tree performed in Data Train" = DT_datatrain)

It seems that we do not experience any overfitting since our test data performs better than our train data. That means our model fits just right when it comes to predicting unseen data.

9 Random Forest

source: https://www.upgrad.com/blog/random-forest-classifier/

What is Random Forest?

Random Forest is method that consists of multiple decision trees, just as the name suggests, forest = many trees. The reason why it’s called Random is because it uses randomness to improve its accuracy and solve overfitting issues found in decision tree model. The algorithm works by making decision trees based on a random selection of data samples and get predictions from every tree. Subsequently, the model selects the best solution through voting.

What are the pros and cons of Random Forest?

Pros:

  • more accurate that other classifiers
  • very robust as it uses multiple decision trees to come to conclusion/result
  • does not face overfitting issue as it takes the average of all predictions, canceling out the biases
  • highly versatile, as it can be used in both regression and classification problems
  • it allows us to select the most contribution features for the classifier of our model

Cons:

  • slower to run compared to other classifiers as it uses multiple decision trees to make predictions, can try using random sampling to solve the problem

  • cannot be used for real-time predictions - cannot be easily interpreted compared to a decision tree

This algorithm is substantially slower than other classification algorithms because it uses multiple decision trees to make predictions. When a random forest classifier makes a prediction, every tree in the forest has to make a prediction for the same input and vote on the same. This process can be very time-consuming. Because of its slow pace, random forest classifiers can be unsuitable for real-time predictions. The model can be quite challenging to interpret in comparison to a decision tree as you can make a selection by following the tree’s path. However, that’s not possible in a random forest as it has multiple decision trees.

9.1 Random Sampling from Earlier Data

As Random Sampling is not prone to overfitting, we may want to use the earlier dataset that is df_salary_omitna for our model without dropping any variables. However due to the large size of the observations, we may want to conduct random sampling first and reduce it to 10% of the observations from the original dataset or around 3000. This is to expedite the random forest processing.

df_salary_sample <- sample_n(df_salary_omitna, 3000)
df_salary_sample

9.2 K-Fold Cross-Validation

Before we make cross-validation, we will need to split between the data train and test of df_salary_sample with a proportion of 80%:20%.

RNGkind(sample.kind = "Rounding")
set.seed(100)

insample <- sample(nrow(df_salary_sample), nrow(df_salary_sample)*0.8)
RF_train <- df_salary_sample[insample,]
RF_test <- df_salary_sample[-insample,]

Now let’s check the proportion of the target class of the RF_train data.

prop.table(table(RF_train$salary))
#> 
#>     <=50K      >50K 
#> 0.7491667 0.2508333

Now, let’s begin our model fitting.

set.seed(100)
control <- trainControl(method = "repeatedcv", number = 5, repeats = 3)

model_RF <- train(salary~ ., data = RF_train, method = "rf", trainControl = control)
   
saveRDS(model_RF, "model_RF.RDS")

Now, let’s read our model.

model_RF <- readRDS("model_RF.RDS")
model_RF
#> Random Forest 
#> 
#> 2400 samples
#>   14 predictor
#>    2 classes: ' <=50K', ' >50K' 
#> 
#> No pre-processing
#> Resampling: Bootstrapped (25 reps) 
#> Summary of sample sizes: 2400, 2400, 2400, 2400, 2400, 2400, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy   Kappa     
#>     2   0.7535454  0.01713643
#>    51   0.8396008  0.55230122
#>   100   0.8321996  0.53734907
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 51.

From the results, we can understand the following:

  1. 2400 samples -> the number of rows on our data train used in creating the model

  2. 14 predictor -> jthe number of predictor variables in our data train

  3. 2 classes -> the number of target class on our data

  4. Summary of sample sizes -> the number of sample size on our data train based on the k-fold cross validation

  5. mtry dan accuracy -> shows the number of mtry used and the number of accuracy based on each mtry.

From the model summary, after doing several trials of mtry the number of mtry that we can choose is 51, which has the highest accuracy when tested in the test data from the boostrap sampling.

9.3 Out of Bag Error

The boostrap sampling produces unused data when making random forest. These data are called out-of-bag data and are considered as data test by the model. The model will then try to do prediction using those data and calculate the error. The error is called out-of-bag error.

model_RF$finalModel
#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)), trainControl = ..1) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 51
#> 
#>         OOB estimate of  error rate: 14.96%
#> Confusion matrix:
#>         <=50K  >50K class.error
#>  <=50K   1657   141  0.07842047
#>  >50K     218   384  0.36212625

From the above model, our out-of-bag error rate is 15.54%, which means our model’s accuracy is 84.46%. Based on the model, we can check what are the predictors that highly affect salary of a person.

plot(varImp(model_RF))

From the plotting above, similar to the decision tree model, marital.status of Married-civ-spouse is the most important predictor for determining one’s salary.

9.4 Prediction and Model Evaluation

Now we can check for the confusionMatrix.

model_RF_test <- predict(model_RF, newdata = RF_test)
confusionMatrix(as.factor(model_RF_test), RF_test$salary)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  <=50K  >50K
#>      <=50K    389    67
#>      >50K      26   118
#>                                                
#>                Accuracy : 0.845                
#>                  95% CI : (0.8135, 0.873)      
#>     No Information Rate : 0.6917               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.6128               
#>                                                
#>  Mcnemar's Test P-Value : 0.00003357           
#>                                                
#>             Sensitivity : 0.9373               
#>             Specificity : 0.6378               
#>          Pos Pred Value : 0.8531               
#>          Neg Pred Value : 0.8194               
#>              Prevalence : 0.6917               
#>          Detection Rate : 0.6483               
#>    Detection Prevalence : 0.7600               
#>       Balanced Accuracy : 0.7876               
#>                                                
#>        'Positive' Class :  <=50K               
#> 

For our Random Forest model, we’re getting the following results:

RF <- cbind.data.frame(
  accuracy = 0.8567, 
  recall = 0.9376,
  specificity = 0.6159,
  precision = 0.8789)

RF

Note that the above model uses random sampling dataset to expedite the process of running the model. Hence, the accuracy rate should be considered with a grain of salt.

10 Conclusion

Let’s compare the three models.

rbind("Naive Bayes"= NB_withup, "Decision Tree"= DT_datatest, "Random Forest" = RF)

If we look at the three models, the model that gives better precision and are good across the board is Naive-Bayes. However, again this could be because when we did Random Forest, it was only using random sampling dataset which may cause the lower performance. This warrants for another Random Forest test using all observations with better PC than mine.

However, Decision Tree and Random Forest models do agree on one thing, that marital.status, particularly married-civ-spouse is the predictor that has the highest effect on determining a person’s salary based on the US 1994 Census database. This was confirmed by the most recent AEI’s research as reported in: https://www.cbsnews.com/news/married-men-earn-far-more-than-any-other-american-group/. According to the news, the reason behind this is because men who are married tend to work more hours than their bachelor counterparts because of the “marriage effect” where men with higher motivation to work and support a family.

Another report stated that this is because being legally married, particularly in the US, introduces “a whole host of tangible benefits” both at work and through federal tax breaks. Some of these benefits include additional pension contributions and higher premium for the family’s healthcare. Source: https://www.cnbc.com/2020/01/15/additional-benefits-90-percent-of-us-companies-offer-if-youre-married.html