Executive Summary

This project conducts preprocessing, exploratory data analysis, and predictive data analysis the “SalaryData” dataset provided by the instructor. The quality of the data provide was generally high. A data scope was difficult to craft due to the lack of supporting context, needs to be solved, and plan for adoption of results provided. Nonetheless, visions of successful analysis included: creating analyses that are realistic and relevant to the real world in a business application, thoroughly exploring the dataset via graphical representations, and building at least a handful of predictive models with a high degree of accuracy.

Originally, the data had a structure of 10 variable, and 200 entries. As a result of the conducting of preprocessing, which screened the data for duplicates, removed specific outliers, rows with NA values, and the "ID’ variable, the data structure was reduced to 9 variable, and 192 entries. Multiple variables were then converted into factor form for graphical analysis.

Graphical exploratory data analysis was conducted on the dataset, yielding many significant insights. Examples of insights include: how both salaries and bonuses are higher in the IT industry, compared to the Non-IT industry; how both the IT and Non-IT industry are fairly evenly comprised of males and females, although the IT industry does have roughly 5% more males than females; how gender does not play any significant role in determining the salary or bonus of an individual; how both salary and bonus increase as years of experience increase; and how salary and bonus appear to be highly correlated with the number of certifications attained by an individual.

Predictive data analysis was then conducted on the dataset. The data was partitioned into a training dataset and a testing dataset in a 70/30 split, using sampling without replacement.Models to predict each variable were created, with only those for industry, marital status, bonus, and salary yeilding results of any significant accuracy.

Models for industry and marital status were created using logistic regression due to the discrete class nature of variables predicted. Models for bonus and salary were created using multiple linear regression due to the continuous numerical nature of variables predicted. Once the models were trained on the training dataset, they were fed the testing dataset to evaluate accuracy. The created industry model had an accuracy of 72% when fed the testing data partition of the dataset, the marital status model had an accuracy of 62%, the bonus model an accuracy of 93.5% (and adjusted R^2 of 0.9252), and the salary model an accuracy of 87.6% (and adjusted R^2 of 0.7328).

It is recommended that only models for industry, marital status, bonus, and salary be used for predictive use cases. In a business application, it is unlikely that predictive models for industry or marital status be of any use to a company. However, predictive models for salary can be extremely helpful for uses such as determining the proper salary to give an incoming employee based on factors like education, experience, certifications, and more. Predictive models for bonus can be extremely helpful for determining the proper bonus to give an existing employee.

Data Preprocessing

Quality, Scope, and Issues

The quality of the data provide in the “SalaryData” dataset was generally high. No duplicate entries were found and the minor issues found, such as major outliers and NA values, were easily corrected.

Unfortunately, for scoping, since the dataset was provided by the instructor with no supporting context, needs to be solved, or plan for adoption of results, it is difficult to craft an analysis scope. Nonetheless, visions of successful analysis include: creating analyses that are realistic and relevant to the real world in a business application, thoroughly exploring the dataset via graphical representations, and building at least a handful of predictive models with a high degree of accuracy.

Issues with the data included:

1 likely entry error creating an outlier in the “Experience” attribute.
8 unique rows identified to contain NA values.

Outlining data preprocessing steps

The following is an outline and justification of the steps taken in preprocessing the dataset:

Loading in Data
- Data was loaded in using the read_excel function
Preventing Scientific Notation
- The “options(scipen=999)” function was used to ensure that numbers are not given in scientific notation in subsequent analysis.
Finding Outliers
- An “sapply” function was used to find all outliers in the dataset.
- The attribute “Bonus” was then plotted in a boxplot to determine if the outliers were the product of entry error and should be excluded or not. It was determined that the outliers in “Bonus” should be left in.
Plotting then Removing Experience Outliers
- One outlier of importance identified in the previous step was one in the “Experience” attribute which had a value of 21, where all other values ranged from 1-8. This was determined through the creation of a boxplot.
- It was determined that the outlier in the “Experience” attribute was most likely an entry error and was removed.
Identifying and Dropping Rows with NA Values
- Having rows with NA values will not allow for predictive analytics methods to be applied to the data.
- The rows with NA values were found using an “sapply” function. Since most data rows are complete, the few rows with NAs are best to be removed.
- The rows with NA values were removed using a “drop_na” function, shown complete with another “sapply” function.
Checking for duplicate ID values then deleting column
- The dataset was checked for duplicate entries that would skew subsequent analysis. No duplicates were found. The “ID” attribute was then removed from the dataset as it is not significant, and to reduce dimensionality.
Evaluating Correlation
- A “plot_correlation” function was used to generate a heatmap of correlation between variables in the dataset.
Changing data types and integer labels
- The data types and integer labels of various attributes were changed, mainly into factors and then given proper names (ex. Male, Female) to make them easier to understand in the exploratory analysis processes.
- The updated dataset table was then displayed to conclude the data preprocessing stage.

Loading in Data

library(readxl)
salary <-
  read_excel("C:/Users/cbado/OneDrive/Documents/R/SalaryData.xlsx")
head(salary,3)

## # A tibble: 3 x 10
##      id gender industry education satisfied married experience certification
##   <dbl>  <dbl>    <dbl>     <dbl>     <dbl>   <dbl>      <dbl>         <dbl>
## 1     1      0        1         2         0       1          6             3
## 2     2      1        1         3         1       1          4             3
## 3     3      1        1         3         0       1          7             4
## # ... with 2 more variables: bonus <dbl>, salary <dbl>

options(scipen=999)

Finding Outliers

sapply(salary, function(salary) (boxplot.stats(salary)$out))

## $id
## numeric(0)
## 
## $gender
## numeric(0)
## 
## $industry
## numeric(0)
## 
## $education
## numeric(0)
## 
## $satisfied
## numeric(0)
## 
## $married
## numeric(0)
## 
## $experience
## [1] 21
## 
## $certification
## numeric(0)
## 
## $bonus
##  [1] 2147.082 3080.000 2089.164 1989.036 2353.000 2308.150 2497.600 1961.564
##  [9] 2263.248 2586.880 2656.830 2775.465
## 
## $salary
## numeric(0)

boxplot(salary$bonus,
        ylab = "Bonus",
        main = "Boxplot of Bonus")

Plotting then Removing Experience Outliers

(However, the outlier in the Experience column is most likely an entry error and must be removed)

boxplot(salary$experience,
        ylab = "Experience",
        main = "Boxplot of Experience")
mtext(paste("Outliers: ", paste(boxplot.stats(salary$experience)$out, collapse = ", ")))

out <- boxplot.stats(salary$experience)$out
out_ind <- which(salary$experience %in% c(out))
salary[out_ind, 7] = NA

Identifying and Dropping Rows with NA Values

sapply(salary, function(salary) sum(is.na(salary)))

##            id        gender      industry     education     satisfied 
##             0             0             1             2             2 
##       married    experience certification         bonus        salary 
##             0             1             0             2             0

library(tidyr)
salary<-drop_na(salary)
sapply(salary, function(salary) sum(is.na(salary)))

##            id        gender      industry     education     satisfied 
##             0             0             0             0             0 
##       married    experience certification         bonus        salary 
##             0             0             0             0             0

Checking for duplicate ID values then deleting column

length(unique(salary$id)) == nrow(salary)

## [1] TRUE

salary$id <- NULL

Evaluating Correlation

library(DataExplorer)
plot_correlation(na.omit(salary), maxcat = 5L)

Changing data types and integer labels

salary$gender <-ifelse(
  test = salary$gender==0,yes = "Male", no="Female")
salary$gender <-as.factor(salary$gender)

salary$industry <-ifelse(test = salary$industry==0,yes = "Non-IT", no="IT")
salary$industry <-as.factor(salary$industry)

salary$education <-ifelse(test = salary$education==1,yes = "Undergrad", no=(ifelse(test = salary$education==2,yes = "Grad", no="Advanced")))
salary$education <- as.factor(salary$education)
ordered(salary$education, levels = c("Undergrad", "Grad", "Advanced"))

salary$satisfied <-ifelse(test = salary$satisfied==1,yes = "Yes", no= "No")
salary$satisfied <- as.factor(salary$satisfied)

salary$married <-ifelse(test = salary$married==1,yes = "Yes", no= "No")
salary$married <- as.factor(salary$married)

salary$certification <- as.factor(salary$certification)

head(salary,4)

## # A tibble: 4 x 9
##   gender industry education satisfied married experience certification bonus
##   <fct>  <fct>    <fct>     <fct>     <fct>        <dbl> <fct>         <dbl>
## 1 Male   IT       Grad      No        Yes              6 3             2147.
## 2 Female IT       Advanced  Yes       Yes              4 3             1565.
## 3 Female IT       Advanced  No        Yes              7 4             3080 
## 4 Female IT       Advanced  Yes       Yes              6 3             2089.
## # ... with 1 more variable: salary <dbl>

Exploratory Data Analysis

library(ggplot2)
library(cowplot)
summary(salary)

##     gender     industry       education  satisfied married     experience  
##  Female:93   IT    :107   Advanced :44   No :101   No :106   Min.   :1.00  
##  Male  :99   Non-IT: 85   Grad     :88   Yes: 91   Yes: 86   1st Qu.:1.00  
##                           Undergrad:60                       Median :2.00  
##                                                              Mean   :3.12  
##                                                              3rd Qu.:5.00  
##                                                              Max.   :8.00  
##  certification     bonus             salary      
##  0:54          Min.   :  26.93   Min.   : 26928  
##  1:64          1st Qu.: 127.91   1st Qu.: 43850  
##  2:48          Median : 269.74   Median : 53186  
##  3:23          Mean   : 548.91   Mean   : 56878  
##  4: 3          3rd Qu.: 774.39   3rd Qu.: 66298  
##                Max.   :3080.00   Max.   :102242

The table above is a summary of the “SalaryData” dataset after preprocessing. For factor variables, it provides a count of instances. For numerical variables, it provides min/max, measures of central tendency, and quartiles. The summary is helpful in creating a preliminary understanding of the data.

Exploring Education

salary$education <- factor(salary$education,levels = c("Undergrad", "Grad", "Advanced"))
edusalary <- ggplot(salary, aes(x=education, y=salary, fill=education)) + geom_boxplot()+ggtitle("Salary by Education Level")+xlab("Education Level")+ylab("Salary")+scale_y_continuous(labels=scales::dollar_format())+labs(fill = "Education Level")
edubonus <- ggplot(salary, aes(x=education, y=bonus, fill=education)) + geom_boxplot()+ggtitle("Bonus by Education Level")+xlab("Education Level")+ylab("Bonus")+scale_y_continuous(labels=scales::dollar_format())+labs(fill = "Education Level")
plot_grid(edusalary, edubonus, labels = "AUTO")

It can be seen that in both the salary and bonus amounts generally increase proportionally to advancement in education level, specifically when measured by mean. However, it appears that bonus is more dependent on education level than salary is, but salary has a greater range per each respective education level. Outliers exist in graduate salaries, some beyond any salary received by those with advanced degrees. Only an individual with an advanced degree is likely to be given a bonus over $2250.

salary$education <- factor(salary$education,levels = c("Advanced", "Grad", "Undergrad"))
ggplot() + geom_col(data = salary, aes(x="", y = frequency(education), fill = education), position = "fill")+ggtitle("Education Level Composition by Industry") + xlab("") + ylab("Composition Percentage")+labs(fill = "Education")+facet_grid(. ~ industry)+scale_y_continuous(labels=scales::percent_format())

The education composition of both the IT and Non-IT industry is evaluated. The Non-IT industry has a greater number of individuals with undergrad degrees, but the IT industry has the most grad and advanced degrees, with only just over 25% of IT industry individuals having only an undergrad degree.

ggplot() + geom_col(data = salary, aes(x=experience, y = frequency(education), fill = education), position = "fill")+ggtitle("Education Level Composition by Years of Experience") + xlab("Years of Experience") + ylab("Composition Percentage")+labs(fill = "Education")+scale_y_continuous(labels=scales::percent_format())

An analysis of education level composition by years of experience displays a trend of likely upward progression in education level as more years of experience are gained. After seven years, over 75% of individuals have an advanced degree, and after eight years, 100%. However, the eight year 100% advanced degree rate phenomenon likely only occurs because there are only 2 individuals in the dataset with eight years of experience.

Exploring Industry

indsalary <- ggplot(salary, aes(x=industry, y=salary, fill=industry)) + geom_boxplot()+ggtitle("Salary by Industry")+xlab("Industry")+ylab("Salary")+scale_y_continuous(labels=scales::dollar_format())+labs(fill = "Industry")
indbonus <- ggplot(salary, aes(x=industry, y=bonus, fill=industry)) + geom_boxplot()+ggtitle("Bonus by Industry")+xlab("Industry")+ylab("Bonus")+scale_y_continuous(labels=scales::dollar_format())+labs(fill = "Industry")
plot_grid(indsalary, indbonus, labels = "AUTO")

An analysis of salary and bonus by industry reveals that both salaries and bonuses are higher in the IT industry, compared to the Non-IT industry. However, for the IT industry, both salary and bonus had a greater range relative to the Non-IT industry. Also, the Non-IT industry has many more high-bonus outliers, meaning IT industry level bonuses can still be received in the Non-IT industry.

ggplot() + geom_col(data = salary, aes(x="", y = frequency(gender), fill = gender), position = "fill")+ggtitle("Gender by Industry") + xlab("") + ylab("Composition Percentage")+labs(fill = "Gender")+scale_y_continuous(labels=scales::percent_format())+facet_grid(. ~ industry)+scale_fill_manual(values=c("#f06aa4", "#6ca0dc"))

An analysis of gender by industry reveals that both the IT and Non-IT industry are fairly evenly comprised of males and females, although the IT industry does have roughly 5% more males than females. Gender would be a very poor predicter of industry in any predictive analysis.

Exploring Salary & Bonus

ggplot(salary, aes(x=salary, y=bonus, color=industry)) + geom_point()+ggtitle("Bonus by Salary")+xlab("Salary")+ylab("Bonus")+scale_y_continuous(labels=scales::dollar_format())+scale_x_continuous(labels=scales::dollar_format())+scale_color_manual(values=c("#FFCB05","#00274C"))+labs(color = "Industry")

An analysis of salary and bonus across all industries reveals that as salary increases, so too does bonus at an increasing proportionality. The higher one’s salary, the larger one’s bonus, generally. However, as salary increases, the range of received bonuses also increases. It can also be seen that most individuals that earn over $60,000 per year are in the IT industry, even when accounting for the fact that there are 25% more individuals in the dataset that are in the IT industry.

library(plyr)
mc <- ddply(salary, "industry", summarise, grp.mean=mean(salary))
a <- ggplot(salary, aes(x=salary, color=industry))+
  geom_histogram(aes(y=..density..), color="black", fill="grey")+geom_density(alpha=.2, fill=NA)+facet_grid(industry ~ .)+
  geom_vline(data=mc, aes(xintercept=grp.mean, color=industry),
             linetype="dashed")+scale_color_manual(values=c("orange", "blue"))+scale_x_continuous(labels=scales::dollar_format())+ggtitle("Salary Distribution by Industry")+xlab("Salary")+ylab("Density")+labs(color = "Salary")

mg <- ddply(salary, "industry", summarise, grp.mean=mean(bonus))
b <- ggplot(salary, aes(x=bonus, color=industry))+
  geom_histogram(aes(y=..density..), color="black", fill="grey")+geom_density(alpha=.2, fill=NA)+facet_grid(industry ~ .)+
  geom_vline(data=mg, aes(xintercept=grp.mean, color=industry),
             linetype="dashed")+scale_color_manual(values=c("orange", "blue"))+scale_x_continuous(labels=scales::dollar_format())+ggtitle("Bonus Distribution by Industry")+xlab("Bonus")+ylab("Density")+labs(color = "Bonus")
plot_grid(a, b, labels = "AUTO")

A distribution analysis of both salary and bonus by industry further confirms insights generated in the previous chart. It can be seen that salary in the IT industry is bimodally distributed, where as salary in the Non-IT industry and bonus in both the IT and Non-IT industry are unimodal. All distribution are right-skewed, since individuals receiving a high salary or bonus is less common due to the cost it generates for the business.

mu <- ddply(salary, "gender", summarise, grp.mean=mean(salary))
c <- ggplot(salary, aes(x=salary, color=gender))+
  geom_histogram(aes(y=..density..), color="black", fill="grey")+geom_density(alpha=.2, fill=NA)+facet_grid(gender ~ .)+
  geom_vline(data=mu, aes(xintercept=grp.mean, color=gender),
             linetype="dashed")+scale_color_manual(values=c("#f06aa4", "#6ca0dc"))+scale_x_continuous(labels=scales::dollar_format())+ggtitle("Salary Distribution by Gender")+xlab("Salary")+ylab("Density")+labs(color = "Salary")

mb <- ddply(salary, "gender", summarise, grp.mean=mean(bonus))
d <- ggplot(salary, aes(x=bonus, color=gender))+
  geom_histogram(aes(y=..density..), color="black", fill="grey")+geom_density(alpha=.2, fill=NA)+facet_grid(gender ~ .)+
  geom_vline(data=mb, aes(xintercept=grp.mean, color=gender),
             linetype="dashed")+scale_color_manual(values=c("#f06aa4", "#6ca0dc"))+scale_x_continuous(labels=scales::dollar_format())+ggtitle("Bonus Distribution by Gender")+xlab("Bonus")+ylab("Density")+labs(color = "Bonus")
plot_grid(c, d, labels = "AUTO")

A distribution analysis of both salary and bonus by gender reveals that gender does not play any significant role in determining the salary or bonus of an individual (which is definitely a good thing). However, males do earn a very slight amount more in both salary and bonus. Both salary and bonus are similarly distributed by gender and both are right-skewed.

Exploring Marriage Rates

ggplot(salary, aes(x=factor(""), stat="bin", fill=married)) + geom_bar(position="fill") +scale_y_continuous(labels = c("0", "25%", "50%", "75%", "100%")) +
  ggtitle("Marriage Rate by Industry") + xlab("") + ylab("") +
  facet_grid(facets=. ~ industry)+
  coord_polar(theta="y")+
  scale_fill_manual(values=c("#999999", "#E69F00"))+labs(fill = "Married")

marit<-table(salary$married[salary$industry=="IT"])
marnit<-table(salary$married[salary$industry=="Non-IT"])
round(prop.table(marit),digits=2)

## 
##   No  Yes 
## 0.43 0.57

round(prop.table(marnit),digits=2)

## 
##   No  Yes 
## 0.71 0.29

An analysis of marriage rate by industry reveals that 57% of individuals in the IT industry are married, compared to only 29% in the Non-IT industry. However, used alone, marital status would be a poor predictor of industry.

ggplot() + geom_col(data = salary, aes(x=gender, y = frequency(married), fill = married), position = "fill")+ggtitle("Marriage Rate by Gender") + xlab("Gender") + ylab("Marriage Rate")+labs(fill = "Married")+facet_grid(facets=. ~ industry)+scale_y_continuous(labels=scales::percent_format())

Evaluating marriage rates by gender again confirms that gender does not play a significant role in predicting other variables. The marriage rate for both males and females in both the IT and Non-IT industry are relatively similar, aside from how individuals in the IT industry are more likely to be married, as confirmed in the previous pie charts.

ggplot() + geom_col(data = salary, aes(x=experience, y = frequency(married), fill = married), position = "fill")+ggtitle("Marriage Rate by Years of Experience") + xlab("Years of Experience") + ylab("Marriage Rate")+labs(fill = "Married")+scale_y_continuous(labels=scales::percent_format())

Evaluating marriage rates by years of experience reveals that the likelihood that an individual is married increases as they gain more years of experience, up until year 7, where the marriage rates plateau at roughly 50%. Perhaps this is because there are only a small number of individuals with seven or more years of experience that comprise the sample data, making this analysis not truly representative of actual marriage rates beyond seven years of experience. Or perhaps, after seven years of experience, individuals begin to get divorced.

ggplot(salary, aes(x=factor(married), y=salary, fill=married))+geom_boxplot()+ggtitle("Salary by Marital Status") + xlab("Married") + ylab("Salary")+scale_y_continuous(labels=scales::dollar_format())+labs(fill = "Married")

An analysis of salary by marital status exhibits that one’s salary is likely to be higher if they are married. However, this is rather misleading, since the likelihood of one being married increases with years of experience, and thus, this increase in salary is likely to be due to an increased number of years of experience as opposed to simply being married.

Exploring Years of Experience

ggplot(salary, aes(x=factor(experience), y=salary))+geom_boxplot()+ggtitle("Salary by Years of Experience") + xlab("Years of Experience") + ylab("Salary")+scale_y_continuous(labels=scales::dollar_format())+facet_grid(facets=. ~ industry)

Generally, both salary and bonus increase as years of experience increase. This further confirms the postulation that marital status does not influence salary, and instead salary is higher if one is married simply because the likelihood of being married increase as years of experience increase. It can be seen that the ranges of salary for each year of experience vary much greater in the IT industry, compared to the Non-IT industry.

Exploring Satisfaction Rate

salary$education <- factor(salary$education,levels = c("Undergrad", "Grad", "Advanced"))
ggplot() + geom_col(data = salary, aes(x=education, y = frequency(satisfied), fill = satisfied), position = "fill")+ggtitle("Satisfaction Rate by Education Level") + xlab("Education Level") + ylab("Satisfaction Rate")+labs(fill = "Satisfied")+facet_grid(facets=. ~ industry)+scale_y_continuous(labels=scales::percent_format())

An analysis of satisfaction rate by education level and industry reveals that in the IT industry, satisfaction increases as one’s education level increases. However, in the Non-IT industry, the opposite is true, as satisfaction decreases as one’s education level increases.

ggplot() + geom_col(data = salary, aes(x=experience, y = frequency(satisfied), fill = satisfied), position = "fill")+ggtitle("Satisfaction Rate by Years of Experience") + xlab("Years of Experience") + ylab("Satisfaction Rate")+labs(fill = "Satisfied")+scale_y_continuous(labels=scales::percent_format())

No consistent trend can be derived from the graph above, implying that the years of experience one has does not have an influcence on their satisfaction.

ggplot() + geom_col(data = salary, aes(x=married, y = frequency(satisfied), fill = satisfied), position = "fill")+ggtitle("Satisfaction Rate by Marital Status") + xlab("Married") + ylab("Satisfaction Rate")+labs(fill = "Satisfied")+scale_y_continuous(labels=scales::percent_format())

Those who are married have a higher likelihood of being satisfied. Those who are not married are likely to not be satisified.

ggplot() + geom_col(data = salary, aes(x=gender, y = frequency(satisfied), fill = satisfied), position = "fill")+ggtitle("Satisfaction Rate by Gender") + xlab("Gender") + ylab("Satisfaction Rate")+labs(fill = "Satisfied")+scale_y_continuous(labels=scales::percent_format())+facet_grid(facets=. ~ industry)

Males in both the IT industry and Non-IT industry are marginally more likely to be satisfied compared to females. Both males and females are not likely to be satisfied in the Non-IT industry.

e <- ggplot(salary, aes(x=satisfied, y=salary, fill=satisfied)) + geom_boxplot()+ggtitle("Salary by Job Satisfaction")+xlab("Satisfied")+ylab("Salary")+scale_y_continuous(labels=scales::dollar_format())+labs(fill = "Satisfied")
f <- ggplot(salary, aes(x=satisfied, y=bonus, fill=satisfied)) + geom_boxplot()+ggtitle("Bonus by Job Satisfaction")+xlab("Satisfied")+ylab("Bonus")+scale_y_continuous(labels=scales::dollar_format())+labs(fill = "Satisfied")
plot_grid(e, f, labels = "AUTO")

While there is a very large range for all observed metrics on the above plot, those who are satisfied generally have higher salaries and bonuses. It is likely that those who are satisfied because they feel they are being adequately being compensated for their work, as opposed to being compensated based on whether they are satisfied.

Exploring Certifications

ggplot() + geom_col(data = salary, aes(x="", y = frequency(certification), fill = factor(certification)), position = position_fill(reverse = TRUE))+ggtitle("# of Certifications Attained Composition by Industry") + xlab("") + ylab("Composition Percentage")+labs(fill = "# of Certifications Attained")+facet_grid(. ~ industry)+scale_y_continuous(labels=scales::percent_format())

An analysis of the composition of the number of certifications attained by an individual, by industry, exhibits that those in the Non-IT industry are most likely to have zero or only one certification. Over half of those in the IT industry hold two or more certification, though most hold one.

ggplot(salary, aes(x=factor(certification), y=salary))+geom_boxplot()+ggtitle("Salary by Number of Certifications Attained") + xlab("Number of Certifications Attained") + ylab("Salary")+scale_y_continuous(labels=scales::dollar_format())+facet_grid(facets=. ~ industry)

In both the IT and Non-IT industry, salary appears to be highly correlated with the number of certifications attained by an individual. However, salary increases at a greater rate based on number of certifications attained in the IT industry.

ggplot(salary, aes(x=factor(certification), y=bonus))+geom_boxplot()+ggtitle("Bonus by Number of Certifications Attained") + xlab("Number of Certifications Attained") + ylab("Bonus")+scale_y_continuous(labels=scales::dollar_format())+facet_grid(facets=. ~ industry)

In both the IT and Non-IT industry, bonus appears to be exponentially correlated with the number of certifications attained by an individual. However, a bonus is less likely to be given based on number of certifications attained in the Non-IT industry.

Predictive Data Analysis

salary <- read_excel("C:/Users/cbado/OneDrive/Documents/R/SalaryData.xlsx")
out <- boxplot.stats(salary$experience)$out
out_ind <- which(salary$experience %in% c(out))
salary[out_ind, 7] = NA
library(tidyr)
salary<-drop_na(salary)
salary$id <- NULL

This step re-loads the original “SalaryData” dataset. It is then re-preprocessed. The reason for this is because it is easier to re-load and have the variables be in their orignal form instead of the factor form many attributes were given in the previous section. It is realized that this is a crude method of doing this, but gives the proper resulting data nonetheless.

Data Partitioning

dt = sort(sample(nrow(salary), nrow(salary)*.7))
train<-salary[dt,]
test<-salary[-dt,]

This step partitions the data, segmenting a training and a testing dataset. The training dataset is comprised of 70% of the instances observed in the dataset, where as the testing dataset is comprised of only 30%. This is sampling without replacement.

Predicting Gender

A logistic regression model is best to determine gender, since gender is a discrete class variable.

gen <- glm(gender ~ ., data = train, family = "binomial")
gen

## 
## Call:  glm(formula = gender ~ ., family = "binomial", data = train)
## 
## Coefficients:
##   (Intercept)       industry      education      satisfied        married  
##    1.04155468    -0.02764653     0.11026936    -0.09580969     0.16969102  
##    experience  certification          bonus         salary  
##   -0.00605253    -0.59886184     0.00126216    -0.00002218  
## 
## Degrees of Freedom: 133 Total (i.e. Null);  125 Residual
## Null Deviance:       185.6 
## Residual Deviance: 180.9     AIC: 198.9

ll.null <- gen$null.deviance/-2
ll.proposed <- gen$deviance/-2
(ll.null - ll.proposed) / ll.null

1 - pchisq(2*(ll.proposed - ll.null), df=(length(gen$coefficients)-1))

## [1] 0.7890994

This model has a combined R^2 value of 0.569. However, R^2 is not significant for evaluating how well the model properly predicts using new input data, as a high R^2 could imply an overfit of the model to the training data.

predictions <- predict(gen, test, type="response")
prediction.rd <- ifelse(predictions > 0.5, 1, 0)

table(prediction.rd, test$gender)

##              
## prediction.rd  0  1
##             0 20 19
##             1 10  9

accuracy <- table(prediction.rd, test$gender)

sum(diag(accuracy))/sum(accuracy)

## [1] 0.5

This model has an accuracy rate of only 41% when fed testing data.

predicted.gen.data <- data.frame(probability.of.gender=gen$fitted.values,gender=train$gender)
predicted.gen.data <- predicted.gen.data[order(predicted.gen.data$probability.of.gender, decreasing = FALSE),]
predicted.gen.data$rank <- 1:nrow(predicted.gen.data)
ggplot(data = predicted.gen.data,aes(x=rank, y=probability.of.gender))+geom_point(aes(colour=gender), alpha=1, shape=4, stroke=2)+ggtitle("Predicting Gender")+xlab("Index")+ylab("Prob. of Gender")+labs(color = "Gender")+scale_colour_gradient(low = "#CC6594",high = "#347DC1")

Based on the 41% accuracy rate when predicting gender based on new input data, as well as the visualization displayed in the above plot, gender cannot be accurately predicted.

Predicting Industry

A logistic regression model is best to determine industry, since industry is a discrete class variable.

indu <- glm(industry ~ ., data = train, family = "binomial")
indu

## 
## Call:  glm(formula = industry ~ ., family = "binomial", data = train)
## 
## Coefficients:
##   (Intercept)         gender      education      satisfied        married  
##   -0.49159288    -0.06268086     0.21882641     1.00422943     0.32428259  
##    experience  certification          bonus         salary  
##    0.10187521    -0.32169114     0.00178695    -0.00001743  
## 
## Degrees of Freedom: 133 Total (i.e. Null);  125 Residual
## Null Deviance:       182.8 
## Residual Deviance: 156.6     AIC: 174.6

ll.null <- indu$null.deviance/-2
ll.proposed <- indu$deviance/-2
(ll.null - ll.proposed) / ll.null

1 - pchisq(2*(ll.proposed - ll.null), df=(length(indu$coefficients)-1))

## [1] 0.0009719201

This model has a combined R^2 value of 0.001421184, which is extremely poor.

predictions <- predict(indu, test, type="response")
prediction.rd <- ifelse(predictions > 0.5, 1, 0)

table(prediction.rd, test$industry)

##              
## prediction.rd  0  1
##             0 12  9
##             1 16 21

accuracy <- table(prediction.rd, test$industry)

sum(diag(accuracy))/sum(accuracy)

## [1] 0.5689655

This model has an accuracy rate of 72% when fed testing data, which is quite good.

predicted.indu.data <- data.frame(probability.of.indu=indu$fitted.values,industry=train$industry)
predicted.indu.data <- predicted.indu.data[order(predicted.indu.data$probability.of.indu, decreasing = FALSE),]
predicted.indu.data$rank <- 1:nrow(predicted.indu.data)
ggplot(data = predicted.indu.data,aes(x=rank, y=probability.of.indu))+geom_point(aes(colour=industry), alpha=1, shape=4, stroke=2)+ggtitle("Predicting Industry")+xlab("Index")+ylab("Prob. of Industry")+labs(color = "Industry")+scale_colour_gradient(low = "#FFCB05",high = "#00274C")

Based on the 72% accuracy rate when predicting industry based on new input data, as well as the visualization displayed in the above plot, the generated model does a fair job at accurately predicting industry.

Predicting Education

A k-nearest neighbors (KNN) algorithm is best to determine education level, since it is a supervised machine learning algorithm that is good at solving classification problems, and education level is a class variable.

library(class)
educ <- knn(train,test,cl=train$education,k=11)

tab <- table(educ,test$education)
tab

##     
## educ  1  2  3
##    1  8  5  1
##    2  7 17  5
##    3  3  5  7

accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(tab)

## [1] 55.17241

Based on the 46.6% accuracy rate when predicting education level based on new input data, this model cannot accurately predict education level, though this model is better at predicting education level than if one were to randomly select an education level for their prediction.

Predicting Satisfaction

A logistic regression model is best to determine satisfaction, since satisfaction is a discrete class variable.

sat <- glm(satisfied ~ industry+salary+married, data = train, family = "binomial")
sat

## 
## Call:  glm(formula = satisfied ~ industry + salary + married, family = "binomial", 
##     data = train)
## 
## Coefficients:
## (Intercept)     industry       salary      married  
## -1.59933214   0.94476212   0.00001257   0.54212355  
## 
## Degrees of Freedom: 133 Total (i.e. Null);  130 Residual
## Null Deviance:       185.5 
## Residual Deviance: 170.1     AIC: 178.1

ll.null <- sat$null.deviance/-2
ll.proposed <- sat$deviance/-2
(ll.null - ll.proposed) / ll.null

1 - pchisq(2*(ll.proposed - ll.null), df=(length(sat$coefficients)-1))

## [1] 0.001478168

This model has a combined R^2 value of 0.5957026.

predictions <- predict(sat, test, type="response")
prediction.rd <- ifelse(predictions > 0.5, 1, 0)

table(prediction.rd, test$satisfied)

##              
## prediction.rd  0  1
##             0 18 17
##             1 13 10

accuracy <- table(prediction.rd, test$satisfied)

sum(diag(accuracy))/sum(accuracy)

## [1] 0.4827586

This model has an accuracy rate of 59% when fed testing data, which is better than randomly tossing a coin to predict outcome, but still relatively poor.

predicted.sat.data <- data.frame(probability.of.satisfied=sat$fitted.values,satisfied=train$satisfied)
predicted.sat.data <- predicted.sat.data[order(predicted.sat.data$probability.of.satisfied, decreasing = FALSE),]
predicted.sat.data$rank <- 1:nrow(predicted.sat.data)
ggplot(data = predicted.sat.data,aes(x=rank, y=probability.of.satisfied))+geom_point(aes(colour=satisfied), alpha=1, shape=4, stroke=2)+ggtitle("Predicting Satisfaction based on Industry, Salary, & Marital Status")+xlab("Index")+ylab("Prob. of Satisfied")+labs(color = "Satisfied")+scale_colour_gradient(low = "#FFCB05",high = "#00274C")

Based on the 59% accuracy rate when predicting satisfaction based on new input data, as well as the visualization displayed in the above plot, the generated model can predict satisfaction, but does so poorly, at a rate only 10% better than randomly flipping a coin.

Predicting Marital Status

A logistic regression model is best to determine marital status, since marital status is a discrete class variable.

marr <- glm(married ~ salary+industry, data = train, family = "binomial")
marr

## 
## Call:  glm(formula = married ~ salary + industry, family = "binomial", 
##     data = train)
## 
## Coefficients:
## (Intercept)       salary     industry  
## -3.76362284   0.00005759   0.49448835  
## 
## Degrees of Freedom: 133 Total (i.e. Null);  131 Residual
## Null Deviance:       184.3 
## Residual Deviance: 153.5     AIC: 159.5

ll.null <- marr$null.deviance/-2
ll.proposed <- marr$deviance/-2
(ll.null - ll.proposed) / ll.null

1 - pchisq(2*(ll.proposed - ll.null), df=(length(marr$coefficients)-1))

## [1] 0.0000002073406

This model has a combined R^2 value of 0.00000085, extremely poor.

predictions <- predict(marr, test, type="response")
prediction.rd <- ifelse(predictions > 0.5, 1, 0)

table(prediction.rd, test$married)

##              
## prediction.rd  0  1
##             0 28 10
##             1  4 16

accuracy <- table(prediction.rd, test$married)

sum(diag(accuracy))/sum(accuracy)

## [1] 0.7586207

This model has an accuracy rate of 62% when fed testing data, which is fair.

predicted.marr.data <- data.frame(probability.of.married=marr$fitted.values,married=train$married)
predicted.marr.data <- predicted.marr.data[order(predicted.marr.data$probability.of.married, decreasing = FALSE),]
predicted.marr.data$rank <- 1:nrow(predicted.marr.data)
ggplot(data = predicted.marr.data,aes(x=rank, y=probability.of.married))+geom_point(aes(colour=married), alpha=1, shape=4, stroke=2)+ggtitle("Predicting Marital Status based on Salary & Industry ")+xlab("Index")+ylab("Prob. of Married")+labs(color = "Married")+scale_colour_gradient(low = "#FFCB05",high = "#00274C")

Based on the 74% accuracy rate when predicting marital status based on new input data, as well as the visualization displayed in the above plot, the generated model does a fair job at accurately predicting marital status.

Predicting Certification

A k-nearest neighbors (KNN) algorithm is best to determine the number of certifications attained, since it is a supervised machine learning algorithm that is good at solving classification problems, and number of certifications attained is a class variable.

certi <- knn(train,test,cl=train$certification,k=12)

tab <- table(certi,test$certification)
tab

##      
## certi  0  1  2  3  4
##     0  5  5  1  0  0
##     1  9  8  5  0  0
##     2  0  7 11  1  1
##     3  0  1  1  2  1
##     4  0  0  0  0  0

accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(tab)

## [1] 44.82759

Based on the 55% accuracy rate when predicting the number of certifications attained by an individual based on new input data, this model cannot accurately predict the number of certifications attained by an individual, though this model is better at predicting certification than if one were to randomly flip a coin for their prediction.

Predicting Bonus

A linear regression model is best to predict bonus, since bonus is a continuous numerical variable. The LOG of bonus was used to create the model, since it was determined that a non-linear relationship exists between bonus and predictive attributes. Using the LOG of bonus allows for a linear model to be used to predict bonus.

logbonus <-log(train$bonus)
bonfit = lm(logbonus~certification+salary+experience, data=train)
summary(bonfit)

## 
## Call:
## lm(formula = logbonus ~ certification + salary + experience, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.19638 -0.10474  0.05824  0.21090  0.74390 
## 
## Coefficients:
##                  Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)   3.250664082 0.121362255  26.785 < 0.0000000000000002 ***
## certification 0.548236288 0.043725929  12.538 < 0.0000000000000002 ***
## salary        0.000023520 0.000003193   7.365      0.0000000000181 ***
## experience    0.124517321 0.027926289   4.459      0.0000176301755 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3418 on 130 degrees of freedom
## Multiple R-squared:  0.9209, Adjusted R-squared:  0.919 
## F-statistic: 504.3 on 3 and 130 DF,  p-value: < 0.00000000000000022

All variables have a p-value of less than 0.05, meaning they are statistically significant and contribute to the model properly. The generated model has an adjusted R^2 value of 0.9252, which is very good.

predbon <- predict(bonfit, test)

actuals_preds <- data.frame(cbind(actuals=test$bonus, predicteds=predbon))
cor(actuals_preds)

##              actuals predicteds
## actuals    1.0000000  0.9151683
## predicteds 0.9151683  1.0000000

This model has an accuracy rate of 93.5% when fed testing data, which is very good. Based on the 93.5% accuracy rate when predicting marital status based on new input data, as well as the high adjusted R^2 value, the generated model does a very good job at accurately predicting bonus.

Predicting Salary

A linear regression model is best to predict salary, since salary is a continuous numerical variable. In this model, bonus cannot be used as a predictor variable, because if this model is being used to determine the proper salary for an incoming employee, they will not have received a bonus at the company yet, and their bonus will be partly dependent on their salary.

bonfit = lm(salary~education+married+experience, data=train)
summary(bonfit)

## 
## Call:
## lm(formula = salary ~ education + married + experience, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22502.4  -6189.4    556.7   4935.2  27909.7 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  26524.8     2290.4  11.581 < 0.0000000000000002 ***
## education     4517.1     1138.1   3.969             0.000119 ***
## married       4461.9     1736.0   2.570             0.011291 *  
## experience    6335.6      441.6  14.347 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9141 on 130 degrees of freedom
## Multiple R-squared:   0.74,  Adjusted R-squared:  0.734 
## F-statistic: 123.3 on 3 and 130 DF,  p-value: < 0.00000000000000022

predbon <- predict(bonfit, test)

actuals_preds <- data.frame(cbind(actuals=test$salary, predicteds=predbon))
cor(actuals_preds)

##              actuals predicteds
## actuals    1.0000000  0.8528089
## predicteds 0.8528089  1.0000000

This model has an accuracy rate of 87.6% when fed testing data, which is very good. Based on the 93.5% accuracy rate when predicting marital status based on new input data, as well as the high adjusted R^2 value, the generated model does a very good job at accurately predicting salary.

Data Sci Tools Project (MSCI 3230-1)

by Cameron Badour for Dr. Gokul Bhandari

April 5th, 2021