Nowadays, when determining an employee’s salary, many different components are taking into consideration. By performing this analysis on the dataset provided, it will allow us to understand if any of the X variables, also known as predictors, have affect on the Y dependent variable, salary.
setwd("C:/Users/12267/Desktop/UWindsor/Winter 2021/MSCI 3230 Data Science Tools & Methods/RSTUDIO Work")
library(ggplot2)
library(gplots)
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
library(reshape)
library (tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.4
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v tibble 3.0.5 v dplyr 1.0.3
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## v purrr 0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x tidyr::expand() masks reshape::expand()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x dplyr::rename() masks reshape::rename()
library (Hmisc)
## Warning: package 'Hmisc' was built under R version 4.0.4
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.0.4
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(leaps)
## Warning: package 'leaps' was built under R version 4.0.4
library(caret)
## Warning: package 'caret' was built under R version 4.0.4
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
## The following object is masked from 'package:purrr':
##
## lift
library(gains)
library(pROC)
## Warning: package 'pROC' was built under R version 4.0.4
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
List all library
#### Read the data
df <- read.csv("data/SalaryData.csv")
t(t(names(df))) ##display column names in a user friendly manner
## [,1]
## [1,] "id"
## [2,] "gender"
## [3,] "industry"
## [4,] "education"
## [5,] "satisfied"
## [6,] "married"
## [7,] "experience"
## [8,] "certification"
## [9,] "bonus"
## [10,] "salary"
## [11,] "actual.cat"
#change to lowercase if so desired
names(df) <- tolower(names(df))
Read the data and display column names within dataset
#Handling missing data
#Check which columns have missing values
summary(df)
## id gender industry education
## Min. : 1.00 Min. :0.000 Min. :0.0000 Min. :1.000
## 1st Qu.: 50.75 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:1.000
## Median :100.50 Median :0.000 Median :1.0000 Median :2.000
## Mean :100.50 Mean :0.475 Mean :0.5528 Mean :1.904
## 3rd Qu.:150.25 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:2.000
## Max. :200.00 Max. :1.000 Max. :1.0000 Max. :3.000
## NA's :1 NA's :2
## satisfied married experience certification
## Min. :0.0000 Min. :0.000 Min. :1.000 Min. :0.0
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.0
## Median :0.0000 Median :0.000 Median :3.000 Median :1.0
## Mean :0.4798 Mean :0.465 Mean :3.205 Mean :1.3
## 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:5.000 3rd Qu.:2.0
## Max. :1.0000 Max. :1.000 Max. :8.000 Max. :4.0
## NA's :2
## bonus salary actual.cat
## Min. : 27.0 Min. : 26865 Min. :0.000
## 1st Qu.: 128.5 1st Qu.: 43931 1st Qu.:0.000
## Median : 284.5 Median : 54048 Median :0.000
## Mean : 573.7 Mean : 57565 Mean :0.445
## 3rd Qu.: 855.2 3rd Qu.: 68313 3rd Qu.:1.000
## Max. :3080.0 Max. :102242 Max. :1.000
## NA's :2
#Replace missing values with median in industry column
df$industry = impute(df$industry,
median
)
#Replace missing values with median in education column
df$education = impute(df$education,
median
)
#Replace missing values with median in satisfied column
df$satisfied = impute(df$satisfied,
median
)
#replace missing values with median in bonus column
df$bonus = impute(df$bonus,
median
)
#final check to make sure there are no more NA in dataset
summary(df)
##
## 1 values imputed to 1
##
##
## 2 values imputed to 2
##
##
## 2 values imputed to 0
##
##
## 2 values imputed to 284.5
## id gender industry education
## Min. : 1.00 Min. :0.000 Min. :0.000 Min. :1.000
## 1st Qu.: 50.75 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:1.000
## Median :100.50 Median :0.000 Median :1.000 Median :2.000
## Mean :100.50 Mean :0.475 Mean :0.555 Mean :1.905
## 3rd Qu.:150.25 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:2.000
## Max. :200.00 Max. :1.000 Max. :1.000 Max. :3.000
## satisfied married experience certification bonus
## Min. :0.000 Min. :0.000 Min. :1.000 Min. :0.0 Min. : 27.0
## 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.0 1st Qu.: 129.5
## Median :0.000 Median :0.000 Median :3.000 Median :1.0 Median : 284.5
## Mean :0.475 Mean :0.465 Mean :3.205 Mean :1.3 Mean : 570.9
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:5.000 3rd Qu.:2.0 3rd Qu.: 853.8
## Max. :1.000 Max. :1.000 Max. :8.000 Max. :4.0 Max. :3080.0
## salary actual.cat
## Min. : 26865 Min. :0.000
## 1st Qu.: 43931 1st Qu.:0.000
## Median : 54048 Median :0.000
## Mean : 57565 Mean :0.445
## 3rd Qu.: 68313 3rd Qu.:1.000
## Max. :102242 Max. :1.000
For this boxplot we are interested in understanding the stand-alone distribution of salary not across any X variable. Since our minimum value is $26,865 and our maximum is $102,242, we set the range to be displayed between 20,000 and 120,000. We take note that this salary column does not contain any outliers. It tells us that the average salary is about $57,000.
ggplot(df, aes(x = as.factor (gender), y = salary, fill = factor (gender))) +
geom_boxplot(outlier.color = "red", outlier.shape = "o")+
labs(title = "Salary vs. Gender", x = "Gender", y = "Salary (in 000s)")

At first, we decided to understand the distribution between salary and bonus. The graph concludes that there is an upward positive tending distribution between the two columns. As a bonus of employee increases, their salary as increases.
ggplot(df, aes(x = bonus, y = salary)) + geom_point(aes(colour = factor(married)))
## Don't know how to automatically pick scale for object of type impute. Defaulting to continuous.

This gives us an understanding of whether being married or not holds signifance on salary and bonus. Indeed, being married has an impact on the bonus which leads to an impact on salary. For example, majority of class “1” which represents that the employee is married tends to be at higher range of bonus. Which than leads to a higher salary. We conclude that, if the emplyee is married, the greater the bonus, which leads to an increase in salary.
ggplot(df, aes(x = bonus, y = salary)) + geom_point(aes(colour = factor(education)))
## Don't know how to automatically pick scale for object of type impute. Defaulting to continuous.

As mentioned in the scatter plot above, as bonus increases, the salary increase. This graph is more sophisticated while it includes the factor of education. This gives us an understanding of the level of education the employee has completed. Indeed, education has an impact on the bonus. For example, majority of class “1” which represents that the employee that completed up to the undergraduate level tends to be at lower range of bonus. Which than leads to a lower salary. We conclude that, the greater education level completed, the greater the bonus, which leads to an increase in salary.
ggplot(df, aes(x = bonus, y = salary)) + geom_point(aes(colour = factor(satisfied)))
## Don't know how to automatically pick scale for object of type impute. Defaulting to continuous.

This graph includes bonus and salary as a factor or satisfaction. This gives us an understanding of the whether the employee is satisfied or not. Unless, the factor of education in the graph above, there is not really a set range for satisfaction. Meaning that satisfaction do not impact salary. For example, you can be unsatisfied with a large bonus and high salary or satisfied large bonus and large salary.
##bar charts
##education column
ggplot(df, aes(x = education, fill = factor(education))) +
geom_bar()
## Don't know how to automatically pick scale for object of type impute. Defaulting to continuous.

ggplot(df, aes(x = education, y = salary, fill = as.factor(education))) +
geom_bar(stat = "summary", fun = "mean")
## Don't know how to automatically pick scale for object of type impute. Defaulting to continuous.

Here we want to understand how many employees completed undergrade, grad, or advanced education. We simply show the count of each level. The output indicates that majority of the employees have completed a grad level education. We continue by putting salary against education to get the same outcome as the scatter plot diagram, the greater the level of education completed, the greater the salary. By computing the aggregate function, we get an outcome of $50,777 average salary for employees who completed an undergrad level of education, $55,993 average salary grad level education and $70,397 average for completion of advanced education.
###Industry Column
ggplot(df, aes(x = factor(industry), fill = factor(industry))) +
geom_bar()

ggplot(df, aes(x = factor(industry), y = salary, fill = factor(industry))) +
geom_bar(stat = "summary", fun = "mean")

Here we want to determine how many employees are in the IT industry and whether it affects their salary. “1” represents that they are within the IT industry, “0” represents that they are not. We concluded that about 20 more employees are included in the IT industry, as well as the ones involved in the IT industry tend to have a greater salary.
####Gender column
ggplot(df, aes(x = factor(gender), fill = factor(gender))) +
geom_bar()

ggplot(df, aes(x = factor(gender), y = salary, fill = factor(gender))) +
geom_bar(stat = "summary", fun = "mean")

Here we want to determine how many employees are male or female and whether it affects their salary. “1” represents that they are female, “0” represents that they are male. We concluded that there is a slight difference in salary between females and males, where men tend to make a bit more salary.
####satisfied column
ggplot(df, aes(x = factor(satisfied), fill = factor(satisfied))) +
geom_bar()

ggplot(df, aes(x = factor(satisfied), y = salary, fill = factor(satisfied))) +
geom_bar(stat = "summary", fun = "mean")

#####Certification
ggplot(df, aes(x = factor(certification), fill = factor(certification))) +
geom_bar()

ggplot(df, aes(x = certification, y = salary, fill = as.factor(certification))) +
geom_bar(stat = "summary", fun = "mean")

Here we want to determine the number of certifications employees have attained, while analyzing its affect on salary. We conclude by stating that majority of the employees have 1 certification, while only a few have 4 certifications. Based on salary, those who have a greater number of certifications, tend to have a larger salary.
##histogram
ggplot(df, aes(x= salary)) + geom_histogram( fill = "red", col = "blue", binwidth = 6500)

This graph indicates that the bonus column is more so right skewed.
## heatmap with values
t(t(names(df)))
## [,1]
## [1,] "id"
## [2,] "gender"
## [3,] "industry"
## [4,] "education"
## [5,] "satisfied"
## [6,] "married"
## [7,] "experience"
## [8,] "certification"
## [9,] "bonus"
## [10,] "salary"
## [11,] "actual.cat"
df1 <- df[ , c("gender", "industry", "education", "satisfied", "married", "experience", "certification", "bonus", "salary")]
heatmap.2(cor(df1), Rowv = FALSE, Colv = FALSE, dendrogram = "none",
cellnote = round(cor(df1),2),
notecol = "black", key = FALSE, trace = 'none', margins = c(10,10))

The heatmap presents the correlation between the variables in the dataset. We want to include all variables expect for column “id” as it holds no significance. We conclude by saying that higher correlations are found between columns such as experience, certification, bonus, and salary.
########### PREDICTICE ANALYSIS ##################
########## Linear Regression Chapter 6 ############
##not a proper way
##regression with all predictors included
reg1 <- lm(salary ~ .-id -actual.cat, data = df)
# use options() to ensure numbers are not displayed in scientific notation.
options(scipen = 999)
summary(reg1)
##
## Call:
## lm(formula = salary ~ . - id - actual.cat, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20148.1 -5382.9 -34.6 4454.8 21358.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32645.519 2168.598 15.054 < 0.0000000000000002 ***
## gender -1615.713 1191.197 -1.356 0.17658
## industry 1100.772 1271.338 0.866 0.38767
## education 2710.258 878.694 3.084 0.00234 **
## satisfied -63.722 1193.212 -0.053 0.95747
## married 5393.470 1300.664 4.147 0.000050694764 ***
## experience 3611.859 537.007 6.726 0.000000000197 ***
## certification -585.682 1100.134 -0.532 0.59509
## bonus 11.597 2.243 5.169 0.000000589750 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8309 on 191 degrees of freedom
## Multiple R-squared: 0.7919, Adjusted R-squared: 0.7831
## F-statistic: 90.83 on 8 and 191 DF, p-value: < 0.00000000000000022
Our first regression, we decide to include all variables in the data set expect for column “id” as it holds no significance. We are attempting to run a regression where salary is the Y variable as we want to predict it using all other variables. We need to consider the significance of each predictor. We noticed that predictor such as gender, industry, satisfied, and certification hold no significations in predicting salary. Other predictors such as, education, married, experience, and bonus, hold significance to some extent. We get multiple r sqaured value if 79.2% and adjusted r sqaured of 78.3%. That means that 78.3%% of the variation in that Y column salary, can be explained by the variation in these variables. Even thought the model is good, there are still unsignificant variables that are included.
set.seed(1) # set seed for reproducing the partition
train.rows <- sample(1:nrow(df), 120)
train.df <- df[train.rows, ]
train.df <- as.data.frame(train.df)#add on
valid.df <- df[-train.rows, ]
valid.df <- as.data.frame(valid.df)
reg2 <- lm(salary ~ .-id -actual.cat, data = train.df) #added-id
summary(reg2)
##
## Call:
## lm(formula = salary ~ . - id - actual.cat, data = train.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16924.9 -4372.8 -284.8 3939.4 19674.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32386.197 2622.134 12.351 < 0.0000000000000002 ***
## gender -915.488 1420.029 -0.645 0.5205
## industry 291.169 1539.388 0.189 0.8503
## education 2346.184 1062.262 2.209 0.0293 *
## satisfied 1348.118 1418.714 0.950 0.3441
## married 3700.025 1533.352 2.413 0.0175 *
## experience 4205.456 690.644 6.089 0.0000000166 ***
## certification -1355.223 1338.397 -1.013 0.3135
## bonus 12.008 2.928 4.102 0.0000784388 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7660 on 111 degrees of freedom
## Multiple R-squared: 0.8102, Adjusted R-squared: 0.7966
## F-statistic: 59.25 on 8 and 111 DF, p-value: < 0.00000000000000022
The training data frame displayed the same significant columns as Reg1. The only slight change was the numbers in the estimates. Overall, the models are constant with each other. The model displayed a multiple R-sqaured value of 81% and an adjusted R-squared or 80%.
# use predict() to make predictions on a new set.
lm.pred <- predict(reg2, valid.df)
options(scipen=999, digits = 0)
some.residuals <- valid.df$salary[1:20] - lm.pred[1:20]
data.frame("Actual" = valid.df$salary[1:20],"Predicted" = lm.pred[1:20],
"Residual" = some.residuals)
## Actual Predicted Residual
## 2 97790 75398 22392
## 3 96250 103503 -7253
## 4 94962 90101 4861
## 5 94716 87469 7247
## 6 94672 81992 12680
## 8 93744 72307 21437
## 9 92925 70998 21927
## 11 91377 76389 14988
## 12 91300 73716 17584
## 16 89112 77580 11532
## 18 87210 80708 6502
## 19 87152 72992 14160
## 27 80840 95231 -14391
## 28 80510 98555 -18045
## 30 79299 102318 -23019
## 32 77868 73122 4746
## 36 76632 74750 1882
## 38 75600 67932 7668
## 41 74121 69211 4910
## 46 71550 64604 6946
options(scipen=999, digits = 3)
We sample the first 20 rows to determine the residual. For example, row 4, the actual salary of the employee is $94,962. It predicted the salary to be $90,101. This means the model underestimated the actual salary by $4,861.
# use accuracy() to compute common accuracy measures.
t(accuracy(lm.pred, valid.df$salary))
## Test set
## ME 124.5
## RMSE 9378.4
## MAE 7468.8
## MPE -2.3
## MAPE 12.8
We want to predict the accuracy of the model. The most common method in predicting the accuracy is the root mean squared error. We compute to get a value of 9378.4.
all.residuals <- valid.df$salary - lm.pred
df2 <- data.frame(all.residuals, lm.pred)
h1 <- ggplot(df2, aes(x= all.residuals)) +
geom_histogram(fill = "blue", col = "red", binwidth = 2500)
h1 <- h1 + labs(x = "residuals", y = "count", title = "Histogram of residuals")
h1

By looking at the distribution of the residuals, we concluded that it is more or less like a normal distribution. We can be reasonably comfortable to say that the regression model is well enough to predict salary without bias.
p1 <- ggplot(df2, aes(x = lm.pred, y = all.residuals)) + geom_point()
p1 <- p1 + labs(x = "predicted values", y = "residuals", title = "Residuals vs. predicted values")
p1 <- p1+ geom_smooth()
p1
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

This method looks at all the combination that can be made using these columns. We get an outcome of 8 possible combinations.
# show metrics
sum$rsq
## [1] 0.743 0.790 0.797 0.806 0.808 0.809 0.810 0.810
Displays the r squared of each model.
sum$adjr2
## [1] 0.741 0.786 0.792 0.799 0.799 0.799 0.798 0.797
Displays the adjusted r squared of each model.
##Find the model with the highest adjusted R2.
x <- which(sum$adjr2== max(sum$adjr2))
t(t(sum$which[x,]))
## [,1]
## (Intercept) TRUE
## gender FALSE
## industry FALSE
## education TRUE
## satisfied FALSE
## married TRUE
## experience TRUE
## certification FALSE
## bonus TRUE
We conclude that model 4 is the best model, which only included intercept, education, married, experience, and bonus. It gives us an adjusted R-squared of 0.799 (0.80). The model displays the same value as model 5 and 6, but this model is much simpler since it contains less predictors.
## Logistic regression in Chapter 10 ###########################################
options(digits = 2)
##Change the colname to 'actual' from 'salary'
names(df)[names(df) == 'salary'] <- 'actual'
#Predict salary using all variables
reg4 <- glm(actual.cat~ .-id -actual, data = df, family = "binomial")
options(scipen=999)
summary(reg4)
##
## Call:
## glm(formula = actual.cat ~ . - id - actual, family = "binomial",
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.311 -0.192 -0.083 0.075 1.915
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.75402 1.53511 -3.75 0.00018 ***
## gender -1.10243 0.81479 -1.35 0.17605
## industry -0.96439 0.81770 -1.18 0.23824
## education 0.10521 0.53586 0.20 0.84435
## satisfied -1.00601 0.75504 -1.33 0.18273
## married 0.59727 0.72298 0.83 0.40874
## experience 1.59300 0.38576 4.13 0.000036 ***
## certification -0.88253 0.71694 -1.23 0.21833
## bonus 0.00605 0.00287 2.11 0.03476 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 274.834 on 199 degrees of freedom
## Residual deviance: 61.616 on 191 degrees of freedom
## AIC: 79.62
##
## Number of Fisher Scoring iterations: 8
We attempt to predict salary using all variables. When running the summary of the regression, we conclude that the only predictors that hold significance are experience and bonus.
psuedoR2 <- 1-reg4$deviance/reg4$null.deviance
psuedoR2
## [1] 0.78
It returns a pseudo-R-squared of 78%. This means that 78% of salary can be predicted using these variables.
##Predict salary using experience and bonus
reg5 <- glm(actual.cat ~ experience + bonus, data = df, family = "binomial")
summary(reg5)
##
## Call:
## glm(formula = actual.cat ~ experience + bonus, family = "binomial",
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.282 -0.193 -0.122 0.153 1.554
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.61205 1.01730 -6.50 0.000000000081 ***
## experience 1.61033 0.37249 4.32 0.000015378853 ***
## bonus 0.00272 0.00166 1.64 0.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 274.834 on 199 degrees of freedom
## Residual deviance: 67.088 on 197 degrees of freedom
## AIC: 73.09
##
## Number of Fisher Scoring iterations: 7
We attempt to predict salary using only the significant variables from reg4. That being, experience and bonus. After running this regression, we notice that bonus no longer holds any significance.
psuedoR2 <- 1-reg5$deviance/reg5$null.deviance
psuedoR2
## [1] 0.76
This model is still a good model as it returns a pseudo-R-squared of 76% as it used only 2 predictors.
##Predict salary using experience
reg6 <- glm(actual.cat ~ experience, data = df, family = "binomial")
summary(reg6)
##
## Call:
## glm(formula = actual.cat ~ experience, family = "binomial", data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.363 -0.180 -0.123 0.239 1.470
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.982 1.025 -6.82 0.0000000000094 ***
## experience 2.105 0.304 6.92 0.0000000000044 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 274.834 on 199 degrees of freedom
## Residual deviance: 70.321 on 198 degrees of freedom
## AIC: 74.32
##
## Number of Fisher Scoring iterations: 7
We run the last regression using only experience, as it contains the most significance when predicting salary.
psuedoR2 <- 1-reg6$deviance/reg6$null.deviance
psuedoR2
## [1] 0.74
After performing the regression, it returns a pseudo-R-squared of 74%.