Salary Data Analysis Project

Nowadays, when determining an employee’s salary, many different components are taking into consideration. By performing this analysis on the dataset provided, it will allow us to understand if any of the X variables, also known as predictors, have affect on the Y dependent variable, salary.

setwd("C:/Users/12267/Desktop/UWindsor/Winter 2021/MSCI 3230 Data Science Tools & Methods/RSTUDIO Work")
library(ggplot2)
library(gplots)

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

library(reshape)
library (tidyverse)

## Warning: package 'tidyverse' was built under R version 4.0.4

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v tibble  3.0.5     v dplyr   1.0.3
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## v purrr   0.3.4

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x tidyr::expand() masks reshape::expand()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x dplyr::rename() masks reshape::rename()

library (Hmisc)

## Warning: package 'Hmisc' was built under R version 4.0.4

## Loading required package: lattice

## Warning: package 'lattice' was built under R version 4.0.4

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(leaps)

## Warning: package 'leaps' was built under R version 4.0.4

library(caret)

## Warning: package 'caret' was built under R version 4.0.4

## 
## Attaching package: 'caret'

## The following object is masked from 'package:survival':
## 
##     cluster

## The following object is masked from 'package:purrr':
## 
##     lift

library(gains)
library(pROC)

## Warning: package 'pROC' was built under R version 4.0.4

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

List all library

#### Read the data
df <- read.csv("data/SalaryData.csv")

t(t(names(df))) ##display column names in a user friendly manner

##       [,1]           
##  [1,] "id"           
##  [2,] "gender"       
##  [3,] "industry"     
##  [4,] "education"    
##  [5,] "satisfied"    
##  [6,] "married"      
##  [7,] "experience"   
##  [8,] "certification"
##  [9,] "bonus"        
## [10,] "salary"       
## [11,] "actual.cat"

#change to lowercase if so desired
names(df) <- tolower(names(df))

Read the data and display column names within dataset

#Handling missing data 

#Check which columns have missing values 
summary(df)

##        id             gender         industry        education    
##  Min.   :  1.00   Min.   :0.000   Min.   :0.0000   Min.   :1.000  
##  1st Qu.: 50.75   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:1.000  
##  Median :100.50   Median :0.000   Median :1.0000   Median :2.000  
##  Mean   :100.50   Mean   :0.475   Mean   :0.5528   Mean   :1.904  
##  3rd Qu.:150.25   3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:2.000  
##  Max.   :200.00   Max.   :1.000   Max.   :1.0000   Max.   :3.000  
##                                   NA's   :1        NA's   :2      
##    satisfied         married        experience    certification
##  Min.   :0.0000   Min.   :0.000   Min.   :1.000   Min.   :0.0  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.0  
##  Median :0.0000   Median :0.000   Median :3.000   Median :1.0  
##  Mean   :0.4798   Mean   :0.465   Mean   :3.205   Mean   :1.3  
##  3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:5.000   3rd Qu.:2.0  
##  Max.   :1.0000   Max.   :1.000   Max.   :8.000   Max.   :4.0  
##  NA's   :2                                                     
##      bonus            salary         actual.cat   
##  Min.   :  27.0   Min.   : 26865   Min.   :0.000  
##  1st Qu.: 128.5   1st Qu.: 43931   1st Qu.:0.000  
##  Median : 284.5   Median : 54048   Median :0.000  
##  Mean   : 573.7   Mean   : 57565   Mean   :0.445  
##  3rd Qu.: 855.2   3rd Qu.: 68313   3rd Qu.:1.000  
##  Max.   :3080.0   Max.   :102242   Max.   :1.000  
##  NA's   :2

#Replace missing values with median in industry column 
df$industry = impute(df$industry,
                     median
                        )

#Replace missing values with median in education column 
df$education = impute(df$education,
                     median
)

#Replace missing values with median in satisfied column 
df$satisfied = impute(df$satisfied,
                     median
)

#replace missing values with median in bonus column 
df$bonus = impute(df$bonus,
                     median
)

#final check to make sure there are no more NA in dataset
summary(df)

## 
##  1 values imputed to 1 
## 
## 
##  2 values imputed to 2 
## 
## 
##  2 values imputed to 0 
## 
## 
##  2 values imputed to 284.5

##        id             gender         industry       education    
##  Min.   :  1.00   Min.   :0.000   Min.   :0.000   Min.   :1.000  
##  1st Qu.: 50.75   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:1.000  
##  Median :100.50   Median :0.000   Median :1.000   Median :2.000  
##  Mean   :100.50   Mean   :0.475   Mean   :0.555   Mean   :1.905  
##  3rd Qu.:150.25   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:2.000  
##  Max.   :200.00   Max.   :1.000   Max.   :1.000   Max.   :3.000  
##    satisfied        married        experience    certification     bonus       
##  Min.   :0.000   Min.   :0.000   Min.   :1.000   Min.   :0.0   Min.   :  27.0  
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.0   1st Qu.: 129.5  
##  Median :0.000   Median :0.000   Median :3.000   Median :1.0   Median : 284.5  
##  Mean   :0.475   Mean   :0.465   Mean   :3.205   Mean   :1.3   Mean   : 570.9  
##  3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:5.000   3rd Qu.:2.0   3rd Qu.: 853.8  
##  Max.   :1.000   Max.   :1.000   Max.   :8.000   Max.   :4.0   Max.   :3080.0  
##      salary         actual.cat   
##  Min.   : 26865   Min.   :0.000  
##  1st Qu.: 43931   1st Qu.:0.000  
##  Median : 54048   Median :0.000  
##  Mean   : 57565   Mean   :0.445  
##  3rd Qu.: 68313   3rd Qu.:1.000  
##  Max.   :102242   Max.   :1.000

By running this function, is shows that columns that contain Na values are, industry (1 NA), Education (2 NA), satisfied (2 NA), and bonus (2 NA). Before filling in the missing values, we need to whether we are filling it using the mean or median. By inserting a histogram for each of the column that contain missing values, we conclude that, left or right skewed graphs, we use median and evenly distributed graphs, mean should be used. We use median for the missing values.

##########################Exploratory Analysis 
#boxplots for quantitative columns
# use par() to split the plots into panels.
options(scipen =999) #Get rid of scientific notation
ggplot(df, aes(x = "", y = salary)) + 
  geom_boxplot(fill= "green", color="blue", outlier.color = "red", outlier.shape = "o",width = 0.2)+
  labs(title = "Distribution of Salary", y = "Value in (in 000s)")+
  coord_cartesian(ylim = c(20000, 120000))

For this boxplot we are interested in understanding the stand-alone distribution of salary not across any X variable. Since our minimum value is $26,865 and our maximum is $102,242, we set the range to be displayed between 20,000 and 120,000. We take note that this salary column does not contain any outliers. It tells us that the average salary is about $57,000.

ggplot(df, aes(x = as.factor (gender), y = salary, fill = factor (gender))) + 
  geom_boxplot(outlier.color = "red", outlier.shape = "o")+
  labs(title = "Salary vs. Gender", x = "Gender", y = "Salary (in 000s)")

We are interested in understanding the distribution of salary across the gender column. We input gender as a factor, so we can distinguish between gender type. “1” represents female and “0” represents male. The graph tells us that the average salary of a male tends to be a bit larger than the average salary of a female. However, the tiny red circles indicate 2 cases where a female employee has an extremely large salary in comparison to the usual counterparts.

ggplot(df, aes(x = as.factor (satisfied), y = salary, fill = factor (satisfied))) + 
  geom_boxplot(outlier.color = "red", outlier.shape = "o")+
  labs(title = "Salary vs. Satisfied", x = "Satisfied", y = "Salary (in 000s)")

We are interested in understanding the distribution of salary across the satisfaction column. We input satisfaction as a factor, that way we can distinguish whether an employee is satisfied and unsatisfied in their current job. “1” represents satisfied and “0” represents unsatisfied. The outcome of the graph tells us that employees who are satisfied with their job tend to have a greater salary.

ggplot(df, aes(x= "", y = bonus)) + 
  geom_boxplot(fill='green', color="blue", outlier.color = "red", outlier.shape = "o", width = 0.2)+
  labs(title = "Distribution of Bonus", x = "", y = "Bonus (in 000s)")

## Don't know how to automatically pick scale for object of type impute. Defaulting to continuous.

This graph indicates that the median bonus given is $284. Min is $27 and max is 3080. Here we see that there are 12 outliers but none of them are significantly extreme values.

ggplot(df, aes(x= "", y = experience)) + 
  geom_boxplot(fill='green', color="blue", outlier.color = "red", outlier.shape = "o", width = 0.2)+
  labs(title = "Distribution of experience", x = "", y = "Years of Experience)")

This graph indicates that the median years of experience is 3 years. Min in 1 year and max is 8 years.

ggplot(df, aes(x =factor (married), y = salary, fill = factor (married))) + 
  geom_boxplot(outlier.color = "red", outlier.shape = "o")+
  labs(title = "Salary vs. Married", x = "Married", y = "Salary in (in 000s)")

  options(scipen =999) #Gets rid of scientific notation

We are interested in understanding the distribution of salary across the married column. We input the married as a factor, that was we can tell apart a married and unmarried employee. “1” indicates they are married, “0” they are not married. The boxplot demonstrates that married employees tend to have higher salary than those who are not.

# Scatter plot
ggplot(df, aes(x = bonus, y = salary)) + geom_point()

## Don't know how to automatically pick scale for object of type impute. Defaulting to continuous.

At first, we decided to understand the distribution between salary and bonus. The graph concludes that there is an upward positive tending distribution between the two columns. As a bonus of employee increases, their salary as increases.

ggplot(df, aes(x = bonus, y = salary)) + geom_point(aes(colour = factor(married)))

## Don't know how to automatically pick scale for object of type impute. Defaulting to continuous.

This gives us an understanding of whether being married or not holds signifance on salary and bonus. Indeed, being married has an impact on the bonus which leads to an impact on salary. For example, majority of class “1” which represents that the employee is married tends to be at higher range of bonus. Which than leads to a higher salary. We conclude that, if the emplyee is married, the greater the bonus, which leads to an increase in salary.

ggplot(df, aes(x = bonus, y = salary)) + geom_point(aes(colour = factor(education)))

## Don't know how to automatically pick scale for object of type impute. Defaulting to continuous.

As mentioned in the scatter plot above, as bonus increases, the salary increase. This graph is more sophisticated while it includes the factor of education. This gives us an understanding of the level of education the employee has completed. Indeed, education has an impact on the bonus. For example, majority of class “1” which represents that the employee that completed up to the undergraduate level tends to be at lower range of bonus. Which than leads to a lower salary. We conclude that, the greater education level completed, the greater the bonus, which leads to an increase in salary.

ggplot(df, aes(x = bonus, y = salary)) + geom_point(aes(colour = factor(satisfied)))

## Don't know how to automatically pick scale for object of type impute. Defaulting to continuous.

This graph includes bonus and salary as a factor or satisfaction. This gives us an understanding of the whether the employee is satisfied or not. Unless, the factor of education in the graph above, there is not really a set range for satisfaction. Meaning that satisfaction do not impact salary. For example, you can be unsatisfied with a large bonus and high salary or satisfied large bonus and large salary.

##bar charts
##education column 
ggplot(df, aes(x = education, fill = factor(education))) + 
  geom_bar()

## Don't know how to automatically pick scale for object of type impute. Defaulting to continuous.

ggplot(df, aes(x = education, y = salary, fill = as.factor(education))) + 
  geom_bar(stat = "summary", fun = "mean")

## Don't know how to automatically pick scale for object of type impute. Defaulting to continuous.

Here we want to understand how many employees completed undergrade, grad, or advanced education. We simply show the count of each level. The output indicates that majority of the employees have completed a grad level education. We continue by putting salary against education to get the same outcome as the scatter plot diagram, the greater the level of education completed, the greater the salary. By computing the aggregate function, we get an outcome of $50,777 average salary for employees who completed an undergrad level of education, $55,993 average salary grad level education and $70,397 average for completion of advanced education.

###Industry Column 
ggplot(df, aes(x = factor(industry), fill = factor(industry))) + 
  geom_bar()

ggplot(df, aes(x = factor(industry), y = salary, fill = factor(industry))) + 
  geom_bar(stat = "summary", fun = "mean")

Here we want to determine how many employees are in the IT industry and whether it affects their salary. “1” represents that they are within the IT industry, “0” represents that they are not. We concluded that about 20 more employees are included in the IT industry, as well as the ones involved in the IT industry tend to have a greater salary.

####Gender column 
ggplot(df, aes(x = factor(gender), fill = factor(gender))) + 
  geom_bar()

ggplot(df, aes(x = factor(gender), y = salary, fill = factor(gender))) + 
  geom_bar(stat = "summary", fun = "mean")

Here we want to determine how many employees are male or female and whether it affects their salary. “1” represents that they are female, “0” represents that they are male. We concluded that there is a slight difference in salary between females and males, where men tend to make a bit more salary.

####satisfied column
ggplot(df, aes(x = factor(satisfied), fill = factor(satisfied))) + 
  geom_bar()

ggplot(df, aes(x = factor(satisfied), y = salary, fill = factor(satisfied))) + 
  geom_bar(stat = "summary", fun = "mean")

#####Certification 
ggplot(df, aes(x = factor(certification), fill = factor(certification))) + 
  geom_bar()

ggplot(df, aes(x = certification, y = salary, fill = as.factor(certification))) + 
  geom_bar(stat = "summary", fun = "mean")

Here we want to determine the number of certifications employees have attained, while analyzing its affect on salary. We conclude by stating that majority of the employees have 1 certification, while only a few have 4 certifications. Based on salary, those who have a greater number of certifications, tend to have a larger salary.

##histogram
ggplot(df, aes(x= salary)) + geom_histogram( fill = "red", col = "blue", binwidth = 6500)

This graph indicates that the salary column is more so right skewed. It tells us the same information as the boxplot that was created earlier for salary.

ggplot(df, aes(x= bonus)) + geom_histogram( fill = "yellow", col = "blue", binwidth = 300)

## Don't know how to automatically pick scale for object of type impute. Defaulting to continuous.

This graph indicates that the bonus column is more so right skewed.

## heatmap with values
t(t(names(df)))

##       [,1]           
##  [1,] "id"           
##  [2,] "gender"       
##  [3,] "industry"     
##  [4,] "education"    
##  [5,] "satisfied"    
##  [6,] "married"      
##  [7,] "experience"   
##  [8,] "certification"
##  [9,] "bonus"        
## [10,] "salary"       
## [11,] "actual.cat"

df1 <- df[ , c("gender", "industry", "education", "satisfied", "married", "experience", "certification", "bonus", "salary")]
heatmap.2(cor(df1), Rowv = FALSE, Colv = FALSE, dendrogram = "none", 
          cellnote = round(cor(df1),2), 
          notecol = "black", key = FALSE, trace = 'none', margins = c(10,10))

The heatmap presents the correlation between the variables in the dataset. We want to include all variables expect for column “id” as it holds no significance. We conclude by saying that higher correlations are found between columns such as experience, certification, bonus, and salary.

########### PREDICTICE ANALYSIS ##################
########## Linear Regression Chapter 6 ############

##not a proper way
##regression with all predictors included
reg1 <- lm(salary ~ .-id -actual.cat, data = df) 
#  use options() to ensure numbers are not displayed in scientific notation.
options(scipen = 999)
summary(reg1)

## 
## Call:
## lm(formula = salary ~ . - id - actual.cat, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20148.1  -5382.9    -34.6   4454.8  21358.8 
## 
## Coefficients:
##                Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   32645.519   2168.598  15.054 < 0.0000000000000002 ***
## gender        -1615.713   1191.197  -1.356              0.17658    
## industry       1100.772   1271.338   0.866              0.38767    
## education      2710.258    878.694   3.084              0.00234 ** 
## satisfied       -63.722   1193.212  -0.053              0.95747    
## married        5393.470   1300.664   4.147       0.000050694764 ***
## experience     3611.859    537.007   6.726       0.000000000197 ***
## certification  -585.682   1100.134  -0.532              0.59509    
## bonus            11.597      2.243   5.169       0.000000589750 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8309 on 191 degrees of freedom
## Multiple R-squared:  0.7919, Adjusted R-squared:  0.7831 
## F-statistic: 90.83 on 8 and 191 DF,  p-value: < 0.00000000000000022

Our first regression, we decide to include all variables in the data set expect for column “id” as it holds no significance. We are attempting to run a regression where salary is the Y variable as we want to predict it using all other variables. We need to consider the significance of each predictor. We noticed that predictor such as gender, industry, satisfied, and certification hold no significations in predicting salary. Other predictors such as, education, married, experience, and bonus, hold significance to some extent. We get multiple r sqaured value if 79.2% and adjusted r sqaured of 78.3%. That means that 78.3%% of the variation in that Y column salary, can be explained by the variation in these variables. Even thought the model is good, there are still unsignificant variables that are included.

set.seed(1)  # set seed for reproducing the partition

train.rows <- sample(1:nrow(df), 120)

train.df <- df[train.rows, ]
train.df <- as.data.frame(train.df)#add on
valid.df <- df[-train.rows, ]
valid.df <- as.data.frame(valid.df)
reg2 <- lm(salary ~ .-id -actual.cat, data = train.df) #added-id
summary(reg2)

## 
## Call:
## lm(formula = salary ~ . - id - actual.cat, data = train.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16924.9  -4372.8   -284.8   3939.4  19674.1 
## 
## Coefficients:
##                Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   32386.197   2622.134  12.351 < 0.0000000000000002 ***
## gender         -915.488   1420.029  -0.645               0.5205    
## industry        291.169   1539.388   0.189               0.8503    
## education      2346.184   1062.262   2.209               0.0293 *  
## satisfied      1348.118   1418.714   0.950               0.3441    
## married        3700.025   1533.352   2.413               0.0175 *  
## experience     4205.456    690.644   6.089         0.0000000166 ***
## certification -1355.223   1338.397  -1.013               0.3135    
## bonus            12.008      2.928   4.102         0.0000784388 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7660 on 111 degrees of freedom
## Multiple R-squared:  0.8102, Adjusted R-squared:  0.7966 
## F-statistic: 59.25 on 8 and 111 DF,  p-value: < 0.00000000000000022

The training data frame displayed the same significant columns as Reg1. The only slight change was the numbers in the estimates. Overall, the models are constant with each other. The model displayed a multiple R-sqaured value of 81% and an adjusted R-squared or 80%.

# use predict() to make predictions on a new set. 
lm.pred <- predict(reg2, valid.df)
options(scipen=999, digits = 0)
some.residuals <- valid.df$salary[1:20] - lm.pred[1:20]
data.frame("Actual" = valid.df$salary[1:20],"Predicted" = lm.pred[1:20], 
           "Residual" = some.residuals)

##    Actual Predicted Residual
## 2   97790     75398    22392
## 3   96250    103503    -7253
## 4   94962     90101     4861
## 5   94716     87469     7247
## 6   94672     81992    12680
## 8   93744     72307    21437
## 9   92925     70998    21927
## 11  91377     76389    14988
## 12  91300     73716    17584
## 16  89112     77580    11532
## 18  87210     80708     6502
## 19  87152     72992    14160
## 27  80840     95231   -14391
## 28  80510     98555   -18045
## 30  79299    102318   -23019
## 32  77868     73122     4746
## 36  76632     74750     1882
## 38  75600     67932     7668
## 41  74121     69211     4910
## 46  71550     64604     6946

options(scipen=999, digits = 3)

We sample the first 20 rows to determine the residual. For example, row 4, the actual salary of the employee is $94,962. It predicted the salary to be $90,101. This means the model underestimated the actual salary by $4,861.

# use accuracy() to compute common accuracy measures.
t(accuracy(lm.pred, valid.df$salary))

##      Test set
## ME      124.5
## RMSE   9378.4
## MAE    7468.8
## MPE      -2.3
## MAPE     12.8

We want to predict the accuracy of the model. The most common method in predicting the accuracy is the root mean squared error. We compute to get a value of 9378.4.

all.residuals <- valid.df$salary - lm.pred

df2 <- data.frame(all.residuals, lm.pred)
h1 <- ggplot(df2, aes(x= all.residuals)) + 
  geom_histogram(fill = "blue", col = "red", binwidth = 2500)
h1 <- h1 + labs(x = "residuals", y = "count", title = "Histogram of residuals")
h1

By looking at the distribution of the residuals, we concluded that it is more or less like a normal distribution. We can be reasonably comfortable to say that the regression model is well enough to predict salary without bias.

p1 <- ggplot(df2, aes(x = lm.pred, y = all.residuals)) + geom_point()
p1 <- p1 + labs(x = "predicted values", y = "residuals", title = "Residuals vs. predicted values")
p1 <- p1+ geom_smooth()
p1

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Another similar diagnostic to perform is the scatter plot. This plot demonstrates that around the $70,000 predicted salary mark, the model is more or less evenly distributed, with not many extremely high or low values, with no discernible pattern. Once you pass the $70,000 mark, there is lots of fluctuation presented

search <- regsubsets(salary ~ .-id -actual.cat, data = train.df, nbest = 1, nvmax = dim(train.df)[2],
                     method = "exhaustive")
sum <- summary(search)
# show models
t(sum$which)

##                   1     2     3     4     5     6     7    8
## (Intercept)    TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE TRUE
## gender        FALSE FALSE FALSE FALSE FALSE FALSE  TRUE TRUE
## industry      FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## education     FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE TRUE
## satisfied     FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE TRUE
## married       FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE TRUE
## experience     TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE TRUE
## certification FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE TRUE
## bonus         FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE TRUE

This method looks at all the combination that can be made using these columns. We get an outcome of 8 possible combinations.

# show metrics
sum$rsq

## [1] 0.743 0.790 0.797 0.806 0.808 0.809 0.810 0.810

Displays the r squared of each model.

sum$adjr2

## [1] 0.741 0.786 0.792 0.799 0.799 0.799 0.798 0.797

Displays the adjusted r squared of each model.

##Find the model with the highest adjusted R2.
x <- which(sum$adjr2== max(sum$adjr2))
t(t(sum$which[x,]))

##                [,1]
## (Intercept)    TRUE
## gender        FALSE
## industry      FALSE
## education      TRUE
## satisfied     FALSE
## married        TRUE
## experience     TRUE
## certification FALSE
## bonus          TRUE

We conclude that model 4 is the best model, which only included intercept, education, married, experience, and bonus. It gives us an adjusted R-squared of 0.799 (0.80). The model displays the same value as model 5 and 6, but this model is much simpler since it contains less predictors.

## Logistic regression in Chapter 10 ###########################################

options(digits = 2)

##Change the colname to 'actual' from 'salary'
names(df)[names(df) == 'salary'] <- 'actual'

#Predict salary using all variables 
reg4 <- glm(actual.cat~ .-id -actual, data = df, family = "binomial") 
options(scipen=999)
summary(reg4)

## 
## Call:
## glm(formula = actual.cat ~ . - id - actual, family = "binomial", 
##     data = df)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.311  -0.192  -0.083   0.075   1.915  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -5.75402    1.53511   -3.75  0.00018 ***
## gender        -1.10243    0.81479   -1.35  0.17605    
## industry      -0.96439    0.81770   -1.18  0.23824    
## education      0.10521    0.53586    0.20  0.84435    
## satisfied     -1.00601    0.75504   -1.33  0.18273    
## married        0.59727    0.72298    0.83  0.40874    
## experience     1.59300    0.38576    4.13 0.000036 ***
## certification -0.88253    0.71694   -1.23  0.21833    
## bonus          0.00605    0.00287    2.11  0.03476 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 274.834  on 199  degrees of freedom
## Residual deviance:  61.616  on 191  degrees of freedom
## AIC: 79.62
## 
## Number of Fisher Scoring iterations: 8

We attempt to predict salary using all variables. When running the summary of the regression, we conclude that the only predictors that hold significance are experience and bonus.

psuedoR2 <- 1-reg4$deviance/reg4$null.deviance
psuedoR2

## [1] 0.78

It returns a pseudo-R-squared of 78%. This means that 78% of salary can be predicted using these variables.

##Predict salary using experience and bonus
reg5 <- glm(actual.cat ~ experience + bonus, data = df, family = "binomial") 
summary(reg5)

## 
## Call:
## glm(formula = actual.cat ~ experience + bonus, family = "binomial", 
##     data = df)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.282  -0.193  -0.122   0.153   1.554  
## 
## Coefficients:
##             Estimate Std. Error z value       Pr(>|z|)    
## (Intercept) -6.61205    1.01730   -6.50 0.000000000081 ***
## experience   1.61033    0.37249    4.32 0.000015378853 ***
## bonus        0.00272    0.00166    1.64            0.1    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 274.834  on 199  degrees of freedom
## Residual deviance:  67.088  on 197  degrees of freedom
## AIC: 73.09
## 
## Number of Fisher Scoring iterations: 7

We attempt to predict salary using only the significant variables from reg4. That being, experience and bonus. After running this regression, we notice that bonus no longer holds any significance.

psuedoR2 <- 1-reg5$deviance/reg5$null.deviance
psuedoR2

## [1] 0.76

This model is still a good model as it returns a pseudo-R-squared of 76% as it used only 2 predictors.

##Predict salary using experience 
reg6 <- glm(actual.cat ~ experience, data = df, family = "binomial") 
summary(reg6)

## 
## Call:
## glm(formula = actual.cat ~ experience, family = "binomial", data = df)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.363  -0.180  -0.123   0.239   1.470  
## 
## Coefficients:
##             Estimate Std. Error z value        Pr(>|z|)    
## (Intercept)   -6.982      1.025   -6.82 0.0000000000094 ***
## experience     2.105      0.304    6.92 0.0000000000044 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 274.834  on 199  degrees of freedom
## Residual deviance:  70.321  on 198  degrees of freedom
## AIC: 74.32
## 
## Number of Fisher Scoring iterations: 7

We run the last regression using only experience, as it contains the most significance when predicting salary.

psuedoR2 <- 1-reg6$deviance/reg6$null.deviance
psuedoR2

## [1] 0.74

Salary Data Analysis Project

Hassan Al Hakim

03/04/2021

Nowadays, when determining an employee’s salary, many different components are taking into consideration. By performing this analysis on the dataset provided, it will allow us to understand if any of the X variables, also known as predictors, have affect on the Y dependent variable, salary.

List all library

Read the data and display column names within dataset

This graph indicates that the median bonus given is $284. Min is $27 and max is 3080. Here we see that there are 12 outliers but none of them are significantly extreme values.

This graph indicates that the median years of experience is 3 years. Min in 1 year and max is 8 years.

At first, we decided to understand the distribution between salary and bonus. The graph concludes that there is an upward positive tending distribution between the two columns. As a bonus of employee increases, their salary as increases.

This graph indicates that the salary column is more so right skewed. It tells us the same information as the boxplot that was created earlier for salary.

This graph indicates that the bonus column is more so right skewed.

The heatmap presents the correlation between the variables in the dataset. We want to include all variables expect for column “id” as it holds no significance. We conclude by saying that higher correlations are found between columns such as experience, certification, bonus, and salary.

The training data frame displayed the same significant columns as Reg1. The only slight change was the numbers in the estimates. Overall, the models are constant with each other. The model displayed a multiple R-sqaured value of 81% and an adjusted R-squared or 80%.

We sample the first 20 rows to determine the residual. For example, row 4, the actual salary of the employee is $94,962. It predicted the salary to be $90,101. This means the model underestimated the actual salary by $4,861.

We want to predict the accuracy of the model. The most common method in predicting the accuracy is the root mean squared error. We compute to get a value of 9378.4.

By looking at the distribution of the residuals, we concluded that it is more or less like a normal distribution. We can be reasonably comfortable to say that the regression model is well enough to predict salary without bias.

This method looks at all the combination that can be made using these columns. We get an outcome of 8 possible combinations.

Displays the r squared of each model.

Displays the adjusted r squared of each model.

We conclude that model 4 is the best model, which only included intercept, education, married, experience, and bonus. It gives us an adjusted R-squared of 0.799 (0.80). The model displays the same value as model 5 and 6, but this model is much simpler since it contains less predictors.

We attempt to predict salary using all variables. When running the summary of the regression, we conclude that the only predictors that hold significance are experience and bonus.

It returns a pseudo-R-squared of 78%. This means that 78% of salary can be predicted using these variables.

We attempt to predict salary using only the significant variables from reg4. That being, experience and bonus. After running this regression, we notice that bonus no longer holds any significance.

This model is still a good model as it returns a pseudo-R-squared of 76% as it used only 2 predictors.

We run the last regression using only experience, as it contains the most significance when predicting salary.

After performing the regression, it returns a pseudo-R-squared of 74%.