0.1 Executive Summary

This report aims to predict salary using various employee demographic characteristics. Data preprocessing is completed to ensure the dataset is ready for analysis. Missing variables were deleted or set to the mean of the data attribute. Then dummy variables are created to ensure categorical variables can be used in the analysis. Salary is not normally distributed so logarithmic and square root transformations are completed to normalize the data. The correlation between the predictive variables and the dependent variable, salary, are very strong. This indicates large information overlap; it also is a promising sign that the multiple linear regression model will be effective. Next salary is compared with different categorical variables. The first and most important observation for management is a discrepancy in salary between men and women. Women are on average making a lower salary than men. More women than men have advanced degrees and men and women have on average the same number of certifications and experience. This is notable as the gender pay gap has been a very prevalent topic of discussion in all industries. Next, people who are satisfied make more money on average than people who are not satisfied. This is useful to note as a small increase is salary could create more employee satisfaction in the firm. The last categorical variable is industry. People in the IT field are making significantly more than people in the non-IT industry. If managers are wanting to keep employees and protect them from head hunting, they should match their wages to the industry average. When determining an employee’s salary the most important variables to note are certification, education, experience, industry, and marital status. These are the variables that are used in the stepwise multiple linear regression model.The model resulted in a strong adjusted R squared and has both statistical and practical significance.

0.2 Data Preprocessing and Scope

0.2.1 Scope

The scope of this dataset is 195 employees with their related characteristics. This includes their job satisfaction, salary, gender, education, marital status and other variables. A full list of variables and a sample of the data set can be seen below.

> knitr::kable(head(FinalProjectSalary))
gender industry education satisfied married experience certification bonus salary expensive IsExpensive IsModerate
0 1 2 0 1 6 3 2147.082 102242 3 1 0
1 1 3 1 1 4 3 1564.640 97790 2 0 1
1 1 3 0 1 7 4 3080.000 96250 1 0 0
1 1 3 1 1 6 3 2089.164 94962 1 0 0
0 1 2 1 1 6 3 1989.036 94716 1 0 0
0 1 2 1 1 6 2 1420.080 94672 3 1 0

0.2.2 Data Dictionary

Below is a table of all data attributes with thier descriptions and domains.
Attribute Description Indicator
Gender What is the gender of the person? 1-Female, 0-Male
Industry what industry does the perosn work in? 1- IT, 0 - Non-IT
Education What level of education? 1- Undegrad, 2- Grad, 3- Advanced
Satisfied Is the person satisfied in thier job? 1-Yes, 0-No
Married Is the person married? 1-Yes, 0-No
Experience How many years of job experince? Count if years
Certification How many certifications does the person have? Count of certifications
Bonus What bonus does the person get? Dollar Amount
Salary What is the persons salary? Annual dollar amount
Expensive City How expensive is the city they live in? 1 - Low, 2-Medium, 3-High

0.2.3 Summary of Variables

The summary statistics are useful to ensure the data is within the attributes stated domain. For categorical variables the summary should produce a count of frequency. For continuous variables summary statistics returns mean, median, quartiles etc. The following is a summary of the relevant statistics of each variable. ID variable will be excluded as it does not contain useful data.

  • Gender is a categorical variable. There are 105 men and 95 women in this data set.
  • Industry is a categorical variable. There are 89 people in non-IT jobs and 110 people in IT jobs in this data set.
  • Education is a categorical variable. There are 64 undergrads, 89 grads and 45 people with advanced education.
  • Satisfaction is a categorical variable. There are 95 satisfied people and 103 unsatisfied people.
  • Married is a categorical variable. There is 93 married people and 107 non-married people.
  • Experience is a numerical variable. The range in experience is from 1 year to 21 years of experience. The mean number of years of experience is 3.305.
  • Certification is a categorical variable. The range of certification is from 1 to 4.
  • Bonus is a continuous numerical variable. The range of bonus is from 26.86 dollars to 3,080.2 dollars.
  • Salary is a continuous numerical variable.The range of salary is from 26,865 dollars to 102,242 dollars, with a mean of 57,565 dollars.
  • Expensive is a categorical variable. The number of people in inexpensive cities is 67, in medium cities is 76, and in expensive cities is 57.
> summary(FinalProjectSalary)
 gender  industry   education satisfied  married   experience     certification
 0:105   0   : 89   1   :64   0   :103   0:107   Min.   : 1.000   0:55         
 1: 95   1   :110   2   :89   1   : 95   1: 93   1st Qu.: 1.000   1:64         
         NA's:  1   3   :45   NA's:  2           Median : 3.000   2:52         
                    NA's: 2                      Mean   : 3.305   3:24         
                                                 3rd Qu.: 5.000   4: 5         
                                                 Max.   :21.000                
                                                                               
     bonus             salary       expensive  IsExpensive      IsModerate  
 Min.   :  26.86   Min.   : 26865   1:67      Min.   :0.000   Min.   :0.00  
 1st Qu.: 128.55   1st Qu.: 43930   2:76      1st Qu.:0.000   1st Qu.:0.00  
 Median : 284.37   Median : 54048   3:57      Median :0.000   Median :0.00  
 Mean   : 573.76   Mean   : 57565             Mean   :0.285   Mean   :0.38  
 3rd Qu.: 855.12   3rd Qu.: 68313             3rd Qu.:1.000   3rd Qu.:1.00  
 Max.   :3080.00   Max.   :102242             Max.   :1.000   Max.   :1.00  
 NA's   :2                                                                  

0.2.4 Dummy and Categorical Variables

Dummy variables are created for the attribute expensive. This attribute indicates how expensive the city is that the employee lives in. Two additional columns were created which are IsExpensive and IsModerate. The R code for this is as follows FinalProjectSalary$IsExpensive <- with(FinalProjectSalary, ifelse(expensive == 3, 1,0)) FinalProjectSalary$IsModerate <- with(FinalProjectSalary, ifelse(expensive == 2, 1,0)). Additionally all categorical variables need to be defined as categorical in order to be used in the analysis. The code for transformation these attributes is attribute <- as.factor(attribute).

0.2.5 Data Quality

First to determine the initial quality of the data the count of missing values can be taken. It is important to be aware of any data quality issue prior to analysis. To use this dataset for proper and sound analysis the rows with missing values that are categorical (satisfied, education, industry ) will be deleted in order to preserve data quality. For the missing values that are continuous (bonus) the mean will be substituted in an attempt to not shrink the size of the data set further.

> sapply(FinalProjectSalary, function(x)(sum(is.na(x)))) # NA counts
       gender      industry     education     satisfied       married 
            0             1             2             2             0 
   experience certification         bonus        salary     expensive 
            0             0             2             0             0 
  IsExpensive    IsModerate 
            0             0 

0.3 Analysis of Distribution and Transformation

0.3.1 Salary Distribution

From the histogram we can see that salary data is right skewed as it has the highest frequency is on the left and there is a tail on the right. In the boxplot it is evident that the distribution is right skewed as the top whisker is much longer than the bottom and the 50% quartile line is shifted toward the bottom. Given that the QQPlot opens up there is an indication the data is right skewed. The small downward curve at the end indicates a light tail. Lastly, the mean is greater then the median (50% Quartile) which further indicates the data is right skewed. Lastly, the normality tests displays a pvalue < 0.05 meaning the null hypothesis is accepted and salary is not normally distributed.

> with(FinalProjectSalary, Hist(salary, scale="frequency", breaks="Sturges", col="deepskyblue4", main = "Histrogram of Salary"))

> Boxplot( ~ salary, data=FinalProjectSalary, id=list(method="y"), main ="Box Plot of Salary", col="deepskyblue4")

> with(FinalProjectSalary, qqPlot(salary, dist="norm", id=list(method="y", n=2, labels=rownames(FinalProjectSalary)), main ="QQ Plot of Salary", col="deepskyblue4"))

[1] 1 2
> numSummary(FinalProjectSalary[,"salary", drop=FALSE], statistics=c("mean", "quantiles"), quantiles=c(.5))
     mean   50%   n
 57564.64 54048 200
> normalityTest(~salary, test="shapiro.test", data=FinalProjectSalary)

    Shapiro-Wilk normality test

data:  salary
W = 0.95548, p-value = 6.618e-06

0.3.2 Transformation

Lambda is 0 which would indicate a log transformation. After the log transformation the pvalue was much closer to 0.05 but still is not greater than 0.05. A square root transformation was also completed but it did not produce a pvalue larger than 0.05 either.

> TF <- (powerTransform(salary ~ 1, data=FinalProjectSalary, family="bcPower"))
> summary(TF)
bcPower Transformation to Normality 
   Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
Y1    -0.001           0      -0.4342       0.4322

Likelihood ratio test that transformation parameter is equal to 0
 (log transformation)
                               LRT df   pval
LR test, lambda = (0) 1.924313e-05  1 0.9965

Likelihood ratio test that no transformation is needed
                           LRT df       pval
LR test, lambda = (1) 19.84609  1 8.3934e-06
> lamb <- TF$lambda 
> TSalary <- FinalProjectSalary$salary^TF$lamb
> hist(TSalary, main="Histogram for Log Salary Trasformation", 
+      xlab="Salary", col="deepskyblue4" )

> shapiro.test(TSalary)

    Shapiro-Wilk normality test

data:  TSalary
W = 0.98397, p-value = 0.02246
> summary(powerTransform(salary ~ 1, data=FinalProjectSalary, family="bcPower"))
bcPower Transformation to Normality 
   Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
Y1    -0.001           0      -0.4342       0.4322

Likelihood ratio test that transformation parameter is equal to 0
 (log transformation)
                               LRT df   pval
LR test, lambda = (0) 1.924313e-05  1 0.9965

Likelihood ratio test that no transformation is needed
                           LRT df       pval
LR test, lambda = (1) 19.84609  1 8.3934e-06
> sqrt_sal <- sqrt(FinalProjectSalary$salary)
> hist(sqrt_sal, main="Histogram for Sqrt Salary Trasformation", 
+      xlab="Salary", col="deepskyblue4")

> shapiro.test(sqrt_sal)

    Shapiro-Wilk normality test

data:  sqrt_sal
W = 0.9753, p-value = 0.001346

0.4 Analysis of Correlation

The first method of analysis of correlation that will be completed is a correlation matrix. This matrix compares the correlations of all numerical variables. Conditional formatting was used and only correlations stronger then 0.6 and not equal to 1 are indicated by bolded text. There is very strong correlations between these variables. This does indicate a good base for prediction, but it also indicates there is a lot of information overlap meaning the variables are not truly independent. These same values are reflected visually using the correlation plot below. The more bulbus and ligher the oval the weaker the correlation and the narrower and darker the oval the stronger the correlation. This plot indicates only positive correlations between variables and quite strong correlations.

> library(pander)
> t <- cor(FinalProjectSalary[,c("bonus","experience","salary")], use="complete")
> emphasize.strong.cells(which(t > 0.6 & t != 1 , arr.ind = TRUE))
> pander(t)
  bonus experience salary
bonus 1 0.6842 0.8425
experience 0.6842 1 0.6361
salary 0.8425 0.6361 1
> library(corrplot)
corrplot 0.92 loaded
> C <- cor(FinalProjectSalary[, c("bonus", "experience", "salary")], use = "complete")
> corrplot(C, order="FPC", method = "ellipse")

0.5 Comparisons Using Catagorical Variables

Bonuses, certification and experience are all highly correlated with salary. Often salaries are determined by number of certification and years of experience. Bonuses are commonly a percentage of salary or are correlated in some fashion. Additionally experience and certification are not continuous and represent a smaller range of number. For these reasons salary will be primarily used as the variable that will be compared in accordance with different categorical variables.

0.5.1 Salary and Gender

Observations: The box plot on the left labeled 0 is men’s salaries and the box plot on the right labeled 1 is women’s salaries. Women have two outliers which are higher than the top whisker. This could be two women who hold C-suite jobs so their salaries are much higher than the average. It is interesting to note that the median salary of a women is lower than the median salary of a man. It is also important to note that the 3rd quartile of the men’s box plot end a lot higher and the whisker is longer meaning the top 50% of men are making more than the top 50% of women. The 1st quartile and bottom whisker in both plots are about the same size meaning mean and women earning less than $54,048 are making approximately the same amount of money. Contributing factors to a higher salary in men could be experince or certifications so an analysis of those variables will be conducted. Both women and men have the same median experience with men having a slightly longer whisker. This shows the top 25% of mens have a bit more experince than the top 25% of women by about 1 year. The median level of certification is 1 for both men and women. 50% of women and men have 0 certifications. The 3rd quartile range for both is between 1 and 2 certifications. The top 25% of women have between 2 and 4 certification while the top 25% of men have been 2 and 3 certifications. In a comparison of education it can be seen that a higher percentage of women have advanced degrees than men.

Interpretation: This gender pay gap has been a very prevalent topic of discussion in all industries. According to the data women are just as qualified and are technically more educated. This could be a coincidence but it should be a statistic of concern that should be investigated further.

> Boxplot(salary~gender, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")

[1] "2" "3"
> Boxplot(experience~gender, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")

[1] "200"
> Boxplot(certification~gender, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")
Warning in Ops.factor(yy, b$stats[1, group]): '<' not meaningful for factors
Warning in Ops.factor(yy, b$stats[5, group]): '>' not meaningful for factors

[1] NA NA
> knitr::kable(data.frame(Education=rep(c('Undergrad', 'Grad', 'Advanced')),
+                  Men=rep(c('18%', '48%','34%'),),
+                  Women=rep(c('16%','46%','38%'))))
Education Men Women
Undergrad 18% 16%
Grad 48% 46%
Advanced 34% 38%

0.5.2 Salary and Education

Observations: The median salary increases as education increases. It is interesting to note that the top 25% of people with undergraduate degrees are earning more on average than the top 25% of people with graduate degrees. There are many outliers in the graduate box plot.

Interpretation: It is logical that salary increases with more education as people become more qualified. An explanation for the oultiers is that there are several positions such as lawyer and businesses manager that commonly have graduate degrees and make high salaries. These people would likely make more than a person with a masters degrees in arts or sciences who is looking to do research or work in a non-profit field.

> Boxplot(salary~education, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")

[1] "1"  "5"  "6"  "9"  "11" "19"

0.5.3 Salary and Marital Status

Observations: Married people are represented by the box plot on the right with the 1 label. The median salary increases greatly if a person is married. The top 50% of married people are on average making more than the top 25% of unmarried people.

Interpretation: Does this mean if you get married you will make more money? No, marriage has many other variables that factor into salary. Age is the main one. The average age of marriage in Canada is about 30 years old. This means people are later in their careers which often comes with a salary increase. Additionally, married people look for more stability and may need to be able to potentially support children and a mortgage. Married people are often not in the financial position to take jobs with lower salaries even if the job might be more enjoyable.

> Boxplot(salary~married, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")

0.5.4 Salary and Satisfaction

Observations: People who are not satisfied with their job are represented by 0 and people who are satisfied are represented by 1. It is evident that people who are satisfied have a lightly higher median salary. The 3rd quartile of satisfied people reaches higher than the 3rd quartile of unsatisfied people meaning that 25% of satisfied people are making more than the 25% of unsatisfied people. The whiskers reach to the same level on both meaning the top 25% of people make approximately the same amount.

Interpretation: Do satisfied people make a better salary because they are happy so they perform better or are people more satisfied because they have a higher salary. This correlation is difficult to determine and will take further analysis to conclude.

> Boxplot(salary~satisfied, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")

[1] "1"

0.5.5 Salary and Industry

Observations: People in the tech industry are denoted by the box plot labeled 1. People in the tech industry have a higher median salary than those who are not. The top 50% of people in tech are making the same or more than the top 25% of people in non-tech. The whisker on the tech plot reached much higher meaning the top 25% of tech people earn more than the top 25% of non-tech people. There are two outliers in non-tech jobs.

Interpretation: Tech is a fast growing field with jobs that are heavily in demand such as computer scientists and data scientists. The world is becoming more data driven and the competition to acquire top tech talent is fierce. To be competitive tech firms must offer higher salaries. This is likely the reason there is a discrepancy in salaries between tech and non-tech people. There are a few possible explanations for the salary outliers in non-tech industry. This could be people working at C-suite executives in non-tech industries or could be doctors or lawyers, which are non-tech high paying jobs.

> Boxplot(salary~industry, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")

[1] "11" "13"

0.5.6 Gender, Salary and Industry

Observations: There are an equal amount of women and men working in non-tech. There are more men working in tech than women. Looking at the tech industry a larger percentage of women make an above average salary than men. Looking at non-tech there are a larger percentage of women making below average salary than men.

Interpretation: The tech industry is historically male dominated so it seems there has been an emphasis put on paying women in tech competitively, this could be why a larger percentage of women are getting paid above average.

> FinalProjectSalary$SalAbvBlwAvg <- with(FinalProjectSalary, ifelse(salary>57565, 'Above', 'Below'))
> library(abind, pos=27)
> local({
+   .Table <- xtabs(~SalAbvBlwAvg+gender+industry, data=FinalProjectSalary)
+   cat("\nFrequency table:\n")
+   print(.Table)
+   cat("\nColumn percentages:\n")
+   print(colPercents(.Table)) 
+ })

Frequency table:
, , industry = 0

            gender
SalAbvBlwAvg  0  1
       Above 16 11
       Below 28 34

, , industry = 1

            gender
SalAbvBlwAvg  0  1
       Above 33 28
       Below 28 21


Column percentages:
, , industry = 0

            gender
SalAbvBlwAvg     0     1
       Above  36.4  24.4
       Below  63.6  75.6
       Total 100.0 100.0
       Count  44.0  45.0

, , industry = 1

            gender
SalAbvBlwAvg     0     1
       Above  54.1  57.1
       Below  45.9  42.9
       Total 100.0 100.0
       Count  61.0  49.0

0.6 Predictive Analysis

0.6.1 Model 1: Multiple Linear Regression Continuous Variables

> Sal_Reg1 <- lm(salary~bonus+certification+experience, data=SalaryFinal)
> summary(Sal_Reg1)

Call:
lm(formula = salary ~ bonus + certification + experience, data = SalaryFinal)

Residuals:
   Min     1Q Median     3Q    Max 
-33515  -5571   -490   5026  31085 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   42244.760   1262.760  33.454   <2e-16 ***
bonus            21.866      2.306   9.481   <2e-16 ***
certification    -9.026   1237.716  -0.007    0.994    
experience      835.627    379.877   2.200    0.029 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9373 on 191 degrees of freedom
Multiple R-squared:  0.7187,    Adjusted R-squared:  0.7143 
F-statistic: 162.7 on 3 and 191 DF,  p-value: < 2.2e-16
> equatiomatic::extract_eq(Sal_Reg1)

\[ \operatorname{salary} = \alpha + \beta_{1}(\operatorname{bonus}) + \beta_{2}(\operatorname{certification}) + \beta_{3}(\operatorname{\experience}) + \epsilon \]

> equatiomatic::extract_eq(Sal_Reg1, use_coefs = TRUE)

\[ \operatorname{\widehat{salary}} = 42244.76 + 21.87(\operatorname{bonus}) - 9.03(\operatorname{certification}) + 835.63(\operatorname{\experience}) \] The first output that will be investigated is the residual values. The regression model skews negatively. This can be noticed by identifying the negative median value and the deviation of Q1 vs Q3 from the median. Next the coefficients can be used to create a linear expression.

Example using mean values for all explanatory variables \[y= 21.84(573.76)-9.026(1.3)+835.627(3.305)+42244.76\] \[y=$57,525.69\]

Taking into account standard error \[max = 24.172(Bonus)+1228.69(Certification)+1215.504(Experience)+43545.03\] \[min = 19.56(Bonus)-1246.742(Certification)+455.75(Experience)+40944.49\]

From the asterisks it can be interpreted that bonus and intercept have highly significant p-values as they have 3 askerisks. Experience has one asterisk indicating a significant p-value. Certification does not have an astericks meaning it does not have a significant p-value. The small p-values in experince, bonus and intercept mean that the null is rejected and that there is a relationship between these variables and salary.

An adjusted R squared below 0.4 is seen as a low correlation and above 0.7 is seen as a very high correlation. From the Adjusted R squared in this model is can be concluded there is a strong correlation as it is 0.7143. The intercept represents the base salary without any experience, bonus or salary. This would be considered starting salary and would be the amount employees get paid directly out of school.

0.6.2 Model 1.1: Stepwise Linear Regression

Backwards stepwise multiple linear regression is used to find the optimal linear equation with the lowest Akaike information criteria (AIC). This form of regression begins with all variables and removed one at a time until all the remaining explanatory variables are statistically significant.

\[y= salary\] \[y = 21.85(Bonus)+835.55(Experience)+42240.70 \]

Example using mean values for all explanatory variables \[y= 21.85(573.76)+835.55(3.305)+42240.70 \] \[y=$57,538.85\]

Overall the best model was the one using bonus and experience. So a salary can be estimated by inputing the before mentioned variables into the equation formulated by the stepwise multiple regression analysis. However, bonuses are often derived from salary so other models that are more applicable should be created.

> stepwise(Sal_Reg1, direction='backward', criterion='AIC')

Direction:  backward
Criterion:  AIC 

Start:  AIC=3570.73
salary ~ bonus + certification + experience

                Df  Sum of Sq        RSS    AIC
- certification  1       4671 1.6779e+10 3568.7
<none>                        1.6779e+10 3570.7
- experience     1  425086228 1.7204e+10 3573.6
- bonus          1 7897167079 2.4676e+10 3643.9

Step:  AIC=3568.73
salary ~ bonus + experience

             Df  Sum of Sq        RSS    AIC
<none>                     1.6779e+10 3568.7
- experience  1 4.2533e+08 1.7205e+10 3571.6
- bonus       1 1.9405e+10 3.6184e+10 3716.6

Call:
lm(formula = salary ~ bonus + experience, data = SalaryFinal)

Coefficients:
(Intercept)        bonus   experience  
   42240.70        21.85       835.55  

0.6.3 Model 2: Multiple Linear Regression All Variables

This model will use all variables to create a linear equation. Bonus will be excluded from the model because it has strong variable interaction. Often bonuses are a percentage of salary. Since salary would be determined first it would be inappropriate to use bonus.

> AllRegSal <- lm(salary~certification+education+expensive+experience+gender+industry+married+satisfied, data=SalaryFinal)
> summary(AllRegSal)

Call:
lm(formula = salary ~ certification + education + expensive + 
    experience + gender + industry + married + satisfied, data = SalaryFinal)

Residuals:
   Min     1Q Median     3Q    Max 
-48626  -6345   -370   6227  27108 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    29778.5     3027.9   9.835  < 2e-16 ***
certification   6765.3      927.4   7.295 8.36e-12 ***
education       4725.0     1056.9   4.471 1.35e-05 ***
expensive       -889.0      955.5  -0.930 0.353370    
experience      2064.4      385.2   5.359 2.46e-07 ***
gender          -760.8     1486.7  -0.512 0.609440    
industry        3793.6     1570.2   2.416 0.016659 *  
married         5800.3     1623.7   3.572 0.000451 ***
satisfied        301.5     1497.3   0.201 0.840615    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10240 on 186 degrees of freedom
Multiple R-squared:  0.6727,    Adjusted R-squared:  0.6587 
F-statistic: 47.79 on 8 and 186 DF,  p-value: < 2.2e-16
> equatiomatic::extract_eq(AllRegSal)

\[ \operatorname{salary} = \alpha + \beta_{1}(\operatorname{certification}) + \beta_{2}(\operatorname{education}) + \beta_{3}(\operatorname{\expensive}) + \beta_{4}(\operatorname{\experience}) + \beta_{5}(\operatorname{gender}) + \beta_{6}(\operatorname{industry}) + \beta_{7}(\operatorname{married}) + \beta_{8}(\operatorname{satisfied}) + \epsilon \]

> equatiomatic::extract_eq(AllRegSal, use_coefs = TRUE)

\[ \operatorname{\widehat{salary}} = 29778.48 + 6765.32(\operatorname{certification}) + 4725.04(\operatorname{education}) - 888.99(\operatorname{\expensive}) + 2064.44(\operatorname{\experience}) - 760.82(\operatorname{gender}) + 3793.62(\operatorname{industry}) + 5800.33(\operatorname{married}) + 301.54(\operatorname{satisfied}) \]

In this model the intercept is 29778.5. However, values like education cannot be 0. Meaning the lowest salary for someone with a degree is about 5000 dollars higher. Additionally people working in IT will be paid about 4000 dollars more than non tech. In this model the most significant p-values are those for certification, education, experience, industry and marital status.The adjusted R squared is 0.6587 which is considered a well fitting model.

0.6.4 Model 2.1: Stepwise Linear Regression

Backwards stepwise linear regression is used to find the optimal linear equation with the lowest Akaike information criteria (AIC). In this model all variables are used. The optimal equation can be found below. This equation indicates

  • married people have a higher salary
  • more educated people have a higher salary
  • more experiences people have a higher salary
  • people in tech have a higher salary
  • people with more certifications have a higher salary
  • the base salary is 27,694 dollars

\[y= salary\] \[y = 6906(certification)+4738(education)+2052(experience)+3901(industry)+5623(married)+27694\]

After creating a summary of the model created by the stepwise regression. It is notable that the multiple R squared regression is higher for model 2.1 than 2.0. It can be concluded that model 2.1 is a better fit than model 2.0. The multiple R squared for model 2.1 is smaller than that of model 1, most likely because of the variable interaction between salary and bonus.

Model 2.1 provides the most practical significance and applicability even if it is not the most statistically significant.

> stepwise(AllRegSal, direction='backward', criterion='AIC')

Direction:  backward
Criterion:  AIC 

Start:  AIC=3610.24
salary ~ certification + education + expensive + experience + 
    gender + industry + married + satisfied

                Df  Sum of Sq        RSS    AIC
- satisfied      1    4256596 1.9525e+10 3608.3
- gender         1   27484685 1.9549e+10 3608.5
- expensive      1   90851296 1.9612e+10 3609.1
<none>                        1.9521e+10 3610.2
- industry       1  612614767 2.0134e+10 3614.3
- married        1 1339372447 2.0861e+10 3621.2
- education      1 2097842348 2.1619e+10 3628.1
- experience     1 3014227158 2.2535e+10 3636.2
- certification  1 5585374932 2.5107e+10 3657.3

Step:  AIC=3608.29
salary ~ certification + education + expensive + experience + 
    gender + industry + married

                Df  Sum of Sq        RSS    AIC
- gender         1   28470318 1.9554e+10 3606.6
- expensive      1   89567594 1.9615e+10 3607.2
<none>                        1.9525e+10 3608.3
- industry       1  626256510 2.0152e+10 3612.4
- married        1 1360255640 2.0886e+10 3619.4
- education      1 2095511973 2.1621e+10 3626.2
- experience     1 3054675395 2.2580e+10 3634.6
- certification  1 5582338749 2.5108e+10 3655.3

Step:  AIC=3606.57
salary ~ certification + education + expensive + experience + 
    industry + married

                Df  Sum of Sq        RSS    AIC
- expensive      1   97067754 1.9651e+10 3605.5
<none>                        1.9554e+10 3606.6
- industry       1  643513336 2.0197e+10 3610.9
- married        1 1340095099 2.0894e+10 3617.5
- education      1 2072401334 2.1626e+10 3624.2
- experience     1 3091685588 2.2646e+10 3633.2
- certification  1 5619470281 2.5173e+10 3653.8

Step:  AIC=3605.54
salary ~ certification + education + experience + industry + 
    married

                Df  Sum of Sq        RSS    AIC
<none>                        1.9651e+10 3605.5
- industry       1  655155031 2.0306e+10 3609.9
- married        1 1284089413 2.0935e+10 3615.9
- education      1 2124956397 2.1776e+10 3623.6
- experience     1 3025861222 2.2677e+10 3631.5
- certification  1 5945662017 2.5597e+10 3655.1

Call:
lm(formula = salary ~ certification + education + experience + 
    industry + married, data = SalaryFinal)

Coefficients:
  (Intercept)  certification      education     experience       industry  
        27694           6906           4738           2052           3901  
      married  
         5623  
> StepAllRegSal <- lm(salary~certification+education+experience+industry+married, data=SalaryFinal)
> summary(StepAllRegSal)

Call:
lm(formula = salary ~ certification + education + experience + 
    industry + married, data = SalaryFinal)

Residuals:
   Min     1Q Median     3Q    Max 
-48670  -6587    139   5968  26591 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    27694.3     2170.1  12.762  < 2e-16 ***
certification   6905.8      913.2   7.562 1.68e-12 ***
education       4738.2     1048.1   4.521 1.09e-05 ***
experience      2052.5      380.5   5.395 2.04e-07 ***
industry        3901.0     1554.1   2.510 0.012905 *  
married         5623.0     1600.0   3.514 0.000552 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10200 on 189 degrees of freedom
Multiple R-squared:  0.6706,    Adjusted R-squared:  0.6618 
F-statistic: 76.94 on 5 and 189 DF,  p-value: < 2.2e-16

0.7 Conclusion

Statistical significance does not always result in practical significance. There are numerous conclusion that can be drawn from this report. First, there seems to be an active gender wage gap, surprisingly it is more prevalent in non-tech industries. There is a bias in industry wages with people in tech making more on average. Education and experience both create salary increases. It became evident that bonuses are derived from salaries creating models that were statistically significant but not applicable in the real world. Many of the variables surround one variable that wasn’t included, age. Age often determines marital status, education level, years of experience and amount of certifications. All of these variables are highly correlated with age and therefore salary is highly correlated with age. People “climb the corporate ladder” with each new title a salary bump is common. Overall the best predictors of salary are certification, education, experience, industry, and marital status proven using backward stepwise mulitple linear regression.

0.8 Appendix: All code for this report

> # include this code chunk as-is to set options
> knitr::opts_chunk$set(comment=NA, prompt=TRUE, out.width=750, fig.height=8, fig.width=8)
> library(Rcmdr)
> library(car)
> library(RcmdrMisc)
> library(bookdown)
> library(DiagrammeR)
> library(equatiomatic)
> library(kableExtra)
> library(knitr)
> library(xtable)
> # include this code chunk as-is to enable 3D graphs
> library(rgl)
> knitr::knit_hooks$set(webgl = hook_webgl)
> knitr::include_graphics("/Users/erindane/Desktop/R Studios /Picture1.png")
> DiagrammeR::grViz("digraph {
+   graph [layout = dot, rankdir = TB]
+   
+   node [shape = rectangle]        
+   rec1 [label = 'Step 1. Introduction']
+   rec2 [label = 'Step 2. Exploratory Analysis']
+   rec3 [label =  'Step 3. Predictive Analysis']
+   rec4 [label = 'Step 4. Recommendations/Conclusions']
+   
+   # edge definitions with the node IDs
+   rec1 -> rec2 -> rec3 -> rec4
+   }",
+   height = 500)
> FinalProjectSalary <- readXL("/Users/erindane/Desktop/R Studios /FinalProjectSalary.xlsx", 
+   rownames=FALSE, header=TRUE, na="", sheet="Sheet1", stringsAsFactors=TRUE)
> FinalProjectSalary <- within(FinalProjectSalary, {
+   satisfied <- as.factor(satisfied)
+   education <- as.factor(education)
+   expensive <- as.factor(expensive)
+   gender <- as.factor(gender)
+   industry <- as.factor(industry)
+   married <- as.factor(married)
+   certification <- as.factor(certification)
+   id <- NULL 
+ })
> FinalProjectSalary$IsExpensive <- with(FinalProjectSalary, ifelse(expensive == 3, 1,0))
> FinalProjectSalary$IsModerate <- with(FinalProjectSalary, ifelse(expensive == 2, 1,0))
> knitr::kable(head(FinalProjectSalary))
> knitr::kable(data.frame(Attribute=rep(c('Gender', 'Industry', 'Education','Satisfied','Married','Experience', 'Certification','Bonus','Salary','Expensive City')),
+                 Description=rep(c('What is the gender of the person?', 'what industry does the perosn work in?','What level of education?','Is the person satisfied in thier job?','Is the person married?','How many years of job experince?','How many certifications does the person have?','What bonus does the person get?', 'What is the persons salary?','How expensive is the city they live in?'),),
+                  Indicator=rep(c('1-Female, 0-Male','1- IT, 0 - Non-IT','1- Undegrad, 2- Grad, 3- Advanced', '1-Yes, 0-No','1-Yes, 0-No', 'Count if years', 'Count of certifications', 'Dollar Amount','Annual dollar amount','1 - Low, 2-Medium, 3-High'))))
> summary(FinalProjectSalary)
> sapply(FinalProjectSalary, function(x)(sum(is.na(x)))) # NA counts
> with(FinalProjectSalary, Hist(salary, scale="frequency", breaks="Sturges", col="deepskyblue4", main = "Histrogram of Salary"))
> Boxplot( ~ salary, data=FinalProjectSalary, id=list(method="y"), main ="Box Plot of Salary", col="deepskyblue4")
> with(FinalProjectSalary, qqPlot(salary, dist="norm", id=list(method="y", n=2, labels=rownames(FinalProjectSalary)), main ="QQ Plot of Salary", col="deepskyblue4"))
> numSummary(FinalProjectSalary[,"salary", drop=FALSE], statistics=c("mean", "quantiles"), quantiles=c(.5))
> normalityTest(~salary, test="shapiro.test", data=FinalProjectSalary)
> TF <- (powerTransform(salary ~ 1, data=FinalProjectSalary, family="bcPower"))
> summary(TF)
> lamb <- TF$lambda 
> TSalary <- FinalProjectSalary$salary^TF$lamb
> hist(TSalary, main="Histogram for Log Salary Trasformation", 
+      xlab="Salary", col="deepskyblue4" )
> shapiro.test(TSalary)
> 
> summary(powerTransform(salary ~ 1, data=FinalProjectSalary, family="bcPower"))
> sqrt_sal <- sqrt(FinalProjectSalary$salary)
> hist(sqrt_sal, main="Histogram for Sqrt Salary Trasformation", 
+      xlab="Salary", col="deepskyblue4")
> shapiro.test(sqrt_sal)
> library(pander)
> t <- cor(FinalProjectSalary[,c("bonus","experience","salary")], use="complete")
> emphasize.strong.cells(which(t > 0.6 & t != 1 , arr.ind = TRUE))
> pander(t)
> library(corrplot)
> C <- cor(FinalProjectSalary[, c("bonus", "experience", "salary")], use = "complete")
> corrplot(C, order="FPC", method = "ellipse")
> Boxplot(salary~gender, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")
> Boxplot(experience~gender, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")
> Boxplot(certification~gender, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")
> knitr::kable(data.frame(Education=rep(c('Undergrad', 'Grad', 'Advanced')),
+                  Men=rep(c('18%', '48%','34%'),),
+                  Women=rep(c('16%','46%','38%'))))
> Boxplot(salary~education, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")
> Boxplot(salary~married, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")
> Boxplot(salary~satisfied, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")
> Boxplot(salary~industry, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")
> FinalProjectSalary$SalAbvBlwAvg <- with(FinalProjectSalary, ifelse(salary>57565, 'Above', 'Below'))
> library(abind, pos=27)
> local({
+   .Table <- xtabs(~SalAbvBlwAvg+gender+industry, data=FinalProjectSalary)
+   cat("\nFrequency table:\n")
+   print(.Table)
+   cat("\nColumn percentages:\n")
+   print(colPercents(.Table)) 
+ })
> SalaryFinal <- readXL("/Users/erindane/Desktop/R Studios /FinalProjectSalaryPP.xlsx", 
+   rownames=FALSE, header=TRUE, na="", sheet="Sheet1", stringsAsFactors=TRUE)
> SalaryFinal <- within(SalaryFinal, {
+   id <- NULL 
+ })
> SalaryFinal$IsExpensive <- with(SalaryFinal, ifelse(expensive == 3, 1,0))
> SalaryFinal$IsModerate <- with(SalaryFinal, ifelse(expensive == 2, 1,0))
> Sal_Reg1 <- lm(salary~bonus+certification+experience, data=SalaryFinal)
> summary(Sal_Reg1)
> equatiomatic::extract_eq(Sal_Reg1)
> equatiomatic::extract_eq(Sal_Reg1, use_coefs = TRUE)
> stepwise(Sal_Reg1, direction='backward', criterion='AIC')
> AllRegSal <- lm(salary~certification+education+expensive+experience+gender+industry+married+satisfied, data=SalaryFinal)
> summary(AllRegSal)
> equatiomatic::extract_eq(AllRegSal)
> equatiomatic::extract_eq(AllRegSal, use_coefs = TRUE)
> stepwise(AllRegSal, direction='backward', criterion='AIC')
> StepAllRegSal <- lm(salary~certification+education+experience+industry+married, data=SalaryFinal)
> summary(StepAllRegSal)

  1. U of Windsor, ↩︎