This report aims to predict salary using various employee demographic characteristics. Data preprocessing is completed to ensure the dataset is ready for analysis. Missing variables were deleted or set to the mean of the data attribute. Then dummy variables are created to ensure categorical variables can be used in the analysis. Salary is not normally distributed so logarithmic and square root transformations are completed to normalize the data. The correlation between the predictive variables and the dependent variable, salary, are very strong. This indicates large information overlap; it also is a promising sign that the multiple linear regression model will be effective. Next salary is compared with different categorical variables. The first and most important observation for management is a discrepancy in salary between men and women. Women are on average making a lower salary than men. More women than men have advanced degrees and men and women have on average the same number of certifications and experience. This is notable as the gender pay gap has been a very prevalent topic of discussion in all industries. Next, people who are satisfied make more money on average than people who are not satisfied. This is useful to note as a small increase is salary could create more employee satisfaction in the firm. The last categorical variable is industry. People in the IT field are making significantly more than people in the non-IT industry. If managers are wanting to keep employees and protect them from head hunting, they should match their wages to the industry average. When determining an employee’s salary the most important variables to note are certification, education, experience, industry, and marital status. These are the variables that are used in the stepwise multiple linear regression model.The model resulted in a strong adjusted R squared and has both statistical and practical significance.
The scope of this dataset is 195 employees with their related characteristics. This includes their job satisfaction, salary, gender, education, marital status and other variables. A full list of variables and a sample of the data set can be seen below.
> knitr::kable(head(FinalProjectSalary))
| gender | industry | education | satisfied | married | experience | certification | bonus | salary | expensive | IsExpensive | IsModerate |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 0 | 1 | 6 | 3 | 2147.082 | 102242 | 3 | 1 | 0 |
| 1 | 1 | 3 | 1 | 1 | 4 | 3 | 1564.640 | 97790 | 2 | 0 | 1 |
| 1 | 1 | 3 | 0 | 1 | 7 | 4 | 3080.000 | 96250 | 1 | 0 | 0 |
| 1 | 1 | 3 | 1 | 1 | 6 | 3 | 2089.164 | 94962 | 1 | 0 | 0 |
| 0 | 1 | 2 | 1 | 1 | 6 | 3 | 1989.036 | 94716 | 1 | 0 | 0 |
| 0 | 1 | 2 | 1 | 1 | 6 | 2 | 1420.080 | 94672 | 3 | 1 | 0 |
| Attribute | Description | Indicator |
|---|---|---|
| Gender | What is the gender of the person? | 1-Female, 0-Male |
| Industry | what industry does the perosn work in? | 1- IT, 0 - Non-IT |
| Education | What level of education? | 1- Undegrad, 2- Grad, 3- Advanced |
| Satisfied | Is the person satisfied in thier job? | 1-Yes, 0-No |
| Married | Is the person married? | 1-Yes, 0-No |
| Experience | How many years of job experince? | Count if years |
| Certification | How many certifications does the person have? | Count of certifications |
| Bonus | What bonus does the person get? | Dollar Amount |
| Salary | What is the persons salary? | Annual dollar amount |
| Expensive City | How expensive is the city they live in? | 1 - Low, 2-Medium, 3-High |
The summary statistics are useful to ensure the data is within the attributes stated domain. For categorical variables the summary should produce a count of frequency. For continuous variables summary statistics returns mean, median, quartiles etc. The following is a summary of the relevant statistics of each variable. ID variable will be excluded as it does not contain useful data.
> summary(FinalProjectSalary)
gender industry education satisfied married experience certification
0:105 0 : 89 1 :64 0 :103 0:107 Min. : 1.000 0:55
1: 95 1 :110 2 :89 1 : 95 1: 93 1st Qu.: 1.000 1:64
NA's: 1 3 :45 NA's: 2 Median : 3.000 2:52
NA's: 2 Mean : 3.305 3:24
3rd Qu.: 5.000 4: 5
Max. :21.000
bonus salary expensive IsExpensive IsModerate
Min. : 26.86 Min. : 26865 1:67 Min. :0.000 Min. :0.00
1st Qu.: 128.55 1st Qu.: 43930 2:76 1st Qu.:0.000 1st Qu.:0.00
Median : 284.37 Median : 54048 3:57 Median :0.000 Median :0.00
Mean : 573.76 Mean : 57565 Mean :0.285 Mean :0.38
3rd Qu.: 855.12 3rd Qu.: 68313 3rd Qu.:1.000 3rd Qu.:1.00
Max. :3080.00 Max. :102242 Max. :1.000 Max. :1.00
NA's :2
Dummy variables are created for the attribute expensive. This attribute indicates how expensive the city is that the employee lives in. Two additional columns were created which are IsExpensive and IsModerate. The R code for this is as follows
FinalProjectSalary$IsExpensive <- with(FinalProjectSalary, ifelse(expensive == 3, 1,0))
FinalProjectSalary$IsModerate <- with(FinalProjectSalary, ifelse(expensive == 2, 1,0)).
Additionally all categorical variables need to be defined as categorical in order to be used in the analysis. The code for transformation these attributes is attribute <- as.factor(attribute).
First to determine the initial quality of the data the count of missing values can be taken. It is important to be aware of any data quality issue prior to analysis. To use this dataset for proper and sound analysis the rows with missing values that are categorical (satisfied, education, industry ) will be deleted in order to preserve data quality. For the missing values that are continuous (bonus) the mean will be substituted in an attempt to not shrink the size of the data set further.
> sapply(FinalProjectSalary, function(x)(sum(is.na(x)))) # NA counts
gender industry education satisfied married
0 1 2 2 0
experience certification bonus salary expensive
0 0 2 0 0
IsExpensive IsModerate
0 0
From the histogram we can see that salary data is right skewed as it has the highest frequency is on the left and there is a tail on the right. In the boxplot it is evident that the distribution is right skewed as the top whisker is much longer than the bottom and the 50% quartile line is shifted toward the bottom. Given that the QQPlot opens up there is an indication the data is right skewed. The small downward curve at the end indicates a light tail. Lastly, the mean is greater then the median (50% Quartile) which further indicates the data is right skewed. Lastly, the normality tests displays a pvalue < 0.05 meaning the null hypothesis is accepted and salary is not normally distributed.
> with(FinalProjectSalary, Hist(salary, scale="frequency", breaks="Sturges", col="deepskyblue4", main = "Histrogram of Salary"))
> Boxplot( ~ salary, data=FinalProjectSalary, id=list(method="y"), main ="Box Plot of Salary", col="deepskyblue4")
> with(FinalProjectSalary, qqPlot(salary, dist="norm", id=list(method="y", n=2, labels=rownames(FinalProjectSalary)), main ="QQ Plot of Salary", col="deepskyblue4"))
[1] 1 2
> numSummary(FinalProjectSalary[,"salary", drop=FALSE], statistics=c("mean", "quantiles"), quantiles=c(.5))
mean 50% n
57564.64 54048 200
> normalityTest(~salary, test="shapiro.test", data=FinalProjectSalary)
Shapiro-Wilk normality test
data: salary
W = 0.95548, p-value = 6.618e-06
Lambda is 0 which would indicate a log transformation. After the log transformation the pvalue was much closer to 0.05 but still is not greater than 0.05. A square root transformation was also completed but it did not produce a pvalue larger than 0.05 either.
> TF <- (powerTransform(salary ~ 1, data=FinalProjectSalary, family="bcPower"))
> summary(TF)
bcPower Transformation to Normality
Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
Y1 -0.001 0 -0.4342 0.4322
Likelihood ratio test that transformation parameter is equal to 0
(log transformation)
LRT df pval
LR test, lambda = (0) 1.924313e-05 1 0.9965
Likelihood ratio test that no transformation is needed
LRT df pval
LR test, lambda = (1) 19.84609 1 8.3934e-06
> lamb <- TF$lambda
> TSalary <- FinalProjectSalary$salary^TF$lamb
> hist(TSalary, main="Histogram for Log Salary Trasformation",
+ xlab="Salary", col="deepskyblue4" )
> shapiro.test(TSalary)
Shapiro-Wilk normality test
data: TSalary
W = 0.98397, p-value = 0.02246
> summary(powerTransform(salary ~ 1, data=FinalProjectSalary, family="bcPower"))
bcPower Transformation to Normality
Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
Y1 -0.001 0 -0.4342 0.4322
Likelihood ratio test that transformation parameter is equal to 0
(log transformation)
LRT df pval
LR test, lambda = (0) 1.924313e-05 1 0.9965
Likelihood ratio test that no transformation is needed
LRT df pval
LR test, lambda = (1) 19.84609 1 8.3934e-06
> sqrt_sal <- sqrt(FinalProjectSalary$salary)
> hist(sqrt_sal, main="Histogram for Sqrt Salary Trasformation",
+ xlab="Salary", col="deepskyblue4")
> shapiro.test(sqrt_sal)
Shapiro-Wilk normality test
data: sqrt_sal
W = 0.9753, p-value = 0.001346
The first method of analysis of correlation that will be completed is a correlation matrix. This matrix compares the correlations of all numerical variables. Conditional formatting was used and only correlations stronger then 0.6 and not equal to 1 are indicated by bolded text. There is very strong correlations between these variables. This does indicate a good base for prediction, but it also indicates there is a lot of information overlap meaning the variables are not truly independent. These same values are reflected visually using the correlation plot below. The more bulbus and ligher the oval the weaker the correlation and the narrower and darker the oval the stronger the correlation. This plot indicates only positive correlations between variables and quite strong correlations.
> library(pander)
> t <- cor(FinalProjectSalary[,c("bonus","experience","salary")], use="complete")
> emphasize.strong.cells(which(t > 0.6 & t != 1 , arr.ind = TRUE))
> pander(t)
| bonus | experience | salary | |
|---|---|---|---|
| bonus | 1 | 0.6842 | 0.8425 |
| experience | 0.6842 | 1 | 0.6361 |
| salary | 0.8425 | 0.6361 | 1 |
> library(corrplot)
corrplot 0.92 loaded
> C <- cor(FinalProjectSalary[, c("bonus", "experience", "salary")], use = "complete")
> corrplot(C, order="FPC", method = "ellipse")
Bonuses, certification and experience are all highly correlated with salary. Often salaries are determined by number of certification and years of experience. Bonuses are commonly a percentage of salary or are correlated in some fashion. Additionally experience and certification are not continuous and represent a smaller range of number. For these reasons salary will be primarily used as the variable that will be compared in accordance with different categorical variables.
Observations: The box plot on the left labeled 0 is men’s salaries and the box plot on the right labeled 1 is women’s salaries. Women have two outliers which are higher than the top whisker. This could be two women who hold C-suite jobs so their salaries are much higher than the average. It is interesting to note that the median salary of a women is lower than the median salary of a man. It is also important to note that the 3rd quartile of the men’s box plot end a lot higher and the whisker is longer meaning the top 50% of men are making more than the top 50% of women. The 1st quartile and bottom whisker in both plots are about the same size meaning mean and women earning less than $54,048 are making approximately the same amount of money. Contributing factors to a higher salary in men could be experince or certifications so an analysis of those variables will be conducted. Both women and men have the same median experience with men having a slightly longer whisker. This shows the top 25% of mens have a bit more experince than the top 25% of women by about 1 year. The median level of certification is 1 for both men and women. 50% of women and men have 0 certifications. The 3rd quartile range for both is between 1 and 2 certifications. The top 25% of women have between 2 and 4 certification while the top 25% of men have been 2 and 3 certifications. In a comparison of education it can be seen that a higher percentage of women have advanced degrees than men.
Interpretation: This gender pay gap has been a very prevalent topic of discussion in all industries. According to the data women are just as qualified and are technically more educated. This could be a coincidence but it should be a statistic of concern that should be investigated further.
> Boxplot(salary~gender, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")
[1] "2" "3"
> Boxplot(experience~gender, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")
[1] "200"
> Boxplot(certification~gender, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")
Warning in Ops.factor(yy, b$stats[1, group]): '<' not meaningful for factors
Warning in Ops.factor(yy, b$stats[5, group]): '>' not meaningful for factors
[1] NA NA
> knitr::kable(data.frame(Education=rep(c('Undergrad', 'Grad', 'Advanced')),
+ Men=rep(c('18%', '48%','34%'),),
+ Women=rep(c('16%','46%','38%'))))
| Education | Men | Women |
|---|---|---|
| Undergrad | 18% | 16% |
| Grad | 48% | 46% |
| Advanced | 34% | 38% |
Observations: The median salary increases as education increases. It is interesting to note that the top 25% of people with undergraduate degrees are earning more on average than the top 25% of people with graduate degrees. There are many outliers in the graduate box plot.
Interpretation: It is logical that salary increases with more education as people become more qualified. An explanation for the oultiers is that there are several positions such as lawyer and businesses manager that commonly have graduate degrees and make high salaries. These people would likely make more than a person with a masters degrees in arts or sciences who is looking to do research or work in a non-profit field.
> Boxplot(salary~education, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")
[1] "1" "5" "6" "9" "11" "19"
Observations: Married people are represented by the box plot on the right with the 1 label. The median salary increases greatly if a person is married. The top 50% of married people are on average making more than the top 25% of unmarried people.
Interpretation: Does this mean if you get married you will make more money? No, marriage has many other variables that factor into salary. Age is the main one. The average age of marriage in Canada is about 30 years old. This means people are later in their careers which often comes with a salary increase. Additionally, married people look for more stability and may need to be able to potentially support children and a mortgage. Married people are often not in the financial position to take jobs with lower salaries even if the job might be more enjoyable.
> Boxplot(salary~married, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")
Observations: People who are not satisfied with their job are represented by 0 and people who are satisfied are represented by 1. It is evident that people who are satisfied have a lightly higher median salary. The 3rd quartile of satisfied people reaches higher than the 3rd quartile of unsatisfied people meaning that 25% of satisfied people are making more than the 25% of unsatisfied people. The whiskers reach to the same level on both meaning the top 25% of people make approximately the same amount.
Interpretation: Do satisfied people make a better salary because they are happy so they perform better or are people more satisfied because they have a higher salary. This correlation is difficult to determine and will take further analysis to conclude.
> Boxplot(salary~satisfied, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")
[1] "1"
Observations: People in the tech industry are denoted by the box plot labeled 1. People in the tech industry have a higher median salary than those who are not. The top 50% of people in tech are making the same or more than the top 25% of people in non-tech. The whisker on the tech plot reached much higher meaning the top 25% of tech people earn more than the top 25% of non-tech people. There are two outliers in non-tech jobs.
Interpretation: Tech is a fast growing field with jobs that are heavily in demand such as computer scientists and data scientists. The world is becoming more data driven and the competition to acquire top tech talent is fierce. To be competitive tech firms must offer higher salaries. This is likely the reason there is a discrepancy in salaries between tech and non-tech people. There are a few possible explanations for the salary outliers in non-tech industry. This could be people working at C-suite executives in non-tech industries or could be doctors or lawyers, which are non-tech high paying jobs.
> Boxplot(salary~industry, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")
[1] "11" "13"
Observations: There are an equal amount of women and men working in non-tech. There are more men working in tech than women. Looking at the tech industry a larger percentage of women make an above average salary than men. Looking at non-tech there are a larger percentage of women making below average salary than men.
Interpretation: The tech industry is historically male dominated so it seems there has been an emphasis put on paying women in tech competitively, this could be why a larger percentage of women are getting paid above average.
> FinalProjectSalary$SalAbvBlwAvg <- with(FinalProjectSalary, ifelse(salary>57565, 'Above', 'Below'))
> library(abind, pos=27)
> local({
+ .Table <- xtabs(~SalAbvBlwAvg+gender+industry, data=FinalProjectSalary)
+ cat("\nFrequency table:\n")
+ print(.Table)
+ cat("\nColumn percentages:\n")
+ print(colPercents(.Table))
+ })
Frequency table:
, , industry = 0
gender
SalAbvBlwAvg 0 1
Above 16 11
Below 28 34
, , industry = 1
gender
SalAbvBlwAvg 0 1
Above 33 28
Below 28 21
Column percentages:
, , industry = 0
gender
SalAbvBlwAvg 0 1
Above 36.4 24.4
Below 63.6 75.6
Total 100.0 100.0
Count 44.0 45.0
, , industry = 1
gender
SalAbvBlwAvg 0 1
Above 54.1 57.1
Below 45.9 42.9
Total 100.0 100.0
Count 61.0 49.0
> Sal_Reg1 <- lm(salary~bonus+certification+experience, data=SalaryFinal)
> summary(Sal_Reg1)
Call:
lm(formula = salary ~ bonus + certification + experience, data = SalaryFinal)
Residuals:
Min 1Q Median 3Q Max
-33515 -5571 -490 5026 31085
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42244.760 1262.760 33.454 <2e-16 ***
bonus 21.866 2.306 9.481 <2e-16 ***
certification -9.026 1237.716 -0.007 0.994
experience 835.627 379.877 2.200 0.029 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9373 on 191 degrees of freedom
Multiple R-squared: 0.7187, Adjusted R-squared: 0.7143
F-statistic: 162.7 on 3 and 191 DF, p-value: < 2.2e-16
> equatiomatic::extract_eq(Sal_Reg1)
\[ \operatorname{salary} = \alpha + \beta_{1}(\operatorname{bonus}) + \beta_{2}(\operatorname{certification}) + \beta_{3}(\operatorname{\experience}) + \epsilon \]
> equatiomatic::extract_eq(Sal_Reg1, use_coefs = TRUE)
\[ \operatorname{\widehat{salary}} = 42244.76 + 21.87(\operatorname{bonus}) - 9.03(\operatorname{certification}) + 835.63(\operatorname{\experience}) \] The first output that will be investigated is the residual values. The regression model skews negatively. This can be noticed by identifying the negative median value and the deviation of Q1 vs Q3 from the median. Next the coefficients can be used to create a linear expression.
Example using mean values for all explanatory variables \[y= 21.84(573.76)-9.026(1.3)+835.627(3.305)+42244.76\] \[y=$57,525.69\]
Taking into account standard error \[max = 24.172(Bonus)+1228.69(Certification)+1215.504(Experience)+43545.03\] \[min = 19.56(Bonus)-1246.742(Certification)+455.75(Experience)+40944.49\]
From the asterisks it can be interpreted that bonus and intercept have highly significant p-values as they have 3 askerisks. Experience has one asterisk indicating a significant p-value. Certification does not have an astericks meaning it does not have a significant p-value. The small p-values in experince, bonus and intercept mean that the null is rejected and that there is a relationship between these variables and salary.
An adjusted R squared below 0.4 is seen as a low correlation and above 0.7 is seen as a very high correlation. From the Adjusted R squared in this model is can be concluded there is a strong correlation as it is 0.7143. The intercept represents the base salary without any experience, bonus or salary. This would be considered starting salary and would be the amount employees get paid directly out of school.
Backwards stepwise multiple linear regression is used to find the optimal linear equation with the lowest Akaike information criteria (AIC). This form of regression begins with all variables and removed one at a time until all the remaining explanatory variables are statistically significant.
\[y= salary\] \[y = 21.85(Bonus)+835.55(Experience)+42240.70 \]
Example using mean values for all explanatory variables \[y= 21.85(573.76)+835.55(3.305)+42240.70 \] \[y=$57,538.85\]
Overall the best model was the one using bonus and experience. So a salary can be estimated by inputing the before mentioned variables into the equation formulated by the stepwise multiple regression analysis. However, bonuses are often derived from salary so other models that are more applicable should be created.
> stepwise(Sal_Reg1, direction='backward', criterion='AIC')
Direction: backward
Criterion: AIC
Start: AIC=3570.73
salary ~ bonus + certification + experience
Df Sum of Sq RSS AIC
- certification 1 4671 1.6779e+10 3568.7
<none> 1.6779e+10 3570.7
- experience 1 425086228 1.7204e+10 3573.6
- bonus 1 7897167079 2.4676e+10 3643.9
Step: AIC=3568.73
salary ~ bonus + experience
Df Sum of Sq RSS AIC
<none> 1.6779e+10 3568.7
- experience 1 4.2533e+08 1.7205e+10 3571.6
- bonus 1 1.9405e+10 3.6184e+10 3716.6
Call:
lm(formula = salary ~ bonus + experience, data = SalaryFinal)
Coefficients:
(Intercept) bonus experience
42240.70 21.85 835.55
This model will use all variables to create a linear equation. Bonus will be excluded from the model because it has strong variable interaction. Often bonuses are a percentage of salary. Since salary would be determined first it would be inappropriate to use bonus.
> AllRegSal <- lm(salary~certification+education+expensive+experience+gender+industry+married+satisfied, data=SalaryFinal)
> summary(AllRegSal)
Call:
lm(formula = salary ~ certification + education + expensive +
experience + gender + industry + married + satisfied, data = SalaryFinal)
Residuals:
Min 1Q Median 3Q Max
-48626 -6345 -370 6227 27108
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29778.5 3027.9 9.835 < 2e-16 ***
certification 6765.3 927.4 7.295 8.36e-12 ***
education 4725.0 1056.9 4.471 1.35e-05 ***
expensive -889.0 955.5 -0.930 0.353370
experience 2064.4 385.2 5.359 2.46e-07 ***
gender -760.8 1486.7 -0.512 0.609440
industry 3793.6 1570.2 2.416 0.016659 *
married 5800.3 1623.7 3.572 0.000451 ***
satisfied 301.5 1497.3 0.201 0.840615
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10240 on 186 degrees of freedom
Multiple R-squared: 0.6727, Adjusted R-squared: 0.6587
F-statistic: 47.79 on 8 and 186 DF, p-value: < 2.2e-16
> equatiomatic::extract_eq(AllRegSal)
\[ \operatorname{salary} = \alpha + \beta_{1}(\operatorname{certification}) + \beta_{2}(\operatorname{education}) + \beta_{3}(\operatorname{\expensive}) + \beta_{4}(\operatorname{\experience}) + \beta_{5}(\operatorname{gender}) + \beta_{6}(\operatorname{industry}) + \beta_{7}(\operatorname{married}) + \beta_{8}(\operatorname{satisfied}) + \epsilon \]
> equatiomatic::extract_eq(AllRegSal, use_coefs = TRUE)
\[ \operatorname{\widehat{salary}} = 29778.48 + 6765.32(\operatorname{certification}) + 4725.04(\operatorname{education}) - 888.99(\operatorname{\expensive}) + 2064.44(\operatorname{\experience}) - 760.82(\operatorname{gender}) + 3793.62(\operatorname{industry}) + 5800.33(\operatorname{married}) + 301.54(\operatorname{satisfied}) \]
In this model the intercept is 29778.5. However, values like education cannot be 0. Meaning the lowest salary for someone with a degree is about 5000 dollars higher. Additionally people working in IT will be paid about 4000 dollars more than non tech. In this model the most significant p-values are those for certification, education, experience, industry and marital status.The adjusted R squared is 0.6587 which is considered a well fitting model.
Backwards stepwise linear regression is used to find the optimal linear equation with the lowest Akaike information criteria (AIC). In this model all variables are used. The optimal equation can be found below. This equation indicates
\[y= salary\] \[y = 6906(certification)+4738(education)+2052(experience)+3901(industry)+5623(married)+27694\]
After creating a summary of the model created by the stepwise regression. It is notable that the multiple R squared regression is higher for model 2.1 than 2.0. It can be concluded that model 2.1 is a better fit than model 2.0. The multiple R squared for model 2.1 is smaller than that of model 1, most likely because of the variable interaction between salary and bonus.
Model 2.1 provides the most practical significance and applicability even if it is not the most statistically significant.
> stepwise(AllRegSal, direction='backward', criterion='AIC')
Direction: backward
Criterion: AIC
Start: AIC=3610.24
salary ~ certification + education + expensive + experience +
gender + industry + married + satisfied
Df Sum of Sq RSS AIC
- satisfied 1 4256596 1.9525e+10 3608.3
- gender 1 27484685 1.9549e+10 3608.5
- expensive 1 90851296 1.9612e+10 3609.1
<none> 1.9521e+10 3610.2
- industry 1 612614767 2.0134e+10 3614.3
- married 1 1339372447 2.0861e+10 3621.2
- education 1 2097842348 2.1619e+10 3628.1
- experience 1 3014227158 2.2535e+10 3636.2
- certification 1 5585374932 2.5107e+10 3657.3
Step: AIC=3608.29
salary ~ certification + education + expensive + experience +
gender + industry + married
Df Sum of Sq RSS AIC
- gender 1 28470318 1.9554e+10 3606.6
- expensive 1 89567594 1.9615e+10 3607.2
<none> 1.9525e+10 3608.3
- industry 1 626256510 2.0152e+10 3612.4
- married 1 1360255640 2.0886e+10 3619.4
- education 1 2095511973 2.1621e+10 3626.2
- experience 1 3054675395 2.2580e+10 3634.6
- certification 1 5582338749 2.5108e+10 3655.3
Step: AIC=3606.57
salary ~ certification + education + expensive + experience +
industry + married
Df Sum of Sq RSS AIC
- expensive 1 97067754 1.9651e+10 3605.5
<none> 1.9554e+10 3606.6
- industry 1 643513336 2.0197e+10 3610.9
- married 1 1340095099 2.0894e+10 3617.5
- education 1 2072401334 2.1626e+10 3624.2
- experience 1 3091685588 2.2646e+10 3633.2
- certification 1 5619470281 2.5173e+10 3653.8
Step: AIC=3605.54
salary ~ certification + education + experience + industry +
married
Df Sum of Sq RSS AIC
<none> 1.9651e+10 3605.5
- industry 1 655155031 2.0306e+10 3609.9
- married 1 1284089413 2.0935e+10 3615.9
- education 1 2124956397 2.1776e+10 3623.6
- experience 1 3025861222 2.2677e+10 3631.5
- certification 1 5945662017 2.5597e+10 3655.1
Call:
lm(formula = salary ~ certification + education + experience +
industry + married, data = SalaryFinal)
Coefficients:
(Intercept) certification education experience industry
27694 6906 4738 2052 3901
married
5623
> StepAllRegSal <- lm(salary~certification+education+experience+industry+married, data=SalaryFinal)
> summary(StepAllRegSal)
Call:
lm(formula = salary ~ certification + education + experience +
industry + married, data = SalaryFinal)
Residuals:
Min 1Q Median 3Q Max
-48670 -6587 139 5968 26591
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27694.3 2170.1 12.762 < 2e-16 ***
certification 6905.8 913.2 7.562 1.68e-12 ***
education 4738.2 1048.1 4.521 1.09e-05 ***
experience 2052.5 380.5 5.395 2.04e-07 ***
industry 3901.0 1554.1 2.510 0.012905 *
married 5623.0 1600.0 3.514 0.000552 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10200 on 189 degrees of freedom
Multiple R-squared: 0.6706, Adjusted R-squared: 0.6618
F-statistic: 76.94 on 5 and 189 DF, p-value: < 2.2e-16
Statistical significance does not always result in practical significance. There are numerous conclusion that can be drawn from this report. First, there seems to be an active gender wage gap, surprisingly it is more prevalent in non-tech industries. There is a bias in industry wages with people in tech making more on average. Education and experience both create salary increases. It became evident that bonuses are derived from salaries creating models that were statistically significant but not applicable in the real world. Many of the variables surround one variable that wasn’t included, age. Age often determines marital status, education level, years of experience and amount of certifications. All of these variables are highly correlated with age and therefore salary is highly correlated with age. People “climb the corporate ladder” with each new title a salary bump is common. Overall the best predictors of salary are certification, education, experience, industry, and marital status proven using backward stepwise mulitple linear regression.
> # include this code chunk as-is to set options
> knitr::opts_chunk$set(comment=NA, prompt=TRUE, out.width=750, fig.height=8, fig.width=8)
> library(Rcmdr)
> library(car)
> library(RcmdrMisc)
> library(bookdown)
> library(DiagrammeR)
> library(equatiomatic)
> library(kableExtra)
> library(knitr)
> library(xtable)
> # include this code chunk as-is to enable 3D graphs
> library(rgl)
> knitr::knit_hooks$set(webgl = hook_webgl)
> knitr::include_graphics("/Users/erindane/Desktop/R Studios /Picture1.png")
> DiagrammeR::grViz("digraph {
+ graph [layout = dot, rankdir = TB]
+
+ node [shape = rectangle]
+ rec1 [label = 'Step 1. Introduction']
+ rec2 [label = 'Step 2. Exploratory Analysis']
+ rec3 [label = 'Step 3. Predictive Analysis']
+ rec4 [label = 'Step 4. Recommendations/Conclusions']
+
+ # edge definitions with the node IDs
+ rec1 -> rec2 -> rec3 -> rec4
+ }",
+ height = 500)
> FinalProjectSalary <- readXL("/Users/erindane/Desktop/R Studios /FinalProjectSalary.xlsx",
+ rownames=FALSE, header=TRUE, na="", sheet="Sheet1", stringsAsFactors=TRUE)
> FinalProjectSalary <- within(FinalProjectSalary, {
+ satisfied <- as.factor(satisfied)
+ education <- as.factor(education)
+ expensive <- as.factor(expensive)
+ gender <- as.factor(gender)
+ industry <- as.factor(industry)
+ married <- as.factor(married)
+ certification <- as.factor(certification)
+ id <- NULL
+ })
> FinalProjectSalary$IsExpensive <- with(FinalProjectSalary, ifelse(expensive == 3, 1,0))
> FinalProjectSalary$IsModerate <- with(FinalProjectSalary, ifelse(expensive == 2, 1,0))
> knitr::kable(head(FinalProjectSalary))
> knitr::kable(data.frame(Attribute=rep(c('Gender', 'Industry', 'Education','Satisfied','Married','Experience', 'Certification','Bonus','Salary','Expensive City')),
+ Description=rep(c('What is the gender of the person?', 'what industry does the perosn work in?','What level of education?','Is the person satisfied in thier job?','Is the person married?','How many years of job experince?','How many certifications does the person have?','What bonus does the person get?', 'What is the persons salary?','How expensive is the city they live in?'),),
+ Indicator=rep(c('1-Female, 0-Male','1- IT, 0 - Non-IT','1- Undegrad, 2- Grad, 3- Advanced', '1-Yes, 0-No','1-Yes, 0-No', 'Count if years', 'Count of certifications', 'Dollar Amount','Annual dollar amount','1 - Low, 2-Medium, 3-High'))))
> summary(FinalProjectSalary)
> sapply(FinalProjectSalary, function(x)(sum(is.na(x)))) # NA counts
> with(FinalProjectSalary, Hist(salary, scale="frequency", breaks="Sturges", col="deepskyblue4", main = "Histrogram of Salary"))
> Boxplot( ~ salary, data=FinalProjectSalary, id=list(method="y"), main ="Box Plot of Salary", col="deepskyblue4")
> with(FinalProjectSalary, qqPlot(salary, dist="norm", id=list(method="y", n=2, labels=rownames(FinalProjectSalary)), main ="QQ Plot of Salary", col="deepskyblue4"))
> numSummary(FinalProjectSalary[,"salary", drop=FALSE], statistics=c("mean", "quantiles"), quantiles=c(.5))
> normalityTest(~salary, test="shapiro.test", data=FinalProjectSalary)
> TF <- (powerTransform(salary ~ 1, data=FinalProjectSalary, family="bcPower"))
> summary(TF)
> lamb <- TF$lambda
> TSalary <- FinalProjectSalary$salary^TF$lamb
> hist(TSalary, main="Histogram for Log Salary Trasformation",
+ xlab="Salary", col="deepskyblue4" )
> shapiro.test(TSalary)
>
> summary(powerTransform(salary ~ 1, data=FinalProjectSalary, family="bcPower"))
> sqrt_sal <- sqrt(FinalProjectSalary$salary)
> hist(sqrt_sal, main="Histogram for Sqrt Salary Trasformation",
+ xlab="Salary", col="deepskyblue4")
> shapiro.test(sqrt_sal)
> library(pander)
> t <- cor(FinalProjectSalary[,c("bonus","experience","salary")], use="complete")
> emphasize.strong.cells(which(t > 0.6 & t != 1 , arr.ind = TRUE))
> pander(t)
> library(corrplot)
> C <- cor(FinalProjectSalary[, c("bonus", "experience", "salary")], use = "complete")
> corrplot(C, order="FPC", method = "ellipse")
> Boxplot(salary~gender, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")
> Boxplot(experience~gender, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")
> Boxplot(certification~gender, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")
> knitr::kable(data.frame(Education=rep(c('Undergrad', 'Grad', 'Advanced')),
+ Men=rep(c('18%', '48%','34%'),),
+ Women=rep(c('16%','46%','38%'))))
> Boxplot(salary~education, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")
> Boxplot(salary~married, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")
> Boxplot(salary~satisfied, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")
> Boxplot(salary~industry, data=FinalProjectSalary, id=list(method="y"),col="deepskyblue4")
> FinalProjectSalary$SalAbvBlwAvg <- with(FinalProjectSalary, ifelse(salary>57565, 'Above', 'Below'))
> library(abind, pos=27)
> local({
+ .Table <- xtabs(~SalAbvBlwAvg+gender+industry, data=FinalProjectSalary)
+ cat("\nFrequency table:\n")
+ print(.Table)
+ cat("\nColumn percentages:\n")
+ print(colPercents(.Table))
+ })
> SalaryFinal <- readXL("/Users/erindane/Desktop/R Studios /FinalProjectSalaryPP.xlsx",
+ rownames=FALSE, header=TRUE, na="", sheet="Sheet1", stringsAsFactors=TRUE)
> SalaryFinal <- within(SalaryFinal, {
+ id <- NULL
+ })
> SalaryFinal$IsExpensive <- with(SalaryFinal, ifelse(expensive == 3, 1,0))
> SalaryFinal$IsModerate <- with(SalaryFinal, ifelse(expensive == 2, 1,0))
> Sal_Reg1 <- lm(salary~bonus+certification+experience, data=SalaryFinal)
> summary(Sal_Reg1)
> equatiomatic::extract_eq(Sal_Reg1)
> equatiomatic::extract_eq(Sal_Reg1, use_coefs = TRUE)
> stepwise(Sal_Reg1, direction='backward', criterion='AIC')
> AllRegSal <- lm(salary~certification+education+expensive+experience+gender+industry+married+satisfied, data=SalaryFinal)
> summary(AllRegSal)
> equatiomatic::extract_eq(AllRegSal)
> equatiomatic::extract_eq(AllRegSal, use_coefs = TRUE)
> stepwise(AllRegSal, direction='backward', criterion='AIC')
> StepAllRegSal <- lm(salary~certification+education+experience+industry+married, data=SalaryFinal)
> summary(StepAllRegSal)
U of Windsor, danee@uwindsor.ca↩︎