The dataset being used is as follows:
setwd("C:\\Users\\Tejajay\\Desktop\\Internship\\3. Data Analytics")
mbass <- read.csv(paste("MBA Starting Salaries Data.csv", sep=""))
We attempt to figure out the impact of various factors, such as age, sex, GMAT scores, work experience, degree satisfaction etc. in the starting salary of fresh MBA graduates. The average starting salary of an MBA graduate is described below:
mean(mbass$salary)
## [1] 38730.53
Summary statistics are as follows:
Average age:
mean(mbass$age)
## [1] 27.35766
Male vs Female students:
table(mbass$sex)
##
## 1 2
## 206 68
GMAT Scores (Total, Quantitative, Verbal, Overall Percentile)
mean(mbass$gmat_tot)
## [1] 619.4526
mean(mbass$gmat_qpc)
## [1] 80.64234
mean(mbass$gmat_vpc)
## [1] 78.32117
mean(mbass$gmat_tpc)
## [1] 84.19708
Spring and Fall averages:
mean(mbass$s_avg)
## [1] 3.025401
mean(mbass$f_avg)
## [1] 3.061533
Average quartile ranking:
mean(mbass$quarter)
## [1] 2.478102
Average work experience:
mean(mbass$work_yrs)
## [1] 3.872263
First language of students:
table(mbass$frstlang)
##
## 1 2
## 242 32
Satisfaction of Students with their degree:
mean(mbass$satis)
## [1] 4.631387
We shall now try to understand how salary changes with respect to a few factors such as age, sex, GMAT percentile and work experience.
ssvsage <- table(mbass$salary, mbass$age)
barplot(ssvsage, main = "Starting Salary vs Age", xlab = "Age", ylab = "Starting Salary")
ssvssex <- table(mbass$salary, mbass$sex)
barplot(ssvssex, main = "Starting Salary vs Sex", xlab = "Sex", ylab = "Starting Salary")
ssvsgmat <- table(mbass$salary, mbass$gmat_tpc)
barplot(ssvsgmat, main = "Starting Salary vs GMAT Percentile", xlab = "GMAT Percentile", ylab = "Starting Salary")
ssvswex <- table(mbass$salary, mbass$work_yrs)
barplot(ssvswex, main = "Starting Salary vs Past Work Experience", xlab = "Work Experience", ylab = "Starting Salary")
ssvssatis <- table(mbass$salary, mbass$satis)
barplot(ssvssatis, main = "Starting Salary vs Satisfaction", xlab = "Satisfaction with MBA degree", ylab = "Starting Salary")
As we can see from the above barplots, it is clear that the highest paid candidates are from the age group of 24-26. Also, male candidates have a much higher starting salary than female candidates, indicative of a clear gender gap. The candidate’s total GMAT percentile does not seem to affect the starting salary too much, as is evident from the random distribution of the graph. It is interesting that candidates with a work experience of around 2 years seem to receive a much higher salary than their more-experienced counterparts. It is also quite interesting to note that the students receiving the highest salaries seem to be highly satisfied with their degrees.
plot(mbass$satis, mbass$frstlang, main = "Satisfaction vs First Language", xlab = "Satisfaction with Degree", ylab = "First Language" )
abline(lm(mbass$frstlang~mbass$satis), col = "red")
plot(mbass$gmat_tpc, mbass$work_yrs, main = "Total GMAT Percentile vs Work Experience", xlab = "GMAT Percentile", ylab = "Work Experience")
abline(lm(mbass$work_yrs~mbass$gmat_tpc), col = "blue")
In the above scatterplots, we can clearly see that satisfaction with the degree decreases as “first-language score” increases, i.e students whose first language is not English are clearly not as satisfied with their MBA degree as native English-speaking students are. This leads them to a distinct advantage in terms of salary, as we have seen before that there is a direct correlation between salary and satisfaction with the degree. Similarly, students with the highest GMAT percentile possess the lowest work experience. This could have an important effect on the decisions made by prospective/future students, given that both work experience and GMAT percentile seem to matter in determining their starting salaries.
The dependence of the above-discussed factors on each other is described in the correlogram below:
corrgram::corrgram(mbass)
We are looking only at those students who have actually gotten a job, in order to develop a salary calculation model. For this purpose, we will include only students who have actually answered all the questions in the survey.
mbasalary <- subset(mbass, salary>0, select = age:satis)
View(mbasalary)
We will try to develop a model using the following techniques: 1. Chi-square tests 2. T-tests 3. Regression models
chisq.test(mbasalary$age, mbasalary$salary)
## Warning in chisq.test(mbasalary$age, mbasalary$salary): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mbasalary$age and mbasalary$salary
## X-squared = 717.62, df = 574, p-value = 3.929e-05
chisq.test(mbasalary$sex, mbasalary$salary)
## Warning in chisq.test(mbasalary$sex, mbasalary$salary): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mbasalary$sex and mbasalary$salary
## X-squared = 52.681, df = 41, p-value = 0.1045
chisq.test(mbasalary$gmat_tpc, mbasalary$salary)
## Warning in chisq.test(mbasalary$gmat_tpc, mbasalary$salary): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mbasalary$gmat_tpc and mbasalary$salary
## X-squared = 1422.2, df = 1230, p-value = 0.0001065
chisq.test(mbasalary$s_avg, mbasalary$salary)
## Warning in chisq.test(mbasalary$s_avg, mbasalary$salary): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mbasalary$s_avg and mbasalary$salary
## X-squared = 792.97, df = 861, p-value = 0.9524
chisq.test(mbasalary$f_avg, mbasalary$salary)
## Warning in chisq.test(mbasalary$f_avg, mbasalary$salary): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mbasalary$f_avg and mbasalary$salary
## X-squared = 596.28, df = 574, p-value = 0.2518
chisq.test(mbasalary$quarter, mbasalary$salary)
## Warning in chisq.test(mbasalary$quarter, mbasalary$salary): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mbasalary$quarter and mbasalary$salary
## X-squared = 129.85, df = 123, p-value = 0.3186
chisq.test(mbasalary$work_yrs, mbasalary$salary)
## Warning in chisq.test(mbasalary$work_yrs, mbasalary$salary): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mbasalary$work_yrs and mbasalary$salary
## X-squared = 535.23, df = 451, p-value = 0.003809
chisq.test(mbasalary$frstlang, mbasalary$salary)
## Warning in chisq.test(mbasalary$frstlang, mbasalary$salary): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mbasalary$frstlang and mbasalary$salary
## X-squared = 69.847, df = 41, p-value = 0.003296
chisq.test(mbasalary$satis, mbasalary$salary)
## Warning in chisq.test(mbasalary$satis, mbasalary$salary): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mbasalary$satis and mbasalary$salary
## X-squared = 109.1, df = 164, p-value = 0.9997
The results of the Chi-Square tests tell us that age, GMAT percentile, work experience and first language are factors that affect starting salary (i.e p < 0.05), whereas sex, average GPA for Spring and Fall semesters, quartile ranking and satisfaction with degree have no effect on the salary (p > 0.05). This, however, is in contrast with the results obtained from the plots that we observed earlier.
t.test(mbasalary$age, mbasalary$salary)
##
## Welch Two Sample t-test
##
## data: mbasalary$age and mbasalary$salary
## t = -58.503, df = 102, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -106496.23 -99511.69
## sample estimates:
## mean of x mean of y
## 26.7767 103030.7379
t.test(mbasalary$sex, mbasalary$salary)
##
## Welch Two Sample t-test
##
## data: mbasalary$sex and mbasalary$salary
## t = -58.517, df = 102, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -106521.71 -99537.17
## sample estimates:
## mean of x mean of y
## 1.300971e+00 1.030307e+05
t.test(mbasalary$gmat_tpc, mbasalary$salary)
##
## Welch Two Sample t-test
##
## data: mbasalary$gmat_tpc and mbasalary$salary
## t = -58.47, df = 102, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -106438.49 -99453.94
## sample estimates:
## mean of x mean of y
## 84.52427 103030.73786
t.test(mbasalary$s_avg, mbasalary$salary)
##
## Welch Two Sample t-test
##
## data: mbasalary$s_avg and mbasalary$salary
## t = -58.516, df = 102, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -106519.92 -99535.37
## sample estimates:
## mean of x mean of y
## 3.09233 103030.73786
t.test(mbasalary$f_avg, mbasalary$salary)
##
## Welch Two Sample t-test
##
## data: mbasalary$f_avg and mbasalary$salary
## t = -58.516, df = 102, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -106519.92 -99535.38
## sample estimates:
## mean of x mean of y
## 3.090971e+00 1.030307e+05
t.test(mbasalary$quarter, mbasalary$salary)
##
## Welch Two Sample t-test
##
## data: mbasalary$quarter and mbasalary$salary
## t = -58.517, df = 102, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -106520.7 -99536.2
## sample estimates:
## mean of x mean of y
## 2.262136e+00 1.030307e+05
t.test(mbasalary$work_yrs, mbasalary$salary)
##
## Welch Two Sample t-test
##
## data: mbasalary$work_yrs and mbasalary$salary
## t = -58.516, df = 102, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -106519.33 -99534.79
## sample estimates:
## mean of x mean of y
## 3.679612e+00 1.030307e+05
t.test(mbasalary$frstlang, mbasalary$salary)
##
## Welch Two Sample t-test
##
## data: mbasalary$frstlang and mbasalary$salary
## t = -58.517, df = 102, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -106521.9 -99537.4
## sample estimates:
## mean of x mean of y
## 1.067961e+00 1.030307e+05
t.test(mbasalary$satis, mbasalary$salary)
##
## Welch Two Sample t-test
##
## data: mbasalary$satis and mbasalary$salary
## t = -58.515, df = 102, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -106517.13 -99532.58
## sample estimates:
## mean of x mean of y
## 5.883495e+00 1.030307e+05
In all the above T-Tests, we see the alternative hypothesis being stated as “true difference is not equal to zero” and we also get p < 2.2e - 16, which means that all the factors somehow influence the starting salary. This is in contrast to our analysis using graphs as well as Chi-Squared tests.
lm(formula = mbasalary$age ~ mbasalary$salary)
##
## Call:
## lm(formula = mbasalary$age ~ mbasalary$salary)
##
## Coefficients:
## (Intercept) mbasalary$salary
## 1.735e+01 9.148e-05
lm(formula = mbasalary$sex ~ mbasalary$salary)
##
## Call:
## lm(formula = mbasalary$sex ~ mbasalary$salary)
##
## Coefficients:
## (Intercept) mbasalary$salary
## 1.743e+00 -4.289e-06
lm(formula = mbasalary$gmat_tpc ~ mbasalary$salary)
##
## Call:
## lm(formula = mbasalary$gmat_tpc ~ mbasalary$salary)
##
## Coefficients:
## (Intercept) mbasalary$salary
## 9.290e+01 -8.131e-05
lm(formula = mbasalary$s_avg ~ mbasalary$salary)
##
## Call:
## lm(formula = mbasalary$s_avg ~ mbasalary$salary)
##
## Coefficients:
## (Intercept) mbasalary$salary
## 2.870e+00 2.155e-06
lm(formula = mbasalary$f_avg ~ mbasalary$salary)
##
## Call:
## lm(formula = mbasalary$f_avg ~ mbasalary$salary)
##
## Coefficients:
## (Intercept) mbasalary$salary
## 3.389e+00 -2.894e-06
lm(formula = mbasalary$quarter ~ mbasalary$salary)
##
## Call:
## lm(formula = mbasalary$quarter ~ mbasalary$salary)
##
## Coefficients:
## (Intercept) mbasalary$salary
## 3.092e+00 -8.053e-06
lm(formula = mbasalary$work_yrs ~ mbasalary$salary)
##
## Call:
## lm(formula = mbasalary$work_yrs ~ mbasalary$salary)
##
## Coefficients:
## (Intercept) mbasalary$salary
## -4.2126319 0.0000766
lm(formula = mbasalary$frstlang ~ mbasalary$salary)
##
## Call:
## lm(formula = mbasalary$frstlang ~ mbasalary$salary)
##
## Coefficients:
## (Intercept) mbasalary$salary
## 6.786e-01 3.779e-06
lm(formula = mbasalary$satis ~ mbasalary$salary)
##
## Call:
## lm(formula = mbasalary$satis ~ mbasalary$salary)
##
## Coefficients:
## (Intercept) mbasalary$salary
## 6.064e+00 -1.756e-06
The regression models above tell us how each variable is suited for a straight-line equation of the form y=mx+c (y being salary and x being the variable itself) with respect to starting salary.