mba.df <- read.csv(paste("F:/Data Analytics for Managerial Applications/MBA Starting Salaries Data.csv", sep = ""))
head(mba.df)
## age sex gmat_tot gmat_qpc gmat_vpc gmat_tpc s_avg f_avg quarter work_yrs
## 1 23 2 620 77 87 87 3.4 3.00 1 2
## 2 24 1 610 90 71 87 3.5 4.00 1 2
## 3 24 1 670 99 78 95 3.3 3.25 1 2
## 4 24 1 570 56 81 75 3.3 2.67 1 1
## 5 24 2 710 93 98 98 3.6 3.75 1 2
## 6 24 1 640 82 89 91 3.9 3.75 1 2
## frstlang salary satis
## 1 1 0 7
## 2 1 0 6
## 3 1 0 6
## 4 1 0 7
## 5 1 999 5
## 6 1 0 6
library(psych)
describe(mba.df[,c(1,3:10,12,13)]) ##Summarizing the data
## vars n mean sd median trimmed mad min max
## age 1 274 27.36 3.71 27 26.76 2.97 22 48
## gmat_tot 2 274 619.45 57.54 620 618.86 59.30 450 790
## gmat_qpc 3 274 80.64 14.87 83 82.31 14.83 28 99
## gmat_vpc 4 274 78.32 16.86 81 80.33 14.83 16 99
## gmat_tpc 5 274 84.20 14.02 87 86.12 11.86 0 99
## s_avg 6 274 3.03 0.38 3 3.03 0.44 2 4
## f_avg 7 274 3.06 0.53 3 3.09 0.37 0 4
## quarter 8 274 2.48 1.11 2 2.47 1.48 1 4
## work_yrs 9 274 3.87 3.23 3 3.29 1.48 0 22
## salary 10 274 39025.69 50951.56 999 33607.86 1481.12 0 220000
## satis 11 274 172.18 371.61 6 91.50 1.48 1 998
## range skew kurtosis se
## age 26 2.16 6.45 0.22
## gmat_tot 340 -0.01 0.06 3.48
## gmat_qpc 71 -0.92 0.30 0.90
## gmat_vpc 83 -1.04 0.74 1.02
## gmat_tpc 99 -2.28 9.02 0.85
## s_avg 2 -0.06 -0.38 0.02
## f_avg 4 -2.08 10.85 0.03
## quarter 3 0.02 -1.35 0.07
## work_yrs 22 2.78 9.80 0.20
## salary 220000 0.70 -1.05 3078.10
## satis 997 1.77 1.13 22.45
str(mba.df)
## 'data.frame': 274 obs. of 13 variables:
## $ age : int 23 24 24 24 24 24 25 25 25 25 ...
## $ sex : int 2 1 1 1 2 1 1 2 1 1 ...
## $ gmat_tot: int 620 610 670 570 710 640 610 650 630 680 ...
## $ gmat_qpc: int 77 90 99 56 93 82 89 88 79 99 ...
## $ gmat_vpc: int 87 71 78 81 98 89 74 89 91 81 ...
## $ gmat_tpc: int 87 87 95 75 98 91 87 92 89 96 ...
## $ s_avg : num 3.4 3.5 3.3 3.3 3.6 3.9 3.4 3.3 3.3 3.45 ...
## $ f_avg : num 3 4 3.25 2.67 3.75 3.75 3.5 3.75 3.25 3.67 ...
## $ quarter : int 1 1 1 1 1 1 1 1 1 1 ...
## $ work_yrs: int 2 2 2 1 2 2 2 2 2 2 ...
## $ frstlang: int 1 1 1 1 1 1 1 1 2 1 ...
## $ salary : int 0 0 0 0 999 0 0 0 999 998 ...
## $ satis : int 7 6 6 7 5 6 5 6 4 998 ...
Constructing a box plot to understand how the GMAT scores and fall semester grades are distributed by gender:
par(mfrow = c(2,1))
boxplot(mba.df$gmat_tot ~ mba.df$sex, horizontal = TRUE, ylab = "Sex", xlab = "GMAT scores", yaxt = "n", main = "Boxplot of GMAT scores by sex")
axis(side = 2, las = 2, at = c(1:2), labels = c("Male","Female"))
boxplot(mba.df$f_avg ~ mba.df$sex, horizontal = TRUE, ylab = "Sex", xlab = "Fall average grades", yaxt = "n", main = "Boxplot of fall average grades by sex")
axis(side = 2, las = 2, at = c(1:2), labels = c("Male","Female"))
Note that the median GMAT scores and IQR are very similar across the two genders while there is a considerable difference between the fall average grades between the two genders.
Now our ultimate quest is to find the factors that are most correlated with getting a job or NOT getting a job. So we need to split our data frame into 2 parts according to who got jobs and who didn’t:
placed <- mba.df[which(mba.df$salary > 999),]
notplaced <- mba.df[which(mba.df$salary == 0),]
View(placed)
View(notplaced)
par(mfrow = c(2,1))
boxplot(placed$gmat_tot ~ placed$sex, horizontal = TRUE, ylab = "Sex", xlab = "GMAT scores", yaxt = "n", main = "Boxplot of GMAT scores by sex for placed candidates")
axis(side = 2, las = 2, at = c(1:2), labels = c("Male","Female"))
boxplot(notplaced$gmat_tot ~ notplaced$sex, horizontal = TRUE, ylab = "Sex", xlab = "GMAT scores", yaxt = "n", main = "Boxplot of GMAT scores by sex for unplaced candidates")
axis(side = 2, las = 2, at = c(1:2), labels = c("Male","Female"))
Interestingly, we see that the median GMAT score in each case in almost identical while there seems to be an approx 15 points difference in the median GMAT scores between the placed and unplaced candidates.
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(mba.df[("salary" != 998) & ("salary" != 999),c("salary","gmat_tot","s_avg","work_yrs")], spread = FALSE, smoother.args = list(lty = 2), main = "Scatter Plot Matrix")
library(corrgram)
corrgram(mba.df[("salary" != 998) & ("salary" != 999),c(1,3:10,12)], order=FALSE, lower.panel=panel.shade, upper.panel=panel.pie, text.panel=panel.txt, main="Corrgram of MBA data intercorrelations")
jobs_per_sex <- xtabs(~ sex+got_job, data = all_sal)
addmargins(jobs_per_sex)
## got_job
## sex 0 1 Sum
## 1 67 72 139
## 2 23 31 54
## Sum 90 103 193
jobs_by_lang <- xtabs(~ frstlang+got_job, data = all_sal)
addmargins(jobs_by_lang)
## got_job
## frstlang 0 1 Sum
## 1 82 96 178
## 2 8 7 15
## Sum 90 103 193
Propertion of jobs by gender and language:
prop.table(jobs_per_sex,1)*100
## got_job
## sex 0 1
## 1 48.20144 51.79856
## 2 42.59259 57.40741
prop.table(jobs_by_lang,1)*100
## got_job
## frstlang 0 1
## 1 46.06742 53.93258
## 2 53.33333 46.66667
Null Hypothesis: There is no significant difference in starting salaries and sex i.e. there is no correlation between starting salaries and sex. Alternative Hypothesis: There is a significant difference in starting salaries and sex indicating there is a correlation between the two variables.
chisq.test(jobs_per_sex)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: jobs_per_sex
## X-squared = 0.29208, df = 1, p-value = 0.5889
Thus, as the p-value = 0.5889 > 0.05, we fail to reject the null hypothesis that there is no correlation between starting salaries and sex.
Null Hypothesis: There is no significant difference in starting salaries and language i.e. there is no correlation between starting salaries and language. Alternative Hypothesis: There is a significant difference in starting salaries and language indicating there is a correlation between the two variables.
chisq.test(jobs_by_lang)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: jobs_by_lang
## X-squared = 0.074127, df = 1, p-value = 0.7854
Here again, since p-value = 0.7854 > 0.05, we fail to reject the null hypothesis that there is no correlation between starting salaries and language.
cor(all_sal[,c(3,4,7,8,10,12)])
## gmat_tot gmat_qpc s_avg f_avg work_yrs
## gmat_tot 1.000000e+00 0.74309972 0.14356746 0.101082103 -0.17369086
## gmat_qpc 7.430997e-01 1.00000000 0.01903816 0.130285115 -0.24138468
## s_avg 1.435675e-01 0.01903816 1.00000000 0.520554250 0.15913663
## f_avg 1.010821e-01 0.13028512 0.52055425 1.000000000 -0.04795136
## work_yrs -1.736909e-01 -0.24138468 0.15913663 -0.047951357 1.00000000
## salary -5.685962e-05 0.02839164 0.09632412 0.008846655 -0.05326685
## salary
## gmat_tot -5.685962e-05
## gmat_qpc 2.839164e-02
## s_avg 9.632412e-02
## f_avg 8.846655e-03
## work_yrs -5.326685e-02
## salary 1.000000e+00
library(psych)
corr.test(all_sal[,c(3,6,7,8,10,12)], use = "complete")
## Call:corr.test(x = all_sal[, c(3, 6, 7, 8, 10, 12)], use = "complete")
## Correlation matrix
## gmat_tot gmat_tpc s_avg f_avg work_yrs salary
## gmat_tot 1.00 0.88 0.14 0.10 -0.17 0.00
## gmat_tpc 0.88 1.00 0.19 0.11 -0.17 0.06
## s_avg 0.14 0.19 1.00 0.52 0.16 0.10
## f_avg 0.10 0.11 0.52 1.00 -0.05 0.01
## work_yrs -0.17 -0.17 0.16 -0.05 1.00 -0.05
## salary 0.00 0.06 0.10 0.01 -0.05 1.00
## Sample Size
## [1] 193
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## gmat_tot gmat_tpc s_avg f_avg work_yrs salary
## gmat_tot 0.00 0.00 0.42 1.00 0.19 1
## gmat_tpc 0.00 0.00 0.11 1.00 0.23 1
## s_avg 0.05 0.01 0.00 0.00 0.27 1
## f_avg 0.16 0.13 0.00 0.00 1.00 1
## work_yrs 0.02 0.02 0.03 0.51 0.00 1
## salary 1.00 0.40 0.18 0.90 0.46 0
##
## To see confidence intervals of the correlations, print with the short=FALSE option
Therefore we see that: 1. Total GMAT score (gmat_tot), Fall average grades (f_avg) and Yrs of Work Experience (work_yrs) have p-values < 0.05 which means we can reject the null hypothesis that these variables are not correlated with starting salary. 2. Total GMAT percentile (gmat_tpc) and Spring average grades (s_avg) have p-values > 0.05 which means we fail to reject the null hypothesis that these variables are not correlated with the starting salary.
Applying the linear regression model to the problem: y = b0 + b1x1 + b2x2 + b3x3 + … + bnxn Here the dependent variable is starting salary and the independent variables are variables such as gmat_tot, s_avg, f_avg etc.
Salary = b0 + b1(gmat_tot) + b2(s_avg) + b3(f_avg) + b4(work_yrs)
lmodel <- lm(salary ~ gmat_qpc + gmat_vpc + s_avg + f_avg + work_yrs + sex + frstlang + satis, data = placed)
summary(lmodel)
##
## Call:
## lm(formula = salary ~ gmat_qpc + gmat_vpc + s_avg + f_avg + work_yrs +
## sex + frstlang + satis, data = placed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29800 -7822 -1742 4869 82341
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 86719.94 23350.43 3.714 0.000346 ***
## gmat_qpc 98.72 121.85 0.810 0.419884
## gmat_vpc -95.80 102.99 -0.930 0.354699
## s_avg 4659.05 5015.66 0.929 0.355320
## f_avg -1698.83 3834.70 -0.443 0.658773
## work_yrs 2331.12 585.99 3.978 0.000137 ***
## sex -5289.24 3545.91 -1.492 0.139140
## frstlang 13994.76 6641.66 2.107 0.037770 *
## satis -1671.20 2070.62 -0.807 0.421643
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15740 on 94 degrees of freedom
## Multiple R-squared: 0.285, Adjusted R-squared: 0.2241
## F-statistic: 4.683 on 8 and 94 DF, p-value: 7.574e-05
Insights from the Regression Analysis: - p-value of 7.574e-05 (<0.05) suggests that this is overall a good model. - R-squared value of 0.2241 suggests that the explanatory variables explain only 22.4% of the covariance which means there must be quite a few other explanatory variables not taken into account here. - Only the beta-coefficients of first language and work years have statistical significance (i.e. p-value < 0.05). For every 1 year increase in work experience, the starting salary increases by 2331 units.
Here we apply logit regression to check for the categorical variable got_jobs (1 = Job, 0 = No Job) that we had added to the all_sal dataframe:
model <- glm(got_job ~ gmat_tpc + gmat_tot + s_avg + f_avg + work_yrs + sex + frstlang + satis, family=binomial(link='logit'), data=all_sal)
summary(model)
##
## Call:
## glm(formula = got_job ~ gmat_tpc + gmat_tot + s_avg + f_avg +
## work_yrs + sex + frstlang + satis, family = binomial(link = "logit"),
## data = all_sal)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7969 -1.1750 0.7812 1.0610 1.7901
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.243768 3.154526 -0.077 0.9384
## gmat_tpc 0.073538 0.044501 1.652 0.0984 .
## gmat_tot -0.016302 0.009382 -1.738 0.0823 .
## s_avg 0.652323 0.495360 1.317 0.1879
## f_avg -0.081864 0.347219 -0.236 0.8136
## work_yrs -0.087415 0.046405 -1.884 0.0596 .
## sex 0.210544 0.337300 0.624 0.5325
## frstlang 0.062156 0.576303 0.108 0.9141
## satis 0.437869 0.204119 2.145 0.0319 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 266.68 on 192 degrees of freedom
## Residual deviance: 251.21 on 184 degrees of freedom
## AIC: 269.21
##
## Number of Fisher Scoring iterations: 4
The results suggest that the beta-coefficients of satis (satisfaction level) is the only statistically significant coefficient while others are statistically insignificant