Import the dataset:

mba.df <- read.csv(paste("F:/Data Analytics for Managerial Applications/MBA Starting Salaries Data.csv", sep = ""))
head(mba.df)
##   age sex gmat_tot gmat_qpc gmat_vpc gmat_tpc s_avg f_avg quarter work_yrs
## 1  23   2      620       77       87       87   3.4  3.00       1        2
## 2  24   1      610       90       71       87   3.5  4.00       1        2
## 3  24   1      670       99       78       95   3.3  3.25       1        2
## 4  24   1      570       56       81       75   3.3  2.67       1        1
## 5  24   2      710       93       98       98   3.6  3.75       1        2
## 6  24   1      640       82       89       91   3.9  3.75       1        2
##   frstlang salary satis
## 1        1      0     7
## 2        1      0     6
## 3        1      0     6
## 4        1      0     7
## 5        1    999     5
## 6        1      0     6
library(psych)
describe(mba.df[,c(1,3:10,12,13)]) ##Summarizing the data
##          vars   n     mean       sd median  trimmed     mad min    max
## age         1 274    27.36     3.71     27    26.76    2.97  22     48
## gmat_tot    2 274   619.45    57.54    620   618.86   59.30 450    790
## gmat_qpc    3 274    80.64    14.87     83    82.31   14.83  28     99
## gmat_vpc    4 274    78.32    16.86     81    80.33   14.83  16     99
## gmat_tpc    5 274    84.20    14.02     87    86.12   11.86   0     99
## s_avg       6 274     3.03     0.38      3     3.03    0.44   2      4
## f_avg       7 274     3.06     0.53      3     3.09    0.37   0      4
## quarter     8 274     2.48     1.11      2     2.47    1.48   1      4
## work_yrs    9 274     3.87     3.23      3     3.29    1.48   0     22
## salary     10 274 39025.69 50951.56    999 33607.86 1481.12   0 220000
## satis      11 274   172.18   371.61      6    91.50    1.48   1    998
##           range  skew kurtosis      se
## age          26  2.16     6.45    0.22
## gmat_tot    340 -0.01     0.06    3.48
## gmat_qpc     71 -0.92     0.30    0.90
## gmat_vpc     83 -1.04     0.74    1.02
## gmat_tpc     99 -2.28     9.02    0.85
## s_avg         2 -0.06    -0.38    0.02
## f_avg         4 -2.08    10.85    0.03
## quarter       3  0.02    -1.35    0.07
## work_yrs     22  2.78     9.80    0.20
## salary   220000  0.70    -1.05 3078.10
## satis       997  1.77     1.13   22.45
str(mba.df)
## 'data.frame':    274 obs. of  13 variables:
##  $ age     : int  23 24 24 24 24 24 25 25 25 25 ...
##  $ sex     : int  2 1 1 1 2 1 1 2 1 1 ...
##  $ gmat_tot: int  620 610 670 570 710 640 610 650 630 680 ...
##  $ gmat_qpc: int  77 90 99 56 93 82 89 88 79 99 ...
##  $ gmat_vpc: int  87 71 78 81 98 89 74 89 91 81 ...
##  $ gmat_tpc: int  87 87 95 75 98 91 87 92 89 96 ...
##  $ s_avg   : num  3.4 3.5 3.3 3.3 3.6 3.9 3.4 3.3 3.3 3.45 ...
##  $ f_avg   : num  3 4 3.25 2.67 3.75 3.75 3.5 3.75 3.25 3.67 ...
##  $ quarter : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ work_yrs: int  2 2 2 1 2 2 2 2 2 2 ...
##  $ frstlang: int  1 1 1 1 1 1 1 1 2 1 ...
##  $ salary  : int  0 0 0 0 999 0 0 0 999 998 ...
##  $ satis   : int  7 6 6 7 5 6 5 6 4 998 ...

Including Plots

Constructing a box plot to understand how the GMAT scores and fall semester grades are distributed by gender:

par(mfrow = c(2,1))
boxplot(mba.df$gmat_tot ~ mba.df$sex, horizontal = TRUE, ylab = "Sex", xlab = "GMAT scores", yaxt = "n", main = "Boxplot of GMAT scores by sex")
axis(side = 2, las = 2, at = c(1:2), labels = c("Male","Female"))
boxplot(mba.df$f_avg ~ mba.df$sex, horizontal = TRUE, ylab = "Sex", xlab = "Fall average grades", yaxt = "n", main = "Boxplot of fall average grades by sex")
axis(side = 2, las = 2, at = c(1:2), labels = c("Male","Female"))

Note that the median GMAT scores and IQR are very similar across the two genders while there is a considerable difference between the fall average grades between the two genders.

Now our ultimate quest is to find the factors that are most correlated with getting a job or NOT getting a job. So we need to split our data frame into 2 parts according to who got jobs and who didn’t:

placed <- mba.df[which(mba.df$salary > 999),]
notplaced <- mba.df[which(mba.df$salary == 0),]
View(placed)
View(notplaced)

Boxplots to check the median gmat scores of the students who got jobs and who didn’t

par(mfrow = c(2,1))
boxplot(placed$gmat_tot ~ placed$sex, horizontal = TRUE, ylab = "Sex", xlab = "GMAT scores", yaxt = "n", main = "Boxplot of GMAT scores by sex for placed candidates")
axis(side = 2, las = 2, at = c(1:2), labels = c("Male","Female"))
boxplot(notplaced$gmat_tot ~ notplaced$sex, horizontal = TRUE, ylab = "Sex", xlab = "GMAT scores", yaxt = "n", main = "Boxplot of GMAT scores by sex for unplaced candidates")
axis(side = 2, las = 2, at = c(1:2), labels = c("Male","Female"))

Interestingly, we see that the median GMAT score in each case in almost identical while there seems to be an approx 15 points difference in the median GMAT scores between the placed and unplaced candidates.

Plotting a scatterplot matrix for various variables

library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
scatterplotMatrix(mba.df[("salary" != 998) & ("salary" != 999),c("salary","gmat_tot","s_avg","work_yrs")], spread = FALSE, smoother.args = list(lty = 2), main = "Scatter Plot Matrix")

Plotting a corrgram for all numeric variables in the dataset

library(corrgram)
corrgram(mba.df[("salary" != 998) & ("salary" != 999),c(1,3:10,12)], order=FALSE, lower.panel=panel.shade, upper.panel=panel.pie, text.panel=panel.txt, main="Corrgram of MBA data intercorrelations")

Subsetting the data based on unavailable/undisclosed salary data and then adding a new column to the data frame

Since a part of the salary data is undisclosed/unavailable, it’s best to have a subset of the data so that we can add another factor column to it based on who got jobs and who didn’t:

all_sal <- mba.df[which(mba.df$salary != 998 & mba.df$salary != 999),]
all_sal$got_job <- ifelse(all_sal$salary > 0, 1, 0)
View(all_sal)

Contingency Tables

jobs_per_sex <- xtabs(~ sex+got_job, data = all_sal)
addmargins(jobs_per_sex)
##      got_job
## sex     0   1 Sum
##   1    67  72 139
##   2    23  31  54
##   Sum  90 103 193
jobs_by_lang <- xtabs(~ frstlang+got_job, data = all_sal)
addmargins(jobs_by_lang)
##         got_job
## frstlang   0   1 Sum
##      1    82  96 178
##      2     8   7  15
##      Sum  90 103 193

Propertion of jobs by gender and language:

prop.table(jobs_per_sex,1)*100
##    got_job
## sex        0        1
##   1 48.20144 51.79856
##   2 42.59259 57.40741
prop.table(jobs_by_lang,1)*100
##         got_job
## frstlang        0        1
##        1 46.06742 53.93258
##        2 53.33333 46.66667

Running chi-square and t-tests of independence

Case-1:

Null Hypothesis: There is no significant difference in starting salaries and sex i.e. there is no correlation between starting salaries and sex. Alternative Hypothesis: There is a significant difference in starting salaries and sex indicating there is a correlation between the two variables.

chisq.test(jobs_per_sex)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  jobs_per_sex
## X-squared = 0.29208, df = 1, p-value = 0.5889

Thus, as the p-value = 0.5889 > 0.05, we fail to reject the null hypothesis that there is no correlation between starting salaries and sex.

Case-2:

Null Hypothesis: There is no significant difference in starting salaries and language i.e. there is no correlation between starting salaries and language. Alternative Hypothesis: There is a significant difference in starting salaries and language indicating there is a correlation between the two variables.

chisq.test(jobs_by_lang)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  jobs_by_lang
## X-squared = 0.074127, df = 1, p-value = 0.7854

Here again, since p-value = 0.7854 > 0.05, we fail to reject the null hypothesis that there is no correlation between starting salaries and language.

Constructing a correlation matrix to determine correlation between starting salary and numeric variables such as GMAT scores, MBA grades and work experience

cor(all_sal[,c(3,4,7,8,10,12)])
##               gmat_tot    gmat_qpc      s_avg        f_avg    work_yrs
## gmat_tot  1.000000e+00  0.74309972 0.14356746  0.101082103 -0.17369086
## gmat_qpc  7.430997e-01  1.00000000 0.01903816  0.130285115 -0.24138468
## s_avg     1.435675e-01  0.01903816 1.00000000  0.520554250  0.15913663
## f_avg     1.010821e-01  0.13028512 0.52055425  1.000000000 -0.04795136
## work_yrs -1.736909e-01 -0.24138468 0.15913663 -0.047951357  1.00000000
## salary   -5.685962e-05  0.02839164 0.09632412  0.008846655 -0.05326685
##                 salary
## gmat_tot -5.685962e-05
## gmat_qpc  2.839164e-02
## s_avg     9.632412e-02
## f_avg     8.846655e-03
## work_yrs -5.326685e-02
## salary    1.000000e+00

Conducting a correlation test for the variables above:

library(psych)
corr.test(all_sal[,c(3,6,7,8,10,12)], use = "complete")
## Call:corr.test(x = all_sal[, c(3, 6, 7, 8, 10, 12)], use = "complete")
## Correlation matrix 
##          gmat_tot gmat_tpc s_avg f_avg work_yrs salary
## gmat_tot     1.00     0.88  0.14  0.10    -0.17   0.00
## gmat_tpc     0.88     1.00  0.19  0.11    -0.17   0.06
## s_avg        0.14     0.19  1.00  0.52     0.16   0.10
## f_avg        0.10     0.11  0.52  1.00    -0.05   0.01
## work_yrs    -0.17    -0.17  0.16 -0.05     1.00  -0.05
## salary       0.00     0.06  0.10  0.01    -0.05   1.00
## Sample Size 
## [1] 193
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##          gmat_tot gmat_tpc s_avg f_avg work_yrs salary
## gmat_tot     0.00     0.00  0.42  1.00     0.19      1
## gmat_tpc     0.00     0.00  0.11  1.00     0.23      1
## s_avg        0.05     0.01  0.00  0.00     0.27      1
## f_avg        0.16     0.13  0.00  0.00     1.00      1
## work_yrs     0.02     0.02  0.03  0.51     0.00      1
## salary       1.00     0.40  0.18  0.90     0.46      0
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

Therefore we see that: 1. Total GMAT score (gmat_tot), Fall average grades (f_avg) and Yrs of Work Experience (work_yrs) have p-values < 0.05 which means we can reject the null hypothesis that these variables are not correlated with starting salary. 2. Total GMAT percentile (gmat_tpc) and Spring average grades (s_avg) have p-values > 0.05 which means we fail to reject the null hypothesis that these variables are not correlated with the starting salary.

Linear regression on placed data

Applying the linear regression model to the problem: y = b0 + b1x1 + b2x2 + b3x3 + … + bnxn Here the dependent variable is starting salary and the independent variables are variables such as gmat_tot, s_avg, f_avg etc.

Salary = b0 + b1(gmat_tot) + b2(s_avg) + b3(f_avg) + b4(work_yrs)

lmodel <- lm(salary ~ gmat_qpc + gmat_vpc + s_avg + f_avg + work_yrs + sex + frstlang + satis, data = placed)
summary(lmodel)
## 
## Call:
## lm(formula = salary ~ gmat_qpc + gmat_vpc + s_avg + f_avg + work_yrs + 
##     sex + frstlang + satis, data = placed)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -29800  -7822  -1742   4869  82341 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 86719.94   23350.43   3.714 0.000346 ***
## gmat_qpc       98.72     121.85   0.810 0.419884    
## gmat_vpc      -95.80     102.99  -0.930 0.354699    
## s_avg        4659.05    5015.66   0.929 0.355320    
## f_avg       -1698.83    3834.70  -0.443 0.658773    
## work_yrs     2331.12     585.99   3.978 0.000137 ***
## sex         -5289.24    3545.91  -1.492 0.139140    
## frstlang    13994.76    6641.66   2.107 0.037770 *  
## satis       -1671.20    2070.62  -0.807 0.421643    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15740 on 94 degrees of freedom
## Multiple R-squared:  0.285,  Adjusted R-squared:  0.2241 
## F-statistic: 4.683 on 8 and 94 DF,  p-value: 7.574e-05

Insights from the Regression Analysis: - p-value of 7.574e-05 (<0.05) suggests that this is overall a good model. - R-squared value of 0.2241 suggests that the explanatory variables explain only 22.4% of the covariance which means there must be quite a few other explanatory variables not taken into account here. - Only the beta-coefficients of first language and work years have statistical significance (i.e. p-value < 0.05). For every 1 year increase in work experience, the starting salary increases by 2331 units.

Logistic regression

Here we apply logit regression to check for the categorical variable got_jobs (1 = Job, 0 = No Job) that we had added to the all_sal dataframe:

model <- glm(got_job ~ gmat_tpc + gmat_tot + s_avg + f_avg + work_yrs + sex + frstlang + satis, family=binomial(link='logit'), data=all_sal)
summary(model)
## 
## Call:
## glm(formula = got_job ~ gmat_tpc + gmat_tot + s_avg + f_avg + 
##     work_yrs + sex + frstlang + satis, family = binomial(link = "logit"), 
##     data = all_sal)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7969  -1.1750   0.7812   1.0610   1.7901  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.243768   3.154526  -0.077   0.9384  
## gmat_tpc     0.073538   0.044501   1.652   0.0984 .
## gmat_tot    -0.016302   0.009382  -1.738   0.0823 .
## s_avg        0.652323   0.495360   1.317   0.1879  
## f_avg       -0.081864   0.347219  -0.236   0.8136  
## work_yrs    -0.087415   0.046405  -1.884   0.0596 .
## sex          0.210544   0.337300   0.624   0.5325  
## frstlang     0.062156   0.576303   0.108   0.9141  
## satis        0.437869   0.204119   2.145   0.0319 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 266.68  on 192  degrees of freedom
## Residual deviance: 251.21  on 184  degrees of freedom
## AIC: 269.21
## 
## Number of Fisher Scoring iterations: 4

The results suggest that the beta-coefficients of satis (satisfaction level) is the only statistically significant coefficient while others are statistically insignificant