MBA Starting Salaries

This data set contains the reported starting salaries of MBA’s graduating in 2012. It also contains their GMAT scores and some information about how they did in the MBA program. We will be answering some important questions related to starting salaries of graduates, such as whether gender and/or age made a difference, and whether students liked this particular program. We aslo consider whether GMAT score made a difference in marks.

  1. Reading the raw data into a dataframe
setwd("D:/desktop/Data Analytics internship-sameer mathur/work/datasets")
mbasal.df<-read.csv(paste("MBA Starting Salaries Data.csv", sep=""))
attach(mbasal.df)
nojob.df <- subset(mbasal.df, salary!=998  & salary!=999 & salary==0)
placed.df <- subset(mbasal.df, salary != 998  & salary != 999 & salary != 0)

DATA SUMMARY:

  1. Summary Statistics - mean, sd, median, min, max of variables
library(psych)
describe(mbasal.df)[,c(1,3,7,8,10)]
##          vars     mean     mad min  range
## age         1    27.36    2.97  22     26
## sex         2     1.25    0.00   1      1
## gmat_tot    3   619.45   59.30 450    340
## gmat_qpc    4    80.64   14.83  28     71
## gmat_vpc    5    78.32   14.83  16     83
## gmat_tpc    6    84.20   11.86   0     99
## s_avg       7     3.03    0.44   2      2
## f_avg       8     3.06    0.37   0      4
## quarter     9     2.48    1.48   1      3
## work_yrs   10     3.87    1.48   0     22
## frstlang   11     1.12    0.00   1      1
## salary     12 39025.69 1481.12   0 220000
## satis      13   172.18    1.48   1    997

VISUALIZATION: EFFECT OF VARIABLES ON PLACEMENT AFTER THE MBA PROGRAM

  1. Effect of age on placement
par(mfrow=c(1,2))
hist(nojob.df$age, main="age distribution of MBA graduates of 2012 who didn't get placed", xlab="age", ylim=c(0,25), xlim=c(20,50), col="grey",breaks=10)
hist(placed.df$age, main="age distribution of MBA graduates of 2012 who got placed", xlab="age", ylim=c(0,40), xlim=c(20,45), col="grey",breaks=10)

  1. Who got better placement? males or females
par(mfrow=c(1,1))
pie(table(placed.df$sex),col=c("blue","violet"),main="gender wise split of placed graduates", labels = c("male", "female"))

  1. Effect of GMAT performance on placement
par(mfrow=c(1,2))
library(lattice)
boxplot(placed.df$gmat_tot, main="GMAT performance of placed students",ylab="GMAT Score",horizontal = TRUE,col="green")
boxplot(nojob.df$gmat_tot, main="GMAT performance of non-placed students",ylab="GMAT Score",horizontal = TRUE,col="green")

  1. Effect of prior work experience on placement
par(mfrow=c(1,2))
barplot(placed.df$work_yrs, main="Prior work experience of placed students",ylab="work experience")
barplot(nojob.df$work_yrs, main="Prior work experience of non-placed students",ylab="work experience")

  1. MBA performance of placed and non-placed students
par(mfrow=c(1,2))
xyplot(placed.df$s_avg ~ placed.df$f_avg, main="MBA performance of placed students",ylab="spring average",type = c("p", "g"), xlab="fall average")

xyplot(nojob.df$s_avg ~ nojob.df$f_avg, main="MBA performance of non-placed students",ylab="spring average",type = c("p", "g"), xlab="fall average")

It is pretty clear that more men were placed than wome. Also, maximum placements were in the age group of 20-30 years.Only age and sex and prior work experience affected the placement statistics.

EFFECT OF VARIABLES ON SALARY: WHO GOT HOW MUCH SALARY? 8. age-wise split up of salary

attach(placed.df)
## The following objects are masked from mbasal.df:
## 
##     age, f_avg, frstlang, gmat_qpc, gmat_tot, gmat_tpc, gmat_vpc,
##     quarter, s_avg, salary, satis, sex, work_yrs
par(mfrow=c(1,1))
xyplot(placed.df$salary ~ placed.df$age, main="effect of age on salary",ylab="salary",type = c("p", "g"), xlab="age")

  1. gender-wsie split up of salary
library(sm)
## Package 'sm', version 2.2-5.4: type help(sm) for summary information
gender<- factor(placed.df$sex)
sm.density.compare(placed.df$salary, gender, xlab="salary",levels= c(1,2),
  labels = c("1 male","2 female"))
title(main="gender-wsie split up of salary")

  1. Gender wise split up based on satisfaction with the mba program
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
qplot(x = salary, y = age, data = placed.df, color = factor(sex), size = satis)

  1. dependence of gmat performance on salary with work experiance and and dependence of gmat performance based on 1st language
qplot(x = gmat_tot, y = work_yrs, data = placed.df, color = factor(frstlang), size = satis)

CORRELATIONS:

  1. scatterplotmatrix
library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
scatterplotMatrix(~salary+age+sex+work_yrs+satis,data=placed.df)

  1. MBA starting salary analysis Correlogram
library(corrgram)
par(mfrow=c(1,1))
corrgram(placed.df, order=TRUE, lower.panel=panel.shade,
         upper.panel=panel.pie, text.panel=panel.txt,
         main="MBA starting salary analysis Correlogram")

It is pretty clear from the correlogram that the salary is significantly dependant on only two variables namely work experience and age. We can also see from the scatter plot matrix that maximum people have given a satisfaction rating of 4, 5 or 6 and a rating of 3 or 7 is rare. Also we can see that the men have higher salaries as compared to women.

CHI-SQAURED TESTS:

chisq.test(placed.df$salary, placed.df$age)
## Warning in chisq.test(placed.df$salary, placed.df$age): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  placed.df$salary and placed.df$age
## X-squared = 717.62, df = 574, p-value = 3.929e-05
chisq.test(placed.df$salary, placed.df$sex)
## Warning in chisq.test(placed.df$salary, placed.df$sex): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  placed.df$salary and placed.df$sex
## X-squared = 52.681, df = 41, p-value = 0.1045
chisq.test(placed.df$salary, placed.df$work_yrs)
## Warning in chisq.test(placed.df$salary, placed.df$work_yrs): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  placed.df$salary and placed.df$work_yrs
## X-squared = 535.23, df = 451, p-value = 0.003809
chisq.test(placed.df$salary, placed.df$gmat_tot)
## Warning in chisq.test(placed.df$salary, placed.df$gmat_tot): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  placed.df$salary and placed.df$gmat_tot
## X-squared = 927.24, df = 820, p-value = 0.005279

The results tell us that only age, gmat score and work experience years affect strating salaries.

MODELLING THE DATA

  1. Linear Regression model
fit <- lm(formula = placed.df$salary ~ placed.df$age+placed.df$work_yrs+placed.df$sex+placed.df$gmat_tot)
summary(fit)
## 
## Call:
## lm(formula = placed.df$salary ~ placed.df$age + placed.df$work_yrs + 
##     placed.df$sex + placed.df$gmat_tot)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -30250  -8730  -2148   5632  82607 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)        56162.20   30473.15   1.843   0.0684 .
## placed.df$age       2298.17    1009.90   2.276   0.0250 *
## placed.df$work_yrs   407.10    1095.76   0.372   0.7111  
## placed.df$sex      -3898.40    3407.50  -1.144   0.2554  
## placed.df$gmat_tot   -18.01      30.88  -0.583   0.5610  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15650 on 98 degrees of freedom
## Multiple R-squared:  0.2628, Adjusted R-squared:  0.2327 
## F-statistic: 8.733 on 4 and 98 DF,  p-value: 4.512e-06
  1. Logistic Regression model
mbasal.df$salary[mbasal.df$salary!=0] <- 1
model <- glm(formula = salary ~ ., family = binomial, 
    data = mbasal.df)
summary(model)
## 
## Call:
## glm(formula = salary ~ ., family = binomial, data = mbasal.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8186  -1.1565   0.6494   0.9510   1.6021  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  5.777303   3.646448   1.584   0.1131  
## age         -0.170377   0.074523  -2.286   0.0222 *
## sex         -0.060096   0.331825  -0.181   0.8563  
## gmat_tot    -0.005910   0.010295  -0.574   0.5660  
## gmat_qpc     0.012621   0.028191   0.448   0.6544  
## gmat_vpc     0.013256   0.027455   0.483   0.6292  
## gmat_tpc     0.009468   0.019197   0.493   0.6219  
## s_avg       -0.007863   0.620573  -0.013   0.9899  
## f_avg       -0.164727   0.339987  -0.485   0.6280  
## quarter     -0.150032   0.192517  -0.779   0.4358  
## work_yrs     0.086216   0.082649   1.043   0.2969  
## frstlang     0.683328   0.571813   1.195   0.2321  
## satis        0.008274   0.011895   0.696   0.4867  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 346.93  on 273  degrees of freedom
## Residual deviance: 293.13  on 261  degrees of freedom
## AIC: 319.13
## 
## Number of Fisher Scoring iterations: 10
library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
p <- predict(model, type="response")
pr <- prediction(p, mbasal.df$salary)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)

auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.7075483

From both analysises, it is evident that age is a very vital factor in determining the starting salaries. Work experience years alos has some small significance. Since the linear regresseion model has a very low p-value, and the logistic regression model has a low auc and only around 70% probability, the linear regression model better predicts the results and is a better fit.