MBA Starting Salaries

This data set contains the reported starting salaries of MBA’s graduating in 2012. It also contains their GMAT scores and some information about how they did in the MBA program. We will be answering some important questions related to starting salaries of graduates, such as whether gender and/or age made a difference, and whether students liked this particular program. We aslo consider whether GMAT score made a difference in marks.

Reading the raw data into a dataframe

setwd("D:/desktop/Data Analytics internship-sameer mathur/work/datasets")
mbasal.df<-read.csv(paste("MBA Starting Salaries Data.csv", sep=""))
attach(mbasal.df)
nojob.df <- subset(mbasal.df, salary!=998  & salary!=999 & salary==0)
placed.df <- subset(mbasal.df, salary != 998  & salary != 999 & salary != 0)

DATA SUMMARY:

Summary Statistics - mean, sd, median, min, max of variables

library(psych)
describe(mbasal.df)[,c(1,3,7,8,10)]

##          vars     mean     mad min  range
## age         1    27.36    2.97  22     26
## sex         2     1.25    0.00   1      1
## gmat_tot    3   619.45   59.30 450    340
## gmat_qpc    4    80.64   14.83  28     71
## gmat_vpc    5    78.32   14.83  16     83
## gmat_tpc    6    84.20   11.86   0     99
## s_avg       7     3.03    0.44   2      2
## f_avg       8     3.06    0.37   0      4
## quarter     9     2.48    1.48   1      3
## work_yrs   10     3.87    1.48   0     22
## frstlang   11     1.12    0.00   1      1
## salary     12 39025.69 1481.12   0 220000
## satis      13   172.18    1.48   1    997

VISUALIZATION: EFFECT OF VARIABLES ON PLACEMENT AFTER THE MBA PROGRAM

Effect of age on placement

par(mfrow=c(1,2))
hist(nojob.df$age, main="age distribution of MBA graduates of 2012 who didn't get placed", xlab="age", ylim=c(0,25), xlim=c(20,50), col="grey",breaks=10)
hist(placed.df$age, main="age distribution of MBA graduates of 2012 who got placed", xlab="age", ylim=c(0,40), xlim=c(20,45), col="grey",breaks=10)

Who got better placement? males or females

par(mfrow=c(1,1))
pie(table(placed.df$sex),col=c("blue","violet"),main="gender wise split of placed graduates", labels = c("male", "female"))

Effect of GMAT performance on placement

par(mfrow=c(1,2))
library(lattice)
boxplot(placed.df$gmat_tot, main="GMAT performance of placed students",ylab="GMAT Score",horizontal = TRUE,col="green")
boxplot(nojob.df$gmat_tot, main="GMAT performance of non-placed students",ylab="GMAT Score",horizontal = TRUE,col="green")

Effect of prior work experience on placement

par(mfrow=c(1,2))
barplot(placed.df$work_yrs, main="Prior work experience of placed students",ylab="work experience")
barplot(nojob.df$work_yrs, main="Prior work experience of non-placed students",ylab="work experience")

MBA performance of placed and non-placed students

par(mfrow=c(1,2))
xyplot(placed.df$s_avg ~ placed.df$f_avg, main="MBA performance of placed students",ylab="spring average",type = c("p", "g"), xlab="fall average")

xyplot(nojob.df$s_avg ~ nojob.df$f_avg, main="MBA performance of non-placed students",ylab="spring average",type = c("p", "g"), xlab="fall average")

It is pretty clear that more men were placed than wome. Also, maximum placements were in the age group of 20-30 years.Only age and sex and prior work experience affected the placement statistics.

EFFECT OF VARIABLES ON SALARY: WHO GOT HOW MUCH SALARY? 8. age-wise split up of salary

attach(placed.df)

## The following objects are masked from mbasal.df:
## 
##     age, f_avg, frstlang, gmat_qpc, gmat_tot, gmat_tpc, gmat_vpc,
##     quarter, s_avg, salary, satis, sex, work_yrs

par(mfrow=c(1,1))
xyplot(placed.df$salary ~ placed.df$age, main="effect of age on salary",ylab="salary",type = c("p", "g"), xlab="age")

gender-wsie split up of salary

library(sm)

## Package 'sm', version 2.2-5.4: type help(sm) for summary information

gender<- factor(placed.df$sex)
sm.density.compare(placed.df$salary, gender, xlab="salary",levels= c(1,2),
  labels = c("1 male","2 female"))
title(main="gender-wsie split up of salary")

Gender wise split up based on satisfaction with the mba program

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

qplot(x = salary, y = age, data = placed.df, color = factor(sex), size = satis)

dependence of gmat performance on salary with work experiance and and dependence of gmat performance based on 1st language

qplot(x = gmat_tot, y = work_yrs, data = placed.df, color = factor(frstlang), size = satis)

CORRELATIONS:

scatterplotmatrix

library(car)

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

scatterplotMatrix(~salary+age+sex+work_yrs+satis,data=placed.df)

MBA starting salary analysis Correlogram

library(corrgram)
par(mfrow=c(1,1))
corrgram(placed.df, order=TRUE, lower.panel=panel.shade,
         upper.panel=panel.pie, text.panel=panel.txt,
         main="MBA starting salary analysis Correlogram")

It is pretty clear from the correlogram that the salary is significantly dependant on only two variables namely work experience and age. We can also see from the scatter plot matrix that maximum people have given a satisfaction rating of 4, 5 or 6 and a rating of 3 or 7 is rare. Also we can see that the men have higher salaries as compared to women.

CHI-SQAURED TESTS:

chisq.test(placed.df$salary, placed.df$age)

## Warning in chisq.test(placed.df$salary, placed.df$age): Chi-squared
## approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  placed.df$salary and placed.df$age
## X-squared = 717.62, df = 574, p-value = 3.929e-05

chisq.test(placed.df$salary, placed.df$sex)

## Warning in chisq.test(placed.df$salary, placed.df$sex): Chi-squared
## approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  placed.df$salary and placed.df$sex
## X-squared = 52.681, df = 41, p-value = 0.1045

chisq.test(placed.df$salary, placed.df$work_yrs)

## Warning in chisq.test(placed.df$salary, placed.df$work_yrs): Chi-squared
## approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  placed.df$salary and placed.df$work_yrs
## X-squared = 535.23, df = 451, p-value = 0.003809

chisq.test(placed.df$salary, placed.df$gmat_tot)

## Warning in chisq.test(placed.df$salary, placed.df$gmat_tot): Chi-squared
## approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  placed.df$salary and placed.df$gmat_tot
## X-squared = 927.24, df = 820, p-value = 0.005279

The results tell us that only age, gmat score and work experience years affect strating salaries.

MODELLING THE DATA

Linear Regression model

fit <- lm(formula = placed.df$salary ~ placed.df$age+placed.df$work_yrs+placed.df$sex+placed.df$gmat_tot)
summary(fit)

## 
## Call:
## lm(formula = placed.df$salary ~ placed.df$age + placed.df$work_yrs + 
##     placed.df$sex + placed.df$gmat_tot)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -30250  -8730  -2148   5632  82607 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)        56162.20   30473.15   1.843   0.0684 .
## placed.df$age       2298.17    1009.90   2.276   0.0250 *
## placed.df$work_yrs   407.10    1095.76   0.372   0.7111  
## placed.df$sex      -3898.40    3407.50  -1.144   0.2554  
## placed.df$gmat_tot   -18.01      30.88  -0.583   0.5610  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15650 on 98 degrees of freedom
## Multiple R-squared:  0.2628, Adjusted R-squared:  0.2327 
## F-statistic: 8.733 on 4 and 98 DF,  p-value: 4.512e-06

Logistic Regression model

mbasal.df$salary[mbasal.df$salary!=0] <- 1
model <- glm(formula = salary ~ ., family = binomial, 
    data = mbasal.df)
summary(model)

## 
## Call:
## glm(formula = salary ~ ., family = binomial, data = mbasal.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8186  -1.1565   0.6494   0.9510   1.6021  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  5.777303   3.646448   1.584   0.1131  
## age         -0.170377   0.074523  -2.286   0.0222 *
## sex         -0.060096   0.331825  -0.181   0.8563  
## gmat_tot    -0.005910   0.010295  -0.574   0.5660  
## gmat_qpc     0.012621   0.028191   0.448   0.6544  
## gmat_vpc     0.013256   0.027455   0.483   0.6292  
## gmat_tpc     0.009468   0.019197   0.493   0.6219  
## s_avg       -0.007863   0.620573  -0.013   0.9899  
## f_avg       -0.164727   0.339987  -0.485   0.6280  
## quarter     -0.150032   0.192517  -0.779   0.4358  
## work_yrs     0.086216   0.082649   1.043   0.2969  
## frstlang     0.683328   0.571813   1.195   0.2321  
## satis        0.008274   0.011895   0.696   0.4867  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 346.93  on 273  degrees of freedom
## Residual deviance: 293.13  on 261  degrees of freedom
## AIC: 319.13
## 
## Number of Fisher Scoring iterations: 10

library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

p <- predict(model, type="response")
pr <- prediction(p, mbasal.df$salary)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)

auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc

## [1] 0.7075483

From both analysises, it is evident that age is a very vital factor in determining the starting salaries. Work experience years alos has some small significance. Since the linear regresseion model has a very low p-value, and the logistic regression model has a low auc and only around 70% probability, the linear regression model better predicts the results and is a better fit.

MBA Starting Salaries

Aswathy Gunadeep

27 December 2017

MBA Starting Salaries