This data set contains the reported starting salaries of MBA’s graduating in 2012. It also contains their GMAT scores and some information about how they did in the MBA program. We will be answering some important questions related to starting salaries of graduates, such as whether gender and/or age made a difference, and whether students liked this particular program. We aslo consider whether GMAT score made a difference in marks.
setwd("D:/desktop/Data Analytics internship-sameer mathur/work/datasets")
mbasal.df<-read.csv(paste("MBA Starting Salaries Data.csv", sep=""))
attach(mbasal.df)
nojob.df <- subset(mbasal.df, salary!=998 & salary!=999 & salary==0)
placed.df <- subset(mbasal.df, salary != 998 & salary != 999 & salary != 0)
DATA SUMMARY:
library(psych)
describe(mbasal.df)[,c(1,3,7,8,10)]
## vars mean mad min range
## age 1 27.36 2.97 22 26
## sex 2 1.25 0.00 1 1
## gmat_tot 3 619.45 59.30 450 340
## gmat_qpc 4 80.64 14.83 28 71
## gmat_vpc 5 78.32 14.83 16 83
## gmat_tpc 6 84.20 11.86 0 99
## s_avg 7 3.03 0.44 2 2
## f_avg 8 3.06 0.37 0 4
## quarter 9 2.48 1.48 1 3
## work_yrs 10 3.87 1.48 0 22
## frstlang 11 1.12 0.00 1 1
## salary 12 39025.69 1481.12 0 220000
## satis 13 172.18 1.48 1 997
VISUALIZATION: EFFECT OF VARIABLES ON PLACEMENT AFTER THE MBA PROGRAM
par(mfrow=c(1,2))
hist(nojob.df$age, main="age distribution of MBA graduates of 2012 who didn't get placed", xlab="age", ylim=c(0,25), xlim=c(20,50), col="grey",breaks=10)
hist(placed.df$age, main="age distribution of MBA graduates of 2012 who got placed", xlab="age", ylim=c(0,40), xlim=c(20,45), col="grey",breaks=10)
par(mfrow=c(1,1))
pie(table(placed.df$sex),col=c("blue","violet"),main="gender wise split of placed graduates", labels = c("male", "female"))
par(mfrow=c(1,2))
library(lattice)
boxplot(placed.df$gmat_tot, main="GMAT performance of placed students",ylab="GMAT Score",horizontal = TRUE,col="green")
boxplot(nojob.df$gmat_tot, main="GMAT performance of non-placed students",ylab="GMAT Score",horizontal = TRUE,col="green")
par(mfrow=c(1,2))
barplot(placed.df$work_yrs, main="Prior work experience of placed students",ylab="work experience")
barplot(nojob.df$work_yrs, main="Prior work experience of non-placed students",ylab="work experience")
par(mfrow=c(1,2))
xyplot(placed.df$s_avg ~ placed.df$f_avg, main="MBA performance of placed students",ylab="spring average",type = c("p", "g"), xlab="fall average")
xyplot(nojob.df$s_avg ~ nojob.df$f_avg, main="MBA performance of non-placed students",ylab="spring average",type = c("p", "g"), xlab="fall average")
It is pretty clear that more men were placed than wome. Also, maximum placements were in the age group of 20-30 years.Only age and sex and prior work experience affected the placement statistics.
EFFECT OF VARIABLES ON SALARY: WHO GOT HOW MUCH SALARY? 8. age-wise split up of salary
attach(placed.df)
## The following objects are masked from mbasal.df:
##
## age, f_avg, frstlang, gmat_qpc, gmat_tot, gmat_tpc, gmat_vpc,
## quarter, s_avg, salary, satis, sex, work_yrs
par(mfrow=c(1,1))
xyplot(placed.df$salary ~ placed.df$age, main="effect of age on salary",ylab="salary",type = c("p", "g"), xlab="age")
library(sm)
## Package 'sm', version 2.2-5.4: type help(sm) for summary information
gender<- factor(placed.df$sex)
sm.density.compare(placed.df$salary, gender, xlab="salary",levels= c(1,2),
labels = c("1 male","2 female"))
title(main="gender-wsie split up of salary")
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
qplot(x = salary, y = age, data = placed.df, color = factor(sex), size = satis)
qplot(x = gmat_tot, y = work_yrs, data = placed.df, color = factor(frstlang), size = satis)
CORRELATIONS:
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(~salary+age+sex+work_yrs+satis,data=placed.df)
library(corrgram)
par(mfrow=c(1,1))
corrgram(placed.df, order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt,
main="MBA starting salary analysis Correlogram")
It is pretty clear from the correlogram that the salary is significantly dependant on only two variables namely work experience and age. We can also see from the scatter plot matrix that maximum people have given a satisfaction rating of 4, 5 or 6 and a rating of 3 or 7 is rare. Also we can see that the men have higher salaries as compared to women.
CHI-SQAURED TESTS:
chisq.test(placed.df$salary, placed.df$age)
## Warning in chisq.test(placed.df$salary, placed.df$age): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: placed.df$salary and placed.df$age
## X-squared = 717.62, df = 574, p-value = 3.929e-05
chisq.test(placed.df$salary, placed.df$sex)
## Warning in chisq.test(placed.df$salary, placed.df$sex): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: placed.df$salary and placed.df$sex
## X-squared = 52.681, df = 41, p-value = 0.1045
chisq.test(placed.df$salary, placed.df$work_yrs)
## Warning in chisq.test(placed.df$salary, placed.df$work_yrs): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: placed.df$salary and placed.df$work_yrs
## X-squared = 535.23, df = 451, p-value = 0.003809
chisq.test(placed.df$salary, placed.df$gmat_tot)
## Warning in chisq.test(placed.df$salary, placed.df$gmat_tot): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: placed.df$salary and placed.df$gmat_tot
## X-squared = 927.24, df = 820, p-value = 0.005279
The results tell us that only age, gmat score and work experience years affect strating salaries.
MODELLING THE DATA
fit <- lm(formula = placed.df$salary ~ placed.df$age+placed.df$work_yrs+placed.df$sex+placed.df$gmat_tot)
summary(fit)
##
## Call:
## lm(formula = placed.df$salary ~ placed.df$age + placed.df$work_yrs +
## placed.df$sex + placed.df$gmat_tot)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30250 -8730 -2148 5632 82607
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56162.20 30473.15 1.843 0.0684 .
## placed.df$age 2298.17 1009.90 2.276 0.0250 *
## placed.df$work_yrs 407.10 1095.76 0.372 0.7111
## placed.df$sex -3898.40 3407.50 -1.144 0.2554
## placed.df$gmat_tot -18.01 30.88 -0.583 0.5610
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15650 on 98 degrees of freedom
## Multiple R-squared: 0.2628, Adjusted R-squared: 0.2327
## F-statistic: 8.733 on 4 and 98 DF, p-value: 4.512e-06
mbasal.df$salary[mbasal.df$salary!=0] <- 1
model <- glm(formula = salary ~ ., family = binomial,
data = mbasal.df)
summary(model)
##
## Call:
## glm(formula = salary ~ ., family = binomial, data = mbasal.df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8186 -1.1565 0.6494 0.9510 1.6021
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.777303 3.646448 1.584 0.1131
## age -0.170377 0.074523 -2.286 0.0222 *
## sex -0.060096 0.331825 -0.181 0.8563
## gmat_tot -0.005910 0.010295 -0.574 0.5660
## gmat_qpc 0.012621 0.028191 0.448 0.6544
## gmat_vpc 0.013256 0.027455 0.483 0.6292
## gmat_tpc 0.009468 0.019197 0.493 0.6219
## s_avg -0.007863 0.620573 -0.013 0.9899
## f_avg -0.164727 0.339987 -0.485 0.6280
## quarter -0.150032 0.192517 -0.779 0.4358
## work_yrs 0.086216 0.082649 1.043 0.2969
## frstlang 0.683328 0.571813 1.195 0.2321
## satis 0.008274 0.011895 0.696 0.4867
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 346.93 on 273 degrees of freedom
## Residual deviance: 293.13 on 261 degrees of freedom
## AIC: 319.13
##
## Number of Fisher Scoring iterations: 10
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
p <- predict(model, type="response")
pr <- prediction(p, mbasal.df$salary)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.7075483
From both analysises, it is evident that age is a very vital factor in determining the starting salaries. Work experience years alos has some small significance. Since the linear regresseion model has a very low p-value, and the logistic regression model has a low auc and only around 70% probability, the linear regression model better predicts the results and is a better fit.