This project analyses the factors affecting the starting salaries of MBA graduates of an institute and the likeliness to enroll in an MBA programme in that instutute.
Reading data into the file:
mba<-read.csv(paste("MBA Starting Salaries Data.csv"),)
summary(mba)
## age sex gmat_tot gmat_qpc
## Min. :22.00 Min. :1.000 Min. :450.0 Min. :28.00
## 1st Qu.:25.00 1st Qu.:1.000 1st Qu.:580.0 1st Qu.:72.00
## Median :27.00 Median :1.000 Median :620.0 Median :83.00
## Mean :27.36 Mean :1.248 Mean :619.5 Mean :80.64
## 3rd Qu.:29.00 3rd Qu.:1.000 3rd Qu.:660.0 3rd Qu.:93.00
## Max. :48.00 Max. :2.000 Max. :790.0 Max. :99.00
## gmat_vpc gmat_tpc s_avg f_avg
## Min. :16.00 Min. : 0.0 Min. :2.000 Min. :0.000
## 1st Qu.:71.00 1st Qu.:78.0 1st Qu.:2.708 1st Qu.:2.750
## Median :81.00 Median :87.0 Median :3.000 Median :3.000
## Mean :78.32 Mean :84.2 Mean :3.025 Mean :3.062
## 3rd Qu.:91.00 3rd Qu.:94.0 3rd Qu.:3.300 3rd Qu.:3.250
## Max. :99.00 Max. :99.0 Max. :4.000 Max. :4.000
## quarter work_yrs frstlang salary
## Min. :1.000 Min. : 0.000 Min. :1.000 Min. : 0
## 1st Qu.:1.250 1st Qu.: 2.000 1st Qu.:1.000 1st Qu.: 0
## Median :2.000 Median : 3.000 Median :1.000 Median : 999
## Mean :2.478 Mean : 3.872 Mean :1.117 Mean : 39026
## 3rd Qu.:3.000 3rd Qu.: 4.000 3rd Qu.:1.000 3rd Qu.: 97000
## Max. :4.000 Max. :22.000 Max. :2.000 Max. :220000
## satis
## Min. : 1.0
## 1st Qu.: 5.0
## Median : 6.0
## Mean :172.2
## 3rd Qu.: 7.0
## Max. :998.0
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## The following object is masked from 'package:car':
##
## logit
describe(mba)
## vars n mean sd median trimmed mad min max
## age 1 274 27.36 3.71 27 26.76 2.97 22 48
## sex 2 274 1.25 0.43 1 1.19 0.00 1 2
## gmat_tot 3 274 619.45 57.54 620 618.86 59.30 450 790
## gmat_qpc 4 274 80.64 14.87 83 82.31 14.83 28 99
## gmat_vpc 5 274 78.32 16.86 81 80.33 14.83 16 99
## gmat_tpc 6 274 84.20 14.02 87 86.12 11.86 0 99
## s_avg 7 274 3.03 0.38 3 3.03 0.44 2 4
## f_avg 8 274 3.06 0.53 3 3.09 0.37 0 4
## quarter 9 274 2.48 1.11 2 2.47 1.48 1 4
## work_yrs 10 274 3.87 3.23 3 3.29 1.48 0 22
## frstlang 11 274 1.12 0.32 1 1.02 0.00 1 2
## salary 12 274 39025.69 50951.56 999 33607.86 1481.12 0 220000
## satis 13 274 172.18 371.61 6 91.50 1.48 1 998
## range skew kurtosis se
## age 26 2.16 6.45 0.22
## sex 1 1.16 -0.66 0.03
## gmat_tot 340 -0.01 0.06 3.48
## gmat_qpc 71 -0.92 0.30 0.90
## gmat_vpc 83 -1.04 0.74 1.02
## gmat_tpc 99 -2.28 9.02 0.85
## s_avg 2 -0.06 -0.38 0.02
## f_avg 4 -2.08 10.85 0.03
## quarter 3 0.02 -1.35 0.07
## work_yrs 22 2.78 9.80 0.20
## frstlang 1 2.37 3.65 0.02
## salary 220000 0.70 -1.05 3078.10
## satis 997 1.77 1.13 22.45
there are missing salaries in the given data.
To make the data direct toward the ones who were placed and gave the review and the ones who did not get placed, its better to create two sub tables.
To improve readablity, its better to convert sex and Language into factor variables.
mba$sex[mba$sex==1]<- "Male"
mba$sex[mba$sex==2]<- "Female"
mba$sex<-factor(mba$sex)
mba$frstlang[mba$frstlang==1]<- "English"
mba$frstlang[mba$frstlang==2]<- "Other"
mba$frstlang<-factor(mba$frstlang)
str(mba)
## 'data.frame': 274 obs. of 13 variables:
## $ age : int 23 24 24 24 24 24 25 25 25 25 ...
## $ sex : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 2 2 1 2 2 ...
## $ gmat_tot: int 620 610 670 570 710 640 610 650 630 680 ...
## $ gmat_qpc: int 77 90 99 56 93 82 89 88 79 99 ...
## $ gmat_vpc: int 87 71 78 81 98 89 74 89 91 81 ...
## $ gmat_tpc: int 87 87 95 75 98 91 87 92 89 96 ...
## $ s_avg : num 3.4 3.5 3.3 3.3 3.6 3.9 3.4 3.3 3.3 3.45 ...
## $ f_avg : num 3 4 3.25 2.67 3.75 3.75 3.5 3.75 3.25 3.67 ...
## $ quarter : int 1 1 1 1 1 1 1 1 1 1 ...
## $ work_yrs: int 2 2 2 1 2 2 2 2 2 2 ...
## $ frstlang: Factor w/ 2 levels "English","Other": 1 1 1 1 1 1 1 1 2 1 ...
## $ salary : int 0 0 0 0 999 0 0 0 999 998 ...
## $ satis : int 7 6 6 7 5 6 5 6 4 998 ...
placed<-mba[which(mba$salary>1000),]
some(placed)
## age sex gmat_tot gmat_qpc gmat_vpc gmat_tpc s_avg f_avg quarter
## 36 27 Female 700 94 98 98 3.3 3.25 1
## 46 23 Female 650 93 81 93 3.4 3.00 1
## 118 25 Male 670 95 89 95 3.2 3.50 2
## 120 24 Male 560 52 81 72 3.2 3.25 2
## 133 34 Male 550 72 58 69 3.0 3.00 2
## 190 25 Male 610 89 74 87 2.7 2.75 3
## 200 24 Male 710 99 92 99 2.9 3.00 3
## 208 28 Male 570 56 84 75 2.9 3.00 3
## 272 25 Male 540 79 45 65 2.6 2.50 4
## 273 26 Male 550 72 58 69 2.6 2.75 4
## work_yrs frstlang salary satis
## 36 2 English 85000 6
## 46 2 English 100000 7
## 118 2 English 95000 6
## 120 2 English 96000 7
## 133 16 English 105000 5
## 190 4 English 93000 6
## 200 3 English 100000 6
## 208 4 English 108000 6
## 272 3 English 115000 5
## 273 3 English 126710 6
notplaced<-mba[which(mba$salary==0),]
some(notplaced)
## age sex gmat_tot gmat_qpc gmat_vpc gmat_tpc s_avg f_avg quarter
## 2 24 Male 610 90 71 87 3.50 4.00 1
## 33 42 Female 650 75 98 93 3.38 3.00 1
## 107 30 Male 680 97 87 96 3.00 3.00 2
## 150 25 Male 550 72 58 69 2.90 3.00 3
## 165 27 Female 550 66 63 69 2.90 3.00 3
## 184 34 Male 610 82 78 86 2.70 3.00 3
## 218 25 Male 700 99 87 98 2.00 2.00 4
## 230 27 Male 580 84 58 78 2.70 2.75 4
## 237 28 Male 570 69 71 0 2.30 2.50 4
## 253 32 Male 510 79 22 54 2.30 2.25 4
## work_yrs frstlang salary satis
## 2 2 English 0 6
## 33 13 English 0 5
## 107 4 English 0 5
## 150 3 English 0 6
## 165 3 English 0 4
## 184 12 English 0 5
## 218 1 English 0 7
## 230 1 English 0 5
## 237 5 English 0 5
## 253 5 Other 0 5
summary(placed)
## age sex gmat_tot gmat_qpc gmat_vpc
## Min. :22.00 Female:31 Min. :500 Min. :39.00 Min. :30.00
## 1st Qu.:25.00 Male :72 1st Qu.:580 1st Qu.:72.00 1st Qu.:71.00
## Median :26.00 Median :620 Median :82.00 Median :81.00
## Mean :26.78 Mean :616 Mean :79.73 Mean :78.56
## 3rd Qu.:28.00 3rd Qu.:655 3rd Qu.:89.00 3rd Qu.:92.00
## Max. :40.00 Max. :720 Max. :99.00 Max. :99.00
## gmat_tpc s_avg f_avg quarter
## Min. :51.00 Min. :2.200 Min. :0.000 Min. :1.000
## 1st Qu.:78.00 1st Qu.:2.850 1st Qu.:2.915 1st Qu.:1.000
## Median :87.00 Median :3.100 Median :3.250 Median :2.000
## Mean :84.52 Mean :3.092 Mean :3.091 Mean :2.262
## 3rd Qu.:93.50 3rd Qu.:3.400 3rd Qu.:3.415 3rd Qu.:3.000
## Max. :99.00 Max. :4.000 Max. :4.000 Max. :4.000
## work_yrs frstlang salary satis
## Min. : 0.00 English:96 Min. : 64000 Min. :3.000
## 1st Qu.: 2.00 Other : 7 1st Qu.: 95000 1st Qu.:5.000
## Median : 3.00 Median :100000 Median :6.000
## Mean : 3.68 Mean :103031 Mean :5.883
## 3rd Qu.: 4.00 3rd Qu.:106000 3rd Qu.:6.000
## Max. :16.00 Max. :220000 Max. :7.000
describe(placed)
## vars n mean sd median trimmed mad min
## age 1 103 26.78 3.27 2.60e+01 26.30 2.97 22.0
## sex* 2 103 1.70 0.46 2.00e+00 1.75 0.00 1.0
## gmat_tot 3 103 616.02 50.69 6.20e+02 615.90 59.30 500.0
## gmat_qpc 4 103 79.73 13.39 8.20e+01 81.05 13.34 39.0
## gmat_vpc 5 103 78.56 16.14 8.10e+01 80.33 16.31 30.0
## gmat_tpc 6 103 84.52 11.01 8.70e+01 85.60 11.86 51.0
## s_avg 7 103 3.09 0.38 3.10e+00 3.10 0.44 2.2
## f_avg 8 103 3.09 0.49 3.25e+00 3.13 0.37 0.0
## quarter 9 103 2.26 1.12 2.00e+00 2.20 1.48 1.0
## work_yrs 10 103 3.68 3.01 3.00e+00 3.11 1.48 0.0
## frstlang* 11 103 1.07 0.25 1.00e+00 1.00 0.00 1.0
## salary 12 103 103030.74 17868.80 1.00e+05 101065.06 7413.00 64000.0
## satis 13 103 5.88 0.78 6.00e+00 5.89 1.48 3.0
## max range skew kurtosis se
## age 40 18.0 1.92 4.90 0.32
## sex* 2 1.0 -0.86 -1.28 0.05
## gmat_tot 720 220.0 0.01 -0.69 4.99
## gmat_qpc 99 60.0 -0.81 0.17 1.32
## gmat_vpc 99 69.0 -0.87 0.21 1.59
## gmat_tpc 99 48.0 -0.84 0.19 1.08
## s_avg 4 1.8 -0.13 -0.61 0.04
## f_avg 4 4.0 -2.52 13.86 0.05
## quarter 4 3.0 0.27 -1.34 0.11
## work_yrs 16 16.0 2.48 6.83 0.30
## frstlang* 2 1.0 3.38 9.54 0.02
## salary 220000 156000.0 3.18 17.16 1760.67
## satis 7 4.0 -0.40 0.44 0.08
library(lattice)
histogram(placed$salary,main="Salary Distribution of Placed",xlab="Salary",ylab="Percent of Total",col="grey")
# Variation of different variables with Sex
aggregate(cbind(salary,work_yrs,age)~sex,data = placed,mean)
## sex salary work_yrs age
## 1 Female 98524.39 3.258065 26.06452
## 2 Male 104970.97 3.861111 27.08333
boxplot(salary~sex,data = placed,horizontal=TRUE, col=c("blue3","red2"),main="Salary vs Sex",xlab="Salary",ylab="Sex")
scatterplot(salary~work_yrs,data = placed,main="Distribution of Salary vs Work experience",ylab="Salary",xlab="Work Years")
scatterplot(salary~gmat_tot,data = placed,main="Distribution of Salary vs GMAT Scores",ylab="Salary",xlab="GMAT Scores")
scatterplot.matrix(~salary+s_avg+f_avg,data=placed)
Average salary of the placed MBA graduates of different age groups
aggregate(salary~age,data = placed ,mean)
## age salary
## 1 22 85000.00
## 2 23 91651.20
## 3 24 101518.75
## 4 25 99086.96
## 5 26 101665.00
## 6 27 102214.29
## 7 28 103625.00
## 8 29 102083.33
## 9 30 109916.67
## 10 31 100500.00
## 11 32 107300.00
## 12 33 118000.00
## 13 34 105000.00
## 14 39 112000.00
## 15 40 183000.00
boxplot(salary~age,data=placed,horizontal=TRUE,main="Distribution of Salary vs Age",ylab="Age",xlab="Salary")
This is the average of only those graduates who got placed.
aggregate(salary~satis,data = placed , mean)
## satis salary
## 1 3 95000.00
## 2 4 95000.00
## 3 5 102974.34
## 4 6 105364.20
## 5 7 98531.82
boxplot(salary~satis,data=placed,horizontal=TRUE,main="Distribution of Salary vs Satisfaction",ylab="Satisfaction",xlab="Salary")
boxplot(salary~frstlang,data=placed,horizontal=TRUE,main="Distribution of Salary vs Language",ylab="Language",xlab="Salary")
This correlation includes the correlation with all the graduates irrespective of whether they were placed or not.
cor(mba[,c(1,3:10,12,13)])
## age gmat_tot gmat_qpc gmat_vpc gmat_tpc
## age 1.00000000 -0.14593840 -0.21616985 -0.04417547 -0.169903066
## gmat_tot -0.14593840 1.00000000 0.72473781 0.74839187 0.847799647
## gmat_qpc -0.21616985 0.72473781 1.00000000 0.15218014 0.651377538
## gmat_vpc -0.04417547 0.74839187 0.15218014 1.00000000 0.666216035
## gmat_tpc -0.16990307 0.84779965 0.65137754 0.66621604 1.000000000
## s_avg 0.14970402 0.11311702 -0.02984873 0.20445365 0.117362449
## f_avg -0.01744806 0.10442409 0.07370455 0.07592225 0.079732099
## quarter -0.04967221 -0.09223903 0.03636638 -0.17460736 -0.083035351
## work_yrs 0.85829810 -0.18235434 -0.23660827 -0.06639049 -0.173361859
## salary -0.06257355 -0.05497188 -0.04403293 -0.00613934 0.004930901
## satis -0.12788825 0.08255770 0.06060004 0.06262375 0.092934266
## s_avg f_avg quarter work_yrs salary
## age 0.14970402 -0.01744806 -4.967221e-02 0.858298096 -0.062573547
## gmat_tot 0.11311702 0.10442409 -9.223903e-02 -0.182354339 -0.054971880
## gmat_qpc -0.02984873 0.07370455 3.636638e-02 -0.236608270 -0.044032933
## gmat_vpc 0.20445365 0.07592225 -1.746074e-01 -0.066390490 -0.006139340
## gmat_tpc 0.11736245 0.07973210 -8.303535e-02 -0.173361859 0.004930901
## s_avg 1.00000000 0.55062139 -7.621166e-01 0.129292714 0.145836062
## f_avg 0.55062139 1.00000000 -4.475064e-01 -0.039056921 0.029443027
## quarter -0.76211664 -0.44750637 1.000000e+00 -0.086026406 -0.164369865
## work_yrs 0.12929271 -0.03905692 -8.602641e-02 1.000000000 0.009023407
## salary 0.14583606 0.02944303 -1.643699e-01 0.009023407 1.000000000
## satis -0.03268664 0.01089273 -1.267198e-05 -0.109255286 -0.335217114
## satis
## age -1.278882e-01
## gmat_tot 8.255770e-02
## gmat_qpc 6.060004e-02
## gmat_vpc 6.262375e-02
## gmat_tpc 9.293427e-02
## s_avg -3.268664e-02
## f_avg 1.089273e-02
## quarter -1.267198e-05
## work_yrs -1.092553e-01
## salary -3.352171e-01
## satis 1.000000e+00
To see how the salaries of the placed graduates are correlated:
cor(placed[,c(1,3:10,12,13)])
## age gmat_tot gmat_qpc gmat_vpc gmat_tpc
## age 1.00000000 -0.07871678 -0.165039057 0.01799420 -0.09609156
## gmat_tot -0.07871678 1.00000000 0.666382266 0.78038546 0.96680810
## gmat_qpc -0.16503906 0.66638227 1.000000000 0.09466541 0.65865003
## gmat_vpc 0.01799420 0.78038546 0.094665411 1.00000000 0.78443167
## gmat_tpc -0.09609156 0.96680810 0.658650025 0.78443167 1.00000000
## s_avg 0.15654954 0.17198874 0.015471662 0.15865101 0.13938500
## f_avg -0.21699191 0.12246257 0.098418869 0.02290167 0.07051391
## quarter -0.12568145 -0.10578964 0.012648346 -0.12862079 -0.09955033
## work_yrs 0.88052470 -0.12280018 -0.182701263 -0.02812182 -0.13246963
## salary 0.49964284 -0.09067141 0.014141299 -0.13743230 -0.13201783
## satis 0.10832308 0.06474206 -0.003984632 0.14863481 0.11630842
## s_avg f_avg quarter work_yrs salary
## age 0.15654954 -0.21699191 -0.12568145 0.88052470 0.49964284
## gmat_tot 0.17198874 0.12246257 -0.10578964 -0.12280018 -0.09067141
## gmat_qpc 0.01547166 0.09841887 0.01264835 -0.18270126 0.01414130
## gmat_vpc 0.15865101 0.02290167 -0.12862079 -0.02812182 -0.13743230
## gmat_tpc 0.13938500 0.07051391 -0.09955033 -0.13246963 -0.13201783
## s_avg 1.00000000 0.44590413 -0.84038355 0.16328236 0.10173175
## f_avg 0.44590413 1.00000000 -0.43144819 -0.21633018 -0.10603897
## quarter -0.84038355 -0.43144819 1.00000000 -0.12896722 -0.12848526
## work_yrs 0.16328236 -0.21633018 -0.12896722 1.00000000 0.45466634
## salary 0.10173175 -0.10603897 -0.12848526 0.45466634 1.00000000
## satis -0.14356557 -0.11773304 0.22511985 0.06299926 -0.04005060
## satis
## age 0.108323083
## gmat_tot 0.064742057
## gmat_qpc -0.003984632
## gmat_vpc 0.148634805
## gmat_tpc 0.116308417
## s_avg -0.143565573
## f_avg -0.117733043
## quarter 0.225119851
## work_yrs 0.062999256
## salary -0.040050600
## satis 1.000000000
library(gplots)
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
library(corrgram)
corrgram(placed, order=FALSE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt,
main="Corrgram of store variables")
salary= b + b1work_yrs + b2age + b3gmat_qpc + b4gmat_vpc + b5s_avg + b6f_avg + b7sex +b8frstlang + e
mod1<-lm(salary~work_yrs + age + gmat_qpc + gmat_vpc + s_avg + f_avg + sex +frstlang,data = placed)
summary(mod1)
##
## Call:
## lm(formula = salary ~ work_yrs + age + gmat_qpc + gmat_vpc +
## s_avg + f_avg + sex + frstlang, data = placed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28098 -8994 -1771 4642 79218
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39633.2 29871.9 1.327 0.1878
## work_yrs 663.6 1144.3 0.580 0.5634
## age 1890.4 1135.5 1.665 0.0993 .
## gmat_qpc 120.1 121.0 0.993 0.3234
## gmat_vpc -146.6 101.8 -1.439 0.1535
## s_avg 4006.6 4968.5 0.806 0.4220
## f_avg -1055.5 3808.8 -0.277 0.7823
## sexMale 3737.4 3575.7 1.045 0.2986
## frstlangOther 7851.5 7340.9 1.070 0.2876
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15570 on 94 degrees of freedom
## Multiple R-squared: 0.3006, Adjusted R-squared: 0.2411
## F-statistic: 5.051 on 8 and 94 DF, p-value: 3.102e-05
As can be seen by the above analysis the p-values of all the independent variables are greater than 0.05.
This can also be checked:
mod1$coefficients
## (Intercept) work_yrs age gmat_qpc gmat_vpc
## 39633.1612 663.5975 1890.4183 120.1321 -146.5648
## s_avg f_avg sexMale frstlangOther
## 4006.6062 -1055.5165 3737.3708 7851.5333
Visualising the coefficients
coefplot(mod1)
As can be seen all the coefficients pass zero.
Model1 is not a good model for regression.
salary= b + b1work_yrs + b2satis + b3sex +b4frstlang + e
mod2<-lm(salary~work_yrs + satis + sex +frstlang,data = placed)
summary(mod2)
##
## Call:
## lm(formula = salary ~ work_yrs + satis + sex + frstlang, data = placed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30492 -8055 -1744 5362 80436
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 102214.0 11827.8 8.642 1.06e-13 ***
## work_yrs 2409.4 526.1 4.579 1.37e-05 ***
## satis -2244.4 1988.4 -1.129 0.2618
## sexMale 5949.5 3392.2 1.754 0.0826 .
## frstlangOther 14675.7 6274.0 2.339 0.0214 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15580 on 98 degrees of freedom
## Multiple R-squared: 0.2695, Adjusted R-squared: 0.2397
## F-statistic: 9.038 on 4 and 98 DF, p-value: 2.953e-06
There are some independent variables that have p-values less than 0.05 or may be considered significant.
To clarify this a look at the coefficients would be good
mod2$coefficients
## (Intercept) work_yrs satis sexMale frstlangOther
## 102213.950 2409.391 -2244.425 5949.464 14675.672
Visualising the coefficients
coefplot(mod2)
# Conclusion
1. Its clear that work years and first language creates a significant difference in the salary of the graduates. 2. With every 1 year increase in work years, salary increases by Rs. 2409. 3. Language has a relative greater impact on salary as compared to sex, GMAT scores, age and satisfaction levels with the course. 4. Model3 suits the regression better.