This project is based on the statistical analysis of the starting salaries of MBA pass outs.This is a categorical statistical analysis.Following are the crucial questions which were expected to be answered by performing this statistical analysis:
Does the GMAT score have an effect on the marks?
Is English ,as the first language, a statistically significant factor contributing to a higher GMAT score?
Did gender of the students make a difference in their respective starting salaries?
Did students like the programs that they were selected for?
setwd("C:/Users/Shreyas Jadhav/Downloads")
mba <- read.csv(paste("MBA Starting Salaries Data.csv",sep="."))
View(mba)
attach(mba)
summary(mba)
## age sex gmat_tot gmat_qpc
## Min. :22.00 Min. :1.000 Min. :450.0 Min. :28.00
## 1st Qu.:25.00 1st Qu.:1.000 1st Qu.:580.0 1st Qu.:72.00
## Median :27.00 Median :1.000 Median :620.0 Median :83.00
## Mean :27.36 Mean :1.248 Mean :619.5 Mean :80.64
## 3rd Qu.:29.00 3rd Qu.:1.000 3rd Qu.:660.0 3rd Qu.:93.00
## Max. :48.00 Max. :2.000 Max. :790.0 Max. :99.00
## gmat_vpc gmat_tpc s_avg f_avg
## Min. :16.00 Min. : 0.0 Min. :2.000 Min. :0.000
## 1st Qu.:71.00 1st Qu.:78.0 1st Qu.:2.708 1st Qu.:2.750
## Median :81.00 Median :87.0 Median :3.000 Median :3.000
## Mean :78.32 Mean :84.2 Mean :3.025 Mean :3.062
## 3rd Qu.:91.00 3rd Qu.:94.0 3rd Qu.:3.300 3rd Qu.:3.250
## Max. :99.00 Max. :99.0 Max. :4.000 Max. :4.000
## quarter work_yrs frstlang salary
## Min. :1.000 Min. : 0.000 Min. :1.000 Min. : 0
## 1st Qu.:1.250 1st Qu.: 2.000 1st Qu.:1.000 1st Qu.: 0
## Median :2.000 Median : 3.000 Median :1.000 Median : 999
## Mean :2.478 Mean : 3.872 Mean :1.117 Mean : 39026
## 3rd Qu.:3.000 3rd Qu.: 4.000 3rd Qu.:1.000 3rd Qu.: 97000
## Max. :4.000 Max. :22.000 Max. :2.000 Max. :220000
## satis
## Min. : 1.0
## 1st Qu.: 5.0
## Median : 6.0
## Mean :172.2
## 3rd Qu.: 7.0
## Max. :998.0
table(sex,frstlang)
## frstlang
## sex 1 2
## 1 182 24
## 2 60 8
placed<-mba[which(mba$salary!=0 & mba$salary!=999 & mba$salary!=998),]
not_placed<-mba[which(mba$salary==0),]
min(placed$salary)
## [1] 64000
max(placed$salary)
## [1] 220000
table(placed$sex)
##
## 1 2
## 72 31
table(mba$salary>999)
##
## FALSE TRUE
## 171 103
table(mba$salary==0)
##
## FALSE TRUE
## 184 90
hist(mba$salary,main="Salary of Passed out students",xlab="Salary",ylab="Count"
,breaks=20,col="lightblue",xlim=c(0,220000),ylim=c(0,200))
library(car)
scatterplotMatrix(formula=~age+s_avg,cex=0.6,data=mba,diagonal="density")
library(corrplot)
## corrplot 0.84 loaded
A<-cor(mba)
wb<-c("white","black")
corrplot(A,order="hclust",addrect=2,col=wb,bg="blue")
c1<-cut(age,5)
table(c1,sex)
## sex
## c1 1 2
## (22,27.2] 135 47
## (27.2,32.4] 58 15
## (32.4,37.6] 8 3
## (37.6,42.8] 2 3
## (42.8,48] 3 0
c2<-cut(s_avg,5)
mytable<-xtabs(~frstlang+sex+c2)
prop.table(ftable(mytable))*100
## c2 (2,2.4] (2.4,2.8] (2.8,3.2] (3.2,3.6] (3.6,4]
## frstlang sex
## 1 1 5.8394161 16.7883212 24.0875912 16.7883212 2.9197080
## 2 0.3649635 4.7445255 6.2043796 9.8540146 0.7299270
## 2 1 0.7299270 2.9197080 3.6496350 1.4598540 0.0000000
## 2 0.0000000 1.0948905 1.8248175 0.0000000 0.0000000
c3<-cut(placed$gmat_tot,5)
mytable2<-xtabs(~frstlang+sex+c3,data=placed)
ftable(mytable2)
## c3 (500,544] (544,588] (588,632] (632,676] (676,720]
## frstlang sex
## 1 1 3 18 27 11 9
## 2 2 6 8 8 4
## 2 1 1 0 1 1 1
## 2 1 2 0 0 0
mytable3<-xtabs(~sex+satis,data=placed)
mytable3
## satis
## sex 3 4 5 6 7
## 1 0 1 17 40 14
## 2 1 0 12 10 8
mytable4<-xtabs(~sex+quarter+satis,data=placed)
ftable(mytable4)
## satis 3 4 5 6 7
## sex quarter
## 1 1 0 1 7 10 5
## 2 0 0 4 13 2
## 3 0 0 4 9 4
## 4 0 0 2 8 3
## 2 1 1 0 6 3 2
## 2 0 0 4 1 1
## 3 0 0 1 3 3
## 4 0 0 1 3 2
library(car)
corrplot.mixed(corr=cor(placed[,c(3:11)],use="complete.obs"),upper="circle",tl.pos="lt")
library(lattice)
barchart(work_yrs~age,data=placed,groups=sex,auto.key=TRUE,
par.settings=simpleTheme(col=c("lightpink","blue")))
barchart(frstlang~gmat_tot,data=placed,groups=sex,xlab="Total GMAT score",
ylab="First Language" ,main=" Distribution of total GMAT score wrt
first language and Gender"
,auto.key=TRUE,par.settings=simpleTheme(col=c("black","red")))
barchart(quarter~gmat_tot,data=placed,groups=sex,xlab=
"Total GMAT score",ylab="Quarter" ,main=" Distribution of
total GMAT score wrt first language
and Gender",auto.key=TRUE,par.settings=simpleTheme(col=c("pink","red")))
mytable5<-xtabs(~quarter+salary,data=placed)
chisq.test(mytable5)
## Warning in chisq.test(mytable5): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable5
## X-squared = 129.85, df = 123, p-value = 0.3186
mytable6<-xtabs(~age+salary,data=placed)
chisq.test(mytable6)
## Warning in chisq.test(mytable6): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable6
## X-squared = 717.62, df = 574, p-value = 3.929e-05
mytable7<-xtabs(~gmat_tot+salary,data=placed)
chisq.test(mytable7)
## Warning in chisq.test(mytable7): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable7
## X-squared = 927.24, df = 820, p-value = 0.005279
mytable8<-xtabs(~I((s_avg+f_avg)/2)+salary,data=placed)
chisq.test(mytable8)
## Warning in chisq.test(mytable8): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable8
## X-squared = 2138.9, df = 2132, p-value = 0.454
mytable9<-xtabs(~sex+salary,data=placed)
chisq.test(mytable9)
## Warning in chisq.test(mytable9): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable9
## X-squared = 52.681, df = 41, p-value = 0.1045
mytable10<-xtabs(~sex+satis,data=placed)
chisq.test(mytable10)
## Warning in chisq.test(mytable10): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: mytable10
## X-squared = 7.3413, df = 4, p-value = 0.1189
mytable11<-xtabs(~sex+quarter,data=placed)
chisq.test(mytable11)
##
## Pearson's Chi-squared test
##
## data: mytable11
## X-squared = 0.76332, df = 3, p-value = 0.8582
mytable12<-xtabs(~quarter+work_yrs,data=placed)
chisq.test(mytable12)
## Warning in chisq.test(mytable12): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: mytable12
## X-squared = 29.47, df = 33, p-value = 0.6436
mytable13<-xtabs(~sex+gmat_tpc,data=placed)
chisq.test(mytable13)
## Warning in chisq.test(mytable13): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: mytable13
## X-squared = 29.541, df = 30, p-value = 0.4893
mytable14<-xtabs(~gmat_tot+quarter,data=placed)
chisq.test(mytable14)
## Warning in chisq.test(mytable14): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: mytable14
## X-squared = 56.697, df = 60, p-value = 0.5972
mytable15<-xtabs(~s_avg+f_avg,data=placed)
chisq.test(mytable15)
## Warning in chisq.test(mytable15): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: mytable15
## X-squared = 551.43, df = 294, p-value < 2.2e-16
mytable16<-xtabs(~work_yrs+satis,data=placed)
chisq.test(mytable16)
## Warning in chisq.test(mytable16): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: mytable16
## X-squared = 131.13, df = 44, p-value = 1.35e-10
mytable17<-xtabs(~s_avg+quarter,data=placed)
chisq.test(mytable17)
## Warning in chisq.test(mytable17): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: mytable17
## X-squared = 279.96, df = 63, p-value < 2.2e-16
mytable18<-xtabs(~f_avg+quarter,data=placed)
chisq.test(mytable18)
## Warning in chisq.test(mytable18): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: mytable18
## X-squared = 86.722, df = 42, p-value = 6.017e-05
mytable19<-xtabs(~age+salary,data=placed)
chisq.test(mytable19)
## Warning in chisq.test(mytable19): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: mytable19
## X-squared = 717.62, df = 574, p-value = 3.929e-05
bwplot(placed$salary,xlab="Salaries",
lab="Count",main="Boxplot of Salaries",horizontal="TRUE",las=1)
fit1<-lm(salary~ age+satis+sex+frstlang, data=placed)
summary(fit1)
##
## Call:
## lm(formula = salary ~ age + satis + sex + frstlang, data = placed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25463 -9177 -1636 5686 79645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48730.8 17649.8 2.761 0.00688 **
## age 2452.8 508.1 4.827 5.1e-06 ***
## satis -2542.7 1972.2 -1.289 0.20034
## sex -4720.6 3393.2 -1.391 0.16732
## frstlang 9105.5 6524.3 1.396 0.16598
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15430 on 98 degrees of freedom
## Multiple R-squared: 0.2835, Adjusted R-squared: 0.2543
## F-statistic: 9.695 on 4 and 98 DF, p-value: 1.197e-06
fit2<-lm(salary~age+satis+sex+frstlang+work_yrs, data=placed)
summary(fit2)
##
## Call:
## lm(formula = salary ~ age + satis + sex + frstlang + work_yrs,
## data = placed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25863 -9753 -834 5571 78637
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63735.6 26478.0 2.407 0.018 *
## age 1719.3 1089.4 1.578 0.118
## satis -2471.2 1978.7 -1.249 0.215
## sex -4999.4 3420.1 -1.462 0.147
## frstlang 10459.5 6775.8 1.544 0.126
## work_yrs 850.8 1117.2 0.762 0.448
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15460 on 97 degrees of freedom
## Multiple R-squared: 0.2878, Adjusted R-squared: 0.2511
## F-statistic: 7.839 on 5 and 97 DF, p-value: 3.121e-06
fit3<-lm(salary~ age+satis+sex+frstlang+quarter+gmat_tot, data=placed)
summary(fit3)
##
## Call:
## lm(formula = salary ~ age + satis + sex + frstlang + quarter +
## gmat_tot, data = placed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24883 -9077 -2370 5880 80864
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60431.42 26748.87 2.259 0.0261 *
## age 2352.11 523.32 4.495 1.95e-05 ***
## satis -2064.90 2051.36 -1.007 0.3167
## sex -4866.33 3417.08 -1.424 0.1577
## frstlang 9633.98 6679.04 1.442 0.1524
## quarter -1215.85 1456.29 -0.835 0.4058
## gmat_tot -15.33 30.95 -0.495 0.6216
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15520 on 96 degrees of freedom
## Multiple R-squared: 0.2898, Adjusted R-squared: 0.2454
## F-statistic: 6.53 on 6 and 96 DF, p-value: 8.343e-06
fit4<-lm(salary~quarter+work_yrs+frstlang+age, data=placed)
summary(fit4)
##
## Call:
## lm(formula = salary ~ quarter + work_yrs + frstlang + age, data = placed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29740 -9176 -1180 4180 78065
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44849.8 23850.9 1.880 0.0630 .
## quarter -1366.1 1407.8 -0.970 0.3342
## work_yrs 750.4 1117.3 0.672 0.5034
## frstlang 9610.8 6818.8 1.409 0.1619
## age 1801.8 1080.2 1.668 0.0985 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15580 on 98 degrees of freedom
## Multiple R-squared: 0.2696, Adjusted R-squared: 0.2398
## F-statistic: 9.045 on 4 and 98 DF, p-value: 2.923e-06
The number of males whose first language is English is thrice of the no.of females whose first language is English.
The minimum salary of a placed MBA graduate is 64000 and maximum salary is 220000.
103 students in all were placed and disclosed their salary out of 274 students.
90 MBA graduates did not get a job and hence, were not placed.
Maximum number of males and females fall in the age group [22,27].
Those students, whose first language is English performed better in their GMAT entrance exam that those whose first language is not English.
Maximum no.of placed males had a satisfactory factor of 6.
The spring MBA score and fall MBA score of a student is negatively correlated to his quartile score which means the higher his MBA score , the better is his quarter.
Those students aged 40 years have a work experience of 11 years.
With English as their first language, females have scored better than males on the GMAT exam.
When English is not their first language, males have scored better than females on the GMAT exam.
It is not necessary that students who have a great GMAT score have a better quarter at their MBA college.
Salary of a student is not statistically, significantly related to his quarter at the MBA college.
Students who are older are rewarded with better salaries after passing out.
Surprisingly, students who have a great GMAT score have higher salaries than those who don’t.
The spring and fall MBA scores are not statistically significant with the salary of the student.
Surprisingly, those students who have a good spring MBA score have a good fall MBA score too.
Those students who have a good fall and spring MBA score have a better quarter, as per Pearson’s Chi-Squared Test.
The median salary is around 100000 and very few students have a salary over 200000.
The first language(English) is statistically significant for a better GMAt score.
Out of all the regression models analysed, the best fit model which is the model with the maximum value of adjusted R-squared and minimum value of p and residuals is the subset:(salary~ age+satis+sex+frstlang, data=placed)