Introduction

This project is based on the statistical analysis of the starting salaries of MBA pass outs.This is a categorical statistical analysis.Following are the crucial questions which were expected to be answered by performing this statistical analysis:

  1. Does the GMAT score have an effect on the marks?

  2. Is English ,as the first language, a statistically significant factor contributing to a higher GMAT score?

  3. Did gender of the students make a difference in their respective starting salaries?

  4. Did students like the programs that they were selected for?

Analysis

Reading the data set

setwd("C:/Users/Shreyas Jadhav/Downloads")  
mba <- read.csv(paste("MBA Starting Salaries Data.csv",sep="."))
View(mba)
attach(mba)

Summary of mba data set

summary(mba)
##       age             sex           gmat_tot        gmat_qpc    
##  Min.   :22.00   Min.   :1.000   Min.   :450.0   Min.   :28.00  
##  1st Qu.:25.00   1st Qu.:1.000   1st Qu.:580.0   1st Qu.:72.00  
##  Median :27.00   Median :1.000   Median :620.0   Median :83.00  
##  Mean   :27.36   Mean   :1.248   Mean   :619.5   Mean   :80.64  
##  3rd Qu.:29.00   3rd Qu.:1.000   3rd Qu.:660.0   3rd Qu.:93.00  
##  Max.   :48.00   Max.   :2.000   Max.   :790.0   Max.   :99.00  
##     gmat_vpc        gmat_tpc        s_avg           f_avg      
##  Min.   :16.00   Min.   : 0.0   Min.   :2.000   Min.   :0.000  
##  1st Qu.:71.00   1st Qu.:78.0   1st Qu.:2.708   1st Qu.:2.750  
##  Median :81.00   Median :87.0   Median :3.000   Median :3.000  
##  Mean   :78.32   Mean   :84.2   Mean   :3.025   Mean   :3.062  
##  3rd Qu.:91.00   3rd Qu.:94.0   3rd Qu.:3.300   3rd Qu.:3.250  
##  Max.   :99.00   Max.   :99.0   Max.   :4.000   Max.   :4.000  
##     quarter         work_yrs         frstlang         salary      
##  Min.   :1.000   Min.   : 0.000   Min.   :1.000   Min.   :     0  
##  1st Qu.:1.250   1st Qu.: 2.000   1st Qu.:1.000   1st Qu.:     0  
##  Median :2.000   Median : 3.000   Median :1.000   Median :   999  
##  Mean   :2.478   Mean   : 3.872   Mean   :1.117   Mean   : 39026  
##  3rd Qu.:3.000   3rd Qu.: 4.000   3rd Qu.:1.000   3rd Qu.: 97000  
##  Max.   :4.000   Max.   :22.000   Max.   :2.000   Max.   :220000  
##      satis      
##  Min.   :  1.0  
##  1st Qu.:  5.0  
##  Median :  6.0  
##  Mean   :172.2  
##  3rd Qu.:  7.0  
##  Max.   :998.0

Table to study the how much proportion of each gender has first language as English

table(sex,frstlang)
##    frstlang
## sex   1   2
##   1 182  24
##   2  60   8

New data set consisting only those candidates who got a job

placed<-mba[which(mba$salary!=0 & mba$salary!=999 & mba$salary!=998),]

New data set consisting only those candidates who didn’t get a job

not_placed<-mba[which(mba$salary==0),]

Minimum salary

min(placed$salary)
## [1] 64000

Maximum salary

max(placed$salary)
## [1] 220000

Table for gender distribution of placed students.

table(placed$sex)
## 
##  1  2 
## 72 31

Table to find out the number of the students who disclosed their salary.

table(mba$salary>999)
## 
## FALSE  TRUE 
##   171   103

Table to find out how many students did not get placed.

table(mba$salary==0)
## 
## FALSE  TRUE 
##   184    90

Histogram for Salary of Passed out students

hist(mba$salary,main="Salary of Passed out students",xlab="Salary",ylab="Count"
     ,breaks=20,col="lightblue",xlim=c(0,220000),ylim=c(0,200))

ScatterplotMatrix for age and spring MBA average.

library(car)
scatterplotMatrix(formula=~age+s_avg,cex=0.6,data=mba,diagonal="density")

Corrplots

library(corrplot)
## corrplot 0.84 loaded
A<-cor(mba)
wb<-c("white","black")
corrplot(A,order="hclust",addrect=2,col=wb,bg="blue")

Number of males and females in each of 5 age intervals

c1<-cut(age,5)
table(c1,sex)
##              sex
## c1              1   2
##   (22,27.2]   135  47
##   (27.2,32.4]  58  15
##   (32.4,37.6]   8   3
##   (37.6,42.8]   2   3
##   (42.8,48]     3   0

Distribution according to s_avg(5 intervals), frstlang and sex.

c2<-cut(s_avg,5)
mytable<-xtabs(~frstlang+sex+c2)
prop.table(ftable(mytable))*100
##              c2    (2,2.4]  (2.4,2.8]  (2.8,3.2]  (3.2,3.6]    (3.6,4]
## frstlang sex                                                          
## 1        1       5.8394161 16.7883212 24.0875912 16.7883212  2.9197080
##          2       0.3649635  4.7445255  6.2043796  9.8540146  0.7299270
## 2        1       0.7299270  2.9197080  3.6496350  1.4598540  0.0000000
##          2       0.0000000  1.0948905  1.8248175  0.0000000  0.0000000

Distribution of placed students according to gmat total(5 intervals), sex and frstlang

c3<-cut(placed$gmat_tot,5)
mytable2<-xtabs(~frstlang+sex+c3,data=placed)
ftable(mytable2)
##              c3 (500,544] (544,588] (588,632] (632,676] (676,720]
## frstlang sex                                                     
## 1        1              3        18        27        11         9
##          2              2         6         8         8         4
## 2        1              1         0         1         1         1
##          2              1         2         0         0         0

Table to find out which gender is how much satisfied with the job they have ended up with.

mytable3<-xtabs(~sex+satis,data=placed)
mytable3
##    satis
## sex  3  4  5  6  7
##   1  0  1 17 40 14
##   2  1  0 12 10  8

Table to find out which gender is how much satisfied with the job they have ended up with in which quarter.

mytable4<-xtabs(~sex+quarter+satis,data=placed)
ftable(mytable4)
##             satis  3  4  5  6  7
## sex quarter                     
## 1   1              0  1  7 10  5
##     2              0  0  4 13  2
##     3              0  0  4  9  4
##     4              0  0  2  8  3
## 2   1              1  0  6  3  2
##     2              0  0  4  1  1
##     3              0  0  1  3  3
##     4              0  0  1  3  2

Corrplot

library(car)
corrplot.mixed(corr=cor(placed[,c(3:11)],use="complete.obs"),upper="circle",tl.pos="lt")

Barcharts to explain how gender is divided amongst other parameters.

library(lattice)
barchart(work_yrs~age,data=placed,groups=sex,auto.key=TRUE,
         par.settings=simpleTheme(col=c("lightpink","blue")))

barchart(frstlang~gmat_tot,data=placed,groups=sex,xlab="Total GMAT score",
         ylab="First Language" ,main=" Distribution of total GMAT score wrt
         first language and Gender"
         ,auto.key=TRUE,par.settings=simpleTheme(col=c("black","red")))

barchart(quarter~gmat_tot,data=placed,groups=sex,xlab=
"Total GMAT score",ylab="Quarter" ,main=" Distribution of
total GMAT score wrt first language 
         and Gender",auto.key=TRUE,par.settings=simpleTheme(col=c("pink","red")))

Pearson’s Chi-Squared Tests to study different relations.

mytable5<-xtabs(~quarter+salary,data=placed)
chisq.test(mytable5)
## Warning in chisq.test(mytable5): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable5
## X-squared = 129.85, df = 123, p-value = 0.3186
mytable6<-xtabs(~age+salary,data=placed)
chisq.test(mytable6)
## Warning in chisq.test(mytable6): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable6
## X-squared = 717.62, df = 574, p-value = 3.929e-05
mytable7<-xtabs(~gmat_tot+salary,data=placed)
chisq.test(mytable7)
## Warning in chisq.test(mytable7): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable7
## X-squared = 927.24, df = 820, p-value = 0.005279
mytable8<-xtabs(~I((s_avg+f_avg)/2)+salary,data=placed)
chisq.test(mytable8)
## Warning in chisq.test(mytable8): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable8
## X-squared = 2138.9, df = 2132, p-value = 0.454
mytable9<-xtabs(~sex+salary,data=placed)
chisq.test(mytable9)
## Warning in chisq.test(mytable9): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable9
## X-squared = 52.681, df = 41, p-value = 0.1045
mytable10<-xtabs(~sex+satis,data=placed)
chisq.test(mytable10)
## Warning in chisq.test(mytable10): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable10
## X-squared = 7.3413, df = 4, p-value = 0.1189
mytable11<-xtabs(~sex+quarter,data=placed)
chisq.test(mytable11)
## 
##  Pearson's Chi-squared test
## 
## data:  mytable11
## X-squared = 0.76332, df = 3, p-value = 0.8582
mytable12<-xtabs(~quarter+work_yrs,data=placed)
chisq.test(mytable12)
## Warning in chisq.test(mytable12): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable12
## X-squared = 29.47, df = 33, p-value = 0.6436
mytable13<-xtabs(~sex+gmat_tpc,data=placed)
chisq.test(mytable13)
## Warning in chisq.test(mytable13): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable13
## X-squared = 29.541, df = 30, p-value = 0.4893
mytable14<-xtabs(~gmat_tot+quarter,data=placed)
chisq.test(mytable14)
## Warning in chisq.test(mytable14): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable14
## X-squared = 56.697, df = 60, p-value = 0.5972
mytable15<-xtabs(~s_avg+f_avg,data=placed)
chisq.test(mytable15)
## Warning in chisq.test(mytable15): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable15
## X-squared = 551.43, df = 294, p-value < 2.2e-16
mytable16<-xtabs(~work_yrs+satis,data=placed)
chisq.test(mytable16)
## Warning in chisq.test(mytable16): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable16
## X-squared = 131.13, df = 44, p-value = 1.35e-10
mytable17<-xtabs(~s_avg+quarter,data=placed)
chisq.test(mytable17)
## Warning in chisq.test(mytable17): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable17
## X-squared = 279.96, df = 63, p-value < 2.2e-16
mytable18<-xtabs(~f_avg+quarter,data=placed)
chisq.test(mytable18)
## Warning in chisq.test(mytable18): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable18
## X-squared = 86.722, df = 42, p-value = 6.017e-05
mytable19<-xtabs(~age+salary,data=placed)
chisq.test(mytable19)
## Warning in chisq.test(mytable19): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable19
## X-squared = 717.62, df = 574, p-value = 3.929e-05

Boxplot to explain distribution of salaries

bwplot(placed$salary,xlab="Salaries",
       lab="Count",main="Boxplot of Salaries",horizontal="TRUE",las=1)

Linear Regression Models

fit1<-lm(salary~ age+satis+sex+frstlang, data=placed)
summary(fit1)
## 
## Call:
## lm(formula = salary ~ age + satis + sex + frstlang, data = placed)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -25463  -9177  -1636   5686  79645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  48730.8    17649.8   2.761  0.00688 ** 
## age           2452.8      508.1   4.827  5.1e-06 ***
## satis        -2542.7     1972.2  -1.289  0.20034    
## sex          -4720.6     3393.2  -1.391  0.16732    
## frstlang      9105.5     6524.3   1.396  0.16598    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15430 on 98 degrees of freedom
## Multiple R-squared:  0.2835, Adjusted R-squared:  0.2543 
## F-statistic: 9.695 on 4 and 98 DF,  p-value: 1.197e-06
fit2<-lm(salary~age+satis+sex+frstlang+work_yrs, data=placed)
summary(fit2)
## 
## Call:
## lm(formula = salary ~ age + satis + sex + frstlang + work_yrs, 
##     data = placed)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -25863  -9753   -834   5571  78637 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  63735.6    26478.0   2.407    0.018 *
## age           1719.3     1089.4   1.578    0.118  
## satis        -2471.2     1978.7  -1.249    0.215  
## sex          -4999.4     3420.1  -1.462    0.147  
## frstlang     10459.5     6775.8   1.544    0.126  
## work_yrs       850.8     1117.2   0.762    0.448  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15460 on 97 degrees of freedom
## Multiple R-squared:  0.2878, Adjusted R-squared:  0.2511 
## F-statistic: 7.839 on 5 and 97 DF,  p-value: 3.121e-06
fit3<-lm(salary~ age+satis+sex+frstlang+quarter+gmat_tot, data=placed)
summary(fit3)
## 
## Call:
## lm(formula = salary ~ age + satis + sex + frstlang + quarter + 
##     gmat_tot, data = placed)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24883  -9077  -2370   5880  80864 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 60431.42   26748.87   2.259   0.0261 *  
## age          2352.11     523.32   4.495 1.95e-05 ***
## satis       -2064.90    2051.36  -1.007   0.3167    
## sex         -4866.33    3417.08  -1.424   0.1577    
## frstlang     9633.98    6679.04   1.442   0.1524    
## quarter     -1215.85    1456.29  -0.835   0.4058    
## gmat_tot      -15.33      30.95  -0.495   0.6216    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15520 on 96 degrees of freedom
## Multiple R-squared:  0.2898, Adjusted R-squared:  0.2454 
## F-statistic:  6.53 on 6 and 96 DF,  p-value: 8.343e-06
fit4<-lm(salary~quarter+work_yrs+frstlang+age, data=placed)
summary(fit4)
## 
## Call:
## lm(formula = salary ~ quarter + work_yrs + frstlang + age, data = placed)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -29740  -9176  -1180   4180  78065 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  44849.8    23850.9   1.880   0.0630 .
## quarter      -1366.1     1407.8  -0.970   0.3342  
## work_yrs       750.4     1117.3   0.672   0.5034  
## frstlang      9610.8     6818.8   1.409   0.1619  
## age           1801.8     1080.2   1.668   0.0985 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15580 on 98 degrees of freedom
## Multiple R-squared:  0.2696, Adjusted R-squared:  0.2398 
## F-statistic: 9.045 on 4 and 98 DF,  p-value: 2.923e-06

Conclusions:

  1. The number of males whose first language is English is thrice of the no.of females whose first language is English.

  2. The minimum salary of a placed MBA graduate is 64000 and maximum salary is 220000.

  3. 103 students in all were placed and disclosed their salary out of 274 students.

  4. 90 MBA graduates did not get a job and hence, were not placed.

  5. Maximum number of males and females fall in the age group [22,27].

  6. Those students, whose first language is English performed better in their GMAT entrance exam that those whose first language is not English.

  7. Maximum no.of placed males had a satisfactory factor of 6.

  8. The spring MBA score and fall MBA score of a student is negatively correlated to his quartile score which means the higher his MBA score , the better is his quarter.

  9. Those students aged 40 years have a work experience of 11 years.

  10. With English as their first language, females have scored better than males on the GMAT exam.

  11. When English is not their first language, males have scored better than females on the GMAT exam.

  12. It is not necessary that students who have a great GMAT score have a better quarter at their MBA college.

  13. Salary of a student is not statistically, significantly related to his quarter at the MBA college.

  14. Students who are older are rewarded with better salaries after passing out.

  15. Surprisingly, students who have a great GMAT score have higher salaries than those who don’t.

  16. The spring and fall MBA scores are not statistically significant with the salary of the student.

  17. Surprisingly, those students who have a good spring MBA score have a good fall MBA score too.

  18. Those students who have a good fall and spring MBA score have a better quarter, as per Pearson’s Chi-Squared Test.

  19. The median salary is around 100000 and very few students have a salary over 200000.

  20. The first language(English) is statistically significant for a better GMAt score.

  21. Out of all the regression models analysed, the best fit model which is the model with the maximum value of adjusted R-squared and minimum value of p and residuals is the subset:(salary~ age+satis+sex+frstlang, data=placed)