Setting Up

mba <- read.csv(paste("MBA Starting Salaries Data.csv", sep=""))
mba1<-mba[mba$satis!=998,]
mba2 <- mba[ which(mba$salary!=998 & mba$salary!=999 & mba$salary!=0), ]

Summary

Most data shown does not include the candidates who did not disclose their salaries or did not share it.

summary(mba)
##       age             sex           gmat_tot        gmat_qpc    
##  Min.   :22.00   Min.   :1.000   Min.   :450.0   Min.   :28.00  
##  1st Qu.:25.00   1st Qu.:1.000   1st Qu.:580.0   1st Qu.:72.00  
##  Median :27.00   Median :1.000   Median :620.0   Median :83.00  
##  Mean   :27.36   Mean   :1.248   Mean   :619.5   Mean   :80.64  
##  3rd Qu.:29.00   3rd Qu.:1.000   3rd Qu.:660.0   3rd Qu.:93.00  
##  Max.   :48.00   Max.   :2.000   Max.   :790.0   Max.   :99.00  
##     gmat_vpc        gmat_tpc        s_avg           f_avg      
##  Min.   :16.00   Min.   : 0.0   Min.   :2.000   Min.   :0.000  
##  1st Qu.:71.00   1st Qu.:78.0   1st Qu.:2.708   1st Qu.:2.750  
##  Median :81.00   Median :87.0   Median :3.000   Median :3.000  
##  Mean   :78.32   Mean   :84.2   Mean   :3.025   Mean   :3.062  
##  3rd Qu.:91.00   3rd Qu.:94.0   3rd Qu.:3.300   3rd Qu.:3.250  
##  Max.   :99.00   Max.   :99.0   Max.   :4.000   Max.   :4.000  
##     quarter         work_yrs         frstlang         salary      
##  Min.   :1.000   Min.   : 0.000   Min.   :1.000   Min.   :     0  
##  1st Qu.:1.250   1st Qu.: 2.000   1st Qu.:1.000   1st Qu.:     0  
##  Median :2.000   Median : 3.000   Median :1.000   Median :   999  
##  Mean   :2.478   Mean   : 3.872   Mean   :1.117   Mean   : 39026  
##  3rd Qu.:3.000   3rd Qu.: 4.000   3rd Qu.:1.000   3rd Qu.: 97000  
##  Max.   :4.000   Max.   :22.000   Max.   :2.000   Max.   :220000  
##      satis      
##  Min.   :  1.0  
##  1st Qu.:  5.0  
##  Median :  6.0  
##  Mean   :172.2  
##  3rd Qu.:  7.0  
##  Max.   :998.0
library(psych)
describe(mba)
##          vars   n     mean       sd median  trimmed     mad min    max
## age         1 274    27.36     3.71     27    26.76    2.97  22     48
## sex         2 274     1.25     0.43      1     1.19    0.00   1      2
## gmat_tot    3 274   619.45    57.54    620   618.86   59.30 450    790
## gmat_qpc    4 274    80.64    14.87     83    82.31   14.83  28     99
## gmat_vpc    5 274    78.32    16.86     81    80.33   14.83  16     99
## gmat_tpc    6 274    84.20    14.02     87    86.12   11.86   0     99
## s_avg       7 274     3.03     0.38      3     3.03    0.44   2      4
## f_avg       8 274     3.06     0.53      3     3.09    0.37   0      4
## quarter     9 274     2.48     1.11      2     2.47    1.48   1      4
## work_yrs   10 274     3.87     3.23      3     3.29    1.48   0     22
## frstlang   11 274     1.12     0.32      1     1.02    0.00   1      2
## salary     12 274 39025.69 50951.56    999 33607.86 1481.12   0 220000
## satis      13 274   172.18   371.61      6    91.50    1.48   1    998
##           range  skew kurtosis      se
## age          26  2.16     6.45    0.22
## sex           1  1.16    -0.66    0.03
## gmat_tot    340 -0.01     0.06    3.48
## gmat_qpc     71 -0.92     0.30    0.90
## gmat_vpc     83 -1.04     0.74    1.02
## gmat_tpc     99 -2.28     9.02    0.85
## s_avg         2 -0.06    -0.38    0.02
## f_avg         4 -2.08    10.85    0.03
## quarter       3  0.02    -1.35    0.07
## work_yrs     22  2.78     9.80    0.20
## frstlang      1  2.37     3.65    0.02
## salary   220000  0.70    -1.05 3078.10
## satis       997  1.77     1.13   22.45

Distrubution Plots

hist(mba$salary,
     breaks=5,
     col="gray",
     xlab="Salary",
     main="")

Comparing Salaries and Gender

library(lattice)
boxplot(salary~sex, data=mba2,ylab="1 = Male; 2 = Female",xlab="Salary", horizontal=TRUE )

A t-test between salaries and gender to show whether there is any significant difference between the salaries of males and females.

H0: There is no difference in the mean salaries.

t.test(salary~sex,data=mba2)
## 
##  Welch Two Sample t-test
## 
## data:  salary by sex
## t = 1.3628, df = 38.115, p-value = 0.1809
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3128.55 16021.72
## sample estimates:
## mean in group 1 mean in group 2 
##       104970.97        98524.39

p-value = 0.1809 > 0.05

The probability of this occuring by chance is very large and we fail to reject the null hypothesis.

Comparing Salaries and Age

library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
scatterplot(jitter(salary)~jitter(age), data=mba2,
     main="Salary Vs. Age",
     xlab="Age",ylab="Salary")

To check the correlation between salary and age:

cor.test(mba2$salary,mba2$age)
## 
##  Pearson's product-moment correlation
## 
## data:  mba2$salary and mba2$age
## t = 5.7968, df = 101, p-value = 7.748e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3388862 0.6320523
## sample estimates:
##       cor 
## 0.4996428

cor = 0.499

p-value is very small 7.748e-08 < 0.01

Hence, we reject the null hypothesis that age and salary are not correlated.

Comparing Salaries and Language

1 = English Speaking

2 = Non-English Speaking

boxplot(salary~frstlang,data=mba2,xlab="1 : English Speaking   2 : Non-English Speaking",ylab="Salary",main="Salary & language")

A t-test between salaries and language to show whether there is any significant difference.

H0: There is no difference in the mean salaries.

t.test(salary~frstlang,data=mba2)
## 
##  Welch Two Sample t-test
## 
## data:  salary by frstlang
## t = -1.1202, df = 6.0863, p-value = 0.3049
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -59933.62  22202.25
## sample estimates:
## mean in group 1 mean in group 2 
##        101748.6        120614.3

p-value = 0.3049 > 0.05

The probability of this occuring by chance is very large and we fail to reject the null hypothesis that there is a significant difference between the mean salaries of native english speakers and non-english speakers.

Comparing Salaries and Work Experience

scatterplot(salary~work_yrs, data=mba1,xlab="Work experience (years) of those who got jobs", ylab="Salary", main="Salary vs Work Experience")

Comparing Salaries and Spring Average

scatterplot(salary~s_avg, data=mba2,xlab="Spring Average", ylab="Salary", main="Salary vs Spring Average")

To check the correlation between salary and spring average:

cor.test(mba2$salary,mba2$s_avg)
## 
##  Pearson's product-moment correlation
## 
## data:  mba2$salary and mba2$s_avg
## t = 1.0277, df = 101, p-value = 0.3065
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.09363639  0.28955576
## sample estimates:
##       cor 
## 0.1017317

cor = 0.1017

p-value is very large 0.3065 > 0.01

Hence, we fail to reject the null hypothesis that salary and spring MBA average are not correlated.

Comparing Salaries and Fall Average

scatterplot(salary~f_avg, data=mba2,xlab="Fall Average", ylab="Salary", main="Salary vs Fall Average")

To check the correlation between salary and fall average:

cor.test(mba2$salary,mba2$f_avg)
## 
##  Pearson's product-moment correlation
## 
## data:  mba2$salary and mba2$f_avg
## t = -1.0717, df = 101, p-value = 0.2864
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.29353985  0.08931862
## sample estimates:
##       cor 
## -0.106039

cor = -0.1060

p-value is very large 0.2864 > 0.01

Hence, we fail to reject the null hypothesis that salary and spring MBA average are not correlated.

Corrgram of those people who got a job

library(corrgram)
corrgram(mba2, order=FALSE, lower.panel=panel.cor,
         upper.panel=panel.pie, text.panel=panel.txt,
         main="MBA")

Distribution of Satisfaction

table(mba$satis)
## 
##   1   2   3   4   5   6   7 998 
##   1   1   5  17  74  97  33  46
hist(mba1$satis,
     breaks=7,
     col="gray",
     xlim = c(0,7),
     xlab="Satisfaction rating",
     main="Distribution of Satisfaction Rating")

We can conclude that the majority of people have rated 5 & 6 on the satisfaction scale.

Satisfaction vs Gender

sat1 <- xtabs(~sex+satis,data=mba1)
sat1
##    satis
## sex  1  2  3  4  5  6  7
##   1  1  0  4 12 53 79 20
##   2  0  1  1  5 21 18 13
chisq.test(sat1)
## Warning in chisq.test(sat1): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  sat1
## X-squared = 9.5091, df = 6, p-value = 0.1469

p-value for this test is large 0.1469 > 0.05

Hence, we fail to reject the null hypothesis that satisfaction and gender are independant.

Satisfaction vs Language

boxplot(satis~frstlang,data=mba1,xlab="1 : English Speaking   2 : Non-English Speaking",ylab="Satisfaction Level",main="Satisfaction & language")

sat22 <- xtabs(~frstlang+satis,data=mba)
sat22
##         satis
## frstlang  1  2  3  4  5  6  7 998
##        1  1  0  3 10 67 92 31  38
##        2  0  1  2  7  7  5  2   8
chisq.test(sat22)
## Warning in chisq.test(sat22): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  sat22
## X-squared = 32.744, df = 7, p-value = 2.954e-05

p-value is very small 2.945e-05 < 0.01 Hence, we can reject the null hypothesis and say that first language and student satisfaction are related and not independant of each other.

A t-test between satisfaction and language to show whether there is any significant difference.

H0: There is no difference in satisfaction.

t.test(satis~frstlang,data=mba1)
## 
##  Welch Two Sample t-test
## 
## data:  satis by frstlang
## t = 3.2902, df = 25.906, p-value = 0.002887
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.3245711 1.4058211
## sample estimates:
## mean in group 1 mean in group 2 
##        5.656863        4.791667

p-value is very small, 0.002887 < 0.01

We reject the null hypothesis that the two groups being compared (native english students and non-native english students) are statistically the same. There is some factor affecting the mean of these two groups to be different.

Satisfaction vs Age

scatterplot(jitter(age)~jitter(satis),data=mba1,xlab = "Satisfaction Rating",ylab = "Age",main="Age vs Satisfaction")

To check the correlation between satisfaction and age:

cor.test(mba1$age,mba1$satis)
## 
##  Pearson's product-moment correlation
## 
## data:  mba1$age and mba1$satis
## t = -0.98062, df = 226, p-value = 0.3278
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1933820  0.0653869
## sample estimates:
##         cor 
## -0.06509178

p-value is very large 0.3278 > 0.05

Hence, we fail to reject the null hypothesis that satisfaction rating and age are not correlated.

GMAT

scatterplot(jitter(s_avg)~jitter(gmat_tot), data=mba,xlab="GMAT Total", ylab="Spring Average", main="Effect of GMAT Total on Spring Average")

scatterplot(jitter(f_avg)~jitter(gmat_tot), data=mba,xlab="GMAT Total", ylab="Fall Average", main="Effect of GMAT Total on Fall average")

scatterplot(jitter(quarter)~jitter(gmat_tot), data=mba,xlab="GMAT Total", ylab="Quartile Ranking", main="Effect of GMAT Total on Quartile Ranking")

To check the correlation between quarter and GMAT total:

cor.test(mba$quarter,mba$gmat_tot)
## 
##  Pearson's product-moment correlation
## 
## data:  mba$quarter and mba$gmat_tot
## t = -1.5278, df = 272, p-value = 0.1277
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.20846043  0.02655113
## sample estimates:
##         cor 
## -0.09223903

p-value is high 0.1277 > 0.05 Hence, we fail to reject the null hypothesis that quarter and gmat total are not correlated.

library(corrgram)
corrgram(mba, order=FALSE, lower.panel=panel.cor,
         upper.panel=panel.pie, text.panel=panel.txt,
         main="MBA")

Regression Models

Model 1:

Linear Model for age, GMAT performance and work experience

fit1 <- lm( salary~age+gmat_tpc+work_yrs,data=mba2)
summary(fit1)
## 
## Call:
## lm(formula = salary ~ age + gmat_tpc + work_yrs, data = mba2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33547  -7760  -1788   4647  76796 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  47565.6    25902.6   1.836   0.0693 .
## age           2455.0      999.0   2.458   0.0157 *
## gmat_tpc      -133.9      142.0  -0.943   0.3480  
## work_yrs       284.6     1090.2   0.261   0.7946  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15630 on 99 degrees of freedom
## Multiple R-squared:  0.2573, Adjusted R-squared:  0.2348 
## F-statistic: 11.43 on 3 and 99 DF,  p-value: 1.683e-06

Model 2:

Linear Model for age, gender, GMAT performance, language and quartile ranking

fit2<-lm(salary ~ age+sex+gmat_tot+gmat_tpc+gmat_qpc+gmat_vpc+frstlang+quarter, data=mba2)
summary(fit2)
## 
## Call:
## lm(formula = salary ~ age + sex + gmat_tot + gmat_tpc + gmat_qpc + 
##     gmat_vpc + frstlang + quarter, data = mba2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -25167  -7550  -1109   5163  71055 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 50739.508  44276.926   1.146   0.2547    
## age          2458.718    525.702   4.677 9.73e-06 ***
## sex         -3456.252   3458.815  -0.999   0.3202    
## gmat_tot        6.807    159.950   0.043   0.9661    
## gmat_tpc    -1429.300    693.662  -2.061   0.0421 *  
## gmat_qpc      796.902    474.955   1.678   0.0967 .  
## gmat_vpc      533.123    473.264   1.126   0.2628    
## frstlang     5868.789   6789.285   0.864   0.3896    
## quarter     -1820.645   1392.015  -1.308   0.1941    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15200 on 94 degrees of freedom
## Multiple R-squared:  0.3333, Adjusted R-squared:  0.2766 
## F-statistic: 5.875 on 8 and 94 DF,  p-value: 4.368e-06

p-value of age < 0.05

p-value of overall gmat percentile < 0.05

Hence, salary depends on these two factors.

Model 3:

Linear Model for spring average, fall average, work experience and satisfaction

fit3<- lm(salary~s_avg+f_avg+work_yrs+satis, data=mba2)
summary(fit3)
## 
## Call:
## lm(formula = salary ~ s_avg + f_avg + work_yrs + satis, data = mba2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33329  -7748   -853   3885  87689 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 101048.7    20095.5   5.028 2.23e-06 ***
## s_avg         1588.0     4987.7   0.318    0.751    
## f_avg        -1186.1     3885.5  -0.305    0.761    
## work_yrs      2649.6      572.3   4.630 1.12e-05 ***
## satis        -1531.7     2075.3  -0.738    0.462    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16180 on 98 degrees of freedom
## Multiple R-squared:  0.2125, Adjusted R-squared:  0.1804 
## F-statistic: 6.611 on 4 and 98 DF,  p-value: 9.407e-05

p-value of work experience < 0.05.

Hence, salary depends on it.

Comparing those who got a job and those who didn’t

We will compare the first 90 rows of those who got a job with the 90 rows of those who did not get a job

job1=subset(mba, salary>999)
job0=subset(mba, salary==0)
job1=job1[1:90,]
chisq.test(job1$sex,job0$sex)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  job1$sex and job0$sex
## X-squared = 0.11711, df = 1, p-value = 0.7322
chisq.test(job1$frstlang,job0$frstlang)
## Warning in chisq.test(job1$frstlang, job0$frstlang): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  job1$frstlang and job0$frstlang
## X-squared = 0.0080703, df = 1, p-value = 0.9284
chisq.test(job1$quarter,job0$quarter)
## Warning in chisq.test(job1$quarter, job0$quarter): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  job1$quarter and job0$quarter
## X-squared = 110.98, df = 9, p-value < 2.2e-16

As the p-value > 0.05, we can say that students were placed irrespective of sex and first language.

For quartile ranking, the p-value < 2.2e-16, hence we can say that students who got a job performed better academically than those who did not get a job.

END