mba <- read.csv(paste("MBA Starting Salaries Data.csv", sep=""))
mba1<-mba[mba$satis!=998,]
mba2 <- mba[ which(mba$salary!=998 & mba$salary!=999 & mba$salary!=0), ]
Most data shown does not include the candidates who did not disclose their salaries or did not share it.
summary(mba)
## age sex gmat_tot gmat_qpc
## Min. :22.00 Min. :1.000 Min. :450.0 Min. :28.00
## 1st Qu.:25.00 1st Qu.:1.000 1st Qu.:580.0 1st Qu.:72.00
## Median :27.00 Median :1.000 Median :620.0 Median :83.00
## Mean :27.36 Mean :1.248 Mean :619.5 Mean :80.64
## 3rd Qu.:29.00 3rd Qu.:1.000 3rd Qu.:660.0 3rd Qu.:93.00
## Max. :48.00 Max. :2.000 Max. :790.0 Max. :99.00
## gmat_vpc gmat_tpc s_avg f_avg
## Min. :16.00 Min. : 0.0 Min. :2.000 Min. :0.000
## 1st Qu.:71.00 1st Qu.:78.0 1st Qu.:2.708 1st Qu.:2.750
## Median :81.00 Median :87.0 Median :3.000 Median :3.000
## Mean :78.32 Mean :84.2 Mean :3.025 Mean :3.062
## 3rd Qu.:91.00 3rd Qu.:94.0 3rd Qu.:3.300 3rd Qu.:3.250
## Max. :99.00 Max. :99.0 Max. :4.000 Max. :4.000
## quarter work_yrs frstlang salary
## Min. :1.000 Min. : 0.000 Min. :1.000 Min. : 0
## 1st Qu.:1.250 1st Qu.: 2.000 1st Qu.:1.000 1st Qu.: 0
## Median :2.000 Median : 3.000 Median :1.000 Median : 999
## Mean :2.478 Mean : 3.872 Mean :1.117 Mean : 39026
## 3rd Qu.:3.000 3rd Qu.: 4.000 3rd Qu.:1.000 3rd Qu.: 97000
## Max. :4.000 Max. :22.000 Max. :2.000 Max. :220000
## satis
## Min. : 1.0
## 1st Qu.: 5.0
## Median : 6.0
## Mean :172.2
## 3rd Qu.: 7.0
## Max. :998.0
library(psych)
describe(mba)
## vars n mean sd median trimmed mad min max
## age 1 274 27.36 3.71 27 26.76 2.97 22 48
## sex 2 274 1.25 0.43 1 1.19 0.00 1 2
## gmat_tot 3 274 619.45 57.54 620 618.86 59.30 450 790
## gmat_qpc 4 274 80.64 14.87 83 82.31 14.83 28 99
## gmat_vpc 5 274 78.32 16.86 81 80.33 14.83 16 99
## gmat_tpc 6 274 84.20 14.02 87 86.12 11.86 0 99
## s_avg 7 274 3.03 0.38 3 3.03 0.44 2 4
## f_avg 8 274 3.06 0.53 3 3.09 0.37 0 4
## quarter 9 274 2.48 1.11 2 2.47 1.48 1 4
## work_yrs 10 274 3.87 3.23 3 3.29 1.48 0 22
## frstlang 11 274 1.12 0.32 1 1.02 0.00 1 2
## salary 12 274 39025.69 50951.56 999 33607.86 1481.12 0 220000
## satis 13 274 172.18 371.61 6 91.50 1.48 1 998
## range skew kurtosis se
## age 26 2.16 6.45 0.22
## sex 1 1.16 -0.66 0.03
## gmat_tot 340 -0.01 0.06 3.48
## gmat_qpc 71 -0.92 0.30 0.90
## gmat_vpc 83 -1.04 0.74 1.02
## gmat_tpc 99 -2.28 9.02 0.85
## s_avg 2 -0.06 -0.38 0.02
## f_avg 4 -2.08 10.85 0.03
## quarter 3 0.02 -1.35 0.07
## work_yrs 22 2.78 9.80 0.20
## frstlang 1 2.37 3.65 0.02
## salary 220000 0.70 -1.05 3078.10
## satis 997 1.77 1.13 22.45
hist(mba$salary,
breaks=5,
col="gray",
xlab="Salary",
main="")
Comparing Salaries and Gender
library(lattice)
boxplot(salary~sex, data=mba2,ylab="1 = Male; 2 = Female",xlab="Salary", horizontal=TRUE )
A t-test between salaries and gender to show whether there is any significant difference between the salaries of males and females.
H0: There is no difference in the mean salaries.
t.test(salary~sex,data=mba2)
##
## Welch Two Sample t-test
##
## data: salary by sex
## t = 1.3628, df = 38.115, p-value = 0.1809
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3128.55 16021.72
## sample estimates:
## mean in group 1 mean in group 2
## 104970.97 98524.39
p-value = 0.1809 > 0.05
The probability of this occuring by chance is very large and we fail to reject the null hypothesis.
Comparing Salaries and Age
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplot(jitter(salary)~jitter(age), data=mba2,
main="Salary Vs. Age",
xlab="Age",ylab="Salary")
To check the correlation between salary and age:
cor.test(mba2$salary,mba2$age)
##
## Pearson's product-moment correlation
##
## data: mba2$salary and mba2$age
## t = 5.7968, df = 101, p-value = 7.748e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3388862 0.6320523
## sample estimates:
## cor
## 0.4996428
cor = 0.499
p-value is very small 7.748e-08 < 0.01
Hence, we reject the null hypothesis that age and salary are not correlated.
Comparing Salaries and Language
1 = English Speaking
2 = Non-English Speaking
boxplot(salary~frstlang,data=mba2,xlab="1 : English Speaking 2 : Non-English Speaking",ylab="Salary",main="Salary & language")
A t-test between salaries and language to show whether there is any significant difference.
H0: There is no difference in the mean salaries.
t.test(salary~frstlang,data=mba2)
##
## Welch Two Sample t-test
##
## data: salary by frstlang
## t = -1.1202, df = 6.0863, p-value = 0.3049
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -59933.62 22202.25
## sample estimates:
## mean in group 1 mean in group 2
## 101748.6 120614.3
p-value = 0.3049 > 0.05
The probability of this occuring by chance is very large and we fail to reject the null hypothesis that there is a significant difference between the mean salaries of native english speakers and non-english speakers.
Comparing Salaries and Work Experience
scatterplot(salary~work_yrs, data=mba1,xlab="Work experience (years) of those who got jobs", ylab="Salary", main="Salary vs Work Experience")
Comparing Salaries and Spring Average
scatterplot(salary~s_avg, data=mba2,xlab="Spring Average", ylab="Salary", main="Salary vs Spring Average")
To check the correlation between salary and spring average:
cor.test(mba2$salary,mba2$s_avg)
##
## Pearson's product-moment correlation
##
## data: mba2$salary and mba2$s_avg
## t = 1.0277, df = 101, p-value = 0.3065
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.09363639 0.28955576
## sample estimates:
## cor
## 0.1017317
cor = 0.1017
p-value is very large 0.3065 > 0.01
Hence, we fail to reject the null hypothesis that salary and spring MBA average are not correlated.
Comparing Salaries and Fall Average
scatterplot(salary~f_avg, data=mba2,xlab="Fall Average", ylab="Salary", main="Salary vs Fall Average")
To check the correlation between salary and fall average:
cor.test(mba2$salary,mba2$f_avg)
##
## Pearson's product-moment correlation
##
## data: mba2$salary and mba2$f_avg
## t = -1.0717, df = 101, p-value = 0.2864
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.29353985 0.08931862
## sample estimates:
## cor
## -0.106039
cor = -0.1060
p-value is very large 0.2864 > 0.01
Hence, we fail to reject the null hypothesis that salary and spring MBA average are not correlated.
library(corrgram)
corrgram(mba2, order=FALSE, lower.panel=panel.cor,
upper.panel=panel.pie, text.panel=panel.txt,
main="MBA")
table(mba$satis)
##
## 1 2 3 4 5 6 7 998
## 1 1 5 17 74 97 33 46
hist(mba1$satis,
breaks=7,
col="gray",
xlim = c(0,7),
xlab="Satisfaction rating",
main="Distribution of Satisfaction Rating")
We can conclude that the majority of people have rated 5 & 6 on the satisfaction scale.
Satisfaction vs Gender
sat1 <- xtabs(~sex+satis,data=mba1)
sat1
## satis
## sex 1 2 3 4 5 6 7
## 1 1 0 4 12 53 79 20
## 2 0 1 1 5 21 18 13
chisq.test(sat1)
## Warning in chisq.test(sat1): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: sat1
## X-squared = 9.5091, df = 6, p-value = 0.1469
p-value for this test is large 0.1469 > 0.05
Hence, we fail to reject the null hypothesis that satisfaction and gender are independant.
Satisfaction vs Language
boxplot(satis~frstlang,data=mba1,xlab="1 : English Speaking 2 : Non-English Speaking",ylab="Satisfaction Level",main="Satisfaction & language")
sat22 <- xtabs(~frstlang+satis,data=mba)
sat22
## satis
## frstlang 1 2 3 4 5 6 7 998
## 1 1 0 3 10 67 92 31 38
## 2 0 1 2 7 7 5 2 8
chisq.test(sat22)
## Warning in chisq.test(sat22): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: sat22
## X-squared = 32.744, df = 7, p-value = 2.954e-05
p-value is very small 2.945e-05 < 0.01 Hence, we can reject the null hypothesis and say that first language and student satisfaction are related and not independant of each other.
A t-test between satisfaction and language to show whether there is any significant difference.
H0: There is no difference in satisfaction.
t.test(satis~frstlang,data=mba1)
##
## Welch Two Sample t-test
##
## data: satis by frstlang
## t = 3.2902, df = 25.906, p-value = 0.002887
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3245711 1.4058211
## sample estimates:
## mean in group 1 mean in group 2
## 5.656863 4.791667
p-value is very small, 0.002887 < 0.01
We reject the null hypothesis that the two groups being compared (native english students and non-native english students) are statistically the same. There is some factor affecting the mean of these two groups to be different.
Satisfaction vs Age
scatterplot(jitter(age)~jitter(satis),data=mba1,xlab = "Satisfaction Rating",ylab = "Age",main="Age vs Satisfaction")
To check the correlation between satisfaction and age:
cor.test(mba1$age,mba1$satis)
##
## Pearson's product-moment correlation
##
## data: mba1$age and mba1$satis
## t = -0.98062, df = 226, p-value = 0.3278
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1933820 0.0653869
## sample estimates:
## cor
## -0.06509178
p-value is very large 0.3278 > 0.05
Hence, we fail to reject the null hypothesis that satisfaction rating and age are not correlated.
scatterplot(jitter(s_avg)~jitter(gmat_tot), data=mba,xlab="GMAT Total", ylab="Spring Average", main="Effect of GMAT Total on Spring Average")
scatterplot(jitter(f_avg)~jitter(gmat_tot), data=mba,xlab="GMAT Total", ylab="Fall Average", main="Effect of GMAT Total on Fall average")
scatterplot(jitter(quarter)~jitter(gmat_tot), data=mba,xlab="GMAT Total", ylab="Quartile Ranking", main="Effect of GMAT Total on Quartile Ranking")
To check the correlation between quarter and GMAT total:
cor.test(mba$quarter,mba$gmat_tot)
##
## Pearson's product-moment correlation
##
## data: mba$quarter and mba$gmat_tot
## t = -1.5278, df = 272, p-value = 0.1277
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.20846043 0.02655113
## sample estimates:
## cor
## -0.09223903
p-value is high 0.1277 > 0.05 Hence, we fail to reject the null hypothesis that quarter and gmat total are not correlated.
library(corrgram)
corrgram(mba, order=FALSE, lower.panel=panel.cor,
upper.panel=panel.pie, text.panel=panel.txt,
main="MBA")
Model 1:
Linear Model for age, GMAT performance and work experience
fit1 <- lm( salary~age+gmat_tpc+work_yrs,data=mba2)
summary(fit1)
##
## Call:
## lm(formula = salary ~ age + gmat_tpc + work_yrs, data = mba2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33547 -7760 -1788 4647 76796
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47565.6 25902.6 1.836 0.0693 .
## age 2455.0 999.0 2.458 0.0157 *
## gmat_tpc -133.9 142.0 -0.943 0.3480
## work_yrs 284.6 1090.2 0.261 0.7946
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15630 on 99 degrees of freedom
## Multiple R-squared: 0.2573, Adjusted R-squared: 0.2348
## F-statistic: 11.43 on 3 and 99 DF, p-value: 1.683e-06
Model 2:
Linear Model for age, gender, GMAT performance, language and quartile ranking
fit2<-lm(salary ~ age+sex+gmat_tot+gmat_tpc+gmat_qpc+gmat_vpc+frstlang+quarter, data=mba2)
summary(fit2)
##
## Call:
## lm(formula = salary ~ age + sex + gmat_tot + gmat_tpc + gmat_qpc +
## gmat_vpc + frstlang + quarter, data = mba2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25167 -7550 -1109 5163 71055
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50739.508 44276.926 1.146 0.2547
## age 2458.718 525.702 4.677 9.73e-06 ***
## sex -3456.252 3458.815 -0.999 0.3202
## gmat_tot 6.807 159.950 0.043 0.9661
## gmat_tpc -1429.300 693.662 -2.061 0.0421 *
## gmat_qpc 796.902 474.955 1.678 0.0967 .
## gmat_vpc 533.123 473.264 1.126 0.2628
## frstlang 5868.789 6789.285 0.864 0.3896
## quarter -1820.645 1392.015 -1.308 0.1941
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15200 on 94 degrees of freedom
## Multiple R-squared: 0.3333, Adjusted R-squared: 0.2766
## F-statistic: 5.875 on 8 and 94 DF, p-value: 4.368e-06
p-value of age < 0.05
p-value of overall gmat percentile < 0.05
Hence, salary depends on these two factors.
Model 3:
Linear Model for spring average, fall average, work experience and satisfaction
fit3<- lm(salary~s_avg+f_avg+work_yrs+satis, data=mba2)
summary(fit3)
##
## Call:
## lm(formula = salary ~ s_avg + f_avg + work_yrs + satis, data = mba2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33329 -7748 -853 3885 87689
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101048.7 20095.5 5.028 2.23e-06 ***
## s_avg 1588.0 4987.7 0.318 0.751
## f_avg -1186.1 3885.5 -0.305 0.761
## work_yrs 2649.6 572.3 4.630 1.12e-05 ***
## satis -1531.7 2075.3 -0.738 0.462
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16180 on 98 degrees of freedom
## Multiple R-squared: 0.2125, Adjusted R-squared: 0.1804
## F-statistic: 6.611 on 4 and 98 DF, p-value: 9.407e-05
p-value of work experience < 0.05.
Hence, salary depends on it.
We will compare the first 90 rows of those who got a job with the 90 rows of those who did not get a job
job1=subset(mba, salary>999)
job0=subset(mba, salary==0)
job1=job1[1:90,]
chisq.test(job1$sex,job0$sex)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: job1$sex and job0$sex
## X-squared = 0.11711, df = 1, p-value = 0.7322
chisq.test(job1$frstlang,job0$frstlang)
## Warning in chisq.test(job1$frstlang, job0$frstlang): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: job1$frstlang and job0$frstlang
## X-squared = 0.0080703, df = 1, p-value = 0.9284
chisq.test(job1$quarter,job0$quarter)
## Warning in chisq.test(job1$quarter, job0$quarter): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: job1$quarter and job0$quarter
## X-squared = 110.98, df = 9, p-value < 2.2e-16
As the p-value > 0.05, we can say that students were placed irrespective of sex and first language.
For quartile ranking, the p-value < 2.2e-16, hence we can say that students who got a job performed better academically than those who did not get a job.