MBAsalaries.df<-read.csv(paste("MBA Starting Salaries Data.csv",sep=""))
library(psych)
describe(MBAsalaries.df)[,c(1:5,8,9)]
## vars n mean sd median min max
## age 1 274 27.36 3.71 27 22 48
## sex 2 274 1.25 0.43 1 1 2
## gmat_tot 3 274 619.45 57.54 620 450 790
## gmat_qpc 4 274 80.64 14.87 83 28 99
## gmat_vpc 5 274 78.32 16.86 81 16 99
## gmat_tpc 6 274 84.20 14.02 87 0 99
## s_avg 7 274 3.03 0.38 3 2 4
## f_avg 8 274 3.06 0.53 3 0 4
## quarter 9 274 2.48 1.11 2 1 4
## work_yrs 10 274 3.87 3.23 3 0 22
## frstlang 11 274 1.12 0.32 1 1 2
## salary 12 274 39025.69 50951.56 999 0 220000
## satis 13 274 172.18 371.61 6 1 998
#pruning off the data for which salary info not given
MBAsalTrue.df <- MBAsalaries.df[which (MBAsalaries.df$salary >= 1000) , ]
hist(MBAsalTrue.df$salary)
The salary can be affected by multiple factors. These factors and their associated variables are given below:- 1. ascribed factors of the candidate * age * sex * frstlang 2. performance of the candidate * gmat_tot * gmat_qpc * gmat_vpc * gmat_tpc * s_avg * f_avg * quarter 3. work experience * work_yrs 4. satisfaction with MBA program * satis
We can now investigate the relationship of these variables with the starting salary. Further we should also look into profile of candidates who did not get placed. # Salary vs Sex
boxplot(salary~sex, data=MBAsalTrue.df)
Thus it appears female candidates have lesser starting salaries. At this point, let us invoke the data of candidates who did not get placed and view their sex profile
# recording profile of candidates who are not placed
MBAsalZero.df<-MBAsalaries.df[which(MBAsalaries.df$salary==0),]
table(MBAsalZero.df$sex)
##
## 1 2
## 67 23
This indicates that male candidates are more in number among those who did not get placed.
At this point it would be pertinant to ask what percentage of women and men got placed. For this purpose, we shall create a new dataframe combining the data of all candidates whose information is known
MBAknownsal.df<-MBAsalaries.df[which(MBAsalaries.df$salary>= 1000|MBAsalaries.df$salary==0),]
# creating a dummy variable to capture the status of placement
MBAknownsal.df$Status = (MBAknownsal.df$salary >1000)
MBAknownsal.df$Status<- factor(MBAknownsal.df$Status)
Now, we may look into the percentage of women (=2) and men (=1) who got placed
prop.table(xtabs(~ MBAknownsal.df$Status + MBAknownsal.df$sex , data=MBAknownsal.df),2)
## MBAknownsal.df$sex
## MBAknownsal.df$Status 1 2
## FALSE 0.4820144 0.4259259
## TRUE 0.5179856 0.5740741
H1: A hypothesis can be proposed that the percentage of women placed is more than men
chisq.test(xtabs(~ MBAknownsal.df$Status + MBAknownsal.df$sex , data=MBAknownsal.df))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: xtabs(~MBAknownsal.df$Status + MBAknownsal.df$sex, data = MBAknownsal.df)
## X-squared = 0.29208, df = 1, p-value = 0.5889
since the p value >0.05, thus the hypothesis is rejected, which implies that placement of women and men are independent events.
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplot(MBAsalTrue.df$age, MBAsalTrue.df$salary)
The plot implies that with increasing age the starting salary begin to dip
About the candidates not placed
hist(MBAsalZero.df$age)
Again here it can be seen that the frequency of younger people among those not placed is higher.
Now let us compare the profile of those who are placed and not placed based on first language.
print("placed people")
## [1] "placed people"
table(MBAsalTrue.df$frstlang)
##
## 1 2
## 96 7
print("not placed people")
## [1] "not placed people"
table(MBAsalZero.df$frstlang)
##
## 1 2
## 82 8
Now, we may look into the percentage of people with English (=1) as first language and others (=2) who got placed
prop.table(xtabs(~ MBAknownsal.df$Status + MBAknownsal.df$frstlang , data=MBAknownsal.df),2)
## MBAknownsal.df$frstlang
## MBAknownsal.df$Status 1 2
## FALSE 0.4606742 0.5333333
## TRUE 0.5393258 0.4666667
H2: A hypothesis can be proposed that percentage of people placed whose first language is English is higher than the percentage of people placed whose first language is not English
chisq.test(xtabs(~ MBAknownsal.df$Status + MBAknownsal.df$frstlang , data=MBAknownsal.df))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: xtabs(~MBAknownsal.df$Status + MBAknownsal.df$frstlang, data = MBAknownsal.df)
## X-squared = 0.074127, df = 1, p-value = 0.7854
since the p value >0.05, thus the hypothesis is rejected, which implies that placement of people with english as their first language and others are independent events.
salary vs. gmat score
scatterplot(MBAsalTrue.df$gmat_tot, MBAsalTrue.df$salary)
salary vs quartile
boxplot(salary~quarter, data= MBAsalTrue.df)
scatterplot(MBAsalTrue.df$work_yrs, MBAsalTrue.df$salary)
library(corrplot)
## corrplot 0.84 loaded
round(cor(MBAsalTrue.df),2)
## age sex gmat_tot gmat_qpc gmat_vpc gmat_tpc s_avg f_avg
## age 1.00 -0.14 -0.08 -0.17 0.02 -0.10 0.16 -0.22
## sex -0.14 1.00 -0.02 -0.15 0.05 -0.05 0.08 0.17
## gmat_tot -0.08 -0.02 1.00 0.67 0.78 0.97 0.17 0.12
## gmat_qpc -0.17 -0.15 0.67 1.00 0.09 0.66 0.02 0.10
## gmat_vpc 0.02 0.05 0.78 0.09 1.00 0.78 0.16 0.02
## gmat_tpc -0.10 -0.05 0.97 0.66 0.78 1.00 0.14 0.07
## s_avg 0.16 0.08 0.17 0.02 0.16 0.14 1.00 0.45
## f_avg -0.22 0.17 0.12 0.10 0.02 0.07 0.45 1.00
## quarter -0.13 -0.02 -0.11 0.01 -0.13 -0.10 -0.84 -0.43
## work_yrs 0.88 -0.09 -0.12 -0.18 -0.03 -0.13 0.16 -0.22
## frstlang 0.35 0.08 -0.13 0.01 -0.22 -0.16 -0.14 -0.05
## salary 0.50 -0.17 -0.09 0.01 -0.14 -0.13 0.10 -0.11
## satis 0.11 -0.09 0.06 0.00 0.15 0.12 -0.14 -0.12
## quarter work_yrs frstlang salary satis
## age -0.13 0.88 0.35 0.50 0.11
## sex -0.02 -0.09 0.08 -0.17 -0.09
## gmat_tot -0.11 -0.12 -0.13 -0.09 0.06
## gmat_qpc 0.01 -0.18 0.01 0.01 0.00
## gmat_vpc -0.13 -0.03 -0.22 -0.14 0.15
## gmat_tpc -0.10 -0.13 -0.16 -0.13 0.12
## s_avg -0.84 0.16 -0.14 0.10 -0.14
## f_avg -0.43 -0.22 -0.05 -0.11 -0.12
## quarter 1.00 -0.13 0.11 -0.13 0.23
## work_yrs -0.13 1.00 0.20 0.45 0.06
## frstlang 0.11 0.20 1.00 0.27 0.09
## salary -0.13 0.45 0.27 1.00 -0.04
## satis 0.23 0.06 0.09 -0.04 1.00
library(corrgram)
corrgram(MBAsalTrue.df, order = FALSE, lower.panel = panel.shade, upper.panel = panel.pie, text.panel = panel.txt, main= "Corrgram of various factors")
fit<- lm(salary~ age+gmat_qpc+gmat_tot+gmat_tpc+gmat_vpc+s_avg+f_avg+work_yrs+sex+frstlang, data= MBAknownsal.df)
summary(fit)
##
## Call:
## lm(formula = salary ~ age + gmat_qpc + gmat_tot + gmat_tpc +
## gmat_vpc + s_avg + f_avg + work_yrs + sex + frstlang, data = MBAknownsal.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -86129 -52437 22217 43263 188119
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 200482.95 92802.71 2.160 0.0321 *
## age -4703.82 1916.34 -2.455 0.0150 *
## gmat_qpc 538.13 854.89 0.629 0.5298
## gmat_tot -376.13 313.59 -1.199 0.2319
## gmat_tpc 691.61 653.79 1.058 0.2915
## gmat_vpc 522.85 821.83 0.636 0.5254
## s_avg 22036.96 12592.72 1.750 0.0818 .
## f_avg -7562.26 8825.77 -0.857 0.3927
## work_yrs 3559.88 2175.77 1.636 0.1035
## sex -40.44 8746.76 -0.005 0.9963
## frstlang 14456.64 15774.81 0.916 0.3606
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52760 on 182 degrees of freedom
## Multiple R-squared: 0.06607, Adjusted R-squared: 0.01476
## F-statistic: 1.288 on 10 and 182 DF, p-value: 0.2403
the above analysis shows that Age is the only statistically significant factor influencing the starting salary.