A dataset consisting of information regarding candidates who’ve taken up MBA is provided. Here, we try to estimate relationships and differences between populations to better understand whether to pick the MBA course offered, based on these results.
We can try to explore: 1. The factors on which ‘starting salary’ is dependent on, effects of gender and/or age. 2. Whether students liked this particular program. 3. Whether GMAT score made a difference in marks. 4. Effects of first language, work experience etc. of candidates.
# Reading into R, the data.
mba.df<-read.csv(paste("MBA Starting Salaries Data.csv",sep=""))
mba.df$sex[mba.df$sex==1]='M'
mba.df$sex[mba.df$sex==2]='F'
mba.df$frstlang[mba.df$frstlang==1]='Eng'
mba.df$frstlang[mba.df$frstlang==2]='Other'
mba.df$sex=factor(mba.df$sex)
mba.df$frstlang=factor(mba.df$frstlang)
View(mba.df)
attach(mba.df)
# Examining type of data.
str(mba.df)
## 'data.frame': 274 obs. of 13 variables:
## $ age : int 23 24 24 24 24 24 25 25 25 25 ...
## $ sex : Factor w/ 2 levels "F","M": 1 2 2 2 1 2 2 1 2 2 ...
## $ gmat_tot: int 620 610 670 570 710 640 610 650 630 680 ...
## $ gmat_qpc: int 77 90 99 56 93 82 89 88 79 99 ...
## $ gmat_vpc: int 87 71 78 81 98 89 74 89 91 81 ...
## $ gmat_tpc: int 87 87 95 75 98 91 87 92 89 96 ...
## $ s_avg : num 3.4 3.5 3.3 3.3 3.6 3.9 3.4 3.3 3.3 3.45 ...
## $ f_avg : num 3 4 3.25 2.67 3.75 3.75 3.5 3.75 3.25 3.67 ...
## $ quarter : int 1 1 1 1 1 1 1 1 1 1 ...
## $ work_yrs: int 2 2 2 1 2 2 2 2 2 2 ...
## $ frstlang: Factor w/ 2 levels "Eng","Other": 1 1 1 1 1 1 1 1 2 1 ...
## $ salary : int 0 0 0 0 999 0 0 0 999 998 ...
## $ satis : int 7 6 6 7 5 6 5 6 4 998 ...
# Describing data.
library(psych)
describe(mba.df)
## vars n mean sd median trimmed mad min max
## age 1 274 27.36 3.71 27 26.76 2.97 22 48
## sex* 2 274 1.75 0.43 2 1.81 0.00 1 2
## gmat_tot 3 274 619.45 57.54 620 618.86 59.30 450 790
## gmat_qpc 4 274 80.64 14.87 83 82.31 14.83 28 99
## gmat_vpc 5 274 78.32 16.86 81 80.33 14.83 16 99
## gmat_tpc 6 274 84.20 14.02 87 86.12 11.86 0 99
## s_avg 7 274 3.03 0.38 3 3.03 0.44 2 4
## f_avg 8 274 3.06 0.53 3 3.09 0.37 0 4
## quarter 9 274 2.48 1.11 2 2.47 1.48 1 4
## work_yrs 10 274 3.87 3.23 3 3.29 1.48 0 22
## frstlang* 11 274 1.12 0.32 1 1.02 0.00 1 2
## salary 12 274 39025.69 50951.56 999 33607.86 1481.12 0 220000
## satis 13 274 172.18 371.61 6 91.50 1.48 1 998
## range skew kurtosis se
## age 26 2.16 6.45 0.22
## sex* 1 -1.16 -0.66 0.03
## gmat_tot 340 -0.01 0.06 3.48
## gmat_qpc 71 -0.92 0.30 0.90
## gmat_vpc 83 -1.04 0.74 1.02
## gmat_tpc 99 -2.28 9.02 0.85
## s_avg 2 -0.06 -0.38 0.02
## f_avg 4 -2.08 10.85 0.03
## quarter 3 0.02 -1.35 0.07
## work_yrs 22 2.78 9.80 0.20
## frstlang* 1 2.37 3.65 0.02
## salary 220000 0.70 -1.05 3078.10
## satis 997 1.77 1.13 22.45
# Make table for placed students.
placed <- mba.df[(salary>0 & salary!=998 & salary!=999),]
View(placed)
# Make table for students not placed.
notplaced <- mba.df[(salary==0),]
View(notplaced)
# Table for recording satisfaction rating excluding non-participants.
sat <- mba.df[(satis!=998),]
View(sat)
# Visualizing data.
library(lattice)
histogram(sex)
histogram(frstlang)
boxplot(age~sex,main="Candidate age distribution w.r.t to gender",horizontal=TRUE,yaxt="n")
axis(side=2,at=c(2,1),labels=c("Male","Female"))
boxplot(placed$salary,main="Salary distribution among placed candidates",horizontal = TRUE)
boxplot(gmat_tot,main="GMAT Score distribution",horizontal = TRUE)
boxplot(s_avg,main="Spring MBA score average distribution",horizontal = TRUE)
boxplot(f_avg,main="Fall MBA score average distribution",horizontal = TRUE)
boxplot(sat$satis,main="Satisfaction rating distribution",horizontal=TRUE)
hist(work_yrs,main = "Distribution of work experience",xlab="Years in work",ylab="Frquency",breaks=20)
* Most students’ first language is English(>80%).
Most of the students are male(~75%).
Median salary is 100000, median GMAT score is between (600-650), Spring and Fall median scores are 3.0 and median satisfaction rating is 6.
A majority of students have experience between (0-5) years.
Hence, median age of male MBAs is greater than female MBAs.
# Scatter plots to understand relationships between parameters.
par(mfrow=c(2,4))
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplot(gmat_tot,s_avg,xlab="GMAT Total",ylab="Spring MBA average",main = "Spring avg.score vs GMAT Total ")
scatterplot(gmat_tot,f_avg,xlab="GMAT Total",ylab="Fall MBA average",main = "Fall avg.score vs GMAT Total ")
attach(placed)
## The following objects are masked from mba.df:
##
## age, f_avg, frstlang, gmat_qpc, gmat_tot, gmat_tpc, gmat_vpc,
## quarter, s_avg, salary, satis, sex, work_yrs
scatterplot(salary~gmat_tot,xlab="GMAT Total",ylab="Salary",main="GMAT score - salary distribution",labels=row.names(placed))
scatterplot(salary~s_avg,xlab="Spring MBA average",ylab = "Salary",main="Spring avg. score - salary distribution",labels=row.names(placed))
scatterplot(salary~f_avg,xlab="Fall MBA average",ylab = "Salary",main="Fall avg. score - salary distribution",labels=row.names(placed))
scatterplot(salary~age,xlab="Age",ylab="Salary",main="Age - Salary distribution",labels=row.names(placed))
scatterplot(salary~work_yrs,xlab="Work Experience",ylab="Salary",main="Work Experience - Salary distribution",labels=row.names(placed))
scatterplot(salary~satis,xlab="Satisfaction",ylab="Salary",main="Satisfaction - Salary distribution",labels=row.names(placed))
par(mfrow=c(1,1))
library(corrgram)
## Warning: package 'corrgram' was built under R version 3.4.3
corrgram(mba.df,order = TRUE,lower.panel = panel.shade,upper.panel = panel.pie,text.panel = panel.txt,main="Corrgram of MBA_Salaries dataset")
From the corrgram, we can see positive correlation between salary and work experience, Avg. MBA scores.
We can also see a positive correlation between GMAT scores - MBA avg. scores and GMAT scores - Satisfaction rates.
t.test(salary~sex,placed)
##
## Welch Two Sample t-test
##
## data: salary by sex
## t = -1.3628, df = 38.115, p-value = 0.1809
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -16021.72 3128.55
## sample estimates:
## mean in group F mean in group M
## 98524.39 104970.97
t.test(gmat_tot~sex,placed)
##
## Welch Two Sample t-test
##
## data: gmat_tot by sex
## t = -0.1877, df = 51.483, p-value = 0.8518
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -25.14641 20.84534
## sample estimates:
## mean in group F mean in group M
## 614.5161 616.6667
From the t-tests, we see that its not the case that males get higher GMAT scores or salaries than females.(p>0.05)
fit1 <- lm(salary~sex+frstlang+work_yrs+gmat_tot+s_avg+f_avg)
summary(fit1)
##
## Call:
## lm(formula = salary ~ sex + frstlang + work_yrs + gmat_tot +
## s_avg + f_avg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32652 -8940 -1709 5186 83182
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 85685.28 22530.99 3.803 0.000251 ***
## sexM 5886.39 3462.79 1.700 0.092388 .
## frstlangOther 15101.77 6473.46 2.333 0.021743 *
## work_yrs 2201.83 579.92 3.797 0.000257 ***
## gmat_tot -11.90 31.77 -0.375 0.708712
## s_avg 4851.02 4986.79 0.973 0.333110
## f_avg -1153.74 3822.28 -0.302 0.763422
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15760 on 96 degrees of freedom
## Multiple R-squared: 0.2678, Adjusted R-squared: 0.2221
## F-statistic: 5.853 on 6 and 96 DF, p-value: 3.114e-05
The Multiple R-squared value accounts for 26.78% variance for the variables. Here work_yrs, frstlangOther are significant variables. p-value<0.05
fit2 <- lm(salary~age+sex+frstlang+work_yrs+satis+s_avg+f_avg+gmat_tot)
summary(fit2)
##
## Call:
## lm(formula = salary ~ age + sex + frstlang + work_yrs + satis +
## s_avg + f_avg + gmat_tot)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28074 -9183 -1632 5602 79935
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64846.39 33542.40 1.933 0.0562 .
## age 1650.02 1125.42 1.466 0.1459
## sexM 5215.92 3507.28 1.487 0.1403
## frstlangOther 11131.89 7126.07 1.562 0.1216
## work_yrs 787.18 1146.15 0.687 0.4939
## satis -2237.86 2037.81 -1.098 0.2749
## s_avg 3083.80 5058.12 0.610 0.5435
## f_avg -655.40 3825.50 -0.171 0.8643
## gmat_tot -12.40 31.89 -0.389 0.6982
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15670 on 94 degrees of freedom
## Multiple R-squared: 0.2914, Adjusted R-squared: 0.231
## F-statistic: 4.831 on 8 and 94 DF, p-value: 5.285e-05
The Multiple R-squared value accounts for 29.14% variance for the variables. p-value<0.05
** Hence, fit2 is better than fit1 for a ‘starting salary’ linear model, considering multiple R-squared values.
COMPARING THOSE GROUPS WHO GOT A JOB WITH THOSE WHO DIDN’T:
# Contingency Tables.
mytable <- xtabs(~sex+salary,notplaced)
margin.table(mytable,1)
## sex
## F M
## 23 67
Hence, more males are unemployed than females.