MBA_Starting_Salaries

A dataset consisting of information regarding candidates who’ve taken up MBA is provided. Here, we try to estimate relationships and differences between populations to better understand whether to pick the MBA course offered, based on these results.

We can try to explore: 1. The factors on which ‘starting salary’ is dependent on, effects of gender and/or age. 2. Whether students liked this particular program. 3. Whether GMAT score made a difference in marks. 4. Effects of first language, work experience etc. of candidates.

# Reading into R, the data.
mba.df<-read.csv(paste("MBA Starting Salaries Data.csv",sep=""))
mba.df$sex[mba.df$sex==1]='M'
mba.df$sex[mba.df$sex==2]='F'
mba.df$frstlang[mba.df$frstlang==1]='Eng'
mba.df$frstlang[mba.df$frstlang==2]='Other'
mba.df$sex=factor(mba.df$sex)
mba.df$frstlang=factor(mba.df$frstlang)
View(mba.df)
attach(mba.df)

# Examining type of data.
str(mba.df)

## 'data.frame':    274 obs. of  13 variables:
##  $ age     : int  23 24 24 24 24 24 25 25 25 25 ...
##  $ sex     : Factor w/ 2 levels "F","M": 1 2 2 2 1 2 2 1 2 2 ...
##  $ gmat_tot: int  620 610 670 570 710 640 610 650 630 680 ...
##  $ gmat_qpc: int  77 90 99 56 93 82 89 88 79 99 ...
##  $ gmat_vpc: int  87 71 78 81 98 89 74 89 91 81 ...
##  $ gmat_tpc: int  87 87 95 75 98 91 87 92 89 96 ...
##  $ s_avg   : num  3.4 3.5 3.3 3.3 3.6 3.9 3.4 3.3 3.3 3.45 ...
##  $ f_avg   : num  3 4 3.25 2.67 3.75 3.75 3.5 3.75 3.25 3.67 ...
##  $ quarter : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ work_yrs: int  2 2 2 1 2 2 2 2 2 2 ...
##  $ frstlang: Factor w/ 2 levels "Eng","Other": 1 1 1 1 1 1 1 1 2 1 ...
##  $ salary  : int  0 0 0 0 999 0 0 0 999 998 ...
##  $ satis   : int  7 6 6 7 5 6 5 6 4 998 ...

# Describing data.
library(psych)
describe(mba.df)

##           vars   n     mean       sd median  trimmed     mad min    max
## age          1 274    27.36     3.71     27    26.76    2.97  22     48
## sex*         2 274     1.75     0.43      2     1.81    0.00   1      2
## gmat_tot     3 274   619.45    57.54    620   618.86   59.30 450    790
## gmat_qpc     4 274    80.64    14.87     83    82.31   14.83  28     99
## gmat_vpc     5 274    78.32    16.86     81    80.33   14.83  16     99
## gmat_tpc     6 274    84.20    14.02     87    86.12   11.86   0     99
## s_avg        7 274     3.03     0.38      3     3.03    0.44   2      4
## f_avg        8 274     3.06     0.53      3     3.09    0.37   0      4
## quarter      9 274     2.48     1.11      2     2.47    1.48   1      4
## work_yrs    10 274     3.87     3.23      3     3.29    1.48   0     22
## frstlang*   11 274     1.12     0.32      1     1.02    0.00   1      2
## salary      12 274 39025.69 50951.56    999 33607.86 1481.12   0 220000
## satis       13 274   172.18   371.61      6    91.50    1.48   1    998
##            range  skew kurtosis      se
## age           26  2.16     6.45    0.22
## sex*           1 -1.16    -0.66    0.03
## gmat_tot     340 -0.01     0.06    3.48
## gmat_qpc      71 -0.92     0.30    0.90
## gmat_vpc      83 -1.04     0.74    1.02
## gmat_tpc      99 -2.28     9.02    0.85
## s_avg          2 -0.06    -0.38    0.02
## f_avg          4 -2.08    10.85    0.03
## quarter        3  0.02    -1.35    0.07
## work_yrs      22  2.78     9.80    0.20
## frstlang*      1  2.37     3.65    0.02
## salary    220000  0.70    -1.05 3078.10
## satis        997  1.77     1.13   22.45

# Make table for placed students.
placed <- mba.df[(salary>0 & salary!=998 & salary!=999),]
View(placed)

# Make table for students not placed.
notplaced <- mba.df[(salary==0),]
View(notplaced)

# Table for recording satisfaction rating excluding non-participants.
sat <- mba.df[(satis!=998),]
View(sat)

# Visualizing data.
library(lattice)
histogram(sex)

histogram(frstlang)

boxplot(age~sex,main="Candidate age distribution w.r.t to gender",horizontal=TRUE,yaxt="n")
axis(side=2,at=c(2,1),labels=c("Male","Female"))

boxplot(placed$salary,main="Salary distribution among placed candidates",horizontal = TRUE)

boxplot(gmat_tot,main="GMAT Score distribution",horizontal = TRUE)

boxplot(s_avg,main="Spring MBA score average distribution",horizontal = TRUE)

boxplot(f_avg,main="Fall MBA score average distribution",horizontal = TRUE)

boxplot(sat$satis,main="Satisfaction rating distribution",horizontal=TRUE)

hist(work_yrs,main = "Distribution of work experience",xlab="Years in work",ylab="Frquency",breaks=20)

* Most students’ first language is English(>80%).

Most of the students are male(~75%).
Median salary is 100000, median GMAT score is between (600-650), Spring and Fall median scores are 3.0 and median satisfaction rating is 6.
A majority of students have experience between (0-5) years.
Hence, median age of male MBAs is greater than female MBAs.

# Scatter plots to understand relationships between parameters.
par(mfrow=c(2,4))
library(car)

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

scatterplot(gmat_tot,s_avg,xlab="GMAT Total",ylab="Spring MBA average",main = "Spring avg.score vs GMAT Total ")

scatterplot(gmat_tot,f_avg,xlab="GMAT Total",ylab="Fall MBA average",main = "Fall avg.score vs GMAT Total ")

attach(placed)

## The following objects are masked from mba.df:
## 
##     age, f_avg, frstlang, gmat_qpc, gmat_tot, gmat_tpc, gmat_vpc,
##     quarter, s_avg, salary, satis, sex, work_yrs

scatterplot(salary~gmat_tot,xlab="GMAT Total",ylab="Salary",main="GMAT score - salary distribution",labels=row.names(placed))

scatterplot(salary~s_avg,xlab="Spring MBA average",ylab = "Salary",main="Spring avg. score - salary distribution",labels=row.names(placed))

scatterplot(salary~f_avg,xlab="Fall MBA average",ylab = "Salary",main="Fall avg. score - salary distribution",labels=row.names(placed))

scatterplot(salary~age,xlab="Age",ylab="Salary",main="Age - Salary distribution",labels=row.names(placed))

scatterplot(salary~work_yrs,xlab="Work Experience",ylab="Salary",main="Work Experience - Salary distribution",labels=row.names(placed))

scatterplot(salary~satis,xlab="Satisfaction",ylab="Salary",main="Satisfaction - Salary distribution",labels=row.names(placed))

par(mfrow=c(1,1))

library(corrgram)

## Warning: package 'corrgram' was built under R version 3.4.3

corrgram(mba.df,order = TRUE,lower.panel = panel.shade,upper.panel = panel.pie,text.panel = panel.txt,main="Corrgram of MBA_Salaries dataset")

From the corrgram, we can see positive correlation between salary and work experience, Avg. MBA scores.
We can also see a positive correlation between GMAT scores - MBA avg. scores and GMAT scores - Satisfaction rates.

t.test(salary~sex,placed)

## 
##  Welch Two Sample t-test
## 
## data:  salary by sex
## t = -1.3628, df = 38.115, p-value = 0.1809
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -16021.72   3128.55
## sample estimates:
## mean in group F mean in group M 
##        98524.39       104970.97

t.test(gmat_tot~sex,placed)

## 
##  Welch Two Sample t-test
## 
## data:  gmat_tot by sex
## t = -0.1877, df = 51.483, p-value = 0.8518
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -25.14641  20.84534
## sample estimates:
## mean in group F mean in group M 
##        614.5161        616.6667

From the t-tests, we see that its not the case that males get higher GMAT scores or salaries than females.(p>0.05)

WHO GOT HOW MUCH SALARY: Taking a subset of the dataset consisting of only those people who actually got a job. Using this subset of data, to think about the problem as y = f(x), where y = Starting Salary and x = various factors that it could depend upon. Examples: impact of {gender; first language; prior work experience; GMAT performance; MBA performance} etc in determining the Starting Salary.

fit1 <- lm(salary~sex+frstlang+work_yrs+gmat_tot+s_avg+f_avg)
summary(fit1)

## 
## Call:
## lm(formula = salary ~ sex + frstlang + work_yrs + gmat_tot + 
##     s_avg + f_avg)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32652  -8940  -1709   5186  83182 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   85685.28   22530.99   3.803 0.000251 ***
## sexM           5886.39    3462.79   1.700 0.092388 .  
## frstlangOther 15101.77    6473.46   2.333 0.021743 *  
## work_yrs       2201.83     579.92   3.797 0.000257 ***
## gmat_tot        -11.90      31.77  -0.375 0.708712    
## s_avg          4851.02    4986.79   0.973 0.333110    
## f_avg         -1153.74    3822.28  -0.302 0.763422    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15760 on 96 degrees of freedom
## Multiple R-squared:  0.2678, Adjusted R-squared:  0.2221 
## F-statistic: 5.853 on 6 and 96 DF,  p-value: 3.114e-05

The Multiple R-squared value accounts for 26.78% variance for the variables. Here work_yrs, frstlangOther are significant variables. p-value<0.05

fit2 <- lm(salary~age+sex+frstlang+work_yrs+satis+s_avg+f_avg+gmat_tot)
summary(fit2)

## 
## Call:
## lm(formula = salary ~ age + sex + frstlang + work_yrs + satis + 
##     s_avg + f_avg + gmat_tot)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -28074  -9183  -1632   5602  79935 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   64846.39   33542.40   1.933   0.0562 .
## age            1650.02    1125.42   1.466   0.1459  
## sexM           5215.92    3507.28   1.487   0.1403  
## frstlangOther 11131.89    7126.07   1.562   0.1216  
## work_yrs        787.18    1146.15   0.687   0.4939  
## satis         -2237.86    2037.81  -1.098   0.2749  
## s_avg          3083.80    5058.12   0.610   0.5435  
## f_avg          -655.40    3825.50  -0.171   0.8643  
## gmat_tot        -12.40      31.89  -0.389   0.6982  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15670 on 94 degrees of freedom
## Multiple R-squared:  0.2914, Adjusted R-squared:  0.231 
## F-statistic: 4.831 on 8 and 94 DF,  p-value: 5.285e-05

The Multiple R-squared value accounts for 29.14% variance for the variables. p-value<0.05

** Hence, fit2 is better than fit1 for a ‘starting salary’ linear model, considering multiple R-squared values.

COMPARING THOSE GROUPS WHO GOT A JOB WITH THOSE WHO DIDN’T:

# Contingency Tables.
mytable <- xtabs(~sex+salary,notplaced)
margin.table(mytable,1)

## sex
##  F  M 
## 23 67

Hence, more males are unemployed than females.

MBA_Starting_Salaries

Pavani Koduri

December 27, 2017