MBA Salaries Regression Model

Analysis of MBA Salaries

NAME: Monesh Kumar Sharma

EMAIL: monesh.sharma23992@gmail.com

COLLEGE / COMPANY: Welingkar Institute of Management, Mumbai

Problem Definition

Use data set MBA Salaries.
1. Analysis of Data.(Summarize, plots etc.)
2. Visualization of Data
3. Some T and Chi square tests through data
4. Find out which all columns / features impact salary of MBA Student
5. Predict the salaries with some dummy values.

Data Location

Data is already provided by Harward Business school.

Data Description

Size: 11KB 274 obs. of 13 variables:

Attributes:
Field Description age age - in years
sex 1=Male; 2=Female
gmat_tot total GMAT score
gmat_qpc quantitative GMAT percentile
gmat_vpc verbal GMAT percentile
qmat_tpc overall GMAT percentile
s_avg spring MBA average
f_avg fall MBA average
quarter quartile ranking (1st is top, 4th is bottom)
work_yrs years of work experience
frstlang first language (1=English; 2=other)
salary starting salary
satis degree of satisfaction with MBA program (1= low, 7 = high satisfaction)

Missing salary and data are coded as follows:
998 = did not answer the survey
999 = answered the survey but did not disclose salary data
Size of data set: 274 records

Setup

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(corrgram)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(vcd)

## Loading required package: grid

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(car)

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

## The following object is masked from 'package:dplyr':
## 
##     recode

library(corrplot)

Functions

detect_outliers <- function(inp, na.rm=TRUE) {
  i.qnt <- quantile(inp, probs=c(.25, .75), na.rm=na.rm)
  i.max <- 1.5 * IQR(inp, na.rm=na.rm)
  otp <- inp
  otp[inp < (i.qnt[1] - i.max)] <- NA
  otp[inp > (i.qnt[2] + i.max)] <- NA
  #inp <- count(inp[is.na(otp)])
  sum(is.na(otp))
}

Non_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
}

Remove_Outliers <- function ( z, na.rm = TRUE){
 Out <- Non_outliers(z)
 Out <-as.data.frame (Out)
 z <- Out$Out[match(z, Out$Out)]
 z
}

Graph_Boxplot <- function (input, na.rm = TRUE){
Plot <- ggplot(dfrModel, aes(x="", y=input)) +
            geom_boxplot(aes(fill=input), color="green") +
            labs(title="Outliers")
Plot
}

Dataset

setwd("D:/Welingkar/My/IL/Project/MBA Salaries/Data")
dfrModel <- read.csv("./MBA Starting Salaries Data.csv", header=T, stringsAsFactors=F)
intRowCount <- nrow(dfrModel)
head(dfrModel)

##   age sex gmat_tot gmat_qpc gmat_vpc gmat_tpc s_avg f_avg quarter work_yrs
## 1  23   2      620       77       87       87   3.4  3.00       1        2
## 2  24   1      610       90       71       87   3.5  4.00       1        2
## 3  24   1      670       99       78       95   3.3  3.25       1        2
## 4  24   1      570       56       81       75   3.3  2.67       1        1
## 5  24   2      710       93       98       98   3.6  3.75       1        2
## 6  24   1      640       82       89       91   3.9  3.75       1        2
##   frstlang salary satis
## 1        1      0     7
## 2        1      0     6
## 3        1      0     6
## 4        1      0     7
## 5        1    999     5
## 6        1      0     6

Observation 1. There are total ‘intRowCount’ data records in the file.

Summary

#summary(dfrModel)
lapply(dfrModel, FUN=describe)

## $age
##    vars   n  mean   sd median trimmed  mad min max range skew kurtosis
## X1    1 274 27.36 3.71     27   26.76 2.97  22  48    26 2.16     6.45
##      se
## X1 0.22
## 
## $sex
##    vars   n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 274 1.25 0.43      1    1.19   0   1   2     1 1.16    -0.66 0.03
## 
## $gmat_tot
##    vars   n   mean    sd median trimmed  mad min max range  skew kurtosis
## X1    1 274 619.45 57.54    620  618.86 59.3 450 790   340 -0.01     0.06
##      se
## X1 3.48
## 
## $gmat_qpc
##    vars   n  mean    sd median trimmed   mad min max range  skew kurtosis
## X1    1 274 80.64 14.87     83   82.31 14.83  28  99    71 -0.92      0.3
##     se
## X1 0.9
## 
## $gmat_vpc
##    vars   n  mean    sd median trimmed   mad min max range  skew kurtosis
## X1    1 274 78.32 16.86     81   80.33 14.83  16  99    83 -1.04     0.74
##      se
## X1 1.02
## 
## $gmat_tpc
##    vars   n mean    sd median trimmed   mad min max range  skew kurtosis
## X1    1 274 84.2 14.02     87   86.12 11.86   0  99    99 -2.28     9.02
##      se
## X1 0.85
## 
## $s_avg
##    vars   n mean   sd median trimmed  mad min max range  skew kurtosis
## X1    1 274 3.03 0.38      3    3.03 0.44   2   4     2 -0.06    -0.38
##      se
## X1 0.02
## 
## $f_avg
##    vars   n mean   sd median trimmed  mad min max range  skew kurtosis
## X1    1 274 3.06 0.53      3    3.09 0.37   0   4     4 -2.08    10.85
##      se
## X1 0.03
## 
## $quarter
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 274 2.48 1.11      2    2.47 1.48   1   4     3 0.02    -1.35 0.07
## 
## $work_yrs
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis  se
## X1    1 274 3.87 3.23      3    3.29 1.48   0  22    22 2.78      9.8 0.2
## 
## $frstlang
##    vars   n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 274 1.12 0.32      1    1.02   0   1   2     1 2.37     3.65 0.02
## 
## $salary
##    vars   n     mean       sd median  trimmed     mad min    max  range
## X1    1 274 39025.69 50951.56    999 33607.86 1481.12   0 220000 220000
##    skew kurtosis     se
## X1  0.7    -1.05 3078.1
## 
## $satis
##    vars   n   mean     sd median trimmed  mad min max range skew kurtosis
## X1    1 274 172.18 371.61      6    91.5 1.48   1 998   997 1.77     1.13
##       se
## X1 22.45

Box Plot

lapply(dfrModel, FUN=Graph_Boxplot)

## $age

## 
## $sex

## 
## $gmat_tot

## 
## $gmat_qpc

## 
## $gmat_vpc

## 
## $gmat_tpc

## 
## $s_avg

## 
## $f_avg

## 
## $quarter

## 
## $work_yrs

## 
## $frstlang

## 
## $salary

## 
## $satis

Observation
Here we see that many predictor variables numeric categoric variable. We should convert these to factor variables

Scatter Plot

par(mfrow=c(2, 2))
plot(y=dfrModel$salary, x=dfrModel$age,
     col="blue",
     ylim=c(0, 250000), xlim=c(20, 50), 
     main="Relationship Btw Age & Salary",
     ylab="Salary", xlab="Age (Yrs)")
plot(y=dfrModel$salary, x=dfrModel$gmat_tot,
     col="green",
     ylim=c(0, 250000), xlim=c(400, 800), 
     main="Relationship Btw GMAT Total & Salary",
     ylab="Salary", xlab="GMAT Total")
plot(y=dfrModel$salary, x=dfrModel$work_yrs,
      col="green",
     ylim=c(0, 250000), xlim=c(0, 30), 
     main="Relationship Btw Work Exp & Salary",
     ylab="Salary", xlab="Work Exp")

plot(jitter(dfrModel$satis),jitter(dfrModel$salary), 
     col="blue",
      xlim=c(0, 7), ylim=c(0, 250000),
     main="Relationship Btw Satisfaction & Salary",
      xlab="Satisfaction", ylab="Salary")

Observations
1. We can see there are some redundant data
2. For rest of the data there is some relation between salary and other variables.

Correlation Plot

#pairs(dfrModel)
corrplot(corr=cor(dfrModel[ , c(1:13)], use="complete.obs"), 
         method ="ellipse")

Subset of Data who Got the Job

Subset of Data who got Job

dfrMode_Job <- subset(dfrModel, !(dfrModel$salary %in% c(0, 998, 999)))

Outliers

lapply(dfrMode_Job, FUN=detect_outliers)

## $age
## [1] 5
## 
## $sex
## [1] 0
## 
## $gmat_tot
## [1] 0
## 
## $gmat_qpc
## [1] 2
## 
## $gmat_vpc
## [1] 3
## 
## $gmat_tpc
## [1] 2
## 
## $s_avg
## [1] 0
## 
## $f_avg
## [1] 3
## 
## $quarter
## [1] 0
## 
## $work_yrs
## [1] 9
## 
## $frstlang
## [1] 7
## 
## $salary
## [1] 9
## 
## $satis
## [1] 1

Outliers Obsevations
1. We can see that there are outliers in the data, still we will go with outliers as no of data records are less

T test
Null Hypothesis: There is no difference between the mean of gmat_qpc and gmat_tot

Alternative Hypothesis: There is difference between the mean of gmat_qpc and gmat_tot

t.test(dfrMode_Job$gmat_tot, dfrModel$gmat_qpc)

## 
##  Welch Two Sample t-test
## 
## data:  dfrMode_Job$gmat_tot and dfrModel$gmat_qpc
## t = 105.5, df = 108.66, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  525.3191 545.4351
## sample estimates:
## mean of x mean of y 
## 616.01942  80.64234

** T test Observations**
1. We can see that P value is less than 0.05 which is for 95% Confidence interval so it is rejecting Null Hypothesis and accepting the Alternative hypothesis.
2. So there should be difference between mean of gmat_qpc and gmat_tot
3. Mean of gmat_tot is 616.0194
4. Mean of gmat_qpc pricess is 80.64234

Correlation with Salary

vctCorr = numeric(0)
for (i in names(dfrMode_Job)){
cor.result <- cor(dfrMode_Job$salary, as.numeric(dfrMode_Job[,i]))
vctCorr <- c(vctCorr, cor.result)
}
dfrCorr <- vctCorr
names(dfrCorr) <- names(dfrMode_Job)
dfrCorr

##         age         sex    gmat_tot    gmat_qpc    gmat_vpc    gmat_tpc 
##  0.49964284 -0.16628869 -0.09067141  0.01414130 -0.13743230 -0.13201783 
##       s_avg       f_avg     quarter    work_yrs    frstlang      salary 
##  0.10173175 -0.10603897 -0.12848526  0.45466634  0.26701953  1.00000000 
##       satis 
## -0.04005060

Correlation Matrix

cor(dfrMode_Job[, c(1:13)])

##                  age         sex    gmat_tot     gmat_qpc    gmat_vpc
## age       1.00000000 -0.14352927 -0.07871678 -0.165039057  0.01799420
## sex      -0.14352927  1.00000000 -0.01955548 -0.147099027  0.05341428
## gmat_tot -0.07871678 -0.01955548  1.00000000  0.666382266  0.78038546
## gmat_qpc -0.16503906 -0.14709903  0.66638227  1.000000000  0.09466541
## gmat_vpc  0.01799420  0.05341428  0.78038546  0.094665411  1.00000000
## gmat_tpc -0.09609156 -0.04686981  0.96680810  0.658650025  0.78443167
## s_avg     0.15654954  0.08079985  0.17198874  0.015471662  0.15865101
## f_avg    -0.21699191  0.16572186  0.12246257  0.098418869  0.02290167
## quarter  -0.12568145 -0.02139041 -0.10578964  0.012648346 -0.12862079
## work_yrs  0.88052470 -0.09233003 -0.12280018 -0.182701263 -0.02812182
## frstlang  0.35026743  0.07512009 -0.13164323  0.014198516 -0.21835333
## salary    0.49964284 -0.16628869 -0.09067141  0.014141299 -0.13743230
## satis     0.10832308 -0.09199534  0.06474206 -0.003984632  0.14863481
##             gmat_tpc       s_avg       f_avg     quarter    work_yrs
## age      -0.09609156  0.15654954 -0.21699191 -0.12568145  0.88052470
## sex      -0.04686981  0.08079985  0.16572186 -0.02139041 -0.09233003
## gmat_tot  0.96680810  0.17198874  0.12246257 -0.10578964 -0.12280018
## gmat_qpc  0.65865003  0.01547166  0.09841887  0.01264835 -0.18270126
## gmat_vpc  0.78443167  0.15865101  0.02290167 -0.12862079 -0.02812182
## gmat_tpc  1.00000000  0.13938500  0.07051391 -0.09955033 -0.13246963
## s_avg     0.13938500  1.00000000  0.44590413 -0.84038355  0.16328236
## f_avg     0.07051391  0.44590413  1.00000000 -0.43144819 -0.21633018
## quarter  -0.09955033 -0.84038355 -0.43144819  1.00000000 -0.12896722
## work_yrs -0.13246963  0.16328236 -0.21633018 -0.12896722  1.00000000
## frstlang -0.16437561 -0.13788905 -0.05061394  0.10955726  0.19627277
## salary   -0.13201783  0.10173175 -0.10603897 -0.12848526  0.45466634
## satis     0.11630842 -0.14356557 -0.11773304  0.22511985  0.06299926
##             frstlang      salary        satis
## age       0.35026743  0.49964284  0.108323083
## sex       0.07512009 -0.16628869 -0.091995338
## gmat_tot -0.13164323 -0.09067141  0.064742057
## gmat_qpc  0.01419852  0.01414130 -0.003984632
## gmat_vpc -0.21835333 -0.13743230  0.148634805
## gmat_tpc -0.16437561 -0.13201783  0.116308417
## s_avg    -0.13788905  0.10173175 -0.143565573
## f_avg    -0.05061394 -0.10603897 -0.117733043
## quarter   0.10955726 -0.12848526  0.225119851
## work_yrs  0.19627277  0.45466634  0.062999256
## frstlang  1.00000000  0.26701953  0.089834769
## salary    0.26701953  1.00000000 -0.040050600
## satis     0.08983477 -0.04005060  1.000000000

Data Cleaning

dfrMode_Job <- select(dfrMode_Job, -c(gmat_tot, quarter, satis))
head(dfrMode_Job)

##    age sex gmat_qpc gmat_vpc gmat_tpc s_avg f_avg work_yrs frstlang salary
## 35  22   2       90       92       94   3.5  3.75        1        1  85000
## 36  27   2       94       98       98   3.3  3.25        2        1  85000
## 37  25   2       87       96       96   3.5  2.67        2        1  86000
## 38  25   2       82       91       93   3.4  3.25        3        1  88000
## 39  27   1       96       96       98   3.3  3.50        2        1  92000
## 40  28   2       52       98       87   3.4  3.75        5        1  93000

Visualize

dfrGraph <- gather(dfrMode_Job, variable, value, -salary)
head(dfrGraph)

##   salary variable value
## 1  85000      age    22
## 2  85000      age    27
## 3  86000      age    25
## 4  88000      age    25
## 5  92000      age    27
## 6  93000      age    28

ggplot(dfrGraph) +
geom_jitter(aes(value,salary, colour=variable)) + 
geom_smooth(aes(value,salary, colour=variable), method=lm, se=FALSE) +
facet_wrap(~variable, scales="free_x") +
labs(title="Relation Of salary With Other Features")

Observation
There is some impact of all the features with Salary.

Find Best Multi Linear Model for MBA Salary
Choose the best linear model by using step(). Choose a model by AIC in a Stepwise Algorithm
In statistics, stepwise regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of explanatory variables based on some prespecified criterion.
The Akaike information criterion (AIC) is a measure of the relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Hence, AIC provides a means for model selection.

#?step()
stpModel=step(lm(data=dfrMode_Job, salary~.), trace=0, steps=100)
stpSummary <- summary(stpModel)
stpSummary

## 
## Call:
## lm(formula = salary ~ age + gmat_qpc + gmat_vpc + gmat_tpc, data = dfrMode_Job)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -29373  -8011    280   5705  67116 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39383.0    18467.6   2.133   0.0355 *  
## age           2791.1      464.8   6.005 3.25e-08 ***
## gmat_qpc       822.3      347.9   2.363   0.0201 *  
## gmat_vpc       513.2      350.7   1.463   0.1467    
## gmat_tpc     -1383.8      680.4  -2.034   0.0447 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15110 on 98 degrees of freedom
## Multiple R-squared:  0.3129, Adjusted R-squared:  0.2848 
## F-statistic: 11.16 on 4 and 98 DF,  p-value: 1.691e-07

Observations of MBA Salaries Model
We tried different model but R square value is very less for all as there is less no of observation as well as low correlation between Salary and other variables.
For academic purpose we are still going with the model and predict the salaries.
1. salary ~ age + gmat_qpc + gmat_vpc + gmat_tpc
2. Salary is dependent of all the variables which are given above
3. R square value is around 0.2848 which is low but still we are going with Model 4. P value is less than 0.05 which is rejecting the NULL hypothesis which means all the above 4 variables are affecting the salary.

Make Final Multi Linear Model

x1 <- dfrMode_Job$age
x3 <- dfrMode_Job$gmat_qpc
x4 <- dfrMode_Job$gmat_vpc
x5 <- dfrMode_Job$gmat_tpc

y <- dfrMode_Job$salary
slmModel1 <- lm(y~x1+x3+x4+x5, data=dfrMode_Job)
summary(slmModel1)

## 
## Call:
## lm(formula = y ~ x1 + x3 + x4 + x5, data = dfrMode_Job)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -29373  -8011    280   5705  67116 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39383.0    18467.6   2.133   0.0355 *  
## x1            2791.1      464.8   6.005 3.25e-08 ***
## x3             822.3      347.9   2.363   0.0201 *  
## x4             513.2      350.7   1.463   0.1467    
## x5           -1383.8      680.4  -2.034   0.0447 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15110 on 98 degrees of freedom
## Multiple R-squared:  0.3129, Adjusted R-squared:  0.2848 
## F-statistic: 11.16 on 4 and 98 DF,  p-value: 1.691e-07

Observation
No errors. Model successfully created.

Show Model

# print summary
summary(slmModel1)

## 
## Call:
## lm(formula = y ~ x1 + x3 + x4 + x5, data = dfrMode_Job)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -29373  -8011    280   5705  67116 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39383.0    18467.6   2.133   0.0355 *  
## x1            2791.1      464.8   6.005 3.25e-08 ***
## x3             822.3      347.9   2.363   0.0201 *  
## x4             513.2      350.7   1.463   0.1467    
## x5           -1383.8      680.4  -2.034   0.0447 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15110 on 98 degrees of freedom
## Multiple R-squared:  0.3129, Adjusted R-squared:  0.2848 
## F-statistic: 11.16 on 4 and 98 DF,  p-value: 1.691e-07

Test Data

dfrtest <- data.frame(x1=c(25), x3=c(79), x4=c(93), x5=c(93))
dfrtest

##   x1 x3 x4 x5
## 1 25 79 93 93

Observation
Test Data successfully created.

Predict

result <-  predict(slmModel1, dfrtest)
print(result)

##        1 
## 93152.76

** Observation**
1. Prediction is working fine but have some deviation from true values as r squared values was less. True Value is 96000.

Summary

Data has been loaded successfully
Data has been summarized to know the different statistical values
Outliers has been find out in each variable and Evry variable is plotted on Box plot to know about the outliers
Scatter plot as well as Corrgram is plotted which is showing the relationship between salaries and other variables
New Subset is created for the students who got the job.
T test is tested to check is there any difference between the mean of Economy price and Premium price
This Model is after without removing the outliers.
All the required assumptions for Regression model is tested successfully.

There must be correlation between dependent and independent variable
There should not be any correlation between two independent variables.

We tried different model but R square value is very less for all as there is less no of observation as well as low correlation between Salary and other variables.
For academic purpose we are still going with the model and predict the salaries.
salary ~ age + gmat_qpc + gmat_vpc + gmat_tpc Below is the model: salary=39383.0+2791.1age + 822.3gmat_qpc +513.2gmat_vpc - 1383.8gmat_tpc
Salary is dependent of all the variables which are given above
R square value is around 0.2848 which is low but still we are going with Model
P value is less than 0.05 which is rejecting the NULL hypothesis which means all the above 4 variables are affecting the salary.

Data is also Predicted successfully, which is near to true value of Salary.

###########End of the Project#########