Analysis of MBA Salaries
NAME: Monesh Kumar Sharma
EMAIL: monesh.sharma23992@gmail.com
COLLEGE / COMPANY: Welingkar Institute of Management, Mumbai

Problem Definition
Use MBA Starting Salaries Data.csv to create a logistic model ??? Draw Contingency Tables as appropriate ??? Run Chi-Square test ??? Run a Logistic Regression

Data Location

Data is already provided by Harward Business school.

Data Description

Size: 11KB 274 obs. of 13 variables:

Attributes:
Field Description age age - in years
sex 1=Male; 2=Female
gmat_tot total GMAT score
gmat_qpc quantitative GMAT percentile
gmat_vpc verbal GMAT percentile
qmat_tpc overall GMAT percentile
s_avg spring MBA average
f_avg fall MBA average
quarter quartile ranking (1st is top, 4th is bottom)
work_yrs years of work experience
frstlang first language (1=English; 2=other)
salary starting salary
satis degree of satisfaction with MBA program (1= low, 7 = high satisfaction)

Missing salary and data are coded as follows:
998 = did not answer the survey
999 = answered the survey but did not disclose salary data
Size of data set: 274 records

Assumption
In Outcome Variable of data, 1 is taken as the students who got the job while 0 means who did not get the job.

Setup

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(corrgram)
library(gridExtra) 
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(Deducer)
## Loading required package: JGR
## Loading required package: rJava
## Loading required package: JavaGD
## Loading required package: iplots
## 
## Please type JGR() to launch console. Platform specific launchers (.exe and .app) can also be obtained at http://www.rforge.net/JGR/files/.
## Loading required package: car
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## 
## Note Non-JGR console detected:
##  Deducer is best used from within JGR (http://jgr.markushelbig.org/).
##  To Bring up GUI dialogs, type deducer().
library(caret)
## Loading required package: lattice
library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
library(vcd)
## Loading required package: grid
library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'package:car':
## 
##     logit
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

Functions

detect_outliers <- function(inp, na.rm=TRUE) {
  i.qnt <- quantile(inp, probs=c(.25, .75), na.rm=na.rm)
  i.max <- 2.5 * IQR(inp, na.rm=na.rm)
  otp <- inp
  otp[inp < (i.qnt[1] - i.max)] <- NA
  otp[inp > (i.qnt[2] + i.max)] <- NA
  inp[is.na(otp)]
}

detect_na <- function(inp) {
  sum(is.na(inp))
}

Graph_Boxplot <- function (input, na.rm = TRUE){
Plot <- ggplot(dfrModel, aes(x="", y=input)) +
            geom_boxplot(aes(fill=input), color="green") +
            labs(title="Outliers")
Plot
}

Dataset

setwd("D:/Welingkar/My/IL/Project/MBA Salaries/Data")
dfrModel <- read.csv("./MBA Starting Salaries Data.csv", header=T, stringsAsFactors=F)
intRowCount <- nrow(dfrModel)
head(dfrModel)
##   age sex gmat_tot gmat_qpc gmat_vpc gmat_tpc s_avg f_avg quarter work_yrs
## 1  23   2      620       77       87       87   3.4  3.00       1        2
## 2  24   1      610       90       71       87   3.5  4.00       1        2
## 3  24   1      670       99       78       95   3.3  3.25       1        2
## 4  24   1      570       56       81       75   3.3  2.67       1        1
## 5  24   2      710       93       98       98   3.6  3.75       1        2
## 6  24   1      640       82       89       91   3.9  3.75       1        2
##   frstlang salary satis
## 1        1      0     7
## 2        1      0     6
## 3        1      0     6
## 4        1      0     7
## 5        1    999     5
## 6        1      0     6

Observation
There is Numeric data only
As no Alphanumeric Column so no need to do any changes here or no need to drop any column as of now.

Outliers Graph

lapply(dfrModel, FUN=Graph_Boxplot)
## $age

## 
## $sex

## 
## $gmat_tot

## 
## $gmat_qpc

## 
## $gmat_vpc

## 
## $gmat_tpc

## 
## $s_avg

## 
## $f_avg

## 
## $quarter

## 
## $work_yrs

## 
## $frstlang

## 
## $salary

## 
## $satis

Observations
There are few outliers.
We will go with Outliers

Subset of Data except Salary Code 998 and 999

Subset of Data who got Job

dfrModel <- subset(dfrModel, !(dfrModel$salary %in% c(998, 999)))

Outliers Data

#detect_outliers(dfrModel$Age)
lapply(dfrModel, FUN=detect_outliers)
## $age
## [1] 42 48 40 43 43 40
## 
## $sex
## integer(0)
## 
## $gmat_tot
## integer(0)
## 
## $gmat_qpc
## integer(0)
## 
## $gmat_vpc
## integer(0)
## 
## $gmat_tpc
## [1] 0
## 
## $s_avg
## numeric(0)
## 
## $f_avg
## [1] 0 0
## 
## $quarter
## integer(0)
## 
## $work_yrs
## [1] 13 22 16 15 18 16 16 22 15
## 
## $frstlang
##  [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 
## $salary
## integer(0)
## 
## $satis
## integer(0)

Observations
As there are less outliers so we will go with outliers

New Column for Job

dfrModel$Job <- ifelse(dfrModel$salary == 0, 0, 1)
dfrModel$salary <- NULL
head(dfrModel)
##   age sex gmat_tot gmat_qpc gmat_vpc gmat_tpc s_avg f_avg quarter work_yrs
## 1  23   2      620       77       87       87   3.4  3.00       1        2
## 2  24   1      610       90       71       87   3.5  4.00       1        2
## 3  24   1      670       99       78       95   3.3  3.25       1        2
## 4  24   1      570       56       81       75   3.3  2.67       1        1
## 6  24   1      640       82       89       91   3.9  3.75       1        2
## 7  25   1      610       89       74       87   3.4  3.50       1        2
##   frstlang satis Job
## 1        1     7   0
## 2        1     6   0
## 3        1     6   0
## 4        1     7   0
## 6        1     6   0
## 7        1     5   0

Observations
A new column has been which contains the values as 1 = Who got the Job 0 = Who did not get the job

Missing Data

lapply(dfrModel, FUN=detect_na)
## $age
## [1] 0
## 
## $sex
## [1] 0
## 
## $gmat_tot
## [1] 0
## 
## $gmat_qpc
## [1] 0
## 
## $gmat_vpc
## [1] 0
## 
## $gmat_tpc
## [1] 0
## 
## $s_avg
## [1] 0
## 
## $f_avg
## [1] 0
## 
## $quarter
## [1] 0
## 
## $work_yrs
## [1] 0
## 
## $frstlang
## [1] 0
## 
## $satis
## [1] 0
## 
## $Job
## [1] 0

Observation
There is no data in the dataset with values as NA

Summary

#summary(dfrModel)
lapply(dfrModel, FUN=describe)
## $age
##    vars   n  mean   sd median trimmed  mad min max range skew kurtosis  se
## X1    1 193 27.59 4.22     27   26.86 2.97  22  48    26 1.93     4.55 0.3
## 
## $sex
##    vars   n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 193 1.28 0.45      1    1.23   0   1   2     1 0.97    -1.06 0.03
## 
## $gmat_tot
##    vars   n   mean    sd median trimmed  mad min max range skew kurtosis
## X1    1 193 615.23 56.54    610  614.19 59.3 450 760   310 0.08    -0.31
##      se
## X1 4.07
## 
## $gmat_qpc
##    vars   n  mean    sd median trimmed   mad min max range  skew kurtosis
## X1    1 193 79.35 15.15     82   80.92 14.83  28  99    71 -0.88     0.23
##      se
## X1 1.09
## 
## $gmat_vpc
##    vars   n  mean   sd median trimmed   mad min max range skew kurtosis
## X1    1 193 78.13 16.1     81   79.87 14.83  22  99    77 -0.9     0.36
##      se
## X1 1.16
## 
## $gmat_tpc
##    vars   n  mean    sd median trimmed   mad min max range  skew kurtosis
## X1    1 193 83.48 13.53     87   85.08 11.86   0  99    99 -1.87     7.03
##      se
## X1 0.97
## 
## $s_avg
##    vars   n mean   sd median trimmed  mad min max range  skew kurtosis
## X1    1 193 3.06 0.38   3.09    3.08 0.43   2   4     2 -0.27    -0.15
##      se
## X1 0.03
## 
## $f_avg
##    vars   n mean   sd median trimmed  mad min max range  skew kurtosis
## X1    1 193 3.08 0.52      3    3.11 0.37   0   4     4 -2.17    11.03
##      se
## X1 0.04
## 
## $quarter
##    vars   n mean  sd median trimmed  mad min max range skew kurtosis   se
## X1    1 193 2.39 1.1      2    2.37 1.48   1   4     3 0.13    -1.32 0.08
## 
## $work_yrs
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 193  4.1 3.69      3    3.37 1.48   0  22    22 2.47     7.02 0.27
## 
## $frstlang
##    vars   n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 193 1.08 0.27      1       1   0   1   2     1 3.13     7.84 0.02
## 
## $satis
##    vars   n mean   sd median trimmed  mad min max range  skew kurtosis
## X1    1 193 5.76 0.77      6    5.75 1.48   3   7     4 -0.17    -0.06
##      se
## X1 0.06
## 
## $Job
##    vars   n mean  sd median trimmed mad min max range  skew kurtosis   se
## X1    1 193 0.53 0.5      1    0.54   0   0   1     1 -0.13    -1.99 0.04

Contingency Tables

No od Student Got job

table(dfrModel$Job)
## 
##   0   1 
##  90 103

Data Distribution

mytable <- xtabs(~ Job+sex+quarter, data=dfrModel)

ftable(mytable)
##         quarter  1  2  3  4
## Job sex                    
## 0   1           11 21 16 19
##     2            7  6  7  3
## 1   1           23 19 17 13
##     2           12  6  7  6
margin.table(mytable, 1)
## Job
##   0   1 
##  90 103
margin.table(mytable, 2)
## sex
##   1   2 
## 139  54
margin.table(mytable, 3)
## quarter
##  1  2  3  4 
## 53 52 47 41

Correlation

vctCorr = numeric(0)
for (i in names(dfrModel)){
    cor.result <- cor(as.numeric(dfrModel$Job), as.numeric(dfrModel[,i]))
    vctCorr <- c(vctCorr, cor.result)
}
dfrCorr <- vctCorr
names(dfrCorr) <- names(dfrModel)
dfrCorr
##         age         sex    gmat_tot    gmat_qpc    gmat_vpc    gmat_tpc 
## -0.20569719  0.05047054  0.01491495  0.02698202  0.02888098  0.08264631 
##       s_avg       f_avg     quarter    work_yrs    frstlang       satis 
##  0.08063913  0.02746051 -0.12788216 -0.12330395 -0.03899476  0.16882557 
##         Job 
##  1.00000000

Data For Visualization

dfrGraph <- gather(dfrModel, variable, value, -Job)
head(dfrGraph)
##   Job variable value
## 1   0      age    23
## 2   0      age    24
## 3   0      age    24
## 4   0      age    24
## 5   0      age    24
## 6   0      age    25

Data Visualization

ggplot(dfrGraph) +
    geom_jitter(aes(value,Job, colour=variable)) + 
    facet_wrap(~variable, scales="free_x") +
    labs(title="Relation Of Diabetes [atient] With Other Features")

Observation
As per graph, There is some correlation between Job variable and other variables.

Find Best Multi Logistic Model
Choose the best logistic model by using step().

stpModel=step(glm(data=dfrModel, formula=Job~., family=binomial), trace=0, steps=100)
summary(stpModel)
## 
## Call:
## glm(formula = Job ~ age + gmat_tot + gmat_tpc + quarter + satis, 
##     family = binomial, data = dfrModel)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9491  -1.1229   0.7702   1.0287   2.1052  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  5.154571   3.053403   1.688  0.09138 . 
## age         -0.112199   0.040810  -2.749  0.00597 **
## gmat_tot    -0.015176   0.009298  -1.632  0.10266   
## gmat_tpc     0.067617   0.044113   1.533  0.12532   
## quarter     -0.296904   0.143264  -2.072  0.03823 * 
## satis        0.426628   0.203977   2.092  0.03648 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 266.68  on 192  degrees of freedom
## Residual deviance: 244.99  on 187  degrees of freedom
## AIC: 256.99
## 
## Number of Fisher Scoring iterations: 4

Observation
Best results given by Job ~ age + gmat_tot + gmat_tpc + quarter + satis
The degrees of Freedom for Null Variance is 193-1=192 when only outcome is considered so only one parameter
While for Reidual Deviance is 193-6=687 As no of parameters are 6 for residual deviance.

Make Final Multi Linear Model

# make model
mgmModel <- glm(data=dfrModel, formula=Job ~ age + gmat_tot + gmat_tpc + quarter + satis, family=binomial(link="logit"))
# print summary
summary(mgmModel)
## 
## Call:
## glm(formula = Job ~ age + gmat_tot + gmat_tpc + quarter + satis, 
##     family = binomial(link = "logit"), data = dfrModel)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9491  -1.1229   0.7702   1.0287   2.1052  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  5.154571   3.053403   1.688  0.09138 . 
## age         -0.112199   0.040810  -2.749  0.00597 **
## gmat_tot    -0.015176   0.009298  -1.632  0.10266   
## gmat_tpc     0.067617   0.044113   1.533  0.12532   
## quarter     -0.296904   0.143264  -2.072  0.03823 * 
## satis        0.426628   0.203977   2.092  0.03648 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 266.68  on 192  degrees of freedom
## Residual deviance: 244.99  on 187  degrees of freedom
## AIC: 256.99
## 
## Number of Fisher Scoring iterations: 4
pR2 = 1 - mgmModel$deviance / mgmModel$null.deviance
pR2
## [1] 0.08133441
mgmModel_null <- glm(dfrModel$Job~1, family = binomial, data = dfrModel)
pR21= 1- logLik(mgmModel)/logLik(mgmModel_null)
pR21
## 'log Lik.' 0.08133441 (df=6)
library(rcompanion)
nagelkerke(mgmModel)
## $Models
##                                                                                                      
## Model: "glm, Job ~ age + gmat_tot + gmat_tpc + quarter + satis, binomial(link = \"logit\"), dfrModel"
## Null:  "glm, Job ~ 1, binomial(link = \"logit\"), dfrModel"                                          
## 
## $Pseudo.R.squared.for.model.vs.null
##                              Pseudo.R.squared
## McFadden                            0.0813344
## Cox and Snell (ML)                  0.1062990
## Nagelkerke (Cragg and Uhler)        0.1419470
## 
## $Likelihood.ratio.test
##  Df.diff LogLik.diff Chisq    p.value
##       -5     -10.845 21.69 0.00059958
## 
## $Number.of.observations
##           
## Model: 193
## Null:  193
## 
## $Messages
## [1] "Note: For models fit with REML, these statistics are based on refitting with ML"
## 
## $Warnings
## [1] "None"
pR2(mgmModel)
##           llh       llhNull            G2      McFadden          r2ML 
## -122.49418103 -133.33925034   21.69013862    0.08133441    0.10629911 
##          r2CU 
##    0.14194748

Confusion Matrix

prdVal <- predict(mgmModel, type='response')
prdBln <- ifelse(prdVal > 0.50, 1, 0)
cnfmtrx <- table(prd=prdBln, act=dfrModel$Job)
confusionMatrix(cnfmtrx)
## Confusion Matrix and Statistics
## 
##    act
## prd  0  1
##   0 49 21
##   1 41 82
##                                          
##                Accuracy : 0.6788         
##                  95% CI : (0.6079, 0.744)
##     No Information Rate : 0.5337         
##     P-Value [Acc > NIR] : 2.916e-05      
##                                          
##                   Kappa : 0.3454         
##  Mcnemar's Test P-Value : 0.01582        
##                                          
##             Sensitivity : 0.5444         
##             Specificity : 0.7961         
##          Pos Pred Value : 0.7000         
##          Neg Pred Value : 0.6667         
##              Prevalence : 0.4663         
##          Detection Rate : 0.2539         
##    Detection Prevalence : 0.3627         
##       Balanced Accuracy : 0.6703         
##                                          
##        'Positive' Class : 0              
## 

Observations
Accuracy is good which is around 0.6788
P value is also less than 0.05

Regression Data

dfrPlot <- mutate(dfrModel, PrdVal=prdVal, POutcome=prdBln)
head(dfrPlot)
##   age sex gmat_tot gmat_qpc gmat_vpc gmat_tpc s_avg f_avg quarter work_yrs
## 1  23   2      620       77       87       87   3.4  3.00       1        2
## 2  24   1      610       90       71       87   3.5  4.00       1        2
## 3  24   1      670       99       78       95   3.3  3.25       1        2
## 4  24   1      570       56       81       75   3.3  2.67       1        1
## 5  24   1      640       82       89       91   3.9  3.75       1        2
## 6  25   1      610       89       74       87   3.4  3.50       1        2
##   frstlang satis Job    PrdVal POutcome
## 1        1     7   0 0.8503428        1
## 2        1     6   0 0.7941661        1
## 3        1     6   0 0.7272317        1
## 4        1     7   0 0.8281352        1
## 5        1     6   0 0.7623167        1
## 6        1     5   0 0.6924072        1

Regression Visulaization

#dfrPlot
ggplot(dfrPlot, aes(x=PrdVal, y=POutcome))  + 
    geom_point(shape=19, colour="blue", fill="blue") +
    geom_smooth(method="gam", formula=y~s(log(x)), se=FALSE) +
    labs(title="Binomial Regression Curve") +
    labs(x="") +
    labs(y="")

ROC Visulaization

#rocplot(logistic.model,diag=TRUE,pred.prob.labels=FALSE,prob.label.digits=3,AUC=TRUE)
rocplot(mgmModel)

Observations
The closer the ROC gets to the optimal point of perfect prediction the closer the AUC gets to 1.
We can see that AUC value is around 0.6867 as well as Accuracy of our model is around 0.6788 which are quite equal so Model is good.

Test Data

dfrTests <- data.frame(age=c(20), gmat_tot=c(570), gmat_tpc=c(71), quarter=c(1), satis=c(6))
head(dfrTests)
##   age gmat_tot gmat_tpc quarter satis
## 1  20      570       71       1     6

Predict

resVal <- predict(mgmModel, dfrTests, type="response")
prdSur <- ifelse(resVal > 0.5, 1, 0)
prdSur <- as.factor(prdSur)
levels(prdSur) <- c("Not Get Job", "Get Job")
dfrTests <- mutate(dfrTests, Result=resVal, Prd_Outcome=prdSur)
head(dfrTests)
##   age gmat_tot gmat_tpc quarter satis    Result Prd_Outcome
## 1  20      570       71       1     6 0.7898739 Not Get Job

Summary

  1. Data is loaded from the file
  2. Outliers has been checked after removing the unwanted data, this model is with outliers
  3. Data Summary is checked to get the details of important statistical parameters
  4. Box Plot, Correlation graph is made to check the data distribution
  5. New Column has been added to know the details of Student who got the job or not 1 = Students who got the job 0 = Students who did not get the job

  6. Data Contigency Tables has been created to get the distribution of data
  7. Logistic Model is created
    Job ~ age + gmat_tot + gmat_tpc + quarter + satis

Dependent Variable = Job
Independent = Age, gmat_tot, gmat_tpc, quarter, satis
Above are the Dependent and Independent variables in the Model

  1. Accuracy of the Model is around 0.6768 which is good
  2. AUC Value is also near to 0.68 in ROC curve which is near to accuracy
  3. Model is created and Data prediction is done on test data.

End of Project

Model:
Data Imputation is done for all the column except Pregnancies and Outcome as other columns have many values as 0 which is important to be imputed.
Median and Mice Random factor is used to do the data Imputation.
For column where outliers are less, Median imputation is used while for others Mice random factor technique is used for Data Imputation.
Model is created by all the variables taken in account but only four variables are significant enough to give good result:
Outcome~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction Accuracy of Train Data is: 0.77 AUC Value of ROC Curve: 0.84 As no much difference between Accuracy and AUC value so it is a good model.

Data Testing
After Creating the model Data is tested on test data.
Same prediction value is used to get Diabetes patients.
Accuracy of Confusion matrix is around 0.77

Data is tested successfully.