Problem Definition
Use MBA Starting Salaries Data.csv to create a logistic model ??? Draw Contingency Tables as appropriate ??? Run Chi-Square test ??? Run a Logistic Regression
Data is already provided by Harward Business school.
Size: 11KB 274 obs. of 13 variables:
Attributes:
Field Description age age - in years
sex 1=Male; 2=Female
gmat_tot total GMAT score
gmat_qpc quantitative GMAT percentile
gmat_vpc verbal GMAT percentile
qmat_tpc overall GMAT percentile
s_avg spring MBA average
f_avg fall MBA average
quarter quartile ranking (1st is top, 4th is bottom)
work_yrs years of work experience
frstlang first language (1=English; 2=other)
salary starting salary
satis degree of satisfaction with MBA program (1= low, 7 = high satisfaction)
Missing salary and data are coded as follows:
998 = did not answer the survey
999 = answered the survey but did not disclose salary data
Size of data set: 274 records
Assumption
In Outcome Variable of data, 1 is taken as the students who got the job while 0 means who did not get the job.
Setup
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(corrgram)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(Deducer)
## Loading required package: JGR
## Loading required package: rJava
## Loading required package: JavaGD
## Loading required package: iplots
##
## Please type JGR() to launch console. Platform specific launchers (.exe and .app) can also be obtained at http://www.rforge.net/JGR/files/.
## Loading required package: car
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
##
##
## Note Non-JGR console detected:
## Deducer is best used from within JGR (http://jgr.markushelbig.org/).
## To Bring up GUI dialogs, type deducer().
library(caret)
## Loading required package: lattice
library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
library(vcd)
## Loading required package: grid
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:car':
##
## logit
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
Functions
detect_outliers <- function(inp, na.rm=TRUE) {
i.qnt <- quantile(inp, probs=c(.25, .75), na.rm=na.rm)
i.max <- 2.5 * IQR(inp, na.rm=na.rm)
otp <- inp
otp[inp < (i.qnt[1] - i.max)] <- NA
otp[inp > (i.qnt[2] + i.max)] <- NA
inp[is.na(otp)]
}
detect_na <- function(inp) {
sum(is.na(inp))
}
Graph_Boxplot <- function (input, na.rm = TRUE){
Plot <- ggplot(dfrModel, aes(x="", y=input)) +
geom_boxplot(aes(fill=input), color="green") +
labs(title="Outliers")
Plot
}
Dataset
setwd("D:/Welingkar/My/IL/Project/MBA Salaries/Data")
dfrModel <- read.csv("./MBA Starting Salaries Data.csv", header=T, stringsAsFactors=F)
intRowCount <- nrow(dfrModel)
head(dfrModel)
## age sex gmat_tot gmat_qpc gmat_vpc gmat_tpc s_avg f_avg quarter work_yrs
## 1 23 2 620 77 87 87 3.4 3.00 1 2
## 2 24 1 610 90 71 87 3.5 4.00 1 2
## 3 24 1 670 99 78 95 3.3 3.25 1 2
## 4 24 1 570 56 81 75 3.3 2.67 1 1
## 5 24 2 710 93 98 98 3.6 3.75 1 2
## 6 24 1 640 82 89 91 3.9 3.75 1 2
## frstlang salary satis
## 1 1 0 7
## 2 1 0 6
## 3 1 0 6
## 4 1 0 7
## 5 1 999 5
## 6 1 0 6
Observation
There is Numeric data only
As no Alphanumeric Column so no need to do any changes here or no need to drop any column as of now.
Outliers Graph
lapply(dfrModel, FUN=Graph_Boxplot)
## $age
##
## $sex
##
## $gmat_tot
##
## $gmat_qpc
##
## $gmat_vpc
##
## $gmat_tpc
##
## $s_avg
##
## $f_avg
##
## $quarter
##
## $work_yrs
##
## $frstlang
##
## $salary
##
## $satis
Observations
There are few outliers.
We will go with Outliers
Subset of Data who got Job
dfrModel <- subset(dfrModel, !(dfrModel$salary %in% c(998, 999)))
Outliers Data
#detect_outliers(dfrModel$Age)
lapply(dfrModel, FUN=detect_outliers)
## $age
## [1] 42 48 40 43 43 40
##
## $sex
## integer(0)
##
## $gmat_tot
## integer(0)
##
## $gmat_qpc
## integer(0)
##
## $gmat_vpc
## integer(0)
##
## $gmat_tpc
## [1] 0
##
## $s_avg
## numeric(0)
##
## $f_avg
## [1] 0 0
##
## $quarter
## integer(0)
##
## $work_yrs
## [1] 13 22 16 15 18 16 16 22 15
##
## $frstlang
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##
## $salary
## integer(0)
##
## $satis
## integer(0)
Observations
As there are less outliers so we will go with outliers
New Column for Job
dfrModel$Job <- ifelse(dfrModel$salary == 0, 0, 1)
dfrModel$salary <- NULL
head(dfrModel)
## age sex gmat_tot gmat_qpc gmat_vpc gmat_tpc s_avg f_avg quarter work_yrs
## 1 23 2 620 77 87 87 3.4 3.00 1 2
## 2 24 1 610 90 71 87 3.5 4.00 1 2
## 3 24 1 670 99 78 95 3.3 3.25 1 2
## 4 24 1 570 56 81 75 3.3 2.67 1 1
## 6 24 1 640 82 89 91 3.9 3.75 1 2
## 7 25 1 610 89 74 87 3.4 3.50 1 2
## frstlang satis Job
## 1 1 7 0
## 2 1 6 0
## 3 1 6 0
## 4 1 7 0
## 6 1 6 0
## 7 1 5 0
Observations
A new column has been which contains the values as 1 = Who got the Job 0 = Who did not get the job
Missing Data
lapply(dfrModel, FUN=detect_na)
## $age
## [1] 0
##
## $sex
## [1] 0
##
## $gmat_tot
## [1] 0
##
## $gmat_qpc
## [1] 0
##
## $gmat_vpc
## [1] 0
##
## $gmat_tpc
## [1] 0
##
## $s_avg
## [1] 0
##
## $f_avg
## [1] 0
##
## $quarter
## [1] 0
##
## $work_yrs
## [1] 0
##
## $frstlang
## [1] 0
##
## $satis
## [1] 0
##
## $Job
## [1] 0
Observation
There is no data in the dataset with values as NA
Summary
#summary(dfrModel)
lapply(dfrModel, FUN=describe)
## $age
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 193 27.59 4.22 27 26.86 2.97 22 48 26 1.93 4.55 0.3
##
## $sex
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 193 1.28 0.45 1 1.23 0 1 2 1 0.97 -1.06 0.03
##
## $gmat_tot
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 193 615.23 56.54 610 614.19 59.3 450 760 310 0.08 -0.31
## se
## X1 4.07
##
## $gmat_qpc
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 193 79.35 15.15 82 80.92 14.83 28 99 71 -0.88 0.23
## se
## X1 1.09
##
## $gmat_vpc
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 193 78.13 16.1 81 79.87 14.83 22 99 77 -0.9 0.36
## se
## X1 1.16
##
## $gmat_tpc
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 193 83.48 13.53 87 85.08 11.86 0 99 99 -1.87 7.03
## se
## X1 0.97
##
## $s_avg
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 193 3.06 0.38 3.09 3.08 0.43 2 4 2 -0.27 -0.15
## se
## X1 0.03
##
## $f_avg
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 193 3.08 0.52 3 3.11 0.37 0 4 4 -2.17 11.03
## se
## X1 0.04
##
## $quarter
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 193 2.39 1.1 2 2.37 1.48 1 4 3 0.13 -1.32 0.08
##
## $work_yrs
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 193 4.1 3.69 3 3.37 1.48 0 22 22 2.47 7.02 0.27
##
## $frstlang
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 193 1.08 0.27 1 1 0 1 2 1 3.13 7.84 0.02
##
## $satis
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 193 5.76 0.77 6 5.75 1.48 3 7 4 -0.17 -0.06
## se
## X1 0.06
##
## $Job
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 193 0.53 0.5 1 0.54 0 0 1 1 -0.13 -1.99 0.04
No od Student Got job
table(dfrModel$Job)
##
## 0 1
## 90 103
Data Distribution
mytable <- xtabs(~ Job+sex+quarter, data=dfrModel)
ftable(mytable)
## quarter 1 2 3 4
## Job sex
## 0 1 11 21 16 19
## 2 7 6 7 3
## 1 1 23 19 17 13
## 2 12 6 7 6
margin.table(mytable, 1)
## Job
## 0 1
## 90 103
margin.table(mytable, 2)
## sex
## 1 2
## 139 54
margin.table(mytable, 3)
## quarter
## 1 2 3 4
## 53 52 47 41
Correlation
vctCorr = numeric(0)
for (i in names(dfrModel)){
cor.result <- cor(as.numeric(dfrModel$Job), as.numeric(dfrModel[,i]))
vctCorr <- c(vctCorr, cor.result)
}
dfrCorr <- vctCorr
names(dfrCorr) <- names(dfrModel)
dfrCorr
## age sex gmat_tot gmat_qpc gmat_vpc gmat_tpc
## -0.20569719 0.05047054 0.01491495 0.02698202 0.02888098 0.08264631
## s_avg f_avg quarter work_yrs frstlang satis
## 0.08063913 0.02746051 -0.12788216 -0.12330395 -0.03899476 0.16882557
## Job
## 1.00000000
Data For Visualization
dfrGraph <- gather(dfrModel, variable, value, -Job)
head(dfrGraph)
## Job variable value
## 1 0 age 23
## 2 0 age 24
## 3 0 age 24
## 4 0 age 24
## 5 0 age 24
## 6 0 age 25
Data Visualization
ggplot(dfrGraph) +
geom_jitter(aes(value,Job, colour=variable)) +
facet_wrap(~variable, scales="free_x") +
labs(title="Relation Of Diabetes [atient] With Other Features")
Observation
As per graph, There is some correlation between Job variable and other variables.
Find Best Multi Logistic Model
Choose the best logistic model by using step().
stpModel=step(glm(data=dfrModel, formula=Job~., family=binomial), trace=0, steps=100)
summary(stpModel)
##
## Call:
## glm(formula = Job ~ age + gmat_tot + gmat_tpc + quarter + satis,
## family = binomial, data = dfrModel)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9491 -1.1229 0.7702 1.0287 2.1052
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.154571 3.053403 1.688 0.09138 .
## age -0.112199 0.040810 -2.749 0.00597 **
## gmat_tot -0.015176 0.009298 -1.632 0.10266
## gmat_tpc 0.067617 0.044113 1.533 0.12532
## quarter -0.296904 0.143264 -2.072 0.03823 *
## satis 0.426628 0.203977 2.092 0.03648 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 266.68 on 192 degrees of freedom
## Residual deviance: 244.99 on 187 degrees of freedom
## AIC: 256.99
##
## Number of Fisher Scoring iterations: 4
Observation
Best results given by Job ~ age + gmat_tot + gmat_tpc + quarter + satis
The degrees of Freedom for Null Variance is 193-1=192 when only outcome is considered so only one parameter
While for Reidual Deviance is 193-6=687 As no of parameters are 6 for residual deviance.
Make Final Multi Linear Model
# make model
mgmModel <- glm(data=dfrModel, formula=Job ~ age + gmat_tot + gmat_tpc + quarter + satis, family=binomial(link="logit"))
# print summary
summary(mgmModel)
##
## Call:
## glm(formula = Job ~ age + gmat_tot + gmat_tpc + quarter + satis,
## family = binomial(link = "logit"), data = dfrModel)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9491 -1.1229 0.7702 1.0287 2.1052
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.154571 3.053403 1.688 0.09138 .
## age -0.112199 0.040810 -2.749 0.00597 **
## gmat_tot -0.015176 0.009298 -1.632 0.10266
## gmat_tpc 0.067617 0.044113 1.533 0.12532
## quarter -0.296904 0.143264 -2.072 0.03823 *
## satis 0.426628 0.203977 2.092 0.03648 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 266.68 on 192 degrees of freedom
## Residual deviance: 244.99 on 187 degrees of freedom
## AIC: 256.99
##
## Number of Fisher Scoring iterations: 4
pR2 = 1 - mgmModel$deviance / mgmModel$null.deviance
pR2
## [1] 0.08133441
mgmModel_null <- glm(dfrModel$Job~1, family = binomial, data = dfrModel)
pR21= 1- logLik(mgmModel)/logLik(mgmModel_null)
pR21
## 'log Lik.' 0.08133441 (df=6)
library(rcompanion)
nagelkerke(mgmModel)
## $Models
##
## Model: "glm, Job ~ age + gmat_tot + gmat_tpc + quarter + satis, binomial(link = \"logit\"), dfrModel"
## Null: "glm, Job ~ 1, binomial(link = \"logit\"), dfrModel"
##
## $Pseudo.R.squared.for.model.vs.null
## Pseudo.R.squared
## McFadden 0.0813344
## Cox and Snell (ML) 0.1062990
## Nagelkerke (Cragg and Uhler) 0.1419470
##
## $Likelihood.ratio.test
## Df.diff LogLik.diff Chisq p.value
## -5 -10.845 21.69 0.00059958
##
## $Number.of.observations
##
## Model: 193
## Null: 193
##
## $Messages
## [1] "Note: For models fit with REML, these statistics are based on refitting with ML"
##
## $Warnings
## [1] "None"
pR2(mgmModel)
## llh llhNull G2 McFadden r2ML
## -122.49418103 -133.33925034 21.69013862 0.08133441 0.10629911
## r2CU
## 0.14194748
Confusion Matrix
prdVal <- predict(mgmModel, type='response')
prdBln <- ifelse(prdVal > 0.50, 1, 0)
cnfmtrx <- table(prd=prdBln, act=dfrModel$Job)
confusionMatrix(cnfmtrx)
## Confusion Matrix and Statistics
##
## act
## prd 0 1
## 0 49 21
## 1 41 82
##
## Accuracy : 0.6788
## 95% CI : (0.6079, 0.744)
## No Information Rate : 0.5337
## P-Value [Acc > NIR] : 2.916e-05
##
## Kappa : 0.3454
## Mcnemar's Test P-Value : 0.01582
##
## Sensitivity : 0.5444
## Specificity : 0.7961
## Pos Pred Value : 0.7000
## Neg Pred Value : 0.6667
## Prevalence : 0.4663
## Detection Rate : 0.2539
## Detection Prevalence : 0.3627
## Balanced Accuracy : 0.6703
##
## 'Positive' Class : 0
##
Observations
Accuracy is good which is around 0.6788
P value is also less than 0.05
Regression Data
dfrPlot <- mutate(dfrModel, PrdVal=prdVal, POutcome=prdBln)
head(dfrPlot)
## age sex gmat_tot gmat_qpc gmat_vpc gmat_tpc s_avg f_avg quarter work_yrs
## 1 23 2 620 77 87 87 3.4 3.00 1 2
## 2 24 1 610 90 71 87 3.5 4.00 1 2
## 3 24 1 670 99 78 95 3.3 3.25 1 2
## 4 24 1 570 56 81 75 3.3 2.67 1 1
## 5 24 1 640 82 89 91 3.9 3.75 1 2
## 6 25 1 610 89 74 87 3.4 3.50 1 2
## frstlang satis Job PrdVal POutcome
## 1 1 7 0 0.8503428 1
## 2 1 6 0 0.7941661 1
## 3 1 6 0 0.7272317 1
## 4 1 7 0 0.8281352 1
## 5 1 6 0 0.7623167 1
## 6 1 5 0 0.6924072 1
Regression Visulaization
#dfrPlot
ggplot(dfrPlot, aes(x=PrdVal, y=POutcome)) +
geom_point(shape=19, colour="blue", fill="blue") +
geom_smooth(method="gam", formula=y~s(log(x)), se=FALSE) +
labs(title="Binomial Regression Curve") +
labs(x="") +
labs(y="")
ROC Visulaization
#rocplot(logistic.model,diag=TRUE,pred.prob.labels=FALSE,prob.label.digits=3,AUC=TRUE)
rocplot(mgmModel)
Observations
The closer the ROC gets to the optimal point of perfect prediction the closer the AUC gets to 1.
We can see that AUC value is around 0.6867 as well as Accuracy of our model is around 0.6788 which are quite equal so Model is good.
Test Data
dfrTests <- data.frame(age=c(20), gmat_tot=c(570), gmat_tpc=c(71), quarter=c(1), satis=c(6))
head(dfrTests)
## age gmat_tot gmat_tpc quarter satis
## 1 20 570 71 1 6
Predict
resVal <- predict(mgmModel, dfrTests, type="response")
prdSur <- ifelse(resVal > 0.5, 1, 0)
prdSur <- as.factor(prdSur)
levels(prdSur) <- c("Not Get Job", "Get Job")
dfrTests <- mutate(dfrTests, Result=resVal, Prd_Outcome=prdSur)
head(dfrTests)
## age gmat_tot gmat_tpc quarter satis Result Prd_Outcome
## 1 20 570 71 1 6 0.7898739 Not Get Job
New Column has been added to know the details of Student who got the job or not 1 = Students who got the job 0 = Students who did not get the job
Logistic Model is created
Job ~ age + gmat_tot + gmat_tpc + quarter + satis
Dependent Variable = Job
Independent = Age, gmat_tot, gmat_tpc, quarter, satis
Above are the Dependent and Independent variables in the Model
End of Project
Model:
Data Imputation is done for all the column except Pregnancies and Outcome as other columns have many values as 0 which is important to be imputed.
Median and Mice Random factor is used to do the data Imputation.
For column where outliers are less, Median imputation is used while for others Mice random factor technique is used for Data Imputation.
Model is created by all the variables taken in account but only four variables are significant enough to give good result:
Outcome~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction Accuracy of Train Data is: 0.77 AUC Value of ROC Curve: 0.84 As no much difference between Accuracy and AUC value so it is a good model.
Data Testing
After Creating the model Data is tested on test data.
Same prediction value is used to get Diabetes patients.
Accuracy of Confusion matrix is around 0.77
Data is tested successfully.