Use data set MBA Salaries.
1. Analysis of Data.(Summarize, plots etc.)
2. Visualization of Data
3. Some T and Chi square tests through data
4. Find out which all columns / features impact salary of MBA Student
5. Predict the salaries with some dummy values.
Data is already provided by Harward Business school.
Size: 11KB 274 obs. of 13 variables:
Attributes:
Field Description age age - in years
sex 1=Male; 2=Female
gmat_tot total GMAT score
gmat_qpc quantitative GMAT percentile
gmat_vpc verbal GMAT percentile
qmat_tpc overall GMAT percentile
s_avg spring MBA average
f_avg fall MBA average
quarter quartile ranking (1st is top, 4th is bottom)
work_yrs years of work experience
frstlang first language (1=English; 2=other)
salary starting salary
satis degree of satisfaction with MBA program (1= low, 7 = high satisfaction)
Missing salary and data are coded as follows:
998 = did not answer the survey
999 = answered the survey but did not disclose salary data
Size of data set: 274 records
Setup
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(corrgram)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(vcd)
## Loading required package: grid
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
library(corrplot)
Functions
detect_outliers <- function(inp, na.rm=TRUE) {
i.qnt <- quantile(inp, probs=c(.25, .75), na.rm=na.rm)
i.max <- 1.5 * IQR(inp, na.rm=na.rm)
otp <- inp
otp[inp < (i.qnt[1] - i.max)] <- NA
otp[inp > (i.qnt[2] + i.max)] <- NA
#inp <- count(inp[is.na(otp)])
sum(is.na(otp))
}
Non_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
Remove_Outliers <- function ( z, na.rm = TRUE){
Out <- Non_outliers(z)
Out <-as.data.frame (Out)
z <- Out$Out[match(z, Out$Out)]
z
}
Graph_Boxplot <- function (input, na.rm = TRUE){
Plot <- ggplot(dfrModel, aes(x="", y=input)) +
geom_boxplot(aes(fill=input), color="green") +
labs(title="Outliers")
Plot
}
Dataset
setwd("D:/Welingkar/My/IL/Project/MBA Salaries/Data")
dfrModel <- read.csv("./MBA Starting Salaries Data.csv", header=T, stringsAsFactors=F)
intRowCount <- nrow(dfrModel)
head(dfrModel)
## age sex gmat_tot gmat_qpc gmat_vpc gmat_tpc s_avg f_avg quarter work_yrs
## 1 23 2 620 77 87 87 3.4 3.00 1 2
## 2 24 1 610 90 71 87 3.5 4.00 1 2
## 3 24 1 670 99 78 95 3.3 3.25 1 2
## 4 24 1 570 56 81 75 3.3 2.67 1 1
## 5 24 2 710 93 98 98 3.6 3.75 1 2
## 6 24 1 640 82 89 91 3.9 3.75 1 2
## frstlang salary satis
## 1 1 0 7
## 2 1 0 6
## 3 1 0 6
## 4 1 0 7
## 5 1 999 5
## 6 1 0 6
Observation 1. There are total ‘intRowCount’ data records in the file.
Summary
#summary(dfrModel)
lapply(dfrModel, FUN=describe)
## $age
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 274 27.36 3.71 27 26.76 2.97 22 48 26 2.16 6.45
## se
## X1 0.22
##
## $sex
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 274 1.25 0.43 1 1.19 0 1 2 1 1.16 -0.66 0.03
##
## $gmat_tot
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 274 619.45 57.54 620 618.86 59.3 450 790 340 -0.01 0.06
## se
## X1 3.48
##
## $gmat_qpc
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 274 80.64 14.87 83 82.31 14.83 28 99 71 -0.92 0.3
## se
## X1 0.9
##
## $gmat_vpc
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 274 78.32 16.86 81 80.33 14.83 16 99 83 -1.04 0.74
## se
## X1 1.02
##
## $gmat_tpc
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 274 84.2 14.02 87 86.12 11.86 0 99 99 -2.28 9.02
## se
## X1 0.85
##
## $s_avg
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 274 3.03 0.38 3 3.03 0.44 2 4 2 -0.06 -0.38
## se
## X1 0.02
##
## $f_avg
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 274 3.06 0.53 3 3.09 0.37 0 4 4 -2.08 10.85
## se
## X1 0.03
##
## $quarter
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 274 2.48 1.11 2 2.47 1.48 1 4 3 0.02 -1.35 0.07
##
## $work_yrs
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 274 3.87 3.23 3 3.29 1.48 0 22 22 2.78 9.8 0.2
##
## $frstlang
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 274 1.12 0.32 1 1.02 0 1 2 1 2.37 3.65 0.02
##
## $salary
## vars n mean sd median trimmed mad min max range
## X1 1 274 39025.69 50951.56 999 33607.86 1481.12 0 220000 220000
## skew kurtosis se
## X1 0.7 -1.05 3078.1
##
## $satis
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 274 172.18 371.61 6 91.5 1.48 1 998 997 1.77 1.13
## se
## X1 22.45
Box Plot
lapply(dfrModel, FUN=Graph_Boxplot)
## $age
##
## $sex
##
## $gmat_tot
##
## $gmat_qpc
##
## $gmat_vpc
##
## $gmat_tpc
##
## $s_avg
##
## $f_avg
##
## $quarter
##
## $work_yrs
##
## $frstlang
##
## $salary
##
## $satis
Observation
Here we see that many predictor variables numeric categoric variable. We should convert these to factor variables
Scatter Plot
par(mfrow=c(2, 2))
plot(y=dfrModel$salary, x=dfrModel$age,
col="blue",
ylim=c(0, 250000), xlim=c(20, 50),
main="Relationship Btw Age & Salary",
ylab="Salary", xlab="Age (Yrs)")
plot(y=dfrModel$salary, x=dfrModel$gmat_tot,
col="green",
ylim=c(0, 250000), xlim=c(400, 800),
main="Relationship Btw GMAT Total & Salary",
ylab="Salary", xlab="GMAT Total")
plot(y=dfrModel$salary, x=dfrModel$work_yrs,
col="green",
ylim=c(0, 250000), xlim=c(0, 30),
main="Relationship Btw Work Exp & Salary",
ylab="Salary", xlab="Work Exp")
plot(jitter(dfrModel$satis),jitter(dfrModel$salary),
col="blue",
xlim=c(0, 7), ylim=c(0, 250000),
main="Relationship Btw Satisfaction & Salary",
xlab="Satisfaction", ylab="Salary")
Observations
1. We can see there are some redundant data
2. For rest of the data there is some relation between salary and other variables.
Correlation Plot
#pairs(dfrModel)
corrplot(corr=cor(dfrModel[ , c(1:13)], use="complete.obs"),
method ="ellipse")
Subset of Data who got Job
dfrMode_Job <- subset(dfrModel, !(dfrModel$salary %in% c(0, 998, 999)))
Outliers
lapply(dfrMode_Job, FUN=detect_outliers)
## $age
## [1] 5
##
## $sex
## [1] 0
##
## $gmat_tot
## [1] 0
##
## $gmat_qpc
## [1] 2
##
## $gmat_vpc
## [1] 3
##
## $gmat_tpc
## [1] 2
##
## $s_avg
## [1] 0
##
## $f_avg
## [1] 3
##
## $quarter
## [1] 0
##
## $work_yrs
## [1] 9
##
## $frstlang
## [1] 7
##
## $salary
## [1] 9
##
## $satis
## [1] 1
Outliers Obsevations
1. We can see that there are outliers in the data, still we will go with outliers as no of data records are less
T test
Null Hypothesis: There is no difference between the mean of gmat_qpc and gmat_tot
Alternative Hypothesis: There is difference between the mean of gmat_qpc and gmat_tot
t.test(dfrMode_Job$gmat_tot, dfrModel$gmat_qpc)
##
## Welch Two Sample t-test
##
## data: dfrMode_Job$gmat_tot and dfrModel$gmat_qpc
## t = 105.5, df = 108.66, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 525.3191 545.4351
## sample estimates:
## mean of x mean of y
## 616.01942 80.64234
** T test Observations**
1. We can see that P value is less than 0.05 which is for 95% Confidence interval so it is rejecting Null Hypothesis and accepting the Alternative hypothesis.
2. So there should be difference between mean of gmat_qpc and gmat_tot
3. Mean of gmat_tot is 616.0194
4. Mean of gmat_qpc pricess is 80.64234
Correlation with Salary
vctCorr = numeric(0)
for (i in names(dfrMode_Job)){
cor.result <- cor(dfrMode_Job$salary, as.numeric(dfrMode_Job[,i]))
vctCorr <- c(vctCorr, cor.result)
}
dfrCorr <- vctCorr
names(dfrCorr) <- names(dfrMode_Job)
dfrCorr
## age sex gmat_tot gmat_qpc gmat_vpc gmat_tpc
## 0.49964284 -0.16628869 -0.09067141 0.01414130 -0.13743230 -0.13201783
## s_avg f_avg quarter work_yrs frstlang salary
## 0.10173175 -0.10603897 -0.12848526 0.45466634 0.26701953 1.00000000
## satis
## -0.04005060
Correlation Matrix
cor(dfrMode_Job[, c(1:13)])
## age sex gmat_tot gmat_qpc gmat_vpc
## age 1.00000000 -0.14352927 -0.07871678 -0.165039057 0.01799420
## sex -0.14352927 1.00000000 -0.01955548 -0.147099027 0.05341428
## gmat_tot -0.07871678 -0.01955548 1.00000000 0.666382266 0.78038546
## gmat_qpc -0.16503906 -0.14709903 0.66638227 1.000000000 0.09466541
## gmat_vpc 0.01799420 0.05341428 0.78038546 0.094665411 1.00000000
## gmat_tpc -0.09609156 -0.04686981 0.96680810 0.658650025 0.78443167
## s_avg 0.15654954 0.08079985 0.17198874 0.015471662 0.15865101
## f_avg -0.21699191 0.16572186 0.12246257 0.098418869 0.02290167
## quarter -0.12568145 -0.02139041 -0.10578964 0.012648346 -0.12862079
## work_yrs 0.88052470 -0.09233003 -0.12280018 -0.182701263 -0.02812182
## frstlang 0.35026743 0.07512009 -0.13164323 0.014198516 -0.21835333
## salary 0.49964284 -0.16628869 -0.09067141 0.014141299 -0.13743230
## satis 0.10832308 -0.09199534 0.06474206 -0.003984632 0.14863481
## gmat_tpc s_avg f_avg quarter work_yrs
## age -0.09609156 0.15654954 -0.21699191 -0.12568145 0.88052470
## sex -0.04686981 0.08079985 0.16572186 -0.02139041 -0.09233003
## gmat_tot 0.96680810 0.17198874 0.12246257 -0.10578964 -0.12280018
## gmat_qpc 0.65865003 0.01547166 0.09841887 0.01264835 -0.18270126
## gmat_vpc 0.78443167 0.15865101 0.02290167 -0.12862079 -0.02812182
## gmat_tpc 1.00000000 0.13938500 0.07051391 -0.09955033 -0.13246963
## s_avg 0.13938500 1.00000000 0.44590413 -0.84038355 0.16328236
## f_avg 0.07051391 0.44590413 1.00000000 -0.43144819 -0.21633018
## quarter -0.09955033 -0.84038355 -0.43144819 1.00000000 -0.12896722
## work_yrs -0.13246963 0.16328236 -0.21633018 -0.12896722 1.00000000
## frstlang -0.16437561 -0.13788905 -0.05061394 0.10955726 0.19627277
## salary -0.13201783 0.10173175 -0.10603897 -0.12848526 0.45466634
## satis 0.11630842 -0.14356557 -0.11773304 0.22511985 0.06299926
## frstlang salary satis
## age 0.35026743 0.49964284 0.108323083
## sex 0.07512009 -0.16628869 -0.091995338
## gmat_tot -0.13164323 -0.09067141 0.064742057
## gmat_qpc 0.01419852 0.01414130 -0.003984632
## gmat_vpc -0.21835333 -0.13743230 0.148634805
## gmat_tpc -0.16437561 -0.13201783 0.116308417
## s_avg -0.13788905 0.10173175 -0.143565573
## f_avg -0.05061394 -0.10603897 -0.117733043
## quarter 0.10955726 -0.12848526 0.225119851
## work_yrs 0.19627277 0.45466634 0.062999256
## frstlang 1.00000000 0.26701953 0.089834769
## salary 0.26701953 1.00000000 -0.040050600
## satis 0.08983477 -0.04005060 1.000000000
Data Cleaning
dfrMode_Job <- select(dfrMode_Job, -c(gmat_tot, quarter, satis))
head(dfrMode_Job)
## age sex gmat_qpc gmat_vpc gmat_tpc s_avg f_avg work_yrs frstlang salary
## 35 22 2 90 92 94 3.5 3.75 1 1 85000
## 36 27 2 94 98 98 3.3 3.25 2 1 85000
## 37 25 2 87 96 96 3.5 2.67 2 1 86000
## 38 25 2 82 91 93 3.4 3.25 3 1 88000
## 39 27 1 96 96 98 3.3 3.50 2 1 92000
## 40 28 2 52 98 87 3.4 3.75 5 1 93000
Visualize
dfrGraph <- gather(dfrMode_Job, variable, value, -salary)
head(dfrGraph)
## salary variable value
## 1 85000 age 22
## 2 85000 age 27
## 3 86000 age 25
## 4 88000 age 25
## 5 92000 age 27
## 6 93000 age 28
ggplot(dfrGraph) +
geom_jitter(aes(value,salary, colour=variable)) +
geom_smooth(aes(value,salary, colour=variable), method=lm, se=FALSE) +
facet_wrap(~variable, scales="free_x") +
labs(title="Relation Of salary With Other Features")
Observation
There is some impact of all the features with Salary.
Find Best Multi Linear Model for MBA Salary
Choose the best linear model by using step(). Choose a model by AIC in a Stepwise Algorithm
In statistics, stepwise regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of explanatory variables based on some prespecified criterion.
The Akaike information criterion (AIC) is a measure of the relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Hence, AIC provides a means for model selection.
#?step()
stpModel=step(lm(data=dfrMode_Job, salary~.), trace=0, steps=100)
stpSummary <- summary(stpModel)
stpSummary
##
## Call:
## lm(formula = salary ~ age + gmat_qpc + gmat_vpc + gmat_tpc, data = dfrMode_Job)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29373 -8011 280 5705 67116
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39383.0 18467.6 2.133 0.0355 *
## age 2791.1 464.8 6.005 3.25e-08 ***
## gmat_qpc 822.3 347.9 2.363 0.0201 *
## gmat_vpc 513.2 350.7 1.463 0.1467
## gmat_tpc -1383.8 680.4 -2.034 0.0447 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15110 on 98 degrees of freedom
## Multiple R-squared: 0.3129, Adjusted R-squared: 0.2848
## F-statistic: 11.16 on 4 and 98 DF, p-value: 1.691e-07
Observations of MBA Salaries Model
We tried different model but R square value is very less for all as there is less no of observation as well as low correlation between Salary and other variables.
For academic purpose we are still going with the model and predict the salaries.
1. salary ~ age + gmat_qpc + gmat_vpc + gmat_tpc
2. Salary is dependent of all the variables which are given above
3. R square value is around 0.2848 which is low but still we are going with Model 4. P value is less than 0.05 which is rejecting the NULL hypothesis which means all the above 4 variables are affecting the salary.
Make Final Multi Linear Model
x1 <- dfrMode_Job$age
x3 <- dfrMode_Job$gmat_qpc
x4 <- dfrMode_Job$gmat_vpc
x5 <- dfrMode_Job$gmat_tpc
y <- dfrMode_Job$salary
slmModel1 <- lm(y~x1+x3+x4+x5, data=dfrMode_Job)
summary(slmModel1)
##
## Call:
## lm(formula = y ~ x1 + x3 + x4 + x5, data = dfrMode_Job)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29373 -8011 280 5705 67116
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39383.0 18467.6 2.133 0.0355 *
## x1 2791.1 464.8 6.005 3.25e-08 ***
## x3 822.3 347.9 2.363 0.0201 *
## x4 513.2 350.7 1.463 0.1467
## x5 -1383.8 680.4 -2.034 0.0447 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15110 on 98 degrees of freedom
## Multiple R-squared: 0.3129, Adjusted R-squared: 0.2848
## F-statistic: 11.16 on 4 and 98 DF, p-value: 1.691e-07
Observation
No errors. Model successfully created.
Show Model
# print summary
summary(slmModel1)
##
## Call:
## lm(formula = y ~ x1 + x3 + x4 + x5, data = dfrMode_Job)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29373 -8011 280 5705 67116
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39383.0 18467.6 2.133 0.0355 *
## x1 2791.1 464.8 6.005 3.25e-08 ***
## x3 822.3 347.9 2.363 0.0201 *
## x4 513.2 350.7 1.463 0.1467
## x5 -1383.8 680.4 -2.034 0.0447 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15110 on 98 degrees of freedom
## Multiple R-squared: 0.3129, Adjusted R-squared: 0.2848
## F-statistic: 11.16 on 4 and 98 DF, p-value: 1.691e-07
Test Data
dfrtest <- data.frame(x1=c(25), x3=c(79), x4=c(93), x5=c(93))
dfrtest
## x1 x3 x4 x5
## 1 25 79 93 93
Observation
Test Data successfully created.
Predict
result <- predict(slmModel1, dfrtest)
print(result)
## 1
## 93152.76
** Observation**
1. Prediction is working fine but have some deviation from true values as r squared values was less. True Value is 96000.
Data is also Predicted successfully, which is near to true value of Salary.
###########End of the Project#########