Let us bring in the data for the MBA starting Salary casee study for detailed analysis and making the dataset readdy for analysis.
setwd("C:/Users/Rajat/Desktop/Internship IIM Lucknow/Datasets/Offlinel4")
ms1 <- read.csv(paste("MBA Starting Salaries Data.csv",sep=""))
ms1$sex[ms1$sex==1] <- "Male"
ms1$sex[ms1$sex==2] <- "Female"
ms1$sex= factor(ms1$sex)
ms1$frstlang[ms1$frstlang==1] <- "English"
ms1$frstlang[ms1$frstlang==2] <- "Other"
ms1$frstlang= factor(ms1$frstlang)
head(ms1)
## age sex gmat_tot gmat_qpc gmat_vpc gmat_tpc s_avg f_avg quarter
## 1 23 Female 620 77 87 87 3.4 3.00 1
## 2 24 Male 610 90 71 87 3.5 4.00 1
## 3 24 Male 670 99 78 95 3.3 3.25 1
## 4 24 Male 570 56 81 75 3.3 2.67 1
## 5 24 Female 710 93 98 98 3.6 3.75 1
## 6 24 Male 640 82 89 91 3.9 3.75 1
## work_yrs frstlang salary satis
## 1 2 English 0 7
## 2 2 English 0 6
## 3 2 English 0 6
## 4 1 English 0 7
## 5 2 English 999 5
## 6 2 English 0 6
Lets have a look at the data types and summary statistics for the following dataset.
str(ms1)
## 'data.frame': 274 obs. of 13 variables:
## $ age : int 23 24 24 24 24 24 25 25 25 25 ...
## $ sex : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 2 2 1 2 2 ...
## $ gmat_tot: int 620 610 670 570 710 640 610 650 630 680 ...
## $ gmat_qpc: int 77 90 99 56 93 82 89 88 79 99 ...
## $ gmat_vpc: int 87 71 78 81 98 89 74 89 91 81 ...
## $ gmat_tpc: int 87 87 95 75 98 91 87 92 89 96 ...
## $ s_avg : num 3.4 3.5 3.3 3.3 3.6 3.9 3.4 3.3 3.3 3.45 ...
## $ f_avg : num 3 4 3.25 2.67 3.75 3.75 3.5 3.75 3.25 3.67 ...
## $ quarter : int 1 1 1 1 1 1 1 1 1 1 ...
## $ work_yrs: int 2 2 2 1 2 2 2 2 2 2 ...
## $ frstlang: Factor w/ 2 levels "English","Other": 1 1 1 1 1 1 1 1 2 1 ...
## $ salary : int 0 0 0 0 999 0 0 0 999 998 ...
## $ satis : int 7 6 6 7 5 6 5 6 4 998 ...
Now we will summarize the data
summary(ms1)
## age sex gmat_tot gmat_qpc
## Min. :22.00 Female: 68 Min. :450.0 Min. :28.00
## 1st Qu.:25.00 Male :206 1st Qu.:580.0 1st Qu.:72.00
## Median :27.00 Median :620.0 Median :83.00
## Mean :27.36 Mean :619.5 Mean :80.64
## 3rd Qu.:29.00 3rd Qu.:660.0 3rd Qu.:93.00
## Max. :48.00 Max. :790.0 Max. :99.00
## gmat_vpc gmat_tpc s_avg f_avg
## Min. :16.00 Min. : 0.0 Min. :2.000 Min. :0.000
## 1st Qu.:71.00 1st Qu.:78.0 1st Qu.:2.708 1st Qu.:2.750
## Median :81.00 Median :87.0 Median :3.000 Median :3.000
## Mean :78.32 Mean :84.2 Mean :3.025 Mean :3.062
## 3rd Qu.:91.00 3rd Qu.:94.0 3rd Qu.:3.300 3rd Qu.:3.250
## Max. :99.00 Max. :99.0 Max. :4.000 Max. :4.000
## quarter work_yrs frstlang salary
## Min. :1.000 Min. : 0.000 English:242 Min. : 0
## 1st Qu.:1.250 1st Qu.: 2.000 Other : 32 1st Qu.: 0
## Median :2.000 Median : 3.000 Median : 999
## Mean :2.478 Mean : 3.872 Mean : 39026
## 3rd Qu.:3.000 3rd Qu.: 4.000 3rd Qu.: 97000
## Max. :4.000 Max. :22.000 Max. :220000
## satis
## Min. : 1.0
## 1st Qu.: 5.0
## Median : 6.0
## Mean :172.2
## 3rd Qu.: 7.0
## Max. :998.0
lets study some of the variables of MBA starting Salaries
describe(ms1$gmat_tot)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 274 619.45 57.54 620 618.86 59.3 450 790 340 -0.01 0.06
## se
## X1 3.48
describe(ms1$s_avg)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 274 3.03 0.38 3 3.03 0.44 2 4 2 -0.06 -0.38
## se
## X1 0.02
describe(ms1$f_avg)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 274 3.06 0.53 3 3.09 0.37 0 4 4 -2.08 10.85
## se
## X1 0.03
Here the average of Fall and Spring is out of 5. Now lets visualize some variables, lets start with age distribution.
hist(ms1$age, col="red",
xlab="Age in years",breaks=22,xlim=c(20,50),
main="Age distribution")
boxplot(ms1$gmat_tot,
xlab="Gmat Total score",col="yellow",
main="Boxplot of GMAT total Score", horizontal = TRUE)
boxplot(ms1$gmat_qpc,
xlab="Gmat Quants score",col="darkolivegreen",
main="Boxplot of GMAT Quants Score", horizontal = TRUE)
boxplot(ms1$gmat_vpc,
xlab="Gmat Verbal score",col="darkorange",
main="Boxplot of GMAT Verbal Score", horizontal = TRUE)
boxplot(ms1$s_avg,
xlab="Average score in Spring",col="deeppink",
main="Boxplot of Average Score in Spring", horizontal = TRUE)
boxplot(ms1$f_avg,
xlab="Average score in Fall",col="firebrick1",
main="Boxplot of Average Score in Fall", horizontal = TRUE)
boxplot(ms1$work_yrs,
xlab="Work experience of Students in years",col="seagreen",
main="Boxplot of Work experience in Years", horizontal = TRUE)
As seen from all the boxplots there are some outliers in some plots which may be due to diverse culture of students at the MBA program.
lets have a number of students having English having thier 1st language.
options(digits =3)
table(ms1$sex)
##
## Female Male
## 68 206
table(ms1$frstlang)
##
## English Other
## 242 32
y<-table(ms1$sex,ms1$frstlang)
prop.table(y)*100
##
## English Other
## Female 21.90 2.92
## Male 66.42 8.76
z <- table(ms1$satis)
prop.table(z)*100
##
## 1 2 3 4 5 6 7 998
## 0.365 0.365 1.825 6.204 27.007 35.401 12.044 16.788
Thus the proportion table shows that the satisfactory levels are high with around 75% responses above a response of 5. lets find a correlation matrix for all the numeric variables
dd2 <- subset(ms1,select=c(age,gmat_tot,gmat_qpc,gmat_vpc,
gmat_tpc, s_avg, f_avg, quarter,work_yrs,
salary,satis))
corrs <- cor(dd2, use="pairwise.complete.obs")
corrs
## age gmat_tot gmat_qpc gmat_vpc gmat_tpc s_avg f_avg
## age 1.0000 -0.1459 -0.2162 -0.04418 -0.16990 0.1497 -0.0174
## gmat_tot -0.1459 1.0000 0.7247 0.74839 0.84780 0.1131 0.1044
## gmat_qpc -0.2162 0.7247 1.0000 0.15218 0.65138 -0.0298 0.0737
## gmat_vpc -0.0442 0.7484 0.1522 1.00000 0.66622 0.2045 0.0759
## gmat_tpc -0.1699 0.8478 0.6514 0.66622 1.00000 0.1174 0.0797
## s_avg 0.1497 0.1131 -0.0298 0.20445 0.11736 1.0000 0.5506
## f_avg -0.0174 0.1044 0.0737 0.07592 0.07973 0.5506 1.0000
## quarter -0.0497 -0.0922 0.0364 -0.17461 -0.08304 -0.7621 -0.4475
## work_yrs 0.8583 -0.1824 -0.2366 -0.06639 -0.17336 0.1293 -0.0391
## salary -0.0626 -0.0550 -0.0440 -0.00614 0.00493 0.1458 0.0294
## satis -0.1279 0.0826 0.0606 0.06262 0.09293 -0.0327 0.0109
## quarter work_yrs salary satis
## age -4.97e-02 0.85830 -0.06257 -1.28e-01
## gmat_tot -9.22e-02 -0.18235 -0.05497 8.26e-02
## gmat_qpc 3.64e-02 -0.23661 -0.04403 6.06e-02
## gmat_vpc -1.75e-01 -0.06639 -0.00614 6.26e-02
## gmat_tpc -8.30e-02 -0.17336 0.00493 9.29e-02
## s_avg -7.62e-01 0.12929 0.14584 -3.27e-02
## f_avg -4.48e-01 -0.03906 0.02944 1.09e-02
## quarter 1.00e+00 -0.08603 -0.16437 -1.27e-05
## work_yrs -8.60e-02 1.00000 0.00902 -1.09e-01
## salary -1.64e-01 0.00902 1.00000 -3.35e-01
## satis -1.27e-05 -0.10926 -0.33522 1.00e+00
Lets have a correlation Matrix for the entire dataset to get a visual representation of the correlations betweeen the variables.
par(mfrow=c(1,1))
corrplot(corr=cor(ms1[,c(1,3:10,12,13)]),use="complete.obs",
method="ellipse")
lets first have a complete dataset of students that were placed and have mentioned there salary.
Placed <- ms1[which(ms1$salary>999),]
some(Placed)
## age sex gmat_tot gmat_qpc gmat_vpc gmat_tpc s_avg f_avg quarter
## 60 39 Male 600 72 81 83 3.6 3.50 1
## 66 30 Male 600 77 81 84 3.5 3.25 1
## 117 31 Male 540 60 62 65 3.1 3.00 2
## 128 25 Female 620 89 74 87 3.1 3.50 2
## 129 27 Male 620 97 63 88 3.2 3.00 2
## 133 34 Male 550 72 58 69 3.0 3.00 2
## 138 26 Male 650 89 87 93 3.2 3.25 2
## 198 28 Male 610 89 67 86 2.7 3.00 3
## 204 30 Male 670 83 97 96 2.8 2.75 3
## 258 25 Female 720 96 98 99 3.5 3.60 4
## work_yrs frstlang salary satis
## 60 16 English 112000 7
## 66 5 English 120000 6
## 117 8 English 93000 6
## 128 2 English 101000 5
## 129 3 English 103000 6
## 133 16 English 105000 5
## 138 4 English 130000 7
## 198 4 English 98000 7
## 204 6 English 102500 5
## 258 3 English 85000 6
Lets find some interesting insights from the placed students dataset.
describe(Placed$salary)
## vars n mean sd median trimmed mad min max range skew
## X1 1 103 103031 17869 1e+05 101065 7413 64000 220000 156000 3.18
## kurtosis se
## X1 17.2 1761
l<-aggregate(Placed$salary,by=list(Sex=Placed$sex),mean)
names(l)[2] <- "Average Salary"
l
## Sex Average Salary
## 1 Female 98524
## 2 Male 104971
m<-aggregate(Placed$salary,by=list(Age=Placed$age),mean)
names(m)[2] <- "Average Salary"
m
## Age Average Salary
## 1 22 85000
## 2 23 91651
## 3 24 101519
## 4 25 99087
## 5 26 101665
## 6 27 102214
## 7 28 103625
## 8 29 102083
## 9 30 109917
## 10 31 100500
## 11 32 107300
## 12 33 118000
## 13 34 105000
## 14 39 112000
## 15 40 183000
This gives a deep insights of the salary of the placed students. Also it gives the mean salary as per Gender and Age.
Also lets find the correlation between the different factors affeccting the Salary of students.
scatterplotMatrix(~Placed$gmat_tot+Placed$work_yrs+Placed$sex+Placed$age+Placed$salary, diagonal="histogram")
dd3 <- subset(Placed,select=c(age,gmat_tot,gmat_qpc,gmat_vpc,
gmat_tpc, s_avg, f_avg, quarter,work_yrs,
salary,satis))
corrs <- cor(dd3, use="pairwise.complete.obs")
corrs
## age gmat_tot gmat_qpc gmat_vpc gmat_tpc s_avg f_avg
## age 1.0000 -0.0787 -0.16504 0.0180 -0.0961 0.1565 -0.2170
## gmat_tot -0.0787 1.0000 0.66638 0.7804 0.9668 0.1720 0.1225
## gmat_qpc -0.1650 0.6664 1.00000 0.0947 0.6587 0.0155 0.0984
## gmat_vpc 0.0180 0.7804 0.09467 1.0000 0.7844 0.1587 0.0229
## gmat_tpc -0.0961 0.9668 0.65865 0.7844 1.0000 0.1394 0.0705
## s_avg 0.1565 0.1720 0.01547 0.1587 0.1394 1.0000 0.4459
## f_avg -0.2170 0.1225 0.09842 0.0229 0.0705 0.4459 1.0000
## quarter -0.1257 -0.1058 0.01265 -0.1286 -0.0996 -0.8404 -0.4314
## work_yrs 0.8805 -0.1228 -0.18270 -0.0281 -0.1325 0.1633 -0.2163
## salary 0.4996 -0.0907 0.01414 -0.1374 -0.1320 0.1017 -0.1060
## satis 0.1083 0.0647 -0.00398 0.1486 0.1163 -0.1436 -0.1177
## quarter work_yrs salary satis
## age -0.1257 0.8805 0.4996 0.10832
## gmat_tot -0.1058 -0.1228 -0.0907 0.06474
## gmat_qpc 0.0126 -0.1827 0.0141 -0.00398
## gmat_vpc -0.1286 -0.0281 -0.1374 0.14863
## gmat_tpc -0.0996 -0.1325 -0.1320 0.11631
## s_avg -0.8404 0.1633 0.1017 -0.14357
## f_avg -0.4314 -0.2163 -0.1060 -0.11773
## quarter 1.0000 -0.1290 -0.1285 0.22512
## work_yrs -0.1290 1.0000 0.4547 0.06300
## salary -0.1285 0.4547 1.0000 -0.04005
## satis 0.2251 0.0630 -0.0401 1.00000
From above Correlaion it can seen that Salary is only positively correlated with Age, S_avg and Work experience variables
lets have a look at the box plots for the variables in the Placed Dataset.
boxplot(Placed$salary~Placed$work_yrs,horizontal=TRUE,
col=c("red","blue","yellow","orange","green"),
main="Salary based on Work Experience",
xlab="Salary of MBA Students",
ylab="Work Exp.")
boxplot(Placed$salary~Placed$gmat_tot,horizontal=TRUE,
col=c("red","blue","yellow","orange","green"),
main="Salary based on Total Gmat Score",
xlab="Salary of MBA Students",ylab="Total GMAT Score")
This boxplots gives fair idea about the mean and outliers in salaries based on GMAT scores and work experience.
par(mfrow=c(3,1))
boxplot(Placed$salary~Placed$sex,horizontal=TRUE,
col=c("Yellow","darkorange"),
main="Gender Based Salary Distribution",las=1,
xlab="Salary of MBA Students",
ylab="Sex")
boxplot(Placed$salary~Placed$quarter,horizontal=TRUE,
col=c("red","blue","yellow","orange4"),las=1,
main="Quarter Ranking Based Salary Distribution")
boxplot(Placed$salary~Placed$frstlang,
horizontal=TRUE,las=1,col=c("limegreen","Purple4"),
main="First Language Based Salary Distribution")
lets study the effect of Gender and work experience on getting placed, first by getting the number and then by testing it.
mytable <- aggregate(Placed$salary,by=list(Gender=Placed$sex),mean)
names(mytable)[2]<- "Average Salary"
mytable
## Gender Average Salary
## 1 Female 98524
## 2 Male 104971
mytable2 <- aggregate(Placed$salary,by=list(Work_Exp=Placed$work_yrs),mean)
names(mytable2)[2]<- "Average Salary"
mytable2
## Work_Exp Average Salary
## 1 0 95000
## 2 1 103532
## 3 2 97674
## 4 3 101653
## 5 4 105455
## 6 5 103143
## 7 6 105929
## 8 7 98000
## 9 8 105025
## 10 10 118000
## 11 15 183000
## 12 16 108500
Now some scatter plots fro the students who are placed.
scatterplot(salary ~age, data=Placed,
spread=FALSE, smoother.args=list(lty=2),
main="Scatter plot of salary vs age",
xlab="age",
ylab="salary")
scatterplot(salary ~sex, data=Placed,
spread=FALSE, smoother.args=list(lty=2),
main="Scatter plot of salary vs sex",
xlab="sex",
ylab="salary")
## [1] "256" "67" "274" "257" "68" "69" "138" "139" "273"
scatterplot(salary ~work_yrs, data=Placed,
main="Scatter plot of salary vs Work exp.",
xlab="Work experience in years",
ylab="salary")
Lets build a regression model to find the factors that can statistically affect the salary of MBA students as a whole and individually.
mode1 <- (salary ~ age + sex +gmat_tot+ gmat_qpc + gmat_vpc + gmat_tpc + s_avg + f_avg + quarter+work_yrs + satis)
modulus11 <- lm(salary ~ age + sex +gmat_tot+ gmat_qpc + gmat_vpc + gmat_tpc + s_avg + f_avg + quarter+work_yrs + satis,data=ms1 )
summary(modulus11)
##
## Call:
## lm(formula = salary ~ age + sex + gmat_tot + gmat_qpc + gmat_vpc +
## gmat_tpc + s_avg + f_avg + quarter + work_yrs + satis, data = ms1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -77512 -42013 -4615 43171 202697
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 218082.68 73827.61 2.95 0.0034 **
## age -3942.41 1538.32 -2.56 0.0109 *
## sexMale -1628.38 6813.43 -0.24 0.8113
## gmat_tot -286.07 207.32 -1.38 0.1688
## gmat_qpc 346.71 576.23 0.60 0.5479
## gmat_vpc 343.23 523.70 0.66 0.5128
## gmat_tpc 499.36 413.89 1.21 0.2287
## s_avg 13153.80 12852.57 1.02 0.3070
## f_avg -6364.97 6625.26 -0.96 0.3376
## quarter -5377.62 4037.59 -1.33 0.1841
## work_yrs 2976.66 1751.80 1.70 0.0905 .
## satis -47.43 7.77 -6.10 3.7e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47100 on 262 degrees of freedom
## Multiple R-squared: 0.182, Adjusted R-squared: 0.147
## F-statistic: 5.28 on 11 and 262 DF, p-value: 1.54e-07
from above F-test, we could infer that only age is a significant parameter rest all are statistically insignificant.
library(leaps)
## Warning: package 'leaps' was built under R version 3.4.3
leap1 <- regsubsets(mode1, data = ms1, nbest=1)
summary(leap1)
## Subset selection object
## Call: regsubsets.formula(mode1, data = ms1, nbest = 1)
## 11 Variables (and intercept)
## Forced in Forced out
## age FALSE FALSE
## sexMale FALSE FALSE
## gmat_tot FALSE FALSE
## gmat_qpc FALSE FALSE
## gmat_vpc FALSE FALSE
## gmat_tpc FALSE FALSE
## s_avg FALSE FALSE
## f_avg FALSE FALSE
## quarter FALSE FALSE
## work_yrs FALSE FALSE
## satis FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## age sexMale gmat_tot gmat_qpc gmat_vpc gmat_tpc s_avg f_avg
## 1 ( 1 ) " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " " "
## 3 ( 1 ) "*" " " " " " " " " " " " " " "
## 4 ( 1 ) "*" " " " " " " " " " " " " " "
## 5 ( 1 ) "*" " " "*" " " " " " " " " " "
## 6 ( 1 ) "*" " " "*" " " " " "*" " " " "
## 7 ( 1 ) "*" " " "*" " " " " "*" "*" " "
## 8 ( 1 ) "*" " " "*" " " " " "*" "*" "*"
## quarter work_yrs satis
## 1 ( 1 ) " " " " "*"
## 2 ( 1 ) "*" " " "*"
## 3 ( 1 ) "*" " " "*"
## 4 ( 1 ) "*" "*" "*"
## 5 ( 1 ) "*" "*" "*"
## 6 ( 1 ) "*" "*" "*"
## 7 ( 1 ) "*" "*" "*"
## 8 ( 1 ) "*" "*" "*"
plot(leap1, scale="adjr2")
library(coefplot)
## Warning: package 'coefplot' was built under R version 3.4.3
coefplot(modulus11, intercept= FALSE, outerCI=1.96,coefficients=c("age","sex","work_yrs", "gmat_tot","quarter","satis"))
Removing all the statistically insignificant variables, for better model with improved R squared and adjusted R squared
modulus11 <- lm(salary ~ age,data=Placed )
summary(modulus11)
##
## Call:
## lm(formula = salary ~ age, data = Placed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31454 -8533 -2182 4546 80886
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29963 12698 2.36 0.02 *
## age 2729 471 5.80 7.7e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15600 on 101 degrees of freedom
## Multiple R-squared: 0.25, Adjusted R-squared: 0.242
## F-statistic: 33.6 on 1 and 101 DF, p-value: 7.75e-08
Although this model has a p-value less than 0.05, it isn’t a very good model.