Since the rating of MBA programs uses graduate pay as a large component of their rating system in many surveys, different dependencies of students’ initial pay should be reported. In addition, MBA plays a major role in the analysis of program’s degree of gratification purposes. Dependency of various variables such as GMAT scores, GMAT champion, spring average, average fall, years of work experience, quartile ranking, first language, gender, etc. can rely on positive, negative or negative initial salary and degree. Satisfaction of
mbasal.df <- read.csv("MBA Starting Salaries Data.csv")
View(mbasal.df)
summary(mbasal.df)
## age sex gmat_tot gmat_qpc
## Min. :22.00 Min. :1.000 Min. :450.0 Min. :28.00
## 1st Qu.:25.00 1st Qu.:1.000 1st Qu.:580.0 1st Qu.:72.00
## Median :27.00 Median :1.000 Median :620.0 Median :83.00
## Mean :27.36 Mean :1.248 Mean :619.5 Mean :80.64
## 3rd Qu.:29.00 3rd Qu.:1.000 3rd Qu.:660.0 3rd Qu.:93.00
## Max. :48.00 Max. :2.000 Max. :790.0 Max. :99.00
## gmat_vpc gmat_tpc s_avg f_avg
## Min. :16.00 Min. : 0.0 Min. :2.000 Min. :0.000
## 1st Qu.:71.00 1st Qu.:78.0 1st Qu.:2.708 1st Qu.:2.750
## Median :81.00 Median :87.0 Median :3.000 Median :3.000
## Mean :78.32 Mean :84.2 Mean :3.025 Mean :3.062
## 3rd Qu.:91.00 3rd Qu.:94.0 3rd Qu.:3.300 3rd Qu.:3.250
## Max. :99.00 Max. :99.0 Max. :4.000 Max. :4.000
## quarter work_yrs frstlang salary
## Min. :1.000 Min. : 0.000 Min. :1.000 Min. : 0
## 1st Qu.:1.250 1st Qu.: 2.000 1st Qu.:1.000 1st Qu.: 0
## Median :2.000 Median : 3.000 Median :1.000 Median : 999
## Mean :2.478 Mean : 3.872 Mean :1.117 Mean : 39026
## 3rd Qu.:3.000 3rd Qu.: 4.000 3rd Qu.:1.000 3rd Qu.: 97000
## Max. :4.000 Max. :22.000 Max. :2.000 Max. :220000
## satis
## Min. : 1.0
## 1st Qu.: 5.0
## Median : 6.0
## Mean :172.2
## 3rd Qu.: 7.0
## Max. :998.0
newdata1 <- mbasal.df[ which(mbasal.df$salary !="998" & mbasal.df$salary !="999"), ]
hist(newdata1$salary, breaks=5,col="purple",xlab="starting salary", main="Salary distribution")
##Histograms
hist(mbasal.df$age, breaks = 5,col="Blue",main="Histogram of Age")
hist(mbasal.df$gmat_tpc, breaks = 5,col="Red",main="Histogram of Gmat Total %le")
hist(mbasal.df$s_avg, breaks = 5,col="Green",main="Histogram of spring MBA average")
hist(mbasal.df$f_avg, breaks = 5,col="pink",main="Histogram of fall MBA average")
boxplot(mbasal.df$salary ~ mbasal.df$sex, data=mbasal.df, horizontal=TRUE, yaxt="n",
ylab="Gender", xlab="Salary",
main="Comparison of Salaries of Males and Females")
axis(side=2, at=c(1,2), labels=c("Females", "Males"))
boxplot(mbasal.df$salary ~ mbasal.df$frstlang, data=mbasal.df, horizontal=TRUE, yaxt="n",
ylab="First language", xlab="Salary",
main="Comparison of Salaries based on their first language")
axis(side=2, at=c(1,2), labels=c("english", "others"))
boxplot(mbasal.df$salary ~ mbasal.df$quarter, data=mbasal.df, horizontal=TRUE, yaxt="n",
ylab="quartile rating", xlab="Salary",
main="Comparison of Salaries of Quartile ranking")
axis(side=2, at=c(1,2,3,4), labels=c("first", "secnd","thrd","fourth"))
boxplot(mbasal.df$salary ~ mbasal.df$satis, data=mbasal.df, horizontal=TRUE, yaxt="n",
ylab="Satisfactory degree", xlab="Salary",
main="Comparison of Salaries on degree of satisfaction of students")
axis(side=2, at=c(1,2,3,4,5,6,7,8), labels=c("1", "2","3","4","5","6","7","no"))
library(car)
scatterplot(salary ~age, data=newdata1,
spread=FALSE, smoother.args=list(lty=2),
main="Scatter plot of salary vs age",
xlab="age",
ylab="salary")
scatterplot(salary ~sex, data=newdata1,
spread=FALSE, smoother.args=list(lty=2),
main="Scatter plot of salary vs sex",
xlab="sex",
ylab="salary")
scatterplot(salary ~frstlang, data=newdata1,
main="Scatter plot of salary vs first language",
xlab="first language",
ylab="salary")
scatterplot(salary ~gmat_tot, data=newdata1,
spread=FALSE, smoother.args=list(lty=2),
main="Scatter plot of salary vs Gmat total",
xlab="Gmat score",
ylab="salary")
Salary vs work experience
library(car)
scatterplot(salary ~ work_yrs | sex ,data=mbasal.df, main="Scatterplot of Salary with Work Experience", xlab="Work Experience", ylab="Starting Salaries")
Salary vs Degree of satisfaction
scatterplot(salary ~ satis ,data=mbasal.df, main="Scatterplot of Salary with Degree of satisfaction", xlab="Degree of satisfaction", ylab="Starting Salaries")
Salary vs GMAT total score
scatterplot(salary ~ gmat_tot ,data=mbasal.df, main="Scatterplot of Salary with GMAT Total Score", xlab="GMAT Total Score", ylab="Starting Salaries")
library(corrgram)
corrgram(newdata1, order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt,
main="MBA starting salary analysis Correlogram")
x <- newdata1[,c("age", "gmat_tot", "gmat_qpc", "gmat_vpc","gmat_tpc","s_avg","f_avg","work_yrs","salary")]
y <- newdata1[,c("age", "gmat_tot", "gmat_qpc", "gmat_vpc","gmat_tpc","s_avg","f_avg","work_yrs","salary")]
cov(x,y)
## age gmat_tot gmat_qpc gmat_vpc gmat_tpc
## age 1.778562e+01 -29.954933 -14.089729 -0.4564443 -7.5127645
## gmat_tot -2.995493e+01 3196.950561 636.350928 685.4644322 672.4651878
## gmat_qpc -1.408973e+01 636.350928 229.384067 42.7985481 141.4933074
## gmat_vpc -4.564443e-01 685.464432 42.798548 259.2695920 149.8747571
## gmat_tpc -7.512764e+00 672.465188 141.493307 149.8747571 183.0113882
## s_avg 2.626913e-01 3.076706 0.109287 1.1636153 0.9688199
## f_avg -7.513817e-02 2.969557 1.025241 0.2769703 0.7718585
## work_yrs 1.355880e+01 -36.222204 -13.484078 -2.4562014 -8.2897776
## salary -2.918528e+04 -170.881369 22855.717832 2901.3078044 43822.5291991
## s_avg f_avg work_yrs salary
## age 0.2626913 -0.07513817 1.355880e+01 -2.918528e+04
## gmat_tot 3.0767055 2.96955689 -3.622220e+01 -1.708814e+02
## gmat_qpc 0.1092870 1.02524072 -1.348408e+01 2.285572e+04
## gmat_vpc 1.1636153 0.27697026 -2.456201e+00 2.901308e+03
## gmat_tpc 0.9688199 0.77185854 -8.289778e+00 4.382253e+04
## s_avg 0.1436561 0.10251263 2.224652e-01 1.940528e+03
## f_avg 0.1025126 0.26995964 -9.189254e-02 2.443157e+02
## work_yrs 0.2224652 -0.09189254 1.360379e+01 -1.044263e+04
## salary 1940.5276360 244.31568869 -1.044263e+04 2.825177e+09
mba1=mbasal.df[which(mbasal.df$salary!=0),]
View(mba1)
table(mba1$salary,mba1$satis)
##
## 1 2 3 4 5 6 7 998
## 998 0 0 0 0 0 0 0 46
## 999 1 1 4 12 9 7 1 0
## 64000 0 0 0 0 0 0 1 0
## 77000 0 0 0 0 0 1 0 0
## 78256 0 0 0 0 1 0 0 0
## 82000 0 0 0 0 0 0 1 0
## 85000 0 0 0 0 1 3 0 0
## 86000 0 0 0 0 2 0 0 0
## 88000 0 0 0 0 0 0 1 0
## 88500 0 0 0 0 0 1 0 0
## 90000 0 0 0 0 2 0 1 0
## 92000 0 0 0 0 1 1 1 0
## 93000 0 0 0 0 1 2 0 0
## 95000 0 0 1 1 1 2 2 0
## 96000 0 0 0 0 1 1 2 0
## 96500 0 0 0 0 0 1 0 0
## 97000 0 0 0 0 0 1 1 0
## 98000 0 0 0 0 2 5 3 0
## 99000 0 0 0 0 0 1 0 0
## 100000 0 0 0 0 1 6 2 0
## 100400 0 0 0 0 0 0 1 0
## 101000 0 0 0 0 1 1 0 0
## 101100 0 0 0 0 0 1 0 0
## 101600 0 0 0 0 0 1 0 0
## 102500 0 0 0 0 1 0 0 0
## 103000 0 0 0 0 0 1 0 0
## 104000 0 0 0 0 1 1 0 0
## 105000 0 0 0 0 4 6 1 0
## 106000 0 0 0 0 0 2 1 0
## 107000 0 0 0 0 1 0 0 0
## 107300 0 0 0 0 0 0 1 0
## 107500 0 0 0 0 1 0 0 0
## 108000 0 0 0 0 0 2 0 0
## 110000 0 0 0 0 1 0 0 0
## 112000 0 0 0 0 0 2 1 0
## 115000 0 0 0 0 3 2 0 0
## 118000 0 0 0 0 0 0 1 0
## 120000 0 0 0 0 2 2 0 0
## 126710 0 0 0 0 0 1 0 0
## 130000 0 0 0 0 0 0 1 0
## 145800 0 0 0 0 0 1 0 0
## 146000 0 0 0 0 0 1 0 0
## 162000 0 0 0 0 1 0 0 0
## 220000 0 0 0 0 0 1 0 0
table(mba1$s_avg,mba1$f_avg)
##
## 0 2 2.25 2.33 2.5 2.67 2.75 2.8 2.83 3 3.17 3.2 3.25 3.33 3.4 3.5
## 2.2 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0
## 2.3 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0
## 2.4 0 1 1 0 2 0 3 0 0 0 0 0 0 0 0 0
## 2.45 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
## 2.5 0 0 1 0 3 0 4 0 0 2 0 0 0 0 0 0
## 2.6 0 0 0 0 4 0 3 0 0 3 0 0 0 0 0 0
## 2.67 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2.7 0 0 0 0 3 0 6 0 0 6 0 1 3 0 0 0
## 2.73 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## 2.8 0 0 0 0 0 0 3 0 0 5 0 0 2 0 0 0
## 2.9 0 0 0 0 0 0 5 1 0 6 0 0 6 1 0 1
## 2.91 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 4 0 0 6 0 0 4 0 0 0
## 3.09 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1
## 3.1 0 0 0 1 0 1 0 0 0 5 0 0 2 1 0 3
## 3.18 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
## 3.2 0 0 0 0 0 0 0 0 0 4 0 0 6 0 1 1
## 3.27 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
## 3.3 0 0 0 0 0 0 0 0 0 2 0 0 9 0 0 5
## 3.4 0 0 0 0 0 0 0 1 0 1 0 0 3 0 0 0
## 3.45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## 3.5 0 0 0 0 0 1 0 0 0 2 0 0 3 0 0 4
## 3.56 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3.6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4
## 3.7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3.8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2
## 4 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## 3.6 3.67 3.75 4
## 2.2 0 0 0 0
## 2.3 0 0 0 0
## 2.4 0 0 0 0
## 2.45 0 0 0 0
## 2.5 0 0 0 0
## 2.6 0 0 0 0
## 2.67 0 0 0 0
## 2.7 0 0 0 0
## 2.73 0 0 0 0
## 2.8 0 0 0 0
## 2.9 0 0 0 0
## 2.91 0 0 0 0
## 3 0 0 0 0
## 3.09 0 0 0 0
## 3.1 0 0 1 0
## 3.18 0 0 0 0
## 3.2 0 0 0 0
## 3.27 0 0 0 0
## 3.3 0 0 0 0
## 3.4 0 1 2 1
## 3.45 0 1 0 0
## 3.5 1 0 1 2
## 3.56 0 0 0 1
## 3.6 0 1 2 0
## 3.7 1 0 0 1
## 3.8 0 0 0 1
## 4 0 0 0 1
table(mba1$gmat_tot,mba1$gmat_qpc)
##
## 39 43 46 48 49 50 52 53 55 56 57 60 64 65 66 67 68 71 72 74 75 77 78
## 450 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 460 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
## 500 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1
## 520 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 530 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## 540 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
## 550 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 0
## 560 1 0 0 0 0 0 3 0 1 0 1 1 1 0 0 0 1 0 0 0 1 0 0
## 570 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 1 1 0 1 0 0
## 580 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 3 0 0 0 1
## 590 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 1 0 0 0 0
## 600 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 1 0 0 4 0
## 610 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0
## 620 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1
## 630 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0 1 0 0
## 640 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
## 650 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 660 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## 670 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 680 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 690 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 700 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 710 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 720 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 730 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 740 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 790 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## 79 81 82 83 84 85 87 88 89 90 91 92 93 94 95 96 97 98 99
## 450 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 460 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 500 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 520 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 530 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 540 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 550 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 560 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
## 570 0 0 2 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0
## 580 2 0 0 2 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0
## 590 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0
## 600 0 0 1 0 1 0 1 0 2 0 1 0 0 0 0 0 1 0 1
## 610 0 0 1 1 0 0 1 0 2 0 0 0 0 0 0 0 0 0 0
## 620 1 1 1 0 1 1 1 1 2 0 0 0 1 0 0 0 2 0 0
## 630 3 0 1 2 1 1 2 0 0 0 0 0 2 0 0 1 0 0 0
## 640 2 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0
## 650 2 0 1 0 0 0 1 0 3 0 1 0 1 0 1 0 0 0 1
## 660 0 1 0 1 1 0 0 1 0 1 1 0 0 0 2 0 1 0 1
## 670 0 0 0 3 1 0 2 0 0 0 1 0 0 0 1 0 2 1 2
## 680 1 0 0 0 1 0 1 0 0 0 1 1 0 0 0 2 1 0 1
## 690 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 1
## 700 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0
## 710 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 1 0 0 1
## 720 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0
## 730 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
## 740 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3
## 790 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
table(mba1$gmat_vpc,mba1$gmat_tpc)
##
## 0 37 44 51 52 58 61 62 65 68 69 71 72 73 75 77 78 79 80 81 83 84 85
## 16 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 22 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 30 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 33 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
## 37 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
## 41 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0
## 45 0 0 0 1 0 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0
## 46 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 50 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0
## 54 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
## 58 0 0 0 0 0 0 0 0 0 0 2 0 2 0 2 0 0 1 0 1 1 0 0
## 62 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 2 0 0
## 63 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 67 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 1
## 71 1 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 5 0 0 0 0 0 0
## 74 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0
## 75 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 78 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 2 0
## 81 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 1 0 0 2 1 2 0
## 82 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
## 84 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0
## 85 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 87 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## 89 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
## 90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 91 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0
## 92 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 93 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 95 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## 96 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 97 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 98 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 99 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## 86 87 88 89 90 91 92 93 94 95 96 97 98 99
## 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 33 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 37 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 41 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 45 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 46 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 54 0 1 0 0 0 0 0 0 0 0 0 0 0 0
## 58 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 62 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 63 0 0 1 0 0 0 0 1 0 0 0 0 0 0
## 67 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## 71 1 1 0 1 0 1 0 0 0 1 0 0 0 0
## 74 0 5 0 0 0 0 0 0 0 0 1 0 0 0
## 75 0 0 0 0 0 1 0 0 0 0 0 0 0 0
## 78 0 0 0 0 0 1 0 0 0 0 0 0 0 0
## 81 2 1 0 0 1 1 0 1 1 2 2 0 0 0
## 82 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 84 1 1 0 2 0 1 0 3 1 0 0 0 0 0
## 85 0 0 0 1 0 0 0 0 0 0 2 0 0 0
## 87 1 0 1 4 1 0 0 1 0 0 1 2 0 0
## 89 0 1 0 1 1 0 0 1 0 1 0 1 0 0
## 90 0 0 0 0 0 0 0 0 0 1 0 1 0 0
## 91 0 0 0 2 0 0 0 2 0 0 0 0 0 0
## 92 0 0 0 0 0 0 1 0 1 0 0 0 0 1
## 93 1 0 0 1 0 1 0 1 2 1 0 1 1 0
## 95 0 0 0 2 0 1 0 1 1 2 0 1 3 0
## 96 0 1 0 0 0 0 0 0 0 1 2 0 1 0
## 97 0 0 0 0 0 0 0 0 0 0 1 0 0 1
## 98 1 1 0 0 0 0 1 0 0 1 2 0 4 5
## 99 0 0 0 0 0 0 0 0 0 1 1 0 0 2
chisq.test(mba1$salary,mba1$satis)
## Warning in chisq.test(mba1$salary, mba1$satis): Chi-squared approximation
## may be incorrect
##
## Pearson's Chi-squared test
##
## data: mba1$salary and mba1$satis
## X-squared = 391.04, df = 301, p-value = 0.0003578
chisq.test(mba1$s_avg,mba1$f_avg)
## Warning in chisq.test(mba1$s_avg, mba1$f_avg): Chi-squared approximation
## may be incorrect
##
## Pearson's Chi-squared test
##
## data: mba1$s_avg and mba1$f_avg
## X-squared = 1033.1, df = 494, p-value < 2.2e-16
chisq.test(mba1$gmat_tot,mba1$gmat_qpc)
## Warning in chisq.test(mba1$gmat_tot, mba1$gmat_qpc): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mba1$gmat_tot and mba1$gmat_qpc
## X-squared = 1559.3, df = 1066, p-value < 2.2e-16
chisq.test(mba1$gmat_vpc,mba1$gmat_tpc)
## Warning in chisq.test(mba1$gmat_vpc, mba1$gmat_tpc): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mba1$gmat_vpc and mba1$gmat_tpc
## X-squared = 1790.8, df = 1152, p-value < 2.2e-16
The results of the Chi-Square tests tell us that age, GMAT percentile, work experience and first language are factors that affect starting salary (i.e p < 0.05), whereas sex, average GPA for Spring and Fall semesters, quartile ranking and satisfaction with degree have no effect on the salary (p > 0.05). This, however, is in contrast with the results obtained from the plots that we observed earlier.
t.test(mba1$age, mba1$salary)
##
## Welch Two Sample t-test
##
## data: mba1$age and mba1$salary
## t = -15.005, df = 183, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -65725.43 -50449.67
## sample estimates:
## mean of x mean of y
## 26.79348 58114.34239
t.test(mba1$sex, mba1$salary)
##
## Welch Two Sample t-test
##
## data: mba1$sex and mba1$salary
## t = -15.012, df = 183, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -65750.98 -50475.22
## sample estimates:
## mean of x mean of y
## 1.244565 58114.342391
t.test(mba1$gmat_tpc, mba1$salary)
##
## Welch Two Sample t-test
##
## data: mba1$gmat_tpc and mba1$salary
## t = -14.99, df = 183, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -65667.09 -50391.33
## sample estimates:
## mean of x mean of y
## 85.13043 58114.34239
t.test(mba1$s_avg,mba1$f_avg)
##
## Welch Two Sample t-test
##
## data: mba1$s_avg and mba1$f_avg
## t = -0.81877, df = 339.4, p-value = 0.4135
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.13110158 0.05403637
## sample estimates:
## mean of x mean of y
## 3.022554 3.061087
t.test(mba1$gmat_tpc, mba1$salary)
##
## Welch Two Sample t-test
##
## data: mba1$gmat_tpc and mba1$salary
## t = -14.99, df = 183, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -65667.09 -50391.33
## sample estimates:
## mean of x mean of y
## 85.13043 58114.34239
t.test(mba1$gmat_tpc, mba1$salary)
##
## Welch Two Sample t-test
##
## data: mba1$gmat_tpc and mba1$salary
## t = -14.99, df = 183, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -65667.09 -50391.33
## sample estimates:
## mean of x mean of y
## 85.13043 58114.34239
In all the above T-tests, we see that the alternative hypothesis has been called as “true difference is not equal to zero” and we also get P <2.2e-16, which means that all the factors By way of influence the initial salary. This is contrary to our analysis using the graph as well as Che-Square tests.
cor.test(var1.df\(salary,var1.df\)satis)
cor.test(mbasal.df$salary,mbasal.df$satis)
##
## Pearson's product-moment correlation
##
## data: mbasal.df$salary and mbasal.df$satis
## t = -5.8681, df = 272, p-value = 1.279e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4363825 -0.2256820
## sample estimates:
## cor
## -0.3352171
cor.test(var1.df\(salary,var1.df\)satis)
cor.test(mbasal.df$gmat_tot,mbasal.df$salary)
##
## Pearson's product-moment correlation
##
## data: mbasal.df$gmat_tot and mbasal.df$salary
## t = -0.90799, df = 272, p-value = 0.3647
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17234911 0.06394461
## sample estimates:
## cor
## -0.05497188
corelation between salary and gmat_qpc .
cor.test(mbasal.df$gmat_qpc,mbasal.df$salary)
##
## Pearson's product-moment correlation
##
## data: mbasal.df$gmat_qpc and mbasal.df$salary
## t = -0.72691, df = 272, p-value = 0.4679
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.16168920 0.07485761
## sample estimates:
## cor
## -0.04403293
In this model we try to return the value premium to the rest of the remaining columns.
Model1 <- salary~age+sex+gmat_tot+gmat_vpc+gmat_qpc+gmat_tpc+work_yrs+frstlang+satis+s_avg+f_avg
fit1 <- lm(Model1, data = mbasal.df)
summary(fit1)
##
## Call:
## lm(formula = Model1, data = mbasal.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -77221 -41250 -2975 44176 202527
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 170674.41 66986.28 2.548 0.01141 *
## age -4206.94 1557.09 -2.702 0.00735 **
## sex 2182.40 6858.18 0.318 0.75057
## gmat_tot -272.05 209.71 -1.297 0.19569
## gmat_vpc 309.40 551.11 0.561 0.57499
## gmat_qpc 324.04 579.47 0.559 0.57650
## gmat_tpc 481.98 417.04 1.156 0.24885
## work_yrs 3219.85 1769.12 1.820 0.06990 .
## frstlang -2287.14 10365.65 -0.221 0.82554
## satis -46.94 7.85 -5.979 7.32e-09 ***
## s_avg 24684.19 9459.12 2.610 0.00959 **
## f_avg -5950.72 6639.67 -0.896 0.37095
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47210 on 262 degrees of freedom
## Multiple R-squared: 0.1762, Adjusted R-squared: 0.1416
## F-statistic: 5.093 on 11 and 262 DF, p-value: 3.187e-07
NOW WE FIND THE BEST PREDICTORS
library(leaps)
leap1 <- regsubsets(Model1, data = mbasal.df, nbest=1)
# summary(leap1)
plot(leap1, scale="adjr2")
Model 2 predicts the starting salary of MBA graduates“salary”, as a function of the following explanatory variables: “age”, “work_yrs”,“satis”, “s_avg”
Model2 <- salary~age+work_yrs+satis+s_avg
fit2 <- lm(Model2, data = mbasal.df)
summary(fit2)
##
## Call:
## lm(formula = Model2, data = mbasal.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73301 -42350 -5053 44553 200136
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 96159.410 40784.640 2.358 0.01910 *
## age -4594.481 1501.256 -3.060 0.00243 **
## work_yrs 3758.222 1714.644 2.192 0.02925 *
## satis -47.568 7.718 -6.164 2.59e-09 ***
## s_avg 20558.544 7549.300 2.723 0.00689 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46990 on 269 degrees of freedom
## Multiple R-squared: 0.1618, Adjusted R-squared: 0.1493
## F-statistic: 12.98 on 4 and 269 DF, p-value: 1.114e-09
Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes).
Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.
Data set before cleaning process
library(Amelia)
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2018 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
mbasal<-subset(mbasal.df, salary != 998 & salary != 999)
missmap(mbasal, main = "Missing value before Cleaning",horizontal=FALSE)
Creating a new job column
mbasal.df$job <- ifelse(mbasal.df$salary > 0,1, 0)
now we Removing salary column from data
mbasal.df$salary <- NULL
trainingSet <- mbasal.df[1:153,]
testSet <- mbasal.df[154:193,]
Analysis of variance (ANOVA) is a collection of statistical models and their associated procedures (such as “variation” among and between groups) used to analyze the differences among group means. ANOVA was developed by statistician and evolutionary biologist Ronald Fisher.
newmodel <- glm(formula = job ~ age+gmat_tpc+gmat_qpc+gmat_vpc+frstlang+quarter, family = binomial(link = "logit"), data = trainingSet)
anova(newmodel,test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: job
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 152 198.67
## age 1 4.5067 151 194.16 0.0337618 *
## gmat_tpc 1 0.0013 150 194.16 0.9708961
## gmat_qpc 1 0.0911 149 194.07 0.7628236
## gmat_vpc 1 0.2083 148 193.86 0.6480689
## frstlang 1 0.3541 147 193.51 0.5518005
## quarter 1 11.1143 146 182.39 0.0008567 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
newmodel <- glm(formula = job ~ age+quarter, family = binomial(link = "logit"), data = trainingSet)
Age and quartile ranking are significant predictor to model.
Cumulative Gains and Lift Charts. Lift is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model. Cumulative gains and lift charts are visual aids for measuring model performance. Both charts consist of a lift curve and a baseline.
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
p <- predict(newmodel,testSet,type='response')
pr <- prediction(p, testSet$job)
library(gplots)
lift.obj <- performance(pr, measure="lift", x.measure="rpp")
plot(lift.obj,main="Lift Chart",xlab="% Populations",ylab="Lift", col="blue")
abline(1,0,col="red")
The predicted model(Blue line) plotted in lift chart gives great predictive lift over base model(Red line). The level where a naive effort could produce a 20% rate of positive prediction, the model charted in blue produce about a 5 times multiple over baseline.
Cumulative Gains and Lift Charts. Lift is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model. Cumulative gains and lift charts are visual aids for measuring model performance. Both charts consist of a lift curve and a baseline.
lift.obj <- performance(pr, "tpr", x.measure="rpp")
plot(lift.obj,main="Gain Chart",xlab="Rate of positive prediction",ylab="True positive rate", col="green")
abline(0,1,col="red")
The gap between the green line (response after predictive model) and the red line (response before predictive model) indicates the gains that the company can have using the predictive modelt that predict ALF.
The initial salary of any individual student’s MBA program depends on the student’s first language and depends on the degree of estimated satisfaction through various boxplaces and scatterplots.
Most of the students who did not receive the offers chose their first language as English. Average salary for other language students is slightly higher than that of English students.
Logical regression indicates that age is an important factor in determining salary.
Apart from the program and correlation matrix, it is quite clear that the initial salary is strongly related to the first language.
Chie-squared tests and those among those who have jobs and those who have not got a job from T-test, can be analyzed that there is an important relationship between beginners salary, satisfaction degree, MBA program and people The first language
The anti-ratification model helps us to conclude that the pay-level-GMA-TOR-GMA-aquatitative percentage -GMAT_verbalpercentile -work years- degree of vulnerability and negative effects from experience -the first language -quire -process_arriage-fall_average -GMAT_overall Percentile