MBA Starting Salaries Data Analysis

About the project

Since the rating of MBA programs uses graduate pay as a large component of their rating system in many surveys, different dependencies of students’ initial pay should be reported. In addition, MBA plays a major role in the analysis of program’s degree of gratification purposes. Dependency of various variables such as GMAT scores, GMAT champion, spring average, average fall, years of work experience, quartile ranking, first language, gender, etc. can rely on positive, negative or negative initial salary and degree. Satisfaction of

Reading and viewing mba datasets in R

mbasal.df <- read.csv("MBA Starting Salaries Data.csv")
View(mbasal.df)

Summarize the data to understand the mean, median, standard deviation of each variable

summary(mbasal.df)

##       age             sex           gmat_tot        gmat_qpc    
##  Min.   :22.00   Min.   :1.000   Min.   :450.0   Min.   :28.00  
##  1st Qu.:25.00   1st Qu.:1.000   1st Qu.:580.0   1st Qu.:72.00  
##  Median :27.00   Median :1.000   Median :620.0   Median :83.00  
##  Mean   :27.36   Mean   :1.248   Mean   :619.5   Mean   :80.64  
##  3rd Qu.:29.00   3rd Qu.:1.000   3rd Qu.:660.0   3rd Qu.:93.00  
##  Max.   :48.00   Max.   :2.000   Max.   :790.0   Max.   :99.00  
##     gmat_vpc        gmat_tpc        s_avg           f_avg      
##  Min.   :16.00   Min.   : 0.0   Min.   :2.000   Min.   :0.000  
##  1st Qu.:71.00   1st Qu.:78.0   1st Qu.:2.708   1st Qu.:2.750  
##  Median :81.00   Median :87.0   Median :3.000   Median :3.000  
##  Mean   :78.32   Mean   :84.2   Mean   :3.025   Mean   :3.062  
##  3rd Qu.:91.00   3rd Qu.:94.0   3rd Qu.:3.300   3rd Qu.:3.250  
##  Max.   :99.00   Max.   :99.0   Max.   :4.000   Max.   :4.000  
##     quarter         work_yrs         frstlang         salary      
##  Min.   :1.000   Min.   : 0.000   Min.   :1.000   Min.   :     0  
##  1st Qu.:1.250   1st Qu.: 2.000   1st Qu.:1.000   1st Qu.:     0  
##  Median :2.000   Median : 3.000   Median :1.000   Median :   999  
##  Mean   :2.478   Mean   : 3.872   Mean   :1.117   Mean   : 39026  
##  3rd Qu.:3.000   3rd Qu.: 4.000   3rd Qu.:1.000   3rd Qu.: 97000  
##  Max.   :4.000   Max.   :22.000   Max.   :2.000   Max.   :220000  
##      satis      
##  Min.   :  1.0  
##  1st Qu.:  5.0  
##  Median :  6.0  
##  Mean   :172.2  
##  3rd Qu.:  7.0  
##  Max.   :998.0

Starting Salary

newdata1 <- mbasal.df[ which(mbasal.df$salary !="998" & mbasal.df$salary !="999"), ]
hist(newdata1$salary, breaks=5,col="purple",xlab="starting salary", main="Salary  distribution")

##Histograms

hist(mbasal.df$age, breaks = 5,col="Blue",main="Histogram of Age")

hist(mbasal.df$gmat_tpc, breaks = 5,col="Red",main="Histogram of Gmat Total %le")

hist(mbasal.df$s_avg, breaks = 5,col="Green",main="Histogram of spring MBA average")

hist(mbasal.df$f_avg, breaks = 5,col="pink",main="Histogram of fall MBA average")

Box plots

boxplot(mbasal.df$salary ~ mbasal.df$sex, data=mbasal.df, horizontal=TRUE, yaxt="n", 
        ylab="Gender", xlab="Salary",
        main="Comparison of Salaries of Males and Females")
axis(side=2, at=c(1,2), labels=c("Females", "Males"))

boxplot(mbasal.df$salary ~ mbasal.df$frstlang, data=mbasal.df, horizontal=TRUE, yaxt="n", 
        ylab="First language", xlab="Salary",
        main="Comparison of Salaries based on their first language")
axis(side=2, at=c(1,2), labels=c("english", "others"))

boxplot(mbasal.df$salary ~ mbasal.df$quarter, data=mbasal.df, horizontal=TRUE, yaxt="n", 
        ylab="quartile rating", xlab="Salary",
        main="Comparison of Salaries of Quartile ranking")
axis(side=2, at=c(1,2,3,4), labels=c("first", "secnd","thrd","fourth"))

boxplot(mbasal.df$salary ~ mbasal.df$satis, data=mbasal.df, horizontal=TRUE, yaxt="n", 
        ylab="Satisfactory degree", xlab="Salary",
        main="Comparison of Salaries on degree of satisfaction of students")
axis(side=2, at=c(1,2,3,4,5,6,7,8), labels=c("1", "2","3","4","5","6","7","no"))

Scatter Plots

library(car)    
scatterplot(salary ~age,     data=newdata1,
            spread=FALSE, smoother.args=list(lty=2),
            main="Scatter plot of salary vs age",
            xlab="age",
            ylab="salary")

scatterplot(salary ~sex,     data=newdata1,
            spread=FALSE, smoother.args=list(lty=2),
            main="Scatter plot of salary vs sex",
            xlab="sex",
            ylab="salary")

scatterplot(salary ~frstlang,     data=newdata1,
            main="Scatter plot of salary vs first language",
            xlab="first language",
            ylab="salary")

scatterplot(salary ~gmat_tot,     data=newdata1,
            spread=FALSE, smoother.args=list(lty=2),
            main="Scatter plot of salary vs Gmat total",
            xlab="Gmat score",
            ylab="salary")

Salary vs work experience

library(car)
scatterplot(salary ~ work_yrs | sex ,data=mbasal.df, main="Scatterplot of Salary with Work Experience", xlab="Work Experience", ylab="Starting Salaries")

Salary vs Degree of satisfaction

scatterplot(salary ~ satis ,data=mbasal.df, main="Scatterplot of Salary with Degree of satisfaction", xlab="Degree of satisfaction", ylab="Starting Salaries")

Salary vs GMAT total score

scatterplot(salary ~ gmat_tot ,data=mbasal.df, main="Scatterplot of Salary with GMAT Total Score", xlab="GMAT Total Score", ylab="Starting Salaries")

Corrgram

 library(corrgram)
    corrgram(newdata1, order=TRUE, lower.panel=panel.shade,
    upper.panel=panel.pie, text.panel=panel.txt,
    main="MBA starting salary analysis Correlogram")

Variance - Covariance Matrix

x <- newdata1[,c("age", "gmat_tot", "gmat_qpc", "gmat_vpc","gmat_tpc","s_avg","f_avg","work_yrs","salary")]
   y <- newdata1[,c("age", "gmat_tot", "gmat_qpc", "gmat_vpc","gmat_tpc","s_avg","f_avg","work_yrs","salary")]
   cov(x,y)

##                    age    gmat_tot     gmat_qpc     gmat_vpc      gmat_tpc
## age       1.778562e+01  -29.954933   -14.089729   -0.4564443    -7.5127645
## gmat_tot -2.995493e+01 3196.950561   636.350928  685.4644322   672.4651878
## gmat_qpc -1.408973e+01  636.350928   229.384067   42.7985481   141.4933074
## gmat_vpc -4.564443e-01  685.464432    42.798548  259.2695920   149.8747571
## gmat_tpc -7.512764e+00  672.465188   141.493307  149.8747571   183.0113882
## s_avg     2.626913e-01    3.076706     0.109287    1.1636153     0.9688199
## f_avg    -7.513817e-02    2.969557     1.025241    0.2769703     0.7718585
## work_yrs  1.355880e+01  -36.222204   -13.484078   -2.4562014    -8.2897776
## salary   -2.918528e+04 -170.881369 22855.717832 2901.3078044 43822.5291991
##                 s_avg        f_avg      work_yrs        salary
## age         0.2626913  -0.07513817  1.355880e+01 -2.918528e+04
## gmat_tot    3.0767055   2.96955689 -3.622220e+01 -1.708814e+02
## gmat_qpc    0.1092870   1.02524072 -1.348408e+01  2.285572e+04
## gmat_vpc    1.1636153   0.27697026 -2.456201e+00  2.901308e+03
## gmat_tpc    0.9688199   0.77185854 -8.289778e+00  4.382253e+04
## s_avg       0.1436561   0.10251263  2.224652e-01  1.940528e+03
## f_avg       0.1025126   0.26995964 -9.189254e-02  2.443157e+02
## work_yrs    0.2224652  -0.09189254  1.360379e+01 -1.044263e+04
## salary   1940.5276360 244.31568869 -1.044263e+04  2.825177e+09

creating a dataset of the students who actually got the job

mba1=mbasal.df[which(mbasal.df$salary!=0),]
View(mba1)

contigency

table(mba1$salary,mba1$satis)

##         
##           1  2  3  4  5  6  7 998
##   998     0  0  0  0  0  0  0  46
##   999     1  1  4 12  9  7  1   0
##   64000   0  0  0  0  0  0  1   0
##   77000   0  0  0  0  0  1  0   0
##   78256   0  0  0  0  1  0  0   0
##   82000   0  0  0  0  0  0  1   0
##   85000   0  0  0  0  1  3  0   0
##   86000   0  0  0  0  2  0  0   0
##   88000   0  0  0  0  0  0  1   0
##   88500   0  0  0  0  0  1  0   0
##   90000   0  0  0  0  2  0  1   0
##   92000   0  0  0  0  1  1  1   0
##   93000   0  0  0  0  1  2  0   0
##   95000   0  0  1  1  1  2  2   0
##   96000   0  0  0  0  1  1  2   0
##   96500   0  0  0  0  0  1  0   0
##   97000   0  0  0  0  0  1  1   0
##   98000   0  0  0  0  2  5  3   0
##   99000   0  0  0  0  0  1  0   0
##   100000  0  0  0  0  1  6  2   0
##   100400  0  0  0  0  0  0  1   0
##   101000  0  0  0  0  1  1  0   0
##   101100  0  0  0  0  0  1  0   0
##   101600  0  0  0  0  0  1  0   0
##   102500  0  0  0  0  1  0  0   0
##   103000  0  0  0  0  0  1  0   0
##   104000  0  0  0  0  1  1  0   0
##   105000  0  0  0  0  4  6  1   0
##   106000  0  0  0  0  0  2  1   0
##   107000  0  0  0  0  1  0  0   0
##   107300  0  0  0  0  0  0  1   0
##   107500  0  0  0  0  1  0  0   0
##   108000  0  0  0  0  0  2  0   0
##   110000  0  0  0  0  1  0  0   0
##   112000  0  0  0  0  0  2  1   0
##   115000  0  0  0  0  3  2  0   0
##   118000  0  0  0  0  0  0  1   0
##   120000  0  0  0  0  2  2  0   0
##   126710  0  0  0  0  0  1  0   0
##   130000  0  0  0  0  0  0  1   0
##   145800  0  0  0  0  0  1  0   0
##   146000  0  0  0  0  0  1  0   0
##   162000  0  0  0  0  1  0  0   0
##   220000  0  0  0  0  0  1  0   0

table(mba1$s_avg,mba1$f_avg)

##       
##        0 2 2.25 2.33 2.5 2.67 2.75 2.8 2.83 3 3.17 3.2 3.25 3.33 3.4 3.5
##   2.2  0 1    0    0   0    0    0   0    0 1    0   0    0    0   0   0
##   2.3  0 0    1    0   1    0    0   0    0 0    0   0    0    0   0   0
##   2.4  0 1    1    0   2    0    3   0    0 0    0   0    0    0   0   0
##   2.45 0 0    0    0   0    0    1   0    0 0    0   0    0    0   0   0
##   2.5  0 0    1    0   3    0    4   0    0 2    0   0    0    0   0   0
##   2.6  0 0    0    0   4    0    3   0    0 3    0   0    0    0   0   0
##   2.67 1 0    0    0   0    0    0   0    0 0    0   0    0    0   0   0
##   2.7  0 0    0    0   3    0    6   0    0 6    0   1    3    0   0   0
##   2.73 0 0    0    0   0    0    0   0    0 0    1   0    0    0   0   0
##   2.8  0 0    0    0   0    0    3   0    0 5    0   0    2    0   0   0
##   2.9  0 0    0    0   0    0    5   1    0 6    0   0    6    1   0   1
##   2.91 0 0    0    0   0    0    0   0    1 0    0   0    0    0   0   0
##   3    0 0    0    0   0    0    4   0    0 6    0   0    4    0   0   0
##   3.09 0 0    0    0   0    0    0   0    0 1    0   0    0    0   0   1
##   3.1  0 0    0    1   0    1    0   0    0 5    0   0    2    1   0   3
##   3.18 0 0    0    0   0    0    0   0    0 0    0   0    1    0   0   0
##   3.2  0 0    0    0   0    0    0   0    0 4    0   0    6    0   1   1
##   3.27 0 0    0    0   0    0    0   0    0 0    0   0    1    0   0   0
##   3.3  0 0    0    0   0    0    0   0    0 2    0   0    9    0   0   5
##   3.4  0 0    0    0   0    0    0   1    0 1    0   0    3    0   0   0
##   3.45 0 0    0    0   0    0    0   0    0 0    0   0    0    0   0   1
##   3.5  0 0    0    0   0    1    0   0    0 2    0   0    3    0   0   4
##   3.56 0 0    0    0   0    0    0   0    0 0    0   0    0    0   0   0
##   3.6  0 0    0    0   0    0    0   0    0 0    0   0    0    0   0   4
##   3.7  0 0    0    0   0    0    0   0    0 0    0   0    0    0   0   0
##   3.8  0 0    0    0   0    0    0   0    0 0    0   0    0    0   0   2
##   4    1 0    0    0   0    0    0   0    0 0    0   0    0    0   0   0
##       
##        3.6 3.67 3.75 4
##   2.2    0    0    0 0
##   2.3    0    0    0 0
##   2.4    0    0    0 0
##   2.45   0    0    0 0
##   2.5    0    0    0 0
##   2.6    0    0    0 0
##   2.67   0    0    0 0
##   2.7    0    0    0 0
##   2.73   0    0    0 0
##   2.8    0    0    0 0
##   2.9    0    0    0 0
##   2.91   0    0    0 0
##   3      0    0    0 0
##   3.09   0    0    0 0
##   3.1    0    0    1 0
##   3.18   0    0    0 0
##   3.2    0    0    0 0
##   3.27   0    0    0 0
##   3.3    0    0    0 0
##   3.4    0    1    2 1
##   3.45   0    1    0 0
##   3.5    1    0    1 2
##   3.56   0    0    0 1
##   3.6    0    1    2 0
##   3.7    1    0    0 1
##   3.8    0    0    0 1
##   4      0    0    0 1

table(mba1$gmat_tot,mba1$gmat_qpc)

##      
##       39 43 46 48 49 50 52 53 55 56 57 60 64 65 66 67 68 71 72 74 75 77 78
##   450  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   460  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0
##   500  0  0  1  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  1
##   520  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   530  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0
##   540  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0
##   550  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  2  1  0  0  0
##   560  1  0  0  0  0  0  3  0  1  0  1  1  1  0  0  0  1  0  0  0  1  0  0
##   570  0  0  0  0  0  0  0  0  0  1  0  0  0  1  0  0  1  1  1  0  1  0  0
##   580  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  3  0  0  0  1
##   590  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  2  0  1  0  0  0  0
##   600  0  0  0  0  0  0  0  1  0  0  0  1  0  0  0  1  1  0  1  0  0  4  0
##   610  0  0  0  1  0  0  0  0  0  0  0  0  1  0  0  0  0  0  1  0  1  0  0
##   620  0  0  0  0  0  0  1  0  0  0  0  1  0  0  0  0  0  0  1  0  0  0  1
##   630  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  2  0  1  0  0
##   640  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0
##   650  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   660  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0
##   670  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   680  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   690  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   700  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   710  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   720  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   730  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   740  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   790  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##      
##       79 81 82 83 84 85 87 88 89 90 91 92 93 94 95 96 97 98 99
##   450  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   460  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   500  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   520  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   530  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   540  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   550  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   560  1  1  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0
##   570  0  0  2  0  0  0  0  0  1  0  0  0  1  0  1  0  0  0  0
##   580  2  0  0  2  0  0  0  0  1  0  1  0  0  0  0  0  0  0  0
##   590  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  1  0  0
##   600  0  0  1  0  1  0  1  0  2  0  1  0  0  0  0  0  1  0  1
##   610  0  0  1  1  0  0  1  0  2  0  0  0  0  0  0  0  0  0  0
##   620  1  1  1  0  1  1  1  1  2  0  0  0  1  0  0  0  2  0  0
##   630  3  0  1  2  1  1  2  0  0  0  0  0  2  0  0  1  0  0  0
##   640  2  0  0  0  0  0  1  0  1  0  0  0  1  0  0  0  0  0  0
##   650  2  0  1  0  0  0  1  0  3  0  1  0  1  0  1  0  0  0  1
##   660  0  1  0  1  1  0  0  1  0  1  1  0  0  0  2  0  1  0  1
##   670  0  0  0  3  1  0  2  0  0  0  1  0  0  0  1  0  2  1  2
##   680  1  0  0  0  1  0  1  0  0  0  1  1  0  0  0  2  1  0  1
##   690  0  0  0  0  0  0  1  0  0  0  0  0  0  1  0  1  0  0  1
##   700  0  0  0  0  0  0  0  0  0  0  0  0  0  1  1  0  0  1  0
##   710  0  0  0  0  0  0  0  0  0  0  0  0  2  0  2  1  0  0  1
##   720  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  1  0  0
##   730  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0
##   740  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  3
##   790  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1

table(mba1$gmat_vpc,mba1$gmat_tpc)

##     
##      0 37 44 51 52 58 61 62 65 68 69 71 72 73 75 77 78 79 80 81 83 84 85
##   16 0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   22 0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   30 0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   33 0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0
##   37 0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0
##   41 0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  1  0  0  0
##   45 0  0  0  1  0  0  0  1  1  0  1  0  1  0  0  0  0  0  0  0  1  0  0
##   46 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   50 0  0  0  0  0  0  0  0  0  1  0  1  0  0  0  0  0  0  1  0  0  0  0
##   54 0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0
##   58 0  0  0  0  0  0  0  0  0  0  2  0  2  0  2  0  0  1  0  1  1  0  0
##   62 0  0  0  0  0  0  1  0  1  0  0  0  0  0  1  0  0  0  0  0  2  0  0
##   63 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   67 0  0  0  0  0  1  0  0  0  0  0  0  1  0  0  0  0  1  1  0  1  0  1
##   71 1  0  0  0  0  0  0  0  0  0  0  0  2  0  1  0  5  0  0  0  0  0  0
##   74 0  0  0  0  0  0  0  0  0  0  0  0  0  1  1  0  0  0  0  0  1  0  0
##   75 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   78 0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  2  0
##   81 0  0  0  0  0  0  0  0  0  0  0  0  3  0  0  0  1  0  0  2  1  2  0
##   82 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0
##   84 0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  1  1  0  0
##   85 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   87 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0
##   89 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0
##   90 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   91 0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  1  0  0
##   92 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   93 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   95 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0
##   96 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   97 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   98 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   99 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##     
##      86 87 88 89 90 91 92 93 94 95 96 97 98 99
##   16  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   22  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   30  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   33  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   37  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   41  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   45  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   46  1  0  0  0  0  0  0  0  0  0  0  0  0  0
##   50  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   54  0  1  0  0  0  0  0  0  0  0  0  0  0  0
##   58  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   62  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   63  0  0  1  0  0  0  0  1  0  0  0  0  0  0
##   67  1  0  0  0  0  0  0  0  0  0  0  0  0  0
##   71  1  1  0  1  0  1  0  0  0  1  0  0  0  0
##   74  0  5  0  0  0  0  0  0  0  0  1  0  0  0
##   75  0  0  0  0  0  1  0  0  0  0  0  0  0  0
##   78  0  0  0  0  0  1  0  0  0  0  0  0  0  0
##   81  2  1  0  0  1  1  0  1  1  2  2  0  0  0
##   82  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   84  1  1  0  2  0  1  0  3  1  0  0  0  0  0
##   85  0  0  0  1  0  0  0  0  0  0  2  0  0  0
##   87  1  0  1  4  1  0  0  1  0  0  1  2  0  0
##   89  0  1  0  1  1  0  0  1  0  1  0  1  0  0
##   90  0  0  0  0  0  0  0  0  0  1  0  1  0  0
##   91  0  0  0  2  0  0  0  2  0  0  0  0  0  0
##   92  0  0  0  0  0  0  1  0  1  0  0  0  0  1
##   93  1  0  0  1  0  1  0  1  2  1  0  1  1  0
##   95  0  0  0  2  0  1  0  1  1  2  0  1  3  0
##   96  0  1  0  0  0  0  0  0  0  1  2  0  1  0
##   97  0  0  0  0  0  0  0  0  0  0  1  0  0  1
##   98  1  1  0  0  0  0  1  0  0  1  2  0  4  5
##   99  0  0  0  0  0  0  0  0  0  1  1  0  0  2

chisquare test

chisq.test(mba1$salary,mba1$satis)

## Warning in chisq.test(mba1$salary, mba1$satis): Chi-squared approximation
## may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  mba1$salary and mba1$satis
## X-squared = 391.04, df = 301, p-value = 0.0003578

chisq.test(mba1$s_avg,mba1$f_avg)

## Warning in chisq.test(mba1$s_avg, mba1$f_avg): Chi-squared approximation
## may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  mba1$s_avg and mba1$f_avg
## X-squared = 1033.1, df = 494, p-value < 2.2e-16

chisq.test(mba1$gmat_tot,mba1$gmat_qpc)

## Warning in chisq.test(mba1$gmat_tot, mba1$gmat_qpc): Chi-squared
## approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  mba1$gmat_tot and mba1$gmat_qpc
## X-squared = 1559.3, df = 1066, p-value < 2.2e-16

chisq.test(mba1$gmat_vpc,mba1$gmat_tpc)

## Warning in chisq.test(mba1$gmat_vpc, mba1$gmat_tpc): Chi-squared
## approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  mba1$gmat_vpc and mba1$gmat_tpc
## X-squared = 1790.8, df = 1152, p-value < 2.2e-16

The results of the Chi-Square tests tell us that age, GMAT percentile, work experience and first language are factors that affect starting salary (i.e p < 0.05), whereas sex, average GPA for Spring and Fall semesters, quartile ranking and satisfaction with degree have no effect on the salary (p > 0.05). This, however, is in contrast with the results obtained from the plots that we observed earlier.

T-test

t.test(mba1$age, mba1$salary)

## 
##  Welch Two Sample t-test
## 
## data:  mba1$age and mba1$salary
## t = -15.005, df = 183, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -65725.43 -50449.67
## sample estimates:
##   mean of x   mean of y 
##    26.79348 58114.34239

t.test(mba1$sex, mba1$salary)

## 
##  Welch Two Sample t-test
## 
## data:  mba1$sex and mba1$salary
## t = -15.012, df = 183, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -65750.98 -50475.22
## sample estimates:
##    mean of x    mean of y 
##     1.244565 58114.342391

t.test(mba1$gmat_tpc, mba1$salary)

## 
##  Welch Two Sample t-test
## 
## data:  mba1$gmat_tpc and mba1$salary
## t = -14.99, df = 183, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -65667.09 -50391.33
## sample estimates:
##   mean of x   mean of y 
##    85.13043 58114.34239

t.test(mba1$s_avg,mba1$f_avg)

## 
##  Welch Two Sample t-test
## 
## data:  mba1$s_avg and mba1$f_avg
## t = -0.81877, df = 339.4, p-value = 0.4135
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.13110158  0.05403637
## sample estimates:
## mean of x mean of y 
##  3.022554  3.061087

t.test(mba1$gmat_tpc, mba1$salary)

## 
##  Welch Two Sample t-test
## 
## data:  mba1$gmat_tpc and mba1$salary
## t = -14.99, df = 183, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -65667.09 -50391.33
## sample estimates:
##   mean of x   mean of y 
##    85.13043 58114.34239

t.test(mba1$gmat_tpc, mba1$salary)

## 
##  Welch Two Sample t-test
## 
## data:  mba1$gmat_tpc and mba1$salary
## t = -14.99, df = 183, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -65667.09 -50391.33
## sample estimates:
##   mean of x   mean of y 
##    85.13043 58114.34239

In all the above T-tests, we see that the alternative hypothesis has been called as “true difference is not equal to zero” and we also get P <2.2e-16, which means that all the factors By way of influence the initial salary. This is contrary to our analysis using the graph as well as Che-Square tests.

Pearsons Coorelation Test

cor.test(var1.df\(salary,var1.df\)satis)

cor.test(mbasal.df$salary,mbasal.df$satis)

## 
##  Pearson's product-moment correlation
## 
## data:  mbasal.df$salary and mbasal.df$satis
## t = -5.8681, df = 272, p-value = 1.279e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4363825 -0.2256820
## sample estimates:
##        cor 
## -0.3352171

cor.test(var1.df\(salary,var1.df\)satis)

cor.test(mbasal.df$gmat_tot,mbasal.df$salary)

## 
##  Pearson's product-moment correlation
## 
## data:  mbasal.df$gmat_tot and mbasal.df$salary
## t = -0.90799, df = 272, p-value = 0.3647
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17234911  0.06394461
## sample estimates:
##         cor 
## -0.05497188

corelation between salary and gmat_qpc .

cor.test(mbasal.df$gmat_qpc,mbasal.df$salary)

## 
##  Pearson's product-moment correlation
## 
## data:  mbasal.df$gmat_qpc and mbasal.df$salary
## t = -0.72691, df = 272, p-value = 0.4679
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.16168920  0.07485761
## sample estimates:
##         cor 
## -0.04403293

REGRESSION ANALYSIS

model1

In this model we try to return the value premium to the rest of the remaining columns.

Model1 <- salary~age+sex+gmat_tot+gmat_vpc+gmat_qpc+gmat_tpc+work_yrs+frstlang+satis+s_avg+f_avg
fit1 <- lm(Model1, data = mbasal.df)
summary(fit1)

## 
## Call:
## lm(formula = Model1, data = mbasal.df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -77221 -41250  -2975  44176 202527 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 170674.41   66986.28   2.548  0.01141 *  
## age          -4206.94    1557.09  -2.702  0.00735 ** 
## sex           2182.40    6858.18   0.318  0.75057    
## gmat_tot      -272.05     209.71  -1.297  0.19569    
## gmat_vpc       309.40     551.11   0.561  0.57499    
## gmat_qpc       324.04     579.47   0.559  0.57650    
## gmat_tpc       481.98     417.04   1.156  0.24885    
## work_yrs      3219.85    1769.12   1.820  0.06990 .  
## frstlang     -2287.14   10365.65  -0.221  0.82554    
## satis          -46.94       7.85  -5.979 7.32e-09 ***
## s_avg        24684.19    9459.12   2.610  0.00959 ** 
## f_avg        -5950.72    6639.67  -0.896  0.37095    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47210 on 262 degrees of freedom
## Multiple R-squared:  0.1762, Adjusted R-squared:  0.1416 
## F-statistic: 5.093 on 11 and 262 DF,  p-value: 3.187e-07

NOW WE FIND THE BEST PREDICTORS

library(leaps)
leap1 <- regsubsets(Model1, data = mbasal.df, nbest=1)
# summary(leap1)
plot(leap1, scale="adjr2")

model2

Model 2 predicts the starting salary of MBA graduates“salary”, as a function of the following explanatory variables: “age”, “work_yrs”,“satis”, “s_avg”

Model2 <- salary~age+work_yrs+satis+s_avg
fit2 <- lm(Model2, data = mbasal.df)
summary(fit2)

## 
## Call:
## lm(formula = Model2, data = mbasal.df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -73301 -42350  -5053  44553 200136 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 96159.410  40784.640   2.358  0.01910 *  
## age         -4594.481   1501.256  -3.060  0.00243 ** 
## work_yrs     3758.222   1714.644   2.192  0.02925 *  
## satis         -47.568      7.718  -6.164 2.59e-09 ***
## s_avg       20558.544   7549.300   2.723  0.00689 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46990 on 269 degrees of freedom
## Multiple R-squared:  0.1618, Adjusted R-squared:  0.1493 
## F-statistic: 12.98 on 4 and 269 DF,  p-value: 1.114e-09

LOGISTIC REGRESSION

Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes).

Data Cleaning Process

Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

Data set before cleaning process

library(Amelia)

## Loading required package: Rcpp

## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2018 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

mbasal<-subset(mbasal.df, salary != 998  & salary != 999)
missmap(mbasal, main = "Missing value before Cleaning",horizontal=FALSE)

Creating a new job column

mbasal.df$job <- ifelse(mbasal.df$salary > 0,1, 0)

now we Removing salary column from data

mbasal.df$salary <- NULL

Model fitting

trainingSet  <- mbasal.df[1:153,]
testSet  <- mbasal.df[154:193,]

Anova Test

Analysis of variance (ANOVA) is a collection of statistical models and their associated procedures (such as “variation” among and between groups) used to analyze the differences among group means. ANOVA was developed by statistician and evolutionary biologist Ronald Fisher.

newmodel <- glm(formula = job ~ age+gmat_tpc+gmat_qpc+gmat_vpc+frstlang+quarter, family = binomial(link = "logit"), data = trainingSet)
anova(newmodel,test="Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: job
## 
## Terms added sequentially (first to last)
## 
## 
##          Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                       152     198.67              
## age       1   4.5067       151     194.16 0.0337618 *  
## gmat_tpc  1   0.0013       150     194.16 0.9708961    
## gmat_qpc  1   0.0911       149     194.07 0.7628236    
## gmat_vpc  1   0.2083       148     193.86 0.6480689    
## frstlang  1   0.3541       147     193.51 0.5518005    
## quarter   1  11.1143       146     182.39 0.0008567 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

newmodel <- glm(formula = job ~ age+quarter, family = binomial(link = "logit"), data = trainingSet)

Age and quartile ranking are significant predictor to model.

Lift Chart

Cumulative Gains and Lift Charts. Lift is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model. Cumulative gains and lift charts are visual aids for measuring model performance. Both charts consist of a lift curve and a baseline.

library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

p <- predict(newmodel,testSet,type='response')
pr <- prediction(p, testSet$job)

library(gplots)
lift.obj <- performance(pr, measure="lift", x.measure="rpp")
plot(lift.obj,main="Lift Chart",xlab="% Populations",ylab="Lift", col="blue")
abline(1,0,col="red")

The predicted model(Blue line) plotted in lift chart gives great predictive lift over base model(Red line). The level where a naive effort could produce a 20% rate of positive prediction, the model charted in blue produce about a 5 times multiple over baseline.

Gain Chart

lift.obj <- performance(pr, "tpr", x.measure="rpp")
plot(lift.obj,main="Gain Chart",xlab="Rate of positive prediction",ylab="True positive rate", col="green")
abline(0,1,col="red")

The gap between the green line (response after predictive model) and the red line (response before predictive model) indicates the gains that the company can have using the predictive modelt that predict ALF.

Summary

The initial salary of any individual student’s MBA program depends on the student’s first language and depends on the degree of estimated satisfaction through various boxplaces and scatterplots.

Most of the students who did not receive the offers chose their first language as English. Average salary for other language students is slightly higher than that of English students.

Logical regression indicates that age is an important factor in determining salary.

Apart from the program and correlation matrix, it is quite clear that the initial salary is strongly related to the first language.

Chie-squared tests and those among those who have jobs and those who have not got a job from T-test, can be analyzed that there is an important relationship between beginners salary, satisfaction degree, MBA program and people The first language

The anti-ratification model helps us to conclude that the pay-level-GMA-TOR-GMA-aquatitative percentage -GMAT_verbalpercentile -work years- degree of vulnerability and negative effects from experience -the first language -quire -process_arriage-fall_average -GMAT_overall Percentile