Let us bring in the data for the MBA starting Salary casee study for detailed analysis and making the dataset readdy for analysis.

setwd("C:/Users/Rajat/Desktop/Internship IIM Lucknow/Datasets/Offlinel4")
ms1 <- read.csv(paste("MBA Starting Salaries Data.csv",sep=""))
ms1$sex[ms1$sex==1] <- "Male"
ms1$sex[ms1$sex==2] <- "Female"
ms1$sex= factor(ms1$sex)
ms1$frstlang[ms1$frstlang==1] <- "English"
ms1$frstlang[ms1$frstlang==2] <- "Other"
ms1$frstlang= factor(ms1$frstlang)
head(ms1)
##   age    sex gmat_tot gmat_qpc gmat_vpc gmat_tpc s_avg f_avg quarter
## 1  23 Female      620       77       87       87   3.4  3.00       1
## 2  24   Male      610       90       71       87   3.5  4.00       1
## 3  24   Male      670       99       78       95   3.3  3.25       1
## 4  24   Male      570       56       81       75   3.3  2.67       1
## 5  24 Female      710       93       98       98   3.6  3.75       1
## 6  24   Male      640       82       89       91   3.9  3.75       1
##   work_yrs frstlang salary satis
## 1        2  English      0     7
## 2        2  English      0     6
## 3        2  English      0     6
## 4        1  English      0     7
## 5        2  English    999     5
## 6        2  English      0     6

Lets have a look at the data types and summary statistics for the following dataset.

str(ms1)
## 'data.frame':    274 obs. of  13 variables:
##  $ age     : int  23 24 24 24 24 24 25 25 25 25 ...
##  $ sex     : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 2 2 1 2 2 ...
##  $ gmat_tot: int  620 610 670 570 710 640 610 650 630 680 ...
##  $ gmat_qpc: int  77 90 99 56 93 82 89 88 79 99 ...
##  $ gmat_vpc: int  87 71 78 81 98 89 74 89 91 81 ...
##  $ gmat_tpc: int  87 87 95 75 98 91 87 92 89 96 ...
##  $ s_avg   : num  3.4 3.5 3.3 3.3 3.6 3.9 3.4 3.3 3.3 3.45 ...
##  $ f_avg   : num  3 4 3.25 2.67 3.75 3.75 3.5 3.75 3.25 3.67 ...
##  $ quarter : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ work_yrs: int  2 2 2 1 2 2 2 2 2 2 ...
##  $ frstlang: Factor w/ 2 levels "English","Other": 1 1 1 1 1 1 1 1 2 1 ...
##  $ salary  : int  0 0 0 0 999 0 0 0 999 998 ...
##  $ satis   : int  7 6 6 7 5 6 5 6 4 998 ...

Now we will summarize the data

summary(ms1)
##       age            sex         gmat_tot        gmat_qpc    
##  Min.   :22.00   Female: 68   Min.   :450.0   Min.   :28.00  
##  1st Qu.:25.00   Male  :206   1st Qu.:580.0   1st Qu.:72.00  
##  Median :27.00                Median :620.0   Median :83.00  
##  Mean   :27.36                Mean   :619.5   Mean   :80.64  
##  3rd Qu.:29.00                3rd Qu.:660.0   3rd Qu.:93.00  
##  Max.   :48.00                Max.   :790.0   Max.   :99.00  
##     gmat_vpc        gmat_tpc        s_avg           f_avg      
##  Min.   :16.00   Min.   : 0.0   Min.   :2.000   Min.   :0.000  
##  1st Qu.:71.00   1st Qu.:78.0   1st Qu.:2.708   1st Qu.:2.750  
##  Median :81.00   Median :87.0   Median :3.000   Median :3.000  
##  Mean   :78.32   Mean   :84.2   Mean   :3.025   Mean   :3.062  
##  3rd Qu.:91.00   3rd Qu.:94.0   3rd Qu.:3.300   3rd Qu.:3.250  
##  Max.   :99.00   Max.   :99.0   Max.   :4.000   Max.   :4.000  
##     quarter         work_yrs         frstlang       salary      
##  Min.   :1.000   Min.   : 0.000   English:242   Min.   :     0  
##  1st Qu.:1.250   1st Qu.: 2.000   Other  : 32   1st Qu.:     0  
##  Median :2.000   Median : 3.000                 Median :   999  
##  Mean   :2.478   Mean   : 3.872                 Mean   : 39026  
##  3rd Qu.:3.000   3rd Qu.: 4.000                 3rd Qu.: 97000  
##  Max.   :4.000   Max.   :22.000                 Max.   :220000  
##      satis      
##  Min.   :  1.0  
##  1st Qu.:  5.0  
##  Median :  6.0  
##  Mean   :172.2  
##  3rd Qu.:  7.0  
##  Max.   :998.0

lets study some of the variables of MBA starting Salaries

describe(ms1$gmat_tot)
##    vars   n   mean    sd median trimmed  mad min max range  skew kurtosis
## X1    1 274 619.45 57.54    620  618.86 59.3 450 790   340 -0.01     0.06
##      se
## X1 3.48
describe(ms1$s_avg)
##    vars   n mean   sd median trimmed  mad min max range  skew kurtosis
## X1    1 274 3.03 0.38      3    3.03 0.44   2   4     2 -0.06    -0.38
##      se
## X1 0.02
describe(ms1$f_avg)
##    vars   n mean   sd median trimmed  mad min max range  skew kurtosis
## X1    1 274 3.06 0.53      3    3.09 0.37   0   4     4 -2.08    10.85
##      se
## X1 0.03

Here the average of Fall and Spring is out of 5. Now lets visualize some variables, lets start with age distribution.

hist(ms1$age, col="red",
     xlab="Age in years",breaks=22,xlim=c(20,50),
     main="Age distribution")

boxplot(ms1$gmat_tot,
        xlab="Gmat Total score",col="yellow",
        main="Boxplot of GMAT total Score", horizontal = TRUE)

boxplot(ms1$gmat_qpc,
        xlab="Gmat Quants score",col="darkolivegreen",
        main="Boxplot of GMAT Quants Score", horizontal = TRUE)

boxplot(ms1$gmat_vpc,
        xlab="Gmat Verbal score",col="darkorange",
        main="Boxplot of GMAT Verbal Score", horizontal = TRUE)

boxplot(ms1$s_avg,
        xlab="Average score in Spring",col="deeppink",
        main="Boxplot of Average Score in Spring", horizontal = TRUE)

boxplot(ms1$f_avg,
        xlab="Average score in Fall",col="firebrick1",
        main="Boxplot of Average Score in Fall", horizontal = TRUE)

boxplot(ms1$work_yrs,
        xlab="Work experience of Students in years",col="seagreen",
        main="Boxplot of Work experience in Years", horizontal = TRUE)

As seen from all the boxplots there are some outliers in some plots which may be due to diverse culture of students at the MBA program.

lets have a number of students having English having thier 1st language.

options(digits =3)
table(ms1$sex)
## 
## Female   Male 
##     68    206
table(ms1$frstlang)
## 
## English   Other 
##     242      32
y<-table(ms1$sex,ms1$frstlang)
prop.table(y)*100
##         
##          English Other
##   Female   21.90  2.92
##   Male     66.42  8.76
z <- table(ms1$satis)
prop.table(z)*100
## 
##      1      2      3      4      5      6      7    998 
##  0.365  0.365  1.825  6.204 27.007 35.401 12.044 16.788

Thus the proportion table shows that the satisfactory levels are high with around 75% responses above a response of 5. lets find a correlation matrix for all the numeric variables

dd2 <- subset(ms1,select=c(age,gmat_tot,gmat_qpc,gmat_vpc,
                           gmat_tpc, s_avg, f_avg, quarter,work_yrs,
                           salary,satis))
corrs <- cor(dd2, use="pairwise.complete.obs")
corrs
##              age gmat_tot gmat_qpc gmat_vpc gmat_tpc   s_avg   f_avg
## age       1.0000  -0.1459  -0.2162 -0.04418 -0.16990  0.1497 -0.0174
## gmat_tot -0.1459   1.0000   0.7247  0.74839  0.84780  0.1131  0.1044
## gmat_qpc -0.2162   0.7247   1.0000  0.15218  0.65138 -0.0298  0.0737
## gmat_vpc -0.0442   0.7484   0.1522  1.00000  0.66622  0.2045  0.0759
## gmat_tpc -0.1699   0.8478   0.6514  0.66622  1.00000  0.1174  0.0797
## s_avg     0.1497   0.1131  -0.0298  0.20445  0.11736  1.0000  0.5506
## f_avg    -0.0174   0.1044   0.0737  0.07592  0.07973  0.5506  1.0000
## quarter  -0.0497  -0.0922   0.0364 -0.17461 -0.08304 -0.7621 -0.4475
## work_yrs  0.8583  -0.1824  -0.2366 -0.06639 -0.17336  0.1293 -0.0391
## salary   -0.0626  -0.0550  -0.0440 -0.00614  0.00493  0.1458  0.0294
## satis    -0.1279   0.0826   0.0606  0.06262  0.09293 -0.0327  0.0109
##            quarter work_yrs   salary     satis
## age      -4.97e-02  0.85830 -0.06257 -1.28e-01
## gmat_tot -9.22e-02 -0.18235 -0.05497  8.26e-02
## gmat_qpc  3.64e-02 -0.23661 -0.04403  6.06e-02
## gmat_vpc -1.75e-01 -0.06639 -0.00614  6.26e-02
## gmat_tpc -8.30e-02 -0.17336  0.00493  9.29e-02
## s_avg    -7.62e-01  0.12929  0.14584 -3.27e-02
## f_avg    -4.48e-01 -0.03906  0.02944  1.09e-02
## quarter   1.00e+00 -0.08603 -0.16437 -1.27e-05
## work_yrs -8.60e-02  1.00000  0.00902 -1.09e-01
## salary   -1.64e-01  0.00902  1.00000 -3.35e-01
## satis    -1.27e-05 -0.10926 -0.33522  1.00e+00

Lets have a correlation Matrix for the entire dataset to get a visual representation of the correlations betweeen the variables.

par(mfrow=c(1,1))
 corrplot(corr=cor(ms1[,c(1,3:10,12,13)]),use="complete.obs",
                    method="ellipse")

lets first have a complete dataset of students that were placed and have mentioned there salary.

Placed <- ms1[which(ms1$salary>999),]
some(Placed)
##     age    sex gmat_tot gmat_qpc gmat_vpc gmat_tpc s_avg f_avg quarter
## 60   39   Male      600       72       81       83   3.6  3.50       1
## 66   30   Male      600       77       81       84   3.5  3.25       1
## 117  31   Male      540       60       62       65   3.1  3.00       2
## 128  25 Female      620       89       74       87   3.1  3.50       2
## 129  27   Male      620       97       63       88   3.2  3.00       2
## 133  34   Male      550       72       58       69   3.0  3.00       2
## 138  26   Male      650       89       87       93   3.2  3.25       2
## 198  28   Male      610       89       67       86   2.7  3.00       3
## 204  30   Male      670       83       97       96   2.8  2.75       3
## 258  25 Female      720       96       98       99   3.5  3.60       4
##     work_yrs frstlang salary satis
## 60        16  English 112000     7
## 66         5  English 120000     6
## 117        8  English  93000     6
## 128        2  English 101000     5
## 129        3  English 103000     6
## 133       16  English 105000     5
## 138        4  English 130000     7
## 198        4  English  98000     7
## 204        6  English 102500     5
## 258        3  English  85000     6

Lets find some interesting insights from the placed students dataset.

describe(Placed$salary)
##    vars   n   mean    sd median trimmed  mad   min    max  range skew
## X1    1 103 103031 17869  1e+05  101065 7413 64000 220000 156000 3.18
##    kurtosis   se
## X1     17.2 1761
l<-aggregate(Placed$salary,by=list(Sex=Placed$sex),mean)
names(l)[2] <- "Average Salary"
l
##      Sex Average Salary
## 1 Female          98524
## 2   Male         104971
m<-aggregate(Placed$salary,by=list(Age=Placed$age),mean)
names(m)[2] <- "Average Salary"
m
##    Age Average Salary
## 1   22          85000
## 2   23          91651
## 3   24         101519
## 4   25          99087
## 5   26         101665
## 6   27         102214
## 7   28         103625
## 8   29         102083
## 9   30         109917
## 10  31         100500
## 11  32         107300
## 12  33         118000
## 13  34         105000
## 14  39         112000
## 15  40         183000

This gives a deep insights of the salary of the placed students. Also it gives the mean salary as per Gender and Age.

Also lets find the correlation between the different factors affeccting the Salary of students.

scatterplotMatrix(~Placed$gmat_tot+Placed$work_yrs+Placed$sex+Placed$age+Placed$salary,                           diagonal="histogram")

dd3 <- subset(Placed,select=c(age,gmat_tot,gmat_qpc,gmat_vpc,
                           gmat_tpc, s_avg, f_avg, quarter,work_yrs,
                           salary,satis))
corrs <- cor(dd3, use="pairwise.complete.obs")
corrs
##              age gmat_tot gmat_qpc gmat_vpc gmat_tpc   s_avg   f_avg
## age       1.0000  -0.0787 -0.16504   0.0180  -0.0961  0.1565 -0.2170
## gmat_tot -0.0787   1.0000  0.66638   0.7804   0.9668  0.1720  0.1225
## gmat_qpc -0.1650   0.6664  1.00000   0.0947   0.6587  0.0155  0.0984
## gmat_vpc  0.0180   0.7804  0.09467   1.0000   0.7844  0.1587  0.0229
## gmat_tpc -0.0961   0.9668  0.65865   0.7844   1.0000  0.1394  0.0705
## s_avg     0.1565   0.1720  0.01547   0.1587   0.1394  1.0000  0.4459
## f_avg    -0.2170   0.1225  0.09842   0.0229   0.0705  0.4459  1.0000
## quarter  -0.1257  -0.1058  0.01265  -0.1286  -0.0996 -0.8404 -0.4314
## work_yrs  0.8805  -0.1228 -0.18270  -0.0281  -0.1325  0.1633 -0.2163
## salary    0.4996  -0.0907  0.01414  -0.1374  -0.1320  0.1017 -0.1060
## satis     0.1083   0.0647 -0.00398   0.1486   0.1163 -0.1436 -0.1177
##          quarter work_yrs  salary    satis
## age      -0.1257   0.8805  0.4996  0.10832
## gmat_tot -0.1058  -0.1228 -0.0907  0.06474
## gmat_qpc  0.0126  -0.1827  0.0141 -0.00398
## gmat_vpc -0.1286  -0.0281 -0.1374  0.14863
## gmat_tpc -0.0996  -0.1325 -0.1320  0.11631
## s_avg    -0.8404   0.1633  0.1017 -0.14357
## f_avg    -0.4314  -0.2163 -0.1060 -0.11773
## quarter   1.0000  -0.1290 -0.1285  0.22512
## work_yrs -0.1290   1.0000  0.4547  0.06300
## salary   -0.1285   0.4547  1.0000 -0.04005
## satis     0.2251   0.0630 -0.0401  1.00000

From above Correlaion it can seen that Salary is only positively correlated with Age, S_avg and Work experience variables

lets have a look at the box plots for the variables in the Placed Dataset.

boxplot(Placed$salary~Placed$work_yrs,horizontal=TRUE,
        col=c("red","blue","yellow","orange","green"),
        main="Salary based on Work Experience",
        xlab="Salary of MBA Students",
        ylab="Work Exp.")

boxplot(Placed$salary~Placed$gmat_tot,horizontal=TRUE,
        col=c("red","blue","yellow","orange","green"),
        main="Salary based on Total Gmat Score",
        xlab="Salary of MBA Students",ylab="Total GMAT Score")

This boxplots gives fair idea about the mean and outliers in salaries based on GMAT scores and work experience.

par(mfrow=c(3,1))
boxplot(Placed$salary~Placed$sex,horizontal=TRUE,
        col=c("Yellow","darkorange"),
        main="Gender Based Salary Distribution",las=1,
        xlab="Salary of MBA Students",
        ylab="Sex")

boxplot(Placed$salary~Placed$quarter,horizontal=TRUE,
        col=c("red","blue","yellow","orange4"),las=1,
        main="Quarter Ranking Based Salary Distribution")

boxplot(Placed$salary~Placed$frstlang,
        horizontal=TRUE,las=1,col=c("limegreen","Purple4"),
        main="First Language Based Salary Distribution")

lets study the effect of Gender and work experience on getting placed, first by getting the number and then by testing it.

mytable <- aggregate(Placed$salary,by=list(Gender=Placed$sex),mean)
names(mytable)[2]<- "Average Salary"
mytable
##   Gender Average Salary
## 1 Female          98524
## 2   Male         104971
mytable2 <- aggregate(Placed$salary,by=list(Work_Exp=Placed$work_yrs),mean)
names(mytable2)[2]<- "Average Salary"
mytable2
##    Work_Exp Average Salary
## 1         0          95000
## 2         1         103532
## 3         2          97674
## 4         3         101653
## 5         4         105455
## 6         5         103143
## 7         6         105929
## 8         7          98000
## 9         8         105025
## 10       10         118000
## 11       15         183000
## 12       16         108500

Now some scatter plots fro the students who are placed.

scatterplot(salary ~age,     data=Placed,
            spread=FALSE, smoother.args=list(lty=2),
            main="Scatter plot of salary vs age",
            xlab="age",
            ylab="salary")

scatterplot(salary ~sex,     data=Placed,
            spread=FALSE, smoother.args=list(lty=2),
            main="Scatter plot of salary vs sex",
            xlab="sex",
            ylab="salary")

## [1] "256" "67"  "274" "257" "68"  "69"  "138" "139" "273"
scatterplot(salary ~work_yrs,     data=Placed,
            main="Scatter plot of salary vs Work exp.",
            xlab="Work experience in years",
            ylab="salary")

Lets build a regression model to find the factors that can statistically affect the salary of MBA students as a whole and individually.

mode1 <- (salary ~ age + sex +gmat_tot+ gmat_qpc +  gmat_vpc + gmat_tpc + s_avg +                              f_avg + quarter+work_yrs + satis)
modulus11 <- lm(salary ~ age + sex +gmat_tot+ gmat_qpc +  gmat_vpc + gmat_tpc + s_avg +                              f_avg + quarter+work_yrs + satis,data=ms1 )
summary(modulus11)
## 
## Call:
## lm(formula = salary ~ age + sex + gmat_tot + gmat_qpc + gmat_vpc + 
##     gmat_tpc + s_avg + f_avg + quarter + work_yrs + satis, data = ms1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -77512 -42013  -4615  43171 202697 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 218082.68   73827.61    2.95   0.0034 ** 
## age          -3942.41    1538.32   -2.56   0.0109 *  
## sexMale      -1628.38    6813.43   -0.24   0.8113    
## gmat_tot      -286.07     207.32   -1.38   0.1688    
## gmat_qpc       346.71     576.23    0.60   0.5479    
## gmat_vpc       343.23     523.70    0.66   0.5128    
## gmat_tpc       499.36     413.89    1.21   0.2287    
## s_avg        13153.80   12852.57    1.02   0.3070    
## f_avg        -6364.97    6625.26   -0.96   0.3376    
## quarter      -5377.62    4037.59   -1.33   0.1841    
## work_yrs      2976.66    1751.80    1.70   0.0905 .  
## satis          -47.43       7.77   -6.10  3.7e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47100 on 262 degrees of freedom
## Multiple R-squared:  0.182,  Adjusted R-squared:  0.147 
## F-statistic: 5.28 on 11 and 262 DF,  p-value: 1.54e-07

from above F-test, we could infer that only age is a significant parameter rest all are statistically insignificant.

library(leaps)
## Warning: package 'leaps' was built under R version 3.4.3
leap1 <- regsubsets(mode1, data = ms1, nbest=1)
summary(leap1)
## Subset selection object
## Call: regsubsets.formula(mode1, data = ms1, nbest = 1)
## 11 Variables  (and intercept)
##          Forced in Forced out
## age          FALSE      FALSE
## sexMale      FALSE      FALSE
## gmat_tot     FALSE      FALSE
## gmat_qpc     FALSE      FALSE
## gmat_vpc     FALSE      FALSE
## gmat_tpc     FALSE      FALSE
## s_avg        FALSE      FALSE
## f_avg        FALSE      FALSE
## quarter      FALSE      FALSE
## work_yrs     FALSE      FALSE
## satis        FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          age sexMale gmat_tot gmat_qpc gmat_vpc gmat_tpc s_avg f_avg
## 1  ( 1 ) " " " "     " "      " "      " "      " "      " "   " "  
## 2  ( 1 ) " " " "     " "      " "      " "      " "      " "   " "  
## 3  ( 1 ) "*" " "     " "      " "      " "      " "      " "   " "  
## 4  ( 1 ) "*" " "     " "      " "      " "      " "      " "   " "  
## 5  ( 1 ) "*" " "     "*"      " "      " "      " "      " "   " "  
## 6  ( 1 ) "*" " "     "*"      " "      " "      "*"      " "   " "  
## 7  ( 1 ) "*" " "     "*"      " "      " "      "*"      "*"   " "  
## 8  ( 1 ) "*" " "     "*"      " "      " "      "*"      "*"   "*"  
##          quarter work_yrs satis
## 1  ( 1 ) " "     " "      "*"  
## 2  ( 1 ) "*"     " "      "*"  
## 3  ( 1 ) "*"     " "      "*"  
## 4  ( 1 ) "*"     "*"      "*"  
## 5  ( 1 ) "*"     "*"      "*"  
## 6  ( 1 ) "*"     "*"      "*"  
## 7  ( 1 ) "*"     "*"      "*"  
## 8  ( 1 ) "*"     "*"      "*"
plot(leap1, scale="adjr2")

library(coefplot)
## Warning: package 'coefplot' was built under R version 3.4.3
coefplot(modulus11, intercept= FALSE, outerCI=1.96,coefficients=c("age","sex","work_yrs", "gmat_tot","quarter","satis"))

Removing all the statistically insignificant variables, for better model with improved R squared and adjusted R squared

modulus11 <- lm(salary ~ age,data=Placed )
summary(modulus11)
## 
## Call:
## lm(formula = salary ~ age, data = Placed)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -31454  -8533  -2182   4546  80886 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    29963      12698    2.36     0.02 *  
## age             2729        471    5.80  7.7e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15600 on 101 degrees of freedom
## Multiple R-squared:  0.25,   Adjusted R-squared:  0.242 
## F-statistic: 33.6 on 1 and 101 DF,  p-value: 7.75e-08

Although this model has a p-value less than 0.05, it isn’t a very good model.