Reading the data and summary

MBAsalaries.df<-read.csv(paste("MBA Starting Salaries Data.csv",sep=""))
library(psych)
describe(MBAsalaries.df)[,c(1:5,8,9)]
##          vars   n     mean       sd median min    max
## age         1 274    27.36     3.71     27  22     48
## sex         2 274     1.25     0.43      1   1      2
## gmat_tot    3 274   619.45    57.54    620 450    790
## gmat_qpc    4 274    80.64    14.87     83  28     99
## gmat_vpc    5 274    78.32    16.86     81  16     99
## gmat_tpc    6 274    84.20    14.02     87   0     99
## s_avg       7 274     3.03     0.38      3   2      4
## f_avg       8 274     3.06     0.53      3   0      4
## quarter     9 274     2.48     1.11      2   1      4
## work_yrs   10 274     3.87     3.23      3   0     22
## frstlang   11 274     1.12     0.32      1   1      2
## salary     12 274 39025.69 50951.56    999   0 220000
## satis      13 274   172.18   371.61      6   1    998

Exploring Data

#pruning off the data for which salary info not given
MBAsalTrue.df <- MBAsalaries.df[which (MBAsalaries.df$salary >= 1000) , ]
hist(MBAsalTrue.df$salary)

The salary can be affected by multiple factors. These factors and their associated variables are given below:- 1. ascribed factors of the candidate * age * sex * frstlang 2. performance of the candidate * gmat_tot * gmat_qpc * gmat_vpc * gmat_tpc * s_avg * f_avg * quarter 3. work experience * work_yrs 4. satisfaction with MBA program * satis

We can now investigate the relationship of these variables with the starting salary. Further we should also look into profile of candidates who did not get placed. # Salary vs Sex

boxplot(salary~sex, data=MBAsalTrue.df)

Thus it appears female candidates have lesser starting salaries. At this point, let us invoke the data of candidates who did not get placed and view their sex profile

# recording profile of candidates who are not placed
MBAsalZero.df<-MBAsalaries.df[which(MBAsalaries.df$salary==0),]
table(MBAsalZero.df$sex)
## 
##  1  2 
## 67 23

This indicates that male candidates are more in number among those who did not get placed.

At this point it would be pertinant to ask what percentage of women and men got placed. For this purpose, we shall create a new dataframe combining the data of all candidates whose information is known

MBAknownsal.df<-MBAsalaries.df[which(MBAsalaries.df$salary>= 1000|MBAsalaries.df$salary==0),]
# creating a dummy variable to capture the status of placement
MBAknownsal.df$Status = (MBAknownsal.df$salary >1000)
MBAknownsal.df$Status<- factor(MBAknownsal.df$Status)

Now, we may look into the percentage of women (=2) and men (=1) who got placed

prop.table(xtabs(~ MBAknownsal.df$Status + MBAknownsal.df$sex , data=MBAknownsal.df),2)
##                      MBAknownsal.df$sex
## MBAknownsal.df$Status         1         2
##                 FALSE 0.4820144 0.4259259
##                 TRUE  0.5179856 0.5740741

H1: A hypothesis can be proposed that the percentage of women placed is more than men

Chi Square Test : percentage of women who got placed is higher than percentage of men who got placed

chisq.test(xtabs(~ MBAknownsal.df$Status + MBAknownsal.df$sex , data=MBAknownsal.df))
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  xtabs(~MBAknownsal.df$Status + MBAknownsal.df$sex, data = MBAknownsal.df)
## X-squared = 0.29208, df = 1, p-value = 0.5889

since the p value >0.05, thus the hypothesis is rejected, which implies that placement of women and men are independent events.

Salary vs Age

library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
scatterplot(MBAsalTrue.df$age, MBAsalTrue.df$salary)

The plot implies that with increasing age the starting salary begin to dip

About the candidates not placed

hist(MBAsalZero.df$age)

Again here it can be seen that the frequency of younger people among those not placed is higher.

Salary vs first language

Now let us compare the profile of those who are placed and not placed based on first language.

print("placed people")
## [1] "placed people"
table(MBAsalTrue.df$frstlang)
## 
##  1  2 
## 96  7
print("not placed people")
## [1] "not placed people"
table(MBAsalZero.df$frstlang)
## 
##  1  2 
## 82  8

Now, we may look into the percentage of people with English (=1) as first language and others (=2) who got placed

prop.table(xtabs(~ MBAknownsal.df$Status + MBAknownsal.df$frstlang , data=MBAknownsal.df),2)
##                      MBAknownsal.df$frstlang
## MBAknownsal.df$Status         1         2
##                 FALSE 0.4606742 0.5333333
##                 TRUE  0.5393258 0.4666667

H2: A hypothesis can be proposed that percentage of people placed whose first language is English is higher than the percentage of people placed whose first language is not English

Chi Square Test : percentage of people with english as their first language who got placed is higher than others

chisq.test(xtabs(~ MBAknownsal.df$Status + MBAknownsal.df$frstlang , data=MBAknownsal.df))
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  xtabs(~MBAknownsal.df$Status + MBAknownsal.df$frstlang, data = MBAknownsal.df)
## X-squared = 0.074127, df = 1, p-value = 0.7854

since the p value >0.05, thus the hypothesis is rejected, which implies that placement of people with english as their first language and others are independent events.

Salary vs performance

salary vs. gmat score

scatterplot(MBAsalTrue.df$gmat_tot, MBAsalTrue.df$salary)

salary vs quartile

boxplot(salary~quarter, data= MBAsalTrue.df)

Salary vs Work Experience

scatterplot(MBAsalTrue.df$work_yrs, MBAsalTrue.df$salary)

corelations

library(corrplot)
## corrplot 0.84 loaded
round(cor(MBAsalTrue.df),2)
##            age   sex gmat_tot gmat_qpc gmat_vpc gmat_tpc s_avg f_avg
## age       1.00 -0.14    -0.08    -0.17     0.02    -0.10  0.16 -0.22
## sex      -0.14  1.00    -0.02    -0.15     0.05    -0.05  0.08  0.17
## gmat_tot -0.08 -0.02     1.00     0.67     0.78     0.97  0.17  0.12
## gmat_qpc -0.17 -0.15     0.67     1.00     0.09     0.66  0.02  0.10
## gmat_vpc  0.02  0.05     0.78     0.09     1.00     0.78  0.16  0.02
## gmat_tpc -0.10 -0.05     0.97     0.66     0.78     1.00  0.14  0.07
## s_avg     0.16  0.08     0.17     0.02     0.16     0.14  1.00  0.45
## f_avg    -0.22  0.17     0.12     0.10     0.02     0.07  0.45  1.00
## quarter  -0.13 -0.02    -0.11     0.01    -0.13    -0.10 -0.84 -0.43
## work_yrs  0.88 -0.09    -0.12    -0.18    -0.03    -0.13  0.16 -0.22
## frstlang  0.35  0.08    -0.13     0.01    -0.22    -0.16 -0.14 -0.05
## salary    0.50 -0.17    -0.09     0.01    -0.14    -0.13  0.10 -0.11
## satis     0.11 -0.09     0.06     0.00     0.15     0.12 -0.14 -0.12
##          quarter work_yrs frstlang salary satis
## age        -0.13     0.88     0.35   0.50  0.11
## sex        -0.02    -0.09     0.08  -0.17 -0.09
## gmat_tot   -0.11    -0.12    -0.13  -0.09  0.06
## gmat_qpc    0.01    -0.18     0.01   0.01  0.00
## gmat_vpc   -0.13    -0.03    -0.22  -0.14  0.15
## gmat_tpc   -0.10    -0.13    -0.16  -0.13  0.12
## s_avg      -0.84     0.16    -0.14   0.10 -0.14
## f_avg      -0.43    -0.22    -0.05  -0.11 -0.12
## quarter     1.00    -0.13     0.11  -0.13  0.23
## work_yrs   -0.13     1.00     0.20   0.45  0.06
## frstlang    0.11     0.20     1.00   0.27  0.09
## salary     -0.13     0.45     0.27   1.00 -0.04
## satis       0.23     0.06     0.09  -0.04  1.00

Corrgram

library(corrgram) 
corrgram(MBAsalTrue.df, order = FALSE, lower.panel = panel.shade, upper.panel = panel.pie, text.panel = panel.txt, main= "Corrgram of various factors")

Regression analysis

fit<- lm(salary~ age+gmat_qpc+gmat_tot+gmat_tpc+gmat_vpc+s_avg+f_avg+work_yrs+sex+frstlang, data= MBAknownsal.df)
summary(fit)
## 
## Call:
## lm(formula = salary ~ age + gmat_qpc + gmat_tot + gmat_tpc + 
##     gmat_vpc + s_avg + f_avg + work_yrs + sex + frstlang, data = MBAknownsal.df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -86129 -52437  22217  43263 188119 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 200482.95   92802.71   2.160   0.0321 *
## age          -4703.82    1916.34  -2.455   0.0150 *
## gmat_qpc       538.13     854.89   0.629   0.5298  
## gmat_tot      -376.13     313.59  -1.199   0.2319  
## gmat_tpc       691.61     653.79   1.058   0.2915  
## gmat_vpc       522.85     821.83   0.636   0.5254  
## s_avg        22036.96   12592.72   1.750   0.0818 .
## f_avg        -7562.26    8825.77  -0.857   0.3927  
## work_yrs      3559.88    2175.77   1.636   0.1035  
## sex            -40.44    8746.76  -0.005   0.9963  
## frstlang     14456.64   15774.81   0.916   0.3606  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52760 on 182 degrees of freedom
## Multiple R-squared:  0.06607,    Adjusted R-squared:  0.01476 
## F-statistic: 1.288 on 10 and 182 DF,  p-value: 0.2403

the above analysis shows that Age is the only statistically significant factor influencing the starting salary.