DATA DESCRIPTION

age - in years sex 1=Male; 2=Female , gmat_tot =total GMAT score ,gmat_qpc = quantitative GMAT percentile

gmat_vpc = verbal GMAT percentile , qmat_tpc= overall GMAT percentile , s_avg = spring MBA average

f_avg =fall MBA average , quarter = quartile ranking (1st is top, 4th is bottom)

work_yrs = years of work experience , frstlang = first language (1=English; 2=other) , salary = starting salary

satis = degree of satisfaction with MBA program (1= low, 7 = high satisfaction)

There is lot of missing information in the salary and satis column Missing salary and satis data are coded as follows: 998 = did not answer the survey 999 = answered survey but did not disclose the salary 0= not placed

Reading the dataset in R

mba<-read.csv(paste("MBA Starting Salaries Data.csv",sep = ""))
View(mba)
dim(mba)
## [1] 274  13

summarizing the whole data set

summary(mba)
##       age             sex           gmat_tot        gmat_qpc    
##  Min.   :22.00   Min.   :1.000   Min.   :450.0   Min.   :28.00  
##  1st Qu.:25.00   1st Qu.:1.000   1st Qu.:580.0   1st Qu.:72.00  
##  Median :27.00   Median :1.000   Median :620.0   Median :83.00  
##  Mean   :27.36   Mean   :1.248   Mean   :619.5   Mean   :80.64  
##  3rd Qu.:29.00   3rd Qu.:1.000   3rd Qu.:660.0   3rd Qu.:93.00  
##  Max.   :48.00   Max.   :2.000   Max.   :790.0   Max.   :99.00  
##     gmat_vpc        gmat_tpc        s_avg           f_avg      
##  Min.   :16.00   Min.   : 0.0   Min.   :2.000   Min.   :0.000  
##  1st Qu.:71.00   1st Qu.:78.0   1st Qu.:2.708   1st Qu.:2.750  
##  Median :81.00   Median :87.0   Median :3.000   Median :3.000  
##  Mean   :78.32   Mean   :84.2   Mean   :3.025   Mean   :3.062  
##  3rd Qu.:91.00   3rd Qu.:94.0   3rd Qu.:3.300   3rd Qu.:3.250  
##  Max.   :99.00   Max.   :99.0   Max.   :4.000   Max.   :4.000  
##     quarter         work_yrs         frstlang         salary      
##  Min.   :1.000   Min.   : 0.000   Min.   :1.000   Min.   :     0  
##  1st Qu.:1.250   1st Qu.: 2.000   1st Qu.:1.000   1st Qu.:     0  
##  Median :2.000   Median : 3.000   Median :1.000   Median :   999  
##  Mean   :2.478   Mean   : 3.872   Mean   :1.117   Mean   : 39026  
##  3rd Qu.:3.000   3rd Qu.: 4.000   3rd Qu.:1.000   3rd Qu.: 97000  
##  Max.   :4.000   Max.   :22.000   Max.   :2.000   Max.   :220000  
##      satis      
##  Min.   :  1.0  
##  1st Qu.:  5.0  
##  Median :  6.0  
##  Mean   :172.2  
##  3rd Qu.:  7.0  
##  Max.   :998.0
library(psych)
describe(mba)
##          vars   n     mean       sd median  trimmed     mad min    max
## age         1 274    27.36     3.71     27    26.76    2.97  22     48
## sex         2 274     1.25     0.43      1     1.19    0.00   1      2
## gmat_tot    3 274   619.45    57.54    620   618.86   59.30 450    790
## gmat_qpc    4 274    80.64    14.87     83    82.31   14.83  28     99
## gmat_vpc    5 274    78.32    16.86     81    80.33   14.83  16     99
## gmat_tpc    6 274    84.20    14.02     87    86.12   11.86   0     99
## s_avg       7 274     3.03     0.38      3     3.03    0.44   2      4
## f_avg       8 274     3.06     0.53      3     3.09    0.37   0      4
## quarter     9 274     2.48     1.11      2     2.47    1.48   1      4
## work_yrs   10 274     3.87     3.23      3     3.29    1.48   0     22
## frstlang   11 274     1.12     0.32      1     1.02    0.00   1      2
## salary     12 274 39025.69 50951.56    999 33607.86 1481.12   0 220000
## satis      13 274   172.18   371.61      6    91.50    1.48   1    998
##           range  skew kurtosis      se
## age          26  2.16     6.45    0.22
## sex           1  1.16    -0.66    0.03
## gmat_tot    340 -0.01     0.06    3.48
## gmat_qpc     71 -0.92     0.30    0.90
## gmat_vpc     83 -1.04     0.74    1.02
## gmat_tpc     99 -2.28     9.02    0.85
## s_avg         2 -0.06    -0.38    0.02
## f_avg         4 -2.08    10.85    0.03
## quarter       3  0.02    -1.35    0.07
## work_yrs     22  2.78     9.80    0.20
## frstlang      1  2.37     3.65    0.02
## salary   220000  0.70    -1.05 3078.10
## satis       997  1.77     1.13   22.45

In the age data we see that the mean age is 27.36 years and there isn’t too much variation in the data as the standard deviation is 3.71 years. Mean of gmat total is 619.45 but the variation is great ,standaed deviation is 57.54. Mean of quantitative GMAT percentile is 80.64 and standard deviation is 14.87. The data is not dispersed so much. Mean of overall GMAT percentile is 84.20 and dispersion among the data values are less. Mean no. of work experience is 3.87 years. Average no of years a person has worked is 3 years 8 months. Standard deviation is also almost same as mean 3.23 years. Salary column has some unreasonable values of 998 and 999. This shows missing information in the salary column.

Break-up of the variables sex -wise

Average age of males and females

aggregate(mba$age,by=list(sex=mba$sex),mean)
##   sex        x
## 1   1 27.41748
## 2   2 27.17647

Here mean age of male(sex=1 is male) is 27 years and 4months while mean age of females(sex=2 is female) is 27 years and 1 months while the dispersion in the data of the age of male is little less than that of females.

Here mean age of males and females is same. This could not be the potential factor affecting the salary.

GMAT total score of males and females separately

aggregate(mba$gmat_tot,by=list(sex=mba$sex),mean)
##   sex        x
## 1   1 621.2136
## 2   2 614.1176

Average gmat total score for males is 621.2136 and for females is 614.1176. Here also in this case more or less the gmat total score for both the sexes is same. So this factor may not be the potential factor for difference in the salary

Average GMAT total percentile for both the groups separately

aggregate(mba$gmat_tpc,by=list(sex=mba$sex),mean)
##   sex        x
## 1   1 84.26214
## 2   2 84.00000

HERE average gmat total percentile for males and females is same.

HOW many females and males speak english . FOR this we will replace the sex column as follows: 1=MALE, 2=FEMALE and frstlang column as 1=ENGLISH , 2=OTHER

mba$sex[mba$sex==1]<-'MALE'
mba$sex[mba$sex==2]<-'FEMALE'
mba$sex<-factor(mba$sex)
mba$frstlang[mba$frstlang==1]<-'ENGLISH'
mba$frstlang[mba$frstlang==2]<-'OTHER'
mba$frstlang<-factor(mba$frstlang)
t<-table(mba$sex,mba$frstlang)
t
##         
##          ENGLISH OTHER
##   FEMALE      60     8
##   MALE       182    24

Here more no of males speak english as their first language as compared to females.

AVERAGE year of work experience for bot the sexes

aggregate(mba$work_yrs,by=list(sex=mba$sex),mean)
##      sex        x
## 1 FEMALE 3.808824
## 2   MALE 3.893204

Both have same average year of work experience.

We observe that males and females are different on the basis of first language parameter.

As we are intersted in those candidates who got placed for our analysis work we will create new dataframes denoting placed ,not placed, did not answer and answer but did not disclose.Let us create

placed<-mba[which(mba$salary>999),]
View(placed)
notplaced<-mba[which(mba$salary==0),]
View(notplaced)
didnotanswer<-mba[which(mba$salary==998),]
View(didnotanswer)
didnotdisclose<-mba[which(mba$salary==999),]
View(didnotdisclose)

Average salary of placed

mean(placed$salary)
## [1] 103030.7

average salary is 103030.7

Comparison of vital statistics on the basis of sex

t1<-table(placed$sex,placed$frstlang)
t1
##         
##          ENGLISH OTHER
##   FEMALE      28     3
##   MALE        68     4

Among the placed candidates 68 Males speak english as compared to 28 Females.

Mean of gmat total by sex

aggregate(placed$gmat_tot,by=list(sex=placed$sex),mean)
##      sex        x
## 1 FEMALE 614.5161
## 2   MALE 616.6667

Mean of total gmat percentile by sex

aggregate(placed$gmat_tpc,by=list(sex=placed$sex),mean)
##      sex        x
## 1 FEMALE 83.74194
## 2   MALE 84.86111

Average of total gmat percentile among the placed candidates sex-wise is almost same.

Mean age sex-wise

aggregate(placed$age,by=list(sex=placed$sex),mean)
##      sex        x
## 1 FEMALE 26.06452
## 2   MALE 27.08333

Not much difference of ages among the placed candidates gender-wise

Mean WORK experience sex-wise

aggregate(placed$work_yrs,by=list(sex=placed$sex),mean)
##      sex        x
## 1 FEMALE 3.258065
## 2   MALE 3.861111

Again not much of difference in the average age ofplaced males and placed females.

sex wise mean of different variables of the original dataframe mba

aggregate(cbind(salary,work_yrs,gmat_tot,age)~sex,data = mba ,mean)
##      sex   salary work_yrs gmat_tot      age
## 1 FEMALE 45121.07 3.808824 614.1176 27.17647
## 2   MALE 37013.62 3.893204 621.2136 27.41748

Average salary of males and females

aggregate(placed$salary,by=list(sex=placed$sex),mean)
##      sex         x
## 1 FEMALE  98524.39
## 2   MALE 104970.97

Here we can see that average salary of males is 104970.97 which is greater than that of females (98524.39)

mean salry age wise in the data frame placed and mba

aggregate(cbind(salary,work_yrs)~age,data=placed,mean)
##    age    salary  work_yrs
## 1   22  85000.00  1.000000
## 2   23  91651.20  1.800000
## 3   24 101518.75  1.875000
## 4   25  99086.96  2.260870
## 5   26 101665.00  2.642857
## 6   27 102214.29  3.214286
## 7   28 103625.00  4.500000
## 8   29 102083.33  5.833333
## 9   30 109916.67  6.333333
## 10  31 100500.00  5.500000
## 11  32 107300.00  2.000000
## 12  33 118000.00 10.000000
## 13  34 105000.00 16.000000
## 14  39 112000.00 16.000000
## 15  40 183000.00 15.000000
aggregate(cbind(work_yrs,salary)~age,data=mba,mean)
##    age  work_yrs    salary
## 1   22  1.000000  42500.00
## 2   23  1.750000  57282.00
## 3   24  1.727273  49342.24
## 4   25  2.264151  43395.55
## 5   26  2.875000  35982.07
## 6   27  3.130435  31499.37
## 7   28  4.666667  39809.00
## 8   29  4.500000  28067.95
## 9   30  5.583333  55291.25
## 10  31  5.800000  40599.40
## 11  32  5.625000  13662.25
## 12  33 10.000000 118000.00
## 13  34 11.500000  26250.00
## 14  35  9.333333      0.00
## 15  36 12.500000      0.00
## 16  37  9.000000      0.00
## 17  39 10.500000  56000.00
## 18  40 15.000000 183000.00
## 19  42 13.000000      0.00
## 20  43 19.000000      0.00
## 21  48 22.000000      0.00

average salary on the basis of satisfaction

aggregate(placed$salary,by=list(satisfaction=placed$satis,sex=placed$sex),mean)
##   satisfaction    sex         x
## 1            3 FEMALE  95000.00
## 2            5 FEMALE  93354.67
## 3            6 FEMALE 111400.00
## 4            7 FEMALE  90625.00
## 5            4   MALE  95000.00
## 6            5   MALE 109764.71
## 7            6   MALE 103855.25
## 8            7   MALE 103050.00

females seem to e more satisfied than males.see at the level 6 for women the avg salary is highest whereas at level 5 the avg salary is highest among men.

Data frame containing Total no of candiadates who are placed and who did not disclose the salary

placed1<-rbind(placed,didnotdisclose)
View(placed1)

Let us visualize the placed data using the boxplot

Age distribution

boxplot(placed$age,main="age distribution",xlab="age",horizontal = TRUE,col = "orchid3")

FEW outliers can be seen here otherwise the data is symmetric with median age 26.

GMAT TOTAL Distribution

boxplot(placed$gmat_tot,main="gmat total distribution",xlab="gmat_toatl",horizontal = TRUE,col = "beige")

Median total gmat score is 625 approx. Little bit long tail is towards left but we can cosider this data to be symmetric. GMAT total percentile

boxplot(placed$gmat_tpc,main="total percentile distribution",xlab="gmat-tpc",horizontal = TRUE,col = "blue4")

The tali is towards left . More data value is concentrated towards right. The data is negatively skew. More than 50% of the candidates have got more than average total percentile. Only 25% of the Candidates have got less than 78 percentile.

Salary distribution

boxplot(placed$salary,main="salary distribution",xlab="salary",horizontal = TRUE,col = "peachpuff")

Median salary is 100000. few candidates are earning extreme salary. Highest salary is 125000 approx.

Satisfaction distribution

boxplot(placed$satis,main="satisfaction distribution",xlab="satisfaction",horizontal = TRUE,col = "orchid3")

candidates seem preety satisfied as 50% candidates are in the range 5 to 6. Mdian satisfaction level is 6. 75% candidates are over the satisfaction level 5. Pretty good no. It seems like getting a job after MBA is pretty much satisfying.

first language distribution

boxplot(placed$s_avg,horizontal = TRUE,main="spring mba average",xlab="s_avg",col="brown")

we are getting symmetric data for mba average with median over 3.0

SALARY distribution gender-wise

boxplot(placed$salary~placed$sex,las=1,horizontal=TRUE, main="Salary brekup sex wise", xlab="salary",ylab="sex",col=c("red","blue"))

Salary for males is higher than that of females.

salar distribution on the basis of first language

boxplot(placed$salary~placed$frstlang,main="salry on the basis of first language",xlab="salary",horizontal=TRUE,col=c("orchid3","peachpuff"))

here candidates having other language as first language are earning higher.

dimension of first language

t<-table(placed$frstlang)
t
## 
## ENGLISH   OTHER 
##      96       7

Here English is in majority but the 7 candidates having other language have higher earnings.

boxplot of salary and total gmat score

boxplot(placed$salary~placed$gmat_tot,horizontal=TRUE, main="salary vs gmat total",xlab="salary",ylab="gmat score", las=1,col=c("red","beige","magenta","cyan","orchid3","blue","peachpuff","chartreuse4"))

Surprisingly candidates having gmat score 500 i.r the least gmat total score are earning more than the others

Salary and Total gmat percentile

boxplot(placed$salary~placed$gmat_tpc,horizontal=TRUE, main="salary vs gmat percentile",xlab="salary",ylab="gmat percentile", las=1,col=c("red","beige","magenta","cyan","orchid3","blue","peachpuff","chartreuse4"))

candidates having 69 percentile are having consistency in salary i.e less fluctuation in salary alongwith higher salary. Median salary is little low than the median of the 99 percentile group.

Salary on the basis of work expereince

boxplot(placed$salary~placed$work_yrs,horizontal=TRUE, main="salary vs work experience",xlab="salary",ylab="work ex", las=1,col=c("red","beige","magenta","cyan","orchid3","blue","peachpuff","chartreuse4"))

More experience more salary. See here 15 years of experience is attracting higher packages.

Salary vs Quartile ranking

boxplot(placed$salary~placed$quarter,horizontal=TRUE, main="salary vs quartile",xlab="salary",ylab="quartile", las=1,col=c("red","beige","magenta","cyan","orchid3","blue","peachpuff","chartreuse4"))

Here lowest quartile is earning maximum salary.

Let us see what histograms are saying

Salary for placed candidates

library(lattice)
histogram(~placed$salary,data=placed, main="frequency of starting salary", xlab="salary", col="peachpuff")

Salary on the basis of total placed and those who did not disclosed the salary

histogram(~placed1$salary,data=placed1, main="frequency of starting salary", xlab="salary", col="grey")

In both the graphs the maximum percentage of candidates are earning in the range 100000 to 125000

Scatterplot matrix of salary vs gmat total

plot(placed$gmat_tot,placed$salary ,data=placed,main="scatterplot salary vs total gmt score",xlab = "gmat score", ylab = "salary",,las=1,col=c("red","blue","green","brown"))
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter
abline(lm(placed$gmat_tot~placed$salary),col="red")

salary vs work- experience

plot(placed$work_yrs,placed$salary ,data=placed,main="scatterplot salary vs work experience",xlab = "work ex", ylab = "salary",,las=1,col=c("red","blue","green","brown"))
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter
abline(lm(placed$work_yrs~placed$salary),col="red")

salary vs F_avg

plot(placed$f_avg,placed$salary ,data=placed,main="scatterplot salary vs full mba average",xlab = "f_avg", ylab = "salary",,las=1,col=c("red","blue","green","brown"))
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter
abline(lm(placed$f_avg~placed$salary),col="beige")

plot of the entire data

plot(placed)

plot of salary with work ex , gmat total, f_avg, satisfaction and total percentile

library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
scatterplotMatrix(formula=~salary+work_yrs+gmat_tot+f_avg+satis+gmat_tpc,data=placed,diagonal="histogram")

here we see that salary and work experience are positively correlated. gmat total and gmat total percentile re positively correlated.

let us make some comparison between placed and not placed

So first combine our two data frame placed and not placed and didnotdisclose

fullmba<-rbind(placed,notplaced,didnotdisclose)
View(fullmba)

Now we will create a dummy Gotplaced =1 (got a job) and 0 (didn’t get a job)

fullmba$Gotplaced=(fullmba$salary>999)
View(fullmba)
fullmba$Gotplaced<-factor(fullmba$Gotplaced)

No of placed and not placed candidates

tab<-table(fullmba$Gotplaced)
tab
## 
## FALSE  TRUE 
##   125   103

No of placed and not placed gender wise

tab1<-xtabs(~fullmba$Gotplaced+ fullmba$sex,data=fullmba)
tab1
##                  fullmba$sex
## fullmba$Gotplaced FEMALE MALE
##             FALSE     28   97
##             TRUE      31   72
addmargins(tab1)
##                  fullmba$sex
## fullmba$Gotplaced FEMALE MALE Sum
##             FALSE     28   97 125
##             TRUE      31   72 103
##             Sum       59  169 228

express the above quantity in percent terms

prop.table(tab1)
##                  fullmba$sex
## fullmba$Gotplaced    FEMALE      MALE
##             FALSE 0.1228070 0.4254386
##             TRUE  0.1359649 0.3157895

No of placed not placed on the basis of work experience

tab2<-xtabs(~fullmba$Gotplaced+ fullmba$sex + fullmba$work_yrs,data=fullmba)
ftable(tab2)
##                               fullmba$work_yrs  0  1  2  3  4  5  6  7  8  9 10 11 12 13 15 16 18 22
## fullmba$Gotplaced fullmba$sex                                                                       
## FALSE             FEMALE                        0  0  8  6  2  5  1  2  0  1  1  1  0  1  0  0  0  0
##                   MALE                          2 14 20 16 18  9  3  5  2  1  0  1  2  0  0  1  1  2
## TRUE              FEMALE                        0  4 14  5  1  3  2  0  1  0  0  0  0  0  1  0  0  0
##                   MALE                          1  4 24 16 10  4  5  1  3  0  1  0  0  0  1  2  0  0
addmargins(ftable(tab2))
##                                               Sum
##     0  0  8  6  2  5  1 2 0 1 1 1 0 1 0 0 0 0  28
##     2 14 20 16 18  9  3 5 2 1 0 1 2 0 0 1 1 2  97
##     0  4 14  5  1  3  2 0 1 0 0 0 0 0 1 0 0 0  31
##     1  4 24 16 10  4  5 1 3 0 1 0 0 0 1 2 0 0  72
## Sum 3 22 66 43 31 21 11 8 6 2 2 2 2 1 2 3 1 2 228

No of placed and not placed on the basis of language

tab3<-xtabs(~fullmba$Gotplaced+ fullmba$sex + fullmba$frstlang, data= fullmba)
ftable(tab3)
##                               fullmba$frstlang ENGLISH OTHER
## fullmba$Gotplaced fullmba$sex                               
## FALSE             FEMALE                            25     3
##                   MALE                              83    14
## TRUE              FEMALE                            28     3
##                   MALE                              68     4
addmargins(ftable(tab3))
##            Sum
##      25  3  28
##      83 14  97
##      28  3  31
##      68  4  72
## Sum 204 24 228

express the above in % term

prop.table(ftable(tab3))
##                               fullmba$frstlang    ENGLISH      OTHER
## fullmba$Gotplaced fullmba$sex                                       
## FALSE             FEMALE                       0.10964912 0.01315789
##                   MALE                         0.36403509 0.06140351
## TRUE              FEMALE                       0.12280702 0.01315789
##                   MALE                         0.29824561 0.01754386

Now we will test some hypothesis

H0: Percentage of females placed= percentage of males placed

H1: percentage of females placed is more than the percentage of males placed

using chisquare test here

chisq.test(tab1)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tab1
## X-squared = 1.366, df = 1, p-value = 0.2425

here we see that the p-value>0.05 .So we accept our null hypothesis and conclude that percentage of females placed is equal to percentage of males placed

Second hypothesis

H2: percentage of people placed having first language english is more than that of people having other language as first language

t4<-xtabs(~fullmba$Gotplaced+ fullmba$frstlang,data = fullmba)
prop.table(t4)
##                  fullmba$frstlang
## fullmba$Gotplaced    ENGLISH      OTHER
##             FALSE 0.47368421 0.07456140
##             TRUE  0.42105263 0.03070175
chisq.test(t4)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  t4
## X-squared = 2.1002, df = 1, p-value = 0.1473

again our p-value is >0.05. So we accept the null hypothesis and conclude that percentage of placed people having english as first language is equal to those not having english as first language.

Now Hypothesis based on t-test

1st H1: Average salary of placed females is greater than placed males

t.test(salary~sex,data =placed )
## 
##  Welch Two Sample t-test
## 
## data:  salary by sex
## t = -1.3628, df = 38.115, p-value = 0.1809
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -16021.72   3128.55
## sample estimates:
## mean in group FEMALE   mean in group MALE 
##             98524.39            104970.97

here we can say that p-value is>0.05 so there is no difference in the average salary of males and females who are placed.

SEcond hypothesis

H2: Average salary of placed people having english as first language is greater than the others

t.test(salary~frstlang,data=placed)
## 
##  Welch Two Sample t-test
## 
## data:  salary by frstlang
## t = -1.1202, df = 6.0863, p-value = 0.3049
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -59933.62  22202.25
## sample estimates:
## mean in group ENGLISH   mean in group OTHER 
##              101748.6              120614.3

on the basis of p-value which is >0.05 we can say that there is no difference in the average salary of the english speakers who are placed and the non english speakers who are placed.

Now we will run the regression analysis

for model selection let us see the correlogram of the placed variables

we will read our data set again here and create a new dataframe called plac to filter out those people who are placed because our salary data has many missing information .

MBA<-read.csv(paste("MBA Starting Salaries Data.csv",sep = ""))
plac<-MBA[which(MBA$salary>999),]
coor<-cor(plac)
coor
##                  age         sex    gmat_tot     gmat_qpc    gmat_vpc
## age       1.00000000 -0.14352927 -0.07871678 -0.165039057  0.01799420
## sex      -0.14352927  1.00000000 -0.01955548 -0.147099027  0.05341428
## gmat_tot -0.07871678 -0.01955548  1.00000000  0.666382266  0.78038546
## gmat_qpc -0.16503906 -0.14709903  0.66638227  1.000000000  0.09466541
## gmat_vpc  0.01799420  0.05341428  0.78038546  0.094665411  1.00000000
## gmat_tpc -0.09609156 -0.04686981  0.96680810  0.658650025  0.78443167
## s_avg     0.15654954  0.08079985  0.17198874  0.015471662  0.15865101
## f_avg    -0.21699191  0.16572186  0.12246257  0.098418869  0.02290167
## quarter  -0.12568145 -0.02139041 -0.10578964  0.012648346 -0.12862079
## work_yrs  0.88052470 -0.09233003 -0.12280018 -0.182701263 -0.02812182
## frstlang  0.35026743  0.07512009 -0.13164323  0.014198516 -0.21835333
## salary    0.49964284 -0.16628869 -0.09067141  0.014141299 -0.13743230
## satis     0.10832308 -0.09199534  0.06474206 -0.003984632  0.14863481
##             gmat_tpc       s_avg       f_avg     quarter    work_yrs
## age      -0.09609156  0.15654954 -0.21699191 -0.12568145  0.88052470
## sex      -0.04686981  0.08079985  0.16572186 -0.02139041 -0.09233003
## gmat_tot  0.96680810  0.17198874  0.12246257 -0.10578964 -0.12280018
## gmat_qpc  0.65865003  0.01547166  0.09841887  0.01264835 -0.18270126
## gmat_vpc  0.78443167  0.15865101  0.02290167 -0.12862079 -0.02812182
## gmat_tpc  1.00000000  0.13938500  0.07051391 -0.09955033 -0.13246963
## s_avg     0.13938500  1.00000000  0.44590413 -0.84038355  0.16328236
## f_avg     0.07051391  0.44590413  1.00000000 -0.43144819 -0.21633018
## quarter  -0.09955033 -0.84038355 -0.43144819  1.00000000 -0.12896722
## work_yrs -0.13246963  0.16328236 -0.21633018 -0.12896722  1.00000000
## frstlang -0.16437561 -0.13788905 -0.05061394  0.10955726  0.19627277
## salary   -0.13201783  0.10173175 -0.10603897 -0.12848526  0.45466634
## satis     0.11630842 -0.14356557 -0.11773304  0.22511985  0.06299926
##             frstlang      salary        satis
## age       0.35026743  0.49964284  0.108323083
## sex       0.07512009 -0.16628869 -0.091995338
## gmat_tot -0.13164323 -0.09067141  0.064742057
## gmat_qpc  0.01419852  0.01414130 -0.003984632
## gmat_vpc -0.21835333 -0.13743230  0.148634805
## gmat_tpc -0.16437561 -0.13201783  0.116308417
## s_avg    -0.13788905  0.10173175 -0.143565573
## f_avg    -0.05061394 -0.10603897 -0.117733043
## quarter   0.10955726 -0.12848526  0.225119851
## work_yrs  0.19627277  0.45466634  0.062999256
## frstlang  1.00000000  0.26701953  0.089834769
## salary    0.26701953  1.00000000 -0.040050600
## satis     0.08983477 -0.04005060  1.000000000
library(corrplot)
## corrplot 0.84 loaded
corrplot(corr =cor(plac,use = "complete.obs"),method = "ellipse")

See here salary is positively related with age and work eperience and first language. It is also positively related with s_avg though very weakly . seee the blue shades. blue shades are showing positive correlation whereas the red shades are affecting negatively. like sex gmat_tot, gmat_tpc are affecting negatively.

regression analysis

Model1 <- salary ~  work_yrs + s_avg + f_avg + gmat_qpc + gmat_vpc + sex + frstlang + satis 
fit <- lm(Model1, data = plac)
summary(fit)
## 
## Call:
## lm(formula = Model1, data = plac)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -29800  -7822  -1742   4869  82341 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 86719.94   23350.43   3.714 0.000346 ***
## work_yrs     2331.12     585.99   3.978 0.000137 ***
## s_avg        4659.05    5015.66   0.929 0.355320    
## f_avg       -1698.83    3834.70  -0.443 0.658773    
## gmat_qpc       98.72     121.85   0.810 0.419884    
## gmat_vpc      -95.80     102.99  -0.930 0.354699    
## sex         -5289.24    3545.91  -1.492 0.139140    
## frstlang    13994.76    6641.66   2.107 0.037770 *  
## satis       -1671.20    2070.62  -0.807 0.421643    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15740 on 94 degrees of freedom
## Multiple R-squared:  0.285,  Adjusted R-squared:  0.2241 
## F-statistic: 4.683 on 8 and 94 DF,  p-value: 7.574e-05

Here only significant variables affecting salary are work years and frstlang both are affecting positively.

see the R squared it is only 0.285 i.e. 28.5% variation in salary is explained by the model

Now 2nd regression

Model <- salary ~ 
             work_yrs + s_avg + f_avg + gmat_qpc + gmat_vpc +  satis
fit1 <- lm(Model, data = plac)
summary(fit1)
## 
## Call:
## lm(formula = Model, data = plac)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33922  -8301  -1314   4030  85666 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  95302.4    22053.9   4.321 3.79e-05 ***
## work_yrs      2690.3      577.6   4.658 1.03e-05 ***
## s_avg         3001.1     5055.9   0.594    0.554    
## f_avg        -1817.0     3871.3  -0.469    0.640    
## gmat_qpc       151.8      121.5   1.249    0.215    
## gmat_vpc      -152.5      102.2  -1.492    0.139    
## satis        -1012.3     2093.7  -0.484    0.630    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16060 on 96 degrees of freedom
## Multiple R-squared:  0.2401, Adjusted R-squared:  0.1926 
## F-statistic: 5.056 on 6 and 96 DF,  p-value: 0.0001515

again this model is explaining only 24% variation in salary.

3rd mode

Model1 <- salary ~ 
             work_yrs + f_avg + gmat_qpc + gmat_vpc + satis+gmat_tot+gmat_tpc+s_avg+age 
fit1 <- lm(Model1, data = plac)
summary(fit1)
## 
## Call:
## lm(formula = Model1, data = plac)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -29429  -7405    358   5528  69521 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 54916.31   50589.39   1.086   0.2805  
## work_yrs      338.47    1087.50   0.311   0.7563  
## f_avg       -1642.93    3797.13  -0.433   0.6663  
## gmat_qpc      891.50     486.78   1.831   0.0702 .
## gmat_vpc      579.60     488.29   1.187   0.2382  
## satis       -1284.40    2065.97  -0.622   0.5357  
## gmat_tot      -21.21     169.27  -0.125   0.9005  
## gmat_tpc    -1419.88     711.10  -1.997   0.0488 *
## s_avg        3460.17    4923.28   0.703   0.4839  
## age          2437.73    1002.98   2.430   0.0170 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15410 on 93 degrees of freedom
## Multiple R-squared:  0.3218, Adjusted R-squared:  0.2562 
## F-statistic: 4.903 on 9 and 93 DF,  p-value: 2.219e-05

Here 32% variation in salary is explained by the model and the variables which are affecting salary is gmat_tpc and age.