DATA DESCRIPTION
age - in years sex 1=Male; 2=Female , gmat_tot =total GMAT score ,gmat_qpc = quantitative GMAT percentile
gmat_vpc = verbal GMAT percentile , qmat_tpc= overall GMAT percentile , s_avg = spring MBA average
f_avg =fall MBA average , quarter = quartile ranking (1st is top, 4th is bottom)
work_yrs = years of work experience , frstlang = first language (1=English; 2=other) , salary = starting salary
satis = degree of satisfaction with MBA program (1= low, 7 = high satisfaction)
There is lot of missing information in the salary and satis column Missing salary and satis data are coded as follows: 998 = did not answer the survey 999 = answered survey but did not disclose the salary 0= not placed
Reading the dataset in R
mba<-read.csv(paste("MBA Starting Salaries Data.csv",sep = ""))
View(mba)
dim(mba)
## [1] 274 13
summarizing the whole data set
summary(mba)
## age sex gmat_tot gmat_qpc
## Min. :22.00 Min. :1.000 Min. :450.0 Min. :28.00
## 1st Qu.:25.00 1st Qu.:1.000 1st Qu.:580.0 1st Qu.:72.00
## Median :27.00 Median :1.000 Median :620.0 Median :83.00
## Mean :27.36 Mean :1.248 Mean :619.5 Mean :80.64
## 3rd Qu.:29.00 3rd Qu.:1.000 3rd Qu.:660.0 3rd Qu.:93.00
## Max. :48.00 Max. :2.000 Max. :790.0 Max. :99.00
## gmat_vpc gmat_tpc s_avg f_avg
## Min. :16.00 Min. : 0.0 Min. :2.000 Min. :0.000
## 1st Qu.:71.00 1st Qu.:78.0 1st Qu.:2.708 1st Qu.:2.750
## Median :81.00 Median :87.0 Median :3.000 Median :3.000
## Mean :78.32 Mean :84.2 Mean :3.025 Mean :3.062
## 3rd Qu.:91.00 3rd Qu.:94.0 3rd Qu.:3.300 3rd Qu.:3.250
## Max. :99.00 Max. :99.0 Max. :4.000 Max. :4.000
## quarter work_yrs frstlang salary
## Min. :1.000 Min. : 0.000 Min. :1.000 Min. : 0
## 1st Qu.:1.250 1st Qu.: 2.000 1st Qu.:1.000 1st Qu.: 0
## Median :2.000 Median : 3.000 Median :1.000 Median : 999
## Mean :2.478 Mean : 3.872 Mean :1.117 Mean : 39026
## 3rd Qu.:3.000 3rd Qu.: 4.000 3rd Qu.:1.000 3rd Qu.: 97000
## Max. :4.000 Max. :22.000 Max. :2.000 Max. :220000
## satis
## Min. : 1.0
## 1st Qu.: 5.0
## Median : 6.0
## Mean :172.2
## 3rd Qu.: 7.0
## Max. :998.0
library(psych)
describe(mba)
## vars n mean sd median trimmed mad min max
## age 1 274 27.36 3.71 27 26.76 2.97 22 48
## sex 2 274 1.25 0.43 1 1.19 0.00 1 2
## gmat_tot 3 274 619.45 57.54 620 618.86 59.30 450 790
## gmat_qpc 4 274 80.64 14.87 83 82.31 14.83 28 99
## gmat_vpc 5 274 78.32 16.86 81 80.33 14.83 16 99
## gmat_tpc 6 274 84.20 14.02 87 86.12 11.86 0 99
## s_avg 7 274 3.03 0.38 3 3.03 0.44 2 4
## f_avg 8 274 3.06 0.53 3 3.09 0.37 0 4
## quarter 9 274 2.48 1.11 2 2.47 1.48 1 4
## work_yrs 10 274 3.87 3.23 3 3.29 1.48 0 22
## frstlang 11 274 1.12 0.32 1 1.02 0.00 1 2
## salary 12 274 39025.69 50951.56 999 33607.86 1481.12 0 220000
## satis 13 274 172.18 371.61 6 91.50 1.48 1 998
## range skew kurtosis se
## age 26 2.16 6.45 0.22
## sex 1 1.16 -0.66 0.03
## gmat_tot 340 -0.01 0.06 3.48
## gmat_qpc 71 -0.92 0.30 0.90
## gmat_vpc 83 -1.04 0.74 1.02
## gmat_tpc 99 -2.28 9.02 0.85
## s_avg 2 -0.06 -0.38 0.02
## f_avg 4 -2.08 10.85 0.03
## quarter 3 0.02 -1.35 0.07
## work_yrs 22 2.78 9.80 0.20
## frstlang 1 2.37 3.65 0.02
## salary 220000 0.70 -1.05 3078.10
## satis 997 1.77 1.13 22.45
In the age data we see that the mean age is 27.36 years and there isn’t too much variation in the data as the standard deviation is 3.71 years. Mean of gmat total is 619.45 but the variation is great ,standaed deviation is 57.54. Mean of quantitative GMAT percentile is 80.64 and standard deviation is 14.87. The data is not dispersed so much. Mean of overall GMAT percentile is 84.20 and dispersion among the data values are less. Mean no. of work experience is 3.87 years. Average no of years a person has worked is 3 years 8 months. Standard deviation is also almost same as mean 3.23 years. Salary column has some unreasonable values of 998 and 999. This shows missing information in the salary column.
Break-up of the variables sex -wise
Average age of males and females
aggregate(mba$age,by=list(sex=mba$sex),mean)
## sex x
## 1 1 27.41748
## 2 2 27.17647
Here mean age of male(sex=1 is male) is 27 years and 4months while mean age of females(sex=2 is female) is 27 years and 1 months while the dispersion in the data of the age of male is little less than that of females.
Here mean age of males and females is same. This could not be the potential factor affecting the salary.
GMAT total score of males and females separately
aggregate(mba$gmat_tot,by=list(sex=mba$sex),mean)
## sex x
## 1 1 621.2136
## 2 2 614.1176
Average gmat total score for males is 621.2136 and for females is 614.1176. Here also in this case more or less the gmat total score for both the sexes is same. So this factor may not be the potential factor for difference in the salary
Average GMAT total percentile for both the groups separately
aggregate(mba$gmat_tpc,by=list(sex=mba$sex),mean)
## sex x
## 1 1 84.26214
## 2 2 84.00000
HERE average gmat total percentile for males and females is same.
HOW many females and males speak english . FOR this we will replace the sex column as follows: 1=MALE, 2=FEMALE and frstlang column as 1=ENGLISH , 2=OTHER
mba$sex[mba$sex==1]<-'MALE'
mba$sex[mba$sex==2]<-'FEMALE'
mba$sex<-factor(mba$sex)
mba$frstlang[mba$frstlang==1]<-'ENGLISH'
mba$frstlang[mba$frstlang==2]<-'OTHER'
mba$frstlang<-factor(mba$frstlang)
t<-table(mba$sex,mba$frstlang)
t
##
## ENGLISH OTHER
## FEMALE 60 8
## MALE 182 24
Here more no of males speak english as their first language as compared to females.
AVERAGE year of work experience for bot the sexes
aggregate(mba$work_yrs,by=list(sex=mba$sex),mean)
## sex x
## 1 FEMALE 3.808824
## 2 MALE 3.893204
Both have same average year of work experience.
We observe that males and females are different on the basis of first language parameter.
As we are intersted in those candidates who got placed for our analysis work we will create new dataframes denoting placed ,not placed, did not answer and answer but did not disclose.Let us create
placed<-mba[which(mba$salary>999),]
View(placed)
notplaced<-mba[which(mba$salary==0),]
View(notplaced)
didnotanswer<-mba[which(mba$salary==998),]
View(didnotanswer)
didnotdisclose<-mba[which(mba$salary==999),]
View(didnotdisclose)
Average salary of placed
mean(placed$salary)
## [1] 103030.7
average salary is 103030.7
Comparison of vital statistics on the basis of sex
t1<-table(placed$sex,placed$frstlang)
t1
##
## ENGLISH OTHER
## FEMALE 28 3
## MALE 68 4
Among the placed candidates 68 Males speak english as compared to 28 Females.
Mean of gmat total by sex
aggregate(placed$gmat_tot,by=list(sex=placed$sex),mean)
## sex x
## 1 FEMALE 614.5161
## 2 MALE 616.6667
Mean of total gmat percentile by sex
aggregate(placed$gmat_tpc,by=list(sex=placed$sex),mean)
## sex x
## 1 FEMALE 83.74194
## 2 MALE 84.86111
Average of total gmat percentile among the placed candidates sex-wise is almost same.
Mean age sex-wise
aggregate(placed$age,by=list(sex=placed$sex),mean)
## sex x
## 1 FEMALE 26.06452
## 2 MALE 27.08333
Not much difference of ages among the placed candidates gender-wise
Mean WORK experience sex-wise
aggregate(placed$work_yrs,by=list(sex=placed$sex),mean)
## sex x
## 1 FEMALE 3.258065
## 2 MALE 3.861111
Again not much of difference in the average age ofplaced males and placed females.
sex wise mean of different variables of the original dataframe mba
aggregate(cbind(salary,work_yrs,gmat_tot,age)~sex,data = mba ,mean)
## sex salary work_yrs gmat_tot age
## 1 FEMALE 45121.07 3.808824 614.1176 27.17647
## 2 MALE 37013.62 3.893204 621.2136 27.41748
Average salary of males and females
aggregate(placed$salary,by=list(sex=placed$sex),mean)
## sex x
## 1 FEMALE 98524.39
## 2 MALE 104970.97
Here we can see that average salary of males is 104970.97 which is greater than that of females (98524.39)
mean salry age wise in the data frame placed and mba
aggregate(cbind(salary,work_yrs)~age,data=placed,mean)
## age salary work_yrs
## 1 22 85000.00 1.000000
## 2 23 91651.20 1.800000
## 3 24 101518.75 1.875000
## 4 25 99086.96 2.260870
## 5 26 101665.00 2.642857
## 6 27 102214.29 3.214286
## 7 28 103625.00 4.500000
## 8 29 102083.33 5.833333
## 9 30 109916.67 6.333333
## 10 31 100500.00 5.500000
## 11 32 107300.00 2.000000
## 12 33 118000.00 10.000000
## 13 34 105000.00 16.000000
## 14 39 112000.00 16.000000
## 15 40 183000.00 15.000000
aggregate(cbind(work_yrs,salary)~age,data=mba,mean)
## age work_yrs salary
## 1 22 1.000000 42500.00
## 2 23 1.750000 57282.00
## 3 24 1.727273 49342.24
## 4 25 2.264151 43395.55
## 5 26 2.875000 35982.07
## 6 27 3.130435 31499.37
## 7 28 4.666667 39809.00
## 8 29 4.500000 28067.95
## 9 30 5.583333 55291.25
## 10 31 5.800000 40599.40
## 11 32 5.625000 13662.25
## 12 33 10.000000 118000.00
## 13 34 11.500000 26250.00
## 14 35 9.333333 0.00
## 15 36 12.500000 0.00
## 16 37 9.000000 0.00
## 17 39 10.500000 56000.00
## 18 40 15.000000 183000.00
## 19 42 13.000000 0.00
## 20 43 19.000000 0.00
## 21 48 22.000000 0.00
average salary on the basis of satisfaction
aggregate(placed$salary,by=list(satisfaction=placed$satis,sex=placed$sex),mean)
## satisfaction sex x
## 1 3 FEMALE 95000.00
## 2 5 FEMALE 93354.67
## 3 6 FEMALE 111400.00
## 4 7 FEMALE 90625.00
## 5 4 MALE 95000.00
## 6 5 MALE 109764.71
## 7 6 MALE 103855.25
## 8 7 MALE 103050.00
females seem to e more satisfied than males.see at the level 6 for women the avg salary is highest whereas at level 5 the avg salary is highest among men.
Data frame containing Total no of candiadates who are placed and who did not disclose the salary
placed1<-rbind(placed,didnotdisclose)
View(placed1)
Let us visualize the placed data using the boxplot
Age distribution
boxplot(placed$age,main="age distribution",xlab="age",horizontal = TRUE,col = "orchid3")
FEW outliers can be seen here otherwise the data is symmetric with median age 26.
GMAT TOTAL Distribution
boxplot(placed$gmat_tot,main="gmat total distribution",xlab="gmat_toatl",horizontal = TRUE,col = "beige")
Median total gmat score is 625 approx. Little bit long tail is towards left but we can cosider this data to be symmetric. GMAT total percentile
boxplot(placed$gmat_tpc,main="total percentile distribution",xlab="gmat-tpc",horizontal = TRUE,col = "blue4")
The tali is towards left . More data value is concentrated towards right. The data is negatively skew. More than 50% of the candidates have got more than average total percentile. Only 25% of the Candidates have got less than 78 percentile.
Salary distribution
boxplot(placed$salary,main="salary distribution",xlab="salary",horizontal = TRUE,col = "peachpuff")
Median salary is 100000. few candidates are earning extreme salary. Highest salary is 125000 approx.
Satisfaction distribution
boxplot(placed$satis,main="satisfaction distribution",xlab="satisfaction",horizontal = TRUE,col = "orchid3")
candidates seem preety satisfied as 50% candidates are in the range 5 to 6. Mdian satisfaction level is 6. 75% candidates are over the satisfaction level 5. Pretty good no. It seems like getting a job after MBA is pretty much satisfying.
first language distribution
boxplot(placed$s_avg,horizontal = TRUE,main="spring mba average",xlab="s_avg",col="brown")
we are getting symmetric data for mba average with median over 3.0
SALARY distribution gender-wise
boxplot(placed$salary~placed$sex,las=1,horizontal=TRUE, main="Salary brekup sex wise", xlab="salary",ylab="sex",col=c("red","blue"))
Salary for males is higher than that of females.
salar distribution on the basis of first language
boxplot(placed$salary~placed$frstlang,main="salry on the basis of first language",xlab="salary",horizontal=TRUE,col=c("orchid3","peachpuff"))
here candidates having other language as first language are earning higher.
dimension of first language
t<-table(placed$frstlang)
t
##
## ENGLISH OTHER
## 96 7
Here English is in majority but the 7 candidates having other language have higher earnings.
boxplot of salary and total gmat score
boxplot(placed$salary~placed$gmat_tot,horizontal=TRUE, main="salary vs gmat total",xlab="salary",ylab="gmat score", las=1,col=c("red","beige","magenta","cyan","orchid3","blue","peachpuff","chartreuse4"))
Surprisingly candidates having gmat score 500 i.r the least gmat total score are earning more than the others
Salary and Total gmat percentile
boxplot(placed$salary~placed$gmat_tpc,horizontal=TRUE, main="salary vs gmat percentile",xlab="salary",ylab="gmat percentile", las=1,col=c("red","beige","magenta","cyan","orchid3","blue","peachpuff","chartreuse4"))
candidates having 69 percentile are having consistency in salary i.e less fluctuation in salary alongwith higher salary. Median salary is little low than the median of the 99 percentile group.
Salary on the basis of work expereince
boxplot(placed$salary~placed$work_yrs,horizontal=TRUE, main="salary vs work experience",xlab="salary",ylab="work ex", las=1,col=c("red","beige","magenta","cyan","orchid3","blue","peachpuff","chartreuse4"))
More experience more salary. See here 15 years of experience is attracting higher packages.
Salary vs Quartile ranking
boxplot(placed$salary~placed$quarter,horizontal=TRUE, main="salary vs quartile",xlab="salary",ylab="quartile", las=1,col=c("red","beige","magenta","cyan","orchid3","blue","peachpuff","chartreuse4"))
Here lowest quartile is earning maximum salary.
Let us see what histograms are saying
Salary for placed candidates
library(lattice)
histogram(~placed$salary,data=placed, main="frequency of starting salary", xlab="salary", col="peachpuff")
Salary on the basis of total placed and those who did not disclosed the salary
histogram(~placed1$salary,data=placed1, main="frequency of starting salary", xlab="salary", col="grey")
In both the graphs the maximum percentage of candidates are earning in the range 100000 to 125000
Scatterplot matrix of salary vs gmat total
plot(placed$gmat_tot,placed$salary ,data=placed,main="scatterplot salary vs total gmt score",xlab = "gmat score", ylab = "salary",,las=1,col=c("red","blue","green","brown"))
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter
abline(lm(placed$gmat_tot~placed$salary),col="red")
salary vs work- experience
plot(placed$work_yrs,placed$salary ,data=placed,main="scatterplot salary vs work experience",xlab = "work ex", ylab = "salary",,las=1,col=c("red","blue","green","brown"))
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter
abline(lm(placed$work_yrs~placed$salary),col="red")
salary vs F_avg
plot(placed$f_avg,placed$salary ,data=placed,main="scatterplot salary vs full mba average",xlab = "f_avg", ylab = "salary",,las=1,col=c("red","blue","green","brown"))
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter
abline(lm(placed$f_avg~placed$salary),col="beige")
plot of the entire data
plot(placed)
plot of salary with work ex , gmat total, f_avg, satisfaction and total percentile
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(formula=~salary+work_yrs+gmat_tot+f_avg+satis+gmat_tpc,data=placed,diagonal="histogram")
here we see that salary and work experience are positively correlated. gmat total and gmat total percentile re positively correlated.
let us make some comparison between placed and not placed
So first combine our two data frame placed and not placed and didnotdisclose
fullmba<-rbind(placed,notplaced,didnotdisclose)
View(fullmba)
Now we will create a dummy Gotplaced =1 (got a job) and 0 (didn’t get a job)
fullmba$Gotplaced=(fullmba$salary>999)
View(fullmba)
fullmba$Gotplaced<-factor(fullmba$Gotplaced)
No of placed and not placed candidates
tab<-table(fullmba$Gotplaced)
tab
##
## FALSE TRUE
## 125 103
No of placed and not placed gender wise
tab1<-xtabs(~fullmba$Gotplaced+ fullmba$sex,data=fullmba)
tab1
## fullmba$sex
## fullmba$Gotplaced FEMALE MALE
## FALSE 28 97
## TRUE 31 72
addmargins(tab1)
## fullmba$sex
## fullmba$Gotplaced FEMALE MALE Sum
## FALSE 28 97 125
## TRUE 31 72 103
## Sum 59 169 228
express the above quantity in percent terms
prop.table(tab1)
## fullmba$sex
## fullmba$Gotplaced FEMALE MALE
## FALSE 0.1228070 0.4254386
## TRUE 0.1359649 0.3157895
No of placed not placed on the basis of work experience
tab2<-xtabs(~fullmba$Gotplaced+ fullmba$sex + fullmba$work_yrs,data=fullmba)
ftable(tab2)
## fullmba$work_yrs 0 1 2 3 4 5 6 7 8 9 10 11 12 13 15 16 18 22
## fullmba$Gotplaced fullmba$sex
## FALSE FEMALE 0 0 8 6 2 5 1 2 0 1 1 1 0 1 0 0 0 0
## MALE 2 14 20 16 18 9 3 5 2 1 0 1 2 0 0 1 1 2
## TRUE FEMALE 0 4 14 5 1 3 2 0 1 0 0 0 0 0 1 0 0 0
## MALE 1 4 24 16 10 4 5 1 3 0 1 0 0 0 1 2 0 0
addmargins(ftable(tab2))
## Sum
## 0 0 8 6 2 5 1 2 0 1 1 1 0 1 0 0 0 0 28
## 2 14 20 16 18 9 3 5 2 1 0 1 2 0 0 1 1 2 97
## 0 4 14 5 1 3 2 0 1 0 0 0 0 0 1 0 0 0 31
## 1 4 24 16 10 4 5 1 3 0 1 0 0 0 1 2 0 0 72
## Sum 3 22 66 43 31 21 11 8 6 2 2 2 2 1 2 3 1 2 228
No of placed and not placed on the basis of language
tab3<-xtabs(~fullmba$Gotplaced+ fullmba$sex + fullmba$frstlang, data= fullmba)
ftable(tab3)
## fullmba$frstlang ENGLISH OTHER
## fullmba$Gotplaced fullmba$sex
## FALSE FEMALE 25 3
## MALE 83 14
## TRUE FEMALE 28 3
## MALE 68 4
addmargins(ftable(tab3))
## Sum
## 25 3 28
## 83 14 97
## 28 3 31
## 68 4 72
## Sum 204 24 228
express the above in % term
prop.table(ftable(tab3))
## fullmba$frstlang ENGLISH OTHER
## fullmba$Gotplaced fullmba$sex
## FALSE FEMALE 0.10964912 0.01315789
## MALE 0.36403509 0.06140351
## TRUE FEMALE 0.12280702 0.01315789
## MALE 0.29824561 0.01754386
Now we will test some hypothesis
H0: Percentage of females placed= percentage of males placed
H1: percentage of females placed is more than the percentage of males placed
using chisquare test here
chisq.test(tab1)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tab1
## X-squared = 1.366, df = 1, p-value = 0.2425
here we see that the p-value>0.05 .So we accept our null hypothesis and conclude that percentage of females placed is equal to percentage of males placed
Second hypothesis
H2: percentage of people placed having first language english is more than that of people having other language as first language
t4<-xtabs(~fullmba$Gotplaced+ fullmba$frstlang,data = fullmba)
prop.table(t4)
## fullmba$frstlang
## fullmba$Gotplaced ENGLISH OTHER
## FALSE 0.47368421 0.07456140
## TRUE 0.42105263 0.03070175
chisq.test(t4)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: t4
## X-squared = 2.1002, df = 1, p-value = 0.1473
again our p-value is >0.05. So we accept the null hypothesis and conclude that percentage of placed people having english as first language is equal to those not having english as first language.
Now Hypothesis based on t-test
1st H1: Average salary of placed females is greater than placed males
t.test(salary~sex,data =placed )
##
## Welch Two Sample t-test
##
## data: salary by sex
## t = -1.3628, df = 38.115, p-value = 0.1809
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -16021.72 3128.55
## sample estimates:
## mean in group FEMALE mean in group MALE
## 98524.39 104970.97
here we can say that p-value is>0.05 so there is no difference in the average salary of males and females who are placed.
SEcond hypothesis
H2: Average salary of placed people having english as first language is greater than the others
t.test(salary~frstlang,data=placed)
##
## Welch Two Sample t-test
##
## data: salary by frstlang
## t = -1.1202, df = 6.0863, p-value = 0.3049
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -59933.62 22202.25
## sample estimates:
## mean in group ENGLISH mean in group OTHER
## 101748.6 120614.3
on the basis of p-value which is >0.05 we can say that there is no difference in the average salary of the english speakers who are placed and the non english speakers who are placed.
Now we will run the regression analysis
for model selection let us see the correlogram of the placed variables
we will read our data set again here and create a new dataframe called plac to filter out those people who are placed because our salary data has many missing information .
MBA<-read.csv(paste("MBA Starting Salaries Data.csv",sep = ""))
plac<-MBA[which(MBA$salary>999),]
coor<-cor(plac)
coor
## age sex gmat_tot gmat_qpc gmat_vpc
## age 1.00000000 -0.14352927 -0.07871678 -0.165039057 0.01799420
## sex -0.14352927 1.00000000 -0.01955548 -0.147099027 0.05341428
## gmat_tot -0.07871678 -0.01955548 1.00000000 0.666382266 0.78038546
## gmat_qpc -0.16503906 -0.14709903 0.66638227 1.000000000 0.09466541
## gmat_vpc 0.01799420 0.05341428 0.78038546 0.094665411 1.00000000
## gmat_tpc -0.09609156 -0.04686981 0.96680810 0.658650025 0.78443167
## s_avg 0.15654954 0.08079985 0.17198874 0.015471662 0.15865101
## f_avg -0.21699191 0.16572186 0.12246257 0.098418869 0.02290167
## quarter -0.12568145 -0.02139041 -0.10578964 0.012648346 -0.12862079
## work_yrs 0.88052470 -0.09233003 -0.12280018 -0.182701263 -0.02812182
## frstlang 0.35026743 0.07512009 -0.13164323 0.014198516 -0.21835333
## salary 0.49964284 -0.16628869 -0.09067141 0.014141299 -0.13743230
## satis 0.10832308 -0.09199534 0.06474206 -0.003984632 0.14863481
## gmat_tpc s_avg f_avg quarter work_yrs
## age -0.09609156 0.15654954 -0.21699191 -0.12568145 0.88052470
## sex -0.04686981 0.08079985 0.16572186 -0.02139041 -0.09233003
## gmat_tot 0.96680810 0.17198874 0.12246257 -0.10578964 -0.12280018
## gmat_qpc 0.65865003 0.01547166 0.09841887 0.01264835 -0.18270126
## gmat_vpc 0.78443167 0.15865101 0.02290167 -0.12862079 -0.02812182
## gmat_tpc 1.00000000 0.13938500 0.07051391 -0.09955033 -0.13246963
## s_avg 0.13938500 1.00000000 0.44590413 -0.84038355 0.16328236
## f_avg 0.07051391 0.44590413 1.00000000 -0.43144819 -0.21633018
## quarter -0.09955033 -0.84038355 -0.43144819 1.00000000 -0.12896722
## work_yrs -0.13246963 0.16328236 -0.21633018 -0.12896722 1.00000000
## frstlang -0.16437561 -0.13788905 -0.05061394 0.10955726 0.19627277
## salary -0.13201783 0.10173175 -0.10603897 -0.12848526 0.45466634
## satis 0.11630842 -0.14356557 -0.11773304 0.22511985 0.06299926
## frstlang salary satis
## age 0.35026743 0.49964284 0.108323083
## sex 0.07512009 -0.16628869 -0.091995338
## gmat_tot -0.13164323 -0.09067141 0.064742057
## gmat_qpc 0.01419852 0.01414130 -0.003984632
## gmat_vpc -0.21835333 -0.13743230 0.148634805
## gmat_tpc -0.16437561 -0.13201783 0.116308417
## s_avg -0.13788905 0.10173175 -0.143565573
## f_avg -0.05061394 -0.10603897 -0.117733043
## quarter 0.10955726 -0.12848526 0.225119851
## work_yrs 0.19627277 0.45466634 0.062999256
## frstlang 1.00000000 0.26701953 0.089834769
## salary 0.26701953 1.00000000 -0.040050600
## satis 0.08983477 -0.04005060 1.000000000
library(corrplot)
## corrplot 0.84 loaded
corrplot(corr =cor(plac,use = "complete.obs"),method = "ellipse")
See here salary is positively related with age and work eperience and first language. It is also positively related with s_avg though very weakly . seee the blue shades. blue shades are showing positive correlation whereas the red shades are affecting negatively. like sex gmat_tot, gmat_tpc are affecting negatively.
regression analysis
Model1 <- salary ~ work_yrs + s_avg + f_avg + gmat_qpc + gmat_vpc + sex + frstlang + satis
fit <- lm(Model1, data = plac)
summary(fit)
##
## Call:
## lm(formula = Model1, data = plac)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29800 -7822 -1742 4869 82341
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 86719.94 23350.43 3.714 0.000346 ***
## work_yrs 2331.12 585.99 3.978 0.000137 ***
## s_avg 4659.05 5015.66 0.929 0.355320
## f_avg -1698.83 3834.70 -0.443 0.658773
## gmat_qpc 98.72 121.85 0.810 0.419884
## gmat_vpc -95.80 102.99 -0.930 0.354699
## sex -5289.24 3545.91 -1.492 0.139140
## frstlang 13994.76 6641.66 2.107 0.037770 *
## satis -1671.20 2070.62 -0.807 0.421643
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15740 on 94 degrees of freedom
## Multiple R-squared: 0.285, Adjusted R-squared: 0.2241
## F-statistic: 4.683 on 8 and 94 DF, p-value: 7.574e-05
Here only significant variables affecting salary are work years and frstlang both are affecting positively.
see the R squared it is only 0.285 i.e. 28.5% variation in salary is explained by the model
Now 2nd regression
Model <- salary ~
work_yrs + s_avg + f_avg + gmat_qpc + gmat_vpc + satis
fit1 <- lm(Model, data = plac)
summary(fit1)
##
## Call:
## lm(formula = Model, data = plac)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33922 -8301 -1314 4030 85666
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 95302.4 22053.9 4.321 3.79e-05 ***
## work_yrs 2690.3 577.6 4.658 1.03e-05 ***
## s_avg 3001.1 5055.9 0.594 0.554
## f_avg -1817.0 3871.3 -0.469 0.640
## gmat_qpc 151.8 121.5 1.249 0.215
## gmat_vpc -152.5 102.2 -1.492 0.139
## satis -1012.3 2093.7 -0.484 0.630
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16060 on 96 degrees of freedom
## Multiple R-squared: 0.2401, Adjusted R-squared: 0.1926
## F-statistic: 5.056 on 6 and 96 DF, p-value: 0.0001515
again this model is explaining only 24% variation in salary.
3rd mode
Model1 <- salary ~
work_yrs + f_avg + gmat_qpc + gmat_vpc + satis+gmat_tot+gmat_tpc+s_avg+age
fit1 <- lm(Model1, data = plac)
summary(fit1)
##
## Call:
## lm(formula = Model1, data = plac)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29429 -7405 358 5528 69521
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54916.31 50589.39 1.086 0.2805
## work_yrs 338.47 1087.50 0.311 0.7563
## f_avg -1642.93 3797.13 -0.433 0.6663
## gmat_qpc 891.50 486.78 1.831 0.0702 .
## gmat_vpc 579.60 488.29 1.187 0.2382
## satis -1284.40 2065.97 -0.622 0.5357
## gmat_tot -21.21 169.27 -0.125 0.9005
## gmat_tpc -1419.88 711.10 -1.997 0.0488 *
## s_avg 3460.17 4923.28 0.703 0.4839
## age 2437.73 1002.98 2.430 0.0170 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15410 on 93 degrees of freedom
## Multiple R-squared: 0.3218, Adjusted R-squared: 0.2562
## F-statistic: 4.903 on 9 and 93 DF, p-value: 2.219e-05
Here 32% variation in salary is explained by the model and the variables which are affecting salary is gmat_tpc and age.