In This Task, we analyze the Dataset MBAStarting Salaries.
First, we download the dataset and read the file.
mba.df<-read.csv(paste("G:/R Intern/MBA Starting Salaries Data.csv"))
View(mba.df)
dim(mba.df)
## [1] 274 13
For this, we must clean the data.
placed.mba<-mba.df[which(mba.df$salary!=999 & mba.df$salary!=998 & mba.df$salary!=0),]
View(placed.mba)
dim(placed.mba)
## [1] 103 13
This is the data of all students who have given thier salary details. AS we see, only 103 out of 274 are in this dataset.
summary(mba.df)
## age sex gmat_tot gmat_qpc
## Min. :22.00 Min. :1.000 Min. :450.0 Min. :28.00
## 1st Qu.:25.00 1st Qu.:1.000 1st Qu.:580.0 1st Qu.:72.00
## Median :27.00 Median :1.000 Median :620.0 Median :83.00
## Mean :27.36 Mean :1.248 Mean :619.5 Mean :80.64
## 3rd Qu.:29.00 3rd Qu.:1.000 3rd Qu.:660.0 3rd Qu.:93.00
## Max. :48.00 Max. :2.000 Max. :790.0 Max. :99.00
## gmat_vpc gmat_tpc s_avg f_avg
## Min. :16.00 Min. : 0.0 Min. :2.000 Min. :0.000
## 1st Qu.:71.00 1st Qu.:78.0 1st Qu.:2.708 1st Qu.:2.750
## Median :81.00 Median :87.0 Median :3.000 Median :3.000
## Mean :78.32 Mean :84.2 Mean :3.025 Mean :3.062
## 3rd Qu.:91.00 3rd Qu.:94.0 3rd Qu.:3.300 3rd Qu.:3.250
## Max. :99.00 Max. :99.0 Max. :4.000 Max. :4.000
## quarter work_yrs frstlang salary
## Min. :1.000 Min. : 0.000 Min. :1.000 Min. : 0
## 1st Qu.:1.250 1st Qu.: 2.000 1st Qu.:1.000 1st Qu.: 0
## Median :2.000 Median : 3.000 Median :1.000 Median : 999
## Mean :2.478 Mean : 3.872 Mean :1.117 Mean : 39026
## 3rd Qu.:3.000 3rd Qu.: 4.000 3rd Qu.:1.000 3rd Qu.: 97000
## Max. :4.000 Max. :22.000 Max. :2.000 Max. :220000
## satis
## Min. : 1.0
## 1st Qu.: 5.0
## Median : 6.0
## Mean :172.2
## 3rd Qu.: 7.0
## Max. :998.0
library(psych)
describe(mba.df)
## vars n mean sd median trimmed mad min max
## age 1 274 27.36 3.71 27 26.76 2.97 22 48
## sex 2 274 1.25 0.43 1 1.19 0.00 1 2
## gmat_tot 3 274 619.45 57.54 620 618.86 59.30 450 790
## gmat_qpc 4 274 80.64 14.87 83 82.31 14.83 28 99
## gmat_vpc 5 274 78.32 16.86 81 80.33 14.83 16 99
## gmat_tpc 6 274 84.20 14.02 87 86.12 11.86 0 99
## s_avg 7 274 3.03 0.38 3 3.03 0.44 2 4
## f_avg 8 274 3.06 0.53 3 3.09 0.37 0 4
## quarter 9 274 2.48 1.11 2 2.47 1.48 1 4
## work_yrs 10 274 3.87 3.23 3 3.29 1.48 0 22
## frstlang 11 274 1.12 0.32 1 1.02 0.00 1 2
## salary 12 274 39025.69 50951.56 999 33607.86 1481.12 0 220000
## satis 13 274 172.18 371.61 6 91.50 1.48 1 998
## range skew kurtosis se
## age 26 2.16 6.45 0.22
## sex 1 1.16 -0.66 0.03
## gmat_tot 340 -0.01 0.06 3.48
## gmat_qpc 71 -0.92 0.30 0.90
## gmat_vpc 83 -1.04 0.74 1.02
## gmat_tpc 99 -2.28 9.02 0.85
## s_avg 2 -0.06 -0.38 0.02
## f_avg 4 -2.08 10.85 0.03
## quarter 3 0.02 -1.35 0.07
## work_yrs 22 2.78 9.80 0.20
## frstlang 1 2.37 3.65 0.02
## salary 220000 0.70 -1.05 3078.10
## satis 997 1.77 1.13 22.45
AS we see in the Salary section, median is around 999 due to high rate of people not disclosing their details. Thus we apply the same on cleaned dataset.
summary(placed.mba$salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 64000 95000 100000 103031 106000 220000
As we see above, the salaries have mean around 64000 and most people get aroun 100000.
str(mba.df)
## 'data.frame': 274 obs. of 13 variables:
## $ age : int 23 24 24 24 24 24 25 25 25 25 ...
## $ sex : int 2 1 1 1 2 1 1 2 1 1 ...
## $ gmat_tot: int 620 610 670 570 710 640 610 650 630 680 ...
## $ gmat_qpc: int 77 90 99 56 93 82 89 88 79 99 ...
## $ gmat_vpc: int 87 71 78 81 98 89 74 89 91 81 ...
## $ gmat_tpc: int 87 87 95 75 98 91 87 92 89 96 ...
## $ s_avg : num 3.4 3.5 3.3 3.3 3.6 3.9 3.4 3.3 3.3 3.45 ...
## $ f_avg : num 3 4 3.25 2.67 3.75 3.75 3.5 3.75 3.25 3.67 ...
## $ quarter : int 1 1 1 1 1 1 1 1 1 1 ...
## $ work_yrs: int 2 2 2 1 2 2 2 2 2 2 ...
## $ frstlang: int 1 1 1 1 1 1 1 1 2 1 ...
## $ salary : int 0 0 0 0 999 0 0 0 999 998 ...
## $ satis : int 7 6 6 7 5 6 5 6 4 998 ...
mean(mba.df$salary[mba.df$salary>0 && (mba.df$salary<997 && mba.df$salary > 999)])
## [1] NaN
Here. we see that important Series like sex,quarter,firstlang and satis are not factors. So we convert them to same.
mba.df$sex <- factor(mba.df$sex)
mba.df$frstlang <- factor(mba.df$frstlang)
str(mba.df)
## 'data.frame': 274 obs. of 13 variables:
## $ age : int 23 24 24 24 24 24 25 25 25 25 ...
## $ sex : Factor w/ 2 levels "1","2": 2 1 1 1 2 1 1 2 1 1 ...
## $ gmat_tot: int 620 610 670 570 710 640 610 650 630 680 ...
## $ gmat_qpc: int 77 90 99 56 93 82 89 88 79 99 ...
## $ gmat_vpc: int 87 71 78 81 98 89 74 89 91 81 ...
## $ gmat_tpc: int 87 87 95 75 98 91 87 92 89 96 ...
## $ s_avg : num 3.4 3.5 3.3 3.3 3.6 3.9 3.4 3.3 3.3 3.45 ...
## $ f_avg : num 3 4 3.25 2.67 3.75 3.75 3.5 3.75 3.25 3.67 ...
## $ quarter : int 1 1 1 1 1 1 1 1 1 1 ...
## $ work_yrs: int 2 2 2 1 2 2 2 2 2 2 ...
## $ frstlang: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 2 1 ...
## $ salary : int 0 0 0 0 999 0 0 0 999 998 ...
## $ satis : int 7 6 6 7 5 6 5 6 4 998 ...
After seeing the Summary Statistics and description of each variable, we draw visualizations of each.
hist(mba.df$age,col=c("green"),main="Age Plot vs Frequency Count",xlab="Age",breaks = 6)
Thus As seen from above,we infer that Age has a median in 25-30 years (given median 27). ##### Gender Plot
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
mba.df$sex <- as.factor(mba.df$sex)
ggplot(mba.df, aes(x = age)) + geom_histogram() + facet_wrap(~sex)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Above is the plot of sex vs age in mba.df.
hist(mba.df$gmat_tot,xlab="Marks Total in GMAT",main="Total Marks in GMAT vs Frequency Count",col=c("pink"))
hist(mba.df$work_yrs,xlab="Work Years",main="Work Years in Company",col=c("orange"))
We note from the above that the company has a less Customer Retention Percent with most employees leaving near 3 years(average 3.87 years).
plot(mba.df$quarter)
##### Satisfaction from MBA
The Following Histogram shows the level of satisfaction of MBA Students.
barplot(prop.table(table(mba.df$satis[mba.df$satis!=998])),col=c("azure"))
We see from the above that many students were satisfied highly(6).
boxplot(salary~sex,data=mba.df,xlab="Salary",ylab="Gender",main="Comparison of Salaries of Males And Females",horizontal=TRUE,names=c("Females","Males"),col=c("pink","blue"))
As we see from the above, there is a gender - Salary Gap. The median line occurs near zer sonce interval is large and the salaries have median around 999 i.e. most didnot disclose the data details. So, this graph is not accurate.
On removing those having 0 salaries and refining the data of above barplot, we search among the placed students and we get:
boxplot(salary~sex,data=placed.mba,xlab="Salary",ylab="Gender",main="Comparison of Salaries of Males And Females",horizontal=TRUE,names=c("Females","Males"),col=c("pink","blue"))
This is a better detailed Boxplot. Also, gender gap is not seen here, in fact opposite is noticed. The women leave the men behind in average salaries. Men do have outliers in the high salary, but most of them earn lesser.
Now we come to Scatter Plot Matrix
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(formula = ~ age + gmat_tot +s_avg +f_avg + work_yrs +frstlang, cex=1,
data=mba.df,diagonal="histogram")
Now we run tests for Correlations among Variables. For this we need all to be numeric or int.
mba.df$sex <- as.numeric(mba.df$sex)
mba.df$frstlang <- as.numeric(mba.df$frstlang)
str(mba.df)
## 'data.frame': 274 obs. of 13 variables:
## $ age : int 23 24 24 24 24 24 25 25 25 25 ...
## $ sex : num 2 1 1 1 2 1 1 2 1 1 ...
## $ gmat_tot: int 620 610 670 570 710 640 610 650 630 680 ...
## $ gmat_qpc: int 77 90 99 56 93 82 89 88 79 99 ...
## $ gmat_vpc: int 87 71 78 81 98 89 74 89 91 81 ...
## $ gmat_tpc: int 87 87 95 75 98 91 87 92 89 96 ...
## $ s_avg : num 3.4 3.5 3.3 3.3 3.6 3.9 3.4 3.3 3.3 3.45 ...
## $ f_avg : num 3 4 3.25 2.67 3.75 3.75 3.5 3.75 3.25 3.67 ...
## $ quarter : int 1 1 1 1 1 1 1 1 1 1 ...
## $ work_yrs: int 2 2 2 1 2 2 2 2 2 2 ...
## $ frstlang: num 1 1 1 1 1 1 1 1 2 1 ...
## $ salary : int 0 0 0 0 999 0 0 0 999 998 ...
## $ satis : int 7 6 6 7 5 6 5 6 4 998 ...
Thus we convert all to int and num.
library(corrgram)
round(cor(mba.df[sapply(mba.df, function(x) !is.factor(x))]),2)
## age sex gmat_tot gmat_qpc gmat_vpc gmat_tpc s_avg f_avg
## age 1.00 -0.03 -0.15 -0.22 -0.04 -0.17 0.15 -0.02
## sex -0.03 1.00 -0.05 -0.16 0.07 -0.01 0.13 0.09
## gmat_tot -0.15 -0.05 1.00 0.72 0.75 0.85 0.11 0.10
## gmat_qpc -0.22 -0.16 0.72 1.00 0.15 0.65 -0.03 0.07
## gmat_vpc -0.04 0.07 0.75 0.15 1.00 0.67 0.20 0.08
## gmat_tpc -0.17 -0.01 0.85 0.65 0.67 1.00 0.12 0.08
## s_avg 0.15 0.13 0.11 -0.03 0.20 0.12 1.00 0.55
## f_avg -0.02 0.09 0.10 0.07 0.08 0.08 0.55 1.00
## quarter -0.05 -0.13 -0.09 0.04 -0.17 -0.08 -0.76 -0.45
## work_yrs 0.86 -0.01 -0.18 -0.24 -0.07 -0.17 0.13 -0.04
## frstlang 0.06 0.00 -0.14 0.14 -0.39 -0.10 -0.14 -0.04
## salary -0.06 0.07 -0.05 -0.04 -0.01 0.00 0.15 0.03
## satis -0.13 -0.05 0.08 0.06 0.06 0.09 -0.03 0.01
## quarter work_yrs frstlang salary satis
## age -0.05 0.86 0.06 -0.06 -0.13
## sex -0.13 -0.01 0.00 0.07 -0.05
## gmat_tot -0.09 -0.18 -0.14 -0.05 0.08
## gmat_qpc 0.04 -0.24 0.14 -0.04 0.06
## gmat_vpc -0.17 -0.07 -0.39 -0.01 0.06
## gmat_tpc -0.08 -0.17 -0.10 0.00 0.09
## s_avg -0.76 0.13 -0.14 0.15 -0.03
## f_avg -0.45 -0.04 -0.04 0.03 0.01
## quarter 1.00 -0.09 0.10 -0.16 0.00
## work_yrs -0.09 1.00 -0.03 0.01 -0.11
## frstlang 0.10 -0.03 1.00 -0.09 0.08
## salary -0.16 0.01 -0.09 1.00 -0.34
## satis 0.00 -0.11 0.08 -0.34 1.00
Thus we see above a variance covariance matrix.
library(corrplot)
## corrplot 0.84 loaded
corrgram(mba.df,lower.panel = panel.shade,upper.panel = panel.pie,text.panel = panel.txt)
Thus the above Corrplot is loaded.
We created a new subset placed.mba except here we put those too who didnot get salaries(mba.df$salary=0).
answered_survey<-mba.df[which(mba.df$salary!=999 & mba.df$salary!=998),]
all_jobs<-ifelse(answered_survey$salary>0,"Placed","Unplaced")
mytable<-table(all_jobs)
addmargins(mytable)
## all_jobs
## Placed Unplaced Sum
## 103 90 193
So, as we see The number of placed students is 103. This matches the number of entries in placed.mba.
To know the women and placed two way table:
Gender_Placed<-xtabs(~all_jobs+answered_survey$sex)
addmargins(Gender_Placed)
## answered_survey$sex
## all_jobs 1 2 Sum
## Placed 72 31 103
## Unplaced 67 23 90
## Sum 139 54 193
Thus we see a table of Gender vs Placed.
Inferences:
1.The number of women in MBA college are less than half the men. 2.The percentage of placement of both Genders is just above 50%.
Now we run the chi square test.
Consider the null hypothesis that how the student was placed or not depends on what the Gender of Student was.
chisq.test(Gender_Placed)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: Gender_Placed
## X-squared = 0.29208, df = 1, p-value = 0.5889
As p>0.05 this hypothesis is considered independent.
Since this probability is not small p>0.05 we fail to reject that sex and treatment are independent.
Since we talk now only about placed students we use the data set of placed students.
We need to find as to what influenced the starting salaries.
Model1=salary~age+sex+gmat_tot+s_avg+f_avg+quarter+work_yrs+frstlang+satis
fit<-lm(Model1,data=answered_survey)
summary(fit)
##
## Call:
## lm(formula = Model1, data = answered_survey)
##
## Residuals:
## Min 1Q Median 3Q Max
## -92921 -49638 19272 44437 179275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98967.61 83396.78 1.187 0.2369
## age -4026.66 1910.89 -2.107 0.0365 *
## sex 867.55 8430.95 0.103 0.9182
## gmat_tot -25.44 69.39 -0.367 0.7143
## s_avg 8965.84 16466.66 0.544 0.5868
## f_avg -5706.10 8643.34 -0.660 0.5100
## quarter -6765.75 5115.92 -1.322 0.1877
## work_yrs 2759.42 2182.01 1.265 0.2076
## frstlang 14798.88 14623.90 1.012 0.3129
## satis 10526.84 5014.72 2.099 0.0372 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52150 on 183 degrees of freedom
## Multiple R-squared: 0.08241, Adjusted R-squared: 0.03728
## F-statistic: 1.826 on 9 and 183 DF, p-value: 0.06612
As we see, age and satis are mildly significant. So we discard this.
regression1 <- lm(salary~age+sex+work_yrs,data = answered_survey)
summary(regression1)
##
## Call:
## lm(formula = salary ~ age + sex + work_yrs, data = answered_survey)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68650 -54709 26126 42319 179010
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 159190 45205 3.522 0.000538 ***
## age -4388 1841 -2.384 0.018125 *
## sex 1581 8456 0.187 0.851846
## work_yrs 3610 2104 1.716 0.087842 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52700 on 189 degrees of freedom
## Multiple R-squared: 0.03223, Adjusted R-squared: 0.01687
## F-statistic: 2.098 on 3 and 189 DF, p-value: 0.1019
Here, age is only mildly significant(p<0.05). So, we try other columns.
regression2 <- lm(salary~gmat_tot+sex+gmat_tot+frstlang+quarter,data = placed.mba)
summary(regression2)
##
## Call:
## lm(formula = salary ~ gmat_tot + sex + gmat_tot + frstlang +
## quarter, data = placed.mba)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29089 -8925 -2319 4937 104804
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 113231.52 23588.95 4.800 5.69e-06 ***
## gmat_tot -26.24 33.49 -0.783 0.43524
## sex -7492.26 3646.75 -2.055 0.04259 *
## frstlang 20533.14 6733.09 3.050 0.00295 **
## quarter -2749.72 1512.20 -1.818 0.07206 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16920 on 98 degrees of freedom
## Multiple R-squared: 0.1386, Adjusted R-squared: 0.1035
## F-statistic: 3.943 on 4 and 98 DF, p-value: 0.0052
Here, frstlang and sex are significant making the Intercepts(x-values) significant(p<0.001)
regression3 <- lm(salary~age+work_yrs,data = placed.mba)
summary(regression3)
##
## Call:
## lm(formula = salary ~ age + work_yrs, data = placed.mba)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31675 -8099 -2108 4411 80650
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36967.5 23323.8 1.585 0.1161
## age 2413.8 997.4 2.420 0.0173 *
## work_yrs 388.8 1084.0 0.359 0.7206
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15620 on 100 degrees of freedom
## Multiple R-squared: 0.2506, Adjusted R-squared: 0.2356
## F-statistic: 16.72 on 2 and 100 DF, p-value: 5.438e-07
In this Last regression, different order and chosen columns tell us that age is mildly significant making x-values not so significant.
At last, we end this Analysis of various data sets and Visualisation followed by operations on the same.
This Concludes our report.