Analysis of the Dataset

In This Task, we analyze the Dataset MBAStarting Salaries.

Reading The Dataset

First, we download the dataset and read the file.

mba.df<-read.csv(paste("G:/R Intern/MBA Starting Salaries Data.csv"))
View(mba.df)
dim(mba.df)

## [1] 274  13

Creating Summary Statistics

For this, we must clean the data.

placed.mba<-mba.df[which(mba.df$salary!=999 & mba.df$salary!=998 & mba.df$salary!=0),]
View(placed.mba)
dim(placed.mba)

## [1] 103  13

This is the data of all students who have given thier salary details. AS we see, only 103 out of 274 are in this dataset.

summary(mba.df)

##       age             sex           gmat_tot        gmat_qpc    
##  Min.   :22.00   Min.   :1.000   Min.   :450.0   Min.   :28.00  
##  1st Qu.:25.00   1st Qu.:1.000   1st Qu.:580.0   1st Qu.:72.00  
##  Median :27.00   Median :1.000   Median :620.0   Median :83.00  
##  Mean   :27.36   Mean   :1.248   Mean   :619.5   Mean   :80.64  
##  3rd Qu.:29.00   3rd Qu.:1.000   3rd Qu.:660.0   3rd Qu.:93.00  
##  Max.   :48.00   Max.   :2.000   Max.   :790.0   Max.   :99.00  
##     gmat_vpc        gmat_tpc        s_avg           f_avg      
##  Min.   :16.00   Min.   : 0.0   Min.   :2.000   Min.   :0.000  
##  1st Qu.:71.00   1st Qu.:78.0   1st Qu.:2.708   1st Qu.:2.750  
##  Median :81.00   Median :87.0   Median :3.000   Median :3.000  
##  Mean   :78.32   Mean   :84.2   Mean   :3.025   Mean   :3.062  
##  3rd Qu.:91.00   3rd Qu.:94.0   3rd Qu.:3.300   3rd Qu.:3.250  
##  Max.   :99.00   Max.   :99.0   Max.   :4.000   Max.   :4.000  
##     quarter         work_yrs         frstlang         salary      
##  Min.   :1.000   Min.   : 0.000   Min.   :1.000   Min.   :     0  
##  1st Qu.:1.250   1st Qu.: 2.000   1st Qu.:1.000   1st Qu.:     0  
##  Median :2.000   Median : 3.000   Median :1.000   Median :   999  
##  Mean   :2.478   Mean   : 3.872   Mean   :1.117   Mean   : 39026  
##  3rd Qu.:3.000   3rd Qu.: 4.000   3rd Qu.:1.000   3rd Qu.: 97000  
##  Max.   :4.000   Max.   :22.000   Max.   :2.000   Max.   :220000  
##      satis      
##  Min.   :  1.0  
##  1st Qu.:  5.0  
##  Median :  6.0  
##  Mean   :172.2  
##  3rd Qu.:  7.0  
##  Max.   :998.0

library(psych)
describe(mba.df)

##          vars   n     mean       sd median  trimmed     mad min    max
## age         1 274    27.36     3.71     27    26.76    2.97  22     48
## sex         2 274     1.25     0.43      1     1.19    0.00   1      2
## gmat_tot    3 274   619.45    57.54    620   618.86   59.30 450    790
## gmat_qpc    4 274    80.64    14.87     83    82.31   14.83  28     99
## gmat_vpc    5 274    78.32    16.86     81    80.33   14.83  16     99
## gmat_tpc    6 274    84.20    14.02     87    86.12   11.86   0     99
## s_avg       7 274     3.03     0.38      3     3.03    0.44   2      4
## f_avg       8 274     3.06     0.53      3     3.09    0.37   0      4
## quarter     9 274     2.48     1.11      2     2.47    1.48   1      4
## work_yrs   10 274     3.87     3.23      3     3.29    1.48   0     22
## frstlang   11 274     1.12     0.32      1     1.02    0.00   1      2
## salary     12 274 39025.69 50951.56    999 33607.86 1481.12   0 220000
## satis      13 274   172.18   371.61      6    91.50    1.48   1    998
##           range  skew kurtosis      se
## age          26  2.16     6.45    0.22
## sex           1  1.16    -0.66    0.03
## gmat_tot    340 -0.01     0.06    3.48
## gmat_qpc     71 -0.92     0.30    0.90
## gmat_vpc     83 -1.04     0.74    1.02
## gmat_tpc     99 -2.28     9.02    0.85
## s_avg         2 -0.06    -0.38    0.02
## f_avg         4 -2.08    10.85    0.03
## quarter       3  0.02    -1.35    0.07
## work_yrs     22  2.78     9.80    0.20
## frstlang      1  2.37     3.65    0.02
## salary   220000  0.70    -1.05 3078.10
## satis       997  1.77     1.13   22.45

AS we see in the Salary section, median is around 999 due to high rate of people not disclosing their details. Thus we apply the same on cleaned dataset.

summary(placed.mba$salary)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   64000   95000  100000  103031  106000  220000

As we see above, the salaries have mean around 64000 and most people get aroun 100000.

str(mba.df)

## 'data.frame':    274 obs. of  13 variables:
##  $ age     : int  23 24 24 24 24 24 25 25 25 25 ...
##  $ sex     : int  2 1 1 1 2 1 1 2 1 1 ...
##  $ gmat_tot: int  620 610 670 570 710 640 610 650 630 680 ...
##  $ gmat_qpc: int  77 90 99 56 93 82 89 88 79 99 ...
##  $ gmat_vpc: int  87 71 78 81 98 89 74 89 91 81 ...
##  $ gmat_tpc: int  87 87 95 75 98 91 87 92 89 96 ...
##  $ s_avg   : num  3.4 3.5 3.3 3.3 3.6 3.9 3.4 3.3 3.3 3.45 ...
##  $ f_avg   : num  3 4 3.25 2.67 3.75 3.75 3.5 3.75 3.25 3.67 ...
##  $ quarter : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ work_yrs: int  2 2 2 1 2 2 2 2 2 2 ...
##  $ frstlang: int  1 1 1 1 1 1 1 1 2 1 ...
##  $ salary  : int  0 0 0 0 999 0 0 0 999 998 ...
##  $ satis   : int  7 6 6 7 5 6 5 6 4 998 ...

mean(mba.df$salary[mba.df$salary>0 && (mba.df$salary<997 && mba.df$salary > 999)])

## [1] NaN

Here. we see that important Series like sex,quarter,firstlang and satis are not factors. So we convert them to same.

mba.df$sex <- factor(mba.df$sex)
mba.df$frstlang <- factor(mba.df$frstlang)
str(mba.df)

## 'data.frame':    274 obs. of  13 variables:
##  $ age     : int  23 24 24 24 24 24 25 25 25 25 ...
##  $ sex     : Factor w/ 2 levels "1","2": 2 1 1 1 2 1 1 2 1 1 ...
##  $ gmat_tot: int  620 610 670 570 710 640 610 650 630 680 ...
##  $ gmat_qpc: int  77 90 99 56 93 82 89 88 79 99 ...
##  $ gmat_vpc: int  87 71 78 81 98 89 74 89 91 81 ...
##  $ gmat_tpc: int  87 87 95 75 98 91 87 92 89 96 ...
##  $ s_avg   : num  3.4 3.5 3.3 3.3 3.6 3.9 3.4 3.3 3.3 3.45 ...
##  $ f_avg   : num  3 4 3.25 2.67 3.75 3.75 3.5 3.75 3.25 3.67 ...
##  $ quarter : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ work_yrs: int  2 2 2 1 2 2 2 2 2 2 ...
##  $ frstlang: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 2 1 ...
##  $ salary  : int  0 0 0 0 999 0 0 0 999 998 ...
##  $ satis   : int  7 6 6 7 5 6 5 6 4 998 ...

Visualizing the Dataset

After seeing the Summary Statistics and description of each variable, we draw visualizations of each.

Age Plot

hist(mba.df$age,col=c("green"),main="Age Plot vs Frequency Count",xlab="Age",breaks = 6)

Thus As seen from above,we infer that Age has a median in 25-30 years (given median 27). ##### Gender Plot

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

mba.df$sex <- as.factor(mba.df$sex)
ggplot(mba.df, aes(x = age)) + geom_histogram() + facet_wrap(~sex)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Above is the plot of sex vs age in mba.df.

GMAT MARKS Plot

hist(mba.df$gmat_tot,xlab="Marks Total in GMAT",main="Total Marks in GMAT vs Frequency Count",col=c("pink"))

Work Years

hist(mba.df$work_yrs,xlab="Work Years",main="Work Years in Company",col=c("orange"))

We note from the above that the company has a less Customer Retention Percent with most employees leaving near 3 years(average 3.87 years).

To analyze the Quartile Ranking of the Employees.

plot(mba.df$quarter)

##### Satisfaction from MBA

The Following Histogram shows the level of satisfaction of MBA Students.

barplot(prop.table(table(mba.df$satis[mba.df$satis!=998])),col=c("azure"))

We see from the above that many students were satisfied highly(6).

Boxplots For Analyzing the Gender Gap

boxplot(salary~sex,data=mba.df,xlab="Salary",ylab="Gender",main="Comparison of Salaries of Males And Females",horizontal=TRUE,names=c("Females","Males"),col=c("pink","blue"))

As we see from the above, there is a gender - Salary Gap. The median line occurs near zer sonce interval is large and the salaries have median around 999 i.e. most didnot disclose the data details. So, this graph is not accurate.

Refined Boxplot

On removing those having 0 salaries and refining the data of above barplot, we search among the placed students and we get:

boxplot(salary~sex,data=placed.mba,xlab="Salary",ylab="Gender",main="Comparison of Salaries of Males And Females",horizontal=TRUE,names=c("Females","Males"),col=c("pink","blue"))

This is a better detailed Boxplot. Also, gender gap is not seen here, in fact opposite is noticed. The women leave the men behind in average salaries. Men do have outliers in the high salary, but most of them earn lesser.

Scatter Plot Matrix

Now we come to Scatter Plot Matrix

library(car)

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

scatterplotMatrix(formula = ~ age + gmat_tot +s_avg +f_avg + work_yrs +frstlang, cex=1,
                       data=mba.df,diagonal="histogram")

Correlation

Now we run tests for Correlations among Variables. For this we need all to be numeric or int.

mba.df$sex <- as.numeric(mba.df$sex)
mba.df$frstlang <- as.numeric(mba.df$frstlang)
str(mba.df)

## 'data.frame':    274 obs. of  13 variables:
##  $ age     : int  23 24 24 24 24 24 25 25 25 25 ...
##  $ sex     : num  2 1 1 1 2 1 1 2 1 1 ...
##  $ gmat_tot: int  620 610 670 570 710 640 610 650 630 680 ...
##  $ gmat_qpc: int  77 90 99 56 93 82 89 88 79 99 ...
##  $ gmat_vpc: int  87 71 78 81 98 89 74 89 91 81 ...
##  $ gmat_tpc: int  87 87 95 75 98 91 87 92 89 96 ...
##  $ s_avg   : num  3.4 3.5 3.3 3.3 3.6 3.9 3.4 3.3 3.3 3.45 ...
##  $ f_avg   : num  3 4 3.25 2.67 3.75 3.75 3.5 3.75 3.25 3.67 ...
##  $ quarter : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ work_yrs: int  2 2 2 1 2 2 2 2 2 2 ...
##  $ frstlang: num  1 1 1 1 1 1 1 1 2 1 ...
##  $ salary  : int  0 0 0 0 999 0 0 0 999 998 ...
##  $ satis   : int  7 6 6 7 5 6 5 6 4 998 ...

Thus we convert all to int and num.

Correlating

library(corrgram)
round(cor(mba.df[sapply(mba.df, function(x) !is.factor(x))]),2)

##            age   sex gmat_tot gmat_qpc gmat_vpc gmat_tpc s_avg f_avg
## age       1.00 -0.03    -0.15    -0.22    -0.04    -0.17  0.15 -0.02
## sex      -0.03  1.00    -0.05    -0.16     0.07    -0.01  0.13  0.09
## gmat_tot -0.15 -0.05     1.00     0.72     0.75     0.85  0.11  0.10
## gmat_qpc -0.22 -0.16     0.72     1.00     0.15     0.65 -0.03  0.07
## gmat_vpc -0.04  0.07     0.75     0.15     1.00     0.67  0.20  0.08
## gmat_tpc -0.17 -0.01     0.85     0.65     0.67     1.00  0.12  0.08
## s_avg     0.15  0.13     0.11    -0.03     0.20     0.12  1.00  0.55
## f_avg    -0.02  0.09     0.10     0.07     0.08     0.08  0.55  1.00
## quarter  -0.05 -0.13    -0.09     0.04    -0.17    -0.08 -0.76 -0.45
## work_yrs  0.86 -0.01    -0.18    -0.24    -0.07    -0.17  0.13 -0.04
## frstlang  0.06  0.00    -0.14     0.14    -0.39    -0.10 -0.14 -0.04
## salary   -0.06  0.07    -0.05    -0.04    -0.01     0.00  0.15  0.03
## satis    -0.13 -0.05     0.08     0.06     0.06     0.09 -0.03  0.01
##          quarter work_yrs frstlang salary satis
## age        -0.05     0.86     0.06  -0.06 -0.13
## sex        -0.13    -0.01     0.00   0.07 -0.05
## gmat_tot   -0.09    -0.18    -0.14  -0.05  0.08
## gmat_qpc    0.04    -0.24     0.14  -0.04  0.06
## gmat_vpc   -0.17    -0.07    -0.39  -0.01  0.06
## gmat_tpc   -0.08    -0.17    -0.10   0.00  0.09
## s_avg      -0.76     0.13    -0.14   0.15 -0.03
## f_avg      -0.45    -0.04    -0.04   0.03  0.01
## quarter     1.00    -0.09     0.10  -0.16  0.00
## work_yrs   -0.09     1.00    -0.03   0.01 -0.11
## frstlang    0.10    -0.03     1.00  -0.09  0.08
## salary     -0.16     0.01    -0.09   1.00 -0.34
## satis       0.00    -0.11     0.08  -0.34  1.00

Thus we see above a variance covariance matrix.

library(corrplot)

## corrplot 0.84 loaded

corrgram(mba.df,lower.panel = panel.shade,upper.panel = panel.pie,text.panel = panel.txt)

Thus the above Corrplot is loaded.

Creating a new Subset.

We created a new subset placed.mba except here we put those too who didnot get salaries(mba.df$salary=0).

answered_survey<-mba.df[which(mba.df$salary!=999 & mba.df$salary!=998),]
all_jobs<-ifelse(answered_survey$salary>0,"Placed","Unplaced")
mytable<-table(all_jobs)
addmargins(mytable)

## all_jobs
##   Placed Unplaced      Sum 
##      103       90      193

So, as we see The number of placed students is 103. This matches the number of entries in placed.mba.

To know the women and placed two way table:

Gender_Placed<-xtabs(~all_jobs+answered_survey$sex)
addmargins(Gender_Placed)

##           answered_survey$sex
## all_jobs     1   2 Sum
##   Placed    72  31 103
##   Unplaced  67  23  90
##   Sum      139  54 193

Thus we see a table of Gender vs Placed.

Inferences:

1.The number of women in MBA college are less than half the men. 2.The percentage of placement of both Genders is just above 50%.

Now we run the chi square test.

Consider the null hypothesis that how the student was placed or not depends on what the Gender of Student was.

chisq.test(Gender_Placed)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  Gender_Placed
## X-squared = 0.29208, df = 1, p-value = 0.5889

As p>0.05 this hypothesis is considered independent.

Since this probability is not small p>0.05 we fail to reject that sex and treatment are independent.

Testing the Starting Salary

Since we talk now only about placed students we use the data set of placed students.

We need to find as to what influenced the starting salaries.

Model1=salary~age+sex+gmat_tot+s_avg+f_avg+quarter+work_yrs+frstlang+satis
fit<-lm(Model1,data=answered_survey)
summary(fit)

## 
## Call:
## lm(formula = Model1, data = answered_survey)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -92921 -49638  19272  44437 179275 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 98967.61   83396.78   1.187   0.2369  
## age         -4026.66    1910.89  -2.107   0.0365 *
## sex           867.55    8430.95   0.103   0.9182  
## gmat_tot      -25.44      69.39  -0.367   0.7143  
## s_avg        8965.84   16466.66   0.544   0.5868  
## f_avg       -5706.10    8643.34  -0.660   0.5100  
## quarter     -6765.75    5115.92  -1.322   0.1877  
## work_yrs     2759.42    2182.01   1.265   0.2076  
## frstlang    14798.88   14623.90   1.012   0.3129  
## satis       10526.84    5014.72   2.099   0.0372 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52150 on 183 degrees of freedom
## Multiple R-squared:  0.08241,    Adjusted R-squared:  0.03728 
## F-statistic: 1.826 on 9 and 183 DF,  p-value: 0.06612

As we see, age and satis are mildly significant. So we discard this.

regression1 <- lm(salary~age+sex+work_yrs,data = answered_survey)
summary(regression1)

## 
## Call:
## lm(formula = salary ~ age + sex + work_yrs, data = answered_survey)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -68650 -54709  26126  42319 179010 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   159190      45205   3.522 0.000538 ***
## age            -4388       1841  -2.384 0.018125 *  
## sex             1581       8456   0.187 0.851846    
## work_yrs        3610       2104   1.716 0.087842 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52700 on 189 degrees of freedom
## Multiple R-squared:  0.03223,    Adjusted R-squared:  0.01687 
## F-statistic: 2.098 on 3 and 189 DF,  p-value: 0.1019

Here, age is only mildly significant(p<0.05). So, we try other columns.

regression2 <- lm(salary~gmat_tot+sex+gmat_tot+frstlang+quarter,data = placed.mba)
summary(regression2)

## 
## Call:
## lm(formula = salary ~ gmat_tot + sex + gmat_tot + frstlang + 
##     quarter, data = placed.mba)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -29089  -8925  -2319   4937 104804 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 113231.52   23588.95   4.800 5.69e-06 ***
## gmat_tot       -26.24      33.49  -0.783  0.43524    
## sex          -7492.26    3646.75  -2.055  0.04259 *  
## frstlang     20533.14    6733.09   3.050  0.00295 ** 
## quarter      -2749.72    1512.20  -1.818  0.07206 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16920 on 98 degrees of freedom
## Multiple R-squared:  0.1386, Adjusted R-squared:  0.1035 
## F-statistic: 3.943 on 4 and 98 DF,  p-value: 0.0052

Here, frstlang and sex are significant making the Intercepts(x-values) significant(p<0.001)

regression3 <- lm(salary~age+work_yrs,data = placed.mba)
summary(regression3)

## 
## Call:
## lm(formula = salary ~ age + work_yrs, data = placed.mba)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -31675  -8099  -2108   4411  80650 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  36967.5    23323.8   1.585   0.1161  
## age           2413.8      997.4   2.420   0.0173 *
## work_yrs       388.8     1084.0   0.359   0.7206  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15620 on 100 degrees of freedom
## Multiple R-squared:  0.2506, Adjusted R-squared:  0.2356 
## F-statistic: 16.72 on 2 and 100 DF,  p-value: 5.438e-07

In this Last regression, different order and chosen columns tell us that age is mildly significant making x-values not so significant.

At last, we end this Analysis of various data sets and Visualisation followed by operations on the same.

This Concludes our report.

MBAStartingSalaries

Anshuman Raina

23 January 2018