MBA.STARTING.ANALYSIS

##INTRODUCTION

#Every year, MBA programs across the country advertise for prospective students, promoting the academic excellence of their programs, the uniqueness of their offerings, the quality of their faculty and a variety of other factors. In a competitive market, the goal is to attract the brightest and the best.

##Many schools claim that graduates of their programs will earn large salaries upon graduation, a point that clearly ranks high on the list of many students' decision-making criteria. In fact, the Financial Times'rating of MBA programs uses graduates' salaries as a large component of its rating system.

#Marie Daer, an aspiring MBA applicant, was very interested in the starting salaries of graduating students.Surprisingly, she was able to track down a dataset from a prominent - but anonymous - MBA school.

#Daer was able to learn the following about the data. Three months after graduation, the students in the class of 2012 were sent a survey. The survey asked about their satisfaction with the MBA program as well as their starting salary. The survey was not anonymous, and the responses of these students were added to the information already on file about them. These data included the graduates' age, sex, years of workexperience, GMAT information, fall and spring MBA average, quartile ranking, and their native language.

#Daer was pleased to have located the data. She wondered whether it could answer some important questions that would help her decide whether to enroll in the MBA program at this particular school. In particular, she wondered about starting salaries, whether gender and/or age made a difference, and whether students liked this particular program. She also wondered whether her GMAT score made a difference in marks. Since her native language was not English, Daer had a relatively low GMAT.

#dataset information


#Missing salary and data are coded as follows:
#998 = did not answer the survey
#999 = answered the survey but did not disclose salary data
#Size of data set: 274 records

mba <- read.csv("MBA Starting Salaries Data.csv")
str(mba)

## 'data.frame':    274 obs. of  13 variables:
##  $ age     : int  23 24 24 24 24 24 25 25 25 25 ...
##  $ sex     : int  2 1 1 1 2 1 1 2 1 1 ...
##  $ gmat_tot: int  620 610 670 570 710 640 610 650 630 680 ...
##  $ gmat_qpc: int  77 90 99 56 93 82 89 88 79 99 ...
##  $ gmat_vpc: int  87 71 78 81 98 89 74 89 91 81 ...
##  $ gmat_tpc: int  87 87 95 75 98 91 87 92 89 96 ...
##  $ s_avg   : num  3.4 3.5 3.3 3.3 3.6 3.9 3.4 3.3 3.3 3.45 ...
##  $ f_avg   : num  3 4 3.25 2.67 3.75 3.75 3.5 3.75 3.25 3.67 ...
##  $ quarter : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ work_yrs: int  2 2 2 1 2 2 2 2 2 2 ...
##  $ frstlang: int  1 1 1 1 1 1 1 1 2 1 ...
##  $ salary  : int  0 0 0 0 999 0 0 0 999 998 ...
##  $ satis   : int  7 6 6 7 5 6 5 6 4 998 ...

View(mba)
dim(mba)

## [1] 274  13

#Data Preparation

mba$sex <- as.numeric(mba$sex)
mba$frstlang <- as.numeric(mba$frstlang)
mba$salary <- as.numeric(mba$salary)
mba$satis <- as.numeric(mba$satis)
 
str(mba)

## 'data.frame':    274 obs. of  13 variables:
##  $ age     : int  23 24 24 24 24 24 25 25 25 25 ...
##  $ sex     : num  2 1 1 1 2 1 1 2 1 1 ...
##  $ gmat_tot: int  620 610 670 570 710 640 610 650 630 680 ...
##  $ gmat_qpc: int  77 90 99 56 93 82 89 88 79 99 ...
##  $ gmat_vpc: int  87 71 78 81 98 89 74 89 91 81 ...
##  $ gmat_tpc: int  87 87 95 75 98 91 87 92 89 96 ...
##  $ s_avg   : num  3.4 3.5 3.3 3.3 3.6 3.9 3.4 3.3 3.3 3.45 ...
##  $ f_avg   : num  3 4 3.25 2.67 3.75 3.75 3.5 3.75 3.25 3.67 ...
##  $ quarter : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ work_yrs: int  2 2 2 1 2 2 2 2 2 2 ...
##  $ frstlang: num  1 1 1 1 1 1 1 1 2 1 ...
##  $ salary  : num  0 0 0 0 999 0 0 0 999 998 ...
##  $ satis   : num  7 6 6 7 5 6 5 6 4 998 ...

colSums(is.na(mba)) #No missing values as NA but there are missing values as 999, 998.

##      age      sex gmat_tot gmat_qpc gmat_vpc gmat_tpc    s_avg    f_avg 
##        0        0        0        0        0        0        0        0 
##  quarter work_yrs frstlang   salary    satis 
##        0        0        0        0        0

dim(mba) #[1] 274  13

## [1] 274  13

#let's separate data of records who either didn't participate or didn't reveal their salaries and let's see if we can omit them.

x <- mba[mba$salary==999 | mba$satis==998,]
dim(x) #[1] 81 13 so, can't omit as their number is insignificant. So, first let them replace with NA values.

## [1] 81 13

mba[mba$satis==998,"satis"] <- NA
mba[mba$salary %in% c(998,999), "salary"] <- NA

#Now, let's see the number of missing values

colSums(is.na(mba))

##      age      sex gmat_tot gmat_qpc gmat_vpc gmat_tpc    s_avg    f_avg 
##        0        0        0        0        0        0        0        0 
##  quarter work_yrs frstlang   salary    satis 
##        0        0        0       81       46

#salary = 81, satis = 46 NA values so we need to replace these missing values with appropriate data.

# so let's analyse the correct values which can be filled.

# Data Cleaning

library(tidyverse)

## -- Attaching packages ------------------------------------------------------------------------------------------------------------------------------------------------ tidyverse 1.2.1 --

## v ggplot2 3.0.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.6
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0

## -- Conflicts --------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr)

ggplot(mba, aes(salary)) + geom_histogram() + geom_abline()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 81 rows containing non-finite values (stat_bin).

# Data is skewed to the right so we will  replace it with the median value. 
mba[is.na(mba$salary),"salary"] <- median(mba$salary, na.rm=T) 
 
#Finding mode

ggplot(mba, aes(satis)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 46 rows containing non-finite values (stat_bin).

median(mba$satis, na.rm=T)

## [1] 6

mba %>% group_by(satis) %>% summarise(count=n())

## # A tibble: 8 x 2
##   satis count
##   <dbl> <int>
## 1     1     1
## 2     2     1
## 3     3     5
## 4     4    17
## 5     5    74
## 6     6    97
## 7     7    33
## 8    NA    46

#median and mode both are 6 so we will replace NA with 6. SO, let's do it.

mba[is.na(mba$satis), "satis"] <- 6

colSums(is.na(mba))

##      age      sex gmat_tot gmat_qpc gmat_vpc gmat_tpc    s_avg    f_avg 
##        0        0        0        0        0        0        0        0 
##  quarter work_yrs frstlang   salary    satis 
##        0        0        0        0        0

#Now, it's time to work on outliers.

library(lattice)
bwplot(mba$age)

#there are many outliers so, we need to get rid of them by replacing them with their mean values. 

#Two ways to find outliers - 1) find data points outside the inter quartile range and get rid of them by substituting them with their means.
#2) Replacing data points which are outside the 5th and 95th percentile range with their mean values. 

#We will choose percentile method as it has greater range than the IQR so it will not exclude crucial data information from the analysis.

#percentile method - # This method is good as it takes into account the significance level. No crucial data will be lost and significance importance will be consumed.

(RangePercentile <- quantile(mba$age, 0.95, na.rm=T) - quantile(mba$age,0.05,na.rm=T))

##   95% 
## 10.35

# Range is 10.35. 

#Removing outliers - excluding values who aren't in IQR range.
mba$age <- ifelse(mba$age>quantile(mba$age, 0.95, na.rm=T), mean(mba$age, na.rm=T), mba$age)
mba$age <- ifelse(mba$age<quantile(mba$age, 0.05, na.rm=T), mean(mba$age, na.rm=T), mba$age)

bwplot(mba$age, main="age", col="yellow")

#Outliers have been removed but we still see few outliers. It is not possible to remove all outliers. You must be thinking then what ? Shall we remove them as their number is less ? Answer is we shouldn't remove them as it will result in data loss provided their incorporation in my analysis wouldn't affect my pattern. But if my outliers were insignificant then I had no option except removing them as they will skew my pattern unlocking.

# Similary, we will remove outliers from all the Non - continuous numerical variables.

#gmat_tot

bwplot(mba$gmat_tot)

mba$gmat_tot <- ifelse(mba$gmat_tot > quantile(mba$gmat_tot, 0.95, na.rm=T), mean(mba$gmat_tot, na.rm=T), mba$gmat_tot)

mba$gmat_tot <- ifelse(mba$gmat_tot < quantile(mba$gmat_tot, 0.05, na.rm=T), mean(mba$gmat_tot, na.rm=T), mba$gmat_tot)

bwplot(mba$gmat_tot) #outliers removed, we have no outliers.

#gmat_qpc

bwplot(mba$gmat_qpc) #with outliers

mba$gmat_tot <- ifelse(mba$gmat_tot < quantile(mba$gmat_tot, 0.05, na.rm=T), mean(mba$gmat_tot, na.rm=T), mba$gmat_tot)

bwplot(mba$gmat_qpc) #few outliers will remain and we can't ignore them.

#gmat_vpc

bwplot(mba$gmat_vpc) #With Outliers

mba$gmat_vpc <- ifelse(mba$gmat_vpc < quantile(mba$gmat_vpc, 0.05, na.rm=T), mean(mba$gmat_vpc), mba$gmat_vpc)

bwplot(mba$gmat_vpc) #Without Outliers

#gmat_tpc

bwplot(mba$gmat_tpc) #With outliers

mba$gmat_tpc <- ifelse(mba$gmat_tpc < quantile(mba$gmat_tpc, 0.05, na.rm=T), mean(mba$gmat_tpc), mba$gmat_tpc)

bwplot(mba$gmat_tpc) #Without Outliers

bwplot(mba$f_avg) #Original Data with outliers outside percentile range, let's reduce them to minimum numbers..

mba$f_avg <- ifelse(mba$f_avg < quantile(mba$f_avg, 0.05, na.rm=T), mean(mba$f_avg, na.rm=T), mba$f_avg)

mba$f_avg <- ifelse(mba$f_avg > quantile(mba$f_avg, 0.95, na.rm=T), mean(mba$f_avg, na.rm=T), mba$f_avg)

bwplot(mba$f_avg)

#Satis

mba$satis <- ifelse(mba$satis < quantile(mba$satis, 0.05, na.rm=T), median(mba$satis, na.rm=T), mba$satis)

bwplot(mba$satis)

#years_of_experience

bwplot(mba$work_yrs) # Original Data with outliers

mba$work_yrs <- ifelse(mba$work_yrs > quantile(mba$work_yrs, 0.95, na.rm=T), mean(mba$work_yrs, na.rm=T), mba$work_yrs)

bwplot(mba$work_yrs)# Reduced number of outliers.

#Voila!! My raw, messy representation data has been cleaned to consistent data. Data preparation has been done with dimensions: [1] 274  13. Let's View this new prepared data.

summary(mba)

##       age            sex           gmat_tot        gmat_qpc    
##  Min.   :24.0   Min.   :1.000   Min.   :550.0   Min.   :28.00  
##  1st Qu.:25.0   1st Qu.:1.000   1st Qu.:590.0   1st Qu.:72.00  
##  Median :27.0   Median :1.000   Median :619.5   Median :83.00  
##  Mean   :26.9   Mean   :1.248   Mean   :621.4   Mean   :80.64  
##  3rd Qu.:28.0   3rd Qu.:1.000   3rd Qu.:650.0   3rd Qu.:93.00  
##  Max.   :34.0   Max.   :2.000   Max.   :710.0   Max.   :99.00  
##     gmat_vpc     gmat_tpc         s_avg           f_avg      
##  Min.   :45   Min.   :62.00   Min.   :2.000   Min.   :2.500  
##  1st Qu.:71   1st Qu.:80.25   1st Qu.:2.708   1st Qu.:3.000  
##  Median :81   Median :87.00   Median :3.000   Median :3.062  
##  Mean   :80   Mean   :86.21   Mean   :3.025   Mean   :3.095  
##  3rd Qu.:91   3rd Qu.:94.00   3rd Qu.:3.300   3rd Qu.:3.250  
##  Max.   :99   Max.   :99.00   Max.   :4.000   Max.   :3.750  
##     quarter         work_yrs        frstlang         salary      
##  Min.   :1.000   Min.   : 0.00   Min.   :1.000   Min.   :     0  
##  1st Qu.:1.250   1st Qu.: 2.00   1st Qu.:1.000   1st Qu.:     0  
##  Median :2.000   Median : 3.00   Median :1.000   Median : 85000  
##  Mean   :2.478   Mean   : 3.33   Mean   :1.117   Mean   : 63858  
##  3rd Qu.:3.000   3rd Qu.: 4.00   3rd Qu.:1.000   3rd Qu.: 97000  
##  Max.   :4.000   Max.   :10.00   Max.   :2.000   Max.   :220000  
##      satis      
##  Min.   :4.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.726  
##  3rd Qu.:6.000  
##  Max.   :7.000

dim(mba[mba$salary==0,])[1] #[1] 90 13 - quite significant

## [1] 90

(placement_percentage <- nrow(mba[mba$salary!=0,])/nrow(mba)*100)

## [1] 67.15328

# 67.15 % of our sample is placed.


#Adding new column showing whether a person is placed or not.

mba$placement[mba$salary==0] <- 0
mba$placement[mba$salary!=0] <- 1

placed <- mba[mba$salary!=0,]

#Data Manipulation and Visualisation using tidyverse and dplyr libraries

library("tidyverse")
library("dplyr")

#Impact of Age on Salary:

(x<- (placed) %>%  group_by(age) %>% summarise(AVG.SALARY=mean(salary), count=n()) %>% arrange(desc(AVG.SALARY)))

## # A tibble: 13 x 3
##      age AVG.SALARY count
##    <dbl>      <dbl> <int>
##  1  27.4    159333.     3
##  2  33      118000      1
##  3  34      105000      1
##  4  30       99950     10
##  5  24       98215     20
##  6  28       94933.    15
##  7  29       94318.    11
##  8  26       92777     30
##  9  31       92750      8
## 10  27       92531.    32
## 11  32       92433.     3
## 12  25       92364.    44
## 13  26.8     90543.     6

# As such age doesn't have much influence in the salary being provided. Need to verify this with correlation and test of independence:

cor(placed$age, placed$salary)

## [1] 0.06725152

cor.test(placed$age, placed$salary)

## 
##  Pearson's product-moment correlation
## 
## data:  placed$age and placed$salary
## t = 0.90933, df = 182, p-value = 0.3644
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.07816999  0.20987077
## sample estimates:
##        cor 
## 0.06725152

##correlation and test of independence: small correlation of 6.7% and correlation test doesn't imply significant measure of association between age and salary.

(x<- (placed) %>%  group_by(age) %>% summarise(AVG.SALARY=mean(salary), count=n()) %>% arrange(desc(count)))

## # A tibble: 13 x 3
##      age AVG.SALARY count
##    <dbl>      <dbl> <int>
##  1  25       92364.    44
##  2  27       92531.    32
##  3  26       92777     30
##  4  24       98215     20
##  5  28       94933.    15
##  6  29       94318.    11
##  7  30       99950     10
##  8  31       92750      8
##  9  26.8     90543.     6
## 10  27.4    159333.     3
## 11  32       92433.     3
## 12  33      118000      1
## 13  34      105000      1

# frequent age group as freshers in companies - (24,28). Company prefers young leads. Same can be visualised through frequency graph

ggplot(placed, aes(age)) + geom_freqpoly(color="red")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Impact of Sex on salary

(sex <- placed %>% group_by(sex) %>% summarise(AVG.SALARY=mean(salary), count=n()))

## # A tibble: 2 x 3
##     sex AVG.SALARY count
##   <dbl>      <dbl> <int>
## 1     1     95345.   139
## 2     2     94317.    45

boxplot(salary~sex, data=placed, col=c("red","pink"), xlab="Sex", ylab="salary", main="Sex based salary comparison")

#No significant difference in salaries offered to candidates based on the gender. Let's prove it's signifiicance in the real world.

cor.test(placed$sex, placed$salary)

## 
##  Pearson's product-moment correlation
## 
## data:  placed$sex and placed$salary
## t = -0.37186, df = 182, p-value = 0.7104
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1715306  0.1175764
## sample estimates:
##         cor 
## -0.02755326

str(mba)

## 'data.frame':    274 obs. of  14 variables:
##  $ age      : num  26.8 24 24 24 24 ...
##  $ sex      : num  2 1 1 1 2 1 1 2 1 1 ...
##  $ gmat_tot : num  620 610 670 570 710 640 610 650 630 680 ...
##  $ gmat_qpc : int  77 90 99 56 93 82 89 88 79 99 ...
##  $ gmat_vpc : num  87 71 78 81 98 89 74 89 91 81 ...
##  $ gmat_tpc : num  87 87 95 75 98 91 87 92 89 96 ...
##  $ s_avg    : num  3.4 3.5 3.3 3.3 3.6 3.9 3.4 3.3 3.3 3.45 ...
##  $ f_avg    : num  3 3.13 3.25 2.67 3.75 ...
##  $ quarter  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ work_yrs : num  2 2 2 1 2 2 2 2 2 2 ...
##  $ frstlang : num  1 1 1 1 1 1 1 1 2 1 ...
##  $ salary   : num  0 0 0 0 85000 0 0 0 85000 85000 ...
##  $ satis    : num  7 6 6 7 5 6 5 6 4 6 ...
##  $ placement: num  0 0 0 0 1 0 0 0 1 1 ...

# As p-value > 0.05, therefore we accept the null hypothesis implying no significance influence of gender on salaries.

#Impact of Sex on placement:

##Null Hypothesis - Gender and placement are independent.
##Chisq.test

(Sex_Placed <- xtabs(~sex+placement, data=mba))

##    placement
## sex   0   1
##   1  67 139
##   2  23  45

chisq.test(Sex_Placed)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  Sex_Placed
## X-squared = 0.0023919, df = 1, p-value = 0.961

str(mba$sex)

##  num [1:274] 2 1 1 1 2 1 1 2 1 1 ...

#Insights: As p > 0.05, thus, we can say that gender has no significant role to play to get placed.

#Impact of first language i.e. English on salary:

(frst_language <- placed %>% group_by(frstlang) %>% summarise(AVG.SALARY=mean(salary), count=n()))

## # A tibble: 2 x 3
##   frstlang AVG.SALARY count
##      <dbl>      <dbl> <int>
## 1        1     95049.   160
## 2        2     95388.    24

ggplot(placed, aes(frstlang,salary, fill=frstlang)) + geom_boxplot(aes(x=reorder(frstlang, salary, fun=median), y=salary)) + ggtitle("Language based comparison", subtitle = "1 stands for English, 2 for other language")

cor.test(placed$frstlang, placed$salary )

## 
##  Pearson's product-moment correlation
## 
## data:  placed$frstlang and placed$salary
## t = 0.09587, df = 182, p-value = 0.9237
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1376964  0.1516113
## sample estimates:
##         cor 
## 0.007106149

#Since p-value > 0.05 - salary and language is independent.

#Impact of first language on placement:

(frstlangJOB <- xtabs(~frstlang + placement, data=mba))

##         placement
## frstlang   0   1
##        1  82 160
##        2   8  24

prop.table(frstlangJOB) * 100

##         placement
## frstlang         0         1
##        1 29.927007 58.394161
##        2  2.919708  8.759124

str(mba)

## 'data.frame':    274 obs. of  14 variables:
##  $ age      : num  26.8 24 24 24 24 ...
##  $ sex      : num  2 1 1 1 2 1 1 2 1 1 ...
##  $ gmat_tot : num  620 610 670 570 710 640 610 650 630 680 ...
##  $ gmat_qpc : int  77 90 99 56 93 82 89 88 79 99 ...
##  $ gmat_vpc : num  87 71 78 81 98 89 74 89 91 81 ...
##  $ gmat_tpc : num  87 87 95 75 98 91 87 92 89 96 ...
##  $ s_avg    : num  3.4 3.5 3.3 3.3 3.6 3.9 3.4 3.3 3.3 3.45 ...
##  $ f_avg    : num  3 3.13 3.25 2.67 3.75 ...
##  $ quarter  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ work_yrs : num  2 2 2 1 2 2 2 2 2 2 ...
##  $ frstlang : num  1 1 1 1 1 1 1 1 2 1 ...
##  $ salary   : num  0 0 0 0 85000 0 0 0 85000 85000 ...
##  $ satis    : num  7 6 6 7 5 6 5 6 4 6 ...
##  $ placement: num  0 0 0 0 1 0 0 0 1 1 ...

#Most of the students having first language as English have been placed. Let's see if it has the significance or it is just the case of insufficient data.

#Chisq.test() to test null hypothesis i.e. "placement is independent of the language spoken by the candidate"

chisq.test(frstlangJOB)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  frstlangJOB
## X-squared = 0.64868, df = 1, p-value = 0.4206

#We accept the null hypothesis as it has high probability of null hypothesis being true - Thus, Getting a job has nothing to do with the language.

#Did MBA aspirants like the MBA programme ?

#placed

ggplot(placed, aes(satis)) + geom_bar(type=count, col="deeppink", fill="light blue") + theme(aspect.ratio = 0.5) + ggtitle("Frequency based on level of satisfaction of placed students", subtitle = "satis>5 are satisfied") + xlab("Level of Satisfaction")

## Warning: Ignoring unknown parameters: type

#let's see the percentage of people satisfied/unsatisfied by analysing both the placed and unplaced students.

placed %>% count(satis)

## # A tibble: 4 x 2
##   satis     n
##   <dbl> <int>
## 1     4    13
## 2     5    38
## 3     6   110
## 4     7    23

prop.table(xtabs(~satis, data=placed)) * 100

## satis
##         4         5         6         7 
##  7.065217 20.652174 59.782609 12.500000

(percent_satisfied_placed <- length(placed[placed$satis %in% c(6,7),"satis"])/dim(placed)[1]*100)

## [1] 72.28261

#Unplaced

mba[mba$placement==0,] %>% count(satis)

## # A tibble: 4 x 2
##   satis     n
##   <dbl> <int>
## 1     4     4
## 2     5    36
## 3     6    40
## 4     7    10

#that's interesting, large number of unplaced students have been satisfied with the programme.

ggplot(mba[mba$placement==0,], aes(satis)) + geom_bar(type=count, col="deeppink", fill="light yellow") + theme(aspect.ratio = 0.5) + ggtitle("Frequency based on level of satisfaction of unplaced students", subtitle = "satis>5 are satisfied") + xlab("Level of Satisfaction")

## Warning: Ignoring unknown parameters: type

(percent_satisfied_unplaced <- length(mba[mba$placement==0 & mba$satis %in% c(6,7), "satis"])/dim(mba[mba$placement==0,][1])*100)

## [1]   55.55556 5000.00000

#55.55 % of unplaced students are satisfied.

#In_general

(percent_satisfied_overall <- length(mba[mba$satis %in% c(6,7),"satis"])/dim(mba)[1] * 100)

## [1] 66.78832

#2/3rd of students are satisfied in general.

#Significant insights

library("corrgram")

## 
## Attaching package: 'corrgram'

## The following object is masked from 'package:lattice':
## 
##     panel.fill

corrgram(mba, lower.panel = panel.shade, upper.panel=panel.pie, text.panel=panel.txt, order=T, main="Corrgram of MBA Data Variables")

#Effect of Gmat_score on placement and salary

cor(mba[,c(3,4,5,6)], mba[,c(12,14)])

##              salary  placement
## gmat_tot 0.06063265 0.06373658
## gmat_qpc 0.06281164 0.08158152
## gmat_vpc 0.01492840 0.04488052
## gmat_tpc 0.07756364 0.09891251

library("car")

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

scatterplotMatrix(formula=~placement+salary+gmat_tot+gmat_qpc+gmat_tpc+gmat_vpc, cex=0.6, data=mba, main="ScatterPlots of various Gmat_variables with salary and placement")

## Warning in smoother(x[subs], y[subs], col = smoother.args$col[i], log.x =
## FALSE, : could not fit positive part of the spread

cor.test(mba$gmat_tot, mba$salary)

## 
##  Pearson's product-moment correlation
## 
## data:  mba$gmat_tot and mba$salary
## t = 1.0018, df = 272, p-value = 0.3173
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.05828608  0.17785471
## sample estimates:
##        cor 
## 0.06063265

cor.test(mba$gmat_qpc, mba$salary)

## 
##  Pearson's product-moment correlation
## 
## data:  mba$gmat_qpc and mba$salary
## t = 1.038, df = 272, p-value = 0.3002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.05610591  0.17997202
## sample estimates:
##        cor 
## 0.06281164

cor.test(mba$gmat_tpc, mba$salary)

## 
##  Pearson's product-moment correlation
## 
## data:  mba$gmat_tpc and mba$salary
## t = 1.2831, df = 272, p-value = 0.2006
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.04131605  0.19427792
## sample estimates:
##        cor 
## 0.07756364

#Conclusion: Wonderful!! It's done. A lot of significant information and anomalies have been fetched out. Let's summarise and analyse the overall case study in short.

#salary depends on Gmat_score but not significantly effective. So, Daer having low score won't be much of a problem. ScatterplotMatrix shows almost similar effects of gmat_qpc, gmat_tpc, gmat_vpc, gmat_tot percentile on salary. But tests proved that Gmat_overall percentile score has more significant effect on salary amongst all Gmat_variables. So, Daer having low gmat_score won't be much of an issue but a decent overall_gmat percentile would surely give her edge over others.

#73.28 % of employed students and 66.78 in general are satisfied by the program.

#55.55 % of unemployed students are satisfied by the program. What can be the possible reasons?

#*Either data is impurely collected or highly non uniform distributed in nature?

#*Data entry human error ? No, high percentage of human error isn't feasible.

#*Since this result is highly unlikely to exist in real life cases so there are high possibilities of data being collected through random sources and not consistent ones. - So we need to inform our manager that our data isn't consistent enough to predict robust models based on level of satisfaction system, because more than half of unemployed population can't be satisfied with the programme provided placement is the biggest motivator behind satisfaction.

##Thankyou very much for giving your kind attention to my work. Will come up soon with more interesting work - Puneet

MBA.STARTING.ANALYSIS

Puneet

August 27, 2018