##INTRODUCTION
#Every year, MBA programs across the country advertise for prospective students, promoting the academic excellence of their programs, the uniqueness of their offerings, the quality of their faculty and a variety of other factors. In a competitive market, the goal is to attract the brightest and the best.
##Many schools claim that graduates of their programs will earn large salaries upon graduation, a point that clearly ranks high on the list of many students' decision-making criteria. In fact, the Financial Times'rating of MBA programs uses graduates' salaries as a large component of its rating system.
#Marie Daer, an aspiring MBA applicant, was very interested in the starting salaries of graduating students.Surprisingly, she was able to track down a dataset from a prominent - but anonymous - MBA school.
#Daer was able to learn the following about the data. Three months after graduation, the students in the class of 2012 were sent a survey. The survey asked about their satisfaction with the MBA program as well as their starting salary. The survey was not anonymous, and the responses of these students were added to the information already on file about them. These data included the graduates' age, sex, years of workexperience, GMAT information, fall and spring MBA average, quartile ranking, and their native language.
#Daer was pleased to have located the data. She wondered whether it could answer some important questions that would help her decide whether to enroll in the MBA program at this particular school. In particular, she wondered about starting salaries, whether gender and/or age made a difference, and whether students liked this particular program. She also wondered whether her GMAT score made a difference in marks. Since her native language was not English, Daer had a relatively low GMAT.
#dataset information
#Missing salary and data are coded as follows:
#998 = did not answer the survey
#999 = answered the survey but did not disclose salary data
#Size of data set: 274 records
mba <- read.csv("MBA Starting Salaries Data.csv")
str(mba)
## 'data.frame': 274 obs. of 13 variables:
## $ age : int 23 24 24 24 24 24 25 25 25 25 ...
## $ sex : int 2 1 1 1 2 1 1 2 1 1 ...
## $ gmat_tot: int 620 610 670 570 710 640 610 650 630 680 ...
## $ gmat_qpc: int 77 90 99 56 93 82 89 88 79 99 ...
## $ gmat_vpc: int 87 71 78 81 98 89 74 89 91 81 ...
## $ gmat_tpc: int 87 87 95 75 98 91 87 92 89 96 ...
## $ s_avg : num 3.4 3.5 3.3 3.3 3.6 3.9 3.4 3.3 3.3 3.45 ...
## $ f_avg : num 3 4 3.25 2.67 3.75 3.75 3.5 3.75 3.25 3.67 ...
## $ quarter : int 1 1 1 1 1 1 1 1 1 1 ...
## $ work_yrs: int 2 2 2 1 2 2 2 2 2 2 ...
## $ frstlang: int 1 1 1 1 1 1 1 1 2 1 ...
## $ salary : int 0 0 0 0 999 0 0 0 999 998 ...
## $ satis : int 7 6 6 7 5 6 5 6 4 998 ...
View(mba)
dim(mba)
## [1] 274 13
#Data Preparation
mba$sex <- as.numeric(mba$sex)
mba$frstlang <- as.numeric(mba$frstlang)
mba$salary <- as.numeric(mba$salary)
mba$satis <- as.numeric(mba$satis)
str(mba)
## 'data.frame': 274 obs. of 13 variables:
## $ age : int 23 24 24 24 24 24 25 25 25 25 ...
## $ sex : num 2 1 1 1 2 1 1 2 1 1 ...
## $ gmat_tot: int 620 610 670 570 710 640 610 650 630 680 ...
## $ gmat_qpc: int 77 90 99 56 93 82 89 88 79 99 ...
## $ gmat_vpc: int 87 71 78 81 98 89 74 89 91 81 ...
## $ gmat_tpc: int 87 87 95 75 98 91 87 92 89 96 ...
## $ s_avg : num 3.4 3.5 3.3 3.3 3.6 3.9 3.4 3.3 3.3 3.45 ...
## $ f_avg : num 3 4 3.25 2.67 3.75 3.75 3.5 3.75 3.25 3.67 ...
## $ quarter : int 1 1 1 1 1 1 1 1 1 1 ...
## $ work_yrs: int 2 2 2 1 2 2 2 2 2 2 ...
## $ frstlang: num 1 1 1 1 1 1 1 1 2 1 ...
## $ salary : num 0 0 0 0 999 0 0 0 999 998 ...
## $ satis : num 7 6 6 7 5 6 5 6 4 998 ...
colSums(is.na(mba)) #No missing values as NA but there are missing values as 999, 998.
## age sex gmat_tot gmat_qpc gmat_vpc gmat_tpc s_avg f_avg
## 0 0 0 0 0 0 0 0
## quarter work_yrs frstlang salary satis
## 0 0 0 0 0
dim(mba) #[1] 274 13
## [1] 274 13
#let's separate data of records who either didn't participate or didn't reveal their salaries and let's see if we can omit them.
x <- mba[mba$salary==999 | mba$satis==998,]
dim(x) #[1] 81 13 so, can't omit as their number is insignificant. So, first let them replace with NA values.
## [1] 81 13
mba[mba$satis==998,"satis"] <- NA
mba[mba$salary %in% c(998,999), "salary"] <- NA
#Now, let's see the number of missing values
colSums(is.na(mba))
## age sex gmat_tot gmat_qpc gmat_vpc gmat_tpc s_avg f_avg
## 0 0 0 0 0 0 0 0
## quarter work_yrs frstlang salary satis
## 0 0 0 81 46
#salary = 81, satis = 46 NA values so we need to replace these missing values with appropriate data.
# so let's analyse the correct values which can be filled.
# Data Cleaning
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------------------------------------------------------------ tidyverse 1.2.1 --
## v ggplot2 3.0.0 v purrr 0.2.5
## v tibble 1.4.2 v dplyr 0.7.6
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts --------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
ggplot(mba, aes(salary)) + geom_histogram() + geom_abline()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 81 rows containing non-finite values (stat_bin).

# Data is skewed to the right so we will replace it with the median value.
mba[is.na(mba$salary),"salary"] <- median(mba$salary, na.rm=T)
#Finding mode
ggplot(mba, aes(satis)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 46 rows containing non-finite values (stat_bin).

median(mba$satis, na.rm=T)
## [1] 6
mba %>% group_by(satis) %>% summarise(count=n())
## # A tibble: 8 x 2
## satis count
## <dbl> <int>
## 1 1 1
## 2 2 1
## 3 3 5
## 4 4 17
## 5 5 74
## 6 6 97
## 7 7 33
## 8 NA 46
#median and mode both are 6 so we will replace NA with 6. SO, let's do it.
mba[is.na(mba$satis), "satis"] <- 6
colSums(is.na(mba))
## age sex gmat_tot gmat_qpc gmat_vpc gmat_tpc s_avg f_avg
## 0 0 0 0 0 0 0 0
## quarter work_yrs frstlang salary satis
## 0 0 0 0 0
#Now, it's time to work on outliers.
library(lattice)
bwplot(mba$age)

#there are many outliers so, we need to get rid of them by replacing them with their mean values.
#Two ways to find outliers - 1) find data points outside the inter quartile range and get rid of them by substituting them with their means.
#2) Replacing data points which are outside the 5th and 95th percentile range with their mean values.
#We will choose percentile method as it has greater range than the IQR so it will not exclude crucial data information from the analysis.
#percentile method - # This method is good as it takes into account the significance level. No crucial data will be lost and significance importance will be consumed.
(RangePercentile <- quantile(mba$age, 0.95, na.rm=T) - quantile(mba$age,0.05,na.rm=T))
## 95%
## 10.35
# Range is 10.35.
#Removing outliers - excluding values who aren't in IQR range.
mba$age <- ifelse(mba$age>quantile(mba$age, 0.95, na.rm=T), mean(mba$age, na.rm=T), mba$age)
mba$age <- ifelse(mba$age<quantile(mba$age, 0.05, na.rm=T), mean(mba$age, na.rm=T), mba$age)
bwplot(mba$age, main="age", col="yellow")

#Outliers have been removed but we still see few outliers. It is not possible to remove all outliers. You must be thinking then what ? Shall we remove them as their number is less ? Answer is we shouldn't remove them as it will result in data loss provided their incorporation in my analysis wouldn't affect my pattern. But if my outliers were insignificant then I had no option except removing them as they will skew my pattern unlocking.
# Similary, we will remove outliers from all the Non - continuous numerical variables.
#gmat_tot
bwplot(mba$gmat_tot)

mba$gmat_tot <- ifelse(mba$gmat_tot > quantile(mba$gmat_tot, 0.95, na.rm=T), mean(mba$gmat_tot, na.rm=T), mba$gmat_tot)
mba$gmat_tot <- ifelse(mba$gmat_tot < quantile(mba$gmat_tot, 0.05, na.rm=T), mean(mba$gmat_tot, na.rm=T), mba$gmat_tot)
bwplot(mba$gmat_tot) #outliers removed, we have no outliers.

#gmat_qpc
bwplot(mba$gmat_qpc) #with outliers

mba$gmat_tot <- ifelse(mba$gmat_tot < quantile(mba$gmat_tot, 0.05, na.rm=T), mean(mba$gmat_tot, na.rm=T), mba$gmat_tot)
bwplot(mba$gmat_qpc) #few outliers will remain and we can't ignore them.

#gmat_vpc
bwplot(mba$gmat_vpc) #With Outliers

mba$gmat_vpc <- ifelse(mba$gmat_vpc < quantile(mba$gmat_vpc, 0.05, na.rm=T), mean(mba$gmat_vpc), mba$gmat_vpc)
bwplot(mba$gmat_vpc) #Without Outliers

#gmat_tpc
bwplot(mba$gmat_tpc) #With outliers

mba$gmat_tpc <- ifelse(mba$gmat_tpc < quantile(mba$gmat_tpc, 0.05, na.rm=T), mean(mba$gmat_tpc), mba$gmat_tpc)
bwplot(mba$gmat_tpc) #Without Outliers

bwplot(mba$f_avg) #Original Data with outliers outside percentile range, let's reduce them to minimum numbers..

mba$f_avg <- ifelse(mba$f_avg < quantile(mba$f_avg, 0.05, na.rm=T), mean(mba$f_avg, na.rm=T), mba$f_avg)
mba$f_avg <- ifelse(mba$f_avg > quantile(mba$f_avg, 0.95, na.rm=T), mean(mba$f_avg, na.rm=T), mba$f_avg)
bwplot(mba$f_avg)

#Satis
mba$satis <- ifelse(mba$satis < quantile(mba$satis, 0.05, na.rm=T), median(mba$satis, na.rm=T), mba$satis)
bwplot(mba$satis)

#years_of_experience
bwplot(mba$work_yrs) # Original Data with outliers

mba$work_yrs <- ifelse(mba$work_yrs > quantile(mba$work_yrs, 0.95, na.rm=T), mean(mba$work_yrs, na.rm=T), mba$work_yrs)
bwplot(mba$work_yrs)# Reduced number of outliers.

#Voila!! My raw, messy representation data has been cleaned to consistent data. Data preparation has been done with dimensions: [1] 274 13. Let's View this new prepared data.
summary(mba)
## age sex gmat_tot gmat_qpc
## Min. :24.0 Min. :1.000 Min. :550.0 Min. :28.00
## 1st Qu.:25.0 1st Qu.:1.000 1st Qu.:590.0 1st Qu.:72.00
## Median :27.0 Median :1.000 Median :619.5 Median :83.00
## Mean :26.9 Mean :1.248 Mean :621.4 Mean :80.64
## 3rd Qu.:28.0 3rd Qu.:1.000 3rd Qu.:650.0 3rd Qu.:93.00
## Max. :34.0 Max. :2.000 Max. :710.0 Max. :99.00
## gmat_vpc gmat_tpc s_avg f_avg
## Min. :45 Min. :62.00 Min. :2.000 Min. :2.500
## 1st Qu.:71 1st Qu.:80.25 1st Qu.:2.708 1st Qu.:3.000
## Median :81 Median :87.00 Median :3.000 Median :3.062
## Mean :80 Mean :86.21 Mean :3.025 Mean :3.095
## 3rd Qu.:91 3rd Qu.:94.00 3rd Qu.:3.300 3rd Qu.:3.250
## Max. :99 Max. :99.00 Max. :4.000 Max. :3.750
## quarter work_yrs frstlang salary
## Min. :1.000 Min. : 0.00 Min. :1.000 Min. : 0
## 1st Qu.:1.250 1st Qu.: 2.00 1st Qu.:1.000 1st Qu.: 0
## Median :2.000 Median : 3.00 Median :1.000 Median : 85000
## Mean :2.478 Mean : 3.33 Mean :1.117 Mean : 63858
## 3rd Qu.:3.000 3rd Qu.: 4.00 3rd Qu.:1.000 3rd Qu.: 97000
## Max. :4.000 Max. :10.00 Max. :2.000 Max. :220000
## satis
## Min. :4.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.726
## 3rd Qu.:6.000
## Max. :7.000
dim(mba[mba$salary==0,])[1] #[1] 90 13 - quite significant
## [1] 90
(placement_percentage <- nrow(mba[mba$salary!=0,])/nrow(mba)*100)
## [1] 67.15328
# 67.15 % of our sample is placed.
#Adding new column showing whether a person is placed or not.
mba$placement[mba$salary==0] <- 0
mba$placement[mba$salary!=0] <- 1
placed <- mba[mba$salary!=0,]
#Data Manipulation and Visualisation using tidyverse and dplyr libraries
library("tidyverse")
library("dplyr")
#Impact of Age on Salary:
(x<- (placed) %>% group_by(age) %>% summarise(AVG.SALARY=mean(salary), count=n()) %>% arrange(desc(AVG.SALARY)))
## # A tibble: 13 x 3
## age AVG.SALARY count
## <dbl> <dbl> <int>
## 1 27.4 159333. 3
## 2 33 118000 1
## 3 34 105000 1
## 4 30 99950 10
## 5 24 98215 20
## 6 28 94933. 15
## 7 29 94318. 11
## 8 26 92777 30
## 9 31 92750 8
## 10 27 92531. 32
## 11 32 92433. 3
## 12 25 92364. 44
## 13 26.8 90543. 6
# As such age doesn't have much influence in the salary being provided. Need to verify this with correlation and test of independence:
cor(placed$age, placed$salary)
## [1] 0.06725152
cor.test(placed$age, placed$salary)
##
## Pearson's product-moment correlation
##
## data: placed$age and placed$salary
## t = 0.90933, df = 182, p-value = 0.3644
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.07816999 0.20987077
## sample estimates:
## cor
## 0.06725152
##correlation and test of independence: small correlation of 6.7% and correlation test doesn't imply significant measure of association between age and salary.
(x<- (placed) %>% group_by(age) %>% summarise(AVG.SALARY=mean(salary), count=n()) %>% arrange(desc(count)))
## # A tibble: 13 x 3
## age AVG.SALARY count
## <dbl> <dbl> <int>
## 1 25 92364. 44
## 2 27 92531. 32
## 3 26 92777 30
## 4 24 98215 20
## 5 28 94933. 15
## 6 29 94318. 11
## 7 30 99950 10
## 8 31 92750 8
## 9 26.8 90543. 6
## 10 27.4 159333. 3
## 11 32 92433. 3
## 12 33 118000 1
## 13 34 105000 1
# frequent age group as freshers in companies - (24,28). Company prefers young leads. Same can be visualised through frequency graph
ggplot(placed, aes(age)) + geom_freqpoly(color="red")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Impact of Sex on salary
(sex <- placed %>% group_by(sex) %>% summarise(AVG.SALARY=mean(salary), count=n()))
## # A tibble: 2 x 3
## sex AVG.SALARY count
## <dbl> <dbl> <int>
## 1 1 95345. 139
## 2 2 94317. 45
boxplot(salary~sex, data=placed, col=c("red","pink"), xlab="Sex", ylab="salary", main="Sex based salary comparison")

#No significant difference in salaries offered to candidates based on the gender. Let's prove it's signifiicance in the real world.
cor.test(placed$sex, placed$salary)
##
## Pearson's product-moment correlation
##
## data: placed$sex and placed$salary
## t = -0.37186, df = 182, p-value = 0.7104
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1715306 0.1175764
## sample estimates:
## cor
## -0.02755326
str(mba)
## 'data.frame': 274 obs. of 14 variables:
## $ age : num 26.8 24 24 24 24 ...
## $ sex : num 2 1 1 1 2 1 1 2 1 1 ...
## $ gmat_tot : num 620 610 670 570 710 640 610 650 630 680 ...
## $ gmat_qpc : int 77 90 99 56 93 82 89 88 79 99 ...
## $ gmat_vpc : num 87 71 78 81 98 89 74 89 91 81 ...
## $ gmat_tpc : num 87 87 95 75 98 91 87 92 89 96 ...
## $ s_avg : num 3.4 3.5 3.3 3.3 3.6 3.9 3.4 3.3 3.3 3.45 ...
## $ f_avg : num 3 3.13 3.25 2.67 3.75 ...
## $ quarter : int 1 1 1 1 1 1 1 1 1 1 ...
## $ work_yrs : num 2 2 2 1 2 2 2 2 2 2 ...
## $ frstlang : num 1 1 1 1 1 1 1 1 2 1 ...
## $ salary : num 0 0 0 0 85000 0 0 0 85000 85000 ...
## $ satis : num 7 6 6 7 5 6 5 6 4 6 ...
## $ placement: num 0 0 0 0 1 0 0 0 1 1 ...
# As p-value > 0.05, therefore we accept the null hypothesis implying no significance influence of gender on salaries.
#Impact of Sex on placement:
##Null Hypothesis - Gender and placement are independent.
##Chisq.test
(Sex_Placed <- xtabs(~sex+placement, data=mba))
## placement
## sex 0 1
## 1 67 139
## 2 23 45
chisq.test(Sex_Placed)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: Sex_Placed
## X-squared = 0.0023919, df = 1, p-value = 0.961
str(mba$sex)
## num [1:274] 2 1 1 1 2 1 1 2 1 1 ...
#Insights: As p > 0.05, thus, we can say that gender has no significant role to play to get placed.
#Impact of first language i.e. English on salary:
(frst_language <- placed %>% group_by(frstlang) %>% summarise(AVG.SALARY=mean(salary), count=n()))
## # A tibble: 2 x 3
## frstlang AVG.SALARY count
## <dbl> <dbl> <int>
## 1 1 95049. 160
## 2 2 95388. 24
ggplot(placed, aes(frstlang,salary, fill=frstlang)) + geom_boxplot(aes(x=reorder(frstlang, salary, fun=median), y=salary)) + ggtitle("Language based comparison", subtitle = "1 stands for English, 2 for other language")

cor.test(placed$frstlang, placed$salary )
##
## Pearson's product-moment correlation
##
## data: placed$frstlang and placed$salary
## t = 0.09587, df = 182, p-value = 0.9237
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1376964 0.1516113
## sample estimates:
## cor
## 0.007106149
#Since p-value > 0.05 - salary and language is independent.
#Impact of first language on placement:
(frstlangJOB <- xtabs(~frstlang + placement, data=mba))
## placement
## frstlang 0 1
## 1 82 160
## 2 8 24
prop.table(frstlangJOB) * 100
## placement
## frstlang 0 1
## 1 29.927007 58.394161
## 2 2.919708 8.759124
str(mba)
## 'data.frame': 274 obs. of 14 variables:
## $ age : num 26.8 24 24 24 24 ...
## $ sex : num 2 1 1 1 2 1 1 2 1 1 ...
## $ gmat_tot : num 620 610 670 570 710 640 610 650 630 680 ...
## $ gmat_qpc : int 77 90 99 56 93 82 89 88 79 99 ...
## $ gmat_vpc : num 87 71 78 81 98 89 74 89 91 81 ...
## $ gmat_tpc : num 87 87 95 75 98 91 87 92 89 96 ...
## $ s_avg : num 3.4 3.5 3.3 3.3 3.6 3.9 3.4 3.3 3.3 3.45 ...
## $ f_avg : num 3 3.13 3.25 2.67 3.75 ...
## $ quarter : int 1 1 1 1 1 1 1 1 1 1 ...
## $ work_yrs : num 2 2 2 1 2 2 2 2 2 2 ...
## $ frstlang : num 1 1 1 1 1 1 1 1 2 1 ...
## $ salary : num 0 0 0 0 85000 0 0 0 85000 85000 ...
## $ satis : num 7 6 6 7 5 6 5 6 4 6 ...
## $ placement: num 0 0 0 0 1 0 0 0 1 1 ...
#Most of the students having first language as English have been placed. Let's see if it has the significance or it is just the case of insufficient data.
#Chisq.test() to test null hypothesis i.e. "placement is independent of the language spoken by the candidate"
chisq.test(frstlangJOB)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: frstlangJOB
## X-squared = 0.64868, df = 1, p-value = 0.4206
#We accept the null hypothesis as it has high probability of null hypothesis being true - Thus, Getting a job has nothing to do with the language.
#Did MBA aspirants like the MBA programme ?
#placed
ggplot(placed, aes(satis)) + geom_bar(type=count, col="deeppink", fill="light blue") + theme(aspect.ratio = 0.5) + ggtitle("Frequency based on level of satisfaction of placed students", subtitle = "satis>5 are satisfied") + xlab("Level of Satisfaction")
## Warning: Ignoring unknown parameters: type

#let's see the percentage of people satisfied/unsatisfied by analysing both the placed and unplaced students.
placed %>% count(satis)
## # A tibble: 4 x 2
## satis n
## <dbl> <int>
## 1 4 13
## 2 5 38
## 3 6 110
## 4 7 23
prop.table(xtabs(~satis, data=placed)) * 100
## satis
## 4 5 6 7
## 7.065217 20.652174 59.782609 12.500000
(percent_satisfied_placed <- length(placed[placed$satis %in% c(6,7),"satis"])/dim(placed)[1]*100)
## [1] 72.28261
#Unplaced
mba[mba$placement==0,] %>% count(satis)
## # A tibble: 4 x 2
## satis n
## <dbl> <int>
## 1 4 4
## 2 5 36
## 3 6 40
## 4 7 10
#that's interesting, large number of unplaced students have been satisfied with the programme.
ggplot(mba[mba$placement==0,], aes(satis)) + geom_bar(type=count, col="deeppink", fill="light yellow") + theme(aspect.ratio = 0.5) + ggtitle("Frequency based on level of satisfaction of unplaced students", subtitle = "satis>5 are satisfied") + xlab("Level of Satisfaction")
## Warning: Ignoring unknown parameters: type

(percent_satisfied_unplaced <- length(mba[mba$placement==0 & mba$satis %in% c(6,7), "satis"])/dim(mba[mba$placement==0,][1])*100)
## [1] 55.55556 5000.00000
#55.55 % of unplaced students are satisfied.
#In_general
(percent_satisfied_overall <- length(mba[mba$satis %in% c(6,7),"satis"])/dim(mba)[1] * 100)
## [1] 66.78832
#2/3rd of students are satisfied in general.
#Significant insights
library("corrgram")
##
## Attaching package: 'corrgram'
## The following object is masked from 'package:lattice':
##
## panel.fill
corrgram(mba, lower.panel = panel.shade, upper.panel=panel.pie, text.panel=panel.txt, order=T, main="Corrgram of MBA Data Variables")

#Effect of Gmat_score on placement and salary
cor(mba[,c(3,4,5,6)], mba[,c(12,14)])
## salary placement
## gmat_tot 0.06063265 0.06373658
## gmat_qpc 0.06281164 0.08158152
## gmat_vpc 0.01492840 0.04488052
## gmat_tpc 0.07756364 0.09891251
library("car")
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
scatterplotMatrix(formula=~placement+salary+gmat_tot+gmat_qpc+gmat_tpc+gmat_vpc, cex=0.6, data=mba, main="ScatterPlots of various Gmat_variables with salary and placement")
## Warning in smoother(x[subs], y[subs], col = smoother.args$col[i], log.x =
## FALSE, : could not fit positive part of the spread

cor.test(mba$gmat_tot, mba$salary)
##
## Pearson's product-moment correlation
##
## data: mba$gmat_tot and mba$salary
## t = 1.0018, df = 272, p-value = 0.3173
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.05828608 0.17785471
## sample estimates:
## cor
## 0.06063265
cor.test(mba$gmat_qpc, mba$salary)
##
## Pearson's product-moment correlation
##
## data: mba$gmat_qpc and mba$salary
## t = 1.038, df = 272, p-value = 0.3002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.05610591 0.17997202
## sample estimates:
## cor
## 0.06281164
cor.test(mba$gmat_tpc, mba$salary)
##
## Pearson's product-moment correlation
##
## data: mba$gmat_tpc and mba$salary
## t = 1.2831, df = 272, p-value = 0.2006
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.04131605 0.19427792
## sample estimates:
## cor
## 0.07756364
#Conclusion: Wonderful!! It's done. A lot of significant information and anomalies have been fetched out. Let's summarise and analyse the overall case study in short.
#salary depends on Gmat_score but not significantly effective. So, Daer having low score won't be much of a problem. ScatterplotMatrix shows almost similar effects of gmat_qpc, gmat_tpc, gmat_vpc, gmat_tot percentile on salary. But tests proved that Gmat_overall percentile score has more significant effect on salary amongst all Gmat_variables. So, Daer having low gmat_score won't be much of an issue but a decent overall_gmat percentile would surely give her edge over others.
#73.28 % of employed students and 66.78 in general are satisfied by the program.
#55.55 % of unemployed students are satisfied by the program. What can be the possible reasons?
#*Either data is impurely collected or highly non uniform distributed in nature?
#*Data entry human error ? No, high percentage of human error isn't feasible.
#*Since this result is highly unlikely to exist in real life cases so there are high possibilities of data being collected through random sources and not consistent ones. - So we need to inform our manager that our data isn't consistent enough to predict robust models based on level of satisfaction system, because more than half of unemployed population can't be satisfied with the programme provided placement is the biggest motivator behind satisfaction.
##Thankyou very much for giving your kind attention to my work. Will come up soon with more interesting work - Puneet