The data is in an Excel sheet. Hence, we will need the ‘readxl’ package.
# install.package('readxl')
library(readxl)
df <- read_excel('./Maven business school(final).xlsx')
df <- as.data.frame(df)
head(df)
## Student ID Undergrad Degree Undergrad Grade MBA Grade Work Experience
## 1 1 Business 68.4 90.2 No
## 2 2 Business 62.1 92.8 No
## 3 3 Computer Science 70.2 68.7 Yes
## 4 4 Engineering 75.1 80.7 No
## 5 5 Finance 60.9 74.9 No
## 6 6 Computer Science 74.5 80.7 No
## Employability (Before) Employability (After) Status Annual Salary
## 1 252 276 Placed 111000
## 2 423 410 Not Placed NA
## 3 101 119 Placed 107000
## 4 288 334 Not Placed NA
## 5 248 252 Not Placed NA
## 6 145 209 Not Placed NA
The population of the data are students which have graduated from the Maven business school. Multiple variables have been collected for further analysis:
Undergrad Degree: Categorical nominal variable of the completed bachelors of the student.
Undergrad Grade: Numeric variable between 0 and 100 which is the final grade average from the undergraduate program.
MBA Grade: Numeric variable between 0 and 100 which is the final grade average from the undergraduate program.
Work Experience: Yes/No, signifies whether the student had prior work experience.
Employability (before): Score (0-500) from a third party test taken before the Masters which assessed the appeal to employers in selected industries.
Employability (after): Score (0-500) taken after the Masters degree.
Status: Indicator of employment status (Placed, Not Placed).
Annual Salary: Student’s annual salary in USD.
Source: Kaggle.com (2025)
First, let us change the column names:
colnames(df) <- c('id', 'degree', 'bsc_grade', 'mba_grade', 'expierience',
'score_before', 'score_after', 'status', 'salary')
Next, some columns will be dropped in order to focus the analysis on the employability score.
df <- df[, c(1:2,4,6,7)]
head(df, 3)
## id degree mba_grade score_before score_after
## 1 1 Business 90.2 252 276
## 2 2 Business 92.8 423 410
## 3 3 Computer Science 68.7 101 119
Since the homework specifies that there should be a maximum of 4 factors we have to get rid of one. Let’s first see what factors there are and then drop one.
unique(df$degree)
## [1] "Business" "Computer Science" "Engineering" "Finance"
## [5] "Art"
df <- df[!df$degree == 'Engineering',]
Furthermore, we need to make sure that the degree is a factor and not a character vector.
df$degree <- as.factor(df$degree)
Now the data is more readable due to less variables. We can also create a new variable ‘score_change’ which indicates how the employability rating was affected due to the Master’s degree.
df$score_change <- df$score_after - df$score_before
Lastly, let’s check whether the data contains missing values.
any(is.na(df))
## [1] TRUE
Since the output is true, there are NA’s in the data. We can simply delete the rows in which there are missing values.
df <- na.omit(df)
First, we can look at the variables using stat.desc()
#install.packages('pastecs')
library(pastecs)
round(stat.desc(df[,c(-1, -2)]), 2)
## mba_grade score_before score_after score_change
## nbr.val 952.00 952.00 952.00 952.00
## nbr.null 29.00 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00 0.00
## min 0.00 62.00 62.00 -158.79
## max 96.10 423.00 694.93 443.14
## range 96.10 361.00 632.93 601.93
## sum 50067.89 206198.69 273875.82 67677.14
## median 52.87 215.41 260.97 53.51
## mean 52.59 216.60 287.68 71.09
## SE.mean 0.76 1.19 4.03 3.84
## CI.mean.0.95 1.50 2.33 7.91 7.53
## var 553.40 1345.09 15466.39 14022.27
## std.dev 23.52 36.68 124.36 118.42
## coef.var 0.45 0.17 0.43 1.67
The ‘nbr.null’ row tells us that within the data there are 29 observations of people who scored 0 as their final masters degree, which can signify that they failed.
The ‘range’ row shows us a problem with the data. Since range is defined as max minus min, the range should not be larger than the scale of the variable. The employability score is defined as being between 0 and 500, yet we have observations as high as 694 after business school.
The median employability score, i.e. the value under which 50% of the observations are is 215.41 before attending the masters program and 260.97 after the program. However, since the columns are comparable, we can use the variance to say that the variation in scores after the business school is over 10 times higher than before it.
We can also look at the score change based on the bachelor degree.
library(psych)
describeBy(df$score_change, df$degree)
##
## Descriptive statistics by group
## group: Art
## vars n mean sd median trimmed mad min max range skew
## X1 1 227 76.27 122.52 55 68.9 105.45 -158.79 438.06 596.85 0.62
## kurtosis se
## X1 0.09 8.13
## ------------------------------------------------------------
## group: Business
## vars n mean sd median trimmed mad min max range skew
## X1 1 256 69.42 115.13 59.64 62 88.44 -145.34 440.6 585.94 0.74
## kurtosis se
## X1 0.7 7.2
## ------------------------------------------------------------
## group: Computer Science
## vars n mean sd median trimmed mad min max range skew
## X1 1 239 71.74 119.09 55.42 63.75 96.47 -144.15 443.14 587.29 0.73
## kurtosis se
## X1 0.45 7.7
## ------------------------------------------------------------
## group: Finance
## vars n mean sd median trimmed mad min max range skew
## X1 1 230 67.15 117.77 45.78 58.32 92.13 -149.62 435.51 585.13 0.79
## kurtosis se
## X1 0.56 7.77
We can notice that the mean change in score is positive for all degrees, however it appears that students which previously studied finance, on average benefit the least from attending the graduate program.
Furthermore, based on the coefficient of skew we can say that the data is positively skewed for all degrees. This means that there is a longer right tail.
We can start with histograms of the scores before and after the program to see how they changed. In order to ‘stack’ the histograms on top of each other we need to have one column with values. Hence, a data frame purely for the sake of the visualization will be made.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:pastecs':
##
## first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
temp1 <- data.frame(group = rep('Before', length(df$score_before)),
score = df$score_before)
temp2 <- data.frame(group = rep('After', length(df$score_after)),
score = df$score_after)
data <- rbind(temp1, temp2)
rm(temp1, temp2)
data %>% ggplot(aes(x = score)) +
geom_histogram(fill = 'steelblue',) +
facet_wrap(~group, nrow = 2) +
theme_dark()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can clearly see that the scores before the graduate program are concentrated around 200 with only a couple falling around 400. However, after the variation becomes a lot higher which results in the ‘flatter’ shape of the distribution.
Visualizing the score after the program in boxplots let’s us further see how many outliers we are dealing with.
df %>% ggplot(aes(x = degree, score_after)) +
geom_boxplot(fill = c('red', 'steelblue', 'forestgreen', 'darkorchid4')) +
theme_dark()
Based on that we see that large chunks of the data fall outside of the maximum score.
Lastly, in order to try to explain the variation in the score change we can use a scatterplot with the mba_grade on the x-axis.
#install.packages('car')
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:psych':
##
## logit
scatterplot(score_change ~ mba_grade,
smooth = F,
data = df)
From the fitted line we can see that the overall grade one receives during the masters program, does not influence the change in score. Adding the degree as a variable to the scatterplot does not help with the interpretation as the change is seemingly random.