ADA Homework 1

1. Import Data into R Studio.

The data is in an Excel sheet. Hence, we will need the ‘readxl’ package.

# install.package('readxl')
library(readxl)

df <- read_excel('./Maven business school(final).xlsx')
df <- as.data.frame(df)

2. Display Data using head function

head(df)

##   Student ID Undergrad Degree Undergrad Grade MBA Grade Work Experience
## 1          1         Business            68.4      90.2              No
## 2          2         Business            62.1      92.8              No
## 3          3 Computer Science            70.2      68.7             Yes
## 4          4      Engineering            75.1      80.7              No
## 5          5          Finance            60.9      74.9              No
## 6          6 Computer Science            74.5      80.7              No
##   Employability (Before) Employability (After)     Status Annual Salary
## 1                    252                   276     Placed        111000
## 2                    423                   410 Not Placed            NA
## 3                    101                   119     Placed        107000
## 4                    288                   334 Not Placed            NA
## 5                    248                   252 Not Placed            NA
## 6                    145                   209 Not Placed            NA

3. Explain your data

The population of the data are students which have graduated from the Maven business school. Multiple variables have been collected for further analysis:

Undergrad Degree: Categorical nominal variable of the completed bachelors of the student.

Undergrad Grade: Numeric variable between 0 and 100 which is the final grade average from the undergraduate program.

MBA Grade: Numeric variable between 0 and 100 which is the final grade average from the undergraduate program.

Work Experience: Yes/No, signifies whether the student had prior work experience.

Employability (before): Score (0-500) from a third party test taken before the Masters which assessed the appeal to employers in selected industries.

Employability (after): Score (0-500) taken after the Masters degree.

Status: Indicator of employment status (Placed, Not Placed).

Annual Salary: Student’s annual salary in USD.

4. Name the source of the data

Source: Kaggle.com (2025)

5. Carry out data manipulation

First, let us change the column names:

colnames(df) <- c('id', 'degree', 'bsc_grade', 'mba_grade', 'expierience',
                  'score_before', 'score_after', 'status', 'salary')

Next, some columns will be dropped in order to focus the analysis on the employability score.

df <- df[, c(1:2,4,6,7)]

head(df, 3)

##   id           degree mba_grade score_before score_after
## 1  1         Business      90.2          252         276
## 2  2         Business      92.8          423         410
## 3  3 Computer Science      68.7          101         119

Since the homework specifies that there should be a maximum of 4 factors we have to get rid of one. Let’s first see what factors there are and then drop one.

unique(df$degree)

## [1] "Business"         "Computer Science" "Engineering"      "Finance"         
## [5] "Art"

df <- df[!df$degree == 'Engineering',]

Furthermore, we need to make sure that the degree is a factor and not a character vector.

df$degree <- as.factor(df$degree)

Now the data is more readable due to less variables. We can also create a new variable ‘score_change’ which indicates how the employability rating was affected due to the Master’s degree.

df$score_change <- df$score_after - df$score_before

Lastly, let’s check whether the data contains missing values.

any(is.na(df))

## [1] TRUE

Since the output is true, there are NA’s in the data. We can simply delete the rows in which there are missing values.

df <- na.omit(df)

6. Present descriptive statistics.

First, we can look at the variables using stat.desc()

#install.packages('pastecs')
library(pastecs)

round(stat.desc(df[,c(-1, -2)]), 2)

##              mba_grade score_before score_after score_change
## nbr.val         952.00       952.00      952.00       952.00
## nbr.null         29.00         0.00        0.00         0.00
## nbr.na            0.00         0.00        0.00         0.00
## min               0.00        62.00       62.00      -158.79
## max              96.10       423.00      694.93       443.14
## range            96.10       361.00      632.93       601.93
## sum           50067.89    206198.69   273875.82     67677.14
## median           52.87       215.41      260.97        53.51
## mean             52.59       216.60      287.68        71.09
## SE.mean           0.76         1.19        4.03         3.84
## CI.mean.0.95      1.50         2.33        7.91         7.53
## var             553.40      1345.09    15466.39     14022.27
## std.dev          23.52        36.68      124.36       118.42
## coef.var          0.45         0.17        0.43         1.67

The ‘nbr.null’ row tells us that within the data there are 29 observations of people who scored 0 as their final masters degree, which can signify that they failed.

The ‘range’ row shows us a problem with the data. Since range is defined as max minus min, the range should not be larger than the scale of the variable. The employability score is defined as being between 0 and 500, yet we have observations as high as 694 after business school.

The median employability score, i.e. the value under which 50% of the observations are is 215.41 before attending the masters program and 260.97 after the program. However, since the columns are comparable, we can use the variance to say that the variation in scores after the business school is over 10 times higher than before it.

We can also look at the score change based on the bachelor degree.

library(psych)

describeBy(df$score_change, df$degree)

## 
##  Descriptive statistics by group 
## group: Art
##    vars   n  mean     sd median trimmed    mad     min    max  range skew
## X1    1 227 76.27 122.52     55    68.9 105.45 -158.79 438.06 596.85 0.62
##    kurtosis   se
## X1     0.09 8.13
## ------------------------------------------------------------ 
## group: Business
##    vars   n  mean     sd median trimmed   mad     min   max  range skew
## X1    1 256 69.42 115.13  59.64      62 88.44 -145.34 440.6 585.94 0.74
##    kurtosis  se
## X1      0.7 7.2
## ------------------------------------------------------------ 
## group: Computer Science
##    vars   n  mean     sd median trimmed   mad     min    max  range skew
## X1    1 239 71.74 119.09  55.42   63.75 96.47 -144.15 443.14 587.29 0.73
##    kurtosis  se
## X1     0.45 7.7
## ------------------------------------------------------------ 
## group: Finance
##    vars   n  mean     sd median trimmed   mad     min    max  range skew
## X1    1 230 67.15 117.77  45.78   58.32 92.13 -149.62 435.51 585.13 0.79
##    kurtosis   se
## X1     0.56 7.77

We can notice that the mean change in score is positive for all degrees, however it appears that students which previously studied finance, on average benefit the least from attending the graduate program.

Furthermore, based on the coefficient of skew we can say that the data is positively skewed for all degrees. This means that there is a longer right tail.

7. Graph the distributions of the variables

We can start with histograms of the scores before and after the program to see how they changed. In order to ‘stack’ the histograms on top of each other we need to have one column with values. Hence, a data frame purely for the sake of the visualization will be made.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:pastecs':
## 
##     first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

temp1 <- data.frame(group = rep('Before', length(df$score_before)),
                                score = df$score_before)
temp2 <- data.frame(group = rep('After', length(df$score_after)),
                    score = df$score_after)
data <- rbind(temp1, temp2)
rm(temp1, temp2)

data %>% ggplot(aes(x = score)) +
  geom_histogram(fill = 'steelblue',) +
  facet_wrap(~group, nrow = 2) +
  theme_dark()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can clearly see that the scores before the graduate program are concentrated around 200 with only a couple falling around 400. However, after the variation becomes a lot higher which results in the ‘flatter’ shape of the distribution.

Visualizing the score after the program in boxplots let’s us further see how many outliers we are dealing with.

df %>% ggplot(aes(x = degree, score_after)) +
  geom_boxplot(fill = c('red', 'steelblue', 'forestgreen', 'darkorchid4')) +
  theme_dark()

Based on that we see that large chunks of the data fall outside of the maximum score.

Lastly, in order to try to explain the variation in the score change we can use a scatterplot with the mba_grade on the x-axis.

#install.packages('car')
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:psych':
## 
##     logit

scatterplot(score_change ~ mba_grade,
            smooth = F,
            data = df)

From the fitted line we can see that the overall grade one receives during the masters program, does not influence the change in score. Adding the degree as a variable to the scatterplot does not help with the interpretation as the change is seemingly random.