Anna Kostiukovych

Homework Assignment 1

First, the dataset has to be imported into R Studio. As it is a csv file, the read.csv function can be used:

Adata <- read.csv("Admission.csv")

Then, it is possible to check how the data looks like:

head(Adata)
##    GPA GMAT Decision
## 1 2.96  596    admit
## 2 3.14  473    admit
## 3 3.22  482    admit
## 4 3.29  527    admit
## 5 3.69  505    admit
## 6 3.46  693    admit
str(Adata)
## 'data.frame':    85 obs. of  3 variables:
##  $ GPA     : num  2.96 3.14 3.22 3.29 3.69 3.46 3.03 3.19 3.63 3.59 ...
##  $ GMAT    : int  596 473 482 527 505 693 626 663 447 588 ...
##  $ Decision: chr  "admit" "admit" "admit" "admit" ...

The dataset consists of 85 GPA and GMAT scores combined with an admission decision for a university program. Each row represents an individual applicant to a program, which is a unit of observation here. The dataset contains 85 observations (applicants), therefore, the sample size is 85.

Variables Description

  1. GPA (Grade Point Average)
    • Type: Numerical Interval (differences are meaningful, but no true zero, as a GPA of 0 does not indicate an absolute lack of academic performance)
    • Scale: 0 to 4.0
    • Definition: The applicant’s Grade Point Average on a 4.0 scale.
  2. GMAT (Graduate Management Admission Test Score)
    • Type: Numerical Interval (impossible to say that one score is two times better than another)
    • Scale: 200 to 800
    • Definition: The standardized GMAT score of the applicant.
  3. Decision (Admission Decision)
    • Type: Categorical (Nominal)
    • Categories: "admit", "notadmit", "border"
    • Definition: Indicates whether the applicant was admitted, rejected or waitlisted.

The dataset used in this analysis was sourced from Kaggle, an online platform for dataset sharing.

Now, the missing values have to be identified and removed:

Adata <- na.omit(Adata) 

I also change the name of the columns:

colnames(Adata)[1] <- "Undergrad GPA"
colnames(Adata)[2] <- "GMAT Score"
head(Adata)
##   Undergrad GPA GMAT Score Decision
## 1          2.96        596    admit
## 2          3.14        473    admit
## 3          3.22        482    admit
## 4          3.29        527    admit
## 5          3.69        505    admit
## 6          3.46        693    admit

It would be fitting to add a categorical variable for GPA level:

Adata <- cbind(Adata, rep(NA, nrow(Adata)))
colnames(Adata)[4] <- "GPA Level"
for (i in 1:nrow(Adata)) {
  if (Adata[i,1] < 3) {Adata[i,4] <- "Low"} 
  else {if (Adata[i,1] < 3.5) {Adata[i,4] <- "Medium"} 
    else {Adata[i,4] <- "High"}} 
}

Now I convert the Decision and GPA Level variables into factors:

Adata$Decision <- factor(Adata$Decision, 
                         levels = c("admit", "notadmit", "border"))
Adata$`GPA Level` <- factor(Adata$`GPA Level`, 
                            levels = c("Low", "Medium", "High"))

It is possible to create a new data frame with only admitted students who have high GPA.

Adata2 <- Adata[Adata$Decision=="admit" & Adata$`GPA Level`=="High",]
head(Adata2)
##    Undergrad GPA GMAT Score Decision GPA Level
## 5           3.69        505    admit      High
## 9           3.63        447    admit      High
## 10          3.59        588    admit      High
## 13          3.50        572    admit      High
## 14          3.78        591    admit      High
## 22          3.58        564    admit      High

Library psych has to be activated now:

library(psych)
describe(Adata[, c("Undergrad GPA", "GMAT Score")])
##               vars  n   mean    sd median trimmed   mad    min   max  range
## Undergrad GPA    1 85   2.97  0.43   3.01    2.97  0.52   2.13   3.8   1.67
## GMAT Score       2 85 488.45 81.52 482.00  484.36 84.51 313.00 693.0 380.00
##                skew kurtosis   se
## Undergrad GPA -0.05    -1.01 0.05
## GMAT Score     0.39    -0.14 8.84

The average (mean) undergraduate GPA of the applicants was 2.97, while the median was slightly higher at 3.01, indicating a fairly symmetric distribution with minimal skew.
The standard deviation of GPA was 0.43, suggesting that most students had similar academic performance with little variation.
For the GMAT score, the mean was 488.45 and the median was 482, again showing a relatively balanced distribution.
However, the standard deviation of 81.52 for GMAT scores indicates a slightly wider spread, meaning applicants’ test performance varied more significantly in relation to the mean performance.

To better compare the variance of two variables, it is possible to calculate their coefficients of variation.

cv_gpa <- (sd(Adata$`Undergrad GPA`) / 
             mean(Adata$`Undergrad GPA`))
cv_gmat <- (sd(Adata$`GMAT Score`) / 
             mean(Adata$`GMAT Score`))
cv_gpa
## [1] 0.1442201
cv_gmat
## [1] 0.1669011

It can be concluded that GMAT scores vary more significantly indeed.

describeBy(Adata[, c("Undergrad GPA", "GMAT Score")], 
           group = Adata$Decision)
## 
##  Descriptive statistics by group 
## group: admit
##               vars  n   mean    sd median trimmed   mad    min   max  range
## Undergrad GPA    1 31   3.40  0.21   3.39    3.40  0.19   2.96   3.8   0.84
## GMAT Score       2 31 561.23 67.96 559.00  560.16 56.34 431.00 693.0 262.00
##               skew kurtosis    se
## Undergrad GPA 0.08    -0.56  0.04
## GMAT Score    0.17    -0.65 12.21
## ------------------------------------------------------------ 
## group: notadmit
##               vars  n   mean    sd median trimmed   mad    min   max  range
## Undergrad GPA    1 28   2.48  0.18   2.47    2.48  0.16   2.13   2.9   0.77
## GMAT Score       2 28 447.07 62.38 435.50  449.21 65.23 321.00 542.0 221.00
##                skew kurtosis    se
## Undergrad GPA  0.28    -0.27  0.03
## GMAT Score    -0.07    -1.10 11.79
## ------------------------------------------------------------ 
## group: border
##               vars  n   mean    sd median trimmed   mad    min   max  range
## Undergrad GPA    1 26   2.99  0.17   3.01    2.98  0.18   2.73   3.5   0.77
## GMAT Score       2 26 446.23 47.40 446.00  448.32 42.25 313.00 546.0 233.00
##                skew kurtosis   se
## Undergrad GPA  0.81     0.82 0.03
## GMAT Score    -0.50     0.76 9.30

Using the describeBy() function, I examined the descriptive statistics for undergraduate GPA and GMAT scores across three admission decision groups: admit, notadmit, and border.

Admitted students had the highest average GPA (mean = 3.40) and GMAT scores (mean = 561.23) indicating strong academic performance. However, the minimum values for GPA and GMAT are as low as 2.96 and 431 respectively for this group. In contrast, the not admitted group had significantly lower averages (GPA mean = 2.48, GMAT mean = 447.07), suggesting both measures may have influenced the negative admission decision. The borderline group had an average GPA of 2.99, significantly higher than the not admitted group, but their average GMAT score (446.23) closely resembled the not admitted group, implying that while their academic record was relatively strong, lower GMAT performance may have placed them in an uncertain decision category.

Histogram of GPA by Admission Group

library(ggplot2)
ggplot(Adata, aes(x = `Undergrad GPA`)) +
  geom_histogram(binwidth = 0.1, fill = "darkslateblue", color = "white") +
  facet_wrap(~ Decision, ncol = 1) +
  labs(title = "Distribution of Undergraduate GPA by Admission Decision",
       x = "GPA", y = "Count") +
  theme_minimal()

The histograms illustrate the distribution of undergraduate GPA for each admission decision category: admit, notadmit, and border.

Admitted students show a clear concentration of GPAs in the higher range (around 3.2 to 3.6), reflecting strong academic performance. In contrast, not admitted students are heavily concentrated in the lower GPA range, particularly between 2.4 and 2.6, suggesting GPA may have played a major role in rejection. The borderline group has GPAs clustered around the mid-range (approximately 2.8 to 3.1), indicating that these applicants were neither clearly strong nor weak based on academic performance alone.

This visual comparison supports the conclusion that GPA is positively associated with admission success.

GMAT Score Boxplots by Admission Group

ggplot(Adata, aes(x = "", y = `GMAT Score`)) +
  geom_boxplot(fill = "coral", color = "black") +
  coord_flip() +
  facet_wrap(~ Decision, ncol = 1) +
  labs(
    title = "GMAT Score Distribution by Admission Decision",
    x = "",
    y = "GMAT Score"
  ) +
  theme_minimal()

The boxplots show the distribution of GMAT scores across the three admission decision categories.

Admitted students have the highest GMAT scores overall, with the middle 50% (interquartile range) falling roughly between 520 and 600, and some scores reaching as high as 690. This suggests that strong GMAT performance is closely linked to admission.

The not admitted group has a significantly lower median than the admitted group. This supports the idea that lower GMAT scores are a major factor in rejections.

The borderline group’s GMAT distribution is very similar to the not admitted group — centered around the mid-400s. This indicates that GMAT alone may not have been strong enough to secure admission for these candidates, even if other aspects of their application were competitive.

Scatterplot of GPA vs. GMAT, Colored by Decision

ggplot(Adata, aes(x = `Undergrad GPA`, y = `GMAT Score`, 
                  color = Decision)) +
  geom_point(size = 2, alpha = 0.8) +
  labs(
    title = "Scatterplot of GPA and GMAT Score by Admission Decision",
    x = "Undergraduate GPA",
    y = "GMAT Score"
  ) +
  theme_minimal()

The scatterplot visualizes the relationship between undergraduate GPA and GMAT scores, with data points colored by admission decision.

Admitted students (red) are clearly clustered in the upper-right corner, where both GPA and GMAT scores are high. This suggests that strong academic performance in both areas significantly increases the likelihood of admission.

Not admitted students (green) mostly appear in the lower-left portion of the plot, indicating that lower scores in both GPA and GMAT are commonly associated with rejection.

Borderline applicants (blue) are concentrated in the mid-range of both variables, showing moderate performance that may have made their applications less decisive. Notably, some borderline applicants have strong GPA or GMAT scores individually, but not both — possibly explaining why they didn’t fall clearly into admit or not admit categories.

Overall, the plot highlights a positive relationship between GPA and GMAT, and clearly shows how both contribute to the final admission decision.