For our k-Nearest Neighbor algorithm, our group decided to use a data set that shows the years it took to graduate for a sample of students, with factors such as their high school GPA, college GPA, ACT score, SAT score, parental highest level of education, and parental income. We decided to look primarily at the GPA, SAT score, parental level of education, and parental income to predict the quantitative variable years.to.graduate (how long it would take a given college student to graduate given these factors).

### STAT 0218 Homework 1
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
#@authors: Claire Zhang, Michael He, Samuel Bryan 
# Loading in the data set
graduation_datanew <- read.csv("/Users/clairezhang/Downloads/STAT 0218/Datasets/graduation_rate.csv")

To create our kNN algorithm through R Studio, we first decided to select the columns that were relevant to our prediction, so we selected the SAT score, the parental level of education, parental income, college GPA, and years to graduate. There were a few issues before we cleaned up the data. For example, we had the categorical variable ‘parental.level.of.education’, which was separated into six categories: some high school, high school, some college, associate’s degree, bachelor’s degree, master’s degree. To mitigate this issue, we assigned numerical values to each of the categories, with 1 being ‘some high school’ and 6 being the highest, at ‘master’s degree’. Assigning numbers to each category is subjective, but we decided that there was a relatively equal “difference” between each category. Another issue was that none of our variables were scaled correctly. For example, income was scaled from about 10000-130000, whereas the GPA was only scaled from 2.5-4.0, and there were similar complications with the SAT scores. The kKN algorithm commonly uses Euclidean distance to make predictions, so if we didn’t scale the number, then variables with larger differences would disproportionately affect the prediction. We decided to scale each one of the variables on a 0 to 1 scale by subtracting the minimum and dividing by the range of that particular variable.

Then we decided to do an Exploratory Data Analysis (EDA) to see how the relationships/correlations between data. We graphed the years to graduate over the parental level of education. Our boxplot essentially shows that as parental level of education increases (with master’s degree being the highest graduation level that a parent can be classified as in this data set), the time to graduate will decrease. There are a few exceptions, however, as the master’s degree boxplot has a few rather large outliers. This then piqued our curiosity, and we decided to move forward and make another exploratory graph below.

grad_datanew <- graduation_datanew |>
  select(1,2,3,4,6,7) #isolating the variables in a new data set

# Exploratory Data Analysis (EDA)
# Factor the categorical variable (parental level of education)
# into a numerical value
grad_data <- grad_datanew |>
  mutate(new_col = factor(parental.level.of.education))

# Create exploratory graphs
# This one depicts the years to graduate over the parental
# level of education
grad_data |>
  ggplot() +
  geom_boxplot(aes(factor(parental.level.of.education, level = c("master's degree", "bachelor's degree", "associate's degree", "some college", "high school", "some high school")), years.to.graduate)) +
  labs(x = "Parental Level of Education",
       y = "Years to Graduate") +
  ggtitle("How Parental Level of Education Impacts Students' Graduation Times")

Our second exploratory graph encompasses the years to graduate over the parental income. This gives us some more insight because people with master’s degrees statistically, on average, have a higher income due to more education. We can see that the occasional student whose family makes $100000, the child could possibly graduate late, for example, 8 years max. This could be due to the resources that the family could have to pay off extra years of tuition or prioritizing their child’s education. However, the jitter plot is still mostly negatively correlated. As parental income increases, years to graduate generally decreases.

# Second exploratory graph
grad_data |>
  ggplot() +
  geom_jitter(aes(x = parental.income, y = years.to.graduate)) +
  labs(x = "Parental Income", y = "Years to Graduate") +
  ggtitle("Years till Graduation for Students' Based on Parental Income")

Below we started working on our inputs so that we could plug in values and test the effectiveness of our function. Then, we used the Euclidean distance to find the distances between the quantitative variables and the inputs that we stored into a variable.

# Build a kNN algorithm to determine how long a student takes to graduate

# Master's - 1, Bachelor's - 2, Associate's - 3, Some college - 4,
# High school - 5, Some high school - 6
grad_data_new <- grad_data |>
  mutate(grad_level = factor(
    parental.level.of.education,
    levels = c(
      "master's degree",
      "bachelor's degree",
      "associate's degree",
      "some college",
      "high school",
      "some high school"
    ),
    labels = c(1, 2, 3, 4, 5, 6)
  )) |> 
  mutate(numeric_grad_level = as.numeric(grad_level))

# Scaling the grad_data
grad_data_scaled <- grad_data_new |>
  mutate(scaled_SAT =
           (SAT.total.score - min(SAT.total.score)) / (max(SAT.total.score) - min(SAT.total.score))) |>
  mutate(scaled_income =
           (parental.income - min(parental.income)) / (max(parental.income) -
                                                         min(parental.income))) |>
  mutate(scaled_grad_level =
           (numeric_grad_level - min(numeric_grad_level)) / (max(numeric_grad_level) - min(numeric_grad_level))) |>
  mutate(scaled_GPA =
           (college.gpa - min(college.gpa)) / (max(college.gpa) - min(college.gpa)))
# Inputs
k <- 20
par_edu_input <- (1 - min(grad_data_scaled$numeric_grad_level)) / (
  max(grad_data_scaled$numeric_grad_level) -
    min(grad_data_scaled$numeric_grad_level)
)
SAT_input <- (1550 - min(grad_data_scaled$SAT.total.score)) / (max(grad_data_scaled$SAT.total.score) -
                                                                 min(grad_data_scaled$SAT.total.score))
income_input <- (20000 - min(grad_data_scaled$parental.income)) / (max(grad_data_scaled$parental.income) -
                                                                     min(grad_data_scaled$parental.income))
gpa_input <- (2.5 - min(grad_data_scaled$college.gpa)) / (max(grad_data_scaled$college.gpa) -
                                                            min(grad_data_scaled$college.gpa))
# Adding on the other variables
grad_data_new |>
  mutate(distance = sqrt((parental.income - income_input) ^ 2 +
                           (numeric_grad_level - par_edu_input) ^ 2
  )) |>
  arrange(distance) |>
  head(k) |>
  count(years.to.graduate) |>
  arrange(-n) |>
  head(1) |>
  pull(years.to.graduate)
## [1] 6
#Turning above code into own function
my_knn <- function(k, income_input, par_edu_input, dataset_input){
  dataset_input |>
    mutate(distance = sqrt((parental.income - income_input)^2  + 
                             (numeric_grad_lvl- par_edu_input)^2)) |>
    arrange(distance) |>
    head(k) |>
    count(years.to.graduate) |>
    arrange(-n) |>
    head(1) |>
    pull(years.to.graduate)
}

grad_data_scaled |>
  summarize(mean_yrs = mean(years.to.graduate))
##   mean_yrs
## 1    4.982
grad_data_scaled |>
  mutate(distance = sqrt((scaled_income - income_input) ^ 2 +
                           (scaled_grad_level - par_edu_input) ^ 2 +
                           (scaled_GPA - gpa_input) ^ 2 +
                           (scaled_SAT - SAT_input) ^ 2
  )) |>
  arrange(distance) |>
  head(k) |>
  count(years.to.graduate) |>
  arrange(-n) |>
  head(5) |>
  pull(years.to.graduate)
## [1] 5 6 4 9 7

Below, we begin to search for the optimal k. We start by creating training data and test data from our original dataset. Then we create a function, called my_knn, that will take an input of the dataset, SAT, parental education level, GPA, and income, and spit back a prediction of the years it will take to graduate. To find the optimal k, we created a loop that tested the function’s corrrctness for a range of k = 1 to k = 20. The value of ‘k’ that we found to be optimal is k = 20 with the probability of correctness being .666. This essentially means that 66.6% of the predictions made by our kNN algorithm are correct.

#How good is my algorithm? What k should I choose?
training_rows <- sample(1:nrow(grad_data_scaled), 
                        size = nrow(grad_data_scaled)/2)

train_data <- grad_data_scaled[training_rows, ]
test_data <- grad_data_scaled[-training_rows, ]

#Let's see how good our algorithm is for a given k
my_knn <- function(k, par_edu_input, SAT_input, gpa_input, income_input, dataset_input){
  dataset_input |>
    mutate(distance = sqrt((scaled_income - income_input)^2 + 
                             (scaled_grad_level - par_edu_input)^2 +
                             (scaled_GPA - gpa_input)^2 + 
                             (scaled_SAT - SAT_input)^2)) |>
    arrange(distance) |> 
    head(k) |> 
    count(years.to.graduate) |> 
    arrange(-n) |> 
    head(5) |> 
    pull(years.to.graduate)
}
k_acc <- NULL

ans <- NULL

for(k in 1:20){
  
  for(i in 1:nrow(test_data)){
    prediction <- my_knn(k, 
                         test_data$scaled_income[i], 
                         test_data$scaled_GPA[i], 
                         test_data$scaled_SAT[i],
                         test_data$scaled_grad_level[i],
                         train_data)
    ans[i] <- abs(prediction - test_data$years.to.graduate[i]) <= 1
  }
  k_acc[k] <- mean(ans)
}

k_acc
##  [1] 0.552 0.622 0.606 0.592 0.590 0.630 0.646 0.624 0.632 0.630 0.632 0.634
## [13] 0.634 0.646 0.648 0.668 0.664 0.664 0.662 0.670
test_data_new <- test_data |>
  mutate(scaled_grad_level = factor(
    recode(
      scaled_grad_level,
      "0" = "Master's",
      "0.2" = "Bachelor's",
      "0.4" = "Associate's",
      "0.6" = "Some College",
      "0.8" = "High School",
      "1" = "Some High School"
    ),
    levels = c(
      "Master's",
      "Bachelor's",
      "Associate's",
      "Some College",
      "High School",
      "Some High School"
    )
  )) |>
  mutate(GPA = (scaled_GPA *(max(test_data$college.gpa) - min(test_data$college.gpa))) + min(test_data$college.gpa)) |>
  mutate(income = (scaled_income *(max(test_data$parental.income) - min(test_data$parental.income))) + min(test_data$parental.income))

Above is our graph showing the impact of all of the these factors on years until gradution. This depicts how our kNN algorithm takes all of the variables into consideration when forming a prediction for a given student. It visualizes how there are no data points of students in our data set with parents with master’s and bachelor’s degrees that have received a bad SAT score or a GPA lower than around 3.0.

A few of the advantages and disadvantages of our kNN algorithm: One of our first algorithms with k = 20 was giving very low (“inaccurate”) results in terms of how often it predicted the correct year to graduate. We realized that our KNN was trying to be too exact, meaning our model was seemingly less accurate because it was looking for an exact graduate time. Fixing this involved implementing bounds that were within an acceptable range (+/- 1 year) based on a given input of variables. In other words, our algorithm would identify that if a student took five years to graduate, and our model predicted four, that’s a close approximation that should be weighted accordingly.