INSS615 Homework 5

Morgan State University

Department of Information Science & Systems

Fall 2024

INSS 615: Data Wrangling for Visualization

Name: Ava’ Roberts

Due: Dec 1, 2024 (Sunday)

Questions

A. Scrape the College Ranked by Acceptance Rate dataset available at this link: https://www.oedb.org/rankings/acceptance-rate/#table-rankings and select the first 9 columns [Rank, School, Student to Faculty Ratio, Graduation Rate, Retention Rate, Acceptance Rate, Enrollment Rate, Institutional Aid Rate, and Default Rate] as the dataset for this assignment. [20 Points]

Hint: There are 6 pages of data, so you may want to use a for loop to automate the scraping process and combine the data from all 6 pages. This is just a suggestion—you are free to create the dataset without automating the web scrapping process.

Solution:

library(rvest)

## Warning: package 'rvest' was built under R version 4.2.3

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.2.1     ✔ dplyr   1.1.4
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.5     ✔ forcats 1.0.0

## Warning: package 'tibble' was built under R version 4.2.3

## Warning: package 'readr' was built under R version 4.2.3

## Warning: package 'dplyr' was built under R version 4.2.3

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()         masks stats::filter()
## ✖ readr::guess_encoding() masks rvest::guess_encoding()
## ✖ dplyr::lag()            masks stats::lag()

# multipage scraping using a for loop
# empty list to store data
all_data <- list()

# Loop through 6 pages
for (page_num in 1:6) {
  # URL for the current page
  url <- paste0("https://www.oedb.org/rankings/acceptance-rate/page/", page_num, "/")
  
  # Read the HTML on page
  page <- read_html(url)
  
  # Extract the tables
  table <- page %>%
    html_node("table") %>%
    html_table(fill = TRUE)
  
  # Select columns
  selected_data <- table %>%
    select(Rank, School, `Student to Faculty Ratio`, `Graduation Rate`, `Retention Rate`,
           `Acceptance Rate`, `Enrollment Rate`, `Institutional Aid Rate`, `Default Rate`)
  
  # Add data to list
  all_data[[page_num]] <- selected_data
}

# Combine all pages to one data frame
final_data <- bind_rows(all_data)

# Display the first few rows of the combined data
print(head(final_data))

## # A tibble: 6 × 9
##    Rank School         Student to Faculty R…¹ `Graduation Rate` `Retention Rate`
##   <int> <chr>          <chr>                  <chr>             <chr>           
## 1     1 Harvard Unive… 7 to 1                 98%               98%             
## 2     2 Yale Universi… 6 to 1                 97%               99%             
## 3     3 University of… 6 to 1                 95%               98%             
## 4     4 Johns Hopkins… 10 to 1                94%               97%             
## 5     5 Cornell Unive… 9 to 1                 93%               97%             
## 6     6 Tufts Univers… 9 to 1                 93%               97%             
## # ℹ abbreviated name: ¹`Student to Faculty Ratio`
## # ℹ 4 more variables: `Acceptance Rate` <chr>, `Enrollment Rate` <chr>,
## #   `Institutional Aid Rate` <chr>, `Default Rate` <chr>

B. You are going to need the dataset created in Question A to answer the following questions. There are 16 questions each carrying 5 points:

Replace the missing values “N/A” in the dataset with NA.

Solution:

# Replace all "N/A"values with the string "NA"
final_data_cleaned <- final_data %>%
  mutate(across(everything(), ~ ifelse(is.na(.) | . == "N/A", "NA", .)))

# Display the first few rows of the cleaned data
print(head(final_data_cleaned))

## # A tibble: 6 × 9
##    Rank School         Student to Faculty R…¹ `Graduation Rate` `Retention Rate`
##   <int> <chr>          <chr>                  <chr>             <chr>           
## 1     1 Harvard Unive… 7 to 1                 98%               98%             
## 2     2 Yale Universi… 6 to 1                 97%               99%             
## 3     3 University of… 6 to 1                 95%               98%             
## 4     4 Johns Hopkins… 10 to 1                94%               97%             
## 5     5 Cornell Unive… 9 to 1                 93%               97%             
## 6     6 Tufts Univers… 9 to 1                 93%               97%             
## # ℹ abbreviated name: ¹`Student to Faculty Ratio`
## # ℹ 4 more variables: `Acceptance Rate` <chr>, `Enrollment Rate` <chr>,
## #   `Institutional Aid Rate` <chr>, `Default Rate` <chr>

Convert percentage columns (e.g., Graduation Rate) to numeric format.

Solution:

# Id the percentage columns to convert
percentage_columns <- c("Graduation Rate", "Retention Rate", "Acceptance Rate", 
                        "Enrollment Rate", "Institutional Aid Rate", "Default Rate")

# Define a function to convert percentage strings to numeric
convert_to_numeric <- function(x) {
  as.numeric(str_remove(x, "%")) / 100
}

# Apply the function to the specified percentage columns
final_data_cleaned <- final_data_cleaned %>%
  mutate(across(all_of(percentage_columns), convert_to_numeric))

## Warning: There were 5 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `across(all_of(percentage_columns), convert_to_numeric)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 4 remaining warnings.

# Display the first few rows of the cleaned data
print(head(final_data_cleaned))

## # A tibble: 6 × 9
##    Rank School         Student to Faculty R…¹ `Graduation Rate` `Retention Rate`
##   <int> <chr>          <chr>                              <dbl>            <dbl>
## 1     1 Harvard Unive… 7 to 1                              0.98             0.98
## 2     2 Yale Universi… 6 to 1                              0.97             0.99
## 3     3 University of… 6 to 1                              0.95             0.98
## 4     4 Johns Hopkins… 10 to 1                             0.94             0.97
## 5     5 Cornell Unive… 9 to 1                              0.93             0.97
## 6     6 Tufts Univers… 9 to 1                              0.93             0.97
## # ℹ abbreviated name: ¹`Student to Faculty Ratio`
## # ℹ 4 more variables: `Acceptance Rate` <dbl>, `Enrollment Rate` <dbl>,
## #   `Institutional Aid Rate` <dbl>, `Default Rate` <dbl>

Transform the “Student to Faculty Ratio” column into two separate numeric columns: Students and Faculty.

Solution:

# Split the Student to Faculty Ratio column into Students and Faculty
final_data_cleaned <- final_data_cleaned %>%
  mutate(
    Students = as.numeric(str_extract(`Student to Faculty Ratio`, "^[0-9]+")),
    Faculty = as.numeric(str_extract(`Student to Faculty Ratio`, "[0-9]+$"))
  )

# Display the first few rows of data
print(head(final_data_cleaned))

## # A tibble: 6 × 11
##    Rank School         Student to Faculty R…¹ `Graduation Rate` `Retention Rate`
##   <int> <chr>          <chr>                              <dbl>            <dbl>
## 1     1 Harvard Unive… 7 to 1                              0.98             0.98
## 2     2 Yale Universi… 6 to 1                              0.97             0.99
## 3     3 University of… 6 to 1                              0.95             0.98
## 4     4 Johns Hopkins… 10 to 1                             0.94             0.97
## 5     5 Cornell Unive… 9 to 1                              0.93             0.97
## 6     6 Tufts Univers… 9 to 1                              0.93             0.97
## # ℹ abbreviated name: ¹`Student to Faculty Ratio`
## # ℹ 6 more variables: `Acceptance Rate` <dbl>, `Enrollment Rate` <dbl>,
## #   `Institutional Aid Rate` <dbl>, `Default Rate` <dbl>, Students <dbl>,
## #   Faculty <dbl>

What is the count of missing values in the “Default Rate” column? Impute the missing values in the “Default Rate” column with the median value.

Solution:

# Count missing values in the "Default Rate" column
missing_count <- sum(is.na(final_data_cleaned$`Default Rate`))
cat("Count of missing values in 'Default Rate':", missing_count, "\n")

## Count of missing values in 'Default Rate': 291

# mutate missing values with the median
median_default_rate <- median(final_data_cleaned$`Default Rate`, na.rm = TRUE)
final_data_cleaned <- final_data_cleaned %>%
  mutate(`Default Rate` = ifelse(is.na(`Default Rate`), median_default_rate, `Default Rate`))

# Display the first few rows of the updated dataset
print(head(final_data_cleaned))

## # A tibble: 6 × 11
##    Rank School         Student to Faculty R…¹ `Graduation Rate` `Retention Rate`
##   <int> <chr>          <chr>                              <dbl>            <dbl>
## 1     1 Harvard Unive… 7 to 1                              0.98             0.98
## 2     2 Yale Universi… 6 to 1                              0.97             0.99
## 3     3 University of… 6 to 1                              0.95             0.98
## 4     4 Johns Hopkins… 10 to 1                             0.94             0.97
## 5     5 Cornell Unive… 9 to 1                              0.93             0.97
## 6     6 Tufts Univers… 9 to 1                              0.93             0.97
## # ℹ abbreviated name: ¹`Student to Faculty Ratio`
## # ℹ 6 more variables: `Acceptance Rate` <dbl>, `Enrollment Rate` <dbl>,
## #   `Institutional Aid Rate` <dbl>, `Default Rate` <dbl>, Students <dbl>,
## #   Faculty <dbl>

Find the average graduation rate for universities ranked in the top 50.

Solution:

# Filter top 50 universities
top_50_universities <- final_data_cleaned %>%
  filter(Rank <= 50)

# average graduation rate
average_graduation_rate <- mean(top_50_universities$`Graduation Rate`, na.rm = TRUE)

# Display the result
cat("The average graduation rate for universities ranked in the top 50 is:", average_graduation_rate * 100, "%\n")

## The average graduation rate for universities ranked in the top 50 is: 79.18 %

Filter universities with a retention rate above 90% and find the count of rows in the subset.

Solution:

# Filter universities with retention rate above 90%
high_retention_universities <- final_data_cleaned %>%
  filter(`Retention Rate` > 0.90)

# Count the number of rows 
count_high_retention <- nrow(high_retention_universities)

# Display the count
cat("The number of universities with a retention rate above 90% is:", count_high_retention, "\n")

## The number of universities with a retention rate above 90% is: 98

Rank universities by enrollment rate in descending order and display the last 6 rows.

Solution:

# Rank by enrollment rate in descending order
ranked_universities <- final_data_cleaned %>%
  arrange(desc(`Enrollment Rate`))

# Display the last 6 rows of the ranked data
tail(ranked_universities, 6)

Create a histogram of graduation rates using ggplot2 library.

Solution:

# Load the ggplot2 library
library(ggplot2)

# Create a histogram of graduation rates
ggplot(final_data_cleaned, aes(x = `Graduation Rate`)) +
  geom_histogram(binwidth = 0.05, fill = "green", color = "black", alpha = 0.7) +
  labs(
    title = "Histogram of Graduation Rates",
    x = "Graduation Rate (Proportion)",
    y = "Frequency"
  ) +
  theme_minimal()

## Warning: Removed 6 rows containing non-finite values (`stat_bin()`).

Plot a scatterplot between acceptance rate and enrollment rate using ggplot2 library.

Solution:

# Create a scatterplot of acceptance rate vs. enrollment rate
ggplot(final_data_cleaned, aes(x = `Acceptance Rate`, y = `Enrollment Rate`)) +
  geom_point(color = "red", alpha = 0.7) +
  labs(
    title = "Scatterplot of Acceptance Rate vs. Enrollment Rate",
    x = "Acceptance Rate (Proportion)",
    y = "Enrollment Rate (Proportion)"
  ) +
  theme_minimal()

## Warning: Removed 29 rows containing missing values (`geom_point()`).

Calculate the average default rate by aid rate category (e.g., grouped into ranges like 0-20%, 20-40%). Display the categories.

Solution:

# Define aid rate bins
final_data_cleaned <- final_data_cleaned %>%
  mutate(`Aid Rate Category` = cut(
    `Institutional Aid Rate`,
    breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1.0),
    labels = c("0-20%", "20-40%", "40-60%", "60-80%", "80-100%"),
    include.lowest = TRUE
  ))

# Calculate the average default 
avg_default_rate_by_category <- final_data_cleaned %>%
  group_by(`Aid Rate Category`) %>%
  summarise(`Average Default Rate` = mean(`Default Rate`, na.rm = TRUE))

# Display the result
print(avg_default_rate_by_category)

## # A tibble: 5 × 2
##   `Aid Rate Category` `Average Default Rate`
##   <fct>                                <dbl>
## 1 0-20%                               0.1   
## 2 20-40%                              0.0756
## 3 40-60%                              0.0605
## 4 60-80%                              0.0651
## 5 80-100%                             0.0562

Normalize the acceptance rate to a scale of 0-1 and save in a new column “Acceptance Rate Normalized”. Display the first 6 values.

Solution:

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

 # Normalize the acceptance rate to a scale of 0-1
final_data_cleaned <- final_data_cleaned %>%
  mutate(`Acceptance Rate Normalized` = rescale(`Acceptance Rate`, to = c(0, 1)))

# Display the first 6 values of new column
head(final_data_cleaned[, c("Acceptance Rate", "Acceptance Rate Normalized")])

What is the count of the duplicate entries in the “School” column? Remove duplicate university entries.

Solution:

# Count duplicate entries in the "School" column
duplicate_count <- final_data_cleaned %>%
  filter(duplicated(School)) %>%
  nrow()

cat("The number of duplicate entries in the 'School' column is:", duplicate_count, "\n")

## The number of duplicate entries in the 'School' column is: 3

# Remove duplicate entries in "School" column
final_data_cleaned <- final_data_cleaned %>%
  distinct(School, .keep_all = TRUE)

# Display the result
print(final_data_cleaned)

## # A tibble: 568 × 13
##     Rank School        Student to Faculty R…¹ `Graduation Rate` `Retention Rate`
##    <int> <chr>         <chr>                              <dbl>            <dbl>
##  1     1 Harvard Univ… 7 to 1                              0.98             0.98
##  2     2 Yale Univers… 6 to 1                              0.97             0.99
##  3     3 University o… 6 to 1                              0.95             0.98
##  4     4 Johns Hopkin… 10 to 1                             0.94             0.97
##  5     5 Cornell Univ… 9 to 1                              0.93             0.97
##  6     6 Tufts Univer… 9 to 1                              0.93             0.97
##  7     7 University o… 17 to 1                             0.92             0.96
##  8     8 University o… 16 to 1                             0.91             0.96
##  9     9 Georgetown U… 11 to 1                             0.94             0.96
## 10    10 Washington U… 8 to 1                              0.93             0.96
## # ℹ 558 more rows
## # ℹ abbreviated name: ¹`Student to Faculty Ratio`
## # ℹ 8 more variables: `Acceptance Rate` <dbl>, `Enrollment Rate` <dbl>,
## #   `Institutional Aid Rate` <dbl>, `Default Rate` <dbl>, Students <dbl>,
## #   Faculty <dbl>, `Aid Rate Category` <fct>,
## #   `Acceptance Rate Normalized` <dbl>

Find the correlation between graduation rate and retention rate (exclude the NAs in both columns).

Solution:

# Calculate the correlation between Graduation Rate and Retention Rate
correlation <- cor(
  final_data_cleaned$`Graduation Rate`,
  final_data_cleaned$`Retention Rate`,
  use = "complete.obs"
)

# Display the correlation
cat("The correlation between Graduation Rate and Retention Rate is:", correlation, "\n")

## The correlation between Graduation Rate and Retention Rate is: 0.6159709

Extract the values in School column into a new variable without “University” in the string. For example “Rowan University” becomes “Rowan”

Solution:

# Create a new column without the word "University"
final_data_cleaned <- final_data_cleaned %>%
  mutate(School_Short = gsub(" University", "", School))

# Display the first few rows of the new column
head(final_data_cleaned[, c("School", "School_Short")])

Count how many universities have “Institute” in their name.

Solution:

# Count the number of universities with "Institute" in their name
institute_count <- sum(grepl("Institute", final_data_cleaned$School, ignore.case = TRUE))


# Display the count
cat("The number of universities with 'Institute' in their name is:", institute_count, "\n")

## The number of universities with 'Institute' in their name is: 17

Export the cleaned and processed dataset to a CSV file.

Solution:

# Export the cleaned and processed dataset to a CSV file
write.csv(final_data_cleaned, file = "cleaned_processed_dataset.csv", row.names = FALSE)