INSS615 Homework 5

Morgan State University

Department of Information Science & Systems

Fall 2024

INSS 615: Data Wrangling for Visualization

Name: Darren Mitchell

Due: Dec 1, 2024 (Sunday)

Questions

A. Scrape the College Ranked by Acceptance Rate dataset available at this link: https://www.oedb.org/rankings/acceptance-rate/#table-rankings and select the first 9 columns [Rank, School, Student to Faculty Ratio, Graduation Rate, Retention Rate, Acceptance Rate, Enrollment Rate, Institutional Aid Rate, and Default Rate] as the dataset for this assignment. [20 Points]

Hint: There are 6 pages of data, so you may want to use a for loop to automate the scraping process and combine the data from all 6 pages. This is just a suggestion—you are free to create the dataset without automating the web scrapping process.

Solution:

library(rvest) 
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()         masks stats::filter()
## ✖ readr::guess_encoding() masks rvest::guess_encoding()
## ✖ dplyr::lag()            masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# multipage scraping using a for loop
# Load necessary libraries
library(rvest)
library(tidyverse)

# Define the URLs for each page (update these manually based on actual page URLs)
urls <- c(
  "https://www.oedb.org/rankings/acceptance-rate/#table-rankings",
  "https://www.oedb.org/rankings/acceptance-rate/page/2/#table-rankings",
  "https://www.oedb.org/rankings/acceptance-rate/page/3/#table-rankings",
  "https://www.oedb.org/rankings/acceptance-rate/page/4/#table-rankings",
  "https://www.oedb.org/rankings/acceptance-rate/page/5/#table-rankings",
  "https://www.oedb.org/rankings/acceptance-rate/page/6/#table-rankings"
  # Add more URLs if there are more pages
)

# Function to scrape and process data from a single URL
scrape_page <- function(url) {
  # Read the HTML content
  web_page <- read_html(url)
  
  # Extract the table and select columns
  college_table <- web_page %>%
    html_node("table") %>%
    html_table(fill = TRUE) %>%
    select(
      Rank = 1, 
      School = 2, 
      `Student to Faculty Ratio` = 3, 
      `Graduation Rate` = 4, 
      `Retention Rate` = 5, 
      `Acceptance Rate` = 6, 
      `Enrollment Rate` = 7, 
      `Institutional Aid Rate` = 8, 
      `Default Rate` = 9
    )
  
  return(college_table)
}

# Scrape and combine data from all pages
final_data <- map_dfr(urls, scrape_page)

# View the final dataset
print(final_data)

## # A tibble: 571 × 9
##     Rank School        Student to Faculty R…¹ `Graduation Rate` `Retention Rate`
##    <int> <chr>         <chr>                  <chr>             <chr>           
##  1     1 Harvard Univ… 7 to 1                 98%               98%             
##  2     2 Yale Univers… 6 to 1                 97%               99%             
##  3     3 University o… 6 to 1                 95%               98%             
##  4     4 Johns Hopkin… 10 to 1                94%               97%             
##  5     5 Cornell Univ… 9 to 1                 93%               97%             
##  6     6 Tufts Univer… 9 to 1                 93%               97%             
##  7     7 University o… 17 to 1                92%               96%             
##  8     8 University o… 16 to 1                91%               96%             
##  9     9 Georgetown U… 11 to 1                94%               96%             
## 10    10 Washington U… 8 to 1                 93%               96%             
## # ℹ 561 more rows
## # ℹ abbreviated name: ¹`Student to Faculty Ratio`
## # ℹ 4 more variables: `Acceptance Rate` <chr>, `Enrollment Rate` <chr>,
## #   `Institutional Aid Rate` <chr>, `Default Rate` <chr>

# Optional: Save the final dataset as a CSV file
write_csv(final_data, "college_acceptance_data_full.csv")

B. You are going to need the dataset created in Question A to answer the following questions. There are 16 questions each carrying 5 points:

Replace the missing values “N/A” in the dataset with NA.

Solution:

# Replace "N/A" with NA
final_data <- final_data %>%
  mutate(across(everything(), ~ na_if(as.character(.), "N/A")))


# View the updated dataset
print(final_data)

## # A tibble: 571 × 9
##    Rank  School        Student to Faculty R…¹ `Graduation Rate` `Retention Rate`
##    <chr> <chr>         <chr>                  <chr>             <chr>           
##  1 1     Harvard Univ… 7 to 1                 98%               98%             
##  2 2     Yale Univers… 6 to 1                 97%               99%             
##  3 3     University o… 6 to 1                 95%               98%             
##  4 4     Johns Hopkin… 10 to 1                94%               97%             
##  5 5     Cornell Univ… 9 to 1                 93%               97%             
##  6 6     Tufts Univer… 9 to 1                 93%               97%             
##  7 7     University o… 17 to 1                92%               96%             
##  8 8     University o… 16 to 1                91%               96%             
##  9 9     Georgetown U… 11 to 1                94%               96%             
## 10 10    Washington U… 8 to 1                 93%               96%             
## # ℹ 561 more rows
## # ℹ abbreviated name: ¹`Student to Faculty Ratio`
## # ℹ 4 more variables: `Acceptance Rate` <chr>, `Enrollment Rate` <chr>,
## #   `Institutional Aid Rate` <chr>, `Default Rate` <chr>

Convert percentage columns (e.g., Graduation Rate) to numeric format.

Solution:

# Replace "N/A" with NA and handle percentage columns
final_data <- final_data %>%
  mutate(across(everything(), ~ na_if(as.character(.), "N/A"))) %>%
  mutate(
    `Graduation Rate` = as.numeric(str_remove(`Graduation Rate`, "%")),
    `Retention Rate` = as.numeric(str_remove(`Retention Rate`, "%")),
    `Acceptance Rate` = as.numeric(str_remove(`Acceptance Rate`, "%")),
    `Enrollment Rate` = as.numeric(str_remove(`Enrollment Rate`, "%")),
    `Institutional Aid Rate` = as.numeric(str_remove(`Institutional Aid Rate`, "%")),
    `Default Rate` = as.numeric(str_remove(`Default Rate`, "%"))
  )

# View the updated dataset
print(final_data)

## # A tibble: 571 × 9
##    Rank  School        Student to Faculty R…¹ `Graduation Rate` `Retention Rate`
##    <chr> <chr>         <chr>                              <dbl>            <dbl>
##  1 1     Harvard Univ… 7 to 1                                98               98
##  2 2     Yale Univers… 6 to 1                                97               99
##  3 3     University o… 6 to 1                                95               98
##  4 4     Johns Hopkin… 10 to 1                               94               97
##  5 5     Cornell Univ… 9 to 1                                93               97
##  6 6     Tufts Univer… 9 to 1                                93               97
##  7 7     University o… 17 to 1                               92               96
##  8 8     University o… 16 to 1                               91               96
##  9 9     Georgetown U… 11 to 1                               94               96
## 10 10    Washington U… 8 to 1                                93               96
## # ℹ 561 more rows
## # ℹ abbreviated name: ¹`Student to Faculty Ratio`
## # ℹ 4 more variables: `Acceptance Rate` <dbl>, `Enrollment Rate` <dbl>,
## #   `Institutional Aid Rate` <dbl>, `Default Rate` <dbl>

Transform the “Student to Faculty Ratio” column into two separate numeric columns: Students and Faculty.

Solution:

# Transform the "Student to Faculty Ratio" column into two numeric columns
final_data <- final_data %>%
  separate(
    `Student to Faculty Ratio`, # Column to split
    into = c("Students", "Faculty"), # Names of new columns
    sep = ":", # Separator is the colon (:)
    convert = TRUE # Automatically convert to numeric
  )

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 571 rows [1, 2, 3, 4, 5,
## 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

# View the transformed dataset
print(final_data)

## # A tibble: 571 × 10
##    Rank  School              Students Faculty `Graduation Rate` `Retention Rate`
##    <chr> <chr>               <chr>    <lgl>               <dbl>            <dbl>
##  1 1     Harvard University  7 to 1   NA                     98               98
##  2 2     Yale University     6 to 1   NA                     97               99
##  3 3     University of Penn… 6 to 1   NA                     95               98
##  4 4     Johns Hopkins Univ… 10 to 1  NA                     94               97
##  5 5     Cornell University  9 to 1   NA                     93               97
##  6 6     Tufts University    9 to 1   NA                     93               97
##  7 7     University of Cali… 17 to 1  NA                     92               96
##  8 8     University of Cali… 16 to 1  NA                     91               96
##  9 9     Georgetown Univers… 11 to 1  NA                     94               96
## 10 10    Washington Univers… 8 to 1   NA                     93               96
## # ℹ 561 more rows
## # ℹ 4 more variables: `Acceptance Rate` <dbl>, `Enrollment Rate` <dbl>,
## #   `Institutional Aid Rate` <dbl>, `Default Rate` <dbl>

What is the count of missing values in the “Default Rate” column? Impute the missing values in the “Default Rate” column with the median value.

Solution:

# Count missing values in the "Default Rate" column
missing_count <- sum(is.na(final_data$`Default Rate`))
cat("Count of missing values in 'Default Rate':", missing_count, "\n")

## Count of missing values in 'Default Rate': 291

# Impute missing values with the median
final_data <- final_data %>%
  mutate(
    `Default Rate` = ifelse(
      is.na(`Default Rate`), 
      median(`Default Rate`, na.rm = TRUE), 
      `Default Rate`
    )
  )

# View the dataset after imputation
print(final_data)

## # A tibble: 571 × 10
##    Rank  School              Students Faculty `Graduation Rate` `Retention Rate`
##    <chr> <chr>               <chr>    <lgl>               <dbl>            <dbl>
##  1 1     Harvard University  7 to 1   NA                     98               98
##  2 2     Yale University     6 to 1   NA                     97               99
##  3 3     University of Penn… 6 to 1   NA                     95               98
##  4 4     Johns Hopkins Univ… 10 to 1  NA                     94               97
##  5 5     Cornell University  9 to 1   NA                     93               97
##  6 6     Tufts University    9 to 1   NA                     93               97
##  7 7     University of Cali… 17 to 1  NA                     92               96
##  8 8     University of Cali… 16 to 1  NA                     91               96
##  9 9     Georgetown Univers… 11 to 1  NA                     94               96
## 10 10    Washington Univers… 8 to 1   NA                     93               96
## # ℹ 561 more rows
## # ℹ 4 more variables: `Acceptance Rate` <dbl>, `Enrollment Rate` <dbl>,
## #   `Institutional Aid Rate` <dbl>, `Default Rate` <dbl>

Find the average graduation rate for universities ranked in the top 50.

Solution:

# Filter for universities ranked in the top 50 and calculate the average graduation rate
average_graduation_rate <- final_data %>%
  filter(Rank <= 50) %>%
  summarise(Average_Graduation_Rate = mean(`Graduation Rate`, na.rm = TRUE))

# Print the result
print(average_graduation_rate)

## # A tibble: 1 × 1
##   Average_Graduation_Rate
##                     <dbl>
## 1                    65.9

Filter universities with a retention rate above 90% and find the count of rows in the subset.

Solution:

# Filter universities with retention rate above 90% and count the rows
count_above_90 <- final_data %>%
  filter(`Retention Rate` > 90) %>%
  summarise(Count = n())

# Print the result
print(count_above_90)

## # A tibble: 1 × 1
##   Count
##   <int>
## 1    98

Rank universities by enrollment rate in descending order and display the last 6 rows.

Solution:

# Rank universities by enrollment rate in descending order and display the last 6 rows
last_6_universities <- final_data %>%
  arrange(desc(`Enrollment Rate`)) %>%
  tail(6)

# Print the result
print(last_6_universities)

## # A tibble: 6 × 10
##   Rank  School               Students Faculty `Graduation Rate` `Retention Rate`
##   <chr> <chr>                <chr>    <lgl>               <dbl>            <dbl>
## 1 566   Southeastern Baptis… 6 to 1   NA                     80               40
## 2 567   Touro University Wo… 13 to 1  NA                     NA              100
## 3 568   Unitek College       16 to 1  NA                     NA              100
## 4 569   University of Weste… 16 to 1  NA                     88               NA
## 5 570   Virginia Baptist Co… 5 to 1   NA                    100               25
## 6 571   West Virginia Junio… 25 to 1  NA                     53               69
## # ℹ 4 more variables: `Acceptance Rate` <dbl>, `Enrollment Rate` <dbl>,
## #   `Institutional Aid Rate` <dbl>, `Default Rate` <dbl>

Create a histogram of graduation rates using ggplot2 library.

Solution:

# Load ggplot2 library
library(ggplot2)

# Create a histogram of graduation rates
ggplot(final_data, aes(x = `Graduation Rate`)) +
  geom_histogram(binwidth = 5, fill = "blue", color = "black", alpha = 0.7) +
  labs(
    title = "Histogram of Graduation Rates",
    x = "Graduation Rate (%)",
    y = "Frequency"
  ) +
  theme_minimal()

## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_bin()`).

Plot a scatterplot between acceptance rate and enrollment rate using ggplot2 library.

Solution:

# Load ggplot2 library
library(ggplot2)

# Create a scatterplot of acceptance rate vs. enrollment rate
ggplot(final_data, aes(x = `Acceptance Rate`, y = `Enrollment Rate`)) +
  geom_point(color = "blue", alpha = 0.7, size = 3) +
  labs(
    title = "Scatterplot of Acceptance Rate vs. Enrollment Rate",
    x = "Acceptance Rate (%)",
    y = "Enrollment Rate (%)"
  ) +
  theme_minimal()

## Warning: Removed 29 rows containing missing values or values outside the scale range
## (`geom_point()`).

Calculate the average default rate by aid rate category (e.g., grouped into ranges like 0-20%, 20-40%). Display the categories.

Solution:

# Define aid rate categories
final_data$Aid_Rate_Category <- cut(
  final_data$`Institutional Aid Rate`,
  breaks = c(0, 20, 40, 60, 80, 100),
  labels = c("0-20%", "20-40%", "40-60%", "60-80%", "80-100%"),
  include.lowest = TRUE
)

# Calculate the average default rate by aid rate category
category_levels <- levels(final_data$Aid_Rate_Category)
average_default_rate <- sapply(category_levels, function(category) {
  mean(final_data$`Default Rate`[final_data$Aid_Rate_Category == category], na.rm = TRUE)
})

# Combine categories and averages into a data frame
result <- data.frame(
  Aid_Rate_Category = category_levels,
  Average_Default_Rate = average_default_rate
)

# Display the result
print(result)

##         Aid_Rate_Category Average_Default_Rate
## 0-20%               0-20%            10.000000
## 20-40%             20-40%             7.555556
## 40-60%             40-60%             6.052980
## 60-80%             60-80%             6.514451
## 80-100%           80-100%             5.619469

Normalize the acceptance rate to a scale of 0-1 and save in a new column “Acceptance Rate Normalized”. Display the first 6 values.

Solution:

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

# Load the scales library
library(scales)

# Normalize the Acceptance Rate to a scale of 0-1
final_data$`Acceptance Rate Normalized` <- rescale(final_data$`Acceptance Rate`, to = c(0, 1))

# Display the first 6 values of the normalized column
head(final_data[, c("Acceptance Rate", "Acceptance Rate Normalized")])

What is the count of the duplicate entries in the “School” column? Remove duplicate university entries.

Solution:

# Count duplicate entries in the "School" column
duplicate_count <- sum(duplicated(final_data$School))
cat("Count of duplicate entries in 'School':", duplicate_count, "\n")

## Count of duplicate entries in 'School': 3

# Remove duplicate university entries based on the "School" column
final_data <- final_data[!duplicated(final_data$School), ]

# Display the dataset after removing duplicates
head(final_data)

Find the correlation between graduation rate and retention rate (exclude the NAs in both columns).

Solution:

# Calculate the correlation between Graduation Rate and Retention Rate
correlation <- cor(
  final_data$`Graduation Rate`,
  final_data$`Retention Rate`,
  use = "complete.obs" # Exclude rows with NA in either column
)

# Display the correlation
cat("Correlation between Graduation Rate and Retention Rate:", correlation, "\n")

## Correlation between Graduation Rate and Retention Rate: 0.6159709

Extract the values in School column into a new variable without “University” in the string. For example “Rowan University” becomes “Rowan”

Solution:

# Extract the values in the School column without "University"
school_without_university <- gsub("University", "", final_data$School)

# Trim leading and trailing whitespace
school_without_university <- trimws(school_without_university)

# Display the first few values
head(school_without_university)

## [1] "Harvard"         "Yale"            "of Pennsylvania" "Johns Hopkins"  
## [5] "Cornell"         "Tufts"

Count how many universities have “Institute” in their name.

Solution:

# Count universities with "Institute" in their name
institute_count <- sum(grepl("Institute", final_data$School))

# Display the count
cat("Number of universities with 'Institute' in their name:", institute_count, "\n")

## Number of universities with 'Institute' in their name: 17

Export the cleaned and processed dataset to a CSV file.

Solution:

# Export the cleaned dataset to a CSV file
write.csv(final_data, "cleaned_final_data.csv", row.names = FALSE)

cat("The cleaned dataset has been exported to 'cleaned_final_data.csv'.\n")

## The cleaned dataset has been exported to 'cleaned_final_data.csv'.