Morgan State University

Department of Information Science & Systems

Fall 2024

INSS 615: Data Wrangling for Visualization

Name: Anais Williams

Due: Dec 1, 2024 (Sunday)

Questions

A. Scrape the College Ranked by Acceptance Rate dataset available at this link: https://www.oedb.org/rankings/acceptance-rate/#table-rankings and select the first 9 columns [Rank, School, Student to Faculty Ratio, Graduation Rate, Retention Rate, Acceptance Rate, Enrollment Rate, Institutional Aid Rate, and Default Rate] as the dataset for this assignment. [20 Points]

Hint: There are 6 pages of data, so you may want to use a for loop to automate the scraping process and combine the data from all 6 pages. This is just a suggestion—you are free to create the dataset without automating the web scrapping process.

Solution:

#loading libraries
library(rvest)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(stringr)


scrape_page <- function(url) {
  content <- read_html(url)
  table <- content %>% html_node("table")
  rows <- table %>% html_nodes("tr")
  data_list <- list()
  for (i in 2:length(rows)) {
    cells <- rows[i] %>% html_nodes("td") %>% html_text(trim = TRUE)
    if (length(cells) >= 9) {
      data_list[[i - 1]] <- cells[1:9]  # Get only the first 9 columns
    }
  }
  data <- do.call(rbind, data_list) %>% as.data.frame(stringsAsFactors = FALSE)
  colnames(data) <- c("Rank", "School", "Student to Faculty Ratio", 
                      "Graduation Rate", "Retention Rate", "Acceptance Rate", 
                      "Enrollment Rate", "Institutional Aid Rate", "Default Rate")
  return(data)
}
scraped_url <- "https://www.oedb.org/rankings/acceptance-rate/page/"
full_data <- data.frame()
for (page_num in 1:6) {
  url <- paste0(scraped_url, page_num, "/#table-rankings")
  page_data <- scrape_page(url)
  full_data <- bind_rows(full_data, page_data)
}
head(full_data)
##   Rank                     School Student to Faculty Ratio Graduation Rate
## 1    1         Harvard University                   7 to 1             98%
## 2    2            Yale University                   6 to 1             97%
## 3    3 University of Pennsylvania                   6 to 1             95%
## 4    4   Johns Hopkins University                  10 to 1             94%
## 5    5         Cornell University                   9 to 1             93%
## 6    6           Tufts University                   9 to 1             93%
##   Retention Rate Acceptance Rate Enrollment Rate Institutional Aid Rate
## 1            98%              6%              4%                    44%
## 2            99%              7%              5%                    52%
## 3            98%             10%              7%                    54%
## 4            97%             14%              5%                    51%
## 5            97%             15%              8%                    55%
## 6            97%             16%              7%                    43%
##   Default Rate
## 1          N/A
## 2          N/A
## 3          N/A
## 4          N/A
## 5          N/A
## 6          N/A

B. You are going to need the dataset created in Question A to answer the following questions. There are 16 questions each carrying 5 points:

  1. Replace the missing values “N/A” in the dataset with NA.

Solution:

full_data[full_data == "N/A"] <- NA
head(full_data)
##   Rank                     School Student to Faculty Ratio Graduation Rate
## 1    1         Harvard University                   7 to 1             98%
## 2    2            Yale University                   6 to 1             97%
## 3    3 University of Pennsylvania                   6 to 1             95%
## 4    4   Johns Hopkins University                  10 to 1             94%
## 5    5         Cornell University                   9 to 1             93%
## 6    6           Tufts University                   9 to 1             93%
##   Retention Rate Acceptance Rate Enrollment Rate Institutional Aid Rate
## 1            98%              6%              4%                    44%
## 2            99%              7%              5%                    52%
## 3            98%             10%              7%                    54%
## 4            97%             14%              5%                    51%
## 5            97%             15%              8%                    55%
## 6            97%             16%              7%                    43%
##   Default Rate
## 1         <NA>
## 2         <NA>
## 3         <NA>
## 4         <NA>
## 5         <NA>
## 6         <NA>
  1. Convert percentage columns (e.g., Graduation Rate) to numeric format.

Solution:

percent_columns <- c("Graduation Rate", "Retention Rate", "Acceptance Rate", 
                        "Enrollment Rate", "Institutional Aid Rate", "Default Rate")
full_data <- full_data %>%
  mutate(across(all_of(percent_columns), 
                ~ as.numeric(gsub("%", "", .))))
head(full_data)
##   Rank                     School Student to Faculty Ratio Graduation Rate
## 1    1         Harvard University                   7 to 1              98
## 2    2            Yale University                   6 to 1              97
## 3    3 University of Pennsylvania                   6 to 1              95
## 4    4   Johns Hopkins University                  10 to 1              94
## 5    5         Cornell University                   9 to 1              93
## 6    6           Tufts University                   9 to 1              93
##   Retention Rate Acceptance Rate Enrollment Rate Institutional Aid Rate
## 1             98               6               4                     44
## 2             99               7               5                     52
## 3             98              10               7                     54
## 4             97              14               5                     51
## 5             97              15               8                     55
## 6             97              16               7                     43
##   Default Rate
## 1           NA
## 2           NA
## 3           NA
## 4           NA
## 5           NA
## 6           NA
  1. Transform the “Student to Faculty Ratio” column into two separate numeric columns: Students and Faculty.

Solution:

full_data$`Student to Faculty Ratio` <- gsub(" to ", ":", full_data$`Student to Faculty Ratio`)
full_data <- full_data %>%
  separate(`Student to Faculty Ratio`, into = c("Students", "Faculty"), sep = ":", convert = TRUE)
full_data$Students <- as.numeric(full_data$Students)
full_data$Faculty <- as.numeric(full_data$Faculty)
head(full_data)
##   Rank                     School Students Faculty Graduation Rate
## 1    1         Harvard University        7       1              98
## 2    2            Yale University        6       1              97
## 3    3 University of Pennsylvania        6       1              95
## 4    4   Johns Hopkins University       10       1              94
## 5    5         Cornell University        9       1              93
## 6    6           Tufts University        9       1              93
##   Retention Rate Acceptance Rate Enrollment Rate Institutional Aid Rate
## 1             98               6               4                     44
## 2             99               7               5                     52
## 3             98              10               7                     54
## 4             97              14               5                     51
## 5             97              15               8                     55
## 6             97              16               7                     43
##   Default Rate
## 1           NA
## 2           NA
## 3           NA
## 4           NA
## 5           NA
## 6           NA
  1. What is the count of missing values in the “Default Rate” column? Impute the missing values in the “Default Rate” column with the median value.

Solution:

missing_count <- sum(is.na(full_data$`Default Rate`))
cat("Count of missing values in 'Default Rate' column:", missing_count, "\n")
## Count of missing values in 'Default Rate' column: 291
median_value <- median(full_data$`Default Rate`, na.rm = TRUE)
full_data$`Default Rate`[is.na(full_data$`Default Rate`)] <- median_value
cat("Count of missing values after imputation:", sum(is.na(full_data$`Default Rate`)), "\n")
## Count of missing values after imputation: 0
  1. Find the average graduation rate for universities ranked in the top 50.

Solution:

top_50_colleges <- full_data %>% filter(Rank <= 50)
avg_graduation_rate <- mean(top_50_colleges$`Graduation Rate`, na.rm = TRUE)
cat("Average Graduation Rate for Top 50 Universities:", avg_graduation_rate, "\n")
## Average Graduation Rate for Top 50 Universities: 65.9009
  1. Filter universities with a retention rate above 90% and find the count of rows in the subset.

Solution:

high_retention <- full_data %>% filter(`Retention Rate` > 90)
count_high_retention <- nrow(high_retention)
cat("Number of universities with a retention rate above 90%:", count_high_retention, "\n")
## Number of universities with a retention rate above 90%: 98
  1. Rank universities by enrollment rate in descending order and display the last 6 rows.

Solution:

full_data$`Enrollment Rate` <- as.numeric(as.character(full_data$`Enrollment Rate`))
sorted_data <- full_data %>% arrange(desc(`Enrollment Rate`))
last_6_rows <- tail(sorted_data, 6)
head(last_6_rows)
##     Rank                                  School Students Faculty
## 566  566            Southeastern Baptist College        6       1
## 567  567              Touro University Worldwide       13       1
## 568  568                          Unitek College       16       1
## 569  569            University of Western States       16       1
## 570  570                Virginia Baptist College        5       1
## 571  571 West Virginia Junior College-Morgantown       25       1
##     Graduation Rate Retention Rate Acceptance Rate Enrollment Rate
## 566              80             40              NA              NA
## 567              NA            100              NA              NA
## 568              NA            100              NA              NA
## 569              88             NA              NA              NA
## 570             100             25              NA              NA
## 571              53             69              NA              NA
##     Institutional Aid Rate Default Rate
## 566                     21            6
## 567                     76            4
## 568                     20            6
## 569                     56            6
## 570                     38            6
## 571                     88            6
#Exlcuding the NAs from the list
sorted_data <- full_data %>%
  filter(!is.na(`Enrollment Rate`)) %>%
  arrange(desc(`Enrollment Rate`))
last_6_rows <- tail(sorted_data, 6)
head(last_6_rows)
##     Rank                   School Students Faculty Graduation Rate
## 537   64      Spring Hill College       14       1              53
## 538  211       Arcadia University       10       1              58
## 539    2          Yale University        6       1              97
## 540    4 Johns Hopkins University       10       1              94
## 541   98       Fordham University       14       1              81
## 542    1       Harvard University        7       1              98
##     Retention Rate Acceptance Rate Enrollment Rate Institutional Aid Rate
## 537             81              41               6                     99
## 538             79              59               6                     94
## 539             99               7               5                     52
## 540             97              14               5                     51
## 541             91              48               5                     80
## 542             98               6               4                     44
##     Default Rate
## 537            8
## 538            6
## 539            6
## 540            6
## 541            5
## 542            6
  1. Create a histogram of graduation rates using ggplot2 library.

Solution:

#Load library
library(ggplot2)

ggplot(full_data, aes(x = `Graduation Rate`)) +
  geom_histogram(binwidth = 2, fill = "magenta", color = "pink", alpha = 0.7) +
  labs(
    title = "Histogram of Graduation Rates",
    x = "Graduation Rate (%)",
    y = "Frequency"
  ) +
  theme_minimal()
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_bin()`).

  1. Plot a scatterplot between acceptance rate and enrollment rate using ggplot2 library.

Solution:

ggplot(full_data, aes(x = `Acceptance Rate`, y = `Enrollment Rate`)) +
  geom_point(color = "magenta", alpha = 0.7) +   # Scatter plot points
  labs(
    title = "Scatterplot of Acceptance Rate vs Enrollment Rate",
    x = "Acceptance Rate (%)",
    y = "Enrollment Rate (%)"
  ) +
  theme_minimal()
## Warning: Removed 29 rows containing missing values or values outside the scale range
## (`geom_point()`).

  1. Calculate the average default rate by aid rate category (e.g., grouped into ranges like 0-20%, 20-40%). Display the categories.

Solution:

full_data <- full_data %>%
  mutate(
    AidRateCategory = case_when(
      `Institutional Aid Rate` >= 0 & `Institutional Aid Rate` <= 20 ~ "0-20%",
      `Institutional Aid Rate` > 20 & `Institutional Aid Rate` <= 40 ~ "20-40%",
      `Institutional Aid Rate` > 40 & `Institutional Aid Rate` <= 60 ~ "40-60%",
      `Institutional Aid Rate` > 60 & `Institutional Aid Rate` <= 80 ~ "60-80%",
      `Institutional Aid Rate` > 80 & `Institutional Aid Rate` <= 100 ~ "80-100%",
      TRUE ~ "Other" # For values outside the expected range, if any
    )
  )

avg_default_rate_by_aid_cat <- full_data %>%
  group_by(AidRateCategory) %>%
  summarise(AverageDefaultRate = mean(`Default Rate`, na.rm = TRUE))
head(avg_default_rate_by_aid_cat)
## # A tibble: 5 × 2
##   AidRateCategory AverageDefaultRate
##   <chr>                        <dbl>
## 1 0-20%                        10   
## 2 20-40%                        7.56
## 3 40-60%                        6.05
## 4 60-80%                        6.51
## 5 80-100%                       5.62
  1. Normalize the acceptance rate to a scale of 0-1 and save in a new column “Acceptance Rate Normalized”. Display the first 6 values.

Solution:

min_rate <- min(full_data$`Acceptance Rate`, na.rm = TRUE)
max_rate <- max(full_data$`Acceptance Rate`, na.rm = TRUE)
full_data$"Acceptance Rate Normalized" <- (full_data$`Acceptance Rate` - min_rate) / (max_rate - min_rate)

head(full_data$`Acceptance Rate Normalized`)
## [1] 0.00000000 0.01063830 0.04255319 0.08510638 0.09574468 0.10638298
  1. What is the count of the duplicate entries in the “School” column? Remove duplicate university entries.

Solution:

duplicate_count <- sum(duplicated(full_data$School))
print(paste("Count of duplicate entries in the 'School' column:", duplicate_count))
## [1] "Count of duplicate entries in the 'School' column: 3"
full_unique <- full_data %>%
  distinct(School, .keep_all = TRUE)
  1. Find the correlation between graduation rate and retention rate (exclude the NAs in both columns).

Solution:

correlation <- cor(full_data$`Graduation Rate`, full_data$`Retention Rate`, use = "complete.obs")
print(paste("The correlation between Graduation Rate and Retention Rate is:", correlation))
## [1] "The correlation between Graduation Rate and Retention Rate is: 0.616971168087149"
  1. Extract the values in School column into a new variable without “University” in the string. For example “Rowan University” becomes “Rowan”

Solution:

full_data$SchoolShort <- str_trim(full_data$School)
full_data$SchoolShort <- str_remove_all(full_data$School, "\\bUniversity\\b")
full_data$SchoolShort <- str_trim(full_data$SchoolShort)

head(full_data$SchoolShort, 10)
##  [1] "Harvard"                   "Yale"                     
##  [3] "of Pennsylvania"           "Johns Hopkins"            
##  [5] "Cornell"                   "Tufts"                    
##  [7] "of California-Berkeley"    "of California-Los Angeles"
##  [9] "Georgetown"                "Washington  in St Louis"
  1. Count how many universities have “Institute” in their name.

Solution:

institute_count <- sum(str_detect(full_data$School, "\\bInstitute\\b"))
print(institute_count)
## [1] 17
  1. Export the cleaned and processed dataset to a CSV file.

Solution:

write.csv(full_data, "cleaned_college_data.csv", row.names = FALSE)