Morgan State University
Department of Information Science & Systems
Fall 2024
INSS 615: Data Wrangling for Visualization
Name: Anais Williams
Due: Dec 1, 2024 (Sunday)
Questions
A. Scrape the College Ranked by Acceptance Rate dataset available at this link: https://www.oedb.org/rankings/acceptance-rate/#table-rankings and select the first 9 columns [Rank, School, Student to Faculty Ratio, Graduation Rate, Retention Rate, Acceptance Rate, Enrollment Rate, Institutional Aid Rate, and Default Rate] as the dataset for this assignment. [20 Points]
Hint: There are 6 pages of data, so you may want to use a for loop to automate the scraping process and combine the data from all 6 pages. This is just a suggestion—you are free to create the dataset without automating the web scrapping process.
Solution:
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(stringr)
scrape_page <- function(url) {
content <- read_html(url)
table <- content %>% html_node("table")
rows <- table %>% html_nodes("tr")
data_list <- list()
for (i in 2:length(rows)) {
cells <- rows[i] %>% html_nodes("td") %>% html_text(trim = TRUE)
if (length(cells) >= 9) {
data_list[[i - 1]] <- cells[1:9] # Get only the first 9 columns
}
}
data <- do.call(rbind, data_list) %>% as.data.frame(stringsAsFactors = FALSE)
colnames(data) <- c("Rank", "School", "Student to Faculty Ratio",
"Graduation Rate", "Retention Rate", "Acceptance Rate",
"Enrollment Rate", "Institutional Aid Rate", "Default Rate")
return(data)
}
scraped_url <- "https://www.oedb.org/rankings/acceptance-rate/page/"
full_data <- data.frame()
for (page_num in 1:6) {
url <- paste0(scraped_url, page_num, "/#table-rankings")
page_data <- scrape_page(url)
full_data <- bind_rows(full_data, page_data)
}
head(full_data)## Rank School Student to Faculty Ratio Graduation Rate
## 1 1 Harvard University 7 to 1 98%
## 2 2 Yale University 6 to 1 97%
## 3 3 University of Pennsylvania 6 to 1 95%
## 4 4 Johns Hopkins University 10 to 1 94%
## 5 5 Cornell University 9 to 1 93%
## 6 6 Tufts University 9 to 1 93%
## Retention Rate Acceptance Rate Enrollment Rate Institutional Aid Rate
## 1 98% 6% 4% 44%
## 2 99% 7% 5% 52%
## 3 98% 10% 7% 54%
## 4 97% 14% 5% 51%
## 5 97% 15% 8% 55%
## 6 97% 16% 7% 43%
## Default Rate
## 1 N/A
## 2 N/A
## 3 N/A
## 4 N/A
## 5 N/A
## 6 N/A
B. You are going to need the dataset created in Question A to answer the following questions. There are 16 questions each carrying 5 points:
- Replace the missing values “N/A” in the dataset with NA.
Solution:
## Rank School Student to Faculty Ratio Graduation Rate
## 1 1 Harvard University 7 to 1 98%
## 2 2 Yale University 6 to 1 97%
## 3 3 University of Pennsylvania 6 to 1 95%
## 4 4 Johns Hopkins University 10 to 1 94%
## 5 5 Cornell University 9 to 1 93%
## 6 6 Tufts University 9 to 1 93%
## Retention Rate Acceptance Rate Enrollment Rate Institutional Aid Rate
## 1 98% 6% 4% 44%
## 2 99% 7% 5% 52%
## 3 98% 10% 7% 54%
## 4 97% 14% 5% 51%
## 5 97% 15% 8% 55%
## 6 97% 16% 7% 43%
## Default Rate
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>
- Convert percentage columns (e.g., Graduation Rate) to numeric format.
Solution:
percent_columns <- c("Graduation Rate", "Retention Rate", "Acceptance Rate",
"Enrollment Rate", "Institutional Aid Rate", "Default Rate")
full_data <- full_data %>%
mutate(across(all_of(percent_columns),
~ as.numeric(gsub("%", "", .))))
head(full_data)## Rank School Student to Faculty Ratio Graduation Rate
## 1 1 Harvard University 7 to 1 98
## 2 2 Yale University 6 to 1 97
## 3 3 University of Pennsylvania 6 to 1 95
## 4 4 Johns Hopkins University 10 to 1 94
## 5 5 Cornell University 9 to 1 93
## 6 6 Tufts University 9 to 1 93
## Retention Rate Acceptance Rate Enrollment Rate Institutional Aid Rate
## 1 98 6 4 44
## 2 99 7 5 52
## 3 98 10 7 54
## 4 97 14 5 51
## 5 97 15 8 55
## 6 97 16 7 43
## Default Rate
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
- Transform the “Student to Faculty Ratio” column into two separate numeric columns: Students and Faculty.
Solution:
full_data$`Student to Faculty Ratio` <- gsub(" to ", ":", full_data$`Student to Faculty Ratio`)
full_data <- full_data %>%
separate(`Student to Faculty Ratio`, into = c("Students", "Faculty"), sep = ":", convert = TRUE)
full_data$Students <- as.numeric(full_data$Students)
full_data$Faculty <- as.numeric(full_data$Faculty)
head(full_data)## Rank School Students Faculty Graduation Rate
## 1 1 Harvard University 7 1 98
## 2 2 Yale University 6 1 97
## 3 3 University of Pennsylvania 6 1 95
## 4 4 Johns Hopkins University 10 1 94
## 5 5 Cornell University 9 1 93
## 6 6 Tufts University 9 1 93
## Retention Rate Acceptance Rate Enrollment Rate Institutional Aid Rate
## 1 98 6 4 44
## 2 99 7 5 52
## 3 98 10 7 54
## 4 97 14 5 51
## 5 97 15 8 55
## 6 97 16 7 43
## Default Rate
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
- What is the count of missing values in the “Default Rate” column? Impute the missing values in the “Default Rate” column with the median value.
Solution:
missing_count <- sum(is.na(full_data$`Default Rate`))
cat("Count of missing values in 'Default Rate' column:", missing_count, "\n")## Count of missing values in 'Default Rate' column: 291
median_value <- median(full_data$`Default Rate`, na.rm = TRUE)
full_data$`Default Rate`[is.na(full_data$`Default Rate`)] <- median_value
cat("Count of missing values after imputation:", sum(is.na(full_data$`Default Rate`)), "\n")## Count of missing values after imputation: 0
- Find the average graduation rate for universities ranked in the top 50.
Solution:
top_50_colleges <- full_data %>% filter(Rank <= 50)
avg_graduation_rate <- mean(top_50_colleges$`Graduation Rate`, na.rm = TRUE)
cat("Average Graduation Rate for Top 50 Universities:", avg_graduation_rate, "\n")## Average Graduation Rate for Top 50 Universities: 65.9009
- Filter universities with a retention rate above 90% and find the count of rows in the subset.
Solution:
high_retention <- full_data %>% filter(`Retention Rate` > 90)
count_high_retention <- nrow(high_retention)
cat("Number of universities with a retention rate above 90%:", count_high_retention, "\n")## Number of universities with a retention rate above 90%: 98
- Rank universities by enrollment rate in descending order and display the last 6 rows.
Solution:
full_data$`Enrollment Rate` <- as.numeric(as.character(full_data$`Enrollment Rate`))
sorted_data <- full_data %>% arrange(desc(`Enrollment Rate`))
last_6_rows <- tail(sorted_data, 6)
head(last_6_rows)## Rank School Students Faculty
## 566 566 Southeastern Baptist College 6 1
## 567 567 Touro University Worldwide 13 1
## 568 568 Unitek College 16 1
## 569 569 University of Western States 16 1
## 570 570 Virginia Baptist College 5 1
## 571 571 West Virginia Junior College-Morgantown 25 1
## Graduation Rate Retention Rate Acceptance Rate Enrollment Rate
## 566 80 40 NA NA
## 567 NA 100 NA NA
## 568 NA 100 NA NA
## 569 88 NA NA NA
## 570 100 25 NA NA
## 571 53 69 NA NA
## Institutional Aid Rate Default Rate
## 566 21 6
## 567 76 4
## 568 20 6
## 569 56 6
## 570 38 6
## 571 88 6
#Exlcuding the NAs from the list
sorted_data <- full_data %>%
filter(!is.na(`Enrollment Rate`)) %>%
arrange(desc(`Enrollment Rate`))
last_6_rows <- tail(sorted_data, 6)
head(last_6_rows)## Rank School Students Faculty Graduation Rate
## 537 64 Spring Hill College 14 1 53
## 538 211 Arcadia University 10 1 58
## 539 2 Yale University 6 1 97
## 540 4 Johns Hopkins University 10 1 94
## 541 98 Fordham University 14 1 81
## 542 1 Harvard University 7 1 98
## Retention Rate Acceptance Rate Enrollment Rate Institutional Aid Rate
## 537 81 41 6 99
## 538 79 59 6 94
## 539 99 7 5 52
## 540 97 14 5 51
## 541 91 48 5 80
## 542 98 6 4 44
## Default Rate
## 537 8
## 538 6
## 539 6
## 540 6
## 541 5
## 542 6
- Create a histogram of graduation rates using ggplot2 library.
Solution:
#Load library
library(ggplot2)
ggplot(full_data, aes(x = `Graduation Rate`)) +
geom_histogram(binwidth = 2, fill = "magenta", color = "pink", alpha = 0.7) +
labs(
title = "Histogram of Graduation Rates",
x = "Graduation Rate (%)",
y = "Frequency"
) +
theme_minimal()## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_bin()`).
- Plot a scatterplot between acceptance rate and enrollment rate using ggplot2 library.
Solution:
ggplot(full_data, aes(x = `Acceptance Rate`, y = `Enrollment Rate`)) +
geom_point(color = "magenta", alpha = 0.7) + # Scatter plot points
labs(
title = "Scatterplot of Acceptance Rate vs Enrollment Rate",
x = "Acceptance Rate (%)",
y = "Enrollment Rate (%)"
) +
theme_minimal()## Warning: Removed 29 rows containing missing values or values outside the scale range
## (`geom_point()`).
- Calculate the average default rate by aid rate category (e.g., grouped into ranges like 0-20%, 20-40%). Display the categories.
Solution:
full_data <- full_data %>%
mutate(
AidRateCategory = case_when(
`Institutional Aid Rate` >= 0 & `Institutional Aid Rate` <= 20 ~ "0-20%",
`Institutional Aid Rate` > 20 & `Institutional Aid Rate` <= 40 ~ "20-40%",
`Institutional Aid Rate` > 40 & `Institutional Aid Rate` <= 60 ~ "40-60%",
`Institutional Aid Rate` > 60 & `Institutional Aid Rate` <= 80 ~ "60-80%",
`Institutional Aid Rate` > 80 & `Institutional Aid Rate` <= 100 ~ "80-100%",
TRUE ~ "Other" # For values outside the expected range, if any
)
)
avg_default_rate_by_aid_cat <- full_data %>%
group_by(AidRateCategory) %>%
summarise(AverageDefaultRate = mean(`Default Rate`, na.rm = TRUE))
head(avg_default_rate_by_aid_cat)## # A tibble: 5 × 2
## AidRateCategory AverageDefaultRate
## <chr> <dbl>
## 1 0-20% 10
## 2 20-40% 7.56
## 3 40-60% 6.05
## 4 60-80% 6.51
## 5 80-100% 5.62
- Normalize the acceptance rate to a scale of 0-1 and save in a new column “Acceptance Rate Normalized”. Display the first 6 values.
Solution:
min_rate <- min(full_data$`Acceptance Rate`, na.rm = TRUE)
max_rate <- max(full_data$`Acceptance Rate`, na.rm = TRUE)
full_data$"Acceptance Rate Normalized" <- (full_data$`Acceptance Rate` - min_rate) / (max_rate - min_rate)
head(full_data$`Acceptance Rate Normalized`)## [1] 0.00000000 0.01063830 0.04255319 0.08510638 0.09574468 0.10638298
- What is the count of the duplicate entries in the “School” column? Remove duplicate university entries.
Solution:
duplicate_count <- sum(duplicated(full_data$School))
print(paste("Count of duplicate entries in the 'School' column:", duplicate_count))## [1] "Count of duplicate entries in the 'School' column: 3"
- Find the correlation between graduation rate and retention rate (exclude the NAs in both columns).
Solution:
correlation <- cor(full_data$`Graduation Rate`, full_data$`Retention Rate`, use = "complete.obs")
print(paste("The correlation between Graduation Rate and Retention Rate is:", correlation))## [1] "The correlation between Graduation Rate and Retention Rate is: 0.616971168087149"
- Extract the values in School column into a new variable without “University” in the string. For example “Rowan University” becomes “Rowan”
Solution:
full_data$SchoolShort <- str_trim(full_data$School)
full_data$SchoolShort <- str_remove_all(full_data$School, "\\bUniversity\\b")
full_data$SchoolShort <- str_trim(full_data$SchoolShort)
head(full_data$SchoolShort, 10)## [1] "Harvard" "Yale"
## [3] "of Pennsylvania" "Johns Hopkins"
## [5] "Cornell" "Tufts"
## [7] "of California-Berkeley" "of California-Los Angeles"
## [9] "Georgetown" "Washington in St Louis"
- Count how many universities have “Institute” in their name.
Solution:
## [1] 17
- Export the cleaned and processed dataset to a CSV file.
Solution: