Morgan State University
Department of Information Science & Systems
Fall 2024
INSS 615: Data Wrangling for Visualization
Name: Ava’ Roberts
Due: Dec 1, 2024 (Sunday)
Questions
A. Scrape the College Ranked by Acceptance Rate dataset available at this link: https://www.oedb.org/rankings/acceptance-rate/#table-rankings and select the first 9 columns [Rank, School, Student to Faculty Ratio, Graduation Rate, Retention Rate, Acceptance Rate, Enrollment Rate, Institutional Aid Rate, and Default Rate] as the dataset for this assignment. [20 Points]
Hint: There are 6 pages of data, so you may want to use a for loop to automate the scraping process and combine the data from all 6 pages. This is just a suggestion—you are free to create the dataset without automating the web scrapping process.
Solution:
library(rvest)
## Warning: package 'rvest' was built under R version 4.2.3
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.2.1 ✔ dplyr 1.1.4
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.5 ✔ forcats 1.0.0
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ readr::guess_encoding() masks rvest::guess_encoding()
## ✖ dplyr::lag() masks stats::lag()
# multipage scraping using a for loop
# empty list to store data
all_data <- list()
# Loop through 6 pages
for (page_num in 1:6) {
# URL for the current page
url <- paste0("https://www.oedb.org/rankings/acceptance-rate/page/", page_num, "/")
# Read the HTML on page
page <- read_html(url)
# Extract the tables
table <- page %>%
html_node("table") %>%
html_table(fill = TRUE)
# Select columns
selected_data <- table %>%
select(Rank, School, `Student to Faculty Ratio`, `Graduation Rate`, `Retention Rate`,
`Acceptance Rate`, `Enrollment Rate`, `Institutional Aid Rate`, `Default Rate`)
# Add data to list
all_data[[page_num]] <- selected_data
}
# Combine all pages to one data frame
final_data <- bind_rows(all_data)
# Display the first few rows of the combined data
print(head(final_data))
## # A tibble: 6 × 9
## Rank School Student to Faculty R…¹ `Graduation Rate` `Retention Rate`
## <int> <chr> <chr> <chr> <chr>
## 1 1 Harvard Unive… 7 to 1 98% 98%
## 2 2 Yale Universi… 6 to 1 97% 99%
## 3 3 University of… 6 to 1 95% 98%
## 4 4 Johns Hopkins… 10 to 1 94% 97%
## 5 5 Cornell Unive… 9 to 1 93% 97%
## 6 6 Tufts Univers… 9 to 1 93% 97%
## # ℹ abbreviated name: ¹`Student to Faculty Ratio`
## # ℹ 4 more variables: `Acceptance Rate` <chr>, `Enrollment Rate` <chr>,
## # `Institutional Aid Rate` <chr>, `Default Rate` <chr>
B. You are going to need the dataset created in Question A to answer the following questions. There are 16 questions each carrying 5 points:
Solution:
# Replace all "N/A"values with the string "NA"
final_data_cleaned <- final_data %>%
mutate(across(everything(), ~ ifelse(is.na(.) | . == "N/A", "NA", .)))
# Display the first few rows of the cleaned data
print(head(final_data_cleaned))
## # A tibble: 6 × 9
## Rank School Student to Faculty R…¹ `Graduation Rate` `Retention Rate`
## <int> <chr> <chr> <chr> <chr>
## 1 1 Harvard Unive… 7 to 1 98% 98%
## 2 2 Yale Universi… 6 to 1 97% 99%
## 3 3 University of… 6 to 1 95% 98%
## 4 4 Johns Hopkins… 10 to 1 94% 97%
## 5 5 Cornell Unive… 9 to 1 93% 97%
## 6 6 Tufts Univers… 9 to 1 93% 97%
## # ℹ abbreviated name: ¹`Student to Faculty Ratio`
## # ℹ 4 more variables: `Acceptance Rate` <chr>, `Enrollment Rate` <chr>,
## # `Institutional Aid Rate` <chr>, `Default Rate` <chr>
Solution:
# Id the percentage columns to convert
percentage_columns <- c("Graduation Rate", "Retention Rate", "Acceptance Rate",
"Enrollment Rate", "Institutional Aid Rate", "Default Rate")
# Define a function to convert percentage strings to numeric
convert_to_numeric <- function(x) {
as.numeric(str_remove(x, "%")) / 100
}
# Apply the function to the specified percentage columns
final_data_cleaned <- final_data_cleaned %>%
mutate(across(all_of(percentage_columns), convert_to_numeric))
## Warning: There were 5 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `across(all_of(percentage_columns), convert_to_numeric)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warnings()]8;; to see the 4 remaining warnings.
# Display the first few rows of the cleaned data
print(head(final_data_cleaned))
## # A tibble: 6 × 9
## Rank School Student to Faculty R…¹ `Graduation Rate` `Retention Rate`
## <int> <chr> <chr> <dbl> <dbl>
## 1 1 Harvard Unive… 7 to 1 0.98 0.98
## 2 2 Yale Universi… 6 to 1 0.97 0.99
## 3 3 University of… 6 to 1 0.95 0.98
## 4 4 Johns Hopkins… 10 to 1 0.94 0.97
## 5 5 Cornell Unive… 9 to 1 0.93 0.97
## 6 6 Tufts Univers… 9 to 1 0.93 0.97
## # ℹ abbreviated name: ¹`Student to Faculty Ratio`
## # ℹ 4 more variables: `Acceptance Rate` <dbl>, `Enrollment Rate` <dbl>,
## # `Institutional Aid Rate` <dbl>, `Default Rate` <dbl>
Solution:
# Split the Student to Faculty Ratio column into Students and Faculty
final_data_cleaned <- final_data_cleaned %>%
mutate(
Students = as.numeric(str_extract(`Student to Faculty Ratio`, "^[0-9]+")),
Faculty = as.numeric(str_extract(`Student to Faculty Ratio`, "[0-9]+$"))
)
# Display the first few rows of data
print(head(final_data_cleaned))
## # A tibble: 6 × 11
## Rank School Student to Faculty R…¹ `Graduation Rate` `Retention Rate`
## <int> <chr> <chr> <dbl> <dbl>
## 1 1 Harvard Unive… 7 to 1 0.98 0.98
## 2 2 Yale Universi… 6 to 1 0.97 0.99
## 3 3 University of… 6 to 1 0.95 0.98
## 4 4 Johns Hopkins… 10 to 1 0.94 0.97
## 5 5 Cornell Unive… 9 to 1 0.93 0.97
## 6 6 Tufts Univers… 9 to 1 0.93 0.97
## # ℹ abbreviated name: ¹`Student to Faculty Ratio`
## # ℹ 6 more variables: `Acceptance Rate` <dbl>, `Enrollment Rate` <dbl>,
## # `Institutional Aid Rate` <dbl>, `Default Rate` <dbl>, Students <dbl>,
## # Faculty <dbl>
Solution:
# Count missing values in the "Default Rate" column
missing_count <- sum(is.na(final_data_cleaned$`Default Rate`))
cat("Count of missing values in 'Default Rate':", missing_count, "\n")
## Count of missing values in 'Default Rate': 291
# mutate missing values with the median
median_default_rate <- median(final_data_cleaned$`Default Rate`, na.rm = TRUE)
final_data_cleaned <- final_data_cleaned %>%
mutate(`Default Rate` = ifelse(is.na(`Default Rate`), median_default_rate, `Default Rate`))
# Display the first few rows of the updated dataset
print(head(final_data_cleaned))
## # A tibble: 6 × 11
## Rank School Student to Faculty R…¹ `Graduation Rate` `Retention Rate`
## <int> <chr> <chr> <dbl> <dbl>
## 1 1 Harvard Unive… 7 to 1 0.98 0.98
## 2 2 Yale Universi… 6 to 1 0.97 0.99
## 3 3 University of… 6 to 1 0.95 0.98
## 4 4 Johns Hopkins… 10 to 1 0.94 0.97
## 5 5 Cornell Unive… 9 to 1 0.93 0.97
## 6 6 Tufts Univers… 9 to 1 0.93 0.97
## # ℹ abbreviated name: ¹`Student to Faculty Ratio`
## # ℹ 6 more variables: `Acceptance Rate` <dbl>, `Enrollment Rate` <dbl>,
## # `Institutional Aid Rate` <dbl>, `Default Rate` <dbl>, Students <dbl>,
## # Faculty <dbl>
Solution:
# Filter top 50 universities
top_50_universities <- final_data_cleaned %>%
filter(Rank <= 50)
# average graduation rate
average_graduation_rate <- mean(top_50_universities$`Graduation Rate`, na.rm = TRUE)
# Display the result
cat("The average graduation rate for universities ranked in the top 50 is:", average_graduation_rate * 100, "%\n")
## The average graduation rate for universities ranked in the top 50 is: 79.18 %
Solution:
# Filter universities with retention rate above 90%
high_retention_universities <- final_data_cleaned %>%
filter(`Retention Rate` > 0.90)
# Count the number of rows
count_high_retention <- nrow(high_retention_universities)
# Display the count
cat("The number of universities with a retention rate above 90% is:", count_high_retention, "\n")
## The number of universities with a retention rate above 90% is: 98
Solution:
# Rank by enrollment rate in descending order
ranked_universities <- final_data_cleaned %>%
arrange(desc(`Enrollment Rate`))
# Display the last 6 rows of the ranked data
tail(ranked_universities, 6)
Solution:
# Load the ggplot2 library
library(ggplot2)
# Create a histogram of graduation rates
ggplot(final_data_cleaned, aes(x = `Graduation Rate`)) +
geom_histogram(binwidth = 0.05, fill = "green", color = "black", alpha = 0.7) +
labs(
title = "Histogram of Graduation Rates",
x = "Graduation Rate (Proportion)",
y = "Frequency"
) +
theme_minimal()
## Warning: Removed 6 rows containing non-finite values (`stat_bin()`).
Solution:
# Create a scatterplot of acceptance rate vs. enrollment rate
ggplot(final_data_cleaned, aes(x = `Acceptance Rate`, y = `Enrollment Rate`)) +
geom_point(color = "red", alpha = 0.7) +
labs(
title = "Scatterplot of Acceptance Rate vs. Enrollment Rate",
x = "Acceptance Rate (Proportion)",
y = "Enrollment Rate (Proportion)"
) +
theme_minimal()
## Warning: Removed 29 rows containing missing values (`geom_point()`).
Solution:
# Define aid rate bins
final_data_cleaned <- final_data_cleaned %>%
mutate(`Aid Rate Category` = cut(
`Institutional Aid Rate`,
breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1.0),
labels = c("0-20%", "20-40%", "40-60%", "60-80%", "80-100%"),
include.lowest = TRUE
))
# Calculate the average default
avg_default_rate_by_category <- final_data_cleaned %>%
group_by(`Aid Rate Category`) %>%
summarise(`Average Default Rate` = mean(`Default Rate`, na.rm = TRUE))
# Display the result
print(avg_default_rate_by_category)
## # A tibble: 5 × 2
## `Aid Rate Category` `Average Default Rate`
## <fct> <dbl>
## 1 0-20% 0.1
## 2 20-40% 0.0756
## 3 40-60% 0.0605
## 4 60-80% 0.0651
## 5 80-100% 0.0562
Solution:
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
# Normalize the acceptance rate to a scale of 0-1
final_data_cleaned <- final_data_cleaned %>%
mutate(`Acceptance Rate Normalized` = rescale(`Acceptance Rate`, to = c(0, 1)))
# Display the first 6 values of new column
head(final_data_cleaned[, c("Acceptance Rate", "Acceptance Rate Normalized")])
Solution:
# Count duplicate entries in the "School" column
duplicate_count <- final_data_cleaned %>%
filter(duplicated(School)) %>%
nrow()
cat("The number of duplicate entries in the 'School' column is:", duplicate_count, "\n")
## The number of duplicate entries in the 'School' column is: 3
# Remove duplicate entries in "School" column
final_data_cleaned <- final_data_cleaned %>%
distinct(School, .keep_all = TRUE)
# Display the result
print(final_data_cleaned)
## # A tibble: 568 × 13
## Rank School Student to Faculty R…¹ `Graduation Rate` `Retention Rate`
## <int> <chr> <chr> <dbl> <dbl>
## 1 1 Harvard Univ… 7 to 1 0.98 0.98
## 2 2 Yale Univers… 6 to 1 0.97 0.99
## 3 3 University o… 6 to 1 0.95 0.98
## 4 4 Johns Hopkin… 10 to 1 0.94 0.97
## 5 5 Cornell Univ… 9 to 1 0.93 0.97
## 6 6 Tufts Univer… 9 to 1 0.93 0.97
## 7 7 University o… 17 to 1 0.92 0.96
## 8 8 University o… 16 to 1 0.91 0.96
## 9 9 Georgetown U… 11 to 1 0.94 0.96
## 10 10 Washington U… 8 to 1 0.93 0.96
## # ℹ 558 more rows
## # ℹ abbreviated name: ¹`Student to Faculty Ratio`
## # ℹ 8 more variables: `Acceptance Rate` <dbl>, `Enrollment Rate` <dbl>,
## # `Institutional Aid Rate` <dbl>, `Default Rate` <dbl>, Students <dbl>,
## # Faculty <dbl>, `Aid Rate Category` <fct>,
## # `Acceptance Rate Normalized` <dbl>
Solution:
# Calculate the correlation between Graduation Rate and Retention Rate
correlation <- cor(
final_data_cleaned$`Graduation Rate`,
final_data_cleaned$`Retention Rate`,
use = "complete.obs"
)
# Display the correlation
cat("The correlation between Graduation Rate and Retention Rate is:", correlation, "\n")
## The correlation between Graduation Rate and Retention Rate is: 0.6159709
Solution:
# Create a new column without the word "University"
final_data_cleaned <- final_data_cleaned %>%
mutate(School_Short = gsub(" University", "", School))
# Display the first few rows of the new column
head(final_data_cleaned[, c("School", "School_Short")])
Solution:
# Count the number of universities with "Institute" in their name
institute_count <- sum(grepl("Institute", final_data_cleaned$School, ignore.case = TRUE))
# Display the count
cat("The number of universities with 'Institute' in their name is:", institute_count, "\n")
## The number of universities with 'Institute' in their name is: 17
Solution:
# Export the cleaned and processed dataset to a CSV file
write.csv(final_data_cleaned, file = "cleaned_processed_dataset.csv", row.names = FALSE)