Morgan State University
Department of Information Science & Systems
Fall 2024
INSS 615: Data Wrangling for Visualization
Name: Darren Mitchell
Due: Dec 1, 2024 (Sunday)
Questions
A. Scrape the College Ranked by Acceptance Rate dataset available at this link: https://www.oedb.org/rankings/acceptance-rate/#table-rankings and select the first 9 columns [Rank, School, Student to Faculty Ratio, Graduation Rate, Retention Rate, Acceptance Rate, Enrollment Rate, Institutional Aid Rate, and Default Rate] as the dataset for this assignment. [20 Points]
Hint: There are 6 pages of data, so you may want to use a for loop to automate the scraping process and combine the data from all 6 pages. This is just a suggestion—you are free to create the dataset without automating the web scrapping process.
Solution:
library(rvest)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ readr::guess_encoding() masks rvest::guess_encoding()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# multipage scraping using a for loop
# Load necessary libraries
library(rvest)
library(tidyverse)
# Define the URLs for each page (update these manually based on actual page URLs)
urls <- c(
"https://www.oedb.org/rankings/acceptance-rate/#table-rankings",
"https://www.oedb.org/rankings/acceptance-rate/page/2/#table-rankings",
"https://www.oedb.org/rankings/acceptance-rate/page/3/#table-rankings",
"https://www.oedb.org/rankings/acceptance-rate/page/4/#table-rankings",
"https://www.oedb.org/rankings/acceptance-rate/page/5/#table-rankings",
"https://www.oedb.org/rankings/acceptance-rate/page/6/#table-rankings"
# Add more URLs if there are more pages
)
# Function to scrape and process data from a single URL
scrape_page <- function(url) {
# Read the HTML content
web_page <- read_html(url)
# Extract the table and select columns
college_table <- web_page %>%
html_node("table") %>%
html_table(fill = TRUE) %>%
select(
Rank = 1,
School = 2,
`Student to Faculty Ratio` = 3,
`Graduation Rate` = 4,
`Retention Rate` = 5,
`Acceptance Rate` = 6,
`Enrollment Rate` = 7,
`Institutional Aid Rate` = 8,
`Default Rate` = 9
)
return(college_table)
}
# Scrape and combine data from all pages
final_data <- map_dfr(urls, scrape_page)
# View the final dataset
print(final_data)
## # A tibble: 571 × 9
## Rank School Student to Faculty R…¹ `Graduation Rate` `Retention Rate`
## <int> <chr> <chr> <chr> <chr>
## 1 1 Harvard Univ… 7 to 1 98% 98%
## 2 2 Yale Univers… 6 to 1 97% 99%
## 3 3 University o… 6 to 1 95% 98%
## 4 4 Johns Hopkin… 10 to 1 94% 97%
## 5 5 Cornell Univ… 9 to 1 93% 97%
## 6 6 Tufts Univer… 9 to 1 93% 97%
## 7 7 University o… 17 to 1 92% 96%
## 8 8 University o… 16 to 1 91% 96%
## 9 9 Georgetown U… 11 to 1 94% 96%
## 10 10 Washington U… 8 to 1 93% 96%
## # ℹ 561 more rows
## # ℹ abbreviated name: ¹`Student to Faculty Ratio`
## # ℹ 4 more variables: `Acceptance Rate` <chr>, `Enrollment Rate` <chr>,
## # `Institutional Aid Rate` <chr>, `Default Rate` <chr>
# Optional: Save the final dataset as a CSV file
write_csv(final_data, "college_acceptance_data_full.csv")
B. You are going to need the dataset created in Question A to answer the following questions. There are 16 questions each carrying 5 points:
Solution:
# Replace "N/A" with NA
final_data <- final_data %>%
mutate(across(everything(), ~ na_if(as.character(.), "N/A")))
# View the updated dataset
print(final_data)
## # A tibble: 571 × 9
## Rank School Student to Faculty R…¹ `Graduation Rate` `Retention Rate`
## <chr> <chr> <chr> <chr> <chr>
## 1 1 Harvard Univ… 7 to 1 98% 98%
## 2 2 Yale Univers… 6 to 1 97% 99%
## 3 3 University o… 6 to 1 95% 98%
## 4 4 Johns Hopkin… 10 to 1 94% 97%
## 5 5 Cornell Univ… 9 to 1 93% 97%
## 6 6 Tufts Univer… 9 to 1 93% 97%
## 7 7 University o… 17 to 1 92% 96%
## 8 8 University o… 16 to 1 91% 96%
## 9 9 Georgetown U… 11 to 1 94% 96%
## 10 10 Washington U… 8 to 1 93% 96%
## # ℹ 561 more rows
## # ℹ abbreviated name: ¹`Student to Faculty Ratio`
## # ℹ 4 more variables: `Acceptance Rate` <chr>, `Enrollment Rate` <chr>,
## # `Institutional Aid Rate` <chr>, `Default Rate` <chr>
Solution:
# Replace "N/A" with NA and handle percentage columns
final_data <- final_data %>%
mutate(across(everything(), ~ na_if(as.character(.), "N/A"))) %>%
mutate(
`Graduation Rate` = as.numeric(str_remove(`Graduation Rate`, "%")),
`Retention Rate` = as.numeric(str_remove(`Retention Rate`, "%")),
`Acceptance Rate` = as.numeric(str_remove(`Acceptance Rate`, "%")),
`Enrollment Rate` = as.numeric(str_remove(`Enrollment Rate`, "%")),
`Institutional Aid Rate` = as.numeric(str_remove(`Institutional Aid Rate`, "%")),
`Default Rate` = as.numeric(str_remove(`Default Rate`, "%"))
)
# View the updated dataset
print(final_data)
## # A tibble: 571 × 9
## Rank School Student to Faculty R…¹ `Graduation Rate` `Retention Rate`
## <chr> <chr> <chr> <dbl> <dbl>
## 1 1 Harvard Univ… 7 to 1 98 98
## 2 2 Yale Univers… 6 to 1 97 99
## 3 3 University o… 6 to 1 95 98
## 4 4 Johns Hopkin… 10 to 1 94 97
## 5 5 Cornell Univ… 9 to 1 93 97
## 6 6 Tufts Univer… 9 to 1 93 97
## 7 7 University o… 17 to 1 92 96
## 8 8 University o… 16 to 1 91 96
## 9 9 Georgetown U… 11 to 1 94 96
## 10 10 Washington U… 8 to 1 93 96
## # ℹ 561 more rows
## # ℹ abbreviated name: ¹`Student to Faculty Ratio`
## # ℹ 4 more variables: `Acceptance Rate` <dbl>, `Enrollment Rate` <dbl>,
## # `Institutional Aid Rate` <dbl>, `Default Rate` <dbl>
Solution:
# Transform the "Student to Faculty Ratio" column into two numeric columns
final_data <- final_data %>%
separate(
`Student to Faculty Ratio`, # Column to split
into = c("Students", "Faculty"), # Names of new columns
sep = ":", # Separator is the colon (:)
convert = TRUE # Automatically convert to numeric
)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 571 rows [1, 2, 3, 4, 5,
## 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
# View the transformed dataset
print(final_data)
## # A tibble: 571 × 10
## Rank School Students Faculty `Graduation Rate` `Retention Rate`
## <chr> <chr> <chr> <lgl> <dbl> <dbl>
## 1 1 Harvard University 7 to 1 NA 98 98
## 2 2 Yale University 6 to 1 NA 97 99
## 3 3 University of Penn… 6 to 1 NA 95 98
## 4 4 Johns Hopkins Univ… 10 to 1 NA 94 97
## 5 5 Cornell University 9 to 1 NA 93 97
## 6 6 Tufts University 9 to 1 NA 93 97
## 7 7 University of Cali… 17 to 1 NA 92 96
## 8 8 University of Cali… 16 to 1 NA 91 96
## 9 9 Georgetown Univers… 11 to 1 NA 94 96
## 10 10 Washington Univers… 8 to 1 NA 93 96
## # ℹ 561 more rows
## # ℹ 4 more variables: `Acceptance Rate` <dbl>, `Enrollment Rate` <dbl>,
## # `Institutional Aid Rate` <dbl>, `Default Rate` <dbl>
Solution:
# Count missing values in the "Default Rate" column
missing_count <- sum(is.na(final_data$`Default Rate`))
cat("Count of missing values in 'Default Rate':", missing_count, "\n")
## Count of missing values in 'Default Rate': 291
# Impute missing values with the median
final_data <- final_data %>%
mutate(
`Default Rate` = ifelse(
is.na(`Default Rate`),
median(`Default Rate`, na.rm = TRUE),
`Default Rate`
)
)
# View the dataset after imputation
print(final_data)
## # A tibble: 571 × 10
## Rank School Students Faculty `Graduation Rate` `Retention Rate`
## <chr> <chr> <chr> <lgl> <dbl> <dbl>
## 1 1 Harvard University 7 to 1 NA 98 98
## 2 2 Yale University 6 to 1 NA 97 99
## 3 3 University of Penn… 6 to 1 NA 95 98
## 4 4 Johns Hopkins Univ… 10 to 1 NA 94 97
## 5 5 Cornell University 9 to 1 NA 93 97
## 6 6 Tufts University 9 to 1 NA 93 97
## 7 7 University of Cali… 17 to 1 NA 92 96
## 8 8 University of Cali… 16 to 1 NA 91 96
## 9 9 Georgetown Univers… 11 to 1 NA 94 96
## 10 10 Washington Univers… 8 to 1 NA 93 96
## # ℹ 561 more rows
## # ℹ 4 more variables: `Acceptance Rate` <dbl>, `Enrollment Rate` <dbl>,
## # `Institutional Aid Rate` <dbl>, `Default Rate` <dbl>
Solution:
# Filter for universities ranked in the top 50 and calculate the average graduation rate
average_graduation_rate <- final_data %>%
filter(Rank <= 50) %>%
summarise(Average_Graduation_Rate = mean(`Graduation Rate`, na.rm = TRUE))
# Print the result
print(average_graduation_rate)
## # A tibble: 1 × 1
## Average_Graduation_Rate
## <dbl>
## 1 65.9
Solution:
# Filter universities with retention rate above 90% and count the rows
count_above_90 <- final_data %>%
filter(`Retention Rate` > 90) %>%
summarise(Count = n())
# Print the result
print(count_above_90)
## # A tibble: 1 × 1
## Count
## <int>
## 1 98
Solution:
# Rank universities by enrollment rate in descending order and display the last 6 rows
last_6_universities <- final_data %>%
arrange(desc(`Enrollment Rate`)) %>%
tail(6)
# Print the result
print(last_6_universities)
## # A tibble: 6 × 10
## Rank School Students Faculty `Graduation Rate` `Retention Rate`
## <chr> <chr> <chr> <lgl> <dbl> <dbl>
## 1 566 Southeastern Baptis… 6 to 1 NA 80 40
## 2 567 Touro University Wo… 13 to 1 NA NA 100
## 3 568 Unitek College 16 to 1 NA NA 100
## 4 569 University of Weste… 16 to 1 NA 88 NA
## 5 570 Virginia Baptist Co… 5 to 1 NA 100 25
## 6 571 West Virginia Junio… 25 to 1 NA 53 69
## # ℹ 4 more variables: `Acceptance Rate` <dbl>, `Enrollment Rate` <dbl>,
## # `Institutional Aid Rate` <dbl>, `Default Rate` <dbl>
Solution:
# Load ggplot2 library
library(ggplot2)
# Create a histogram of graduation rates
ggplot(final_data, aes(x = `Graduation Rate`)) +
geom_histogram(binwidth = 5, fill = "blue", color = "black", alpha = 0.7) +
labs(
title = "Histogram of Graduation Rates",
x = "Graduation Rate (%)",
y = "Frequency"
) +
theme_minimal()
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_bin()`).
Solution:
# Load ggplot2 library
library(ggplot2)
# Create a scatterplot of acceptance rate vs. enrollment rate
ggplot(final_data, aes(x = `Acceptance Rate`, y = `Enrollment Rate`)) +
geom_point(color = "blue", alpha = 0.7, size = 3) +
labs(
title = "Scatterplot of Acceptance Rate vs. Enrollment Rate",
x = "Acceptance Rate (%)",
y = "Enrollment Rate (%)"
) +
theme_minimal()
## Warning: Removed 29 rows containing missing values or values outside the scale range
## (`geom_point()`).
Solution:
# Define aid rate categories
final_data$Aid_Rate_Category <- cut(
final_data$`Institutional Aid Rate`,
breaks = c(0, 20, 40, 60, 80, 100),
labels = c("0-20%", "20-40%", "40-60%", "60-80%", "80-100%"),
include.lowest = TRUE
)
# Calculate the average default rate by aid rate category
category_levels <- levels(final_data$Aid_Rate_Category)
average_default_rate <- sapply(category_levels, function(category) {
mean(final_data$`Default Rate`[final_data$Aid_Rate_Category == category], na.rm = TRUE)
})
# Combine categories and averages into a data frame
result <- data.frame(
Aid_Rate_Category = category_levels,
Average_Default_Rate = average_default_rate
)
# Display the result
print(result)
## Aid_Rate_Category Average_Default_Rate
## 0-20% 0-20% 10.000000
## 20-40% 20-40% 7.555556
## 40-60% 40-60% 6.052980
## 60-80% 60-80% 6.514451
## 80-100% 80-100% 5.619469
Solution:
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
# Load the scales library
library(scales)
# Normalize the Acceptance Rate to a scale of 0-1
final_data$`Acceptance Rate Normalized` <- rescale(final_data$`Acceptance Rate`, to = c(0, 1))
# Display the first 6 values of the normalized column
head(final_data[, c("Acceptance Rate", "Acceptance Rate Normalized")])
Solution:
# Count duplicate entries in the "School" column
duplicate_count <- sum(duplicated(final_data$School))
cat("Count of duplicate entries in 'School':", duplicate_count, "\n")
## Count of duplicate entries in 'School': 3
# Remove duplicate university entries based on the "School" column
final_data <- final_data[!duplicated(final_data$School), ]
# Display the dataset after removing duplicates
head(final_data)
Solution:
# Calculate the correlation between Graduation Rate and Retention Rate
correlation <- cor(
final_data$`Graduation Rate`,
final_data$`Retention Rate`,
use = "complete.obs" # Exclude rows with NA in either column
)
# Display the correlation
cat("Correlation between Graduation Rate and Retention Rate:", correlation, "\n")
## Correlation between Graduation Rate and Retention Rate: 0.6159709
Solution:
# Extract the values in the School column without "University"
school_without_university <- gsub("University", "", final_data$School)
# Trim leading and trailing whitespace
school_without_university <- trimws(school_without_university)
# Display the first few values
head(school_without_university)
## [1] "Harvard" "Yale" "of Pennsylvania" "Johns Hopkins"
## [5] "Cornell" "Tufts"
Solution:
# Count universities with "Institute" in their name
institute_count <- sum(grepl("Institute", final_data$School))
# Display the count
cat("Number of universities with 'Institute' in their name:", institute_count, "\n")
## Number of universities with 'Institute' in their name: 17
Solution:
# Export the cleaned dataset to a CSV file
write.csv(final_data, "cleaned_final_data.csv", row.names = FALSE)
cat("The cleaned dataset has been exported to 'cleaned_final_data.csv'.\n")
## The cleaned dataset has been exported to 'cleaned_final_data.csv'.