Morgan State University
Department of Information Science & Systems
Fall 2024
INSS 615: Data Wrangling for Visualization
Name: OLUWAFEMI SAMUEL ESAN
Due: Dec 1, 2024 (Sunday)
Questions
A. Scrape the College Ranked by Acceptance Rate dataset available at this link: https://www.oedb.org/rankings/acceptance-rate/#table-rankings and select the first 9 columns [Rank, School, Student to Faculty Ratio, Graduation Rate, Retention Rate, Acceptance Rate, Enrollment Rate, Institutional Aid Rate, and Default Rate] as the dataset for this assignment. [20 Points]
Hint: There are 6 pages of data, so you may want to use a for loop to automate the scraping process and combine the data from all 6 pages. This is just a suggestion—you are free to create the dataset without automating the web scrapping process.
Solution:
# Load required libraries
library(rvest)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ readr::guess_encoding() masks rvest::guess_encoding()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# List of all page URLs
urls <- c(
"https://www.oedb.org/rankings/acceptance-rate/#table-rankings",
"https://www.oedb.org/rankings/acceptance-rate/page/2/#table-rankings",
"https://www.oedb.org/rankings/acceptance-rate/page/3/#table-rankings",
"https://www.oedb.org/rankings/acceptance-rate/page/4/#table-rankings",
"https://www.oedb.org/rankings/acceptance-rate/page/5/#table-rankings",
"https://www.oedb.org/rankings/acceptance-rate/page/6/#table-rankings"
)
# Initialize an empty list to store page data
all_pages <- list()
# Loop through the URLs and Scrape the College Ranked by Acceptance Rate data
for (i in seq_along(urls)) {
# Read the HTML content from the page
webpage <- tryCatch({
read_html(urls[i])
}, error = function(e) {
message(paste("Error loading page:", urls[i]))
return(NULL)
})
# Proceed if the page loaded successfully
if (!is.null(webpage)) {
# Extract the table data
page_data <- tryCatch({
webpage %>%
html_node("table") %>%
html_table(fill = TRUE)
}, error = function(e) {
message(paste("Error extracting table from page:", urls[i]))
return(NULL)
})
# Add the page data to the list if it's valid
if (!is.null(page_data)) {
all_pages[[i]] <- page_data
}
}
}
# Combine all pages into a single df
full_dataset <- bind_rows(all_pages)
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## • `` -> `...15`
# Check the structure of the combined dataset
print(str(full_dataset))
## tibble [571 × 15] (S3: tbl_df/tbl/data.frame)
## $ Rank : int [1:571] 1 2 3 4 5 6 7 8 9 10 ...
## $ School : chr [1:571] "Harvard University" "Yale University" "University of Pennsylvania" "Johns Hopkins University" ...
## $ Student to Faculty Ratio: chr [1:571] "7 to 1" "6 to 1" "6 to 1" "10 to 1" ...
## $ Graduation Rate : chr [1:571] "98%" "97%" "95%" "94%" ...
## $ Retention Rate : chr [1:571] "98%" "99%" "98%" "97%" ...
## $ Acceptance Rate : chr [1:571] "6%" "7%" "10%" "14%" ...
## $ Enrollment Rate : chr [1:571] "4%" "5%" "7%" "5%" ...
## $ Institutional Aid Rate : chr [1:571] "44%" "52%" "54%" "51%" ...
## $ Default Rate : chr [1:571] "N/A" "N/A" "N/A" "N/A" ...
## $ ...10 : chr [1:571] "N/A" "N/A" "N/A" "N/A" ...
## $ ...11 : chr [1:571] "N/A" "N/A" "N/A" "N/A" ...
## $ ...12 : chr [1:571] "N/A" "N/A" "N/A" "N/A" ...
## $ ...13 : chr [1:571] "," "," "," "," ...
## $ ...14 : chr [1:571] "N/AN/A" "N/AN/A" "N/AN/A" "N/AN/A" ...
## $ ...15 : logi [1:571] NA NA NA NA NA NA ...
## NULL
# Select the first 9 columns
new_data <- full_dataset %>%
select(Rank, School, `Student to Faculty Ratio`, `Graduation Rate`,
`Retention Rate`, `Acceptance Rate`, `Enrollment Rate`,
`Institutional Aid Rate`, `Default Rate`)
# Check the cleaned dataset
print(new_data)
## # A tibble: 571 × 9
## Rank School Student to Faculty R…¹ `Graduation Rate` `Retention Rate`
## <int> <chr> <chr> <chr> <chr>
## 1 1 Harvard Univ… 7 to 1 98% 98%
## 2 2 Yale Univers… 6 to 1 97% 99%
## 3 3 University o… 6 to 1 95% 98%
## 4 4 Johns Hopkin… 10 to 1 94% 97%
## 5 5 Cornell Univ… 9 to 1 93% 97%
## 6 6 Tufts Univer… 9 to 1 93% 97%
## 7 7 University o… 17 to 1 92% 96%
## 8 8 University o… 16 to 1 91% 96%
## 9 9 Georgetown U… 11 to 1 94% 96%
## 10 10 Washington U… 8 to 1 93% 96%
## # ℹ 561 more rows
## # ℹ abbreviated name: ¹`Student to Faculty Ratio`
## # ℹ 4 more variables: `Acceptance Rate` <chr>, `Enrollment Rate` <chr>,
## # `Institutional Aid Rate` <chr>, `Default Rate` <chr>
# Save the dataset to a CSV file
write.csv(new_data, "college_acceptance_rate_clean.csv", row.names = FALSE)
B. You are going to need the dataset created in Question A to answer the following questions. There are 16 questions each carrying 5 points:
Solution:
#Replace the missing values "N/A" in the dataset with NA
new_data <- new_data %>%
mutate(
Rank = as.character(Rank),
School = as.character(School),
`Student to Faculty Ratio` = as.character(`Student to Faculty Ratio`),
`Graduation Rate` = as.character(`Graduation Rate`),
`Retention Rate` = as.character(`Retention Rate`),
`Acceptance Rate` = as.character(`Acceptance Rate`),
`Enrollment Rate` = as.character(`Enrollment Rate`),
`Institutional Aid Rate` = as.character(`Institutional Aid Rate`),
`Default Rate` = as.character(`Default Rate`)
) %>%
mutate(across(everything(), ~ na_if(., "N/A"))) # Replace "N/A" with "NA" (empty)
# View the final dataset
print(new_data)
## # A tibble: 571 × 9
## Rank School Student to Faculty R…¹ `Graduation Rate` `Retention Rate`
## <chr> <chr> <chr> <chr> <chr>
## 1 1 Harvard Univ… 7 to 1 98% 98%
## 2 2 Yale Univers… 6 to 1 97% 99%
## 3 3 University o… 6 to 1 95% 98%
## 4 4 Johns Hopkin… 10 to 1 94% 97%
## 5 5 Cornell Univ… 9 to 1 93% 97%
## 6 6 Tufts Univer… 9 to 1 93% 97%
## 7 7 University o… 17 to 1 92% 96%
## 8 8 University o… 16 to 1 91% 96%
## 9 9 Georgetown U… 11 to 1 94% 96%
## 10 10 Washington U… 8 to 1 93% 96%
## # ℹ 561 more rows
## # ℹ abbreviated name: ¹`Student to Faculty Ratio`
## # ℹ 4 more variables: `Acceptance Rate` <chr>, `Enrollment Rate` <chr>,
## # `Institutional Aid Rate` <chr>, `Default Rate` <chr>
Solution:
new_data <- new_data %>%
mutate(
`Graduation Rate` = as.numeric(str_remove(`Graduation Rate`, "%")) / 100,
`Retention Rate` = as.numeric(str_remove(`Retention Rate`, "%")) / 100,
`Acceptance Rate` = as.numeric(str_remove(`Acceptance Rate`, "%")) / 100,
`Enrollment Rate` = as.numeric(str_remove(`Enrollment Rate`, "%")) / 100,
`Institutional Aid Rate` = as.numeric(str_remove(`Institutional Aid Rate`, "%")) / 100,
`Default Rate` = as.numeric(str_remove(`Default Rate`, "%")) / 100
)
# View the final dataset
print(new_data)
## # A tibble: 571 × 9
## Rank School Student to Faculty R…¹ `Graduation Rate` `Retention Rate`
## <chr> <chr> <chr> <dbl> <dbl>
## 1 1 Harvard Univ… 7 to 1 0.98 0.98
## 2 2 Yale Univers… 6 to 1 0.97 0.99
## 3 3 University o… 6 to 1 0.95 0.98
## 4 4 Johns Hopkin… 10 to 1 0.94 0.97
## 5 5 Cornell Univ… 9 to 1 0.93 0.97
## 6 6 Tufts Univer… 9 to 1 0.93 0.97
## 7 7 University o… 17 to 1 0.92 0.96
## 8 8 University o… 16 to 1 0.91 0.96
## 9 9 Georgetown U… 11 to 1 0.94 0.96
## 10 10 Washington U… 8 to 1 0.93 0.96
## # ℹ 561 more rows
## # ℹ abbreviated name: ¹`Student to Faculty Ratio`
## # ℹ 4 more variables: `Acceptance Rate` <dbl>, `Enrollment Rate` <dbl>,
## # `Institutional Aid Rate` <dbl>, `Default Rate` <dbl>
Solution:
# Separate the Column Students to Faculty Ratio into two columns of Students and Faculty
new_data <- new_data %>%
separate(
`Student to Faculty Ratio`,
into = c("Students", "Faculty"),
sep = " to ",
convert = TRUE
)%>%
mutate(`Students` = as.numeric( Students),
`Faculty` = as.numeric( Faculty),
)
# View the resulting dataset
print(new_data)
## # A tibble: 571 × 10
## Rank School Students Faculty `Graduation Rate` `Retention Rate`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 Harvard University 7 1 0.98 0.98
## 2 2 Yale University 6 1 0.97 0.99
## 3 3 University of Penn… 6 1 0.95 0.98
## 4 4 Johns Hopkins Univ… 10 1 0.94 0.97
## 5 5 Cornell University 9 1 0.93 0.97
## 6 6 Tufts University 9 1 0.93 0.97
## 7 7 University of Cali… 17 1 0.92 0.96
## 8 8 University of Cali… 16 1 0.91 0.96
## 9 9 Georgetown Univers… 11 1 0.94 0.96
## 10 10 Washington Univers… 8 1 0.93 0.96
## # ℹ 561 more rows
## # ℹ 4 more variables: `Acceptance Rate` <dbl>, `Enrollment Rate` <dbl>,
## # `Institutional Aid Rate` <dbl>, `Default Rate` <dbl>
Solution:
# Count missing values in the "Default Rate" column
missing_count <- sum(is.na(new_data$`Default Rate`))
cat("Number of missing values in 'Default Rate':", missing_count, "\n")
## Number of missing values in 'Default Rate': 291
# Calculate the median of the "Default Rate" column, excluding NAs
default_rate_median <- median(new_data$`Default Rate`, na.rm = TRUE)
# Impute missing values in the "Default Rate" column with the median value
new_data <- new_data %>%
mutate(`Default Rate` = ifelse(is.na(`Default Rate`), default_rate_median, `Default Rate`))
# Verify that there are no more missing values in the "Default Rate" column
cat("Number of missing values in 'Default Rate' after imputation:", sum(is.na(new_data$`Default Rate`)), "\n")
## Number of missing values in 'Default Rate' after imputation: 0
Solution:
# Select the top 50 ranked universities
top_50 <- new_data %>%
filter(as.numeric(Rank) <= 50)
# View the resulting dataset
print(top_50)
## # A tibble: 50 × 10
## Rank School Students Faculty `Graduation Rate` `Retention Rate`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 Harvard University 7 1 0.98 0.98
## 2 2 Yale University 6 1 0.97 0.99
## 3 3 University of Penn… 6 1 0.95 0.98
## 4 4 Johns Hopkins Univ… 10 1 0.94 0.97
## 5 5 Cornell University 9 1 0.93 0.97
## 6 6 Tufts University 9 1 0.93 0.97
## 7 7 University of Cali… 17 1 0.92 0.96
## 8 8 University of Cali… 16 1 0.91 0.96
## 9 9 Georgetown Univers… 11 1 0.94 0.96
## 10 10 Washington Univers… 8 1 0.93 0.96
## # ℹ 40 more rows
## # ℹ 4 more variables: `Acceptance Rate` <dbl>, `Enrollment Rate` <dbl>,
## # `Institutional Aid Rate` <dbl>, `Default Rate` <dbl>
# Calculate the average graduation rate
average_graduation_rate <- top_50 %>%
summarise(AverageGraduationRate = mean(`Graduation Rate`, na.rm = TRUE)) %>%
pull(AverageGraduationRate)
# Print the result
cat("Average graduation rate (%) for top 50 ranked Universities is:", round(average_graduation_rate * 100, 2), "%\n")
## Average graduation rate (%) for top 50 ranked Universities is: 79.18 %
Solution:
# Filter universities with a retention rate above 90%
high_retention_universities <- new_data %>%
filter(`Retention Rate` > 0.90)
# Count of rows in the filtered subset
count_high_retention <- nrow(high_retention_universities)
# View the result
cat("The number of universities with a retention rate above 90% is:", count_high_retention, "\n")
## The number of universities with a retention rate above 90% is: 98
Solution:
# Rank universities by enrollment rate in descending order
ranked_universities <- new_data %>%
arrange(desc(`Enrollment Rate`))
# Display the last 6 rows
last_6_universities <- tail(ranked_universities, 6)
# View the result
print(last_6_universities)
## # A tibble: 6 × 10
## Rank School Students Faculty `Graduation Rate` `Retention Rate`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 566 Southeastern Baptis… 6 1 0.8 0.4
## 2 567 Touro University Wo… 13 1 NA 1
## 3 568 Unitek College 16 1 NA 1
## 4 569 University of Weste… 16 1 0.88 NA
## 5 570 Virginia Baptist Co… 5 1 1 0.25
## 6 571 West Virginia Junio… 25 1 0.53 0.69
## # ℹ 4 more variables: `Acceptance Rate` <dbl>, `Enrollment Rate` <dbl>,
## # `Institutional Aid Rate` <dbl>, `Default Rate` <dbl>
Solution:
# Load ggplot2 library needed for creating charts
library(ggplot2)
# Create a histogram of graduation rates
ggplot(new_data, aes(x = `Graduation Rate`)) +
geom_histogram(binwidth = 0.05, fill = "green", color = "blue", alpha = 0.7) +
labs(
title = "Histogram of Graduation Rates",
x = "Graduation Rate (Proportion)",
y = "Frequency"
) +
theme_minimal()
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_bin()`).
Solution:
# Load ggplot2 library
library(ggplot2)
# Create a scatterplot of acceptance rate vs. enrollment rate
ggplot(new_data, aes(x = `Acceptance Rate`, y = `Enrollment Rate`)) +
geom_point(color = "orange", alpha = 0.7) +
labs(
title = "Scatterplot of Acceptance Rate vs Enrollment Rate",
x = "Acceptance Rate (Proportion)",
y = "Enrollment Rate (Proportion)"
) +
theme_minimal()
## Warning: Removed 29 rows containing missing values or values outside the scale range
## (`geom_point()`).
Solution:
# Load required library
library(dplyr)
# Create aid rate categories
new_data <- new_data %>%
mutate(AidRateCategory = case_when(
`Institutional Aid Rate` >= 0 & `Institutional Aid Rate` < 0.2 ~ "0-20%",
`Institutional Aid Rate` >= 0.2 & `Institutional Aid Rate` < 0.4 ~ "20-40%",
`Institutional Aid Rate` >= 0.4 & `Institutional Aid Rate` < 0.6 ~ "40-60%",
`Institutional Aid Rate` >= 0.6 & `Institutional Aid Rate` < 0.8 ~ "60-80%",
`Institutional Aid Rate` >= 0.8 & `Institutional Aid Rate` <= 1 ~ "80-100%",
TRUE ~ "Other"
))
# Calculate the average default rate for each aid rate category
average_default_rate <- new_data %>%
group_by(AidRateCategory) %>%
summarise(AverageDefaultRate = round( mean(`Default Rate`, na.rm = TRUE),2)) %>%
arrange(AidRateCategory)
# Display the results
print(average_default_rate)
## # A tibble: 5 × 2
## AidRateCategory AverageDefaultRate
## <chr> <dbl>
## 1 0-20% 0.12
## 2 20-40% 0.08
## 3 40-60% 0.06
## 4 60-80% 0.06
## 5 80-100% 0.06
Solution:
# Load the required library
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
# Normalize and rescale the Acceptance Rate to 0-1
new_data <- new_data %>%
mutate(
# Normalize the Scale Acceptance Rate to 0-1
`Acceptance Rate Normalized` = (`Acceptance Rate` - min(`Acceptance Rate`, na.rm = TRUE)) /
(max(`Acceptance Rate`, na.rm = TRUE) - min(`Acceptance Rate`, na.rm = TRUE)),
# Rescale and Adjust normalized values to fall within a target range of 0-1
`Acceptance Rate Rescaled` = scales::rescale(`Acceptance Rate`, to = c(0, 1))
)
# Display the first 6 rows of the new columns
head(new_data %>%
select(`Acceptance Rate`, `Acceptance Rate Normalized`, `Acceptance Rate Rescaled`)
)
Solution:
# Count duplicate entries in the "School" column
duplicate_count <- new_data %>%
filter(duplicated(School)) %>%
nrow()
# Print the count of duplicates
cat("The number of duplicate entries in the 'School' column is:", duplicate_count, "\n")
## The number of duplicate entries in the 'School' column is: 3
# Remove duplicate entries in the "School" column
new_data <- new_data %>%
distinct(School, .keep_all = TRUE)
# To Verify the removal of duplicates
cat("The number of rows after removing duplicates is:", nrow(new_data), "\n")
## The number of rows after removing duplicates is: 568
#Total number of rows from new_data before duplicate removal was 571 (The 571 universities been considered)
Solution:
# Calculate the correlation between Graduation Rate and Retention Rate (exclude the NAs in both columns)
correlation <- cor(
new_data$`Graduation Rate`,
new_data$`Retention Rate`,
use = "complete.obs"
)
# Print the correlation
cat("The correlation between Graduation Rate and Retention Rate is :", round(correlation, 2), "\n")
## The correlation between Graduation Rate and Retention Rate is : 0.62
#The result shows a positive relationship between Graduation rate and Retention rate
Solution:
# Create a new variable with "University" removed from the School column
new_data <- new_data %>%
mutate(School_Name_Without_University = str_remove(School,"University"))
# View the first 6 rows of the new variable
head(new_data %>% select(School, School_Name_Without_University))
Solution:
# Count the number of universities with "Institute" in their name
institute_count <- new_data %>%
filter(str_detect(School, "Institute")) %>%
nrow()
# Print the result
cat("The number of universities with 'Institute' in their name is:", institute_count, "\n")
## The number of universities with 'Institute' in their name is: 17
Solution:
# Specify the file name
output_file <- "University_ranking_cleaned_dataset.csv"
# Save the final_data dataset as a CSV file
write.csv(new_data, file = output_file, row.names = FALSE)
# Print a confirmation message
cat("The cleaned_data has been saved as:", output_file, "\n")
## The cleaned_data has been saved as: University_ranking_cleaned_dataset.csv