Morgan State University

Department of Information Science & Systems

Fall 2024

INSS 615: Data Wrangling for Visualization

Name: OLUWAFEMI SAMUEL ESAN

Due: Dec 1, 2024 (Sunday)

Questions

A. Scrape the College Ranked by Acceptance Rate dataset available at this link: https://www.oedb.org/rankings/acceptance-rate/#table-rankings and select the first 9 columns [Rank, School, Student to Faculty Ratio, Graduation Rate, Retention Rate, Acceptance Rate, Enrollment Rate, Institutional Aid Rate, and Default Rate] as the dataset for this assignment. [20 Points]

Hint: There are 6 pages of data, so you may want to use a for loop to automate the scraping process and combine the data from all 6 pages. This is just a suggestion—you are free to create the dataset without automating the web scrapping process.

Solution:

# Load required libraries
library(rvest)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()         masks stats::filter()
## ✖ readr::guess_encoding() masks rvest::guess_encoding()
## ✖ dplyr::lag()            masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# List of all page URLs
urls <- c(
  "https://www.oedb.org/rankings/acceptance-rate/#table-rankings",
  "https://www.oedb.org/rankings/acceptance-rate/page/2/#table-rankings",
  "https://www.oedb.org/rankings/acceptance-rate/page/3/#table-rankings",
  "https://www.oedb.org/rankings/acceptance-rate/page/4/#table-rankings",
  "https://www.oedb.org/rankings/acceptance-rate/page/5/#table-rankings",
  "https://www.oedb.org/rankings/acceptance-rate/page/6/#table-rankings"
)

# Initialize an empty list to store page data
all_pages <- list()

# Loop through the URLs and Scrape the College Ranked by Acceptance Rate data
for (i in seq_along(urls)) {
  # Read the HTML content from the page
  webpage <- tryCatch({
    read_html(urls[i])
  }, error = function(e) {
    message(paste("Error loading page:", urls[i]))
    return(NULL)
  })
  
  # Proceed if the page loaded successfully
  if (!is.null(webpage)) {
    # Extract the table data
    page_data <- tryCatch({
      webpage %>%
        html_node("table") %>%
        html_table(fill = TRUE)
    }, error = function(e) {
      message(paste("Error extracting table from page:", urls[i]))
      return(NULL)
    })
    
    # Add the page data to the list if it's valid
    if (!is.null(page_data)) {
      all_pages[[i]] <- page_data
    }
  }
}

# Combine all pages into a single df
full_dataset <- bind_rows(all_pages)
## New names:
## New names:
## New names:
## New names:
## New names:
## New names:
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## • `` -> `...15`
# Check the structure of the combined dataset
print(str(full_dataset))
## tibble [571 × 15] (S3: tbl_df/tbl/data.frame)
##  $ Rank                    : int [1:571] 1 2 3 4 5 6 7 8 9 10 ...
##  $ School                  : chr [1:571] "Harvard University" "Yale University" "University of Pennsylvania" "Johns Hopkins University" ...
##  $ Student to Faculty Ratio: chr [1:571] "7 to 1" "6 to 1" "6 to 1" "10 to 1" ...
##  $ Graduation Rate         : chr [1:571] "98%" "97%" "95%" "94%" ...
##  $ Retention Rate          : chr [1:571] "98%" "99%" "98%" "97%" ...
##  $ Acceptance Rate         : chr [1:571] "6%" "7%" "10%" "14%" ...
##  $ Enrollment Rate         : chr [1:571] "4%" "5%" "7%" "5%" ...
##  $ Institutional Aid Rate  : chr [1:571] "44%" "52%" "54%" "51%" ...
##  $ Default Rate            : chr [1:571] "N/A" "N/A" "N/A" "N/A" ...
##  $ ...10                   : chr [1:571] "N/A" "N/A" "N/A" "N/A" ...
##  $ ...11                   : chr [1:571] "N/A" "N/A" "N/A" "N/A" ...
##  $ ...12                   : chr [1:571] "N/A" "N/A" "N/A" "N/A" ...
##  $ ...13                   : chr [1:571] "," "," "," "," ...
##  $ ...14                   : chr [1:571] "N/AN/A" "N/AN/A" "N/AN/A" "N/AN/A" ...
##  $ ...15                   : logi [1:571] NA NA NA NA NA NA ...
## NULL
# Select the first 9 columns
new_data <- full_dataset %>%
  select(Rank, School, `Student to Faculty Ratio`, `Graduation Rate`, 
         `Retention Rate`, `Acceptance Rate`, `Enrollment Rate`, 
         `Institutional Aid Rate`, `Default Rate`)

# Check the cleaned dataset
print(new_data)
## # A tibble: 571 × 9
##     Rank School        Student to Faculty R…¹ `Graduation Rate` `Retention Rate`
##    <int> <chr>         <chr>                  <chr>             <chr>           
##  1     1 Harvard Univ… 7 to 1                 98%               98%             
##  2     2 Yale Univers… 6 to 1                 97%               99%             
##  3     3 University o… 6 to 1                 95%               98%             
##  4     4 Johns Hopkin… 10 to 1                94%               97%             
##  5     5 Cornell Univ… 9 to 1                 93%               97%             
##  6     6 Tufts Univer… 9 to 1                 93%               97%             
##  7     7 University o… 17 to 1                92%               96%             
##  8     8 University o… 16 to 1                91%               96%             
##  9     9 Georgetown U… 11 to 1                94%               96%             
## 10    10 Washington U… 8 to 1                 93%               96%             
## # ℹ 561 more rows
## # ℹ abbreviated name: ¹​`Student to Faculty Ratio`
## # ℹ 4 more variables: `Acceptance Rate` <chr>, `Enrollment Rate` <chr>,
## #   `Institutional Aid Rate` <chr>, `Default Rate` <chr>
# Save the dataset to a CSV file
write.csv(new_data, "college_acceptance_rate_clean.csv", row.names = FALSE)

B. You are going to need the dataset created in Question A to answer the following questions. There are 16 questions each carrying 5 points:

  1. Replace the missing values “N/A” in the dataset with NA.

Solution:

 #Replace the missing values "N/A" in the dataset with NA
  new_data <- new_data %>%
    mutate(
      Rank = as.character(Rank),
      School = as.character(School),
      `Student to Faculty Ratio` = as.character(`Student to Faculty Ratio`),
      `Graduation Rate` = as.character(`Graduation Rate`),
      `Retention Rate` = as.character(`Retention Rate`),
      `Acceptance Rate` = as.character(`Acceptance Rate`),
      `Enrollment Rate` = as.character(`Enrollment Rate`),
      `Institutional Aid Rate` = as.character(`Institutional Aid Rate`),
      `Default Rate` = as.character(`Default Rate`)
    ) %>%
                      mutate(across(everything(), ~ na_if(., "N/A"))) # Replace "N/A" with "NA" (empty)

# View the final dataset
print(new_data)
## # A tibble: 571 × 9
##    Rank  School        Student to Faculty R…¹ `Graduation Rate` `Retention Rate`
##    <chr> <chr>         <chr>                  <chr>             <chr>           
##  1 1     Harvard Univ… 7 to 1                 98%               98%             
##  2 2     Yale Univers… 6 to 1                 97%               99%             
##  3 3     University o… 6 to 1                 95%               98%             
##  4 4     Johns Hopkin… 10 to 1                94%               97%             
##  5 5     Cornell Univ… 9 to 1                 93%               97%             
##  6 6     Tufts Univer… 9 to 1                 93%               97%             
##  7 7     University o… 17 to 1                92%               96%             
##  8 8     University o… 16 to 1                91%               96%             
##  9 9     Georgetown U… 11 to 1                94%               96%             
## 10 10    Washington U… 8 to 1                 93%               96%             
## # ℹ 561 more rows
## # ℹ abbreviated name: ¹​`Student to Faculty Ratio`
## # ℹ 4 more variables: `Acceptance Rate` <chr>, `Enrollment Rate` <chr>,
## #   `Institutional Aid Rate` <chr>, `Default Rate` <chr>
  1. Convert percentage columns (e.g., Graduation Rate) to numeric format.

Solution:

  new_data <- new_data %>%
                              mutate(
              `Graduation Rate` = as.numeric(str_remove(`Graduation Rate`, "%")) / 100,
              `Retention Rate` = as.numeric(str_remove(`Retention Rate`, "%")) / 100,
              `Acceptance Rate` = as.numeric(str_remove(`Acceptance Rate`, "%")) / 100,
              `Enrollment Rate` = as.numeric(str_remove(`Enrollment Rate`, "%")) / 100,
              `Institutional Aid Rate` = as.numeric(str_remove(`Institutional Aid Rate`, "%")) / 100,
              `Default Rate` = as.numeric(str_remove(`Default Rate`, "%")) / 100
            )


# View the final dataset
print(new_data)
## # A tibble: 571 × 9
##    Rank  School        Student to Faculty R…¹ `Graduation Rate` `Retention Rate`
##    <chr> <chr>         <chr>                              <dbl>            <dbl>
##  1 1     Harvard Univ… 7 to 1                              0.98             0.98
##  2 2     Yale Univers… 6 to 1                              0.97             0.99
##  3 3     University o… 6 to 1                              0.95             0.98
##  4 4     Johns Hopkin… 10 to 1                             0.94             0.97
##  5 5     Cornell Univ… 9 to 1                              0.93             0.97
##  6 6     Tufts Univer… 9 to 1                              0.93             0.97
##  7 7     University o… 17 to 1                             0.92             0.96
##  8 8     University o… 16 to 1                             0.91             0.96
##  9 9     Georgetown U… 11 to 1                             0.94             0.96
## 10 10    Washington U… 8 to 1                              0.93             0.96
## # ℹ 561 more rows
## # ℹ abbreviated name: ¹​`Student to Faculty Ratio`
## # ℹ 4 more variables: `Acceptance Rate` <dbl>, `Enrollment Rate` <dbl>,
## #   `Institutional Aid Rate` <dbl>, `Default Rate` <dbl>
  1. Transform the “Student to Faculty Ratio” column into two separate numeric columns: Students and Faculty.

Solution:

# Separate the Column Students to Faculty Ratio into two columns of Students and Faculty
new_data <- new_data %>%
  separate(
    `Student to Faculty Ratio`,
    into = c("Students", "Faculty"),
    sep = " to ",
    convert = TRUE
                  )%>%
                    mutate(`Students` = as.numeric( Students),
                            `Faculty` = as.numeric( Faculty),
                           )



# View the resulting dataset
print(new_data)
## # A tibble: 571 × 10
##    Rank  School              Students Faculty `Graduation Rate` `Retention Rate`
##    <chr> <chr>                  <dbl>   <dbl>             <dbl>            <dbl>
##  1 1     Harvard University         7       1              0.98             0.98
##  2 2     Yale University            6       1              0.97             0.99
##  3 3     University of Penn…        6       1              0.95             0.98
##  4 4     Johns Hopkins Univ…       10       1              0.94             0.97
##  5 5     Cornell University         9       1              0.93             0.97
##  6 6     Tufts University           9       1              0.93             0.97
##  7 7     University of Cali…       17       1              0.92             0.96
##  8 8     University of Cali…       16       1              0.91             0.96
##  9 9     Georgetown Univers…       11       1              0.94             0.96
## 10 10    Washington Univers…        8       1              0.93             0.96
## # ℹ 561 more rows
## # ℹ 4 more variables: `Acceptance Rate` <dbl>, `Enrollment Rate` <dbl>,
## #   `Institutional Aid Rate` <dbl>, `Default Rate` <dbl>
  1. What is the count of missing values in the “Default Rate” column? Impute the missing values in the “Default Rate” column with the median value.

Solution:

# Count missing values in the "Default Rate" column
missing_count <- sum(is.na(new_data$`Default Rate`))
cat("Number of missing values in 'Default Rate':", missing_count, "\n")
## Number of missing values in 'Default Rate': 291
# Calculate the median of the "Default Rate" column, excluding NAs
default_rate_median <- median(new_data$`Default Rate`, na.rm = TRUE)

# Impute missing values in the "Default Rate" column with the median value
new_data <- new_data %>%
  mutate(`Default Rate` = ifelse(is.na(`Default Rate`), default_rate_median, `Default Rate`))

# Verify that there are no more missing values in the "Default Rate" column
cat("Number of missing values in 'Default Rate' after imputation:", sum(is.na(new_data$`Default Rate`)), "\n")
## Number of missing values in 'Default Rate' after imputation: 0
  1. Find the average graduation rate for universities ranked in the top 50.

Solution:

# Select the top 50 ranked universities
top_50 <- new_data %>%
  filter(as.numeric(Rank) <= 50)

# View the resulting dataset
print(top_50)
## # A tibble: 50 × 10
##    Rank  School              Students Faculty `Graduation Rate` `Retention Rate`
##    <chr> <chr>                  <dbl>   <dbl>             <dbl>            <dbl>
##  1 1     Harvard University         7       1              0.98             0.98
##  2 2     Yale University            6       1              0.97             0.99
##  3 3     University of Penn…        6       1              0.95             0.98
##  4 4     Johns Hopkins Univ…       10       1              0.94             0.97
##  5 5     Cornell University         9       1              0.93             0.97
##  6 6     Tufts University           9       1              0.93             0.97
##  7 7     University of Cali…       17       1              0.92             0.96
##  8 8     University of Cali…       16       1              0.91             0.96
##  9 9     Georgetown Univers…       11       1              0.94             0.96
## 10 10    Washington Univers…        8       1              0.93             0.96
## # ℹ 40 more rows
## # ℹ 4 more variables: `Acceptance Rate` <dbl>, `Enrollment Rate` <dbl>,
## #   `Institutional Aid Rate` <dbl>, `Default Rate` <dbl>
# Calculate the average graduation rate
average_graduation_rate <- top_50 %>%
  summarise(AverageGraduationRate = mean(`Graduation Rate`, na.rm = TRUE)) %>%
  pull(AverageGraduationRate)

# Print the result
cat("Average graduation rate (%) for top 50 ranked Universities is:", round(average_graduation_rate * 100, 2), "%\n")
## Average graduation rate (%) for top 50 ranked Universities is: 79.18 %
  1. Filter universities with a retention rate above 90% and find the count of rows in the subset.

Solution:

# Filter universities with a retention rate above 90%
high_retention_universities <- new_data %>%
  filter(`Retention Rate` > 0.90)

# Count of rows in the filtered subset
count_high_retention <- nrow(high_retention_universities)

# View the result
cat("The number of universities with a retention rate above 90% is:", count_high_retention, "\n")
## The number of universities with a retention rate above 90% is: 98
  1. Rank universities by enrollment rate in descending order and display the last 6 rows.

Solution:

# Rank universities by enrollment rate in descending order
ranked_universities <- new_data %>%
  arrange(desc(`Enrollment Rate`))

# Display the last 6 rows
last_6_universities <- tail(ranked_universities, 6)

# View the result
print(last_6_universities)
## # A tibble: 6 × 10
##   Rank  School               Students Faculty `Graduation Rate` `Retention Rate`
##   <chr> <chr>                   <dbl>   <dbl>             <dbl>            <dbl>
## 1 566   Southeastern Baptis…        6       1              0.8              0.4 
## 2 567   Touro University Wo…       13       1             NA                1   
## 3 568   Unitek College             16       1             NA                1   
## 4 569   University of Weste…       16       1              0.88            NA   
## 5 570   Virginia Baptist Co…        5       1              1                0.25
## 6 571   West Virginia Junio…       25       1              0.53             0.69
## # ℹ 4 more variables: `Acceptance Rate` <dbl>, `Enrollment Rate` <dbl>,
## #   `Institutional Aid Rate` <dbl>, `Default Rate` <dbl>
  1. Create a histogram of graduation rates using ggplot2 library.

Solution:

# Load ggplot2 library needed for creating charts
library(ggplot2)

# Create a histogram of graduation rates
ggplot(new_data, aes(x = `Graduation Rate`)) +
  geom_histogram(binwidth = 0.05, fill = "green", color = "blue", alpha = 0.7) +
  labs(
    title = "Histogram of Graduation Rates",
    x = "Graduation Rate (Proportion)",
    y = "Frequency"
  ) +
  theme_minimal()
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_bin()`).

  1. Plot a scatterplot between acceptance rate and enrollment rate using ggplot2 library.

Solution:

# Load ggplot2 library
library(ggplot2)

# Create a scatterplot of acceptance rate vs. enrollment rate
ggplot(new_data, aes(x = `Acceptance Rate`, y = `Enrollment Rate`)) +
  geom_point(color = "orange", alpha = 0.7) +
  labs(
    title = "Scatterplot of Acceptance Rate vs Enrollment Rate",
    x = "Acceptance Rate (Proportion)",
    y = "Enrollment Rate (Proportion)"
  ) +
  theme_minimal()
## Warning: Removed 29 rows containing missing values or values outside the scale range
## (`geom_point()`).

  1. Calculate the average default rate by aid rate category (e.g., grouped into ranges like 0-20%, 20-40%). Display the categories.

Solution:

# Load required library
library(dplyr)

# Create aid rate categories
new_data <- new_data %>%
  mutate(AidRateCategory = case_when(
    `Institutional Aid Rate` >= 0 & `Institutional Aid Rate` < 0.2 ~ "0-20%",
    `Institutional Aid Rate` >= 0.2 & `Institutional Aid Rate` < 0.4 ~ "20-40%",
    `Institutional Aid Rate` >= 0.4 & `Institutional Aid Rate` < 0.6 ~ "40-60%",
    `Institutional Aid Rate` >= 0.6 & `Institutional Aid Rate` < 0.8 ~ "60-80%",
    `Institutional Aid Rate` >= 0.8 & `Institutional Aid Rate` <= 1 ~ "80-100%",
    TRUE ~ "Other"
  ))

# Calculate the average default rate for each aid rate category
average_default_rate <- new_data %>%
  group_by(AidRateCategory) %>%
  summarise(AverageDefaultRate = round( mean(`Default Rate`, na.rm = TRUE),2)) %>%
  arrange(AidRateCategory)

# Display the results
print(average_default_rate)
## # A tibble: 5 × 2
##   AidRateCategory AverageDefaultRate
##   <chr>                        <dbl>
## 1 0-20%                         0.12
## 2 20-40%                        0.08
## 3 40-60%                        0.06
## 4 60-80%                        0.06
## 5 80-100%                       0.06
  1. Normalize the acceptance rate to a scale of 0-1 and save in a new column “Acceptance Rate Normalized”. Display the first 6 values.

Solution:

# Load the required library 
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
# Normalize and rescale the Acceptance Rate to 0-1
new_data <- new_data %>%
  mutate(
    # Normalize the Scale Acceptance Rate to 0-1
    `Acceptance Rate Normalized` = (`Acceptance Rate` - min(`Acceptance Rate`, na.rm = TRUE)) / 
                                    (max(`Acceptance Rate`, na.rm = TRUE) - min(`Acceptance Rate`, na.rm = TRUE)),
    
    # Rescale and Adjust normalized values to fall within a target range of 0-1
    `Acceptance Rate Rescaled` = scales::rescale(`Acceptance Rate`, to = c(0, 1))
  )

# Display the first 6 rows of the new columns
head(new_data %>% 
       select(`Acceptance Rate`, `Acceptance Rate Normalized`, `Acceptance Rate Rescaled`)
     )
  1. What is the count of the duplicate entries in the “School” column? Remove duplicate university entries.

Solution:

# Count duplicate entries in the "School" column
duplicate_count <- new_data %>%
  filter(duplicated(School)) %>%
  nrow()

# Print the count of duplicates
cat("The number of duplicate entries in the 'School' column is:", duplicate_count, "\n")
## The number of duplicate entries in the 'School' column is: 3
# Remove duplicate entries in the "School" column
new_data <- new_data %>%
  distinct(School, .keep_all = TRUE)

# To Verify the removal of duplicates
cat("The number of rows after removing duplicates is:", nrow(new_data), "\n")
## The number of rows after removing duplicates is: 568
#Total number of rows from new_data before duplicate removal was 571 (The 571 universities been considered)
  1. Find the correlation between graduation rate and retention rate (exclude the NAs in both columns).

Solution:

# Calculate the correlation between Graduation Rate and Retention Rate (exclude the NAs in both columns)
correlation <- cor(
new_data$`Graduation Rate`,
  new_data$`Retention Rate`,
  use = "complete.obs" 
)

# Print the correlation
cat("The correlation between Graduation Rate and Retention Rate is :", round(correlation, 2), "\n")
## The correlation between Graduation Rate and Retention Rate is : 0.62
#The result shows a positive relationship between Graduation rate and Retention rate
  1. Extract the values in School column into a new variable without “University” in the string. For example “Rowan University” becomes “Rowan”

Solution:

# Create a new variable with "University" removed from the School column
new_data <- new_data %>%
  mutate(School_Name_Without_University = str_remove(School,"University"))

# View the first 6 rows of the new variable
head(new_data %>% select(School, School_Name_Without_University))
  1. Count how many universities have “Institute” in their name.

Solution:

# Count the number of universities with "Institute" in their name
institute_count <- new_data %>%
  filter(str_detect(School, "Institute")) %>%
  nrow()

# Print the result
cat("The number of universities with 'Institute' in their name is:", institute_count, "\n")
## The number of universities with 'Institute' in their name is: 17
  1. Export the cleaned and processed dataset to a CSV file.

Solution:

# Specify the file name
output_file <- "University_ranking_cleaned_dataset.csv"

# Save the final_data dataset as a CSV file
write.csv(new_data, file = output_file, row.names = FALSE)

# Print a confirmation message
cat("The cleaned_data has been saved as:", output_file, "\n")
## The cleaned_data has been saved as: University_ranking_cleaned_dataset.csv