Morgan State University
Department of Information Science & Systems
Fall 2024
INSS 615: Data Wrangling for Visualization
Name: Mohammed Naveed Afroz Mulla
Due: Dec 1, 2024 (Sunday)
Questions
A. Scrape the College Ranked by Acceptance Rate dataset available at
this link: https://www.oedb.org/rankings/acceptance-rate/#table-rankings
and select the first 9 columns [Rank, School, Student to Faculty Ratio,
Graduation Rate, Retention Rate, Acceptance Rate, Enrollment Rate,
Institutional Aid Rate, and Default Rate] as the dataset for this
assignment. [20 Points]
Hint: There are 6 pages of data, so you may want to use a for loop to
automate the scraping process and combine the data from all 6 pages.
This is just a suggestion—you are free to create the dataset without
automating the web scrapping process.
Solution:
library(rvest)
library(tidyverse)
library(dplyr)
# multipage scraping using a for loop
base_url <- "https://www.oedb.org/rankings/acceptance-rate/#table-rankings?page="
all_pages <- list()
for (page in 1:6) {
url <- paste0(base_url, page)
page_content <- read_html(url)
table <- page_content %>%
html_table(fill = TRUE) %>%
.[[1]]
all_pages[[page]] <- table
}
college_data <- bind_rows(all_pages)
New names:New names:New names:New names:New names:New names:
selected_columns <- c("Rank", "School", "Student to Faculty Ratio",
"Graduation Rate", "Retention Rate", "Acceptance Rate",
"Enrollment Rate", "Institutional Aid Rate", "Default Rate")
college_data <- college_data %>%
select(all_of(selected_columns))
head(college_data)
NA
B. You are going to need the dataset created in Question A to answer
the following questions. There are 16 questions each carrying 5
points:
- Replace the missing values “N/A” in the dataset with NA.
Solution:
college_data[college_data == "N/A"] <- NA
head(college_data)
- Convert percentage columns (e.g., Graduation Rate) to numeric
format.
Solution:
percentage_columns <- c("Graduation Rate", "Retention Rate",
"Acceptance Rate", "Enrollment Rate",
"Institutional Aid Rate", "Default Rate")
college_data[percentage_columns] <- lapply(college_data[percentage_columns], function(column) {
as.numeric(gsub("%", "", column))
})
str(college_data)
tibble [600 × 9] (S3: tbl_df/tbl/data.frame)
$ Rank : int [1:600] 1 2 3 4 5 6 7 8 9 10 ...
$ School : chr [1:600] "Harvard University" "Yale University" "University of Pennsylvania" "Johns Hopkins University" ...
$ Student to Faculty Ratio: chr [1:600] "7 to 1" "6 to 1" "6 to 1" "10 to 1" ...
$ Graduation Rate : num [1:600] 98 97 95 94 93 93 92 91 94 93 ...
$ Retention Rate : num [1:600] 98 99 98 97 97 97 96 96 96 96 ...
$ Acceptance Rate : num [1:600] 6 7 10 14 15 16 17 17 17 17 ...
$ Enrollment Rate : num [1:600] 4 5 7 5 8 7 7 6 8 6 ...
$ Institutional Aid Rate : num [1:600] 44 52 54 51 55 43 61 61 42 49 ...
$ Default Rate : num [1:600] NA NA NA NA NA NA NA NA NA NA ...
head(college_data)
- Transform the “Student to Faculty Ratio” column into two separate
numeric columns: Students and Faculty.
Solution:
library(dplyr)
library(stringr)
college_data <- college_data %>%
mutate(
Students = as.numeric(str_extract(`Student to Faculty Ratio`, "^[0-9]+")),
Faculty = as.numeric(str_extract(`Student to Faculty Ratio`, "[0-9]+$"))
)
college_data <- college_data %>%
select(-`Student to Faculty Ratio`)
head(college_data)
- What is the count of missing values in the “Default Rate” column?
Impute the missing values in the “Default Rate” column with the median
value.
Solution:
missing_count <- sum(is.na(college_data$`Default Rate`))
cat("Number of missing values in 'Default Rate':", missing_count, "\n")
Number of missing values in 'Default Rate': 414
default_rate_median <- median(college_data$`Default Rate`, na.rm = TRUE)
college_data$`Default Rate`[is.na(college_data$`Default Rate`)] <- default_rate_median
cat("Number of missing values in 'Default Rate' after imputation:", sum(is.na(college_data$`Default Rate`)), "\n")
Number of missing values in 'Default Rate' after imputation: 0
- Find the average graduation rate for universities ranked in the top
50.
Solution:
str(college_data$`Graduation Rate`)
num [1:600] 98 97 95 94 93 93 92 91 94 93 ...
college_data$`Graduation Rate` <- as.numeric(gsub("%", "", college_data$`Graduation Rate`))
top_50_universities <- college_data %>%
filter(Rank <= 50)
average_graduation_rate <- mean(top_50_universities$`Graduation Rate`, na.rm = TRUE)
cat("Average Graduation Rate for Top 50 Universities:", average_graduation_rate, "%\n")
Average Graduation Rate for Top 50 Universities: 79.18 %
- Filter universities with a retention rate above 90% and find the
count of rows in the subset.
Solution:
high_retention_universities <- college_data %>%
filter(`Retention Rate` > 90)
high_retention_count <- nrow(high_retention_universities)
cat("Number of universities with a retention rate above 90%:", high_retention_count, "\n")
Number of universities with a retention rate above 90%: 300
- Rank universities by enrollment rate in descending order and display
the last 6 rows.
Solution:
ranked_universities <- college_data %>%
arrange(desc(`Enrollment Rate`))
tail(ranked_universities, 6)
NA
- Create a histogram of graduation rates using ggplot2 library.
Solution:
if (!require("ggplot2")) install.packages("ggplot2")
library(ggplot2)
ggplot(data = college_data, aes(x = `Graduation Rate`)) +
geom_histogram(binwidth = 5, color = "black", fill = "lightblue") +
labs(
title = "Histogram of Graduation Rates",
x = "Graduation Rate (%)",
y = "Frequency"
) +
theme_minimal()

- Plot a scatterplot between acceptance rate and enrollment rate using
ggplot2 library.
Solution:
if (!require("ggplot2")) install.packages("ggplot2")
library(ggplot2)
ggplot(data = college_data, aes(x = `Acceptance Rate`, y = `Enrollment Rate`)) +
geom_point(color = "blue", size = 2) +
labs(
title = "Scatterplot of Acceptance Rate vs Enrollment Rate",
x = "Acceptance Rate (%)",
y = "Enrollment Rate (%)"
) +
theme_minimal()

NA
NA
- Calculate the average default rate by aid rate category (e.g.,
grouped into ranges like 0-20%, 20-40%). Display the categories.
Solution:
college_data <- college_data %>%
mutate(
AidRateCategory = cut(
`Institutional Aid Rate`,
breaks = c(0, 20, 40, 60, 80, 100),
labels = c("0-20%", "20-40%", "40-60%", "60-80%", "80-100%"),
include.lowest = TRUE
)
)
average_default_rate <- college_data %>%
group_by(AidRateCategory) %>%
summarise(AverageDefaultRate = mean(`Default Rate`, na.rm = TRUE))
print(average_default_rate)
NA
NA
- Normalize the acceptance rate to a scale of 0-1 and save in a new
column “Acceptance Rate Normalized”. Display the first 6 values.
Solution:
library(scales)
college_data <- college_data %>%
mutate(`Acceptance Rate Normalized` =
(`Acceptance Rate` - min(`Acceptance Rate`, na.rm = TRUE)) /
(max(`Acceptance Rate`, na.rm = TRUE) - min(`Acceptance Rate`, na.rm = TRUE)))
head(college_data[, c("Acceptance Rate", "Acceptance Rate Normalized")])
NA
NA
NA
- What is the count of the duplicate entries in the “School” column?
Remove duplicate university entries.
Solution:
duplicate_count <- sum(duplicated(college_data$School))
cat("Number of duplicate entries in the 'School' column:", duplicate_count, "\n")
Number of duplicate entries in the 'School' column: 500
college_data <- college_data %>%
distinct(School, .keep_all = TRUE)
cat("Number of rows after removing duplicates:", nrow(college_data), "\n")
Number of rows after removing duplicates: 100
- Find the correlation between graduation rate and retention rate
(exclude the NAs in both columns).
Solution:
correlation <- cor(
college_data$`Graduation Rate`,
college_data$`Retention Rate`,
use = "complete.obs" # Exclude NAs in both columns
)
cat("Correlation between Graduation Rate and Retention Rate:", correlation, "\n")
Correlation between Graduation Rate and Retention Rate: 0.8052769
- Extract the values in School column into a new variable without
“University” in the string. For example “Rowan University” becomes
“Rowan”
Solution:
college_data <- college_data %>%
mutate(School_No_University = gsub(" University", "", School))
head(college_data[, c("School", "School_No_University")])
NA
- Count how many universities have “Institute” in their name.
Solution:
# Count universities with "Institute" in their name
institute_count <- sum(grepl("Institute", college_data$School))
# Display the count
cat("Number of universities with 'Institute' in their name:", institute_count, "\n")
Number of universities with 'Institute' in their name: 5
- Export the cleaned and processed dataset to a CSV file.
Solution:
# Export the cleaned dataset to a CSV file
write.csv(college_data, "cleaned_college_data.csv", row.names = FALSE)
# Confirm the file has been saved
cat("The cleaned dataset has been exported to 'cleaned_college_data.csv'.\n")
The cleaned dataset has been exported to 'cleaned_college_data.csv'.
---
title: "INSS615 Homework 5"
output:
  # word_document: default
  html_notebook: default
  html_document:
    df_print: paged
---


**Morgan State University**

**Department of Information Science & Systems**

**Fall 2024**

**INSS 615: Data Wrangling for Visualization**

**Name: Mohammed Naveed Afroz Mulla**

*Due: Dec 1, 2024 (Sunday)*



Questions


A. Scrape the College Ranked by Acceptance Rate dataset available at this link: https://www.oedb.org/rankings/acceptance-rate/#table-rankings and select the first 9 columns [Rank, School, Student to Faculty Ratio, Graduation Rate, Retention Rate, Acceptance Rate, Enrollment Rate, Institutional Aid Rate, and Default Rate] as the dataset for this assignment. [20 Points]

Hint: There are 6 pages of data, so you may want to use a for loop to automate the scraping process and combine the data from all 6 pages. This is just a suggestion—you are free to create the dataset without automating the web scrapping process.

 
  Solution:
```{r}
library(rvest) 
library(tidyverse)
library(dplyr)


# multipage scraping using a for loop

base_url <- "https://www.oedb.org/rankings/acceptance-rate/#table-rankings?page="

all_pages <- list()

for (page in 1:6) {
 
  url <- paste0(base_url, page)
  
  page_content <- read_html(url)
  
  table <- page_content %>%
    html_table(fill = TRUE) %>%
    .[[1]]
  
  all_pages[[page]] <- table
}

college_data <- bind_rows(all_pages)

selected_columns <- c("Rank", "School", "Student to Faculty Ratio", 
                      "Graduation Rate", "Retention Rate", "Acceptance Rate",
                      "Enrollment Rate", "Institutional Aid Rate", "Default Rate")

college_data <- college_data %>%
  select(all_of(selected_columns))
head(college_data)

```

B. You are going to need the dataset created in Question A to answer the following questions. There are 16 questions each carrying 5 points:

1. Replace the missing values "N/A" in the dataset with NA.


  Solution:
```{r}
college_data[college_data == "N/A"] <- NA

head(college_data)
```

2. Convert percentage columns (e.g., Graduation Rate) to numeric format.

  
  Solution:
```{r}

percentage_columns <- c("Graduation Rate", "Retention Rate", 
                        "Acceptance Rate", "Enrollment Rate", 
                        "Institutional Aid Rate", "Default Rate")

college_data[percentage_columns] <- lapply(college_data[percentage_columns], function(column) {

  as.numeric(gsub("%", "", column))
})

str(college_data)

head(college_data)
```


3. Transform the "Student to Faculty Ratio" column into two separate numeric columns: Students and Faculty.


  Solution:
```{r}
library(dplyr)
library(stringr)

college_data <- college_data %>%
  mutate(
    Students = as.numeric(str_extract(`Student to Faculty Ratio`, "^[0-9]+")),
    Faculty = as.numeric(str_extract(`Student to Faculty Ratio`, "[0-9]+$"))
  )

college_data <- college_data %>%
  select(-`Student to Faculty Ratio`)

head(college_data)
```




4. What is the count of missing values in the "Default Rate" column? Impute the missing values in the "Default Rate" column with the median value.


  Solution:
```{r}
missing_count <- sum(is.na(college_data$`Default Rate`))
cat("Number of missing values in 'Default Rate':", missing_count, "\n")

default_rate_median <- median(college_data$`Default Rate`, na.rm = TRUE)

college_data$`Default Rate`[is.na(college_data$`Default Rate`)] <- default_rate_median

cat("Number of missing values in 'Default Rate' after imputation:", sum(is.na(college_data$`Default Rate`)), "\n")

```


5. Find the average graduation rate for universities ranked in the top 50.


  Solution:
```{r}
str(college_data$`Graduation Rate`)

college_data$`Graduation Rate` <- as.numeric(gsub("%", "", college_data$`Graduation Rate`))

top_50_universities <- college_data %>%
  filter(Rank <= 50)

average_graduation_rate <- mean(top_50_universities$`Graduation Rate`, na.rm = TRUE)

cat("Average Graduation Rate for Top 50 Universities:", average_graduation_rate, "%\n")


```


6. Filter universities with a retention rate above 90% and find the count of rows in the subset.


  Solution:
```{r}
high_retention_universities <- college_data %>%
  filter(`Retention Rate` > 90)

high_retention_count <- nrow(high_retention_universities)


cat("Number of universities with a retention rate above 90%:", high_retention_count, "\n")


```


7. Rank universities by enrollment rate in descending order and display the last 6 rows.


  Solution:
```{r}

ranked_universities <- college_data %>%
  arrange(desc(`Enrollment Rate`))

tail(ranked_universities, 6)

```


8. Create a histogram of graduation rates using ggplot2 library.


  Solution:
```{r}

if (!require("ggplot2")) install.packages("ggplot2")

library(ggplot2)

ggplot(data = college_data, aes(x = `Graduation Rate`)) +
  geom_histogram(binwidth = 5, color = "black", fill = "lightblue") +
  labs(
    title = "Histogram of Graduation Rates",
    x = "Graduation Rate (%)",
    y = "Frequency"
  ) +
  theme_minimal()

```


9. Plot a scatterplot between acceptance rate and enrollment rate using ggplot2 library.


  Solution:
```{r}
if (!require("ggplot2")) install.packages("ggplot2")

library(ggplot2)

ggplot(data = college_data, aes(x = `Acceptance Rate`, y = `Enrollment Rate`)) +
  geom_point(color = "blue", size = 2) +
  labs(
    title = "Scatterplot of Acceptance Rate vs Enrollment Rate",
    x = "Acceptance Rate (%)",
    y = "Enrollment Rate (%)"
  ) +
  theme_minimal()


```


10. Calculate the average default rate by aid rate category (e.g., grouped into ranges like 0-20%, 20-40%). Display the categories.


  Solution:
```{r}
college_data <- college_data %>%
  mutate(
    AidRateCategory = cut(
      `Institutional Aid Rate`,
      breaks = c(0, 20, 40, 60, 80, 100),
      labels = c("0-20%", "20-40%", "40-60%", "60-80%", "80-100%"),
      include.lowest = TRUE
    )
  )

average_default_rate <- college_data %>%
  group_by(AidRateCategory) %>%
  summarise(AverageDefaultRate = mean(`Default Rate`, na.rm = TRUE))

print(average_default_rate)


```




11. Normalize the acceptance rate to a scale of 0-1 and save in a new column "Acceptance Rate Normalized". Display the first 6 values.


  Solution:
 
```{r}
library(scales)
college_data <- college_data %>%
  mutate(`Acceptance Rate Normalized` = 
           (`Acceptance Rate` - min(`Acceptance Rate`, na.rm = TRUE)) / 
           (max(`Acceptance Rate`, na.rm = TRUE) - min(`Acceptance Rate`, na.rm = TRUE)))

head(college_data[, c("Acceptance Rate", "Acceptance Rate Normalized")])



```

12. What is the count of the duplicate entries in the "School" column? Remove duplicate university entries.


 Solution:

```{r}
duplicate_count <- sum(duplicated(college_data$School))
cat("Number of duplicate entries in the 'School' column:", duplicate_count, "\n")

college_data <- college_data %>%
  distinct(School, .keep_all = TRUE)

cat("Number of rows after removing duplicates:", nrow(college_data), "\n")


```


13. Find the correlation between graduation rate and retention rate (exclude the NAs in both columns).


 Solution:

```{r}
correlation <- cor(
  college_data$`Graduation Rate`,
  college_data$`Retention Rate`,
  use = "complete.obs" # Exclude NAs in both columns
)

cat("Correlation between Graduation Rate and Retention Rate:", correlation, "\n")


```



14. Extract the values in School column into a new variable without "University" in the string. For example "Rowan University" becomes "Rowan"


 Solution:

```{r}
college_data <- college_data %>%
  mutate(School_No_University = gsub(" University", "", School))

head(college_data[, c("School", "School_No_University")])

```




15. Count how many universities have "Institute" in their name.


 Solution:

```{r}
institute_count <- sum(grepl("Institute", college_data$School))

cat("Number of universities with 'Institute' in their name:", institute_count, "\n")



```

16. Export the cleaned and processed dataset to a CSV file.


 Solution:

```{r}
write.csv(college_data, "cleaned_college_data.csv", row.names = FALSE)

cat("The cleaned dataset has been exported to 'cleaned_college_data.csv'.\n")



```

