Morgan State University

Department of Information Science & Systems

Fall 2024

INSS 615: Data Wrangling for Visualization

Name: Natalia Miranda

Due: Dec 1, 2024 (Sunday)

Questions

A. Scrape the College Ranked by Acceptance Rate dataset available at this link: https://www.oedb.org/rankings/acceptance-rate/#table-rankings and select the first 9 columns [Rank, School, Student to Faculty Ratio, Graduation Rate, Retention Rate, Acceptance Rate, Enrollment Rate, Institutional Aid Rate, and Default Rate] as the dataset for this assignment. [20 Points]

Hint: There are 6 pages of data, so you may want to use a for loop to automate the scraping process and combine the data from all 6 pages. This is just a suggestion—you are free to create the dataset without automating the web scrapping process.

Solution:

library(rvest) 
library(tidyverse)
library(readr)
library(dplyr)
library(scales)

# URL OF THE WEBSITE: url_base PAGE# url_end
url_base <- "https://www.oedb.org/rankings/acceptance-rate/page/" # Base URL for the target website
url_end <- "/#table-rankings" # Ending part of URL for the target website

# Empty list to store the tables from each page
tables_list <- list()

# Loop for the MULTIPAGE SCRAPING, through the page numbers (1 to 6)
for (page_num in 1:6) {
    # TARGET WEBSITE: build the full URL for each page
  page_url <- paste0(url_base, page_num, url_end)
  
    # Variable to store the PAGE HTML
  page <- rvest::read_html(page_url)
   
    #Using the HTML code, copy the *Xpath* or *CSS Selector* of the element (text or table) you want to get
  location <- "#content > div.js-data-list-r > table"
    #Extract the element
  page_table <- html_elements(page, css = location)
  
    # Convert the extracted element to a DATAFRAME
  target_table <- html_table(page_table)[[1]]

# Combine all the tables into a single data frame
tables_list[[page_num]] <- target_table [ , 1:9]
}

#Join all table to store it into a SINGLE DATA FRAME
original <- bind_rows(tables_list) #Table to save the original version
college_table <- bind_rows(tables_list) #Table to edit on the following steps

# View the combined dataset
View(college_table)

B. You are going to need the dataset created in Question A to answer the following questions. There are 16 questions each carrying 5 points:

  1. Replace the missing values “N/A” in the dataset with NA. Reviewing N/A Values:
#Checking N/A values within the Dataset
total_na_count <- sum(college_table == "N/A", na.rm = TRUE)
print(paste("Total 'N/A' values in the dataset:", total_na_count))
[1] "Total 'N/A' values in the dataset: 359"
# Column breakdown of N/A values
cols_with_na <- colSums(college_table == "N/A", na.rm = TRUE)
print("Breakdown of 'N/A' values by column:")
[1] "Breakdown of 'N/A' values by column:"
print(cols_with_na[cols_with_na > 0])  # Show only columns with at least one N/A
Graduation Rate  Retention Rate Acceptance Rate Enrollment Rate    Default Rate 
              6               4              29              29             291 

Solution:

#Replace Values
college_table[college_table == "N/A"] <- NA
  1. Convert percentage columns (e.g., Graduation Rate) to numeric format.

Verifying columns names:

names(college_table)
[1] "Rank"                     "School"                   "Student to Faculty Ratio"
[4] "Graduation Rate"          "Retention Rate"           "Acceptance Rate"         
[7] "Enrollment Rate"          "Institutional Aid Rate"   "Default Rate"            

Solution:

# List of percentage columns
#percentage_columns <- c("Graduation Rate", "Retention Rate","Acceptance Rate","Enrollment Rate","Institutional Aid Raid")
college_table$"Graduation Rate" <- as.numeric(gsub("%", "", college_table$"Graduation Rate")) / 100
college_table$"Retention Rate" <- as.numeric(gsub("%", "", college_table$"Retention Rate")) / 100
college_table$"Acceptance Rate" <- as.numeric(gsub("%", "", college_table$"Acceptance Rate")) / 100
college_table$"Enrollment Rate" <- as.numeric(gsub("%", "", college_table$"Enrollment Rate")) / 100
college_table$"Institutional Aid Rate" <- as.numeric(gsub("%", "", college_table$"Institutional Aid Rate")) / 100
college_table$"Default Rate" <- as.numeric(gsub("%", "", college_table$"Default Rate")) / 100
  1. Transform the “Student to Faculty Ratio” column into two separate numeric columns: Students and Faculty.

Solution:

#Separate Columns
college_table <- college_table %>%
  separate("Student to Faculty Ratio", into = c("Students", "Faculty"), sep = " to ")

#Convert columns to NUMERIC VALUE
college_table$Students <- parse_number(college_table$Students)
college_table$Faculty <- parse_number(college_table$Faculty)
  1. What is the count of missing values in the “Default Rate” column? Impute the missing values in the “Default Rate” column with the median value.

Solution:

# Count the missing values in the "Default Rate" column
missing_DefaultRate <- sum(is.na(college_table$"Default Rate"))
missing_DefaultRate
[1] 291
# Calculate the median of the "Default Rate" column, ignoring NAs
median_DefaultRate <- median(college_table$"Default Rate", na.rm = TRUE)
median_DefaultRate
[1] 0.06
# Impute the missing values with the median
college_table$"Default Rate" <- ifelse(is.na(college_table$"Default Rate"),
                                       median_DefaultRate,
                                       college_table$"Default Rate")
  1. Find the average graduation rate for universities ranked in the top 50.

Solution:

# Select Universities ranked in top 50 
U_top_50 <- college_table %>% filter(Rank <= 50)

# Calculate the average graduation rate
avg_grad_rate <- mean(U_top_50$"Graduation Rate", na.rm = TRUE)

avg_grad_rate
[1] 0.7918
  1. Filter universities with a retention rate above 90% and find the count of rows in the subset.

Solution:

# Select universities with a retention rate above 90% (decimal format)
U_90retention <- college_table %>% filter(`Retention Rate` > 0.90)

# Count the rows in the filtered subset
count_90retention <- nrow(U_90retention)

# Print the result
count_90retention
[1] 98
  1. Rank universities by enrollment rate in descending order and display the last 6 rows.

Solution:

# Rank universities by enrollment rate in descending order, ignoring NAs
U_EnrollmentTail <- college_table %>%
  filter(!is.na(`Enrollment Rate`)) %>%  # Exclude rows with NA in Enrollment Rate
  arrange(desc(`Enrollment Rate`))      # Sort in descending order

# Display the last 6 rows
tail_rows <- tail(U_EnrollmentTail, 6)

tail_rows
NA
  1. Create a histogram of graduation rates using ggplot2 library.

HISTOGRAMS in ggplot2 ggplot(data, aes(x = column_name)) + geom_histogram(binwidth = value, fill = “color”, color = “border_color”) + labs(title = “Plot Title”, x = “X-axis Label”, y = “Y-axis Label”) + theme_style()

Solution:


college_table_clean <- college_table %>%
  filter(!is.na(`Graduation Rate`))

# Create a histogram of graduation rates
ggplot(college_table_clean, aes(x = `Graduation Rate`)) +  # Set x-axis
  geom_histogram(fill = "blue", color = "black") +  # Bar settings width and colors
  labs(title = "Graduation Rates Distribution", x = "Graduation Rate", y = "Univerisity Count") +  # Title and axis labels
  theme_minimal() 
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  1. Plot a scatterplot between acceptance rate and enrollment rate using ggplot2 library.

SCATTERPLOT in ggplot2 ggplot(data = , aes(x = , y = )) + # Map variables to axes geom_point(color = , size = , alpha = ) + # Add points with customizations labs( title = “”, # Add title x = “”, # Add x-axis label y = “” # Add y-axis label ) + theme_() # Choose a theme (minimal, classic, etc)

Solution:

ggplot(college_table, aes(x = `Acceptance Rate`, y = `Enrollment Rate`)) +  # Map variables to x and y axes
  geom_point(color = "blue", size = 1, alpha = 0.5) +  # Customize points features
  labs(
    title = "Acceptance Rate vs. Enrollment Rate",  # Title for the plot
    x = "Acceptance Rate",  # Name for x-axis
    y = "Enrollment Rate"   # Name for y-axis
  ) +
  theme_minimal() 
Warning: Removed 29 rows containing missing values or values outside the scale range (`geom_point()`).

  1. Calculate the average default rate by aid rate category (e.g., grouped into ranges like 0-20%, 20-40%). Display the categories.

Solution:

# Define RANGES for "Institutional Aid Rate" CATEGORIES
college_table <- college_table %>%
  mutate(`Institutional Aid Category` = case_when(
    `Institutional Aid Rate` >= 0 & `Institutional Aid Rate` < 0.2 ~ "0-20%",
    `Institutional Aid Rate` >= 0.2 & `Institutional Aid Rate` < 0.4 ~ "20-40%",
    `Institutional Aid Rate` >= 0.4 & `Institutional Aid Rate` < 0.6 ~ "40-60%",
    `Institutional Aid Rate` >= 0.6 & `Institutional Aid Rate` < 0.8 ~ "60-80%",
    `Institutional Aid Rate` >= 0.8 & `Institutional Aid Rate` <= 1 ~ "80-100%",
  ))

# Calculate the average Default Rate for each RANGE
avg_AidRaid <- college_table %>%
  filter(!is.na(`Default Rate`)) %>%  # Exclude NAs
    #Group the Default Rate values by the categories previously established
  group_by(`Institutional Aid Category`) %>% 
    #Calculate averages for each group
  summarize(Average_Default_Rate = mean(`Default Rate`)) %>%
    #Sort the results by the categories to keep order (ascending)
  arrange(`Institutional Aid Category`) 

avg_AidRaid
NA
NA
  1. Normalize the acceptance rate to a scale of 0-1 and save in a new column “Acceptance Rate Normalized”. Display the first 6 values. x-min(x) NORMALIZATION = —————– max(x) - min(x)

From the library (scales) use the function rescale() rescale(x, to = c(0, 1), from = NULL, na.rm = FALSE)

x- target data to- range to rescale the data from- customize max and min na.rm- remove NA values before rescaling

Solution:

library(scales)

# Normalize the "Acceptance Rate" column using rescale()
college_table <- college_table %>%
  mutate(
    # Creating new column to save the normalized values with the 0-1 scale
    `Acceptance Rate Normalized` = rescale(`Acceptance Rate`, to = c(0, 1),na.rm=TRUE)  # Normalizes to 0-1 range
  )

# Display the first 6 normalized values
head(college_table$`Acceptance Rate Normalized`)
[1] 0.00000000 0.01063830 0.04255319 0.08510638 0.09574468 0.10638298
  1. What is the count of the duplicate entries in the “School” column? Remove duplicate university entries.

Solution:

# Count duplicates in the "School" column
duplicate_count <- sum(duplicated(college_table$School))

print(paste("There are", duplicate_count, "duplicated schools."))
[1] "There are 3 duplicated schools."
# Remove the duplicate entries based on the "School" column
college_table<- college_table %>%
  distinct(School, .keep_all = TRUE)
  1. Find the correlation between graduation rate and retention rate (exclude the NAs in both columns).

Use the function cor() to calculate the correlation between two columns, excluding NAs (complete.obs):

cor(result <- cor(data\(`Column1`, data\)Column2, use = “complete.obs”, method = “pearson”))

Solution:

# Calculate the correlation between "Graduation Rate" and "Retention Rate", excluding NAs
correlation_result <- cor(college_table$`Graduation Rate`, college_table$`Retention Rate`, use = "complete.obs")

print(paste("The correlation between Graduation Rate and Retention Rate is:", correlation_result))
[1] "The correlation between Graduation Rate and Retention Rate is: 0.615970939269887"
  1. Extract the values in School column into a new variable without “University” in the string. For example “Rowan University” becomes “Rowan”

Use gsub() to replace a pattern in a string: new_column <- gsub(“pattern_to_replace”, “replacement_string”, data$column_name)

Solution:

# Create a new variable "School_Name" by removing "University" from the "School" column
college_table$School_Name <- gsub(" University", "", college_table$School)

# Display the first few rows of the new column
head(college_table$School_Name)
[1] "Harvard"                    "Yale"                       "University of Pennsylvania"
[4] "Johns Hopkins"              "Cornell"                    "Tufts"                     
  1. Count how many universities have “Institute” in their name.

Use grpl() to get a logicat vector with the matches grepl(“pattern_to_find”, dataset$column_name)

To have the count add the sum()

Solution:


# Count how many universities have "Institute" in their name
count_institute <- sum(grepl("Institute", college_table$School))


print(paste("There are", count_institute, "universities with Insitutue in their name."))
[1] "There are 17 universities with Insitutue in their name."
  1. Export the cleaned and processed dataset to a CSV file.

write.csv(data, file, row.names)

Solution:


# Export the cleaned and processed dataset to a CSV file
write.csv(college_table, "INSS615_H5_OutputFile.csv", row.names = FALSE)

print("csv file created")
[1] "csv file created"
getwd()
[1] "C:/Users/natal/Downloads"
---
title: "INSS615 Homework 5"
output:
  # word_document: default
  html_notebook: default
  html_document:
    df_print: paged
---


**Morgan State University**

**Department of Information Science & Systems**

**Fall 2024**

**INSS 615: Data Wrangling for Visualization**

**Name: Natalia Miranda**

*Due: Dec 1, 2024 (Sunday)*



Questions


A. Scrape the College Ranked by Acceptance Rate dataset available at this link: https://www.oedb.org/rankings/acceptance-rate/#table-rankings and select the first 9 columns [Rank, School, Student to Faculty Ratio, Graduation Rate, Retention Rate, Acceptance Rate, Enrollment Rate, Institutional Aid Rate, and Default Rate] as the dataset for this assignment. [20 Points]

Hint: There are 6 pages of data, so you may want to use a for loop to automate the scraping process and combine the data from all 6 pages. This is just a suggestion—you are free to create the dataset without automating the web scrapping process.

 
  Solution:
```{r}
library(rvest) 
library(tidyverse)
library(readr)
library(dplyr)
library(scales)

# URL OF THE WEBSITE: url_base PAGE# url_end
url_base <- "https://www.oedb.org/rankings/acceptance-rate/page/" # Base URL for the target website
url_end <- "/#table-rankings" # Ending part of URL for the target website

# Empty list to store the tables from each page
tables_list <- list()

# Loop for the MULTIPAGE SCRAPING, through the page numbers (1 to 6)
for (page_num in 1:6) {
    # TARGET WEBSITE: build the full URL for each page
  page_url <- paste0(url_base, page_num, url_end)
  
    # Variable to store the PAGE HTML
  page <- rvest::read_html(page_url)
   
    #Using the HTML code, copy the *Xpath* or *CSS Selector* of the element (text or table) you want to get
  location <- "#content > div.js-data-list-r > table"
    #Extract the element
  page_table <- html_elements(page, css = location)
  
    # Convert the extracted element to a DATAFRAME
  target_table <- html_table(page_table)[[1]]

# Combine all the tables into a single data frame
tables_list[[page_num]] <- target_table [ , 1:9]
}

#Join all table to store it into a SINGLE DATA FRAME
original <- bind_rows(tables_list) #Table to save the original version
college_table <- bind_rows(tables_list) #Table to edit on the following steps

# View the combined dataset
View(college_table)


```

B. You are going to need the dataset created in Question A to answer the following questions. There are 16 questions each carrying 5 points:

1. Replace the missing values "N/A" in the dataset with NA.
Reviewing N/A Values:
```{R}
#Checking N/A values within the Dataset
total_na_count <- sum(college_table == "N/A", na.rm = TRUE)
print(paste("Total 'N/A' values in the dataset:", total_na_count))

# Column breakdown of N/A values
cols_with_na <- colSums(college_table == "N/A", na.rm = TRUE)
print("Breakdown of 'N/A' values by column:")
print(cols_with_na[cols_with_na > 0])  # Show only columns with at least one N/A

```
  Solution:
```{r}
#Replace Values
college_table[college_table == "N/A"] <- NA

```

2. Convert percentage columns (e.g., Graduation Rate) to numeric format.

Verifying columns names:
```{R}
names(college_table)
```

  Solution:
```{r}
# List of percentage columns
#percentage_columns <- c("Graduation Rate", "Retention Rate","Acceptance Rate","Enrollment Rate","Institutional Aid Raid")
college_table$"Graduation Rate" <- as.numeric(gsub("%", "", college_table$"Graduation Rate")) / 100
college_table$"Retention Rate" <- as.numeric(gsub("%", "", college_table$"Retention Rate")) / 100
college_table$"Acceptance Rate" <- as.numeric(gsub("%", "", college_table$"Acceptance Rate")) / 100
college_table$"Enrollment Rate" <- as.numeric(gsub("%", "", college_table$"Enrollment Rate")) / 100
college_table$"Institutional Aid Rate" <- as.numeric(gsub("%", "", college_table$"Institutional Aid Rate")) / 100
college_table$"Default Rate" <- as.numeric(gsub("%", "", college_table$"Default Rate")) / 100

```


3. Transform the "Student to Faculty Ratio" column into two separate numeric columns: Students and Faculty.


  Solution:
```{r}
#Separate Columns
college_table <- college_table %>%
  separate("Student to Faculty Ratio", into = c("Students", "Faculty"), sep = " to ")

#Convert columns to NUMERIC VALUE
college_table$Students <- parse_number(college_table$Students)
college_table$Faculty <- parse_number(college_table$Faculty)

```


4. What is the count of missing values in the "Default Rate" column? Impute the missing values in the "Default Rate" column with the median value.


  Solution:
```{r}
# Count the missing values in the "Default Rate" column
missing_DefaultRate <- sum(is.na(college_table$"Default Rate"))
missing_DefaultRate
# Calculate the median of the "Default Rate" column, ignoring NAs
median_DefaultRate <- median(college_table$"Default Rate", na.rm = TRUE)
median_DefaultRate

# Impute the missing values with the median
college_table$"Default Rate" <- ifelse(is.na(college_table$"Default Rate"),
                                       median_DefaultRate,
                                       college_table$"Default Rate")


```


5. Find the average graduation rate for universities ranked in the top 50.


  Solution:
```{r}
# Select Universities ranked in top 50 
U_top_50 <- college_table %>% filter(Rank <= 50)

# Calculate the average graduation rate
avg_grad_rate <- mean(U_top_50$"Graduation Rate", na.rm = TRUE)

avg_grad_rate

```


6. Filter universities with a retention rate above 90% and find the count of rows in the subset.


  Solution:
```{r}
# Select universities with a retention rate above 90% (decimal format)
U_90retention <- college_table %>% filter(`Retention Rate` > 0.90)

# Count the rows in the filtered subset
count_90retention <- nrow(U_90retention)

# Print the result
count_90retention

```


7. Rank universities by enrollment rate in descending order and display the last 6 rows.


  Solution:
```{r}
# Rank universities by enrollment rate in descending order, ignoring NAs
U_EnrollmentTail <- college_table %>%
  filter(!is.na(`Enrollment Rate`)) %>%  # Exclude rows with NA in Enrollment Rate
  arrange(desc(`Enrollment Rate`))      # Sort in descending order

# Display the last 6 rows
tail_rows <- tail(U_EnrollmentTail, 6)

tail_rows

```

8. Create a histogram of graduation rates using ggplot2 library.

HISTOGRAMS in ggplot2
    ggplot(data, aes(x = column_name)) +
      geom_histogram(binwidth = value, fill = "color", color = "border_color") +
      labs(title = "Plot Title", x = "X-axis Label", y = "Y-axis Label") +
      theme_style()

  Solution:
```{r}

college_table_clean <- college_table %>%
  filter(!is.na(`Graduation Rate`))

# Create a histogram of graduation rates
ggplot(college_table_clean, aes(x = `Graduation Rate`)) +  # Set x-axis
  geom_histogram(fill = "blue", color = "black") +  # Bar settings width and colors
  labs(title = "Graduation Rates Distribution", x = "Graduation Rate", y = "Univerisity Count") +  # Title and axis labels
  theme_minimal() 


```


9. Plot a scatterplot between acceptance rate and enrollment rate using ggplot2 library.

SCATTERPLOT in ggplot2
ggplot(data = <dataframe>, aes(x = <x_column>, y = <y_column>)) +  # Map variables to axes
  geom_point(color = <point_color>, size = <point_size>, alpha = <transparency_value>) +  # Add points with customizations
  labs(
    title = "<Plot Title>",  # Add title
    x = "<X-axis Label>",    # Add x-axis label
    y = "<Y-axis Label>"     # Add y-axis label
  ) +
  theme_<theme_name>()  # Choose a theme (minimal, classic, etc)

  Solution:
```{r}
ggplot(college_table, aes(x = `Acceptance Rate`, y = `Enrollment Rate`)) +  # Map variables to x and y axes
  geom_point(color = "blue", size = 1, alpha = 0.5) +  # Customize points features
  labs(
    title = "Acceptance Rate vs. Enrollment Rate",  # Title for the plot
    x = "Acceptance Rate",  # Name for x-axis
    y = "Enrollment Rate"   # Name for y-axis
  ) +
  theme_minimal() 

```


10. Calculate the average default rate by aid rate category (e.g., grouped into ranges like 0-20%, 20-40%). Display the categories.


  Solution:
```{r}
# Define RANGES for "Institutional Aid Rate" CATEGORIES
college_table <- college_table %>%
  mutate(`Institutional Aid Category` = case_when(
    `Institutional Aid Rate` >= 0 & `Institutional Aid Rate` < 0.2 ~ "0-20%",
    `Institutional Aid Rate` >= 0.2 & `Institutional Aid Rate` < 0.4 ~ "20-40%",
    `Institutional Aid Rate` >= 0.4 & `Institutional Aid Rate` < 0.6 ~ "40-60%",
    `Institutional Aid Rate` >= 0.6 & `Institutional Aid Rate` < 0.8 ~ "60-80%",
    `Institutional Aid Rate` >= 0.8 & `Institutional Aid Rate` <= 1 ~ "80-100%",
  ))

# Calculate the average Default Rate for each RANGE
avg_AidRaid <- college_table %>%
  filter(!is.na(`Default Rate`)) %>%  # Exclude NAs
    #Group the Default Rate values by the categories previously established
  group_by(`Institutional Aid Category`) %>% 
    #Calculate averages for each group
  summarize(Average_Default_Rate = mean(`Default Rate`)) %>%
    #Sort the results by the categories to keep order (ascending)
  arrange(`Institutional Aid Category`) 

avg_AidRaid


```


11. Normalize the acceptance rate to a scale of 0-1 and save in a new column "Acceptance Rate Normalized". Display the first 6 values.
                     x-min(x)
NORMALIZATION = -----------------
                  max(x) - min(x)

From the library (scales) use the function rescale()
rescale(x, to = c(0, 1), from = NULL, na.rm = FALSE)

x- target data
to- range to rescale the data
from- customize max and min
na.rm- remove NA values before rescaling

  Solution:
 
```{r}
library(scales)

# Normalize the "Acceptance Rate" column using rescale()
college_table <- college_table %>%
  mutate(
    # Creating new column to save the normalized values with the 0-1 scale
    `Acceptance Rate Normalized` = rescale(`Acceptance Rate`, to = c(0, 1),na.rm=TRUE)  # Normalizes to 0-1 range
  )

# Display the first 6 normalized values
head(college_table$`Acceptance Rate Normalized`)


```

12. What is the count of the duplicate entries in the "School" column? Remove duplicate university entries.


 Solution:

```{r}
# Count duplicates in the "School" column
duplicate_count <- sum(duplicated(college_table$School))

print(paste("There are", duplicate_count, "duplicated schools."))

# Remove the duplicate entries based on the "School" column
college_table<- college_table %>%
  distinct(School, .keep_all = TRUE)

```

13. Find the correlation between graduation rate and retention rate (exclude the NAs in both columns).

Use the function cor() to calculate the correlation between two columns, excluding NAs (complete.obs): 

cor(result <- cor(data$`Column1`, data$`Column2`, use = "complete.obs", method = "pearson"))


 Solution:

```{r}
# Calculate the correlation between "Graduation Rate" and "Retention Rate", excluding NAs
correlation_result <- cor(college_table$`Graduation Rate`, college_table$`Retention Rate`, use = "complete.obs")

print(paste("The correlation between Graduation Rate and Retention Rate is:", correlation_result))


```

14. Extract the values in School column into a new variable without "University" in the string. For example "Rowan University" becomes "Rowan"

Use gsub() to replace a pattern in a string: 
new_column <- gsub("pattern_to_replace", "replacement_string", data$column_name)

 Solution:

```{r}
# Create a new variable "School_Name" by removing "University" from the "School" column
college_table$School_Name <- gsub(" University", "", college_table$School)

# Display the first few rows of the new column
head(college_table$School_Name)


```


15. Count how many universities have "Institute" in their name.

Use grpl() to get a logicat vector with the matches
grepl("pattern_to_find", dataset$column_name)

To have the count add the sum()

 Solution:

```{r}

# Count how many universities have "Institute" in their name
count_institute <- sum(grepl("Institute", college_table$School))


print(paste("There are", count_institute, "universities with Insitutue in their name."))

```

16. Export the cleaned and processed dataset to a CSV file.

write.csv(data, file, row.names)

 Solution:

```{r}

# Export the cleaned and processed dataset to a CSV file
write.csv(college_table, "INSS615_H5_OutputFile.csv", row.names = FALSE)

print("csv file created")

getwd()
```

