library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

setwd("C:/Users/sarah/OneDrive/Desktop/Data 101")

enrolldata <- read.csv("Montgomery_College_Enrollment_Data_20260312.csv")

Introduction

Question: Which race has the highest number of students attending the Rockville campus?

Montgomery College is a community college in Montgomery County, Maryland. Enrollment data has been provided by Montgomery College. This dataset includes 18 variables and 25,320 observations. This dataset shows enrollment data for the fall semester of 2015. Some of the important variables in this dataset include Gender, Race, each of the three campuses, and age group.

In this project, I will be using the variables “race” and “attending_rockville” to answer my question. The “race” variable has 8 unique values, such as “White”, “Black”, “Asian” “Hispanic”, etc. It also has “Unknown,” signifying not available data. The attending_rockville variable has two unique values: “yes” and “no”. If the value is “no”, then the case was enrolled in one of the other campuses.

Data Analysis

In this section, I will clean the dataset and make it ready for use. I will then see the structure of the variables I will be using to make sure numerical values are actually quantitative, and categorical values are qualitative. Then, I will create a frequency table of race by the number of students accepted. Finally, I will create a plot of the table.

Cleaning & EDA

Check for N/A values

colSums(is.na(enrolldata))

##                Fall.Term             Student.Type           Student.Status 
##                        0                        0                        0 
##                   Gender                Ethnicity                     Race 
##                        0                        0                        0 
##     Attending.Germantown      Attending.Rockville Attending.Takoma.Park.SS 
##                        0                        0                        0 
##    Attend.Day.or.Evening   MC.Program.Description                Age.Group 
##                        0                        0                        0 
##              HS.Category         MCPS.High.School               City.in.MD 
##                        0                        0                        0 
##                    State                      ZIP             County.in.MD 
##                        0                       99                        0

No N/A values that may conflict with our chosen variables.

Replace all dots with underscores and lowercase all the names of each of the variables.

names(enrolldata) <- gsub("[.]", "_", names(enrolldata)) # replace . with dash
names(enrolldata) <- tolower(names(enrolldata))

unique(enrolldata$fall_term)

## [1] 2015

The only value for the “fall_term” variable is 2015, meaning this data is from the fall semester of 2015.

See the data type of each variable.

class(enrolldata$race)

## [1] "character"

class(enrolldata$attending_rockville)

## [1] "character"

Both of our variables are “character”, meaning they contain only words or they contain a mixed class of data types because R displays “character” when there’s multiple data types (e.g. “H#3llo” - contains numbers, letters and special letters).

Check the unique values for each variable.

unique(enrolldata$race)

## [1] "White"            "Black"            "Asian"            "Hispanic"        
## [5] "Pacific Islander" "Multi-Race"       "Native American"  "Unknown"

unique(enrolldata$attending_rockville)

## [1] "Yes" "No"

Data values are all the same type and are ready for analysis.

Summary Table

Create a new dataset, select only the two necessary variables, and filter for values that have “Yes” for attending Rockville. Group by the race and count every value in descending order for those attending Rockville.

enrolldata2 <- enrolldata |>
  select(race, attending_rockville) |> # Select the two necessary variables 
  filter(attending_rockville == "Yes") |>
  group_by(race) |>
  count(attending_rockville) |>
  arrange(desc(n))
head(enrolldata2)

## # A tibble: 6 × 3
## # Groups:   race [6]
##   race            attending_rockville     n
##   <chr>           <chr>               <int>
## 1 White           Yes                  6785
## 2 Black           Yes                  4431
## 3 Asian           Yes                  2591
## 4 Hispanic        Yes                  1349
## 5 Multi-Race      Yes                   555
## 6 Native American Yes                   343

We can see that the race with the highest number of students attending the Rockville campus is White, with 6785 students. Runner-ups are Black and Asian students.

Plot

Create a bar graph of the dataset using race as the x-value and the number of students as the y-value. Add color by using “fill = Race”, which adds a color to each bar. Use “coord_flip()” to flip the x- and y-axes, as the x-axis labels overlap with each other when positioned at the bottom.

ggplot(enrolldata2, aes(x = race, y = n, fill = race)) +
  geom_col() +
  coord_flip() +
  labs(title = "Race by Number of Students at the Rockville Campus", y = "Number of Students Admitted", x = "Race")

From this graph, we see that white students are the most common, among other races, at the Rockville campus.

Conclusion

At the Rockville campus in Montgomery College, a total of 6785 white students were enrolled in 2015 for the fall term. This makes the White race have the highest number of students enrolled at the campus. Runner-ups include Black students (4431) and Asian students (2591). This means that Montgomery College enrolled White students the most out of other races, possibly implying the high frequency of White people around Rockville.

This dataset can be further analyzed through statistical tests, such as chi-square tests and anova tests, or be used to show possible correlation between variables. Questions such as “Is there an association between Race and the number of students admitted” can be answered using statistical tests. Furthermore, linear or logistic models can show the correlation between two variables.

References

Montgomery County Government. (2023). Montgomery College enrollment data. data.montgomerycountymd.gov

Repository Link: https://data.montgomerycountymd.gov/Education/Montgomery-College-Enrollment-Data/wmr2-6hn6/about_data

Project 1 Data 101

Muhammad Ahtisham

2026-03-12