The dataset I am working with contains information about student enrollment at Montgomery College. It provides details on students’ demographics, enrollment status, program of study, and campus attendance. Specifically, the dataset includes variables such as:
fall_term: The academic year or term of enrollment.
age_group: The age range of the student (e.g., under 18, 18-24, 25-34, etc.).
mc_program_description: The program or major in which the student is enrolled.
student_type: The type of student (e.g., new, continuing).
student_status: Whether the student is part-time or full-time.
gender: Gender of the student.
ethnicity / race: Self-reported ethnicity and race.
campus: Information about which campus the student attends (e.g., Rockville, Germantown, Takoma Park).
The goal of this project is to explore enrollment patterns by age group, focusing on identifying the most popular majors across different age categories.
The data comes from the Montgomery County government open data portal, specifically from the dataMontgomery website.
Loading necessary libraries and dataset
library(tidyverse)
Warning: package 'readr' was built under R version 4.4.3
Warning: package 'lubridate' was built under R version 4.4.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("C:/Users/Lenovo/Downloads/SummerData110") # set repositorydata=read_csv("Montgomery_College_Enrollment_Data_20250612.csv") #load dataset
Rows: 25320 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (16): Student Type, Student Status, Gender, Ethnicity, Race, Attending G...
dbl (2): Fall Term, ZIP
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
checking the colum names
colnames(data)
[1] "Fall Term" "Student Type"
[3] "Student Status" "Gender"
[5] "Ethnicity" "Race"
[7] "Attending Germantown" "Attending Rockville"
[9] "Attending Takoma Park/SS" "Attend Day or Evening"
[11] "MC Program Description" "Age Group"
[13] "HS Category" "MCPS High School"
[15] "City in MD" "State"
[17] "ZIP" "County in MD"
top_programs <- summary_data |>group_by(age_group) |>slice_max(order_by = n_students, n =5) |>ungroup()
vizualizing using ggplot
library(ggplot2)ggplot(top_programs, aes(x =reorder(mc_program_description, n_students), y = n_students, fill = mc_program_description)) +geom_col(show.legend =FALSE) +facet_wrap(~ age_group) +# remove scales = "free_y" to unify x-axiscoord_flip() +labs(title ="Top 5 Programs in Each Age Group \nat Montgomery College",subtitle ="Enrollment data from Montgomery \nCounty government open data portal",x ="Program",y ="Number of Students",caption ="Source: https://data.montgomerycountymd.gov/" ) +scale_fill_brewer(palette ="Set3") +theme_minimal(base_size =12) +theme(strip.text =element_text(face ="bold", size =12),axis.text.y =element_text(size =10),plot.title =element_text(face ="bold", size =14) )
``` ## Final Essay
In this project, I worked with a dataset on student enrollment at Montgomery College. As part of data cleaning, I converted all variable names to lower case and replaced spaces with underscores. Luckily, the columns I needed for the visualization — age_group and mc_program_description — did not contain any missing values, so I didn’t have to handle NAs in this analysis. The visualization I created shows the top five programs by age group. It highlights that most students are younger than 20 years old, which makes sense since Montgomery College is a two-year institution that attracts many students right after high school. I was surprised to see that the most popular program across all age groups is General Studies. This could reflect that students are still exploring their academic and career interests. Additionally, I noticed that Health Sciences is consistently among the top choices across age groups, which might point to strong interest in healthcare professions in the region. Business and Engineering Science also appear frequently. One thing I wish I could have done is to generate separate plots for each age group’s with seporated naming of each top categories.
Note: I used ChatGPT assistance to help summarize the data and debug my plotting code.