Project 1

Author

Viktoriia L.

Top 5 Programs by Age Group at Montgomery College

Introduction

The dataset I am working with contains information about student enrollment at Montgomery College. It provides details on students’ demographics, enrollment status, program of study, and campus attendance. Specifically, the dataset includes variables such as:

fall_term: The academic year or term of enrollment.
age_group: The age range of the student (e.g., under 18, 18-24, 25-34, etc.).
mc_program_description: The program or major in which the student is enrolled.
student_type: The type of student (e.g., new, continuing).
student_status: Whether the student is part-time or full-time.
gender: Gender of the student.
ethnicity / race: Self-reported ethnicity and race.
campus: Information about which campus the student attends (e.g., Rockville, Germantown, Takoma Park).

The goal of this project is to explore enrollment patterns by age group, focusing on identifying the most popular majors across different age categories.

The data comes from the Montgomery County government open data portal, specifically from the dataMontgomery website.

Loading necessary libraries and dataset

library(tidyverse)

Warning: package 'readr' was built under R version 4.4.3

Warning: package 'lubridate' was built under R version 4.4.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

setwd("C:/Users/Lenovo/Downloads/SummerData110") # set repository
data=read_csv("Montgomery_College_Enrollment_Data_20250612.csv") #load dataset

Rows: 25320 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (16): Student Type, Student Status, Gender, Ethnicity, Race, Attending G...
dbl  (2): Fall Term, ZIP

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

checking the colum names

colnames(data)

 [1] "Fall Term"                "Student Type"            
 [3] "Student Status"           "Gender"                  
 [5] "Ethnicity"                "Race"                    
 [7] "Attending Germantown"     "Attending Rockville"     
 [9] "Attending Takoma Park/SS" "Attend Day or Evening"   
[11] "MC Program Description"   "Age Group"               
[13] "HS Category"              "MCPS High School"        
[15] "City in MD"               "State"                   
[17] "ZIP"                      "County in MD"

cleaning the names of the variables

colnames(data) <- gsub(" ", "_", tolower(colnames(data)))

colnames(data)

 [1] "fall_term"                "student_type"            
 [3] "student_status"           "gender"                  
 [5] "ethnicity"                "race"                    
 [7] "attending_germantown"     "attending_rockville"     
 [9] "attending_takoma_park/ss" "attend_day_or_evening"   
[11] "mc_program_description"   "age_group"               
[13] "hs_category"              "mcps_high_school"        
[15] "city_in_md"               "state"                   
[17] "zip"                      "county_in_md"

View first few rows

head(data)

# A tibble: 6 × 18
  fall_term student_type student_status gender ethnicity    race    
      <dbl> <chr>        <chr>          <chr>  <chr>        <chr>   
1      2015 Continuing   Full-Time      Female Not Hispanic White   
2      2015 Continuing   Part-Time      Male   Not Hispanic White   
3      2015 Continuing   Part-Time      Male   Not Hispanic Black   
4      2015 New          Full-Time      Male   Not Hispanic Asian   
5      2015 New          Full-Time      Female Hispanic     White   
6      2015 Continuing   Full-Time      Female Hispanic     Hispanic
# ℹ 12 more variables: attending_germantown <chr>, attending_rockville <chr>,
#   `attending_takoma_park/ss` <chr>, attend_day_or_evening <chr>,
#   mc_program_description <chr>, age_group <chr>, hs_category <chr>,
#   mcps_high_school <chr>, city_in_md <chr>, state <chr>, zip <dbl>,
#   county_in_md <chr>

Summary stats for numeric columns

summary(data)

   fall_term    student_type       student_status        gender         
 Min.   :2015   Length:25320       Length:25320       Length:25320      
 1st Qu.:2015   Class :character   Class :character   Class :character  
 Median :2015   Mode  :character   Mode  :character   Mode  :character  
 Mean   :2015                                                           
 3rd Qu.:2015                                                           
 Max.   :2015                                                           
                                                                        
  ethnicity             race           attending_germantown attending_rockville
 Length:25320       Length:25320       Length:25320         Length:25320       
 Class :character   Class :character   Class :character     Class :character   
 Mode  :character   Mode  :character   Mode  :character     Mode  :character   
                                                                               
                                                                               
                                                                               
                                                                               
 attending_takoma_park/ss attend_day_or_evening mc_program_description
 Length:25320             Length:25320          Length:25320          
 Class :character         Class :character      Class :character      
 Mode  :character         Mode  :character      Mode  :character      
                                                                      
                                                                      
                                                                      
                                                                      
  age_group         hs_category        mcps_high_school    city_in_md       
 Length:25320       Length:25320       Length:25320       Length:25320      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
    state                zip        county_in_md      
 Length:25320       Min.   :  926   Length:25320      
 Class :character   1st Qu.:20852   Class :character  
 Mode  :character   Median :20877   Mode  :character  
                    Mean   :20892                     
                    3rd Qu.:20902                     
                    Max.   :95492                     
                    NA's   :99

Check NAs in each column

colSums(is.na(data))

               fall_term             student_type           student_status 
                       0                        0                        0 
                  gender                ethnicity                     race 
                       0                        0                        0 
    attending_germantown      attending_rockville attending_takoma_park/ss 
                       0                        0                        0 
   attend_day_or_evening   mc_program_description                age_group 
                       0                        0                        0 
             hs_category         mcps_high_school               city_in_md 
                       0                    11762                       97 
                   state                      zip             county_in_md 
                       8                       99                        0

understanding how many different programs at the MC

data |>
  distinct(mc_program_description) |>
  count()

# A tibble: 1 × 1
      n
  <int>
1    96

Grouping data by age and program

summary_data <- data |>
  group_by(age_group, mc_program_description) |>
  summarise(n_students = n(), .groups = "drop")

filter unnecessary group

summary_data <- summary_data |>
  filter(age_group != "Unknown")

slicing the top 5 programs

top_programs <- summary_data |>
  group_by(age_group) |>
  slice_max(order_by = n_students, n = 5) |>
  ungroup()

vizualizing using ggplot

library(ggplot2)

ggplot(top_programs, aes(x = reorder(mc_program_description, n_students), y = n_students, fill = mc_program_description)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ age_group) +  # remove scales = "free_y" to unify x-axis
  coord_flip() +
  labs(
    title = "Top 5 Programs in Each Age Group \nat Montgomery College",
    subtitle = "Enrollment data from Montgomery \nCounty government open data portal",
    x = "Program",
    y = "Number of Students",
    caption = "Source: https://data.montgomerycountymd.gov/"
  ) +
  scale_fill_brewer(palette = "Set3") +
  theme_minimal(base_size = 12) +
  theme(
    strip.text = element_text(face = "bold", size = 12),
    axis.text.y = element_text(size = 10),
    plot.title = element_text(face = "bold", size = 14)
  )

``` ## Final Essay

In this project, I worked with a dataset on student enrollment at Montgomery College. As part of data cleaning, I converted all variable names to lower case and replaced spaces with underscores. Luckily, the columns I needed for the visualization — age_group and mc_program_description — did not contain any missing values, so I didn’t have to handle NAs in this analysis. The visualization I created shows the top five programs by age group. It highlights that most students are younger than 20 years old, which makes sense since Montgomery College is a two-year institution that attracts many students right after high school. I was surprised to see that the most popular program across all age groups is General Studies. This could reflect that students are still exploring their academic and career interests. Additionally, I noticed that Health Sciences is consistently among the top choices across age groups, which might point to strong interest in healthcare professions in the region. Business and Engineering Science also appear frequently. One thing I wish I could have done is to generate separate plots for each age group’s with seporated naming of each top categories.

Note: I used ChatGPT assistance to help summarize the data and debug my plotting code.