Project 2 - MCPS High School Academic Analysis

Author

Alex Lopez

Introduction

My datasets used consists of three datasets: all_schools (information on every MCPS school, including long/lat), merged_hs (a merged dataset that holds various quantitative data), and hs_areas (polygons that will be used in a map). All data is from OpenDataMCPS for their school addresses and coordinates, and their MCPS “School Profiles” page that provide exact numbers on performance.

https://data.montgomeryschoolsmd.org/ https://www.montgomeryschoolsmd.org/school-profiles/

In my main dataset, merged_hs, I chose to focus on two factors to determine academic success in high schools: Attendance Rate and Graduation Rate, and whether those two are positively correlated. I also wanted to visualize if a schools attendance/graduation % is below or above the county’s average rate. I filtered to focus on high schools, changed categorical data to numeric, and created a new column, “what_symbol”, as an indicator for symbols on a graph but also indicating whether a school is above the county % in attendance/graduation rate, only has 1 of them satisfying the threshold, or both are below the both county % rates. This indicator would be useful in coloring my highcharter graph and placing each school in a quadrant series, and the same colors are used in my map where the school’s polygon of their cluster area is also colored the same way for consistency. I had to clean my data up further when I wanted to include polygon and geom area into my map, and filter out school names manually so when I joined them, I could have a new dataset that includes the polygons and the indicators to color them accordingly.

I chose this topic and dataset because of education has always been important to me, and I’m aware of certain issues in this county and various socioeconomic factors that can affect this schools system in different areas. I wanted to put color and thought into actually visualizing this rather than just knowing from experience, so my graph and map are a starter for that topic and can build up to include certain other factors that lead to lower/higher attendance rate, the population of a school, its resources, grades, etc.

1: Loading in Libraries and Datasets

setwd("/Applications/DATA110")
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(leaflet)
library(highcharter)

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

library(ggpubr)

Warning: package 'ggpubr' was built under R version 4.3.3

Loading required package: ggplot2

Warning: package 'ggplot2' was built under R version 4.3.3

library(sf)

Warning: package 'sf' was built under R version 4.3.3

Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE

library(stringr)

school <- readr::read_csv("all_schools.csv")

Rows: 200 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): CATEGORY, SCHOOL NAME, ADDRESS, CITY, PHONE, URL, LOCATION
dbl (3): ZIP CODE, LONGITUDE, LATITUDE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

merged_hs <- readr::read_csv("merged_hs.csv")

Rows: 25 Columns: 32
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (17): SCHOOL.NAME, CATEGORY, ADDRESS, CITY, PHONE, URL, LOCATION, X, AM,...
dbl (15): ZIP.CODE, LONGITUDE, LATITUDE, Total, Female, Male, HI, FARMS, Spe...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

hs_areas <- readr::read_csv("hs_serviceareas.csv")

Rows: 25 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): the_geom, SchoolName
num (2): Shape_Leng, Shape_Area

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2: Cleaning up both datasets, merging, and more cleaning

# Giving the categories I'm using cleaner names
merged_hs <- merged_hs %>%
  rename(
    School_Name = SCHOOL.NAME,
    Attendance_Rate = Attendance.Rate.,
    Graduation_Rate = Graduation.Rate.
  )

# Cleaning specific data and creating columns to be suitable to graph, categorical to numerial from datasets, filtering for only High Schools, assigning future indicators by symbols

#Caution: some are redundant because my original file changed category names out of nowhere but below symbolizes some parts of merged_hs cleaning
school <- school %>%
rename(SCHOOLNAME = `SCHOOL NAME`)

merged_hs$Graduation_Rate <- gsub("≥95.0",  96, merged_hs$Graduation_Rate)
merged_hs <- merged_hs %>%
  filter(CATEGORY == "HIGH SCHOOLS") %>%
  mutate(Graduation_Rate = as.numeric(Graduation_Rate)) %>%
  mutate(Total = as.numeric(Total)) %>%
  mutate(what_symbol = case_when(
    merged_hs$Attendance_Rate >= 89.9 & merged_hs$Graduation_Rate >= 92.0 ~ "diamond",
    merged_hs$Attendance_Rate <= 89.9 & merged_hs$Graduation_Rate <= 92.0 ~ "triangle-down",
    TRUE ~ "triangle"
  ))

Data Plot 1: Highcharter

highchart() |>
  # Four quadrants to highlight a high school's academic performance manually by the data (x,y)
  hc_add_series(data = list(list(89.9,92),
                            list(100,92),
                            list(100,100),
                            list(89.9,100)),
                type = "polygon", 
                name = "<b>Quadrant I</b>", color = "rgba(154,187,127,0.15)",
                enableMouseTracking = FALSE)|>
  hc_add_series(data = list(list(80,92),
                            list(89.9,92),
                            list(89.9,100),
                            list(80,100)),
                type = "polygon", 
                name = "<b>Quadrant II</b>", color = "rgba(243,204,144,0.15)",
                enableMouseTracking = FALSE)|>
  hc_add_series(data = list(list(80,70),
                            list(89.9,70),
                            list(89.9,92),
                            list(80,92)),
                type = "polygon", 
                name = "<b>Quadrant III</b>", color = "rgba(212,143,140,0.15)",
                enableMouseTracking = FALSE)|>
  hc_add_series(data = list(list(89.9,70),
                            list(100,70),
                            list(100,92),
                            list(89.9,92)),
                type = "polygon", 
                name = "<b>Quadrant IV</b>", color = "rgba(243,204,144,0.15)",
                enableMouseTracking = FALSE)|>
  
  # Three series depending on their academic performance quadrant
  hc_add_series(data = merged_hs %>% 
                filter(what_symbol == "diamond"),
                type = "scatter", 
                name = "Above Attendance & Graduation Rate",
                hcaes(x = Attendance_Rate,
                      y = Graduation_Rate),
                marker = list(symbol = "diamond", radius = 7.5))|>
  hc_add_series(data = merged_hs %>% 
                filter(what_symbol == "triangle"),
                type = "scatter", 
                name = "Below in Either Attendance/Graduation Rate",
                hcaes(x = Attendance_Rate,
                      y = Graduation_Rate),
                marker = list(symbol = "triangle", radius = 6))|>
  hc_add_series(data = merged_hs %>%
                filter(what_symbol == "triangle-down"),
                type = "scatter", 
                name = "Below Attendance/Graduation Rate",
                hcaes(x = Attendance_Rate,
                      y = Graduation_Rate),
                marker = list(symbol = "triangle-down", radius = 6))|>
  
  # Title
  
  hc_title(text = "<b>At a Glance Academic Performance Analysis in MCPS High Schools (2023-2024 FY)</b>") |>
  
  # X and Y axis, adjusting to fit the graph
  # Plotting the countywide rates as thresholds
  
  hc_xAxis(min = 80, max = 100,
    plotLines = list(
      list(
        value = 89.9, color = "#cccccc", width = 2, dashStyle = "longdashdot", label = list(text = "<i>MCPS Attendance <b>(89.9%)</b></i>", style = list(fontSize = "10px"), verticalAlign = "middle", y = 110, align = "right"))),
    title = list(text = "Attendance Rate (%)")) |>
  hc_yAxis(min = 75, max = 100,
    plotLines = list(
      list(
        value = 92, color = "#cccccc", width = 2, dashStyle = "longdashdot", label = list(text = "<i>MCPS Graduation <b>(92%)</b></i>", style = list(fontSize = "10px")))),
    title = list(text = "Graduation Rate (%)")) |>
  
  #Highcharter + Quadrants + Tooltip doubles as a legend 
  hc_legend(enabled = FALSE) |>
  
  hc_caption(text = "Source: OpenDataMCPS") |>
  
  # Personalized color palette to signify an upward trend red/yellow/green
  hc_colorAxis(
    stops = color_stops(colors = c("#c14850", "#f1d46c", "#8a9b6e"))
    ) |>

  hc_tooltip(
    pointFormat = "<b>{point.School_Name}</b><br>
    Student Total: {point.Total}<br>
    Attendance Rate (%): {point.Attendance_Rate}<br>
    Graduation Rate (%): {point.Graduation_Rate}"
  )

My highcharter map is plotted on the two factors I explored in MCPS high schools, attendance and graduation rates. Knowing my data and wanting to analyze the factors based on the county rate, I plotted a vertical and horizontal line to act as the county thresholds, and I added series of polygons before the scatter plot to put some color into the background of where the individual schools would land, fit through the hc x/y axis borders.

Red = School is below the Attendance & the Graduation Rate Yellow = School is below the Attendance OR the Graduation Rate Green = School is above the Attendance & the Graduation Rate

As a scatter plot, each point is represented by a high school, and not only can someone observe if it fits in one of these categories, my what_symbol column also filtered to add a specific symbol to its specific quadrant that would only fit with other schools in its same category.

Triangle Down - School is below the Attendance & the Graduation Rate Normal Triangle - School is below the Attendance OR the Graduation Rate Diamond - School is above the Attendance & the Graduation Rate

I chose these three symbols because they progress in shape, a triangle is only part of a diamond, and a diamond shape would represent two triangles/a full shape that means a school is above the attendance/graduation rates. All in all, the background colors and the scatter plot symbols are able to indicate a school’s academic analysis on the two rates I evaluated, and the tooltip, when hovered over a point, gives the school name, its student population, and the exact percentage stats of said school.

I ran into a ton of trial and error in this highcharter chart, and had to use various external websites to familiarize myself with highcharter and explore more of its functions. I learned about putting the polygon areas in first before the scatter plots (a lofty idea I really wanted to include), looked at JavaScript syntax and understood how its brackets translated into lists in R, moving axis labels around, gradient code in highcharter and how it differed from ggplot, a legend that would not work (that I had to remove). One school (Wootton) has the same rates as Churchill and I could not figure out how to move it across highcharter – something for the future. However the struggles aside I’m satisfied with what I learned and hoped to visualize especially with color coding.

3: Map Preparation + More Cleaning

#Popup Info for every school
popup <- paste0(
"<b>School Name: </b>", school$SCHOOLNAME, "<br>",
"<b>Category: </b>", school$CATEGORY, "<br>",
"<b>City: </b>", school$CITY, "<br>",
"<b>Address: </b>", school$ADDRESS, "<br>"
)

# Color by school level
colors <- colorFactor(palette = c("#d1c2ee","#a467ba","#5a2055"),
  levels = c("ELEMENTARY SCHOOLS", "MIDDLE SCHOOLS", "HIGH SCHOOLS"))

# The Beginning of the end - adding the Polygon data and cleaning it all

# Use of AI to assist through lines 161 + 166 + 181:188 (OpenAI)
# Using polygon data to map out clusters

hs_polygons <- st_as_sf(hs_areas, wkt = "the_geom")

#Joining polygon data + the symbol/academic indicators datasets to properly insert into the map

new_hs_polygons <- hs_polygons %>%
  mutate(SchoolName = sub("HS", "", hs_polygons$SchoolName)) %>%
  rename(School_Name = SchoolName)

# Manually changing names by cell because a proper function was frustrating
indicator <- merged_hs[, c("School_Name", names(merged_hs)[33])] 
new_hs_polygons[4, "School_Name"] <- "Thomas S Wootton"
new_hs_polygons[8, "School_Name"] <- "James Hubert Blake"
new_hs_polygons[11, "School_Name"] <- "Walt Whitman"
new_hs_polygons[13, "School_Name"] <- "James Hubert Blake"
new_hs_polygons[15, "School_Name"] <- "Albert Einstein"
new_hs_polygons[18, "School_Name"] <- "John F Kennedy"
new_hs_polygons[23, "School_Name"] <- "Winston Churchill"
new_hs_polygons[25, "School_Name"] <- "Montgomery Blair"

# Invisible spaces were an issue
new_hs_polygons$School_Name <- str_squish(new_hs_polygons$School_Name)
indicator$School_Name <- str_squish(indicator$School_Name)

merged_polygons <- left_join(
  new_hs_polygons,
  indicator[, c("School_Name", "what_symbol")],
  by = "School_Name"
)

# Same color palette resemblance from the prior highcharter chart, assigns same colors from gradients depending on symbol/academic performance

pal <- colorFactor(
  palette = c("#8a9b6e", "#f1d46c", "#c14850"), domain = merged_polygons$what_symbol)

Data Plot 2: Map

# https://r-charts.com/spatial/interactive-maps-leaflet/

# Adding:
# - Points for schools, varying sizes (school)
# - Cluster polygons for HS clusters (merged_polygons) + indicating colors

leaflet(school) |>
  setView(lng = -77.2, lat = 39.1, zoom = 10) |>
addProviderTiles("CartoDB.Positron") |>
addPolygons(data = merged_polygons, color = "#340e28", weight = 0.5, fillColor = ~pal(what_symbol), opacity = 0.5) |>
addCircleMarkers(
  data = school,
  radius = case_when(
    school$CATEGORY == "HIGH SCHOOLS" ~ 7,
    school$CATEGORY == "MIDDLE SCHOOLS" ~ 2,
    school$CATEGORY == "ELEMENTARY SCHOOLS" ~ 1
    ),
  fillColor = ~colors(CATEGORY),
  fillOpacity = 0.8,
  color = "#340e28",
  weight = 2,
  popup = popup
  )

Assuming "LONGITUDE" and "LATITUDE" are longitude and latitude, respectively

My map is zoomed in on Montgomery County, where I used a merged dataset called merged_polygons to combine polygon data with a high school name and its symbol from my highcharter graph (that the symbol’s background quadrant will turn into the fillColor on this map accordingly). I also plotted every MCPS school, elementary, middle, and high. High schools are a much larger radius because of their focus on my project, but I chose to include the elementary/middle as they all feed into the high schools and to begin to understand school density. Individual school colors/size dont matter aside from having them for show – my real analysis is from the high school’s cluster maps.

Based on my map, Montgomery County high schools that are further west and south are colored in green/(diamond shape), to indicate that these high schools are above the attendance/graduation rates % of the county. Yellow clusters only satisfy one of the thresholds (northern and eastern Montgomery County), and red clusters in central and southeastern Montgomery County are below both attendance/graduation rates %. There is a pattern, or at least some adjacent correlation in clusters’ attendance and graduation data. Visually, lesser feeder schools (elementary and middle) could contribute to this correlation, but it will have to be another topic to explore in the future.

I started out with the map in my project because of my immediate interest towards MCPS data, but coloring in using polygons and R instead of Tableau proved to be the toughest part of this project and the most frustrating. In every database I downloaded, almost every category’s syntax was different and I had to constantly edit and filter, even manually changing school names so they could join without any N/As. The data itself even suffered from invisible spaces which I finally realized after various trial and error. Out of frustration I had to use some AI in this aspect to learn of sf to properly add polygon data into R and clean up my data until it finally was proven useful in merged_polygons. Thankfully, that was it and it gave me the one result I needed – coloring polygon data with my initial exploration on attendance/graduation rates. Lastly I wish I included the exact percentages into the school popups too, but it would have had to include elementary/middle schools as well, and I hopefully estimate that a viewer is able to color associate the three color indicators with the above/below the county’s attendance and graduation rates from my prior graph too.