CT Attendance Rates

Author

Joshua Shleifer

School Attendance in Connecticut

Describing the Topic and Dataset

The dataset I chose to analyze includes the attendance for public school students PK-12 by student group and by district during the 2019-2020, 2020-2021, and the 2021-2022 school year. The groups that the students are split into include disabilities, free and reduced meal eligibility, race, homelessness, and whether or not they are considered high needs. High needs includes students who are English language learners, who receive special education, or who qualify for free and reduced lunch. Race is split up into four categories: Black or African-American, White, Hispanic/Latino of any race, and all other races.

This dataset was sourced from data.gov and was published by the State of Connecticut. There are a few instances of missing attendance data. These instances have been suppressed to either protect student confidentiality or to ensure that statistics based on a small sample size are not interpreted as equally representative as those based on a sufficiently larger sample size. I plan to explore the links between these groups and attendance rates, as well as the attendance rates overall over these years to see if attendance was significantly affected by the pandemic.

The Data

Loading the Data and Necessary Libraries

library(tidyverse)
setwd("C:/Users/Shea/Documents/data110/csvs")
Attendance <- read_csv("School_Attendance_by_Student_Group_and_District__2021-2022.csv")

Cleaning the Data by Removing Spaces and Capital Letters

names(Attendance) <- tolower(names(Attendance))
names(Attendance) <- gsub(" ","_",names(Attendance))

Creating a Separate Dataset of Just the Summaries

Normally I would have grouped the categories, created a new variable “attended” by multiplying students and attendance rate, and then added them all up and gotten the percentage to make a summary exactly like the one found on the top of the dataset. However, as found the dataset already had that information I saw no reason to reinvent the wheel.

AttSum <- filter(Attendance, district_name == "Connecticut")

Renaming Variables

names(AttSum)[5] = "2021-2022_count"
names(AttSum)[6] = "2021-2022_rate"
names(AttSum)[7] = "2020-2021_count"
names(AttSum)[8] = "2020-2021_rate"
names(AttSum)[9] = "2019-2020_count"
names(AttSum)[10] = "2019-2020_rate"

Converting Wide to Long and Splitting Variable Names

AttSumLongRates <- AttSum[2:10] |>
  pivot_longer(
        cols = c(5,7,9),
        names_sep = "_",
        names_to = c("year","yearRate"), # yearRate is a throwaway variable. It contains the string "rate" which i dont need and have no se for
        values_to = "AttRate")

The Plot

p1 <-AttSumLongRates |> 
  ggplot() +
  geom_bar(aes(x=fct_inorder(student_group), y=AttRate, fill = year),
      position = "dodge", stat = "identity") +
  scale_fill_manual(values=c("#1619cc", "#26e059", "#d1249a")) +
  labs(fill = "School Year",
       x = "Student Group",
       y = "Attendance Rate",
       title = "Attendance by Group For 2019-2022",
       caption = "Source: data.gov, Published by State of Connecticut")
p1 + 
  coord_cartesian(ylim=c(0.7,1)) + #I limited the y axis so I could more easily see the differences between each of the rates. I found that without doing so the data all looked almost the same
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) # I'm not usually the biggest fan of vertical labels but with the categories being as long as they are I couldn't find a nicer way to display them

Processes and Final Thoughts

The Cleaning

This dataset required a bunch of cleaning to make it useful. To make it as easy to follow as possible I’m going to describe the processes in order. The first thing I did was remove all capital letters from the column names. Then, to make the data easier to use I removed all of the spaces and replaced them with underscores. After that, I took a summary of the data across the entire state so I could more easily compare groups rather than districts. As the dataset already had that information in the first few lines I used that summary rather than redoing it myself. With the new dataset of just the summaries, I renamed the columns describing the student counts and attendance rates to make converting the data to a long format easier. Then I converted it from wide to long with the columns with the attendance rates as the restructured column. To get the years, when converting I split the column name to take off the “_rate” at the end. This allowed me to more easily label the legend when graphing. After that, I used a bar graph to compare the data because I think that’s the cleanest way to compare the data in the dataset.

Interesting Patterns

I found when graphing the data that economic disadvantages only start to correlate with decreased attendance after a certain point. Homeless students experience the worst attendance, But those eligible for reduced meals actually had a higher attendance rate than average. Students with high needs also had a significantly lower attendance rate than those without. However, as that category contains a bunch of variables that we don’t have individual information for, I don’t see how we can draw a meaningful conclusion from it. With regards to attendance over the years, I expected 2019-2020 to have the highest attendance rate as that was before COVID-19, but I wasn’t sure what would happen between 2020-2021 and 2021-2022. After seeing the graph, I saw the attendance rates dropped in almost every category from 2019-2020 to 2020-2021. However, from 2020-2021 to 2021-2022, overall attendance went down but the categories had much more variance. Most surprising to me was that those experiencing homelessness had a huge increase in attendance and were one of only two categories to increase.

Unexplored Ideas

My biggest lamentation about the dataset is that it didn’t have every piece of information. I noted above that those with high needs had a significantly lower attendance than those without. I would’ve loved to explore that further and try to hone in on specifically what the correlation is with but the given group is too vague to do so and the dataset doesn’t include all of the categories that that group is made of. I also would’ve loved to be able to have the x-axis labels look nicer but I tried a few ways and couldn’t find anything else that was better.