Data 110 project 2

Author

Ayan Elmi

(https://ftwtoday.6amcity.com/city-guide/live/school-districts-tarrant-county-tx)

Introduction

My data set is about school district in the United States surveyed in 2022-2023. According to the department of education, school districts are special-purpose governments and administrative units designed by state and local officials to provide public education for local residents. District boundaries are collected for NCES by the U.S. Census Bureau to develop demographic estimates and to support educational research and program administration.

My dataset has 13239 observations and 29 variables.

My interest:

The reason why I am investigating this data set is because I was born in Maryland, however I did not attend school here. I know that having more teachers can improve the quality of education because it allows students to receive more one on one attention on their studies. Thus, I wanted to compare the number of students and teachers in different school districts in Maryland and see if there is a correlation. Therefore, I wanted to compare it to the states surrounding Maryland.

Variables

member: the number of students which is quantitative. ttoch: The number of teachers quantitative variable. stuteratio: the student teacher ratio (student/teachers) is a quantitative variable. sy_status_text :operating staus of the school, for example, “1-open” means the school is currently open is a categorical variable. locale_text= is the school rural or urban quantitative variable. statename : Name of the state (categorical). coname: name of the county (categorical). lea_name: Local Education Agency (school district) categorical. lat: latitude (geographical coordinates). long:longitude (geographical coordinates).

The variables I will be focusing on is member (number of students), ttoch (number of teachers) and states.

Research question

My question is do the school districts in Maryland, Virginia and Pennsylvania have significantly different student teacher ratios?

source: https://catalog.data.gov/dataset/school-district-characteristics-current-4aa03 found on data.gov.

Loading libraries

library(tidyverse)
library (plotly)
library (leaflet)

Set working directory and checking the data set

setwd("~/Desktop/Data 110")
school_districts<- read_csv("School_District_Characteristics_-_Current.csv")

Checking the head and structure of my data set

head(school_districts)

# A tibble: 6 × 29
  OBJECTID SURVYEAR  STATENAME LEAID   ST_LEAID LEA_NAME LSTREET1 LSTREET2 LCITY
     <dbl> <chr>     <chr>     <chr>   <chr>    <chr>    <chr>    <chr>    <chr>
1        1 2022-2023 FLORIDA   1200180 FL-06    Broward… 600 SE … <NA>     FORT…
2        2 2022-2023 FLORIDA   1200240 FL-08    Charlot… 1445 ED… <NA>     PORT…
3        3 2022-2023 FLORIDA   1200330 FL-11    Collier… 5775 OS… <NA>     NAPL…
4        4 2022-2023 FLORIDA   1200390 FL-13    Dade Co… 1450 NE… <NA>     MIAMI
5        5 2022-2023 FLORIDA   1200420 FL-14    DeSoto … 530 LAS… <NA>     ARCA…
6        6 2022-2023 FLORIDA   1200660 FL-22    Glades … 400 10T… <NA>     MOOR…
# ℹ 20 more variables: LSTATE <chr>, LZIP <chr>, LZIP4 <chr>,
#   LEA_TYPE_TEXT <chr>, LEA_TYPE <dbl>, GSLO <chr>, GSHI <chr>,
#   SY_STATUS_TEXT <chr>, SCH <dbl>, MEMBER <dbl>, TOTTCH <dbl>,
#   STUTERATIO <dbl>, LOCALE_TEXT <chr>, CONAME <chr>, COID <chr>, PHONE <chr>,
#   Lat <dbl>, Long <dbl>, Shape__Area <dbl>, Shape__Length <dbl>

#str(school_districts) did the # for rendering purpose.

Cleaning the data

names(school_districts) <- tolower(names(school_districts)) #make all of the names lowercase instead of all capital letters"
names(school_districts) <- gsub("[(). \\-]", "_", names(school_districts)) #instead make the gaps have _
head(school_districts) #verifying how it looks

# A tibble: 6 × 29
  objectid survyear  statename leaid   st_leaid lea_name lstreet1 lstreet2 lcity
     <dbl> <chr>     <chr>     <chr>   <chr>    <chr>    <chr>    <chr>    <chr>
1        1 2022-2023 FLORIDA   1200180 FL-06    Broward… 600 SE … <NA>     FORT…
2        2 2022-2023 FLORIDA   1200240 FL-08    Charlot… 1445 ED… <NA>     PORT…
3        3 2022-2023 FLORIDA   1200330 FL-11    Collier… 5775 OS… <NA>     NAPL…
4        4 2022-2023 FLORIDA   1200390 FL-13    Dade Co… 1450 NE… <NA>     MIAMI
5        5 2022-2023 FLORIDA   1200420 FL-14    DeSoto … 530 LAS… <NA>     ARCA…
6        6 2022-2023 FLORIDA   1200660 FL-22    Glades … 400 10T… <NA>     MOOR…
# ℹ 20 more variables: lstate <chr>, lzip <chr>, lzip4 <chr>,
#   lea_type_text <chr>, lea_type <dbl>, gslo <chr>, gshi <chr>,
#   sy_status_text <chr>, sch <dbl>, member <dbl>, tottch <dbl>,
#   stuteratio <dbl>, locale_text <chr>, coname <chr>, coid <chr>, phone <chr>,
#   lat <dbl>, long <dbl>, shape__area <dbl>, shape__length <dbl>

checking for Na’s in the dataset

colSums((is.na(school_districts)))

      objectid       survyear      statename          leaid       st_leaid 
             0              0              0              0              0 
      lea_name       lstreet1       lstreet2          lcity         lstate 
             0              0          13051              0              0 
          lzip          lzip4  lea_type_text       lea_type           gslo 
             0           5018              0              0              0 
          gshi sy_status_text            sch         member         tottch 
             0              0              0              0              0 
    stuteratio    locale_text         coname           coid          phone 
             0              0              0              0              0 
           lat           long    shape__area  shape__length 
             0              0              0              0

Checking for what states are in the dataset

#unique(school_districts$statename) did the # for rendering

filtering my 3 states

chosen_states <- school_districts |>
  filter(statename %in% c("MARYLAND", "VIRGINIA", "PENNSYLVANIA"),
         sy_status_text == "1-Open",
         member >= 100 ) # I choose 100 as the minimum number of members because 100 students is too little.
head(chosen_states)

# A tibble: 6 × 29
  objectid survyear  statename leaid   st_leaid lea_name lstreet1 lstreet2 lcity
     <dbl> <chr>     <chr>     <chr>   <chr>    <chr>    <chr>    <chr>    <chr>
1     6403 2022-2023 VIRGINIA  5102220 VA-137   Lexingt… 300 Dia… <NA>     Lexi…
2     8460 2022-2023 MARYLAND  2400030 MD-01    Allegan… 108 Was… <NA>     Cumb…
3     8461 2022-2023 MARYLAND  2400060 MD-02    Anne Ar… 2644 Ri… <NA>     Anna…
4     8462 2022-2023 MARYLAND  2400090 MD-30    Baltimo… 200 Eas… <NA>     Balt…
5     8463 2022-2023 MARYLAND  2400120 MD-03    Baltimo… 6901 Ch… <NA>     Tows…
6     8464 2022-2023 MARYLAND  2400150 MD-04    Calvert… 1305 Da… <NA>     Prin…
# ℹ 20 more variables: lstate <chr>, lzip <chr>, lzip4 <chr>,
#   lea_type_text <chr>, lea_type <dbl>, gslo <chr>, gshi <chr>,
#   sy_status_text <chr>, sch <dbl>, member <dbl>, tottch <dbl>,
#   stuteratio <dbl>, locale_text <chr>, coname <chr>, coid <chr>, phone <chr>,
#   lat <dbl>, long <dbl>, shape__area <dbl>, shape__length <dbl>

What variables are included? (can any of them be removed?)

names(school_districts)

 [1] "objectid"       "survyear"       "statename"      "leaid"         
 [5] "st_leaid"       "lea_name"       "lstreet1"       "lstreet2"      
 [9] "lcity"          "lstate"         "lzip"           "lzip4"         
[13] "lea_type_text"  "lea_type"       "gslo"           "gshi"          
[17] "sy_status_text" "sch"            "member"         "tottch"        
[21] "stuteratio"     "locale_text"    "coname"         "coid"          
[25] "phone"          "lat"            "long"           "shape__area"   
[29] "shape__length"

Removing unused variables

states_filtered <-chosen_states |>
  select(-objectid,-survyear, -shape__area, -shape__length, -gshi, -coid, -lzip, -sch, -lstreet1, -lstreet2, -lcity, -lea_type_text, -gslo, -phone, -lzip4, -lstate,-st_leaid,-lea_type, -leaid,)

head(states_filtered)

# A tibble: 6 × 10
  statename lea_name  sy_status_text member tottch stuteratio locale_text coname
  <chr>     <chr>     <chr>           <dbl>  <dbl>      <dbl> <chr>       <chr> 
1 VIRGINIA  Lexingto… 1-Open            485   40.8       11.9 32-Town: D… Lexin…
2 MARYLAND  Allegany… 1-Open           8181  619.        13.2 13-City: S… Alleg…
3 MARYLAND  Anne Aru… 1-Open          84452 5930.        14.2 21-Suburb:… Anne …
4 MARYLAND  Baltimor… 1-Open          75995 5008.        15.2 11-City: L… Balti…
5 MARYLAND  Baltimor… 1-Open         111082 7652.        14.5 21-Suburb:… Balti…
6 MARYLAND  Calvert … 1-Open          15461 1000.        15.5 41-Rural: … Calve…
# ℹ 2 more variables: lat <dbl>, long <dbl>

Adding a columns that says if the district is either in a city, rural area,town or suburb

# I will use this in the map 
states_filtered2 <-states_filtered |>
  mutate(area_simple=case_when(locale_text %in% c("13-City: Small", "11-City: Large", "12-City: Mid-size") ~"city",
    locale_text %in% c( "21-Suburb: Large", "22-Suburb: Mid-size", "23-Suburb: Small" ) ~"suburb",
    locale_text %in% c("42-Rural: Distant","41-Rural: Fringe", "43-Rural: Remote") ~"rural",
    locale_text %in% c("33-Town: Remote", "31-Town: Fringe","32-Town: Distant" ) ~"town",
    TRUE ~ "others"
  ))

Visualizations

Exploration

Plot 1 without faceting and plotly (not final one to be graded)

options (scipen= 999) # used week 6 tutorial on scatterplot
plot_2<- ggplot (states_filtered2, aes(x=tottch, y=member, color=statename)) +
  geom_point(alpha=0.6) +
  scale_x_log10()+  # used week 6 tutorial on scatterplot
  scale_y_log10()+
  labs(x= "Number of teachers (log scale)", y= "Number of students (log scale)",
       title= "Number of Students vs Number of Teachers",
       caption= "Source: NCES", color= "State", )+
  scale_color_brewer(palette="Set1")+
theme_bw()+
  theme(plot.title = element_text(face="bold",size=14,hjust=0.5,family = "mono"))
#used thilonis' gis tutorial  for the theme and faceting 
plot_2

Final visualization without plotly added

options (scipen= 999) # used week 6 tutorial on scatterplot
plot_2<- ggplot (states_filtered2, aes(x=tottch, y=member, color=statename)) +
  geom_point(alpha=0.6) +
  scale_x_log10()+  # used week 6 tutorial on scatterplot
  scale_y_log10()+
  labs(x= "Number of teachers (log scale)", y= "Number of students (log scale)",
       title= "Number of Students vs Number of Teachers ",
       caption= "Source: NCES", color= "State", )+
   facet_wrap(~statename)+
  scale_color_brewer(palette="Set1")+
theme_bw()+
  theme(plot.title = element_text(face="bold",size=14,hjust=0.5,family = "mono"),
        axis.text.x = element_text(size = 9),
        axis.text.y = element_text(size = 9),
        legend.title = element_blank()) 
 
plot_2

Final visualization with plotly added

options (scipen= 999) # to remove significant figures
final_plot<- ggplot (states_filtered2, aes(x=tottch, y=member, color=statename, 
                                            text=paste("School District:",lea_name, "\n",
                                                       "State:", statename, "\n",
                                                       "County name:",coname, "\n",
                                                       "Student number:", member,"\n",
                                                      "Teacher number:", tottch,"\n", 
                                                      "Student teacher ratio:", round(stuteratio, 1) 
                                                      ))) +
  geom_point(alpha=0.6) +
  scale_x_log10()+
  scale_y_log10()+
  labs(x= "Number of teachers (log scale)", y= "Number of students (log scale)",
       title= "Number of Students vs Number of Teachers (log scale)",
       caption= "Source: NCES", color= "State", )+
   facet_wrap(~statename)+
  scale_color_brewer(palette="Set1")+
theme_bw()+
  theme(plot.title = element_text(face="bold",size=14,hjust=0.5,family = "mono"),
        axis.text.x = element_text(size = 9),
        axis.text.y = element_text(size = 9),
        legend.title = element_blank()) 

final_plot<- ggplotly(final_plot, tooltip="text") #I wanted to only have the tooltips I added so I used this cite https://r-graph-gallery.com/customize-plotly-tooltip.html to help me.
                                     
final_plot

Mapping Maryland Districts

filtering for only Maryland

maryland_district<-states_filtered2 |>
  filter(statename== "MARYLAND")

Getting the latitude and longitude

maryland_long <-  mean (maryland_district$long)
maryland_lat <- mean (maryland_district$lat)

Creating the color pallete for student-teacher ratio

coloring_stuteratio <-colorNumeric(palette = "RdYlGn",
                                   domain=maryland_district$stuteratio,
                                   reverse=TRUE)
##reverse is used to flip the colors making red the high student teacher ratio. I learned from the site below
#https://stackoverflow.com/questions/40276569/reverse-order-in-r-leaflet-continuous-legend use this site for the colors and idea of reverse
#and I used a bit of sajutee's Gis tutorial colornumeric idea and color palette

Creating the map

Without tooltip added (not final)

leaflet() |>
  setView(lng = maryland_long, lat = maryland_lat, zoom = 7.2) |>
  addProviderTiles("Esri.WorldStreetMap") |>
  addCircles(
    data = maryland_district,
    radius =(maryland_district$member)/10, #this is what worked best for the size.
    color = "#14010d",
    fillColor = ~coloring_stuteratio(stuteratio), #learned from the same link above.
    fillOpacity = 0.6)

Assuming "long" and "lat" are longitude and latitude, respectively

Creating tooltip

popup_maryland <- paste0(
       "<b>School District: </b>", maryland_district$lea_name, "<br>",
      "<b>County: </b>", maryland_district$coname, "<br>",
      "<b>Area type: </b>", maryland_district$area_simple, "<br>",
      "<b>Number of Students: </b>", maryland_district$member , "<br>",
      "<b>Number of Teachers: </b>", maryland_district$tottch , "<br>", 
      "<b>Student-Teacher ratio: </b>", round(maryland_district$stuteratio,1), "<br>") #rounding learnt in data 101

Final map

leaflet() |>
  setView(lng = maryland_long, lat = maryland_lat, zoom = 7.3) |>
  addProviderTiles("Esri.WorldStreetMap") |>
  addCircles(
    data = maryland_district,
    radius =(maryland_district$member)/10,
    color = "#14010d",
    fillColor =~coloring_stuteratio(stuteratio),
    fillOpacity = 0.6,
    popup=popup_maryland)

Assuming "long" and "lat" are longitude and latitude, respectively

Essay

How I cleaned the data set.

I cleaned the data set by first changing all my names of the variables to lowercase.This was done by first going into the data set and then making all of them lowercase with ‘tolower’ “names(school_district) <- tolower(names(school_district))”. Next, I checked for NA’s using ‘colSums’ and since there were no NA’s in my data set I did not have to filter anything. Lastly, I checked for what unique states were in the data set to see if the 3 states I wanted to investigate were there, and I filtered for Maryland,Virginia and Pennsylvania. Additionally, I filtered for open schools with over 100 members using ” filter(sy_status_text == “1-Open”, member >= 100 )“. Lastly, I removed selected columns that I was not using and I used”mutate(area_simple=case_when(locale_text)“to add a column that has the locale text as either city, suburb, rural and town.

What the visualizations illustrate:

My faceted scatterplot highlighted the strong linear correlation between the number of students and the number of teachers across school districts in Maryland, Virginia, and Pennsylvania. The patterns I found were a strong linear correlation in all three graphs. I used the log scale to remove the clustering of the small districts on the left side of the graph. I was surprised to see that all three states had similar trends, even though Pennsylvania had more districts that didn’t disrupt the staffing patterns or affect the correlation. Pennsylvania has the highest number of school districts of the three.

In the map, I focused on Maryland, the map shows Maryland school districts geographically, the size of the circle represents the number of students enrolled and the color represents the student teacher ratio. In my map, green is a lower ratio, which is better and red is a higher ratio which is worse.I noted that the clustering was around Annapolis(the capital), Baltimore and Washington. This also gives insight into Maryland’s population distribution, where the majority live in the clustered areas on the map.

The colors are predominantly yellow and orange, meaning moderate ratios. Frederick County Public Schools was the most red, indicating that it had a high student teacher (16.1:1) ratio, despite being in a suburban county which surprised me. Furthermore, Wicomico County Public Schools and Worcester County Public Schools had the lowest at 12.5:1 and 11.4:1 and are a town and city. This could suggest that, despite location, funding towards schooling or staffing numbers remain similar.

Additional improvements I thought about including but couldn’t:

I thought about initially including th entire DMV, however, District of Columbia only had one row which was too little information for me to investigate and would misconstrue the data presented. Also, if I could find more recent data on school districts in the US. This data set said current but it was done 2022-2023. Thus, I would like to see if 2024-2025 would show any difference in student-teacher ratio from 2022-2023 survey year.

citations:

https://ftwtoday.6amcity.com/city-guide/live/school-districts-tarrant-county-tx

https://stackoverflow.com/questions/40276569/reverse-order-in-r-leaflet-continuous-legend

https://r-graph-gallery.com/customize-plotly-tooltip.html -