Motivation

Datasources

Academic Performance Data

dbn_url <- "https://raw.githubusercontent.com/andrewbowen19/cunyDATA607/main/projects/final_project/2006_-_2012__Math_Test_Results__-_All_Students.csv"

tests <- read.csv(dbn_url)

head(tests)
##      DBN Grade Year  Demographic Number.Tested Mean.Scale.Score Num.Level.1
## 1 01M015     3 2006 All Students            39              667           2
## 2 01M015     3 2007 All Students            31              672           2
## 3 01M015     3 2008 All Students            37              668           0
## 4 01M015     3 2009 All Students            33              668           0
## 5 01M015     3 2010 All Students            26              677           6
## 6 01M015     3 2011 All Students            28              671          10
##   Pct.Level.1 Num.Level.2 Pct.Level.2 Num.Level.3 Pct.Level.3 Num.Level.4
## 1         5.1          11        28.2          20        51.3           6
## 2         6.5           3         9.7          22          71           4
## 3           0           6        16.2          29        78.4           2
## 4           0           4        12.1          28        84.8           1
## 5        23.1          12        46.2           6        23.1           2
## 6        35.7          13        46.4           5        17.9           0
##   Pct.Level.4 Num.Level.3.and.4 Pct.Level.3.and.4
## 1        15.4                26              66.7
## 2        12.9                26              83.9
## 3         5.4                31              83.8
## 4           3                29              87.9
## 5         7.7                 8              30.8
## 6           0                 5              17.9

Getting School Geospatial Data

Looking up school district GIS data from NYC Open Data. School point locations can be found here, we’ll download them and unzip the file to get the shp file

# Create temporary files
temp <- tempfile()
temp2 <- tempfile()

# Download the zip file and save to 'temp' 
school_locs_url <- "https://data.cityofnewyork.us/download/jfju-ynrr/application%2Fzip"
download.file(school_locs_url, temp)

# Unzip the folder contents and then read in the shp file to a dataframe
unzip(zipfile = temp, exdir = temp2)
school_locs <- sf::read_sf(temp2)

Cleaning School Location Data

# Cleaning and re-naming the ATS_CODE column to 
school_locs <- school_locs %>% 
                  dplyr::rename( "DBN"="ATS_CODE")
school_locs$DBN <- str_replace_all(school_locs$DBN, "\\s+", "")

# Projected CRS: NAD83
school_locs_sf <- st_as_sf(school_locs, coords=c("YCoord", "XCoord"), NAD83)

school_locs <- st_transform(school_locs_sf, crs = 4326 ) %>%
      mutate( lat = st_coordinates(.)[,2],lon = st_coordinates(.)[,1])

Joining & Cleaning Academic Data

Let’s join our school_locs dataframe to our tests

#  Adding in borough lookup as well
school_boro_lookup = data.frame(BORO=c("K", "M", "Q", "R", "X"),
                                borough=c("Brooklyn", "Manhattan", "Queens", "Staten Island", "Bronx")
                                )
school_data <- as.data.frame(left_join(tests, school_locs, by="DBN")) %>%
                            left_join(., school_boro_lookup, by="BORO") %>%
                            rename("year" = "Year") %>%
                            mutate(year = as.character(year))

Pulling Air Quality Data for New York City

Looking up air quality for school locations in New York’s geographical area.

Air quality data

Using this NYC Open Data Air Quality Dataset from their API. Going to call their API

Breaking down air quality data by borough. Air pollution varies borough-to-borough in New York City so we’ll group our data that way. We’ll need to add a borough column to our Air quality dataset

#  Getting data from API call
# https://data.cityofnewyork.us/Environment/Air-Quality/c3uy-2p5r
nyc_open_data_aq_url <- "https://data.cityofnewyork.us/resource/c3uy-2p5r.json"
nyc_aq <- as.data.frame(fromJSON(nyc_open_data_aq_url))

# Adding in borough lookup code to compare to school data
# Also adding year column to compare with testing data
borough_lookup <- data.frame(borough_code = c("1", "2", "3", "4", "5"),
                    borough=c("Bronx",
                    "Brooklyn",
                    "Manhattan",
                    "Queens",
                    "Staten Island"))
nyc_aq <- nyc_aq %>%
            mutate(borough_code = str_extract(geo_join_id, "^\\d"),
                   year = str_extract(start_date, "^\\d{4}")) %>%
            left_join(., borough_lookup, by="borough_code")


head(nyc_aq)
##   unique_id indicator_id                 name measure measure_info
## 1    216498          386           Ozone (O3)    Mean          ppb
## 2    216499          386           Ozone (O3)    Mean          ppb
## 3    219969          386           Ozone (O3)    Mean          ppb
## 4    219970          386           Ozone (O3)    Mean          ppb
## 5    164876          383 Sulfur Dioxide (SO2)    Mean          ppb
## 6    164877          383 Sulfur Dioxide (SO2)    Mean          ppb
##   geo_type_name geo_join_id                       geo_place_name    time_period
## 1            CD         313                  Coney Island (CD13)    Summer 2013
## 2            CD         313                  Coney Island (CD13)    Summer 2014
## 3       Borough           1                                Bronx    Summer 2013
## 4       Borough           1                                Bronx    Summer 2014
## 5            CD         211     Morris Park and Bronxdale (CD11) Winter 2008-09
## 6            CD         212 Williamsbridge and Baychester (CD12) Winter 2008-09
##                start_date data_value borough_code year   borough
## 1 2013-06-01T00:00:00.000      34.64            3 2013 Manhattan
## 2 2014-06-01T00:00:00.000      33.22            3 2014 Manhattan
## 3 2013-06-01T00:00:00.000      31.25            1 2013     Bronx
## 4 2014-06-01T00:00:00.000      31.15            1 2014     Bronx
## 5 2008-12-01T00:00:00.000       5.89            2 2008  Brooklyn
## 6 2008-12-01T00:00:00.000       5.75            2 2008  Brooklyn

Additional AQ Data Cleaning

nyc_aq <- nyc_aq %>% pivot_wider(., names_from = name, values_from=data_value) %>%
            rename("ozone" = "Ozone (O3)", "sulfur_o2" = "Sulfur Dioxide (SO2)") %>%
            mutate(ozone = as.double(ozone), sulfur_o2 = as.double(sulfur_o2))

Joining our Air Quality and Academic

df <- inner_join(school_data, nyc_aq, by=c("borough", "year"))

Data Analysis

Let’s get a sense of the distributions for our ozone and sulfer_o2 variables

bin_width = 0.5
# Ozone distribution
ggplot(df, aes(x=ozone)) + geom_histogram(binwidth=bin_width)
## Warning: Removed 278654 rows containing non-finite values (stat_bin).

# SO2 distribution
ggplot(df, aes(x=sulfur_o2)) + geom_histogram(binwidth=bin_width)
## Warning: Removed 268035 rows containing non-finite values (stat_bin).

Making a choropleth map of each borough, colored by mean ozone and sulfur_o2 value, with school locations overlaid on a map of NYC.

## Warning: OGR support is provided by the sf and terra packages among others
## Warning: OGR support is provided by the sf and terra packages among others
## Warning: OGR support is provided by the sf and terra packages among others
## Warning: OGR support is provided by the sf and terra packages among others
## Warning: OGR support is provided by the sf and terra packages among others
## Warning: OGR support is provided by the sf and terra packages among others

Creating the same plot with the sulfer_o2 column

ggplot() + 
  geom_polygon(data=nyc_neighborhoods_df, aes(x=long, y=lat, group=group)) + 
  geom_point(data=df, aes(x = lon, y=lat, colour=sulfur_o2, fill=sulfur_o2))

ANOVA (Ozone)

Running an Analysis of Variance (ANOVA) hypothesis test to see if there’s a statistically significant different in ozone levels between boroughs. We’ll conduct this test with significance level (\(\alpha = 0.05\)) and the following hypotheses:

ozone_aov <- aov(ozone ~ borough, data = df )
summary(ozone_aov)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## borough        1  67576   67576    6310 <2e-16 ***
## Residuals   9710 103984      11                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 278654 observations deleted due to missingness

Because our p-value is less than our significance level, we can reject the null hypothesis and claim that there is a statistically significant difference between mean ozone (O3) levels between NYC boroughs.

ANOVA (SO2)

so2_aov <- aov(sulfur_o2 ~ borough, data = df )
summary(so2_aov)
##                Df Sum Sq Mean Sq F value Pr(>F)    
## borough         1  22844   22844    6608 <2e-16 ***
## Residuals   20329  70277       3                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 268035 observations deleted due to missingness

Because our p-value is less than our significance level, we can reject the null hypothesis and claim that there is a statistically significant difference between mean sulfur dioxide (SO2) levels between NYC boroughs.

Testing Academic Performance against Air Quality

Data imputation

I’ll need to impute some values in order to make our matrix less sparse. Going to use the median of each distribution as our imputing value. Each distribution

# Coercing the mean scaled score, ozone, & sulfur o2 columns to an integer
df$mean_scaled_score <- as.numeric(df$Mean.Scale.Score)
## Warning: NAs introduced by coercion
df$ozone <- as.numeric(df$ozone)
df$sulfur_o2 <- as.numeric(df$sulfur_o2)

# Imputing missing values using the mice packages
df <- df %>%
          mutate(mean_scaled_score = ifelse(is.na(mean_scaled_score), median(mean_scaled_score, na.rm = TRUE), mean_scaled_score)) %>%
          mutate(ozone = ifelse(is.na(ozone), median(ozone, na.rm = TRUE), ozone)) %>%
          mutate(sulfur_o2 = ifelse(is.na(sulfur_o2), median(sulfur_o2, na.rm = TRUE), sulfur_o2))

Let’s check scores by borough. First we’ll make a bar chart.

# Creating bar plot of mean score by borough
df %>%
    group_by(borough) %>%
    summarise(mean_scaled_score = mean(mean_scaled_score)) %>%
    ggplot(., aes(x = borough, y = mean_scaled_score)) + geom_bar(stat='identity')

Next, let’s create a choropleth (as we did above) showing the mean_scaled_score as a fill value for each school

library(viridis)
## Loading required package: viridisLite
# Plot scores by school loc
cp <- ggplot() + 
  geom_polygon(data=nyc_neighborhoods_df, aes(x=long, y=lat, group=group)) + 
  geom_point(data=df, aes(x = lon, y=lat, colour=mean_scaled_score)) 
# Add color scale
cp + scale_color_viridis() #scale_color_gradient(low="blue", high="red")

Now that we have our values imputed properly, we can use the built-in cor.test function to check

cor.test(x = df$ozone, y = df$mean_scaled_score)
## 
##  Pearson's product-moment correlation
## 
## data:  df$ozone and df$mean_scaled_score
## t = -3.132, df = 288364, p-value = 0.001737
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.009481922 -0.002182447
## sample estimates:
##          cor 
## -0.005832262
cor.test(x = df$sulfur_o2, y = df$mean_scaled_score)
## 
##  Pearson's product-moment correlation
## 
## data:  df$sulfur_o2 and df$mean_scaled_score
## t = -12.323, df = 288364, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.02659001 -0.01929413
## sample estimates:
##         cor 
## -0.02294238

We see slightly negative correlations for both variables with our mean_scaled_score variable. This implies a weakly negative/no relation between air quality measures (such as ozone and sulfur dioxide presence) and academic performance.

Further Work

Some further work on this topic I’d like to incorporate - More granular air quality data - Try using different environmental factors (pollution, etc.) as inputs to our hypothesis testing - Updated academic data (our dataset only ran from 2006 - 2012) - Using academic data besides test scores. Tests are not always indiicative of a student’s academic skill. Grade point average could give a good sense of longer-term effects of environmental impacts on students’ performances

Conclusion

Overall, I was unable to prove significant correlation with academic performance given the datasets used. There are a multitude of factors that can influence a student’s academic performance (family life, socioeconomic status, race, etc.) that make it difficult to ascribe causal variables to something as broad as academic performance. In addition, the publicly-available datasets I could find that were usable were not completely up-to-date.