dbn_url <- "https://raw.githubusercontent.com/andrewbowen19/cunyDATA607/main/projects/final_project/2006_-_2012__Math_Test_Results__-_All_Students.csv"
tests <- read.csv(dbn_url)
head(tests)
## DBN Grade Year Demographic Number.Tested Mean.Scale.Score Num.Level.1
## 1 01M015 3 2006 All Students 39 667 2
## 2 01M015 3 2007 All Students 31 672 2
## 3 01M015 3 2008 All Students 37 668 0
## 4 01M015 3 2009 All Students 33 668 0
## 5 01M015 3 2010 All Students 26 677 6
## 6 01M015 3 2011 All Students 28 671 10
## Pct.Level.1 Num.Level.2 Pct.Level.2 Num.Level.3 Pct.Level.3 Num.Level.4
## 1 5.1 11 28.2 20 51.3 6
## 2 6.5 3 9.7 22 71 4
## 3 0 6 16.2 29 78.4 2
## 4 0 4 12.1 28 84.8 1
## 5 23.1 12 46.2 6 23.1 2
## 6 35.7 13 46.4 5 17.9 0
## Pct.Level.4 Num.Level.3.and.4 Pct.Level.3.and.4
## 1 15.4 26 66.7
## 2 12.9 26 83.9
## 3 5.4 31 83.8
## 4 3 29 87.9
## 5 7.7 8 30.8
## 6 0 5 17.9
Looking up school district GIS data from NYC Open Data. School point locations can be found here, we’ll download them and unzip the file to get the shp file
# Create temporary files
temp <- tempfile()
temp2 <- tempfile()
# Download the zip file and save to 'temp'
school_locs_url <- "https://data.cityofnewyork.us/download/jfju-ynrr/application%2Fzip"
download.file(school_locs_url, temp)
# Unzip the folder contents and then read in the shp file to a dataframe
unzip(zipfile = temp, exdir = temp2)
school_locs <- sf::read_sf(temp2)
# Cleaning and re-naming the ATS_CODE column to
school_locs <- school_locs %>%
dplyr::rename( "DBN"="ATS_CODE")
school_locs$DBN <- str_replace_all(school_locs$DBN, "\\s+", "")
# Projected CRS: NAD83
school_locs_sf <- st_as_sf(school_locs, coords=c("YCoord", "XCoord"), NAD83)
school_locs <- st_transform(school_locs_sf, crs = 4326 ) %>%
mutate( lat = st_coordinates(.)[,2],lon = st_coordinates(.)[,1])
Let’s join our school_locs dataframe to our
tests
# Adding in borough lookup as well
school_boro_lookup = data.frame(BORO=c("K", "M", "Q", "R", "X"),
borough=c("Brooklyn", "Manhattan", "Queens", "Staten Island", "Bronx")
)
school_data <- as.data.frame(left_join(tests, school_locs, by="DBN")) %>%
left_join(., school_boro_lookup, by="BORO") %>%
rename("year" = "Year") %>%
mutate(year = as.character(year))
Looking up air quality for school locations in New York’s geographical area.
Using this NYC Open Data Air Quality Dataset from their API. Going to call their API
Breaking down air quality data by borough. Air pollution varies
borough-to-borough in New York City so we’ll group our data that
way. We’ll need to add a borough column to our Air quality
dataset
# Getting data from API call
# https://data.cityofnewyork.us/Environment/Air-Quality/c3uy-2p5r
nyc_open_data_aq_url <- "https://data.cityofnewyork.us/resource/c3uy-2p5r.json"
nyc_aq <- as.data.frame(fromJSON(nyc_open_data_aq_url))
# Adding in borough lookup code to compare to school data
# Also adding year column to compare with testing data
borough_lookup <- data.frame(borough_code = c("1", "2", "3", "4", "5"),
borough=c("Bronx",
"Brooklyn",
"Manhattan",
"Queens",
"Staten Island"))
nyc_aq <- nyc_aq %>%
mutate(borough_code = str_extract(geo_join_id, "^\\d"),
year = str_extract(start_date, "^\\d{4}")) %>%
left_join(., borough_lookup, by="borough_code")
head(nyc_aq)
## unique_id indicator_id name measure measure_info
## 1 216498 386 Ozone (O3) Mean ppb
## 2 216499 386 Ozone (O3) Mean ppb
## 3 219969 386 Ozone (O3) Mean ppb
## 4 219970 386 Ozone (O3) Mean ppb
## 5 164876 383 Sulfur Dioxide (SO2) Mean ppb
## 6 164877 383 Sulfur Dioxide (SO2) Mean ppb
## geo_type_name geo_join_id geo_place_name time_period
## 1 CD 313 Coney Island (CD13) Summer 2013
## 2 CD 313 Coney Island (CD13) Summer 2014
## 3 Borough 1 Bronx Summer 2013
## 4 Borough 1 Bronx Summer 2014
## 5 CD 211 Morris Park and Bronxdale (CD11) Winter 2008-09
## 6 CD 212 Williamsbridge and Baychester (CD12) Winter 2008-09
## start_date data_value borough_code year borough
## 1 2013-06-01T00:00:00.000 34.64 3 2013 Manhattan
## 2 2014-06-01T00:00:00.000 33.22 3 2014 Manhattan
## 3 2013-06-01T00:00:00.000 31.25 1 2013 Bronx
## 4 2014-06-01T00:00:00.000 31.15 1 2014 Bronx
## 5 2008-12-01T00:00:00.000 5.89 2 2008 Brooklyn
## 6 2008-12-01T00:00:00.000 5.75 2 2008 Brooklyn
nyc_aq <- nyc_aq %>% pivot_wider(., names_from = name, values_from=data_value) %>%
rename("ozone" = "Ozone (O3)", "sulfur_o2" = "Sulfur Dioxide (SO2)") %>%
mutate(ozone = as.double(ozone), sulfur_o2 = as.double(sulfur_o2))
df <- inner_join(school_data, nyc_aq, by=c("borough", "year"))
Let’s get a sense of the distributions for our ozone and
sulfer_o2 variables
bin_width = 0.5
# Ozone distribution
ggplot(df, aes(x=ozone)) + geom_histogram(binwidth=bin_width)
## Warning: Removed 278654 rows containing non-finite values (stat_bin).
# SO2 distribution
ggplot(df, aes(x=sulfur_o2)) + geom_histogram(binwidth=bin_width)
## Warning: Removed 268035 rows containing non-finite values (stat_bin).
Making a choropleth map of each borough, colored by mean
ozone and sulfur_o2 value, with school
locations overlaid on a map of NYC.
## Warning: OGR support is provided by the sf and terra packages among others
## Warning: OGR support is provided by the sf and terra packages among others
## Warning: OGR support is provided by the sf and terra packages among others
## Warning: OGR support is provided by the sf and terra packages among others
## Warning: OGR support is provided by the sf and terra packages among others
## Warning: OGR support is provided by the sf and terra packages among others
Creating the same plot with the sulfer_o2 column
ggplot() +
geom_polygon(data=nyc_neighborhoods_df, aes(x=long, y=lat, group=group)) +
geom_point(data=df, aes(x = lon, y=lat, colour=sulfur_o2, fill=sulfur_o2))
Running an Analysis of Variance (ANOVA) hypothesis test to see if there’s a statistically significant different in ozone levels between boroughs. We’ll conduct this test with significance level (\(\alpha = 0.05\)) and the following hypotheses:
\(H_0\): There is no difference in mean ozone (O3) levels between NYC boroughs
\(H_a\): There is a difference in mean ozone (O3) levels between NYC boroughs
ozone_aov <- aov(ozone ~ borough, data = df )
summary(ozone_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## borough 1 67576 67576 6310 <2e-16 ***
## Residuals 9710 103984 11
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 278654 observations deleted due to missingness
Because our p-value is less than our significance level, we can reject the null hypothesis and claim that there is a statistically significant difference between mean ozone (O3) levels between NYC boroughs.
\(H_0\): There is no difference in mean sulfur dioxide (SO2) levels between NYC boroughs
\(H_a\): There is a difference in mean sulfur dioxide (SO2) levels between NYC boroughs
so2_aov <- aov(sulfur_o2 ~ borough, data = df )
summary(so2_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## borough 1 22844 22844 6608 <2e-16 ***
## Residuals 20329 70277 3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 268035 observations deleted due to missingness
Because our p-value is less than our significance level, we can reject the null hypothesis and claim that there is a statistically significant difference between mean sulfur dioxide (SO2) levels between NYC boroughs.
I’ll need to impute some values in order to make our matrix less sparse. Going to use the median of each distribution as our imputing value. Each distribution
# Coercing the mean scaled score, ozone, & sulfur o2 columns to an integer
df$mean_scaled_score <- as.numeric(df$Mean.Scale.Score)
## Warning: NAs introduced by coercion
df$ozone <- as.numeric(df$ozone)
df$sulfur_o2 <- as.numeric(df$sulfur_o2)
# Imputing missing values using the mice packages
df <- df %>%
mutate(mean_scaled_score = ifelse(is.na(mean_scaled_score), median(mean_scaled_score, na.rm = TRUE), mean_scaled_score)) %>%
mutate(ozone = ifelse(is.na(ozone), median(ozone, na.rm = TRUE), ozone)) %>%
mutate(sulfur_o2 = ifelse(is.na(sulfur_o2), median(sulfur_o2, na.rm = TRUE), sulfur_o2))
Let’s check scores by borough. First we’ll make a bar chart.
# Creating bar plot of mean score by borough
df %>%
group_by(borough) %>%
summarise(mean_scaled_score = mean(mean_scaled_score)) %>%
ggplot(., aes(x = borough, y = mean_scaled_score)) + geom_bar(stat='identity')
Next, let’s create a choropleth (as we did above) showing the
mean_scaled_score as a fill value for each school
library(viridis)
## Loading required package: viridisLite
# Plot scores by school loc
cp <- ggplot() +
geom_polygon(data=nyc_neighborhoods_df, aes(x=long, y=lat, group=group)) +
geom_point(data=df, aes(x = lon, y=lat, colour=mean_scaled_score))
# Add color scale
cp + scale_color_viridis() #scale_color_gradient(low="blue", high="red")
Now that we have our values imputed properly, we can use the built-in
cor.test function to check
cor.test(x = df$ozone, y = df$mean_scaled_score)
##
## Pearson's product-moment correlation
##
## data: df$ozone and df$mean_scaled_score
## t = -3.132, df = 288364, p-value = 0.001737
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.009481922 -0.002182447
## sample estimates:
## cor
## -0.005832262
cor.test(x = df$sulfur_o2, y = df$mean_scaled_score)
##
## Pearson's product-moment correlation
##
## data: df$sulfur_o2 and df$mean_scaled_score
## t = -12.323, df = 288364, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.02659001 -0.01929413
## sample estimates:
## cor
## -0.02294238
We see slightly negative correlations for both variables with our
mean_scaled_score variable. This implies a weakly
negative/no relation between air quality measures (such as ozone and
sulfur dioxide presence) and academic performance.
Some further work on this topic I’d like to incorporate - More granular air quality data - Try using different environmental factors (pollution, etc.) as inputs to our hypothesis testing - Updated academic data (our dataset only ran from 2006 - 2012) - Using academic data besides test scores. Tests are not always indiicative of a student’s academic skill. Grade point average could give a good sense of longer-term effects of environmental impacts on students’ performances
Overall, I was unable to prove significant correlation with academic performance given the datasets used. There are a multitude of factors that can influence a student’s academic performance (family life, socioeconomic status, race, etc.) that make it difficult to ascribe causal variables to something as broad as academic performance. In addition, the publicly-available datasets I could find that were usable were not completely up-to-date.