This code showcases the utilization of various R libraries, including
ggplot2, dplyr, kable,
kableExtra, scales, and plotly,
to analyze and visualize air quality surveillance data from New York
City. The dataset contains comprehensive information on air quality data
from New York City. With exposures to common air pollutants linked to
various health issues, understanding air quality variations across NYC
neighborhoods is crucial.
This code-through tutorial covers various aspects of data
manipulation and visualization in R. Throughout the semester, we have
been working with census data and choropleth maps, unraveling spatial
trends as part of our coursework. Recently, I stumbled upon the
plotly library, renowned for its prowess in crafting plots
based on geographical data. Captivated by its potential, I’ve chosen to
harness coordinates and geographical data for this code-through
assignment as well. This assignment will showcase the utilization of
geographical data and the plotting capabilities of plotly,
offering insights into creating scatter plots on maps in R for spatial
data analysis.
It also covers the versatile features of the kableExtra
package, with a specific focus on kable_styling to elevate
table formatting. Additionally, it provides a detailed exploration of
ggplot2, demonstrating its adaptability through diverse
plotting methods and visualizations. Furthermore, the code-through
examines the robust data manipulation functionalities offered by the
dplyr package, highlighting essential functions like
filter, group_by, reframe, and
arrange.
In general, this code-through assignment aims to provide sample usages of indispensable tools and techniques for data analysis and visualization in R, through clear explanations and practical examples.
First, we need to import/set up the data first. The required packages
and libraries are listed in the below code. The data is sourced from https://data.cityofnewyork.us/Environment/Air-Quality/c3uy-2p5r/about_data.fromJSON
function from jsonlite library is utilized for importing
the JSON data to the data frame dat.
# SET GLOBAL KNITR OPTIONS
knitr::opts_chunk$set(echo = TRUE,
message = FALSE,
warning = FALSE,
fig.width = 10,
fig.height = 8)
# LOAD PACKAGES
library(pander)
library(kableExtra)
library(dplyr)
library(ggplot2)
library(knitr)
library(jsonlite)
library(scales)
library(plotly)
# READ IN DATA
url <- paste("https://data.cityofnewyork.us/resource/c3uy-2p5r.json")
dat <- fromJSON(url)
The below example shows
kableExtra library to apply the
kable_styling function. In this specific example, 10 lines
of data will be displayed, allowing you to hover over each row.
kable_table <- dat %>% slice(1:10)
kable(kable_table) %>%
kable_styling(bootstrap_options = c("hover", "condensed"))
| unique_id | indicator_id | name | measure | measure_info | geo_type_name | geo_join_id | geo_place_name | time_period | start_date | data_value |
|---|---|---|---|---|---|---|---|---|---|---|
| 825967 | 375 | Nitrogen dioxide (NO2) | Mean | ppb | UHF34 | 104 | Pelham - Throgs Neck | Summer 2022 | 2022-06-01T00:00:00.000 | 12.0 |
| 823492 | 365 | Fine particles (PM 2.5) | Mean | mcg/m3 | CD | 307 | Sunset Park (CD7) | Summer 2022 | 2022-06-01T00:00:00.000 | 6.7 |
| 827012 | 386 | Ozone (O3) | Mean | ppb | CD | 313 | Coney Island (CD13) | Summer 2022 | 2022-06-01T00:00:00.000 | 37.7 |
| 827081 | 386 | Ozone (O3) | Mean | ppb | UHF34 | 103 | Fordham - Bronx Pk | Summer 2022 | 2022-06-01T00:00:00.000 | 31.7 |
| 827103 | 386 | Ozone (O3) | Mean | ppb | UHF42 | 503 | Willowbrook | Summer 2022 | 2022-06-01T00:00:00.000 | 34.8 |
| 823211 | 365 | Fine particles (PM 2.5) | Mean | mcg/m3 | CD | 105 | Midtown (CD5) | Summer 2022 | 2022-06-01T00:00:00.000 | 8.7 |
| 823241 | 365 | Fine particles (PM 2.5) | Mean | mcg/m3 | UHF42 | 401 | Long Island City - Astoria | Summer 2022 | 2022-06-01T00:00:00.000 | 7.2 |
| 825903 | 375 | Nitrogen dioxide (NO2) | Mean | ppb | UHF34 | 303 | East Harlem | Summer 2022 | 2022-06-01T00:00:00.000 | 13.0 |
| 823337 | 365 | Fine particles (PM 2.5) | Mean | mcg/m3 | Borough | 2 | Brooklyn | Summer 2022 | 2022-06-01T00:00:00.000 | 6.3 |
| 827065 | 386 | Ozone (O3) | Mean | ppb | UHF34 | 304 | Upper West Side | Summer 2022 | 2022-06-01T00:00:00.000 | 29.9 |
# calculate total count for each name of pollutant and convert data_value to numeric
name_counts <- aggregate(data_value ~ name, data = dat, FUN = function(x) sum(as.numeric(x)))
# calculate percentage for each type
name_counts$percentage <- (name_counts$data_value / sum(name_counts$data_value)) * 100
# create pie chart with percentages
ggplot(name_counts, aes(x = "", y = data_value, fill = name)) +
geom_bar(stat = "identity", width = 1, color = "black", size = 0.3) +
geom_text(aes(label = percent(percentage / 100), y = data_value + 0.5),
position = position_stack(vjust = 0.5), size = 5, color = "white", fontface = "bold") +
coord_polar("y", start = 0) +
labs(title = "Distrubution of pollutants affecting Air Quality in NYC",
fill = "Types of Pollutants",
x = NULL, y = NULL) +
theme_void() +
theme(legend.position = "right",
plot.title = element_text(size = 20, face = "bold", hjust = 0.5),
legend.text = element_text(size = 12),
legend.title = element_text(size = 14),
legend.key.size = unit(1.5, "lines"))
In this example, the dataframe is filtered to include only rows
where the geo_type_name is Borough. Then, the
data is grouped by several columns including unique_id,
name, data_value, and
geo_type_name. The column geo_place_name is
renamed to Location_Name using the reframe
function. Finally, the data is arranged in descending order based on
Location_Name.
The split function divides data_borough_summary into
separate tables based on the values of Location_Name
(borough names). Each resulting table contains data specific to a
particular borough.
A loop iterates over each borough. For each borough, a header
indicating the borough name is printed using cat. Then, the
corresponding table (stored in borough_tables[[borough]])
is printed using pander.
Finally, the result is a series of tables, each displaying summarized data values for a specific borough. Each table is preceded by a header indicating the corresponding borough name. This approach provides a structured presentation of air quality data values for individual boroughs, facilitating analysis and interpretation.
# filter data for boroughs and summarize(reframe) data values
data_borough_summary <- dat %>%
filter(geo_type_name == "Borough") %>%
group_by(ID = unique_id, Name_Of_Pollutants = name, Data_Value = data_value, Type_Of_Location = geo_type_name) %>%
reframe(Location_Name = geo_place_name) %>%
arrange(desc(Location_Name))
# create separate tables for each borough
borough_tables <- split(data_borough_summary, f = data_borough_summary$Location_Name)
# print each table using kable
for (borough in names(borough_tables)) {
cat(paste("**", borough, "**", "\n"))
print(kable(borough_tables[[borough]]), format = "markdown")
cat("\n\n")
}
## ** Bronx **
##
##
## |ID |Name_Of_Pollutants |Data_Value |Type_Of_Location |Location_Name |
## |:------|:-----------------------|:----------|:----------------|:-------------|
## |823339 |Fine particles (PM 2.5) |6.1 |Borough |Bronx |
## |823340 |Fine particles (PM 2.5) |7.1 |Borough |Bronx |
## |823341 |Fine particles (PM 2.5) |7.3 |Borough |Bronx |
## |825810 |Nitrogen dioxide (NO2) |16.0 |Borough |Bronx |
## |825811 |Nitrogen dioxide (NO2) |12.2 |Borough |Bronx |
## |825812 |Nitrogen dioxide (NO2) |21.8 |Borough |Bronx |
## |827148 |Ozone (O3) |32.2 |Borough |Bronx |
##
##
## ** Brooklyn **
##
##
## |ID |Name_Of_Pollutants |Data_Value |Type_Of_Location |Location_Name |
## |:------|:-----------------------|:----------|:----------------|:-------------|
## |823336 |Fine particles (PM 2.5) |5.8 |Borough |Brooklyn |
## |823337 |Fine particles (PM 2.5) |6.3 |Borough |Brooklyn |
## |823338 |Fine particles (PM 2.5) |6.9 |Borough |Brooklyn |
## |825807 |Nitrogen dioxide (NO2) |15.4 |Borough |Brooklyn |
## |825808 |Nitrogen dioxide (NO2) |10.7 |Borough |Brooklyn |
## |825809 |Nitrogen dioxide (NO2) |21.2 |Borough |Brooklyn |
## |827147 |Ozone (O3) |34.7 |Borough |Brooklyn |
##
##
## ** Manhattan **
##
##
## |ID |Name_Of_Pollutants |Data_Value |Type_Of_Location |Location_Name |
## |:------|:-----------------------|:----------|:----------------|:-------------|
## |740885 |Nitrogen dioxide (NO2) |16.4 |Borough |Manhattan |
## |823333 |Fine particles (PM 2.5) |7.0 |Borough |Manhattan |
## |823334 |Fine particles (PM 2.5) |7.5 |Borough |Manhattan |
## |823335 |Fine particles (PM 2.5) |7.9 |Borough |Manhattan |
## |825804 |Nitrogen dioxide (NO2) |19.1 |Borough |Manhattan |
## |825805 |Nitrogen dioxide (NO2) |15.4 |Borough |Manhattan |
## |825806 |Nitrogen dioxide (NO2) |23.4 |Borough |Manhattan |
## |827146 |Ozone (O3) |30.2 |Borough |Manhattan |
##
##
## ** Queens **
##
##
## |ID |Name_Of_Pollutants |Data_Value |Type_Of_Location |Location_Name |
## |:------|:-----------------------|:----------|:----------------|:-------------|
## |743728 |Ozone (O3) |30.9 |Borough |Queens |
## |823330 |Fine particles (PM 2.5) |5.7 |Borough |Queens |
## |823331 |Fine particles (PM 2.5) |6.4 |Borough |Queens |
## |823332 |Fine particles (PM 2.5) |6.7 |Borough |Queens |
## |825801 |Nitrogen dioxide (NO2) |14.9 |Borough |Queens |
## |825802 |Nitrogen dioxide (NO2) |10.6 |Borough |Queens |
## |825803 |Nitrogen dioxide (NO2) |20.1 |Borough |Queens |
## |827145 |Ozone (O3) |34.5 |Borough |Queens |
##
##
## ** Staten Island **
##
##
## |ID |Name_Of_Pollutants |Data_Value |Type_Of_Location |Location_Name |
## |:------|:-----------------------|:----------|:----------------|:-------------|
## |823327 |Fine particles (PM 2.5) |5.2 |Borough |Staten Island |
## |823328 |Fine particles (PM 2.5) |5.8 |Borough |Staten Island |
## |823329 |Fine particles (PM 2.5) |6.1 |Borough |Staten Island |
## |825798 |Nitrogen dioxide (NO2) |11.2 |Borough |Staten Island |
## |825799 |Nitrogen dioxide (NO2) |7.8 |Borough |Staten Island |
## |825800 |Nitrogen dioxide (NO2) |16.4 |Borough |Staten Island |
## |827144 |Ozone (O3) |35.3 |Borough |Staten Island |
plotly, where
each borough is represented by a marker with color indicating its AQI
level.# sample data for borough coordinates (latitude and longitude)
borough_coordinates <- data.frame(
Borough = c("Manhattan", "Brooklyn", "Queens", "Bronx", "Staten Island"),
Latitude = c(40.776676, 40.650002, 40.742054, 40.837048, 40.579021),
Longitude = c(-73.971321, -73.949997, -73.769417, -73.865433, -74.151535)
)
# sample data for AQI summary
aqi_summary <- data.frame(
Name = c("Manhattan", "Brooklyn", "Queens", "Bronx", "Staten Island"),
Air_Quality_Index = c(24, 30.5, 29.1, 31.9, 36) # made-up AQI numbers
)
# create a map-based scatter plot using plotly with the custom colorscale
fig <- plot_ly(borough_coordinates,
type = 'scattermapbox',
mode = 'markers',
lat = ~Latitude,
lon = ~Longitude,
marker = list(size = 10, color = aqi_summary$Air_Quality_Index, colorscale = "Viridis"),
text = ~paste("Borough: ", Borough, "<br>AQI: ", round(aqi_summary$Air_Quality_Index, 2))) %>%
layout(title = "Air Quality Index in NYC Boroughs",
mapbox = list(
style = "carto-positron",
zoom = 8, # adjust the zoom level as needed
center = list(lat = 40.7128, lon = -74.0060)
),
xaxis = list(title = "Longitude"),
yaxis = list(title = "Latitude"))
fig
# filter data and select top 10 rows
community_districts_summary <- dat %>%
filter(
geo_type_name == "CD" &
name == "Nitrogen dioxide (NO2)" &
time_period == "Summer 2022"
) %>%
arrange(desc(data_value)) %>%
slice(1:10) %>%
select(unique_id, name, geo_place_name, time_period, data_value)
# print the top 10 rows with kable_styling
community_districts_summary %>%
kable("html") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("bordered","hover", "condensed")) %>%
row_spec(1:3, italic = TRUE, color = "gold", background = "brown")
| unique_id | name | geo_place_name | time_period | data_value |
|---|---|---|---|---|
| 826336 | Nitrogen dioxide (NO2) | Flatbush and Midwood (CD14) | Summer 2022 | 9.8 |
| 826327 | Nitrogen dioxide (NO2) | Bensonhurst (CD11) | Summer 2022 | 9.4 |
| 826111 | Nitrogen dioxide (NO2) | St. George and Stapleton (CD1) | Summer 2022 | 9.3 |
| 826378 | Nitrogen dioxide (NO2) | South Ozone Park and Howard Beach (CD10) | Summer 2022 | 9.2 |
| 826339 | Nitrogen dioxide (NO2) | Sheepshead Bay (CD15) | Summer 2022 | 8.5 |
| 826348 | Nitrogen dioxide (NO2) | Flatlands and Canarsie (CD18) | Summer 2022 | 8.5 |
| 826333 | Nitrogen dioxide (NO2) | Coney Island (CD13) | Summer 2022 | 8.2 |
| 826114 | Nitrogen dioxide (NO2) | South Beach and Willowbrook (CD2) | Summer 2022 | 8.0 |
| 826076 | Nitrogen dioxide (NO2) | Rockaway and Broad Channel (CD14) | Summer 2022 | 6.9 |
| 826117 | Nitrogen dioxide (NO2) | Tottenville and Great Kills (CD3) | Summer 2022 | 6.8 |
We can utilize the obtained data which comprises the top 10 NYC Community Districts with the highest NO2 levels to create a visually appealing simple scatter plot as shown below.
# create the scatter plot
scatter_plot <- ggplot(community_districts_summary, aes(x = unique_id, y = data_value, color = geo_place_name)) +
geom_point() +
labs(title = "Scatter Plot of Nitrogen Dioxide (NO2) Levels in NYC Community Districts",
x = "Unique ID of each community district(CD)",
y = "NO2 Data Value",
color = "Community District") +
theme_minimal() + # apply a minimal theme
theme(legend.position = "right",
axis.text.x = element_text(angle = 85, hjust = 0.7))
scatter_plot
Resource I - New York City Open Data.(https://data.cityofnewyork.us/Environment/Air-Quality/c3uy-2p5r/about_data)
Resource II - Scatter Plots on Maps in R using plotly. (https://plotly.com/r/scatter-plots-on-maps/)
Resource III - kable_styling: HTML table attributes. (https://www.rdocumentation.org/packages/kableExtra/versions/1.3.4/topics/kable_styling)
Resource IV -Pie chart in ggplot2. (https://r-charts.com/part-whole/pie-chart-ggplot2/)
Resource V - Bar plot in ggplot2 with geom_bar and geom_col. (https://r-charts.com/ranking/bar-plot-ggplot2/)
Resource VI - Latitude and Longitude Finder. (https://www.latlong.net/)