Tidyverse is a package in R that is a collection of many different tools and functions that help make organizing and visualizing data easier. People usually make R packages to make specific tasks easier to do.
For this assignment, I will be demonstrating how to clean and visualize Geographical Data using some Tidyverse functions. I also used some extra packages that work in conjunction with tidyverse. The data set used in this assignment was NYC Motor Vehicles Crash from June 2012 to Aug 2023. All car crashes are theoretically recorded by the police. Tidyverse can be helpful in cleaning and also visualizing the data found in this data set.
For the following assignment to work, please install the following packages:
install.packages(c('readr','ggmap','googledrive','tidyverse','gganimate'))
To reproduce the functionality of this R-Markdown file, please obtain a google maps api key.
Please define your Google Maps API here:
library(readr)
library(ggmap)
## Loading required package: ggplot2
## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, will retire in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.
## The sp package is now running under evolution status 2
## (status 2 uses the sf package in place of rgdal)
## ℹ Google's Terms of Service: <https://mapsplatform.google.com>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
register_google(key=my_google_api)
Since the csv data set that I used is too big for GitHub (444 Mb vs 25 Mb max), I will upload it to google drive and pull it from there using the google drive package which is part of TidyVerse. To reproduce, get the CSV File here and add it to your google drive. (I tried to use kaggle api with no success).
drive_auth() will open your default browser and ask you to log-in to google drive.
drive_ls (pattern= ‘name.csv’) will look for a csv file named name and return a google content ID of the file when it gets a match.
drive_download() will download the file to the current working directory.
# Load the googledrive package
library(googledrive)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ stringr 1.5.0
## ✔ forcats 1.0.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#browser auth
drive_auth()
## ! Using an auto-discovered, cached token.
## To suppress this message, modify your code or options to clearly consent to
## the use of a cached token.
## See gargle's "Non-interactive auth" vignette for more details:
## <https://gargle.r-lib.org/articles/non-interactive-auth.html>
## ℹ The googledrive package is using a cached token for 'jeanj7722@gmail.com'.
file = drive_ls(pattern = "nyc_crashes.csv")
drive_download(file, path = "nyc_crashes.csv")
## File downloaded:
## • 'nyc_crashes.csv' <id: 1Nxkb8HtqYIKypxxYx40XqxlHUjq4Cvaf>
## Saved locally as:
## • 'nyc_crashes.csv'
nyc_crashes=read.csv('nyc_crashes.csv')
Tidyverse has a lot of function that can help you clean data.
Glimpse is a dplyr function (included in tidyverse), that returns column information for the data set. In this case, we use it to view the columns and some values of nyc_crashes.
glimpse(nyc_crashes)
## Rows: 2,022,581
## Columns: 30
## $ X <int> 1923971, 1921212, 1923909, 1921229, 1923…
## $ CRASH.DATE <chr> "2012-07-01", "2012-07-01", "2012-07-01"…
## $ CRASH.TIME <chr> "2:50", "19:30", "14:14", "3:58", "18:37…
## $ BOROUGH <chr> "BRONX", "BROOKLYN", "MANHATTAN", "MANHA…
## $ ZIP.CODE <dbl> 10451, 11238, 10003, 10002, 10026, 11229…
## $ LATITUDE <dbl> 40.82558, 40.68304, 40.73117, 40.72199, …
## $ LONGITUDE <dbl> -73.91846, -73.96478, -73.99192, -73.985…
## $ LOCATION <chr> "(40.8255779, -73.9184596)", "(40.683039…
## $ ON.STREET.NAME <chr> "EAST 161 STREET ", "FUL…
## $ CROSS.STREET.NAME <chr> "MORRIS AVENUE ", "WAS…
## $ OFF.STREET.NAME <chr> "", "", "", "", "", "", "", "", "", "", …
## $ NUMBER.OF.PERSONS.INJURED <dbl> 1, 1, 7, 0, 1, 0, 0, 2, 1, 0, 2, 0, 1, 0…
## $ NUMBER.OF.PERSONS.KILLED <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ NUMBER.OF.PEDESTRIANS.INJURED <int> 0, 0, 3, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ NUMBER.OF.PEDESTRIANS.KILLED <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ NUMBER.OF.CYCLIST.INJURED <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ NUMBER.OF.CYCLIST.KILLED <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ NUMBER.OF.MOTORIST.INJURED <int> 1, 1, 3, 0, 0, 0, 0, 2, 0, 0, 2, 0, 1, 0…
## $ NUMBER.OF.MOTORIST.KILLED <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ CONTRIBUTING.FACTOR.VEHICLE.1 <chr> "Failure to Yield Right-of-Way", "Unspec…
## $ CONTRIBUTING.FACTOR.VEHICLE.2 <chr> "Unspecified", "Unspecified", "Unspecifi…
## $ CONTRIBUTING.FACTOR.VEHICLE.3 <chr> "", "", "", "", "", "", "", "", "", "", …
## $ CONTRIBUTING.FACTOR.VEHICLE.4 <chr> "", "", "", "", "", "", "", "", "", "", …
## $ CONTRIBUTING.FACTOR.VEHICLE.5 <chr> "", "", "", "", "", "", "", "", "", "", …
## $ COLLISION_ID <int> 85182, 191551, 9409, 12176, 63606, 11626…
## $ VEHICLE.TYPE.CODE.1 <chr> "PASSENGER VEHICLE", "SPORT UTILITY / ST…
## $ VEHICLE.TYPE.CODE.2 <chr> "PASSENGER VEHICLE", "TAXI", "BICYCLE", …
## $ VEHICLE.TYPE.CODE.3 <chr> "", "", "", "", "", "", "", "", "", "", …
## $ VEHICLE.TYPE.CODE.4 <chr> "", "", "", "", "", "", "", "", "", "", …
## $ VEHICLE.TYPE.CODE.5 <chr> "", "", "", "", "", "", "", "", "", "", …
I then wanted to delete columns that didn’t matter to us. We can use the select column to either select ( or delete) and rename some variables in our data set. By adding the -c, we are removing the variables (aka columns) listed.
In other words, we pipe nyc_crashes through select, remove the columns, and name the new data frame nyc_crashes_mod.
nyc_crashes_mod = nyc_crashes %>%
select(-c(
"NUMBER.OF.PERSONS.INJURED",
"NUMBER.OF.PERSONS.KILLED",
"NUMBER.OF.PEDESTRIANS.INJURED",
"NUMBER.OF.PEDESTRIANS.KILLED",
"NUMBER.OF.CYCLIST.INJURED",
"NUMBER.OF.CYCLIST.KILLED",
"NUMBER.OF.MOTORIST.INJURED",
"NUMBER.OF.MOTORIST.KILLED",
"CONTRIBUTING.FACTOR.VEHICLE.1",
"CONTRIBUTING.FACTOR.VEHICLE.2",
"CONTRIBUTING.FACTOR.VEHICLE.3",
"CONTRIBUTING.FACTOR.VEHICLE.4",
"CONTRIBUTING.FACTOR.VEHICLE.5",
"COLLISION_ID",
"VEHICLE.TYPE.CODE.1",
"VEHICLE.TYPE.CODE.2",
"VEHICLE.TYPE.CODE.3",
"VEHICLE.TYPE.CODE.4",
"VEHICLE.TYPE.CODE.5"
))
We can use the filter() function to filter the data set based on specific values in a column.
In this case, I first filtered for crashes that occurred in the BOROUGH of QUEENS and placed the results in queens_crashes.
Then, I filtered for crashes that occurred in 2023. To do this, I analyzed the characters that corresponded to the year in the date column string.
#only queens
queens_crashes = nyc_crashes_mod %>%
filter(BOROUGH == 'QUEENS')
# 2023 Queens only
queens_crashes_2023 = queens_crashes %>%
filter(str_sub(CRASH.DATE, 1, 4) == '2023')
glimpse(queens_crashes_2023)
## Rows: 11,434
## Columns: 11
## $ X <int> 1970421, 1970449, 1970326, 1970331, 1970424, 1970299…
## $ CRASH.DATE <chr> "2023-01-01", "2023-01-01", "2023-01-01", "2023-01-0…
## $ CRASH.TIME <chr> "12:00", "18:06", "15:30", "5:01", "4:30", "12:00", …
## $ BOROUGH <chr> "QUEENS", "QUEENS", "QUEENS", "QUEENS", "QUEENS", "Q…
## $ ZIP.CODE <dbl> 11362, 11358, 11354, 11379, 11373, 11354, 11375, 113…
## $ LATITUDE <dbl> 40.76896, 40.76141, 40.76516, 40.71638, 40.74184, 40…
## $ LONGITUDE <dbl> -73.73119, -73.80385, -73.81911, -73.89751, -73.8697…
## $ LOCATION <chr> "(40.76896, -73.731186)", "(40.761406, -73.80385)", …
## $ ON.STREET.NAME <chr> "", "162 STREET", "", "ELIOT AVENUE", "94 STREET", "…
## $ CROSS.STREET.NAME <chr> "", "STATION ROAD", "", "MOUNT OLIVET CRESCENT", "49…
## $ OFF.STREET.NAME <chr> "49-10 LITTLE NECK PARKWAY", "", "147-10 NORT…
While not part of the tidyverse package, the get_map() from the ggmap package was used to get the map of queens from google maps. I entered an api key above. Please enter your own for this to work.
The ggmap package works well with ggplot, which IS in tidyverse.
queens_map = get_map(location = "Queens, NY", zoom = 12)
## ℹ <https://maps.googleapis.com/maps/api/staticmap?center=Queens,%20NY&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx>
## ℹ <https://maps.googleapis.com/maps/api/geocode/json?address=Queens,+NY&key=xxx>
Here, I first draw the map of queens using ggmap().
Afterwards I add a layer using the geom_point() function of ggplot. Here, I load the data of queens_car_crashes_2023, and define the x and y variables as longitude and latitude. I sort color by zipcode (each crash in the same zip code will be its own color.
Afterwards, I use the function scale_color_gradientn() to create the color gradient from blue to red.
Finally, I add the labs() function to add a title to the graph.
Displayed should be a map of Queens, NY with each point corresponding to a car crash that occurred in 2023.
ggmap(queens_map) +
geom_point(data = queens_crashes_2023,
aes(x = LONGITUDE, y = LATITUDE, color = ZIP.CODE),
alpha = 0.5, size = 1) +
scale_color_gradientn(colors = rev(colorRampPalette(c("blue", "red"))(100))) +
labs(title = '2023 QUEENS CAR CRASHES')
## Warning: Removed 2414 rows containing missing values (`geom_point()`).
I wanted to find the average number of crashes that occur per zipcode and plot it.
To do this, i first used the group_by() function. I piped queens_crashes_2023 and grouped it by zipcode. Afterwards, I used the summarise() function to calculate the average Longitude and Latitude of each zipcode. I saved this in avg_location_per_zip.
avg_location_per_zip = queens_crashes_2023 %>%
group_by(ZIP.CODE) %>%
summarise(avg_latitude = mean(LATITUDE, na.rm = TRUE),
avg_longitude = mean(LONGITUDE, na.rm = TRUE),
avg_crashes = mean(n(), na.rm = TRUE))
Next, I plotted this data using the same ggplot functions. It shows the mean crash location per zipcode of all crashes in 2023 in Queens.
ggmap(queens_map) +
geom_point(data = avg_location_per_zip,
aes(x = avg_longitude, y = avg_latitude, color = avg_crashes),
size = 5, alpha = 0.7) +
scale_color_gradientn(colors = rev(colorRampPalette(c("blue", "red"))(100))) +
labs(title = 'Average Number of Crashes per ZIP Code in Queens for 2023')
## Warning: Removed 34 rows containing missing values (`geom_point()`).
I wanted to make a graph like above but animated to show the change over the years. To do this, first I cleaned the data up a bit more.
Using the filter() function, I removed NA data points from zipcode column. There were some NA values. Afterwards, I use the mutate() function to convert the crash date string into a date to extract the year.
Then, similar to above I use group_by and summarise to group by zipcode and calculate the average long and lat per zipcode. This is done for each year.
# remove NA
queens_crashes_clean = queens_crashes %>%
filter(!is.na(ZIP.CODE) & !is.na(n()))
# convert crashdate into crashdate
queens_crashes_clean = queens_crashes_clean %>%
mutate(CRASH.DATE = as.Date(CRASH.DATE, format = "%Y-%m-%d"),
year = as.integer(format(CRASH.DATE, "%Y")))
avg_location_per_zip_year = queens_crashes %>%
mutate(year = as.integer(substr(CRASH.DATE, 1, 4))) %>%
group_by(year, ZIP.CODE) %>%
summarise(
avg_latitude = mean(LATITUDE, na.rm = TRUE),
avg_longitude = mean(LONGITUDE, na.rm = TRUE),
avg_crashes = n()
)
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
Finally, I use the gganimate to create a graph like above but animated per each year. It will render the gif into your current working directory. To view, go to your working directory and open the gif. gganimate package is not part of tidyverse, however it is built to work well with ggplot as a layer.
library(gganimate)
# animated plot
anim_map = ggmap(queens_map) +
geom_point(data = avg_location_per_zip_year, aes(x = avg_longitude, y = avg_latitude, color = avg_crashes), size = 5, alpha = 0.7) +
scale_color_gradientn(colors = rev(colorRampPalette(c("blue", "red"))(100))) +
labs(title = 'Average Number of Crashes per ZIP Code in Queens for {closest_state}', x = 'Longitude', y = 'Latitude') +
theme_minimal() +
transition_states(year, transition_length = 2, state_length = 1) +
enter_fade() +
exit_fade()
# save
anim_save("avg_crashes_per_zip_year_map.gif", animation = anim_map)
## Warning: Removed 15 rows containing missing values (`geom_point()`).
## Removed 15 rows containing missing values (`geom_point()`).
## Removed 15 rows containing missing values (`geom_point()`).
## Removed 15 rows containing missing values (`geom_point()`).
## Removed 15 rows containing missing values (`geom_point()`).
## Removed 15 rows containing missing values (`geom_point()`).
## Removed 15 rows containing missing values (`geom_point()`).
## Removed 15 rows containing missing values (`geom_point()`).
## Removed 15 rows containing missing values (`geom_point()`).
## Removed 15 rows containing missing values (`geom_point()`).
## Removed 15 rows containing missing values (`geom_point()`).
## Removed 15 rows containing missing values (`geom_point()`).
## Removed 15 rows containing missing values (`geom_point()`).
## Warning: Removed 17 rows containing missing values (`geom_point()`).
## Removed 17 rows containing missing values (`geom_point()`).
## Removed 17 rows containing missing values (`geom_point()`).
## Removed 17 rows containing missing values (`geom_point()`).
## Removed 17 rows containing missing values (`geom_point()`).
## Warning: Removed 15 rows containing missing values (`geom_point()`).
## Removed 15 rows containing missing values (`geom_point()`).
## Removed 15 rows containing missing values (`geom_point()`).
## Removed 15 rows containing missing values (`geom_point()`).
## Warning: Removed 17 rows containing missing values (`geom_point()`).
## Removed 17 rows containing missing values (`geom_point()`).
## Removed 17 rows containing missing values (`geom_point()`).
## Removed 17 rows containing missing values (`geom_point()`).
## Removed 17 rows containing missing values (`geom_point()`).
## Warning: Removed 15 rows containing missing values (`geom_point()`).
## Removed 15 rows containing missing values (`geom_point()`).
## Removed 15 rows containing missing values (`geom_point()`).
## Removed 15 rows containing missing values (`geom_point()`).
## Warning: Removed 48 rows containing missing values (`geom_point()`).
## Removed 48 rows containing missing values (`geom_point()`).
## Removed 48 rows containing missing values (`geom_point()`).
## Removed 48 rows containing missing values (`geom_point()`).
## Removed 48 rows containing missing values (`geom_point()`).
## Warning: Removed 47 rows containing missing values (`geom_point()`).
## Removed 47 rows containing missing values (`geom_point()`).
## Removed 47 rows containing missing values (`geom_point()`).
## Removed 47 rows containing missing values (`geom_point()`).
## Warning: Removed 51 rows containing missing values (`geom_point()`).
## Removed 51 rows containing missing values (`geom_point()`).
## Removed 51 rows containing missing values (`geom_point()`).
## Removed 51 rows containing missing values (`geom_point()`).
## Removed 51 rows containing missing values (`geom_point()`).
## Warning: Removed 21 rows containing missing values (`geom_point()`).
## Removed 21 rows containing missing values (`geom_point()`).
## Removed 21 rows containing missing values (`geom_point()`).
## Removed 21 rows containing missing values (`geom_point()`).
## Warning: Removed 39 rows containing missing values (`geom_point()`).
## Removed 39 rows containing missing values (`geom_point()`).
## Removed 39 rows containing missing values (`geom_point()`).
## Warning: Removed 33 rows containing missing values (`geom_point()`).
## Removed 33 rows containing missing values (`geom_point()`).
## Removed 33 rows containing missing values (`geom_point()`).
## Removed 33 rows containing missing values (`geom_point()`).
## Warning: Removed 47 rows containing missing values (`geom_point()`).
## Removed 47 rows containing missing values (`geom_point()`).
## Removed 47 rows containing missing values (`geom_point()`).
## Removed 47 rows containing missing values (`geom_point()`).
## Warning: Removed 40 rows containing missing values (`geom_point()`).
## Removed 40 rows containing missing values (`geom_point()`).
## Removed 40 rows containing missing values (`geom_point()`).
## Removed 40 rows containing missing values (`geom_point()`).
## Warning: Removed 48 rows containing missing values (`geom_point()`).
## Removed 48 rows containing missing values (`geom_point()`).
## Removed 48 rows containing missing values (`geom_point()`).
## Removed 48 rows containing missing values (`geom_point()`).
## Warning: Removed 33 rows containing missing values (`geom_point()`).
## Removed 33 rows containing missing values (`geom_point()`).
## Removed 33 rows containing missing values (`geom_point()`).
## Removed 33 rows containing missing values (`geom_point()`).
## Warning: Removed 58 rows containing missing values (`geom_point()`).
## Removed 58 rows containing missing values (`geom_point()`).
## Removed 58 rows containing missing values (`geom_point()`).
## Removed 58 rows containing missing values (`geom_point()`).
## Warning: Removed 57 rows containing missing values (`geom_point()`).
## Removed 57 rows containing missing values (`geom_point()`).
## Removed 57 rows containing missing values (`geom_point()`).
## Removed 57 rows containing missing values (`geom_point()`).
## Warning: Removed 64 rows containing missing values (`geom_point()`).
## Removed 64 rows containing missing values (`geom_point()`).
## Removed 64 rows containing missing values (`geom_point()`).
## Removed 64 rows containing missing values (`geom_point()`).
## Warning: Removed 62 rows containing missing values (`geom_point()`).
## Removed 62 rows containing missing values (`geom_point()`).
## Removed 62 rows containing missing values (`geom_point()`).
## Removed 62 rows containing missing values (`geom_point()`).
## Warning: Removed 63 rows containing missing values (`geom_point()`).
## Removed 63 rows containing missing values (`geom_point()`).
## Removed 63 rows containing missing values (`geom_point()`).
## Removed 63 rows containing missing values (`geom_point()`).
## Warning: Removed 34 rows containing missing values (`geom_point()`).
## Removed 34 rows containing missing values (`geom_point()`).
## Removed 34 rows containing missing values (`geom_point()`).
## Removed 34 rows containing missing values (`geom_point()`).
## Warning: Removed 35 rows containing missing values (`geom_point()`).
## Removed 35 rows containing missing values (`geom_point()`).
## Removed 35 rows containing missing values (`geom_point()`).
## Removed 35 rows containing missing values (`geom_point()`).
## Warning: Removed 15 rows containing missing values (`geom_point()`).