| title: “Data Wrangling Final Project” |
| author: “Lawrence Porter” |
| date: “December 8, 2016” |
| output: html_document |
Between 1999 and 2010, traffic crashes involving alcohol and/or drugs resulted in an estimated 14,256 deaths, 841,004 injuries and damage to 2,779,458 vehicles in PDO crashes alone. It is also estimated that there were 11,880 fatal impaired driving crashes, 574,872 injury-only impaired crashes and 1,828,589 PDO impaired crashes, totalling 2,415,341 crashes. Using a social cost model, these deaths, injuries and PDO crashes cost Canadians an estimated $246.1 billion. Based on a population of 33 million people, that represents a cost of about $7,457 per Canadian. (from madd.ca)
This project will look at and combine a couple of datasets: the Incident Rate of Impaired driving in 25 major cities/regional police jurisdictions from 1998 to 2015. For the mapping, the geo code data for the 25 police jurisdictions was obtained from a website and pasted into a csv file to be joined with the Impaired data set.
There are four visualizations provided that examine this data:
A facetted line plot that illustrates the IncidentRate of Impaired Driving trends in each of the 25 police jurisdictions from 1998 to 2015.
An ordered bar plot shows the percent increase/decrease in the Incident rate from 2008 to 2015. We can easily identify which cities had the largest changes.
A line plot comparing average annual incident rates by province from 1998 to 2015
A map of the 25 police jurisdictions from 2015 using sequential colors to identify which cities/regions have the highest/lowest IncidentRates. Mapping was problematic as the police jurisdictions do not align with available choropleths very well. In some cases a police jurisdiction is a city and in other cases this jurisdiction could contain a large area, including urban and rural areas that are not bounded by specific counties. Points were used to represent the jurisdictions and color was used to show the Incident rates. Map insets were included to call out and highlight separate jurisdictions in populated areas
library(RSocrata) #Provides easier interaction with Socrata open data portals and data import
library(readr) #Read flat/tabular text files from disk (or a connection).
library(tidyverse) #general package for data manipulation, tidying and plotting
library(ggmap) #a package for spatial visualization with Google Maps and OpenStreetMap
library(ggthemes) #Extra Themes, Scales and Geoms for 'ggplot2'
library(plotly) #create interactive plots
library(maps) #for creating geographical maps
library(mapdata) #contains basic data to go along with 'maps' (topographic and geologic)
library(sp) #classes and methods for spatial data
library(gridExtra) #plot graphs and tables on grids
library(grid) #grid is a low-level graphics system which provides a great deal of control and flexibility in the appearance and arrangement of graphical output
library(DT) #create scrollable and sortable datatables
library(ggrepel) #Provides text and label geoms for 'ggplot2' that help to avoid overlapping text labels. Labels repel away from each other and away from the data points.
library(stringr) #Simple, Consistent Wrappers for Common String Operations
This dataset measures the number of impaired driving incidents per 100,000 population. Incidents include impairment from alcohol, narcotics, or prescription medication, and include failure to comply with testing. Data is obtained from Statistics Canada, tables 252-0075 to 252-0081 link to “codebook”: https://www.opendatanetwork.com/dataset/dashboard.edmonton.ca/n5ux-m2xi
For mapping purposes, I downloaded latitude and longitude data from www.findlatitudeandlongitude.com into a csv file that was joined to the first data set.
More information from source tables is here: http://www5.statcan.gc.ca/cansim/a26?lang=eng&retrLang=eng&id=2520076&&pattern=&stByVal=1&p1=1&p2=31&tabMode=dataTable&csid=
“In 2011, the introduction of the Immediate Roadside Prohibition (IRP) in British Columbia provided an alternative method for officers to proceed with penalties for impaired drivers and may account for the trends reported for 2011 and 2012.”
Using the Rsocrata package the data was imported directly from the previously mentioned webpage
## Warning in combine_vars(vars, ind_list): '.Random.seed' is not an integer
## vector but of type 'NULL', so ignored
## Joining, by = c("City", "Province")
## # A tibble: 450 × 6
## City Province Year IncidentRate latitude longitude
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Burnaby BC 1998 107.60 49.24881 -122.98051
## 2 Calgary AB 1998 212.33 51.04861 -114.07085
## 3 Durham Region ON 1998 147.72 43.93684 -78.92882
## 4 Edmonton AB 1998 340.97 53.54439 -113.49093
## 5 Gatineau QB 1998 NA 45.47650 -75.70130
## 6 Halifax Region NS 1998 NA 44.64886 -63.57532
## 7 Halton Region ON 1998 119.61 43.53254 -79.87448
## 8 Hamilton ON 1998 102.30 43.25002 -79.86609
## 9 Laval QB 1998 114.58 45.60660 -73.71240
## 10 London ON 1998 196.35 42.98695 -81.24318
## # ... with 440 more rows
Answer questions such as
* What city had the highest incident rate and when?
* What cities or provinces had the 5 highest incident rates / 5 lowest rates
* What are the geocodes for cities you select
* What year had the lowest incident rates
* Select a latitude/longtitude. What city(s) are located there?
During the data preparation phase, I used code to determine the size, variable names, classes of data
* Dimensions- as_tibble() shows 450 observations x 6 columns as_tibble()
* Column Names- as_tibble() City Province Year IncidentRate latitude longitude
* Data class -as_tibble() City-chr, Province-chr, Year-chr, IncidentRate-dbl, Latitude-dbl,longitude-dbl
* Missing values: sum(Jimp(is.na) and summary(Jimp) 19 total, missing values by year 1998-2007,none after 2008
* Summary statistics summary(Jimp$IncidentRate) (Only IncidentRate is meaningful)
.Min. -1st Qu. Median Mean -rd Qu. Max. -NA’s
35.78 100.50 136.30 151.30 185.60 441.70 19
## `geom_smooth()` using method = 'loess'
Notes:
* Because some of this data includes regions that included both urban and large urban/rural locations that were not defined by counties, it did not fit nicely with choropleth maps (but I did investigate them).
* I looked at several different maps of Canada and there were problems with zoom levels. Zoom level 4 showed all of North America, but then putting points on the Canada map was not very helpful. Zoom level 5 showed part of Canada but not all areas of interest.
* For best mapping I decided to use boundary boxes defined by latitude and longitude to get a larger map of the areas of interest.
* Google maps had the best looking base maps but a boundary box cannot be created with google maps.
* I ended up using the “stamen”" map package with a “toner” (black and white) base because it allowed specifying boundaries and I could do a tighter focus on highly populated areas where data overlapped from several regions on the large map of canada
This project studed the incident rate of impaired driving per 100,00 population across Canada from 1998 to 2015. I wanted to find differences in various regions across Canada, which areas across Canada have the highest incident rates, which cities have increased/decreased their incident rates from the first observation to the most recent observation and by how much.
Canadian police jurisdictions are typically on a local city level or provincial level (no county level in Canada) and there is much cooperation. The data does not distiguish between the two, but in the larger urban/rural regions certainly, the provincial force would be dominant, and the trends could reflect provinvial enforcement policy.
I addressed the problem by creating
* Scrollable and searchable data tables that can be used to slice, order and select data of interest.
I also created several plots
* A facetted line plot visualization of cities showing the incident rates by year for each city
* An ordered bar plot visualization showing the percent change in the incident rate from 1998 to 2015
* Two maps to help visualize regions, locate cities, and display incident rates
* One of the country of Canada plotting cities in the data set using 2015 incident rates and color gradients to show level * Due to high population density and number of Police Jurisdictions in Central Canada, the data is not clear on the map of Canada. A close up of Central Canada with the same information as the map of Canada was needed