title: “Data Wrangling Final Project”
author: “Lawrence Porter”
date: “December 8, 2016”
output: html_document

Impaired Driving Rates in Canada for Years 1998 to 2015

Executive Summary

Between 1999 and 2010, traffic crashes involving alcohol and/or drugs resulted in an estimated 14,256 deaths, 841,004 injuries and damage to 2,779,458 vehicles in PDO crashes alone. It is also estimated that there were 11,880 fatal impaired driving crashes, 574,872 injury-only impaired crashes and 1,828,589 PDO impaired crashes, totalling 2,415,341 crashes. Using a social cost model, these deaths, injuries and PDO crashes cost Canadians an estimated $246.1 billion. Based on a population of 33 million people, that represents a cost of about $7,457 per Canadian. (from madd.ca)

This project will look at and combine a couple of datasets: the Incident Rate of Impaired driving in 25 major cities/regional police jurisdictions from 1998 to 2015. For the mapping, the geo code data for the 25 police jurisdictions was obtained from a website and pasted into a csv file to be joined with the Impaired data set.

There are four visualizations provided that examine this data:

  1. A facetted line plot that illustrates the IncidentRate of Impaired Driving trends in each of the 25 police jurisdictions from 1998 to 2015.

  2. An ordered bar plot shows the percent increase/decrease in the Incident rate from 2008 to 2015. We can easily identify which cities had the largest changes.

  3. A line plot comparing average annual incident rates by province from 1998 to 2015

  4. A map of the 25 police jurisdictions from 2015 using sequential colors to identify which cities/regions have the highest/lowest IncidentRates. Mapping was problematic as the police jurisdictions do not align with available choropleths very well. In some cases a police jurisdiction is a city and in other cases this jurisdiction could contain a large area, including urban and rural areas that are not bounded by specific counties. Points were used to represent the jurisdictions and color was used to show the Incident rates. Map insets were included to call out and highlight separate jurisdictions in populated areas

2 Load Packages Required for Data Cleaning, Plotting and Mapping

library(RSocrata)  #Provides easier interaction with Socrata open data portals and data import
library(readr)    #Read flat/tabular text files from disk (or a connection).
library(tidyverse)  #general package for data manipulation, tidying and plotting
library(ggmap) #a package for spatial visualization with Google Maps and OpenStreetMap
library(ggthemes) #Extra Themes, Scales and Geoms for 'ggplot2'
library(plotly) #create interactive plots
library(maps) #for creating geographical maps
library(mapdata) #contains basic data to go along with 'maps' (topographic and geologic)
library(sp) #classes and methods for spatial data
library(gridExtra)  #plot graphs and tables on grids  
library(grid) #grid is a low-level graphics system which provides a great deal of control and flexibility in the appearance and arrangement of graphical output
library(DT) #create scrollable and sortable datatables
library(ggrepel) #Provides text and label geoms for 'ggplot2' that help to avoid overlapping text labels. Labels repel away from each other and away from the data points.
library(stringr) #Simple, Consistent Wrappers for Common String Operations

3 Data Selection and Preparation

This dataset measures the number of impaired driving incidents per 100,000 population. Incidents include impairment from alcohol, narcotics, or prescription medication, and include failure to comply with testing. Data is obtained from Statistics Canada, tables 252-0075 to 252-0081 link to “codebook”: https://www.opendatanetwork.com/dataset/dashboard.edmonton.ca/n5ux-m2xi

For mapping purposes, I downloaded latitude and longitude data from www.findlatitudeandlongitude.com into a csv file that was joined to the first data set.

More information from source tables is here: http://www5.statcan.gc.ca/cansim/a26?lang=eng&retrLang=eng&id=2520076&&pattern=&stByVal=1&p1=1&p2=31&tabMode=dataTable&csid=

About the Raw Data

  • “Year” from 1998 to 2015 and entries are coded as numeric values
  • “Police Jurisdiction” There are 25 Police jurisdictions in this dataset and are coded as characters
  • “NA” While the data set contains blank entries, the original data source states that “a process of imputation was applied to derive counts for violations that do not exist on their own in the aggregate survey. For approximately 80% of the aggregate offence codes, there is a 1:1 mapping with a new incident-based violation code. For violations where this was not the case, such as the aggregate other Criminal Code category, it was necessary to estimate (impute) this figure using the distribution of other Criminal Code offences from existing Incident-based UCR2 respondents.”

Highlights From Data Footnotes

  • “Data from the previous year are revised to reflect any updates or changes that have been received from the police services.” The latest years data (2015) may be revised in future, but data prior to 2015 should be correct.
  • “With the release of 2012 data, revised population estimates at the respondent level were applied back to and including 2004.”
  • “Prior to 1999, a number of Royal Canadian Mounted Police (RCMP) detachments in Saskatchewan were double counting the number of actual offences of impaired driving. This over-counting was corrected in 1999, therefore, comparisons with previous years should be made with caution. It is recommended that the analysis of impaired driving be based on persons charged data rather than actual offences”
  • “In 2011, the introduction of the Immediate Roadside Prohibition (IRP) in British Columbia provided an alternative method for officers to proceed with penalties for impaired drivers and may account for the trends reported for 2011 and 2012.”

  • Using the Rsocrata package the data was imported directly from the previously mentioned webpage

Create Scrollable Table to Display First 10 Rows of Raw Data

  • Data was imported as a dataframe and converted to a tibble using tidyverse package and as_tibble()
  • There are 19 variables and 25 observations
  • 19 missing values until 2007. Missing values reduce each year until 2007. From 2008-2015 there are no missing values. These missing values do not affect the analysis at this time and will be kept

Scrollable table of First 10 rows of raw data

  • Years begin with an “X” which was removed
  • Because Year was the column headings the data needed to be converted from wide to long format (year become a row) using dplyr package.
  • Police.Jurisdiction was combined as city/region and province so separated into City and Province again using dplyr package
  • Now 450 rows with 4 variables (Year, City, Province, IncidentRate)
## Warning in combine_vars(vars, ind_list): '.Random.seed' is not an integer
## vector but of type 'NULL', so ignored

Display head of revised data set in grid table

  • The prepared data was inner_join (dplyr package) with the Geocode data (imported from a website) to create additional variables of latitude and longitude. The data set was renamed ImpairedL
## Joining, by = c("City", "Province")
## # A tibble: 450 × 6
##              City Province  Year IncidentRate latitude  longitude
##             <chr>    <chr> <chr>        <dbl>    <dbl>      <dbl>
## 1         Burnaby       BC  1998       107.60 49.24881 -122.98051
## 2         Calgary       AB  1998       212.33 51.04861 -114.07085
## 3   Durham Region       ON  1998       147.72 43.93684  -78.92882
## 4        Edmonton       AB  1998       340.97 53.54439 -113.49093
## 5        Gatineau       QB  1998           NA 45.47650  -75.70130
## 6  Halifax Region       NS  1998           NA 44.64886  -63.57532
## 7   Halton Region       ON  1998       119.61 43.53254  -79.87448
## 8        Hamilton       ON  1998       102.30 43.25002  -79.86609
## 9           Laval       QB  1998       114.58 45.60660  -73.71240
## 10         London       ON  1998       196.35 42.98695  -81.24318
## # ... with 440 more rows
  • To get a quick overview of some of the data, the Average Incident Rate per Year for all of Canada was calculated

Scrollable and Searchable Table of Prepared Data Set with Column Filters

Click on column heading to order data

Enter City or Province name or year to sort.

Slice the data! Click on IncidentRate, latitude or longtidueboxes for slider to select values

Answer questions such as
* What city had the highest incident rate and when?
* What cities or provinces had the 5 highest incident rates / 5 lowest rates
* What are the geocodes for cities you select
* What year had the lowest incident rates
* Select a latitude/longtitude. What city(s) are located there?

Variables of Concern

During the data preparation phase, I used code to determine the size, variable names, classes of data
* Dimensions- as_tibble() shows 450 observations x 6 columns as_tibble()
* Column Names- as_tibble() City Province Year IncidentRate latitude longitude
* Data class -as_tibble() City-chr, Province-chr, Year-chr, IncidentRate-dbl, Latitude-dbl,longitude-dbl
* Missing values: sum(Jimp(is.na) and summary(Jimp) 19 total, missing values by year 1998-2007,none after 2008
* Summary statistics summary(Jimp$IncidentRate) (Only IncidentRate is meaningful)
.Min. -1st Qu. Median Mean -rd Qu. Max. -NA’s
35.78 100.50 136.30 151.30 185.60 441.70 19

Impaired has been transformed and joined to create Jimp now has 6 variables and 450 Observations. Ready for exploration

4 Exploratory Data Analysis

VIZ 1 Facetted Chart Comparing IncidentRates from 1998 to 2015 by City

Insights from VIZ 1

  • This data points out the relative differences in incident rates between regions of the country and the general pattern of reduction in incident rates from 1998 to 2015. (Less enforcement? Successful drinking and driving campaigns?, Socio economic conditions?)
  • In the early years 1998-2016, most cities in the western region of Canada had much higher Incidentrates per 100,00 population than the rest of the country, but dropped quickly.
  • Vancouver had a unique pattern with a bell shaped curve. Incidentrates peaked 2006 -2009. I checked the original Statistics Canada footnotes but there was nothing there to explain this pattern.
  • Ontario and Quebec (Central) consistently had lower IncidentRates (than West or East). Rates showed a gradual decline over the years.
  • Halifax Region (starting in 2008 ) had a higher IncidentRate than most of the rest of the country.

VIZ # 2

Bar Plot indicating the percentage change in incident rate from selected years 2008 to 2015

  • Minimal plot theme

Insights from VIZ 2

  • Most cities reduced their incident rates from 2008 to 2015
  • Vancouver had significant decline over 200%
  • Surrey BC, close to Vancouver did not exhibit the same results and actually increased the rate over 25%
  • Most cities in Ontario showed minimal increase or decrease

VIZ # 3 Provincial Average Incident Rates by Year

Mouse over lines to view values

## `geom_smooth()` using method = 'loess'

Insights from VIZ 3

  • British Columbia has a much different pattern than the rest of the provinces. The reason is unknown, but when incident rates are going down in other province BC is going up for 1998 to 2008.
  • Alberta, Manitoba and Saskatchewan all exhibit the sharpest drops in IncidentRate over the time period.
  • Quebec and Ontario started with low IncidentRates that have not changed much over the time period

VIZ # 4 Map of Canada

Filtered Dataset for Year 2015 Only

Notes:
* Because some of this data includes regions that included both urban and large urban/rural locations that were not defined by counties, it did not fit nicely with choropleth maps (but I did investigate them).
* I looked at several different maps of Canada and there were problems with zoom levels. Zoom level 4 showed all of North America, but then putting points on the Canada map was not very helpful. Zoom level 5 showed part of Canada but not all areas of interest.
* For best mapping I decided to use boundary boxes defined by latitude and longitude to get a larger map of the areas of interest.
* Google maps had the best looking base maps but a boundary box cannot be created with google maps.
* I ended up using the “stamen”" map package with a “toner” (black and white) base because it allowed specifying boundaries and I could do a tighter focus on highly populated areas where data overlapped from several regions on the large map of canada

Insights From Chart and VIZ 3:

  1. 3 cities with highest incident rate in 2015= 1 Regina SK, 2 Surrey BC 3 Halifax NS (Don’t dring and drive there!)
  2. 3 cities with lowest incident rate in 2015 = 1 Toronto ON 2 Ottawa ON 3 London ON (Don’t drink and drive, but if have to go to Ontario)
  3. Central regions of canada have lowest incident rates
  4. Western and Eastern canada have highest incident rates

5 Summary Section

This project studed the incident rate of impaired driving per 100,00 population across Canada from 1998 to 2015. I wanted to find differences in various regions across Canada, which areas across Canada have the highest incident rates, which cities have increased/decreased their incident rates from the first observation to the most recent observation and by how much.

Canadian police jurisdictions are typically on a local city level or provincial level (no county level in Canada) and there is much cooperation. The data does not distiguish between the two, but in the larger urban/rural regions certainly, the provincial force would be dominant, and the trends could reflect provinvial enforcement policy.

I addressed the problem by creating
* Scrollable and searchable data tables that can be used to slice, order and select data of interest.
I also created several plots
* A facetted line plot visualization of cities showing the incident rates by year for each city
* An ordered bar plot visualization showing the percent change in the incident rate from 1998 to 2015
* Two maps to help visualize regions, locate cities, and display incident rates
* One of the country of Canada plotting cities in the data set using 2015 incident rates and color gradients to show level * Due to high population density and number of Police Jurisdictions in Central Canada, the data is not clear on the map of Canada. A close up of Central Canada with the same information as the map of Canada was needed

Interesting Insights from the Analysis

  1. Vancouver has the largest variations in impaired driving incident rates of any city.
  2. Vancouver had the largest drop in incident rate from 2008 to 2015.
  3. Halifax NS has maintained a high incident rate. Socio-economic factors could be driving this.
  4. Western Canada seems to have had higher incident rates throughout the years. Much of this area is rural and lacks public transportation.
  5. British Columbia has a much different pattern than the rest of the provinces. The reason is unknown, but when incident rates are going down in other province BC is going up for 1998 to 2008.
  6. Alberta, Manitoba and Saskatchewan all exhibit the sharpest drops in IncidentRate over the time period.
  7. Central Canada (Ontario, Quebec) have maintained low incident rates over the time period.
  8. Western Canadian cities showed the largest percentage drops in incident rate from 2008 to 2015.
  9. Reporting Cities/Regions in this data set are heavily concentrated in southern Ontario.
  10. 3 cities with highest incident rate in 2015= 1 Regina SK, 2 Surrey BC 3 Halifax NS (Don’t drink and drive there!)
  11. 3 cities with lowest incident rate in 2015 = 1 Toronto ON 2 Ottawa ON 3 London ON (Don’t drink and drive, but if have to, go to Ontario)

Future Analysis Opportunities

  • Break out impaired driving by infraction to illustrate that drug impaired driving offences are increasing as stated in some advertising literature
  • Break out rural and urban areas to see if any significant difference between areas
  • Apply some demographic data to see if there are insights
  • Look at availability of public transportation in areas to see if this correlates to Impaired driving rates