Description of Data Set

This dataset measures the number of impaired driving incidents per 100,000 population. Incidents include impairment from alcohol, narcotics, or prescription medication, and include failure to comply with testing. Data is obtained from Statistics Canada, tables 252-0075 to 252-0081
link to “codebook”: https://www.opendatanetwork.com/dataset/dashboard.edmonton.ca/n5ux-m2xi
also more information from source tables is here: http://www5.statcan.gc.ca/cansim/a26?lang=eng&retrLang=eng&id=2520076&&pattern=&stByVal=1&p1=1&p2=31&tabMode=dataTable&csid=

There are 19 variables and 25 observations
Data is in wide format with 18 Years (1998-2015) and Police Jurisdiction as column names.

“Year”" from 1998 to 2015 and entries are coded as numeric values
“Police Jurisdiction” There are 25 Police jurisdictions in this dataset and are coded as characters
“NA” While the data set contains blank entries, the original data source states that “a process of imputation was applied to derive counts for violations that do not exist on their own in the aggregate survey. For approximately 80% of the aggregate offence codes, there is a 1:1 mapping with a new incident-based violation code. For violations where this was not the case, such as the aggregate other Criminal Code category, it was necessary to estimate (impute) this figure using the distribution of other Criminal Code offences from existing Incident-based UCR2 respondents.”

Highlights from data footnotes

“Data from the previous year are revised to reflect any updates or changes that have been received from the police services.” The latest years data (2015) may be revised in future, but data prior to 2015 should be correct.

“With the release of 2012 data, revised population estimates at the respondent level were applied back to and including 2004.”

“Prior to 1999, a number of Royal Canadian Mounted Police (RCMP) detachments in Saskatchewan were double counting the number of actual offences of impaired driving. This over-counting was corrected in 1999, therefore, comparisons with previous years should be made with caution. It is recommended that the analysis of impaired driving be based on persons charged data rather than actual offences”

“In 2011, the introduction of the Immediate Roadside Prohibition (IRP) in British Columbia provided an alternative method for officers to proceed with penalties for impaired drivers and may account for the trends reported for 2011 and 2012.”

Import Data

library(jsonlite)  #load package to import json file
library(tidyverse) #load packages to tidy data and create graphics
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
Impaired <- fromJSON("https://dashboard.edmonton.ca/resource/kcvf-z5h9.json")
Impaired <- as_tibble(Impaired)
Impaired
## # A tibble: 25 × 19
##    `_1998`            `_1999` `_2000`            `_2001` `_2002`
## *    <chr>              <chr>   <chr>              <chr>   <chr>
## 1    107.6               92.5  104.35             174.68  164.62
## 2   212.33             206.56   211.3             214.16  217.33
## 3   147.72             175.07  187.27              176.1  181.23
## 4   340.97 311.08999999999997  300.93             250.99  234.28
## 5     <NA>               <NA>    <NA>               <NA>  184.27
## 6     <NA>               <NA>    <NA>               <NA>    <NA>
## 7   119.61             100.64  114.53             106.86   97.11
## 8    102.3             100.58  104.61              95.87   86.87
## 9   114.58 154.58000000000001  146.57 160.33000000000001  139.32
## 10  196.35             186.05  179.34 152.58000000000001  154.84
## # ... with 15 more rows, and 14 more variables: `_2003` <chr>,
## #   `_2004` <chr>, `_2005` <chr>, `_2006` <chr>, `_2007` <chr>,
## #   `_2008` <chr>, `_2009` <chr>, `_2010` <chr>, `_2011` <chr>,
## #   `_2012` <chr>, `_2013` <chr>, `_2014` <chr>, `_2015` <chr>,
## #   police_jurisdiction <chr>
str(Impaired)
## Classes 'tbl_df', 'tbl' and 'data.frame':    25 obs. of  19 variables:
##  $ _1998              : chr  "107.6" "212.33" "147.72" "340.97" ...
##  $ _1999              : chr  "92.5" "206.56" "175.07" "311.08999999999997" ...
##  $ _2000              : chr  "104.35" "211.3" "187.27" "300.93" ...
##  $ _2001              : chr  "174.68" "214.16" "176.1" "250.99" ...
##  $ _2002              : chr  "164.62" "217.33" "181.23" "234.28" ...
##  $ _2003              : chr  "179.13" "197.2" "181.7" "220.56" ...
##  $ _2004              : chr  "155.18" "198.24" "162.25" "212.89" ...
##  $ _2005              : chr  "172.06" "185.76" "139.43" "199.11" ...
##  $ _2006              : chr  "159.47999999999999" "166.2" "144.80000000000001" "200.66" ...
##  $ _2007              : chr  "125.78" "159.07" "139.76" "239.35" ...
##  $ _2008              : chr  "132.85" "155.46" "134.58000000000001" "255.26" ...
##  $ _2009              : chr  "182.28" "154.54" "138.03" "253.1" ...
##  $ _2010              : chr  "211.79" "152.44" "147" "257.83" ...
##  $ _2011              : chr  "258.07" "155.84" "137.21" "251.9" ...
##  $ _2012              : chr  "181.23" "149.32" "129.38" "248.46" ...
##  $ _2013              : chr  "200.54" "121.57" "113.79" "209.32" ...
##  $ _2014              : chr  "221.22" "105.58" "121.4" "196.87" ...
##  $ _2015              : chr  "174.91" "86.7" "116.61" "171.71" ...
##  $ police_jurisdiction: chr  "Burnaby, BC" "Calgary, AB" "Durham Region, ON" "Edmonton, AB" ...

There are 19 variables and 25 observations

The jsonlite package seems to have imported all the observations as characters despite the datasource information saying these were numeric values. The observations for years 1998-2015 will need to be converted to numerical values

Some observations have many decimal points and some only 1. All will be converted to 2 decimal places

Police Jurisdiction includes both city and province. These will be split into separate fields for regional analysis

Data Cleaning

Change Column Title to remove underscore

names(Impaired) <- gsub("_", "", names(Impaired), fixed = TRUE)
names(Impaired)
##  [1] "1998"               "1999"               "2000"              
##  [4] "2001"               "2002"               "2003"              
##  [7] "2004"               "2005"               "2006"              
## [10] "2007"               "2008"               "2009"              
## [13] "2010"               "2011"               "2012"              
## [16] "2013"               "2014"               "2015"              
## [19] "policejurisdiction"

Convert chr to numeric values

cols <- c("1998", "1999", "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008",
          "2009", "2010", "2011", "2012", "2013", "2014", "2015")
Impaired[cols] <- sapply(Impaired[cols],as.numeric)
sapply(Impaired, class)
##               1998               1999               2000 
##          "numeric"          "numeric"          "numeric" 
##               2001               2002               2003 
##          "numeric"          "numeric"          "numeric" 
##               2004               2005               2006 
##          "numeric"          "numeric"          "numeric" 
##               2007               2008               2009 
##          "numeric"          "numeric"          "numeric" 
##               2010               2011               2012 
##          "numeric"          "numeric"          "numeric" 
##               2013               2014               2015 
##          "numeric"          "numeric"          "numeric" 
## policejurisdiction 
##        "character"

Missing Values

19 missing values until 2007. Missing values reduce each year until 2007. From 2008-2015 there are no missing values
Will determine what to do with missing values upon further analysis

## [1] 19
##       1998             1999             2000             2001       
##  Min.   : 52.45   Min.   : 46.81   Min.   : 35.78   Min.   : 76.81  
##  1st Qu.:107.60   1st Qu.:106.36   1st Qu.:113.17   1st Qu.:125.75  
##  Median :147.72   Median :167.01   Median :165.16   Median :157.74  
##  Mean   :173.97   Mean   :171.80   Mean   :174.02   Mean   :169.47  
##  3rd Qu.:196.35   3rd Qu.:185.71   3rd Qu.:193.53   3rd Qu.:191.94  
##  Max.   :402.44   Max.   :409.65   Max.   :441.67   Max.   :376.50  
##  NA's   :4        NA's   :3        NA's   :3        NA's   :3       
##       2002             2003             2004             2005       
##  Min.   : 83.79   Min.   : 64.75   Min.   : 72.86   Min.   : 60.19  
##  1st Qu.:127.83   1st Qu.:117.51   1st Qu.:117.51   1st Qu.:111.70  
##  Median :154.06   Median :141.38   Median :145.57   Median :136.00  
##  Mean   :162.64   Mean   :150.05   Mean   :155.17   Mean   :142.83  
##  3rd Qu.:188.19   3rd Qu.:179.77   3rd Qu.:177.44   3rd Qu.:164.01  
##  Max.   :293.45   Max.   :302.69   Max.   :391.66   Max.   :368.01  
##  NA's   :1        NA's   :1        NA's   :1        NA's   :1       
##       2006             2007             2008             2009       
##  Min.   : 61.61   Min.   : 64.89   Min.   : 77.83   Min.   : 70.04  
##  1st Qu.:100.52   1st Qu.:103.22   1st Qu.: 94.56   1st Qu.:102.11  
##  Median :127.61   Median :125.57   Median :128.39   Median :134.58  
##  Mean   :138.32   Mean   :141.73   Mean   :147.78   Mean   :153.99  
##  3rd Qu.:167.51   3rd Qu.:165.45   3rd Qu.:170.24   3rd Qu.:198.88  
##  Max.   :317.99   Max.   :347.84   Max.   :365.25   Max.   :377.59  
##  NA's   :1        NA's   :1                                         
##       2010             2011             2012             2013       
##  Min.   : 72.01   Min.   : 74.06   Min.   : 65.66   Min.   : 58.58  
##  1st Qu.: 92.98   1st Qu.: 95.03   1st Qu.: 84.01   1st Qu.: 85.75  
##  Median :136.93   Median :125.56   Median :129.38   Median :121.57  
##  Mean   :153.00   Mean   :154.49   Mean   :142.06   Mean   :139.56  
##  3rd Qu.:193.89   3rd Qu.:197.85   3rd Qu.:181.23   3rd Qu.:172.57  
##  Max.   :322.16   Max.   :309.52   Max.   :291.51   Max.   :349.13  
##                                                                     
##       2014             2015        policejurisdiction
##  Min.   : 48.17   Min.   : 47.97   Length:25         
##  1st Qu.: 88.96   1st Qu.: 86.70   Class :character  
##  Median :111.10   Median :106.44   Mode  :character  
##  Mean   :134.10   Mean   :128.90                     
##  3rd Qu.:161.60   3rd Qu.:168.95                     
##  Max.   :299.37   Max.   :294.15                     
## 

Convert to long format. Years should be rows not columns.

Separate policejurisdiction into City and Province

ImpairedL <-Impaired %>% 
  gather(year, IncidentRate, -policejurisdiction) %>%
  separate(policejurisdiction, into = c("City", "Province"), sep = ", ")
head(ImpairedL)
## # A tibble: 6 × 4
##             City Province  year IncidentRate
##            <chr>    <chr> <chr>        <dbl>
## 1        Burnaby       BC  1998       107.60
## 2        Calgary       AB  1998       212.33
## 3  Durham Region       ON  1998       147.72
## 4       Edmonton       AB  1998       340.97
## 5       Gatineau       QB  1998           NA
## 6 Halifax Region       NS  1998           NA
tail(ImpairedL)
## # A tibble: 6 × 4
##              City Province  year IncidentRate
##             <chr>    <chr> <chr>        <dbl>
## 1         Toronto       ON  2015        47.97
## 2       Vancouver       BC  2015       119.11
## 3 Waterloo Region       ON  2015        82.95
## 4         Windsor       ON  2015       104.41
## 5        Winnipeg       MB  2015        80.18
## 6     York Region       ON  2015        94.23
dim(ImpairedL)
## [1] 450   4

Impaired has been transformed and ImpairedL now has 4 variables and 450 Observations

Planned Analysis

Trends over time by city/region> Are number of infractions increasing/decreasing (Visualization using line graphs facet wrap for each jurisdiction)

Year over year % increase/decrease in infractions by city and by region.

Calculate city/regional average over time frame Map with intensity of color (?). What cities regions have highest incidents of impaired driving? Any common factors?