This project is presented as week 3 project for the Developing Data Products course for Johns Hopkins University Data Science specialization on Coursera. The objective of this assignment is to create a web page presentation using R Markdown that features a plot created with Plotly.
The data is obtained from City of Vancouver’s Open Data Catalogue (Vancouver, British Columbia, Canada). The dataset presents the crime data on a year-by-year basis beginning in 2003 released by Vancouver Police Department (VPD) which is updated every Sunday morning. The data is available at https://data.vancouver.ca/datacatalogue/crime-data.htm.
Let’s load some packages that will be used in this analysis:
library(dplyr)
library(tidyr)
library(plotly)
And let’s load the data from City of Vancouver’s server:
url <- "ftp://webftp.vancouver.ca/opendata/csv/crime_csv_all_years.zip"
temp <- tempfile()
download.file(url, temp)
data <- read.csv(unz(temp, "crime_csv_all_years.csv"))
unlink(temp)
Below is the list of variables that are included in this dataset:
| Attribute | Description |
|---|---|
| TYPE | The type of crime activities |
| YEAR | A four-digit field that indicates the year when the reported crime activity occurred |
| MONTH | A numeric field that indicates the month when the reported crime activity occurred |
| DAY | Day of the month when the reported crime activity occurred |
| HOUR | Hour time (in 24 hours format) when the reported crime activity occurred |
| MINUTE | Minute when the reported crime activity occurred |
| HUNDRED_BLOCK | Generalized location of the report crime activity |
| NEIGHBOURHOOD | Neighbourhoods within the City of Vancouver |
| X | Coordinate values are projected in UTM Zone 10 |
| Y | Coordinate values are projected in UTM Zone 10 |
As declared by the City of Vancouver’s website, all coordinates data in this data set are offset and in some cases not disclosed to provide privacy protection. Therefore only the follwing attributes will be used in this analysis:
| Attribute | Description |
|---|---|
| TYPE | The type of crime activities |
| YEAR | A four-digit field that indicates the year when the reported crime activity occurred |
| MONTH | A numeric field that indicates the month when the reported crime activity occurred |
| DAY | Day of the month when the reported crime activity occurred |
| HOUR | Hour time (in 24 hours format) when the reported crime activity occurred |
| MINUTE | Minute when the reported crime activity occurred |
| NEIGHBOURHOOD | Neighbourhoods within the City of Vancouver |
Removing the irrelevant variables:
data <- select(.data = data, -"X", -"Y", -"HUNDRED_BLOCK")
summary(data)
## TYPE YEAR
## Theft from Vehicle :191704 Min. :2003
## Mischief : 77960 1st Qu.:2006
## Break and Enter Residential/Other: 64022 Median :2009
## Other Theft : 58982 Mean :2010
## Offence Against a Person : 58301 3rd Qu.:2014
## Theft of Vehicle : 40156 Max. :2018
## (Other) : 89796
## MONTH DAY HOUR MINUTE
## Min. : 1.000 Min. : 1.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 9.00 1st Qu.: 0.00
## Median : 7.000 Median :15.00 Median :15.00 Median :10.00
## Mean : 6.502 Mean :15.41 Mean :13.72 Mean :17.03
## 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:19.00 3rd Qu.:30.00
## Max. :12.000 Max. :31.00 Max. :23.00 Max. :59.00
## NA's :58540 NA's :58540
## NEIGHBOURHOOD
## Central Business District:124888
## : 60924
## West End : 45219
## Fairview : 34631
## Mount Pleasant : 33747
## Grandview-Woodland : 29686
## (Other) :251826
For this current assignment, the total number of reported crimes (regardless of type) is plotted using Plotly for two popular neighbourhoods in Downtown Vancouver:
dist <- "Central Business District"
dataPlot <- data %>%
filter(data$NEIGHBOURHOOD == dist & data$YEAR != format(Sys.Date(), "%Y")) %>%
group_by(YEAR) %>%
summarize(totalCrime = n())
plot_ly(data = dataPlot, x = as.factor(dataPlot$YEAR), y = dataPlot$totalCrime, type = "scatter", mode = "lines")
dist <- "West End"
dataPlot <- data %>%
filter(data$NEIGHBOURHOOD == dist & data$YEAR != format(Sys.Date(), "%Y")) %>%
group_by(YEAR) %>%
summarize(totalCrime = n())
plot_ly(data = dataPlot, x = as.factor(dataPlot$YEAR), y = dataPlot$totalCrime, type = "scatter", mode = "lines")
More analysis will be performed on this dataset as for the course’s final project.
Until then!