Philly Crime Analysis - Dataset 1 of 4

---
author: "David Apolinar"
title: "Philly Crime Analysis - Dataset 1 of 4"
output: 
  flexdashboard::flex_dashboard:
    theme: flatly
    orientation: rows
    vertical_layout: fill
    source_code: embed
    
---


```{r read CSV file}
library(readr)
library(dplyr)
library(httr)
library(magrittr)
library(tidyr)
library(treemapify)
library(ggplot2)
library(stringr)
library(treemap)
library(revgeo)
library(zipcode)
library(leaflet)
library(DT)

```



```{r}
### DataSet Events
incidents <- read_csv("https://azuredocuments.blob.core.windows.net/phillycrime/phillycrime2016-2019.csv")
# Data has 3 years of information. For this assignment, we selected crime for only 2018.
crime2018 <- incidents[incidents$dispatch_date >= as.Date('2018-01-01') & incidents$dispatch_date < as.Date('2019-01-01'),]
# Let's take a quick glance at the data
#head(crime2018, n=3)
crime2018 <- crime2018 %>% select(dc_dist, psa, dispatch_date, dispatch_time, location_block, text_general_code, ucr_general, lng, lat, hour_)
# Let's check the columns we are working with.
#names(crime2018)

```

```{r drop na}
# drop na from all rows
filtered2018 <- drop_na(crime2018)
# remove block from the address field
filtered2018$location_block<-filtered2018$location_block %>% str_remove_all("BLOCK")
# add a month column
filtered2018 <- filtered2018  %>% mutate(month=format(dispatch_date, "%B"))
# add day of week column
filtered2018 <- filtered2018  %>% mutate(Weekday=format(dispatch_date, "%A"))
```

```{r analysis}
# Unique category names
unique.categories <- filtered2018 %>% select(text_general_code) %>% distinct()
#datatable(unique.categories)
# Counts per category
category.counts <- filtered2018 %>% group_by(text_general_code) %>% summarise(Total=n()) %>% arrange(desc(Total))
#datatable(category.counts)
# Counts by Month
month.counts <- filtered2018 %>% group_by(month) %>% summarise(Total=n()) %>% arrange(desc(Total))
#datatable(month.counts)
# Counts by Day of Week
weekday.counts <- filtered2018 %>% group_by(Weekday) %>% summarise(Total=n()) %>% arrange(desc(Total))
#datatable(weekday.counts)
```




Philadelphia Crime Information
====================

Row {.tabset .tabset-fade}
-------------------------------------
### Overview

For this analysis, I choose to review Philly's crime data to determine which sections are the most dangerous, but also, answer the following questions:

(1) What time of the year does the most crime occur?
(2) What day of the week does most crime occur?
(3) What time of the day does most crime occur?
(4) Which are the most common crimes committed in Philly?
(5) Which sections are experience the most amount of crime?

Pulling Data from Azure Blob Storage

This data was originally downloaded from the City of Philadelphia's public data. 
As part of this data pull, I stored the data on Azure Blob storage instead of github due to github's 25MB limit. I am also filtering the data to focus on a single year of data crime (2018). The original file is roughly 300 MB in size, and therefore it was necessary to reduce the total number of records.

Data had 3 years of information. For this assignment, we selected crime for only 2018.

### Crime by Calendar Year


```{r month geo bar}
# Create a factor so that our geom_bar is ordered by month.
month.factor.order <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
g<-ggplot(filtered2018, aes(x=ordered(filtered2018$month, levels = month.factor.order))) + geom_bar(fill ="darkblue")+ labs(y= "Total crimes commited over the year", x="Month", title="Total Philly Crimes over the calendar year") 
g
```

> The first question we want to answer is what time during the year do most crimes occur? Based on the graph above, it appears as the summer months, crime is most prevalent. There does appear to be a peak during the holiday months, but not as high as during the summer. 

### Philly Crime by time of the day
```{r}
gh <- ggplot(filtered2018, aes(x=hour_)) + geom_histogram(color="black", fill="red") +  labs(y= "Total crimes commited over the day", x="Time of the day", title="Total Philly Crimes during the day") 
gh
```

> The 2nd question we can answer here is what time of the day do most crimes occur? Based on the histogram and using a 0-24 scale for time of the day, we can see that most crime occurs during 11-12am and 5-6pm. 


### Crime by Weekday

```{r weekday bar}
weekday.factor <- c(weekday.counts$Weekday)
g<-ggplot(filtered2018, aes(x=ordered(filtered2018$Weekday, levels = weekday.factor))) + geom_bar(fill ="blue") + labs(y= "Total crimes commited", x="Weekday", title="Total Philly Crimes over weekday") 
g
```

> One of our other questions, what day of the week does most crime occur, can be answered by the graph above. Ironically, Monday is when most crime activities appear to occur with crime slowly tampering off towards the weekends.

###  Philly Crime 2018 Treemap

```{r most common categories}
treemap(category.counts, #Your data frame object
        index=c("text_general_code"),  #A list of your categorical variables
        vSize = "Total",  #This is your quantitative variable
        type="index", #Type sets the organization and color scheme of your treemap
        palette = "Reds",  #Select your color palette from the RColorBrewer presets or make your own.
        title="Philly Crime 2018", #Customize your title
        fontsize.title = 14 #Change the font size of the title
        )
```

> Here we can visually see how the crime categories compare to other relative in size.

### Conclusion

Philly's crime have some interesting and unexpected surprises. I did not expect total crime to be highest during the early morning or early evening. I did expect the summer months to have the highest levels of crime. As a potential future project, I would like to join this data with housing and economic data to determine if any correlation exists between the variables. 

Other Observations:
Having a large data sample was challenging to parse using R as certain calculations would cause R to hang. I am not certain as to why this was happening, but it is worth noting that I was able to leverage almost all of my observations during the analysis in this report sans NA values.

Data Tables
=========================

Row {.tabset .tabset-fade}
-------------------------------------
   
### Most crimes by Category
```{r dt most common category}
datatable(category.counts)
```



### Most crimes by Month
```{r dt most common month}
datatable(month.counts)
```


### Most crimes by day of the week
```{r dt most common weekly}
datatable(weekday.counts)
```



Spatial Map
=============

Row
-------------

### Map
```{r map}
leaflet(data = filtered2018) %>% addTiles() %>%
   addMarkers(~lng, ~lat, label= ~as.character(`text_general_code`), popup    = ~as.character(paste("Crime: ",text_general_code, ", Date: ", dispatch_date)),clusterOptions = markerClusterOptions())
```

> Finally, using spatial data, we can create a map to look at the cluster of crimes committed in Philly. The map is interactive and shows which areas are most affected by crime.