How safe is Cincinnati

For my data wrangling project, I am trying to analyse the crime patterns for the city of Cincinnati. While the city has come a long way from 2001 riots, I wanted to check whether trend of falling crime rates has been sustained over the years and answer the question of whether or not Cincinnati neighborhoods are really getting safer.

I am using data from the last 5 years to understand how crime rates have changed by time and location and hopefully show drop in reported crimes over the entire city.The conclusion of my analysis should answer following specific questions.

This data can be leveraged by city officials and police department to roll out targeted interventions in the crime prone neighborhoods. This analysis would also help people take precautions while travelling through crime prone areas.

DataSet

For my analysis , I would be using dataset provided by Police Data Initiative for which Cincinnati Police Department is a member.

Pacakges required

During the project, I would be using following packages.

library(dplyr)      ## For Data manipulation
library(lubridate)  ## Used for date manipulation
library(readr)      ## To read the csv input file
library(tidyr)      ## For Data Cleaning
library(DT)         ## To render R data objects as tables in HTML page.
library(leaflet)    ## For interactive maps
library(ggplot2)    ## For visualization

Data Preparation

The following section would explain the required steps to in preparing data for analysis.

Importing Data

We would be using read_csv function to import the dataset into R.

# Read the file and view data
dataset <- read.csv("https://www.dropbox.com/s/cpya6okvbopi81y/city_of_cincinnati_police_data_initiative_crime_incidents.csv?dl=1")
head(dataset)

While head() outputs the result to console, the View() function displays the output in a new window.This is useful when dataset has a large number of columns.

Data Cleaning

Since our analysis relies on time , we focus DATE_FROM column. We convert this column from text to date datatype.Further, I created two separate fields, one to hold the full date with time and the other to hold just the time for any further analysis on time column.

#Establish time of occurence
event_datetime_occurence <- substr(dataset$DATE_FROM, 1, 22)
event_datetime_occurence <- (strptime(event_datetime_occurence, '%m/%d/%Y %I:%M:%S %p'))

event_time_occurence <- substr(dataset$DATE_FROM, 12, 22)
event_time_occurence <- as.difftime(event_time_occurence, '%I:%M:%S %p', units = "hours")

#round to nearest hour for easier analysis
event_time_occurence <- round(event_time_occurence, 1)

#Bind to Dataset
dataset <- cbind(dataset, event_datetime_occurence, event_time_occurence)

#Convert to a table
dataset <- tbl_df(dataset)

As the analysis is restricted last 5 years , use filter to select the required data only.

#Filter for years 2014 - 2018
dataset_filtered <- filter(dataset, event_datetime_occurence >= "2014-1-1" & event_datetime_occurence < "2019-1-1")

The original dataset has over 40 columns.We would need only the subset of those for our analysis.

# select only the required columns
dataset_final <- select(dataset_filtered, INSTANCEID,
                        event_datetime_occurence,
                        event_time_occurence,
                        OFFENSE,
                        DAYOFWEEK,
                        CPD_NEIGHBORHOOD,
                        WEAPONS,
                        LONGITUDE_X,LATITUDE_X,VICTIM_AGE,ZIP)

I would like to rename a couple of columns to make it more readable.

colnames(dataset_final)[colnames(dataset_final)=="LATITUDE_X"] <- "LATITUDE"
colnames(dataset_final)[colnames(dataset_final)=="LONGITUDE_X"] <- "LONGITUDE"

Next check and remove NA from the dataset.

dataset_final <- na.omit(dataset_final)

dataset_final <- with(dataset_final, dataset_final[!(WEAPONS == "" | is.na(WEAPONS)), ])
dataset_final <- with(dataset_final, dataset_final[!(OFFENSE == "" | is.na(OFFENSE)), ])
dataset_final <- with(dataset_final, dataset_final[!(VICTIM_AGE == "" | is.na(VICTIM_AGE)), ])
dataset_final <- with(dataset_final, dataset_final[!(INSTANCEID == "" | is.na(INSTANCEID)), ])
dataset_final <- with(dataset_final, dataset_final[!(DAYOFWEEK == "" | is.na(DAYOFWEEK)), ])
dataset_final <- with(dataset_final, dataset_final[!(CPD_NEIGHBORHOOD == "" | is.na(CPD_NEIGHBORHOOD)), ])
dataset_final <- with(dataset_final, dataset_final[!(ZIP == "" | is.na(ZIP)), ])
dataset_final <- with(dataset_final, dataset_final[!(WEAPONS == "" | is.na(WEAPONS)), ])
dataset_final <- with(dataset_final, dataset_final[!(OFFENSE == "" | is.na(OFFENSE)), ])
dataset_final <- with(dataset_final, dataset_final[!(VICTIM_AGE == "" | is.na(VICTIM_AGE)), ])
dataset_final <- with(dataset_final, dataset_final[!(INSTANCEID == "" | is.na(INSTANCEID)), ])
dataset_final <- with(dataset_final, dataset_final[!(DAYOFWEEK == "" | is.na(DAYOFWEEK)), ])
dataset_final <- with(dataset_final, dataset_final[!(CPD_NEIGHBORHOOD == "" | is.na(CPD_NEIGHBORHOOD)), ])
dataset_final <- with(dataset_final, dataset_final[!(ZIP == "" | is.na(ZIP)), ])

Finally check the dimensions of the dataset

dim(dataset_final)
## [1] 161616     11

Data dictionary

Now we take a look a structure of final dataset

#looking at structure of final dataset
str(dataset_final)
## Classes 'tbl_df', 'tbl' and 'data.frame':    161616 obs. of  11 variables:
##  $ INSTANCEID              : Factor w/ 280111 levels "00002938-D5AD-4F41-890B-1D10A6611A6F",..: 235246 202681 64816 277019 71716 175854 204673 8016 112024 2230 ...
##  $ event_datetime_occurence: POSIXct, format: "2015-03-16 15:02:00" "2018-03-13 16:45:00" ...
##  $ event_time_occurence    : 'difftime' num  15 16.8 0 1.2 ...
##   ..- attr(*, "units")= chr "hours"
##  $ OFFENSE                 : Factor w/ 202 levels "","ABDUCTION",..: 15 24 69 24 175 31 24 173 175 175 ...
##  $ DAYOFWEEK               : Factor w/ 8 levels "","FRIDAY","MONDAY",..: 3 7 2 6 5 7 5 5 6 5 ...
##  $ CPD_NEIGHBORHOOD        : Factor w/ 54 levels "","AVONDALE",..: 50 51 18 38 29 11 4 39 22 38 ...
##  $ WEAPONS                 : Factor w/ 70 levels "","11--FIREARM (TYPE NOT STATED)",..: 51 36 69 36 51 51 36 51 51 51 ...
##  $ LONGITUDE               : num  -84.5 -84.5 -84.5 -84.5 -84.5 ...
##  $ LATITUDE                : num  39.1 39.1 39.1 39.1 39.1 ...
##  $ VICTIM_AGE              : Factor w/ 13 levels "","00","18-25",..: 4 6 12 6 3 6 6 12 8 7 ...
##  $ ZIP                     : num  45206 45214 45207 45202 45219 ...
##  - attr(*, "na.action")= 'omit' Named int  89 93 178 248 294 340 384 396 405 435 ...
##   ..- attr(*, "names")= chr  "89" "93" "178" "248" ...

We see most of columns are factors apart from date and time columns. The explanation of each of those column is given below.

  • Instance ID - This is used to uniquely identify a particular crime in the dataset.
  • event_datetime_occurrence - The date and time the crime was reported.
  • event_time_occurrence - The time of day the crime was reported.
  • Offense - The type of offence that the occurred.
  • Dayofweek - Day of the week the crime occurred
  • CPD_Neighborhood - Identifies the neighborhood in which the crime occurred.
  • Weapons - Were any weapons were used? If so what weapons were they?
  • Latitude - Latitude the crime location
  • Longitude - Longitude the crime location
  • Victim Age - Age of the victim affected by crime

The first 6 rows of final dataset is given below.

#looking at structure of final dataset
head(dataset_final)
## # A tibble: 6 x 11
##   INSTANCEID event_datetime_occ~ event_time_occu~ OFFENSE DAYOFWEEK
##   <fct>      <dttm>              <time>           <fct>   <fct>    
## 1 D6D7D173-~ 2015-03-16 15:02:00 15.0 hours       AGGRAV~ MONDAY   
## 2 B908DB49-~ 2018-03-13 16:45:00 16.8 hours       ASSAULT TUESDAY  
## 3 3B4450F2-~ 2015-07-24 00:00:00  0.0 hours       ENDANG~ FRIDAY   
## 4 FD5706CD-~ 2015-09-03 01:10:00  1.2 hours       ASSAULT THURSDAY 
## 5 417FB2EC-~ 2014-08-31 18:00:00 18.0 hours       THEFT   SUNDAY   
## 6 A0705DF3-~ 2015-01-20 11:00:00 11.0 hours       BURGLA~ TUESDAY  
## # ... with 6 more variables: CPD_NEIGHBORHOOD <fct>, WEAPONS <fct>,
## #   LONGITUDE <dbl>, LATITUDE <dbl>, VICTIM_AGE <fct>, ZIP <dbl>

Proposed Exploratory Data Analysis

We would be using visual analysis by plotting crimes over time for each neighborhood.This should help us confirm or reject our assumption.This would also help us identify the crime prone neighborhood.We can also use MapR function to plot the same.Also we should be able to check trend of most common crimes across neighborhoods.We can also check to see if there is any pattern when plotting crimes against day of the week and time.

Exploratory Data Analysis

Before we get into deeper analysis, we can see that columns for victim age, weapons and offenses have too many factors. Let’s try to collapse these factors by combining multiple factors into one to make it easier for us draw conclusions.

Let’s combine Under18 and Juvenile into Under18 for victim age.Also lets remove records that dont have victim age.

# clean data for Victim age
dataset_final$VICTIM_AGE <- gsub(".*JUVENILE.*", "UNDER 18", (dataset_final$VICTIM_AGE), perl = FALSE)

#remove the entries where victim_age is 'unknown'
dataset_final[dataset_final$VICTIM_AGE == "UNKNOWN" ,"VICTIM_AGE"] <- "NA"
dataset_final <- dataset_final[!(dataset_final$VICTIM_AGE == "NA"),]

I am also combining factors in weapons column . Also lets remove the code associated to the weapons and special characters present in the same.

# Combine weapons of similar kind under one umbrella,remove the number attached
dataset_final <- dataset_final %>%
  mutate(WEAPONS = gsub(".*11.*", "FIREARM", WEAPONS)) %>% 
  mutate(WEAPONS = gsub(".*12.*", "HANDGUN",  WEAPONS)) %>%
  mutate(WEAPONS = gsub(".*13.*", "RIFLE",  WEAPONS)) %>%
  mutate(WEAPONS = gsub(".*14.*", "SHOTGUN",  WEAPONS)) %>%
  mutate(WEAPONS = gsub(".*15.*", "FIREARM",  WEAPONS)) %>%
  mutate(WEAPONS = gsub(".*16.*", "FIREARM",  WEAPONS)) %>%
  mutate(WEAPONS = gsub(".*17.*", "FIREARM",  WEAPONS)) %>%
  mutate(WEAPONS = gsub(".*18.*", "BB AND PELLET GUNS",  WEAPONS)) %>%
  mutate(WEAPONS = gsub(".*20.*", "KNIFE/CUTTING INSTRUMENT",  WEAPONS)) %>%
  mutate(WEAPONS = gsub(".*30.*", "BLUNT OBJECT",  WEAPONS)) %>%
  mutate(WEAPONS = gsub(".*35.*", "MOTOR VEHICLE",  WEAPONS)) %>%
  mutate(WEAPONS = gsub(".*40.*", "PERSONAL WEAPON",  WEAPONS)) %>%
  mutate(WEAPONS = gsub(".*60.*", "EXPLOSIVES",  WEAPONS)) %>%
  mutate(WEAPONS = gsub(".*70.*", "DRUGS",  WEAPONS)) %>%
  mutate(WEAPONS = gsub(".*80.*", "OTHER WEAPONS",  WEAPONS)) %>%
  mutate(WEAPONS = gsub(".*U.*", "UNKNOWN",  WEAPONS)) %>%
  mutate(WEAPONS = gsub(".*65.*", "FIRE/INCENDIARY DEVICE",  WEAPONS)) %>%
  mutate(WEAPONS = gsub(".*50.*", "POISON",  WEAPONS)) %>%
  mutate(WEAPONS = gsub(".*99.*", "NONE",  WEAPONS)) %>%
  mutate(WEAPONS = gsub(".*85.*", "ASPHYXIATION",  WEAPONS))

Let’s also combine similar types of crime into single category to reduce the number of factors in crime column.

#merge values for offenses
dataset_final <- dataset_final %>%
  mutate(OFFENSE = gsub(".*ASSAULT.*", "ASSAULT", OFFENSE)) %>%
  mutate(OFFENSE = gsub(".*BURGLARY.*", "BURGLARY", OFFENSE))  %>%
  mutate(OFFENSE = gsub(".*RAPE.*", "RAPE", OFFENSE))  %>%
  mutate(OFFENSE = gsub(".*ROBBERY.*", "ROBBERY", OFFENSE))  %>%
  mutate(OFFENSE = gsub(".*MURDER.*", "MURDER", OFFENSE))  %>%
  mutate(OFFENSE = gsub(".*ABDUCTION.*", "ABDUCTION", OFFENSE))  %>%
  mutate(OFFENSE = gsub(".*MENACING.*", "MENACING", OFFENSE))  %>%
  mutate(OFFENSE = gsub(".*FORGERY.*", "FORGERY", OFFENSE))  %>%
  mutate(OFFENSE = gsub(".*KIDNAPPING.*", "KIDNAPPING", OFFENSE))  %>%
  mutate(OFFENSE = gsub(".*ARSON.*", "ARSON", OFFENSE))  %>%
  mutate(OFFENSE = gsub(".*SEX.*", "SEX", OFFENSE))  %>% 
  mutate(OFFENSE = gsub(".*INTIMID.*", "INTIMIDATION", OFFENSE))  %>%
  mutate(OFFENSE = gsub(".*HARRASS.*", "HARRASS", OFFENSE))  %>%
  mutate(OFFENSE = gsub(".*VANDALISM.*", "VANDALISM", OFFENSE))  %>%
  mutate(OFFENSE = gsub(".*THEFT.*", "THEFT", OFFENSE))  %>%
  mutate(OFFENSE = gsub(".*CRIMINAL.*", "CRIMINAL", OFFENSE))  %>%
  mutate(OFFENSE = gsub(".*DISORDERLY CONDUCT.*", "DISORDERLY CONDUCT", OFFENSE))  %>%
  mutate(OFFENSE = gsub(".*ENDANGERING CHILDREN.*", "ENDANGERING CHILDREN", OFFENSE))  %>%
  mutate(OFFENSE = gsub(".*VIOL.*", "VIOLATE PROTECTION ORDER", OFFENSE))  %>%
  mutate(OFFENSE = gsub(".*CREDIT CARD.*", "CREDIT CARD FRAUD", OFFENSE))  %>%
  mutate(OFFENSE = gsub(".*TELEPHONE HARRASSMENT.*", "TELEPHONE HARRASSMENT", OFFENSE))  %>%
  mutate(OFFENSE = gsub(".*PATIENT ABUSE.*", "PATIENT ABUSE", OFFENSE))  %>%
  mutate(OFFENSE = gsub(".*UNAUTHORISED USE.*", "UNAUTHORISED USE", OFFENSE))

We can see the effect of our cleanups.We now have 8 different categories for victim age group.

#print categories of age group
unique(dataset_final$VICTIM_AGE)
## [1] "26-30"    "41-50"    "UNDER 18" "18-25"    "61-70"    "51-60"   
## [7] "31-40"    "OVER 70"

We have 10 categories of weapons used to commit crimes.

#print categories of weapons
unique(dataset_final$WEAPONS)
##  [1] "NONE"                   "PERSONAL WEAPON"       
##  [3] "UNKNOWN"                "FIREARM"               
##  [5] "OTHER WEAPONS"          "MOTOR VEHICLE"         
##  [7] "RIFLE"                  "FIRE/INCENDIARY DEVICE"
##  [9] "EXPLOSIVES"             "POISON"

We have reduced the number of type of crimes to 62.

#print categories of offense
unique(dataset_final$OFFENSE)
##  [1] "MENACING"                                                
##  [2] "ASSAULT"                                                 
##  [3] "ENDANGERING CHILDREN"                                    
##  [4] "THEFT"                                                   
##  [5] "BURGLARY"                                                
##  [6] "TELEPHONE HARASSMENT"                                    
##  [7] "BREAKING AND ENTERING"                                   
##  [8] "ROBBERY"                                                 
##  [9] "CRIMINAL"                                                
## [10] "VIOLATE PROTECTION ORDER"                                
## [11] "MURDER"                                                  
## [12] "IMPROPERLY DISCHARGING FIREARM AT/INTO HABITATION/SCHOOL"
## [13] "UNAUTHORIZED USE OF MOTOR VEHICLE"                       
## [14] "SEX"                                                     
## [15] "CREDIT CARD FRAUD"                                       
## [16] "TAMPERING WITH COIN MACHINES"                            
## [17] "RETALIATION"                                             
## [18] "VANDALISM"                                               
## [19] "TAKING THE IDENTITY OF ANOTHER"                          
## [20] "FORGERY"                                                 
## [21] "PATIENT ABUSE"                                           
## [22] "PASSING BAD CHECKS"                                      
## [23] "KIDNAPPING"                                              
## [24] "IMPERSONATING PEACE OFFICER/PRIVATE POLICEMAN"           
## [25] "INTIMIDATION"                                            
## [26] "ABDUCTION"                                               
## [27] "TELECOMMUNICATIONS FRAUD"                                
## [28] "INDUCING PANIC"                                          
## [29] "CONTRIB TO CHILD UNRULINESS/DELINQUENCY"                 
## [30] "UNLAWFUL RESTRAINT"                                      
## [31] "INTERFERENCE WITH CUSTODY"                               
## [32] "RUNAWAY"                                                 
## [33] "DISORDERLY CONDUCT"                                      
## [34] "RECEIVING STOLEN PROPERTY"                               
## [35] "VOYEURISM"                                               
## [36] "TELEPHONE HARASS-FAIL TO DESIST"                         
## [37] "MAKING FALSE ALARMS"                                     
## [38] "UNAUTHORIZED USE OF PROPERTY"                            
## [39] "RECKLESS HOMICIDE"                                       
## [40] "EXTORTION"                                               
## [41] "AGG VEH HOM/VEH HOM/VEH MNSLGHTER"                       
## [42] "SAFECRACKING"                                            
## [43] "FAIL COMPLY ORDER/SIGNAL OF PO-ELUDE/FLEE"               
## [44] "DISRUPTING PUBLIC SERVICE"                               
## [45] "EMBEZZLEMENT"                                            
## [46] "ARSON"                                                   
## [47] "UNAUTHORIZED USE OF MOTOR VEHICLE-JOY RIDING"            
## [48] "FAIL TO PROVIDE FOR FUNCTIONALLY IMPAIRED PERSON"        
## [49] "WIRE FRAUD"                                              
## [50] "IMPORTUNING"                                             
## [51] "TELEPHONE HARASSMENT - ANONYMOUS"                        
## [52] "PUBLIC INDECENCY"                                        
## [53] "PERSONATING AN OFFICER"                                  
## [54] "VOLUNTARY MANSLAUGHTER"                                  
## [55] "IMPERSONATING PO/PRIVATE POLICEMAN-FACILITATE CRIME"     
## [56] "DEFRAUDING A LIVERY OR HOSTELRY"                         
## [57] "TAMPERING WITH RECORDS"                                  
## [58] "B&E-COMMIT FELONY-PREMISES OF ANOTHER"                   
## [59] "NEGLIGENT HOMICIDE"                                      
## [60] "OBSTRUCTING OFFICIAL BUSINESS"                           
## [61] "INVOLUNTARY MANSLAUGHTER"                                
## [62] "AGGRAVATED TRESPASS"

Now we start our analysis.First thing we would like to check is the top 5 most crime prone neighborhoods.

#Top 5 worst neighborhoods
dataset_final    %>% count(CPD_NEIGHBORHOOD, sort = TRUE) %>% top_n(5)
## # A tibble: 5 x 2
##   CPD_NEIGHBORHOOD     n
##   <fct>            <int>
## 1 WESTWOOD         13281
## 2 WEST PRICE HILL   9521
## 3 EAST PRICE HILL   8519
## 4 AVONDALE          6388
## 5 OVER-THE-RHINE    5850

We can see the difference between Westwood and OvertheRhine.Westwood seems like a pretty dangerous neighborhood.

Lets now check most common type of weapons used to commit crime.

#Top 5 weapons used 
dataset_final    %>% count(WEAPONS,  sort = TRUE) %>% filter(WEAPONS != "NONE" & WEAPONS != "UNKNOWN") %>% top_n(5)
## # A tibble: 5 x 2
##   WEAPONS             n
##   <chr>           <int>
## 1 PERSONAL WEAPON 25599
## 2 FIREARM          2798
## 3 OTHER WEAPONS    2198
## 4 MOTOR VEHICLE     488
## 5 RIFLE             216

Most of crimes are committed using personal weapons. We see big difference between position 1 and the rest.

Lets check common offense categories.

#Most common offense type
dataset_final    %>% count(OFFENSE,  sort = TRUE) %>% top_n(5)
## # A tibble: 5 x 2
##   OFFENSE      n
##   <chr>    <int>
## 1 THEFT    40714
## 2 ASSAULT  22419
## 3 CRIMINAL 17781
## 4 BURGLARY 14832
## 5 ROBBERY   8910

We can see that theft is most common crime.The legal difference between theft,burglary and robbery is as follows. Robbery, in contrast to theft, is a taking of property that does involve person-to-person interaction with force, intimidation, and/or coercion. Burglary, in contrast to both theft and robbery, is the entering of a building or residence with the intention to commit a theft or any felonious crime.

We now check age profile of the victims.

#Most vulnerable age profile
dataset_final    %>% count(VICTIM_AGE, sort = TRUE) %>% top_n(5)
## # A tibble: 5 x 2
##   VICTIM_AGE     n
##   <chr>      <int>
## 1 18-25      31299
## 2 31-40      27496
## 3 26-30      19736
## 4 41-50      19243
## 5 51-60      16832

We can see the college going student and young adults are the main victims of the crime.

Next we check for most unsafe time.

#Most crime prone time
dataset_final    %>% count(event_time_occurence, sort = TRUE) %>% top_n(5)
## # A tibble: 5 x 2
##   event_time_occurence     n
##   <time>               <int>
## 1  0 hours              5410
## 2 12 hours              4277
## 3 22 hours              3451
## 4 18 hours              3400
## 5 21 hours              3321

We can see that 4 out of top 5 are during evening or night time.

Next we check which day of the week is most crime prone.

#Most crime prone day of week
dataset_final    %>% count(DAYOFWEEK, sort = TRUE) %>% top_n(5)
## # A tibble: 5 x 2
##   DAYOFWEEK     n
##   <fct>     <int>
## 1 SATURDAY  19792
## 2 SUNDAY    19738
## 3 FRIDAY    19316
## 4 MONDAY    19169
## 5 WEDNESDAY 18962

We see that weekends are most crime prone with Saturday and Sunday occupying top 2 spots.

Now lets plot the total number of crime over years.

#Visualizing crime over the years 
barplot(table(substr(dataset_final$event_datetime_occurence, 1, 4)),
        main = "Crimes over years in Cincinnati",
        xlab = "Year",
        ylab = "Total Number of Crimes"
        )

We can see that we have small decreasing trend from year 2015.

Lets try plotting the crimes in the map. For this I would using package called as Leaflet. Leaflet is open source library for interactive maps built using JavaScript. We would be plotting the occurrences of the crime against the city as clusters. As we zoom in, the details becomes much more granular until it resolves to a single incident.

#Adding year variable to dataset
dataset_final <- mutate(dataset_final,event_occurence_year = (substr(dataset_final$event_datetime_occurence,1,4)))

#Convert neighborhood to factor
dataset_final$CPD_NEIGHBORHOOD <- as.factor(dataset_final$CPD_NEIGHBORHOOD)
head(dataset_final)

#creating function for visualization by year
mapbydate <- function(yearofCrime){
  subdataset <- filter(dataset_final, dataset_final$event_occurence_year == yearofCrime)
  occurence_map <- leaflet() %>%
    addTiles() %>%
    addMarkers(lng = subdataset$LONGITUDE,
               lat = subdataset$LATITUDE, 
               clusterOptions = markerClusterOptions()
    )
  
}

#map of occurences grouped by year for 2017,2018
map2017 <- mapbydate("2017")
map2018 <- mapbydate("2018")

Printing the map for 2018.

map2018

Printing the map for 2017.

map2018

Lets plot the worse neighborhoods to understand the trends of crime in 5 crime prone neighborhoods identified above for last 5 years.

dataset_final %>%
  filter(dataset_final$CPD_NEIGHBORHOOD == "WESTWOOD" |
           dataset_final$CPD_NEIGHBORHOOD == "WEST PRICE HILL" |
           dataset_final$CPD_NEIGHBORHOOD == "EAST PRICE HILL" |
           dataset_final$CPD_NEIGHBORHOOD == "OVER-THE-RHINE" |
           dataset_final$CPD_NEIGHBORHOOD == "AVONDALE") %>%
  group_by(CPD_NEIGHBORHOOD, event_occurence_year) %>%
  tally() %>%
  ggplot(aes(x = event_occurence_year, y = n, group = CPD_NEIGHBORHOOD, color = CPD_NEIGHBORHOOD)) + 
  geom_line() +
  labs(y = "Number of Occurences",
       x = "Years",
       title = "Is it getting better? Trend of Crimes .",
       color = "Crime prone neighborhood in Cincinnati")

We can see there is significant difference between Westwood and other areas. But the trend of falling crime rates stays the same across neighborhoods.

Now in these neighborhoods what are most dangerous time to be out?

#Time of crime occurence
dataset_final$event_time_occurence <- round(dataset_final$event_time_occurence,digits =  0)
dataset_final %>%
  filter(dataset_final$CPD_NEIGHBORHOOD == "WESTWOOD" |
           dataset_final$CPD_NEIGHBORHOOD == "WEST PRICE HILL" |
           dataset_final$CPD_NEIGHBORHOOD == "EAST PRICE HILL" |
           dataset_final$CPD_NEIGHBORHOOD == "OVER-THE-RHINE" |
           dataset_final$CPD_NEIGHBORHOOD == "AVONDALE") %>%
  group_by(CPD_NEIGHBORHOOD, event_time_occurence) %>%
  tally() %>%
  ggplot(aes(x = event_time_occurence, y = n, group = CPD_NEIGHBORHOOD, color = CPD_NEIGHBORHOOD)) +
  geom_point(alpha = .5) +
  stat_smooth(aes(x = event_time_occurence, y = n),method = "lm", formula = y ~ poly(x, 10), se = FALSE) +
  labs(y = "Number of Occurences",
       x = "Time of day",
       title = "Crime prone times of Day ",
       color = "Crime prone neighborhood in Cincinnati")

We can see that evening and night times are more dangerous than rest of the day.

Next we try to answer the question which day of the week do we wish to avoid going out?

#Day of the week crime occurences
dataset_final$DAYOFWEEK <- factor(dataset_final$DAYOFWEEK, levels = c("SUNDAY", "MONDAY", "TUESDAY", "WEDNESDAY", "THURSDAY", "FRIDAY", "SATURDAY"))
dataset_final %>%
  filter(dataset_final$CPD_NEIGHBORHOOD == "WESTWOOD" |
           dataset_final$CPD_NEIGHBORHOOD == "WEST PRICE HILL" |
           dataset_final$CPD_NEIGHBORHOOD == "EAST PRICE HILL" |
           dataset_final$CPD_NEIGHBORHOOD == "OVER-THE-RHINE" |
           dataset_final$CPD_NEIGHBORHOOD == "AVONDALE") %>%
  group_by(CPD_NEIGHBORHOOD, DAYOFWEEK) %>%
  tally() %>%
  ggplot(aes(x = DAYOFWEEK, y = n, group = CPD_NEIGHBORHOOD, color = CPD_NEIGHBORHOOD)) +
  geom_point(alpha = .5) +
  stat_smooth(aes(x = DAYOFWEEK, y = n),method = "lm", formula = y ~ poly(x, 4), se = FALSE) +
  labs(y = "Number of Occurences",
       x = "Day of Week",
       title = "Crime prone days of the Week",
       color = "Crime prone Neighborhood in Cincinnati")

We can see that weekends(Saturday,Sunday) are most likely days for crime.

Is there particular month that is more crime prone than other?

#month of crime occurence 
month <- month(as.POSIXlt(dataset_final$event_datetime_occurence, format = "%d/%m/%Y"))
dataset_final <- cbind(dataset_final, month)
dataset_final$month <- as.factor(dataset_final$month)


dataset_final %>%
  group_by(month, event_occurence_year) %>%
  tally() %>%
  ggplot(aes(x = month, y = n, group = event_occurence_year, color = event_occurence_year)) + 
  geom_line() +
  labs(y = "Number of Occurences",
       x = "Months",
       title = "Trend of Crimes in different months over the years",
       color = "Years")

We can see that summer seasons have much higher incidence of crime.

Is there a particular age profile to victims?

# How have the different age groups victimized in these neighborhoods
dataset_final %>% filter(dataset_final$CPD_NEIGHBORHOOD == "WESTWOOD" |
                           dataset_final$CPD_NEIGHBORHOOD == "WEST PRICE HILL" |
                           dataset_final$CPD_NEIGHBORHOOD == "EAST PRICE HILL" |
                           dataset_final$CPD_NEIGHBORHOOD == "OVER-THE-RHINE" |
                           dataset_final$CPD_NEIGHBORHOOD == "AVONDALE") %>%
  group_by(VICTIM_AGE, CPD_NEIGHBORHOOD) %>%
  tally() %>%
  ggplot(aes(x = VICTIM_AGE, y = n, group = CPD_NEIGHBORHOOD, color = CPD_NEIGHBORHOOD)) + 
  geom_line() +
  labs(y = "Number of Occurences",
       x = "Age Group",
       title = "Most Vulnerable Age Groups",
       color = "Cincinnati Neighborhoods")

In these areas we can see the most common age profile is 31-40. Age profile 18-25 which is most common across Cincinnati is second most common in these neighborhoods.

Summary

For this analysis we looked crime data from City of Cincinnati.Initially we looked at an overview identifying crime prone neighborhoods. We also identified most common type of crime and weapons used to commit crime , age profile of the victims and time when most of crimes were committed.We then did detailed analysis of neighborhood level in neighborhoods that are most affected. We used visualization to check if the neighborhoods were becoming safer over period of time.We also plotted the crime occurrences in a map for better understanding.

We summarize our study as given below. Westwood remains the most crime prone neigh hood. Most of crime is committed using personal weapons and are of non violent type. Theft seems to be most common type of crime Most victims are between 18 and 25 years of age. *Most of crime happens on weekends and during evening/night hours.

However from the trend analysis , we can see that crime rate across Cincinnati and its crime prone neighborhoods are falling over years. Cincinnati is slowly but surely becoming a safe place to live.