Crime DataSet Study : Crime Data Study , a comparison between East Coast city , New York City (NYC) and West Coast city San Francisco (SFO) and MidWest city Chicago (CHI)


Summary

This is an R Markdown document for providing documentation for performing Data Exploration And Analysis Of the Crime DataSet of publicly available crime data for New York City and San Francisco and Chicago


R Code :

Loading Packages Used

knitr::opts_chunk$set(message = FALSE, echo = TRUE)

# Library for string manipulation/regex operations
library(stringr)
# Library for data display in tabular format
library(DT)
# Library to read text file
library(RCurl)
# Library to gather (to long format) and spread (to wide format) data, to tidy data
library(tidyr)
# Library to filter, transform data
library(dplyr)
# Library to plot
library(ggplot2)
library(knitr)


# Library for db operations
library(RMySQL)


# Library for loading data
library(jsonlite)
library(XML)
library(xml2)


library(lubridate)
library(plotly)



library(bitops)

library(stringi)

library(ggmap)

library(grid)

Forming MYSQL DB Connection to Crime Schema

Database MYSQL set up in cloud. Connection and access to database can be obtained as follows through amazon web services. @schema crimedb

The connection object ‘conn’ will further be used for querying and manipulating database tables.

conn <- dbConnect(MySQL(), user = "mygroup", password = "mygroup#01", host = "mygroup.c5rotlbjbl71.us-east-1.rds.amazonaws.com", 
    dbname = "crimedb")


# List the database tables in crime schema
entdbtables <- dbListTables(conn)


############################################################################# FOR TEST PURPOSE OF RMD, NEED TO DELETE EXISTING DATA TO AVOID INTEGRITY
############################################################################# CONSTRAINT ERRORS

# delnycdata <- dbGetQuery(conn,'DELETE FROM crimedata WHERE citycode = 'NYC'')

################################################################################## 

San Francisco (SFO) DataSet Study

Fetching San Francisco (SFO) DataSet from JSON file

This JSON has two parts. First section has the meta data and second section will have the actual data

# JSON Load

sf_crim.json <- fromJSON("https://raw.githubusercontent.com/bvshyam/Project3_cuny_607/master/Data/sfo_dec2015.json")

# Look for data in JSON
sf_crimedata <- sf_crim.json[["data"]]


# Select the necessary columns in the JSON

sf_crime.df <- data.frame(offenseid = sapply(sf_crimedata, function(x) x[[1]]), Category = sapply(sf_crimedata, 
    function(x) x[[10]]), DayOfWeek = sapply(sf_crimedata, function(x) x[[12]]), 
    Date = as.Date(substr(sapply(sf_crimedata, function(x) x[[13]]), 1, 10), "%Y-%m-%d"), 
    Time = sapply(sf_crimedata, function(x) x[[14]]), PdDistrict = sapply(sf_crimedata, 
        function(x) x[[15]]), Location = paste(sapply(sf_crimedata, function(x) x[[19]]), 
        ",", sapply(sf_crimedata, function(x) x[[18]])), month = month.abb[month(as.Date(substr(sapply(sf_crimedata, 
        function(x) x[[13]]), 1, 10), "%Y-%m-%d"))], offensearea = sapply(sf_crimedata, 
        function(x) x[[15]]))


sf_crime_changes <- sf_crime.df

# Adding Time period for each day

sf_crime_changes <- mutate(sf_crime_changes, Hour = round(as.numeric(str_replace_all(sf_crime_changes$Time, 
    ":", ".")))) %>% mutate(Date = as.Date(sf_crime_changes$Date, "%m/%d/%Y"))

sf_crime_changes <- sf_crime_changes %>% mutate(timeframe = sapply(sf_crime_changes$Hour, 
    function(x) {
        if (x > 6 & x <= 11) {
            "Morning"
        } else if (x > 11 & x <= 16) {
            "Noon"
        } else if (x > 16 & x <= 20) {
            "Evening"
        } else if (x <= 6 | x > 20) {
            "Night"
        }
    }))


# Selecing the columns and calculating the additional columns

sf_crime_db <- select(sf_crime_changes, offenseid = offenseid, offense = Category, 
    offensedate = Date, offensetime = Time, dayofweek = DayOfWeek, offensehour = Hour, 
    offensemonth = month, offensearea = offensearea) %>% mutate(citycode = "SFO")

Database connection and insert

# Arrange the columns for database insert

sf_crime_db_1 <- select(sf_crime_db, citycode, offenseid, offense, offenselocaldate = offensedate, 
    offenselocaltime = offensetime, dayofweek, offensehour, offensemonth, offensearea)


# Bulk insert from dataframe to MYSQL table called crimedata

dbWriteTable(conn, value = sf_crime_db_1, name = "crimedata", row.names = F, append = TRUE)
## [1] TRUE

Analysis on SFO data

Now we have the required data. We will make some analysis and plot the graphs.

# Select the required columns for comparing hour and offense

sfo_crime_hour <- sf_crime_changes %>% select(Hour, Category) %>% group_by(Hour, 
    Category) %>% summarise(total = n()) %>% dplyr::arrange(total)


# Hour vs Offense

p <- ggplot(sfo_crime_hour, aes(x = sfo_crime_hour$Hour, y = sfo_crime_hour$total)) + 
    geom_line(aes(color = Category)) + ggtitle("Hour vs Offense") + xlab("Hour") + 
    ylab("Offense")
ggplotly(p)
# Analyze Crime by offense

category_count <- table(sf_crime_changes$Category) %>% data.frame() %>% setNames(c("Offense", 
    "Frequency"))

datatable(category_count)
# Plot a scattered plot by comparing Offense vs Frequency(occurrence)



a <- list(autotick = TRUE, tick0 = 0, dtick = 1, size = 24, showticklabels = FALSE, 
    showlegend = FALSE, tickcolor = toRGB("white"), zeroline = FALSE, showline = FALSE, 
    showticklabels = FALSE, showgrid = FALSE, visible = FALSE, title = "Offense")

b <- list(title = "Crime count")

plot_ly(x = ~category_count$Offense, y = ~category_count$Frequency, type = "scatter", 
    color = category_count$Offense) %>% layout(xaxis = a, yaxis = b, title = "Offense vs Frequency")

Analyze Crime by Date

# Plot a scattered plot by comparing Offense Date vs Frequency(occurrence)

category_date <- table(sf_crime_changes$Date) %>% data.frame() %>% setNames(c("Offense_date", 
    "Frequency"))

category_date$Offense_date <- str_replace(category_date$Offense_date, "/2015|2015-", 
    "")

datatable(category_date)
# Plot a scattered plot by comparing Offense Date vs Frequency(occurrence)


a1 <- list(autotick = TRUE, tick0 = 0, dtick = 1, size = 24, showlegend = FALSE, 
    tickangle = 45, title = FALSE)

b1 <- list(title = "Crime count")

plot_ly(x = ~category_date$Offense_date, y = ~category_date$Frequency, type = "area") %>% 
    layout(xaxis = a1, yaxis = b1, title = "Date vs Frequency")

Analyze by timeframe and top crimes

# Crime according to timeframe and Day

crimesbyday <- table(sf_crime_changes$DayOfWeek, sf_crime_changes$timeframe) %>% 
    data.frame() %>% setNames(c("Day", "Day_category", "Frequency"))


datatable(crimesbyday)
# Top crimes greater than 5%


mostcrimes <- table(sf_crime_changes$DayOfWeek, sf_crime_changes$Category) %>% data.frame() %>% 
    spread(Var1, Freq) %>% mutate(summary = rowSums(.[, 2:8], 1))


topcrimes <- mostcrimes[(mostcrimes$summary/sum(mostcrimes$summary)) > 0.05, ] %>% 
    data.frame()


datatable(topcrimes)

New York City (NYC) DataSet Study

Loading New York City (NYC) DataSet

Loading the crime data. Reading data file in XML format, from the GitHub location with Header as True

knitr::opts_chunk$set(message = FALSE, echo = TRUE)


################################ XML #######################################

nycxml.giturl <- "https://raw.githubusercontent.com/bvshyam/Project3_cuny_607/master/Data/nyc_dec2015.xml"
nycdataxml <- xmlRoot(xmlParse(getURL(nycxml.giturl)))  # get XML file contents

class(nycdataxml)
FALSE [1] "XMLInternalElementNode" "XMLInternalNode"       
FALSE [3] "XMLAbstractNode"
xmlName(nycdataxml)
FALSE [1] "response"
xmlSize(nycdataxml)
FALSE [1] 1
xmlSize(nycdataxml[1])
FALSE [1] 1
xmlSize(nycdataxml[[1]])
FALSE [1] 8055
nycdataxmlDFchk <- xmlToDataFrame(nycdataxml[[1]])

# The number of rows available in datasw
nrow(nycdataxmlDFchk)
FALSE [1] 8055
head(nycdataxmlDFchk)
FALSE   objectid identifier     occurrence_date day_of_week occurrence_month
FALSE 1  1023513   e9e03977 2015-12-01T00:00:00     Tuesday              Dec
FALSE 2  1023514   127584fd 2015-12-01T00:00:00     Tuesday              Dec
FALSE 3  1023515   8a7b0245 2015-12-01T00:00:00     Tuesday              Dec
FALSE 4  1023516   b1adaa5e 2015-12-01T00:00:00     Tuesday              Dec
FALSE 5  1023517   8cfddfd5 2015-12-01T00:01:00     Tuesday              Dec
FALSE 6  1023518   eea9fa7b 2015-12-01T00:01:00     Tuesday              Dec
FALSE   occurrence_day occurrence_year occurrence_hour compstat_month
FALSE 1              1            2015               0             12
FALSE 2              1            2015               0             12
FALSE 3              1            2015               0             12
FALSE 4              1            2015               0             12
FALSE 5              1            2015               0             12
FALSE 6              1            2015               0             12
FALSE   compstat_day compstat_year                        offense
FALSE 1            1          2015 GRAND LARCENY OF MOTOR VEHICLE
FALSE 2           31          2015                  GRAND LARCENY
FALSE 3           11          2015                  GRAND LARCENY
FALSE 4            4          2015                  GRAND LARCENY
FALSE 5            3          2015                  GRAND LARCENY
FALSE 6            1          2015 GRAND LARCENY OF MOTOR VEHICLE
FALSE   offense_classification sector precinct       borough     jurisdiction
FALSE 1                 FELONY      H      122 STATEN ISLAND N.Y. POLICE DEPT
FALSE 2                 FELONY      A      101        QUEENS N.Y. POLICE DEPT
FALSE 3                 FELONY      A       67      BROOKLYN N.Y. POLICE DEPT
FALSE 4                 FELONY      H      115        QUEENS N.Y. POLICE DEPT
FALSE 5                 FELONY      A      113        QUEENS N.Y. POLICE DEPT
FALSE 6                 FELONY      H       94      BROOKLYN N.Y. POLICE DEPT
FALSE   xcoordinate ycoordinate location_1
FALSE 1      963230      155663           
FALSE 2     1056616      157681           
FALSE 3      997049      176315           
FALSE 4     1020414      213600           
FALSE 5     1039980      182653           
FALSE 6      997286      201416

Data Manipulation New York City (NYC) DataSet

Selecting columns from the NYC dataset for study

nyc_crimedataDF <- nycdataxmlDFchk %>% select(objectid, occurrence_date, day_of_week, 
    occurrence_month, occurrence_hour, offense, borough)


# mutate to strip time as separate attribute from date.

nyc_crimedataDF <- nyc_crimedataDF %>% mutate(occurrence_time = str_extract(occurrence_date, 
    "[[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}")) %>% mutate(occurrence_date = str_extract(occurrence_date, 
    "[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}")) %>% mutate(occurrence_hour = as.numeric(occurrence_hour))



# mutate to add Week# from (0 - 52)

nyc_crimedataDF <- nyc_crimedataDF %>% mutate(week = strftime(as.POSIXlt(occurrence_date), 
    format = "%W"))



# View of NYC crime data View(nyc_crimedataDF)
datatable(nyc_crimedataDF)

Populating Crimedb database CrimeData table with New York City (NYC) Crime Data

The data loaded in to data frame which is manipulated is now selected for database posting into crimedata table

# Prepare dataframe for database insert

nycdbinsert <- subset(nyc_crimedataDF, select = c(objectid, offense, occurrence_date, 
    occurrence_time, day_of_week, occurrence_hour, occurrence_month, borough))



# Adding citycode column at start of dataframe
nycdbinsert <- cbind(citycode = "NYC", nycdbinsert)


# matching column names to database table column names

colnames(nycdbinsert) <- c("citycode", "offenseid", "offense", "offenselocaldate", 
    "offenselocaltime", "dayofweek", "offensehour", "offensemonth", "offensearea")


# Bulk insert from dataframe to MYSQL table

dbWriteTable(conn, value = nycdbinsert, name = "crimedata", row.names = F, append = TRUE)
## [1] TRUE
# View(nycdbinsert)

Tidying NYC Crime data to make it plot ready

Group by Bourough and Offense

Performing group by operations on the dataset to grou by borough and within borough, group by offense.

nyc_crimedata_boroughDF <- nyc_crimedataDF %>% select(borough, offense) %>% group_by(borough, 
    offense) %>% dplyr::summarise(total = n())

datatable(nyc_crimedata_boroughDF)

Group by Week# and Offense

Performing group by operations on the dataset to grou by Week# and within Week#, group by Offense.

nyc_crimedata_weekDF <- nyc_crimedataDF %>% select(week, offense) %>% group_by(week, 
    offense) %>% dplyr::summarise(total = n())

datatable(nyc_crimedata_weekDF)

Group by Hour# and Offense

Performing group by operations on the dataset to grou by Hour# and within Hour#, group by Offense.

nyc_crimedata_hourDF <- nyc_crimedataDF %>% select(occurrence_hour, offense) %>% 
    group_by(occurrence_hour, offense) %>% dplyr::summarise(total = n()) %>% arrange(occurrence_hour)

nyc_crimedata_hourrateDF <- nyc_crimedata_hourDF %>% select(occurrence_hour, total) %>% 
    group_by(occurrence_hour) %>% dplyr::summarise(offensecount = sum(total)) %>% 
    arrange(occurrence_hour)


datatable(nyc_crimedata_hourDF)

NYC DataSet Plots

The New York City Crime Data is studied to find insights that would throw light on crime occurence in the city. The following plots achieve the same.

Hour vs Offense

Plot of ‘Hour’ the crime occurred in the city against the type of ‘Offense’. This details certain patterns for different types of crimes if they occur at certain time of day more than other.

ggplot(nyc_crimedata_hourDF, aes(y = nyc_crimedata_hourDF$total, x = nyc_crimedata_hourDF$occurrence_hour)) + geom_point(aes(color = offense)) + 
    ggtitle("Hypothesis : Offense Vs Hour Of Day") + facet_grid(~offense) + xlab("Hour Of Day") + ylab("Offense")

Findings :

We observe Burglary happening more between the sleep hours of 12 am to dawn , denoting the crime is more advantageous when it is sleep time or less awareness time.

Assaults also happening during early hours , perhaps denoting early day travelers.

Thefts in general, any kind of larceny occurs more during day time. similarly robbery is being observed to be happening in NYC area prmarily in late evenings.

Heinous crimes of Murder /Sexual assualts are being seen occurrirng in day time however it can be a risk anytime of day.


Day vs Offense

Plot of ‘Day’ (i.e. the day of the week) the crime occurred in the city against the type of ‘Offense’. This throws light on facts and figures about the crime occurring more on weekdays or weekends , if there is any association between the two factors of the plot.

nyc_crimedata_dayDF <- nyc_crimedataDF %>% select(day_of_week, offense) %>% group_by(day_of_week, offense) %>% dplyr::summarise(daywisetotal = n())

datatable(nyc_crimedata_dayDF)
ggplot(nyc_crimedata_dayDF, aes(y = daywisetotal, x = day_of_week, fill = offense)) + xlab("Day Of Week") + ylab("Offense Count") + 
    ggtitle("Hypothesis : Offense Vs Day Of Week") + geom_bar(stat = "identity", position = position_dodge()) + scale_fill_brewer(palette = "OrRd") + 
    scale_x_discrete(limits = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

Findings : We find Grand Larceny being high during weekdays and mid week Robbery as well is high in mid week. Assault is less on weekends comapred to weekday, highlighting that perhaps assaults occur more when people might be usually alone/travelling during weekday rather than weekend when they are more social and acocmpanied with someone.


Offense vs Borough

Comaprison of the offenses occuring in NYC boroughs. Bronx, Brooklyn, Manhattan, Queens, Staten Island

ggplot(nyc_crimedata_boroughDF, aes(x = nyc_crimedata_boroughDF$borough, y = nyc_crimedata_boroughDF$total, fill = nyc_crimedata_boroughDF$offense)) + 
    ggtitle("Offense In NYC Boroughs") + geom_bar(colour = "black", stat = "identity") + coord_flip() + xlab("Boroughs") + 
    ylab("Offense") + scale_fill_brewer(palette = "RdPu")

Findings : Brooklyn has the highest Offense count followed by Manhattan. Staten Island is the safest NYC borough

Heinous crime of murder / rape are much lesser compared to Grand Larceny which is high in all boroughs in comaprison to other offenses.

Brooklyn has the highest burglary as well as robbery rate among all the boroughs.



# #########################################################
# Please add Marks code for Chicago here on per the index
# Please follow the heading size protocol as above two datasets
# Number of hashes make the headings and subheadings.
# Please keep page breaks after every  section

# And remove this part afer adding code
#########################################################

Chicago (CHI) DataSet Study


Loading Chicago (CHI) DataSet into R from an external source:

chicago <- read.csv("https://raw.githubusercontent.com/bvshyam/Project3_cuny_607/master/Data/chicago_dec2015.csv", header=TRUE, sep=",", stringsAsFactors = FALSE,na.strings = c("",NA,"n/a","N/A","NA"))

Clean the data!

As the data is in a format of “M/D/Y H:M:S”, we need to break that down and separate the data so we can get more fine grained analysis with it. To do so, we use the lubridate and stringr library. String allows us to use regular expressions to filter data, and lubridate provides easy to use, nicely abstracted functions to transform the data into time formats. As the timestamp is in the AM/PM format, we have to convert everything to twenty four hour time so that you as the analyst can tell the difference. Otherwise, the data will just all show a 12 hr format with no distinction. Once this occurs, we can perform calculations on the data. We are also going to extract out the hours, minutes, and seconds of the crime.

crimeDates <- as.Date(chicago$Date, "%m/%d/%Y")
twentyFourHourTime <- as.POSIXct(chicago$Date, format = "%m/%d/%Y %I:%M:%S %p")
chicagoTimes <- unlist(str_extract_all(twentyFourHourTime, pattern="\\d{1,2}\\:\\d{1,2}\\:\\d{1,2}"))
partOfDay <- unlist(str_extract_all(chicago$Date, pattern="[[:alpha:]]{2}"))
dayOfWeek <- wday(crimeDates, label = TRUE)
crimeMonth <- month(crimeDates, label = TRUE)
crimeTimeStamp <- hms(chicagoTimes)
crimeHr <- hour(crimeTimeStamp)
crimeMin <- minute(crimeTimeStamp)
crimeSec <- second(crimeTimeStamp)

Once the data has been cleaned, we need to put it into a new data frame for further analysis.

cleanedData <- data.frame(
crimeType = chicago$Primary.Type,
wasArrestMade = chicago$Arrest,
district = chicago$District,
location = chicago$Location.Description,
crimeDates,
dayOfWeek,
crimeMonth,
crimeHr,
crimeMin,
crimeSec,
latitude = chicago$Latitude,
longitude =chicago$Longitude
)

For the purpose of adding the data into the shared database, I need to add the time and crime id.

cleanedData$offenseid <- chicago$ID
cleanedData$offenselocaltime <- chicagoTimes

Create a dataframe in the format expected in the database

dataForDB <- data.frame (
cityCode = "CHI",
offenseid = cleanedData$offenseid,
offense = cleanedData$crimeType,
offenselocaldate =cleanedData$crimeDates,
offenselocaltime = cleanedData$offenselocaltime,
offensehour= cleanedData$crimeHr,
offensemonth= cleanedData$crimeMonth,
offensearea = cleanedData$location
)

dataForDB$dayofweek <- weekdays(as.Date(crimeDates))

dataForDB <- dataForDB[c(1,2,3,4,5,9,6,7,8)]

Create the connection to the database, and insert the data

conn <- dbConnect(MySQL(),
                 user = 'mygroup',
                 password = 'mygroup#01',
                 host = 'mygroup.c5rotlbjbl71.us-east-1.rds.amazonaws.com',
                 dbname = 'crimedb')

dbWriteTable(conn, value = dataForDB, name = "crimedata", row.names=F, append = TRUE ) 
## [1] TRUE

Analysis

What we are most interested in for the analysis is finding how the primary crime types compare in terms of arrest rates. First, we need to group the data by the primary type and arrest data (true or false). We can then count how many of each occurred.

table(cleanedData$crimeType, cleanedData$wasArrestMade )
##                                    
##                                     false true
##   ARSON                                27    1
##   ASSAULT                             967  236
##   BATTERY                            2815  772
##   BURGLARY                           1261   66
##   CONCEALED CARRY LICENSE VIOLATION     0    2
##   CRIM SEXUAL ASSAULT                  88    5
##   CRIMINAL DAMAGE                    2286  126
##   CRIMINAL TRESPASS                   160  273
##   DECEPTIVE PRACTICE                 1173   54
##   GAMBLING                              0    2
##   HOMICIDE                             26    7
##   INTERFERENCE WITH PUBLIC OFFICER      3   64
##   INTIMIDATION                          9    2
##   KIDNAPPING                           18    1
##   LIQUOR LAW VIOLATION                  0   10
##   MOTOR VEHICLE THEFT                 832   65
##   NARCOTICS                             0 1264
##   NON-CRIMINAL                          2    0
##   NON - CRIMINAL                        1    0
##   OBSCENITY                             2    1
##   OFFENSE INVOLVING CHILDREN          148   23
##   OTHER OFFENSE                      1004  240
##   PROSTITUTION                          0   65
##   PUBLIC INDECENCY                      0    2
##   PUBLIC PEACE VIOLATION               47   93
##   ROBBERY                             928   70
##   SEX OFFENSE                          47   14
##   STALKING                              9    0
##   THEFT                              4345  548
##   WEAPONS VIOLATION                    37  128

Here we see that the narcotics crime type has the highest number of arrests, while the theft crime type has the highest number of non-arrests.

We can then find out where one is most likely to get arrested by grouping the data together based on a crime type, arrest boolean, and location. Once this is done, we can compare the data of arrests and non arrests. The following barplot shows the arrests that did occur as “true”, and arrests that did not occur as “false”

arrestByLocation <- table(cleanedData$wasArrestMade, cleanedData$location )
barplot(arrestByLocation, main="Arrests By Location", xlab="Location", col=c("darkblue","red"), legend = rownames(arrestByLocation))

The same data can be viewed in a neater format via the group by command

groupedCrimesByLocation <- cleanedData %>% group_by(crimeType, wasArrestMade, location) %>% summarise(Total=n())

arrestOnly <- filter(groupedCrimesByLocation, wasArrestMade %in% "true")

arrestOnly
## Source: local data frame [407 x 4]
## Groups: crimeType, wasArrestMade [27]
## 
##    crimeType wasArrestMade                                    location
##       <fctr>        <fctr>                                      <fctr>
## 1      ARSON          true                                   APARTMENT
## 2    ASSAULT          true AIRPORT BUILDING NON-TERMINAL - SECURE AREA
## 3    ASSAULT          true                                       ALLEY
## 4    ASSAULT          true                                   APARTMENT
## 5    ASSAULT          true                                        BANK
## 6    ASSAULT          true                               BAR OR TAVERN
## 7    ASSAULT          true                                    CAR WASH
## 8    ASSAULT          true                               CHA APARTMENT
## 9    ASSAULT          true              CHA HALLWAY/STAIRWELL/ELEVATOR
## 10   ASSAULT          true                     CHA PARKING LOT/GROUNDS
## # ... with 397 more rows, and 1 more variables: Total <int>
arrestOnly[which.max(arrestOnly$Total),]
## Source: local data frame [1 x 4]
## Groups: crimeType, wasArrestMade [1]
## 
##   crimeType wasArrestMade location Total
##      <fctr>        <fctr>   <fctr> <int>
## 1 NARCOTICS          true   STREET   424
noArrest <- filter(groupedCrimesByLocation, wasArrestMade %in% "false")

noArrest
## Source: local data frame [606 x 4]
## Groups: crimeType, wasArrestMade [24]
## 
##    crimeType wasArrestMade                          location Total
##       <fctr>        <fctr>                            <fctr> <int>
## 1      ARSON         false                             ALLEY     1
## 2      ARSON         false                         APARTMENT     3
## 3      ARSON         false CHURCH/SYNAGOGUE/PLACE OF WORSHIP     1
## 4      ARSON         false                 CONVENIENCE STORE     1
## 5      ARSON         false            DRIVEWAY - RESIDENTIAL     1
## 6      ARSON         false                       GAS STATION     1
## 7      ARSON         false                GROCERY FOOD STORE     1
## 8      ARSON         false                             OTHER     1
## 9      ARSON         false    PARKING LOT/GARAGE(NON.RESID.)     3
## 10     ARSON         false                         RESIDENCE     5
## # ... with 596 more rows
noArrest[which.max(noArrest$Total),]
## Source: local data frame [1 x 4]
## Groups: crimeType, wasArrestMade [1]
## 
##   crimeType wasArrestMade location Total
##      <fctr>        <fctr>   <fctr> <int>
## 1     THEFT         false   STREET  1205

As the data shows, more people are likely to be arrested for a narcotics charge on the street. It also shows that one has a greater chance of not being arrested for theft on the street as well.

Speaking of arrests, one interesting observation would be to find what day of the week someone is likely to be arrested.

table(cleanedData$wasArrestMade, cleanedData$dayOfWeek )
##        
##          Sun  Mon Tues  Wed Thurs  Fri  Sat
##   false 2083 2095 2775 2749  2277 2072 2184
##   true   566  490  773  706   523  509  567

Based on this information, Tuesday is the day when most people are arrested for the sample.

Now, how do we find the average time someone is arrested? Well, we need to find the mean of the timestamp from the Chicago data. Since we did the conversion of 12 hr time to 24 hr, we can perform a “mean” calculation against the variable to find the time.

mean(twentyFourHourTime)
## [1] "2015-12-15 13:59:45 EST"

The average time that someone would be arrested in this sample is on 12/15/15 at 1:59AM.


Mapping Crime Data For US Cities

Mapping for East Coast City New York city (NYC)

nyc_data <- getURL('https://raw.githubusercontent.com/bvshyam/Project3_cuny_607/master/Data/nyc_dec2015.csv', ssl.verifyhost=F, ssl.verifypeer=F)

data.nyc <- read.csv(textConnection(nyc_data), header=T)
data.nyc_f <- data.nyc[complete.cases(data.nyc),]


#extracting lat and lon
dataloc <- data.frame(lat = stri_extract_first(data.nyc_f$Location.1, regex="\\d{1,}.\\d{1,}"),
           lon = stri_extract_last(data.nyc_f$Location.1, regex = "\\-\\d{1,}.\\d{1,}"))
head(dataloc)
##                lat               lon
## 1 40.5939184530001 -74.0756886519999
## 2 40.5991869440001     -73.739410553
## 3 40.6506195700001 -73.9538766169999
## 4 40.7528934640001     -73.869474086
## 5      40.66784984 -73.7991105359999
## 6     40.719515733      -73.95297396
#subsetting and changing datatype
data.nyc_f <- data.nyc_f[,c(4,12,20)]
str(data.nyc_f)
## 'data.frame':    8055 obs. of  3 variables:
##  $ Day.of.Week: Factor w/ 7 levels "Friday","Monday",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Offense    : Factor w/ 7 levels "BURGLARY","FELONY ASSAULT",..: 4 3 3 3 3 4 3 3 2 2 ...
##  $ Location.1 : Factor w/ 6495 levels "(40.503177034, -74.231875648)",..: 204 271 1048 3938 1422 3059 3996 4656 5819 5819 ...
data.nyc_f$Offense<- as.character(data.nyc_f$Offense)
data.nyc_f$Day.of.Week<- as.character(data.nyc_f$Day.of.Week)
#combining with latitude data
df2 <- data.frame(cbind(data.nyc_f$Offense,data.nyc_f$Day.of.Week,dataloc[]))
#changing datatype
df2$data.nyc_f.Offense <- as.character(df2$data.nyc_f.Offense)
df2$data.nyc_f.Day.of.Week <- as.character(df2$data.nyc_f.Day.of.Week)
df2$lat <- as.numeric(as.character(df2$lat))
df2$lon <- as.numeric(as.character(df2$lon))


str(df2)
## 'data.frame':    8055 obs. of  4 variables:
##  $ data.nyc_f.Offense    : chr  "GRAND LARCENY OF MOTOR VEHICLE" "GRAND LARCENY" "GRAND LARCENY" "GRAND LARCENY" ...
##  $ data.nyc_f.Day.of.Week: chr  "Tuesday" "Tuesday" "Tuesday" "Tuesday" ...
##  $ lat                   : num  40.6 40.6 40.7 40.8 40.7 ...
##  $ lon                   : num  -74.1 -73.7 -74 -73.9 -73.8 ...
#subsetting data based on more violent crime
dv.nyc <- df2[which(df2$data.nyc_f.Offense == 'ASSAULT' | df2$data.nyc_f.Offense =='ROBBERY'),]
str(dv.nyc )
## 'data.frame':    1524 obs. of  4 variables:
##  $ data.nyc_f.Offense    : chr  "ROBBERY" "ROBBERY" "ROBBERY" "ROBBERY" ...
##  $ data.nyc_f.Day.of.Week: chr  "Tuesday" "Tuesday" "Tuesday" "Tuesday" ...
##  $ lat                   : num  40.8 40.8 40.6 40.8 40.7 ...
##  $ lon                   : num  -73.9 -73.9 -74.1 -73.9 -74 ...
#mapping 
map <- get_map('nyc', zoom = 15, maptype = 'roadmap')
ggmap(map,extent = "device")+stat_density2d(
aes(x = dv.nyc$lon, y = dv.nyc$lat, fill = ..level..,
alpha = ..level..),
size = 2, bins = 6, data = dv.nyc,
geom = "polygon")
## Warning: Removed 1507 rows containing non-finite values (stat_density2d).


Mapping for MidWest City Chicago

c_data <- getURL('https://raw.githubusercontent.com/bvshyam/Project3_cuny_607/master/Data/chicago_dec2015.csv', ssl.verifyhost=F, ssl.verifypeer=F)

 

data.chi <- read.csv(textConnection(c_data), header=T)

df <- data.chi[,c(6,22)]

df <- df[complete.cases(df),]

df$Location <- as.character(df$Location)

str(df)
## 'data.frame':    20369 obs. of  2 variables:
##  $ Primary.Type: Factor w/ 30 levels "ARSON","ASSAULT",..: 9 3 9 9 21 9 22 27 27 29 ...
##  $ Location    : chr  "" "" "" "" ...
#extracting latitude and longitude values

dataloc <- data.frame(lat = stri_extract_first(df$Location, regex = "\\d{1,}.\\d{1,}"),

           lon = stri_extract_last(df$Location, regex = "\\-\\d{1,}.\\d{1,}"))

#changing to numeric type

dataloc$lat <- as.numeric(as.character(dataloc$lat))

dataloc$lon <- as.numeric(as.character(dataloc$lon))

 

#recreating data frame with latitude and longitude column

fndata <- data.frame(cbind(df$Primary.Type,dataloc[]))

finaldata <- fndata[complete.cases(dataloc),]

str(finaldata)
## 'data.frame':    19910 obs. of  3 variables:
##  $ df.Primary.Type: Factor w/ 30 levels "ARSON","ASSAULT",..: 9 22 27 27 29 29 7 9 9 9 ...
##  $ lat            : num  41.9 41.9 41.9 41.9 41.8 ...
##  $ lon            : num  -87.7 -87.7 -87.7 -87.7 -87.6 ...
#subsetting data based on more violent crime

dv.chicago <- finaldata[which(finaldata$df.Primary.Type == 'ASSAULT' | finaldata$df.Primary.Type =='ROBBERY'),]

str(dv.chicago)
## 'data.frame':    2176 obs. of  3 variables:
##  $ df.Primary.Type: Factor w/ 30 levels "ARSON","ASSAULT",..: 2 26 26 2 2 26 26 26 26 26 ...
##  $ lat            : num  41.8 41.8 41.7 41.7 41.8 ...
##  $ lon            : num  -87.6 -87.7 -87.7 -87.6 -87.7 ...
#mapping with dayofweek

map <- get_map('chicago', zoom = 15, maptype = 'roadmap')

ggmap(map,extent = "device")+stat_density2d(

aes(x = lon, y = lat, fill = ..level..,

alpha = ..level..),

size = 2, bins = 6, data = dv.chicago,

geom = "polygon")
## Warning: Removed 2131 rows containing non-finite values (stat_density2d).


Mapping for West Coast City San Francisco (SFO)

#dat load and subsetting

#d <- getURL("https://raw.githubusercontent.com/bvshyam/Project3_cuny_607/master/Data/SFPD_Incidents_-_from_1_January_2003.csv", ssl.verifyhost=F, ssl.verifypeer=F)


d <- getURL("https://raw.githubusercontent.com/bvshyam/Project3_cuny_607/master/Data/sfo_dec2015.csv", ssl.verifyhost=F, ssl.verifypeer=F)



d <- read.csv(textConnection(d), header=T)

 

d <- d[complete.cases(d),]

d <- d[,c(2:6,10,11)]

 

#changing column name

colnames(d)[6]<- "lon"

colnames(d)[7]<- "lat"

str(d)
## 'data.frame':    11394 obs. of  7 variables:
##  $ Category : Factor w/ 36 levels "ARSON","ASSAULT",..: 17 17 20 21 19 26 27 21 21 17 ...
##  $ Descript : Factor w/ 428 levels "ABANDONMENT OF CHILD",..: 278 278 238 211 194 225 353 197 175 277 ...
##  $ DayOfWeek: Factor w/ 7 levels "Friday","Monday",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Date     : Factor w/ 31 levels "12/01/2015","12/02/2015",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Time     : Factor w/ 1194 levels "00:01","00:02",..: 986 646 570 1 1 1 1 540 1 382 ...
##  $ lon      : num  -122 -122 -122 -122 -122 ...
##  $ lat      : num  37.8 37.8 37.8 37.8 37.7 ...
#subsetting data based on more violent crime

dv <- d[which(d$Category == 'ASSAULT' | d$Category =='ROBBERY'),]

str(dv)
## 'data.frame':    1232 obs. of  7 variables:
##  $ Category : Factor w/ 36 levels "ARSON","ASSAULT",..: 2 2 2 2 2 2 2 2 2 24 ...
##  $ Descript : Factor w/ 428 levels "ABANDONMENT OF CHILD",..: 393 393 393 61 393 150 172 61 393 48 ...
##  $ DayOfWeek: Factor w/ 7 levels "Friday","Monday",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Date     : Factor w/ 31 levels "12/01/2015","12/02/2015",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Time     : Factor w/ 1194 levels "00:01","00:02",..: 1 1 1 1 1 1 763 249 894 931 ...
##  $ lon      : num  -122 -122 -122 -122 -122 ...
##  $ lat      : num  37.8 37.8 37.8 37.8 37.8 ...
#mapping with dayofweek

map <- get_map('San Francisco', zoom = 12, maptype = 'roadmap')

ggmap(map,extent = "device")+stat_density2d(

aes(x = lon, y = lat, fill = ..level..,

alpha = ..level..),

size = 2, bins = 4, data = dv,

geom = "polygon")+facet_wrap(~DayOfWeek)

#mapping with area and legend

ggmap(map,extent = "device", legend = "topleft")+stat_density2d(

aes(x = lon, y = lat, fill = ..level..,

alpha = ..level..),

size = 2, bins = 6, data = dv,

geom = "polygon")


Crime Compare for Major US Cities, West Coast SFO, East Coast NYC, Mid-West Chicago

Loading Captured Crime Data From All Cities

Loading crime data captured for different US cities, EastCoast NYC, WestCoast SFO and MidWest Chicago.

# DATE_FORMAT(offenselocaltime,'%H:%i:%s')

if (isIdCurrent(conn)) {
    allcitycrimedata <- dbGetQuery(conn, "select crimedata.citycode, offenseid, offense, offenselocaldate, offenselocaltime, dayofweek, offensehour, offensemonth, offensearea from crimedata, city where city.citycode = crimedata.citycode")
    
}

# View(allcitycrimedata)


# Categorizing the offenses at broad level Adding Category column to data frame


for (i in seq(1, nrow(allcitycrimedata), 1)) {
    
    if (grepl("LARCENY|ROBBERY|THEFT|STOLEN|EMBEZZLEMENT", allcitycrimedata$offense[i], 
        ignore.case = TRUE)) {
        allcitycrimedata$category[i] <- "LARCENY"
    } else if (grepl("RAPE|SEX OFFENSE|SEXUAL|BATTERY|PROSTITUTION", allcitycrimedata$offense[i], 
        ignore.case = TRUE)) {
        allcitycrimedata$category[i] <- "SEX OFFENSE"
    } else if (grepl("BURGLARY", allcitycrimedata$offense[i], ignore.case = TRUE)) {
        allcitycrimedata$category[i] <- "BURGLARY"
    } else if (grepl("MURDER|HOMICIDE", allcitycrimedata$offense[i], ignore.case = TRUE)) {
        allcitycrimedata$category[i] <- "MURDER"
    } else if (grepl("NARCOTIC|LIQUOR", allcitycrimedata$offense[i], ignore.case = TRUE)) {
        allcitycrimedata$category[i] <- "NARCOTIC"
    } else if (grepl("OTHER|SUICIDE|NON CRIMINAL|NON-CRIMINAL|NON - CRIMINAL|SUSPICIOUS|DECEPTIVE|INTERFERENCE|INTIMIDATION|STALKING|TRESSPASS|SECONDARY|RUNAWAY|MISSING|TRESPASS", 
        allcitycrimedata$offense[i], ignore.case = TRUE)) {
        allcitycrimedata$category[i] <- "OTHER"
    } else if (grepl("FORGERY|FRAUD|BAD CHECKS|EXTORTION", allcitycrimedata$offense[i], 
        ignore.case = TRUE)) {
        allcitycrimedata$category[i] <- "FRAUD"
    } else if (grepl("PUBLIC INDECENCY|OBSCENITY|DRUNKENNESS|CONDUCT|PUBLIC PEACE|WARRANTS|WEAPON|VEHICLE|LICENSE|DRIVING|BRIBERY|GAMBLING", 
        allcitycrimedata$offense[i], ignore.case = TRUE)) {
        allcitycrimedata$category[i] <- "MISCONDUCT"
    } else if (grepl("FAMILY|KIDNAPPING", allcitycrimedata$offense[i], ignore.case = TRUE)) {
        allcitycrimedata$category[i] <- "FAMILY"
    } else if (grepl("CHILDREN", allcitycrimedata$offense[i], ignore.case = TRUE)) {
        allcitycrimedata$category[i] <- "CHILD ABUSE"
    } else if (grepl("VANDALISM|DAMAGE|ARSON", allcitycrimedata$offense[i], ignore.case = TRUE)) {
        allcitycrimedata$category[i] <- "VANDALISM"
    } else if (grepl("ASSAULT", allcitycrimedata$offense[i], ignore.case = TRUE)) {
        allcitycrimedata$category[i] <- "ASSAULT"
    }
    
}  # end of for loop


datatable(allcitycrimedata)
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## http://rstudio.github.io/DT/server.html

Data Manipulation And Plots For Crime Compare Across Cities

Grouping data retrieved from database with group by city and within city, group by offense

crimedata_cityDF <- allcitycrimedata %>% select(citycode, category) %>% group_by(citycode, 
    category) %>% dplyr::summarise(total = n())

datatable(crimedata_cityDF)

Plot Offense Vs Cities

ggplot(crimedata_cityDF, aes(x = crimedata_cityDF$citycode, y = crimedata_cityDF$total, fill = crimedata_cityDF$category)) + 
    ggtitle("Offense In US Cities(East Coast, MidWest, WestCoast") + geom_bar(colour = "black", stat = "identity") + coord_flip() + 
    xlab("Cities") + ylab("Offense Category") + scale_fill_brewer(palette = "Paired")

Findings From the bar plot of city compare for different offenses, we see that Chicago is highest in crime rate, almost double of San Francisco and more than double of that of New York City. Larceny is the highest offense in Chicago, followed by New York City . Chicago also leads in Burglary, and Sex Offenses Chicago leads in almost all types of crimes

SFO has high of smaller crimes like suspicious activity, trespassing, non criminal offenses, stalking, deception, intereference, intimidation San Francisco also leads in misconduct types of offenses, like obscenity, drunkeness, public indecency, DUI, driving related or Narcotic

Safest City seems to be New York City area as comapred to San Francisco and Chicago.

Close database connection

# remove SFO tables and variables

remove(sf_crime_db, sf_crime_db_1, sf_crime, sf_day_offense, sfo_crime_hour, topcrimes, 
    sfo_crimedata_hourDF, entdbtables, a, b, a1, b1, category_count, crimesbyday, 
    mostcrimes, sf_crime.df, i, p, sf_crim.json, sf_crimedata)
dbDisconnect(conn)
## [1] TRUE
##################################################################### 

Motivation

The project motivation has been to successfully demonstrate a team work, soft skills which also form the foundation of a good Data Scientist. And with collaboration help study crime data and come up with insights.

Efforts

We collaborated together as a team of Kumudini Bhave, Shyam BV, Mark Halpin since Oct 16 to be joined in by Upal Chowdhury. Our team demonstrated great team work with work distribution, accommodation and tolerance and healthy collaboration .

Shyam, Mark did great work over finding interesting datasets and worked /analyzed on SFO , Chicago crime datasets. Kumudini contributed to working and analyzing on NYC crime data, doing across cities plot and integrating the individual Rmds. Upal was very instrumental in setting the much required database on cloud and helping us be all in sync. Having a database on cloud made it greatly possible to have all data in one place and reinforced the strength of the project. Upal also did some interesting mappings of the three cities SFO, NYC and Chicago crime datasets.And prepared the ER diagram and Workflow diagram

Difficulties Faced

We did have initial dilemma as we started off with multiyear crime data but over period realised that only a months data could be handled by the environment. It consumed some time for this realization.

We also had issues with integration of Rmd files as JSON data loading would time out.

Findings

With our team work, we were succesfully able to study crime across the major cities on East Coast, West Coast and Mid-West of USA . The Plots an our findings are captured in the main Rmd file.

Which are the most valued data science skills?

The most valued data science skills : Our team worked together and could demonstrate the essential and most valued data science skills. We found the apt dataset for our project that engaged our curiousity to find about crime in different cities. We worked through unstructured /structured data We scraped data in different formats of XML, CSV, JSON We worked in SQL and posted to relational databases We used R for Coding We worked on large datasets and performed data mining. Lastly and most importantly, we are a team globally dispersed across timezones , we worked round the clock to deliver the united effort.