This is an R Markdown document for providing documentation for performing Data Exploration And Analysis Of the Crime DataSet of publicly available crime data for New York City and San Francisco and Chicago
knitr::opts_chunk$set(message = FALSE, echo = TRUE)
# Library for string manipulation/regex operations
library(stringr)
# Library for data display in tabular format
library(DT)
# Library to read text file
library(RCurl)
# Library to gather (to long format) and spread (to wide format) data, to tidy data
library(tidyr)
# Library to filter, transform data
library(dplyr)
# Library to plot
library(ggplot2)
library(knitr)
# Library for db operations
library(RMySQL)
# Library for loading data
library(jsonlite)
library(XML)
library(xml2)
library(lubridate)
library(plotly)
library(bitops)
library(stringi)
library(ggmap)
library(grid)
Database MYSQL set up in cloud. Connection and access to database can be obtained as follows through amazon web services. @schema crimedb
The connection object ‘conn’ will further be used for querying and manipulating database tables.
conn <- dbConnect(MySQL(), user = "mygroup", password = "mygroup#01", host = "mygroup.c5rotlbjbl71.us-east-1.rds.amazonaws.com",
dbname = "crimedb")
# List the database tables in crime schema
entdbtables <- dbListTables(conn)
############################################################################# FOR TEST PURPOSE OF RMD, NEED TO DELETE EXISTING DATA TO AVOID INTEGRITY
############################################################################# CONSTRAINT ERRORS
# delnycdata <- dbGetQuery(conn,'DELETE FROM crimedata WHERE citycode = 'NYC'')
##################################################################################
This JSON has two parts. First section has the meta data and second section will have the actual data
# JSON Load
sf_crim.json <- fromJSON("https://raw.githubusercontent.com/bvshyam/Project3_cuny_607/master/Data/sfo_dec2015.json")
# Look for data in JSON
sf_crimedata <- sf_crim.json[["data"]]
# Select the necessary columns in the JSON
sf_crime.df <- data.frame(offenseid = sapply(sf_crimedata, function(x) x[[1]]), Category = sapply(sf_crimedata,
function(x) x[[10]]), DayOfWeek = sapply(sf_crimedata, function(x) x[[12]]),
Date = as.Date(substr(sapply(sf_crimedata, function(x) x[[13]]), 1, 10), "%Y-%m-%d"),
Time = sapply(sf_crimedata, function(x) x[[14]]), PdDistrict = sapply(sf_crimedata,
function(x) x[[15]]), Location = paste(sapply(sf_crimedata, function(x) x[[19]]),
",", sapply(sf_crimedata, function(x) x[[18]])), month = month.abb[month(as.Date(substr(sapply(sf_crimedata,
function(x) x[[13]]), 1, 10), "%Y-%m-%d"))], offensearea = sapply(sf_crimedata,
function(x) x[[15]]))
sf_crime_changes <- sf_crime.df
# Adding Time period for each day
sf_crime_changes <- mutate(sf_crime_changes, Hour = round(as.numeric(str_replace_all(sf_crime_changes$Time,
":", ".")))) %>% mutate(Date = as.Date(sf_crime_changes$Date, "%m/%d/%Y"))
sf_crime_changes <- sf_crime_changes %>% mutate(timeframe = sapply(sf_crime_changes$Hour,
function(x) {
if (x > 6 & x <= 11) {
"Morning"
} else if (x > 11 & x <= 16) {
"Noon"
} else if (x > 16 & x <= 20) {
"Evening"
} else if (x <= 6 | x > 20) {
"Night"
}
}))
# Selecing the columns and calculating the additional columns
sf_crime_db <- select(sf_crime_changes, offenseid = offenseid, offense = Category,
offensedate = Date, offensetime = Time, dayofweek = DayOfWeek, offensehour = Hour,
offensemonth = month, offensearea = offensearea) %>% mutate(citycode = "SFO")
# Arrange the columns for database insert
sf_crime_db_1 <- select(sf_crime_db, citycode, offenseid, offense, offenselocaldate = offensedate,
offenselocaltime = offensetime, dayofweek, offensehour, offensemonth, offensearea)
# Bulk insert from dataframe to MYSQL table called crimedata
dbWriteTable(conn, value = sf_crime_db_1, name = "crimedata", row.names = F, append = TRUE)
## [1] TRUE
Now we have the required data. We will make some analysis and plot the graphs.
# Select the required columns for comparing hour and offense
sfo_crime_hour <- sf_crime_changes %>% select(Hour, Category) %>% group_by(Hour,
Category) %>% summarise(total = n()) %>% dplyr::arrange(total)
# Hour vs Offense
p <- ggplot(sfo_crime_hour, aes(x = sfo_crime_hour$Hour, y = sfo_crime_hour$total)) +
geom_line(aes(color = Category)) + ggtitle("Hour vs Offense") + xlab("Hour") +
ylab("Offense")
ggplotly(p)
# Analyze Crime by offense
category_count <- table(sf_crime_changes$Category) %>% data.frame() %>% setNames(c("Offense",
"Frequency"))
datatable(category_count)
# Plot a scattered plot by comparing Offense vs Frequency(occurrence)
a <- list(autotick = TRUE, tick0 = 0, dtick = 1, size = 24, showticklabels = FALSE,
showlegend = FALSE, tickcolor = toRGB("white"), zeroline = FALSE, showline = FALSE,
showticklabels = FALSE, showgrid = FALSE, visible = FALSE, title = "Offense")
b <- list(title = "Crime count")
plot_ly(x = ~category_count$Offense, y = ~category_count$Frequency, type = "scatter",
color = category_count$Offense) %>% layout(xaxis = a, yaxis = b, title = "Offense vs Frequency")
# Plot a scattered plot by comparing Offense Date vs Frequency(occurrence)
category_date <- table(sf_crime_changes$Date) %>% data.frame() %>% setNames(c("Offense_date",
"Frequency"))
category_date$Offense_date <- str_replace(category_date$Offense_date, "/2015|2015-",
"")
datatable(category_date)
# Plot a scattered plot by comparing Offense Date vs Frequency(occurrence)
a1 <- list(autotick = TRUE, tick0 = 0, dtick = 1, size = 24, showlegend = FALSE,
tickangle = 45, title = FALSE)
b1 <- list(title = "Crime count")
plot_ly(x = ~category_date$Offense_date, y = ~category_date$Frequency, type = "area") %>%
layout(xaxis = a1, yaxis = b1, title = "Date vs Frequency")
# Crime according to timeframe and Day
crimesbyday <- table(sf_crime_changes$DayOfWeek, sf_crime_changes$timeframe) %>%
data.frame() %>% setNames(c("Day", "Day_category", "Frequency"))
datatable(crimesbyday)
# Top crimes greater than 5%
mostcrimes <- table(sf_crime_changes$DayOfWeek, sf_crime_changes$Category) %>% data.frame() %>%
spread(Var1, Freq) %>% mutate(summary = rowSums(.[, 2:8], 1))
topcrimes <- mostcrimes[(mostcrimes$summary/sum(mostcrimes$summary)) > 0.05, ] %>%
data.frame()
datatable(topcrimes)
Loading the crime data. Reading data file in XML format, from the GitHub location with Header as True
knitr::opts_chunk$set(message = FALSE, echo = TRUE)
################################ XML #######################################
nycxml.giturl <- "https://raw.githubusercontent.com/bvshyam/Project3_cuny_607/master/Data/nyc_dec2015.xml"
nycdataxml <- xmlRoot(xmlParse(getURL(nycxml.giturl))) # get XML file contents
class(nycdataxml)
FALSE [1] "XMLInternalElementNode" "XMLInternalNode"
FALSE [3] "XMLAbstractNode"
xmlName(nycdataxml)
FALSE [1] "response"
xmlSize(nycdataxml)
FALSE [1] 1
xmlSize(nycdataxml[1])
FALSE [1] 1
xmlSize(nycdataxml[[1]])
FALSE [1] 8055
nycdataxmlDFchk <- xmlToDataFrame(nycdataxml[[1]])
# The number of rows available in datasw
nrow(nycdataxmlDFchk)
FALSE [1] 8055
head(nycdataxmlDFchk)
FALSE objectid identifier occurrence_date day_of_week occurrence_month
FALSE 1 1023513 e9e03977 2015-12-01T00:00:00 Tuesday Dec
FALSE 2 1023514 127584fd 2015-12-01T00:00:00 Tuesday Dec
FALSE 3 1023515 8a7b0245 2015-12-01T00:00:00 Tuesday Dec
FALSE 4 1023516 b1adaa5e 2015-12-01T00:00:00 Tuesday Dec
FALSE 5 1023517 8cfddfd5 2015-12-01T00:01:00 Tuesday Dec
FALSE 6 1023518 eea9fa7b 2015-12-01T00:01:00 Tuesday Dec
FALSE occurrence_day occurrence_year occurrence_hour compstat_month
FALSE 1 1 2015 0 12
FALSE 2 1 2015 0 12
FALSE 3 1 2015 0 12
FALSE 4 1 2015 0 12
FALSE 5 1 2015 0 12
FALSE 6 1 2015 0 12
FALSE compstat_day compstat_year offense
FALSE 1 1 2015 GRAND LARCENY OF MOTOR VEHICLE
FALSE 2 31 2015 GRAND LARCENY
FALSE 3 11 2015 GRAND LARCENY
FALSE 4 4 2015 GRAND LARCENY
FALSE 5 3 2015 GRAND LARCENY
FALSE 6 1 2015 GRAND LARCENY OF MOTOR VEHICLE
FALSE offense_classification sector precinct borough jurisdiction
FALSE 1 FELONY H 122 STATEN ISLAND N.Y. POLICE DEPT
FALSE 2 FELONY A 101 QUEENS N.Y. POLICE DEPT
FALSE 3 FELONY A 67 BROOKLYN N.Y. POLICE DEPT
FALSE 4 FELONY H 115 QUEENS N.Y. POLICE DEPT
FALSE 5 FELONY A 113 QUEENS N.Y. POLICE DEPT
FALSE 6 FELONY H 94 BROOKLYN N.Y. POLICE DEPT
FALSE xcoordinate ycoordinate location_1
FALSE 1 963230 155663
FALSE 2 1056616 157681
FALSE 3 997049 176315
FALSE 4 1020414 213600
FALSE 5 1039980 182653
FALSE 6 997286 201416
Selecting columns from the NYC dataset for study
nyc_crimedataDF <- nycdataxmlDFchk %>% select(objectid, occurrence_date, day_of_week,
occurrence_month, occurrence_hour, offense, borough)
# mutate to strip time as separate attribute from date.
nyc_crimedataDF <- nyc_crimedataDF %>% mutate(occurrence_time = str_extract(occurrence_date,
"[[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}")) %>% mutate(occurrence_date = str_extract(occurrence_date,
"[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}")) %>% mutate(occurrence_hour = as.numeric(occurrence_hour))
# mutate to add Week# from (0 - 52)
nyc_crimedataDF <- nyc_crimedataDF %>% mutate(week = strftime(as.POSIXlt(occurrence_date),
format = "%W"))
# View of NYC crime data View(nyc_crimedataDF)
datatable(nyc_crimedataDF)
The data loaded in to data frame which is manipulated is now selected for database posting into crimedata table
# Prepare dataframe for database insert
nycdbinsert <- subset(nyc_crimedataDF, select = c(objectid, offense, occurrence_date,
occurrence_time, day_of_week, occurrence_hour, occurrence_month, borough))
# Adding citycode column at start of dataframe
nycdbinsert <- cbind(citycode = "NYC", nycdbinsert)
# matching column names to database table column names
colnames(nycdbinsert) <- c("citycode", "offenseid", "offense", "offenselocaldate",
"offenselocaltime", "dayofweek", "offensehour", "offensemonth", "offensearea")
# Bulk insert from dataframe to MYSQL table
dbWriteTable(conn, value = nycdbinsert, name = "crimedata", row.names = F, append = TRUE)
## [1] TRUE
# View(nycdbinsert)
Performing group by operations on the dataset to grou by borough and within borough, group by offense.
nyc_crimedata_boroughDF <- nyc_crimedataDF %>% select(borough, offense) %>% group_by(borough,
offense) %>% dplyr::summarise(total = n())
datatable(nyc_crimedata_boroughDF)
Performing group by operations on the dataset to grou by Week# and within Week#, group by Offense.
nyc_crimedata_weekDF <- nyc_crimedataDF %>% select(week, offense) %>% group_by(week,
offense) %>% dplyr::summarise(total = n())
datatable(nyc_crimedata_weekDF)
Performing group by operations on the dataset to grou by Hour# and within Hour#, group by Offense.
nyc_crimedata_hourDF <- nyc_crimedataDF %>% select(occurrence_hour, offense) %>%
group_by(occurrence_hour, offense) %>% dplyr::summarise(total = n()) %>% arrange(occurrence_hour)
nyc_crimedata_hourrateDF <- nyc_crimedata_hourDF %>% select(occurrence_hour, total) %>%
group_by(occurrence_hour) %>% dplyr::summarise(offensecount = sum(total)) %>%
arrange(occurrence_hour)
datatable(nyc_crimedata_hourDF)
The New York City Crime Data is studied to find insights that would throw light on crime occurence in the city. The following plots achieve the same.
Plot of ‘Hour’ the crime occurred in the city against the type of ‘Offense’. This details certain patterns for different types of crimes if they occur at certain time of day more than other.
ggplot(nyc_crimedata_hourDF, aes(y = nyc_crimedata_hourDF$total, x = nyc_crimedata_hourDF$occurrence_hour)) + geom_point(aes(color = offense)) +
ggtitle("Hypothesis : Offense Vs Hour Of Day") + facet_grid(~offense) + xlab("Hour Of Day") + ylab("Offense")
Findings :
We observe Burglary happening more between the sleep hours of 12 am to dawn , denoting the crime is more advantageous when it is sleep time or less awareness time.
Assaults also happening during early hours , perhaps denoting early day travelers.
Thefts in general, any kind of larceny occurs more during day time. similarly robbery is being observed to be happening in NYC area prmarily in late evenings.
Heinous crimes of Murder /Sexual assualts are being seen occurrirng in day time however it can be a risk anytime of day.
Plot of ‘Day’ (i.e. the day of the week) the crime occurred in the city against the type of ‘Offense’. This throws light on facts and figures about the crime occurring more on weekdays or weekends , if there is any association between the two factors of the plot.
nyc_crimedata_dayDF <- nyc_crimedataDF %>% select(day_of_week, offense) %>% group_by(day_of_week, offense) %>% dplyr::summarise(daywisetotal = n())
datatable(nyc_crimedata_dayDF)
ggplot(nyc_crimedata_dayDF, aes(y = daywisetotal, x = day_of_week, fill = offense)) + xlab("Day Of Week") + ylab("Offense Count") +
ggtitle("Hypothesis : Offense Vs Day Of Week") + geom_bar(stat = "identity", position = position_dodge()) + scale_fill_brewer(palette = "OrRd") +
scale_x_discrete(limits = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
Findings : We find Grand Larceny being high during weekdays and mid week Robbery as well is high in mid week. Assault is less on weekends comapred to weekday, highlighting that perhaps assaults occur more when people might be usually alone/travelling during weekday rather than weekend when they are more social and acocmpanied with someone.
Comaprison of the offenses occuring in NYC boroughs. Bronx, Brooklyn, Manhattan, Queens, Staten Island
ggplot(nyc_crimedata_boroughDF, aes(x = nyc_crimedata_boroughDF$borough, y = nyc_crimedata_boroughDF$total, fill = nyc_crimedata_boroughDF$offense)) +
ggtitle("Offense In NYC Boroughs") + geom_bar(colour = "black", stat = "identity") + coord_flip() + xlab("Boroughs") +
ylab("Offense") + scale_fill_brewer(palette = "RdPu")
Findings : Brooklyn has the highest Offense count followed by Manhattan. Staten Island is the safest NYC borough
Heinous crime of murder / rape are much lesser compared to Grand Larceny which is high in all boroughs in comaprison to other offenses.
Brooklyn has the highest burglary as well as robbery rate among all the boroughs.
# #########################################################
# Please add Marks code for Chicago here on per the index
# Please follow the heading size protocol as above two datasets
# Number of hashes make the headings and subheadings.
# Please keep page breaks after every section
# And remove this part afer adding code
#########################################################
chicago <- read.csv("https://raw.githubusercontent.com/bvshyam/Project3_cuny_607/master/Data/chicago_dec2015.csv", header=TRUE, sep=",", stringsAsFactors = FALSE,na.strings = c("",NA,"n/a","N/A","NA"))
As the data is in a format of “M/D/Y H:M:S”, we need to break that down and separate the data so we can get more fine grained analysis with it. To do so, we use the lubridate and stringr library. String allows us to use regular expressions to filter data, and lubridate provides easy to use, nicely abstracted functions to transform the data into time formats. As the timestamp is in the AM/PM format, we have to convert everything to twenty four hour time so that you as the analyst can tell the difference. Otherwise, the data will just all show a 12 hr format with no distinction. Once this occurs, we can perform calculations on the data. We are also going to extract out the hours, minutes, and seconds of the crime.
crimeDates <- as.Date(chicago$Date, "%m/%d/%Y")
twentyFourHourTime <- as.POSIXct(chicago$Date, format = "%m/%d/%Y %I:%M:%S %p")
chicagoTimes <- unlist(str_extract_all(twentyFourHourTime, pattern="\\d{1,2}\\:\\d{1,2}\\:\\d{1,2}"))
partOfDay <- unlist(str_extract_all(chicago$Date, pattern="[[:alpha:]]{2}"))
dayOfWeek <- wday(crimeDates, label = TRUE)
crimeMonth <- month(crimeDates, label = TRUE)
crimeTimeStamp <- hms(chicagoTimes)
crimeHr <- hour(crimeTimeStamp)
crimeMin <- minute(crimeTimeStamp)
crimeSec <- second(crimeTimeStamp)
Once the data has been cleaned, we need to put it into a new data frame for further analysis.
cleanedData <- data.frame(
crimeType = chicago$Primary.Type,
wasArrestMade = chicago$Arrest,
district = chicago$District,
location = chicago$Location.Description,
crimeDates,
dayOfWeek,
crimeMonth,
crimeHr,
crimeMin,
crimeSec,
latitude = chicago$Latitude,
longitude =chicago$Longitude
)
For the purpose of adding the data into the shared database, I need to add the time and crime id.
cleanedData$offenseid <- chicago$ID
cleanedData$offenselocaltime <- chicagoTimes
dataForDB <- data.frame (
cityCode = "CHI",
offenseid = cleanedData$offenseid,
offense = cleanedData$crimeType,
offenselocaldate =cleanedData$crimeDates,
offenselocaltime = cleanedData$offenselocaltime,
offensehour= cleanedData$crimeHr,
offensemonth= cleanedData$crimeMonth,
offensearea = cleanedData$location
)
dataForDB$dayofweek <- weekdays(as.Date(crimeDates))
dataForDB <- dataForDB[c(1,2,3,4,5,9,6,7,8)]
Create the connection to the database, and insert the data
conn <- dbConnect(MySQL(),
user = 'mygroup',
password = 'mygroup#01',
host = 'mygroup.c5rotlbjbl71.us-east-1.rds.amazonaws.com',
dbname = 'crimedb')
dbWriteTable(conn, value = dataForDB, name = "crimedata", row.names=F, append = TRUE )
## [1] TRUE
What we are most interested in for the analysis is finding how the primary crime types compare in terms of arrest rates. First, we need to group the data by the primary type and arrest data (true or false). We can then count how many of each occurred.
table(cleanedData$crimeType, cleanedData$wasArrestMade )
##
## false true
## ARSON 27 1
## ASSAULT 967 236
## BATTERY 2815 772
## BURGLARY 1261 66
## CONCEALED CARRY LICENSE VIOLATION 0 2
## CRIM SEXUAL ASSAULT 88 5
## CRIMINAL DAMAGE 2286 126
## CRIMINAL TRESPASS 160 273
## DECEPTIVE PRACTICE 1173 54
## GAMBLING 0 2
## HOMICIDE 26 7
## INTERFERENCE WITH PUBLIC OFFICER 3 64
## INTIMIDATION 9 2
## KIDNAPPING 18 1
## LIQUOR LAW VIOLATION 0 10
## MOTOR VEHICLE THEFT 832 65
## NARCOTICS 0 1264
## NON-CRIMINAL 2 0
## NON - CRIMINAL 1 0
## OBSCENITY 2 1
## OFFENSE INVOLVING CHILDREN 148 23
## OTHER OFFENSE 1004 240
## PROSTITUTION 0 65
## PUBLIC INDECENCY 0 2
## PUBLIC PEACE VIOLATION 47 93
## ROBBERY 928 70
## SEX OFFENSE 47 14
## STALKING 9 0
## THEFT 4345 548
## WEAPONS VIOLATION 37 128
Here we see that the narcotics crime type has the highest number of arrests, while the theft crime type has the highest number of non-arrests.
We can then find out where one is most likely to get arrested by grouping the data together based on a crime type, arrest boolean, and location. Once this is done, we can compare the data of arrests and non arrests. The following barplot shows the arrests that did occur as “true”, and arrests that did not occur as “false”
arrestByLocation <- table(cleanedData$wasArrestMade, cleanedData$location )
barplot(arrestByLocation, main="Arrests By Location", xlab="Location", col=c("darkblue","red"), legend = rownames(arrestByLocation))
The same data can be viewed in a neater format via the group by command
groupedCrimesByLocation <- cleanedData %>% group_by(crimeType, wasArrestMade, location) %>% summarise(Total=n())
arrestOnly <- filter(groupedCrimesByLocation, wasArrestMade %in% "true")
arrestOnly
## Source: local data frame [407 x 4]
## Groups: crimeType, wasArrestMade [27]
##
## crimeType wasArrestMade location
## <fctr> <fctr> <fctr>
## 1 ARSON true APARTMENT
## 2 ASSAULT true AIRPORT BUILDING NON-TERMINAL - SECURE AREA
## 3 ASSAULT true ALLEY
## 4 ASSAULT true APARTMENT
## 5 ASSAULT true BANK
## 6 ASSAULT true BAR OR TAVERN
## 7 ASSAULT true CAR WASH
## 8 ASSAULT true CHA APARTMENT
## 9 ASSAULT true CHA HALLWAY/STAIRWELL/ELEVATOR
## 10 ASSAULT true CHA PARKING LOT/GROUNDS
## # ... with 397 more rows, and 1 more variables: Total <int>
arrestOnly[which.max(arrestOnly$Total),]
## Source: local data frame [1 x 4]
## Groups: crimeType, wasArrestMade [1]
##
## crimeType wasArrestMade location Total
## <fctr> <fctr> <fctr> <int>
## 1 NARCOTICS true STREET 424
noArrest <- filter(groupedCrimesByLocation, wasArrestMade %in% "false")
noArrest
## Source: local data frame [606 x 4]
## Groups: crimeType, wasArrestMade [24]
##
## crimeType wasArrestMade location Total
## <fctr> <fctr> <fctr> <int>
## 1 ARSON false ALLEY 1
## 2 ARSON false APARTMENT 3
## 3 ARSON false CHURCH/SYNAGOGUE/PLACE OF WORSHIP 1
## 4 ARSON false CONVENIENCE STORE 1
## 5 ARSON false DRIVEWAY - RESIDENTIAL 1
## 6 ARSON false GAS STATION 1
## 7 ARSON false GROCERY FOOD STORE 1
## 8 ARSON false OTHER 1
## 9 ARSON false PARKING LOT/GARAGE(NON.RESID.) 3
## 10 ARSON false RESIDENCE 5
## # ... with 596 more rows
noArrest[which.max(noArrest$Total),]
## Source: local data frame [1 x 4]
## Groups: crimeType, wasArrestMade [1]
##
## crimeType wasArrestMade location Total
## <fctr> <fctr> <fctr> <int>
## 1 THEFT false STREET 1205
As the data shows, more people are likely to be arrested for a narcotics charge on the street. It also shows that one has a greater chance of not being arrested for theft on the street as well.
Speaking of arrests, one interesting observation would be to find what day of the week someone is likely to be arrested.
table(cleanedData$wasArrestMade, cleanedData$dayOfWeek )
##
## Sun Mon Tues Wed Thurs Fri Sat
## false 2083 2095 2775 2749 2277 2072 2184
## true 566 490 773 706 523 509 567
Based on this information, Tuesday is the day when most people are arrested for the sample.
Now, how do we find the average time someone is arrested? Well, we need to find the mean of the timestamp from the Chicago data. Since we did the conversion of 12 hr time to 24 hr, we can perform a “mean” calculation against the variable to find the time.
mean(twentyFourHourTime)
## [1] "2015-12-15 13:59:45 EST"
The average time that someone would be arrested in this sample is on 12/15/15 at 1:59AM.
nyc_data <- getURL('https://raw.githubusercontent.com/bvshyam/Project3_cuny_607/master/Data/nyc_dec2015.csv', ssl.verifyhost=F, ssl.verifypeer=F)
data.nyc <- read.csv(textConnection(nyc_data), header=T)
data.nyc_f <- data.nyc[complete.cases(data.nyc),]
#extracting lat and lon
dataloc <- data.frame(lat = stri_extract_first(data.nyc_f$Location.1, regex="\\d{1,}.\\d{1,}"),
lon = stri_extract_last(data.nyc_f$Location.1, regex = "\\-\\d{1,}.\\d{1,}"))
head(dataloc)
## lat lon
## 1 40.5939184530001 -74.0756886519999
## 2 40.5991869440001 -73.739410553
## 3 40.6506195700001 -73.9538766169999
## 4 40.7528934640001 -73.869474086
## 5 40.66784984 -73.7991105359999
## 6 40.719515733 -73.95297396
#subsetting and changing datatype
data.nyc_f <- data.nyc_f[,c(4,12,20)]
str(data.nyc_f)
## 'data.frame': 8055 obs. of 3 variables:
## $ Day.of.Week: Factor w/ 7 levels "Friday","Monday",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Offense : Factor w/ 7 levels "BURGLARY","FELONY ASSAULT",..: 4 3 3 3 3 4 3 3 2 2 ...
## $ Location.1 : Factor w/ 6495 levels "(40.503177034, -74.231875648)",..: 204 271 1048 3938 1422 3059 3996 4656 5819 5819 ...
data.nyc_f$Offense<- as.character(data.nyc_f$Offense)
data.nyc_f$Day.of.Week<- as.character(data.nyc_f$Day.of.Week)
#combining with latitude data
df2 <- data.frame(cbind(data.nyc_f$Offense,data.nyc_f$Day.of.Week,dataloc[]))
#changing datatype
df2$data.nyc_f.Offense <- as.character(df2$data.nyc_f.Offense)
df2$data.nyc_f.Day.of.Week <- as.character(df2$data.nyc_f.Day.of.Week)
df2$lat <- as.numeric(as.character(df2$lat))
df2$lon <- as.numeric(as.character(df2$lon))
str(df2)
## 'data.frame': 8055 obs. of 4 variables:
## $ data.nyc_f.Offense : chr "GRAND LARCENY OF MOTOR VEHICLE" "GRAND LARCENY" "GRAND LARCENY" "GRAND LARCENY" ...
## $ data.nyc_f.Day.of.Week: chr "Tuesday" "Tuesday" "Tuesday" "Tuesday" ...
## $ lat : num 40.6 40.6 40.7 40.8 40.7 ...
## $ lon : num -74.1 -73.7 -74 -73.9 -73.8 ...
#subsetting data based on more violent crime
dv.nyc <- df2[which(df2$data.nyc_f.Offense == 'ASSAULT' | df2$data.nyc_f.Offense =='ROBBERY'),]
str(dv.nyc )
## 'data.frame': 1524 obs. of 4 variables:
## $ data.nyc_f.Offense : chr "ROBBERY" "ROBBERY" "ROBBERY" "ROBBERY" ...
## $ data.nyc_f.Day.of.Week: chr "Tuesday" "Tuesday" "Tuesday" "Tuesday" ...
## $ lat : num 40.8 40.8 40.6 40.8 40.7 ...
## $ lon : num -73.9 -73.9 -74.1 -73.9 -74 ...
#mapping
map <- get_map('nyc', zoom = 15, maptype = 'roadmap')
ggmap(map,extent = "device")+stat_density2d(
aes(x = dv.nyc$lon, y = dv.nyc$lat, fill = ..level..,
alpha = ..level..),
size = 2, bins = 6, data = dv.nyc,
geom = "polygon")
## Warning: Removed 1507 rows containing non-finite values (stat_density2d).
c_data <- getURL('https://raw.githubusercontent.com/bvshyam/Project3_cuny_607/master/Data/chicago_dec2015.csv', ssl.verifyhost=F, ssl.verifypeer=F)
data.chi <- read.csv(textConnection(c_data), header=T)
df <- data.chi[,c(6,22)]
df <- df[complete.cases(df),]
df$Location <- as.character(df$Location)
str(df)
## 'data.frame': 20369 obs. of 2 variables:
## $ Primary.Type: Factor w/ 30 levels "ARSON","ASSAULT",..: 9 3 9 9 21 9 22 27 27 29 ...
## $ Location : chr "" "" "" "" ...
#extracting latitude and longitude values
dataloc <- data.frame(lat = stri_extract_first(df$Location, regex = "\\d{1,}.\\d{1,}"),
lon = stri_extract_last(df$Location, regex = "\\-\\d{1,}.\\d{1,}"))
#changing to numeric type
dataloc$lat <- as.numeric(as.character(dataloc$lat))
dataloc$lon <- as.numeric(as.character(dataloc$lon))
#recreating data frame with latitude and longitude column
fndata <- data.frame(cbind(df$Primary.Type,dataloc[]))
finaldata <- fndata[complete.cases(dataloc),]
str(finaldata)
## 'data.frame': 19910 obs. of 3 variables:
## $ df.Primary.Type: Factor w/ 30 levels "ARSON","ASSAULT",..: 9 22 27 27 29 29 7 9 9 9 ...
## $ lat : num 41.9 41.9 41.9 41.9 41.8 ...
## $ lon : num -87.7 -87.7 -87.7 -87.7 -87.6 ...
#subsetting data based on more violent crime
dv.chicago <- finaldata[which(finaldata$df.Primary.Type == 'ASSAULT' | finaldata$df.Primary.Type =='ROBBERY'),]
str(dv.chicago)
## 'data.frame': 2176 obs. of 3 variables:
## $ df.Primary.Type: Factor w/ 30 levels "ARSON","ASSAULT",..: 2 26 26 2 2 26 26 26 26 26 ...
## $ lat : num 41.8 41.8 41.7 41.7 41.8 ...
## $ lon : num -87.6 -87.7 -87.7 -87.6 -87.7 ...
#mapping with dayofweek
map <- get_map('chicago', zoom = 15, maptype = 'roadmap')
ggmap(map,extent = "device")+stat_density2d(
aes(x = lon, y = lat, fill = ..level..,
alpha = ..level..),
size = 2, bins = 6, data = dv.chicago,
geom = "polygon")
## Warning: Removed 2131 rows containing non-finite values (stat_density2d).
#dat load and subsetting
#d <- getURL("https://raw.githubusercontent.com/bvshyam/Project3_cuny_607/master/Data/SFPD_Incidents_-_from_1_January_2003.csv", ssl.verifyhost=F, ssl.verifypeer=F)
d <- getURL("https://raw.githubusercontent.com/bvshyam/Project3_cuny_607/master/Data/sfo_dec2015.csv", ssl.verifyhost=F, ssl.verifypeer=F)
d <- read.csv(textConnection(d), header=T)
d <- d[complete.cases(d),]
d <- d[,c(2:6,10,11)]
#changing column name
colnames(d)[6]<- "lon"
colnames(d)[7]<- "lat"
str(d)
## 'data.frame': 11394 obs. of 7 variables:
## $ Category : Factor w/ 36 levels "ARSON","ASSAULT",..: 17 17 20 21 19 26 27 21 21 17 ...
## $ Descript : Factor w/ 428 levels "ABANDONMENT OF CHILD",..: 278 278 238 211 194 225 353 197 175 277 ...
## $ DayOfWeek: Factor w/ 7 levels "Friday","Monday",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Date : Factor w/ 31 levels "12/01/2015","12/02/2015",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Time : Factor w/ 1194 levels "00:01","00:02",..: 986 646 570 1 1 1 1 540 1 382 ...
## $ lon : num -122 -122 -122 -122 -122 ...
## $ lat : num 37.8 37.8 37.8 37.8 37.7 ...
#subsetting data based on more violent crime
dv <- d[which(d$Category == 'ASSAULT' | d$Category =='ROBBERY'),]
str(dv)
## 'data.frame': 1232 obs. of 7 variables:
## $ Category : Factor w/ 36 levels "ARSON","ASSAULT",..: 2 2 2 2 2 2 2 2 2 24 ...
## $ Descript : Factor w/ 428 levels "ABANDONMENT OF CHILD",..: 393 393 393 61 393 150 172 61 393 48 ...
## $ DayOfWeek: Factor w/ 7 levels "Friday","Monday",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Date : Factor w/ 31 levels "12/01/2015","12/02/2015",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Time : Factor w/ 1194 levels "00:01","00:02",..: 1 1 1 1 1 1 763 249 894 931 ...
## $ lon : num -122 -122 -122 -122 -122 ...
## $ lat : num 37.8 37.8 37.8 37.8 37.8 ...
#mapping with dayofweek
map <- get_map('San Francisco', zoom = 12, maptype = 'roadmap')
ggmap(map,extent = "device")+stat_density2d(
aes(x = lon, y = lat, fill = ..level..,
alpha = ..level..),
size = 2, bins = 4, data = dv,
geom = "polygon")+facet_wrap(~DayOfWeek)
#mapping with area and legend
ggmap(map,extent = "device", legend = "topleft")+stat_density2d(
aes(x = lon, y = lat, fill = ..level..,
alpha = ..level..),
size = 2, bins = 6, data = dv,
geom = "polygon")
Loading crime data captured for different US cities, EastCoast NYC, WestCoast SFO and MidWest Chicago.
# DATE_FORMAT(offenselocaltime,'%H:%i:%s')
if (isIdCurrent(conn)) {
allcitycrimedata <- dbGetQuery(conn, "select crimedata.citycode, offenseid, offense, offenselocaldate, offenselocaltime, dayofweek, offensehour, offensemonth, offensearea from crimedata, city where city.citycode = crimedata.citycode")
}
# View(allcitycrimedata)
# Categorizing the offenses at broad level Adding Category column to data frame
for (i in seq(1, nrow(allcitycrimedata), 1)) {
if (grepl("LARCENY|ROBBERY|THEFT|STOLEN|EMBEZZLEMENT", allcitycrimedata$offense[i],
ignore.case = TRUE)) {
allcitycrimedata$category[i] <- "LARCENY"
} else if (grepl("RAPE|SEX OFFENSE|SEXUAL|BATTERY|PROSTITUTION", allcitycrimedata$offense[i],
ignore.case = TRUE)) {
allcitycrimedata$category[i] <- "SEX OFFENSE"
} else if (grepl("BURGLARY", allcitycrimedata$offense[i], ignore.case = TRUE)) {
allcitycrimedata$category[i] <- "BURGLARY"
} else if (grepl("MURDER|HOMICIDE", allcitycrimedata$offense[i], ignore.case = TRUE)) {
allcitycrimedata$category[i] <- "MURDER"
} else if (grepl("NARCOTIC|LIQUOR", allcitycrimedata$offense[i], ignore.case = TRUE)) {
allcitycrimedata$category[i] <- "NARCOTIC"
} else if (grepl("OTHER|SUICIDE|NON CRIMINAL|NON-CRIMINAL|NON - CRIMINAL|SUSPICIOUS|DECEPTIVE|INTERFERENCE|INTIMIDATION|STALKING|TRESSPASS|SECONDARY|RUNAWAY|MISSING|TRESPASS",
allcitycrimedata$offense[i], ignore.case = TRUE)) {
allcitycrimedata$category[i] <- "OTHER"
} else if (grepl("FORGERY|FRAUD|BAD CHECKS|EXTORTION", allcitycrimedata$offense[i],
ignore.case = TRUE)) {
allcitycrimedata$category[i] <- "FRAUD"
} else if (grepl("PUBLIC INDECENCY|OBSCENITY|DRUNKENNESS|CONDUCT|PUBLIC PEACE|WARRANTS|WEAPON|VEHICLE|LICENSE|DRIVING|BRIBERY|GAMBLING",
allcitycrimedata$offense[i], ignore.case = TRUE)) {
allcitycrimedata$category[i] <- "MISCONDUCT"
} else if (grepl("FAMILY|KIDNAPPING", allcitycrimedata$offense[i], ignore.case = TRUE)) {
allcitycrimedata$category[i] <- "FAMILY"
} else if (grepl("CHILDREN", allcitycrimedata$offense[i], ignore.case = TRUE)) {
allcitycrimedata$category[i] <- "CHILD ABUSE"
} else if (grepl("VANDALISM|DAMAGE|ARSON", allcitycrimedata$offense[i], ignore.case = TRUE)) {
allcitycrimedata$category[i] <- "VANDALISM"
} else if (grepl("ASSAULT", allcitycrimedata$offense[i], ignore.case = TRUE)) {
allcitycrimedata$category[i] <- "ASSAULT"
}
} # end of for loop
datatable(allcitycrimedata)
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## http://rstudio.github.io/DT/server.html
Grouping data retrieved from database with group by city and within city, group by offense
crimedata_cityDF <- allcitycrimedata %>% select(citycode, category) %>% group_by(citycode,
category) %>% dplyr::summarise(total = n())
datatable(crimedata_cityDF)
ggplot(crimedata_cityDF, aes(x = crimedata_cityDF$citycode, y = crimedata_cityDF$total, fill = crimedata_cityDF$category)) +
ggtitle("Offense In US Cities(East Coast, MidWest, WestCoast") + geom_bar(colour = "black", stat = "identity") + coord_flip() +
xlab("Cities") + ylab("Offense Category") + scale_fill_brewer(palette = "Paired")
Findings From the bar plot of city compare for different offenses, we see that Chicago is highest in crime rate, almost double of San Francisco and more than double of that of New York City. Larceny is the highest offense in Chicago, followed by New York City . Chicago also leads in Burglary, and Sex Offenses Chicago leads in almost all types of crimes
SFO has high of smaller crimes like suspicious activity, trespassing, non criminal offenses, stalking, deception, intereference, intimidation San Francisco also leads in misconduct types of offenses, like obscenity, drunkeness, public indecency, DUI, driving related or Narcotic
Safest City seems to be New York City area as comapred to San Francisco and Chicago.
# remove SFO tables and variables
remove(sf_crime_db, sf_crime_db_1, sf_crime, sf_day_offense, sfo_crime_hour, topcrimes,
sfo_crimedata_hourDF, entdbtables, a, b, a1, b1, category_count, crimesbyday,
mostcrimes, sf_crime.df, i, p, sf_crim.json, sf_crimedata)
dbDisconnect(conn)
## [1] TRUE
#####################################################################
The project motivation has been to successfully demonstrate a team work, soft skills which also form the foundation of a good Data Scientist. And with collaboration help study crime data and come up with insights.
We collaborated together as a team of Kumudini Bhave, Shyam BV, Mark Halpin since Oct 16 to be joined in by Upal Chowdhury. Our team demonstrated great team work with work distribution, accommodation and tolerance and healthy collaboration .
Shyam, Mark did great work over finding interesting datasets and worked /analyzed on SFO , Chicago crime datasets. Kumudini contributed to working and analyzing on NYC crime data, doing across cities plot and integrating the individual Rmds. Upal was very instrumental in setting the much required database on cloud and helping us be all in sync. Having a database on cloud made it greatly possible to have all data in one place and reinforced the strength of the project. Upal also did some interesting mappings of the three cities SFO, NYC and Chicago crime datasets.And prepared the ER diagram and Workflow diagram
We did have initial dilemma as we started off with multiyear crime data but over period realised that only a months data could be handled by the environment. It consumed some time for this realization.
We also had issues with integration of Rmd files as JSON data loading would time out.
With our team work, we were succesfully able to study crime across the major cities on East Coast, West Coast and Mid-West of USA . The Plots an our findings are captured in the main Rmd file.
The most valued data science skills : Our team worked together and could demonstrate the essential and most valued data science skills. We found the apt dataset for our project that engaged our curiousity to find about crime in different cities. We worked through unstructured /structured data We scraped data in different formats of XML, CSV, JSON We worked in SQL and posted to relational databases We used R for Coding We worked on large datasets and performed data mining. Lastly and most importantly, we are a team globally dispersed across timezones , we worked round the clock to deliver the united effort.