This exploratory data analysis was done using the San Francisco (SF) Crime dataset available in https://www.kaggle.com/c/sf-crime/download/train.csv.zip. This dataset was used in the famous Kaggle’s practice competition of crime classification and was brought to us by SF OpenData, the central clearinghouse for data published by the City and County of San Francisco.
The idea of this notebook is to answer one specific question: which are the safest attractions of San Francisco for a visitor?
To analyze this, I also used TripAdvisor’s TOP 15 attractions in SF (https://www.tripadvisor.com/Attractions-g60713-Activities-San_Francisco_California.html#ATTRACTION_SORT_WRAPPER).
First, we got the data and oppened it in R:
if(!require(dplyr)) install.packages("dplyr")
if(!require(ggmap)) install.packages("ggmap")
library("ggmap")
library("dplyr")
data <- read.csv("train.csv",header = TRUE)
The TripAdvisor data was used to produce a table using the longitude/latitude values and the name of the TOP 15 touristic place, as can be seen here:
turisticPlaces <- read.table("locais.txt",header=TRUE,sep = ",")
turisticPlaces
## X Y NOME
## 1 37.82698 -122.4230 ALCATRAZ
## 2 37.81051 -122.4768 GOLDEN GATE
## 3 37.77860 -122.3893 AT&T PARK
## 4 37.76942 -122.4862 GOLDEN G PARK
## 5 37.77986 -122.5095 LANDS END
## 6 37.80199 -122.4487 FINE ARTS MUS
## 7 37.75522 -122.4479 TWIN PEAKS
## 8 37.79470 -122.4117 CABLE CARS
## 9 37.80084 -122.3986 THE EXPLORATORIUM
## 10 37.76987 -122.4661 ACADEMY OF SCIENCE
## 11 37.79527 -122.3934 FERRY MKT
## 12 37.78447 -122.5008 LEGION OF HONOR
## 13 37.78016 -122.4162 ASIAN ART MUS
## 14 37.80145 -122.4587 DISNEY
## 15 37.78572 -122.4011 SFMOMA
The first step was to clean the data. To do this, we are going to check some observations of the Crime Dataset.
head(data)
## Dates Category Descript
## 1 2015-05-13 23:53:00 WARRANTS WARRANT ARREST
## 2 2015-05-13 23:53:00 OTHER OFFENSES TRAFFIC VIOLATION ARREST
## 3 2015-05-13 23:33:00 OTHER OFFENSES TRAFFIC VIOLATION ARREST
## 4 2015-05-13 23:30:00 LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO
## 5 2015-05-13 23:30:00 LARCENY/THEFT GRAND THEFT FROM LOCKED AUTO
## 6 2015-05-13 23:30:00 LARCENY/THEFT GRAND THEFT FROM UNLOCKED AUTO
## DayOfWeek PdDistrict Resolution Address X
## 1 Wednesday NORTHERN ARREST, BOOKED OAK ST / LAGUNA ST -122.4259
## 2 Wednesday NORTHERN ARREST, BOOKED OAK ST / LAGUNA ST -122.4259
## 3 Wednesday NORTHERN ARREST, BOOKED VANNESS AV / GREENWICH ST -122.4244
## 4 Wednesday NORTHERN NONE 1500 Block of LOMBARD ST -122.4270
## 5 Wednesday PARK NONE 100 Block of BRODERICK ST -122.4387
## 6 Wednesday INGLESIDE NONE 0 Block of TEDDY AV -122.4033
## Y
## 1 37.77460
## 2 37.77460
## 3 37.80041
## 4 37.80087
## 5 37.77154
## 6 37.71343
As we can see, the data has nine columns that represent the date of the crime, category of crime, a little description of the crime, day of week, police department district, Address, Latitude and Longitude.
I decided to analyze the description of the crime to find the crimes that are more likely for touristic situations.
I checked the available levels of the dataset.
levels(data$Descript)
After analyzing it, I decided that the relevant words in the description of the crimes for tourists were THEFT and STOLEN
substring<-"THEFT|PICKPOCKET|STOLEN"
levelsList<-levels(data$Descript)
listOfRelevantsDesc<-grepl(substring,levelsList)
firstLevelList<-levelsList[listOfRelevantsDesc]
After this point, I reduced the universe from 879 different description levels to 101. Doing a deeper analysis in these 101 levels I filtered again just the ones that are relevant for tourists (for example, tourists are unlikely to have their trailer stolen, boat stolen or having their private property trespassed). To do this filter, I chose the following words:
CARD, PICKPOCKET, BICYCLE, WITH PRIOR, POSSESSION, PHONES, TRICK, PERSON,COMPUTER and METALS
secondSubstring<-"CARD|PICKPOCKET|BICYCLE|WITH PRIOR|POSSESSION|PHONES|TRICK|PERSON|COMPUTER|METALS"
secondLevelList<-firstLevelList[grepl(secondSubstring,firstLevelList)]
finalFilteredData<-filter(data,data$Descript%in%secondLevelList)
So, I filtered the data by the year 2015 (most recent year of this dataset). I reduced the space from 34427 observations to 1115, as can be seen here:
dataLevels<-finalFilteredData$Dates
year<-"2015"
final<-grepl(year,dataLevels)
secondList<-finalFilteredData$Dates[final]
final2015Data<-filter(finalFilteredData,finalFilteredData$Dates%in%secondList)
nrow(final2015Data)
## [1] 1115
After this stage, I plotted the location of those remaining crimes in red with the density estimation of the crimes in blue, as can be seen here:
qmplot(X,Y,data=final2015Data,maptype = "toner-lite",color = I("red"))+geom_density_2d()
To get a visualization that can help to answer our initial question, I plotted the Density of selected crimes in 2015 with the location of the touristic places in green. Note that we do not want to compare which one is more violent, we just want to filter which ones are dangerous or not.
qmplot(X, Y, data = final2015Data, geom = "blank", maptype = "toner-background", darken = .7, legend = "bottomright") +
stat_density_2d(aes(fill = ..level..), geom = "polygon", alpha = .3, color = NA) +
scale_fill_gradient2("Quantity of selected crimes\nthat may affect tourists (2015)", low = "white", mid = "yellow", high = "red", midpoint = 750,limits=c(250,1500))+geom_point(data=turisticPlaces, aes(Y, X, label=levels(turisticPlaces$NOME)),color=11,size=3,shape=21,stroke=2)+geom_text(nudge_y=0.0025,data=turisticPlaces,aes(Y,X,label=turisticPlaces$NOME),size=2.5,color="white")+
ggtitle("Touristic places (green) and selected crimes density for SF in 2015")+theme(plot.title = element_text(hjust = 0.5,size=20))
font:filtered data from https://www.kaggle.com/c/sf-crime (accessed in 04/18/2017)
We also checked the crimes variation throught the week.
qmplot(X, Y, data = final2015Data, geom = "blank", maptype = "toner-background", darken = .7, legend = "bottomright") +
stat_density_2d(aes(fill = ..level..), geom = "polygon", alpha = .3, color = NA) +
scale_fill_gradient2("Quantity of selected crimes\nthat may affect tourists (2015)", low = "white", mid = "yellow", high = "red", midpoint = 750,limits=c(250,1500))+geom_point(data=turisticPlaces, aes(X, Y),color="blue")+facet_wrap(~ DayOfWeek)+ ggtitle("Selected crimes density for SF in 2015 for each day of the week")+theme(plot.title = element_text(hjust = 0.5,size=18))
As we can see looking at the maps above, if you want to visit the SFMOMA, Cable Cars or the Asian Art Museum (that are in areas with higher crime incidents), it is safer to go on Sundays than on Saturdays. Wednesday is also a good day to go.
As we can see in the map below, the most dangerous touristic attractions of the TOP 15 attractions in TripAdvisor are SFMOMA, Asian Art Museum and Cable Cars. If you want to visit these places, pay attention to your belongings because these three attractions are in areas with higher crime rate for selected crimes that can affect tourists in SF.
I would also recommend that you pay attention to The Exploratorium and Ferry Marketplace, as they are in the boardline of the most dangerous areas.
font:filtered data from https://www.kaggle.com/c/sf-crime (accessed in 04/18/2017)