Introduction

This exploratory data analysis was done using the San Francisco (SF) Crime dataset available in https://www.kaggle.com/c/sf-crime/download/train.csv.zip. This dataset was used in the famous Kaggle’s practice competition of crime classification and was brought to us by SF OpenData, the central clearinghouse for data published by the City and County of San Francisco.

The idea of this notebook is to answer one specific question: which are the safest attractions of San Francisco for a visitor?

To analyze this, I also used TripAdvisor’s TOP 15 attractions in SF (https://www.tripadvisor.com/Attractions-g60713-Activities-San_Francisco_California.html#ATTRACTION_SORT_WRAPPER).

First, we got the data and oppened it in R:

if(!require(dplyr)) install.packages("dplyr")
if(!require(ggmap)) install.packages("ggmap")

library("ggmap")
library("dplyr")

data <- read.csv("train.csv",header = TRUE)

The TripAdvisor data was used to produce a table using the longitude/latitude values and the name of the TOP 15 touristic place, as can be seen here:

turisticPlaces <- read.table("locais.txt",header=TRUE,sep = ",")
turisticPlaces
##           X         Y               NOME
## 1  37.82698 -122.4230           ALCATRAZ
## 2  37.81051 -122.4768        GOLDEN GATE
## 3  37.77860 -122.3893          AT&T PARK
## 4  37.76942 -122.4862      GOLDEN G PARK
## 5  37.77986 -122.5095          LANDS END
## 6  37.80199 -122.4487     FINE ARTS MUS 
## 7  37.75522 -122.4479         TWIN PEAKS
## 8  37.79470 -122.4117         CABLE CARS
## 9  37.80084 -122.3986  THE EXPLORATORIUM
## 10 37.76987 -122.4661 ACADEMY OF SCIENCE
## 11 37.79527 -122.3934          FERRY MKT
## 12 37.78447 -122.5008    LEGION OF HONOR
## 13 37.78016 -122.4162      ASIAN ART MUS
## 14 37.80145 -122.4587             DISNEY
## 15 37.78572 -122.4011             SFMOMA

Cleaning Data

The first step was to clean the data. To do this, we are going to check some observations of the Crime Dataset.

head(data)
##                 Dates       Category                       Descript
## 1 2015-05-13 23:53:00       WARRANTS                 WARRANT ARREST
## 2 2015-05-13 23:53:00 OTHER OFFENSES       TRAFFIC VIOLATION ARREST
## 3 2015-05-13 23:33:00 OTHER OFFENSES       TRAFFIC VIOLATION ARREST
## 4 2015-05-13 23:30:00  LARCENY/THEFT   GRAND THEFT FROM LOCKED AUTO
## 5 2015-05-13 23:30:00  LARCENY/THEFT   GRAND THEFT FROM LOCKED AUTO
## 6 2015-05-13 23:30:00  LARCENY/THEFT GRAND THEFT FROM UNLOCKED AUTO
##   DayOfWeek PdDistrict     Resolution                   Address         X
## 1 Wednesday   NORTHERN ARREST, BOOKED        OAK ST / LAGUNA ST -122.4259
## 2 Wednesday   NORTHERN ARREST, BOOKED        OAK ST / LAGUNA ST -122.4259
## 3 Wednesday   NORTHERN ARREST, BOOKED VANNESS AV / GREENWICH ST -122.4244
## 4 Wednesday   NORTHERN           NONE  1500 Block of LOMBARD ST -122.4270
## 5 Wednesday       PARK           NONE 100 Block of BRODERICK ST -122.4387
## 6 Wednesday  INGLESIDE           NONE       0 Block of TEDDY AV -122.4033
##          Y
## 1 37.77460
## 2 37.77460
## 3 37.80041
## 4 37.80087
## 5 37.77154
## 6 37.71343

As we can see, the data has nine columns that represent the date of the crime, category of crime, a little description of the crime, day of week, police department district, Address, Latitude and Longitude.

I decided to analyze the description of the crime to find the crimes that are more likely for touristic situations.

I checked the available levels of the dataset.

levels(data$Descript)

After analyzing it, I decided that the relevant words in the description of the crimes for tourists were THEFT and STOLEN

substring<-"THEFT|PICKPOCKET|STOLEN"
levelsList<-levels(data$Descript)
listOfRelevantsDesc<-grepl(substring,levelsList)
firstLevelList<-levelsList[listOfRelevantsDesc]

After this point, I reduced the universe from 879 different description levels to 101. Doing a deeper analysis in these 101 levels I filtered again just the ones that are relevant for tourists (for example, tourists are unlikely to have their trailer stolen, boat stolen or having their private property trespassed). To do this filter, I chose the following words:

CARD, PICKPOCKET, BICYCLE, WITH PRIOR, POSSESSION, PHONES, TRICK, PERSON,COMPUTER and METALS

secondSubstring<-"CARD|PICKPOCKET|BICYCLE|WITH PRIOR|POSSESSION|PHONES|TRICK|PERSON|COMPUTER|METALS"
secondLevelList<-firstLevelList[grepl(secondSubstring,firstLevelList)]
finalFilteredData<-filter(data,data$Descript%in%secondLevelList)

So, I filtered the data by the year 2015 (most recent year of this dataset). I reduced the space from 34427 observations to 1115, as can be seen here:

dataLevels<-finalFilteredData$Dates
year<-"2015"
final<-grepl(year,dataLevels)
secondList<-finalFilteredData$Dates[final]
final2015Data<-filter(finalFilteredData,finalFilteredData$Dates%in%secondList)

nrow(final2015Data)
## [1] 1115

Visualization Stage


After this stage, I plotted the location of those remaining crimes in red with the density estimation of the crimes in blue, as can be seen here:

qmplot(X,Y,data=final2015Data,maptype = "toner-lite",color = I("red"))+geom_density_2d()

To get a visualization that can help to answer our initial question, I plotted the Density of selected crimes in 2015 with the location of the touristic places in green. Note that we do not want to compare which one is more violent, we just want to filter which ones are dangerous or not.

qmplot(X, Y, data = final2015Data, geom = "blank", maptype = "toner-background", darken = .7, legend = "bottomright") +
  stat_density_2d(aes(fill = ..level..), geom = "polygon", alpha = .3, color = NA) +
  scale_fill_gradient2("Quantity of selected crimes\nthat may affect tourists (2015)", low = "white", mid = "yellow", high = "red", midpoint = 750,limits=c(250,1500))+geom_point(data=turisticPlaces, aes(Y, X, label=levels(turisticPlaces$NOME)),color=11,size=3,shape=21,stroke=2)+geom_text(nudge_y=0.0025,data=turisticPlaces,aes(Y,X,label=turisticPlaces$NOME),size=2.5,color="white")+
  ggtitle("Touristic places (green) and selected crimes density for SF in 2015")+theme(plot.title = element_text(hjust = 0.5,size=20))

font:filtered data from https://www.kaggle.com/c/sf-crime (accessed in 04/18/2017)

Other insights

We also checked the crimes variation throught the week.

qmplot(X, Y, data = final2015Data, geom = "blank", maptype = "toner-background", darken = .7, legend = "bottomright") +
  stat_density_2d(aes(fill = ..level..), geom = "polygon", alpha = .3, color = NA) +
  scale_fill_gradient2("Quantity of selected crimes\nthat may affect tourists (2015)", low = "white", mid = "yellow", high = "red", midpoint = 750,limits=c(250,1500))+geom_point(data=turisticPlaces, aes(X, Y),color="blue")+facet_wrap(~ DayOfWeek)+ ggtitle("Selected crimes density for SF in 2015 for each day of the week")+theme(plot.title = element_text(hjust = 0.5,size=18))



As we can see looking at the maps above, if you want to visit the SFMOMA, Cable Cars or the Asian Art Museum (that are in areas with higher crime incidents), it is safer to go on Sundays than on Saturdays. Wednesday is also a good day to go.

Conclusion

As we can see in the map below, the most dangerous touristic attractions of the TOP 15 attractions in TripAdvisor are SFMOMA, Asian Art Museum and Cable Cars. If you want to visit these places, pay attention to your belongings because these three attractions are in areas with higher crime rate for selected crimes that can affect tourists in SF.

I would also recommend that you pay attention to The Exploratorium and Ferry Marketplace, as they are in the boardline of the most dangerous areas.

font:filtered data from https://www.kaggle.com/c/sf-crime (accessed in 04/18/2017)