How the story starts

I was attending 2017 American Society of Clinical Oncology (ASCO) annual meeting in Chicago. It was my first time to do a uber from the airport to our hotel. We had a nice uber driver and we talked a lot about the city. He said he grew up in the south part of Chicago and there are a lot of crimes there. He knows the city so well that he tries his best to avoid bad districts to pick up customers. When strangers meet, topic will always be weather. I certainly joked about the badness of winter in Chicago and he replied back with the badness of hotness in Houston.

He said there are around 600 cases of shooting per year in Chicago, and in the holidays such as Memorial day, shooting arises to 40 per day. I asked him why do you think there is such an increase. He said: I think it has to do with the weather. In the Memorial day, it becomes warmer. everybody has access to everybody. that’s why the crime rate is higher as well."

As a budding data scientist (that’s how I define myself) to interrogate data from DNA sequencing on tumor samples, I should not just take his words, I want to verify this by some evidences. Initially, I was surprised that weather can be associated with crime rate. Then I googled around when I got the hotel and it turns out that this notion is very common.

See a post on this topic Chicago Crime vs. Weather: Under the Hood. The author did a very good job to demonstrate the association of weather and crime rate in Chicago.

There is even a Kaggle channel for this topic.

In the post I linked, the analysis was done in python. I want to do some similar and more analysis using R. That’s why I am firing up Rstudio and start to type.

Download the crime data

To associate crime rate with weather, I need data sets for them respectively.

The city of Chicago has a very nice portol for the crime data.

I downloaded the crime data with the csv format. Note that the data set is relatively big (~1.5G). My computer has enough memory to hold the whole data set.

less -S ~/Downloads/Crimes_-_2001_to_present.csv | wc -l
# 6 million lines
##  6346217
# load in the libraries
library(tidyverse)
library(ggplot2)
library(lubridate)

crime<- read_csv("~/Downloads/Crimes_-_2001_to_present.csv", col_names =T)
## split the Date column, keep only the day information

crime<- crime %>% mutate(Day = gsub("([0-9]{2}/[0-9]{2}/[0-9]{4}) .+", "\\1", Date))

## change the Day column to a date type using mdy function from lubridate
crime<- crime %>% mutate(Day_date = mdy(Day)) %>% dplyr::select(-Date, -Day)

## what's the date range for the crime data?
range(crime$Day_date)
## [1] "2001-01-01" "2017-05-27"

Get the weather data

The weather data is a bit harder to get if you follow the linked post above. luckily, I found a prepared weather data set from the kaggle channel (was uploaded two days ago when I wrote this! lucky me).

UPDATE: unfortunately, the weather data set was not clean and there are many missing dates.

On twitter, Kevin Johnson @bigkage suggested a package weatherData.

only the developmental branch in github is working, the package on CRAN does not work…

Take a look at the rnoaa package as well. A post on it https://recology.info/2015/07/weather-data-with-rnoaa/

library("devtools")
#install_github("Ram-N/weatherData")
library(weatherData)

# O'Hare airport code is ORD, worked like a charm!
# other columns can be fetched. ?getWeatherForDate. for me, only temperature is needed
weather <- getWeatherForDate("ORD", "2001-01-01", "2017-05-27")
dim(weather)

However, only 388 records were found.

I opened an issue on github.

The reason for this is that weatherUnderground doesn’t want us to pull a huge volume of data in a single file. (For some reason, they truncate the number of rows to be 400 rows or less.)

To fetch all the data:

multi_weather <- lapply(as.character(2001:2017), 
                        function(x){getWeatherForYear("ORD", x)})

weather <- do.call(rbind, multi_weather)

merge the two data sets

str(weather)
## 'data.frame':    5977 obs. of  4 variables:
##  $ Date             : POSIXlt, format: "2001-01-01" "2001-01-02" ...
##  $ Max_TemperatureF : int  17 44 50 57 46 54 39 45 46 51 ...
##  $ Mean_TemperatureF: int  6 24 34 42 36 40 32 32 32 40 ...
##  $ Min_TemperatureF : int  -6 3 19 26 26 26 26 18 17 28 ...
str(crime)
## Classes 'tbl_df', 'tbl' and 'data.frame':    6346216 obs. of  22 variables:
##  $ ID                  : int  5437456 5437458 5437459 5437460 5437461 5437462 5437463 5437464 5437465 5437466 ...
##  $ Case Number         : chr  "HN270582" "HN237768" "HN247565" "HN270399" ...
##  $ Block               : chr  "028XX E 90TH ST" "013XX S CHRISTIANA AVE" "024XX S SAWYER AVE" "086XX S LAFLIN ST" ...
##  $ IUCR                : chr  "1822" "2024" "2022" "1320" ...
##  $ Primary Type        : chr  "NARCOTICS" "NARCOTICS" "NARCOTICS" "CRIMINAL DAMAGE" ...
##  $ Description         : chr  "MANU/DEL:CANNABIS OVER 10 GMS" "POSS: HEROIN(WHITE)" "POSS: COCAINE" "TO VEHICLE" ...
##  $ Location Description: chr  "STREET" "SIDEWALK" "SIDEWALK" "OTHER" ...
##  $ Arrest              : chr  "true" "true" "true" "false" ...
##  $ Domestic            : chr  "false" "false" "false" "true" ...
##  $ Beat                : chr  "0423" "1021" "1024" "0614" ...
##  $ District            : chr  "004" "010" "010" "006" ...
##  $ Ward                : int  10 24 22 21 4 10 28 32 16 7 ...
##  $ Community Area      : int  46 29 30 71 35 55 27 7 61 48 ...
##  $ FBI Code            : chr  "18" "18" "18" "14" ...
##  $ X Coordinate        : int  1196743 1154244 1155082 1167841 1180453 1199105 1154131 1165540 1164423 1192802 ...
##  $ Y Coordinate        : int  1845841 1893642 1887583 1847481 1881415 1816541 1900784 1917732 1870859 1843576 ...
##  $ Year                : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
##  $ Updated On          : chr  "04/15/2016 08:55:02 AM" "04/15/2016 08:55:02 AM" "04/15/2016 08:55:02 AM" "04/15/2016 08:55:02 AM" ...
##  $ Latitude            : num  41.7 41.9 41.8 41.7 41.8 ...
##  $ Longitude           : num  -87.6 -87.7 -87.7 -87.7 -87.6 ...
##  $ Location            : chr  "(41.73185728, -87.55483341)" "(41.863979388, -87.709253509)" "(41.84733605, -87.706339649)" "(41.737026255, -87.660665675)" ...
##  $ Day_date            : Date, format: "2007-04-07" "2007-03-20" ...
# change the POSIXlt to date 
weather$Date<- as.Date(weather$Date)
crime_weather<- inner_join(crime, weather, by = c("Day_date" = "Date"))

## Note that weather information on some dates were missing
anti_join(crime, weather, by = c("Day_date" = "Date")) %>% .$Day_date %>% unique()
##  [1] "2005-01-23" "2004-07-28" "2001-12-25" "2001-12-24" "2001-12-22"
##  [6] "2001-11-04" "2001-11-03" "2004-08-21" "2001-05-28" "2004-08-22"
## [11] "2002-11-24" "2002-02-05" "2001-07-30" "2001-06-18" "2006-01-07"
## [16] "2001-12-23" "2004-08-20" "2006-01-06" "2006-01-08" "2007-01-01"
## [21] "2007-01-04" "2007-01-05" "2001-07-29"
str(crime_weather)
## Classes 'tbl_df', 'tbl' and 'data.frame':    6317559 obs. of  25 variables:
##  $ ID                  : int  5437456 5437458 5437459 5437460 5437461 5437462 5437463 5437464 5437465 5437466 ...
##  $ Case Number         : chr  "HN270582" "HN237768" "HN247565" "HN270399" ...
##  $ Block               : chr  "028XX E 90TH ST" "013XX S CHRISTIANA AVE" "024XX S SAWYER AVE" "086XX S LAFLIN ST" ...
##  $ IUCR                : chr  "1822" "2024" "2022" "1320" ...
##  $ Primary Type        : chr  "NARCOTICS" "NARCOTICS" "NARCOTICS" "CRIMINAL DAMAGE" ...
##  $ Description         : chr  "MANU/DEL:CANNABIS OVER 10 GMS" "POSS: HEROIN(WHITE)" "POSS: COCAINE" "TO VEHICLE" ...
##  $ Location Description: chr  "STREET" "SIDEWALK" "SIDEWALK" "OTHER" ...
##  $ Arrest              : chr  "true" "true" "true" "false" ...
##  $ Domestic            : chr  "false" "false" "false" "true" ...
##  $ Beat                : chr  "0423" "1021" "1024" "0614" ...
##  $ District            : chr  "004" "010" "010" "006" ...
##  $ Ward                : int  10 24 22 21 4 10 28 32 16 7 ...
##  $ Community Area      : int  46 29 30 71 35 55 27 7 61 48 ...
##  $ FBI Code            : chr  "18" "18" "18" "14" ...
##  $ X Coordinate        : int  1196743 1154244 1155082 1167841 1180453 1199105 1154131 1165540 1164423 1192802 ...
##  $ Y Coordinate        : int  1845841 1893642 1887583 1847481 1881415 1816541 1900784 1917732 1870859 1843576 ...
##  $ Year                : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
##  $ Updated On          : chr  "04/15/2016 08:55:02 AM" "04/15/2016 08:55:02 AM" "04/15/2016 08:55:02 AM" "04/15/2016 08:55:02 AM" ...
##  $ Latitude            : num  41.7 41.9 41.8 41.7 41.8 ...
##  $ Longitude           : num  -87.6 -87.7 -87.7 -87.7 -87.6 ...
##  $ Location            : chr  "(41.73185728, -87.55483341)" "(41.863979388, -87.709253509)" "(41.84733605, -87.706339649)" "(41.737026255, -87.660665675)" ...
##  $ Day_date            : Date, format: "2007-04-07" "2007-03-20" ...
##  $ Max_TemperatureF    : int  36 57 75 36 32 36 34 36 34 36 ...
##  $ Mean_TemperatureF   : int  26 46 58 26 26 26 26 26 26 26 ...
##  $ Min_TemperatureF    : int  17 36 43 17 21 17 19 17 19 17 ...

ggplot2 for visulization

Most of the time was spent to get the data and tidy the data (the data are relatively tidy in this example, usually data are much more messy). Now, let’s begin some exploration. Independent of weather, what’s the crime rate trend from 2001 to 2017 ?

## 2017 is not quite at the end yet, let's filter it out 
crime %>% filter(Year != 2017) %>% group_by(Year) %>% summarise(average = n()/365) %>%
        ggplot(aes(x= Year, y = average)) + geom_bar(stat= "identity") +
        geom_text(aes(label= round(average)), vjust=-0.2) +
        ggtitle("number of incidents of crimes per day") + 
        theme_bw(base_size = 14)

Overall, there is a decreasing rate of crime over the years. I am still surprised by the great number of crime rates.

Types of crimes

Let’s dig into a bit details of the type of crimes:

# types of crimes
crime %>% .$`Primary Type` %>% unique()
##  [1] "NARCOTICS"                        
##  [2] "CRIMINAL DAMAGE"                  
##  [3] "BATTERY"                          
##  [4] "THEFT"                            
##  [5] "MOTOR VEHICLE THEFT"              
##  [6] "DECEPTIVE PRACTICE"               
##  [7] "BURGLARY"                         
##  [8] "PROSTITUTION"                     
##  [9] "CRIMINAL TRESPASS"                
## [10] "OTHER OFFENSE"                    
## [11] "OFFENSE INVOLVING CHILDREN"       
## [12] "GAMBLING"                         
## [13] "ROBBERY"                          
## [14] "ASSAULT"                          
## [15] "LIQUOR LAW VIOLATION"             
## [16] "CRIM SEXUAL ASSAULT"              
## [17] "PUBLIC PEACE VIOLATION"           
## [18] "SEX OFFENSE"                      
## [19] "KIDNAPPING"                       
## [20] "WEAPONS VIOLATION"                
## [21] "OBSCENITY"                        
## [22] "INTERFERENCE WITH PUBLIC OFFICER" 
## [23] "ARSON"                            
## [24] "STALKING"                         
## [25] "INTIMIDATION"                     
## [26] "HOMICIDE"                         
## [27] "PUBLIC INDECENCY"                 
## [28] "OTHER NARCOTIC VIOLATION"         
## [29] "RITUALISM"                        
## [30] "NON-CRIMINAL"                     
## [31] "HUMAN TRAFFICKING"                
## [32] "CONCEALED CARRY LICENSE VIOLATION"
## [33] "NON - CRIMINAL"                   
## [34] "NON-CRIMINAL (SUBJECT SPECIFIED)" 
## [35] "DOMESTIC VIOLENCE"
# plot it
crime %>% count(Year, `Primary Type`) %>% filter(Year != 2017) %>% 
        ggplot(aes(x = Year, y = n/365)) + 
        geom_line() + geom_point() + 
        facet_wrap(~ `Primary Type`, scale = "free_y") +
        ggtitle("number of incidents of crimes per day") + 
        theme_bw(base_size = 12)
## geom_path: Each group consists of only one observation. Do you need to
## adjust the group aesthetic?

The things I noticed is that NON-CRIMINAL should be combined with NON - CRIMINAL and NON-CRIMINAL (SUBJECT SPECIFIED); only one record of DOMESTIC VIOLENCE; No RITUALISM after 2007; No HUMAN TRAFFICKING before 2013 (no recording but not necessary not happening?).

Let’s fix that and re-plot it:

# there is only one record for DOMESTIC VIOLENCE
crime %>% count(Year, `Primary Type`) %>% filter(Year != 2017, `Primary Type` != "DOMESTIC VIOLENCE") %>% 
        mutate(`Primary Type` = gsub(" - ", "-", `Primary Type`)) %>%
        mutate(`Primary Type` = gsub(" \\(SUBJECT SPECIFIED\\)", "", `Primary Type`)) %>%
        ggplot(aes(x = Year, y = n/365)) + 
        geom_line() + geom_point() + 
        facet_wrap(~ `Primary Type`, scale = "free_y") +
        ggtitle("number of incidents of crimes per day") + 
        theme_bw(base_size = 10)

Our uber driver mentioned around 600 shooting per year. There is no specific type of shooting in the data. The only thing close to it is WEAPONS VIOLATION, it is around 8 to 12 per day. Of course, a weapons violation does not mean a shooting.

Time series: Crime rate across years

crime  %>% filter(Year != 2017, `Primary Type` != "DOMESTIC VIOLENCE") %>% 
        mutate(`Primary Type` = gsub(" - ", "-", `Primary Type`)) %>%
        mutate(`Primary Type` = gsub(" \\(SUBJECT SPECIFIED\\)", "", `Primary Type`)) %>%
        count(Day_date) %>%
        ggplot(aes(x = Day_date, y =n )) + geom_line() +
        ggtitle("number of incidents of crimes per day across years") + 
        ylab("number of incidents per day") +
        xlab("Date") +
        theme_bw(base_size = 14)

Did you observe the cyclic pattern?!

Let’s split by year and explore in more details: which days have the highest crime rate?

First, let’s extract the days. ref from R for data science.

datetime <- ymd_hms("2016-07-08 12:34:56")

year(datetime)
## [1] 2016
month(datetime)
## [1] 7
mday(datetime)
## [1] 8
yday(datetime)
## [1] 190
wday(datetime)
## [1] 6

split by year and day:

crime  %>% filter(Year != 2017, `Primary Type` != "DOMESTIC VIOLENCE") %>% 
        mutate(`Primary Type` = gsub(" - ", "-", `Primary Type`)) %>%
        mutate(`Primary Type` = gsub(" \\(SUBJECT SPECIFIED\\)", "", `Primary Type`)) %>%
        mutate(day = yday(.$Day_date)) %>% 
        mutate(Year = as.factor(Year)) %>%
        count(Year, day) %>%
        ggplot(aes(x = day, y =n, group = Year)) + geom_line(aes(color = Year)) +
        ggtitle("number of incidents of crimes per day across years") + 
        ylab("number of incidents per day") +
        xlab("Date") +
        theme_bw(base_size = 14)

We see more clearly that there is a peak around day 200 for each year. It is around June and July. The weather in June when I attend the ASCO meeting is very good.

One last try to compare crime incidents across month of each year:

crime  %>% filter(Year != 2017, `Primary Type` != "DOMESTIC VIOLENCE") %>% 
        mutate(`Primary Type` = gsub(" - ", "-", `Primary Type`)) %>%
        mutate(`Primary Type` = gsub(" \\(SUBJECT SPECIFIED\\)", "", `Primary Type`)) %>%
        mutate(day = yday(.$Day_date)) %>% 
        mutate(month = month(.$Day_date)) %>%
        mutate(Year = as.factor(Year)) %>%
        count(Year, month) %>%
        ggplot(aes(x = as.factor(month), y =n)) + geom_boxplot() +
        geom_jitter(aes(color = Year), position=position_jitter(0.2)) +
        ggtitle("number of incidents of crimes per month across years") + 
        ylab("number of incidents per month") +
        xlab("month") +
        theme_bw(base_size = 14)

Consistent with our previous observations, July and August have a higher crime rate than other months.

Hypothesis: Weather condition (temperature) is correlated with crime rate

To answer this question, we will need to turn to the combined weather and crime data set.

# tidy the data a bit
crime_weather<- crime_weather %>% filter(Year != 2017, `Primary Type` != "DOMESTIC VIOLENCE") %>% 
        mutate(`Primary Type` = gsub(" - ", "-", `Primary Type`)) %>%
        mutate(`Primary Type` = gsub(" \\(SUBJECT SPECIFIED\\)", "", `Primary Type`)) %>%
        mutate(day = yday(.$Day_date)) %>% 
        mutate(month = as.factor(month(.$Day_date))) %>%
        mutate(Year = as.factor(Year)) 

## number of days for each temperature 
temp_day<- crime_weather %>% na.omit() %>% 
        dplyr::group_by(Mean_TemperatureF) %>% dplyr::summarise(n_days= n_distinct(day)) 

ggplot(temp_day, aes(x = Mean_TemperatureF, y = n_days)) + geom_line() + geom_point() +
        ggtitle("number of days for each temperature")

  • There are fewer days with extremly cold or hot weather.

Average crime per temperature per day

crime_weather %>% na.omit() %>% count(Mean_TemperatureF) %>% 
        inner_join(temp_day) %>%
        mutate(average = n/n_days) %>% 
        ggplot(aes(x = Mean_TemperatureF, y = average)) + 
        geom_line() +
        geom_point() +
        ggtitle("average number of crimes for each temperature per day")
## Joining, by = "Mean_TemperatureF"

average crime per temperature per day per crime type

crime_weather %>% na.omit() %>% count(Mean_TemperatureF, `Primary Type`) %>%
        inner_join(temp_day) %>%
        mutate(average = n/n_days) %>% 
        ggplot(aes(x = Mean_TemperatureF, y = average)) + 
        geom_line() +
        geom_point() +
        facet_wrap(~ `Primary Type`, scales = "free_y") +
        ggtitle("average number of crimes for each temperature per day")
## Joining, by = "Mean_TemperatureF"

Indeed, we see a trend of increasing crime rate as the temperature rises. However, remember that:

  • Correlation does not mean causation.

There are various ways to explore the data, comments are welcome :)

cartographic/Choropleth graph

We have latitude and longitude information in the data set, Let’s try something fancy using the cartographic maps to map the crimes to the Chicago districts.

library(ggmap)

chicago_map_10<- get_map("Chicago, USA", zoom = 10)
# Create the map of chicago, let's filter the data to a smaller data set

ggmap(chicago_map_10) + coord_cartesian() +
        geom_point(data = crime %>% filter(Year == 2017), 
                   aes(x = Longitude, y = Latitude),
                   alpha = 0.1, shape = 1) 

ggmap(chicago_map_10) + coord_cartesian() +
        geom_hex(data = crime %>% filter(Year == 2017), aes(x = Longitude, y = Latitude)) 

The other way is to use the choroplethr package to map the crime number (summarized for each district first) in each district and then plot. it should give a better view since now we have over-plotting issues.

Reproducible environment

sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.5 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] hexbin_1.27.1   ggmap_2.6.1     weatherData_0.5 devtools_1.12.0
##  [5] lubridate_1.6.0 dplyr_0.5.0     purrr_0.2.2     readr_1.0.0    
##  [9] tidyr_0.6.0     tibble_1.2      ggplot2_2.2.1   tidyverse_1.1.1
## 
## loaded via a namespace (and not attached):
##  [1] reshape2_1.4.2    haven_1.0.0       lattice_0.20-34  
##  [4] colorspace_1.2-7  htmltools_0.3.5   yaml_2.1.13      
##  [7] foreign_0.8-67    withr_1.0.2       DBI_0.5-1        
## [10] sp_1.2-4          modelr_0.1.0      readxl_0.1.1     
## [13] jpeg_0.1-8        plyr_1.8.4        stringr_1.2.0    
## [16] munsell_0.4.3     gtable_0.2.0      rvest_0.3.2      
## [19] mapproj_1.2-4     RgoogleMaps_1.4.1 psych_1.6.9      
## [22] evaluate_0.10     memoise_1.0.0     labeling_0.3     
## [25] knitr_1.14        forcats_0.2.0     parallel_3.3.1   
## [28] curl_2.6          proto_0.3-10      broom_0.4.2      
## [31] Rcpp_0.12.8       geosphere_1.5-5   scales_0.4.1     
## [34] backports_1.0.5   formatR_1.4       jsonlite_1.1     
## [37] mnormt_1.5-5      rjson_0.2.15      hms_0.2          
## [40] png_0.1-7         digest_0.6.10     stringi_1.1.2    
## [43] grid_3.3.1        rprojroot_1.2     tools_3.3.1      
## [46] maps_3.1.1        magrittr_1.5      lazyeval_0.2.0   
## [49] xml2_1.0.0        assertthat_0.1    rmarkdown_1.3    
## [52] httr_1.2.1        R6_2.2.0          nlme_3.1-128