Stat. 694 Project - Alameda Crime Data

The data is downloaded from ArcGIS Hub and the address is:
https://hub.arcgis.com/datasets/9e459776d4c3463cad52fe6003ffc668_0/data

Before cleaning the data, I saved the dataset into a file, so it is no need to download it every time when I restart R.

ala <- readOGR(‘https://opendata.arcgis.com/datasets/9e459776d4c3463cad52fe6003ffc668_0.geojson’)
geojson_write(ala, file = ‘Alameda_Crime_Data.geojson’)

After cleaning the data, I saved it into a .Rds file, and here is where I start to analyze the data.

library(pacman)
p_load(tidyverse, DT, dygraphs, plotly, lubridate, xts)

source('Chen_Xiaodan_Stat694_Project.R')

alam <- read_rds('alameda_crime_data.Rds')
alam <- alam %>% mutate(date = as.Date.POSIXct(time))
head(alam)

Here is the order of crime numbers base on CrimeCode.

alam %>%
  select(CrimeCode) %>%
  group_by(CrimeCode) %>%
  count() %>%
  arrange(desc(n))

The datatable help us to search the relative information about each CrimeCode.

datatable(alam)

## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

Here is the order of crime numbers base on City.

alam %>% select(City) %>%
  group_by(City) %>%
  count() %>%
  arrange(desc(n))

Changing the value of City into ‘OTHERS’ for those crime numbers are less than 800.

alame <- alam %>% 
         mutate(City=as.character(City)) %>%
         group_by(City) %>%
         summarize(n = n(), .groups = 'drop') %>%
         mutate(city_n = as.factor(ifelse(n > 800, City, 'OTHERS'))) %>%
         left_join(alam, City = City)

## Joining, by = "City"

alame

Making the time series crime plot for each city, and the time is for each month

alame_plot1 <- alame %>% filter(!is.na(month)) %>%
                    group_by(month, city_n) %>%
                    summarise(n_crime=n()) %>% 
                    ggplot(aes(month, n_crime)) +
                    geom_line() + 
                    facet_wrap(~city_n)

## `summarise()` regrouping output by 'month' (override with `.groups` argument)

alame_plot1

Making the time series plot for the top three crime numbers’ cities and comparing to the one with all city’s crime numbers.

There is a strike on a specific time for several years, I will find out those special dates and the reasons for those high crime numbers (find out what types of crime they are).

alamed <- alame %>% filter(!is.na(month)) %>%
                    group_by(month) %>%
                    summarise(n_crime=n(), .groups = 'drop') %>%
                    mutate(city_n = 'TOTAL', city_n = as.factor(city_n))


alameda <- alame %>% filter(!is.na(month), city_n %in% c('CASTRO VALLEY', 'HAYWARD', 'SAN LEANDRO')) %>%
                    group_by(month, city_n) %>%
                    summarise(n_crime=n()) %>% 
                    rbind(alamed)

## `summarise()` regrouping output by 'month' (override with `.groups` argument)

alame_plot2 <- alameda %>%  ggplot(aes(month, n_crime, col = city_n)) +
                 geom_line() 
alame_plot2

Zoom in the plot and find out a decrease in the crime numbers during the cities’ lockdown.
I will do the hypothesis test to see if this number is significantly lower than the number for other times.

lockdown <- tibble(date = c(as.numeric(as.yearmon('2020-03-17')), 
                            as.numeric(as.yearmon('2020-05-22'))))
                           
alame_plot2 %+% filter(alameda, month > 'Jun 2018') +
  geom_vline(data = lockdown, aes(xintercept = date), linetype = 'dotted', col = 'blue4') +
  geom_text(aes(x =  as.numeric(as.yearmon('2020-04-22')), y = 2050), label = 'Lock down', col = 'blue4', size = 3)

Making a dygraph for the time series.

alamed <- alame %>% filter(!is.na(date)) %>%
                    group_by(date) %>%
                    summarise(n_crime=n(), .groups = 'drop') %>%
                    mutate(city_n = 'TOTAL', city_n = as.factor(city_n))

time_series <- xts(x= alamed$n_crime, order.by = alamed$date)
 
time_series %>%
  dygraph(main = 'Crime_Counts_Alameda') %>%
  dyRangeSelector(dateWindow = c('2019-01-01','2020-09-28'))

I will make a plot for the crime number base on each date of a month and find out if there is any pattern.

Furthermore, I will keep working on my project using the algorithms or functions below:

ShinyApp, echart.
heatmap
hypothesis

Stat. 694 Project - Alameda Crime Data

Xiaodan Chen

2020/10/20