Introdution

For this project, I decided to work with the NYC criminal complaints dataset which is available on the Open Data NYC website. I will be focusing through the project on the observation on the complaints and analyzing them by average and counts measure on months and years and how is the data spreading by borough(Manhattan Brooklyn, Staten Island, Queens, The Bronx).

hopefully with the mission to visualize and understand and see if criminal complaints are increasing/decreasing on average and comparison.

Data

Data Source

The data was retieve from the NYC Open Data
https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i

The dataset coinstain about 6.4M rows.
it has 35 variables.

Data Preparation DataBricks Spark

Because this dataset is a big file i decided to big data tools such as Databricks SPark tools in order to work with faster filtering,manipulate and wrangling the data.

Data loading MongoDB

After the the big data was organized and mipulated on Databricks Spark. Made a connection from Databricks notebook to MongoDB in order to export the subset of the processed data for later on access the data on Rstudio connection back on MongoDb Atlas Server and retrive the data.

#MongoDB connection Object
con <- mongo(collection = "data_count_days", db = "data_607", url = url)
df.count.days <- con$find()
colnames(df.count.days) <-c("date","type","count","boro")



con <- mongo(collection = "data_count_mean_month", db = "data_607", url = url)
df.count.mean.month <- con$find()
colnames(df.count.mean.month) <-c("mean","date")


con <- mongo(collection = "data_count_sum_month", db = "data_607", url = url)
df.count.sum.month <- con$find()
colnames(df.count.sum.month) <- c("sum","date")


con <- mongo(collection = "data_count_type", db = "data_607", url = url)
df.count.type <- con$find()
colnames(df.count.type) <- c("type","count")

con <- mongo(collection = "data_count_days_alone", db = "data_607", url = url)
df.count.days.alone <- con$find()
colnames(df.count.days.alone) <- c("date","count")


con <- mongo(collection = "data_count_boro", db = "data_607", url = url)
df.count.boro <- con$find()
colnames(df.count.boro) <- c("type","count","boro")


con <- mongo(collection = "data_count_boro_days", db = "data_607", url = url)
df.count.boro.days <- con$find()
colnames(df.count.boro.days) <- c("count","boro","date")

Data cleaning and wrangling

df.year.count <- df.count.days.alone %>% group_by(year=floor_date(as.Date(as.character(date),"%m/%d/%Y"), "year")) %>% summarise(total = sum(count)) %>% filter(as.Date(year,"%m/%d/%Y") > "1999-01-01" & as.Date(year,"%m/%d/%Y") < "2019-01-01") 

df.count.boro  <- df.count.boro[-15,] 
df.count.days <- df.count.days %>% filter(type != "LAW_CAT_CD" ) 
df.count.type <- df.count.type %>% filter(type != "LAW_CAT_CD" ) 

df.count.boro.days <- df.count.boro.days %>% filter(boro != "BORO_NM" )

Data Exploratory

pl1 <- df.count.type  %>%
          plot_ly(x=~type , 
                  y = ~count , 
                  type = "bar",
                   xaxis = list(autotick = T, dtick = 1),
                 marker=list(color= ~count , size=20 , opacity=0.9) ) %>% layout(xaxis = list(title = "Type of Crime"),yaxis = list(title = "Number of Criminal Complaints"))

pl1

pl <- df.count.boro %>%  ggplot( aes(boro, count))
pl + geom_boxplot(varwidth=T, fill="blue") + 
    labs(title="Box plot", 
         subtitle="Criminal Complaints grouped by Borough of NYC",
         x="Borough of NYC",
         y="Criminal Complaints")

pl2 <- df.year.count %>% 
          plot_ly(x=~ year , 
                  y = ~total , 
                  type = "bar",
                 marker=list(color= ~total , size=10 , opacity=0.5) ) %>% layout(xaxis = list(title = "Years"),yaxis = list(title = "Number of Violations"))
  
pl2

Modeling

ANOVA

H0 - The means of daily count crime by borough is the same and doesnt vary.

Ha- The means of daily count crime by borough indeed vary and are not the same.

dat <-  df.count.boro.days %>% filter(as.Date(date,"%m/%d/%Y") > "2006-01-01" & as.Date(date,"%m/%d/%Y") < "2018-01-01")


df.anova <- dat %>% select(boro,count)


anova.analysis <- aov(count ~ boro,data = df.anova)

summary(anova.analysis)

##                Df    Sum Sq  Mean Sq F value Pr(>F)    
## boro            4 286248728 71562182   40753 <2e-16 ***
## Residuals   21905  38465137     1756                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ggplot(df.anova, aes(x=boro, y=count,fill=boro)) + 
    geom_boxplot()

conclusion

Doing the ANOVA analysis and comparing the mean values by the borough over the select range 2006-2008. we obtain a P value below 5 % and we reject the hypothesis null.

This is not a conclusive approach due that other factors may affect the final result and further research is needed.