Data 606 Project Proposal

Data Preparation

library(dplyr)
library(ggplot2)
library(plotly)
data <- read.csv(file="dat.csv")

#selecing the rows
data.new <- data %>% select(CMPLNT_FR_DT,CMPLNT_FR_TM,OFNS_DESC,PD_DESC,LAW_CAT_CD,BORO_NM,PREM_TYP_DESC,Latitude,Longitude)

#counting the grouping by crime type
crime.compl.count <-  data.new %>% group_by(PD_DESC) %>% summarise(total = n()) %>% arrange(desc(total))

#the first 20 most common crime complaints
first_20 <- crime.compl.count[1:20,]

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

I want to study criminal complaints in the last few years in NYC and see if they are increasing or decreasing? comparing by borough and possible by zip code area.

Cases

What are the cases, and how many are there?

there are 469900 obs and 36 variables.

Data collection

Describe the method of data collection.

The data was retrieved from the NYC OpenData website, download to CSV file and loaded to RStudio.

Type of study

What type of study is this (observational/experiment)?

This is an observational study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i

Dependent Variable

What is the response variable? Is it quantitative or qualitative?

The response variable will be the count of crime complains(by borough or zip code) and will be a numerical variable.

Independent Variable

You should have two independent variables, one quantitative and one qualitative.

The independent variables are by borough and its qualitative variables, the Crime complains level(violation, a misdemeanor, felony) it can be by count and will be quantitative.

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

str(data.new)

## 'data.frame':    469900 obs. of  9 variables:
##  $ CMPLNT_FR_DT : Factor w/ 1768 levels "1/1/07","1/1/08",..: 582 582 582 582 582 582 582 582 582 582 ...
##  $ CMPLNT_FR_TM : Factor w/ 1440 levels "0:00:00","0:01:00",..: 679 676 676 676 671 668 666 666 665 661 ...
##  $ OFNS_DESC    : Factor w/ 63 levels "","ABORTION",..: 27 30 27 52 8 20 26 27 30 12 ...
##  $ PD_DESC      : Factor w/ 353 levels "","A.B.C.,FALSE PROOF OF AGE",..: 123 137 122 194 21 20 180 122 137 80 ...
##  $ LAW_CAT_CD   : Factor w/ 3 levels "FELONY","MISDEMEANOR",..: 3 2 3 2 2 1 1 3 2 2 ...
##  $ BORO_NM      : Factor w/ 6 levels "","BRONX","BROOKLYN",..: 2 6 5 3 3 2 5 4 5 4 ...
##  $ PREM_TYP_DESC: Factor w/ 73 levels "","ABANDONED BUILDING",..: 55 62 62 20 55 53 47 62 62 20 ...
##  $ Latitude     : num  40.9 40.5 40.7 40.6 40.6 ...
##  $ Longitude    : num  -73.9 -74.2 -73.9 -73.9 -74 ...

summary(data.new)

##   CMPLNT_FR_DT      CMPLNT_FR_TM   
##  1/1/17 :  1986   12:00:00: 12824  
##  6/1/17 :  1681   15:00:00: 10188  
##  10/6/17:  1590   18:00:00:  9731  
##  7/1/17 :  1583   17:00:00:  9429  
##  5/1/17 :  1562   20:00:00:  9064  
##  3/1/17 :  1540   16:00:00:  8895  
##  (Other):459958   (Other) :409769  
##                           OFNS_DESC     
##  PETIT LARCENY                 : 83731  
##  HARRASSMENT 2                 : 66601  
##  ASSAULT 3 & RELATED OFFENSES  : 51561  
##  CRIMINAL MISCHIEF & RELATED OF: 49503  
##  GRAND LARCENY                 : 43468  
##  OFF. AGNST PUB ORD SENSBLTY & : 21942  
##  (Other)                       :153094  
##                            PD_DESC             LAW_CAT_CD    
##  HARASSMENT,SUBD 3,4,5         : 47941   FELONY     :143381  
##  ASSAULT 3                     : 41661   MISDEMEANOR:259066  
##  LARCENY,PETIT FROM STORE-SHOPL: 29062   VIOLATION  : 67453  
##  AGGRAVATED HARASSMENT 2       : 20384                       
##  LARCENY,PETIT FROM BUILDING,UN: 18933                       
##  HARASSMENT,SUBD 1,CIVILIAN    : 18660                       
##  (Other)                       :293259                       
##           BORO_NM                          PREM_TYP_DESC   
##               :   293   STREET                    :137499  
##  BRONX        :103701   RESIDENCE - APT. HOUSE    :102503  
##  BROOKLYN     :137928   RESIDENCE-HOUSE           : 43442  
##  MANHATTAN    :114802   RESIDENCE - PUBLIC HOUSING: 37203  
##  QUEENS       : 91804   CHAIN STORE               : 13897  
##  STATEN ISLAND: 21372   OTHER                     : 13739  
##                         (Other)                   :121617  
##     Latitude       Longitude     
##  Min.   :40.50   Min.   :-74.25  
##  1st Qu.:40.67   1st Qu.:-73.97  
##  Median :40.73   Median :-73.93  
##  Mean   :40.74   Mean   :-73.93  
##  3rd Qu.:40.81   3rd Qu.:-73.88  
##  Max.   :40.91   Max.   :-73.70  
##  NA's   :39      NA's   :39

#ascending order the first 20 most common crime reported
ggplot(first_20, aes(x=total, y=PD_DESC)) + geom_point()

#count crime complains by borough
pl <- data.new %>% group_by(BORO_NM,.drop = T) %>%
        summarise(total = n()) %>%
          plot_ly(x=~ BORO_NM , 
                  y = ~total , 
                  type = "bar",
                 marker=list(color= ~total , size=20 , opacity=0.9)) %>% layout(xaxis = list(title = "Borough"),
                                                yaxis = list(title = "Number of Crime Complaints"))

pl

pl <- data.new %>% group_by(OFNS_DESC) %>%
        summarise(total = n()) %>%
          plot_ly(x=~ OFNS_DESC , 
                  y = ~total , 
                  type = "bar",
                   xaxis = list(autotick = T, dtick = 1),
                
                 marker=list(color= ~OFNS_DESC , size=20 , opacity=0.9) ) %>% layout(xaxis = list(title = "Crime Type Name"),yaxis = list(title = "Number of Crime Complaints"),margin = list(b = 240))

pl

boxplot(first_20$total)