Introduction

The word crime is derived from Latin as “cerno” which means sinfulness in English can be defined as “violation of a law in which there is injury to the public or a member of the public and a term in jail or prison, and/or a fine as possible penalty” -https://legal-dictionary.thefreedictionary.com/crime. A quote by Frank Wolf say “We can’t just rail against crime. We must speak of the root problems - devastating family breakup, an insidious culture of violence that cheapens human life, skyrocketing prisoner recidivism rates that robs our communities of husbands and fathers (wives and mothers) - and recognize that there is a societal role in rehabilitation and restoration”. To this end, I decided to take a clue from him and decided to do my research on factors affecting crime rates - social, economic, security and psychological against the popular notion that population is the only determining factor in high rate of crime in a geographical area and at a point in time. The research will focus mainly on New York state for the period of 5-10 years.

Abstract

In other to gain more insight about crime rate, a stratum of the dataset was analyzed. The NYPD historical complaints data from 2010 to 2015 was collected from New York State open data website.

It was discovered that most crimes are committed on Fridays, while July follow closely by August and May are the months with the highest crime rate. In NYC borough, Bronx came first, while Staten Island came last in terms of crime rate. A popular believe is that population alone has the most impact on crime rate, but a further analysis also revealed other culprits; the likes of economic and psychological factors also contribute significantly to crime rate.

It was very difficult combining the needed dataset from all the above dataset into one. Therefore, to make it a little bit easier, we will implore the helping hand of SQL (Structured Query Language). A relational database like ORACLE SQL, MySQL or PostgreSQL. I chose PostgreSQL to access and extract the require data.

If your connection is right, you should see that the above table are now in your database. But some of the data type may be wrong, use the code below to alter the it to desirable data type. Upon further scrutiny, we discovered that the poverty table still have some unwanted character in them, we therefore removed them using gsub from grep, a R based.

Background

Crime and violence impact us psychologically (psychological factor), whether we are directly involved, a family member or friend is involved, our residence or community was impacted or exposure to it on media coverage. This is sometimes the case when a family member or friend is killed or hurt after directly or indirectly exposure to crime, it is natural for us to experience strong feelings and the effects may push some people state of mind beyond it optimal capability and may in turn leads to revenge or survival kind of crime(s).

Also, Unemployment (economical factor) has a great deal of common sense appeal as one of the root cause of criminal behavior as unemployment is always viewed as a causal agent of economic deprivation may ultimately increases the probability of crime.

According study conducted (Social factor) by likes of Wilson (1987), Ferguson (1952), Robins (1995) and Gluecks (1950) established that parents with criminal records are more likely to have children criminal records also. Bowlby (1946) believed that a traumatic experience of parent-child separation and disruption of attachment relations may cause delinquency after parental separation, that where divorce came into being. The different between Bowlby work and mine is that; my research work not only focus on children alone, but all ages, as crime affect everyone not just the children and their immediate family, but society in general.

The role of law enforcement officers (Security) is to protect citizenry against crimes. The research work did focus on the ratio of law enforcement to the crime rates. Does hiring more police /law enforcement officers curb the menace of crimes?

Methodology

The research project answered and provided a recommendation (where necessary) for many of the overlooked factors that may influenced crime rates in a community. A R programing was used to obtain and check for any positive or negative effect/interaction(s) between crime rates and all the factors, while a graphical representation was included in the analysis. Prediction was done on counties to check for any increase or decrease in crime rate in the future as well as a recommendation on best method to curb the menace (crime rates).

The dataset for this research work was derived from various governmental organizations throughout New York State (counties). The datasets were secondary data as primary data are not feasible due to some constraint like cost, government restrictions etc.

Most of the data tables did not conform to the standard R format. I therefore adopted the use of PostgreSQL. After loading the dataset into R, I reloaded them into PostgreSQL as a table and called it directly from SQL into R environment using PostgreSQL package that was connected to R through ODBC. I was parsing SQL queries in R environment and made some cleaning. I also used SQLDF package in R to write SQL queries in R.

Dataset

The required datasets are as follows:

Data Cleaning

There are some missing observations in the datasets before and after combinations. I therefore employ the use of a package in R called MICE, multivariate Imputation by Chained Equations (mice package) is said to be the best approach to tackle end-on the menace missing observation may cause. It is well known fact that if proper care is not taken in dealing with missing observation, the analysis would likely be biased. To avoid this, the mice package in R was used. A fully conditional specification where each incomplete variable is imputed by a separate model at which the algorithm will impute mixes of continuous, binary, ordered or unordered data based on the data type.

Limitation

The best datasets required for this research would have been the one collected from the department of correction, (as the saying goes “from the horse’s mouth”) but due to nature of the datasets, government put some restrictions in it use and therefore blocking public access to it. We therefore opted for the dataset from different organizations (some were scrapped from websites) as we believe it will give same result, which required extra work to bring them together as one dataset.

As mentioned above, this research project did used (heavily) relational database like PostgreSQL, MySQL, OracleSQL and SQLIte (Structured Query Language) for most of the query. Most (if not all) of the dataset used in the analysis are not readily available. I therefore sorted dataset from various government agencies. Some of the dataset was scrapped fro

Packages

As school of thought dictate, we should first install the packages to be used in this research by using the following syntax “install.packages(”The Package name“)” one after the other.

# install.packages(c("flexdashboard", "DT", "pandocfilters", "googleVis", "R.rsp", "rmarkdown", "knitr", "plyr", "plotrix", "tidyr","dplyr", "ggplot2","plotly", "MASS","reshape2", "Amelia","mice","stringi","ROCR", "XML", "httr", "AUC", "magrittr", "rvest", "rgdal", "purrr", "DBI",
# "leaflet", "tigris", "RPostgreSQL" , "data.table", "tibble", "cowplot", "Hmisc", "sp", "maptools", "PerformanceAnalytics" , "ggcorrplot", "ggalt",
# "ggmap", "ggthemes", "readr", "webshot", "caTools", "caret", "rpart", "agricolae", "sqldf"))
suppressMessages(library("rmarkdown"))
#setwd("C:/Users/HP/Desktop")
library(flexdashboard)
library(DT)
library(prettydoc)
suppressMessages(library(R.rsp))
suppressMessages(library(pandocfilters))
suppressPackageStartupMessages(library(googleVis))
op <- options(gvis.plot.tag='chart')

Cont’d…

suppressMessages(library(knitr))
suppressMessages(library(plyr))
suppressMessages(library(plotrix))
suppressMessages(library(tidyr))
suppressMessages(library(dplyr))
suppressMessages(library(ggplot2))
suppressMessages(library(plotly))
suppressMessages(library(MASS))
suppressMessages(library(reshape2))
suppressMessages(library(Amelia))
suppressMessages(library(mice))
suppressMessages(library(stringi))
suppressMessages(library(ROCR))
suppressMessages(library(XML))
suppressMessages(library(httr))
suppressMessages(library(AUC))
suppressMessages(library(magrittr))
suppressMessages(library(rvest))
suppressMessages(library(rgdal))
suppressMessages(library(purrr))
suppressMessages(library(DBI))
suppressMessages(library(leaflet))
suppressMessages(library(tigris))
suppressMessages(library(RPostgreSQL))
suppressMessages(library(data.table))
suppressMessages(library(tibble))
suppressMessages(library(cowplot))
suppressMessages(library(Hmisc))
suppressMessages(library(sp))
suppressMessages(library(maptools))
suppressMessages(library(PerformanceAnalytics))
suppressMessages(library(ggcorrplot))
suppressMessages(library(ggalt))
suppressMessages(library(ggmap))
suppressMessages(library(ggthemes))
suppressMessages(library(readr))
suppressMessages(library(webshot))
suppressMessages(library(caTools))
suppressMessages(library(randomForest))
suppressMessages(library(caret))
suppressMessages(library(rpart))
suppressMessages(library(agricolae))
library(htmltools)
library(RColorBrewer);

PostgreSQL Connection.

It will be very difficult to work or combine all the needed dataset from all the above dataset. Therefore to make it a little bit easier, we will implore the help hand of SQL (Structured Query Language). A relational database like ORACLE SQL, MySQL or PostgreSQL would be nice in doing it better. I outed to use PostreSQL to access and extract the require data.

Connect your PostreSQL and R using the code below.

drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "capstone",
                 host = "localhost", port = 5432,
                 user = "postgres", password = "oracle")

Web Scrapping

The divorce came as a table from New York state health department website as follows.Here is the glimpse after cleaning: As mentioned above, this research project did used (heavily) relational database like PostgreSQL, MySQL, OracleSQL and SQLIte (Structured Query Language) for most of the query. Most (if not all) of the dataset used in the analysis are not readily available. I therefore sorted dataset from various government agencies. Some of the datase was scrapped for websites e.g the divorce dataset for year 2015 was scrapped from: https://www.health.ny.gov/statistics/vital_statistics/2015/table50.htm while others were downloaded and the required rows and columns were extracted.

div_2015 <- ("https://www.health.ny.gov/statistics/vital_statistics/2015/table50.htm") %>% 
  map(possibly(~html_session(.) %>% 
                 read_html %>% 
                 html_table(fill = TRUE) %>% 
                 .[[1]], 
               NULL))


div_2014 <- ("https://www.health.ny.gov/statistics/vital_statistics/2014/table50.htm") %>% 
  map(possibly(~html_session(.) %>% 
                 read_html %>% 
                 html_table(fill = TRUE) %>% 
                 .[[1]], 
               NULL))


div_2013 <- ("https://www.health.ny.gov/statistics/vital_statistics/2013/table50.htm") %>% 
  map(possibly(~html_session(.) %>% 
                 read_html %>% 
                 html_table(fill = TRUE) %>% 
                 .[[1]], 
               NULL))



div_2012 <- ("https://www.health.ny.gov/statistics/vital_statistics/2012/table50.htm") %>% 
  map(possibly(~html_session(.) %>% 
                 read_html %>% 
                 html_table(fill = TRUE) %>% 
                 .[[1]], 
               NULL))



div_2011 <- ("https://www.health.ny.gov/statistics/vital_statistics/2011/table50.htm") %>% 
  map(possibly(~html_session(.) %>% 
                 read_html %>% 
                 html_table(fill = TRUE) %>% 
                 .[[1]], 
               NULL))

div_2010 <- ("https://www.health.ny.gov/statistics/vital_statistics/2010/table50.htm") %>% 
  map(possibly(~html_session(.) %>% 
                 read_html %>% 
                 html_table(fill = TRUE) %>% 
                 .[[1]], 
               NULL))

# After scrapping the table, we realized that they are in list format. We will therefore convert it to accepatable format; first as a dataframe, subsetting unwanted rows and adding year column to the dataset as it was not provided in the scrapped table.

div_2010 <- as.data.frame(div_2010)
div_2011 <- as.data.frame(div_2011)
div_2012 <- as.data.frame(div_2012)
div_2013 <- as.data.frame(div_2013)
div_2014 <- as.data.frame(div_2014)
div_2015 <- as.data.frame(div_2015)


div_2010 <- div_2010[-c(1, 2,3,4,5, 11, 12), ]
div_2011 <- div_2011[-c(1, 2,3,4,5, 11, 12), ]
div_2012 <- div_2012[-c(1, 2,3,4,5, 11, 12), ]
div_2013 <- div_2013[-c(1, 2,3,4,5, 11, 12), ]
div_2014 <- div_2014[-c(1, 2,3,4,5, 11, 12), ]
div_2015 <- div_2015[-c(1, 2,3,4,5, 11, 12), ]

div_2010[["year"]] <- rep(2010, nrow(div_2010))
div_2011[["year"]] <- rep(2011, nrow(div_2011))
div_2012[["year"]] <- rep(2012, nrow(div_2012))
div_2013[["year"]] <- rep(2013, nrow(div_2013))
div_2014[["year"]] <- rep(2014, nrow(div_2014))
div_2015[["year"]] <- rep(2015, nrow(div_2015))
#The column names were not in conformitting with the standard R Programming column naming, we will therefore rename the columns.

Merge the divorce dataset from 2010 to 2015.

divorce <- Reduce(function(...) merge(..., all=T), list(div_2010, div_2011,
                            div_2012, div_2013, div_2014, div_2015))
colnames(divorce) <- c("county", "total", "cruelty", "abandonment", "imprisonment",
                       "adultery", "afterlegalseparation", "afterseparationbyagreement", "notstated", "year")
kable(head(divorce[, 1:8]))
county total cruelty abandonment imprisonment adultery afterlegalseparation afterseparationbyagreement
Albany 719 2 7 0 2 1 16
Albany 722 1 20 0 0 0 31
Albany 766 109 460 6 7 5 152
Albany 783 12 67 1 0 2 60
Albany 809 10 23 1 1 0 35
Albany 862 45 194 0 1 2 119

Above is the scrapped and cleaned New York state divorce dataset from the year 2010 to 2015. Also, other dataset like the number Arrest, Mental Health, Police, Unemployment, Poverty, Crimes, Longtitude and Latitude was included in the analysis.

Arrest<- "https://docs.google.com/spreadsheets/d/e/2PACX-1vTQbIoMp7kHRauK90knVVWxL0ZQgSyJo5w53ufpeoJhv3NEKZ7e52Z9fU5gI2yj4Q50kCt2S9-WIe8t/pub?gid=0&single=true&output=csv"

LongLat <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vRdUPece17flbIDQQ-prWnhh9ZOntvmHz9Saao-qC6EvV5VODzQ6TUfacf4oeB_w6EnDq0xmEhjw7X5/pub?gid=0&single=true&output=csv"

Mental <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vT87ZxCxB9WAXFNmOWmxAV0aBlpyiE22Jqu_sgTtXpgpgHCzmskdboIiDtcJVgramrtVoTyqT2lzFeT/pub?gid=0&single=true&output=csv"


CrimesNY <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vQ8KAFelT2l0AXFAabbbZSaeoqmQJRt-11kt7Jm_LKaHGbHa_ndUQidLj8Op_d19thmsyxoe9p5Dl3o/pub?gid=0&single=true&output=csv"

 LawEnf <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vQauvaPEbqzKX1Vr9ojJuwqjYZFYTyMgclAOLxeUg9ithYryjbQz0foYmAe52jk3A7xBUl33vpDfksj/pub?gid=0&single=true&output=csv"

 
UnemploymentData <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vSUjXqUWfUlOIQ_hYncSP8-BlOEVnzNQ5YDg49HD63IZDoZuw-2h8DiXt6hBuHTKeGAa9F2l4c23fwN/pub?gid=0&single=true&output=csv"


PovData <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vSpp6mizldwEOSIXYPIuLH4Y8Ho3WvfkdAY8mTCcNnOwFExK7xo4q7bv6qUW-v2Z49rZIHJxo-3iG1X/pub?gid=0&single=true&output=csv"

Read datasets into R

# Note, the dataset contains column(s) of data we needed. We will load all of them and rename them to meet up with the standard.

nypd <- read.csv("C:/Users/HP/Documents/GitHub/Capstone_Project_2/NYPD_Complaint_Data_Historic.csv",
                 sep = ",", header=TRUE, stringsAsFactors = FALSE)


arrest <- read.csv(url(Arrest),header = TRUE, sep = ",", col.names= c("county", "year", "total",
                                          "felonytotal", "drugfelony","violentfelony", "dwifelony",
                                          "otherfelony", "misdemeanortotal", "drugmisd",
                                         "dwimisd", "propertymisd", "othermisd"))



mental_health <- read.csv(url(Mental), header = TRUE, sep = ",", col.names= c("rowcreateddatetime",
                          "year", "omhregioncode", "omhregionlabel", "county", "agegroup", "ratecodegroup", "countycode",
                          "countOfrecipientbycounty", "total", "paidclaimtotal"))



police <- read.csv(url(LawEnf), header = TRUE, sep = ",", stringsAsFactors = FALSE,
                   col.names=  c("county", "year", "policetotal"))



unemployment <- read.csv(UnemploymentData, header = TRUE, sep = ",", stringsAsFactors = TRUE,
                     col.names= c('year',   'month',    'county', 'total', 'benefitamountsdollars'))


lat_long <- read.csv(url(LongLat), header = TRUE, sep = ",")


crimes_ny <- read.csv(url(CrimesNY),header = TRUE, sep = ",")
crimes_ny <- crimes_ny[, 1:7]
colnames(crimes_ny)= c('county','year','population','indexcount', 'violentcount','propertycount','firearmcount')



poverty <- read_csv(url(PovData), locale = locale(encoding = "UTF-8"))
pov_sub <- poverty[, c(1, 2, 3, 4, 6, 15, 24)]
colnames(pov_sub) <- c("year","state", "countyid", "county", "allageproverty", "ueighteenpoverty", "fivetoseventenpov")

NYPD HISTORICAL COMPLAINTS

The New York City Department of Police, NYPD historical dataset that was downloaded from the city open data source was first analyzed as the stratum of the entire state as the NYC dataset contains all the require variables for further scrutiny. Below is the glimpse of the dataset

kable(head(nypd[, 1:8]))
CMPLNT_NUM CMPLNT_FR_DT CMPLNT_FR_TM CMPLNT_TO_DT CMPLNT_TO_TM RPT_DT KY_CD OFNS_DESC
101109527 12/31/2015 23:45:00 12/31/2015 113 FORGERY
153401121 12/31/2015 23:36:00 12/31/2015 101 MURDER & NON-NEGL. MANSLAUGHTER
569369778 12/31/2015 23:30:00 12/31/2015 117 DANGEROUS DRUGS
968417082 12/31/2015 23:30:00 12/31/2015 344 ASSAULT 3 & RELATED OFFENSES
641637920 12/31/2015 23:25:00 12/31/2015 23:30:00 12/31/2015 344 ASSAULT 3 & RELATED OFFENSES
365661343 12/31/2015 23:18:00 12/31/2015 23:25:00 12/31/2015 106 FELONY ASSAULT
print(paste0(".....Last columns"))
## [1] ".....Last columns"
kable(head(nypd[, 17:24]))
PREM_TYP_DESC PARKS_NM HADEVELOPT X_COORD_CD Y_COORD_CD Latitude Longitude Lat_Lon
BAR/NIGHT CLUB 1007314 241257 40.82885 -73.91666 (40.828848333, -73.916661142)
1043991 193406 40.69734 -73.78456 (40.697338138, -73.784556739)
OTHER 999463 231690 40.80261 -73.94505 (40.802606608, -73.945051911)
RESIDENCE-HOUSE 1060183 177862 40.65455 -73.72634 (40.654549444, -73.726338791)
OTHER 987606 208148 40.73800 -73.98789 (40.7380024, -73.98789129)
DRUG STORE 996149 181562 40.66502 -73.95711 (40.665022689, -73.957110763)

The arrest table…

kable(head(arrest[, 1:8]))
county year total felonytotal drugfelony violentfelony dwifelony otherfelony
Albany 1970 1226 688 97 191 5 395
Albany 1971 1833 829 131 231 6 461
Albany 1972 3035 1054 211 256 8 579
Albany 1973 3573 1134 244 274 28 588
Albany 1974 4255 1329 281 308 17 723
Albany 1975 4173 1259 209 344 12 694

The mental health table

DT::datatable(mental_health, options = list(
  pageLength = 10
))
kable(head(police));
county year policetotal
Albany 2010 587
Albany 2011 1280
Albany 2012 1233
Albany 2013 1228
Albany 2014 1260
Albany 2015 1212

Change the columns to months after dcast. Here is Unemployment table after cleaning..

# The unemployment dataset will be reshaped and have their columns renamed appropriately.

unemployment2 <- unemployment[, -5]
unemployment_dcast <- dcast(unemployment2, year + county ~ month, value.var="total")

names(unemployment_dcast)[names(unemployment_dcast)== 1] <- "jan"
names(unemployment_dcast)[names(unemployment_dcast)== 2] <- "feb"
names(unemployment_dcast)[names(unemployment_dcast)== 3] <- "mar"
names(unemployment_dcast)[names(unemployment_dcast)== 4] <- "apr"
names(unemployment_dcast)[names(unemployment_dcast)== 5] <- "may"
names(unemployment_dcast)[names(unemployment_dcast)== 6] <- "jun"
names(unemployment_dcast)[names(unemployment_dcast)== 7] <- "jul"
names(unemployment_dcast)[names(unemployment_dcast)== 8] <- "aug"
names(unemployment_dcast)[names(unemployment_dcast)== 9] <- "sep"
names(unemployment_dcast)[names(unemployment_dcast)== 10] <- "oct"
names(unemployment_dcast)[names(unemployment_dcast)== 11] <- "nov"
names(unemployment_dcast)[names(unemployment_dcast)== 12] <- "dec"

unemployment_dcast <- tbl_df(unemployment_dcast)



DT::datatable(unemployment_dcast, options = list(
  pageLength = 10
))

New York State crimes count.

DT::datatable(crimes_ny, options = list(
  pageLength = 10
))

Crimes County By County Table

Load Data Into PostreSQL

Let start by checking if any of the dataset do exist in the database, and if it exist, we will remove them and write (or save) our datasets in the PostgreSQL Automatically.

dbRemoveTable(con, "arrest")
## [1] TRUE
dbWriteTable(con, "arrest", arrest, row.names=FALSE)
## [1] TRUE
dbRemoveTable(con, "mental_health")
## [1] TRUE
dbWriteTable(con, "mental_health", mental_health, row.names=FALSE)
## [1] TRUE
dbRemoveTable(con, "police")
## [1] TRUE
dbWriteTable(con, "police", police, row.names=FALSE)
## [1] TRUE
dbRemoveTable(con, "unemployment_dcast")
## [1] TRUE
dbWriteTable(con, "unemployment_dcast", unemployment_dcast, row.names=FALSE)
## [1] TRUE
dbRemoveTable(con, "divorce")
## [1] TRUE
dbWriteTable(con, "divorce", divorce, row.names=FALSE)
## [1] TRUE
dbRemoveTable(con, "crimes_ny")
## [1] TRUE
dbWriteTable(con, "crimes_ny", crimes_ny, row.names=FALSE)
## [1] TRUE
dbRemoveTable(con, "pov_sub")
## [1] TRUE
dbWriteTable(con, "pov_sub", pov_sub, row.names=FALSE)
## [1] TRUE

If your connection is right, you should see that the above table are now in your database. But some of the data type may be wrong, use the code below to alter the it to desirable data type.

dbGetQuery(con, statement = paste("ALTER TABLE crimes_ny
    ALTER COLUMN population TYPE INT USING population::integer"));
# dbGetQuery(con, statement = paste("ALTER TABLE inmates_dcast
#           ALTER COLUMN values TYPE VARCHAR USING values::VARCHAR"))
# 

Data Cleaning (SQL)

After writing (or saving) them into the database, we can now try to subset the columns and the row we needed from the tables.

police_sub <- dbGetQuery(con, statement = paste(
                       "SELECT year, county, policetotal as police",
                       "FROM police",
                       "WHERE year IN (2010, 2011,2012,2013,2014,2015)",
                      "ORDER BY year"));


arrest_sub <- dbGetQuery(con, statement=paste(
                           "SELECT *",
                           "FROM arrest",
                           "WHERE year IN (2010, 2011,2012,2013,2014,2015)",
                           "ORDER BY county ASC"));



divorce$total <- (gsub(",", "", divorce$total));
divorce_sub <-  dbGetQuery(con, statement=paste(
                           "SELECT total as divorce, county, year",
                           "FROM divorce",
                           "ORDER BY county ASC"));


emp <-  dbGetQuery(con, statement=paste(
                           "SELECT *",
                           "FROM unemployment_dcast",
                           "WHERE year IN (2010, 2011,2012,2013,2014,2015)",
                           "ORDER BY county ASC"));


crimes_ny_sub <-  dbGetQuery(con, statement=paste(
                           "SELECT county, year, population ",
                           "FROM crimes_ny",
                           "WHERE year IN (2010, 2011,2012,2013,2014,2015)",
                           "ORDER BY county ASC"));

Saving them in tbl for faster accessibility.

# For easy and better table ouput
police_sub <- tbl_df(police_sub)
arrest_sub <- tbl_df(arrest_sub)
divorce_sub <- tbl_df(divorce_sub)
crimes_ny_sub <- tbl_df(crimes_ny_sub)
#inmates_sub <- tbl_df(inmates_sub)
dbGetQuery(con, statement = paste("ALTER TABLE mental_health ALTER COLUMN total TYPE INT"));
mental_health_sub <- as.data.frame(dbGetQuery(con, statement = paste(
                       "SELECT year, county, agegroup, total as mentalhealth",
                       "FROM mental_health",
                       "WHERE year BETWEEN 2010 AND 2015",
                       "ORDER BY year, agegroup")));

Finally, after all the hard work done by PostreSQl, we now have our sought after dataset, hoooray!

Lets disconnect PostreSQL and start using R and SQLDF.

#dbListTables(con);

dbDisconnect(con)
## [1] TRUE
detach("package:RPostgreSQL", unload=TRUE)
library(sqldf);

DATA CLEANING (R and SQL)

Upon further scrutiny, we discovered that the poverty table still have some unwanted character in them, we therefore removed them using gsub from grep, a R based.

mental_sub <- dcast(mental_health_sub, year +county ~ agegroup, value.var="mentalhealth", fun.aggregate = sum)

mental_sub <- mental_sub[, c(1:4)]

divorce_sub$divorce <- (gsub(",", "", divorce_sub$divorce));
divorce_sub1 <- sapply(divorce_sub$divorce, as.numeric)
divorce_sub2 <- divorce_sub[, c(2:3)]
div_extraction <- cbind.data.frame(divorce_sub2, divorce_sub1)
colnames(div_extraction) <- c("county", "year", "divorce")
povselect <- sqldf::sqldf("SELECT * FROM pov_sub 
                          WHERE year BETWEEN 2010 AND 2015 AND state = 36 AND county !='New York'
                          ORDER BY year ASC")

povselect$county <- (gsub("\\(NY)", "", povselect$county));
povselect$county <- (gsub("\\County", "", povselect$county));
povselect$allageproverty <- (gsub(",", "", povselect$allageproverty));
povselect$ueighteenpoverty <- (gsub(",", "", povselect$ueighteenpoverty));
povselect$fivetoseventenpov <- (gsub(",", "", povselect$fivetoseventenpov))
col_pov <- c('allageproverty', 'ueighteenpoverty', 'fivetoseventenpov')
povselect[col_pov] <- sapply(povselect[col_pov], as.numeric)
povertydata <- povselect[, -c(2:3)]
emp$jan <- (gsub(",", "", emp$jan))
emp$feb <- (gsub(",", "", emp$feb))
emp$mar <- (gsub(",", "", emp$mar))
emp$apr <- (gsub(",", "", emp$apr))
emp$may <- (gsub(",", "", emp$may))
emp$jun <- (gsub(",", "", emp$jun))
emp$jul <- (gsub(",", "", emp$jul))
emp$aug <- (gsub(",", "", emp$aug))
emp$sep <- (gsub(",", "", emp$sep))
emp$oct <- (gsub(",", "", emp$oct))
emp$nov <- (gsub(",", "", emp$nov))
emp$dec <- (gsub(",", "", emp$dec))



emp1 <- sapply(emp[, c(3:14)], as.data.frame.numeric)
emp2 <- as.data.frame(emp1)


unemp_sub <- cbind.data.frame(emp2,emp[, c(1,2)], stringsAsFactors=FALSE)
colnames(unemp_sub) <- c("jan", "feb", "mar", "apr", "may", "jun", "jul", "aug",
                         "sept", "oct", "nov", "dec", "year", "county")

df_unemp_sub <-sqldf::sqldf("SELECT county, year, (jan + feb + mar + apr + may +
                           jun + jul + aug + sept + oct + nov + dec) as unemployment
                           FROM unemp_sub")
nypd[, 2] <- as.Date.character(nypd[, 2], format = "%m/%d/%Y")
nypd[, 4] <- as.Date.character(nypd[, 4], format = "%m/%d/%Y")
nypd[, 6] <- as.Date.character(nypd[, 6], format = "%m/%d/%Y")

Data cleaning of NYPD’s latitude and Longtitude from coordinate to a usable R format.

nypd$Lat_Lon <- (gsub(")","", nypd$Lat_Lon))
nypd$Lat_Lon <- (gsub(",",":", nypd$Lat_Lon))
nypd$Lat_Lon <- (gsub("[[alphanum:]]","", nypd$Lat_Lon))

nypd_new <- subset(nypd, CMPLNT_FR_DT >= '2010-1-1' & CMPLNT_FR_DT <= '2015-12-31', 
                   select=c(CMPLNT_NUM , CMPLNT_FR_DT, CMPLNT_FR_TM, RPT_DT,  OFNS_DESC,
                            CRM_ATPT_CPTD_CD, LAW_CAT_CD,BORO_NM,  Latitude, Longitude,
                            HADEVELOPT, ADDR_PCT_CD, PREM_TYP_DESC))

colnames(nypd_new) <- c("ComplainNum", "IncidentDate", "IncidentTime",
                        "ReportedDate", "OffenseDesc", "CrimeFailSuccces", "CrimeLevel",
                        "Borough", "lat", "Long", "NYCHA", "PolicePrecinct", "BuildingType" )

nypd_new<- as.data.frame(nypd_new)
nypd_new$DateTime = paste(nypd_new$IncidentDate, nypd_new$IncidentTime, sep="")

nypd_new$dayofweek <- strftime(nypd_new$IncidentDate, '%A') # Days of the week

nypd_new$months <- strftime(nypd_new$IncidentDate, '%B') # Converting To Months Column

nypd_new$year <- strftime(nypd_new$IncidentDate, '%Y') # Converting To Year Column

kable(head(nypd_new[, 10:17]))
Long NYCHA PolicePrecinct BuildingType DateTime dayofweek months year
-73.91666 44 BAR/NIGHT CLUB 2015-12-3123:45:00 Thursday December 2015
-73.78456 103 2015-12-3123:36:00 Thursday December 2015
-73.94505 28 OTHER 2015-12-3123:30:00 Thursday December 2015
-73.72634 105 RESIDENCE-HOUSE 2015-12-3123:30:00 Thursday December 2015
-73.98789 13 OTHER 2015-12-3123:25:00 Thursday December 2015
-73.95711 71 DRUG STORE 2015-12-3123:18:00 Thursday December 2015

Incident Report

 incidentreport <- sqldf::sqldf("SELECT
                            
                             'At ' || IncidentTime || ',' || dayofweek ||', '|| months || ' ' || year
                            ||', a complaint was filled with NYPD police precinct ' || PolicePrecinct ||
                           ' about ' || OffenseDesc
                         || ' that was committed . The complainant maintained that the incident occurred at a '||
                          BuildingType || ' in ' || Borough || ' with co-ordinate ' || lat|| ' and ' ||Long||
                         ' . The Police described the incident as ' || CrimeFailSuccces || ' and classified it as ' ||
                          CrimeLevel ||' with complain number: ' || ComplainNum
                           as Incident_Report from nypd_new
                                ")
 
 kable(head(incidentreport))
Incident_Report
At 23:45:00,Thursday, December 2015, a complaint was filled with NYPD police precinct 44 about FORGERY that was committed . The complainant maintained that the incident occurred at a BAR/NIGHT CLUB in BRONX with co-ordinate 40.828848333 and -73.916661142 . The Police described the incident as COMPLETED and classified it as FELONY with complain number: 101109527
At 23:36:00,Thursday, December 2015, a complaint was filled with NYPD police precinct 103 about MURDER & NON-NEGL. MANSLAUGHTER that was committed . The complainant maintained that the incident occurred at a in QUEENS with co-ordinate 40.697338138 and -73.784556739 . The Police described the incident as COMPLETED and classified it as FELONY with complain number: 153401121
At 23:30:00,Thursday, December 2015, a complaint was filled with NYPD police precinct 28 about DANGEROUS DRUGS that was committed . The complainant maintained that the incident occurred at a OTHER in MANHATTAN with co-ordinate 40.802606608 and -73.945051911 . The Police described the incident as COMPLETED and classified it as FELONY with complain number: 569369778
At 23:30:00,Thursday, December 2015, a complaint was filled with NYPD police precinct 105 about ASSAULT 3 & RELATED OFFENSES that was committed . The complainant maintained that the incident occurred at a RESIDENCE-HOUSE in QUEENS with co-ordinate 40.654549444 and -73.726338791 . The Police described the incident as COMPLETED and classified it as MISDEMEANOR with complain number: 968417082
At 23:25:00,Thursday, December 2015, a complaint was filled with NYPD police precinct 13 about ASSAULT 3 & RELATED OFFENSES that was committed . The complainant maintained that the incident occurred at a OTHER in MANHATTAN with co-ordinate 40.7380024 and -73.98789129 . The Police described the incident as COMPLETED and classified it as MISDEMEANOR with complain number: 641637920
At 23:18:00,Thursday, December 2015, a complaint was filled with NYPD police precinct 71 about FELONY ASSAULT that was committed . The complainant maintained that the incident occurred at a DRUG STORE in BROOKLYN with co-ordinate 40.665022689 and -73.957110763 . The Police described the incident as ATTEMPTED and classified it as FELONY with complain number: 365661343

Inferential Statistics (NYC)

precinct_count <- plyr::count(nypd_new, vars = "PolicePrecinct")
precinct <- sqldf::sqldf("SELECT PolicePrecinct, freq as NumberOfCrimesHandled
                         FROM precinct_count ORDER BY freq DESC LIMIT 10")
precinct_low <- sqldf::sqldf("SELECT PolicePrecinct,  freq as NumberOfCrimesHandled
                             FROM precinct_count ORDER BY freq ASC LIMIT 10")
lat= c(40.67124, 40.84156, 40.82885, 40.81687 , 40.75792, 40.86400 , 40.68564, 40.64174, 40.84865, 40.65247)
long= c(-73.88831, -73.86918, -73.92980, -73.91769,-73.99111 , -73.89342, -73.91079, -74.07568, -73.91168, -73.94426)

lat2 <- c(0,40.77472,40.58751,40.67687,40.58230,40.73599,40.75520,40.50930,40.81077,40.72350)
long2 <- c(0,-73.97468,-73.81171,-74.00806,-74.16894,-73.74580,-73.89195,-74.24090,-73.88334,-73.94862)

precinct <- cbind.data.frame(precinct, lat, long)
precinct_low <- cbind.data.frame(precinct, lat2, long2)

kable(head(precinct, 10))
PolicePrecinct NumberOfCrimesHandled lat long
75 101480 40.67124 -73.88831
43 78285 40.84156 -73.86918
44 75838 40.82885 -73.92980
40 69758 40.81687 -73.91769
14 68051 40.75792 -73.99111
52 64737 40.86400 -73.89342
73 64724 40.68564 -73.91079
120 60366 40.64174 -74.07568
46 60027 40.84865 -73.91168
67 56866 40.65247 -73.94426

MAPS

neigbourhood_json <- GET('http://data.beta.nyc//dataset/0ff93d2d-90ba-457c-9f7e-39e47bf2ac5f/resource/35dd04fb-81b3-479b-a074-a27a37888ce7/download/d085e2f8d0b54d4590b1e7d1f35594c1pediacitiesnycneighborhoods.geojson')

neigbourhood_json2 <- GET('http://data.beta.nyc//dataset/0ff93d2d-90ba-457c-9f7e-39e47bf2ac5f/resource/35dd04fb-81b3-479b-a074-a27a37888ce7/download/d085e2f8d0b54d4590b1e7d1f35594c1pediacitiesnycneighborhoods.geojson')
neighborhoods_low <- readOGR(content(neigbourhood_json2,'text'), 'OGRGeoJSON', verbose = F)

precinct_low2 <- precinct_low
coordinates(precinct_low2) <- ~long2 + lat2
proj4string(precinct_low2) <- proj4string(neighborhoods_low)
combining_low <- over(precinct_low2, neighborhoods_low)
precint_low3 <- cbind(precinct_low, combining_low)
precint_low3 <- precint_low3[, -c(3,4, 8)]

map1 <- leaflet(neighborhoods_low) %>%
  addTiles() %>% 
  addPolygons(popup = ~neighborhoods_low, color = "#03F", fill = TRUE,
              fillColor = "brown", opacity = 0.5, fillOpacity = 0.2) %>% 
  addMarkers(~long2, ~lat2, popup = ~neighborhoods_low, data = precint_low3) %>%
  addProviderTiles("Stamen.Toner", options = providerTileOptions()) %>%
  setView(-73.92, 40.75, zoom = 9.5)




## Map 2

# Neigbourhood with high crime rate

 neighborhoods <- readOGR(content(neigbourhood_json,'text'), 'OGRGeoJSON', verbose = F)

precinct2 <- precinct
coordinates(precinct2) <- ~long + lat
proj4string(precinct2) <- proj4string(neighborhoods)
combining <- over(precinct2, neighborhoods)
precint3 <- cbind(precinct, combining)
precint3 <- precint3[, -8]

map2 <- leaflet(neighborhoods) %>%
  addTiles() %>% 
  addPolygons(popup = ~neighborhood, color = "#03F", fill = TRUE,
              fillColor = "brown", opacity = 0.5, fillOpacity = 0.2) %>% 
  addMarkers(~long, ~lat, popup = ~neighborhood, data = precint3) %>%
  addProviderTiles("HERE.normalday", options = providerTileOptions()) %>%
  setView(-73.92, 40.75, zoom = 10.5)


# New York state counties

#df_maps <- df_incident[1:62, ]  

counties_map <- leaflet(lat_long) %>%
  addTiles() %>%
  addMarkers(lng = lat_long$long, lat = lat_long$lat, popup = lat_long$county, data =lat_long) %>%
  addProviderTiles("Esri.WorldImagery", options = providerTileOptions()) %>%
  setView(-73.82, 40.75, zoom = 5.9)


counties_map
map1
map2
# Table (nyc high and low)
kable(precint_low3[, 1:6])
PolicePrecinct NumberOfCrimesHandled lat2 long2 neighborhood borough
75 101480 0.00000 0.00000 NA NA
43 78285 40.77472 -73.97468 Central Park Manhattan
44 75838 40.58751 -73.81171 Rockaway Beach Queens
40 69758 40.67687 -74.00806 Red Hook Brooklyn
14 68051 40.58230 -74.16894 New Springville Staten Island
52 64737 40.73599 -73.74580 Hollis Hills Queens
73 64724 40.75520 -73.89195 East Elmhurst Queens
120 60366 40.50930 -74.24090 Tottenville Staten Island
46 60027 40.81077 -73.88334 Hunts Point Bronx
67 56866 40.72350 -73.94862 Greenpoint Brooklyn
kable(precint3)
PolicePrecinct NumberOfCrimesHandled lat long neighborhood boroughCode borough
75 101480 40.67124 -73.88831 East New York 4 Brooklyn
43 78285 40.84156 -73.86918 Van Nest 2 Bronx
44 75838 40.82885 -73.92980 Highbridge 2 Bronx
40 69758 40.81687 -73.91769 Melrose 2 Bronx
14 68051 40.75792 -73.99111 Hell’s Kitchen 1 Manhattan
52 64737 40.86400 -73.89342 Fordham 2 Bronx
73 64724 40.68564 -73.91079 Bushwick 4 Brooklyn
120 60366 40.64174 -74.07568 St. George 5 Staten Island
46 60027 40.84865 -73.91168 Morris Heights 2 Bronx
67 56866 40.65247 -73.94426 East Flatbush 3 Brooklyn

From analysis and map above, we can deduce that we have 5 out 10 locations with the highest number of crimes complaint in Bronx, 3 in Brooklyn, 1 each in Manhattan and Staten Island, but none in Queens! Does this have anything to do with poverty level?

daysofweek <- plyr::count(nypd_new, vars = "dayofweek")
offenses <- plyr::count(nypd_new, vars = "OffenseDesc")
daysofweek[, -1] <- sapply(daysofweek[, -1], as.numeric)
daysofweek$percentages <- round(daysofweek$freq/sum(daysofweek$freq)*100)
 
days <- paste(daysofweek$dayofweek, daysofweek$percentages) # add percents to labels
days <- paste(days,"%",sep="") 
 
pie(daysofweek$freq,labels = days, col=rainbow(length(days)),
    main="Pie Chart of Crime Complaints In Days Of The Week") 

print(paste0("Most crimes are carried out on FRIDAYS with the total being: ", max(daysofweek$freq)))
## [1] "Most crimes are carried out on FRIDAYS with the total being: 462851"
print(paste0("While the least crimes were carried out on SUNDAYS with the total being: ", min(daysofweek$freq)))
## [1] "While the least crimes were carried out on SUNDAYS with the total being: 387899"
theme_set(theme_bw())
incident_months <- plyr::count(nypd_new, vars = "months")
incident_months[, 2] <- sapply(incident_months[, 2],as.numeric)
incident_months2 <- sqldf::sqldf("SELECT months as MonthOfIncident, freq as IncidentFrequency
                                FROM incident_months ORDER BY freq DESC")
kable(head(incident_months2))
MonthOfIncident IncidentFrequency
July 268942
August 266856
May 263434
October 259617
June 257991
September 256227
ggplot(incident_months2, aes(x=MonthOfIncident, y=IncidentFrequency)) + 
  geom_point(size=5) + 
  geom_segment(aes(x=MonthOfIncident, 
                   xend=MonthOfIncident, 
                   y=0, 
                   yend=IncidentFrequency)) + 
  labs(title="NYPD Crime Complaints Per Month ", 
       subtitle="FROM 2010 To 2015") + 
  theme(axis.text.x = element_text(angle=90, vjust=0.9)) + coord_flip()

The Most crimes was carried out in July follow by August and May in Third place , while February has the least numbers of crimes complaints.

incident_time <- plyr::count(nypd_new, vars = "IncidentTime")
incident_time <- sqldf::sqldf("SELECT IncidentTime as TimeOfIncident, freq as IncidentFrequency
                              FROM incident_time ORDER BY freq DESC")
kable(head(incident_time))
TimeOfIncident IncidentFrequency
12:00:00 77587
15:00:00 66375
18:00:00 63726
20:00:00 60164
16:00:00 60009
17:00:00 59880

Most Crimes are carried out around 12 Noon, 3PM, 6PM and 4PM, while the least crimes are carried out in the early morning toward the morning.

Crime_cat <- plyr::count(nypd_new, vars = "CrimeLevel")

sprintf(" Misdemeanor crimes are '%i' times greater than Felony crimes in New York City", round(Crime_cat[2, 2]/Crime_cat[1, 2]))
## [1] " Misdemeanor crimes are '2' times greater than Felony crimes in New York City"
Building_cat <- plyr::count(nypd_new, vars = "BuildingType")
Building_cat <- sqldf::sqldf("SELECT BuildingType as TypeOfBuilding, freq as IncidentFrequency
                             FROM Building_cat ORDER BY freq DESC LIMIT 20")

Building_cat[, -1] <- sapply(Building_cat[, -1], as.numeric)


Building_cat$percentages <- round(Building_cat$IncidentFrequency/sum(Building_cat$IncidentFrequency)*100)
days <- paste(Building_cat$TypeOfBuilding, Building_cat$percentages) # add percents to labels
days <- paste(days,"%",sep="") # ad % to labels
pie(Building_cat$IncidentFrequency,labels = days, col= c("purple", "violetred1", "green3","cornsilk", "cyan", "white"),
   main="Top 20 Building/Area Where Crimes Were Carried Out", border = FALSE, init.angle = 315) 

kable(head(Building_cat))
TypeOfBuilding IncidentFrequency percentages
STREET 962514 35
RESIDENCE - APT. HOUSE 634715 23
RESIDENCE-HOUSE 272063 10
RESIDENCE - PUBLIC HOUSING 230529 8
OTHER 81313 3
COMMERCIAL BUILDING 73895 3

Crimes were most carried out on the street, followed by an apartmental houses, where residential houses came third!

At this time, we will be combining the required datasets into one and write it into PostgreSQL for further scrutiny.

df <- sqldf::sqldf("SELECT c.county, c.population, c.year,  a.felonytotal,a.misdemeanortotal,
a.total as totalcrimes,p.policetotal as lawenforcementtotal ,a.othermisd, d.divorce, m.child as mentalchild,
m.adult as mentaladult, (m.child+m.adult) as mentaltotal, u.unemployment ,a.violentfelony, a.drugfelony,
a.dwifelony, a.otherfelony, a.drugmisd, a.dwimisd, a.propertymisd FROM crimes_ny_sub c
LEFT JOIN arrest_sub a ON a.county=c.county AND a.year=c.year
LEFT JOIN div_extraction d ON d.year=c.year AND a.county=d.county
LEFT JOIN df_unemp_sub u ON u.year=a.year AND u.county=d.county
LEFT JOIN mental_sub m ON m.year=a.year AND m.county=a.county
LEFT JOIN police p ON p.year=a.year AND p.county=d.county
ORDER BY c.year, c.county ASC")

df <- sqldf::sqldf("SELECT * from df WHERE county NOT IN ('Out-of-State', 'St. Lawrence') ")
df <- cbind.data.frame(df, povertydata[, 3:5])

Inferential Statistics (NY)

A summary statistics of the datasets

summary(df)
##     county            population           year       felonytotal   
##  Length:372         Min.   :   4678   Min.   :2010   Min.   :   11  
##  Class :character   1st Qu.:  50431   1st Qu.:2011   1st Qu.:  326  
##  Mode  :character   Median :  89674   Median :2012   Median :  529  
##                     Mean   : 316077   Mean   :2012   Mean   : 2529  
##                     3rd Qu.: 233454   3rd Qu.:2014   3rd Qu.: 1512  
##                     Max.   :2640252   Max.   :2015   Max.   :27551  
##                                                      NA's   :6      
##  misdemeanortotal  totalcrimes    lawenforcementtotal   othermisd      
##  Min.   :   52    Min.   :   66   Min.   :   0.0      Min.   :   13.0  
##  1st Qu.:  890    1st Qu.: 1214   1st Qu.: 131.0      1st Qu.:  329.8  
##  Median : 1382    Median : 1934   Median : 224.0      Median :  510.5  
##  Mean   : 6448    Mean   : 8977   Mean   : 550.8      Mean   : 2167.5  
##  3rd Qu.: 4055    3rd Qu.: 5520   3rd Qu.: 471.0      3rd Qu.: 1512.2  
##  Max.   :68603    Max.   :95647   Max.   :4751.0      Max.   :21854.0  
##  NA's   :6        NA's   :6       NA's   :44          NA's   :6        
##     divorce         mentalchild      mentaladult      mentaltotal     
##  Min.   :    5.0   Min.   :   365   Min.   :  1082   Min.   :   2162  
##  1st Qu.:  158.8   1st Qu.:  2215   1st Qu.:  6798   1st Qu.:   9067  
##  Median :  255.5   Median :  6826   Median : 14301   Median :  21757  
##  Mean   :  931.6   Mean   : 30786   Mean   : 71983   Mean   : 102768  
##  3rd Qu.:  667.5   3rd Qu.: 23274   3rd Qu.: 44093   3rd Qu.:  67605  
##  Max.   :15666.0   Max.   :290846   Max.   :915912   Max.   :1159220  
##  NA's   :8         NA's   :12       NA's   :12       NA's   :12       
##   unemployment    violentfelony      drugfelony       dwifelony    
##  Min.   :  1000   Min.   :   1.0   Min.   :   0.0   Min.   :  1.0  
##  1st Qu.:  9275   1st Qu.:  57.0   1st Qu.:  43.0   1st Qu.: 34.0  
##  Median : 15450   Median : 103.5   Median :  85.5   Median : 56.0  
##  Mean   : 48497   Mean   : 698.0   Mean   : 483.0   Mean   : 96.7  
##  3rd Qu.: 37825   3rd Qu.: 370.0   3rd Qu.: 223.0   3rd Qu.:108.0  
##  Max.   :459700   Max.   :9507.0   Max.   :7679.0   Max.   :615.0  
##  NA's   :8        NA's   :6        NA's   :6        NA's   :6      
##   otherfelony         drugmisd           dwimisd        propertymisd  
##  Min.   :    7.0   Min.   :    3.00   Min.   :  19.0   Min.   :    9  
##  1st Qu.:  167.2   1st Qu.:   65.25   1st Qu.: 186.2   1st Qu.:  234  
##  Median :  279.5   Median :  122.50   Median : 306.5   Median :  480  
##  Mean   : 1251.5   Mean   : 1471.95   Mean   : 620.5   Mean   : 2188  
##  3rd Qu.:  824.2   3rd Qu.:  464.00   3rd Qu.: 713.5   3rd Qu.: 1212  
##  Max.   :13590.0   Max.   :27784.00   Max.   :3943.0   Max.   :30197  
##  NA's   :6         NA's   :6          NA's   :6        NA's   :6      
##  allageproverty   ueighteenpoverty fivetoseventenpov
##  Min.   :   490   Min.   :   123   Min.   :    80   
##  1st Qu.:  7311   1st Qu.:  2140   1st Qu.:  1423   
##  Median : 12260   Median :  3510   Median :  2343   
##  Mean   : 48550   Mean   : 15236   Mean   : 10440   
##  3rd Qu.: 32620   3rd Qu.:  9326   3rd Qu.:  6185   
##  Max.   :615319   Max.   :206071   Max.   :138276   
## 

We saw that some the columns in the summary are not in the right format and have some of NAs and are not in the right/accetable datatype for a perfect analysis. We will therefore convert them back to numerica datatype (a more acceptable datatype for analysis)

df[, -1] <- sapply(df[, -1], as.numeric) #Convert some df column to numeric

A glimpse of the dataset with the NAS

missmap(df, legend = TRUE, col = c("wheat","darkred", col=c('yellow', 'darkgreen')),
        main ="Plot Showing The Missing Values Per Observation",
        y.cex = 1, x.cex = 0.8,csvar = NULL, tsvar =NULL, rank.order = TRUE)

The above is a graphical representation of all the missing values.The darkgreen area shows the no missing values while the other color indicates the present of missing values in each columns.

A mulitivariate Imputation by Chained Equations (mice package) is said to be the best approach to tackle end-on the menace missing observation may cause. It is well know that if proper care is not taken in dealing with missing observation, the analysis would likely be biased. To avoid this, the mice package in R will use a fully conditional specification were each incomplete variable is imputed by a separate model at which the algoritm will impute mixes of continous, binary, ordered or unordered data based on the data type.

df_missing_val <- mice(df, m=1, method='cart', seed = 500, printFlag=FALSE, where = is.na(df))
## Warning: Number of logged events: 87
df_complete <- complete(df_missing_val, action='long', include =FALSE)

If possible, we will show map of the location, and as googlevis requirement, the longtitude and latitude needed for the map must be in “lat:long” format

df_complete <- merge(df_complete , lat_long, all=T)
df_complete$Lat_Long = paste(df_complete$lat, df_complete$long, sep=":")
df_cleaned <- df_complete[, -c(2,3)]

df_pca <- df_complete[-c(235, 302),  c(1, 4,6, 7, 8, 9, 11, 12,13, 15, 23, 24, 25)]

DT::datatable(df_cleaned, options = list(
  pageLength = 10
))

Checking out for outliers (an observation point that is abnormal distant from other observations) in an analysis one of the best way to avoid variablity in the measurement or experimantal error.

outliers_check <- function(check) {
  check < quantile(check, 0.25, na.rm = TRUE) - 1.5 * IQR(check, na.rm = TRUE) |
    check > quantile(check, 0.75, na.rm = TRUE) + 1.5 * IQR(check, na.rm = TRUE)
}


outlier_obs <- df_cleaned %>%
  group_by(year, county) %>%
  mutate(Flagged = is.na(felonytotal) | outliers_check(felonytotal) | felonytotal <= 0) %>%
  ungroup()

#outliers <- subset(outlier_obs, Flagged == TRUE | year==2015, select=c(year, county, population, lawenforcementtotal, felonytotal, divorce))

A boxplot showing observation with oultiers

theme_set(theme_bw())


ct <- ggplot(df_cleaned, aes(county, totalcrimes))
ct + geom_boxplot() + 
  geom_dotplot(binaxis='y', 
               stackdir='center', 
               dotsize = .15, 
               fill="red") +
  theme(axis.text.x = element_text(angle=75, vjust=0.6)) + 
  labs(title="Box plot + Dot plot of", 
       subtitle="County Vs Total Crimes",
       x="New York State Counties",
       y="Total Crimes (Felony & Misdemeanor) ")

df_corr <- df_cleaned[ , -c(1,3, 24, 25, 26)]


corr_coef <- cor(df_corr, method = "kendall", use = "complete")

ggcorrplot(corr_coef, hc.order = TRUE, 
           type = "lower", 
           lab = TRUE, 
           lab_size = 3, 
           method="circle", 
           colors = c("tomato2", "white", "springgreen3"), 
           title="Correlogram of variables", 
           ggtheme=theme_bw)

The data set would be used in two different way.

Incidents Rate

Incidence: a measure of the occurrence of a new disease, in a defined population at risk for the disease, in a specified time period.

while

1). Incidence proportion (also known as cumulative incidence) is the number of new cases within a specified time period divided by the size of the population initially at risk.

  • Mathematically as,

\(Incidence\quad Rate\quad =\quad \frac { Number\quad of\quad newcases\quad of\quad incidence\quad or\quad disease \quad or\quad injury\quad during\quad specified\quad period }{ Time\quad each\quad person\quad was\quad observed,\quad totaled\quad for\quad all\quad persons }\)

df_incident <- sqldf::sqldf("SELECT 

county as County, year as Year, population as Populations, 
(((felonytotal + misdemeanortotal)/population)*100000) as TotalCrimes,
((mentalchild/population)*10000) as ChildMental, ((mentaladult/population)*10000) as AdultMental,
((mentaltotal/population)*10000) as TotalMental,
((lawenforcementtotal/population)*100000) as lawenforcementtotal,
((divorce/population)*100000) as DivorcePerPopPerK, ((drugfelony/population)*100000) as DrugFelonyPerPopPerK, 
((violentfelony/population)*100000) as ViolentTotal, ((dwifelony/population)*100000) as DWIFelony,
((otherfelony/population)*100000) as OtherFelony, ((felonytotal/population)*100000) as TotalFelony, 
((propertymisd/population)*100000) as PropertyMisdemeanor, 
((drugmisd/population)*100000) as DrugMisdemeanor, ((dwimisd/population)*100000) as DwiMisdemeanor,
((othermisd/population)*100000) as OtherMisdemeanor,
((misdemeanortotal)/population)*100000
as TotalMisdemeanor, 
((unemployment/population)*100000) as UnemploymentTotal,
((allageproverty/population)*100000) as AllPersonInPoverty ,
((ueighteenpoverty/population)*100000) as Under18InPoverty,
((fivetoseventenpov/population)*100000) as FiveTo17InPoverty,lat, long, Lat_long
FROM df_cleaned
WHERE county NOT IN ('Out-of-State', 'St. Lawrence') 
ORDER BY Year, county ASC

")
df_incident <- as.data.frame(df_incident)
kable(head(df_incident[, 1:9]))
County Year Populations TotalCrimes ChildMental AdultMental TotalMental lawenforcementtotal DivorcePerPopPerK
Albany 2010 295267 3128.016 1638.5847 2626.775 4265.360 198.8031 259.4262
Allegany 2010 48531 2542.705 649.4818 2695.185 3344.666 325.5651 327.6256
Bronx 2010 1387983 6352.599 1780.7855 4031.065 5811.851 193.8064 223.5618
Broome 2010 191892 2814.083 860.3277 1962.093 2822.421 128.7182 355.9294
Cattaraugus 2010 76602 2545.625 1243.8318 2883.345 4127.177 304.1696 276.7552
Cayuga 2010 78400 2022.959 2907.9082 2459.184 5367.092 302.2959 270.4082

Incident Table Cont’d…“)

kable(head(df_incident[, 10:18]))
DrugFelonyPerPopPerK ViolentTotal DWIFelony OtherFelony TotalFelony PropertyMisdemeanor DrugMisdemeanor DwiMisdemeanor OtherMisdemeanor
173.4024 207.9474 59.60707 529.6901 970.6469 748.1364 270.94122 389.13932 749.1525
115.3902 160.7220 90.66370 317.3229 684.0988 473.9239 98.90585 471.86335 813.9128
553.2489 494.5306 5.25943 484.0117 1537.0505 1486.1133 1727.83096 87.68119 1513.9234
113.6056 178.2253 68.26757 390.3237 750.4221 781.6897 171.97173 305.38011 804.6193
58.7452 161.8757 109.65771 298.9478 629.2264 625.3100 91.38143 399.46738 800.2402
99.4898 88.0102 47.19388 315.0510 549.7449 521.6837 103.31633 247.44898 600.7653

print(paste0(“Incident Table Last”))

kable(head(df_incident[, 19:26]))
TotalMisdemeanor UnemploymentTotal AllPersonInPoverty Under18InPoverty FiveTo17InPoverty lat long Lat_Long
2157.369 15579.12 12917.12 3500.899 2258.634 42.6526 -73.7562 42.6526:-73.7562
1858.606 21841.71 15674.52 4897.900 3294.801 42.0901 -78.4942 42.0901:-78.4942
4815.549 19906.58 29148.48 10956.330 7667.457 40.8448 -73.8648 40.8448:-73.8648
2063.661 22199.99 16375.88 4809.476 3218.477 42.1792 -75.8534 42.1792:-75.8534
1916.399 28067.15 15209.79 5276.625 3469.883 42.3292 -78.8681 42.3292:-78.8681
1473.214 23341.84 12196.43 4053.571 2739.796 42.9190 -76.7263 42.919:-76.7263

Charts

Poverty Chart

theme_set(theme_bw())

poverty_inc <- sqldf::sqldf("SELECT 
                        Year, AllPersonInPoverty, County
                        FROM df_incident
                        ORDER BY AllPersonInPoverty DESC LIMIT 30
                        ")

poverty_plot <- ggplot(poverty_inc, aes(x=AllPersonInPoverty, y=reorder(County, +AllPersonInPoverty), fill=as.factor(Year))) +
geom_point(colour="blue", size=5, alpha=.8) +
scale_fill_brewer(palette="Blues", breaks=rev(levels(poverty_inc$Year))) +
  labs(title="Chart of Top 30 Poverty  In New York State Per County
       \n (Incident Rate Per 100K Persons From 2010 To 2015))")




## Divorce Chart

div_inc <- sqldf::sqldf("SELECT 
                        Year, DivorcePerPopPerK, County
                        FROM df_incident
                        ORDER BY DivorcePerPopPerK DESC LIMIT 30
                        ")

divorce_plot <- ggplot(div_inc, aes(x=DivorcePerPopPerK, y=reorder(County, +DivorcePerPopPerK), fill=as.factor(Year))) +
geom_point(colour="pink", size=5, alpha=.8) +
scale_fill_brewer(palette="Blues", breaks=rev(levels(div_inc$Year))) +
  labs(title="Chart of Top 30 Divorce  In New York State Per County
       \n (Incident Rate Per 100K Persons From 2010 To 2015))")



## Unemployment Chart



unemp_inc <- sqldf::sqldf("SELECT 
                        Year, UnemploymentTotal, County
                        FROM df_incident
                        ORDER BY UnemploymentTotal DESC LIMIT 30
                        ")


unemployment_plot <- ggplot(unemp_inc, aes(x=UnemploymentTotal, y=reorder(County, +UnemploymentTotal), fill=as.factor(Year))) +
geom_point(colour="brown", size=5, alpha=.8) +
scale_fill_brewer(palette="Blues", breaks=rev(levels(unemp_inc$Year))) +
  labs(title="Chart of Top 30 Unemployment In New York State Per County
       \n (Incident Rate Per 100K Persons From 2010 To 2015)")



## Mental Health Chart

mentalhealth_inc <- sqldf::sqldf("SELECT 
                        Year, TotalMental, County
                        FROM df_incident
                        ORDER BY TotalMental DESC LIMIT 30
                        ")

mental_plot <- ggplot(mentalhealth_inc, aes(x=TotalMental, y=reorder(County, +TotalMental), fill=as.factor(Year))) +
geom_point(colour="purple", size=5, alpha=.8, shape = 21) +
scale_fill_brewer(palette="Blues", breaks=rev(levels(mentalhealth_inc$Year))) +
  labs(title="Chart of Top 30 Mental Health In New York State Per County
       \n (Incident Rate Per 100K Persons From 2010 To 2015)")


poverty_plot #poverty plot
divorce_plot #divorce plot
unemployment_plot #unemployment plot
mental_plot #mental health plot

The above plot depicts some of the variable used in the analysis. As we can see, most of the plot shows NYC! The main reason NYC historical complaint data was used for further investigation as its a representative of the other counties in NY state.

Total Felony Popuation Desity

ggplot(df_incident, aes(TotalFelony)) + geom_density(aes(color=as.factor(Year), fill=as.factor(Year)),alpha=0.1) +
  labs(x="Total Felony", y="Density", title="Felony Total Per 100k Population Density") +
  theme(legend.position = "top")+ theme_economist()

Total Misdemeanor Popuation Desity

misd_dens_plt <- ggplot(df_incident, aes(TotalMisdemeanor)) + geom_density(aes(color=as.factor(Year),
                        fill=as.factor(Year)),alpha=0.1) + labs(x="Misdemeanor Total", y="Density",
                        title="Misdemeanor Total Per 100k Population Density") + theme(legend.position = "top")+
                       theme_economist()

misd_dens_plt

Felony & Misdemeanor Combo

felony <- dplyr::select(df_incident, TotalFelony, Year, UnemploymentTotal) %>%
    mutate(Year = as.factor(Year)) %>%
ggplot(aes(x = Year, y = TotalFelony)) + 
    geom_jitter(aes(col = Year), alpha = 0.8) +
    labs(title = "x") +
    theme_light()

misd <- dplyr::select(df_incident, TotalMisdemeanor, Year, UnemploymentTotal) %>%
    mutate(Year = as.factor(Year)) %>%
ggplot(aes(x = Year, y = TotalMisdemeanor)) + 
    geom_jitter(aes(col = Year), alpha = 0.8) +
    labs(title = "x") +
    theme_light()

p <- plot_grid(felony, misd) 
title <- ggdraw() + draw_label("A Plot of Felony and Misdemeanor Incident Per Population 100k", fontface='bold')
plot_grid(title, p, ncol=1, rel_heights = c(0.1, 2.2))

3D Scatter Plot

A better view of some important observations in a reactive 3D!

df_incident$size <- df_incident$TotalCrimes
colors <- c('#4AC6B7', '#1972A4', '#965F8A', '#FF7070', '#C61951')

comparison <- plot_ly(df_incident[1:1000, ], x = ~TotalMisdemeanor, y = ~TotalFelony, z = ~lawenforcementtotal,
                      color = ~Year, size = ~size, colors = colors,
             marker = list(symbol = 'circle', sizemode = 'diameter'), sizes = c(1, 50),
             text = ~paste('County:', County, '<br>Population Total:', Populations,
                           '<br>Total Unemployment Pop in 100K:', UnemploymentTotal)) %>%
  layout(title = 'Comparison Among Police, Felony and Misdemeanor Crimes In New York State',
         scene = list(xaxis = list(title = 'Misdemeanor Pop At Risk Per 1K',
                      gridcolor = 'rgb(255, 255, 255)',
                      type = 'log',
                      zerolinewidth = 1,
                      ticklen = 5,
                      gridwidth = 2),
               yaxis = list(title = ' Felony Crime Pop At Risk Per 100K',
                      gridcolor = 'rgb(255, 255, 255)',
                      zerolinewidth = 1,
                      ticklen = 5,
                      gridwith = 2),
               zaxis = list(title = 'Police Pop Per 100K',
                            gridcolor = 'rgb(255, 255, 255)',
                            type = 'log',
                            zerolinewidth = 1,
                            ticklen = 5,
                            gridwith = 2)),
         paper_bgcolor = 'rgb(243, 243, 243)',
         plot_bgcolor = 'rgb(243, 243, 243)')

comparison

Regression Analysis & Prediction

  • Check for normality
qqnorm(df_cleaned$totalcrimes)
qqline(df$totalcrimes, col = 2)

From the above normality check, we can see that the variable under observation is not normally distributed. We will therefore use Generalized Linear Model (conventional linear regression models) to; select best model, estimate parameters and interpretations. The generalized linear models (GLMs) are a broad class of models that include linear regression, ANOVA, Poisson regression, log-linear models etc.

i.e The dependent variable Yi does NOT need to be normally distributed, but it typically assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal etc.

For Linear Regression, we have

\({ Y }_{ i }\quad =\quad { \beta }_{ 0 }\quad +\quad { \beta }_{ 1 }{ x }_{ i }\quad +\quad { \varepsilon }_{ i }\quad\)

where,

\({ Y }_{ i }\quad =\quad Dependent\quad Variable.\\ { \beta }_{ 0 }\quad =\quad Intercept.\\ { \beta }_{ 1 }\quad =\quad Parameter\quad To\quad Be\quad estimated\quad (slope).\\ { x }_{ i }\quad =\quad Independent\quad Variable.\\ { \varepsilon }_{ i }\quad =\quad Error\quad Term.\\ i\quad \quad =\quad 1,\quad 2,\quad 3,\quad 4,......,\quad n\quad\)

While Multiple linear regression, we have (in matrix notation):

\(\hat { \beta } =\quad ({ { X }^{ T }X) }^{ -1 }{ X }^{ T }Y\)

SETTING UP HYPOTHESIS: For Crimes Total

  • Null Hypothesis: There is no relationship between Crimes and any other variables.

    \(H_o:\) \(\mu_1\) = \(\mu_2\) = \(\mu_3\)…..= \(\mu_n\)

  • Alternative Hythesis:

    \(H_a:\) Not \(H_o:\) \(\mu_1\) \(\neq\) \(\mu_2\) \(\neq\) \(\mu_3\)…..\(\neq\) \(\mu_n\)

Rejection:

Reject \(H_o\) (Null Hypothesis) if the calculated value (P-Value) is less than the tabulated value(Table value = 0.05 ), otherwise do not reject \(H_o\)

  • NB: We used both forward and backward model selection to select the best model for our analysis.

  • Dummy Variables

Counties dummy varibale

counties <- (c("Albany", "Allegany", "Bronx", "Broome", "Cattaraugus", "Cayuga",
         "Chautauqua", "Chemung", "Chenango", "Clinton", "Columbia", "Cortland",
         "Delaware", "Dutchess", "Erie", "Essex", "Franklin","Fulton", "Genesee", 
         "Greene", "Hamilton", "Herkimer", "Jefferson", "Kings", "Lewis","Livingston",
         " Madison", "Monroe", "Montgomery", "Nassau", "New York", "Niagara", "Oneida",
         " Onondaga", " Ontario", "Orange", "Orleans", "Oswego", "Otsego",  "Putnam", "Queens",
         "Rensselaer", "Richmond", "Rockland","Saratoga", "Schenectady", "Schoharie", "Schuyler",
         "Seneca", "St Lawrence",  "Steuben", "Suffolk",  "Sullivan", "Tioga", "Tompkins", "Ulster",
         "Warren", "Washington", "Wayne", "Westchester", "Wyoming", "Yates"))


lab <- c(1, 2, 3, 4, 5,  6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
         26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
         52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64)
dummy <- cbind(counties, lab)
kable(head(dummy))
counties lab
Albany 1
Allegany 2
Bronx 3
Broome 4
Cattaraugus 5
Cayuga 6

Years dummy variable

df_cleaned <- df_cleaned[-c(235, 302), -c(4, 5, 8, 12, 14:20,24:26)]


df_cleaned$county <- factor(df_cleaned$county,
                    levels=counties,
                    labels= lab)

df_cleaned$year <- factor(df_cleaned$year,
                     levels = c("2010", "2011", "2012", "2013", "2014", "2015"),
                     labels = c(1, 2, 3, 4, 5, 6));

#df_cleaned$totalcrimes <- as.factor(df_cleaned$totalcrimes)


set.seed(123)

splits <- sample.split(df_cleaned$county, SplitRatio = 0.8)
train_set <- subset(df_cleaned, splits==TRUE)
test_set <- subset(df_cleaned, splits==FALSE)

#train_set[, 2:19] <- scale(train_set[, 2:19])
#test_set[, 2:19] <- scale(test_set[, 2:19])

Regression

regressor <-  glm(totalcrimes~., family=gaussian(link = "identity"),data=train_set)
stepwise <- step(regressor, direction = "both")
## Start:  AIC=5002.1
## totalcrimes ~ county + population + year + lawenforcementtotal + 
##     divorce + mentalchild + mentaladult + unemployment + allageproverty + 
##     ueighteenpoverty + fivetoseventenpov
## 
##                       Df   Deviance    AIC
## - lawenforcementtotal  1  241942427 5000.3
## - divorce              1  242655053 5001.1
## <none>                    241814047 5002.1
## - allageproverty       1  245092297 5004.1
## - ueighteenpoverty     1  248825615 5008.5
## - year                 5  257390382 5010.5
## - mentalchild          1  253561895 5014.1
## - unemployment         1  253903448 5014.5
## - fivetoseventenpov    1  254127113 5014.8
## - mentaladult          1  256384468 5017.4
## - population           1  320388196 5083.1
## - county              58 1098802081 5332.7
## 
## Step:  AIC=5000.26
## totalcrimes ~ county + population + year + divorce + mentalchild + 
##     mentaladult + unemployment + allageproverty + ueighteenpoverty + 
##     fivetoseventenpov
## 
##                       Df   Deviance    AIC
## - divorce              1  242705306 4999.2
## <none>                    241942427 5000.3
## + lawenforcementtotal  1  241814047 5002.1
## - allageproverty       1  245277509 5002.3
## - ueighteenpoverty     1  248945811 5006.7
## - year                 5  258021041 5009.2
## - mentalchild          1  253617214 5012.2
## - unemployment         1  254031095 5012.6
## - fivetoseventenpov    1  254342989 5013.0
## - mentaladult          1  256596366 5015.6
## - population           1  320413512 5081.1
## - county              58 1107065647 5332.9
## 
## Step:  AIC=4999.19
## totalcrimes ~ county + population + year + mentalchild + mentaladult + 
##     unemployment + allageproverty + ueighteenpoverty + fivetoseventenpov
## 
##                       Df   Deviance    AIC
## <none>                    242705306 4999.2
## + divorce              1  241942427 5000.3
## + lawenforcementtotal  1  242655053 5001.1
## - allageproverty       1  248206467 5003.8
## - ueighteenpoverty     1  250067204 5006.0
## - year                 5  259171855 5008.6
## - unemployment         1  254031855 5010.6
## - mentalchild          1  254634736 5011.3
## - fivetoseventenpov    1  254805543 5011.5
## - mentaladult          1  256868278 5013.9
## - population           1  327927630 5086.0
## - county              58 2175493407 5530.2

Analysis of Variance (ANOVA)

Analysis of Variance (ANOVA) above is used to compare the means of the variance for statistical significance.

anova(stepwise, test="F")
# 
fit <- aov(totalcrimes~., data=train_set)
summary(fit)
##                      Df    Sum Sq   Mean Sq  F value   Pr(>F)    
## county               58 1.080e+11 1.863e+09 1710.044  < 2e-16 ***
## population            1 4.378e+08 4.378e+08  401.888  < 2e-16 ***
## year                  5 2.327e+07 4.654e+06    4.272 0.000993 ***
## lawenforcementtotal   1 2.568e+06 2.568e+06    2.358 0.126070    
## divorce               1 2.233e+06 2.233e+06    2.050 0.153654    
## mentalchild           1 1.845e+07 1.845e+07   16.940 5.44e-05 ***
## mentaladult           1 3.540e+07 3.540e+07   32.500 3.77e-08 ***
## unemployment          1 1.471e+07 1.471e+07   13.507 0.000298 ***
## allageproverty        1 6.444e+06 6.444e+06    5.916 0.015793 *  
## ueighteenpoverty      1 1.628e+05 1.628e+05    0.149 0.699407    
## fivetoseventenpov     1 1.231e+07 1.231e+07   11.304 0.000911 ***
## Residuals           222 2.418e+08 1.089e+06                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# HSD.test(fit, 'unemployment ')

Residual Plot

residual <- data.frame(Fitted = fitted(stepwise), Residuals = resid(stepwise), Treatment = train_set$totalcrimes)

plot_res <- ggplot(residual, aes(Fitted, Residuals, colour = Treatment)) + geom_point()

ggplotly(plot_res)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Confident Interval

confint(stepwise)
## Waiting for profiling to be done...
##                           2.5 %        97.5 %
## (Intercept)        3.514137e+04  5.257861e+04
## county2           -4.509359e+04 -3.026503e+04
## county3            1.759665e+05  2.561847e+05
## county4           -1.847632e+04 -1.218690e+04
## county5           -4.003689e+04 -2.698067e+04
## county6           -3.933880e+04 -2.645562e+04
## county7           -2.996834e+04 -2.008484e+04
## county8           -3.761863e+04 -2.524088e+04
## county9           -4.475257e+04 -3.004656e+04
## county10          -3.902694e+04 -2.619976e+04
## county11          -4.312707e+04 -2.893036e+04
## county12          -4.447702e+04 -2.975810e+04
## county13          -4.543456e+04 -3.042556e+04
## county14          -6.699662e+03 -3.683733e+03
## county15           6.884225e+04  1.050451e+05
## county16          -4.706535e+04 -3.154038e+04
## county17          -4.421168e+04 -2.960197e+04
## county18          -4.386743e+04 -2.948005e+04
## county19          -4.344640e+04 -2.916084e+04
## county20          -4.478454e+04 -2.993710e+04
## county21          -5.172073e+04 -3.468023e+04
## county22          -4.270715e+04 -2.869432e+04
## county23          -3.289370e+04 -2.202968e+04
## county24           2.831810e+05  4.268649e+05
## county25          -4.877178e+04 -3.274365e+04
## county26          -4.276384e+04 -2.872511e+04
## county28           4.858833e+04  7.553175e+04
## county29          -4.467252e+04 -3.006745e+04
## county30           9.342428e+04  1.472886e+05
## county31           1.852829e+05  2.661927e+05
## county32          -1.728764e+04 -1.161247e+04
## county33          -1.381400e+04 -9.091988e+03
## county36           4.654261e+03  1.032043e+04
## county37          -4.618999e+04 -3.096345e+04
## county38          -3.264768e+04 -2.201878e+04
## county39          -4.273305e+04 -2.860299e+04
## county40          -3.802003e+04 -2.554821e+04
## county41           2.211987e+05  3.367049e+05
## county42          -2.767832e+04 -1.873803e+04
## county43           1.653561e+04  2.676194e+04
## county44          -6.870157e+03 -2.116914e+03
## county45          -1.868452e+04 -1.235097e+04
## county46          -2.706671e+04 -1.781711e+04
## county47          -4.634840e+04 -3.126450e+04
## county48          -4.977473e+04 -3.336613e+04
## county49          -4.742532e+04 -3.170056e+04
## county52          -3.630333e+04 -2.444980e+04
## county53          -3.631639e+04 -2.459065e+04
## county54           1.126232e+05  1.757783e+05
## county55          -3.973076e+04 -2.648663e+04
## county56          -4.529073e+04 -3.044249e+04
## county57          -3.595419e+04 -2.424635e+04
## county58          -2.260695e+04 -1.485317e+04
## county59          -4.202072e+04 -2.810953e+04
## county60          -4.285657e+04 -2.874644e+04
## county61          -3.851339e+04 -2.601193e+04
## county62           6.553923e+04  1.023905e+05
## county63          -4.659207e+04 -3.131903e+04
## county64          -4.876006e+04 -3.267136e+04
## population        -1.316622e-01 -8.400117e-02
## year2             -2.078425e+01  8.704594e+02
## year3             -2.181291e+01  8.911599e+02
## year4              1.632421e+02  1.072794e+03
## year5              1.180490e+02  1.037611e+03
## year6             -5.025215e+02  4.932371e+02
## mentalchild       -6.769890e-02 -1.742043e-02
## mentaladult        8.618134e-03  2.902459e-02
## unemployment       1.103547e-02  4.501041e-02
## allageproverty    -1.598438e-01 -1.112728e-02
## ueighteenpoverty   8.897740e-02  6.283368e-01
## fivetoseventenpov -7.998222e-01 -2.084640e-01

Prediction

pred <- predict(regressor, test_set, type="response")
actual_pred <- cbind.data.frame(test_set$totalcrimes, pred)
actual_pred <- actual_pred[1:20, ]
kable(actual_pred)
test_set$totalcrimes pred
5 8743 8710.9305
11 820 695.6136
16 78402 79652.0196
24 5400 6524.1559
26 2082 1739.9127
31 1664 1861.8931
37 3532 3953.7500
48 2210 2215.0483
50 817 785.2716
59 2006 1314.9362
65 1262 1246.1255
68 1527 1436.8891
73 926 619.6817
84 5997 7320.9345
87 27531 27140.7780
94 891 496.9765
97 1329 1501.8446
104 1284 1081.2709
114 1443 1427.4388
118 1469 1358.9968

Check for Accuracy

MAPE The mean absolute percentage error (MAPE), also known as mean absolute percentage deviation (MAPD), is a measure of prediction accuracy of a forecasting method in statistics. Represented mathematically as:

\(M=\quad \frac { 100 }{ n } \sum _{ t=1 }^{ n }{ \left| \frac { { A }_{ t }\quad -\quad { F }_{ t } }{ { A }_{ t } } \right| }\) ,

Where,

\({ A }_{ t }\quad =\quad Actual\quad Value\\ \quad \quad \quad \quad \quad \quad { F }_{ t }\quad =\quad Predicted\quad Value.\)

min_max_accuracy <- mean(apply(actual_pred, 1, min) / apply(actual_pred, 1, max))*100 
print(paste('The Percentage Accuracy Is: ', min_max_accuracy))
## [1] "The Percentage Accuracy Is:  87.4774508797211"
mape <- mean(abs((actual_pred$pred - actual_pred$`test_set$totalcrimes`))/actual_pred$pred)*100 
print(paste('The MAPE Is: ', mape ))
## [1] "The MAPE Is:  16.5027632643134"

Principal Component Analysis

To obtain the best model, the best way is to adopt a dimensionality reduction. The PCA, like factor analysis, is a mathematical procedure that transforms a number of correlated variables into a smaller number of uncorrelated variables.

prin_comp <- prcomp(train_set[, -c(1,3)], scale. = T)
biplot(prin_comp, scale = 0)

std_dev <- prin_comp$sdev
pr_var <- std_dev^2
pca_var <- pr_var/sum(pr_var)

plot(pca_var, xlab = "Principal Component",
             ylab = "Proportion of Variance Explained",
             type = "b")

plot(cumsum(pca_var), xlab = "Principal Component",
              ylab = "Cumulative Proportion of Variance Explained",
              type = "b")

pca2 <- df_pca[, -1]
pca1 <- prcomp(pca2, scale. = TRUE)

cir <- function(center = c(0, 0), npoints = 100) {
    r = 1
    t = seq(0, 2 * pi, length = npoints)
    x = center[1] + r * cos(t)
    y = center[1] + r * sin(t)
    return(data.frame(x = x, y = y))
}
corcir = cir(c(0, 0), npoints = 100)

correlations = as.data.frame(cor(pca2, pca1$x))

arrows = data.frame(x1 = c(0, 0, 0, 0), y1 = c(0, 0, 0, 0), x2 = correlations$PC1, 
    y2 = correlations$PC2)


ggplot() + geom_path(data = corcir, aes(x = x, y = y), colour = "black") + 
    geom_segment(data = arrows, aes(x = x1, y = y1, xend = x2, yend = y2), colour = "blue") + 
    geom_text(data = correlations, aes(x = PC1, y = PC2, label = rownames(correlations))) + 
    geom_hline(yintercept = 0, colour = "brown") + geom_vline(xintercept = 0, 
    colour = "brown") + xlim(-1.1, 1.1) + ylim(-1.1, 1.1) + labs(x = "pc1 aixs", 
    y = "pc2 axis") + ggtitle("Circle of correlations")

PCA Visual

#train set
train_pca <- data.frame(TotalCrimes = train_set$totalcrimes, prin_comp$x)
train_pca <- train_pca[,1:8]
pca_model <- rpart(TotalCrimes ~ .,data = train_pca, method = "anova")

#test set
test_pca <- predict(prin_comp, newdata = test_set)
test_pca <- as.data.frame(test_pca)
test_pca <- test_pca[,1:8]
#prediction
prediction <- predict(pca_model, test_pca)

INTERPRETATIONS:

DECISION: Since the calculated P-values are less than the tabulated values (\(??\) = 0.05), we therefore reject the alternative hypothesis, H_o.

CONCLUSION: We therefore conclude that there is a positive relationship between crimes and factors like Economic and Psychological as confirmed by variables like: Unemployment, Mental Health, Poverty, County (or location) and Year. It also means; as the factors increases, so also is crime rate and vice-versa. We can therefore confirm that not only increase and decrease in population affect the crime rate, but other factors like economic and psychological also contribute significantly to crime rate in a geographical area. While security and social factors may not contribute significantly to crime rate, but a further analysis may prove otherwise.

Bibliography

The Origins of attachment theory- John Bowlby and Mary Ainsworth

http://www.sakkyndig.com/psykologi/artikler/Bowlbyatt.pdf

ftp://statgen.ncsu.edu/pub/thorne/molevoclass/AtchleyOct19.pdf

https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Current-YTD/5uac-w243

http://data.beta.nyc/dataset/pediacities-nyc-neighborhoods/resource/35dd04fb-81b3-479b-a074-a27a37888ce7

http://www.kevjohnson.org/making-maps-in-r/

http://leaflet-extras.github.io/leaflet-providers/preview/

http://nyscommunityaction.org/wp-content/uploads/2018/04/Comparitive-Look-@-County-Profiles.pdf

https://stackoverflow.com/questions/19449307/how-to-solve-error-in-as-posixct-numericx2l-origin-must-be-supp

http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html

https://data.ny.gov/en/browse?q=nypd

http://www1.nyc.gov/site/nypd/stats/crime-statistics/crime-statistics-landing.page

https://data.ny.gov/en/browse?q=New%20York%20State%20Population%20Data%3A%20Beginning%202003&anonymous=true&sortBy=relevance

https://www.health.ny.gov/statistics/vital_statistics/2006/table50.htm

http://www.postgresqltutorial.com/postgresql-alter-table/

https://www.statmethods.net/graphs/scatterplot.html

http://nyscommunityaction.org/wp-content/uploads/2018/04/Comparitive-Look-@-County-Profiles.pdf