Introduction

The word crime is derived from Latin as “cerno” which means sinfulness in English can be defined as “violation of a law in which there is injury to the public or a member of the public and a term in jail or prison, and/or a fine as possible penalty” -https://legal-dictionary.thefreedictionary.com/crime. A quote by Frank Wolf say “We can’t just rail against crime. We must speak of the root problems - devastating family breakup, an insidious culture of violence that cheapens human life, skyrocketing prisoner recidivism rates that robs our communities of husbands and fathers (wives and mothers) - and recognize that there is a societal role in rehabilitation and restoration”. To this end, I decided to take a clue from him and decided to do my research on factors affecting crime rates - social, economic, security and psychological against the popular notion that population is the only determining factor in high rate of crime in a geographical area and at a point in time. The research will focus mainly on New York state for the period of 5-10 years.

Abstract

In other to gain more insight about crime rate, a stratum of the dataset was analyzed. The NYPD historical complaints data from 2010 to 2015 was collected from New York State open data website.

It was discovered that most crimes are committed on Fridays, while July follow closely by August and May are the months with the highest crime rate. In NYC borough, Bronx came first, while Staten Island came last in terms of crime rate. A popular believe is that population alone has the most impact on crime rate, but a further analysis also revealed other culprits; the likes of economic and psychological factors also contribute significantly to crime rate.

It was very difficult combining the needed dataset from all the above dataset into one. Therefore, to make it a little bit easier, we will implore the helping hand of SQL (Structured Query Language). A relational database like ORACLE SQL, MySQL or PostgreSQL. I chose PostgreSQL to access and extract the require data.

If your connection is right, you should see that the above table are now in your database. But some of the data type may be wrong, use the code below to alter the it to desirable data type. Upon further scrutiny, we discovered that the poverty table still have some unwanted character in them, we therefore removed them using gsub from grep, a R based.

Background

Crime and violence impact us psychologically (psychological factor), whether we are directly involved, a family member or friend is involved, our residence or community was impacted or exposure to it on media coverage. This is sometimes the case when a family member or friend is killed or hurt after directly or indirectly exposure to crime, it is natural for us to experience strong feelings and the effects may push some people state of mind beyond it optimal capability and may in turn leads to revenge or survival kind of crime(s).

Also, Unemployment (economical factor) has a great deal of common sense appeal as one of the root cause of criminal behavior as unemployment is always viewed as a causal agent of economic deprivation may ultimately increases the probability of crime.

According study conducted (Social factor) by likes of Wilson (1987), Ferguson (1952), Robins (1995) and Gluecks (1950) established that parents with criminal records are more likely to have children criminal records also. Bowlby (1946) believed that a traumatic experience of parent-child separation and disruption of attachment relations may cause delinquency after parental separation, that where divorce came into being. The different between Bowlby work and mine is that; my research work not only focus on children alone, but all ages, as crime affect everyone not just the children and their immediate family, but society in general.

The role of law enforcement officers (Security) is to protect citizenry against crimes. The research work did focus on the ratio of law enforcement to the crime rates. Does hiring more police /law enforcement officers curb the menace of crimes?

Methodology

The research project answered and provided a recommendation (where necessary) for many of the overlooked factors that may influenced crime rates in a community. A R programing was used to obtain and check for any positive or negative effect/interaction(s) between crime rates and all the factors, while a graphical representation was included in the analysis. Prediction was done on counties to check for any increase or decrease in crime rate in the future as well as a recommendation on best method to curb the menace (crime rates).

The dataset for this research work was derived from various governmental organizations throughout New York State (counties). The datasets were secondary data as primary data are not feasible due to some constraint like cost, government restrictions etc.

Most of the data tables did not conform to the standard R format. I therefore adopted the use of PostgreSQL. After loading the dataset into R, I reloaded them into PostgreSQL as a table and called it directly from SQL into R environment using PostgreSQL package that was connected to R through ODBC. I was parsing SQL queries in R environment and made some cleaning. I also used SQLDF package in R to write SQL queries in R.

Dataset

The required datasets are as follows:

Population: As widely known, population plays an important role is geographical location as population somethings (if not always) used to determine how wealthy, organized etc. that society are. It (Population) can also spells dome to the geographical location as it is something believe that the higher the population, the higher the crime rate and vice versa. I will not dispute this, but I will like to also shift our attention to some other factor we always overlooked like!
Crimes Rate (Felony & Misdemeanor): As the main culprit in our analysis, I collected dataset for crime rate for both New York City and New York State as I was trying to dig deeper. New York City is seen as the largest city in New York State, therefore a stratum from NYC would give a better judgment of the whole state I think. Analysis was done on the NYC historical complaint dataset also the NYS.
Law Enforcement: Total employed law enforcements at the time of the incident was also considered. As one article in New York Times say “The notion of pruning the police inevitably raises the specter of more crime, even if there is little evidence to support such fears. The relationship between the number of officers and lawful behavior is not clear-cut. In New York City, for example, the police force peaked at more than 40,000 in 2000. Since then, both the number of officers and the crime rate have declined.” We looked into this and included the law enforcement officers. NB: Law enforcement officers used here comprises of college security officer, hospital police etc.
Unemployment: No society can escape this. Robert Frost once said, “The reason why worry kills more people than work is that more people worry that work”. Also, a school of thought said, “Idle hands are devils workshop”. You may want to turn to crime if you don’t work. Therefore, unemployment may be seen as a culprit.
Child Abuse and Neglect: I wanted to include this into the analysis, but the required datasets was not readily available at the time of this analysis. It would have been of great to see the effect(s) of child abuse and neglect in crime rate. Because, I believed, like the divorce, a child that is abused physically or mentally may want to seek help or inclusion from the outside world and that may lead them into crimes.
Divorce: Divorce or single parental care may also contribute significantly to crime rate, (although this is debatable) as a single parent may not be able to single-handedly successfully rate a child. Even with a home with both parents present have their own challenges, let alone single parent as such child may not be adequately taken care of and may want to seek for paternal or maternal (whichever missing) outside the home and may in turn spell dome as the child may fall into wrong hand seeking prey.

Data Cleaning

There are some missing observations in the datasets before and after combinations. I therefore employ the use of a package in R called MICE, multivariate Imputation by Chained Equations (mice package) is said to be the best approach to tackle end-on the menace missing observation may cause. It is well known fact that if proper care is not taken in dealing with missing observation, the analysis would likely be biased. To avoid this, the mice package in R was used. A fully conditional specification where each incomplete variable is imputed by a separate model at which the algorithm will impute mixes of continuous, binary, ordered or unordered data based on the data type.

The Divorce table did not come with years, I was able to add years of occurrence into their respective rows, also reformatted it and removed all the unwanted characters.
Unemployment table came with months name in number, I changed it to alphabets and changing it from long format to wide and removed all the unwanted characters.
NYPD Historical Complaint dataset was not in the correct format at all, I changed the date to format, re-formatted the longitude and latitude, made years, months and days-of-the-weeks out of the datasets.

Limitation

The best datasets required for this research would have been the one collected from the department of correction, (as the saying goes “from the horse’s mouth”) but due to nature of the datasets, government put some restrictions in it use and therefore blocking public access to it. We therefore opted for the dataset from different organizations (some were scrapped from websites) as we believe it will give same result, which required extra work to bring them together as one dataset.

As mentioned above, this research project did used (heavily) relational database like PostgreSQL, MySQL, OracleSQL and SQLIte (Structured Query Language) for most of the query. Most (if not all) of the dataset used in the analysis are not readily available. I therefore sorted dataset from various government agencies. Some of the dataset was scrapped fro

Packages

As school of thought dictate, we should first install the packages to be used in this research by using the following syntax “install.packages(”The Package name“)” one after the other.

Gist : Uncomment the codes below for a faster installation of all the packages used

# install.packages(c("flexdashboard", "DT", "pandocfilters", "googleVis", "R.rsp", "rmarkdown", "knitr", "plyr", "plotrix", "tidyr","dplyr", "ggplot2","plotly", "MASS","reshape2", "Amelia","mice","stringi","ROCR", "XML", "httr", "AUC", "magrittr", "rvest", "rgdal", "purrr", "DBI",
# "leaflet", "tigris", "RPostgreSQL" , "data.table", "tibble", "cowplot", "Hmisc", "sp", "maptools", "PerformanceAnalytics" , "ggcorrplot", "ggalt",
# "ggmap", "ggthemes", "readr", "webshot", "caTools", "caret", "rpart", "agricolae", "sqldf"))

suppressMessages(library("rmarkdown"))
#setwd("C:/Users/HP/Desktop")
library(flexdashboard)
library(DT)
library(prettydoc)
suppressMessages(library(R.rsp))
suppressMessages(library(pandocfilters))
suppressPackageStartupMessages(library(googleVis))
op <- options(gvis.plot.tag='chart')

Cont’d…

suppressMessages(library(knitr))
suppressMessages(library(plyr))
suppressMessages(library(plotrix))
suppressMessages(library(tidyr))
suppressMessages(library(dplyr))
suppressMessages(library(ggplot2))
suppressMessages(library(plotly))
suppressMessages(library(MASS))
suppressMessages(library(reshape2))
suppressMessages(library(Amelia))
suppressMessages(library(mice))
suppressMessages(library(stringi))
suppressMessages(library(ROCR))
suppressMessages(library(XML))
suppressMessages(library(httr))
suppressMessages(library(AUC))
suppressMessages(library(magrittr))
suppressMessages(library(rvest))
suppressMessages(library(rgdal))
suppressMessages(library(purrr))
suppressMessages(library(DBI))
suppressMessages(library(leaflet))
suppressMessages(library(tigris))
suppressMessages(library(RPostgreSQL))
suppressMessages(library(data.table))
suppressMessages(library(tibble))
suppressMessages(library(cowplot))
suppressMessages(library(Hmisc))
suppressMessages(library(sp))
suppressMessages(library(maptools))
suppressMessages(library(PerformanceAnalytics))
suppressMessages(library(ggcorrplot))
suppressMessages(library(ggalt))
suppressMessages(library(ggmap))
suppressMessages(library(ggthemes))
suppressMessages(library(readr))
suppressMessages(library(webshot))
suppressMessages(library(caTools))
suppressMessages(library(randomForest))
suppressMessages(library(caret))
suppressMessages(library(rpart))
suppressMessages(library(agricolae))
library(htmltools)
library(RColorBrewer);

PostgreSQL Connection.

It will be very difficult to work or combine all the needed dataset from all the above dataset. Therefore to make it a little bit easier, we will implore the help hand of SQL (Structured Query Language). A relational database like ORACLE SQL, MySQL or PostgreSQL would be nice in doing it better. I outed to use PostreSQL to access and extract the require data.

Connect your PostreSQL and R using the code below.

drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "capstone",
                 host = "localhost", port = 5432,
                 user = "postgres", password = "oracle")

Web Scrapping

The divorce came as a table from New York state health department website as follows.Here is the glimpse after cleaning: As mentioned above, this research project did used (heavily) relational database like PostgreSQL, MySQL, OracleSQL and SQLIte (Structured Query Language) for most of the query. Most (if not all) of the dataset used in the analysis are not readily available. I therefore sorted dataset from various government agencies. Some of the datase was scrapped for websites e.g the divorce dataset for year 2015 was scrapped from: https://www.health.ny.gov/statistics/vital_statistics/2015/table50.htm while others were downloaded and the required rows and columns were extracted.

div_2015 <- ("https://www.health.ny.gov/statistics/vital_statistics/2015/table50.htm") %>% 
  map(possibly(~html_session(.) %>% 
                 read_html %>% 
                 html_table(fill = TRUE) %>% 
                 .[[1]], 
               NULL))


div_2014 <- ("https://www.health.ny.gov/statistics/vital_statistics/2014/table50.htm") %>% 
  map(possibly(~html_session(.) %>% 
                 read_html %>% 
                 html_table(fill = TRUE) %>% 
                 .[[1]], 
               NULL))


div_2013 <- ("https://www.health.ny.gov/statistics/vital_statistics/2013/table50.htm") %>% 
  map(possibly(~html_session(.) %>% 
                 read_html %>% 
                 html_table(fill = TRUE) %>% 
                 .[[1]], 
               NULL))



div_2012 <- ("https://www.health.ny.gov/statistics/vital_statistics/2012/table50.htm") %>% 
  map(possibly(~html_session(.) %>% 
                 read_html %>% 
                 html_table(fill = TRUE) %>% 
                 .[[1]], 
               NULL))



div_2011 <- ("https://www.health.ny.gov/statistics/vital_statistics/2011/table50.htm") %>% 
  map(possibly(~html_session(.) %>% 
                 read_html %>% 
                 html_table(fill = TRUE) %>% 
                 .[[1]], 
               NULL))

div_2010 <- ("https://www.health.ny.gov/statistics/vital_statistics/2010/table50.htm") %>% 
  map(possibly(~html_session(.) %>% 
                 read_html %>% 
                 html_table(fill = TRUE) %>% 
                 .[[1]], 
               NULL))

# After scrapping the table, we realized that they are in list format. We will therefore convert it to accepatable format; first as a dataframe, subsetting unwanted rows and adding year column to the dataset as it was not provided in the scrapped table.

div_2010 <- as.data.frame(div_2010)
div_2011 <- as.data.frame(div_2011)
div_2012 <- as.data.frame(div_2012)
div_2013 <- as.data.frame(div_2013)
div_2014 <- as.data.frame(div_2014)
div_2015 <- as.data.frame(div_2015)


div_2010 <- div_2010[-c(1, 2,3,4,5, 11, 12), ]
div_2011 <- div_2011[-c(1, 2,3,4,5, 11, 12), ]
div_2012 <- div_2012[-c(1, 2,3,4,5, 11, 12), ]
div_2013 <- div_2013[-c(1, 2,3,4,5, 11, 12), ]
div_2014 <- div_2014[-c(1, 2,3,4,5, 11, 12), ]
div_2015 <- div_2015[-c(1, 2,3,4,5, 11, 12), ]

div_2010[["year"]] <- rep(2010, nrow(div_2010))
div_2011[["year"]] <- rep(2011, nrow(div_2011))
div_2012[["year"]] <- rep(2012, nrow(div_2012))
div_2013[["year"]] <- rep(2013, nrow(div_2013))
div_2014[["year"]] <- rep(2014, nrow(div_2014))
div_2015[["year"]] <- rep(2015, nrow(div_2015))
#The column names were not in conformitting with the standard R Programming column naming, we will therefore rename the columns.

Merge the divorce dataset from 2010 to 2015.

divorce <- Reduce(function(...) merge(..., all=T), list(div_2010, div_2011,
                            div_2012, div_2013, div_2014, div_2015))
colnames(divorce) <- c("county", "total", "cruelty", "abandonment", "imprisonment",
                       "adultery", "afterlegalseparation", "afterseparationbyagreement", "notstated", "year")
kable(head(divorce[, 1:8]))

county	total	cruelty	abandonment	imprisonment	adultery	afterlegalseparation	afterseparationbyagreement
Albany	719	2	7	0	2	1	16
Albany	722	1	20	0	0	0	31
Albany	766	109	460	6	7	5	152
Albany	783	12	67	1	0	2	60
Albany	809	10	23	1	1	0	35
Albany	862	45	194	0	1	2	119

Above is the scrapped and cleaned New York state divorce dataset from the year 2010 to 2015. Also, other dataset like the number Arrest, Mental Health, Police, Unemployment, Poverty, Crimes, Longtitude and Latitude was included in the analysis.

Arrest<- "https://docs.google.com/spreadsheets/d/e/2PACX-1vTQbIoMp7kHRauK90knVVWxL0ZQgSyJo5w53ufpeoJhv3NEKZ7e52Z9fU5gI2yj4Q50kCt2S9-WIe8t/pub?gid=0&single=true&output=csv"

LongLat <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vRdUPece17flbIDQQ-prWnhh9ZOntvmHz9Saao-qC6EvV5VODzQ6TUfacf4oeB_w6EnDq0xmEhjw7X5/pub?gid=0&single=true&output=csv"

Mental <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vT87ZxCxB9WAXFNmOWmxAV0aBlpyiE22Jqu_sgTtXpgpgHCzmskdboIiDtcJVgramrtVoTyqT2lzFeT/pub?gid=0&single=true&output=csv"


CrimesNY <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vQ8KAFelT2l0AXFAabbbZSaeoqmQJRt-11kt7Jm_LKaHGbHa_ndUQidLj8Op_d19thmsyxoe9p5Dl3o/pub?gid=0&single=true&output=csv"

 LawEnf <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vQauvaPEbqzKX1Vr9ojJuwqjYZFYTyMgclAOLxeUg9ithYryjbQz0foYmAe52jk3A7xBUl33vpDfksj/pub?gid=0&single=true&output=csv"

 
UnemploymentData <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vSUjXqUWfUlOIQ_hYncSP8-BlOEVnzNQ5YDg49HD63IZDoZuw-2h8DiXt6hBuHTKeGAa9F2l4c23fwN/pub?gid=0&single=true&output=csv"


PovData <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vSpp6mizldwEOSIXYPIuLH4Y8Ho3WvfkdAY8mTCcNnOwFExK7xo4q7bv6qUW-v2Z49rZIHJxo-3iG1X/pub?gid=0&single=true&output=csv"

Read datasets into R

# Note, the dataset contains column(s) of data we needed. We will load all of them and rename them to meet up with the standard.

nypd <- read.csv("C:/Users/HP/Documents/GitHub/Capstone_Project_2/NYPD_Complaint_Data_Historic.csv",
                 sep = ",", header=TRUE, stringsAsFactors = FALSE)


arrest <- read.csv(url(Arrest),header = TRUE, sep = ",", col.names= c("county", "year", "total",
                                          "felonytotal", "drugfelony","violentfelony", "dwifelony",
                                          "otherfelony", "misdemeanortotal", "drugmisd",
                                         "dwimisd", "propertymisd", "othermisd"))



mental_health <- read.csv(url(Mental), header = TRUE, sep = ",", col.names= c("rowcreateddatetime",
                          "year", "omhregioncode", "omhregionlabel", "county", "agegroup", "ratecodegroup", "countycode",
                          "countOfrecipientbycounty", "total", "paidclaimtotal"))



police <- read.csv(url(LawEnf), header = TRUE, sep = ",", stringsAsFactors = FALSE,
                   col.names=  c("county", "year", "policetotal"))



unemployment <- read.csv(UnemploymentData, header = TRUE, sep = ",", stringsAsFactors = TRUE,
                     col.names= c('year',   'month',    'county', 'total', 'benefitamountsdollars'))


lat_long <- read.csv(url(LongLat), header = TRUE, sep = ",")


crimes_ny <- read.csv(url(CrimesNY),header = TRUE, sep = ",")
crimes_ny <- crimes_ny[, 1:7]
colnames(crimes_ny)= c('county','year','population','indexcount', 'violentcount','propertycount','firearmcount')



poverty <- read_csv(url(PovData), locale = locale(encoding = "UTF-8"))
pov_sub <- poverty[, c(1, 2, 3, 4, 6, 15, 24)]
colnames(pov_sub) <- c("year","state", "countyid", "county", "allageproverty", "ueighteenpoverty", "fivetoseventenpov")

NYPD HISTORICAL COMPLAINTS

The New York City Department of Police, NYPD historical dataset that was downloaded from the city open data source was first analyzed as the stratum of the entire state as the NYC dataset contains all the require variables for further scrutiny. Below is the glimpse of the dataset

Download The NYPD Historical Complaint Here!
You’re required to download the NYPD Historical Complaint data and change path below to where you downloaded it. I can not include it here, its about 1.4GB in size!
NB: The NYPD dataset is very large, i will therefore divide the table into chuck so are to fit the screen as (option=width) does not seem to work!

kable(head(nypd[, 1:8]))

CMPLNT_NUM	CMPLNT_FR_DT	CMPLNT_FR_TM	CMPLNT_TO_DT	CMPLNT_TO_TM	RPT_DT	KY_CD	OFNS_DESC
101109527	12/31/2015	23:45:00			12/31/2015	113	FORGERY
153401121	12/31/2015	23:36:00			12/31/2015	101	MURDER & NON-NEGL. MANSLAUGHTER
569369778	12/31/2015	23:30:00			12/31/2015	117	DANGEROUS DRUGS
968417082	12/31/2015	23:30:00			12/31/2015	344	ASSAULT 3 & RELATED OFFENSES
641637920	12/31/2015	23:25:00	12/31/2015	23:30:00	12/31/2015	344	ASSAULT 3 & RELATED OFFENSES
365661343	12/31/2015	23:18:00	12/31/2015	23:25:00	12/31/2015	106	FELONY ASSAULT

print(paste0(".....Last columns"))

## [1] ".....Last columns"

kable(head(nypd[, 17:24]))

PREM_TYP_DESC	X_COORD_CD	Y_COORD_CD	Latitude	Longitude	Lat_Lon
BAR/NIGHT CLUB	1007314	241257	40.82885	-73.91666	(40.828848333, -73.916661142)
	1043991	193406	40.69734	-73.78456	(40.697338138, -73.784556739)
OTHER	999463	231690	40.80261	-73.94505	(40.802606608, -73.945051911)
RESIDENCE-HOUSE	1060183	177862	40.65455	-73.72634	(40.654549444, -73.726338791)
OTHER	987606	208148	40.73800	-73.98789	(40.7380024, -73.98789129)
DRUG STORE	996149	181562	40.66502	-73.95711	(40.665022689, -73.957110763)

The arrest table…

kable(head(arrest[, 1:8]))

county	year	total	felonytotal	drugfelony	violentfelony	dwifelony	otherfelony
Albany	1970	1226	688	97	191	5	395
Albany	1971	1833	829	131	231	6	461
Albany	1972	3035	1054	211	256	8	579
Albany	1973	3573	1134	244	274	28	588
Albany	1974	4255	1329	281	308	17	723
Albany	1975	4173	1259	209	344	12	694

The mental health table

DT::datatable(mental_health, options = list(
  pageLength = 10
))

NYPD Historical Complaint Data

kable(head(police));

county	year	policetotal
Albany	2010	587
Albany	2011	1280
Albany	2012	1233
Albany	2013	1228
Albany	2014	1260
Albany	2015	1212

Change the columns to months after dcast. Here is Unemployment table after cleaning..

# The unemployment dataset will be reshaped and have their columns renamed appropriately.

unemployment2 <- unemployment[, -5]
unemployment_dcast <- dcast(unemployment2, year + county ~ month, value.var="total")

names(unemployment_dcast)[names(unemployment_dcast)== 1] <- "jan"
names(unemployment_dcast)[names(unemployment_dcast)== 2] <- "feb"
names(unemployment_dcast)[names(unemployment_dcast)== 3] <- "mar"
names(unemployment_dcast)[names(unemployment_dcast)== 4] <- "apr"
names(unemployment_dcast)[names(unemployment_dcast)== 5] <- "may"
names(unemployment_dcast)[names(unemployment_dcast)== 6] <- "jun"
names(unemployment_dcast)[names(unemployment_dcast)== 7] <- "jul"
names(unemployment_dcast)[names(unemployment_dcast)== 8] <- "aug"
names(unemployment_dcast)[names(unemployment_dcast)== 9] <- "sep"
names(unemployment_dcast)[names(unemployment_dcast)== 10] <- "oct"
names(unemployment_dcast)[names(unemployment_dcast)== 11] <- "nov"
names(unemployment_dcast)[names(unemployment_dcast)== 12] <- "dec"

unemployment_dcast <- tbl_df(unemployment_dcast)



DT::datatable(unemployment_dcast, options = list(
  pageLength = 10
))

New York State crimes count.

DT::datatable(crimes_ny, options = list(
  pageLength = 10
))

Crimes County By County Table

Load Data Into PostreSQL

Let start by checking if any of the dataset do exist in the database, and if it exist, we will remove them and write (or save) our datasets in the PostgreSQL Automatically.

dbRemoveTable(con, "arrest")

## [1] TRUE

dbWriteTable(con, "arrest", arrest, row.names=FALSE)

## [1] TRUE

dbRemoveTable(con, "mental_health")

## [1] TRUE

dbWriteTable(con, "mental_health", mental_health, row.names=FALSE)

## [1] TRUE

dbRemoveTable(con, "police")

## [1] TRUE

dbWriteTable(con, "police", police, row.names=FALSE)

## [1] TRUE

dbRemoveTable(con, "unemployment_dcast")

## [1] TRUE

dbWriteTable(con, "unemployment_dcast", unemployment_dcast, row.names=FALSE)

## [1] TRUE

dbRemoveTable(con, "divorce")

## [1] TRUE

dbWriteTable(con, "divorce", divorce, row.names=FALSE)

## [1] TRUE

dbRemoveTable(con, "crimes_ny")

## [1] TRUE

dbWriteTable(con, "crimes_ny", crimes_ny, row.names=FALSE)

## [1] TRUE

dbRemoveTable(con, "pov_sub")

## [1] TRUE

dbWriteTable(con, "pov_sub", pov_sub, row.names=FALSE)

## [1] TRUE

If your connection is right, you should see that the above table are now in your database. But some of the data type may be wrong, use the code below to alter the it to desirable data type.

dbGetQuery(con, statement = paste("ALTER TABLE crimes_ny
    ALTER COLUMN population TYPE INT USING population::integer"));

# dbGetQuery(con, statement = paste("ALTER TABLE inmates_dcast
#           ALTER COLUMN values TYPE VARCHAR USING values::VARCHAR"))
#

Data Cleaning (SQL)

After writing (or saving) them into the database, we can now try to subset the columns and the row we needed from the tables.

police_sub <- dbGetQuery(con, statement = paste(
                       "SELECT year, county, policetotal as police",
                       "FROM police",
                       "WHERE year IN (2010, 2011,2012,2013,2014,2015)",
                      "ORDER BY year"));


arrest_sub <- dbGetQuery(con, statement=paste(
                           "SELECT *",
                           "FROM arrest",
                           "WHERE year IN (2010, 2011,2012,2013,2014,2015)",
                           "ORDER BY county ASC"));



divorce$total <- (gsub(",", "", divorce$total));
divorce_sub <-  dbGetQuery(con, statement=paste(
                           "SELECT total as divorce, county, year",
                           "FROM divorce",
                           "ORDER BY county ASC"));


emp <-  dbGetQuery(con, statement=paste(
                           "SELECT *",
                           "FROM unemployment_dcast",
                           "WHERE year IN (2010, 2011,2012,2013,2014,2015)",
                           "ORDER BY county ASC"));


crimes_ny_sub <-  dbGetQuery(con, statement=paste(
                           "SELECT county, year, population ",
                           "FROM crimes_ny",
                           "WHERE year IN (2010, 2011,2012,2013,2014,2015)",
                           "ORDER BY county ASC"));

Saving them in tbl for faster accessibility.

# For easy and better table ouput
police_sub <- tbl_df(police_sub)
arrest_sub <- tbl_df(arrest_sub)
divorce_sub <- tbl_df(divorce_sub)
crimes_ny_sub <- tbl_df(crimes_ny_sub)
#inmates_sub <- tbl_df(inmates_sub)

dbGetQuery(con, statement = paste("ALTER TABLE mental_health ALTER COLUMN total TYPE INT"));

mental_health_sub <- as.data.frame(dbGetQuery(con, statement = paste(
                       "SELECT year, county, agegroup, total as mentalhealth",
                       "FROM mental_health",
                       "WHERE year BETWEEN 2010 AND 2015",
                       "ORDER BY year, agegroup")));

Finally, after all the hard work done by PostreSQl, we now have our sought after dataset, hoooray!

Lets disconnect PostreSQL and start using R and SQLDF.

#dbListTables(con);

dbDisconnect(con)

## [1] TRUE

detach("package:RPostgreSQL", unload=TRUE)
library(sqldf);

DATA CLEANING (R and SQL)

Upon further scrutiny, we discovered that the poverty table still have some unwanted character in them, we therefore removed them using gsub from grep, a R based.

Mental Health and Divorce.

mental_sub <- dcast(mental_health_sub, year +county ~ agegroup, value.var="mentalhealth", fun.aggregate = sum)

mental_sub <- mental_sub[, c(1:4)]

divorce_sub$divorce <- (gsub(",", "", divorce_sub$divorce));
divorce_sub1 <- sapply(divorce_sub$divorce, as.numeric)
divorce_sub2 <- divorce_sub[, c(2:3)]
div_extraction <- cbind.data.frame(divorce_sub2, divorce_sub1)
colnames(div_extraction) <- c("county", "year", "divorce")

Poverty

povselect <- sqldf::sqldf("SELECT * FROM pov_sub 
                          WHERE year BETWEEN 2010 AND 2015 AND state = 36 AND county !='New York'
                          ORDER BY year ASC")

povselect$county <- (gsub("\\(NY)", "", povselect$county));
povselect$county <- (gsub("\\County", "", povselect$county));
povselect$allageproverty <- (gsub(",", "", povselect$allageproverty));
povselect$ueighteenpoverty <- (gsub(",", "", povselect$ueighteenpoverty));
povselect$fivetoseventenpov <- (gsub(",", "", povselect$fivetoseventenpov))
col_pov <- c('allageproverty', 'ueighteenpoverty', 'fivetoseventenpov')
povselect[col_pov] <- sapply(povselect[col_pov], as.numeric)
povertydata <- povselect[, -c(2:3)]

Employment

emp$jan <- (gsub(",", "", emp$jan))
emp$feb <- (gsub(",", "", emp$feb))
emp$mar <- (gsub(",", "", emp$mar))
emp$apr <- (gsub(",", "", emp$apr))
emp$may <- (gsub(",", "", emp$may))
emp$jun <- (gsub(",", "", emp$jun))
emp$jul <- (gsub(",", "", emp$jul))
emp$aug <- (gsub(",", "", emp$aug))
emp$sep <- (gsub(",", "", emp$sep))
emp$oct <- (gsub(",", "", emp$oct))
emp$nov <- (gsub(",", "", emp$nov))
emp$dec <- (gsub(",", "", emp$dec))



emp1 <- sapply(emp[, c(3:14)], as.data.frame.numeric)
emp2 <- as.data.frame(emp1)


unemp_sub <- cbind.data.frame(emp2,emp[, c(1,2)], stringsAsFactors=FALSE)
colnames(unemp_sub) <- c("jan", "feb", "mar", "apr", "may", "jun", "jul", "aug",
                         "sept", "oct", "nov", "dec", "year", "county")

df_unemp_sub <-sqldf::sqldf("SELECT county, year, (jan + feb + mar + apr + may +
                           jun + jul + aug + sept + oct + nov + dec) as unemployment
                           FROM unemp_sub")

Converting Character Columns To Date.

nypd[, 2] <- as.Date.character(nypd[, 2], format = "%m/%d/%Y")
nypd[, 4] <- as.Date.character(nypd[, 4], format = "%m/%d/%Y")
nypd[, 6] <- as.Date.character(nypd[, 6], format = "%m/%d/%Y")

Data cleaning of NYPD’s latitude and Longtitude from coordinate to a usable R format.

nypd$Lat_Lon <- (gsub(")","", nypd$Lat_Lon))
nypd$Lat_Lon <- (gsub(",",":", nypd$Lat_Lon))
nypd$Lat_Lon <- (gsub("[[alphanum:]]","", nypd$Lat_Lon))

nypd_new <- subset(nypd, CMPLNT_FR_DT >= '2010-1-1' & CMPLNT_FR_DT <= '2015-12-31', 
                   select=c(CMPLNT_NUM , CMPLNT_FR_DT, CMPLNT_FR_TM, RPT_DT,  OFNS_DESC,
                            CRM_ATPT_CPTD_CD, LAW_CAT_CD,BORO_NM,  Latitude, Longitude,
                            HADEVELOPT, ADDR_PCT_CD, PREM_TYP_DESC))

colnames(nypd_new) <- c("ComplainNum", "IncidentDate", "IncidentTime",
                        "ReportedDate", "OffenseDesc", "CrimeFailSuccces", "CrimeLevel",
                        "Borough", "lat", "Long", "NYCHA", "PolicePrecinct", "BuildingType" )

nypd_new<- as.data.frame(nypd_new)

Converting the NYPD data to days of week, months and even years to easy accessibility.

nypd_new$DateTime = paste(nypd_new$IncidentDate, nypd_new$IncidentTime, sep="")

nypd_new$dayofweek <- strftime(nypd_new$IncidentDate, '%A') # Days of the week

nypd_new$months <- strftime(nypd_new$IncidentDate, '%B') # Converting To Months Column

nypd_new$year <- strftime(nypd_new$IncidentDate, '%Y') # Converting To Year Column

kable(head(nypd_new[, 10:17]))

Long	PolicePrecinct	BuildingType	DateTime	dayofweek	months	year
-73.91666	44	BAR/NIGHT CLUB	2015-12-3123:45:00	Thursday	December	2015
-73.78456	103		2015-12-3123:36:00	Thursday	December	2015
-73.94505	28	OTHER	2015-12-3123:30:00	Thursday	December	2015
-73.72634	105	RESIDENCE-HOUSE	2015-12-3123:30:00	Thursday	December	2015
-73.98789	13	OTHER	2015-12-3123:25:00	Thursday	December	2015
-73.95711	71	DRUG STORE	2015-12-3123:18:00	Thursday	December	2015

Incident Report

Writing Incident Report With SQL Query

 incidentreport <- sqldf::sqldf("SELECT
                            
                             'At ' || IncidentTime || ',' || dayofweek ||', '|| months || ' ' || year
                            ||', a complaint was filled with NYPD police precinct ' || PolicePrecinct ||
                           ' about ' || OffenseDesc
                         || ' that was committed . The complainant maintained that the incident occurred at a '||
                          BuildingType || ' in ' || Borough || ' with co-ordinate ' || lat|| ' and ' ||Long||
                         ' . The Police described the incident as ' || CrimeFailSuccces || ' and classified it as ' ||
                          CrimeLevel ||' with complain number: ' || ComplainNum
                           as Incident_Report from nypd_new
                                ")
 
 kable(head(incidentreport))

Incident_Report
At 23:45:00,Thursday, December 2015, a complaint was filled with NYPD police precinct 44 about FORGERY that was committed . The complainant maintained that the incident occurred at a BAR/NIGHT CLUB in BRONX with co-ordinate 40.828848333 and -73.916661142 . The Police described the incident as COMPLETED and classified it as FELONY with complain number: 101109527
At 23:36:00,Thursday, December 2015, a complaint was filled with NYPD police precinct 103 about MURDER & NON-NEGL. MANSLAUGHTER that was committed . The complainant maintained that the incident occurred at a in QUEENS with co-ordinate 40.697338138 and -73.784556739 . The Police described the incident as COMPLETED and classified it as FELONY with complain number: 153401121
At 23:30:00,Thursday, December 2015, a complaint was filled with NYPD police precinct 28 about DANGEROUS DRUGS that was committed . The complainant maintained that the incident occurred at a OTHER in MANHATTAN with co-ordinate 40.802606608 and -73.945051911 . The Police described the incident as COMPLETED and classified it as FELONY with complain number: 569369778
At 23:30:00,Thursday, December 2015, a complaint was filled with NYPD police precinct 105 about ASSAULT 3 & RELATED OFFENSES that was committed . The complainant maintained that the incident occurred at a RESIDENCE-HOUSE in QUEENS with co-ordinate 40.654549444 and -73.726338791 . The Police described the incident as COMPLETED and classified it as MISDEMEANOR with complain number: 968417082
At 23:25:00,Thursday, December 2015, a complaint was filled with NYPD police precinct 13 about ASSAULT 3 & RELATED OFFENSES that was committed . The complainant maintained that the incident occurred at a OTHER in MANHATTAN with co-ordinate 40.7380024 and -73.98789129 . The Police described the incident as COMPLETED and classified it as MISDEMEANOR with complain number: 641637920
At 23:18:00,Thursday, December 2015, a complaint was filled with NYPD police precinct 71 about FELONY ASSAULT that was committed . The complainant maintained that the incident occurred at a DRUG STORE in BROOKLYN with co-ordinate 40.665022689 and -73.957110763 . The Police described the incident as ATTEMPTED and classified it as FELONY with complain number: 365661343

Inferential Statistics (NYC)

Top Ten NYPD Precinct & Neigbhorhoods with High Crime Incidents From 2010 to 2015.

precinct_count <- plyr::count(nypd_new, vars = "PolicePrecinct")
precinct <- sqldf::sqldf("SELECT PolicePrecinct, freq as NumberOfCrimesHandled
                         FROM precinct_count ORDER BY freq DESC LIMIT 10")
precinct_low <- sqldf::sqldf("SELECT PolicePrecinct,  freq as NumberOfCrimesHandled
                             FROM precinct_count ORDER BY freq ASC LIMIT 10")
lat= c(40.67124, 40.84156, 40.82885, 40.81687 , 40.75792, 40.86400 , 40.68564, 40.64174, 40.84865, 40.65247)
long= c(-73.88831, -73.86918, -73.92980, -73.91769,-73.99111 , -73.89342, -73.91079, -74.07568, -73.91168, -73.94426)

lat2 <- c(0,40.77472,40.58751,40.67687,40.58230,40.73599,40.75520,40.50930,40.81077,40.72350)
long2 <- c(0,-73.97468,-73.81171,-74.00806,-74.16894,-73.74580,-73.89195,-74.24090,-73.88334,-73.94862)

precinct <- cbind.data.frame(precinct, lat, long)
precinct_low <- cbind.data.frame(precinct, lat2, long2)

kable(head(precinct, 10))

PolicePrecinct	NumberOfCrimesHandled	lat	long
75	101480	40.67124	-73.88831
43	78285	40.84156	-73.86918
44	75838	40.82885	-73.92980
40	69758	40.81687	-73.91769
14	68051	40.75792	-73.99111
52	64737	40.86400	-73.89342
73	64724	40.68564	-73.91079
120	60366	40.64174	-74.07568
46	60027	40.84865	-73.91168
67	56866	40.65247	-73.94426

MAPS

Map showing all New York Counties, least and most crime area in New York City.
click on the pop-up to see the county name.

neigbourhood_json <- GET('http://data.beta.nyc//dataset/0ff93d2d-90ba-457c-9f7e-39e47bf2ac5f/resource/35dd04fb-81b3-479b-a074-a27a37888ce7/download/d085e2f8d0b54d4590b1e7d1f35594c1pediacitiesnycneighborhoods.geojson')

neigbourhood_json2 <- GET('http://data.beta.nyc//dataset/0ff93d2d-90ba-457c-9f7e-39e47bf2ac5f/resource/35dd04fb-81b3-479b-a074-a27a37888ce7/download/d085e2f8d0b54d4590b1e7d1f35594c1pediacitiesnycneighborhoods.geojson')

Precinct juridiction with low crime
Precinct juridiction with high crime

neighborhoods_low <- readOGR(content(neigbourhood_json2,'text'), 'OGRGeoJSON', verbose = F)

precinct_low2 <- precinct_low
coordinates(precinct_low2) <- ~long2 + lat2
proj4string(precinct_low2) <- proj4string(neighborhoods_low)
combining_low <- over(precinct_low2, neighborhoods_low)
precint_low3 <- cbind(precinct_low, combining_low)
precint_low3 <- precint_low3[, -c(3,4, 8)]

map1 <- leaflet(neighborhoods_low) %>%
  addTiles() %>% 
  addPolygons(popup = ~neighborhoods_low, color = "#03F", fill = TRUE,
              fillColor = "brown", opacity = 0.5, fillOpacity = 0.2) %>% 
  addMarkers(~long2, ~lat2, popup = ~neighborhoods_low, data = precint_low3) %>%
  addProviderTiles("Stamen.Toner", options = providerTileOptions()) %>%
  setView(-73.92, 40.75, zoom = 9.5)




## Map 2

# Neigbourhood with high crime rate

 neighborhoods <- readOGR(content(neigbourhood_json,'text'), 'OGRGeoJSON', verbose = F)

precinct2 <- precinct
coordinates(precinct2) <- ~long + lat
proj4string(precinct2) <- proj4string(neighborhoods)
combining <- over(precinct2, neighborhoods)
precint3 <- cbind(precinct, combining)
precint3 <- precint3[, -8]

map2 <- leaflet(neighborhoods) %>%
  addTiles() %>% 
  addPolygons(popup = ~neighborhood, color = "#03F", fill = TRUE,
              fillColor = "brown", opacity = 0.5, fillOpacity = 0.2) %>% 
  addMarkers(~long, ~lat, popup = ~neighborhood, data = precint3) %>%
  addProviderTiles("HERE.normalday", options = providerTileOptions()) %>%
  setView(-73.92, 40.75, zoom = 10.5)


# New York state counties

#df_maps <- df_incident[1:62, ]  

counties_map <- leaflet(lat_long) %>%
  addTiles() %>%
  addMarkers(lng = lat_long$long, lat = lat_long$lat, popup = lat_long$county, data =lat_long) %>%
  addProviderTiles("Esri.WorldImagery", options = providerTileOptions()) %>%
  setView(-73.82, 40.75, zoom = 5.9)


counties_map

map1

map2

# Table (nyc high and low)
kable(precint_low3[, 1:6])

PolicePrecinct	NumberOfCrimesHandled	lat2	long2	neighborhood	borough
75	101480	0.00000	0.00000	NA	NA
43	78285	40.77472	-73.97468	Central Park	Manhattan
44	75838	40.58751	-73.81171	Rockaway Beach	Queens
40	69758	40.67687	-74.00806	Red Hook	Brooklyn
14	68051	40.58230	-74.16894	New Springville	Staten Island
52	64737	40.73599	-73.74580	Hollis Hills	Queens
73	64724	40.75520	-73.89195	East Elmhurst	Queens
120	60366	40.50930	-74.24090	Tottenville	Staten Island
46	60027	40.81077	-73.88334	Hunts Point	Bronx
67	56866	40.72350	-73.94862	Greenpoint	Brooklyn

kable(precint3)

PolicePrecinct	NumberOfCrimesHandled	lat	long	neighborhood	boroughCode	borough
75	101480	40.67124	-73.88831	East New York	4	Brooklyn
43	78285	40.84156	-73.86918	Van Nest	2	Bronx
44	75838	40.82885	-73.92980	Highbridge	2	Bronx
40	69758	40.81687	-73.91769	Melrose	2	Bronx
14	68051	40.75792	-73.99111	Hell’s Kitchen	1	Manhattan
52	64737	40.86400	-73.89342	Fordham	2	Bronx
73	64724	40.68564	-73.91079	Bushwick	4	Brooklyn
120	60366	40.64174	-74.07568	St. George	5	Staten Island
46	60027	40.84865	-73.91168	Morris Heights	2	Bronx
67	56866	40.65247	-73.94426	East Flatbush	3	Brooklyn

From analysis and map above, we can deduce that we have 5 out 10 locations with the highest number of crimes complaint in Bronx, 3 in Brooklyn, 1 each in Manhattan and Staten Island, but none in Queens! Does this have anything to do with poverty level?

What Day Has Most Crime?

daysofweek <- plyr::count(nypd_new, vars = "dayofweek")
offenses <- plyr::count(nypd_new, vars = "OffenseDesc")
daysofweek[, -1] <- sapply(daysofweek[, -1], as.numeric)
daysofweek$percentages <- round(daysofweek$freq/sum(daysofweek$freq)*100)
 
days <- paste(daysofweek$dayofweek, daysofweek$percentages) # add percents to labels
days <- paste(days,"%",sep="") 
 
pie(daysofweek$freq,labels = days, col=rainbow(length(days)),
    main="Pie Chart of Crime Complaints In Days Of The Week")

print(paste0("Most crimes are carried out on FRIDAYS with the total being: ", max(daysofweek$freq)))

## [1] "Most crimes are carried out on FRIDAYS with the total being: 462851"

print(paste0("While the least crimes were carried out on SUNDAYS with the total being: ", min(daysofweek$freq)))

## [1] "While the least crimes were carried out on SUNDAYS with the total being: 387899"

What Month(s) Has Most Crimes?

theme_set(theme_bw())
incident_months <- plyr::count(nypd_new, vars = "months")
incident_months[, 2] <- sapply(incident_months[, 2],as.numeric)
incident_months2 <- sqldf::sqldf("SELECT months as MonthOfIncident, freq as IncidentFrequency
                                FROM incident_months ORDER BY freq DESC")
kable(head(incident_months2))

MonthOfIncident	IncidentFrequency
July	268942
August	266856
May	263434
October	259617
June	257991
September	256227

ggplot(incident_months2, aes(x=MonthOfIncident, y=IncidentFrequency)) + 
  geom_point(size=5) + 
  geom_segment(aes(x=MonthOfIncident, 
                   xend=MonthOfIncident, 
                   y=0, 
                   yend=IncidentFrequency)) + 
  labs(title="NYPD Crime Complaints Per Month ", 
       subtitle="FROM 2010 To 2015") + 
  theme(axis.text.x = element_text(angle=90, vjust=0.9)) + coord_flip()

The Most crimes was carried out in July follow by August and May in Third place , while February has the least numbers of crimes complaints.

What Time Of The Day Has Most Crime?

incident_time <- plyr::count(nypd_new, vars = "IncidentTime")
incident_time <- sqldf::sqldf("SELECT IncidentTime as TimeOfIncident, freq as IncidentFrequency
                              FROM incident_time ORDER BY freq DESC")
kable(head(incident_time))

TimeOfIncident	IncidentFrequency
12:00:00	77587
15:00:00	66375
18:00:00	63726
20:00:00	60164
16:00:00	60009
17:00:00	59880

Most Crimes are carried out around 12 Noon, 3PM, 6PM and 4PM, while the least crimes are carried out in the early morning toward the morning.

Misdemeanor Vs Felony

Crime_cat <- plyr::count(nypd_new, vars = "CrimeLevel")

sprintf(" Misdemeanor crimes are '%i' times greater than Felony crimes in New York City", round(Crime_cat[2, 2]/Crime_cat[1, 2]))

## [1] " Misdemeanor crimes are '2' times greater than Felony crimes in New York City"

Location (Area/Place) Where Most Crime Occured?

Building_cat <- plyr::count(nypd_new, vars = "BuildingType")
Building_cat <- sqldf::sqldf("SELECT BuildingType as TypeOfBuilding, freq as IncidentFrequency
                             FROM Building_cat ORDER BY freq DESC LIMIT 20")

Building_cat[, -1] <- sapply(Building_cat[, -1], as.numeric)


Building_cat$percentages <- round(Building_cat$IncidentFrequency/sum(Building_cat$IncidentFrequency)*100)
days <- paste(Building_cat$TypeOfBuilding, Building_cat$percentages) # add percents to labels
days <- paste(days,"%",sep="") # ad % to labels
pie(Building_cat$IncidentFrequency,labels = days, col= c("purple", "violetred1", "green3","cornsilk", "cyan", "white"),
   main="Top 20 Building/Area Where Crimes Were Carried Out", border = FALSE, init.angle = 315)

kable(head(Building_cat))

TypeOfBuilding	IncidentFrequency	percentages
STREET	962514	35
RESIDENCE - APT. HOUSE	634715	23
RESIDENCE-HOUSE	272063	10
RESIDENCE - PUBLIC HOUSING	230529	8
OTHER	81313	3
COMMERCIAL BUILDING	73895	3

Crimes were most carried out on the street, followed by an apartmental houses, where residential houses came third!

At this time, we will be combining the required datasets into one and write it into PostgreSQL for further scrutiny.

df <- sqldf::sqldf("SELECT c.county, c.population, c.year,  a.felonytotal,a.misdemeanortotal,
a.total as totalcrimes,p.policetotal as lawenforcementtotal ,a.othermisd, d.divorce, m.child as mentalchild,
m.adult as mentaladult, (m.child+m.adult) as mentaltotal, u.unemployment ,a.violentfelony, a.drugfelony,
a.dwifelony, a.otherfelony, a.drugmisd, a.dwimisd, a.propertymisd FROM crimes_ny_sub c
LEFT JOIN arrest_sub a ON a.county=c.county AND a.year=c.year
LEFT JOIN div_extraction d ON d.year=c.year AND a.county=d.county
LEFT JOIN df_unemp_sub u ON u.year=a.year AND u.county=d.county
LEFT JOIN mental_sub m ON m.year=a.year AND m.county=a.county
LEFT JOIN police p ON p.year=a.year AND p.county=d.county
ORDER BY c.year, c.county ASC")

df <- sqldf::sqldf("SELECT * from df WHERE county NOT IN ('Out-of-State', 'St. Lawrence') ")
df <- cbind.data.frame(df, povertydata[, 3:5])

Inferential Statistics (NY)

A summary statistics of the datasets

summary(df)

##     county            population           year       felonytotal   
##  Length:372         Min.   :   4678   Min.   :2010   Min.   :   11  
##  Class :character   1st Qu.:  50431   1st Qu.:2011   1st Qu.:  326  
##  Mode  :character   Median :  89674   Median :2012   Median :  529  
##                     Mean   : 316077   Mean   :2012   Mean   : 2529  
##                     3rd Qu.: 233454   3rd Qu.:2014   3rd Qu.: 1512  
##                     Max.   :2640252   Max.   :2015   Max.   :27551  
##                                                      NA's   :6      
##  misdemeanortotal  totalcrimes    lawenforcementtotal   othermisd      
##  Min.   :   52    Min.   :   66   Min.   :   0.0      Min.   :   13.0  
##  1st Qu.:  890    1st Qu.: 1214   1st Qu.: 131.0      1st Qu.:  329.8  
##  Median : 1382    Median : 1934   Median : 224.0      Median :  510.5  
##  Mean   : 6448    Mean   : 8977   Mean   : 550.8      Mean   : 2167.5  
##  3rd Qu.: 4055    3rd Qu.: 5520   3rd Qu.: 471.0      3rd Qu.: 1512.2  
##  Max.   :68603    Max.   :95647   Max.   :4751.0      Max.   :21854.0  
##  NA's   :6        NA's   :6       NA's   :44          NA's   :6        
##     divorce         mentalchild      mentaladult      mentaltotal     
##  Min.   :    5.0   Min.   :   365   Min.   :  1082   Min.   :   2162  
##  1st Qu.:  158.8   1st Qu.:  2215   1st Qu.:  6798   1st Qu.:   9067  
##  Median :  255.5   Median :  6826   Median : 14301   Median :  21757  
##  Mean   :  931.6   Mean   : 30786   Mean   : 71983   Mean   : 102768  
##  3rd Qu.:  667.5   3rd Qu.: 23274   3rd Qu.: 44093   3rd Qu.:  67605  
##  Max.   :15666.0   Max.   :290846   Max.   :915912   Max.   :1159220  
##  NA's   :8         NA's   :12       NA's   :12       NA's   :12       
##   unemployment    violentfelony      drugfelony       dwifelony    
##  Min.   :  1000   Min.   :   1.0   Min.   :   0.0   Min.   :  1.0  
##  1st Qu.:  9275   1st Qu.:  57.0   1st Qu.:  43.0   1st Qu.: 34.0  
##  Median : 15450   Median : 103.5   Median :  85.5   Median : 56.0  
##  Mean   : 48497   Mean   : 698.0   Mean   : 483.0   Mean   : 96.7  
##  3rd Qu.: 37825   3rd Qu.: 370.0   3rd Qu.: 223.0   3rd Qu.:108.0  
##  Max.   :459700   Max.   :9507.0   Max.   :7679.0   Max.   :615.0  
##  NA's   :8        NA's   :6        NA's   :6        NA's   :6      
##   otherfelony         drugmisd           dwimisd        propertymisd  
##  Min.   :    7.0   Min.   :    3.00   Min.   :  19.0   Min.   :    9  
##  1st Qu.:  167.2   1st Qu.:   65.25   1st Qu.: 186.2   1st Qu.:  234  
##  Median :  279.5   Median :  122.50   Median : 306.5   Median :  480  
##  Mean   : 1251.5   Mean   : 1471.95   Mean   : 620.5   Mean   : 2188  
##  3rd Qu.:  824.2   3rd Qu.:  464.00   3rd Qu.: 713.5   3rd Qu.: 1212  
##  Max.   :13590.0   Max.   :27784.00   Max.   :3943.0   Max.   :30197  
##  NA's   :6         NA's   :6          NA's   :6        NA's   :6      
##  allageproverty   ueighteenpoverty fivetoseventenpov
##  Min.   :   490   Min.   :   123   Min.   :    80   
##  1st Qu.:  7311   1st Qu.:  2140   1st Qu.:  1423   
##  Median : 12260   Median :  3510   Median :  2343   
##  Mean   : 48550   Mean   : 15236   Mean   : 10440   
##  3rd Qu.: 32620   3rd Qu.:  9326   3rd Qu.:  6185   
##  Max.   :615319   Max.   :206071   Max.   :138276   
##

We saw that some the columns in the summary are not in the right format and have some of NAs and are not in the right/accetable datatype for a perfect analysis. We will therefore convert them back to numerica datatype (a more acceptable datatype for analysis)

df[, -1] <- sapply(df[, -1], as.numeric) #Convert some df column to numeric

A glimpse of the dataset with the NAS

missmap(df, legend = TRUE, col = c("wheat","darkred", col=c('yellow', 'darkgreen')),
        main ="Plot Showing The Missing Values Per Observation",
        y.cex = 1, x.cex = 0.8,csvar = NULL, tsvar =NULL, rank.order = TRUE)

The above is a graphical representation of all the missing values.The darkgreen area shows the no missing values while the other color indicates the present of missing values in each columns.

A mulitivariate Imputation by Chained Equations (mice package) is said to be the best approach to tackle end-on the menace missing observation may cause. It is well know that if proper care is not taken in dealing with missing observation, the analysis would likely be biased. To avoid this, the mice package in R will use a fully conditional specification were each incomplete variable is imputed by a separate model at which the algoritm will impute mixes of continous, binary, ordered or unordered data based on the data type.

Filling The Missing Values

df_missing_val <- mice(df, m=1, method='cart', seed = 500, printFlag=FALSE, where = is.na(df))

## Warning: Number of logged events: 87

df_complete <- complete(df_missing_val, action='long', include =FALSE)

If possible, we will show map of the location, and as googlevis requirement, the longtitude and latitude needed for the map must be in “lat:long” format

Cleaning and adding map longitute and latititudes.

df_complete <- merge(df_complete , lat_long, all=T)
df_complete$Lat_Long = paste(df_complete$lat, df_complete$long, sep=":")
df_cleaned <- df_complete[, -c(2,3)]

df_pca <- df_complete[-c(235, 302),  c(1, 4,6, 7, 8, 9, 11, 12,13, 15, 23, 24, 25)]

DT::datatable(df_cleaned, options = list(
  pageLength = 10
))

Checking out for outliers (an observation point that is abnormal distant from other observations) in an analysis one of the best way to avoid variablity in the measurement or experimantal error.

outliers_check <- function(check) {
  check < quantile(check, 0.25, na.rm = TRUE) - 1.5 * IQR(check, na.rm = TRUE) |
    check > quantile(check, 0.75, na.rm = TRUE) + 1.5 * IQR(check, na.rm = TRUE)
}


outlier_obs <- df_cleaned %>%
  group_by(year, county) %>%
  mutate(Flagged = is.na(felonytotal) | outliers_check(felonytotal) | felonytotal <= 0) %>%
  ungroup()

#outliers <- subset(outlier_obs, Flagged == TRUE | year==2015, select=c(year, county, population, lawenforcementtotal, felonytotal, divorce))

A boxplot showing observation with oultiers

theme_set(theme_bw())


ct <- ggplot(df_cleaned, aes(county, totalcrimes))
ct + geom_boxplot() + 
  geom_dotplot(binaxis='y', 
               stackdir='center', 
               dotsize = .15, 
               fill="red") +
  theme(axis.text.x = element_text(angle=75, vjust=0.6)) + 
  labs(title="Box plot + Dot plot of", 
       subtitle="County Vs Total Crimes",
       x="New York State Counties",
       y="Total Crimes (Felony & Misdemeanor) ")

df_corr <- df_cleaned[ , -c(1,3, 24, 25, 26)]


corr_coef <- cor(df_corr, method = "kendall", use = "complete")

ggcorrplot(corr_coef, hc.order = TRUE, 
           type = "lower", 
           lab = TRUE, 
           lab_size = 3, 
           method="circle", 
           colors = c("tomato2", "white", "springgreen3"), 
           title="Correlogram of variables", 
           ggtheme=theme_bw)

The data set would be used in two different way.

Incidents Rate

Incidence: a measure of the occurrence of a new disease, in a defined population at risk for the disease, in a specified time period.

while

1). Incidence proportion (also known as cumulative incidence) is the number of new cases within a specified time period divided by the size of the population initially at risk.

Mathematically as,

$Incidence\quad Rate\quad =\quad \frac { Number\quad of\quad newcases\quad of\quad incidence\quad or\quad disease \quad or\quad injury\quad during\quad specified\quad period }{ Time\quad each\quad person\quad was\quad observed,\quad totaled\quad for\quad all\quad persons }$

df_incident <- sqldf::sqldf("SELECT 

county as County, year as Year, population as Populations, 
(((felonytotal + misdemeanortotal)/population)*100000) as TotalCrimes,
((mentalchild/population)*10000) as ChildMental, ((mentaladult/population)*10000) as AdultMental,
((mentaltotal/population)*10000) as TotalMental,
((lawenforcementtotal/population)*100000) as lawenforcementtotal,
((divorce/population)*100000) as DivorcePerPopPerK, ((drugfelony/population)*100000) as DrugFelonyPerPopPerK, 
((violentfelony/population)*100000) as ViolentTotal, ((dwifelony/population)*100000) as DWIFelony,
((otherfelony/population)*100000) as OtherFelony, ((felonytotal/population)*100000) as TotalFelony, 
((propertymisd/population)*100000) as PropertyMisdemeanor, 
((drugmisd/population)*100000) as DrugMisdemeanor, ((dwimisd/population)*100000) as DwiMisdemeanor,
((othermisd/population)*100000) as OtherMisdemeanor,
((misdemeanortotal)/population)*100000
as TotalMisdemeanor, 
((unemployment/population)*100000) as UnemploymentTotal,
((allageproverty/population)*100000) as AllPersonInPoverty ,
((ueighteenpoverty/population)*100000) as Under18InPoverty,
((fivetoseventenpov/population)*100000) as FiveTo17InPoverty,lat, long, Lat_long
FROM df_cleaned
WHERE county NOT IN ('Out-of-State', 'St. Lawrence') 
ORDER BY Year, county ASC

")
df_incident <- as.data.frame(df_incident)
kable(head(df_incident[, 1:9]))

County	Year	Populations	TotalCrimes	ChildMental	AdultMental	TotalMental	lawenforcementtotal	DivorcePerPopPerK
Albany	2010	295267	3128.016	1638.5847	2626.775	4265.360	198.8031	259.4262
Allegany	2010	48531	2542.705	649.4818	2695.185	3344.666	325.5651	327.6256
Bronx	2010	1387983	6352.599	1780.7855	4031.065	5811.851	193.8064	223.5618
Broome	2010	191892	2814.083	860.3277	1962.093	2822.421	128.7182	355.9294
Cattaraugus	2010	76602	2545.625	1243.8318	2883.345	4127.177	304.1696	276.7552
Cayuga	2010	78400	2022.959	2907.9082	2459.184	5367.092	302.2959	270.4082

Incident Table Cont’d…“)

kable(head(df_incident[, 10:18]))

DrugFelonyPerPopPerK	ViolentTotal	DWIFelony	OtherFelony	TotalFelony	PropertyMisdemeanor	DrugMisdemeanor	DwiMisdemeanor	OtherMisdemeanor
173.4024	207.9474	59.60707	529.6901	970.6469	748.1364	270.94122	389.13932	749.1525
115.3902	160.7220	90.66370	317.3229	684.0988	473.9239	98.90585	471.86335	813.9128
553.2489	494.5306	5.25943	484.0117	1537.0505	1486.1133	1727.83096	87.68119	1513.9234
113.6056	178.2253	68.26757	390.3237	750.4221	781.6897	171.97173	305.38011	804.6193
58.7452	161.8757	109.65771	298.9478	629.2264	625.3100	91.38143	399.46738	800.2402
99.4898	88.0102	47.19388	315.0510	549.7449	521.6837	103.31633	247.44898	600.7653

print(paste0(“Incident Table Last”))

kable(head(df_incident[, 19:26]))

TotalMisdemeanor	UnemploymentTotal	AllPersonInPoverty	Under18InPoverty	FiveTo17InPoverty	lat	long	Lat_Long
2157.369	15579.12	12917.12	3500.899	2258.634	42.6526	-73.7562	42.6526:-73.7562
1858.606	21841.71	15674.52	4897.900	3294.801	42.0901	-78.4942	42.0901:-78.4942
4815.549	19906.58	29148.48	10956.330	7667.457	40.8448	-73.8648	40.8448:-73.8648
2063.661	22199.99	16375.88	4809.476	3218.477	42.1792	-75.8534	42.1792:-75.8534
1916.399	28067.15	15209.79	5276.625	3469.883	42.3292	-78.8681	42.3292:-78.8681
1473.214	23341.84	12196.43	4053.571	2739.796	42.9190	-76.7263	42.919:-76.7263

Charts

Poverty Chart

theme_set(theme_bw())

poverty_inc <- sqldf::sqldf("SELECT 
                        Year, AllPersonInPoverty, County
                        FROM df_incident
                        ORDER BY AllPersonInPoverty DESC LIMIT 30
                        ")

poverty_plot <- ggplot(poverty_inc, aes(x=AllPersonInPoverty, y=reorder(County, +AllPersonInPoverty), fill=as.factor(Year))) +
geom_point(colour="blue", size=5, alpha=.8) +
scale_fill_brewer(palette="Blues", breaks=rev(levels(poverty_inc$Year))) +
  labs(title="Chart of Top 30 Poverty  In New York State Per County
       \n (Incident Rate Per 100K Persons From 2010 To 2015))")




## Divorce Chart

div_inc <- sqldf::sqldf("SELECT 
                        Year, DivorcePerPopPerK, County
                        FROM df_incident
                        ORDER BY DivorcePerPopPerK DESC LIMIT 30
                        ")

divorce_plot <- ggplot(div_inc, aes(x=DivorcePerPopPerK, y=reorder(County, +DivorcePerPopPerK), fill=as.factor(Year))) +
geom_point(colour="pink", size=5, alpha=.8) +
scale_fill_brewer(palette="Blues", breaks=rev(levels(div_inc$Year))) +
  labs(title="Chart of Top 30 Divorce  In New York State Per County
       \n (Incident Rate Per 100K Persons From 2010 To 2015))")



## Unemployment Chart



unemp_inc <- sqldf::sqldf("SELECT 
                        Year, UnemploymentTotal, County
                        FROM df_incident
                        ORDER BY UnemploymentTotal DESC LIMIT 30
                        ")


unemployment_plot <- ggplot(unemp_inc, aes(x=UnemploymentTotal, y=reorder(County, +UnemploymentTotal), fill=as.factor(Year))) +
geom_point(colour="brown", size=5, alpha=.8) +
scale_fill_brewer(palette="Blues", breaks=rev(levels(unemp_inc$Year))) +
  labs(title="Chart of Top 30 Unemployment In New York State Per County
       \n (Incident Rate Per 100K Persons From 2010 To 2015)")



## Mental Health Chart

mentalhealth_inc <- sqldf::sqldf("SELECT 
                        Year, TotalMental, County
                        FROM df_incident
                        ORDER BY TotalMental DESC LIMIT 30
                        ")

mental_plot <- ggplot(mentalhealth_inc, aes(x=TotalMental, y=reorder(County, +TotalMental), fill=as.factor(Year))) +
geom_point(colour="purple", size=5, alpha=.8, shape = 21) +
scale_fill_brewer(palette="Blues", breaks=rev(levels(mentalhealth_inc$Year))) +
  labs(title="Chart of Top 30 Mental Health In New York State Per County
       \n (Incident Rate Per 100K Persons From 2010 To 2015)")


poverty_plot #poverty plot
divorce_plot #divorce plot
unemployment_plot #unemployment plot
mental_plot #mental health plot

The above plot depicts some of the variable used in the analysis. As we can see, most of the plot shows NYC! The main reason NYC historical complaint data was used for further investigation as its a representative of the other counties in NY state.

Total Felony Popuation Desity

ggplot(df_incident, aes(TotalFelony)) + geom_density(aes(color=as.factor(Year), fill=as.factor(Year)),alpha=0.1) +
  labs(x="Total Felony", y="Density", title="Felony Total Per 100k Population Density") +
  theme(legend.position = "top")+ theme_economist()

Total Misdemeanor Popuation Desity

misd_dens_plt <- ggplot(df_incident, aes(TotalMisdemeanor)) + geom_density(aes(color=as.factor(Year),
                        fill=as.factor(Year)),alpha=0.1) + labs(x="Misdemeanor Total", y="Density",
                        title="Misdemeanor Total Per 100k Population Density") + theme(legend.position = "top")+
                       theme_economist()

misd_dens_plt

Felony & Misdemeanor Combo

felony <- dplyr::select(df_incident, TotalFelony, Year, UnemploymentTotal) %>%
    mutate(Year = as.factor(Year)) %>%
ggplot(aes(x = Year, y = TotalFelony)) + 
    geom_jitter(aes(col = Year), alpha = 0.8) +
    labs(title = "x") +
    theme_light()

misd <- dplyr::select(df_incident, TotalMisdemeanor, Year, UnemploymentTotal) %>%
    mutate(Year = as.factor(Year)) %>%
ggplot(aes(x = Year, y = TotalMisdemeanor)) + 
    geom_jitter(aes(col = Year), alpha = 0.8) +
    labs(title = "x") +
    theme_light()

p <- plot_grid(felony, misd) 
title <- ggdraw() + draw_label("A Plot of Felony and Misdemeanor Incident Per Population 100k", fontface='bold')
plot_grid(title, p, ncol=1, rel_heights = c(0.1, 2.2))

3D Scatter Plot

A better view of some important observations in a reactive 3D!

df_incident$size <- df_incident$TotalCrimes
colors <- c('#4AC6B7', '#1972A4', '#965F8A', '#FF7070', '#C61951')

comparison <- plot_ly(df_incident[1:1000, ], x = ~TotalMisdemeanor, y = ~TotalFelony, z = ~lawenforcementtotal,
                      color = ~Year, size = ~size, colors = colors,
             marker = list(symbol = 'circle', sizemode = 'diameter'), sizes = c(1, 50),
             text = ~paste('County:', County, '<br>Population Total:', Populations,
                           '<br>Total Unemployment Pop in 100K:', UnemploymentTotal)) %>%
  layout(title = 'Comparison Among Police, Felony and Misdemeanor Crimes In New York State',
         scene = list(xaxis = list(title = 'Misdemeanor Pop At Risk Per 1K',
                      gridcolor = 'rgb(255, 255, 255)',
                      type = 'log',
                      zerolinewidth = 1,
                      ticklen = 5,
                      gridwidth = 2),
               yaxis = list(title = ' Felony Crime Pop At Risk Per 100K',
                      gridcolor = 'rgb(255, 255, 255)',
                      zerolinewidth = 1,
                      ticklen = 5,
                      gridwith = 2),
               zaxis = list(title = 'Police Pop Per 100K',
                            gridcolor = 'rgb(255, 255, 255)',
                            type = 'log',
                            zerolinewidth = 1,
                            ticklen = 5,
                            gridwith = 2)),
         paper_bgcolor = 'rgb(243, 243, 243)',
         plot_bgcolor = 'rgb(243, 243, 243)')

comparison

Regression Analysis & Prediction

Check for normality

qqnorm(df_cleaned$totalcrimes)
qqline(df$totalcrimes, col = 2)

From the above normality check, we can see that the variable under observation is not normally distributed. We will therefore use Generalized Linear Model (conventional linear regression models) to; select best model, estimate parameters and interpretations. The generalized linear models (GLMs) are a broad class of models that include linear regression, ANOVA, Poisson regression, log-linear models etc.

i.e The dependent variable Yi does NOT need to be normally distributed, but it typically assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal etc.

For Linear Regression, we have

${ Y }_{ i }\quad =\quad { \beta }_{ 0 }\quad +\quad { \beta }_{ 1 }{ x }_{ i }\quad +\quad { \varepsilon }_{ i }\quad$

where,

${ Y }_{ i }\quad =\quad Dependent\quad Variable.\\ { \beta }_{ 0 }\quad =\quad Intercept.\\ { \beta }_{ 1 }\quad =\quad Parameter\quad To\quad Be\quad estimated\quad (slope).\\ { x }_{ i }\quad =\quad Independent\quad Variable.\\ { \varepsilon }_{ i }\quad =\quad Error\quad Term.\\ i\quad \quad =\quad 1,\quad 2,\quad 3,\quad 4,......,\quad n\quad$

While Multiple linear regression, we have (in matrix notation):

$\hat { \beta } =\quad ({ { X }^{ T }X) }^{ -1 }{ X }^{ T }Y$

SETTING UP HYPOTHESIS: For Crimes Total

Null Hypothesis: There is no relationship between Crimes and any other variables.

$H_o:$ $\mu_1$ = $\mu_2$ = $\mu_3$…..= $\mu_n$
Alternative Hythesis:

$H_a:$ Not $H_o:$ $\mu_1$ $\neq$ $\mu_2$ $\neq$ $\mu_3$…..$\neq$ $\mu_n$

Rejection:

Reject $H_o$ (Null Hypothesis) if the calculated value (P-Value) is less than the tabulated value(Table value = 0.05 ), otherwise do not reject $H_o$

NB: We used both forward and backward model selection to select the best model for our analysis.
Dummy Variables

Counties dummy varibale

counties <- (c("Albany", "Allegany", "Bronx", "Broome", "Cattaraugus", "Cayuga",
         "Chautauqua", "Chemung", "Chenango", "Clinton", "Columbia", "Cortland",
         "Delaware", "Dutchess", "Erie", "Essex", "Franklin","Fulton", "Genesee", 
         "Greene", "Hamilton", "Herkimer", "Jefferson", "Kings", "Lewis","Livingston",
         " Madison", "Monroe", "Montgomery", "Nassau", "New York", "Niagara", "Oneida",
         " Onondaga", " Ontario", "Orange", "Orleans", "Oswego", "Otsego",  "Putnam", "Queens",
         "Rensselaer", "Richmond", "Rockland","Saratoga", "Schenectady", "Schoharie", "Schuyler",
         "Seneca", "St Lawrence",  "Steuben", "Suffolk",  "Sullivan", "Tioga", "Tompkins", "Ulster",
         "Warren", "Washington", "Wayne", "Westchester", "Wyoming", "Yates"))


lab <- c(1, 2, 3, 4, 5,  6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
         26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
         52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64)
dummy <- cbind(counties, lab)
kable(head(dummy))

counties	lab
Albany	1
Allegany	2
Bronx	3
Broome	4
Cattaraugus	5
Cayuga	6

Years dummy variable

df_cleaned <- df_cleaned[-c(235, 302), -c(4, 5, 8, 12, 14:20,24:26)]


df_cleaned$county <- factor(df_cleaned$county,
                    levels=counties,
                    labels= lab)

df_cleaned$year <- factor(df_cleaned$year,
                     levels = c("2010", "2011", "2012", "2013", "2014", "2015"),
                     labels = c(1, 2, 3, 4, 5, 6));

#df_cleaned$totalcrimes <- as.factor(df_cleaned$totalcrimes)


set.seed(123)

splits <- sample.split(df_cleaned$county, SplitRatio = 0.8)
train_set <- subset(df_cleaned, splits==TRUE)
test_set <- subset(df_cleaned, splits==FALSE)

#train_set[, 2:19] <- scale(train_set[, 2:19])
#test_set[, 2:19] <- scale(test_set[, 2:19])

Regression

regressor <-  glm(totalcrimes~., family=gaussian(link = "identity"),data=train_set)
stepwise <- step(regressor, direction = "both")

## Start:  AIC=5002.1
## totalcrimes ~ county + population + year + lawenforcementtotal + 
##     divorce + mentalchild + mentaladult + unemployment + allageproverty + 
##     ueighteenpoverty + fivetoseventenpov
## 
##                       Df   Deviance    AIC
## - lawenforcementtotal  1  241942427 5000.3
## - divorce              1  242655053 5001.1
## <none>                    241814047 5002.1
## - allageproverty       1  245092297 5004.1
## - ueighteenpoverty     1  248825615 5008.5
## - year                 5  257390382 5010.5
## - mentalchild          1  253561895 5014.1
## - unemployment         1  253903448 5014.5
## - fivetoseventenpov    1  254127113 5014.8
## - mentaladult          1  256384468 5017.4
## - population           1  320388196 5083.1
## - county              58 1098802081 5332.7
## 
## Step:  AIC=5000.26
## totalcrimes ~ county + population + year + divorce + mentalchild + 
##     mentaladult + unemployment + allageproverty + ueighteenpoverty + 
##     fivetoseventenpov
## 
##                       Df   Deviance    AIC
## - divorce              1  242705306 4999.2
## <none>                    241942427 5000.3
## + lawenforcementtotal  1  241814047 5002.1
## - allageproverty       1  245277509 5002.3
## - ueighteenpoverty     1  248945811 5006.7
## - year                 5  258021041 5009.2
## - mentalchild          1  253617214 5012.2
## - unemployment         1  254031095 5012.6
## - fivetoseventenpov    1  254342989 5013.0
## - mentaladult          1  256596366 5015.6
## - population           1  320413512 5081.1
## - county              58 1107065647 5332.9
## 
## Step:  AIC=4999.19
## totalcrimes ~ county + population + year + mentalchild + mentaladult + 
##     unemployment + allageproverty + ueighteenpoverty + fivetoseventenpov
## 
##                       Df   Deviance    AIC
## <none>                    242705306 4999.2
## + divorce              1  241942427 5000.3
## + lawenforcementtotal  1  242655053 5001.1
## - allageproverty       1  248206467 5003.8
## - ueighteenpoverty     1  250067204 5006.0
## - year                 5  259171855 5008.6
## - unemployment         1  254031855 5010.6
## - mentalchild          1  254634736 5011.3
## - fivetoseventenpov    1  254805543 5011.5
## - mentaladult          1  256868278 5013.9
## - population           1  327927630 5086.0
## - county              58 2175493407 5530.2

Analysis of Variance (ANOVA)

Analysis of Variance (ANOVA) above is used to compare the means of the variance for statistical significance.

anova(stepwise, test="F")

# 
fit <- aov(totalcrimes~., data=train_set)
summary(fit)

##                      Df    Sum Sq   Mean Sq  F value   Pr(>F)    
## county               58 1.080e+11 1.863e+09 1710.044  < 2e-16 ***
## population            1 4.378e+08 4.378e+08  401.888  < 2e-16 ***
## year                  5 2.327e+07 4.654e+06    4.272 0.000993 ***
## lawenforcementtotal   1 2.568e+06 2.568e+06    2.358 0.126070    
## divorce               1 2.233e+06 2.233e+06    2.050 0.153654    
## mentalchild           1 1.845e+07 1.845e+07   16.940 5.44e-05 ***
## mentaladult           1 3.540e+07 3.540e+07   32.500 3.77e-08 ***
## unemployment          1 1.471e+07 1.471e+07   13.507 0.000298 ***
## allageproverty        1 6.444e+06 6.444e+06    5.916 0.015793 *  
## ueighteenpoverty      1 1.628e+05 1.628e+05    0.149 0.699407    
## fivetoseventenpov     1 1.231e+07 1.231e+07   11.304 0.000911 ***
## Residuals           222 2.418e+08 1.089e+06                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# HSD.test(fit, 'unemployment ')

Residual Plot

residual <- data.frame(Fitted = fitted(stepwise), Residuals = resid(stepwise), Treatment = train_set$totalcrimes)

plot_res <- ggplot(residual, aes(Fitted, Residuals, colour = Treatment)) + geom_point()

ggplotly(plot_res)

## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Confident Interval

confint(stepwise)

## Waiting for profiling to be done...

##                           2.5 %        97.5 %
## (Intercept)        3.514137e+04  5.257861e+04
## county2           -4.509359e+04 -3.026503e+04
## county3            1.759665e+05  2.561847e+05
## county4           -1.847632e+04 -1.218690e+04
## county5           -4.003689e+04 -2.698067e+04
## county6           -3.933880e+04 -2.645562e+04
## county7           -2.996834e+04 -2.008484e+04
## county8           -3.761863e+04 -2.524088e+04
## county9           -4.475257e+04 -3.004656e+04
## county10          -3.902694e+04 -2.619976e+04
## county11          -4.312707e+04 -2.893036e+04
## county12          -4.447702e+04 -2.975810e+04
## county13          -4.543456e+04 -3.042556e+04
## county14          -6.699662e+03 -3.683733e+03
## county15           6.884225e+04  1.050451e+05
## county16          -4.706535e+04 -3.154038e+04
## county17          -4.421168e+04 -2.960197e+04
## county18          -4.386743e+04 -2.948005e+04
## county19          -4.344640e+04 -2.916084e+04
## county20          -4.478454e+04 -2.993710e+04
## county21          -5.172073e+04 -3.468023e+04
## county22          -4.270715e+04 -2.869432e+04
## county23          -3.289370e+04 -2.202968e+04
## county24           2.831810e+05  4.268649e+05
## county25          -4.877178e+04 -3.274365e+04
## county26          -4.276384e+04 -2.872511e+04
## county28           4.858833e+04  7.553175e+04
## county29          -4.467252e+04 -3.006745e+04
## county30           9.342428e+04  1.472886e+05
## county31           1.852829e+05  2.661927e+05
## county32          -1.728764e+04 -1.161247e+04
## county33          -1.381400e+04 -9.091988e+03
## county36           4.654261e+03  1.032043e+04
## county37          -4.618999e+04 -3.096345e+04
## county38          -3.264768e+04 -2.201878e+04
## county39          -4.273305e+04 -2.860299e+04
## county40          -3.802003e+04 -2.554821e+04
## county41           2.211987e+05  3.367049e+05
## county42          -2.767832e+04 -1.873803e+04
## county43           1.653561e+04  2.676194e+04
## county44          -6.870157e+03 -2.116914e+03
## county45          -1.868452e+04 -1.235097e+04
## county46          -2.706671e+04 -1.781711e+04
## county47          -4.634840e+04 -3.126450e+04
## county48          -4.977473e+04 -3.336613e+04
## county49          -4.742532e+04 -3.170056e+04
## county52          -3.630333e+04 -2.444980e+04
## county53          -3.631639e+04 -2.459065e+04
## county54           1.126232e+05  1.757783e+05
## county55          -3.973076e+04 -2.648663e+04
## county56          -4.529073e+04 -3.044249e+04
## county57          -3.595419e+04 -2.424635e+04
## county58          -2.260695e+04 -1.485317e+04
## county59          -4.202072e+04 -2.810953e+04
## county60          -4.285657e+04 -2.874644e+04
## county61          -3.851339e+04 -2.601193e+04
## county62           6.553923e+04  1.023905e+05
## county63          -4.659207e+04 -3.131903e+04
## county64          -4.876006e+04 -3.267136e+04
## population        -1.316622e-01 -8.400117e-02
## year2             -2.078425e+01  8.704594e+02
## year3             -2.181291e+01  8.911599e+02
## year4              1.632421e+02  1.072794e+03
## year5              1.180490e+02  1.037611e+03
## year6             -5.025215e+02  4.932371e+02
## mentalchild       -6.769890e-02 -1.742043e-02
## mentaladult        8.618134e-03  2.902459e-02
## unemployment       1.103547e-02  4.501041e-02
## allageproverty    -1.598438e-01 -1.112728e-02
## ueighteenpoverty   8.897740e-02  6.283368e-01
## fivetoseventenpov -7.998222e-01 -2.084640e-01

Prediction

pred <- predict(regressor, test_set, type="response")
actual_pred <- cbind.data.frame(test_set$totalcrimes, pred)
actual_pred <- actual_pred[1:20, ]
kable(actual_pred)

	test_set$totalcrimes	pred
5	8743	8710.9305
11	820	695.6136
16	78402	79652.0196
24	5400	6524.1559
26	2082	1739.9127
31	1664	1861.8931
37	3532	3953.7500
48	2210	2215.0483
50	817	785.2716
59	2006	1314.9362
65	1262	1246.1255
68	1527	1436.8891
73	926	619.6817
84	5997	7320.9345
87	27531	27140.7780
94	891	496.9765
97	1329	1501.8446
104	1284	1081.2709
114	1443	1427.4388
118	1469	1358.9968

Check for Accuracy

MAPE The mean absolute percentage error (MAPE), also known as mean absolute percentage deviation (MAPD), is a measure of prediction accuracy of a forecasting method in statistics. Represented mathematically as:

$M=\quad \frac { 100 }{ n } \sum _{ t=1 }^{ n }{ \left| \frac { { A }_{ t }\quad -\quad { F }_{ t } }{ { A }_{ t } } \right| }$ ,

Where,

${ A }_{ t }\quad =\quad Actual\quad Value\\ \quad \quad \quad \quad \quad \quad { F }_{ t }\quad =\quad Predicted\quad Value.$

min_max_accuracy <- mean(apply(actual_pred, 1, min) / apply(actual_pred, 1, max))*100 
print(paste('The Percentage Accuracy Is: ', min_max_accuracy))

## [1] "The Percentage Accuracy Is:  87.4774508797211"

mape <- mean(abs((actual_pred$pred - actual_pred$`test_set$totalcrimes`))/actual_pred$pred)*100 
print(paste('The MAPE Is: ', mape ))

## [1] "The MAPE Is:  16.5027632643134"

Principal Component Analysis

To obtain the best model, the best way is to adopt a dimensionality reduction. The PCA, like factor analysis, is a mathematical procedure that transforms a number of correlated variables into a smaller number of uncorrelated variables.

prin_comp <- prcomp(train_set[, -c(1,3)], scale. = T)
biplot(prin_comp, scale = 0)

std_dev <- prin_comp$sdev
pr_var <- std_dev^2
pca_var <- pr_var/sum(pr_var)

plot(pca_var, xlab = "Principal Component",
             ylab = "Proportion of Variance Explained",
             type = "b")

plot(cumsum(pca_var), xlab = "Principal Component",
              ylab = "Cumulative Proportion of Variance Explained",
              type = "b")

pca2 <- df_pca[, -1]
pca1 <- prcomp(pca2, scale. = TRUE)

cir <- function(center = c(0, 0), npoints = 100) {
    r = 1
    t = seq(0, 2 * pi, length = npoints)
    x = center[1] + r * cos(t)
    y = center[1] + r * sin(t)
    return(data.frame(x = x, y = y))
}
corcir = cir(c(0, 0), npoints = 100)

correlations = as.data.frame(cor(pca2, pca1$x))

arrows = data.frame(x1 = c(0, 0, 0, 0), y1 = c(0, 0, 0, 0), x2 = correlations$PC1, 
    y2 = correlations$PC2)


ggplot() + geom_path(data = corcir, aes(x = x, y = y), colour = "black") + 
    geom_segment(data = arrows, aes(x = x1, y = y1, xend = x2, yend = y2), colour = "blue") + 
    geom_text(data = correlations, aes(x = PC1, y = PC2, label = rownames(correlations))) + 
    geom_hline(yintercept = 0, colour = "brown") + geom_vline(xintercept = 0, 
    colour = "brown") + xlim(-1.1, 1.1) + ylim(-1.1, 1.1) + labs(x = "pc1 aixs", 
    y = "pc2 axis") + ggtitle("Circle of correlations")

PCA Visual

#train set
train_pca <- data.frame(TotalCrimes = train_set$totalcrimes, prin_comp$x)
train_pca <- train_pca[,1:8]
pca_model <- rpart(TotalCrimes ~ .,data = train_pca, method = "anova")

#test set
test_pca <- predict(prin_comp, newdata = test_set)
test_pca <- as.data.frame(test_pca)
test_pca <- test_pca[,1:8]
#prediction
prediction <- predict(pca_model, test_pca)

INTERPRETATIONS:

Generalized Linear Model: The analysis revealed that variables like: years, mentalchild, mentaladult, unemployment, allageproverty, ueighteenpoverty, fivetoseventenpov and population have P-values less than 0.05, which means they are statistically significant.
Residual Plot: Unpattern nature of the residuals conform to the accuracy of model.
Analysis of Variance: Supports the GLM model but struck out the variable “ueighteenpoverty”.
Confidence Interval: The CI test indicates that at 95% the confidence interval would contain the true population mean as none of the interval is zero.
Principal Component Analysis: After the dimensionality reduction, PCA also confirms the best model.

DECISION: Since the calculated P-values are less than the tabulated values ($??$ = 0.05), we therefore reject the alternative hypothesis, H_o.

CONCLUSION: We therefore conclude that there is a positive relationship between crimes and factors like Economic and Psychological as confirmed by variables like: Unemployment, Mental Health, Poverty, County (or location) and Year. It also means; as the factors increases, so also is crime rate and vice-versa. We can therefore confirm that not only increase and decrease in population affect the crime rate, but other factors like economic and psychological also contribute significantly to crime rate in a geographical area. While security and social factors may not contribute significantly to crime rate, but a further analysis may prove otherwise.

Bibliography

The Origins of attachment theory- John Bowlby and Mary Ainsworth

http://www.sakkyndig.com/psykologi/artikler/Bowlbyatt.pdf

ftp://statgen.ncsu.edu/pub/thorne/molevoclass/AtchleyOct19.pdf

https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Current-YTD/5uac-w243

http://data.beta.nyc/dataset/pediacities-nyc-neighborhoods/resource/35dd04fb-81b3-479b-a074-a27a37888ce7

http://www.kevjohnson.org/making-maps-in-r/

http://leaflet-extras.github.io/leaflet-providers/preview/

http://nyscommunityaction.org/wp-content/uploads/2018/04/Comparitive-Look-@-County-Profiles.pdf

https://stackoverflow.com/questions/19449307/how-to-solve-error-in-as-posixct-numericx2l-origin-must-be-supp

http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html

https://data.ny.gov/en/browse?q=nypd