Introduction
The word crime is derived from Latin as “cerno” which means sinfulness in English can be defined as “violation of a law in which there is injury to the public or a member of the public and a term in jail or prison, and/or a fine as possible penalty” -https://legal-dictionary.thefreedictionary.com/crime. A quote by Frank Wolf say “We can’t just rail against crime. We must speak of the root problems - devastating family breakup, an insidious culture of violence that cheapens human life, skyrocketing prisoner recidivism rates that robs our communities of husbands and fathers (wives and mothers) - and recognize that there is a societal role in rehabilitation and restoration”. To this end, I decided to take a clue from him and decided to do my research on factors affecting crime rates - social, economic, security and psychological against the popular notion that population is the only determining factor in high rate of crime in a geographical area and at a point in time. The research will focus mainly on New York state for the period of 5-10 years.
Abstract
In other to gain more insight about crime rate, a stratum of the dataset was analyzed. The NYPD historical complaints data from 2010 to 2015 was collected from New York State open data website.
It was discovered that most crimes are committed on Fridays, while July follow closely by August and May are the months with the highest crime rate. In NYC borough, Bronx came first, while Staten Island came last in terms of crime rate. A popular believe is that population alone has the most impact on crime rate, but a further analysis also revealed other culprits; the likes of economic and psychological factors also contribute significantly to crime rate.
It was very difficult combining the needed dataset from all the above dataset into one. Therefore, to make it a little bit easier, we will implore the helping hand of SQL (Structured Query Language). A relational database like ORACLE SQL, MySQL or PostgreSQL. I chose PostgreSQL to access and extract the require data.
If your connection is right, you should see that the above table are now in your database. But some of the data type may be wrong, use the code below to alter the it to desirable data type. Upon further scrutiny, we discovered that the poverty table still have some unwanted character in them, we therefore removed them using gsub from grep, a R based.
Background
Crime and violence impact us psychologically (psychological factor), whether we are directly involved, a family member or friend is involved, our residence or community was impacted or exposure to it on media coverage. This is sometimes the case when a family member or friend is killed or hurt after directly or indirectly exposure to crime, it is natural for us to experience strong feelings and the effects may push some people state of mind beyond it optimal capability and may in turn leads to revenge or survival kind of crime(s).
Also, Unemployment (economical factor) has a great deal of common sense appeal as one of the root cause of criminal behavior as unemployment is always viewed as a causal agent of economic deprivation may ultimately increases the probability of crime.
According study conducted (Social factor) by likes of Wilson (1987), Ferguson (1952), Robins (1995) and Gluecks (1950) established that parents with criminal records are more likely to have children criminal records also. Bowlby (1946) believed that a traumatic experience of parent-child separation and disruption of attachment relations may cause delinquency after parental separation, that where divorce came into being. The different between Bowlby work and mine is that; my research work not only focus on children alone, but all ages, as crime affect everyone not just the children and their immediate family, but society in general.
The role of law enforcement officers (Security) is to protect citizenry against crimes. The research work did focus on the ratio of law enforcement to the crime rates. Does hiring more police /law enforcement officers curb the menace of crimes?
Methodology
The research project answered and provided a recommendation (where necessary) for many of the overlooked factors that may influenced crime rates in a community. A R programing was used to obtain and check for any positive or negative effect/interaction(s) between crime rates and all the factors, while a graphical representation was included in the analysis. Prediction was done on counties to check for any increase or decrease in crime rate in the future as well as a recommendation on best method to curb the menace (crime rates).
The dataset for this research work was derived from various governmental organizations throughout New York State (counties). The datasets were secondary data as primary data are not feasible due to some constraint like cost, government restrictions etc.
Most of the data tables did not conform to the standard R format. I therefore adopted the use of PostgreSQL. After loading the dataset into R, I reloaded them into PostgreSQL as a table and called it directly from SQL into R environment using PostgreSQL package that was connected to R through ODBC. I was parsing SQL queries in R environment and made some cleaning. I also used SQLDF package in R to write SQL queries in R.
Dataset
The required datasets are as follows:
Population: As widely known, population plays an important role is geographical location as population somethings (if not always) used to determine how wealthy, organized etc. that society are. It (Population) can also spells dome to the geographical location as it is something believe that the higher the population, the higher the crime rate and vice versa. I will not dispute this, but I will like to also shift our attention to some other factor we always overlooked like!
Crimes Rate (Felony & Misdemeanor): As the main culprit in our analysis, I collected dataset for crime rate for both New York City and New York State as I was trying to dig deeper. New York City is seen as the largest city in New York State, therefore a stratum from NYC would give a better judgment of the whole state I think. Analysis was done on the NYC historical complaint dataset also the NYS.
Law Enforcement: Total employed law enforcements at the time of the incident was also considered. As one article in New York Times say “The notion of pruning the police inevitably raises the specter of more crime, even if there is little evidence to support such fears. The relationship between the number of officers and lawful behavior is not clear-cut. In New York City, for example, the police force peaked at more than 40,000 in 2000. Since then, both the number of officers and the crime rate have declined.” We looked into this and included the law enforcement officers. NB: Law enforcement officers used here comprises of college security officer, hospital police etc.
Unemployment: No society can escape this. Robert Frost once said, “The reason why worry kills more people than work is that more people worry that work”. Also, a school of thought said, “Idle hands are devils workshop”. You may want to turn to crime if you don’t work. Therefore, unemployment may be seen as a culprit.
Child Abuse and Neglect: I wanted to include this into the analysis, but the required datasets was not readily available at the time of this analysis. It would have been of great to see the effect(s) of child abuse and neglect in crime rate. Because, I believed, like the divorce, a child that is abused physically or mentally may want to seek help or inclusion from the outside world and that may lead them into crimes.
Divorce: Divorce or single parental care may also contribute significantly to crime rate, (although this is debatable) as a single parent may not be able to single-handedly successfully rate a child. Even with a home with both parents present have their own challenges, let alone single parent as such child may not be adequately taken care of and may want to seek for paternal or maternal (whichever missing) outside the home and may in turn spell dome as the child may fall into wrong hand seeking prey.
Data Cleaning
There are some missing observations in the datasets before and after combinations. I therefore employ the use of a package in R called MICE, multivariate Imputation by Chained Equations (mice package) is said to be the best approach to tackle end-on the menace missing observation may cause. It is well known fact that if proper care is not taken in dealing with missing observation, the analysis would likely be biased. To avoid this, the mice package in R was used. A fully conditional specification where each incomplete variable is imputed by a separate model at which the algorithm will impute mixes of continuous, binary, ordered or unordered data based on the data type.
The Divorce table did not come with years, I was able to add years of occurrence into their respective rows, also reformatted it and removed all the unwanted characters.
Unemployment table came with months name in number, I changed it to alphabets and changing it from long format to wide and removed all the unwanted characters.
NYPD Historical Complaint dataset was not in the correct format at all, I changed the date to format, re-formatted the longitude and latitude, made years, months and days-of-the-weeks out of the datasets.
Limitation
The best datasets required for this research would have been the one collected from the department of correction, (as the saying goes “from the horse’s mouth”) but due to nature of the datasets, government put some restrictions in it use and therefore blocking public access to it. We therefore opted for the dataset from different organizations (some were scrapped from websites) as we believe it will give same result, which required extra work to bring them together as one dataset.
As mentioned above, this research project did used (heavily) relational database like PostgreSQL, MySQL, OracleSQL and SQLIte (Structured Query Language) for most of the query. Most (if not all) of the dataset used in the analysis are not readily available. I therefore sorted dataset from various government agencies. Some of the dataset was scrapped fro
As school of thought dictate, we should first install the packages to be used in this research by using the following syntax “install.packages(”The Package name“)” one after the other.
# install.packages(c("flexdashboard", "DT", "pandocfilters", "googleVis", "R.rsp", "rmarkdown", "knitr", "plyr", "plotrix", "tidyr","dplyr", "ggplot2","plotly", "MASS","reshape2", "Amelia","mice","stringi","ROCR", "XML", "httr", "AUC", "magrittr", "rvest", "rgdal", "purrr", "DBI",
# "leaflet", "tigris", "RPostgreSQL" , "data.table", "tibble", "cowplot", "Hmisc", "sp", "maptools", "PerformanceAnalytics" , "ggcorrplot", "ggalt",
# "ggmap", "ggthemes", "readr", "webshot", "caTools", "caret", "rpart", "agricolae", "sqldf"))
suppressMessages(library("rmarkdown"))
#setwd("C:/Users/HP/Desktop")
library(flexdashboard)
library(DT)
library(prettydoc)
suppressMessages(library(R.rsp))
suppressMessages(library(pandocfilters))
suppressPackageStartupMessages(library(googleVis))
op <- options(gvis.plot.tag='chart')
Cont’d…
suppressMessages(library(knitr))
suppressMessages(library(plyr))
suppressMessages(library(plotrix))
suppressMessages(library(tidyr))
suppressMessages(library(dplyr))
suppressMessages(library(ggplot2))
suppressMessages(library(plotly))
suppressMessages(library(MASS))
suppressMessages(library(reshape2))
suppressMessages(library(Amelia))
suppressMessages(library(mice))
suppressMessages(library(stringi))
suppressMessages(library(ROCR))
suppressMessages(library(XML))
suppressMessages(library(httr))
suppressMessages(library(AUC))
suppressMessages(library(magrittr))
suppressMessages(library(rvest))
suppressMessages(library(rgdal))
suppressMessages(library(purrr))
suppressMessages(library(DBI))
suppressMessages(library(leaflet))
suppressMessages(library(tigris))
suppressMessages(library(RPostgreSQL))
suppressMessages(library(data.table))
suppressMessages(library(tibble))
suppressMessages(library(cowplot))
suppressMessages(library(Hmisc))
suppressMessages(library(sp))
suppressMessages(library(maptools))
suppressMessages(library(PerformanceAnalytics))
suppressMessages(library(ggcorrplot))
suppressMessages(library(ggalt))
suppressMessages(library(ggmap))
suppressMessages(library(ggthemes))
suppressMessages(library(readr))
suppressMessages(library(webshot))
suppressMessages(library(caTools))
suppressMessages(library(randomForest))
suppressMessages(library(caret))
suppressMessages(library(rpart))
suppressMessages(library(agricolae))
library(htmltools)
library(RColorBrewer);
It will be very difficult to work or combine all the needed dataset from all the above dataset. Therefore to make it a little bit easier, we will implore the help hand of SQL (Structured Query Language). A relational database like ORACLE SQL, MySQL or PostgreSQL would be nice in doing it better. I outed to use PostreSQL to access and extract the require data.
Connect your PostreSQL and R using the code below.
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "capstone",
host = "localhost", port = 5432,
user = "postgres", password = "oracle")
The divorce came as a table from New York state health department website as follows.Here is the glimpse after cleaning: As mentioned above, this research project did used (heavily) relational database like PostgreSQL, MySQL, OracleSQL and SQLIte (Structured Query Language) for most of the query. Most (if not all) of the dataset used in the analysis are not readily available. I therefore sorted dataset from various government agencies. Some of the datase was scrapped for websites e.g the divorce dataset for year 2015 was scrapped from: https://www.health.ny.gov/statistics/vital_statistics/2015/table50.htm while others were downloaded and the required rows and columns were extracted.
div_2015 <- ("https://www.health.ny.gov/statistics/vital_statistics/2015/table50.htm") %>%
map(possibly(~html_session(.) %>%
read_html %>%
html_table(fill = TRUE) %>%
.[[1]],
NULL))
div_2014 <- ("https://www.health.ny.gov/statistics/vital_statistics/2014/table50.htm") %>%
map(possibly(~html_session(.) %>%
read_html %>%
html_table(fill = TRUE) %>%
.[[1]],
NULL))
div_2013 <- ("https://www.health.ny.gov/statistics/vital_statistics/2013/table50.htm") %>%
map(possibly(~html_session(.) %>%
read_html %>%
html_table(fill = TRUE) %>%
.[[1]],
NULL))
div_2012 <- ("https://www.health.ny.gov/statistics/vital_statistics/2012/table50.htm") %>%
map(possibly(~html_session(.) %>%
read_html %>%
html_table(fill = TRUE) %>%
.[[1]],
NULL))
div_2011 <- ("https://www.health.ny.gov/statistics/vital_statistics/2011/table50.htm") %>%
map(possibly(~html_session(.) %>%
read_html %>%
html_table(fill = TRUE) %>%
.[[1]],
NULL))
div_2010 <- ("https://www.health.ny.gov/statistics/vital_statistics/2010/table50.htm") %>%
map(possibly(~html_session(.) %>%
read_html %>%
html_table(fill = TRUE) %>%
.[[1]],
NULL))
# After scrapping the table, we realized that they are in list format. We will therefore convert it to accepatable format; first as a dataframe, subsetting unwanted rows and adding year column to the dataset as it was not provided in the scrapped table.
div_2010 <- as.data.frame(div_2010)
div_2011 <- as.data.frame(div_2011)
div_2012 <- as.data.frame(div_2012)
div_2013 <- as.data.frame(div_2013)
div_2014 <- as.data.frame(div_2014)
div_2015 <- as.data.frame(div_2015)
div_2010 <- div_2010[-c(1, 2,3,4,5, 11, 12), ]
div_2011 <- div_2011[-c(1, 2,3,4,5, 11, 12), ]
div_2012 <- div_2012[-c(1, 2,3,4,5, 11, 12), ]
div_2013 <- div_2013[-c(1, 2,3,4,5, 11, 12), ]
div_2014 <- div_2014[-c(1, 2,3,4,5, 11, 12), ]
div_2015 <- div_2015[-c(1, 2,3,4,5, 11, 12), ]
div_2010[["year"]] <- rep(2010, nrow(div_2010))
div_2011[["year"]] <- rep(2011, nrow(div_2011))
div_2012[["year"]] <- rep(2012, nrow(div_2012))
div_2013[["year"]] <- rep(2013, nrow(div_2013))
div_2014[["year"]] <- rep(2014, nrow(div_2014))
div_2015[["year"]] <- rep(2015, nrow(div_2015))
#The column names were not in conformitting with the standard R Programming column naming, we will therefore rename the columns.
Merge the divorce dataset from 2010 to 2015.
divorce <- Reduce(function(...) merge(..., all=T), list(div_2010, div_2011,
div_2012, div_2013, div_2014, div_2015))
colnames(divorce) <- c("county", "total", "cruelty", "abandonment", "imprisonment",
"adultery", "afterlegalseparation", "afterseparationbyagreement", "notstated", "year")
kable(head(divorce[, 1:8]))
| county | total | cruelty | abandonment | imprisonment | adultery | afterlegalseparation | afterseparationbyagreement |
|---|---|---|---|---|---|---|---|
| Albany | 719 | 2 | 7 | 0 | 2 | 1 | 16 |
| Albany | 722 | 1 | 20 | 0 | 0 | 0 | 31 |
| Albany | 766 | 109 | 460 | 6 | 7 | 5 | 152 |
| Albany | 783 | 12 | 67 | 1 | 0 | 2 | 60 |
| Albany | 809 | 10 | 23 | 1 | 1 | 0 | 35 |
| Albany | 862 | 45 | 194 | 0 | 1 | 2 | 119 |
Above is the scrapped and cleaned New York state divorce dataset from the year 2010 to 2015. Also, other dataset like the number Arrest, Mental Health, Police, Unemployment, Poverty, Crimes, Longtitude and Latitude was included in the analysis.
Arrest<- "https://docs.google.com/spreadsheets/d/e/2PACX-1vTQbIoMp7kHRauK90knVVWxL0ZQgSyJo5w53ufpeoJhv3NEKZ7e52Z9fU5gI2yj4Q50kCt2S9-WIe8t/pub?gid=0&single=true&output=csv"
LongLat <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vRdUPece17flbIDQQ-prWnhh9ZOntvmHz9Saao-qC6EvV5VODzQ6TUfacf4oeB_w6EnDq0xmEhjw7X5/pub?gid=0&single=true&output=csv"
Mental <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vT87ZxCxB9WAXFNmOWmxAV0aBlpyiE22Jqu_sgTtXpgpgHCzmskdboIiDtcJVgramrtVoTyqT2lzFeT/pub?gid=0&single=true&output=csv"
CrimesNY <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vQ8KAFelT2l0AXFAabbbZSaeoqmQJRt-11kt7Jm_LKaHGbHa_ndUQidLj8Op_d19thmsyxoe9p5Dl3o/pub?gid=0&single=true&output=csv"
LawEnf <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vQauvaPEbqzKX1Vr9ojJuwqjYZFYTyMgclAOLxeUg9ithYryjbQz0foYmAe52jk3A7xBUl33vpDfksj/pub?gid=0&single=true&output=csv"
UnemploymentData <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vSUjXqUWfUlOIQ_hYncSP8-BlOEVnzNQ5YDg49HD63IZDoZuw-2h8DiXt6hBuHTKeGAa9F2l4c23fwN/pub?gid=0&single=true&output=csv"
PovData <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vSpp6mizldwEOSIXYPIuLH4Y8Ho3WvfkdAY8mTCcNnOwFExK7xo4q7bv6qUW-v2Z49rZIHJxo-3iG1X/pub?gid=0&single=true&output=csv"
# Note, the dataset contains column(s) of data we needed. We will load all of them and rename them to meet up with the standard.
nypd <- read.csv("C:/Users/HP/Documents/GitHub/Capstone_Project_2/NYPD_Complaint_Data_Historic.csv",
sep = ",", header=TRUE, stringsAsFactors = FALSE)
arrest <- read.csv(url(Arrest),header = TRUE, sep = ",", col.names= c("county", "year", "total",
"felonytotal", "drugfelony","violentfelony", "dwifelony",
"otherfelony", "misdemeanortotal", "drugmisd",
"dwimisd", "propertymisd", "othermisd"))
mental_health <- read.csv(url(Mental), header = TRUE, sep = ",", col.names= c("rowcreateddatetime",
"year", "omhregioncode", "omhregionlabel", "county", "agegroup", "ratecodegroup", "countycode",
"countOfrecipientbycounty", "total", "paidclaimtotal"))
police <- read.csv(url(LawEnf), header = TRUE, sep = ",", stringsAsFactors = FALSE,
col.names= c("county", "year", "policetotal"))
unemployment <- read.csv(UnemploymentData, header = TRUE, sep = ",", stringsAsFactors = TRUE,
col.names= c('year', 'month', 'county', 'total', 'benefitamountsdollars'))
lat_long <- read.csv(url(LongLat), header = TRUE, sep = ",")
crimes_ny <- read.csv(url(CrimesNY),header = TRUE, sep = ",")
crimes_ny <- crimes_ny[, 1:7]
colnames(crimes_ny)= c('county','year','population','indexcount', 'violentcount','propertycount','firearmcount')
poverty <- read_csv(url(PovData), locale = locale(encoding = "UTF-8"))
pov_sub <- poverty[, c(1, 2, 3, 4, 6, 15, 24)]
colnames(pov_sub) <- c("year","state", "countyid", "county", "allageproverty", "ueighteenpoverty", "fivetoseventenpov")
The New York City Department of Police, NYPD historical dataset that was downloaded from the city open data source was first analyzed as the stratum of the entire state as the NYC dataset contains all the require variables for further scrutiny. Below is the glimpse of the dataset
You’re required to download the NYPD Historical Complaint data and change path below to where you downloaded it. I can not include it here, its about 1.4GB in size!
NB: The NYPD dataset is very large, i will therefore divide the table into chuck so are to fit the screen as (option=width) does not seem to work!
kable(head(nypd[, 1:8]))
| CMPLNT_NUM | CMPLNT_FR_DT | CMPLNT_FR_TM | CMPLNT_TO_DT | CMPLNT_TO_TM | RPT_DT | KY_CD | OFNS_DESC |
|---|---|---|---|---|---|---|---|
| 101109527 | 12/31/2015 | 23:45:00 | 12/31/2015 | 113 | FORGERY | ||
| 153401121 | 12/31/2015 | 23:36:00 | 12/31/2015 | 101 | MURDER & NON-NEGL. MANSLAUGHTER | ||
| 569369778 | 12/31/2015 | 23:30:00 | 12/31/2015 | 117 | DANGEROUS DRUGS | ||
| 968417082 | 12/31/2015 | 23:30:00 | 12/31/2015 | 344 | ASSAULT 3 & RELATED OFFENSES | ||
| 641637920 | 12/31/2015 | 23:25:00 | 12/31/2015 | 23:30:00 | 12/31/2015 | 344 | ASSAULT 3 & RELATED OFFENSES |
| 365661343 | 12/31/2015 | 23:18:00 | 12/31/2015 | 23:25:00 | 12/31/2015 | 106 | FELONY ASSAULT |
print(paste0(".....Last columns"))
## [1] ".....Last columns"
kable(head(nypd[, 17:24]))
| PREM_TYP_DESC | PARKS_NM | HADEVELOPT | X_COORD_CD | Y_COORD_CD | Latitude | Longitude | Lat_Lon |
|---|---|---|---|---|---|---|---|
| BAR/NIGHT CLUB | 1007314 | 241257 | 40.82885 | -73.91666 | (40.828848333, -73.916661142) | ||
| 1043991 | 193406 | 40.69734 | -73.78456 | (40.697338138, -73.784556739) | |||
| OTHER | 999463 | 231690 | 40.80261 | -73.94505 | (40.802606608, -73.945051911) | ||
| RESIDENCE-HOUSE | 1060183 | 177862 | 40.65455 | -73.72634 | (40.654549444, -73.726338791) | ||
| OTHER | 987606 | 208148 | 40.73800 | -73.98789 | (40.7380024, -73.98789129) | ||
| DRUG STORE | 996149 | 181562 | 40.66502 | -73.95711 | (40.665022689, -73.957110763) |
The arrest table…
kable(head(arrest[, 1:8]))
| county | year | total | felonytotal | drugfelony | violentfelony | dwifelony | otherfelony |
|---|---|---|---|---|---|---|---|
| Albany | 1970 | 1226 | 688 | 97 | 191 | 5 | 395 |
| Albany | 1971 | 1833 | 829 | 131 | 231 | 6 | 461 |
| Albany | 1972 | 3035 | 1054 | 211 | 256 | 8 | 579 |
| Albany | 1973 | 3573 | 1134 | 244 | 274 | 28 | 588 |
| Albany | 1974 | 4255 | 1329 | 281 | 308 | 17 | 723 |
| Albany | 1975 | 4173 | 1259 | 209 | 344 | 12 | 694 |
The mental health table
DT::datatable(mental_health, options = list(
pageLength = 10
))
kable(head(police));
| county | year | policetotal |
|---|---|---|
| Albany | 2010 | 587 |
| Albany | 2011 | 1280 |
| Albany | 2012 | 1233 |
| Albany | 2013 | 1228 |
| Albany | 2014 | 1260 |
| Albany | 2015 | 1212 |
Change the columns to months after dcast. Here is Unemployment table after cleaning..
# The unemployment dataset will be reshaped and have their columns renamed appropriately.
unemployment2 <- unemployment[, -5]
unemployment_dcast <- dcast(unemployment2, year + county ~ month, value.var="total")
names(unemployment_dcast)[names(unemployment_dcast)== 1] <- "jan"
names(unemployment_dcast)[names(unemployment_dcast)== 2] <- "feb"
names(unemployment_dcast)[names(unemployment_dcast)== 3] <- "mar"
names(unemployment_dcast)[names(unemployment_dcast)== 4] <- "apr"
names(unemployment_dcast)[names(unemployment_dcast)== 5] <- "may"
names(unemployment_dcast)[names(unemployment_dcast)== 6] <- "jun"
names(unemployment_dcast)[names(unemployment_dcast)== 7] <- "jul"
names(unemployment_dcast)[names(unemployment_dcast)== 8] <- "aug"
names(unemployment_dcast)[names(unemployment_dcast)== 9] <- "sep"
names(unemployment_dcast)[names(unemployment_dcast)== 10] <- "oct"
names(unemployment_dcast)[names(unemployment_dcast)== 11] <- "nov"
names(unemployment_dcast)[names(unemployment_dcast)== 12] <- "dec"
unemployment_dcast <- tbl_df(unemployment_dcast)
DT::datatable(unemployment_dcast, options = list(
pageLength = 10
))
New York State crimes count.
DT::datatable(crimes_ny, options = list(
pageLength = 10
))
Crimes County By County Table
Let start by checking if any of the dataset do exist in the database, and if it exist, we will remove them and write (or save) our datasets in the PostgreSQL Automatically.
dbRemoveTable(con, "arrest")
## [1] TRUE
dbWriteTable(con, "arrest", arrest, row.names=FALSE)
## [1] TRUE
dbRemoveTable(con, "mental_health")
## [1] TRUE
dbWriteTable(con, "mental_health", mental_health, row.names=FALSE)
## [1] TRUE
dbRemoveTable(con, "police")
## [1] TRUE
dbWriteTable(con, "police", police, row.names=FALSE)
## [1] TRUE
dbRemoveTable(con, "unemployment_dcast")
## [1] TRUE
dbWriteTable(con, "unemployment_dcast", unemployment_dcast, row.names=FALSE)
## [1] TRUE
dbRemoveTable(con, "divorce")
## [1] TRUE
dbWriteTable(con, "divorce", divorce, row.names=FALSE)
## [1] TRUE
dbRemoveTable(con, "crimes_ny")
## [1] TRUE
dbWriteTable(con, "crimes_ny", crimes_ny, row.names=FALSE)
## [1] TRUE
dbRemoveTable(con, "pov_sub")
## [1] TRUE
dbWriteTable(con, "pov_sub", pov_sub, row.names=FALSE)
## [1] TRUE
If your connection is right, you should see that the above table are now in your database. But some of the data type may be wrong, use the code below to alter the it to desirable data type.
dbGetQuery(con, statement = paste("ALTER TABLE crimes_ny
ALTER COLUMN population TYPE INT USING population::integer"));
# dbGetQuery(con, statement = paste("ALTER TABLE inmates_dcast
# ALTER COLUMN values TYPE VARCHAR USING values::VARCHAR"))
#
After writing (or saving) them into the database, we can now try to subset the columns and the row we needed from the tables.
police_sub <- dbGetQuery(con, statement = paste(
"SELECT year, county, policetotal as police",
"FROM police",
"WHERE year IN (2010, 2011,2012,2013,2014,2015)",
"ORDER BY year"));
arrest_sub <- dbGetQuery(con, statement=paste(
"SELECT *",
"FROM arrest",
"WHERE year IN (2010, 2011,2012,2013,2014,2015)",
"ORDER BY county ASC"));
divorce$total <- (gsub(",", "", divorce$total));
divorce_sub <- dbGetQuery(con, statement=paste(
"SELECT total as divorce, county, year",
"FROM divorce",
"ORDER BY county ASC"));
emp <- dbGetQuery(con, statement=paste(
"SELECT *",
"FROM unemployment_dcast",
"WHERE year IN (2010, 2011,2012,2013,2014,2015)",
"ORDER BY county ASC"));
crimes_ny_sub <- dbGetQuery(con, statement=paste(
"SELECT county, year, population ",
"FROM crimes_ny",
"WHERE year IN (2010, 2011,2012,2013,2014,2015)",
"ORDER BY county ASC"));
Saving them in tbl for faster accessibility.
# For easy and better table ouput
police_sub <- tbl_df(police_sub)
arrest_sub <- tbl_df(arrest_sub)
divorce_sub <- tbl_df(divorce_sub)
crimes_ny_sub <- tbl_df(crimes_ny_sub)
#inmates_sub <- tbl_df(inmates_sub)
dbGetQuery(con, statement = paste("ALTER TABLE mental_health ALTER COLUMN total TYPE INT"));
mental_health_sub <- as.data.frame(dbGetQuery(con, statement = paste(
"SELECT year, county, agegroup, total as mentalhealth",
"FROM mental_health",
"WHERE year BETWEEN 2010 AND 2015",
"ORDER BY year, agegroup")));
Finally, after all the hard work done by PostreSQl, we now have our sought after dataset, hoooray!
Lets disconnect PostreSQL and start using R and SQLDF.
#dbListTables(con);
dbDisconnect(con)
## [1] TRUE
detach("package:RPostgreSQL", unload=TRUE)
library(sqldf);
Upon further scrutiny, we discovered that the poverty table still have some unwanted character in them, we therefore removed them using gsub from grep, a R based.
mental_sub <- dcast(mental_health_sub, year +county ~ agegroup, value.var="mentalhealth", fun.aggregate = sum)
mental_sub <- mental_sub[, c(1:4)]
divorce_sub$divorce <- (gsub(",", "", divorce_sub$divorce));
divorce_sub1 <- sapply(divorce_sub$divorce, as.numeric)
divorce_sub2 <- divorce_sub[, c(2:3)]
div_extraction <- cbind.data.frame(divorce_sub2, divorce_sub1)
colnames(div_extraction) <- c("county", "year", "divorce")
povselect <- sqldf::sqldf("SELECT * FROM pov_sub
WHERE year BETWEEN 2010 AND 2015 AND state = 36 AND county !='New York'
ORDER BY year ASC")
povselect$county <- (gsub("\\(NY)", "", povselect$county));
povselect$county <- (gsub("\\County", "", povselect$county));
povselect$allageproverty <- (gsub(",", "", povselect$allageproverty));
povselect$ueighteenpoverty <- (gsub(",", "", povselect$ueighteenpoverty));
povselect$fivetoseventenpov <- (gsub(",", "", povselect$fivetoseventenpov))
col_pov <- c('allageproverty', 'ueighteenpoverty', 'fivetoseventenpov')
povselect[col_pov] <- sapply(povselect[col_pov], as.numeric)
povertydata <- povselect[, -c(2:3)]
emp$jan <- (gsub(",", "", emp$jan))
emp$feb <- (gsub(",", "", emp$feb))
emp$mar <- (gsub(",", "", emp$mar))
emp$apr <- (gsub(",", "", emp$apr))
emp$may <- (gsub(",", "", emp$may))
emp$jun <- (gsub(",", "", emp$jun))
emp$jul <- (gsub(",", "", emp$jul))
emp$aug <- (gsub(",", "", emp$aug))
emp$sep <- (gsub(",", "", emp$sep))
emp$oct <- (gsub(",", "", emp$oct))
emp$nov <- (gsub(",", "", emp$nov))
emp$dec <- (gsub(",", "", emp$dec))
emp1 <- sapply(emp[, c(3:14)], as.data.frame.numeric)
emp2 <- as.data.frame(emp1)
unemp_sub <- cbind.data.frame(emp2,emp[, c(1,2)], stringsAsFactors=FALSE)
colnames(unemp_sub) <- c("jan", "feb", "mar", "apr", "may", "jun", "jul", "aug",
"sept", "oct", "nov", "dec", "year", "county")
df_unemp_sub <-sqldf::sqldf("SELECT county, year, (jan + feb + mar + apr + may +
jun + jul + aug + sept + oct + nov + dec) as unemployment
FROM unemp_sub")
nypd[, 2] <- as.Date.character(nypd[, 2], format = "%m/%d/%Y")
nypd[, 4] <- as.Date.character(nypd[, 4], format = "%m/%d/%Y")
nypd[, 6] <- as.Date.character(nypd[, 6], format = "%m/%d/%Y")
Data cleaning of NYPD’s latitude and Longtitude from coordinate to a usable R format.
nypd$Lat_Lon <- (gsub(")","", nypd$Lat_Lon))
nypd$Lat_Lon <- (gsub(",",":", nypd$Lat_Lon))
nypd$Lat_Lon <- (gsub("[[alphanum:]]","", nypd$Lat_Lon))
nypd_new <- subset(nypd, CMPLNT_FR_DT >= '2010-1-1' & CMPLNT_FR_DT <= '2015-12-31',
select=c(CMPLNT_NUM , CMPLNT_FR_DT, CMPLNT_FR_TM, RPT_DT, OFNS_DESC,
CRM_ATPT_CPTD_CD, LAW_CAT_CD,BORO_NM, Latitude, Longitude,
HADEVELOPT, ADDR_PCT_CD, PREM_TYP_DESC))
colnames(nypd_new) <- c("ComplainNum", "IncidentDate", "IncidentTime",
"ReportedDate", "OffenseDesc", "CrimeFailSuccces", "CrimeLevel",
"Borough", "lat", "Long", "NYCHA", "PolicePrecinct", "BuildingType" )
nypd_new<- as.data.frame(nypd_new)
nypd_new$DateTime = paste(nypd_new$IncidentDate, nypd_new$IncidentTime, sep="")
nypd_new$dayofweek <- strftime(nypd_new$IncidentDate, '%A') # Days of the week
nypd_new$months <- strftime(nypd_new$IncidentDate, '%B') # Converting To Months Column
nypd_new$year <- strftime(nypd_new$IncidentDate, '%Y') # Converting To Year Column
kable(head(nypd_new[, 10:17]))
| Long | NYCHA | PolicePrecinct | BuildingType | DateTime | dayofweek | months | year |
|---|---|---|---|---|---|---|---|
| -73.91666 | 44 | BAR/NIGHT CLUB | 2015-12-3123:45:00 | Thursday | December | 2015 | |
| -73.78456 | 103 | 2015-12-3123:36:00 | Thursday | December | 2015 | ||
| -73.94505 | 28 | OTHER | 2015-12-3123:30:00 | Thursday | December | 2015 | |
| -73.72634 | 105 | RESIDENCE-HOUSE | 2015-12-3123:30:00 | Thursday | December | 2015 | |
| -73.98789 | 13 | OTHER | 2015-12-3123:25:00 | Thursday | December | 2015 | |
| -73.95711 | 71 | DRUG STORE | 2015-12-3123:18:00 | Thursday | December | 2015 |
incidentreport <- sqldf::sqldf("SELECT
'At ' || IncidentTime || ',' || dayofweek ||', '|| months || ' ' || year
||', a complaint was filled with NYPD police precinct ' || PolicePrecinct ||
' about ' || OffenseDesc
|| ' that was committed . The complainant maintained that the incident occurred at a '||
BuildingType || ' in ' || Borough || ' with co-ordinate ' || lat|| ' and ' ||Long||
' . The Police described the incident as ' || CrimeFailSuccces || ' and classified it as ' ||
CrimeLevel ||' with complain number: ' || ComplainNum
as Incident_Report from nypd_new
")
kable(head(incidentreport))
| Incident_Report |
|---|
| At 23:45:00,Thursday, December 2015, a complaint was filled with NYPD police precinct 44 about FORGERY that was committed . The complainant maintained that the incident occurred at a BAR/NIGHT CLUB in BRONX with co-ordinate 40.828848333 and -73.916661142 . The Police described the incident as COMPLETED and classified it as FELONY with complain number: 101109527 |
| At 23:36:00,Thursday, December 2015, a complaint was filled with NYPD police precinct 103 about MURDER & NON-NEGL. MANSLAUGHTER that was committed . The complainant maintained that the incident occurred at a in QUEENS with co-ordinate 40.697338138 and -73.784556739 . The Police described the incident as COMPLETED and classified it as FELONY with complain number: 153401121 |
| At 23:30:00,Thursday, December 2015, a complaint was filled with NYPD police precinct 28 about DANGEROUS DRUGS that was committed . The complainant maintained that the incident occurred at a OTHER in MANHATTAN with co-ordinate 40.802606608 and -73.945051911 . The Police described the incident as COMPLETED and classified it as FELONY with complain number: 569369778 |
| At 23:30:00,Thursday, December 2015, a complaint was filled with NYPD police precinct 105 about ASSAULT 3 & RELATED OFFENSES that was committed . The complainant maintained that the incident occurred at a RESIDENCE-HOUSE in QUEENS with co-ordinate 40.654549444 and -73.726338791 . The Police described the incident as COMPLETED and classified it as MISDEMEANOR with complain number: 968417082 |
| At 23:25:00,Thursday, December 2015, a complaint was filled with NYPD police precinct 13 about ASSAULT 3 & RELATED OFFENSES that was committed . The complainant maintained that the incident occurred at a OTHER in MANHATTAN with co-ordinate 40.7380024 and -73.98789129 . The Police described the incident as COMPLETED and classified it as MISDEMEANOR with complain number: 641637920 |
| At 23:18:00,Thursday, December 2015, a complaint was filled with NYPD police precinct 71 about FELONY ASSAULT that was committed . The complainant maintained that the incident occurred at a DRUG STORE in BROOKLYN with co-ordinate 40.665022689 and -73.957110763 . The Police described the incident as ATTEMPTED and classified it as FELONY with complain number: 365661343 |
precinct_count <- plyr::count(nypd_new, vars = "PolicePrecinct")
precinct <- sqldf::sqldf("SELECT PolicePrecinct, freq as NumberOfCrimesHandled
FROM precinct_count ORDER BY freq DESC LIMIT 10")
precinct_low <- sqldf::sqldf("SELECT PolicePrecinct, freq as NumberOfCrimesHandled
FROM precinct_count ORDER BY freq ASC LIMIT 10")
lat= c(40.67124, 40.84156, 40.82885, 40.81687 , 40.75792, 40.86400 , 40.68564, 40.64174, 40.84865, 40.65247)
long= c(-73.88831, -73.86918, -73.92980, -73.91769,-73.99111 , -73.89342, -73.91079, -74.07568, -73.91168, -73.94426)
lat2 <- c(0,40.77472,40.58751,40.67687,40.58230,40.73599,40.75520,40.50930,40.81077,40.72350)
long2 <- c(0,-73.97468,-73.81171,-74.00806,-74.16894,-73.74580,-73.89195,-74.24090,-73.88334,-73.94862)
precinct <- cbind.data.frame(precinct, lat, long)
precinct_low <- cbind.data.frame(precinct, lat2, long2)
kable(head(precinct, 10))
| PolicePrecinct | NumberOfCrimesHandled | lat | long |
|---|---|---|---|
| 75 | 101480 | 40.67124 | -73.88831 |
| 43 | 78285 | 40.84156 | -73.86918 |
| 44 | 75838 | 40.82885 | -73.92980 |
| 40 | 69758 | 40.81687 | -73.91769 |
| 14 | 68051 | 40.75792 | -73.99111 |
| 52 | 64737 | 40.86400 | -73.89342 |
| 73 | 64724 | 40.68564 | -73.91079 |
| 120 | 60366 | 40.64174 | -74.07568 |
| 46 | 60027 | 40.84865 | -73.91168 |
| 67 | 56866 | 40.65247 | -73.94426 |
Map showing all New York Counties, least and most crime area in New York City.
click on the pop-up to see the county name.
neigbourhood_json <- GET('http://data.beta.nyc//dataset/0ff93d2d-90ba-457c-9f7e-39e47bf2ac5f/resource/35dd04fb-81b3-479b-a074-a27a37888ce7/download/d085e2f8d0b54d4590b1e7d1f35594c1pediacitiesnycneighborhoods.geojson')
neigbourhood_json2 <- GET('http://data.beta.nyc//dataset/0ff93d2d-90ba-457c-9f7e-39e47bf2ac5f/resource/35dd04fb-81b3-479b-a074-a27a37888ce7/download/d085e2f8d0b54d4590b1e7d1f35594c1pediacitiesnycneighborhoods.geojson')
neighborhoods_low <- readOGR(content(neigbourhood_json2,'text'), 'OGRGeoJSON', verbose = F)
precinct_low2 <- precinct_low
coordinates(precinct_low2) <- ~long2 + lat2
proj4string(precinct_low2) <- proj4string(neighborhoods_low)
combining_low <- over(precinct_low2, neighborhoods_low)
precint_low3 <- cbind(precinct_low, combining_low)
precint_low3 <- precint_low3[, -c(3,4, 8)]
map1 <- leaflet(neighborhoods_low) %>%
addTiles() %>%
addPolygons(popup = ~neighborhoods_low, color = "#03F", fill = TRUE,
fillColor = "brown", opacity = 0.5, fillOpacity = 0.2) %>%
addMarkers(~long2, ~lat2, popup = ~neighborhoods_low, data = precint_low3) %>%
addProviderTiles("Stamen.Toner", options = providerTileOptions()) %>%
setView(-73.92, 40.75, zoom = 9.5)
## Map 2
# Neigbourhood with high crime rate
neighborhoods <- readOGR(content(neigbourhood_json,'text'), 'OGRGeoJSON', verbose = F)
precinct2 <- precinct
coordinates(precinct2) <- ~long + lat
proj4string(precinct2) <- proj4string(neighborhoods)
combining <- over(precinct2, neighborhoods)
precint3 <- cbind(precinct, combining)
precint3 <- precint3[, -8]
map2 <- leaflet(neighborhoods) %>%
addTiles() %>%
addPolygons(popup = ~neighborhood, color = "#03F", fill = TRUE,
fillColor = "brown", opacity = 0.5, fillOpacity = 0.2) %>%
addMarkers(~long, ~lat, popup = ~neighborhood, data = precint3) %>%
addProviderTiles("HERE.normalday", options = providerTileOptions()) %>%
setView(-73.92, 40.75, zoom = 10.5)
# New York state counties
#df_maps <- df_incident[1:62, ]
counties_map <- leaflet(lat_long) %>%
addTiles() %>%
addMarkers(lng = lat_long$long, lat = lat_long$lat, popup = lat_long$county, data =lat_long) %>%
addProviderTiles("Esri.WorldImagery", options = providerTileOptions()) %>%
setView(-73.82, 40.75, zoom = 5.9)
counties_map
map1
map2
# Table (nyc high and low)
kable(precint_low3[, 1:6])
| PolicePrecinct | NumberOfCrimesHandled | lat2 | long2 | neighborhood | borough |
|---|---|---|---|---|---|
| 75 | 101480 | 0.00000 | 0.00000 | NA | NA |
| 43 | 78285 | 40.77472 | -73.97468 | Central Park | Manhattan |
| 44 | 75838 | 40.58751 | -73.81171 | Rockaway Beach | Queens |
| 40 | 69758 | 40.67687 | -74.00806 | Red Hook | Brooklyn |
| 14 | 68051 | 40.58230 | -74.16894 | New Springville | Staten Island |
| 52 | 64737 | 40.73599 | -73.74580 | Hollis Hills | Queens |
| 73 | 64724 | 40.75520 | -73.89195 | East Elmhurst | Queens |
| 120 | 60366 | 40.50930 | -74.24090 | Tottenville | Staten Island |
| 46 | 60027 | 40.81077 | -73.88334 | Hunts Point | Bronx |
| 67 | 56866 | 40.72350 | -73.94862 | Greenpoint | Brooklyn |
kable(precint3)
| PolicePrecinct | NumberOfCrimesHandled | lat | long | neighborhood | boroughCode | borough |
|---|---|---|---|---|---|---|
| 75 | 101480 | 40.67124 | -73.88831 | East New York | 4 | Brooklyn |
| 43 | 78285 | 40.84156 | -73.86918 | Van Nest | 2 | Bronx |
| 44 | 75838 | 40.82885 | -73.92980 | Highbridge | 2 | Bronx |
| 40 | 69758 | 40.81687 | -73.91769 | Melrose | 2 | Bronx |
| 14 | 68051 | 40.75792 | -73.99111 | Hell’s Kitchen | 1 | Manhattan |
| 52 | 64737 | 40.86400 | -73.89342 | Fordham | 2 | Bronx |
| 73 | 64724 | 40.68564 | -73.91079 | Bushwick | 4 | Brooklyn |
| 120 | 60366 | 40.64174 | -74.07568 | St. George | 5 | Staten Island |
| 46 | 60027 | 40.84865 | -73.91168 | Morris Heights | 2 | Bronx |
| 67 | 56866 | 40.65247 | -73.94426 | East Flatbush | 3 | Brooklyn |
From analysis and map above, we can deduce that we have 5 out 10 locations with the highest number of crimes complaint in Bronx, 3 in Brooklyn, 1 each in Manhattan and Staten Island, but none in Queens! Does this have anything to do with poverty level?
daysofweek <- plyr::count(nypd_new, vars = "dayofweek")
offenses <- plyr::count(nypd_new, vars = "OffenseDesc")
daysofweek[, -1] <- sapply(daysofweek[, -1], as.numeric)
daysofweek$percentages <- round(daysofweek$freq/sum(daysofweek$freq)*100)
days <- paste(daysofweek$dayofweek, daysofweek$percentages) # add percents to labels
days <- paste(days,"%",sep="")
pie(daysofweek$freq,labels = days, col=rainbow(length(days)),
main="Pie Chart of Crime Complaints In Days Of The Week")
print(paste0("Most crimes are carried out on FRIDAYS with the total being: ", max(daysofweek$freq)))
## [1] "Most crimes are carried out on FRIDAYS with the total being: 462851"
print(paste0("While the least crimes were carried out on SUNDAYS with the total being: ", min(daysofweek$freq)))
## [1] "While the least crimes were carried out on SUNDAYS with the total being: 387899"
theme_set(theme_bw())
incident_months <- plyr::count(nypd_new, vars = "months")
incident_months[, 2] <- sapply(incident_months[, 2],as.numeric)
incident_months2 <- sqldf::sqldf("SELECT months as MonthOfIncident, freq as IncidentFrequency
FROM incident_months ORDER BY freq DESC")
kable(head(incident_months2))
| MonthOfIncident | IncidentFrequency |
|---|---|
| July | 268942 |
| August | 266856 |
| May | 263434 |
| October | 259617 |
| June | 257991 |
| September | 256227 |
ggplot(incident_months2, aes(x=MonthOfIncident, y=IncidentFrequency)) +
geom_point(size=5) +
geom_segment(aes(x=MonthOfIncident,
xend=MonthOfIncident,
y=0,
yend=IncidentFrequency)) +
labs(title="NYPD Crime Complaints Per Month ",
subtitle="FROM 2010 To 2015") +
theme(axis.text.x = element_text(angle=90, vjust=0.9)) + coord_flip()
The Most crimes was carried out in July follow by August and May in Third place , while February has the least numbers of crimes complaints.
incident_time <- plyr::count(nypd_new, vars = "IncidentTime")
incident_time <- sqldf::sqldf("SELECT IncidentTime as TimeOfIncident, freq as IncidentFrequency
FROM incident_time ORDER BY freq DESC")
kable(head(incident_time))
| TimeOfIncident | IncidentFrequency |
|---|---|
| 12:00:00 | 77587 |
| 15:00:00 | 66375 |
| 18:00:00 | 63726 |
| 20:00:00 | 60164 |
| 16:00:00 | 60009 |
| 17:00:00 | 59880 |
Most Crimes are carried out around 12 Noon, 3PM, 6PM and 4PM, while the least crimes are carried out in the early morning toward the morning.
Crime_cat <- plyr::count(nypd_new, vars = "CrimeLevel")
sprintf(" Misdemeanor crimes are '%i' times greater than Felony crimes in New York City", round(Crime_cat[2, 2]/Crime_cat[1, 2]))
## [1] " Misdemeanor crimes are '2' times greater than Felony crimes in New York City"
Building_cat <- plyr::count(nypd_new, vars = "BuildingType")
Building_cat <- sqldf::sqldf("SELECT BuildingType as TypeOfBuilding, freq as IncidentFrequency
FROM Building_cat ORDER BY freq DESC LIMIT 20")
Building_cat[, -1] <- sapply(Building_cat[, -1], as.numeric)
Building_cat$percentages <- round(Building_cat$IncidentFrequency/sum(Building_cat$IncidentFrequency)*100)
days <- paste(Building_cat$TypeOfBuilding, Building_cat$percentages) # add percents to labels
days <- paste(days,"%",sep="") # ad % to labels
pie(Building_cat$IncidentFrequency,labels = days, col= c("purple", "violetred1", "green3","cornsilk", "cyan", "white"),
main="Top 20 Building/Area Where Crimes Were Carried Out", border = FALSE, init.angle = 315)
kable(head(Building_cat))
| TypeOfBuilding | IncidentFrequency | percentages |
|---|---|---|
| STREET | 962514 | 35 |
| RESIDENCE - APT. HOUSE | 634715 | 23 |
| RESIDENCE-HOUSE | 272063 | 10 |
| RESIDENCE - PUBLIC HOUSING | 230529 | 8 |
| OTHER | 81313 | 3 |
| COMMERCIAL BUILDING | 73895 | 3 |
Crimes were most carried out on the street, followed by an apartmental houses, where residential houses came third!
At this time, we will be combining the required datasets into one and write it into PostgreSQL for further scrutiny.
df <- sqldf::sqldf("SELECT c.county, c.population, c.year, a.felonytotal,a.misdemeanortotal,
a.total as totalcrimes,p.policetotal as lawenforcementtotal ,a.othermisd, d.divorce, m.child as mentalchild,
m.adult as mentaladult, (m.child+m.adult) as mentaltotal, u.unemployment ,a.violentfelony, a.drugfelony,
a.dwifelony, a.otherfelony, a.drugmisd, a.dwimisd, a.propertymisd FROM crimes_ny_sub c
LEFT JOIN arrest_sub a ON a.county=c.county AND a.year=c.year
LEFT JOIN div_extraction d ON d.year=c.year AND a.county=d.county
LEFT JOIN df_unemp_sub u ON u.year=a.year AND u.county=d.county
LEFT JOIN mental_sub m ON m.year=a.year AND m.county=a.county
LEFT JOIN police p ON p.year=a.year AND p.county=d.county
ORDER BY c.year, c.county ASC")
df <- sqldf::sqldf("SELECT * from df WHERE county NOT IN ('Out-of-State', 'St. Lawrence') ")
df <- cbind.data.frame(df, povertydata[, 3:5])
A summary statistics of the datasets
summary(df)
## county population year felonytotal
## Length:372 Min. : 4678 Min. :2010 Min. : 11
## Class :character 1st Qu.: 50431 1st Qu.:2011 1st Qu.: 326
## Mode :character Median : 89674 Median :2012 Median : 529
## Mean : 316077 Mean :2012 Mean : 2529
## 3rd Qu.: 233454 3rd Qu.:2014 3rd Qu.: 1512
## Max. :2640252 Max. :2015 Max. :27551
## NA's :6
## misdemeanortotal totalcrimes lawenforcementtotal othermisd
## Min. : 52 Min. : 66 Min. : 0.0 Min. : 13.0
## 1st Qu.: 890 1st Qu.: 1214 1st Qu.: 131.0 1st Qu.: 329.8
## Median : 1382 Median : 1934 Median : 224.0 Median : 510.5
## Mean : 6448 Mean : 8977 Mean : 550.8 Mean : 2167.5
## 3rd Qu.: 4055 3rd Qu.: 5520 3rd Qu.: 471.0 3rd Qu.: 1512.2
## Max. :68603 Max. :95647 Max. :4751.0 Max. :21854.0
## NA's :6 NA's :6 NA's :44 NA's :6
## divorce mentalchild mentaladult mentaltotal
## Min. : 5.0 Min. : 365 Min. : 1082 Min. : 2162
## 1st Qu.: 158.8 1st Qu.: 2215 1st Qu.: 6798 1st Qu.: 9067
## Median : 255.5 Median : 6826 Median : 14301 Median : 21757
## Mean : 931.6 Mean : 30786 Mean : 71983 Mean : 102768
## 3rd Qu.: 667.5 3rd Qu.: 23274 3rd Qu.: 44093 3rd Qu.: 67605
## Max. :15666.0 Max. :290846 Max. :915912 Max. :1159220
## NA's :8 NA's :12 NA's :12 NA's :12
## unemployment violentfelony drugfelony dwifelony
## Min. : 1000 Min. : 1.0 Min. : 0.0 Min. : 1.0
## 1st Qu.: 9275 1st Qu.: 57.0 1st Qu.: 43.0 1st Qu.: 34.0
## Median : 15450 Median : 103.5 Median : 85.5 Median : 56.0
## Mean : 48497 Mean : 698.0 Mean : 483.0 Mean : 96.7
## 3rd Qu.: 37825 3rd Qu.: 370.0 3rd Qu.: 223.0 3rd Qu.:108.0
## Max. :459700 Max. :9507.0 Max. :7679.0 Max. :615.0
## NA's :8 NA's :6 NA's :6 NA's :6
## otherfelony drugmisd dwimisd propertymisd
## Min. : 7.0 Min. : 3.00 Min. : 19.0 Min. : 9
## 1st Qu.: 167.2 1st Qu.: 65.25 1st Qu.: 186.2 1st Qu.: 234
## Median : 279.5 Median : 122.50 Median : 306.5 Median : 480
## Mean : 1251.5 Mean : 1471.95 Mean : 620.5 Mean : 2188
## 3rd Qu.: 824.2 3rd Qu.: 464.00 3rd Qu.: 713.5 3rd Qu.: 1212
## Max. :13590.0 Max. :27784.00 Max. :3943.0 Max. :30197
## NA's :6 NA's :6 NA's :6 NA's :6
## allageproverty ueighteenpoverty fivetoseventenpov
## Min. : 490 Min. : 123 Min. : 80
## 1st Qu.: 7311 1st Qu.: 2140 1st Qu.: 1423
## Median : 12260 Median : 3510 Median : 2343
## Mean : 48550 Mean : 15236 Mean : 10440
## 3rd Qu.: 32620 3rd Qu.: 9326 3rd Qu.: 6185
## Max. :615319 Max. :206071 Max. :138276
##
We saw that some the columns in the summary are not in the right format and have some of NAs and are not in the right/accetable datatype for a perfect analysis. We will therefore convert them back to numerica datatype (a more acceptable datatype for analysis)
df[, -1] <- sapply(df[, -1], as.numeric) #Convert some df column to numeric
A glimpse of the dataset with the NAS
missmap(df, legend = TRUE, col = c("wheat","darkred", col=c('yellow', 'darkgreen')),
main ="Plot Showing The Missing Values Per Observation",
y.cex = 1, x.cex = 0.8,csvar = NULL, tsvar =NULL, rank.order = TRUE)
The above is a graphical representation of all the missing values.The darkgreen area shows the no missing values while the other color indicates the present of missing values in each columns.
A mulitivariate Imputation by Chained Equations (mice package) is said to be the best approach to tackle end-on the menace missing observation may cause. It is well know that if proper care is not taken in dealing with missing observation, the analysis would likely be biased. To avoid this, the mice package in R will use a fully conditional specification were each incomplete variable is imputed by a separate model at which the algoritm will impute mixes of continous, binary, ordered or unordered data based on the data type.
df_missing_val <- mice(df, m=1, method='cart', seed = 500, printFlag=FALSE, where = is.na(df))
## Warning: Number of logged events: 87
df_complete <- complete(df_missing_val, action='long', include =FALSE)
If possible, we will show map of the location, and as googlevis requirement, the longtitude and latitude needed for the map must be in “lat:long” format
df_complete <- merge(df_complete , lat_long, all=T)
df_complete$Lat_Long = paste(df_complete$lat, df_complete$long, sep=":")
df_cleaned <- df_complete[, -c(2,3)]
df_pca <- df_complete[-c(235, 302), c(1, 4,6, 7, 8, 9, 11, 12,13, 15, 23, 24, 25)]
DT::datatable(df_cleaned, options = list(
pageLength = 10
))
Checking out for outliers (an observation point that is abnormal distant from other observations) in an analysis one of the best way to avoid variablity in the measurement or experimantal error.
outliers_check <- function(check) {
check < quantile(check, 0.25, na.rm = TRUE) - 1.5 * IQR(check, na.rm = TRUE) |
check > quantile(check, 0.75, na.rm = TRUE) + 1.5 * IQR(check, na.rm = TRUE)
}
outlier_obs <- df_cleaned %>%
group_by(year, county) %>%
mutate(Flagged = is.na(felonytotal) | outliers_check(felonytotal) | felonytotal <= 0) %>%
ungroup()
#outliers <- subset(outlier_obs, Flagged == TRUE | year==2015, select=c(year, county, population, lawenforcementtotal, felonytotal, divorce))
A boxplot showing observation with oultiers
theme_set(theme_bw())
ct <- ggplot(df_cleaned, aes(county, totalcrimes))
ct + geom_boxplot() +
geom_dotplot(binaxis='y',
stackdir='center',
dotsize = .15,
fill="red") +
theme(axis.text.x = element_text(angle=75, vjust=0.6)) +
labs(title="Box plot + Dot plot of",
subtitle="County Vs Total Crimes",
x="New York State Counties",
y="Total Crimes (Felony & Misdemeanor) ")
df_corr <- df_cleaned[ , -c(1,3, 24, 25, 26)]
corr_coef <- cor(df_corr, method = "kendall", use = "complete")
ggcorrplot(corr_coef, hc.order = TRUE,
type = "lower",
lab = TRUE,
lab_size = 3,
method="circle",
colors = c("tomato2", "white", "springgreen3"),
title="Correlogram of variables",
ggtheme=theme_bw)
The data set would be used in two different way.
Incidence: a measure of the occurrence of a new disease, in a defined population at risk for the disease, in a specified time period.
while
1). Incidence proportion (also known as cumulative incidence) is the number of new cases within a specified time period divided by the size of the population initially at risk.
\(Incidence\quad Rate\quad =\quad \frac { Number\quad of\quad newcases\quad of\quad incidence\quad or\quad disease \quad or\quad injury\quad during\quad specified\quad period }{ Time\quad each\quad person\quad was\quad observed,\quad totaled\quad for\quad all\quad persons }\)
df_incident <- sqldf::sqldf("SELECT
county as County, year as Year, population as Populations,
(((felonytotal + misdemeanortotal)/population)*100000) as TotalCrimes,
((mentalchild/population)*10000) as ChildMental, ((mentaladult/population)*10000) as AdultMental,
((mentaltotal/population)*10000) as TotalMental,
((lawenforcementtotal/population)*100000) as lawenforcementtotal,
((divorce/population)*100000) as DivorcePerPopPerK, ((drugfelony/population)*100000) as DrugFelonyPerPopPerK,
((violentfelony/population)*100000) as ViolentTotal, ((dwifelony/population)*100000) as DWIFelony,
((otherfelony/population)*100000) as OtherFelony, ((felonytotal/population)*100000) as TotalFelony,
((propertymisd/population)*100000) as PropertyMisdemeanor,
((drugmisd/population)*100000) as DrugMisdemeanor, ((dwimisd/population)*100000) as DwiMisdemeanor,
((othermisd/population)*100000) as OtherMisdemeanor,
((misdemeanortotal)/population)*100000
as TotalMisdemeanor,
((unemployment/population)*100000) as UnemploymentTotal,
((allageproverty/population)*100000) as AllPersonInPoverty ,
((ueighteenpoverty/population)*100000) as Under18InPoverty,
((fivetoseventenpov/population)*100000) as FiveTo17InPoverty,lat, long, Lat_long
FROM df_cleaned
WHERE county NOT IN ('Out-of-State', 'St. Lawrence')
ORDER BY Year, county ASC
")
df_incident <- as.data.frame(df_incident)
kable(head(df_incident[, 1:9]))
| County | Year | Populations | TotalCrimes | ChildMental | AdultMental | TotalMental | lawenforcementtotal | DivorcePerPopPerK |
|---|---|---|---|---|---|---|---|---|
| Albany | 2010 | 295267 | 3128.016 | 1638.5847 | 2626.775 | 4265.360 | 198.8031 | 259.4262 |
| Allegany | 2010 | 48531 | 2542.705 | 649.4818 | 2695.185 | 3344.666 | 325.5651 | 327.6256 |
| Bronx | 2010 | 1387983 | 6352.599 | 1780.7855 | 4031.065 | 5811.851 | 193.8064 | 223.5618 |
| Broome | 2010 | 191892 | 2814.083 | 860.3277 | 1962.093 | 2822.421 | 128.7182 | 355.9294 |
| Cattaraugus | 2010 | 76602 | 2545.625 | 1243.8318 | 2883.345 | 4127.177 | 304.1696 | 276.7552 |
| Cayuga | 2010 | 78400 | 2022.959 | 2907.9082 | 2459.184 | 5367.092 | 302.2959 | 270.4082 |
Incident Table Cont’d…“)
kable(head(df_incident[, 10:18]))
| DrugFelonyPerPopPerK | ViolentTotal | DWIFelony | OtherFelony | TotalFelony | PropertyMisdemeanor | DrugMisdemeanor | DwiMisdemeanor | OtherMisdemeanor |
|---|---|---|---|---|---|---|---|---|
| 173.4024 | 207.9474 | 59.60707 | 529.6901 | 970.6469 | 748.1364 | 270.94122 | 389.13932 | 749.1525 |
| 115.3902 | 160.7220 | 90.66370 | 317.3229 | 684.0988 | 473.9239 | 98.90585 | 471.86335 | 813.9128 |
| 553.2489 | 494.5306 | 5.25943 | 484.0117 | 1537.0505 | 1486.1133 | 1727.83096 | 87.68119 | 1513.9234 |
| 113.6056 | 178.2253 | 68.26757 | 390.3237 | 750.4221 | 781.6897 | 171.97173 | 305.38011 | 804.6193 |
| 58.7452 | 161.8757 | 109.65771 | 298.9478 | 629.2264 | 625.3100 | 91.38143 | 399.46738 | 800.2402 |
| 99.4898 | 88.0102 | 47.19388 | 315.0510 | 549.7449 | 521.6837 | 103.31633 | 247.44898 | 600.7653 |
print(paste0(“Incident Table Last”))
kable(head(df_incident[, 19:26]))
| TotalMisdemeanor | UnemploymentTotal | AllPersonInPoverty | Under18InPoverty | FiveTo17InPoverty | lat | long | Lat_Long |
|---|---|---|---|---|---|---|---|
| 2157.369 | 15579.12 | 12917.12 | 3500.899 | 2258.634 | 42.6526 | -73.7562 | 42.6526:-73.7562 |
| 1858.606 | 21841.71 | 15674.52 | 4897.900 | 3294.801 | 42.0901 | -78.4942 | 42.0901:-78.4942 |
| 4815.549 | 19906.58 | 29148.48 | 10956.330 | 7667.457 | 40.8448 | -73.8648 | 40.8448:-73.8648 |
| 2063.661 | 22199.99 | 16375.88 | 4809.476 | 3218.477 | 42.1792 | -75.8534 | 42.1792:-75.8534 |
| 1916.399 | 28067.15 | 15209.79 | 5276.625 | 3469.883 | 42.3292 | -78.8681 | 42.3292:-78.8681 |
| 1473.214 | 23341.84 | 12196.43 | 4053.571 | 2739.796 | 42.9190 | -76.7263 | 42.919:-76.7263 |
theme_set(theme_bw())
poverty_inc <- sqldf::sqldf("SELECT
Year, AllPersonInPoverty, County
FROM df_incident
ORDER BY AllPersonInPoverty DESC LIMIT 30
")
poverty_plot <- ggplot(poverty_inc, aes(x=AllPersonInPoverty, y=reorder(County, +AllPersonInPoverty), fill=as.factor(Year))) +
geom_point(colour="blue", size=5, alpha=.8) +
scale_fill_brewer(palette="Blues", breaks=rev(levels(poverty_inc$Year))) +
labs(title="Chart of Top 30 Poverty In New York State Per County
\n (Incident Rate Per 100K Persons From 2010 To 2015))")
## Divorce Chart
div_inc <- sqldf::sqldf("SELECT
Year, DivorcePerPopPerK, County
FROM df_incident
ORDER BY DivorcePerPopPerK DESC LIMIT 30
")
divorce_plot <- ggplot(div_inc, aes(x=DivorcePerPopPerK, y=reorder(County, +DivorcePerPopPerK), fill=as.factor(Year))) +
geom_point(colour="pink", size=5, alpha=.8) +
scale_fill_brewer(palette="Blues", breaks=rev(levels(div_inc$Year))) +
labs(title="Chart of Top 30 Divorce In New York State Per County
\n (Incident Rate Per 100K Persons From 2010 To 2015))")
## Unemployment Chart
unemp_inc <- sqldf::sqldf("SELECT
Year, UnemploymentTotal, County
FROM df_incident
ORDER BY UnemploymentTotal DESC LIMIT 30
")
unemployment_plot <- ggplot(unemp_inc, aes(x=UnemploymentTotal, y=reorder(County, +UnemploymentTotal), fill=as.factor(Year))) +
geom_point(colour="brown", size=5, alpha=.8) +
scale_fill_brewer(palette="Blues", breaks=rev(levels(unemp_inc$Year))) +
labs(title="Chart of Top 30 Unemployment In New York State Per County
\n (Incident Rate Per 100K Persons From 2010 To 2015)")
## Mental Health Chart
mentalhealth_inc <- sqldf::sqldf("SELECT
Year, TotalMental, County
FROM df_incident
ORDER BY TotalMental DESC LIMIT 30
")
mental_plot <- ggplot(mentalhealth_inc, aes(x=TotalMental, y=reorder(County, +TotalMental), fill=as.factor(Year))) +
geom_point(colour="purple", size=5, alpha=.8, shape = 21) +
scale_fill_brewer(palette="Blues", breaks=rev(levels(mentalhealth_inc$Year))) +
labs(title="Chart of Top 30 Mental Health In New York State Per County
\n (Incident Rate Per 100K Persons From 2010 To 2015)")
poverty_plot #poverty plot
divorce_plot #divorce plot
unemployment_plot #unemployment plot
mental_plot #mental health plot
The above plot depicts some of the variable used in the analysis. As we can see, most of the plot shows NYC! The main reason NYC historical complaint data was used for further investigation as its a representative of the other counties in NY state.
ggplot(df_incident, aes(TotalFelony)) + geom_density(aes(color=as.factor(Year), fill=as.factor(Year)),alpha=0.1) +
labs(x="Total Felony", y="Density", title="Felony Total Per 100k Population Density") +
theme(legend.position = "top")+ theme_economist()
misd_dens_plt <- ggplot(df_incident, aes(TotalMisdemeanor)) + geom_density(aes(color=as.factor(Year),
fill=as.factor(Year)),alpha=0.1) + labs(x="Misdemeanor Total", y="Density",
title="Misdemeanor Total Per 100k Population Density") + theme(legend.position = "top")+
theme_economist()
misd_dens_plt
felony <- dplyr::select(df_incident, TotalFelony, Year, UnemploymentTotal) %>%
mutate(Year = as.factor(Year)) %>%
ggplot(aes(x = Year, y = TotalFelony)) +
geom_jitter(aes(col = Year), alpha = 0.8) +
labs(title = "x") +
theme_light()
misd <- dplyr::select(df_incident, TotalMisdemeanor, Year, UnemploymentTotal) %>%
mutate(Year = as.factor(Year)) %>%
ggplot(aes(x = Year, y = TotalMisdemeanor)) +
geom_jitter(aes(col = Year), alpha = 0.8) +
labs(title = "x") +
theme_light()
p <- plot_grid(felony, misd)
title <- ggdraw() + draw_label("A Plot of Felony and Misdemeanor Incident Per Population 100k", fontface='bold')
plot_grid(title, p, ncol=1, rel_heights = c(0.1, 2.2))
A better view of some important observations in a reactive 3D!
df_incident$size <- df_incident$TotalCrimes
colors <- c('#4AC6B7', '#1972A4', '#965F8A', '#FF7070', '#C61951')
comparison <- plot_ly(df_incident[1:1000, ], x = ~TotalMisdemeanor, y = ~TotalFelony, z = ~lawenforcementtotal,
color = ~Year, size = ~size, colors = colors,
marker = list(symbol = 'circle', sizemode = 'diameter'), sizes = c(1, 50),
text = ~paste('County:', County, '<br>Population Total:', Populations,
'<br>Total Unemployment Pop in 100K:', UnemploymentTotal)) %>%
layout(title = 'Comparison Among Police, Felony and Misdemeanor Crimes In New York State',
scene = list(xaxis = list(title = 'Misdemeanor Pop At Risk Per 1K',
gridcolor = 'rgb(255, 255, 255)',
type = 'log',
zerolinewidth = 1,
ticklen = 5,
gridwidth = 2),
yaxis = list(title = ' Felony Crime Pop At Risk Per 100K',
gridcolor = 'rgb(255, 255, 255)',
zerolinewidth = 1,
ticklen = 5,
gridwith = 2),
zaxis = list(title = 'Police Pop Per 100K',
gridcolor = 'rgb(255, 255, 255)',
type = 'log',
zerolinewidth = 1,
ticklen = 5,
gridwith = 2)),
paper_bgcolor = 'rgb(243, 243, 243)',
plot_bgcolor = 'rgb(243, 243, 243)')
comparison
Regression Analysis & Prediction
qqnorm(df_cleaned$totalcrimes)
qqline(df$totalcrimes, col = 2)
From the above normality check, we can see that the variable under observation is not normally distributed. We will therefore use Generalized Linear Model (conventional linear regression models) to; select best model, estimate parameters and interpretations. The generalized linear models (GLMs) are a broad class of models that include linear regression, ANOVA, Poisson regression, log-linear models etc.
i.e The dependent variable Yi does NOT need to be normally distributed, but it typically assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal etc.
For Linear Regression, we have
\({ Y }_{ i }\quad =\quad { \beta }_{ 0 }\quad +\quad { \beta }_{ 1 }{ x }_{ i }\quad +\quad { \varepsilon }_{ i }\quad\)
where,
\({ Y }_{ i }\quad =\quad Dependent\quad Variable.\\ { \beta }_{ 0 }\quad =\quad Intercept.\\ { \beta }_{ 1 }\quad =\quad Parameter\quad To\quad Be\quad estimated\quad (slope).\\ { x }_{ i }\quad =\quad Independent\quad Variable.\\ { \varepsilon }_{ i }\quad =\quad Error\quad Term.\\ i\quad \quad =\quad 1,\quad 2,\quad 3,\quad 4,......,\quad n\quad\)
While Multiple linear regression, we have (in matrix notation):
\(\hat { \beta } =\quad ({ { X }^{ T }X) }^{ -1 }{ X }^{ T }Y\)
SETTING UP HYPOTHESIS: For Crimes Total
Null Hypothesis: There is no relationship between Crimes and any other variables.
\(H_o:\) \(\mu_1\) = \(\mu_2\) = \(\mu_3\)…..= \(\mu_n\)
Alternative Hythesis:
\(H_a:\) Not \(H_o:\) \(\mu_1\) \(\neq\) \(\mu_2\) \(\neq\) \(\mu_3\)…..\(\neq\) \(\mu_n\)
Rejection:
Reject \(H_o\) (Null Hypothesis) if the calculated value (P-Value) is less than the tabulated value(Table value = 0.05 ), otherwise do not reject \(H_o\)
NB: We used both forward and backward model selection to select the best model for our analysis.
Dummy Variables
Counties dummy varibale
counties <- (c("Albany", "Allegany", "Bronx", "Broome", "Cattaraugus", "Cayuga",
"Chautauqua", "Chemung", "Chenango", "Clinton", "Columbia", "Cortland",
"Delaware", "Dutchess", "Erie", "Essex", "Franklin","Fulton", "Genesee",
"Greene", "Hamilton", "Herkimer", "Jefferson", "Kings", "Lewis","Livingston",
" Madison", "Monroe", "Montgomery", "Nassau", "New York", "Niagara", "Oneida",
" Onondaga", " Ontario", "Orange", "Orleans", "Oswego", "Otsego", "Putnam", "Queens",
"Rensselaer", "Richmond", "Rockland","Saratoga", "Schenectady", "Schoharie", "Schuyler",
"Seneca", "St Lawrence", "Steuben", "Suffolk", "Sullivan", "Tioga", "Tompkins", "Ulster",
"Warren", "Washington", "Wayne", "Westchester", "Wyoming", "Yates"))
lab <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64)
dummy <- cbind(counties, lab)
kable(head(dummy))
| counties | lab |
|---|---|
| Albany | 1 |
| Allegany | 2 |
| Bronx | 3 |
| Broome | 4 |
| Cattaraugus | 5 |
| Cayuga | 6 |
Years dummy variable
df_cleaned <- df_cleaned[-c(235, 302), -c(4, 5, 8, 12, 14:20,24:26)]
df_cleaned$county <- factor(df_cleaned$county,
levels=counties,
labels= lab)
df_cleaned$year <- factor(df_cleaned$year,
levels = c("2010", "2011", "2012", "2013", "2014", "2015"),
labels = c(1, 2, 3, 4, 5, 6));
#df_cleaned$totalcrimes <- as.factor(df_cleaned$totalcrimes)
set.seed(123)
splits <- sample.split(df_cleaned$county, SplitRatio = 0.8)
train_set <- subset(df_cleaned, splits==TRUE)
test_set <- subset(df_cleaned, splits==FALSE)
#train_set[, 2:19] <- scale(train_set[, 2:19])
#test_set[, 2:19] <- scale(test_set[, 2:19])
regressor <- glm(totalcrimes~., family=gaussian(link = "identity"),data=train_set)
stepwise <- step(regressor, direction = "both")
## Start: AIC=5002.1
## totalcrimes ~ county + population + year + lawenforcementtotal +
## divorce + mentalchild + mentaladult + unemployment + allageproverty +
## ueighteenpoverty + fivetoseventenpov
##
## Df Deviance AIC
## - lawenforcementtotal 1 241942427 5000.3
## - divorce 1 242655053 5001.1
## <none> 241814047 5002.1
## - allageproverty 1 245092297 5004.1
## - ueighteenpoverty 1 248825615 5008.5
## - year 5 257390382 5010.5
## - mentalchild 1 253561895 5014.1
## - unemployment 1 253903448 5014.5
## - fivetoseventenpov 1 254127113 5014.8
## - mentaladult 1 256384468 5017.4
## - population 1 320388196 5083.1
## - county 58 1098802081 5332.7
##
## Step: AIC=5000.26
## totalcrimes ~ county + population + year + divorce + mentalchild +
## mentaladult + unemployment + allageproverty + ueighteenpoverty +
## fivetoseventenpov
##
## Df Deviance AIC
## - divorce 1 242705306 4999.2
## <none> 241942427 5000.3
## + lawenforcementtotal 1 241814047 5002.1
## - allageproverty 1 245277509 5002.3
## - ueighteenpoverty 1 248945811 5006.7
## - year 5 258021041 5009.2
## - mentalchild 1 253617214 5012.2
## - unemployment 1 254031095 5012.6
## - fivetoseventenpov 1 254342989 5013.0
## - mentaladult 1 256596366 5015.6
## - population 1 320413512 5081.1
## - county 58 1107065647 5332.9
##
## Step: AIC=4999.19
## totalcrimes ~ county + population + year + mentalchild + mentaladult +
## unemployment + allageproverty + ueighteenpoverty + fivetoseventenpov
##
## Df Deviance AIC
## <none> 242705306 4999.2
## + divorce 1 241942427 5000.3
## + lawenforcementtotal 1 242655053 5001.1
## - allageproverty 1 248206467 5003.8
## - ueighteenpoverty 1 250067204 5006.0
## - year 5 259171855 5008.6
## - unemployment 1 254031855 5010.6
## - mentalchild 1 254634736 5011.3
## - fivetoseventenpov 1 254805543 5011.5
## - mentaladult 1 256868278 5013.9
## - population 1 327927630 5086.0
## - county 58 2175493407 5530.2
Analysis of Variance (ANOVA) above is used to compare the means of the variance for statistical significance.
anova(stepwise, test="F")
#
fit <- aov(totalcrimes~., data=train_set)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## county 58 1.080e+11 1.863e+09 1710.044 < 2e-16 ***
## population 1 4.378e+08 4.378e+08 401.888 < 2e-16 ***
## year 5 2.327e+07 4.654e+06 4.272 0.000993 ***
## lawenforcementtotal 1 2.568e+06 2.568e+06 2.358 0.126070
## divorce 1 2.233e+06 2.233e+06 2.050 0.153654
## mentalchild 1 1.845e+07 1.845e+07 16.940 5.44e-05 ***
## mentaladult 1 3.540e+07 3.540e+07 32.500 3.77e-08 ***
## unemployment 1 1.471e+07 1.471e+07 13.507 0.000298 ***
## allageproverty 1 6.444e+06 6.444e+06 5.916 0.015793 *
## ueighteenpoverty 1 1.628e+05 1.628e+05 0.149 0.699407
## fivetoseventenpov 1 1.231e+07 1.231e+07 11.304 0.000911 ***
## Residuals 222 2.418e+08 1.089e+06
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# HSD.test(fit, 'unemployment ')
residual <- data.frame(Fitted = fitted(stepwise), Residuals = resid(stepwise), Treatment = train_set$totalcrimes)
plot_res <- ggplot(residual, aes(Fitted, Residuals, colour = Treatment)) + geom_point()
ggplotly(plot_res)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
confint(stepwise)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 3.514137e+04 5.257861e+04
## county2 -4.509359e+04 -3.026503e+04
## county3 1.759665e+05 2.561847e+05
## county4 -1.847632e+04 -1.218690e+04
## county5 -4.003689e+04 -2.698067e+04
## county6 -3.933880e+04 -2.645562e+04
## county7 -2.996834e+04 -2.008484e+04
## county8 -3.761863e+04 -2.524088e+04
## county9 -4.475257e+04 -3.004656e+04
## county10 -3.902694e+04 -2.619976e+04
## county11 -4.312707e+04 -2.893036e+04
## county12 -4.447702e+04 -2.975810e+04
## county13 -4.543456e+04 -3.042556e+04
## county14 -6.699662e+03 -3.683733e+03
## county15 6.884225e+04 1.050451e+05
## county16 -4.706535e+04 -3.154038e+04
## county17 -4.421168e+04 -2.960197e+04
## county18 -4.386743e+04 -2.948005e+04
## county19 -4.344640e+04 -2.916084e+04
## county20 -4.478454e+04 -2.993710e+04
## county21 -5.172073e+04 -3.468023e+04
## county22 -4.270715e+04 -2.869432e+04
## county23 -3.289370e+04 -2.202968e+04
## county24 2.831810e+05 4.268649e+05
## county25 -4.877178e+04 -3.274365e+04
## county26 -4.276384e+04 -2.872511e+04
## county28 4.858833e+04 7.553175e+04
## county29 -4.467252e+04 -3.006745e+04
## county30 9.342428e+04 1.472886e+05
## county31 1.852829e+05 2.661927e+05
## county32 -1.728764e+04 -1.161247e+04
## county33 -1.381400e+04 -9.091988e+03
## county36 4.654261e+03 1.032043e+04
## county37 -4.618999e+04 -3.096345e+04
## county38 -3.264768e+04 -2.201878e+04
## county39 -4.273305e+04 -2.860299e+04
## county40 -3.802003e+04 -2.554821e+04
## county41 2.211987e+05 3.367049e+05
## county42 -2.767832e+04 -1.873803e+04
## county43 1.653561e+04 2.676194e+04
## county44 -6.870157e+03 -2.116914e+03
## county45 -1.868452e+04 -1.235097e+04
## county46 -2.706671e+04 -1.781711e+04
## county47 -4.634840e+04 -3.126450e+04
## county48 -4.977473e+04 -3.336613e+04
## county49 -4.742532e+04 -3.170056e+04
## county52 -3.630333e+04 -2.444980e+04
## county53 -3.631639e+04 -2.459065e+04
## county54 1.126232e+05 1.757783e+05
## county55 -3.973076e+04 -2.648663e+04
## county56 -4.529073e+04 -3.044249e+04
## county57 -3.595419e+04 -2.424635e+04
## county58 -2.260695e+04 -1.485317e+04
## county59 -4.202072e+04 -2.810953e+04
## county60 -4.285657e+04 -2.874644e+04
## county61 -3.851339e+04 -2.601193e+04
## county62 6.553923e+04 1.023905e+05
## county63 -4.659207e+04 -3.131903e+04
## county64 -4.876006e+04 -3.267136e+04
## population -1.316622e-01 -8.400117e-02
## year2 -2.078425e+01 8.704594e+02
## year3 -2.181291e+01 8.911599e+02
## year4 1.632421e+02 1.072794e+03
## year5 1.180490e+02 1.037611e+03
## year6 -5.025215e+02 4.932371e+02
## mentalchild -6.769890e-02 -1.742043e-02
## mentaladult 8.618134e-03 2.902459e-02
## unemployment 1.103547e-02 4.501041e-02
## allageproverty -1.598438e-01 -1.112728e-02
## ueighteenpoverty 8.897740e-02 6.283368e-01
## fivetoseventenpov -7.998222e-01 -2.084640e-01
pred <- predict(regressor, test_set, type="response")
actual_pred <- cbind.data.frame(test_set$totalcrimes, pred)
actual_pred <- actual_pred[1:20, ]
kable(actual_pred)
| test_set$totalcrimes | pred | |
|---|---|---|
| 5 | 8743 | 8710.9305 |
| 11 | 820 | 695.6136 |
| 16 | 78402 | 79652.0196 |
| 24 | 5400 | 6524.1559 |
| 26 | 2082 | 1739.9127 |
| 31 | 1664 | 1861.8931 |
| 37 | 3532 | 3953.7500 |
| 48 | 2210 | 2215.0483 |
| 50 | 817 | 785.2716 |
| 59 | 2006 | 1314.9362 |
| 65 | 1262 | 1246.1255 |
| 68 | 1527 | 1436.8891 |
| 73 | 926 | 619.6817 |
| 84 | 5997 | 7320.9345 |
| 87 | 27531 | 27140.7780 |
| 94 | 891 | 496.9765 |
| 97 | 1329 | 1501.8446 |
| 104 | 1284 | 1081.2709 |
| 114 | 1443 | 1427.4388 |
| 118 | 1469 | 1358.9968 |
MAPE The mean absolute percentage error (MAPE), also known as mean absolute percentage deviation (MAPD), is a measure of prediction accuracy of a forecasting method in statistics. Represented mathematically as:
\(M=\quad \frac { 100 }{ n } \sum _{ t=1 }^{ n }{ \left| \frac { { A }_{ t }\quad -\quad { F }_{ t } }{ { A }_{ t } } \right| }\) ,
Where,
\({ A }_{ t }\quad =\quad Actual\quad Value\\ \quad \quad \quad \quad \quad \quad { F }_{ t }\quad =\quad Predicted\quad Value.\)
min_max_accuracy <- mean(apply(actual_pred, 1, min) / apply(actual_pred, 1, max))*100
print(paste('The Percentage Accuracy Is: ', min_max_accuracy))
## [1] "The Percentage Accuracy Is: 87.4774508797211"
mape <- mean(abs((actual_pred$pred - actual_pred$`test_set$totalcrimes`))/actual_pred$pred)*100
print(paste('The MAPE Is: ', mape ))
## [1] "The MAPE Is: 16.5027632643134"
To obtain the best model, the best way is to adopt a dimensionality reduction. The PCA, like factor analysis, is a mathematical procedure that transforms a number of correlated variables into a smaller number of uncorrelated variables.
prin_comp <- prcomp(train_set[, -c(1,3)], scale. = T)
biplot(prin_comp, scale = 0)
std_dev <- prin_comp$sdev
pr_var <- std_dev^2
pca_var <- pr_var/sum(pr_var)
plot(pca_var, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
type = "b")
plot(cumsum(pca_var), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
type = "b")
pca2 <- df_pca[, -1]
pca1 <- prcomp(pca2, scale. = TRUE)
cir <- function(center = c(0, 0), npoints = 100) {
r = 1
t = seq(0, 2 * pi, length = npoints)
x = center[1] + r * cos(t)
y = center[1] + r * sin(t)
return(data.frame(x = x, y = y))
}
corcir = cir(c(0, 0), npoints = 100)
correlations = as.data.frame(cor(pca2, pca1$x))
arrows = data.frame(x1 = c(0, 0, 0, 0), y1 = c(0, 0, 0, 0), x2 = correlations$PC1,
y2 = correlations$PC2)
ggplot() + geom_path(data = corcir, aes(x = x, y = y), colour = "black") +
geom_segment(data = arrows, aes(x = x1, y = y1, xend = x2, yend = y2), colour = "blue") +
geom_text(data = correlations, aes(x = PC1, y = PC2, label = rownames(correlations))) +
geom_hline(yintercept = 0, colour = "brown") + geom_vline(xintercept = 0,
colour = "brown") + xlim(-1.1, 1.1) + ylim(-1.1, 1.1) + labs(x = "pc1 aixs",
y = "pc2 axis") + ggtitle("Circle of correlations")
#train set
train_pca <- data.frame(TotalCrimes = train_set$totalcrimes, prin_comp$x)
train_pca <- train_pca[,1:8]
pca_model <- rpart(TotalCrimes ~ .,data = train_pca, method = "anova")
#test set
test_pca <- predict(prin_comp, newdata = test_set)
test_pca <- as.data.frame(test_pca)
test_pca <- test_pca[,1:8]
#prediction
prediction <- predict(pca_model, test_pca)
INTERPRETATIONS:
Generalized Linear Model: The analysis revealed that variables like: years, mentalchild, mentaladult, unemployment, allageproverty, ueighteenpoverty, fivetoseventenpov and population have P-values less than 0.05, which means they are statistically significant.
Residual Plot: Unpattern nature of the residuals conform to the accuracy of model.
Analysis of Variance: Supports the GLM model but struck out the variable “ueighteenpoverty”.
Principal Component Analysis: After the dimensionality reduction, PCA also confirms the best model.
DECISION: Since the calculated P-values are less than the tabulated values (\(??\) = 0.05), we therefore reject the alternative hypothesis, H_o.
CONCLUSION: We therefore conclude that there is a positive relationship between crimes and factors like Economic and Psychological as confirmed by variables like: Unemployment, Mental Health, Poverty, County (or location) and Year. It also means; as the factors increases, so also is crime rate and vice-versa. We can therefore confirm that not only increase and decrease in population affect the crime rate, but other factors like economic and psychological also contribute significantly to crime rate in a geographical area. While security and social factors may not contribute significantly to crime rate, but a further analysis may prove otherwise.
Bibliography
The Origins of attachment theory- John Bowlby and Mary Ainsworth
http://www.sakkyndig.com/psykologi/artikler/Bowlbyatt.pdf
ftp://statgen.ncsu.edu/pub/thorne/molevoclass/AtchleyOct19.pdf
https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Current-YTD/5uac-w243
http://www.kevjohnson.org/making-maps-in-r/
http://leaflet-extras.github.io/leaflet-providers/preview/
http://nyscommunityaction.org/wp-content/uploads/2018/04/Comparitive-Look-@-County-Profiles.pdf
http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html
https://data.ny.gov/en/browse?q=nypd
http://www1.nyc.gov/site/nypd/stats/crime-statistics/crime-statistics-landing.page
https://www.health.ny.gov/statistics/vital_statistics/2006/table50.htm
http://www.postgresqltutorial.com/postgresql-alter-table/
https://www.statmethods.net/graphs/scatterplot.html
http://nyscommunityaction.org/wp-content/uploads/2018/04/Comparitive-Look-@-County-Profiles.pdf