Crime affects us all. It affects our friends, our families and our co-workers. More importantly, it affects how we grow as a community. The stronger the community, the more likely it will be able to help its members grow.
There are many types of crimes and have been recorded in many different common areas. These areas include our neighborhood streets, our residence, apartment builds, libraries, schools etc. And a few examples of the types of crimes in these areas are: Robbery, Sexual Assault, Narcotics and many more.
In Chicago, we have record counts in all of these categories. These categories all seem to have a correlation to our factors of growth. Factors of growth such as our community graduation or more explicitly the school drop out rate.
This research project tells the story of how crime and the growing rate of school drop outs are related. Even though the data analyzed for crime analysis only included the years 2012 and 2013, it shows how growth in crime correlates to the growth in drop out rates in those areas where crime is high. More specifically in areas that are less fortunate in community growth. After completing this research and further analyzed the visualizations, I found there are a few areas that are resistant to this crime to drop out rate relationship. However, this is due to its community members having more resources than those with less. I can confirm due to being a native of Chicago and know the areas very well.
Chicago crime growth rates affect school drop out rates and as the drop out rate increase the crime rate also increases.
This analysis project was able to identify a direct relationship between crime rates and drop out rates as well as the inverse that show as drop out rates grow - crime grows in specific areas of Chicago. It was able to also identify the segemented areas where crimes would be more likely to occur on our neighbor streets by Police District.
This research uses data collected with no intended reporting bias. As is intended to help highlight areas of growth opportunity at the community level. The results of this analysis may cause disparate impact to the residents of areas where crime counts are high if used by a party who’s intension is not fair.
The limitation of this analysis is due to the limited amount of drop out rate data and is the reason why it only uses data from the year 2012 and 2013. If someone were to have more years of dropout rate by school it may help show more upto date growth in drop out rates.
Below is my complete project used to complete this analysis.
# install.packages("ggplot2")
# install.packages("data.table")
# install.packages("ggrepel")
# install.packages("dplyr")
# install.packages("ggmap")
# install.packages("rgeos")
# install.packages("sqldf")
library(ggplot2)
library(ggmap)
library(dplyr)
library(data.table)
library(ggrepel)
library(sqldf)
library(sp)
library(rgeos)
library(reshape2)
# https://www.kaggle.com/chicago/chicago-police-stations
PoliceStations <- read.csv("police-stations.csv")
# https://www.kaggle.com/chicago/chicago-public-schools-data
PSchools <- read.csv("chicago-public-schools-high-school-progress-report-2013-2014.csv")
# https://www.kaggle.com/chicago/chicago-crime
Crime <- read.csv("bq-results-20190722-210325-h0fceiz30y1l.csv")
# sqldf("select * from PSchools LIMIT 10")
# sqldf("select * from PoliceStations LIMIT 10")
# sqldf("select * from Crime where date > '2011-12-31' and date < '2014-01-01' LIMIT 10")
# Create Dataframes of the subset of variables needed
schoollat <- PSchools$Latitude
schoollong <- PSchools$Longitude
schoolname <- PSchools$Name.of.School
schoolID <- PSchools$School.ID
# schooldf <- as.data.frame(cbind(schoolname,schoollong,schoollat,schoolID,schooldropoutrate2013,schooldropoutrate2012))
schooldf <- as.data.frame(cbind(schoolname,schoollong,schoollat,schoolID))
policelat <- PoliceStations$LATITUDE
policelong <- PoliceStations$LONGITUDE
policename <- PoliceStations$DISTRICT.NAME
policedistrict <- PoliceStations$DISTRICT
policedf <- as.data.frame(cbind(policename,policelat,policelong,policedistrict))
coordinates(schooldf) <- c("schoollong", "schoollat")
coordinates(policedf) <- c("policelong", "policelat")
validschooldf <- na.omit(PSchools$Name.of.School)
# place holder
closestSiteVecpolicedf <- vector(mode = "numeric",length = nrow(policedf))
minDistVecpolicedf <- vector(mode = "numeric",length = nrow(policedf))
# initialize coordinates
schooldf[1,]$coordinates
schooldf[1,]$schoolname
# Get the closest policestation to each school.
ClosestDistance_df <- data.frame(matrix(ncol = 3, nrow = 0))
# Loop through each school and each police station to find the closest police station.
for (i in 1 : nrow(schooldf))
{
tdf <- data.frame(matrix(ncol = 3, nrow = 0))
x <- c( "schoolname", "policename", "distance")
colnames(tdf) <- x
for (a in 1 : nrow(policedf))
{
newRow <- data.frame(schoolname= schooldf[i,]$schoolname, policename=policedf[a,]$policename, distance= spDistsN1(policedf[a,],schooldf[i,],longlat = TRUE))
tdf <- rbind(tdf, newRow)
}
#Add the closest row to dataframe
ClosestDistance_df <- rbind(ClosestDistance_df, tdf[which.min(tdf$distance),])
}
# Prepare the merge
ClosestDistance_df <- merge(ClosestDistance_df, policedf, by= "policename")
PoliceStations$policedistrict <- PoliceStations$DISTRICT
ClosestDistance_df <- merge(ClosestDistance_df, PoliceStations, by= "policedistrict")
ClosestDistance_df <- merge(ClosestDistance_df, schooldf, by= "schoolname")
PSchools$schoolID <- PSchools$School.ID
ClosestDistance_df <- merge(ClosestDistance_df, PSchools, by= "schoolID")
# Get Types of Crime counts by district for 2012 and 2013
CrimeAgg2012 <- sqldf("SELECT iucr, primary_type||'_2012' as primary_type, district as DISTRICT, count(*) cnt FROM Crime where date > '2011-12-31' and date < '2013-01-01' group by iucr, primary_type, district ")
CrimeAgg2013 <- sqldf("SELECT iucr, primary_type||'_2013' as primary_type, district as DISTRICT, count(*) cnt FROM Crime where date > '2012-12-31' and date < '2014-01-01' group by iucr, primary_type, district ")
# Transpose/pivot to columns
PivotCrime2012 = dcast(CrimeAgg2012, DISTRICT ~ primary_type, value.var = "cnt", fun.aggregate=sum)
PivotCrime2013 = dcast(CrimeAgg2013, DISTRICT ~ primary_type, value.var = "cnt", fun.aggregate=sum)
# Get total crime counts by district for 2012 and 2013
CrimeTot2012 <- sqldf("SELECT DISTRICT as DISTRICT, count(*) TotalCrimeCount_2012 FROM Crime where date > '2011-12-31' and date < '2013-01-01' group by district ")
CrimeTot2013 <- sqldf("SELECT DISTRICT as DISTRICT, count(*) TotalCrimeCount_2013 FROM Crime where date > '2012-12-31' and date < '2014-01-01' group by district ")
# Merge to the master analytics table
ClosestDistance_df <- merge(ClosestDistance_df, CrimeTot2012, by= "DISTRICT")
ClosestDistance_df <- merge(ClosestDistance_df, CrimeTot2013, by= "DISTRICT")
ClosestDistance_df <- merge(ClosestDistance_df, PivotCrime2012, by= "DISTRICT")
ClosestDistance_df <- merge(ClosestDistance_df, PivotCrime2013, by= "DISTRICT")
# Remove Duplicate Fields after Merge
MasterAnalyticsTable <- within(ClosestDistance_df, rm("City"))
MasterAnalyticsTable <- within(MasterAnalyticsTable, rm("State"))
MasterAnalyticsTable <- within(MasterAnalyticsTable, rm("WEBSITE"))
MasterAnalyticsTable <- within(MasterAnalyticsTable, rm("X.Coordinate"))
MasterAnalyticsTable <- within(MasterAnalyticsTable, rm("Y.Coordinate"))
MasterAnalyticsTable <- within(MasterAnalyticsTable, rm("Latitude"))
MasterAnalyticsTable <- within(MasterAnalyticsTable, rm("Longitude"))
MasterAnalyticsTable <- within(MasterAnalyticsTable, rm("Location"))
MasterAnalyticsTableOut <- sqldf("select distinct DISTRICT, `DISTRICT.NAME`, policelat, policelong, distance, schoollat, schoollong, `Name.of.School`,`One.Year.DropOut.Rate.Percentage.2013` as DropOutRate_2013,`One.Year.DropOut.Rate.Percentage.2012` as DropOutRate_2012, `School.ID`, `Street.Address`, `Student.Response.Rate`, `Teacher.Response.Rate`, Safe, TotalCrimeCount_2012, TotalCrimeCount_2013, ARSON_2012, ASSAULT_2012, BATTERY_2012, BURGLARY_2012, `CRIM SEXUAL ASSAULT_2012`,`CRIMINAL DAMAGE_2012`,`CRIMINAL TRESPASS_2012`,`DECEPTIVE PRACTICE_2012`,GAMBLING_2012,HOMICIDE_2012,`INTERFERENCE WITH PUBLIC OFFICER_2012`,INTIMIDATION_2012,KIDNAPPING_2012,`LIQUOR LAW VIOLATION_2012`
,`MOTOR VEHICLE THEFT_2012`,NARCOTICS_2012,`NON-CRIMINAL (SUBJECT SPECIFIED)_2012`,OBSCENITY_2012,`OFFENSE INVOLVING CHILDREN_2012`,`OTHER OFFENSE_2012`,PROSTITUTION_2012,`PUBLIC INDECENCY_2012`,`PUBLIC PEACE VIOLATION_2012`
,ROBBERY_2012,`SEX OFFENSE_2012`,STALKING_2012,THEFT_2012,`WEAPONS VIOLATION_2012`,ARSON_2013,ASSAULT_2013,BATTERY_2013,BURGLARY_2013,`CRIM SEXUAL ASSAULT_2013`,`CRIMINAL DAMAGE_2013`,`CRIMINAL TRESPASS_2013`
,`DECEPTIVE PRACTICE_2013`,GAMBLING_2013,HOMICIDE_2013,`INTERFERENCE WITH PUBLIC OFFICER_2013`,INTIMIDATION_2013,KIDNAPPING_2013,`LIQUOR LAW VIOLATION_2013`,`MOTOR VEHICLE THEFT_2013`,NARCOTICS_2013,`NON - CRIMINAL_2013`
,OBSCENITY_2013,`OFFENSE INVOLVING CHILDREN_2013`,`OTHER OFFENSE_2013`,PROSTITUTION_2013,`OFFENSE INVOLVING CHILDREN_2013`,`OTHER OFFENSE_2013`,PROSTITUTION_2013,`PUBLIC INDECENCY_2013`,`PUBLIC PEACE VIOLATION_2013`,ROBBERY_2013
,`SEX OFFENSE_2013`,STALKING_2013,THEFT_2013,`WEAPONS VIOLATION_2013`
from MasterAnalyticsTable " )
str(MasterAnalyticsTableOut)
# summary(MasterAnalyticsTable)
sqldf("select * from MasterAnalyticsTableOut LIMIT 10" )
# Replace All N/A values with 0
MasterAnalyticsTableOut[is.na(MasterAnalyticsTableOut)] = 0
# Rename fields
#MasterAnalyticsTableOut %>%
# rename(
# One.Year.DropOut.Rate.Percentage.2012 = DropOut.Rate.2012,
# TotalCrimeCount_2012 = Total.Crime.2012,
# One.Year.DropOut.Rate.Percentage.2013 = DropOut.Rate.2013,
# TotalCrimeCount_2013 = Total.Crime.2013
# )
MasterAnalyticsTableOut
# Break down of where crimes are happening in 2012 vs 2013
# unique(Crime$location_description)
CrimeLocTypes_2012 <- sqldf("SELECT location_description, count(*) Location_Cnt FROM Crime where date > '2011-12-31' and date < '2013-01-01' group by location_description order by count(*) desc LIMIT 20")
#CrimeLocTypes_2012
ggplot(aes(x = reorder(location_description, Location_Cnt), y = Location_Cnt), data = CrimeLocTypes_2012) +
geom_bar(stat = 'identity', width = 0.5) +
geom_text(aes(label = Location_Cnt), stat = 'identity', data = CrimeLocTypes_2012, hjust = -0.1, size = 3.5) +
coord_flip() +
xlab('Major Crime Locations') +
ylab('Number of Occurrences') +
ggtitle('Top 20 Crime Area Types 2012') +
theme_bw() +
theme(plot.title = element_text(size = 16),
axis.title = element_text(size = 12, face = "bold"))
CrimeLocTypes_2013 <- sqldf("SELECT location_description, count(*) Location_Cnt FROM Crime where date > '2012-12-31' and date < '2014-01-01' group by location_description order by count(*) desc LIMIT 20")
#CrimeLocTypes_2013
ggplot(aes(x = reorder(location_description, Location_Cnt), y = Location_Cnt), data = CrimeLocTypes_2013) +
geom_bar(stat = 'identity', width = 0.5) +
geom_text(aes(label = Location_Cnt), stat = 'identity', data = CrimeLocTypes_2013, hjust = -0.1, size = 3.5) +
coord_flip() +
xlab('Major Crime Locations') +
ylab('Number of Occurrences') +
ggtitle('Top 20 Crime Area Types 2013') +
theme_bw() +
theme(plot.title = element_text(size = 16),
axis.title = element_text(size = 12, face = "bold"))
ggmap::register_google(key = "AIzaSyBCTcgVKReBhR_2cIxYo6WBDodBvUP8tu0")
google_key()
# Break down of where crimes are happening in 2012 vs 2013
# unique(Crime$location_description)
CrimeTypes_2012 <- sqldf("SELECT a.DISTRICT, location_description||' '||primary_type as LocType, b.LATITUDE, b.LONGITUDE, count(*) Location_Cnt FROM Crime a join PoliceStations b on a.DISTRICT = b.DISTRICT where date > '2011-12-31' and date < '2013-01-01' group by a.DISTRICT, LocType, b.LATITUDE, b.LONGITUDE order by count(*) desc ")
DropOutRatePerDistrict <- sqldf("SELECT cast(DISTRICT as integer) district, policelat, policelong, sum(DropOutRate_2012) DropOuts FROM MasterAnalyticsTableOut group by cast(DISTRICT as integer), policelat, policelong order by cast(DISTRICT as integer) asc ")
DistrictStreetCrime_2012 <- sqldf("SELECT district, count(*) Location_Cnt FROM Crime where (date > '2011-12-31' and date < '2013-01-01') and location_description like '%STREET%' and district not in (31) group by district ")
#theme_set(theme_bw(16))
#ChicagoMap <- qmap("chicago", zoom = 14, color = "bw", legend = "topleft")
ChicagoMap <- qmap("chicago", zoom = 11, color = "color", size = .05) + ggtitle("District Locations")
ChicagoMap +
geom_point(aes(x = LONGITUDE, y = LATITUDE, colour = DISTRICT), data = PoliceStations)
ggplot(CrimeTypes_2012, aes(x = district, y = Location_Cnt, colours=LocType )) + ggtitle("Crimes per District 2012")+ geom_col()
ggplot(DistrictStreetCrime_2012, aes(x = district, y = Location_Cnt, colours=district )) + ggtitle("StreetCrimes per District 2012")+ geom_col()
ggplot(DropOutRatePerDistrict, aes(x = district, y = DropOuts, colours=district )) + ggtitle("DropOut per District 2012")+ geom_col()
CrimeTypes_2013 <- sqldf("SELECT a.DISTRICT, location_description||' '||primary_type as LocType, b.LATITUDE, b.LONGITUDE, count(*) Location_Cnt FROM Crime a join PoliceStations b on a.DISTRICT = b.DISTRICT where date > '2012-12-31' and date < '2014-01-01' group by a.DISTRICT, LocType, b.LATITUDE, b.LONGITUDE order by count(*) desc ")
DropOutRatePerDistrict2013 <- sqldf("SELECT cast(DISTRICT as integer) district, policelat, policelong, sum(DropOutRate_2013) DropOuts FROM MasterAnalyticsTableOut group by cast(DISTRICT as integer), policelat, policelong order by cast(DISTRICT as integer) asc ")
DistrictStreetCrime_2013 <- sqldf("SELECT district, count(*) Location_Cnt FROM Crime where (date > '2012-12-31' and date < '2014-01-01') and location_description like '%STREET%' and district not in (31) group by district ")
ggplot(CrimeTypes_2013, aes(x = district, y = Location_Cnt, colours=LocType )) + ggtitle("Crimes per District 2013")+ geom_col()
ggplot(DistrictStreetCrime_2013, aes(x = district, y = Location_Cnt, colours=district )) + ggtitle("StreetCrimes per District 2013")+ geom_col()
ggplot(DropOutRatePerDistrict, aes(x = district, y = DropOuts, colours=district )) + ggtitle("DropOut per District 2013")+ geom_col()
# Review relationshiop between the DropOutRate and the following variables: TotalCrimeCount_2013 + `CRIMINAL DAMAGE_2013` + BURGLARY_2013 + `MOTOR VEHICLE THEFT_2013` + `NARCOTICS_2013` + THEFT_2013 + ROBBERY_2013 + ASSAULT_2013 + `OTHER OFFENSE_2013`
#install.packages("lm.beta")
library(lm.beta)
CrimeAnalysis <- lm( `DropOutRate_2013` ~ TotalCrimeCount_2013 + `CRIMINAL DAMAGE_2013` + BURGLARY_2013 + `MOTOR VEHICLE THEFT_2013` + `NARCOTICS_2013` + THEFT_2013 + ROBBERY_2013 + ASSAULT_2013 + `OTHER OFFENSE_2013`, data = MasterAnalyticsTableOut)
summary(CrimeAnalysis)
coef_lmbeta1 <- lm.beta(CrimeAnalysis)
coef_lmbeta1
par(mfrow=c(2,2))
plot(CrimeAnalysis)
# The affect of crime on drop out rate
library(ggplot2)
library(ggmap)
library(lubridate)
library(dplyr)
library(data.table)
library(ggrepel)
library(tidyverse)
#theme_set(theme_bw(16))
#ChicagoMap <- qmap("chicago", zoom = 14, color = "bw", legend = "topleft")
ChicagoMap <- qmap("chicago", zoom = 11, color = "color") + ggtitle("Dropouts vs Crime Counts 2012") # zoom="auto"
ChicagoMap +
geom_point(aes(x = schoollong, y = schoollat, colour = DropOutRate_2012, size = TotalCrimeCount_2012),
data = MasterAnalyticsTableOut)
#theme_set(theme_bw(16))
#ChicagoMap <- qmap("chicago", zoom = 14, color = "bw", legend = "topleft")
ChicagoMap <- qmap("chicago", zoom = 11, color = "color") + ggtitle("Dropouts vs Crime Counts 2013")
ChicagoMap +
geom_point(aes(x = schoollong, y = schoollat, colour = DropOutRate_2013, size = TotalCrimeCount_2013),
data = MasterAnalyticsTableOut)
# CRIME HEAT MAP
chicago = c(lon = -87.6298, lat = 41.8781)
Chicago.map = get_map(location = chicago, zoom = 11, color = "color")
CrimeAll2012 <- sqldf("SELECT * FROM Crime a join PoliceStations b on a.DISTRICT = b.DISTRICT where date > '2011-12-31' and date < '2013-01-01'")
CrimeAll2013 <- sqldf("SELECT * FROM Crime a join PoliceStations b on a.DISTRICT = b.DISTRICT where date > '2012-12-31' and date < '2014-01-01'")
ggmap(Chicago.map, extent = "normal", maprange=TRUE) %+% CrimeAll2012 + aes(x = longitude, y = latitude, zoom = 12) + ggtitle("Crime Heat Map 2012")+
geom_density2d() +
stat_density2d(aes(fill = ..level.., alpha = ..level..),
size = 0.01, bins = 16, geom = 'polygon') +
scale_fill_gradient(low = "green", high = "red") +
scale_alpha(range = c(0.00, 0.25), guide = FALSE) +
coord_map(projection="mercator",
xlim=c(attr(Chicago.map, "bb")$longitude, attr(Chicago.map, "bb")$longitude),
ylim=c(attr(Chicago.map, "bb")$latitude, attr(Chicago.map, "bb")$latitude)) +
theme(legend.position = "none", axis.title = element_blank(), text = element_text(size = 12))
ggmap(Chicago.map, extent = "normal", maprange=TRUE) %+% CrimeAll2013 + aes(x = longitude, y = latitude, zoom = 12) + ggtitle("Crime Heat Map 2013") +
geom_density2d() +
stat_density2d(aes(fill = ..level.., alpha = ..level..),
size = 0.01, bins = 16, geom = 'polygon') +
scale_fill_gradient(low = "green", high = "red") +
scale_alpha(range = c(0.00, 0.25), guide = FALSE) +
coord_map(projection="mercator",
xlim=c(attr(Chicago.map, "bb")$longitude, attr(Chicago.map, "bb")$longitude),
ylim=c(attr(Chicago.map, "bb")$latitude, attr(Chicago.map, "bb")$latitude)) +
theme(legend.position = "none", axis.title = element_blank(), text = element_text(size = 12))
# KMEANS CLUSTER OF CRIME
chicago = c(lon = -87.6298, lat = 41.8781)
Chicago.map = get_map(location = chicago, zoom = 11, color = "color")
# PoliceStations
districts <- sqldf("select `DISTRICT.NAME`, DISTRICT, LATITUDE lat, LONGITUDE lon from PoliceStations ")
schooldistrict <- sqldf("select `Name.of.School`, DISTRICT, `DISTRICT.NAME`, schoollat lat, schoollong lon, distance from MasterAnalyticsTableOut ")
schooldistrict1 <- sqldf("select schoollat lat, schoollong lon from MasterAnalyticsTableOut ")
#install.packages("leaflet")
library(leaflet)
leaflet() %>% #initialize object
addProviderTiles("CartoDB.Positron") %>% #http://leaflet-extras.github.io/leaflet-providers/preview/index.html
addCircleMarkers(data = districts, lng = ~ lon, lat = ~ lat, radius = 3,color = "green", popup = ~`DISTRICT.NAME`) %>%
addCircleMarkers(data = schooldistrict, lng = ~ lon, lat = ~ lat, radius = 3,
color = "blue")
set.seed(23)
CrimeAll2012 <- sqldf("SELECT * FROM Crime a join PoliceStations b on a.DISTRICT = b.DISTRICT where date > '2011-12-31' and date < '2013-01-01'")
CrimeAll2013 <- sqldf("SELECT * FROM Crime a join PoliceStations b on a.DISTRICT = b.DISTRICT where date > '2012-12-31' and date < '2014-01-01'")
CrimeAll2012L <- sqldf("SELECT a.district ,a.latitude as lat, a.longitude lon FROM Crime a join PoliceStations b on a.DISTRICT = b.DISTRICT where date > '2011-12-31' and date < '2013-01-01'")
CrimeAll2012L <- na.omit(CrimeAll2012L)
CrimeAll2013L <- sqldf("SELECT a.district ,a.latitude as lat, a.longitude lon FROM Crime a join PoliceStations b on a.DISTRICT = b.DISTRICT where date > '2012-12-31' and date < '2014-01-01'")
CrimeAll2013L <- na.omit(CrimeAll2013L)
clusters <- kmeans(na.omit(CrimeAll2012L[,2:3]), 23) # Currently 23 Police Districts
CrimeAll2012L$Borough <- as.factor(clusters$cluster)
ggmap(Chicago.map) + geom_point(aes(x = lon[], y = lat[], colour = as.factor(Borough)),data = CrimeAll2012L) +
geom_label(data = districts, aes(x=lon, y=lat, label = `DISTRICT.NAME`), size = 2, fontface = "bold") +
ggtitle("Chicago Boroughs using KMean 2012")
clusters <- kmeans(na.omit(CrimeAll2013L[,2:3]), 23) # Currently 23 Police Districts
CrimeAll2013L$Borough <- as.factor(clusters$cluster)
ggmap(Chicago.map) + geom_point(aes(x = lon[], y = lat[], colour = as.factor(Borough)),data = CrimeAll2013L) +
geom_label(data = districts, aes(x=lon, y=lat, label = `DISTRICT.NAME`), size = 2, fontface = "bold") +
# geom_text_repel(data = districts, aes(x = lon, y = lat, label = `DISTRICT.NAME`), fontface = "bold", nudge_x = c(1, -1.5, 2, 2, -1), nudge_y = c(0.25, -0.25, 0.5, 0.5, -0.5)) +
ggtitle("Chicago Boroughs using KMean 2013")