For this project, I chose the dataset “Minneapolis Police Department 2017 Stop Data” (“MplsStops”) and the related dataset “Minneapolis Demographic Data 2015, by Neighborhood” (“MplsDemo”) from http://vincentarelbundock.github.io/Rdatasets/. The first dataset contains information from stops made by the Minneapolis Police Department in 2017 (almost 52K records), while the second dataset includes demographic data for 84 neighborhoods in Minneapolis. Descriptions of both datasets can be found at: http://vincentarelbundock.github.io/Rdatasets/doc/carData/MplsStops.html http://vincentarelbundock.github.io/Rdatasets/doc/carData/MplsDemo.html
# install packages
library(plyr)
library(dplyr)
library(ggplot2)
library(maps)
library(ggmap)
# load datasets from either of 2 file locations: C drive or GitHub
# location1 <- "C:/Users/Kevin/Documents/MSDS/Week3 Assignments/MplsStops.csv"
# location2 <- "C:/Users/Kevin/Documents/MSDS/Week3 Assignments/MplsDemo.csv"
# location1 <- "https://raw.githubusercontent.com/kecbenson/Bridge_Program/master/MplsStops.csv"
# location2 <- "https://raw.githubusercontent.com/kecbenson/Bridge_Program/master/MplsDemo.csv"
# use GitHub file for this assignment
location1 <- "https://raw.githubusercontent.com/kecbenson/Bridge_Program/master/MplsStops.csv"
location2 <- "https://raw.githubusercontent.com/kecbenson/Bridge_Program/master/MplsDemo.csv"
stops <- as.tbl(read.csv(file=location1))
demo <- as.tbl(read.csv(file=location2))
stops
## # A tibble: 51,920 x 15
## X idNum date problem MDC citationIssued personSearch
## <int> <fct> <fct> <fct> <fct> <fct> <fct>
## 1 6823 17-0000~ 2017-01-01 0~ suspici~ MDC <NA> NO
## 2 6824 17-0000~ 2017-01-01 0~ suspici~ MDC <NA> NO
## 3 6825 17-0000~ 2017-01-01 0~ traffic MDC <NA> NO
## 4 6826 17-0000~ 2017-01-01 0~ suspici~ MDC <NA> NO
## 5 6827 17-0000~ 2017-01-01 0~ traffic MDC <NA> NO
## 6 6828 17-0001~ 2017-01-01 0~ traffic MDC <NA> NO
## 7 6829 17-0001~ 2017-01-01 0~ suspici~ MDC <NA> NO
## 8 6830 17-0001~ 2017-01-01 0~ traffic MDC <NA> NO
## 9 6831 17-0001~ 2017-01-01 0~ traffic MDC <NA> NO
## 10 6832 17-0001~ 2017-01-01 0~ traffic MDC <NA> NO
## # ... with 51,910 more rows, and 8 more variables: vehicleSearch <fct>,
## # preRace <fct>, race <fct>, gender <fct>, lat <dbl>, long <dbl>,
## # policePrecinct <int>, neighborhood <fct>
str(stops)
## Classes 'tbl_df', 'tbl' and 'data.frame': 51920 obs. of 15 variables:
## $ X : int 6823 6824 6825 6826 6827 6828 6829 6830 6831 6832 ...
## $ idNum : Factor w/ 51920 levels "17-000003","17-000007",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ date : Factor w/ 51873 levels "2017-01-01 00:00:42",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ problem : Factor w/ 2 levels "suspicious","traffic": 1 1 2 1 2 2 1 2 2 2 ...
## $ MDC : Factor w/ 2 levels "MDC","other": 1 1 1 1 1 1 1 1 1 1 ...
## $ citationIssued: Factor w/ 2 levels "NO","YES": NA NA NA NA NA NA NA NA NA NA ...
## $ personSearch : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
## $ vehicleSearch : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
## $ preRace : Factor w/ 8 levels "Asian","Black",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ race : Factor w/ 8 levels "Asian","Black",..: 7 7 8 3 8 3 2 6 8 2 ...
## $ gender : Factor w/ 3 levels "Female","Male",..: 3 2 1 2 1 2 2 1 2 2 ...
## $ lat : num 45 45 44.9 44.9 45 ...
## $ long : num -93.2 -93.3 -93.3 -93.3 -93.3 ...
## $ policePrecinct: int 1 1 5 5 1 1 1 2 2 4 ...
## $ neighborhood : Factor w/ 87 levels "Armatage","Audubon Park",..: 11 20 84 84 20 20 20 51 59 28 ...
demo
## # A tibble: 84 x 9
## X neighborhood population white black foreignBorn hhIncome poverty
## <int> <fct> <int> <dbl> <dbl> <dbl> <int> <dbl>
## 1 1 Cedar Rivers~ 8247 0.353 0.464 0.408 18892 0.06
## 2 3 Phillips West 5184 0.199 0.538 0.318 18404 0.042
## 3 4 Downtown West 7141 0.561 0.211 0.203 67086 0.057
## 4 5 Downtown East 1674 0.543 0.221 0.221 70669 0.071
## 5 6 Shingle Creek 3249 0.407 0.259 0.14 59414 0.11
## 6 7 Ventura Vill~ 6150 0.214 0.424 0.329 17469 0.048
## 7 8 Sumner - Gle~ 1676 0.087 0.656 0.292 18854 0.074
## 8 9 Lind - Bohan~ 5420 0.417 0.251 0.172 43438 0.089
## 9 10 Victory 4525 0.622 0.25 0.055 57148 0.066
## 10 11 Webber - Cam~ 5109 0.394 0.383 0.11 37030 0.053
## # ... with 74 more rows, and 1 more variable: collegeGrad <dbl>
str(demo)
## Classes 'tbl_df', 'tbl' and 'data.frame': 84 obs. of 9 variables:
## $ X : int 1 3 4 5 6 7 8 9 10 11 ...
## $ neighborhood: Factor w/ 84 levels "Armatage","Audubon Park",..: 10 61 19 18 67 75 72 40 76 78 ...
## $ population : int 8247 5184 7141 1674 3249 6150 1676 5420 4525 5109 ...
## $ white : num 0.353 0.199 0.561 0.543 0.407 0.214 0.087 0.417 0.622 0.394 ...
## $ black : num 0.464 0.538 0.211 0.221 0.259 0.424 0.656 0.251 0.25 0.383 ...
## $ foreignBorn : num 0.408 0.318 0.203 0.221 0.14 0.329 0.292 0.172 0.055 0.11 ...
## $ hhIncome : int 18892 18404 67086 70669 59414 17469 18854 43438 57148 37030 ...
## $ poverty : num 0.06 0.042 0.057 0.071 0.11 0.048 0.074 0.089 0.066 0.053 ...
## $ collegeGrad : num 0.258 0.211 0.551 0.577 0.247 0.173 0.165 0.196 0.345 0.203 ...
Let’s start by summarizing the datasets, and summing up the total population of all neighborhoods in the Demo dataset.
summary(stops)
## X idNum date
## Min. : 6823 17-000003: 1 2017-01-10 12:05:28: 2
## 1st Qu.:20379 17-000007: 1 2017-01-22 17:33:37: 2
## Median :33864 17-000073: 1 2017-01-27 22:40:21: 2
## Mean :33861 17-000092: 1 2017-01-27 23:21:45: 2
## 3rd Qu.:47387 17-000098: 1 2017-01-30 14:26:45: 2
## Max. :60838 17-000111: 1 2017-02-03 13:35:57: 2
## (Other) :51914 (Other) :51908
## problem MDC citationIssued personSearch
## suspicious:25822 MDC :43699 NO :15899 NO :38462
## traffic :26098 other: 8221 YES : 3211 YES : 5237
## NA's:32810 NA's: 8221
##
##
##
##
## vehicleSearch preRace race
## NO :40579 Unknown :28337 Black :15220
## YES : 3120 Black : 6805 White :11703
## NA's: 8221 White : 6004 Unknown : 9219
## Native American: 908 East African: 2188
## Latino : 528 Latino : 1858
## (Other) : 1117 (Other) : 3511
## NA's : 8221 NA's : 8221
## gender lat long policePrecinct
## Female :10015 Min. :44.89 Min. :-93.33 Min. :1.000
## Male :27131 1st Qu.:44.95 1st Qu.:-93.29 1st Qu.:2.000
## Unknown: 6492 Median :44.98 Median :-93.28 Median :3.000
## NA's : 8282 Mean :44.97 Mean :-93.27 Mean :3.257
## 3rd Qu.:45.00 3rd Qu.:-93.25 3rd Qu.:4.000
## Max. :45.05 Max. :-93.20 Max. :5.000
##
## neighborhood
## Downtown West: 4409
## Whittier : 3328
## Near - North : 2256
## Lyndale : 2154
## Jordan : 2075
## Hawthorne : 2031
## (Other) :35667
summary(demo)
## X neighborhood population white
## Min. : 1.00 Armatage : 1 Min. : 240 Min. :0.0870
## 1st Qu.: 25.75 Audubon Park: 1 1st Qu.: 2909 1st Qu.:0.5150
## Median : 51.50 Bancroft : 1 Median : 4518 Median :0.6940
## Mean : 49.68 Beltrami : 1 Mean : 4781 Mean :0.6343
## 3rd Qu.: 72.25 Bottineau : 1 3rd Qu.: 5985 3rd Qu.:0.8183
## Max. :100.00 Bryant : 1 Max. :16022 Max. :0.9020
## (Other) :78
## black foreignBorn hhIncome poverty
## Min. :0.00400 Min. :0.0290 Min. : 17469 Min. :0.03100
## 1st Qu.:0.05275 1st Qu.:0.0800 1st Qu.: 42028 1st Qu.:0.04900
## Median :0.12200 Median :0.1255 Median : 57302 Median :0.05900
## Mean :0.16114 Mean :0.1409 Mean : 60799 Mean :0.06457
## 3rd Qu.:0.22525 3rd Qu.:0.1725 3rd Qu.: 75482 3rd Qu.:0.07675
## Max. :0.65600 Max. :0.4080 Max. :118750 Max. :0.13500
##
## collegeGrad
## Min. :0.1220
## 1st Qu.:0.3417
## Median :0.4795
## Mean :0.4766
## 3rd Qu.:0.6222
## Max. :0.8370
##
sum(demo$population)
## [1] 401606
Several observations about the Stops dataset:
Several observations about the Demo dataset:
First let’s see if the neighborhood labels in the Stops dataset match those in the Demo dataset. It turns out that there are 3 neighborhoods in the Stops dataset that are NOT in the Demo dataset: “Camden”, “Humboldt” and “Near - North”. This accounts for 2,300 observations out of 51,920, or roughly ~4%.
# identify which neighborhood names in Stops dataset DO NOT match those in Demo dataset
v1 <- levels(stops$neighborhood)
v2 <- levels(demo$neighborhood)
levels(stops$neighborhood)[ !(v1 %in% v2) ]
## [1] "Camden Industrial" "Humboldt Industrial Area"
## [3] "Near - North"
sum( !(stops$neighborhood %in% v2) )
## [1] 2300
Second, join selected columns from the Demo dataset into the Stops dataset. Keep in mind that ~4% of the observations in the Stops dataset will have missing Demo data.
# rank order the neighborhoods in the Demo dataset into quartiles and append the column
demo <- as.tbl(cbind(demo, quartile = ntile(demo$hhIncome, 4)))
demo
## # A tibble: 84 x 10
## X neighborhood population white black foreignBorn hhIncome poverty
## <int> <fct> <int> <dbl> <dbl> <dbl> <int> <dbl>
## 1 1 Cedar Rivers~ 8247 0.353 0.464 0.408 18892 0.06
## 2 3 Phillips West 5184 0.199 0.538 0.318 18404 0.042
## 3 4 Downtown West 7141 0.561 0.211 0.203 67086 0.057
## 4 5 Downtown East 1674 0.543 0.221 0.221 70669 0.071
## 5 6 Shingle Creek 3249 0.407 0.259 0.14 59414 0.11
## 6 7 Ventura Vill~ 6150 0.214 0.424 0.329 17469 0.048
## 7 8 Sumner - Gle~ 1676 0.087 0.656 0.292 18854 0.074
## 8 9 Lind - Bohan~ 5420 0.417 0.251 0.172 43438 0.089
## 9 10 Victory 4525 0.622 0.25 0.055 57148 0.066
## 10 11 Webber - Cam~ 5109 0.394 0.383 0.11 37030 0.053
## # ... with 74 more rows, and 2 more variables: collegeGrad <dbl>,
## # quartile <int>
str(demo)
## Classes 'tbl_df', 'tbl' and 'data.frame': 84 obs. of 10 variables:
## $ X : int 1 3 4 5 6 7 8 9 10 11 ...
## $ neighborhood: Factor w/ 84 levels "Armatage","Audubon Park",..: 10 61 19 18 67 75 72 40 76 78 ...
## $ population : int 8247 5184 7141 1674 3249 6150 1676 5420 4525 5109 ...
## $ white : num 0.353 0.199 0.561 0.543 0.407 0.214 0.087 0.417 0.622 0.394 ...
## $ black : num 0.464 0.538 0.211 0.221 0.259 0.424 0.656 0.251 0.25 0.383 ...
## $ foreignBorn : num 0.408 0.318 0.203 0.221 0.14 0.329 0.292 0.172 0.055 0.11 ...
## $ hhIncome : int 18892 18404 67086 70669 59414 17469 18854 43438 57148 37030 ...
## $ poverty : num 0.06 0.042 0.057 0.071 0.11 0.048 0.074 0.089 0.066 0.053 ...
## $ collegeGrad : num 0.258 0.211 0.551 0.577 0.247 0.173 0.165 0.196 0.345 0.203 ...
## $ quartile : int 1 1 3 3 3 1 1 2 2 1 ...
# create neighborhood vector to index into Demo dataset
v3 <- match(stops$neighborhood, demo$neighborhood)
length(v3)
## [1] 51920
# create new columns to add to Stops dataset
pop <- demo$population[v3]
income <- demo$hhIncome[v3]
college <- demo$collegeGrad[v3]
quartile <- demo$quartile[v3]
# add to Stops dataset, create Combined dataset
comb <- as.tbl(cbind(stops, pop, income, college, quartile))
comb
## # A tibble: 51,920 x 19
## X idNum date problem MDC citationIssued personSearch
## <int> <fct> <fct> <fct> <fct> <fct> <fct>
## 1 6823 17-0000~ 2017-01-01 0~ suspici~ MDC <NA> NO
## 2 6824 17-0000~ 2017-01-01 0~ suspici~ MDC <NA> NO
## 3 6825 17-0000~ 2017-01-01 0~ traffic MDC <NA> NO
## 4 6826 17-0000~ 2017-01-01 0~ suspici~ MDC <NA> NO
## 5 6827 17-0000~ 2017-01-01 0~ traffic MDC <NA> NO
## 6 6828 17-0001~ 2017-01-01 0~ traffic MDC <NA> NO
## 7 6829 17-0001~ 2017-01-01 0~ suspici~ MDC <NA> NO
## 8 6830 17-0001~ 2017-01-01 0~ traffic MDC <NA> NO
## 9 6831 17-0001~ 2017-01-01 0~ traffic MDC <NA> NO
## 10 6832 17-0001~ 2017-01-01 0~ traffic MDC <NA> NO
## # ... with 51,910 more rows, and 12 more variables: vehicleSearch <fct>,
## # preRace <fct>, race <fct>, gender <fct>, lat <dbl>, long <dbl>,
## # policePrecinct <int>, neighborhood <fct>, pop <int>, income <int>,
## # college <dbl>, quartile <int>
str(comb)
## Classes 'tbl_df', 'tbl' and 'data.frame': 51920 obs. of 19 variables:
## $ X : int 6823 6824 6825 6826 6827 6828 6829 6830 6831 6832 ...
## $ idNum : Factor w/ 51920 levels "17-000003","17-000007",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ date : Factor w/ 51873 levels "2017-01-01 00:00:42",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ problem : Factor w/ 2 levels "suspicious","traffic": 1 1 2 1 2 2 1 2 2 2 ...
## $ MDC : Factor w/ 2 levels "MDC","other": 1 1 1 1 1 1 1 1 1 1 ...
## $ citationIssued: Factor w/ 2 levels "NO","YES": NA NA NA NA NA NA NA NA NA NA ...
## $ personSearch : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
## $ vehicleSearch : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
## $ preRace : Factor w/ 8 levels "Asian","Black",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ race : Factor w/ 8 levels "Asian","Black",..: 7 7 8 3 8 3 2 6 8 2 ...
## $ gender : Factor w/ 3 levels "Female","Male",..: 3 2 1 2 1 2 2 1 2 2 ...
## $ lat : num 45 45 44.9 44.9 45 ...
## $ long : num -93.2 -93.3 -93.3 -93.3 -93.3 ...
## $ policePrecinct: int 1 1 5 5 1 1 1 2 2 4 ...
## $ neighborhood : Factor w/ 87 levels "Armatage","Audubon Park",..: 11 20 84 84 20 20 20 51 59 28 ...
## $ pop : int 8247 7141 14604 14604 7141 7141 7141 10496 1393 5023 ...
## $ income : int 18892 67086 35855 35855 67086 67086 67086 27104 83520 40442 ...
## $ college : num 0.258 0.551 0.399 0.399 0.551 0.551 0.551 0.587 0.734 0.181 ...
## $ quartile : int 1 3 1 1 3 3 3 1 4 1 ...
Next, change the date column to a date/time data type, extract time variables (e.g., hour of the day, day of the week, etc.) and then remove columns that won’t be used in our analysis.
# change date column to POSIXlt data type per dataset description
comb$date <- as.POSIXlt(comb$date)
# add time, day, month
# comb$date[["hour"]] # hour of the day, 0-23
# comb$date[["wday"]] # day of the week, 0-6
# comb$date[["mday"]] # day of the month, 1-31
# comb$date[["yday"]] # day of the year, 0-365
# comb$date[["mon"]] # month
comb <- ( comb %>% mutate(hour=comb$date[["hour"]], wday=comb$date[["wday"]],
mday=comb$date[["mday"]], yday=comb$date[["yday"]], month=comb$date[["mon"]] ) )
# drop extraneous columns: X, idNum, MDC, preRace, race, gender, policePrecinct
comb <- ( comb %>% select(-X, -idNum, -MDC, -preRace, -race, -gender, -policePrecinct) )
comb
## # A tibble: 51,920 x 17
## date problem citationIssued personSearch vehicleSearch
## <S3: POSIXlt> <fct> <fct> <fct> <fct>
## 1 2017-01-01 00:00:42 suspicio~ <NA> NO NO
## 2 2017-01-01 00:03:07 suspicio~ <NA> NO NO
## 3 2017-01-01 00:23:15 traffic <NA> NO NO
## 4 2017-01-01 00:33:48 suspicio~ <NA> NO NO
## 5 2017-01-01 00:37:58 traffic <NA> NO NO
## 6 2017-01-01 00:46:48 traffic <NA> NO NO
## 7 2017-01-01 00:48:46 suspicio~ <NA> NO NO
## 8 2017-01-01 00:50:55 traffic <NA> NO NO
## 9 2017-01-01 00:57:10 traffic <NA> NO NO
## 10 2017-01-01 01:05:50 traffic <NA> NO NO
## # ... with 51,910 more rows, and 12 more variables: lat <dbl>, long <dbl>,
## # neighborhood <fct>, pop <int>, income <int>, college <dbl>,
## # quartile <int>, hour <int>, wday <int>, mday <int>, yday <int>,
## # month <int>
str(comb)
## Classes 'tbl_df', 'tbl' and 'data.frame': 51920 obs. of 17 variables:
## $ date : POSIXlt, format: "2017-01-01 00:00:42" "2017-01-01 00:03:07" ...
## $ problem : Factor w/ 2 levels "suspicious","traffic": 1 1 2 1 2 2 1 2 2 2 ...
## $ citationIssued: Factor w/ 2 levels "NO","YES": NA NA NA NA NA NA NA NA NA NA ...
## $ personSearch : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
## $ vehicleSearch : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
## $ lat : num 45 45 44.9 44.9 45 ...
## $ long : num -93.2 -93.3 -93.3 -93.3 -93.3 ...
## $ neighborhood : Factor w/ 87 levels "Armatage","Audubon Park",..: 11 20 84 84 20 20 20 51 59 28 ...
## $ pop : int 8247 7141 14604 14604 7141 7141 7141 10496 1393 5023 ...
## $ income : int 18892 67086 35855 35855 67086 67086 67086 27104 83520 40442 ...
## $ college : num 0.258 0.551 0.399 0.399 0.551 0.551 0.551 0.587 0.734 0.181 ...
## $ quartile : int 1 3 1 1 3 3 3 1 4 1 ...
## $ hour : int 0 0 0 0 0 0 0 0 0 1 ...
## $ wday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mday : int 1 1 1 1 1 1 1 1 1 1 ...
## $ yday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ month : int 0 0 0 0 0 0 0 0 0 0 ...
summary(comb)
## date problem citationIssued
## Min. :2017-01-01 00:00:42 suspicious:25822 NO :15899
## 1st Qu.:2017-03-29 08:35:09 traffic :26098 YES : 3211
## Median :2017-06-17 18:46:47 NA's:32810
## Mean :2017-06-23 19:57:49
## 3rd Qu.:2017-09-18 18:32:06
## Max. :2017-12-31 23:52:35
##
## personSearch vehicleSearch lat long
## NO :38462 NO :40579 Min. :44.89 Min. :-93.33
## YES : 5237 YES : 3120 1st Qu.:44.95 1st Qu.:-93.29
## NA's: 8221 NA's: 8221 Median :44.98 Median :-93.28
## Mean :44.97 Mean :-93.27
## 3rd Qu.:45.00 3rd Qu.:-93.25
## Max. :45.05 Max. :-93.20
##
## neighborhood pop income college
## Downtown West: 4409 Min. : 240 Min. : 17469 Min. :0.1220
## Whittier : 3328 1st Qu.: 4609 1st Qu.: 35855 1st Qu.:0.2360
## Near - North : 2256 Median : 5934 Median : 44733 Median :0.3990
## Lyndale : 2154 Mean : 6428 Mean : 49557 Mean :0.4189
## Jordan : 2075 3rd Qu.: 8097 3rd Qu.: 67086 3rd Qu.:0.5680
## Hawthorne : 2031 Max. :16022 Max. :118750 Max. :0.8370
## (Other) :35667 NA's :2300 NA's :2300 NA's :2300
## quartile hour wday mday
## Min. :1.000 Min. : 0.0 Min. :0.000 Min. : 1.0
## 1st Qu.:1.000 1st Qu.: 9.0 1st Qu.:2.000 1st Qu.: 8.0
## Median :2.000 Median :15.0 Median :3.000 Median :16.0
## Mean :1.976 Mean :13.5 Mean :3.172 Mean :15.8
## 3rd Qu.:3.000 3rd Qu.:20.0 3rd Qu.:5.000 3rd Qu.:23.0
## Max. :4.000 Max. :23.0 Max. :6.000 Max. :31.0
## NA's :2300
## yday month
## Min. : 0.0 Min. : 0.000
## 1st Qu.: 87.0 1st Qu.: 2.000
## Median :167.0 Median : 5.000
## Mean :173.2 Mean : 5.236
## 3rd Qu.:260.0 3rd Qu.: 8.000
## Max. :364.0 Max. :11.000
##
If I had more time, it would be useful to:
This would allow us to do some interesting analysis that accounts for the neighborhood size / population, or the severity of the police stop activity. For instance, higher-income neighborhoods have lower absolute numbers of police stops, but this should be scaled for the population size. Secondly, not all stops are created equal — some stops result in vehicle searches or citations (tickets), while most stops result in no searches or citations. Creating a severity metric would allow us to quantify severity as an attribute of police stops.
Please see examples below in the answer to Question 4.
The overall question I tried to address is “What are some of the relationships in police stops over time and across neighborhoods, based on the datasets?” Breaking this down, there are several sub-questions that the analysis answers:
First, let’s look at the demographic data, and see how the neighborhoods vary by median household income, proportion of the population having a college degree and population size. As expected, household income has a strong positive relationship with the proportion of college graduates.
(Example of scatter plot)
ggplot(demo, aes(x = hhIncome, y = collegeGrad)) + geom_point(aes(size = population, col = quartile)) +
geom_smooth(method = "lm", se = FALSE, col = "red") +
labs(x="Household income", y="Proportion of college grads", title="Selected demographics of Minneapolis neighborhoods")
Second, take a look at how the number of stops varies by the neighborhood’s household income, proportion of college graduates and population. The histograms show that neighborhoods with higher income and more college graduates tend to have fewer police stops. The relationship with neighborhood population is less conclusive.
(Examples of histograms)
g <- ggplot(comb)
g + geom_histogram(aes(x=income), binwidth = 5000) +
labs(x="Household income", y="Numb. of stops", title="Number of stops vs. household income")
g + geom_histogram(aes(x=college), binwidth = 0.075) +
labs(x="Prop. of college grads", y="Numb. of stops", title="Number of stops vs. proportion of college grads")
g + geom_histogram(aes(x=pop), binwidth = 1000) +
labs(x="Population", y="Numb. of stops", title="Number of stops vs. population")
Third, let’s plot the observation count of stops over time, to see the pattern of police stops over different time measures:
(Examples of histograms, bar charts and density chart)
g <- ggplot(comb)
g + geom_histogram(aes(x=hour), binwidth = 1) +
labs(x="Hour of the day (0=midnight, 23=11PM)", y="Numb. of stops", title="Stop activity during the day")
g + geom_bar(aes(x=wday)) +
labs(x="Day of the week (0=Sun, 6=Sat)", y="Numb. of stops", title="Stop activity during the week")
g + geom_histogram(aes(x=mday)) +
labs(x="Day of the month (1-31)", y="Numb. of stops", title="Stop activity during the month")
g + geom_density(aes(x=yday)) +
labs(x="Day of the year (0=Jan 1, 364=Dec 31)", y="Numb. of stops", title="Stop activity during the year")
g + geom_bar(aes(x=month)) +
labs(x="Month (0=Jan, 11=Dec)", y="Numb. of stops", title="Stop activity by month")
Next, let’s see if the distribution of police stops during the day varies by neighborhood income quartile. The box plot shows that 4th-quartile (high-income) neighborhoods tend to have fewer stops before 10AM or after 8PM.
(Example of box plot)
ggplot(comb, aes(y = hour, x = quartile, group = quartile)) + geom_boxplot() +
labs(x="Income quartile", y="Hour of the day (0-23)", title="Distribution of stop times by neighborhood income quartile")
To expand on this last point, let’s breakdown the count of stops by search type during the day, by neighborhood income quartile. The charts show that police stops are concentrated in 1st- and 2nd-quartile (lower-income) neighborhoods, and stops tend to be most active in the evening hours leading up to midnight. This pattern is generally true for other attributes include citations issued and personal or vehicle searches.
As a next step, it would be appropriate to scale the number of stops by the neighborhood population, and do the analysis on a per capita basis.
(Example of stacked bar charts and facet wraps)
ggplot(comb, aes(x=hour)) + geom_histogram(binwidth = 1) + facet_wrap(~ quartile) +
labs(x="Hour of the day (0-23)", y="Numb. of stops", title="Stop activity during the day, per neighborhood income quartile")
ggplot(comb, aes(x=hour, fill = problem)) + geom_histogram(binwidth = 1) + facet_wrap(~ quartile) +
labs(x="Hour of the day (0-23)", y="Numb. of stops", title="Stop activity by problem reported, per neighborhood income quartile")
ggplot(comb, aes(x=hour, fill = citationIssued)) + geom_histogram(binwidth = 1) + facet_wrap(~ quartile) +
labs(x="Hour of the day (0-23)", y="Numb. of stops", title="Stop activity by citation result, per neighborhood income quartile")
ggplot(comb, aes(x=hour, fill = personSearch)) + geom_histogram(binwidth = 1) + facet_wrap(~ quartile) +
labs(x="Hour of the day (0-23)", y="Numb. of stops", title="Stop activity by personal search result, per neighborhood income quartile")
ggplot(comb, aes(x=hour, fill = vehicleSearch)) + geom_histogram(binwidth = 1) + facet_wrap(~ quartile) +
labs(x="Hour of the day (0-23)", y="Numb. of stops", title="Stop activity by vehicle search result, per neighborhood income quartile")
Finally, let’s see how to visualize 2017 police stop activity in Minneapolis. To do this, I consulted a few sources on creating heatmaps, and ended up borrowing code logic from a paper on RPubs (https://rpubs.com/jimu_xw/crime_visualization).
(Example of heatmaps using ggmap)
# get city map and plot sample stop data onto the map
minnie <- get_map(location = "minneapolis", zoom = 12)
ggmap(minnie) + geom_jitter(data=comb[sample(nrow(comb), 5000), ], aes(x=long, y=lat)) +
labs(title="Sample of 2017 police stop activity in Minneapolis")
# create heatmap -- borrow code from https://rpubs.com/jimu_xw/crime_visualization
LatLonCounts <- as.data.frame(table(round(comb$long, 2), round(comb$lat, 2)))
str(LatLonCounts)
## 'data.frame': 238 obs. of 3 variables:
## $ Var1: Factor w/ 14 levels "-93.33","-93.32",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Var2: Factor w/ 17 levels "44.89","44.9",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Freq: int 0 3 14 18 59 126 22 22 3 0 ...
LatLonCounts$Long <- as.numeric(as.character(LatLonCounts$Var1))
LatLonCounts$Lat <- as.numeric(as.character(LatLonCounts$Var2))
# try a few iterations
ggmap(minnie) + geom_point(data = LatLonCounts, aes(x = Long, y = Lat, color = Freq, size = Freq)) +
labs(title="Frequency of 2017 police stops in Minneapolis")
ggmap(minnie) + geom_point(data = LatLonCounts, aes(x = Long, y = Lat, color = Freq, size = Freq)) +
scale_color_gradient(low = "yellow", high = "red") +
labs(title="Frequency of 2017 police stops in Minneapolis")
ggmap(minnie) + geom_tile(data = LatLonCounts, aes(x = Long, y = Lat, alpha = Freq), fill = "red") +
labs(title="Heat map of 2017 police stops in Minneapolis")
In conclusion, 2017 Minneapolis police stop activity exhibits clear relationships in the frequency and type of stops that occur over time and across neighborhoods.
This is provided in the setup and data loading section above, where the file locations can be set to either the C: drive or to my GitHub repository. This assignment was run using the GitHub files.