Introduction

For this project, I chose the dataset “Minneapolis Police Department 2017 Stop Data” (“MplsStops”) and the related dataset “Minneapolis Demographic Data 2015, by Neighborhood” (“MplsDemo”) from http://vincentarelbundock.github.io/Rdatasets/. The first dataset contains information from stops made by the Minneapolis Police Department in 2017 (almost 52K records), while the second dataset includes demographic data for 84 neighborhoods in Minneapolis. Descriptions of both datasets can be found at: http://vincentarelbundock.github.io/Rdatasets/doc/carData/MplsStops.html http://vincentarelbundock.github.io/Rdatasets/doc/carData/MplsDemo.html

Setup: Install packages and load datasets

# install packages
library(plyr)
library(dplyr)
library(ggplot2)
library(maps)        
library(ggmap)

# load datasets from either of 2 file locations: C drive or GitHub
# location1 <- "C:/Users/Kevin/Documents/MSDS/Week3 Assignments/MplsStops.csv"
# location2 <- "C:/Users/Kevin/Documents/MSDS/Week3 Assignments/MplsDemo.csv"
# location1 <- "https://raw.githubusercontent.com/kecbenson/Bridge_Program/master/MplsStops.csv"
# location2 <- "https://raw.githubusercontent.com/kecbenson/Bridge_Program/master/MplsDemo.csv"

# use GitHub file for this assignment
location1 <- "https://raw.githubusercontent.com/kecbenson/Bridge_Program/master/MplsStops.csv"
location2 <- "https://raw.githubusercontent.com/kecbenson/Bridge_Program/master/MplsDemo.csv"
stops <- as.tbl(read.csv(file=location1))
demo <- as.tbl(read.csv(file=location2))

stops
## # A tibble: 51,920 x 15
##        X idNum    date          problem  MDC   citationIssued personSearch
##    <int> <fct>    <fct>         <fct>    <fct> <fct>          <fct>       
##  1  6823 17-0000~ 2017-01-01 0~ suspici~ MDC   <NA>           NO          
##  2  6824 17-0000~ 2017-01-01 0~ suspici~ MDC   <NA>           NO          
##  3  6825 17-0000~ 2017-01-01 0~ traffic  MDC   <NA>           NO          
##  4  6826 17-0000~ 2017-01-01 0~ suspici~ MDC   <NA>           NO          
##  5  6827 17-0000~ 2017-01-01 0~ traffic  MDC   <NA>           NO          
##  6  6828 17-0001~ 2017-01-01 0~ traffic  MDC   <NA>           NO          
##  7  6829 17-0001~ 2017-01-01 0~ suspici~ MDC   <NA>           NO          
##  8  6830 17-0001~ 2017-01-01 0~ traffic  MDC   <NA>           NO          
##  9  6831 17-0001~ 2017-01-01 0~ traffic  MDC   <NA>           NO          
## 10  6832 17-0001~ 2017-01-01 0~ traffic  MDC   <NA>           NO          
## # ... with 51,910 more rows, and 8 more variables: vehicleSearch <fct>,
## #   preRace <fct>, race <fct>, gender <fct>, lat <dbl>, long <dbl>,
## #   policePrecinct <int>, neighborhood <fct>
str(stops)
## Classes 'tbl_df', 'tbl' and 'data.frame':    51920 obs. of  15 variables:
##  $ X             : int  6823 6824 6825 6826 6827 6828 6829 6830 6831 6832 ...
##  $ idNum         : Factor w/ 51920 levels "17-000003","17-000007",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ date          : Factor w/ 51873 levels "2017-01-01 00:00:42",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ problem       : Factor w/ 2 levels "suspicious","traffic": 1 1 2 1 2 2 1 2 2 2 ...
##  $ MDC           : Factor w/ 2 levels "MDC","other": 1 1 1 1 1 1 1 1 1 1 ...
##  $ citationIssued: Factor w/ 2 levels "NO","YES": NA NA NA NA NA NA NA NA NA NA ...
##  $ personSearch  : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
##  $ vehicleSearch : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
##  $ preRace       : Factor w/ 8 levels "Asian","Black",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ race          : Factor w/ 8 levels "Asian","Black",..: 7 7 8 3 8 3 2 6 8 2 ...
##  $ gender        : Factor w/ 3 levels "Female","Male",..: 3 2 1 2 1 2 2 1 2 2 ...
##  $ lat           : num  45 45 44.9 44.9 45 ...
##  $ long          : num  -93.2 -93.3 -93.3 -93.3 -93.3 ...
##  $ policePrecinct: int  1 1 5 5 1 1 1 2 2 4 ...
##  $ neighborhood  : Factor w/ 87 levels "Armatage","Audubon Park",..: 11 20 84 84 20 20 20 51 59 28 ...
demo
## # A tibble: 84 x 9
##        X neighborhood  population white black foreignBorn hhIncome poverty
##    <int> <fct>              <int> <dbl> <dbl>       <dbl>    <int>   <dbl>
##  1     1 Cedar Rivers~       8247 0.353 0.464       0.408    18892   0.06 
##  2     3 Phillips West       5184 0.199 0.538       0.318    18404   0.042
##  3     4 Downtown West       7141 0.561 0.211       0.203    67086   0.057
##  4     5 Downtown East       1674 0.543 0.221       0.221    70669   0.071
##  5     6 Shingle Creek       3249 0.407 0.259       0.14     59414   0.11 
##  6     7 Ventura Vill~       6150 0.214 0.424       0.329    17469   0.048
##  7     8 Sumner - Gle~       1676 0.087 0.656       0.292    18854   0.074
##  8     9 Lind - Bohan~       5420 0.417 0.251       0.172    43438   0.089
##  9    10 Victory             4525 0.622 0.25        0.055    57148   0.066
## 10    11 Webber - Cam~       5109 0.394 0.383       0.11     37030   0.053
## # ... with 74 more rows, and 1 more variable: collegeGrad <dbl>
str(demo)
## Classes 'tbl_df', 'tbl' and 'data.frame':    84 obs. of  9 variables:
##  $ X           : int  1 3 4 5 6 7 8 9 10 11 ...
##  $ neighborhood: Factor w/ 84 levels "Armatage","Audubon Park",..: 10 61 19 18 67 75 72 40 76 78 ...
##  $ population  : int  8247 5184 7141 1674 3249 6150 1676 5420 4525 5109 ...
##  $ white       : num  0.353 0.199 0.561 0.543 0.407 0.214 0.087 0.417 0.622 0.394 ...
##  $ black       : num  0.464 0.538 0.211 0.221 0.259 0.424 0.656 0.251 0.25 0.383 ...
##  $ foreignBorn : num  0.408 0.318 0.203 0.221 0.14 0.329 0.292 0.172 0.055 0.11 ...
##  $ hhIncome    : int  18892 18404 67086 70669 59414 17469 18854 43438 57148 37030 ...
##  $ poverty     : num  0.06 0.042 0.057 0.071 0.11 0.048 0.074 0.089 0.066 0.053 ...
##  $ collegeGrad : num  0.258 0.211 0.551 0.577 0.247 0.173 0.165 0.196 0.345 0.203 ...

Question 1

Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.

Let’s start by summarizing the datasets, and summing up the total population of all neighborhoods in the Demo dataset.

summary(stops)
##        X               idNum                        date      
##  Min.   : 6823   17-000003:    1   2017-01-10 12:05:28:    2  
##  1st Qu.:20379   17-000007:    1   2017-01-22 17:33:37:    2  
##  Median :33864   17-000073:    1   2017-01-27 22:40:21:    2  
##  Mean   :33861   17-000092:    1   2017-01-27 23:21:45:    2  
##  3rd Qu.:47387   17-000098:    1   2017-01-30 14:26:45:    2  
##  Max.   :60838   17-000111:    1   2017-02-03 13:35:57:    2  
##                  (Other)  :51914   (Other)            :51908  
##        problem         MDC        citationIssued personSearch
##  suspicious:25822   MDC  :43699   NO  :15899     NO  :38462  
##  traffic   :26098   other: 8221   YES : 3211     YES : 5237  
##                                   NA's:32810     NA's: 8221  
##                                                              
##                                                              
##                                                              
##                                                              
##  vehicleSearch            preRace                race      
##  NO  :40579    Unknown        :28337   Black       :15220  
##  YES : 3120    Black          : 6805   White       :11703  
##  NA's: 8221    White          : 6004   Unknown     : 9219  
##                Native American:  908   East African: 2188  
##                Latino         :  528   Latino      : 1858  
##                (Other)        : 1117   (Other)     : 3511  
##                NA's           : 8221   NA's        : 8221  
##      gender           lat             long        policePrecinct 
##  Female :10015   Min.   :44.89   Min.   :-93.33   Min.   :1.000  
##  Male   :27131   1st Qu.:44.95   1st Qu.:-93.29   1st Qu.:2.000  
##  Unknown: 6492   Median :44.98   Median :-93.28   Median :3.000  
##  NA's   : 8282   Mean   :44.97   Mean   :-93.27   Mean   :3.257  
##                  3rd Qu.:45.00   3rd Qu.:-93.25   3rd Qu.:4.000  
##                  Max.   :45.05   Max.   :-93.20   Max.   :5.000  
##                                                                  
##         neighborhood  
##  Downtown West: 4409  
##  Whittier     : 3328  
##  Near - North : 2256  
##  Lyndale      : 2154  
##  Jordan       : 2075  
##  Hawthorne    : 2031  
##  (Other)      :35667
summary(demo)
##        X                neighborhood   population        white       
##  Min.   :  1.00   Armatage    : 1    Min.   :  240   Min.   :0.0870  
##  1st Qu.: 25.75   Audubon Park: 1    1st Qu.: 2909   1st Qu.:0.5150  
##  Median : 51.50   Bancroft    : 1    Median : 4518   Median :0.6940  
##  Mean   : 49.68   Beltrami    : 1    Mean   : 4781   Mean   :0.6343  
##  3rd Qu.: 72.25   Bottineau   : 1    3rd Qu.: 5985   3rd Qu.:0.8183  
##  Max.   :100.00   Bryant      : 1    Max.   :16022   Max.   :0.9020  
##                   (Other)     :78                                    
##      black          foreignBorn        hhIncome         poverty       
##  Min.   :0.00400   Min.   :0.0290   Min.   : 17469   Min.   :0.03100  
##  1st Qu.:0.05275   1st Qu.:0.0800   1st Qu.: 42028   1st Qu.:0.04900  
##  Median :0.12200   Median :0.1255   Median : 57302   Median :0.05900  
##  Mean   :0.16114   Mean   :0.1409   Mean   : 60799   Mean   :0.06457  
##  3rd Qu.:0.22525   3rd Qu.:0.1725   3rd Qu.: 75482   3rd Qu.:0.07675  
##  Max.   :0.65600   Max.   :0.4080   Max.   :118750   Max.   :0.13500  
##                                                                       
##   collegeGrad    
##  Min.   :0.1220  
##  1st Qu.:0.3417  
##  Median :0.4795  
##  Mean   :0.4766  
##  3rd Qu.:0.6222  
##  Max.   :0.8370  
## 
sum(demo$population)
## [1] 401606

Several observations about the Stops dataset:

  • The dataset includes date and time data; we can use this to see if there are time of day or day of the week or seasonality patterns. It appears that the date field is treated as a factor, so this may need to be converted to a date/time data type.
  • The stops data is roughly split between “suspicious” (25,822) and “traffic” (26,098); it will be interesting to see if there are different patterns between “suspicious” and “traffic” stops over time or across neighborhoods.
  • Only a small portion of the stops result in:
    • citation issued (3,211 out of 51,920, or ~6%)
    • person searched (5,237 or ~10%)
    • vehicle searched (3,120 or ~6%).
  • The dataset includes latitude / longitude data, so we can try to plot the stop data on to a city map of Minneapolis.
  • The dataset includes the neighborhood, so we can use this to reference demographic info from the Demo dataset.

Several observations about the Demo dataset:

  • The dataset includes neighborhood labels, but we’ll need to cross-check against the neighborhood labels in the Stops dataset to make sure they’re consistent (so can be used to cross-reference between the two datasets).
  • The dataset includes population data, so we can calculate per capita statistics (e.g., stops divided by population for each neighborhood). The total population covered by these neighborhoods is 401,606. The mean and median neighborhood population are 4,781 and 4,518, respectively.
  • The dataset includes the median household income and percentage of the population having a college degree, by neighborhood. We can use these to compare against the stops data. Across all neighborhoods, the household income averages $61K (median of $57K) and the average proportion having a college degree is 48%.

Question 2

Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example - if it makes sense you could sum two columns together).

First let’s see if the neighborhood labels in the Stops dataset match those in the Demo dataset. It turns out that there are 3 neighborhoods in the Stops dataset that are NOT in the Demo dataset: “Camden”, “Humboldt” and “Near - North”. This accounts for 2,300 observations out of 51,920, or roughly ~4%.

# identify which neighborhood names in Stops dataset DO NOT match those in Demo dataset
v1 <- levels(stops$neighborhood)
v2 <- levels(demo$neighborhood)
levels(stops$neighborhood)[ !(v1 %in% v2) ]
## [1] "Camden Industrial"        "Humboldt Industrial Area"
## [3] "Near - North"
sum( !(stops$neighborhood %in% v2) )
## [1] 2300

Second, join selected columns from the Demo dataset into the Stops dataset. Keep in mind that ~4% of the observations in the Stops dataset will have missing Demo data.

# rank order the neighborhoods in the Demo dataset into quartiles and append the column
demo <- as.tbl(cbind(demo, quartile = ntile(demo$hhIncome, 4)))
demo
## # A tibble: 84 x 10
##        X neighborhood  population white black foreignBorn hhIncome poverty
##    <int> <fct>              <int> <dbl> <dbl>       <dbl>    <int>   <dbl>
##  1     1 Cedar Rivers~       8247 0.353 0.464       0.408    18892   0.06 
##  2     3 Phillips West       5184 0.199 0.538       0.318    18404   0.042
##  3     4 Downtown West       7141 0.561 0.211       0.203    67086   0.057
##  4     5 Downtown East       1674 0.543 0.221       0.221    70669   0.071
##  5     6 Shingle Creek       3249 0.407 0.259       0.14     59414   0.11 
##  6     7 Ventura Vill~       6150 0.214 0.424       0.329    17469   0.048
##  7     8 Sumner - Gle~       1676 0.087 0.656       0.292    18854   0.074
##  8     9 Lind - Bohan~       5420 0.417 0.251       0.172    43438   0.089
##  9    10 Victory             4525 0.622 0.25        0.055    57148   0.066
## 10    11 Webber - Cam~       5109 0.394 0.383       0.11     37030   0.053
## # ... with 74 more rows, and 2 more variables: collegeGrad <dbl>,
## #   quartile <int>
str(demo)
## Classes 'tbl_df', 'tbl' and 'data.frame':    84 obs. of  10 variables:
##  $ X           : int  1 3 4 5 6 7 8 9 10 11 ...
##  $ neighborhood: Factor w/ 84 levels "Armatage","Audubon Park",..: 10 61 19 18 67 75 72 40 76 78 ...
##  $ population  : int  8247 5184 7141 1674 3249 6150 1676 5420 4525 5109 ...
##  $ white       : num  0.353 0.199 0.561 0.543 0.407 0.214 0.087 0.417 0.622 0.394 ...
##  $ black       : num  0.464 0.538 0.211 0.221 0.259 0.424 0.656 0.251 0.25 0.383 ...
##  $ foreignBorn : num  0.408 0.318 0.203 0.221 0.14 0.329 0.292 0.172 0.055 0.11 ...
##  $ hhIncome    : int  18892 18404 67086 70669 59414 17469 18854 43438 57148 37030 ...
##  $ poverty     : num  0.06 0.042 0.057 0.071 0.11 0.048 0.074 0.089 0.066 0.053 ...
##  $ collegeGrad : num  0.258 0.211 0.551 0.577 0.247 0.173 0.165 0.196 0.345 0.203 ...
##  $ quartile    : int  1 1 3 3 3 1 1 2 2 1 ...
# create neighborhood vector to index into Demo dataset
v3 <- match(stops$neighborhood, demo$neighborhood)
length(v3)
## [1] 51920
# create new columns to add to Stops dataset
pop <- demo$population[v3]
income <- demo$hhIncome[v3]
college <- demo$collegeGrad[v3]
quartile <- demo$quartile[v3]
# add to Stops dataset, create Combined dataset
comb <- as.tbl(cbind(stops, pop, income, college, quartile))
comb
## # A tibble: 51,920 x 19
##        X idNum    date          problem  MDC   citationIssued personSearch
##    <int> <fct>    <fct>         <fct>    <fct> <fct>          <fct>       
##  1  6823 17-0000~ 2017-01-01 0~ suspici~ MDC   <NA>           NO          
##  2  6824 17-0000~ 2017-01-01 0~ suspici~ MDC   <NA>           NO          
##  3  6825 17-0000~ 2017-01-01 0~ traffic  MDC   <NA>           NO          
##  4  6826 17-0000~ 2017-01-01 0~ suspici~ MDC   <NA>           NO          
##  5  6827 17-0000~ 2017-01-01 0~ traffic  MDC   <NA>           NO          
##  6  6828 17-0001~ 2017-01-01 0~ traffic  MDC   <NA>           NO          
##  7  6829 17-0001~ 2017-01-01 0~ suspici~ MDC   <NA>           NO          
##  8  6830 17-0001~ 2017-01-01 0~ traffic  MDC   <NA>           NO          
##  9  6831 17-0001~ 2017-01-01 0~ traffic  MDC   <NA>           NO          
## 10  6832 17-0001~ 2017-01-01 0~ traffic  MDC   <NA>           NO          
## # ... with 51,910 more rows, and 12 more variables: vehicleSearch <fct>,
## #   preRace <fct>, race <fct>, gender <fct>, lat <dbl>, long <dbl>,
## #   policePrecinct <int>, neighborhood <fct>, pop <int>, income <int>,
## #   college <dbl>, quartile <int>
str(comb)
## Classes 'tbl_df', 'tbl' and 'data.frame':    51920 obs. of  19 variables:
##  $ X             : int  6823 6824 6825 6826 6827 6828 6829 6830 6831 6832 ...
##  $ idNum         : Factor w/ 51920 levels "17-000003","17-000007",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ date          : Factor w/ 51873 levels "2017-01-01 00:00:42",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ problem       : Factor w/ 2 levels "suspicious","traffic": 1 1 2 1 2 2 1 2 2 2 ...
##  $ MDC           : Factor w/ 2 levels "MDC","other": 1 1 1 1 1 1 1 1 1 1 ...
##  $ citationIssued: Factor w/ 2 levels "NO","YES": NA NA NA NA NA NA NA NA NA NA ...
##  $ personSearch  : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
##  $ vehicleSearch : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
##  $ preRace       : Factor w/ 8 levels "Asian","Black",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ race          : Factor w/ 8 levels "Asian","Black",..: 7 7 8 3 8 3 2 6 8 2 ...
##  $ gender        : Factor w/ 3 levels "Female","Male",..: 3 2 1 2 1 2 2 1 2 2 ...
##  $ lat           : num  45 45 44.9 44.9 45 ...
##  $ long          : num  -93.2 -93.3 -93.3 -93.3 -93.3 ...
##  $ policePrecinct: int  1 1 5 5 1 1 1 2 2 4 ...
##  $ neighborhood  : Factor w/ 87 levels "Armatage","Audubon Park",..: 11 20 84 84 20 20 20 51 59 28 ...
##  $ pop           : int  8247 7141 14604 14604 7141 7141 7141 10496 1393 5023 ...
##  $ income        : int  18892 67086 35855 35855 67086 67086 67086 27104 83520 40442 ...
##  $ college       : num  0.258 0.551 0.399 0.399 0.551 0.551 0.551 0.587 0.734 0.181 ...
##  $ quartile      : int  1 3 1 1 3 3 3 1 4 1 ...

Next, change the date column to a date/time data type, extract time variables (e.g., hour of the day, day of the week, etc.) and then remove columns that won’t be used in our analysis.

# change date column to POSIXlt data type per dataset description
comb$date <- as.POSIXlt(comb$date)
# add time, day, month
# comb$date[["hour"]]   # hour of the day, 0-23
# comb$date[["wday"]]   # day of the week, 0-6
# comb$date[["mday"]]   # day of the month, 1-31
# comb$date[["yday"]]   # day of the year, 0-365 
# comb$date[["mon"]]    # month
comb <- ( comb %>% mutate(hour=comb$date[["hour"]], wday=comb$date[["wday"]], 
                    mday=comb$date[["mday"]], yday=comb$date[["yday"]], month=comb$date[["mon"]] ) )
# drop extraneous columns: X, idNum, MDC, preRace, race, gender, policePrecinct
comb <- ( comb %>% select(-X, -idNum, -MDC, -preRace, -race, -gender, -policePrecinct) )

comb
## # A tibble: 51,920 x 17
##    date                problem   citationIssued personSearch vehicleSearch
##    <S3: POSIXlt>       <fct>     <fct>          <fct>        <fct>        
##  1 2017-01-01 00:00:42 suspicio~ <NA>           NO           NO           
##  2 2017-01-01 00:03:07 suspicio~ <NA>           NO           NO           
##  3 2017-01-01 00:23:15 traffic   <NA>           NO           NO           
##  4 2017-01-01 00:33:48 suspicio~ <NA>           NO           NO           
##  5 2017-01-01 00:37:58 traffic   <NA>           NO           NO           
##  6 2017-01-01 00:46:48 traffic   <NA>           NO           NO           
##  7 2017-01-01 00:48:46 suspicio~ <NA>           NO           NO           
##  8 2017-01-01 00:50:55 traffic   <NA>           NO           NO           
##  9 2017-01-01 00:57:10 traffic   <NA>           NO           NO           
## 10 2017-01-01 01:05:50 traffic   <NA>           NO           NO           
## # ... with 51,910 more rows, and 12 more variables: lat <dbl>, long <dbl>,
## #   neighborhood <fct>, pop <int>, income <int>, college <dbl>,
## #   quartile <int>, hour <int>, wday <int>, mday <int>, yday <int>,
## #   month <int>
str(comb)
## Classes 'tbl_df', 'tbl' and 'data.frame':    51920 obs. of  17 variables:
##  $ date          : POSIXlt, format: "2017-01-01 00:00:42" "2017-01-01 00:03:07" ...
##  $ problem       : Factor w/ 2 levels "suspicious","traffic": 1 1 2 1 2 2 1 2 2 2 ...
##  $ citationIssued: Factor w/ 2 levels "NO","YES": NA NA NA NA NA NA NA NA NA NA ...
##  $ personSearch  : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
##  $ vehicleSearch : Factor w/ 2 levels "NO","YES": 1 1 1 1 1 1 1 1 1 1 ...
##  $ lat           : num  45 45 44.9 44.9 45 ...
##  $ long          : num  -93.2 -93.3 -93.3 -93.3 -93.3 ...
##  $ neighborhood  : Factor w/ 87 levels "Armatage","Audubon Park",..: 11 20 84 84 20 20 20 51 59 28 ...
##  $ pop           : int  8247 7141 14604 14604 7141 7141 7141 10496 1393 5023 ...
##  $ income        : int  18892 67086 35855 35855 67086 67086 67086 27104 83520 40442 ...
##  $ college       : num  0.258 0.551 0.399 0.399 0.551 0.551 0.551 0.587 0.734 0.181 ...
##  $ quartile      : int  1 3 1 1 3 3 3 1 4 1 ...
##  $ hour          : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ wday          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mday          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yday          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ month         : int  0 0 0 0 0 0 0 0 0 0 ...
summary(comb)
##       date                           problem      citationIssued
##  Min.   :2017-01-01 00:00:42   suspicious:25822   NO  :15899    
##  1st Qu.:2017-03-29 08:35:09   traffic   :26098   YES : 3211    
##  Median :2017-06-17 18:46:47                      NA's:32810    
##  Mean   :2017-06-23 19:57:49                                    
##  3rd Qu.:2017-09-18 18:32:06                                    
##  Max.   :2017-12-31 23:52:35                                    
##                                                                 
##  personSearch vehicleSearch      lat             long       
##  NO  :38462   NO  :40579    Min.   :44.89   Min.   :-93.33  
##  YES : 5237   YES : 3120    1st Qu.:44.95   1st Qu.:-93.29  
##  NA's: 8221   NA's: 8221    Median :44.98   Median :-93.28  
##                             Mean   :44.97   Mean   :-93.27  
##                             3rd Qu.:45.00   3rd Qu.:-93.25  
##                             Max.   :45.05   Max.   :-93.20  
##                                                             
##         neighborhood        pop            income          college      
##  Downtown West: 4409   Min.   :  240   Min.   : 17469   Min.   :0.1220  
##  Whittier     : 3328   1st Qu.: 4609   1st Qu.: 35855   1st Qu.:0.2360  
##  Near - North : 2256   Median : 5934   Median : 44733   Median :0.3990  
##  Lyndale      : 2154   Mean   : 6428   Mean   : 49557   Mean   :0.4189  
##  Jordan       : 2075   3rd Qu.: 8097   3rd Qu.: 67086   3rd Qu.:0.5680  
##  Hawthorne    : 2031   Max.   :16022   Max.   :118750   Max.   :0.8370  
##  (Other)      :35667   NA's   :2300    NA's   :2300     NA's   :2300    
##     quartile          hour           wday            mday     
##  Min.   :1.000   Min.   : 0.0   Min.   :0.000   Min.   : 1.0  
##  1st Qu.:1.000   1st Qu.: 9.0   1st Qu.:2.000   1st Qu.: 8.0  
##  Median :2.000   Median :15.0   Median :3.000   Median :16.0  
##  Mean   :1.976   Mean   :13.5   Mean   :3.172   Mean   :15.8  
##  3rd Qu.:3.000   3rd Qu.:20.0   3rd Qu.:5.000   3rd Qu.:23.0  
##  Max.   :4.000   Max.   :23.0   Max.   :6.000   Max.   :31.0  
##  NA's   :2300                                                 
##       yday           month       
##  Min.   :  0.0   Min.   : 0.000  
##  1st Qu.: 87.0   1st Qu.: 2.000  
##  Median :167.0   Median : 5.000  
##  Mean   :173.2   Mean   : 5.236  
##  3rd Qu.:260.0   3rd Qu.: 8.000  
##  Max.   :364.0   Max.   :11.000  
## 

If I had more time, it would be useful to:

  • Aggregate the number of stops by neighborhood, then divide by population to get stops per capita
  • Create and calculate a “stop severity” metric, say severity = citationIssued + personSearch + vehicleSearch

This would allow us to do some interesting analysis that accounts for the neighborhood size / population, or the severity of the police stop activity. For instance, higher-income neighborhoods have lower absolute numbers of police stops, but this should be scaled for the population size. Secondly, not all stops are created equal — some stops result in vehicle searches or citations (tickets), while most stops result in no searches or citations. Creating a severity metric would allow us to quantify severity as an attribute of police stops.

Question 3

Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.

Please see examples below in the answer to Question 4.

Question 4

Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end.

The overall question I tried to address is “What are some of the relationships in police stops over time and across neighborhoods, based on the datasets?” Breaking this down, there are several sub-questions that the analysis answers:

  • What are the demographics of the Minneapolis neighborhoods?
  • What are some of the average patterns of police stops over time, with respect to:
    • Hour of the day
    • Day of the week
    • Day of the year
    • Type of police stops
    • Neighborhood income?
  • How can we visualize the police stop data on a Minneapolis city map?

First, let’s look at the demographic data, and see how the neighborhoods vary by median household income, proportion of the population having a college degree and population size. As expected, household income has a strong positive relationship with the proportion of college graduates.

(Example of scatter plot)

ggplot(demo, aes(x = hhIncome, y = collegeGrad)) + geom_point(aes(size = population, col = quartile)) + 
    geom_smooth(method = "lm", se = FALSE, col = "red") + 
    labs(x="Household income", y="Proportion of college grads", title="Selected demographics of Minneapolis neighborhoods")  

Second, take a look at how the number of stops varies by the neighborhood’s household income, proportion of college graduates and population. The histograms show that neighborhoods with higher income and more college graduates tend to have fewer police stops. The relationship with neighborhood population is less conclusive.

(Examples of histograms)

g <- ggplot(comb)
g + geom_histogram(aes(x=income), binwidth = 5000) + 
    labs(x="Household income", y="Numb. of stops", title="Number of stops vs. household income")

g + geom_histogram(aes(x=college), binwidth = 0.075) + 
    labs(x="Prop. of college grads", y="Numb. of stops", title="Number of stops vs. proportion of college grads")

g + geom_histogram(aes(x=pop), binwidth = 1000) + 
    labs(x="Population", y="Numb. of stops", title="Number of stops vs. population")

Third, let’s plot the observation count of stops over time, to see the pattern of police stops over different time measures:

  • Hour of the day: police stops are lowest between the hours of 4AM and 7AM, and then increase almost uniformly until 11PM.
  • Day of the week: Friday is the busiest day and Sunday the quietest day for stops.
  • Day of the month: the 15th of the month shows a spike in activity; possibly related to St. Patrick’s Day (3/14), tax day (4/15) or police “quotas”? End of the month seems to show a drop-off in stop activity, but this could be related to daycount issues (e.g., not all months have a 31st day).
  • Day of the year and month: police stops are slowest over November and December, but then rise to a peak in March and April.

(Examples of histograms, bar charts and density chart)

g <- ggplot(comb)
g + geom_histogram(aes(x=hour), binwidth = 1) + 
    labs(x="Hour of the day (0=midnight, 23=11PM)", y="Numb. of stops", title="Stop activity during the day")

g + geom_bar(aes(x=wday)) + 
    labs(x="Day of the week (0=Sun, 6=Sat)", y="Numb. of stops", title="Stop activity during the week")

g + geom_histogram(aes(x=mday)) + 
    labs(x="Day of the month (1-31)", y="Numb. of stops", title="Stop activity during the month")

g + geom_density(aes(x=yday)) + 
    labs(x="Day of the year (0=Jan 1, 364=Dec 31)", y="Numb. of stops", title="Stop activity during the year")

g + geom_bar(aes(x=month)) + 
    labs(x="Month (0=Jan, 11=Dec)", y="Numb. of stops", title="Stop activity by month")

Next, let’s see if the distribution of police stops during the day varies by neighborhood income quartile. The box plot shows that 4th-quartile (high-income) neighborhoods tend to have fewer stops before 10AM or after 8PM.

(Example of box plot)

ggplot(comb, aes(y = hour, x = quartile, group = quartile)) + geom_boxplot() +
    labs(x="Income quartile", y="Hour of the day (0-23)", title="Distribution of stop times by neighborhood income quartile") 

To expand on this last point, let’s breakdown the count of stops by search type during the day, by neighborhood income quartile. The charts show that police stops are concentrated in 1st- and 2nd-quartile (lower-income) neighborhoods, and stops tend to be most active in the evening hours leading up to midnight. This pattern is generally true for other attributes include citations issued and personal or vehicle searches.

As a next step, it would be appropriate to scale the number of stops by the neighborhood population, and do the analysis on a per capita basis.

(Example of stacked bar charts and facet wraps)

ggplot(comb, aes(x=hour)) + geom_histogram(binwidth = 1) + facet_wrap(~ quartile) +
    labs(x="Hour of the day (0-23)", y="Numb. of stops", title="Stop activity during the day, per neighborhood income quartile")

ggplot(comb, aes(x=hour, fill = problem)) + geom_histogram(binwidth = 1) + facet_wrap(~ quartile) +
    labs(x="Hour of the day (0-23)", y="Numb. of stops", title="Stop activity by problem reported, per neighborhood income quartile")

ggplot(comb, aes(x=hour, fill = citationIssued)) + geom_histogram(binwidth = 1) + facet_wrap(~ quartile) +
    labs(x="Hour of the day (0-23)", y="Numb. of stops", title="Stop activity by citation result, per neighborhood income quartile")

ggplot(comb, aes(x=hour, fill = personSearch)) + geom_histogram(binwidth = 1) + facet_wrap(~ quartile) +
    labs(x="Hour of the day (0-23)", y="Numb. of stops", title="Stop activity by personal search result, per neighborhood income quartile")

ggplot(comb, aes(x=hour, fill = vehicleSearch)) + geom_histogram(binwidth = 1) + facet_wrap(~ quartile) +
    labs(x="Hour of the day (0-23)", y="Numb. of stops", title="Stop activity by vehicle search result, per neighborhood income quartile")

Finally, let’s see how to visualize 2017 police stop activity in Minneapolis. To do this, I consulted a few sources on creating heatmaps, and ended up borrowing code logic from a paper on RPubs (https://rpubs.com/jimu_xw/crime_visualization).

(Example of heatmaps using ggmap)

# get city map and plot sample stop data onto the map
minnie <- get_map(location = "minneapolis", zoom = 12)        
ggmap(minnie) + geom_jitter(data=comb[sample(nrow(comb), 5000), ], aes(x=long, y=lat)) +
    labs(title="Sample of 2017 police stop activity in Minneapolis")

# create heatmap -- borrow code from https://rpubs.com/jimu_xw/crime_visualization
LatLonCounts <- as.data.frame(table(round(comb$long, 2), round(comb$lat, 2)))
str(LatLonCounts)
## 'data.frame':    238 obs. of  3 variables:
##  $ Var1: Factor w/ 14 levels "-93.33","-93.32",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Var2: Factor w/ 17 levels "44.89","44.9",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Freq: int  0 3 14 18 59 126 22 22 3 0 ...
LatLonCounts$Long <- as.numeric(as.character(LatLonCounts$Var1))
LatLonCounts$Lat <- as.numeric(as.character(LatLonCounts$Var2))

# try a few iterations
ggmap(minnie) + geom_point(data = LatLonCounts, aes(x = Long, y = Lat, color = Freq, size = Freq)) +
    labs(title="Frequency of 2017 police stops in Minneapolis")

ggmap(minnie) + geom_point(data = LatLonCounts, aes(x = Long, y = Lat, color = Freq, size = Freq)) +
    scale_color_gradient(low = "yellow", high = "red") +
    labs(title="Frequency of 2017 police stops in Minneapolis")

ggmap(minnie) + geom_tile(data = LatLonCounts, aes(x = Long, y = Lat, alpha = Freq), fill = "red") +
    labs(title="Heat map of 2017 police stops in Minneapolis")

In conclusion, 2017 Minneapolis police stop activity exhibits clear relationships in the frequency and type of stops that occur over time and across neighborhoods.

Question 5

BONUS - place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.

This is provided in the setup and data loading section above, where the file locations can be set to either the C: drive or to my GitHub repository. This assignment was run using the GitHub files.