The dataset that I will explore in this project was created by the San Francisco Chronicle as a part of their Fast and Fatal investigation into police car chases in the U.S. The dataset was compiled from both government sources, private organisations and the Chronicle’s own journalists.
This dataset is a compilation of at over 3300 cases of deaths connected to police vehicle pursuits, with variables that describe the location of the pursuit, the race and gender of the diseased, as well as the inital reasons for the pursuit and what relation the diseased had to the event (for example, bystander, offiser, driver, passenger). My plan for exploring this dataset is that I will begin by cleaning it, and then exploring the dataset walking through the categories through a series of rough exploratory graphs. My aim is that by the end I will have sufficiently transformed that dataset to be able to apply linear regression model to at least a few of the key variables.
Finally, I want to add that this project got a little long, the graph and relevant stuff can be found at the end: Results and Final Graph
Appendix:
Packages:
loading necessary packages:
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(GGally) # might not need this one
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
library(psych)
Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':
%+%, alpha
library(usdata) # could be controversiallibrary(lubridate)library(streamgraph)library(treemap)library(stringr)
Rows: 3336 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): data_source, date, gender, race, race_source, county, state, name,...
dbl (8): unique_id, year, number_killed, age, lat, long, centroid_geo, in_f...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##Cleaning data: cleaning away
head(police_on_speed)
# A tibble: 6 × 22
unique_id data_source year date number_killed age gender race race_source
<dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 2918 nhtsa_sfch… 2021 10/2… 1 93 male white photo,nhtsa
2 323 nhtsa_sfch… 2017 8/17… 1 92 male white photo,nhtsa
3 1488 nhtsa_sfch… 2019 9/15… 3 91 male white photo,nhtsa
4 2315 nhtsa_sfch… 2020 12/2… 2 89 male whit… photo,nhtsa
5 1720 nhtsa_sfch… 2020 2/14… 1 89 female white photo,nhtsa
6 1490 nhtsa_sfch… 2019 9/15… 3 89 female white photo,nhtsa
# ℹ 13 more variables: county <chr>, state <chr>, lat <dbl>, long <dbl>,
# name <chr>, initial_reason <chr>, person_role <chr>, main_agency <chr>,
# news_urls <chr>, city <chr>, zip <chr>, centroid_geo <dbl>,
# in_fars_pursuit <dbl>
This is supposed to be a binary, or rather a bolean (true or false) and it is not. This might be a controversial move, but I am going to assume that those that come out as 2 were in fact meant to be 1.
I have a gripe with the demographic data. The gripe I have is that there are seven categories for mixed race for latino. As a result, latio and the other mixed race categories barely show up in the data at all. I think it may also be fair to presume that it may be better to create a composit latino and mixed race category.
Domestic incidents and minor incident/no crime do not really have a sufficent number of cases to really excavate for data and rather serve as a distraction. Therefore I have chosen to fold them into the categroy “other”
county_summary <- police_on_speed |>group_by(county) |>summarise(total_killed =sum(number_killed, na.rm =TRUE), incidents =n())# can I group by other things here? county_explorer <-ggplot(county_summary, aes(x = county, y = incidents, color = total_killed)) +theme_minimal(base_size =12) +geom_point() county_explorer
This is an interesting graph, clearly the pandemic had some kind of effect.
Discovering a huge oversight pivoting for sake of finding data
So it turns out - of course - that I have so far misread the data, as the number killed is duplicated for each individual killed, greatly skewing the data in favour of situations in which many were killed rather than a few. I am now uncertain about how to undo this but I will work on.
model <-lm(number_killed ~ black + white + asian + latino, data = pivot_race)summary(model)
Call:
lm(formula = number_killed ~ black + white + asian + latino,
data = pivot_race)
Residuals:
Min 1Q Median 3Q Max
-1.1951 -0.6165 -0.4822 0.3835 6.3835
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.616541 0.044548 36.288 < 2e-16 ***
black 0.005552 0.059321 0.094 0.9254
white -0.134322 0.059045 -2.275 0.0230 *
asian 0.280895 0.170455 1.648 0.0995 .
latino 0.578581 0.102800 5.628 2.07e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.027 on 2080 degrees of freedom
Multiple R-squared: 0.02527, Adjusted R-squared: 0.02339
F-statistic: 13.48 on 4 and 2080 DF, p-value: 7.535e-11
Analysis:
This is a terrible model! The p value, and the adjusted R^2 values suggest that there is no clear relationship to be found here. I am clearly not looking hard enough.
TASK: Include a linear or multiple linear regression analysis of 2 or more quantitative variables. Write the equation for your model and analyze your model based on p-values, adjusted R^2 values, and diagnostic plots.
ggplot(data = police_on_speed, aes(x = age, color = person_role)) +geom_density(alpha =1, na.rm =TRUE) +# Assuming you want a density plotscale_fill_fermenter() +labs(title ="Police Kills by age and their relationship to the chase", caption ="Source: San Francisco Chronicle", color ="Role of Person killed") +xlab("Age") +# Label for the x-axisylab("Deaths proportion of each type") # Label for the y-axis
# Install devtools if you haven't already#install.packages("devtools")# Use devtools to install the streamgraph package from GitHub (replace <repository> with the actual repository path)#devtools::install_github("hrbrmstr/streamgraph")# Load the package#library(streamgraph)
library(ggalluvial)
ggplot(data = police_on_speed, mapping =aes(x = date, y = age, color = person_role, size = number_killed)) +geom_point(alpha =0.6, na.rm =TRUE) +labs(title ="Police kills",caption ="Source: U.S. Census Bureau and Nathan Yau") +theme_minimal(base_size =12)
new_table <- police_on_speed |>group_by(year, race) |>summarise(incidents =n(), .groups ='drop')ggalluv <- new_table|>ggplot(aes(x = year, y = incidents, alluvium = race)) +theme_bw() +geom_alluvium(aes(fill = race), color ="white",width = .1, alpha = .8,decreasing =FALSE) +scale_fill_discrete() +labs(title ="The number of deaths by demographic",y ="Number of deaths", fill ="Demographic",caption ="Source: San Francisco Chronicle")
ggalluv
other_table <- police_on_speed |>group_by(year, initial_reason) |>summarise(incidents =n(), .groups ='drop')Rolealluv <- other_table|>ggplot(aes(x = year, y = incidents, alluvium = initial_reason)) +theme_bw() +geom_alluvium(aes(fill = initial_reason), color ="white",width = .1, alpha = .8,decreasing =FALSE) +scale_fill_brewer(palette ="Spectral") +labs(title ="The number of deaths by the inital reason given for police action",y ="number of deaths", fill ="Initial reason given",caption ="Source: San Francisco Chronicle")
Rolealluv
Some more data cleaning:
traffic_stop_data <- police_on_speed |>filter(initial_reason =="traffic stop") |>group_by(year, race) |>summarise(incidents =n(), .groups ='drop')ggplot(traffic_stop_data, aes(x = year, y = incidents, color = race)) +geom_line() +geom_point() +scale_color_brewer() +# Using a discrete viridis palette for better color distinctiontheme_bw() +labs(title ="Traffic Stop Incidents by Race Over Time",x ="Year",y ="Number of Incidents",color ="Race") +theme(plot.title =element_text(hjust =0.5))
# Example using role_table to create a treemap# Ensure your role_table has the columns 'race', 'initial_reason', and 'incidents'tree_race <-treemap(role_table,index =c("race", "initial_reason"), # Hierarchical structure: primary grouping by race, then by initial_reasonvSize ="incidents", # The size of the rectangles represents the number of incidentsvColor ="incidents", # Color can also represent the number of incidents, or you can use another variablepalette ="Reds", # Color palettetitle ="Treemap of Incidents by Race and Initial Reason")
tree_race
$tm
race initial_reason vSize vColor stdErr vColorValue level
1 asian <NA> 39 15 39 NA 1
2 asian other 1 1 1 NA 2
3 asian suspected nonviolent 9 3 9 NA 2
4 asian suspected violent 2 2 2 NA 2
5 asian traffic stop 24 6 24 NA 2
6 asian unknown 3 3 3 NA 2
7 black <NA> 688 30 688 NA 1
8 black other 21 6 21 NA 2
9 black suspected nonviolent 164 6 164 NA 2
10 black suspected violent 116 6 116 NA 2
11 black traffic stop 328 6 328 NA 2
12 black unknown 59 6 59 NA 2
13 latino <NA> 123 23 123 NA 1
14 latino other 3 2 3 NA 2
15 latino suspected nonviolent 53 6 53 NA 2
16 latino suspected violent 12 5 12 NA 2
17 latino traffic stop 48 6 48 NA 2
18 latino unknown 7 4 7 NA 2
19 mixed race <NA> 210 25 210 NA 1
20 mixed race other 2 2 2 NA 2
21 mixed race suspected nonviolent 68 6 68 NA 2
22 mixed race suspected violent 28 5 28 NA 2
23 mixed race traffic stop 86 6 86 NA 2
24 mixed race unknown 26 6 26 NA 2
25 other <NA> 32 18 32 NA 1
26 other other 3 2 3 NA 2
27 other suspected nonviolent 12 5 12 NA 2
28 other suspected violent 3 3 3 NA 2
29 other traffic stop 12 6 12 NA 2
30 other unknown 2 2 2 NA 2
31 unknown <NA> 290 26 290 NA 1
32 unknown other 4 2 4 NA 2
33 unknown suspected nonviolent 91 6 91 NA 2
34 unknown suspected violent 43 6 43 NA 2
35 unknown traffic stop 117 6 117 NA 2
36 unknown unknown 35 6 35 NA 2
37 white <NA> 703 30 703 NA 1
38 white other 36 6 36 NA 2
39 white suspected nonviolent 177 6 177 NA 2
40 white suspected violent 80 6 80 NA 2
41 white traffic stop 334 6 334 NA 2
42 white unknown 76 6 76 NA 2
x0 y0 w h color
1 0.6671463 0.00000000 0.18283514 0.10230548 #FFF5F0
2 0.8359172 0.00000000 0.01406424 0.03410183 #E6D8D5
3 0.7796602 0.02557637 0.05625697 0.07672911 #E6D9D5
4 0.8359172 0.03410183 0.01406424 0.06820365 #E6DBD5
5 0.6671463 0.00000000 0.11251393 0.10230548 #E6DCD5
6 0.7796602 0.00000000 0.05625697 0.02557637 #E6DDD5
7 0.0000000 0.00000000 0.66714628 0.49460820 #FEE0D2
8 0.5895711 0.00000000 0.07757515 0.12983465 #E5BCB5
9 0.3180581 0.20490911 0.27151302 0.28969909 #E5C0B5
10 0.3180581 0.00000000 0.27151302 0.20490911 #E5C4B5
11 0.0000000 0.00000000 0.31805811 0.49460820 #E5C8B5
12 0.5895711 0.12983465 0.07757515 0.36477354 #E5CCB5
13 0.8770540 0.10230548 0.12294597 0.47982709 #FCBBA1
14 0.9441155 0.10230548 0.05588453 0.02574682 #E38C81
15 0.8770540 0.37537780 0.12294597 0.20675476 #E39481
16 0.8770540 0.10230548 0.06706144 0.08582273 #E39D81
17 0.8770540 0.18812821 0.12294597 0.18724960 #E3A581
18 0.9441155 0.12805229 0.05588453 0.06007591 #E3AD81
19 0.6671463 0.10230548 0.20990775 0.47982709 #FC9272
20 0.8296555 0.10230548 0.04739852 0.02023761 #E3574E
21 0.6671463 0.18494236 0.16250923 0.20068958 #E3644E
22 0.6671463 0.10230548 0.16250923 0.08263689 #E3704E
23 0.6671463 0.38563195 0.20990775 0.19650062 #E37D4E
24 0.8296555 0.12254308 0.04739852 0.26308887 #E3894E
25 0.8499814 0.00000000 0.15001858 0.10230548 #FB6A4A
26 0.9624954 0.06394092 0.03750464 0.03836455 #E22523
27 0.8499814 0.00000000 0.05625697 0.10230548 #E23523
28 0.9624954 0.02557637 0.03750464 0.03836455 #E24523
29 0.9062384 0.00000000 0.05625697 0.10230548 #E25523
30 0.9624954 0.00000000 0.03750464 0.02557637 #E26523
31 0.6671463 0.58213256 0.33285372 0.41786744 #EF3B2C
32 0.9058827 0.58213256 0.09411726 0.02038378 #D70517
33 0.6671463 0.58213256 0.23873646 0.18281700 #D70506
34 0.9058827 0.78087439 0.09411726 0.21912561 #D71505
35 0.6671463 0.76494957 0.23873646 0.23505043 #D72605
36 0.9058827 0.60251634 0.09411726 0.17835805 #D73805
37 0.0000000 0.49460820 0.66714628 0.50539180 #CB181D
38 0.5608584 0.49460820 0.10628789 0.16244737 #B70024
39 0.3169657 0.65192860 0.24389274 0.34807140 #B70014
40 0.3169657 0.49460820 0.24389274 0.15732041 #B70005
41 0.0000000 0.49460820 0.31696566 0.50539180 #B70A00
42 0.5608584 0.65705556 0.10628789 0.34294444 #B71900
$type
[1] "index"
$vSize
[1] "incidents"
$vColor
[1] NA
$stdErr
[1] "incidents"
$algorithm
[1] "pivotSize"
$vpCoorX
[1] 0.02812148 0.97187852
$vpCoorY
[1] 0.01968504 0.91031496
$aspRatio
[1] 1.483512
$range
[1] NA
$mapping
[1] NA NA NA
$draw
[1] TRUE
ggplot(role_table, aes(x = initial_reason, y = race, fill = incidents)) +geom_tile() +# Creates the heatmap tilesscale_fill_gradient(low ="white", high ="red") +# Gradient of colorslabs(title ="Heatmap of Incidents by Race and Initial Reason",x ="Initial Reason",y ="Race",fill ="Number of Incidents") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
ggplot(role_table, aes(x = initial_reason, y = incidents, fill = race)) +geom_bar(stat ="identity", position ="dodge") +# Creates the heatmap tilestheme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
Summary:
a:
The first thing I did was to remove unncessary columns that I would not be investigating (news_urls, centroid_geo, race_source, data_source, name). Then I wanted to filter out some of the na’s out of sets I thought I might end up using. After that I changed the date format it something slightly more usable using the lubridate package so that I wouldn’t have any issues plotting data in a time series. I then discovered that there were errors in the “in_fars_pursuit” column. This is column supposed to be a binary, or rather a bolean (true or false) and it is not. In what may be a controversial move, I decided not to throw out those rows but rather assumed that those that come out as 2 were in fact meant to be 1. I finally renamed the column as I discovered that I found the spelling hard to remember.
b:
As far as what the visualisation represents. The most arresting discovery I found was that in a timeseries graph we can see a very distinct jump in the number of deaths during the pandemic and a somewhat return to original levels in the final year of 2022.
c:
What I was going to do and which took far too much time was to figure out how to join this dataset with the US census. This would have allowed me to use demograpfic and income data from the specific counties, and would have allowed for a much more detailed analysis.