Data110_PROJECT_1_SF_POLICE

Data 110 Project 1 Police chases

Introduction:

The dataset that I will explore in this project was created by the San Francisco Chronicle as a part of their Fast and Fatal investigation into police car chases in the U.S. The dataset was compiled from both government sources, private organisations and the Chronicle’s own journalists.

This dataset is a compilation of at over 3300 cases of deaths connected to police vehicle pursuits, with variables that describe the location of the pursuit, the race and gender of the diseased, as well as the inital reasons for the pursuit and what relation the diseased had to the event (for example, bystander, offiser, driver, passenger). My plan for exploring this dataset is that I will begin by cleaning it, and then exploring the dataset walking through the categories through a series of rough exploratory graphs. My aim is that by the end I will have sufficiently transformed that dataset to be able to apply linear regression model to at least a few of the key variables.

Finally, I want to add that this project got a little long, the graph and relevant stuff can be found at the end: Results and Final Graph

Appendix:

Packages:

loading necessary packages:

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tinytex)
library(ggplot2)
library(wesanderson)
library(ggfortify)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(GGally) # might not need this one

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

library(psych)


Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha

library(usdata) # could be controversial
library(lubridate)
library(streamgraph)
library(treemap)
library(stringr)

Dataset(s):

loading dataset:

setwd("/Users/gimle/Desktop/Data 110/Datasets 110")
police_on_speed <- read_csv("sfc_pursuit_fatalities-3.csv")

Rows: 3336 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): data_source, date, gender, race, race_source, county, state, name,...
dbl  (8): unique_id, year, number_killed, age, lat, long, centroid_geo, in_f...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

##Cleaning data: cleaning away

head(police_on_speed)

# A tibble: 6 × 22
  unique_id data_source  year date  number_killed   age gender race  race_source
      <dbl> <chr>       <dbl> <chr>         <dbl> <dbl> <chr>  <chr> <chr>      
1      2918 nhtsa_sfch…  2021 10/2…             1    93 male   white photo,nhtsa
2       323 nhtsa_sfch…  2017 8/17…             1    92 male   white photo,nhtsa
3      1488 nhtsa_sfch…  2019 9/15…             3    91 male   white photo,nhtsa
4      2315 nhtsa_sfch…  2020 12/2…             2    89 male   whit… photo,nhtsa
5      1720 nhtsa_sfch…  2020 2/14…             1    89 female white photo,nhtsa
6      1490 nhtsa_sfch…  2019 9/15…             3    89 female white photo,nhtsa
# ℹ 13 more variables: county <chr>, state <chr>, lat <dbl>, long <dbl>,
#   name <chr>, initial_reason <chr>, person_role <chr>, main_agency <chr>,
#   news_urls <chr>, city <chr>, zip <chr>, centroid_geo <dbl>,
#   in_fars_pursuit <dbl>

unique_names <- length(unique(police_on_speed$county))

police_on_speed <- police_on_speed |> 
  select(-c(news_urls, centroid_geo, race_source, data_source, name))

Dealing with NAs

police_on_speed <- police_on_speed |>
  filter(!is.na(initial_reason) & !is.na(main_agency))

Dealing with how they have written the date:

police_on_speed$date <- police_on_speed$date |> mdy()

Problem with in_fars_pursuit:

This is supposed to be a binary, or rather a bolean (true or false) and it is not. This might be a controversial move, but I am going to assume that those that come out as 2 were in fact meant to be 1.

police_on_speed <- police_on_speed |> 
  mutate(in_fars_pursuit = gsub("2", "1", in_fars_pursuit))

Also, this is a stupid name.

police_on_speed <- police_on_speed |> 
  rename(fast_binary = in_fars_pursuit)

I have a gripe with the demographic data. The gripe I have is that there are seven categories for mixed race for latino. As a result, latio and the other mixed race categories barely show up in the data at all. I think it may also be fair to presume that it may be better to create a composit latino and mixed race category.

police_on_speed <- police_on_speed |> 
  mutate(race = case_when(
    race %in% c("white,latino", "black,latino", "other,latino", "latino,white", "latino,black") ~ "mixed race",
    TRUE ~ race 
  ))

I kept making the mistake of choosing the column “number_killed”

police_on_speed <- police_on_speed |> 
  mutate(incidents = 1)

Domestic incidents and minor incident/no crime do not really have a sufficent number of cases to really excavate for data and rather serve as a distraction. Therefore I have chosen to fold them into the categroy “other”

police_on_speed <- police_on_speed |> 
  mutate(initial_reason = case_when(
    initial_reason %in% c("domestic incident", "minor incident/no crime") ~ "other",
    TRUE ~ initial_reason 
  ))

view(police_on_speed)

Exploration:

Not including final graph

polyear <- police_on_speed |> 
    mutate(date = as.Date(date),
    year_month = floor_date(date, "month"))

monthly_killed_summary <- polyear |> 
  group_by(year_month) |> 
  summarise(total_killed = sum(number_killed, na.rm = TRUE)) |> 
  ungroup() |> 
  arrange(year_month)  

Covid_kills <- ggplot(monthly_killed_summary, aes(x = year_month, y = total_killed)) +
  geom_point() + 
  scale_x_date(date_breaks = "1 year", date_labels = "%Y-%b") + 
  theme_minimal() +
  labs(x = "Year-Month", y = "Number Killed", title = "Monthly Trend of Number Killed ") 
 
Covid_kills + geom_smooth(method='loess',formula=y~x)

state_summary <- police_on_speed |>
  group_by(state) |> 
  summarise(total_killed = sum(number_killed, na.rm = TRUE))

state_explorer <- ggplot(state_summary, aes(x = state, y = total_killed)) +
  theme_minimal(base_size = 10) + geom_point() 
state_explorer

bystander_summary <- police_on_speed |> 
  group_by(lat) |> 
  summarise(total_killed = sum(number_killed, na.rm = TRUE))

bystander_explorer <- ggplot(bystander_summary, aes(x = total_killed , y = lat)) +
  theme_minimal(base_size = 12) + geom_point()
bystander_explorer

bystander_summary <- police_on_speed |> 
  group_by(person_role) |> 
  summarise(total_killed = sum(number_killed, na.rm = TRUE))

bystander_explorer <- ggplot(bystander_summary, aes(x = person_role, y = total_killed)) +
  theme_minimal(base_size = 12) + geom_point()
bystander_explorer

county_summary <- police_on_speed |> 
  group_by(county) |> 
  summarise(total_killed = sum(number_killed, na.rm = TRUE), incidents = n())
# can I group by other things here?  

county_explorer <- ggplot(county_summary, aes(x = county, y = incidents, color = total_killed)) +
  theme_minimal(base_size = 12) + geom_point() 
county_explorer

county_summary <- police_on_speed |>  
  group_by(county) |> 
  summarise(total_killed = sum(number_killed, na.rm = TRUE)) |> 
  arrange(desc(total_killed)) |>
  slice_head(n = 100) 

county_explorer <- ggplot(county_summary, aes(x = reorder(county, total_killed), y = total_killed)) +
  geom_point() +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
county_explorer

year_summary <- police_on_speed |> 
  group_by(year) |> 
  summarise(total_killed = sum(number_killed, na.rm = TRUE))
year_explorer <- ggplot(year_summary, aes(x = year, y = total_killed)) +
  theme_minimal(base_size = 12) + geom_point()
year_explorer

This is an interesting graph, clearly the pandemic had some kind of effect.

Discovering a huge oversight pivoting for sake of finding data

So it turns out - of course - that I have so far misread the data, as the number killed is duplicated for each individual killed, greatly skewing the data in favour of situations in which many were killed rather than a few. I am now uncertain about how to undo this but I will work on.

pivot_race <- police_on_speed |> 
  mutate(value = 1) |> 
  pivot_wider(names_from = race, values_from = value, values_fill = list(value = 0))

pivot_bystander <- police_on_speed |> 
   mutate(value = 1) |> 
  pivot_wider(names_from = person_role, values_from = value, values_fill = list(value = 0))

pivot_reason <- police_on_speed |> 
   mutate(value = 1) |> 
  pivot_wider(names_from = initial_reason, values_from = value, values_fill = list(value = 0))

pivot_year <- police_on_speed |> 
   mutate(value = 1) |> 
  pivot_wider(names_from = year, values_from = value, values_fill = list(value = 0))

Linear regression:

For race:

model <- lm(number_killed ~ black + white + asian + latino, 
            data = pivot_race)
summary(model)


Call:
lm(formula = number_killed ~ black + white + asian + latino, 
    data = pivot_race)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.1951 -0.6165 -0.4822  0.3835  6.3835 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.616541   0.044548  36.288  < 2e-16 ***
black        0.005552   0.059321   0.094   0.9254    
white       -0.134322   0.059045  -2.275   0.0230 *  
asian        0.280895   0.170455   1.648   0.0995 .  
latino       0.578581   0.102800   5.628 2.07e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.027 on 2080 degrees of freedom
Multiple R-squared:  0.02527,   Adjusted R-squared:  0.02339 
F-statistic: 13.48 on 4 and 2080 DF,  p-value: 7.535e-11

Analysis:

This is a terrible model! The p value, and the adjusted R^2 values suggest that there is no clear relationship to be found here. I am clearly not looking hard enough.

TASK: Include a linear or multiple linear regression analysis of 2 or more quantitative variables. Write the equation for your model and analyze your model based on p-values, adjusted R^2 values, and diagnostic plots.

ggplot(data = police_on_speed, aes(x = age, color = person_role)) +  
  geom_density(alpha = 1, na.rm = TRUE) +  # Assuming you want a density plot
  scale_fill_fermenter() +
  labs(title = "Police Kills by age and their relationship to the chase", caption = "Source: San Francisco Chronicle", color = "Role of Person killed") +
  xlab("Age") +  # Label for the x-axis
  ylab("Deaths proportion of each type")   # Label for the y-axis

# Install devtools if you haven't already
#install.packages("devtools")

# Use devtools to install the streamgraph package from GitHub (replace <repository> with the actual repository path)
#devtools::install_github("hrbrmstr/streamgraph")

# Load the package
#library(streamgraph)

library(ggalluvial)

ggplot(data = police_on_speed, mapping = aes(x = date, y = age, color = person_role, size = number_killed)) +  
  geom_point(alpha = 0.6, na.rm = TRUE) + 
  labs(title = "Police kills",
  caption = "Source: U.S. Census Bureau and Nathan Yau") +
  theme_minimal(base_size = 12)

new_table <- police_on_speed |> 
  group_by(year, race)  |> 
  summarise(incidents = n(), .groups = 'drop')


ggalluv <- new_table|>
  ggplot(aes(x = year, y = incidents, alluvium = race)) + 
  theme_bw() +
  geom_alluvium(aes(fill = race), 
                color = "white",
                width = .1, 
                alpha = .8,
                decreasing = FALSE) +
  scale_fill_discrete() +
  labs(title = "The number of deaths by demographic",
       y = "Number of deaths", 
       fill = "Demographic",
       caption = "Source: San Francisco Chronicle")

ggalluv

other_table <- police_on_speed |> 
  group_by(year, initial_reason)  |> 
  summarise(incidents = n(), .groups = 'drop')

Rolealluv <- other_table|>
  ggplot(aes(x = year, y = incidents, alluvium = initial_reason)) + 
  theme_bw() +
  geom_alluvium(aes(fill = initial_reason), 
                color = "white",
                width = .1, 
                alpha = .8,
                decreasing = FALSE) +
  scale_fill_brewer(palette = "Spectral") +
  labs(title = "The number of deaths by the inital reason given for police action",
       y = "number of deaths", 
       fill = "Initial reason given",
       caption = "Source: San Francisco Chronicle")

Rolealluv

Some more data cleaning:

traffic_stop_data <- police_on_speed |> 
  filter(initial_reason == "traffic stop") |> 
  group_by(year, race) |> 
  summarise(incidents = n(), .groups = 'drop')



ggplot(traffic_stop_data, aes(x = year, y = incidents, color = race)) +
  geom_line() + 
  geom_point() +
  scale_color_brewer() +  # Using a discrete viridis palette for better color distinction
  theme_bw() +
  labs(title = "Traffic Stop Incidents by Race Over Time",
       x = "Year",
       y = "Number of Incidents",
       color = "Race") +
  theme(plot.title = element_text(hjust = 0.5))

role_table <- police_on_speed |> 
  group_by(year, race, initial_reason)  |> 
  summarise(incidents = n(), .groups = 'drop')

# Example using role_table to create a treemap
# Ensure your role_table has the columns 'race', 'initial_reason', and 'incidents'
tree_race <-treemap(role_table,
        index = c("race", "initial_reason"),  # Hierarchical structure: primary grouping by race, then by initial_reason
        vSize = "incidents",  # The size of the rectangles represents the number of incidents
        vColor = "incidents",  # Color can also represent the number of incidents, or you can use another variable
        palette = "Reds",  # Color palette
        title = "Treemap of Incidents by Race and Initial Reason")

tree_race

$tm
         race       initial_reason vSize vColor stdErr vColorValue level
1       asian                 <NA>    39     15     39          NA     1
2       asian                other     1      1      1          NA     2
3       asian suspected nonviolent     9      3      9          NA     2
4       asian    suspected violent     2      2      2          NA     2
5       asian         traffic stop    24      6     24          NA     2
6       asian              unknown     3      3      3          NA     2
7       black                 <NA>   688     30    688          NA     1
8       black                other    21      6     21          NA     2
9       black suspected nonviolent   164      6    164          NA     2
10      black    suspected violent   116      6    116          NA     2
11      black         traffic stop   328      6    328          NA     2
12      black              unknown    59      6     59          NA     2
13     latino                 <NA>   123     23    123          NA     1
14     latino                other     3      2      3          NA     2
15     latino suspected nonviolent    53      6     53          NA     2
16     latino    suspected violent    12      5     12          NA     2
17     latino         traffic stop    48      6     48          NA     2
18     latino              unknown     7      4      7          NA     2
19 mixed race                 <NA>   210     25    210          NA     1
20 mixed race                other     2      2      2          NA     2
21 mixed race suspected nonviolent    68      6     68          NA     2
22 mixed race    suspected violent    28      5     28          NA     2
23 mixed race         traffic stop    86      6     86          NA     2
24 mixed race              unknown    26      6     26          NA     2
25      other                 <NA>    32     18     32          NA     1
26      other                other     3      2      3          NA     2
27      other suspected nonviolent    12      5     12          NA     2
28      other    suspected violent     3      3      3          NA     2
29      other         traffic stop    12      6     12          NA     2
30      other              unknown     2      2      2          NA     2
31    unknown                 <NA>   290     26    290          NA     1
32    unknown                other     4      2      4          NA     2
33    unknown suspected nonviolent    91      6     91          NA     2
34    unknown    suspected violent    43      6     43          NA     2
35    unknown         traffic stop   117      6    117          NA     2
36    unknown              unknown    35      6     35          NA     2
37      white                 <NA>   703     30    703          NA     1
38      white                other    36      6     36          NA     2
39      white suspected nonviolent   177      6    177          NA     2
40      white    suspected violent    80      6     80          NA     2
41      white         traffic stop   334      6    334          NA     2
42      white              unknown    76      6     76          NA     2
          x0         y0          w          h   color
1  0.6671463 0.00000000 0.18283514 0.10230548 #FFF5F0
2  0.8359172 0.00000000 0.01406424 0.03410183 #E6D8D5
3  0.7796602 0.02557637 0.05625697 0.07672911 #E6D9D5
4  0.8359172 0.03410183 0.01406424 0.06820365 #E6DBD5
5  0.6671463 0.00000000 0.11251393 0.10230548 #E6DCD5
6  0.7796602 0.00000000 0.05625697 0.02557637 #E6DDD5
7  0.0000000 0.00000000 0.66714628 0.49460820 #FEE0D2
8  0.5895711 0.00000000 0.07757515 0.12983465 #E5BCB5
9  0.3180581 0.20490911 0.27151302 0.28969909 #E5C0B5
10 0.3180581 0.00000000 0.27151302 0.20490911 #E5C4B5
11 0.0000000 0.00000000 0.31805811 0.49460820 #E5C8B5
12 0.5895711 0.12983465 0.07757515 0.36477354 #E5CCB5
13 0.8770540 0.10230548 0.12294597 0.47982709 #FCBBA1
14 0.9441155 0.10230548 0.05588453 0.02574682 #E38C81
15 0.8770540 0.37537780 0.12294597 0.20675476 #E39481
16 0.8770540 0.10230548 0.06706144 0.08582273 #E39D81
17 0.8770540 0.18812821 0.12294597 0.18724960 #E3A581
18 0.9441155 0.12805229 0.05588453 0.06007591 #E3AD81
19 0.6671463 0.10230548 0.20990775 0.47982709 #FC9272
20 0.8296555 0.10230548 0.04739852 0.02023761 #E3574E
21 0.6671463 0.18494236 0.16250923 0.20068958 #E3644E
22 0.6671463 0.10230548 0.16250923 0.08263689 #E3704E
23 0.6671463 0.38563195 0.20990775 0.19650062 #E37D4E
24 0.8296555 0.12254308 0.04739852 0.26308887 #E3894E
25 0.8499814 0.00000000 0.15001858 0.10230548 #FB6A4A
26 0.9624954 0.06394092 0.03750464 0.03836455 #E22523
27 0.8499814 0.00000000 0.05625697 0.10230548 #E23523
28 0.9624954 0.02557637 0.03750464 0.03836455 #E24523
29 0.9062384 0.00000000 0.05625697 0.10230548 #E25523
30 0.9624954 0.00000000 0.03750464 0.02557637 #E26523
31 0.6671463 0.58213256 0.33285372 0.41786744 #EF3B2C
32 0.9058827 0.58213256 0.09411726 0.02038378 #D70517
33 0.6671463 0.58213256 0.23873646 0.18281700 #D70506
34 0.9058827 0.78087439 0.09411726 0.21912561 #D71505
35 0.6671463 0.76494957 0.23873646 0.23505043 #D72605
36 0.9058827 0.60251634 0.09411726 0.17835805 #D73805
37 0.0000000 0.49460820 0.66714628 0.50539180 #CB181D
38 0.5608584 0.49460820 0.10628789 0.16244737 #B70024
39 0.3169657 0.65192860 0.24389274 0.34807140 #B70014
40 0.3169657 0.49460820 0.24389274 0.15732041 #B70005
41 0.0000000 0.49460820 0.31696566 0.50539180 #B70A00
42 0.5608584 0.65705556 0.10628789 0.34294444 #B71900

$type
[1] "index"

$vSize
[1] "incidents"

$vColor
[1] NA

$stdErr
[1] "incidents"

$algorithm
[1] "pivotSize"

$vpCoorX
[1] 0.02812148 0.97187852

$vpCoorY
[1] 0.01968504 0.91031496

$aspRatio
[1] 1.483512

$range
[1] NA

$mapping
[1] NA NA NA

$draw
[1] TRUE

ggplot(role_table, aes(x = initial_reason, y = race, fill = incidents)) + 
  geom_tile() +  # Creates the heatmap tiles
  scale_fill_gradient(low = "white", high = "red") +  # Gradient of colors
  labs(title = "Heatmap of Incidents by Race and Initial Reason",
       x = "Initial Reason",
       y = "Race",
       fill = "Number of Incidents") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(role_table, aes(x = initial_reason, y = incidents, fill = race)) + 
  geom_bar(stat = "identity", position = "dodge") +  # Creates the heatmap tiles
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Summary:

The first thing I did was to remove unncessary columns that I would not be investigating (news_urls, centroid_geo, race_source, data_source, name). Then I wanted to filter out some of the na’s out of sets I thought I might end up using. After that I changed the date format it something slightly more usable using the lubridate package so that I wouldn’t have any issues plotting data in a time series. I then discovered that there were errors in the “in_fars_pursuit” column. This is column supposed to be a binary, or rather a bolean (true or false) and it is not. In what may be a controversial move, I decided not to throw out those rows but rather assumed that those that come out as 2 were in fact meant to be 1. I finally renamed the column as I discovered that I found the spelling hard to remember.

As far as what the visualisation represents. The most arresting discovery I found was that in a timeseries graph we can see a very distinct jump in the number of deaths during the pandemic and a somewhat return to original levels in the final year of 2022.

What I was going to do and which took far too much time was to figure out how to join this dataset with the US census. This would have allowed me to use demograpfic and income data from the specific counties, and would have allowed for a much more detailed analysis.