This goes through a lot of the queries that you might write to work on your PPP data, but it will also serve as an example for the test.

If you are having trouble figuring out what these do, try them and see what they produce. Also try running pieces in their parts. You can select everything up to (but not including) a pipe %>% and run it by hitting CTL-Enter (both Mac and PC)

Loading libraries

Every time you start up R, you will have to run every chunk before subsequent ones will work. Don’t forget to do this. Here are some typical libraries that many of your programs will always use:

knitr::opts_chunk$set(echo=TRUE, 
                      message=FALSE, 
                      results = "hide")


library(tidyverse)  # for almost everything
library(janitor)    # for fixing column names, etc.
library(readxl)     # for importing Excel (.xlsx) files
library(lubridate)  # for working with dates

Overview

This document reviews the key commands that will be on the R exam. You will largely not have to write code yourself – instead, you’ll have to understand it. In cases where you write it, there will be plenty of examples preceeding so that you would be able to copy, paste and adjust to your own needs.

Although this document is organized by verb, the questions in the real test will not be. They will be organized by data set, and all of the verbs will be mixed together. However, there is nothing from R that will be on the test that is outside this document that you are expected to know. Your bigger challenge is the same as the first test: to understand and imagine the journalistic value of a dataset.

Each section has typical questions that might be in the exam itself.

Making sure you know where things are on your computer

The easiest way to make sure you know how to find things is to save everything into the same folder for a project – say, a week at school, or the PPP project. Then open that folder as a project.

For the long, complicated links that are csv files, you just have to copy and paste them into your code.

Files we’ll work with in this tutorial:

We MAY go over these, but they’re good practice for you. The two CSV files are used to standardize the categories in the spreadsheet, by joining to them .

Reading data

Three typical ways to get data in:

  • Load an R file
  • Import a csv (or one of its cousins)
  • Import an Excel file

R and csv files can be read from links. Excel files can’t.

When you import a file, you will likely have to check on a few things:

  • Were the data types of each column properly read? For example, is a date actually a date, or is a text field incorrectly seen as a number?

  • Are the names of the columns difficult to work with in R?

Import a csv

First we’ll import the Phoenix city salaries for 2020. You’ll have to describe what the following code does, in your own words:

Sometimes, as in this example, it’s useful to create a simple variable that holds just one thing to use in another command. In this case, it’s a long link copied from the Web. If you wanted to change it, all you’d have to do is change that line and it would propagate to everywhere you’ve used it.

phx_2020_salary_link <- "https://www.phoenixopendata.com/dataset/0c02bbfd-08d9-41e9-a00c-97848a9f7350/resource/4d8c9800-8054-4c11-8a13-b9b5bd0511dc/download/open-data-portal-staff-salaries-dec-2020.csv"

read_csv( phx_2020_salary_link)

Hmmm. That read it, but it came out with some weird messages and I don’t have a data frame.

Why?

phx_2020 <- read_csv(phx_2020_salary_link)

I still have the same weird messages, but they appear ok: Everything but the ANNUAL RATE has been read as text, and the amount has been read as a number. Instead of making this the default, let’s make it clear what we want it to do:

phx_2020 <- read_csv(phx_2020_salary_link, 
                     col_types = cols ( .default="c", 
                                        "ANNUAL RATE" = "n"))

No more message.

The names are still awful because there are spaces in them. Change them to lower case and remove the spaces using a shortcut for renaming variables in a standard format, from the janitor library:

phx_2020 <- read_csv(phx_2020_salary_link, 
                     col_types = cols ( .default="c", 
                                        "ANNUAL RATE" = "n")) %>%
  clean_names()

This is the same thing as doing:

  rename ( last_name = LAST_NAME, 
            first_name = FIRST_NAME
             etc.)

Read an Excel file

You don’t have the same options of controlling the input from Excel. You just have to pretty much live with what it guesses, then work with it later.

Notice that we don’t have a link, because it was saved into your default folder. However, the first line of the spreadsheet contains a title – here, you need to skip one line:

payouts <- read_excel("payouts.xlsx", skip=1) %>%
  clean_names()

Read an R file

R files can be called several things – you’ll most frequently see them as .Rda or .Rdata. These don’t get imported, they just get loaded. Load from the web by using the url() function. They create data frames in your environment without naming them, and there can be all kinds of objects in them. Think of them as a zip file that’s been unzipped right into R:

load ( url ("https://github.com/cronkitedata/rstudyguide/blob/master/data/az_ppp_zipcodes.Rda?raw=true"))
What is the noun for this dataset? Ie, what does each row represent?

Filter and arrange

Here are a few example questions you might get about a dataset like the PPP data:

  1. In your own words, what does this code do? Make sure your description is something that a normal person with no background in data would understand.

  2. Why might a reporter want to know this?

  3. try to write a short, declarative sentece that conveys the essence of the results. This is the equivalent of a straight lede that accurately describes the findings that would be understandable to someone who knew nothing about the data analysis .

You would do the same thing in the next three chunks.

az_ppp_zip %>%
  select ( borrower, city, state, zip, date_approved, status_date, initial_amt:nonpayroll_amt ) %>%
  arrange ( desc(nonpayroll_amt)) %>%
  head (15)

Number 2:

az_ppp_zip %>%
  filter ( str_detect ( naics_sector, "^72") & date_approved >= "2020-12-20") %>%
  select ( borrower, city, state, zip, date_approved,  initial_amt) %>%
  arrange ( desc ( date_approved) ) 

Number 3:

az_ppp_zip %>%
  filter ( business_age %in% c("STARTUP, LOAN FUNDS WILL OPEN BUSINESS", "NEW BUSINESS OR 2 YEARS OR LESS") &
             city == "PHOENIX")

(An alternative way to get the same thing using the wildcard regular expression of “^STARTUP|NEW” where ‘^’ means “it should start with this” and | means “either one of these things.”)

az_ppp_zip %>%
  filter ( str_detect ( business_age, "^STARTUP|NEW") &
             city == "PHOENIX")

What is another question that can be answered using select, filter and arrange verbs that a reporter might ask of the data? Write your question in plain English, not in computer-ese, and make it clear why a reporter might want to know this.

Mutate

NOTE: Looking at it today, there is a problem in the way the government is reporting the non-payroll amount in this second draw of money (since December). There are no non-payroll amounts, except there is sometimes a “1” in the utilities amount. I don’t know if that’s meant as a TRUE-FALSE indicator, or what, but it’s not reliable at this time. (I’ve checked, and it’s like that in the original data file as published by SBA, which hasn’t been updated since I downloaded it. )

Answer the same questions as above for this set of code chunks :

az_ppp_zip %>%
  mutate ( dollars_per_job = payroll_amt / jobs_reported, 
           new_bus_type = case_when ( str_detect ( business_type, "(SOLE PROP|SELF-EMPLOYED|INDEPENDENT)" ) ~ "SELF", 
                                    non_profit == "Y" ~ "NON-PROFIT", 
                                    TRUE ~ "FOR-PROFIT")
           ) %>%
  select (borrower, new_bus_type, initial_amt, dollars_per_job)

GROUP BY , SUMMARISE

Same questions for these code chunks:

(This one will take a minute to run. The group by here is slow.)

both_draws <- 
  az_ppp_zip %>%
  group_by ( borrower, zip) %>%
  summarise ( num_of_loans = n(), 
              amount = sum (initial_amt, na.rm=T), 
              .groups="drop"
            ) %>%
  filter ( num_of_loans > 1) %>%
  arrange ( zip)

(Don’t worry about that .groups=“drop” statement - it just gets rid of something that will be a pain later on. )

Additional question:

Now that you have this list, what might you do with it? Let’s discuss in the session.

zcta_totals <- 
  az_ppp_zip %>%
  group_by (census_zip ) %>%
  summarise ( number_of_loans = n() , 
              loan_amount = sum(initial_amt, na.rm=T), 
              payroll_amount = sum( payroll_amt, na.rm=T), 
              jobs_supported = sum(jobs_reported, na.rm=T)) 

Next one:

zcta_totals %>%
  mutate ( pct_of_loans = number_of_loans / sum(number_of_loans) * 100) %>%
  arrange ( desc (loan_amount))

Next one:

az_ppp_zip %>%
  mutate ( program = if_else ( date_approved < "2020-12-20", "First draw", "Second draw"))  %>%
  group_by ( census_zip, program) %>%
  summarise ( number_of_loans = n() ) %>%
  pivot_wider ( values_from = number_of_loans, values_fill=0, names_from = program) %>%
  clean_names() %>%
  filter ( second_draw / first_draw > .5 & first_draw > 20)

JOIN

Same questions

az_ppp_byziptype <- 
  az_ppp_zip %>%
  left_join ( az_by_zipcode, by=c("census_zip"="zcta")) %>%
  group_by ( lender, zcta_ethnic) %>%
  summarise (number_of_loans = n() ) %>%
  mutate (lender_total = sum(number_of_loans)) %>%
  pivot_wider ( names_from = zcta_ethnic, values_from = number_of_loans, values_fill=0) %>%
  mutate (white_as_pct = `01 - Majority white`/lender_total * 100) %>%
  filter (lender_total > 20)

What might you do with this table? (There’s one logical mistake in this: It excludes any zip codes in which there were no loans given. You’d have to join again to the zip code list and fill it with zeros to get it exact.)

Example 2:

options (scipen=999)

az_ppp_zip %>%
  group_by (census_zip) %>%
  summarise ( number_of_loans = n(), 
              number_of_borrowers  = n_distinct ( borrower, zip), 
              total_amount = sum( initial_amt )
  ) %>%
  right_join ( az_by_zipcode, by=c("census_zip"="zcta")) %>%
  mutate ( amt_per_cap = total_amount / tot_pop, 
           #YOU HAVEN'T SEEN THIS AND ARENT EXPECTED TO UNDERSTAND - IT IS TURNING NA INTO 0 FOR ALL NUMERIC
           across ( where (is.numeric), replace_na, 0)
           ) %>%
  select (zcta_ethnic,  tot_pop, census_zip, zipcode_city, number_of_loans: total_amount, amt_per_cap, county_name,  median_inc_2018) %>%
  filter (tot_pop > 100) %>%
  arrange ( amt_per_cap)

(Here’s where you see how missing the ones that go no loans matters.)

Another join example

We already loaded in the Phoenix salaries for 2020. Let’s load in the previous year’s doing exactly what we did before.

phx_2019_salary_link <- "https://www.phoenixopendata.com/dataset/0c02bbfd-08d9-41e9-a00c-97848a9f7350/resource/4280d70d-1dae-45dd-8190-2ba32109979d/download/2019_salies.csv"

phx_2019 <- read_csv ( phx_2019_salary_link, col_types = cols (.default="c", "ANNUAL RATE" = "n")) %>% clean_names()

Let’s try a full join - we have to rename columns so that they don’t get mixed up. This is really dangerous, and I always regret it, since if I make a mistake I have to go back to the import statement.

phx_2019 <- 
  phx_2019 %>%
  rename ( descr2019 = descr, annual_rate_2019 = annual_rate)  %>%
  select (last_name:annual_rate_2019)

phx_2020 <- 
phx_2020 %>%
  rename ( descr2020 = descr, annual_rate_2020 = annual_rate)

Now - what happens with a “full” join, in other words keeping everything?

phx_bothyears <- 
  phx_2020 %>% full_join ( phx_2019, by=c("last_name"="last_name", "first_name"="first_name"))

There’s a lot of duplicates in here, but that’s OK - what do you think this query does? And why might it be helpful for a reporter? What are the likely mistakes in it?

phx_bothyears %>%
  filter (is.na( descr2020) & str_detect(descr2019, "Police") )
---
title: "R study session"
author: "Sarah Cohen"
date: "`r Sys.Date()`"
output: 
  html_document:
    theme: yeti
    highlight: kate
    toc: true
    toc_depth: 4
    toc_float: true
    df_print: paged
    code_download: true
---


This goes through a lot of the queries that you might write to work on your PPP data, but it will also serve as an example for the test. 

If you are having trouble figuring out what these do, try them and see what they produce. Also try running pieces in their parts. You can select everything up to (but not including) a pipe `%>%` and run it by hitting CTL-Enter (both Mac and PC)

## Loading libraries

Every time you start up R, you will have to run every chunk before subsequent ones will work. Don't forget to do this. Here are some typical libraries that many of your programs will always use:

```{r setup, message=FALSE, warning=FALSE }

knitr::opts_chunk$set(echo=TRUE, 
                      message=FALSE, 
                      results = "hide")


library(tidyverse)  # for almost everything
library(janitor)    # for fixing column names, etc.
library(readxl)     # for importing Excel (.xlsx) files
library(lubridate)  # for working with dates


```




## Overview

This document reviews the key commands that will be on the R exam. You will largely not have to write code yourself -- instead, you'll have to understand it. In cases where you write it, there will be plenty of examples preceeding so that you would be able to copy, paste and adjust to your own needs. 

Although this document is organized by verb, the questions in the real test will not be. They will be organized by data set, and all of the verbs will be mixed together. However, there is nothing from R that will be on the test that is outside this document that you are expected to know. Your bigger challenge is the same as the first test: to understand and imagine the journalistic value of a dataset. 


Each section has typical questions that might be in the exam itself. 

## Making sure you know where things are on your computer

The easiest way to make sure you know how to find things is to save everything into the same folder for a project -- say, a week at school, or the PPP project. Then open that folder as a project. 

For the long, complicated links that are csv files, you just have to copy and paste them into your code. 

Files we'll work with in this tutorial: 

* PPP data that is in an R data file, which is documented here: <https://cronkitedata.github.io/rstudyguide/extra/az_ppp_documentation.html>


* The Phoenix salaries of city workers for 2020 and 2019: 
<https://www.phoenixopendata.com/dataset/0c02bbfd-08d9-41e9-a00c-97848a9f7350/resource/4d8c9800-8054-4c11-8a13-b9b5bd0511dc/download/open-data-portal-staff-salaries-dec-2020.csv> and <https://www.phoenixopendata.com/dataset/0c02bbfd-08d9-41e9-a00c-97848a9f7350/resource/4280d70d-1dae-45dd-8190-2ba32109979d/download/2019_salies.csv>

We MAY go over these, but they're good practice for you. The two CSV files are used to standardize the categories in the spreadsheet, by joining to them . 

* Three files from the LA Times on payouts from lawsuits. You read the story this data produced earlier in the semester. 

  * Download the Excel file called `payouts.xlsx` from this link: <https://github.com/datadesk/la-settlements-analysis/raw/master/input/payouts.xlsx> and put it in your folder. 
  * Two "lookup tables" that the LAT made to standardize some of the information are in csv's: <https://raw.githubusercontent.com/datadesk/la-settlements-analysis/master/input/casetypes.csv> and <https://github.com/datadesk/la-settlements-analysis/blob/master/input/departments.csv>
  


## Reading data 

Three typical ways to get data in: 

* Load an R file
* Import a csv (or one of its cousins)
* Import an Excel file

R and csv files can be read from links. Excel files can't. 

When you import a file, you will likely have to check on a few things:

* Were the data types of each column properly read? For example, is a date actually a date, or is a text field incorrectly seen as a number? 

* Are the names of the columns difficult to work with in R?

### Import a csv

First we'll import the Phoenix city salaries for 2020. You'll have to describe what the following code does, in your own words: 


Sometimes, as in this example, it's useful to create a simple variable that holds just one thing to use in another command. In this case, it's a long link copied from the Web. If you wanted to change it, all you'd have to do is change that line and it would propagate to everywhere you've used it. 

```{r import_csv_phx2020}

phx_2020_salary_link <- "https://www.phoenixopendata.com/dataset/0c02bbfd-08d9-41e9-a00c-97848a9f7350/resource/4d8c9800-8054-4c11-8a13-b9b5bd0511dc/download/open-data-portal-staff-salaries-dec-2020.csv"

read_csv( phx_2020_salary_link)

```
Hmmm. That read it, but it came out with some weird messages and I don't have a data frame. 

Why? 

```{r}

phx_2020 <- read_csv(phx_2020_salary_link)

```
I still have the same weird messages, but they appear ok: Everything but the `ANNUAL RATE` has been read as text, and the amount has been read as a number. Instead of making this the default, let's make it clear what we want it to do: 

```{r}

phx_2020 <- read_csv(phx_2020_salary_link, 
                     col_types = cols ( .default="c", 
                                        "ANNUAL RATE" = "n"))


```

No more message. 

The names are still awful because there are spaces in them. Change them to lower case and remove the spaces using a shortcut for renaming variables in a standard format, from the janitor library: 

```{r read_csv_phx}

phx_2020 <- read_csv(phx_2020_salary_link, 
                     col_types = cols ( .default="c", 
                                        "ANNUAL RATE" = "n")) %>%
  clean_names()


```


This is the same thing as doing: 

      rename ( last_name = LAST_NAME, 
                first_name = FIRST_NAME
                 etc.)


#### Read an Excel file

You don't have the same options of controlling the input from Excel. You just have to pretty much live with what it guesses, then work with it later. 

Notice that we don't have a link, because it was saved into your default folder. However, the first line of the spreadsheet contains a title -- here, you need to skip one line:

```{r read_excel_payouts}

payouts <- read_excel("payouts.xlsx", skip=1) %>%
  clean_names()

```

#### Read an R file

R files can be called several things -- you'll most frequently see them as .Rda or .Rdata. These don't get imported, they just get loaded. Load from the web by using the `url()` function. They create data frames in your environment without naming them, and there can be all kinds of objects in them. Think of them as a zip file that's been unzipped right into R:

```{r load_ppp}

load ( url ("https://github.com/cronkitedata/rstudyguide/blob/master/data/az_ppp_zipcodes.Rda?raw=true"))

```


<div style="font-weight:bold;color:red;">What is the noun for this dataset? Ie, what does each row represent?</div> 


## Filter and arrange


Here are a few example questions you might get about a dataset like the PPP data: 


<div style="font-weight:bold;color:red;">

a) In your own words, what does this code do? Make sure your description is something that a normal person with no background in data would understand. 

b) Why  might a reporter want to know this? 

c) try to write a short, declarative sentece that conveys the essence of the results. This is the equivalent of a straight lede that accurately describes the findings that would be understandable to someone who knew nothing about the data analysis . 

You would do the same thing in the next three chunks. 

</div>


```{r}

az_ppp_zip %>%
  select ( borrower, city, state, zip, date_approved, status_date, initial_amt:nonpayroll_amt ) %>%
  arrange ( desc(nonpayroll_amt)) %>%
  head (15)



```

Number 2: 

```{r}

az_ppp_zip %>%
  filter ( str_detect ( naics_sector, "^72") & date_approved >= "2020-12-20") %>%
  select ( borrower, city, state, zip, date_approved,  initial_amt) %>%
  arrange ( desc ( date_approved) ) 



```



Number 3: 

```{r}

az_ppp_zip %>%
  filter ( business_age %in% c("STARTUP, LOAN FUNDS WILL OPEN BUSINESS", "NEW BUSINESS OR 2 YEARS OR LESS") &
             city == "PHOENIX")

```

(An alternative way to get the same thing using the wildcard regular expression of "^STARTUP|NEW" where '^' means "it should start with this" and `|` means "either one of these things.")

```{r}


az_ppp_zip %>%
  filter ( str_detect ( business_age, "^STARTUP|NEW") &
             city == "PHOENIX")


```

<div style="font-weight:bold;color:red;">

What is another question that can be answered using select, filter and arrange verbs that a reporter might ask of the data? Write your question in plain English, not in computer-ese, and make it clear why a reporter might want to know this. 

</div>


## Mutate

NOTE: Looking at it today, there is a problem in the way the government is reporting the non-payroll amount in this second draw of money (since December). There are no non-payroll amounts, except there is sometimes a "1" in the utilities amount. I don't know if that's meant as a TRUE-FALSE indicator, or what, but it's not reliable at this time.  (I've checked, and it's like that in the original data file as published by SBA, which hasn't been updated since I downloaded it. )  

Answer the same questions as above for this set of code chunks : 

```{r}

az_ppp_zip %>%
  mutate ( dollars_per_job = payroll_amt / jobs_reported, 
           new_bus_type = case_when ( str_detect ( business_type, "(SOLE PROP|SELF-EMPLOYED|INDEPENDENT)" ) ~ "SELF", 
                                    non_profit == "Y" ~ "NON-PROFIT", 
                                    TRUE ~ "FOR-PROFIT")
           ) %>%
  select (borrower, new_bus_type, initial_amt, dollars_per_job)
  


```


## GROUP BY , SUMMARISE

Same questions for these code chunks: 

(This one will take a minute to run. The group by here is slow.)

```{r grouping1}

both_draws <- 
  az_ppp_zip %>%
  group_by ( borrower, zip) %>%
  summarise ( num_of_loans = n(), 
              amount = sum (initial_amt, na.rm=T), 
              .groups="drop"
            ) %>%
  filter ( num_of_loans > 1) %>%
  arrange ( zip)
  
  


```

(Don't worry about that .groups="drop" statement - it just gets rid of something that will be a pain later on. )

Additional question: 

Now that you have this list, what might you do with it? Let's discuss in the session. 

```{r grouping2}

zcta_totals <- 
  az_ppp_zip %>%
  group_by (census_zip ) %>%
  summarise ( number_of_loans = n() , 
              loan_amount = sum(initial_amt, na.rm=T), 
              payroll_amount = sum( payroll_amt, na.rm=T), 
              jobs_supported = sum(jobs_reported, na.rm=T)) 



```


Next one: 

```{r}

zcta_totals %>%
  mutate ( pct_of_loans = number_of_loans / sum(number_of_loans) * 100) %>%
  arrange ( desc (loan_amount))


```

Next one: 

```{r}

az_ppp_zip %>%
  mutate ( program = if_else ( date_approved < "2020-12-20", "First draw", "Second draw"))  %>%
  group_by ( census_zip, program) %>%
  summarise ( number_of_loans = n() ) %>%
  pivot_wider ( values_from = number_of_loans, values_fill=0, names_from = program) %>%
  clean_names() %>%
  filter ( second_draw / first_draw > .5 & first_draw > 20)
  
```

## JOIN 

Same questions 

```{r}

az_ppp_byziptype <- 
  az_ppp_zip %>%
  left_join ( az_by_zipcode, by=c("census_zip"="zcta")) %>%
  group_by ( lender, zcta_ethnic) %>%
  summarise (number_of_loans = n() ) %>%
  mutate (lender_total = sum(number_of_loans)) %>%
  pivot_wider ( names_from = zcta_ethnic, values_from = number_of_loans, values_fill=0) %>%
  mutate (white_as_pct = `01 - Majority white`/lender_total * 100) %>%
  filter (lender_total > 20)
  

```

What might you do with this table? (There's one logical mistake in this: It excludes any zip codes in which there were no loans given. You'd have to join again to the zip code list and fill it with zeros to get it exact.)

Example 2: 

```{r}

options (scipen=999)

az_ppp_zip %>%
  group_by (census_zip) %>%
  summarise ( number_of_loans = n(), 
              number_of_borrowers  = n_distinct ( borrower, zip), 
              total_amount = sum( initial_amt )
  ) %>%
  right_join ( az_by_zipcode, by=c("census_zip"="zcta")) %>%
  mutate ( amt_per_cap = total_amount / tot_pop, 
           #YOU HAVEN'T SEEN THIS AND ARENT EXPECTED TO UNDERSTAND - IT IS TURNING NA INTO 0 FOR ALL NUMERIC
           across ( where (is.numeric), replace_na, 0)
           ) %>%
  select (zcta_ethnic,  tot_pop, census_zip, zipcode_city, number_of_loans: total_amount, amt_per_cap, county_name,  median_inc_2018) %>%
  filter (tot_pop > 100) %>%
  arrange ( amt_per_cap)



```


(Here's where you see how missing the ones that go no loans matters.)

## Another join example

We already loaded in the Phoenix salaries for 2020. Let's load in the previous year's doing exactly what we did before. 

```{r }
phx_2019_salary_link <- "https://www.phoenixopendata.com/dataset/0c02bbfd-08d9-41e9-a00c-97848a9f7350/resource/4280d70d-1dae-45dd-8190-2ba32109979d/download/2019_salies.csv"

phx_2019 <- read_csv ( phx_2019_salary_link, col_types = cols (.default="c", "ANNUAL RATE" = "n")) %>% clean_names()

```

Let's try a full join - we have to rename columns so that they don't get mixed up. This is really dangerous, and I always regret it, since if I make a mistake I have to go back to the import statement.

```{r}
phx_2019 <- 
  phx_2019 %>%
  rename ( descr2019 = descr, annual_rate_2019 = annual_rate)  %>%
  select (last_name:annual_rate_2019)

phx_2020 <- 
phx_2020 %>%
  rename ( descr2020 = descr, annual_rate_2020 = annual_rate)



```

Now - what happens with a "full" join, in other words keeping everything? 

```{r}

phx_bothyears <- 
  phx_2020 %>% full_join ( phx_2019, by=c("last_name"="last_name", "first_name"="first_name"))


```

There's a lot of duplicates in here, but that's OK - what do you think this query does? And why might it be helpful for a reporter? What are the likely mistakes in it? 


```{r}

phx_bothyears %>%
  filter (is.na( descr2020) & str_detect(descr2019, "Police") )

```

