This is a typical setup section. If you’re having trouble knitting with errors, try changing this to error=TRUE to let it keep going when it hits a mistake. But go back and look at why – it’s sometimes easier to see in the final R markdown.
This example markdown uses a new library called formattable, for formatted tables. Install it in your console, not in this markdown!
library(knitr)
knitr::opts_chunk$set(echo = TRUE,
warning=TRUE,
message=FALSE,
error=FALSE)
library(tidyverse) # the usual
library(janitor) # to work with messy data
library(scales) # to make numbers readable
library(lubridate) # to work with dates
library(formattable) # to make easier-to-read tables
options (scipen=999, digits=10) # to avoid scientific notation.
We’re going to work with the Paycheck Protection Program data that I’ve prepared for you. Here is the documentation that I have so far - I hope to add to it as we learn more.
I have saved the dataset at this url:
https://github.com/cronkitedata/rstudyguide/blob/master/data/az_ppp.Rda?raw=t
This loads the data and prints out a list of the variables that are in it, so you don’t have to keep looking.
load( url ( "https://github.com/cronkitedata/rstudyguide/blob/master/data/az_ppp.Rda?raw=t" ) )
glimpse(az_ppp)
## Rows: 95,151
## Columns: 30
## $ id <dbl> 2008847101, 3057347105, 3470877104, 3530847305, 3771367…
## $ date_approved <date> 2020-04-10, 2020-04-11, 2020-04-11, 2020-04-29, 2020-0…
## $ status_date <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ borrower <chr> "SFE HOLDINGS, LLC", "NAVAJO TRIBAL UTILITY AUTHORITY",…
## $ address <chr> "9366 EAST RAINTREE DRIVE", "PO BOX 170", "2999 N44TH S…
## $ city <chr> "SCOTTSDALE", "FORT DEFIANCE", "PHOENIX", "TUCSON", "SC…
## $ state <chr> "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "…
## $ zip <chr> "85260", "86504", "85018", "85711", "85250", "85054", "…
## $ loan_status <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ term <dbl> 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,…
## $ initial_amt <dbl> 10000000.00, 10000000.00, 10000000.00, 10000000.00, 100…
## $ payroll_amt <dbl> 10000000.00, 10000000.00, 10000000.00, 10000000.00, 100…
## $ nonpayroll_amt <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 2500000.00, 0.00, 0…
## $ franchise <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "CA…
## $ business_type <chr> "LIMITED LIABILITY COMPANY(LLC)", "NON-PROFIT ORGANIZA…
## $ lender <chr> "THE HUNTINGTON NATIONAL BANK", "GREAT WESTERN BANK", "…
## $ lender_state <chr> "OH", "SD", "OH", "IL", "AZ", "SD", "AL", "OK", "NY", "…
## $ rural_urban <chr> "U", "R", "U", "U", "U", "U", "U", "U", "U", "U", "U", …
## $ business_age <chr> "EXISTING OR MORE THAN 2 YEARS OLD", "EXISTING OR MORE …
## $ project_city <chr> "SCOTTSDALE", "FORT DEFIANCE", "PHOENIX", "TUCSON", "SC…
## $ project_county <chr> "MARICOPA", "APACHE", "MARICOPA", "PIMA", "MARICOPA", "…
## $ project_state <chr> "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "…
## $ jobs_reported <dbl> 500, 500, 500, 500, 500, 500, 378, 500, 500, 500, 481, …
## $ naics_code <chr> "722310", "221122", "621610", "621111", "517311", "5613…
## $ naics_sector <chr> "722 - RESTAURANTS AND FOOD SERVICE", "22 - UTILITIES",…
## $ naics_descript <chr> "FOOD SERVICE CONTRACTORS", "ELECTRIC POWER DISTRIBUTIO…
## $ race_ethnicity <chr> "UNANSWERED", "UNANSWERED", "UNANSWERED", "UNANSWERED",…
## $ gender <chr> "UNANSWERED", "UNANSWERED", "UNANSWERED", "UNANSWERED",…
## $ veteran <chr> "UNANSWERED", "UNANSWERED", "UNANSWERED", "UNANSWERED",…
## $ non_profit <chr> NA, "Y", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
In each of the sections below, work through the examples and then describe, in normal, everyday words, what the code is doing, and what the answer tells you. In other words, fill out the part of the markdown before and after the code so that you have a report that you could show others.
Here’s an exammple :
Only some of the columns that we got from the government are particularly useful. This code eliminates some of those that are not very informative, such as the demographic information that was mainly left blank and addresses of the lenders.
az_ppp_v1 <-
az_ppp %>%
select ( borrower, franchise, address, city, zip, date_approved, initial_amt:nonpayroll_amt, jobs_reported,
business_type, non_profit, business_age, project_county, project_state, naics_code:naics_descript)
glimpse(az_ppp_v1)
## Rows: 95,151
## Columns: 18
## $ borrower <chr> "SFE HOLDINGS, LLC", "NAVAJO TRIBAL UTILITY AUTHORITY",…
## $ franchise <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "CA…
## $ address <chr> "9366 EAST RAINTREE DRIVE", "PO BOX 170", "2999 N44TH S…
## $ city <chr> "SCOTTSDALE", "FORT DEFIANCE", "PHOENIX", "TUCSON", "SC…
## $ zip <chr> "85260", "86504", "85018", "85711", "85250", "85054", "…
## $ date_approved <date> 2020-04-10, 2020-04-11, 2020-04-11, 2020-04-29, 2020-0…
## $ initial_amt <dbl> 10000000.00, 10000000.00, 10000000.00, 10000000.00, 100…
## $ payroll_amt <dbl> 10000000.00, 10000000.00, 10000000.00, 10000000.00, 100…
## $ nonpayroll_amt <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 2500000.00, 0.00, 0…
## $ jobs_reported <dbl> 500, 500, 500, 500, 500, 500, 378, 500, 500, 500, 481, …
## $ business_type <chr> "LIMITED LIABILITY COMPANY(LLC)", "NON-PROFIT ORGANIZA…
## $ non_profit <chr> NA, "Y", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ business_age <chr> "EXISTING OR MORE THAN 2 YEARS OLD", "EXISTING OR MORE …
## $ project_county <chr> "MARICOPA", "APACHE", "MARICOPA", "PIMA", "MARICOPA", "…
## $ project_state <chr> "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "…
## $ naics_code <chr> "722310", "221122", "621610", "621111", "517311", "5613…
## $ naics_sector <chr> "722 - RESTAURANTS AND FOOD SERVICE", "22 - UTILITIES",…
## $ naics_descript <chr> "FOOD SERVICE CONTRACTORS", "ELECTRIC POWER DISTRIBUTIO…
(There’s no need to describe what you discovered here, since you just trimmed down the list of columns and listed them. )
Make this a nicely formatted and easy to read description of the journalistic question or data-integrity question that is being asked, and then, if necessary, what it means.
az_ppp_v1 %>%
mutate (total_amt = payroll_amt + nonpayroll_amt ,
difference = total_amt - initial_amt ) %>%
filter ( difference < -10000 | difference > 10000) %>%
select (borrower, initial_amt, total_amt, payroll_amt, nonpayroll_amt) %>%
arrange (initial_amt)
az_ppp_v2 <-
az_ppp_v1 %>%
mutate (total_amt = payroll_amt + nonpayroll_amt ,
loan_round = case_when ( date_approved >= "2020-12-23" ~ "Second round",
(date_approved < "2020-12-23" & date_approved >= "2020-03-01") ~ "First Round",
TRUE ~ "Inaccurate date")
)
az_ppp_v2 %>%
group_by ( borrower, zip) %>%
summarise ( both_round_total = sum (total_amt, na.rm=T),
both_round_payroll = sum (payroll_amt, na.rm=T),
num_of_rounds = n_distinct (loan_round)
) %>%
filter ( num_of_rounds == 2) %>%
arrange (desc( both_round_total)) %>%
head (150) %>%
mutate ( both_round_total = comma(both_round_total),
both_round_payroll = comma(both_round_payroll))
This is a trick : It creates a dataset containing all of the details of those businesses that got loans in both rounds. Instead of summarising the result, you use the summary function to filter! (The ungroup() thing is hard to explain. Just stick it in there for now if you use this trick. )
az_ppp_both_rounds <-
az_ppp_v2 %>%
group_by (borrower, zip) %>%
filter ( n_distinct( loan_round) == 2) %>%
ungroup()
az_ppp_v2 %>%
group_by (naics_sector) %>%
summarise ( number_of_loans = comma(n(), digits=0 ) ) %>%
formattable::format_table()
| naics_sector | number_of_loans |
|---|---|
| 11 - AGRICULTURE | 870 |
| 21 - MINING | 156 |
| 22 - UTILITIES | 166 |
| 23 - CONSTRUCTION | 8,414 |
| 31-33 - MANUFACTURING | 3,941 |
| 42 - WHOLESALE TRADE | 2,737 |
| 44-45 - RETAIL TRADE | 7,829 |
| 485 - TAXI AND LIMOUSINE SERVICES | 334 |
| 489 - OTHER TRANSPORTATION | 2,226 |
| 49 - DELIVERY SERVICE & WAREHOUSING | 250 |
| 51 - INFORMATION, MOVIES AND PUBLISHING | 1,316 |
| 52 - FINANCE | 3,518 |
| 53 - REAL ESTATE AND OTHER RENTALS | 5,727 |
| 54 - LEGAL, ADVERTISING, CONSULTING AND OTHER PROFESSIONAL SERVICES | 13,659 |
| 55-56 - MANAGEMENT & ADMIN SUPPORT | 5,534 |
| 61 - EDUCATION | 1,826 |
| 62 - HEALTH CARE | 11,239 |
| 71 - ARTS AND RECREATION | 2,470 |
| 721 - HOTELS AND ACCOMMODATIONS | 1,238 |
| 722 - RESTAURANTS AND FOOD SERVICE | 5,700 |
| 813 - RELIGIOUS, GRANTMAKING, CIVIC ORGS | 1,957 |
| 819 - OTHER MISC SERVICES | 8,389 |
| 92 - GOVERNMENT | 219 |
| 99 - NO VALID INDUSTRY CODE | 5,436 |
az_ppp_v4 <-
az_ppp_v2 %>%
filter ( str_detect ( naics_sector, "^(813|61|92)")) %>%
arrange ( desc( total_amt ))
You finish: what in here is interesting?
az_ppp_v2 %>%
group_by (business_age) %>%
summarise ( num_loans = comma(n(), digits=0)) %>%
format_table()
| business_age | num_loans |
|---|---|
| CHANGE OF OWNERSHIP | 20 |
| EXISTING OR MORE THAN 2 YEARS OLD | 76,384 |
| NEW BUSINESS OR 2 YEARS OR LESS | 13,223 |
| STARTUP, LOAN FUNDS WILL OPEN BUSINESS | 40 |
| UNANSWERED | 5,484 |
You finish: What other questions might you want to ask about this data?
az_ppp_v2 %>%
mutate (new_sector = if_else ( str_detect ( naics_sector, "^72"), "72 - RESTAURANTS AND HOTELS", naics_sector)) %>%
group_by ( city, new_sector) %>%
summarise ( num_loans = n(),
num_businesses = n_distinct ( borrower, zip) ,
total_amt = comma ( sum ( total_amt, na.rm=T), digits=0 ),
total_nonpayroll = comma (sum (total_amt, na.rm=T, digits=0)))
This uses the trick from the other section: Pick out the largest industry for each city. You haven’t seen this before, but try to guess what’s happening!
az_ppp_v2 %>%
mutate (new_sector = if_else ( str_detect ( naics_sector, "^72"), "72 - RESTAURANTS AND HOTELS", naics_sector)) %>%
group_by ( city, new_sector) %>%
summarise ( num_loans = n(),
num_companies = n_distinct ( borrower, zip),
loan_amount = sum ( total_amt),
nonpayroll_total = sum (nonpayroll_amt) ) %>%
group_by ( city) %>%
mutate ( total_loans = comma(sum( num_loans), digits=0),
pct_sector = percent ( loan_amount / sum(loan_amount))
) %>%
slice_max ( order_by = loan_amount) %>%
filter ( num_loans > 20) %>%
format_table()
| city | new_sector | num_loans | num_companies | loan_amount | nonpayroll_total | total_loans | pct_sector |
|---|---|---|---|---|---|---|---|
| ANTHEM | 62 - HEALTH CARE | 38 | 37 | 2979695.16 | 207265.61 | 285 | 26.64% |
| APACHE JUNCTION | 23 - CONSTRUCTION | 65 | 61 | 8155087.45 | 138620.00 | 358 | 24.28% |
| AVONDALE | 23 - CONSTRUCTION | 56 | 54 | 9263383.69 | 103027.76 | 537 | 18.25% |
| BUCKEYE | 11 - AGRICULTURE | 39 | 39 | 11411576.30 | 950584.80 | 533 | 29.08% |
| BULLHEAD CITY | 62 - HEALTH CARE | 56 | 56 | 6602556.50 | 358633.25 | 307 | 27.74% |
| CASA GRANDE | 62 - HEALTH CARE | 63 | 61 | 8748056.93 | 136725.27 | 502 | 23.06% |
| CAVE CREEK | 55-56 - MANAGEMENT & ADMIN SUPPORT | 55 | 55 | 7100559.93 | 31596.00 | 754 | 18.46% |
| CHANDLER | 23 - CONSTRUCTION | 292 | 286 | 66080472.37 | 1706356.28 | 4,710 | 14.85% |
| CHINO VALLEY | 23 - CONSTRUCTION | 37 | 36 | 1991652.97 | 6877.00 | 164 | 28.67% |
| COLORADO CITY | 23 - CONSTRUCTION | 24 | 24 | 818501.00 | 1.00 | 48 | 58.84% |
| COTTONWOOD | 62 - HEALTH CARE | 39 | 39 | 4746828.00 | 615804.00 | 385 | 19.52% |
| EL MIRAGE | 23 - CONSTRUCTION | 26 | 26 | 5099989.83 | 453899.47 | 109 | 47.08% |
| FLAGSTAFF | 72 - RESTAURANTS AND HOTELS | 205 | 192 | 32943558.00 | 1000540.03 | 1,712 | 22.54% |
| FORT MOHAVE | 23 - CONSTRUCTION | 23 | 23 | 1265855.59 | 16786.61 | 111 | 19.35% |
| FOUNTAIN HILLS | 72 - RESTAURANTS AND HOTELS | 41 | 39 | 6395102.57 | 50008.00 | 481 | 24.85% |
| GILBERT | 23 - CONSTRUCTION | 448 | 434 | 81893711.61 | 3516400.80 | 4,534 | 21.17% |
| GLENDALE | 23 - CONSTRUCTION | 346 | 334 | 47513419.85 | 1386695.01 | 3,063 | 20.22% |
| GOODYEAR | 54 - LEGAL, ADVERTISING, CONSULTING AND OTHER PROFESSIONAL SERVICES | 114 | 113 | 9060425.59 | 261849.89 | 945 | 14.17% |
| GREEN VALLEY | 62 - HEALTH CARE | 32 | 31 | 10491596.50 | 1360580.56 | 230 | 49.75% |
| KINGMAN | 72 - RESTAURANTS AND HOTELS | 59 | 58 | 31797138.23 | 69238.81 | 460 | 49.01% |
| LAKE HAVASU CITY | 72 - RESTAURANTS AND HOTELS | 109 | 104 | 10930788.37 | 83387.34 | 988 | 17.08% |
| LITCHFIELD PARK | 72 - RESTAURANTS AND HOTELS | 21 | 21 | 4762937.50 | 45863.00 | 385 | 19.09% |
| MESA | 23 - CONSTRUCTION | 748 | 727 | 122438440.24 | 2660368.26 | 6,863 | 18.45% |
| NEW RIVER | 23 - CONSTRUCTION | 24 | 23 | 705239.00 | 12409.00 | 103 | 22.80% |
| NOGALES | 489 - OTHER TRANSPORTATION | 62 | 61 | 6740488.70 | 225928.00 | 306 | 24.12% |
| ORO VALLEY | 62 - HEALTH CARE | 43 | 40 | 4341728.40 | 24353.02 | 295 | 21.22% |
| PAGE | 72 - RESTAURANTS AND HOTELS | 57 | 55 | 6906236.92 | 288849.26 | 162 | 34.01% |
| PARADISE VALLEY | 62 - HEALTH CARE | 114 | 114 | 8646008.86 | 347349.99 | 503 | 17.01% |
| PAYSON | 72 - RESTAURANTS AND HOTELS | 22 | 22 | 3481682.74 | 123663.00 | 285 | 22.79% |
| PEORIA | 62 - HEALTH CARE | 325 | 314 | 33227158.37 | 1704173.39 | 2,360 | 18.70% |
| PHOENIX | 23 - CONSTRUCTION | 1801 | 1754 | 407997068.41 | 12849015.03 | 21,975 | 14.54% |
| PRESCOTT | 23 - CONSTRUCTION | 134 | 133 | 19938616.24 | 74821.81 | 1,385 | 17.91% |
| PRESCOTT VALLEY | 62 - HEALTH CARE | 71 | 68 | 13353363.21 | 208501.00 | 451 | 34.79% |
| QUEEN CREEK | 23 - CONSTRUCTION | 125 | 121 | 12662276.79 | 644378.09 | 929 | 24.60% |
| RIO RICO | 42 - WHOLESALE TRADE | 30 | 30 | 4988539.06 | 288406.31 | 130 | 48.53% |
| SAFFORD | 62 - HEALTH CARE | 23 | 23 | 8057634.91 | 34713.00 | 156 | 38.61% |
| SAN TAN VALLEY | 62 - HEALTH CARE | 34 | 34 | 3422086.97 | 168487.00 | 400 | 21.63% |
| SCOTTSDALE | 62 - HEALTH CARE | 1614 | 1561 | 203260186.51 | 7999720.70 | 12,141 | 15.64% |
| SEDONA | 72 - RESTAURANTS AND HOTELS | 112 | 105 | 12760503.56 | 909820.04 | 714 | 30.54% |
| SHOW LOW | 62 - HEALTH CARE | 34 | 34 | 3931790.70 | 465811.34 | 240 | 18.71% |
| SIERRA VISTA | 62 - HEALTH CARE | 95 | 93 | 6596442.57 | 333509.34 | 509 | 17.76% |
| SNOWFLAKE | 23 - CONSTRUCTION | 21 | 20 | 1872904.00 | 5680.00 | 97 | 25.70% |
| SUN CITY | 62 - HEALTH CARE | 57 | 55 | 6172431.59 | 52305.00 | 352 | 31.02% |
| SURPRISE | 62 - HEALTH CARE | 174 | 169 | 18564658.64 | 484548.09 | 1,190 | 24.79% |
| TEMPE | 54 - LEGAL, ADVERTISING, CONSULTING AND OTHER PROFESSIONAL SERVICES | 720 | 703 | 99359730.09 | 6333469.44 | 4,163 | 17.04% |
| TUCSON | 62 - HEALTH CARE | 1244 | 1218 | 174406564.64 | 10134888.26 | 10,596 | 17.01% |
| WILLIAMS | 72 - RESTAURANTS AND HOTELS | 50 | 49 | 5789708.85 | 78554.50 | 134 | 57.05% |
| YUMA | 11 - AGRICULTURE | 77 | 77 | 33721983.52 | 360983.63 | 1,309 | 23.89% |
Now you take it from here.
az_ppp %>%
group_by ( loan_status ) %>%
summarise ( num_loans = n(), num_companies = n_distinct ( borrower, zip) ,
total_amt = sum (payroll_amt + nonpayroll_amt ) )
az_ppp %>%
filter (loan_status=="PAID IN FULL" & payroll_amt < 200000) %>%
select ( date_approved, borrower, city, payroll_amt , nonpayroll_amt,
jobs_reported, naics_sector) %>%
mutate ( dollars_per_job = if_else (jobs_reported > 0 ,
comma (( payroll_amt + nonpayroll_amt)/ jobs_reported,digits=0),
NA_real_), .before=payroll_amt) %>%
arrange (desc(dollars_per_job))