Here is a fairly typical setup section that you can copy and paste into any document. I usually just start a new document, erase everything and copy this from the top through the first code chunk.
library(knitr)
knitr::opts_chunk$set(echo = TRUE,
warning=TRUE,
message=FALSE,
error=TRUE)
library(tidyverse)
library(janitor)
library(scales)
library(lubridate)
options (scipen=999, digits=10)
We’re going to work with the Paycheck Protection Program data that I’ve prepared for you. Here is the documentation that I have so far - I hope to add to it as we learn more.
I have saved the dataset at this url:
https://github.com/cronkitedata/rstudyguide/blob/master/data/az_ppp.Rda?raw=t
For you to load it, you’d copy that URL into the following code chunk, where I’ve left a blank space. The computer is looking for the name of a file, so it must be in quotation marks:
load( url ( "https://github.com/cronkitedata/rstudyguide/blob/master/data/az_ppp.Rda?raw=t" ) )
To explore your data, you can click on it in the upper right of your screen, or you can type View (az_ppp) in your Console.
You can sort and filter the data in this view, but nothing will be saved. I don’t know how to save or copy from that view.
One of the powerful things about using a programming language is that you can choose to look just at the columns you plan to use for the moment, making it easier to look through your results. Run this code to
The verb “glimpse()” lets you look at a compact list of the column names so you can see what order they’re in and what they are called, and also shows you a little preview of what’s in them. Use this to examine whether the items are upper or lower case, whether they’re characters or numbers, etc.:
az_ppp %>% glimpse()
## Rows: 95,151
## Columns: 30
## $ id <dbl> 2008847101, 3057347105, 3470877104, 3530847305, 3771367…
## $ date_approved <date> 2020-04-10, 2020-04-11, 2020-04-11, 2020-04-29, 2020-0…
## $ status_date <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ borrower <chr> "SFE HOLDINGS, LLC", "NAVAJO TRIBAL UTILITY AUTHORITY",…
## $ address <chr> "9366 EAST RAINTREE DRIVE", "PO BOX 170", "2999 N44TH S…
## $ city <chr> "SCOTTSDALE", "FORT DEFIANCE", "PHOENIX", "TUCSON", "SC…
## $ state <chr> "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "…
## $ zip <chr> "85260", "86504", "85018", "85711", "85250", "85054", "…
## $ loan_status <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ term <dbl> 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,…
## $ initial_amt <dbl> 10000000.00, 10000000.00, 10000000.00, 10000000.00, 100…
## $ payroll_amt <dbl> 10000000.00, 10000000.00, 10000000.00, 10000000.00, 100…
## $ nonpayroll_amt <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 2500000.00, 0.00, 0…
## $ franchise <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "CA…
## $ business_type <chr> "LIMITED LIABILITY COMPANY(LLC)", "NON-PROFIT ORGANIZA…
## $ lender <chr> "THE HUNTINGTON NATIONAL BANK", "GREAT WESTERN BANK", "…
## $ lender_state <chr> "OH", "SD", "OH", "IL", "AZ", "SD", "AL", "OK", "NY", "…
## $ rural_urban <chr> "U", "R", "U", "U", "U", "U", "U", "U", "U", "U", "U", …
## $ business_age <chr> "EXISTING OR MORE THAN 2 YEARS OLD", "EXISTING OR MORE …
## $ project_city <chr> "SCOTTSDALE", "FORT DEFIANCE", "PHOENIX", "TUCSON", "SC…
## $ project_county <chr> "MARICOPA", "APACHE", "MARICOPA", "PIMA", "MARICOPA", "…
## $ project_state <chr> "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "…
## $ jobs_reported <dbl> 500, 500, 500, 500, 500, 500, 378, 500, 500, 500, 481, …
## $ naics_code <chr> "722310", "221122", "621610", "621111", "517311", "5613…
## $ naics_sector <chr> "722 - RESTAURANTS AND FOOD SERVICE", "22 - UTILITIES",…
## $ naics_descript <chr> "FOOD SERVICE CONTRACTORS", "ELECTRIC POWER DISTRIBUTIO…
## $ race_ethnicity <chr> "UNANSWERED", "UNANSWERED", "UNANSWERED", "UNANSWERED",…
## $ gender <chr> "UNANSWERED", "UNANSWERED", "UNANSWERED", "UNANSWERED",…
## $ veteran <chr> "UNANSWERED", "UNANSWERED", "UNANSWERED", "UNANSWERED",…
## $ non_profit <chr> NA, "Y", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
In this code chunk, create a new data frame with a subset of the information that you might want to use for one set of queries. Remember, it’s not everything you might ever want to do – it’s just what you want to work on right now!
az_ppp_basic <-
az_ppp %>%
select ( borrower, city, zip,
date_approved,
contains ("amt"),
naics_sector,
race_ethnicity:non_profit,
-gender)
Hints:
new_data <- az_ppp %>% ....race_ethnicity:non_profitcontains : contains ( "amt")Here’s one answer to the above challenge:
az_ppp_basic <-
az_ppp %>%
select ( borrower, city, zip,
date_approved,
contains ("amt"),
naics_sector,
race_ethnicity:non_profit,
-gender)
Now you have a new data frame just with the information you want.
First, sort the new data by the date approved, then by the lowest to highest amount, then look at only the first 15 rows:
az_ppp_basic %>%
arrange ( desc(date_approved), borrower) %>%
head (150 ) %>%
select ( date_approved, borrower, city, initial_amt:nonpayroll_amt)
Possible error: If you picked out columns using SELECT in the chunk above, you have to have the variables you want to sort in the list. It won’t find them otherwise.
There are three rarely-used columns that, when used, indicate some demographic information about the owner of the company. They are gender; race_ethnicity; and veteran.
Your choices in race_ethnicity are:
UNANSWERED
WHITE
AMERICAN INDIAN OR ALASKA NATIVE
HISPANIC
ASIAN
BLACK OR AFRICAN AMERICAN
Write a code chunk to select only the the Native American owners:
az_ppp_basic %>%
filter ( race_ethnicity == "BLACK OR AFRICAN AMERICAN") %>%
select ( borrower, city, initial_amt, payroll_amt, date_approved)
Now write a code chunk that picks out the Native American and Black owners:
az_ppp_basic %>%
filter ( race_ethnicity == 'BLACK OR AFRICAN AMERICAN' | race_ethnicity == 'HISPANIC' )
Say you want to pick out all of the owners who filled out the race and ethnicity form, but are NOT “WHITE”. Use the %in% operator to reduce your typing:
race_ethnicity %in% c( "HISPANIC", "ASIAN", "BLACK OR AFRICAN AMERICAN" .... )
az_ppp_basic %>%
filter ( ! (race_ethnicity %in% c( "WHITE", "UNANSWERED") ))
Now, use that same code to pick out all of those owners in the restaurant and food service industries, which is listed as “722 - RESTAURANTS AND FOOD SERVICE”
Hint: You need to use the AND operator: &
Write your own filtering code chunk below. First, describe what you are going to look for in “markdown” – in the section before the code chunk. Then add the code chunk, and write the code to accomplish it. If you have trouble figuring out how, ask for help!!
You may have noticed by now that this is a real pain – you have to know EXACTLY how the information is recorded in the database, and you have to know every permutation of it. That’s because R doesn’t have what’s called “wildcard” filtering - anything that starts with or has letters in common.
In R, we’ll use what’s called “regular expressions” to do fuzzy filtering . We can also use them with other verbs. At its simplest, a regular expression tells you to look for characters at the beginning, the end , or anywhere in a column. To use regular expressions, though, you have to do it a little differently. Instead of putting the name of the column first, you put it inside a function. It looks like this:
FILTER (str_detect ( race_ethnicity, "BLACK"))
That way you don’t have to type it all out.
You can also use the same “OR” operator within a regular expression:
FILTER (str_detect (race_ethnicity, "BLACK|HISPANIC|ASIAN|INDIAN"))
That stuff inside the quotes is the phrase it will look for, with the “|” meaning - “any of these”. Here’s an example, looking for anything that has “FOOD” or “HOTEL” in the sector :
az_ppp %>%
select (date_approved, borrower, city, initial_amt, franchise, naics_sector ) %>%
filter ( str_detect ( naics_sector, "FOOD|HOTEL")) %>%
arrange ( city) %>%
head(100)