Setup

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(nycflights13)

These notes are a compilation of important points from Chapter 14 of R4DS.

parse_number()

This is an extremely useful function that saves you from having to do a lot of string manipulation. It finds the first number in a string of text.

Here are a few examples.

s = "there are 5 peanuts and 3 squirrels"

num = parse_number(s)
num

## [1] 5

s = "1,600,000"
num = parse_number(s)
num

## [1] 1600000

Make at least 3 examples of your own. The act of writing “parse_number()” will put this function in your memory.

count()

This is very similar to the function table() in base R.

t = table(flights$dest)

c = flights %>% count(dest)

Look at t

str(t)

##  'table' int [1:105(1d)] 254 265 439 8 17215 2439 275 443 375 297 ...
##  - attr(*, "dimnames")=List of 1
##   ..$ : chr [1:105] "ABQ" "ACK" "ALB" "ANC" ...

## 
##   ABQ   ACK   ALB   ANC   ATL   AUS   AVL   BDL   BGR   BHM   BNA   BOS   BQN 
##   254   265   439     8 17215  2439   275   443   375   297  6333 15508   896 
##   BTV   BUF   BUR   BWI   BZN   CAE   CAK   CHO   CHS   CLE   CLT   CMH   CRW 
##  2589  4681   371  1781    36   116   864    52  2884  4573 14064  3524   138 
##   CVG   DAY   DCA   DEN   DFW   DSM   DTW   EGE   EYW   FLL   GRR   GSO   GSP 
##  3941  1525  9705  7266  8738   569  9384   213    17 12055   765  1606   849 
##   HDN   HNL   HOU   IAD   IAH   ILM   IND   JAC   JAX   LAS   LAX   LEX   LGA 
##    15   707  2115  5700  7198   110  2077    25  2720  5997 16174     1     1 
##   LGB   MCI   MCO   MDW   MEM   MHT   MIA   MKE   MSN   MSP   MSY   MTJ   MVY 
##   668  2008 14082  4113  1789  1009 11728  2802   572  7185  3799    15   221 
##   MYR   OAK   OKC   OMA   ORD   ORF   PBI   PDX   PHL   PHX   PIT   PSE   PSP 
##    59   312   346   849 17283  1536  6554  1354  1632  4656  2875   365    19 
##   PVD   PWM   RDU   RIC   ROC   RSW   SAN   SAT   SAV   SBN   SDF   SEA   SFO 
##   376  2352  8163  2454  2416  3537  2737   686   804    10  1157  3923 13331 
##   SJC   SJU   SLC   SMF   SNA   SRQ   STL   STT   SYR   TPA   TUL   TVC   TYS 
##   329  5819  2467   284   825  1211  4339   522  1761  7466   315   101   631 
##   XNA 
##  1036

Look at c

str(c)

## tibble [105 × 2] (S3: tbl_df/tbl/data.frame)
##  $ dest: chr [1:105] "ABQ" "ACK" "ALB" "ANC" ...
##  $ n   : int [1:105] 254 265 439 8 17215 2439 275 443 375 297 ...

## # A tibble: 105 × 2
##    dest      n
##    <chr> <int>
##  1 ABQ     254
##  2 ACK     265
##  3 ALB     439
##  4 ANC       8
##  5 ATL   17215
##  6 AUS    2439
##  7 AVL     275
##  8 BDL     443
##  9 BGR     375
## 10 BHM     297
## # … with 95 more rows

What’s the difference?

A tibble is easier to work with. The philosophy is to produce tibbles.

Recycling

recall the basic principle that R sees even the simplest piece of data as a vector. You may think of multiplying a vector by a number, but R sees the number as a vector. Consider the following.

x = c(1,2,3)
twox = 2 * x
twox

## [1] 2 4 6

R “recycles” the 2 to get a vector of the same length as x. So, here’s what really happens. The 2 becomes a vector with three copies of 2.

vtwo = c(2,2,2)
vtwox = vtwo * x
vtwox

## [1] 2 4 6

To see this more clearly, observe what happens when you multiply by a shorter vector, not just a simple number.

short = c(2,3)
short_times_x = short * x

## Warning in short * x: longer object length is not a multiple of shorter object
## length

short_times_x

## [1] 2 6 6

What happened?

recycled_short = c(2,3,2)

This makes a vector of length 3 to match the length of x.

recycled_short * x

## [1] 2 6 6

Recycling Booleans: Extraction

Consider the following.

x = 1:10
x

##  [1]  1  2  3  4  5  6  7  8  9 10

odds = x[c(T,F)]
odds

## [1] 1 3 5 7 9

What happened?

R recycled the vector of length 2 to create a vector of length 10.

l2 = c(T,F)
l10 = c(T,F,T,F,T,F,T,F,T,F)

x[l2]

## [1] 1 3 5 7 9

x[l10]

## [1] 1 3 5 7 9

Question

How would you go about skipping every third item in a vector?

Test it with the vector of numbers from 1 to 15.

Solution

skipper = c(T,T,F)
x = 1:15
x[skipper]

##  [1]  1  2  4  5  7  8 10 11 13 14

Doodle

Take a few minutes and make a few examples of your own.

Logarithms, etc.

The numbers you have may not be the best versions of the data to work with.

Think about square figures. A square with sides l, has area \(l^2\).

Should your analysis focus on the length of a side or on the area of the square?

Sometimes data is very skewed to to the right with a small number of very large values. Doing a histogram or density graph turns out to be useless because the detail for most of the data is cramped into a small area.

County Populations

Load the file county.rda. Then do a histogram and a density of the variable pop2017.

Solution

load("county.rda")

county %>% 
  ggplot(aes(x = pop2017)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 3 rows containing non-finite values (stat_bin).

county %>% 
  ggplot(aes(x = pop2017)) +
  geom_density() +
  geom_rug()

## Warning: Removed 3 rows containing non-finite values (stat_density).

## Logarithmic Transformation

Create a variable logpop in this dataframe use the function log10. Then redo the graphs.

Solution

county = county %>% 
  mutate(logpop = log10(pop2017))

county %>% 
  ggplot(aes(x =logpop)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 3 rows containing non-finite values (stat_bin).

county %>% 
  ggplot(aes(x =  logpop)) +
  geom_density() +
  geom_rug()

## Warning: Removed 3 rows containing non-finite values (stat_density).

Percentiles

We use the quantile function to get percentiles.

Example

quantile(county$pop2017,.75,na.rm = TRUE)

##   75% 
## 67756

To reverse the quantile function, use the mean of a boolean expression.

Example: What fraction of the pop20217 values are below 10000?

Solution

mean(county$pop2017 < 10000,na.rm = TRUE)

## [1] 0.2255495

Time Series Stuff

First I need to import the CPI data I downloaded from FRED.

library(readr)
CPI <- read_csv("CPIAUCSL.csv")

## Rows: 86 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (1): CPIAUCSL
## date (1): DATE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Let’s plot it.

Solution

CPI = CPI %>% 
  rename(CPI = CPIAUCSL)

CPI %>% 
  ggplot(aes(x = DATE, y = CPI)) +
  geom_point()

Lags and Differences

CPI_plus = CPI %>% 
  mutate(diff = CPI - lag(CPI),
         pct_diff  = diff/lag(CPI),
         diff12 = CPI - lag(CPI,12),
         pct_diff12 = diff12/lag(CPI,12),
         annualized_monthly = (1 + pct_diff)^12 - 1)
head(CPI_plus)

## # A tibble: 6 × 7
##   DATE         CPI   diff pct_diff diff12 pct_diff12 annualized_monthly
##   <date>     <dbl>  <dbl>    <dbl>  <dbl>      <dbl>              <dbl>
## 1 2016-01-01  238. NA     NA           NA         NA            NA     
## 2 2016-02-01  237. -0.316 -0.00133     NA         NA            -0.0158
## 3 2016-03-01  238.  0.744  0.00313     NA         NA             0.0383
## 4 2016-04-01  239.  0.912  0.00383     NA         NA             0.0469
## 5 2016-05-01  240.  0.565  0.00236     NA         NA             0.0287
## 6 2016-06-01  240.  0.665  0.00278     NA         NA             0.0338

tail(CPI_plus)

## # A tibble: 6 × 7
##   DATE         CPI  diff pct_diff diff12 pct_diff12 annualized_monthly
##   <date>     <dbl> <dbl>    <dbl>  <dbl>      <dbl>              <dbl>
## 1 2022-09-01  297. 1.22   0.00413   22.5     0.0821             0.0507
## 2 2022-10-01  298. 1.45   0.00488   21.5     0.0776             0.0602
## 3 2022-11-01  299. 0.611  0.00205   19.9     0.0714             0.0249
## 4 2022-12-01  299. 0.392  0.00131   18.1     0.0644             0.0159
## 5 2023-01-01  301. 1.55   0.00517   17.9     0.0635             0.0638
## 6 2023-02-01  302. 1.11   0.00370   17.0     0.0599             0.0453

Plot Growth Rates

CPI_plus %>% 
  ggplot(aes(x = DATE)) +
  geom_point(aes(y = pct_diff12),color = "red") +
  geom_point(aes(y = annualized_monthly),color = "blue")

## Warning: Removed 12 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

Notes on Numbers

Setup

parse_number()

count()

Recycling

Recycling Booleans: Extraction

What happened?

Question

Solution

Doodle

Logarithms, etc.

County Populations

Solution

Solution

Percentiles

Example

Solution

Time Series Stuff

Solution

Lags and Differences

Plot Growth Rates