Harold Nelson
2023-03-15
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
These notes are a compilation of important points from Chapter 14 of R4DS.
This is an extremely useful function that saves you from having to do a lot of string manipulation. It finds the first number in a string of text.
Here are a few examples.
## [1] 5
## [1] 1600000
Make at least 3 examples of your own. The act of writing “parse_number()” will put this function in your memory.
This is very similar to the function table() in base R.
Look at t
## 'table' int [1:105(1d)] 254 265 439 8 17215 2439 275 443 375 297 ...
## - attr(*, "dimnames")=List of 1
## ..$ : chr [1:105] "ABQ" "ACK" "ALB" "ANC" ...
##
## ABQ ACK ALB ANC ATL AUS AVL BDL BGR BHM BNA BOS BQN
## 254 265 439 8 17215 2439 275 443 375 297 6333 15508 896
## BTV BUF BUR BWI BZN CAE CAK CHO CHS CLE CLT CMH CRW
## 2589 4681 371 1781 36 116 864 52 2884 4573 14064 3524 138
## CVG DAY DCA DEN DFW DSM DTW EGE EYW FLL GRR GSO GSP
## 3941 1525 9705 7266 8738 569 9384 213 17 12055 765 1606 849
## HDN HNL HOU IAD IAH ILM IND JAC JAX LAS LAX LEX LGA
## 15 707 2115 5700 7198 110 2077 25 2720 5997 16174 1 1
## LGB MCI MCO MDW MEM MHT MIA MKE MSN MSP MSY MTJ MVY
## 668 2008 14082 4113 1789 1009 11728 2802 572 7185 3799 15 221
## MYR OAK OKC OMA ORD ORF PBI PDX PHL PHX PIT PSE PSP
## 59 312 346 849 17283 1536 6554 1354 1632 4656 2875 365 19
## PVD PWM RDU RIC ROC RSW SAN SAT SAV SBN SDF SEA SFO
## 376 2352 8163 2454 2416 3537 2737 686 804 10 1157 3923 13331
## SJC SJU SLC SMF SNA SRQ STL STT SYR TPA TUL TVC TYS
## 329 5819 2467 284 825 1211 4339 522 1761 7466 315 101 631
## XNA
## 1036
Look at c
## tibble [105 × 2] (S3: tbl_df/tbl/data.frame)
## $ dest: chr [1:105] "ABQ" "ACK" "ALB" "ANC" ...
## $ n : int [1:105] 254 265 439 8 17215 2439 275 443 375 297 ...
## # A tibble: 105 × 2
## dest n
## <chr> <int>
## 1 ABQ 254
## 2 ACK 265
## 3 ALB 439
## 4 ANC 8
## 5 ATL 17215
## 6 AUS 2439
## 7 AVL 275
## 8 BDL 443
## 9 BGR 375
## 10 BHM 297
## # … with 95 more rows
What’s the difference?
A tibble is easier to work with. The philosophy is to produce tibbles.
recall the basic principle that R sees even the simplest piece of data as a vector. You may think of multiplying a vector by a number, but R sees the number as a vector. Consider the following.
## [1] 2 4 6
R “recycles” the 2 to get a vector of the same length as x. So, here’s what really happens. The 2 becomes a vector with three copies of 2.
## [1] 2 4 6
To see this more clearly, observe what happens when you multiply by a shorter vector, not just a simple number.
## Warning in short * x: longer object length is not a multiple of shorter object
## length
## [1] 2 6 6
What happened?
This makes a vector of length 3 to match the length of x.
## [1] 2 6 6
Consider the following.
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 1 3 5 7 9
R recycled the vector of length 2 to create a vector of length 10.
## [1] 1 3 5 7 9
## [1] 1 3 5 7 9
How would you go about skipping every third item in a vector?
Test it with the vector of numbers from 1 to 15.
Take a few minutes and make a few examples of your own.
The numbers you have may not be the best versions of the data to work with.
Think about square figures. A square with sides l, has area \(l^2\).
Should your analysis focus on the length of a side or on the area of the square?
Sometimes data is very skewed to to the right with a small number of very large values. Doing a histogram or density graph turns out to be useless because the detail for most of the data is cramped into a small area.
Load the file county.rda. Then do a histogram and a density of the variable pop2017.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 3 rows containing non-finite values (stat_bin).
## Warning: Removed 3 rows containing non-finite values (stat_density).
## Logarithmic Transformation
Create a variable logpop in this dataframe use the function log10. Then redo the graphs.
county = county %>%
mutate(logpop = log10(pop2017))
county %>%
ggplot(aes(x =logpop)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 3 rows containing non-finite values (stat_bin).
## Warning: Removed 3 rows containing non-finite values (stat_density).
We use the quantile function to get percentiles.
## 75%
## 67756
To reverse the quantile function, use the mean of a boolean expression.
Example: What fraction of the pop20217 values are below 10000?
First I need to import the CPI data I downloaded from FRED.
## Rows: 86 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (1): CPIAUCSL
## date (1): DATE
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Let’s plot it.
CPI_plus = CPI %>%
mutate(diff = CPI - lag(CPI),
pct_diff = diff/lag(CPI),
diff12 = CPI - lag(CPI,12),
pct_diff12 = diff12/lag(CPI,12),
annualized_monthly = (1 + pct_diff)^12 - 1)
head(CPI_plus)
## # A tibble: 6 × 7
## DATE CPI diff pct_diff diff12 pct_diff12 annualized_monthly
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2016-01-01 238. NA NA NA NA NA
## 2 2016-02-01 237. -0.316 -0.00133 NA NA -0.0158
## 3 2016-03-01 238. 0.744 0.00313 NA NA 0.0383
## 4 2016-04-01 239. 0.912 0.00383 NA NA 0.0469
## 5 2016-05-01 240. 0.565 0.00236 NA NA 0.0287
## 6 2016-06-01 240. 0.665 0.00278 NA NA 0.0338
## # A tibble: 6 × 7
## DATE CPI diff pct_diff diff12 pct_diff12 annualized_monthly
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2022-09-01 297. 1.22 0.00413 22.5 0.0821 0.0507
## 2 2022-10-01 298. 1.45 0.00488 21.5 0.0776 0.0602
## 3 2022-11-01 299. 0.611 0.00205 19.9 0.0714 0.0249
## 4 2022-12-01 299. 0.392 0.00131 18.1 0.0644 0.0159
## 5 2023-01-01 301. 1.55 0.00517 17.9 0.0635 0.0638
## 6 2023-02-01 302. 1.11 0.00370 17.0 0.0599 0.0453