October 8, 2020

First things first

Some things you should know

  • Typical uses for R
    • Data analysis
    • Data visualization
    • Reshaping data
    • Automation
  • Mental models and programming
    • Excel/SPSS/Stata –> computational programming tools like R
  • What all R programmers do
    • Google for answers
    • Borrow code
    • Ask friends and pros for help
  • Useful references (see Reference Materials in the syllabus for more)

Quick Tour of R Studio

Quick Tour of R Studio

  • Open and save (locally) an R Notebook
  • Run code from the notebook and the console
  • Make sure you can install packages
    • 2 ways to install packages
      • Command line/Console
      • R Studio GUI

Quick Tour of R Studio

  • Open and save (locally) an R Notebook
  • Make sure you can install packages
    • 2 ways to install packages
      • Command line/Console
      • R Studio GUI
install.packages('devtools')

Quick Tour of R Studio

  • Open and save (locally) an R Notebook
  • Make sure you can install packages
    • 2 ways to install packages
      • Command line/Console
      • R Studio GUI

Libraries, functions, data

Libraries, functions, data

  • Libraries (or packages) are collections of functions (and datasets)
  • Over 10,000 libraries on CRAN
install.packages('tidyverse')
install.packages(c('ggthemes', 'officer'))
library(tidyverse)

Libraries, functions, data

  • Functions perform operations in R
foo <- c(1,2,4)
foo %>% min()
foo %>% mean()
foo %>% max()
foo %>% sd()

Libraries, functions, data

  • Functions perform operations in R
foo <- c(1,2,4)
foo %>% min()
## [1] 1
foo %>% mean()
## [1] 2.333333
foo %>% max()
## [1] 4
foo %>% sd()
## [1] 1.527525

Libraries, functions, data

  • Function arguments
    • Let’s look at arguments in the glm and min functions
    • Arguments are inputs
    • Arguments are comma separated
help(glm)
help(min)

cat_function <- function(love = TRUE){
    if(love == TRUE){
        print('I love cats!')
    }
    else {
        print('I am not a cool person.')
    }
}

Libraries, functions, data

  • Function arguments
    • Let’s look at arguments in the glm and min functions
    • Arguments are inputs
    • Arguments are comma separated
foo <- c(1,2,NA, 4)
foo %>% min()
foo %>% mean()
foo %>% max()
foo %>% sd()

Libraries, functions, data

  • Function arguments
    • Let’s look at arguments in the glm and min functions
    • Arguments are inputs
    • Arguments are comma separated
foo <- c(1,2,NA, 4)
foo %>% min()
## [1] NA
foo %>% mean()
## [1] NA
foo %>% max()
## [1] NA
foo %>% sd()
## [1] NA

Libraries, functions, data

  • Function arguments
    • Let’s look at arguments in the glm and min functions
    • Arguments are inputs
    • Arguments are comma separated
foo <- c(1,2,NA, 4)
foo %>% min(na.rm= TRUE)
## [1] 1
foo %>% mean(na.rm= TRUE)
## [1] 2.333333
foo %>% max(na.rm= TRUE)
## [1] 4
foo %>% sd(na.rm= TRUE)
## [1] 1.527525

Libraries, functions, data

  • Oh my, so many types of data!
    • Tibbles, data frames, matrices, vectors
    • Integers, numbers, characters, factors
  • Confirm data type
c('Washington', 'Oregon', 'Idaho') %>% class()
c('Washington', 'Oregon', 'Idaho') %>% is.character()
c('Washington', 'Oregon', 'Idaho') %>% is.factor()
  • Coerce data type
c('Washington', 'Oregon', 'Idaho') %>% as.factor()
c('Washington', 'Oregon', 'Idaho') %>% as.factor() %>% class()

Libraries, functions, data

  • Oh my, so many types of data!
    • Vectors, tibbles, data frames, matrices
    • Integers, numbers, characters, factors
  • Confirm data type
c('Washington', 'Oregon', 'Idaho') %>% class()
## [1] "character"
c('Washington', 'Oregon', 'Idaho') %>% is.character()
## [1] TRUE
c('Washington', 'Oregon', 'Idaho') %>% is.factor()
## [1] FALSE
  • Coerce data type
c('Washington', 'Oregon', 'Idaho') %>% as.factor()
c('Washington', 'Oregon', 'Idaho') %>% as.factor() %>% class()

Libraries, functions, data

  • Oh my, so many types of data!
    • Vectors, tibbles, data frames, matrices
    • Integers, numbers, characters, factors
  • Confirm data type
c('Washington', 'Oregon', 'Idaho') %>% class()
## [1] "character"
c('Washington', 'Oregon', 'Idaho') %>% is.character()
## [1] TRUE
c('Washington', 'Oregon', 'Idaho') %>% is.factor()
## [1] FALSE
  • Coerce data type
c('Washington', 'Oregon', 'Idaho') %>% as.factor()
## [1] Washington Oregon     Idaho     
## Levels: Idaho Oregon Washington
c('Washington', 'Oregon', 'Idaho') %>% as.factor() %>% class()
## [1] "factor"

Libraries, functions, data

  • Oh my, so many types of data!
    • Vectors, tibbles, data frames, matrices
  • Vectors contain 1 or more values in a string
  • Call a specific element by its location in a vector
c('Washington', 'Oregon', 'Idaho')
## [1] "Washington" "Oregon"     "Idaho"
1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
rep(1:2, times = 2)
## [1] 1 2 1 2
seq(from = 0, to = 100, by = 10)
##  [1]   0  10  20  30  40  50  60  70  80  90 100

Libraries, functions, data

  • Oh my, so many types of data!
    • Vectors, tibbles, data frames, matrices
  • Vectors contain 1 or more values in a string
  • Call a specific element by its location in a vector
c('Washington', 'Oregon', 'Idaho')[2]
seq(from = 0, to = 100, by = 10)[6]

Libraries, functions, data

  • Oh my, so many types of data!
    • Vectors, tibbles, data frames, matrices
  • Vectors contain 1 or more values in a string
  • Call a specific element by its location in a vector
c('Washington', 'Oregon', 'Idaho')[2]
## [1] "Oregon"
seq(from = 0, to = 100, by = 10)[6]
## [1] 50

Libraries, functions, data

  • Oh my, so many types of data!
    • Vectors, tibbles, data frames, matrices
  • Tibbles look a little like spreadsheets
tibble(x = c(1:3), y = c(4:6), z = c('Washington', 'Oregon', 'Idaho'))
## # A tibble: 3 x 3
##       x     y z         
##   <int> <int> <chr>     
## 1     1     4 Washington
## 2     2     5 Oregon    
## 3     3     6 Idaho

Libraries, functions, data

  • Call a specific variable by name or location in the tibble
    • Variables are vectors
    • Use the $ between the tibble name and the variable name
    • pull() functionizes $
    • Also able to call the column (or row) by index number
cars$speed %>% head(7)
## [1]  4  4  7  7  8  9 10
cars %>% pull(speed) %>% head(7)
## [1]  4  4  7  7  8  9 10
cars[,1] %>% head(7)
## [1]  4  4  7  7  8  9 10

Libraries, functions, data

  • Oh my, so many types of data!
    • Vectors, tibbles, data frames, matrices
  • Data frames are like tibbles minus metadata and truncated print
data.frame(x = c(1:3), y = c(4:6), z = c('Washington', 'Oregon', 'Idaho'))
##   x y          z
## 1 1 4 Washington
## 2 2 5     Oregon
## 3 3 6      Idaho

Libraries, functions, data

  • Oh my, so many types of data!
    • Vectors, tibbles, data frames, matrices
  • Matrices and tibbles have similar shapes
  • Traditionally matrices contained data of a single type
  • When it comes down to using a matrix or tibbles in our class, tibbles are the way to go
matrix(data = 1:6, nrow = 3, ncol = 2)
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

Let’s apply what we learned to real data

Let’s apply what we learned to real data

You are an analyst at a large global health organization. Organizational leadership needs to report pubically on the Covid-19 data that you’ve collected. Over the next month, you will be asked to prepare findings that you observe in the data.

Exercise 1 - 5 minutes
- What type of data object is covid?
- What are the min() and mean() for est_infections_p100k?
- What is the range for date?

Let’s apply what we learned to real data

Exercise 1 - 5 minutes
- What type of data object is covid?
- What are the min() and mean() for est_infections_p100k?
- What is the range for date?

To answer the questions, import the Covid-19 dataset.

covid <- read_csv('https://rb.gy/lzlylj')

Let’s apply what we learned to real data

Exercise 1 - 5 minutes
- What type of data object is covid?
- What are the min() and mean() for est_infections_p100k?
- What is the range for date?

covid %>% class()
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

Let’s apply what we learned to real data

Exercise 1 - 5 minutes
- What type of data object is covid?
- What are the min() and mean() for est_infections_p100k?
- What is the range for date?

covid %>% class()
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
covid %>% pull(est_infections_p100k) %>% min(na.rm = TRUE)
## [1] 0.00000003176912
covid %>% pull(est_infections_p100k) %>% mean(na.rm = TRUE)
## [1] 51.07284

Let’s apply what we learned to real data

Exercise 1 - 5 minutes
- What type of data object is covid?
- What are the min() and mean() for est_infections_p100k?
- What is the range for date?

covid %>% class()
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
covid %>% pull(est_infections_p100k) %>% min(na.rm = TRUE)
## [1] 0.00000003176912
covid %>% pull(est_infections_p100k) %>% mean(na.rm = TRUE)
## [1] 51.07284
covid %>% pull(date) %>% min(na.rm = TRUE)
## [1] "2020-02-04"
covid %>% pull(date) %>% max(na.rm = TRUE)
## [1] "2021-01-01"

Base R functions to explore tibbles

Base R functions to explore tibbles

  • Get in the habit of running these functions when you open a new dataset
    • head() shows you the top subset of a data in a tibble
      • head() argument defaults to 6
      • tail() shows the bottom subset of data in a tibble
    • summary() shows summary statistics on all variabes in a dataset
      • Reported summary statistics depend on the variable type
    • ls() shows all variables in a tibble
      • You can also use ls() without calling an object between the parentheses to see all objects in your workspace
    • str() tells the variable type and selected variable values in a tibble for all variables
    • dim() tells you the dimensions of your dataset
      • nrow() reports the number of rows only
      • ncol() reports the number of columns only

Base R functions to explore tibbles

Exercise 2 - 5 minutes
- How many observations (or rows) are in covid?
- What is the median() value for mobility_composite?
- How many variables are in covid?

Hint: There are multiple ways to answers these questions with the functions you know

Base R functions to explore tibbles

Exercise 2 - 5 minutes
- How many observations (or rows) are in covid?
- What is the median() value for mobility_composite?
- How many variables are in covid?

covid %>% dim()
## [1] 165210     13

Base R functions to explore tibbles

Exercise 2 - 5 minutes
- How many observations (or rows) are in covid?
- What is the median() value for mobility_composite?
- How many variables are in covid?

covid %>% dim()
## [1] 165210     13
covid %>% pull(mobility_composite) %>% median(na.rm = TRUE)
## [1] -19.54311

Base R functions to explore tibbles

Exercise 2 - 5 minutes
- How many observations (or rows) are in covid?
- What is the median() value for mobility_composite?
- How many variables are in covid?

covid %>% dim()
## [1] 165210     13
covid %>% pull(mobility_composite) %>% median(na.rm = TRUE)
## [1] -19.54311
covid %>% ncol()
## [1] 13

Base R functions to explore tibbles

Exercise 2 - 5 minutes
- How many observations (or rows) are in covid?
- What is the median() value for mobility_composite?
- How many variables are in covid?

covid %>% dim()
## [1] 165210     13
covid %>% pull(mobility_composite) %>% median(na.rm = TRUE)
## [1] -19.54311
covid %>% ncol()
## [1] 13

Other methods to answer questions

covid %>% nrow()
covid %>% summary()
covid %>% ls()

Base R functions to explore vectors

  • table() shows you the distribution of values in a vector
  • length() tells you in the number of elements in a vector
  • unique() shows you the unique values in a vector
  • sort() orders values in a vector
  • summary() shows you descriptive statistics for a vector
    • summary() can run on a vector or tibble

You can chain together multiple functions

covid %>% pull(location) %>% unique() %>% length()

Base R functions to explore vectors

Exercise 3 - 5 minutes
- How many ‘projected’ values are there in mobility_data_type?
- Which is the third to last value for location when values are listed alphabetically?
- Is ‘projected’ a value from the total_tests_data_type variable?

Base R functions to explore vectors

Exercise 3 - 5 minutes
- How many ‘projected’ values are there in mobility_data_type?
- Which is the third to last value for location when values are listed alphabetically?
- Is ‘projected’ a value from the total_tests_data_type variable?

covid %>% pull(mobility_data_type) %>% table()
## .
##  observed projected 
##    123887     39992
covid %>% pull(location) %>% table() %>% tail()
## .
##   Wyoming     Yemen   Yucatán Zacatecas    Zambia  Zimbabwe 
##       540       538       538       538       538       538
covid %>% pull(total_tests_data_type) %>% unique() 
## [1] "observed" NA

Troubleshooting your R code

Make sure…

  • You loaded your libraries
  • Arguments in your functions are comma-separated
  • Your functions have a closing paren
  • You have an end quotation mark where relevant
  • The variable type is correct to perform the function you call
    • For example you probably don’t want to run min() on a character string
  • Your definition operator looks like this
<-
  • You pipe operator looks like this
%>%

Troubleshooting your R code

Exercise 4 - 5 minutes

This code contains 6 mistakes. The code is supposed to create three new data object called covid_inf, mean_inf, and median_inf and report mean and median confirmed infections.

  • Correct 6 coding mistakes
  • Reference the previous slide for help
covid_inf <- covid %.% pull(confirmed_infection)

mean_inf < mean(covid_inf na.rm = TRUE)
median_inf <- covid_inf %>% median(na.rm = TRUE

mean_inf
medean_inf

Troubleshooting your R code

Exercise 4 - 5 minutes

This code contains 6 mistakes. The code is supposed to create three new data object called covid_inf, mean_inf, and median_inf and report mean and median confirmed infections.

  • Correct 6 coding mistakes
  • Reference the previous slide for help
covid_inf <- covid %>% pull(confirmed_infections)

mean_inf <- mean(covid_inf, na.rm = TRUE)
median_inf <- covid_inf %>% median(na.rm = TRUE)

mean_inf
## [1] 1335.746
median_inf
## [1] 54

Practice with crime data

You wear many hats and are also a crime analyst at a think tank. For a report being prepared by the think tank, you need to analyze crime data.

Exercise 5 - 7 minutes
- What is the most recent occurred_date?
- Which neighborhood sees the most incident activity?
- What are the earliest and latest reported_year values?

Begin the exercise by importing the crime dataset in R Studio

crime <- read_csv('https://rb.gy/5zuayh') 

Practice with crime data

Exercise 5 - 7 minutes
- What is the most recent occurred_date?
- Which neighborhood sees the most incident activity?
- What are the earliest and latest reported_year values?

crime %>% pull(occurred_date) %>% max(na.rm = TRUE)
## [1] "2020-09-25"

Practice with crime data

Exercise 5 - 7 minutes
- What is the most recent occurred_date?
- Which neighborhood sees the most incident activity?
- What are the earliest and latest reported_year values?

crime %>% pull(occurred_date) %>% max(na.rm = TRUE)
## [1] "2020-09-25"
crime %>% pull(neighborhood) %>% table() %>% sort() %>% tail()
## .
##          UNIVERSITY         SLU/CASCADE          QUEEN ANNE           NORTHGATE 
##                5458                5970                6792                7614 
##        CAPITOL HILL DOWNTOWN COMMERCIAL 
##                8144               11692

Practice with crime data

Exercise 5 - 7 minutes
- What is the most recent occurred_date?
- Which neighborhood sees the most incident activity?
- What are the earliest and latest reported_year values?

crime %>% pull(occurred_date) %>% max(na.rm = TRUE)
## [1] "2020-09-25"
crime %>% pull(neighborhood) %>% table() %>% sort() %>% tail()
## .
##          UNIVERSITY         SLU/CASCADE          QUEEN ANNE           NORTHGATE 
##                5458                5970                6792                7614 
##        CAPITOL HILL DOWNTOWN COMMERCIAL 
##                8144               11692
crime %>% pull(reported_year) %>% min(na.rm = TRUE)
## [1] 2008
crime %>% pull(reported_year) %>% max(na.rm = TRUE)
## [1] 2020