R Curriculum

Overview

Will work through real-world problems
Alpha access will not be necessary
Will query using presto function (must have permissions)
Downloads should be complete prior to course start

Learning Outcomes

At the end of this course students should know how to:

Navigate RStudio environment
Know how to get help with a function
Introduced to helpful functions
Read in data through presto and csv
Recognize tidy data
Use main dplyr verbs
Apply principles to real data examples

Navigating RStudio

What is R? A development environment designed for reproducible statistical analysis.
What is RStudio? An Interactive Development Environment (IDE) for R

Separated into 4 windows:

Console
Script
Help/Visualizations
Environment variables

Good to Know:

Ability to write directly into console or use script with Cmd+Enter to run.
Scripts can be run line by line

How to get help - It is difficult to remember argument ordering for all functions - Help is available within the RStudio environment - Use question mark before function to get help (?) - If multiple functions share the name, all options will appear - Help screen is organized by:

1. Description - Describes the functionality
2. Usage - function argument layout with defaults
3. Arguments - Detail of all function arguments
4. Value - What the function outputs
5. Examples - Reproducible example usage of function

?mean

Helpful functions:

head - Preview data
set.seed - Set seed for reproducible results
str - View structure of the data
table - tally values
cut - group a vector into bins

Reading Data into R

Use Rbnb package to query data from silver or gold clusters
CSV files can be read in using read.csv

The Rbnb package is a collection of functions, useful packages, and design aesthetics created and maintained by Airbnb developers. The main function for reading in data is presto. It facilitates the querying of data from Airbnb data sources.

library(Rbnb)
test <- presto('select * from core_data_anon.fct_bookings limit 1')

Helpful tip: - Cached data will be loaded if query is ____ days/hours old - To avoid caching use argument use_cache=FALSE

Tidy Data

How to use tidy data principles? Why are tidy data principles important?

Will help with tidyverse R packages
Vitally important to ggplot usage

What is tidy data?

Each variable is a column
Each observation is a row
Each type of observational unit is a table

Examples:

–Replace examples with Abnb data

data1
#>           name treatmenta treatmentb
#> 1   John Smith         NA         18
#> 2     Jane Doe          4          1
#> 3 Mary Johnson          6          7

Reshaped to

data2
#>           name treatment  n
#> 1     Jane Doe         a  4
#> 2     Jane Doe         b  1
#> 3   John Smith         a NA
#> 4   John Smith         b 18
#> 5 Mary Johnson         a  6
#> 6 Mary Johnson         b  7

There are many times when ‘long’ data will be needed instead of ‘wide’.

How to identify ‘wide’ data:

Data measurement is split into multiple columns

Pipe Operator

Makes sequences of functions more readable
Passes previous function returns into the next expression
By default, first argument of new function filled by result of previous result

# generate 5 random numbers ranging from 1 to 10
x <- runif(5, 1, 10)

# print my random sequence
x

## [1] 1.352935 8.613906 2.634372 8.190679 6.350486

## function
## what does it do?
floor(mean(round(x)))

## [1] 5

## pipeline without print statement
x %>% 
  round %>% 
  mean %>% 
  floor

## [1] 5

## pipeline with print statement
x %>% print %>%
  round %>% print %>%
  mean %>% print %>%
  floor

## [1] 1.352935 8.613906 2.634372 8.190679 6.350486
## [1] 1 9 3 8 6
## [1] 5.4

## [1] 5

Main dplyr verbs

filter

Similar to SQL ‘WHERE’ clause
Specifies which rows to return

df1 <- presto("select * from core_data.fct_reviews_of_host limit 50")
df1 %>% filter(dim_overall_rating > 3)

select

Similar to SQL ‘SELECT’ statement
Which columns/fields to return
Unlike SQL, data aggregations will not be done in this statement

df1 %>% filter(dim_overall_rating > 3) %>% 
  select(id_review, dim_overall_rating, ds)

Tips on variable selection

Check ?select for great examples on usage
starts_with(), ends_with(), and contains() can help with inexact searches

df1 %>% select(contains('rating'))

group_by

Useful for aggregation by groups
Allows for partitioned group operations

df1 %>% group_by(dim_rating)

mutate

Similar to SQL ‘OVER’ statement
When combined with ‘GROUP BY’, similar to ‘PARTITION BY’ combination

df1 %>% mutate(running_total = cumsum(dim_accuracy_rating))

summarise

Similar to SQL ‘GROUP BY’ aggregations
Should return one value per operation

df1 %>% group_by(dim_overall_rating) %>%
  summarise(m_count = n())

gather

Reshapes data from ‘wide’ to ‘long’
Arguments: data key value columns to gather

Data is not yet tidy. The type of rating should be its own field. The rating itself should be its own field.

gather(df1, rating_type, rating_value, -id_review)

-Two more ways to specify the table - gather(df1, rating_type, rating_value, m_accuracy_rating, m_overall_rating, m_cleanliness_rating) - gather(df1, rating_type, rating_value, m_accuracy_rating:m_cleanliness_rating)