Overview
- Will work through real-world problems
- Alpha access will not be necessary
- Will query using presto function (must have permissions)
- Downloads should be complete prior to course start
Learning Outcomes
At the end of this course students should know how to:
- Navigate RStudio environment
- Know how to get help with a function
- Introduced to helpful functions
- Read in data through presto and csv
- Recognize tidy data
- Use main dplyr verbs
- Apply principles to real data examples
Navigating RStudio
What is R? A development environment designed for reproducible statistical analysis.
What is RStudio? An Interactive Development Environment (IDE) for R
Separated into 4 windows:
- Console
- Script
- Help/Visualizations
- Environment variables
Good to Know:
- Ability to write directly into console or use script with Cmd+Enter to run.
- Scripts can be run line by line
How to get help - It is difficult to remember argument ordering for all functions - Help is available within the RStudio environment - Use question mark before function to get help (?) - If multiple functions share the name, all options will appear - Help screen is organized by:
1. Description - Describes the functionality
2. Usage - function argument layout with defaults
3. Arguments - Detail of all function arguments
4. Value - What the function outputs
5. Examples - Reproducible example usage of function
?mean
Helpful functions:
- head - Preview data
- set.seed - Set seed for reproducible results
- str - View structure of the data
- table - tally values
- cut - group a vector into bins
Reading Data into R
- Use Rbnb package to query data from silver or gold clusters
- CSV files can be read in using read.csv
The Rbnb package is a collection of functions, useful packages, and design aesthetics created and maintained by Airbnb developers. The main function for reading in data is presto. It facilitates the querying of data from Airbnb data sources.
library(Rbnb)
test <- presto('select * from core_data_anon.fct_bookings limit 1')
Helpful tip: - Cached data will be loaded if query is ____ days/hours old - To avoid caching use argument use_cache=FALSE
Tidy Data
How to use tidy data principles? Why are tidy data principles important?
- Will help with tidyverse R packages
- Vitally important to ggplot usage
What is tidy data?
- Each variable is a column
- Each observation is a row
- Each type of observational unit is a table
Examples:
–Replace examples with Abnb data
data1
#> name treatmenta treatmentb
#> 1 John Smith NA 18
#> 2 Jane Doe 4 1
#> 3 Mary Johnson 6 7
Reshaped to
data2
#> name treatment n
#> 1 Jane Doe a 4
#> 2 Jane Doe b 1
#> 3 John Smith a NA
#> 4 John Smith b 18
#> 5 Mary Johnson a 6
#> 6 Mary Johnson b 7
There are many times when ‘long’ data will be needed instead of ‘wide’.
How to identify ‘wide’ data:
- Data measurement is split into multiple columns
Pipe Operator
- Makes sequences of functions more readable
- Passes previous function returns into the next expression
- By default, first argument of new function filled by result of previous result
# generate 5 random numbers ranging from 1 to 10
x <- runif(5, 1, 10)
# print my random sequence
x
## [1] 1.352935 8.613906 2.634372 8.190679 6.350486
## function
## what does it do?
floor(mean(round(x)))
## [1] 5
## pipeline without print statement
x %>%
round %>%
mean %>%
floor
## [1] 5
## pipeline with print statement
x %>% print %>%
round %>% print %>%
mean %>% print %>%
floor
## [1] 1.352935 8.613906 2.634372 8.190679 6.350486
## [1] 1 9 3 8 6
## [1] 5.4
## [1] 5
Main dplyr verbs
filter
- Similar to SQL ‘WHERE’ clause
- Specifies which rows to return
df1 <- presto("select * from core_data.fct_reviews_of_host limit 50")
df1 %>% filter(dim_overall_rating > 3)
select
- Similar to SQL ‘SELECT’ statement
- Which columns/fields to return
- Unlike SQL, data aggregations will not be done in this statement
df1 %>% filter(dim_overall_rating > 3) %>%
select(id_review, dim_overall_rating, ds)
Tips on variable selection
- Check ?select for great examples on usage
- starts_with(), ends_with(), and contains() can help with inexact searches
df1 %>% select(contains('rating'))
group_by
- Useful for aggregation by groups
- Allows for partitioned group operations
df1 %>% group_by(dim_rating)
mutate
- Similar to SQL ‘OVER’ statement
- When combined with ‘GROUP BY’, similar to ‘PARTITION BY’ combination
df1 %>% mutate(running_total = cumsum(dim_accuracy_rating))
summarise
- Similar to SQL ‘GROUP BY’ aggregations
- Should return one value per operation
df1 %>% group_by(dim_overall_rating) %>%
summarise(m_count = n())
gather
- Reshapes data from ‘wide’ to ‘long’
- Arguments: data key value columns to gather
Data is not yet tidy. The type of rating should be its own field. The rating itself should be its own field.
gather(df1, rating_type, rating_value, -id_review)
-Two more ways to specify the table - gather(df1, rating_type, rating_value, m_accuracy_rating, m_overall_rating, m_cleanliness_rating) - gather(df1, rating_type, rating_value, m_accuracy_rating:m_cleanliness_rating)