Overview

Learning Outcomes

At the end of this course students should know how to:

Reading Data into R

The Rbnb package is a collection of functions, useful packages, and design aesthetics created and maintained by Airbnb developers. The main function for reading in data is presto. It facilitates the querying of data from Airbnb data sources.

library(Rbnb)
test <- presto('select * from core_data_anon.fct_bookings limit 1')

Helpful tip: - Cached data will be loaded if query is ____ days/hours old - To avoid caching use argument use_cache=FALSE

Tidy Data

How to use tidy data principles? Why are tidy data principles important?

What is tidy data?

Examples:

–Replace examples with Abnb data

data1
#>           name treatmenta treatmentb
#> 1   John Smith         NA         18
#> 2     Jane Doe          4          1
#> 3 Mary Johnson          6          7

Reshaped to

data2
#>           name treatment  n
#> 1     Jane Doe         a  4
#> 2     Jane Doe         b  1
#> 3   John Smith         a NA
#> 4   John Smith         b 18
#> 5 Mary Johnson         a  6
#> 6 Mary Johnson         b  7

There are many times when ‘long’ data will be needed instead of ‘wide’.

How to identify ‘wide’ data:

Pipe Operator

# generate 5 random numbers ranging from 1 to 10
x <- runif(5, 1, 10)

# print my random sequence
x
## [1] 1.352935 8.613906 2.634372 8.190679 6.350486
## function
## what does it do?
floor(mean(round(x)))
## [1] 5
## pipeline without print statement
x %>% 
  round %>% 
  mean %>% 
  floor
## [1] 5
## pipeline with print statement
x %>% print %>%
  round %>% print %>%
  mean %>% print %>%
  floor
## [1] 1.352935 8.613906 2.634372 8.190679 6.350486
## [1] 1 9 3 8 6
## [1] 5.4
## [1] 5

Main dplyr verbs

filter

  • Similar to SQL ‘WHERE’ clause
  • Specifies which rows to return
df1 <- presto("select * from core_data.fct_reviews_of_host limit 50")
df1 %>% filter(dim_overall_rating > 3)

select

  • Similar to SQL ‘SELECT’ statement
  • Which columns/fields to return
  • Unlike SQL, data aggregations will not be done in this statement
df1 %>% filter(dim_overall_rating > 3) %>% 
  select(id_review, dim_overall_rating, ds)

Tips on variable selection

  • Check ?select for great examples on usage
  • starts_with(), ends_with(), and contains() can help with inexact searches
df1 %>% select(contains('rating'))

group_by

  • Useful for aggregation by groups
  • Allows for partitioned group operations
df1 %>% group_by(dim_rating)

mutate

  • Similar to SQL ‘OVER’ statement
  • When combined with ‘GROUP BY’, similar to ‘PARTITION BY’ combination
df1 %>% mutate(running_total = cumsum(dim_accuracy_rating))

summarise

  • Similar to SQL ‘GROUP BY’ aggregations
  • Should return one value per operation
df1 %>% group_by(dim_overall_rating) %>%
  summarise(m_count = n())

gather

  • Reshapes data from ‘wide’ to ‘long’
  • Arguments: data key value columns to gather

Data is not yet tidy. The type of rating should be its own field. The rating itself should be its own field.

gather(df1, rating_type, rating_value, -id_review)

-Two more ways to specify the table - gather(df1, rating_type, rating_value, m_accuracy_rating, m_overall_rating, m_cleanliness_rating) - gather(df1, rating_type, rating_value, m_accuracy_rating:m_cleanliness_rating)