- avoid kable extra penalties

knitr::opts_chunk$set(echo = TRUE)

Instructions

The Ratings_and_Sentiments.csv dataset contains over 7,600 reviews of 66 coffee shops in Austin Texas, the data was initially scraped, munged and prepped by Rachel Downs (http://www.racheldowns.co) a Marketing and MIS student at UT Austin.

Your challenge is to further prepare the data (easy), and create summaries and charts answering various questions about the data (also easy)

To complete this assignment, follow these steps:

  1. Download the Austin_Coffee_Sentiment.RMD file from the course website.

  2. Open Austin_Coffee_Sentiment.RMD in RStudio.

  3. Replace the “Your Name Here” text in the author: field with your own name.

  4. Supply your solutions to the project by editing Austin_Coffee_Sentiment.Rmd.

  5. When you have completed the homework and have checked that your code both runs in the Console and knit correctly when you click Knit HTML, rename the R Markdown file to Austin_Coffee_Sentiment_YourNameHere.Rmd, and submit it.

Tip: Note that each of the code blocks in this Problem contain the expression eval = FALSE. This tells R Markdown to display the code contained in the block, but not to evaluate it. you’ll need to change this to eval = TRUE before you

knit

Step 0. Load Libraries

load the following libraries, you may need to install them first! - tidyverse - lubridate # – a new library that makes dealing with dates easy - stringr # – a new library that makes dealing wiht strings easy - ggplot2
- janitor # – clean up column names with clean_names()

don’t forget to change eval=FASE to eval=TRUE

# load libraries here 
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(stringr)

Step 1. Stage

Read (read_csv), ignore warnings - you’ll see 4 parsing failures just ignore them.

  1. create a new data frame called “yelp” using read_csv() to read in ratings_and_sentiments.csv data file

yelp <- read_csv(“../data/ratings_and_sentiments.csv”)

  1. use head() to print the first 5 records
yelp <- read_csv("ratings_and_sentiments.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   coffee_shop_name = col_character(),
##   review_text = col_character(),
##   rating = col_character(),
##   cat_rating = col_character(),
##   coffee_sent = col_character()
## )
## i Use `spec()` for the full column specifications.
## Warning: 4 parsing failures.
##  row          col expected  actual                         file
## 1099 vibe_sent    a double #VALUE! 'ratings_and_sentiments.csv'
## 2006 vibe_sent    a double #VALUE! 'ratings_and_sentiments.csv'
## 6833 food_sent    a double #VALUE! 'ratings_and_sentiments.csv'
## 7446 parking_sent a double #VALUE! 'ratings_and_sentiments.csv'
head(5)
## [1] 5

Step 2. Structure & Transform

Using the PIPE (%>%) and stringr functions to create a “yelp_prep” dataset by performing the following:

  1. Filter out any rows where coffee_shop_name is Null: filter(!is.na(coffee_shop_name))

  2. Create a new date variable (review_date) based on the date contained the review_text field.

  • You’ll note the review date is the first item in that field, so you’ll need to parse it.
  • if you’ve loaded stringr the “word()” function will grab the first “word” in a sentance
  • or you can do the hard route something like this ~ sub("\s.*“,”", review_text)
  • To convert review_date to a Date data type you’ll need to use another function
    • lubridate has a function called mdy() for month day and year it will convert a string into a date, use that!
  • make sure your new field is a DATE data type

Hint: mutate(review_date = mdy(word(review_text)))

  1. Create a flag variable “starbucks_flag”, if the review mentions “starbucks”, regardless of case, set the flag to “yes” else default it to “no”

Hint: use mutate() to create starbucks_flag, use if_else(condition,1,0), there should be at least 320 mentions of starbucks https://stringr.tidyverse.org/reference/index.html https://stringr.tidyverse.org/reference/str_detect.html

Hint: stringr and and str_detect() how do you deal with mixed cases hmm? https://stringr.tidyverse.org/reference/case.html

  1. Create a flag variable “good_flag” if the review mentions the word “good”, regardless of case, set the flag to “yes” else default it to “no” , there should be around 3077 good references Same hints as 2c

  2. Create a flag variable “great_flag” if the review mentions the word “great”, regardless of case, set the flag to “yes” else default it to “no” , there should be around 2870 references to great. Same hints as 2c

  3. Convert coffee_sent to numeric using as.numeric(coffee_sent)

  4. list the first 10 records to make sure your code works. > hint use head(,10)

— helper code — yelp_prep <- yelp %>% filter(!is.na(coffee_shop_name)) %>% mutate(review_date = mdy(word(review_text))) %>% mutate(review_text = str_to_lower(review_text)) %>% mutate(starbucks_flag = if_else(str_detect(review_text, “starbucks”), TRUE, FALSE)) %>% mutate(good_flag = if_else(str_detect(str_to_upper(review_text), “GOOD”), “Yes”, “no”)) %>% mutate(great_flag = if_else(str_detect(str_to_upper(review_text), “GREAT”), “Yes”, “no”)) %>% mutate(coffee_sent = as.numeric(coffee_sent))

yelp_prep %>% head(5)

yelp_prep %>% group_by(starbucks_flag) %>% summarise(n=n()) %>% arrange(desc(n))

yelp_prep <- yelp %>% 
  filter(!is.na(coffee_shop_name)) %>%
  mutate(review_date = mdy(word(review_text))) %>%
  mutate(review_text = str_to_lower(review_text)) %>%
  mutate(starbucks_flag = if_else(str_detect(review_text, "starbucks"), TRUE, FALSE)) %>%
  mutate(good_flag = if_else(str_detect(str_to_upper(review_text), "GOOD"), "Yes", "no")) %>%
  mutate(great_flag = if_else(str_detect(str_to_upper(review_text), "GREAT"), "Yes", "no")) %>%
  mutate(coffee_sent = as.numeric(coffee_sent))
## Warning: Problem with `mutate()` input `coffee_sent`.
## i NAs introduced by coercion
## i Input `coffee_sent` is `as.numeric(coffee_sent)`.
## Warning in mask$eval_all_mutate(dots[[i]]): NAs introduced by coercion
yelp_prep %>%
  head(10)
## # A tibble: 10 x 24
##    coffee_shop_name review_text rating num_rating cat_rating bool_HIGH
##    <chr>            <chr>       <chr>       <dbl> <chr>          <dbl>
##  1 The Factory - C~ "11/25/201~ 5.0 s~          5 HIGH               1
##  2 The Factory - C~ "12/2/2016~ 4.0 s~          4 HIGH               1
##  3 The Factory - C~ "11/30/201~ 4.0 s~          4 HIGH               1
##  4 The Factory - C~ "11/25/201~ 2.0 s~          2 LOW                0
##  5 The Factory - C~ "12/3/2016~ 4.0 s~          4 HIGH               1
##  6 The Factory - C~ "11/20/201~ 4.0 s~          4 HIGH               1
##  7 The Factory - C~ "10/27/201~ 4.0 s~          4 HIGH               1
##  8 The Factory - C~ "11/2/2016~ 5.0 s~          5 HIGH               1
##  9 The Factory - C~ "10/25/201~ 3.0 s~          3 LOW                0
## 10 The Factory - C~ "11/10/201~ 5.0 s~          5 HIGH               1
## # ... with 18 more variables: overall_sent <dbl>, vibe_sent <dbl>,
## #   tea_sent <dbl>, service_sent <dbl>, seating_sent <dbl>, price_sent <dbl>,
## #   parking_sent <dbl>, location_sent <dbl>, alcohol_sent <dbl>,
## #   coffee_sent <dbl>, food_sent <dbl>, hours_sent <dbl>, internet_sent <dbl>,
## #   local_sent <dbl>, review_date <date>, starbucks_flag <lgl>,
## #   good_flag <chr>, great_flag <chr>
yelp_prep %>%
  group_by(starbucks_flag) %>%
  summarise(n=n()) %>% arrange(desc(n))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   starbucks_flag     n
##   <lgl>          <int>
## 1 FALSE           7296
## 2 TRUE             320

Step 3. Frequency Analysis & Graphs

Create 5 bar charts, for the following variables, set the fill/color setting by indicated fields, use a facet wrap for the following. What if anything do the graphs tell you about sentiment and ratings?

here is a template to follow

yelp_prep %>% ggplot(aes(x=numeric_variabale, fill=fill_variable)) + geom_bar() + facet_wrap(~facet_variable)

Variable Fill/Color Facet Wrap
num_rating great_flag starbucks_flag
yelp_prep %>%
  ggplot(aes(x=num_rating, fill=great_flag)) +
  geom_bar() + 
  facet_wrap(~starbucks_flag)

Variable Fill/Color facet wrap
overall_sent starbucks_flag NA
yelp_prep %>%
  ggplot(aes(x=overall_sent, fill=starbucks_flag)) +
  geom_bar() + 
  facet_wrap(~NA)

Variable Fill/Color Facet Wrap
service_sent great_flag
yelp_prep %>%
  ggplot(aes(x=service_sent, fill=great_flag)) +
  geom_bar() 

Variable Fill/Color Facet Wrap
coffee_sent N/A Starbucks_flag
yelp_prep %>%
  ggplot(aes(x=coffee_sent, fill='N/A')) +
  geom_bar() + 
  facet_wrap(~starbucks_flag)
## Warning: Removed 1 rows containing non-finite values (stat_count).

Variable Fill/Color Facet Wrap
food_sent good_flag N/A
yelp_prep %>%
  ggplot(aes(x=food_sent, fill=good_flag)) +
  geom_bar() + 
  facet_wrap(~'N/A')
## Warning: Removed 1 rows containing non-finite values (stat_count).

Step 4. More interesting Analysis, Does day of Week Matter?

Create a new data set called sentiment_summary

  1. Create a new varaible day_of_week by applying wday to review_date, like this: wday(review_date, label = TRUE, abbr = FALSE)
  2. Group by day_of_week
  3. Use summarize to create a count, mean of overall_sent, a mean of coffee_sent, rememver to remove NA values with na.rm = TRUE

also if you forgot to convert coffee_sent to numeric now is a good time to go back and do that : )

print out the sentiment_summary %>% print()

sentiment_summary_v1 <- yelp %>% 
  filter(!is.na(coffee_shop_name)) %>%
  mutate(review_date = mdy(word(review_text))) %>%
  mutate(review_text = str_to_lower(review_text)) %>%
  mutate(starbucks_flag = if_else(str_detect(review_text, "starbucks"), TRUE, FALSE)) %>%
  mutate(good_flag = if_else(str_detect(str_to_upper(review_text), "GOOD"), "Yes", "no")) %>%
  mutate(great_flag = if_else(str_detect(str_to_upper(review_text), "GREAT"), "Yes", "no")) %>%
  mutate(day_of_week = wday(review_date, label = TRUE, abbr = FALSE)) %>%
  mutate(coffee_sent = as.numeric(coffee_sent))
## Warning: Problem with `mutate()` input `coffee_sent`.
## i NAs introduced by coercion
## i Input `coffee_sent` is `as.numeric(coffee_sent)`.
## Warning in mask$eval_all_mutate(dots[[i]]): NAs introduced by coercion
sentiment_summary <- sentiment_summary_v1 %>%  
  group_by(day_of_week) %>%
  summarise(count_n=n(),mean_overall_sent = mean(overall_sent), mean_coffee_sent = mean(coffee_sent), na.rm = TRUE)
## `summarise()` ungrouping output (override with `.groups` argument)
sentiment_summary %>% print()
## # A tibble: 7 x 5
##   day_of_week count_n mean_overall_sent mean_coffee_sent na.rm
##   <ord>         <int>             <dbl>            <dbl> <lgl>
## 1 Sunday         1279              1.07            0.472 TRUE 
## 2 Monday         1146              1.13            0.551 TRUE 
## 3 Tuesday         983              1.14           NA     TRUE 
## 4 Wednesday      1092              1.09            0.441 TRUE 
## 5 Thursday       1011              1.14            0.513 TRUE 
## 6 Friday          951              1.10            0.545 TRUE 
## 7 Saturday       1154              1.09            0.525 TRUE

Step 5. Create the Following Bar Charts using stat=“identity”

Create three Bar charts to answer the following questions

  1. what is the most/least reviewed day of the week

  2. what day are you most likely to get the highest/lowest mean overall sentiment

  3. what day are you most likely to get the highest/lowest mean coffee sentiment

for example:

sentiment_summary %>%
  ggplot(aes(x=day_of_week, y=count_n)) +
  geom_col(stat="identity")
## Warning: Ignoring unknown parameters: stat

sentiment_summary %>%
  ggplot(aes(x=day_of_week, y=mean_overall_sent)) +
  geom_col(stat="identiy")
## Warning: Ignoring unknown parameters: stat

sentiment_summary %>%
  ggplot(aes(x=day_of_week, y=mean_coffee_sent)) +
  geom_col(stat="identiy")
## Warning: Ignoring unknown parameters: stat
## Warning: Removed 1 rows containing missing values (position_stack).

Finally

Wrap this up in a R notebook and knit it to HTML