The Ratings_and_Sentiments.csv dataset contains over 7,600 reviews of 66 coffee shops in Austin Texas, the data was initially scraped, munged and prepped by Rachel Downs (http://www.racheldowns.co) a Marketing and MIS student at UT Austin.
Your challenge is to further prepare the data (easy), and create summaries and charts answering various questions about the data (also easy)
To complete this assignment, follow these steps:
Download the Austin_Coffee_Sentiment.RMD file from the course website.
Open Austin_Coffee_Sentiment.RMD in RStudio.
Replace the “Your Name Here” text in the author: field with your own name.
Supply your solutions to the project by editing Austin_Coffee_Sentiment.Rmd.
When you have completed the homework and have checked that your code both runs in the Console and knit correctly when you click Knit HTML, rename the R Markdown file to Austin_Coffee_Sentiment_YourNameHere.Rmd, and submit it.
Tip: Note that each of the code blocks in this Problem contain the expression
eval = FALSE. This tells R Markdown to display the code contained in the block, but not to evaluate it. you’ll need to change this toeval = TRUEbefore you
knit
load the following libraries, you may need to install them first! - tidyverse - lubridate # – a new library that makes dealing with dates easy - stringr # – a new library that makes dealing wiht strings easy - ggplot2
- janitor # – clean up column names with clean_names()
don’t forget to change eval=FASE to eval=TRUE
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
Read (read_csv), ignore warnings - you’ll see 4 parsing failures just ignore them.
yelp <- read_csv(“../data/ratings_and_sentiments.csv”)
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## coffee_shop_name = col_character(),
## review_text = col_character(),
## rating = col_character(),
## cat_rating = col_character(),
## coffee_sent = col_character()
## )
## i Use `spec()` for the full column specifications.
## Warning: 4 parsing failures.
## row col expected actual file
## 1099 vibe_sent a double #VALUE! 'ratings_and_sentiments.csv'
## 2006 vibe_sent a double #VALUE! 'ratings_and_sentiments.csv'
## 6833 food_sent a double #VALUE! 'ratings_and_sentiments.csv'
## 7446 parking_sent a double #VALUE! 'ratings_and_sentiments.csv'
## [1] 5
Using the PIPE (%>%) and stringr functions to create a “yelp_prep” dataset by performing the following:
Filter out any rows where coffee_shop_name is Null: filter(!is.na(coffee_shop_name))
Create a new date variable (review_date) based on the date contained the review_text field.
Hint: mutate(review_date = mdy(word(review_text)))
Hint: use mutate() to create starbucks_flag, use if_else(condition,1,0), there should be at least 320 mentions of starbucks https://stringr.tidyverse.org/reference/index.html https://stringr.tidyverse.org/reference/str_detect.html
Hint: stringr and and str_detect() how do you deal with mixed cases hmm? https://stringr.tidyverse.org/reference/case.html
Create a flag variable “good_flag” if the review mentions the word “good”, regardless of case, set the flag to “yes” else default it to “no” , there should be around 3077 good references Same hints as 2c
Create a flag variable “great_flag” if the review mentions the word “great”, regardless of case, set the flag to “yes” else default it to “no” , there should be around 2870 references to great. Same hints as 2c
Convert coffee_sent to numeric using as.numeric(coffee_sent)
list the first 10 records to make sure your code works. > hint use head(,10)
— helper code — yelp_prep <- yelp %>% filter(!is.na(coffee_shop_name)) %>% mutate(review_date = mdy(word(review_text))) %>% mutate(review_text = str_to_lower(review_text)) %>% mutate(starbucks_flag = if_else(str_detect(review_text, “starbucks”), TRUE, FALSE)) %>% mutate(good_flag = if_else(str_detect(str_to_upper(review_text), “GOOD”), “Yes”, “no”)) %>% mutate(great_flag = if_else(str_detect(str_to_upper(review_text), “GREAT”), “Yes”, “no”)) %>% mutate(coffee_sent = as.numeric(coffee_sent))
yelp_prep %>% head(5)
yelp_prep %>% group_by(starbucks_flag) %>% summarise(n=n()) %>% arrange(desc(n))
yelp_prep <- yelp %>%
filter(!is.na(coffee_shop_name)) %>%
mutate(review_date = mdy(word(review_text))) %>%
mutate(review_text = str_to_lower(review_text)) %>%
mutate(starbucks_flag = if_else(str_detect(review_text, "starbucks"), TRUE, FALSE)) %>%
mutate(good_flag = if_else(str_detect(str_to_upper(review_text), "GOOD"), "Yes", "no")) %>%
mutate(great_flag = if_else(str_detect(str_to_upper(review_text), "GREAT"), "Yes", "no")) %>%
mutate(coffee_sent = as.numeric(coffee_sent))## Warning: Problem with `mutate()` input `coffee_sent`.
## i NAs introduced by coercion
## i Input `coffee_sent` is `as.numeric(coffee_sent)`.
## Warning in mask$eval_all_mutate(dots[[i]]): NAs introduced by coercion
## # A tibble: 10 x 24
## coffee_shop_name review_text rating num_rating cat_rating bool_HIGH
## <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 The Factory - C~ "11/25/201~ 5.0 s~ 5 HIGH 1
## 2 The Factory - C~ "12/2/2016~ 4.0 s~ 4 HIGH 1
## 3 The Factory - C~ "11/30/201~ 4.0 s~ 4 HIGH 1
## 4 The Factory - C~ "11/25/201~ 2.0 s~ 2 LOW 0
## 5 The Factory - C~ "12/3/2016~ 4.0 s~ 4 HIGH 1
## 6 The Factory - C~ "11/20/201~ 4.0 s~ 4 HIGH 1
## 7 The Factory - C~ "10/27/201~ 4.0 s~ 4 HIGH 1
## 8 The Factory - C~ "11/2/2016~ 5.0 s~ 5 HIGH 1
## 9 The Factory - C~ "10/25/201~ 3.0 s~ 3 LOW 0
## 10 The Factory - C~ "11/10/201~ 5.0 s~ 5 HIGH 1
## # ... with 18 more variables: overall_sent <dbl>, vibe_sent <dbl>,
## # tea_sent <dbl>, service_sent <dbl>, seating_sent <dbl>, price_sent <dbl>,
## # parking_sent <dbl>, location_sent <dbl>, alcohol_sent <dbl>,
## # coffee_sent <dbl>, food_sent <dbl>, hours_sent <dbl>, internet_sent <dbl>,
## # local_sent <dbl>, review_date <date>, starbucks_flag <lgl>,
## # good_flag <chr>, great_flag <chr>
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## starbucks_flag n
## <lgl> <int>
## 1 FALSE 7296
## 2 TRUE 320
Create 5 bar charts, for the following variables, set the fill/color setting by indicated fields, use a facet wrap for the following. What if anything do the graphs tell you about sentiment and ratings?
here is a template to follow
yelp_prep %>% ggplot(aes(x=numeric_variabale, fill=fill_variable)) + geom_bar() + facet_wrap(~facet_variable)
| Variable | Fill/Color | Facet Wrap |
|---|---|---|
| num_rating | great_flag | starbucks_flag |
| Variable | Fill/Color | facet wrap |
|---|---|---|
| overall_sent | starbucks_flag | NA |
| Variable | Fill/Color | Facet Wrap |
|---|---|---|
| service_sent | great_flag |
| Variable | Fill/Color | Facet Wrap |
|---|---|---|
| coffee_sent | N/A | Starbucks_flag |
## Warning: Removed 1 rows containing non-finite values (stat_count).
| Variable | Fill/Color | Facet Wrap |
|---|---|---|
| food_sent | good_flag | N/A |
## Warning: Removed 1 rows containing non-finite values (stat_count).
Create a new data set called sentiment_summary
also if you forgot to convert coffee_sent to numeric now is a good time to go back and do that : )
print out the sentiment_summary %>% print()
sentiment_summary_v1 <- yelp %>%
filter(!is.na(coffee_shop_name)) %>%
mutate(review_date = mdy(word(review_text))) %>%
mutate(review_text = str_to_lower(review_text)) %>%
mutate(starbucks_flag = if_else(str_detect(review_text, "starbucks"), TRUE, FALSE)) %>%
mutate(good_flag = if_else(str_detect(str_to_upper(review_text), "GOOD"), "Yes", "no")) %>%
mutate(great_flag = if_else(str_detect(str_to_upper(review_text), "GREAT"), "Yes", "no")) %>%
mutate(day_of_week = wday(review_date, label = TRUE, abbr = FALSE)) %>%
mutate(coffee_sent = as.numeric(coffee_sent))## Warning: Problem with `mutate()` input `coffee_sent`.
## i NAs introduced by coercion
## i Input `coffee_sent` is `as.numeric(coffee_sent)`.
## Warning in mask$eval_all_mutate(dots[[i]]): NAs introduced by coercion
sentiment_summary <- sentiment_summary_v1 %>%
group_by(day_of_week) %>%
summarise(count_n=n(),mean_overall_sent = mean(overall_sent), mean_coffee_sent = mean(coffee_sent), na.rm = TRUE)## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 7 x 5
## day_of_week count_n mean_overall_sent mean_coffee_sent na.rm
## <ord> <int> <dbl> <dbl> <lgl>
## 1 Sunday 1279 1.07 0.472 TRUE
## 2 Monday 1146 1.13 0.551 TRUE
## 3 Tuesday 983 1.14 NA TRUE
## 4 Wednesday 1092 1.09 0.441 TRUE
## 5 Thursday 1011 1.14 0.513 TRUE
## 6 Friday 951 1.10 0.545 TRUE
## 7 Saturday 1154 1.09 0.525 TRUE
Create three Bar charts to answer the following questions
what is the most/least reviewed day of the week
what day are you most likely to get the highest/lowest mean overall sentiment
what day are you most likely to get the highest/lowest mean coffee sentiment
for example:
## Warning: Ignoring unknown parameters: stat
## Warning: Ignoring unknown parameters: stat
## Warning: Ignoring unknown parameters: stat
## Warning: Removed 1 rows containing missing values (position_stack).
Wrap this up in a R notebook and knit it to HTML