In this lesson students will …
These data originally come from winemag.com and are hosted on Kaggle
https://www.kaggle.com/datasets/zynicide/wine-reviews
The data we are using is a subset of these data only for wines from the United States.
### UNITED STATES WINE
usWine <- read.csv("https://raw.githubusercontent.com/kitadasmalley/DATA252/main/Data/usWine.csv")
str(usWine)
## 'data.frame': 62397 obs. of 11 variables:
## $ X : int 0 2 3 8 9 11 12 14 15 16 ...
## $ country : chr "US" "US" "US" "US" ...
## $ description: chr "This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry f"| __truncated__ "Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious, balanced and com"| __truncated__ "This spent 20 months in 30% new French oak, and incorporates fruit from Ponzi's Aurora, Abetina and Madrona vin"| __truncated__ "This re-named vineyard was formerly bottled as deLancellotti. You'll find striking minerality underscoring chun"| __truncated__ ...
## $ designation: chr "Martha's Vineyard" "Special Selected Late Harvest" "Reserve" "Silice" ...
## $ points : int 96 96 96 95 95 95 95 95 95 95 ...
## $ price : int 235 90 65 65 60 48 48 185 90 325 ...
## $ province : chr "California" "California" "Oregon" "Oregon" ...
## $ region_1 : chr "Napa Valley" "Knights Valley" "Willamette Valley" "Chehalem Mountains" ...
## $ region_2 : chr "Napa" "Sonoma" "Willamette Valley" "Willamette Valley" ...
## $ variety : chr "Cabernet Sauvignon" "Sauvignon Blanc" "Pinot Noir" "Pinot Noir" ...
## $ winery : chr "Heitz" "Macauley" "Ponzi" "Bergström" ...
head(usWine)
## X country
## 1 0 US
## 2 2 US
## 3 3 US
## 4 8 US
## 5 9 US
## 6 11 US
## description
## 1 This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background. Balanced and rewarding from start to finish, it has years ahead of it to develop further nuance. Enjoy 2022–2030.
## 2 Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious, balanced and complex botrytised white. Dark gold in color, it layers toasted hazelnut, pear compote and orange peel flavors, reveling in the succulence of its 122 g/L of residual sugar.
## 3 This spent 20 months in 30% new French oak, and incorporates fruit from Ponzi's Aurora, Abetina and Madrona vineyards, among others. Aromatic, dense and toasty, it deftly blends aromas and flavors of toast, cigar box, blackberry, black cherry, coffee and graphite. Tannins are polished to a fine sheen, and frame a finish loaded with dark chocolate and espresso. Drink now through 2032.
## 4 This re-named vineyard was formerly bottled as deLancellotti. You'll find striking minerality underscoring chunky black fruits. Accents of citrus and graphite comingle, with exceptional midpalate concentration. This is a wine to cellar, though it is already quite enjoyable. Drink now through 2030.
## 5 The producer sources from two blocks of the vineyard for this wine—one at a high elevation, which contributes bright acidity. Crunchy cranberry, pomegranate and orange peel flavors surround silky, succulent layers of texture that present as fleshy fruit. That delicately lush flavor has considerable length.
## 6 From 18-year-old vines, this supple well-balanced effort blends flavors of mocha, cherry, vanilla and breakfast tea. Superbly integrated and delicious even at this early stage, this wine seems destined for a long and savory cellar life. Drink now through 2028.
## designation points price province region_1
## 1 Martha's Vineyard 96 235 California Napa Valley
## 2 Special Selected Late Harvest 96 90 California Knights Valley
## 3 Reserve 96 65 Oregon Willamette Valley
## 4 Silice 95 65 Oregon Chehalem Mountains
## 5 Gap's Crown Vineyard 95 60 California Sonoma Coast
## 6 Estate Vineyard Wadensvil Block 95 48 Oregon Ribbon Ridge
## region_2 variety winery
## 1 Napa Cabernet Sauvignon Heitz
## 2 Sonoma Sauvignon Blanc Macauley
## 3 Willamette Valley Pinot Noir Ponzi
## 4 Willamette Valley Pinot Noir Bergström
## 5 Sonoma Pinot Noir Blue Farm
## 6 Willamette Valley Pinot Noir Patricia Green Cellars
Question of Interest: How many different types of wines are made in Oregon?
library(tidyverse)
### OREGON WINE
orWine<-usWine%>%
filter(province=="Oregon")%>%
count(variety)%>%
arrange(desc(n))
dim(orWine)
## [1] 67 2
#head(orWine)
Question of Interest: What are the top ten wines made in Oregon?
orTop<-orWine%>%
mutate(variety2=fct_reorder(variety, n))%>%
slice_max(n, n=10)
orTop
## variety n variety2
## 1 Pinot Noir 2552 Pinot Noir
## 2 Pinot Gris 427 Pinot Gris
## 3 Chardonnay 348 Chardonnay
## 4 Riesling 184 Riesling
## 5 Syrah 146 Syrah
## 6 Pinot Blanc 117 Pinot Blanc
## 7 Cabernet Sauvignon 78 Cabernet Sauvignon
## 8 Viognier 76 Viognier
## 9 Red Blend 74 Red Blend
## 10 Tempranillo 58 Tempranillo
Let’s also turn this into a graphic!
ggplot(orTop, aes(x=variety2, y=n))+
geom_col()+
coord_flip()
Filter for only pinor noir and pinot gris
pinot<-usWine%>%
filter(province=="Oregon")%>%
filter(variety %in% c("Pinot Noir", "Pinot Gris"))
## Density
ggplot(pinot, aes(x=points, fill=variety))+
geom_density(alpha=0.5)
## Boxplot
ggplot(pinot, aes(x=points, fill=variety))+
geom_boxplot()
### Density
ggplot(pinot, aes(x=price, fill=variety))+
geom_density(alpha=0.5)
## Warning: Removed 16 rows containing non-finite values (`stat_density()`).
### Boxplot
ggplot(pinot, aes(x=price, fill=variety))+
geom_boxplot()
## Warning: Removed 16 rows containing non-finite values (`stat_boxplot()`).
We will separate the reviews into terms. This is known as tokenizing.
#install.packages("tidytext")
library(tidytext)
## tokenize
tokenWine<-pinot%>%
unnest_tokens(word, description)
### each term will add a new row
dim(pinot)
## [1] 2979 11
dim(tokenWine)
## [1] 129332 11
### what does this look like?
head(tokenWine)
## X country designation points price province region_1
## 1 3 US Reserve 96 65 Oregon Willamette Valley
## 2 3 US Reserve 96 65 Oregon Willamette Valley
## 3 3 US Reserve 96 65 Oregon Willamette Valley
## 4 3 US Reserve 96 65 Oregon Willamette Valley
## 5 3 US Reserve 96 65 Oregon Willamette Valley
## 6 3 US Reserve 96 65 Oregon Willamette Valley
## region_2 variety winery word
## 1 Willamette Valley Pinot Noir Ponzi this
## 2 Willamette Valley Pinot Noir Ponzi spent
## 3 Willamette Valley Pinot Noir Ponzi 20
## 4 Willamette Valley Pinot Noir Ponzi months
## 5 Willamette Valley Pinot Noir Ponzi in
## 6 Willamette Valley Pinot Noir Ponzi 30
### count
countWords<-tokenWine%>%
count(word)%>%
arrange(desc(n))
head(countWords)
## word n
## 1 and 7669
## 2 the 5766
## 3 a 5232
## 4 of 4461
## 5 with 2999
## 6 this 2661
What do you notice?
These are common “filler words! Let’s take those out.
We want to remove stop words using the stop_words
dataframe in tidytext
.
head(stop_words)
## # A tibble: 6 × 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
An anti_join
removes rows that are in the second
dataframe.
#### anti_join
tidyWineWords<-pinot%>%
unnest_tokens(word, description)%>%
anti_join(stop_words)
## Joining, by = "word"
head(tidyWineWords)
## X country designation points price province region_1
## 1 3 US Reserve 96 65 Oregon Willamette Valley
## 2 3 US Reserve 96 65 Oregon Willamette Valley
## 3 3 US Reserve 96 65 Oregon Willamette Valley
## 4 3 US Reserve 96 65 Oregon Willamette Valley
## 5 3 US Reserve 96 65 Oregon Willamette Valley
## 6 3 US Reserve 96 65 Oregon Willamette Valley
## region_2 variety winery word
## 1 Willamette Valley Pinot Noir Ponzi spent
## 2 Willamette Valley Pinot Noir Ponzi 20
## 3 Willamette Valley Pinot Noir Ponzi months
## 4 Willamette Valley Pinot Noir Ponzi 30
## 5 Willamette Valley Pinot Noir Ponzi french
## 6 Willamette Valley Pinot Noir Ponzi oak
Now we can check the counts again.
tidyWineWords_Count<-tidyWineWords%>%
count(word)%>%
arrange(desc(n))
head(tidyWineWords_Count)
## word n
## 1 fruit 2148
## 2 wine 1550
## 3 flavors 1412
## 4 cherry 1018
## 5 pinot 921
## 6 finish 855
Let’s plot the top 30 words used to describe wine.
### word counts
tidyWineWords_Top<-tidyWineWords%>%
count(word)%>%
arrange(desc(n))%>%
slice_max(n, n=30)
ggplot(tidyWineWords_Top, aes(x=word, y=n))+
geom_col()+
coord_flip()
In graphic best practice, we should order the bars by frequency.
tidyWineWords_Top%>%
mutate(word2=fct_reorder(word, n))%>%
ggplot(aes(x=word2, y=n))+
geom_col()+
coord_flip()
Do you notice any other words that you might want to take out? Words that are not informative?
Words like wine, pinot, vineyard, oregon, estate, and flavors were common in the reviews, but didn’t really provide any more information about how a wine tastes. Let’s take those out too.
### custom stop
custom_stop_words <- tribble(
~word, ~lexicon,
"wine", "CUSTOM",
"pinot", "CUSTOM",
"vineyard", "CUSTOM",
"oregon", "CUSTOM",
"estate", "CUSTOM",
"flavors", "CUSTOM"
)
stop_words2 <- stop_words %>%
bind_rows(custom_stop_words)
We can count frequency again.
### again
tidyWineWords2<-pinot%>%
unnest_tokens(word, description)%>%
anti_join(stop_words2)
## Joining, by = "word"
tidyWineWords_Top2<-tidyWineWords2%>%
count(word)%>%
arrange(desc(n))%>%
slice_max(n, n=30)
tidyWineWords_Top2%>%
mutate(word2=fct_reorder(word, n))%>%
ggplot(aes(x=word2, y=n))+
geom_col()+
coord_flip()
Grouping by variety we can compare pinot noir vs pinot gris.
### compare
comparePinot<-tidyWineWords2%>%
count(word, variety)%>%
group_by(variety)%>%
slice_max(n, n=15)%>%
ungroup()%>%
mutate(word2=fct_reorder(word, n))
ggplot(comparePinot, aes(x=word2, y=n, fill=variety))+
geom_col(show.legend=FALSE)+
facet_wrap(~variety, scales="free")+
coord_flip()
Take the top 30 words and make a word cloud.
### word cloud
#install.packages("wordcloud")
library(wordcloud)
## Loading required package: RColorBrewer
word_counts <- tidyWineWords2 %>%
count(word)
wordcloud(
words = word_counts$word,
freq = word_counts$n,
max.words = 30
)
There are multiple methods for sentiment analysis within
tidytext
bing
: codes positive or negativeafinn
: sentiment score from -5 to 5nrc
: word-emotion lexicon, such as trust, fear,
negative, sadness, anger, …loughran
: language tone, such as constraining,
litigous, negative, positive, superfluous, uncertaintySource: https://www.tidytextmining.com/sentiment.html
#install.packages("textdata")
library(textdata)
get_sentiments("bing")%>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 4781
## 2 positive 2005
#get_sentiments("afinn")%>%
# head()
get_sentiments("nrc")%>%
count(sentiment)
## # A tibble: 10 × 2
## sentiment n
## <chr> <int>
## 1 anger 1245
## 2 anticipation 837
## 3 disgust 1056
## 4 fear 1474
## 5 joy 687
## 6 negative 3316
## 7 positive 2308
## 8 sadness 1187
## 9 surprise 532
## 10 trust 1230
get_sentiments("loughran")%>%
count(sentiment)
## # A tibble: 6 × 2
## sentiment n
## <chr> <int>
## 1 constraining 184
## 2 litigious 904
## 3 negative 2355
## 4 positive 354
## 5 superfluous 56
## 6 uncertainty 297
We will apply bing
to the wine descriptions
###sentiment
sentiment_review <- tidyWineWords2 %>%
inner_join(get_sentiments("bing"))
## Joining, by = "word"
### What sentiments are in the wine descriptions?
sentiment_review %>%
count(sentiment)
## sentiment n
## 1 negative 2160
## 2 positive 7579
### what words and sentiments are used
sentiment_review %>%
count(word, sentiment) %>%
arrange(desc(n))%>%
head()
## word sentiment n
## 1 pretty positive 333
## 2 fresh positive 309
## 3 balanced positive 249
## 4 clean positive 239
## 5 elegant positive 228
## 6 nicely positive 217
What are the top 10 words per sentiment?
word_counts <- sentiment_review %>%
count(word, sentiment) %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(
word2 = fct_reorder(word, n)
)
ggplot(word_counts, aes(x = word2, y = n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ sentiment, scales = "free") +
coord_flip() +
labs(
title = "Sentiment Word Counts",
x = "Words"
)