DATA252: Introduction to Text Analysis

Learning Objectives

In this lesson students will …

Learn how to work with unstructured text data
Utilize the tidytext package
Tockenize text
Remove stop words (and customize lexicon for stop words)
Create visualizations for text
Perfom basic sentiment analysis

Step 1: Load the Wine Data

These data originally come from winemag.com and are hosted on Kaggle

https://www.kaggle.com/datasets/zynicide/wine-reviews

The data we are using is a subset of these data only for wines from the United States.

### UNITED STATES WINE
usWine <- read.csv("https://raw.githubusercontent.com/kitadasmalley/DATA252/main/Data/usWine.csv")
str(usWine)

## 'data.frame':    62397 obs. of  11 variables:
##  $ X          : int  0 2 3 8 9 11 12 14 15 16 ...
##  $ country    : chr  "US" "US" "US" "US" ...
##  $ description: chr  "This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry f"| __truncated__ "Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious, balanced and com"| __truncated__ "This spent 20 months in 30% new French oak, and incorporates fruit from Ponzi's Aurora, Abetina and Madrona vin"| __truncated__ "This re-named vineyard was formerly bottled as deLancellotti. You'll find striking minerality underscoring chun"| __truncated__ ...
##  $ designation: chr  "Martha's Vineyard" "Special Selected Late Harvest" "Reserve" "Silice" ...
##  $ points     : int  96 96 96 95 95 95 95 95 95 95 ...
##  $ price      : int  235 90 65 65 60 48 48 185 90 325 ...
##  $ province   : chr  "California" "California" "Oregon" "Oregon" ...
##  $ region_1   : chr  "Napa Valley" "Knights Valley" "Willamette Valley" "Chehalem Mountains" ...
##  $ region_2   : chr  "Napa" "Sonoma" "Willamette Valley" "Willamette Valley" ...
##  $ variety    : chr  "Cabernet Sauvignon" "Sauvignon Blanc" "Pinot Noir" "Pinot Noir" ...
##  $ winery     : chr  "Heitz" "Macauley" "Ponzi" "Bergström" ...

head(usWine)

##    X country
## 1  0      US
## 2  2      US
## 3  3      US
## 4  8      US
## 5  9      US
## 6 11      US
##                                                                                                                                                                                                                                                                                                                                                                                          description
## 1                                This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background. Balanced and rewarding from start to finish, it has years ahead of it to develop further nuance. Enjoy 2022–2030.
## 2                                                                                                           Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious, balanced and complex botrytised white. Dark gold in color, it layers toasted hazelnut, pear compote and orange peel flavors, reveling in the succulence of its 122 g/L of residual sugar.
## 3 This spent 20 months in 30% new French oak, and incorporates fruit from Ponzi's Aurora, Abetina and Madrona vineyards, among others. Aromatic, dense and toasty, it deftly blends aromas and flavors of toast, cigar box, blackberry, black cherry, coffee and graphite. Tannins are polished to a fine sheen, and frame a finish loaded with dark chocolate and espresso. Drink now through 2032.
## 4                                                                                         This re-named vineyard was formerly bottled as deLancellotti. You'll find striking minerality underscoring chunky black fruits. Accents of citrus and graphite comingle, with exceptional midpalate concentration. This is a wine to cellar, though it is already quite enjoyable. Drink now through 2030.
## 5                                                                                The producer sources from two blocks of the vineyard for this wine—one at a high elevation, which contributes bright acidity. Crunchy cranberry, pomegranate and orange peel flavors surround silky, succulent layers of texture that present as fleshy fruit. That delicately lush flavor has considerable length.
## 6                                                                                                                               From 18-year-old vines, this supple well-balanced effort blends flavors of mocha, cherry, vanilla and breakfast tea. Superbly integrated and delicious even at this early stage, this wine seems destined for a long and savory cellar life. Drink now through 2028.
##                       designation points price   province           region_1
## 1               Martha's Vineyard     96   235 California        Napa Valley
## 2   Special Selected Late Harvest     96    90 California     Knights Valley
## 3                         Reserve     96    65     Oregon  Willamette Valley
## 4                          Silice     95    65     Oregon Chehalem Mountains
## 5            Gap's Crown Vineyard     95    60 California       Sonoma Coast
## 6 Estate Vineyard Wadensvil Block     95    48     Oregon       Ribbon Ridge
##            region_2            variety                 winery
## 1              Napa Cabernet Sauvignon                  Heitz
## 2            Sonoma    Sauvignon Blanc               Macauley
## 3 Willamette Valley         Pinot Noir                  Ponzi
## 4 Willamette Valley         Pinot Noir              Bergström
## 5            Sonoma         Pinot Noir              Blue Farm
## 6 Willamette Valley         Pinot Noir Patricia Green Cellars

Step 2: Oregon Wine

Question of Interest: How many different types of wines are made in Oregon?

library(tidyverse)

### OREGON WINE
orWine<-usWine%>%
  filter(province=="Oregon")%>%
  count(variety)%>%
  arrange(desc(n))

dim(orWine)

## [1] 67  2

#head(orWine)

Step 3: Most Common

Question of Interest: What are the top ten wines made in Oregon?

orTop<-orWine%>%
  mutate(variety2=fct_reorder(variety, n))%>%
  slice_max(n, n=10)

orTop

##               variety    n           variety2
## 1          Pinot Noir 2552         Pinot Noir
## 2          Pinot Gris  427         Pinot Gris
## 3          Chardonnay  348         Chardonnay
## 4            Riesling  184           Riesling
## 5               Syrah  146              Syrah
## 6         Pinot Blanc  117        Pinot Blanc
## 7  Cabernet Sauvignon   78 Cabernet Sauvignon
## 8            Viognier   76           Viognier
## 9           Red Blend   74          Red Blend
## 10        Tempranillo   58        Tempranillo

Let’s also turn this into a graphic!

ggplot(orTop, aes(x=variety2, y=n))+
  geom_col()+
  coord_flip()

Step 4: Pinot Noir vs Pinot Gris

Filter for only pinor noir and pinot gris

pinot<-usWine%>%
  filter(province=="Oregon")%>%
  filter(variety %in% c("Pinot Noir", "Pinot Gris"))

Compare Points

## Density
ggplot(pinot, aes(x=points, fill=variety))+
  geom_density(alpha=0.5)

## Boxplot
ggplot(pinot, aes(x=points, fill=variety))+
  geom_boxplot()

Compare Price

### Density
ggplot(pinot, aes(x=price, fill=variety))+
  geom_density(alpha=0.5)

## Warning: Removed 16 rows containing non-finite values (`stat_density()`).

### Boxplot
ggplot(pinot, aes(x=price, fill=variety))+
  geom_boxplot()

## Warning: Removed 16 rows containing non-finite values (`stat_boxplot()`).

Step 5: Tokenize

We will separate the reviews into terms. This is known as tokenizing.

#install.packages("tidytext")
library(tidytext)

## tokenize
tokenWine<-pinot%>%
  unnest_tokens(word, description)

### each term will add a new row
dim(pinot)

## [1] 2979   11

dim(tokenWine)

## [1] 129332     11

### what does this look like?
head(tokenWine)

##   X country designation points price province          region_1
## 1 3      US     Reserve     96    65   Oregon Willamette Valley
## 2 3      US     Reserve     96    65   Oregon Willamette Valley
## 3 3      US     Reserve     96    65   Oregon Willamette Valley
## 4 3      US     Reserve     96    65   Oregon Willamette Valley
## 5 3      US     Reserve     96    65   Oregon Willamette Valley
## 6 3      US     Reserve     96    65   Oregon Willamette Valley
##            region_2    variety winery   word
## 1 Willamette Valley Pinot Noir  Ponzi   this
## 2 Willamette Valley Pinot Noir  Ponzi  spent
## 3 Willamette Valley Pinot Noir  Ponzi     20
## 4 Willamette Valley Pinot Noir  Ponzi months
## 5 Willamette Valley Pinot Noir  Ponzi     in
## 6 Willamette Valley Pinot Noir  Ponzi     30

Step 6: Count Words

### count
countWords<-tokenWine%>%
  count(word)%>%
  arrange(desc(n))

head(countWords)

##   word    n
## 1  and 7669
## 2  the 5766
## 3    a 5232
## 4   of 4461
## 5 with 2999
## 6 this 2661

What do you notice?

These are common “filler words! Let’s take those out.

Step 7: Stop Words

We want to remove stop words using the stop_words dataframe in tidytext.

head(stop_words)

## # A tibble: 6 × 2
##   word      lexicon
##   <chr>     <chr>  
## 1 a         SMART  
## 2 a's       SMART  
## 3 able      SMART  
## 4 about     SMART  
## 5 above     SMART  
## 6 according SMART

An anti_join removes rows that are in the second dataframe.

#### anti_join
tidyWineWords<-pinot%>%
  unnest_tokens(word, description)%>%
  anti_join(stop_words)

## Joining, by = "word"

head(tidyWineWords)

##   X country designation points price province          region_1
## 1 3      US     Reserve     96    65   Oregon Willamette Valley
## 2 3      US     Reserve     96    65   Oregon Willamette Valley
## 3 3      US     Reserve     96    65   Oregon Willamette Valley
## 4 3      US     Reserve     96    65   Oregon Willamette Valley
## 5 3      US     Reserve     96    65   Oregon Willamette Valley
## 6 3      US     Reserve     96    65   Oregon Willamette Valley
##            region_2    variety winery   word
## 1 Willamette Valley Pinot Noir  Ponzi  spent
## 2 Willamette Valley Pinot Noir  Ponzi     20
## 3 Willamette Valley Pinot Noir  Ponzi months
## 4 Willamette Valley Pinot Noir  Ponzi     30
## 5 Willamette Valley Pinot Noir  Ponzi french
## 6 Willamette Valley Pinot Noir  Ponzi    oak

Now we can check the counts again.

tidyWineWords_Count<-tidyWineWords%>%
  count(word)%>%
  arrange(desc(n))

head(tidyWineWords_Count)

##      word    n
## 1   fruit 2148
## 2    wine 1550
## 3 flavors 1412
## 4  cherry 1018
## 5   pinot  921
## 6  finish  855

Let’s plot the top 30 words used to describe wine.

### word counts
tidyWineWords_Top<-tidyWineWords%>%
  count(word)%>%
  arrange(desc(n))%>%
  slice_max(n, n=30)

ggplot(tidyWineWords_Top, aes(x=word, y=n))+
  geom_col()+
  coord_flip()

In graphic best practice, we should order the bars by frequency.

tidyWineWords_Top%>%
  mutate(word2=fct_reorder(word, n))%>%
  ggplot(aes(x=word2, y=n))+
  geom_col()+
  coord_flip()

Do you notice any other words that you might want to take out? Words that are not informative?

Step 8: Custom Stop Words

Words like wine, pinot, vineyard, oregon, estate, and flavors were common in the reviews, but didn’t really provide any more information about how a wine tastes. Let’s take those out too.

### custom stop
custom_stop_words <- tribble(
  ~word,    ~lexicon,
  "wine", "CUSTOM",
  "pinot",      "CUSTOM", 
  "vineyard", "CUSTOM", 
  "oregon", "CUSTOM",
  "estate", "CUSTOM", 
  "flavors", "CUSTOM" 
)
stop_words2 <- stop_words %>%
  bind_rows(custom_stop_words)

We can count frequency again.

### again
tidyWineWords2<-pinot%>%
  unnest_tokens(word, description)%>%
  anti_join(stop_words2)

## Joining, by = "word"

tidyWineWords_Top2<-tidyWineWords2%>%
  count(word)%>%
  arrange(desc(n))%>%
  slice_max(n, n=30)


tidyWineWords_Top2%>%
  mutate(word2=fct_reorder(word, n))%>%
  ggplot(aes(x=word2, y=n))+
  geom_col()+
  coord_flip()

Step 9: Words for Pinots

Grouping by variety we can compare pinot noir vs pinot gris.

### compare
comparePinot<-tidyWineWords2%>%
  count(word, variety)%>%
  group_by(variety)%>%
  slice_max(n, n=15)%>%
  ungroup()%>%
  mutate(word2=fct_reorder(word, n))
  

ggplot(comparePinot, aes(x=word2, y=n, fill=variety))+
  geom_col(show.legend=FALSE)+
  facet_wrap(~variety, scales="free")+
  coord_flip()

Step 10: Word Clouds

Take the top 30 words and make a word cloud.

### word cloud
#install.packages("wordcloud")
library(wordcloud)

## Loading required package: RColorBrewer

word_counts <- tidyWineWords2 %>%
  count(word)

wordcloud(
  words = word_counts$word,
  freq = word_counts$n,
  max.words = 30
)

Step 11: Sentiment Analysis

There are multiple methods for sentiment analysis within tidytext

bing: codes positive or negative
afinn: sentiment score from -5 to 5
nrc: word-emotion lexicon, such as trust, fear, negative, sadness, anger, …
loughran: language tone, such as constraining, litigous, negative, positive, superfluous, uncertainty

Source: https://www.tidytextmining.com/sentiment.html

#install.packages("textdata")
library(textdata)

get_sentiments("bing")%>%
  count(sentiment)

## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4781
## 2 positive   2005

#get_sentiments("afinn")%>%
#  head()

get_sentiments("nrc")%>%
  count(sentiment)

## # A tibble: 10 × 2
##    sentiment        n
##    <chr>        <int>
##  1 anger         1245
##  2 anticipation   837
##  3 disgust       1056
##  4 fear          1474
##  5 joy            687
##  6 negative      3316
##  7 positive      2308
##  8 sadness       1187
##  9 surprise       532
## 10 trust         1230

get_sentiments("loughran")%>%
  count(sentiment)

## # A tibble: 6 × 2
##   sentiment        n
##   <chr>        <int>
## 1 constraining   184
## 2 litigious      904
## 3 negative      2355
## 4 positive       354
## 5 superfluous     56
## 6 uncertainty    297

We will apply bing to the wine descriptions

###sentiment
sentiment_review <- tidyWineWords2 %>%
  inner_join(get_sentiments("bing"))

## Joining, by = "word"

### What sentiments are in the wine descriptions?
sentiment_review %>%
  count(sentiment)

##   sentiment    n
## 1  negative 2160
## 2  positive 7579

### what words and sentiments are used 
sentiment_review %>%
  count(word, sentiment) %>%
  arrange(desc(n))%>%
  head()

##       word sentiment   n
## 1   pretty  positive 333
## 2    fresh  positive 309
## 3 balanced  positive 249
## 4    clean  positive 239
## 5  elegant  positive 228
## 6   nicely  positive 217

What are the top 10 words per sentiment?

word_counts <- sentiment_review %>%
  count(word, sentiment) %>%
  group_by(sentiment) %>%
  slice_max(n, n = 10) %>%
  ungroup() %>%
  mutate(
    word2 = fct_reorder(word, n)
  )

ggplot(word_counts, aes(x = word2, y = n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ sentiment, scales = "free") +
  coord_flip() +
  labs(
    title = "Sentiment Word Counts",
    x = "Words"
  )