This Milestone Report outlines exploratory analysis and a hierarchical clustering model of several hundred online reviews of award-winning wines aggregated by Wilson Daniels from The Wine Spectator, The Wine Enthusiast and other ezines.
The majority of the California wines reviewed in the online database are from five regions of Northern California. Wine reviews or ‘tasting notes’ are tabulated by the number of lines and number of characters
| Region | Line Count | Word Count |
|---|---|---|
| North Coast | 1,008 lines | 14,819 words |
| Napa Valley | 410 lines | 6,031 words |
| Russian River Valley | 103 lines | 1,536 words |
| Anderson Valley | 67 lines | 694 words |
| Sonoma Valley | 37 lines | 527 words |
reviews <- file.path("C:/Users/d2i2k/Documents","regions")
reviews
## [1] "C:/Users/d2i2k/Documents/regions"
dir(reviews)
## [1] "anderson_valley.csv" "napa_valley.csv"
## [3] "north_coast.csv" "russian_river_valley.csv"
## [5] "sonoma_valley.csv"
library(tm)
## Loading required package: NLP
corpus <- Corpus(DirSource(reviews))
summary(corpus)
## Length Class Mode
## anderson_valley.csv 2 PlainTextDocument list
## napa_valley.csv 2 PlainTextDocument list
## north_coast.csv 2 PlainTextDocument list
## russian_river_valley.csv 2 PlainTextDocument list
## sonoma_valley.csv 2 PlainTextDocument list
The document term matrix (dtm) based on the corpus has 5 region-specific rows and 4,698 terms or columns.
dtm <- DocumentTermMatrix(corpus)
dim(dtm)
## [1] 5 4698
freq <- colSums(as.matrix(dtm))
docs <- corpus
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removeWords, c("and","are","but","has","for","its","the","this","that","with"))
docs <- tm_map(docs, PlainTextDocument)
After pre-processing to remove punctuation, numerals and stopwords, as well as sparse terms, the term document matrix (tdm) has 3,167 terms or columns.
## [1] 3167
freq[tail(ord)]
## aromas rich fruit finish flavors wine
## 137 139 144 150 210 284
freq <- sort(colSums(as.matrix(dtms)), decreasing=TRUE)
The keyword “wine” occurred the most often, 284 times, with “flavors” and “finish” occurring 210 and 150 times throughout the tasting notes for California wine reviews.
wf <- data.frame(word=names(freq), freq=freq)
head(wf, 15)
## word freq
## wine wine 284
## flavors flavors 210
## finish finish 150
## fruit fruit 144
## rich rich 139
## notes notes 111
## palate palate 102
## long long 95
## fresh fresh 87
## cherry cherry 84
## pinot pinot 71
## noir noir 66
## complex complex 65
## ripe ripe 60
## red red 56
The top 15 words from “cherry” to “wine” which occurred more than 50 times in the tasting notes are displayed from left-to-right in the histogram.
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
The top 15 words which occurred more than 50 times in the tasting notes are positioned in the word cloud with heights proportional to their frequency.
## Loading required package: RColorBrewer
The word pairing “pinot noir” is a grape varietal for California red wines. The word pair is perfeclty correlated with a correlation coefficient r=1.00. Four flavor notes, “Cinnamon”, “cuvee”, “gorgeoua” and “vibrant” are also highly correlated.
findAssocs(dtm, c(“pinot”, “noir”), corlimit=0.99)
| Word | pinot | noir |
|---|---|---|
| raspberries | 1.00 | 1.00 |
| cinnamon | 0.99 | 1.00 |
| cuvee | 0.99 | 1.00 |
| gorgeous | 0.99 | 0.99 |
| vibrant | 0.99 | 0.99 |
The first cluster has grouped similar terms “black”, “oak”, “plum” and “tannins”. “Pinot” and “noir” appear in the fourth cluster of the dendogram.
The first confidence ellipse (in red) has four similar terms “flavors”, “finish”, “rich” and “wine” narrowly dispersed between the principal components of the k=2 means cluster plot. The second confidence ellipse (in blue) has the other more widely dispersed wine tasting terms.
Since two principal components explain 97 percent of the overall variation among similar terms, California wine reviews should be categorizable.