Preamble

This Milestone Report outlines exploratory analysis and a hierarchical clustering model of several hundred online reviews of award-winning wines aggregated by Wilson Daniels from The Wine Spectator, The Wine Enthusiast and other ezines.

file:///C:/Users/d2i2k/Desktop/Milestone%20Report/Database%20of%20Reviews%20For%20Our%20Top-Rated%20Award-Winning%20Wines%20-%20Wilson%20Daniels.html.

The majority of the California wines reviewed in the online database are from five regions of Northern California. Wine reviews or ‘tasting notes’ are tabulated by the number of lines and number of characters

Region	Line Count	Word Count
North Coast	1,008 lines	14,819 words
Napa Valley	410 lines	6,031 words
Russian River Valley	103 lines	1,536 words
Anderson Valley	67 lines	694 words
Sonoma Valley	37 lines	527 words

Loading Data

reviews <- file.path("C:/Users/d2i2k/Documents","regions")
reviews

## [1] "C:/Users/d2i2k/Documents/regions"

dir(reviews)

## [1] "anderson_valley.csv"      "napa_valley.csv"         
## [3] "north_coast.csv"          "russian_river_valley.csv"
## [5] "sonoma_valley.csv"

Create Corpus

library(tm)

## Loading required package: NLP

corpus <- Corpus(DirSource(reviews))
summary(corpus)

##                          Length Class             Mode
## anderson_valley.csv      2      PlainTextDocument list
## napa_valley.csv          2      PlainTextDocument list
## north_coast.csv          2      PlainTextDocument list
## russian_river_valley.csv 2      PlainTextDocument list
## sonoma_valley.csv        2      PlainTextDocument list

Document Term Matrix (dtm)

The document term matrix (dtm) based on the corpus has 5 region-specific rows and 4,698 terms or columns.

dtm <- DocumentTermMatrix(corpus)
dim(dtm)

## [1]    5 4698

freq <- colSums(as.matrix(dtm))
docs <- corpus
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removeWords, c("and","are","but","has","for","its","the","this","that","with"))
docs <- tm_map(docs, PlainTextDocument)

After pre-processing to remove punctuation, numerals and stopwords, as well as sparse terms, the term document matrix (tdm) has 3,167 terms or columns.

Term Document Matrix (tdm)

## [1] 3167

Exploratory Analysis

freq[tail(ord)]

##  aromas    rich   fruit  finish flavors    wine 
##     137     139     144     150     210     284

freq <- sort(colSums(as.matrix(dtms)), decreasing=TRUE)

Distribution of Frequently Occurring Words

The keyword “wine” occurred the most often, 284 times, with “flavors” and “finish” occurring 210 and 150 times throughout the tasting notes for California wine reviews.

wf <- data.frame(word=names(freq), freq=freq)
head(wf, 15)

##            word freq
## wine       wine  284
## flavors flavors  210
## finish   finish  150
## fruit     fruit  144
## rich       rich  139
## notes     notes  111
## palate   palate  102
## long       long   95
## fresh     fresh   87
## cherry   cherry   84
## pinot     pinot   71
## noir       noir   66
## complex complex   65
## ripe       ripe   60
## red         red   56

Histogram of Frequently Occurring Words

The top 15 words from “cherry” to “wine” which occurred more than 50 times in the tasting notes are displayed from left-to-right in the histogram.

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

Word Cloud of Frequently Occurring Words

The top 15 words which occurred more than 50 times in the tasting notes are positioned in the word cloud with heights proportional to their frequency.

## Loading required package: RColorBrewer

Term Correlation

The word pairing “pinot noir” is a grape varietal for California red wines. The word pair is perfeclty correlated with a correlation coefficient r=1.00. Four flavor notes, “Cinnamon”, “cuvee”, “gorgeoua” and “vibrant” are also highly correlated.

findAssocs(dtm, c(“pinot”, “noir”), corlimit=0.99)

Word	pinot	noir
raspberries	1.00	1.00
cinnamon	0.99	1.00
cuvee	0.99	1.00
gorgeous	0.99	0.99
vibrant	0.99	0.99

Hierarchical Clustering

Cluster Dendogram of Frequently Occurring Words

The first cluster has grouped similar terms “black”, “oak”, “plum” and “tannins”. “Pinot” and “noir” appear in the fourth cluster of the dendogram.

k-Means Clustering (k=2)

The first confidence ellipse (in red) has four similar terms “flavors”, “finish”, “rich” and “wine” narrowly dispersed between the principal components of the k=2 means cluster plot. The second confidence ellipse (in blue) has the other more widely dispersed wine tasting terms.

Conclusion

Since two principal components explain 97 percent of the overall variation among similar terms, California wine reviews should be categorizable.

RMarkdown

Douglas M Okamoto

September 2, 2016