Ch. 1 - Fast & dirty: Polarity scoring

Let’s talk about our feelings

[Video]

Jump right in! Visualize polarity

# Examine the text data
text_df
##     person
## 1     Nick
## 2 Jonathan
## 3  Martijn
## 4   Nicole
## 5     Nick
## 6 Jonathan
## 7  Martijn
## 8   Nicole
##                                                                         text
## 1                                              DataCamp courses are the best
## 2                                                 I like talking to students
## 3                            Other online data science curricula are boring.
## 4                                                         What is for lunch?
## 5                                        DataCamp has lots of great content!
## 6                           Students are passionate and are excited to learn
## 7 Other data science curriculum is hard to learn and difficult to understand
## 8                                             I think the food here is good.
# Calc overall polarity score
text_df %$% polarity(text)
##   all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all               8          54        0.179       0.452              0.396
# Calc polarity score by person
(datacamp_conversation <- text_df %$% polarity(text, person))
##     person total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 Jonathan               2          13        0.577       0.184              3.141
## 2  Martijn               2          19       -0.478       0.141             -3.388
## 3     Nick               2          11        0.428       0.028             15.524
## 4   Nicole               2          11        0.189       0.267              0.707
# Counts table from datacamp_conversation
counts(datacamp_conversation)
##     person wc polarity           pos.words       neg.words                                                                   text.var
## 1     Nick  5    0.447                best               -                                              DataCamp courses are the best
## 2 Jonathan  5    0.447                like               -                                                 I like talking to students
## 3  Martijn  7   -0.378                   -          boring                            Other online data science curricula are boring.
## 4   Nicole  4    0.000                   -               -                                                         What is for lunch?
## 5     Nick  6    0.408               great               -                                        DataCamp has lots of great content!
## 6 Jonathan  8    0.707 passionate, excited               -                           Students are passionate and are excited to learn
## 7  Martijn 12   -0.577                   - hard, difficult Other data science curriculum is hard to learn and difficult to understand
## 8   Nicole  7    0.378                good               -                                             I think the food here is good.
# Plot the conversation polarity
plot(datacamp_conversation)
## Warning: `show_guide` has been deprecated. Please use `show.legend` instead.
## Warning: Ignoring unknown aesthetics: x
## Warning: `show_guide` has been deprecated. Please use `show.legend` instead.

TM refresher (I)

# clean_corpus(), tm_define are pre-defined
clean_corpus
## function(corpus){
##   corpus <- tm_map(corpus, content_transformer(replace_abbreviation))
##   corpus <- tm_map(corpus, removePunctuation)
##   corpus <- tm_map(corpus, removeNumbers)
##   corpus <- tm_map(corpus, removeWords, c(stopwords("en"), "coffee"))
##   corpus <- tm_map(corpus, content_transformer(tolower))
##   corpus <- tm_map(corpus, stripWhitespace)
##   return(corpus)
## }
tm_define
##                                                                                                   x
## 1                           Text mining is the process of distilling actionable insights from text.
## 2 Sentiment analysis represents the set of tools to extract an author's feelings towards a subject.
# Create a VectorSource
tm_vector <- VectorSource(tm_define)

# Apply VCorpus
tm_corpus <- VCorpus(tm_vector)

# Examine the first document's contents
content(tm_corpus[[1]])
## [1] "Text mining is the process of distilling actionable insights from text."                          
## [2] "Sentiment analysis represents the set of tools to extract an author's feelings towards a subject."
# Clean the text
tm_clean <- clean_corpus(tm_corpus)

# Reexamine the contents of the first doc
content(tm_clean[[1]])
## [1] "text mining process distilling actionable insights text"                         
## [2] "sentiment analysis represents set tools extract authors feelings towards subject"

TM refresher (II)

How many words do YOU know? Zipf’s law & subjectivity lexicon

What is a subjectivity lexicon?

Where can you observe Zipf’s law?

Polarity on actual text

Explore qdap’s polarity & built-in lexicon

Happy songs!

LOL, this song is wicked good

Stressed Out!


Ch. 2 - Sentiment analysis the tidytext way

Plutchik’s wheel of emotion, polarity vs. sentiment

One theory of emotion

DTM vs. tidytext matrix

Getting Sentiment Lexicons

Bing lexicon with an inner join explanation

Bing tidy polarity: Simple example

Bing tidy polarity: Count & spread the white whale

Bing tidy polarity: Call me Ishmael (with ggplot2)!

AFINN & NRC methodologies in more detail

AFINN: I’m your Huckleberry

The wonderful wizard of NRC


Ch. 3 - Visualizing sentiment

Parlor trick or worthwhile?

Real insight?

Unhappy ending? Chronological polarity

Word impact, frequency analysis

Introspection using sentiment analysis

Divide & conquer: Using polarity for a comparison cloud

Emotional introspection

Compare & contrast stacked bar chart

Interpreting visualizations

Kernel density plot

Box plot

Radar chart

Treemaps for groups of documents


Ch. 4 - Case study: Airbnb reviews

Refresher on the text mining workflow

Step 1: What do you want to know?

Step 2: Identify Text Sources

Quickly examine the basic polarity

Step 3: Organize (& clean) the text

Create Polarity Based Corpora

Create a Tidy Text Tibble!

Compare Tidy Sentiment to Qdap Polarity

Step 4: Feature Extraction & Step 5: Time for analysis… almost there!

Assessing author effort

Comparison Cloud

Scaled Comparison Cloud

Step 6: Reach a conclusion

Confirm an expected conclusion

Choose a less expected insight

Your turn!


About Michael Mallari

Michael is a hybrid thinker and doer—a byproduct of being a CliftonStrengths “Learner” over time. With 20+ years of engineering, design, and product experience, he helps organizations identify market needs, mobilize internal and external resources, and deliver delightful digital customer experiences that align with business goals. He has been entrusted with problem-solving for brands—ranging from Fortune 500 companies to early-stage startups to not-for-profit organizations.

Michael earned his BS in Computer Science from New York Institute of Technology and his MBA from the University of Maryland, College Park. He is also a candidate to receive his MS in Applied Analytics from Columbia University.

LinkedIn | Twitter | www.michaelmallari.com/data | www.columbia.edu/~mm5470