Introduction to Sentiment Analysis in R: Day One

Marcus Mann, Ph.D.

About the Instructor

Hi! My name is Marcus Mann and I’m an assistant professor in sociology at Purdue University. I got my Ph.D. in sociology at Duke University. I mainly study politics and knowledge (e.g. political identity and attitudes toward science and scientists) and use a variety of computational methods in my research. I also teach these methods at the graduate level here at Purdue.

Sentiment Analysis in my own work

Traits of partisan perceptions of declines in social trust
Traits of speech on political subreddits

What we’re going to be doing here

At the end of this course, you will know:

How sentiment analysis is different from other text analysis methods and when to use it.
How to apply different sentiment analysis dictionaries to your data.
How to prepare your text for analysis.
How to incorporate results from sentiment analyses into broader analytical frameworks.
How to independently explore and use different dictionaries according to your emerging questions and needs.

Organization of the class

Day one: What is sentiment analysis and when to use it
Day two: Preparing text for analysis and using different dictionaries
Day three: Getting sentiment analysis scores and incorporating them
into broader analytical frameworks
Day four: Interpreting and communicating results of sentiment analysis

Day One

Morning

Why sentiment analysis?

What is sentiment analysis?
When to use sentiment analysis methods as opposed to other (i.e. inductive) methods
The universe of ‘sentiment analysis’ methods and how to choose an approach

Afternoon

Coding basics

Loading packages and data
Cleaning/tidying text data
Quantifying text
Looking ahead to other sentiment analysis dictionaries

Day Two

Sentiment analysis using well-validated dictionaries (e.g. LIWC2022)
Sentiment analysis in R packages (e.g. Tidytext & Syuzhet)
Visualizing dictionary scores using ggplot
Building your own sentiment dictionary
Putting sentiment analysis scores to work

Day Three

API’s and collecting digital trace data
Dealing with text data with different structures
Incorporating sentiment scores into a regression framework
Reporting your sentiment analysis approach and results to your audience

What is sentiment analysis?

“the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer’s attitude towards a particular topic, product, etc. is positive, negative, or neutral.”

When to use sentiment analysis as oppposed to other methods

Three families of automated text analysis

The “three families” of text analysis

Term frequency
Document Structure
Semantic Similarity

Document Structure

Structural Topic Modeling (STM)
Inductive
Interested in word co-occurrence rather than frequency

Semantic Similarity

Word embeddings
Inductive
Takes into account entire context of word occurence
A “black box” of neural networks

Term Frequency (what we’re doing here)

Literally Counting words
Usually associated with robust validation
Emotional and psychological states

Closed Vocabulary Approach (our main focus)

Deductive
Negative-positive polarity
Validated dictionaries

Term frequency

Open Vocabulary Approach (we will touch on this)
- Inductive
- Negative-positive polarity
- Validaged dictionaries

LIWC (Language Inquiry Word Count)

A collection of dictionaries that represent a variety of psychological and emotional states
Dictionaries are “nested” so that some states might belong to broader categories +e.g. the anger dictionary is a sub-category of the emotional dictionary

Some Coding Basics in R

Setting working directory

You always want to make sure you’re working from the same directory on your computer
To check what your current directory is, you can type getwd()
And to set a new directory, you can type setwd("~/your/working/directory/here")

Loading packages

The Pacman package allows you to install AND load packages at the same time through its “p_load” function using only only one line of code.
This means you only need to install the pacman package once and then can use p_load for every package after.

First install the package as normal with install.packages("pacman")
Then you can install and load all packages you’ll need for the rest of class

pacman::p_load( devtools, harrypotter, textdata, tidyverse, stringr, tidytext, dplyr)

Preparing text data using Harry Potter dataset

First we’re going to load Bradley Boehmke’s Harry Potter dataset which he has made available publicly on Githhub and which includes all text from the Harry Potter series organized into its separate books.
To download this corpus, we use the “devtools” package to download user-generated R packages that are not available through CRAN.

The traditional way to do this looks like this install.packages("devtools") library(devtools) install_github("bradleyboehmke/harrypotter")

Each book is an array in which each value in the array is a chapter

Create a vector of book titles

titles <- c("Philosopher's Stone", "Chamber of Secrets", "Prisoner of Azkaban", "Goblet of Fire", "Order of the Phoenix", "Half-Blood Prince", "Deathly Hallows")

Create a list object that includes all of the Harry Potter dats from the package we just loaded

books <- list(philosophers_stone, chamber_of_secrets, prisoner_of_azkaban, goblet_of_fire, order_of_the_phoenix, half_blood_prince, deathly_hallows)

For loop to turn Harry Potter package into tidy data

series <- tibble()

for(i in seq_along(titles)) { temp <- tibble(chapter = seq_along(books[[i]]), text = books[[i]]) %>% unnest_tokens(word, text) %>%

Then tokenize each chapter into words

mutate(book = titles[i]) %>% select(book, everything()) series <- rbind(series, temp) }

set factor to keep books in order of publication

series$book <- factor(series$book, levels = rev(titles)) series