Introduction to Sentiment Analysis in R: Day One

Marcus Mann, Ph.D.

About the Instructor

Hi! My name is Marcus Mann and I’m an assistant professor in sociology at Purdue University. I got my Ph.D. in sociology at Duke University. I mainly study politics and knowledge (e.g. political identity and attitudes toward science and scientists) and use a variety of computational methods in my research. I also teach these methods at the graduate level here at Purdue.

Sentiment Analysis in my own work

Cognitive-Emotional Currents
  • Traits of partisan perceptions of declines in social trust
  • Traits of speech on political subreddits

What we’re going to be doing here

At the end of this course, you will know:

  • How sentiment analysis is different from other text analysis methods and when to use it.
  • How to apply different sentiment analysis dictionaries to your data.
  • How to prepare your text for analysis.
  • How to incorporate results from sentiment analyses into broader analytical frameworks.
  • How to independently explore and use different dictionaries according to your emerging questions and needs.

Organization of the class

  • Day one: What is sentiment analysis and when to use it
  • Day two: Preparing text for analysis and using different dictionaries
  • Day three: Getting sentiment analysis scores and incorporating them
  • into broader analytical frameworks
  • Day four: Interpreting and communicating results of sentiment analysis

Day One

Morning

Why sentiment analysis?

  • What is sentiment analysis?
  • When to use sentiment analysis methods as opposed to other (i.e. inductive) methods
  • The universe of ‘sentiment analysis’ methods and how to choose an approach

Afternoon

Coding basics

  • Loading packages and data
  • Cleaning/tidying text data
  • Quantifying text
  • Looking ahead to other sentiment analysis dictionaries

Day Two

  • Sentiment analysis using well-validated dictionaries (e.g. LIWC2022)
  • Sentiment analysis in R packages (e.g. Tidytext & Syuzhet)
  • Visualizing dictionary scores using ggplot
  • Building your own sentiment dictionary
  • Putting sentiment analysis scores to work

Day Three

  • API’s and collecting digital trace data
  • Dealing with text data with different structures
  • Incorporating sentiment scores into a regression framework
  • Reporting your sentiment analysis approach and results to your audience

What is sentiment analysis?

  • “the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer’s attitude towards a particular topic, product, etc. is positive, negative, or neutral.”

When to use sentiment analysis as oppposed to other methods

Three families of automated text analysis

The “three families” of text analysis

  1. Term frequency
  2. Document Structure
  3. Semantic Similarity

Document Structure

  • Structural Topic Modeling (STM)
  • Inductive
  • Interested in word co-occurrence rather than frequency

Semantic Similarity

  • Word embeddings
  • Inductive
  • Takes into account entire context of word occurence
  • A “black box” of neural networks

Term Frequency (what we’re doing here)

  • Literally Counting words
  • Usually associated with robust validation
  • Emotional and psychological states

Closed Vocabulary Approach (our main focus)

  • Deductive
  • Negative-positive polarity
  • Validated dictionaries

Term frequency

  • Open Vocabulary Approach (we will touch on this)
    • Inductive
    • Negative-positive polarity
    • Validaged dictionaries

LIWC (Language Inquiry Word Count)

  • A collection of dictionaries that represent a variety of psychological and emotional states
  • Dictionaries are “nested” so that some states might belong to broader categories +e.g. the anger dictionary is a sub-category of the emotional dictionary

LIWC 2022

Some Coding Basics in R

Setting working directory

  • You always want to make sure you’re working from the same directory on your computer

  • To check what your current directory is, you can type getwd()

  • And to set a new directory, you can type setwd("~/your/working/directory/here")

Loading packages

  • The Pacman package allows you to install AND load packages at the same time through its “p_load” function using only only one line of code.

  • This means you only need to install the pacman package once and then can use p_load for every package after.

  • First install the package as normal with install.packages("pacman")

  • Then you can install and load all packages you’ll need for the rest of class

    pacman::p_load( devtools, harrypotter, textdata, tidyverse, stringr, tidytext, dplyr)

Preparing text data using Harry Potter dataset

  • First we’re going to load Bradley Boehmke’s Harry Potter dataset which he has made available publicly on Githhub and which includes all text from the Harry Potter series organized into its separate books.

  • To download this corpus, we use the “devtools” package to download user-generated R packages that are not available through CRAN.

  • The traditional way to do this looks like this install.packages("devtools") library(devtools) install_github("bradleyboehmke/harrypotter")

Each book is an array in which each value in the array is a chapter

  • Create a vector of book titles

titles <- c("Philosopher's Stone", "Chamber of Secrets", "Prisoner of Azkaban", "Goblet of Fire", "Order of the Phoenix", "Half-Blood Prince", "Deathly Hallows")

  • Create a list object that includes all of the Harry Potter dats from the package we just loaded

books <- list(philosophers_stone, chamber_of_secrets, prisoner_of_azkaban, goblet_of_fire, order_of_the_phoenix, half_blood_prince, deathly_hallows)

For loop to turn Harry Potter package into tidy data

series <- tibble()

for(i in seq_along(titles)) { temp <- tibble(chapter = seq_along(books[[i]]), text = books[[i]]) %>% unnest_tokens(word, text) %>%

  • Then tokenize each chapter into words

mutate(book = titles[i]) %>% select(book, everything()) series <- rbind(series, temp) }

set factor to keep books in order of publication

series$book <- factor(series$book, levels = rev(titles)) series