POLISCI 3325G - Data Science for Political Science

Evelyne Brie

Winter 2023

Text Analysis

This week, we will: (1) load a dataset, (2) manipulate text strings, (3) process the content of a textual variable by pasting vectors together and creating a Document Term Matrix and (4) create a new variable representing the percentage of texts in which each unigram appears.

The dataset we will be working on today contains text from fake news sources on the Web. It contains text from 244 websites and represents 12999 posts in total (we will be using a subset of 5000 posts in this recitation). Fake news Web pages were identified using the BS Detector Chrome Extension by Daniel Sieradski. Please note that consequently, the data we will be analyzing will be full of nonsense…

Relevant functions: paste(), gsub(), Corpus(), VectorSource(), DocumentTermMatrix(), as.matrix(), apply(), order().

1. Loading Data

First, we are loading a the FakeNewsData.csv file into our library using read.csv().

# Setting work directory
setwd(dirname(rstudioapi::getSourceEditorContext()$path))

# Loading the dataset using read.csv()
myData <- read.csv("FakeNewsData.csv")

# Looking at the dimensions of the dataset using dim()
dim(myData)

## [1] 5000   21

# Looking at the variable names using colnames()
colnames(myData)

##  [1] "X"                  "uuid"               "ord_in_thread"     
##  [4] "author"             "published"          "title"             
##  [7] "text"               "language"           "crawled"           
## [10] "site_url"           "country"            "domain_rank"       
## [13] "thread_title"       "spam_score"         "main_img_url"      
## [16] "replies_count"      "participants_count" "likes"             
## [19] "comments"           "shares"             "type"

# Viewing the class attribute of each variable using sapply() and class()
sapply(myData, FUN=class)

##                  X               uuid      ord_in_thread             author 
##          "integer"        "character"          "integer"        "character" 
##          published              title               text           language 
##        "character"        "character"        "character"        "character" 
##            crawled           site_url            country        domain_rank 
##        "character"        "character"        "character"          "integer" 
##       thread_title         spam_score       main_img_url      replies_count 
##        "character"          "numeric"        "character"          "integer" 
## participants_count              likes           comments             shares 
##          "integer"          "integer"          "integer"          "integer" 
##               type 
##        "character"

# Looking at the title and the text variable for the fifth observation
myData[5, c("title","text")]

##                                                                                                 title
## 5 FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Healthcare Begins With A Bombshell! » 100percentfedUp.com
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   text
## 5 Email HEALTHCARE REFORM TO MAKE AMERICA GREAT AGAIN \nSince March of 2010, the American people have had to suffer under the incredible economic burden of the Affordable Care Act—Obamacare. This legislation, passed by totally partisan votes in the House and Senate and signed into law by the most divisive and partisan President in American history, has tragically but predictably resulted in runaway costs, websites that don’t work, greater rationing of care, higher premiums, less competition and fewer choices. Obamacare has raised the economic uncertainty of every single person residing in this country. As it appears Obamacare is certain to collapse of its own weight, the damage done by the Democrats and President Obama, and abetted by the Supreme Court, will be difficult to repair unless the next President and a Republican congress lead the effort to bring much-needed free market reforms to the healthcare industry. \nCongress must act. Our elected representatives in the House and Senate must: \n1. Completely repeal Obamacare. Our elected representatives must eliminate the individual mandate. No person should be required to buy insurance unless he or she wants to. \n2. Modify existing law that inhibits the sale of health insurance across state lines. As long as the plan purchased complies with state requirements, any vendor ought to be able to offer insurance in any state. By allowing full competition in this market, insurance costs will go down and consumer satisfaction will go up. \n3. Allow individuals to fully deduct health insurance premium payments from their tax returns under the current tax system. Businesses are allowed to take these deductions so why wouldn’t Congress allow individuals the same exemptions? As we allow the free market to provide insurance coverage opportunities to companies and individuals, we must also make sure that no one slips through the cracks simply because they cannot afford insurance. We must review basic options for Medicaid and work with states to ensure that those who want healthcare coverage can have it. TRENDING ON 100% Fed Up

2. Processing the textual content

Before analyzing the data, we might want to make some changes to the texts encompassed within the data frame. Here are some useful functions to modify text strings in R.

Function	Description
`gsub()`	Removes and substitutes a text string
`strsplit()`	Divides a given text string
`paste()`	Putting together separate elements to form a single text string
`substr()`	Selects a subset within a text string
`toupper()`	Converts all elements of a text string to uppercase

We’ll only use the paste() and gsub() functions today (i.e. we will paste two columns of texts together to make one single vector of text to analyze, and then replace some words within that vector). If interested, you can find examples of text manipulation using more functions here.

2.1 Merging two textual variables to create a totalText variable

# Let's say we want to create a new variable with the text of the article and the title combined
myData$totalText <- paste(myData$title, myData$text, sep=" ")

# Convert this variable to character
myData$totalText <- as.character(myData$totalText)

# Displaying the content of the second post
myData$totalText[2]

## [1] "Re: Why Did Attorney General Loretta Lynch Plead The Fifth? Why Did Attorney General Loretta Lynch Plead The Fifth? Barracuda Brigade 2016-10-28 Print The administration is blocking congressional probe into cash payments to Iran. Of course she needs to plead the 5th. She either can’t recall, refuses to answer, or just plain deflects the question. Straight up corruption at its finest! \n100percentfedUp.com ; Talk about covering your ass! Loretta Lynch did just that when she plead the Fifth to avoid incriminating herself over payments to Iran…Corrupt to the core! Attorney General Loretta Lynch is declining to comply with an investigation by leading members of Congress about the Obama administration’s secret efforts to send Iran $1.7 billion in cash earlier this year, prompting accusations that Lynch has “pleaded the Fifth” Amendment to avoid incriminating herself over these payments, according to lawmakers and communications exclusively obtained by the Washington Free Beacon. \nSen. Marco Rubio (R., Fla.) and Rep. Mike Pompeo (R., Kan.) initially presented Lynch in October with a series of questions about how the cash payment to Iran was approved and delivered. \nIn an Oct. 24 response, Assistant Attorney General Peter Kadzik responded on Lynch’s behalf, refusing to answer the questions and informing the lawmakers that they are barred from publicly disclosing any details about the cash payment, which was bound up in a ransom deal aimed at freeing several American hostages from Iran. \nThe response from the attorney general’s office is “unacceptable” and provides evidence that Lynch has chosen to “essentially plead the fifth and refuse to respond to inquiries regarding [her]role in providing cash to the world’s foremost state sponsor of terrorism,” Rubio and Pompeo wrote on Friday in a follow-up letter to Lynch. More Related"

# Removing irrelevant terms using gsub()
myData$totalText <- gsub("Print","",myData$totalText)
myData$totalText <- gsub("More Related","",myData$totalText)
myData$totalText <- gsub("\n","",myData$totalText) # This indicates a line jump in the raw data

# Sanity check: displaying the content of the second post again
# Notice how we successfully removed these terms?
myData$totalText[2]

## [1] "Re: Why Did Attorney General Loretta Lynch Plead The Fifth? Why Did Attorney General Loretta Lynch Plead The Fifth? Barracuda Brigade 2016-10-28  The administration is blocking congressional probe into cash payments to Iran. Of course she needs to plead the 5th. She either can’t recall, refuses to answer, or just plain deflects the question. Straight up corruption at its finest! 100percentfedUp.com ; Talk about covering your ass! Loretta Lynch did just that when she plead the Fifth to avoid incriminating herself over payments to Iran…Corrupt to the core! Attorney General Loretta Lynch is declining to comply with an investigation by leading members of Congress about the Obama administration’s secret efforts to send Iran $1.7 billion in cash earlier this year, prompting accusations that Lynch has “pleaded the Fifth” Amendment to avoid incriminating herself over these payments, according to lawmakers and communications exclusively obtained by the Washington Free Beacon. Sen. Marco Rubio (R., Fla.) and Rep. Mike Pompeo (R., Kan.) initially presented Lynch in October with a series of questions about how the cash payment to Iran was approved and delivered. In an Oct. 24 response, Assistant Attorney General Peter Kadzik responded on Lynch’s behalf, refusing to answer the questions and informing the lawmakers that they are barred from publicly disclosing any details about the cash payment, which was bound up in a ransom deal aimed at freeing several American hostages from Iran. The response from the attorney general’s office is “unacceptable” and provides evidence that Lynch has chosen to “essentially plead the fifth and refuse to respond to inquiries regarding [her]role in providing cash to the world’s foremost state sponsor of terrorism,” Rubio and Pompeo wrote on Friday in a follow-up letter to Lynch. "

2.2 Creating a Document Term Matrix

A Document Term Matrix displays the frequency of terms in a set of documents. These documents can be stored as different files, or as different rows within a vector (like in the current lab material).

While creating a DTM, we remove punctuation and stopwords (i.e. common words like “the”, “a”, etc.), and perform stemming (i.e. reduce words to their “stem”, for instance “elections” and “electors” become “elect”).

	Term 1	Term 2	Term 3	Term 4	…
Document 1	1	0	3	0	…
Document 2	0	0	0	0	…
Document 3	0	1	0	0	…
Document 4	3	0	3	0	…
Document 5	2	0	1	0	…
…	…	…	…	…	…

We will be using the tm (text mining) package to perform text analysis. Please install this package using install.packages() and load it using library() .

# Loading the tm package
library(tm)

# Creating a corpus from the text within the "myData$totalText" vector
myCorpus <- Corpus(VectorSource(myData$totalText),readerControl = list(language = "eng"))

# Creating a Document Term Matrix
dtm <- DocumentTermMatrix(myCorpus, control = list(stemming = TRUE, stopwords = TRUE, minWordLength = 3,
                                               removeNumbers = TRUE, removePunctuation = TRUE))  

# Looking at the dimensions of the Document Term Matrix
dim(dtm)

## [1]  5000 89818

3. Calculating unigrams

Here, we create a matrix with all the terms inside the variable and create a variable which represents the total number of posts in which a given term (or unigram) appears. In other words, we want to know which words appear in most fake news posts. This is just one of the ways to calculate the prevalence of terms—and it is also the way you should use to answer the problem set questions relative to text analysis.

# Creating a matrix with the terms inside the variable
myWords <- as.matrix(dtm)

# Calculating the number of posts in which a given term appears
v1 <- apply((myWords > 0)*1,2,sum)

# Displaying the first 10 terms in that vector
v1[1:10]

##         ––––     –‘ballot        –‘big      –“cool”      –“could       –“grab 
##            1            1            1            1            1            1 
##        –“old       –“our” –“profession     –“spies” 
##            1            1            1            1

# Generating a percentage 
v2 <- v1/dim(dtm)[1]

# What is he highest percentage of posts a term appears in?
max(v2)

## [1] 0.5148

# Which term appears in the highest percentage of posts?
which(v2==max(v2))

##  will 
## 81727

# Seeing the first 30 most used terms
v2[order(v2,decreasing=T)][1:30]

##    will     one   peopl    time    like   state    also     can    just     new 
##  0.5148  0.4874  0.4464  0.4296  0.4270  0.4168  0.4004  0.4002  0.3966  0.3898 
##     now    said    year     say    even     get     use    make   trump    mani 
##  0.3890  0.3878  0.3862  0.3678  0.3670  0.3542  0.3510  0.3466  0.3258  0.3228 
## hillari  report clinton    take     day   elect    work    call   right    come 
##  0.3216  0.3212  0.3172  0.3156  0.3142  0.3138  0.3124  0.3106  0.3036  0.3026

# Keeping only the terms that are in more than 5% of the posts
dtm2 <- dtm[,v2 > .05]

# How many words/columns remain?
dim(dtm2)

## [1] 5000  915

# Comparing with the original dtm
dim(dtm)

## [1]  5000 89818

# How many terms were removed?
dim(dtm2)[2]-dim(dtm)[2]

## [1] -88903

Exercise

Conduct a similar analysis on the myData$thread_title variable using the following instructions.

Step 1

Load in the dataset.

Relevant function: read.csv().

Step 2

Create a corpus.

Relevant functions: Corpus(), VectorSource().

Step 3

Create a Document Term Matrix.

Relevant function: DocumentTermMatrix().

Step 4

Create a matrix from the Document Term Matrix and calculate the number (not the percentage) of thread titles in which a given term appears.

Relevant functions: as.matrix(), apply().

Step 5

Display the first 5 most used terms.

Relevant function: order().

# Answers available on OWL