In this recitation, we will: (1) load a dataset, (2) manipulate text strings, (3) preprocess the content of a textual variable by creating a Document Term Matrix and (4) create a new variable representing the number of unique unigrams.  

The dataset we will be working on today contains text from fake news sources on the Web. It contains text from 244 websites and represents 12999 posts in total (we will be using a subset of 5000 posts in this recitation). Fake news Web pages were identified using the BS Detector Chrome Extension by Daniel Sieradski.

 

Zener Cards

Deliberate misinformation has become a salient theme in U.S. politics

 

 

1. Loading the dataset

 

First, we are loading a the FakeNewsData.csv file into our library using read.csv().  

 

# Loading the dataset using read.csv()
myData <- read.csv("/Users/evelynebrie/Dropbox/TA/PSCI_107_Fall2017/Data_Science/Recitations/Week_13/FakeNewsData.csv")

# Looking at the dimensions of the dataset using dim()
dim(myData)
## [1] 5000   21
# Looking at the variable names using colnames()
colnames(myData)
##  [1] "X"                  "uuid"               "ord_in_thread"     
##  [4] "author"             "published"          "title"             
##  [7] "text"               "language"           "crawled"           
## [10] "site_url"           "country"            "domain_rank"       
## [13] "thread_title"       "spam_score"         "main_img_url"      
## [16] "replies_count"      "participants_count" "likes"             
## [19] "comments"           "shares"             "type"
# Viewing the class attribute of each variable using sapply() and class()
sapply(myData, FUN=class)
##                  X               uuid      ord_in_thread 
##          "integer"           "factor"          "integer" 
##             author          published              title 
##           "factor"           "factor"           "factor" 
##               text           language            crawled 
##           "factor"           "factor"           "factor" 
##           site_url            country        domain_rank 
##           "factor"           "factor"          "integer" 
##       thread_title         spam_score       main_img_url 
##           "factor"          "numeric"           "factor" 
##      replies_count participants_count              likes 
##          "integer"          "integer"          "integer" 
##           comments             shares               type 
##          "integer"          "integer"           "factor"

 

2. Preprocessing the textual content

 

Here are some useful functions to modify text strings in R:

 

Function Description
gsub() Removes and substitutes a text string
strsplit() Divides a given text string
paste() Putting together separate elements to form a single text string
substr() Selects a subset within a text string
toupper() Converts all elements of a text string to uppercase

 

Examples of text manipulation using these functions are available here.

 

2.1 Merging two textual variables to create a totalText variable

 

# Let's say we want to create a new variable with the text of the article and the title combined
myData$totalText <- paste(myData$title, myData$text, sep=" ")

# Convert this variable to character
myData$totalText <- as.character(myData$totalText)

# Displaying the content of the two first posts
myData$totalText[1:2]
## [1] "Muslims BUSTED: They Stole Millions In Gov’t Benefits Print They should pay all the back all the money plus interest. The entire family and everyone who came in with them need to be deported asap. Why did it take two years to bust them? \nHere we go again …another group stealing from the government and taxpayers! A group of Somalis stole over four million in government benefits over just 10 months! \nWe’ve reported on numerous cases like this one where the Muslim refugees/immigrants commit fraud by scamming our system…It’s way out of control! More Related"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
## [2] "Re: Why Did Attorney General Loretta Lynch Plead The Fifth? Why Did Attorney General Loretta Lynch Plead The Fifth? Barracuda Brigade 2016-10-28 Print The administration is blocking congressional probe into cash payments to Iran. Of course she needs to plead the 5th. She either can’t recall, refuses to answer, or just plain deflects the question. Straight up corruption at its finest! \n100percentfedUp.com ; Talk about covering your ass! Loretta Lynch did just that when she plead the Fifth to avoid incriminating herself over payments to Iran…Corrupt to the core! Attorney General Loretta Lynch is declining to comply with an investigation by leading members of Congress about the Obama administration’s secret efforts to send Iran $1.7 billion in cash earlier this year, prompting accusations that Lynch has “pleaded the Fifth” Amendment to avoid incriminating herself over these payments, according to lawmakers and communications exclusively obtained by the Washington Free Beacon. \nSen. Marco Rubio (R., Fla.) and Rep. Mike Pompeo (R., Kan.) initially presented Lynch in October with a series of questions about how the cash payment to Iran was approved and delivered. \nIn an Oct. 24 response, Assistant Attorney General Peter Kadzik responded on Lynch’s behalf, refusing to answer the questions and informing the lawmakers that they are barred from publicly disclosing any details about the cash payment, which was bound up in a ransom deal aimed at freeing several American hostages from Iran. \nThe response from the attorney general’s office is “unacceptable” and provides evidence that Lynch has chosen to “essentially plead the fifth and refuse to respond to inquiries regarding [her]role in providing cash to the world’s foremost state sponsor of terrorism,” Rubio and Pompeo wrote on Friday in a follow-up letter to Lynch. More Related"
# Removing irrelevant terms using gsub()
myData$totalText <- gsub("Print","",myData$totalText)
myData$totalText <- gsub("More Related","",myData$totalText)
myData$totalText <- gsub("\n","",myData$totalText)

# Sanity check: displaying the content of the two first posts again
myData$totalText[1:2]
## [1] "Muslims BUSTED: They Stole Millions In Gov’t Benefits  They should pay all the back all the money plus interest. The entire family and everyone who came in with them need to be deported asap. Why did it take two years to bust them? Here we go again …another group stealing from the government and taxpayers! A group of Somalis stole over four million in government benefits over just 10 months! We’ve reported on numerous cases like this one where the Muslim refugees/immigrants commit fraud by scamming our system…It’s way out of control! "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [2] "Re: Why Did Attorney General Loretta Lynch Plead The Fifth? Why Did Attorney General Loretta Lynch Plead The Fifth? Barracuda Brigade 2016-10-28  The administration is blocking congressional probe into cash payments to Iran. Of course she needs to plead the 5th. She either can’t recall, refuses to answer, or just plain deflects the question. Straight up corruption at its finest! 100percentfedUp.com ; Talk about covering your ass! Loretta Lynch did just that when she plead the Fifth to avoid incriminating herself over payments to Iran…Corrupt to the core! Attorney General Loretta Lynch is declining to comply with an investigation by leading members of Congress about the Obama administration’s secret efforts to send Iran $1.7 billion in cash earlier this year, prompting accusations that Lynch has “pleaded the Fifth” Amendment to avoid incriminating herself over these payments, according to lawmakers and communications exclusively obtained by the Washington Free Beacon. Sen. Marco Rubio (R., Fla.) and Rep. Mike Pompeo (R., Kan.) initially presented Lynch in October with a series of questions about how the cash payment to Iran was approved and delivered. In an Oct. 24 response, Assistant Attorney General Peter Kadzik responded on Lynch’s behalf, refusing to answer the questions and informing the lawmakers that they are barred from publicly disclosing any details about the cash payment, which was bound up in a ransom deal aimed at freeing several American hostages from Iran. The response from the attorney general’s office is “unacceptable” and provides evidence that Lynch has chosen to “essentially plead the fifth and refuse to respond to inquiries regarding [her]role in providing cash to the world’s foremost state sponsor of terrorism,” Rubio and Pompeo wrote on Friday in a follow-up letter to Lynch. "

 

2.2 Creating a Document Term Matrix (while doing so, removing punctuation and stopwords, and performing stemming)

 

A Document Term Matrix displays the frequency of terms in a set of documents. These documents can be stored as different files, or as different rows within a vector (like in the current recitation and in the PS6).  

 

Term 1 Term 2 Term 3 Term 4
Document 1 1 0 3 0
Document 2 0 0 0 0
Document 3 0 1 0 0
Document 4 3 0 3 0
Document 5 2 0 1 0

 

We will be using the tm (text mining) package to perform text analysis. Please install this package using install.packages() and load it using library() .

 

library(tm)
## Warning: package 'tm' was built under R version 3.4.2
## Loading required package: NLP
# Creating a corpus from the text within the "myData$totalText" vector
myCorpus <- Corpus(VectorSource(myData$totalText),readerControl = list(language = "eng"))

# Creating a Document Term Matrix
dtm <- DocumentTermMatrix(myCorpus, control = list(stemming = TRUE, stopwords = TRUE, minWordLength = 3,
                                               removeNumbers = TRUE, removePunctuation = TRUE))  

# Looking at the dimensions of the Document Term Matrix
dim(dtm)
## [1]  5000 71767

 

3. Coding the variable with the presence of unigrams

 

Here, we create a matrix with the terms inside the variable and code the variable as the presence of unigrams within each post.

 

# Creating a matrix with the terms inside the variable
myWords <- as.matrix(dtm)

# Calculating the number of posts in which a given term appears
v1 <- apply((myWords > 0)*1,2,sum)

# Displaying the first 10 terms in that vector
v1[1:10]
##        — a  — publish    million       ––––   –‘ballot      –‘big 
##          1          1          2          1          1          1 
##    –“cool”    –“could     –“grab      –“old 
##          1          1          1          1
# Generating a percentage 
v2 <- v1/dim(dtm)[1]

# What is he highest percentage of posts a term appears in?
max(v2)
## [1] 0.5152
# Which term appears in the highest percentage of posts?
which(v2==max(v2))
##  will 
## 64136
# Seeing the first 30 most used terms
v2[order(v2,decreasing=T)][1:30]
##    will     one   peopl    time    like   state    year     can    also 
##  0.5152  0.4994  0.4522  0.4462  0.4326  0.4256  0.4202  0.4086  0.4024 
##     now    just     new    said     say    even     get     use    make 
##  0.3994  0.3980  0.3910  0.3900  0.3710  0.3682  0.3554  0.3528  0.3514 
##   elect   trump    call hillari    mani     day  report   right clinton 
##  0.3296  0.3290  0.3282  0.3246  0.3246  0.3238  0.3230  0.3230  0.3206 
##    work    take  presid 
##  0.3200  0.3180  0.3152
# Keeping only the terms that are in more than 5% of the posts
dtm2 <- dtm[,v2 > .05]

# How many words/columns remain?
dim(dtm2)
## [1] 5000  950
# Comparing with the original dtm.
dim(dtm)
## [1]  5000 71767
# How many terms were removed?
dim(dtm2)[2]-dim(dtm)[2]
## [1] -70817

 


Exercises


 

Conduct a similar analysis on the myData$thread_title variable using the following instructions.

 

Step 1

Load in the dataset.

Relevant function: read.csv().

 

Step 2

Create a corpus.

Relevant functions: Corpus(), VectorSource().

 

Step 3

Create a Document Term Matrix.

Relevant function: DocumentTermMatrix().

 

Step 4

Create a matrix from the Document Term Matrix and calculate the percentage of thread titles in which a given term appears.

Relevant functions: as.matrix(), apply().

 

Step 5

Display the first 20 most used terms.

Relevant function: order().

 

Step 6

Create a new Document Term Matrix with only the terms that are in more than 5% of the posts. How many terms were removed from the original Document Term Matrix?

Use logical operators