In this recitation, we will: (1) load a dataset, (2) manipulate text strings, (3) preprocess the content of a textual variable by creating a Document Term Matrix and (4) create a new variable representing the number of unique unigrams.
The dataset we will be working on today contains text from fake news sources on the Web. It contains text from 244 websites and represents 12999 posts in total (we will be using a subset of 5000 posts in this recitation). Fake news Web pages were identified using the BS Detector Chrome Extension by Daniel Sieradski.
First, we are loading a the FakeNewsData.csv file into our library using read.csv().
# Loading the dataset using read.csv()
myData <- read.csv("/Users/evelynebrie/Dropbox/TA/PSCI_107_Fall2017/Data_Science/Recitations/Week_13/FakeNewsData.csv")
# Looking at the dimensions of the dataset using dim()
dim(myData)
## [1] 5000 21
# Looking at the variable names using colnames()
colnames(myData)
## [1] "X" "uuid" "ord_in_thread"
## [4] "author" "published" "title"
## [7] "text" "language" "crawled"
## [10] "site_url" "country" "domain_rank"
## [13] "thread_title" "spam_score" "main_img_url"
## [16] "replies_count" "participants_count" "likes"
## [19] "comments" "shares" "type"
# Viewing the class attribute of each variable using sapply() and class()
sapply(myData, FUN=class)
## X uuid ord_in_thread
## "integer" "factor" "integer"
## author published title
## "factor" "factor" "factor"
## text language crawled
## "factor" "factor" "factor"
## site_url country domain_rank
## "factor" "factor" "integer"
## thread_title spam_score main_img_url
## "factor" "numeric" "factor"
## replies_count participants_count likes
## "integer" "integer" "integer"
## comments shares type
## "integer" "integer" "factor"
Here are some useful functions to modify text strings in R:
| Function | Description |
|---|---|
gsub() |
Removes and substitutes a text string |
strsplit() |
Divides a given text string |
paste() |
Putting together separate elements to form a single text string |
substr() |
Selects a subset within a text string |
toupper() |
Converts all elements of a text string to uppercase |
# Let's say we want to create a new variable with the text of the article and the title combined
myData$totalText <- paste(myData$title, myData$text, sep=" ")
# Convert this variable to character
myData$totalText <- as.character(myData$totalText)
# Displaying the content of the two first posts
myData$totalText[1:2]
## [1] "Muslims BUSTED: They Stole Millions In Gov’t Benefits Print They should pay all the back all the money plus interest. The entire family and everyone who came in with them need to be deported asap. Why did it take two years to bust them? \nHere we go again …another group stealing from the government and taxpayers! A group of Somalis stole over four million in government benefits over just 10 months! \nWe’ve reported on numerous cases like this one where the Muslim refugees/immigrants commit fraud by scamming our system…It’s way out of control! More Related"
## [2] "Re: Why Did Attorney General Loretta Lynch Plead The Fifth? Why Did Attorney General Loretta Lynch Plead The Fifth? Barracuda Brigade 2016-10-28 Print The administration is blocking congressional probe into cash payments to Iran. Of course she needs to plead the 5th. She either can’t recall, refuses to answer, or just plain deflects the question. Straight up corruption at its finest! \n100percentfedUp.com ; Talk about covering your ass! Loretta Lynch did just that when she plead the Fifth to avoid incriminating herself over payments to Iran…Corrupt to the core! Attorney General Loretta Lynch is declining to comply with an investigation by leading members of Congress about the Obama administration’s secret efforts to send Iran $1.7 billion in cash earlier this year, prompting accusations that Lynch has “pleaded the Fifth” Amendment to avoid incriminating herself over these payments, according to lawmakers and communications exclusively obtained by the Washington Free Beacon. \nSen. Marco Rubio (R., Fla.) and Rep. Mike Pompeo (R., Kan.) initially presented Lynch in October with a series of questions about how the cash payment to Iran was approved and delivered. \nIn an Oct. 24 response, Assistant Attorney General Peter Kadzik responded on Lynch’s behalf, refusing to answer the questions and informing the lawmakers that they are barred from publicly disclosing any details about the cash payment, which was bound up in a ransom deal aimed at freeing several American hostages from Iran. \nThe response from the attorney general’s office is “unacceptable” and provides evidence that Lynch has chosen to “essentially plead the fifth and refuse to respond to inquiries regarding [her]role in providing cash to the world’s foremost state sponsor of terrorism,” Rubio and Pompeo wrote on Friday in a follow-up letter to Lynch. More Related"
# Removing irrelevant terms using gsub()
myData$totalText <- gsub("Print","",myData$totalText)
myData$totalText <- gsub("More Related","",myData$totalText)
myData$totalText <- gsub("\n","",myData$totalText)
# Sanity check: displaying the content of the two first posts again
myData$totalText[1:2]
## [1] "Muslims BUSTED: They Stole Millions In Gov’t Benefits They should pay all the back all the money plus interest. The entire family and everyone who came in with them need to be deported asap. Why did it take two years to bust them? Here we go again …another group stealing from the government and taxpayers! A group of Somalis stole over four million in government benefits over just 10 months! We’ve reported on numerous cases like this one where the Muslim refugees/immigrants commit fraud by scamming our system…It’s way out of control! "
## [2] "Re: Why Did Attorney General Loretta Lynch Plead The Fifth? Why Did Attorney General Loretta Lynch Plead The Fifth? Barracuda Brigade 2016-10-28 The administration is blocking congressional probe into cash payments to Iran. Of course she needs to plead the 5th. She either can’t recall, refuses to answer, or just plain deflects the question. Straight up corruption at its finest! 100percentfedUp.com ; Talk about covering your ass! Loretta Lynch did just that when she plead the Fifth to avoid incriminating herself over payments to Iran…Corrupt to the core! Attorney General Loretta Lynch is declining to comply with an investigation by leading members of Congress about the Obama administration’s secret efforts to send Iran $1.7 billion in cash earlier this year, prompting accusations that Lynch has “pleaded the Fifth” Amendment to avoid incriminating herself over these payments, according to lawmakers and communications exclusively obtained by the Washington Free Beacon. Sen. Marco Rubio (R., Fla.) and Rep. Mike Pompeo (R., Kan.) initially presented Lynch in October with a series of questions about how the cash payment to Iran was approved and delivered. In an Oct. 24 response, Assistant Attorney General Peter Kadzik responded on Lynch’s behalf, refusing to answer the questions and informing the lawmakers that they are barred from publicly disclosing any details about the cash payment, which was bound up in a ransom deal aimed at freeing several American hostages from Iran. The response from the attorney general’s office is “unacceptable” and provides evidence that Lynch has chosen to “essentially plead the fifth and refuse to respond to inquiries regarding [her]role in providing cash to the world’s foremost state sponsor of terrorism,” Rubio and Pompeo wrote on Friday in a follow-up letter to Lynch. "
A Document Term Matrix displays the frequency of terms in a set of documents. These documents can be stored as different files, or as different rows within a vector (like in the current recitation and in the PS6).
| Term 1 | Term 2 | Term 3 | Term 4 | … | |
|---|---|---|---|---|---|
| Document 1 | 1 | 0 | 3 | 0 | … |
| Document 2 | 0 | 0 | 0 | 0 | … |
| Document 3 | 0 | 1 | 0 | 0 | … |
| Document 4 | 3 | 0 | 3 | 0 | … |
| Document 5 | 2 | 0 | 1 | 0 | … |
| … | … | … | … | … | … |
We will be using the tm (text mining) package to perform text analysis. Please install this package using install.packages() and load it using library() .
library(tm)
## Warning: package 'tm' was built under R version 3.4.2
## Loading required package: NLP
# Creating a corpus from the text within the "myData$totalText" vector
myCorpus <- Corpus(VectorSource(myData$totalText),readerControl = list(language = "eng"))
# Creating a Document Term Matrix
dtm <- DocumentTermMatrix(myCorpus, control = list(stemming = TRUE, stopwords = TRUE, minWordLength = 3,
removeNumbers = TRUE, removePunctuation = TRUE))
# Looking at the dimensions of the Document Term Matrix
dim(dtm)
## [1] 5000 71767
Here, we create a matrix with the terms inside the variable and code the variable as the presence of unigrams within each post.
# Creating a matrix with the terms inside the variable
myWords <- as.matrix(dtm)
# Calculating the number of posts in which a given term appears
v1 <- apply((myWords > 0)*1,2,sum)
# Displaying the first 10 terms in that vector
v1[1:10]
## — a — publish million –––– –‘ballot –‘big
## 1 1 2 1 1 1
## –“cool” –“could –“grab –“old
## 1 1 1 1
# Generating a percentage
v2 <- v1/dim(dtm)[1]
# What is he highest percentage of posts a term appears in?
max(v2)
## [1] 0.5152
# Which term appears in the highest percentage of posts?
which(v2==max(v2))
## will
## 64136
# Seeing the first 30 most used terms
v2[order(v2,decreasing=T)][1:30]
## will one peopl time like state year can also
## 0.5152 0.4994 0.4522 0.4462 0.4326 0.4256 0.4202 0.4086 0.4024
## now just new said say even get use make
## 0.3994 0.3980 0.3910 0.3900 0.3710 0.3682 0.3554 0.3528 0.3514
## elect trump call hillari mani day report right clinton
## 0.3296 0.3290 0.3282 0.3246 0.3246 0.3238 0.3230 0.3230 0.3206
## work take presid
## 0.3200 0.3180 0.3152
# Keeping only the terms that are in more than 5% of the posts
dtm2 <- dtm[,v2 > .05]
# How many words/columns remain?
dim(dtm2)
## [1] 5000 950
# Comparing with the original dtm.
dim(dtm)
## [1] 5000 71767
# How many terms were removed?
dim(dtm2)[2]-dim(dtm)[2]
## [1] -70817
Conduct a similar analysis on the myData$thread_title variable using the following instructions.
Load in the dataset.
Relevant function: read.csv().
Create a corpus.
Relevant functions: Corpus(), VectorSource().
Create a Document Term Matrix.
Relevant function: DocumentTermMatrix().
Create a matrix from the Document Term Matrix and calculate the percentage of thread titles in which a given term appears.
Relevant functions: as.matrix(), apply().
Display the first 20 most used terms.
Relevant function: order().
Create a new Document Term Matrix with only the terms that are in more than 5% of the posts. How many terms were removed from the original Document Term Matrix?
Use logical operators