Data Science Capstone: Milestone Project

Introduction
Loading the data
Generating the vocabulary
Summarizing the data
Summary of findings
Plan for a predictive model

Introduction

Immigration has become one of the most important and divisive issues in American politics. To understand the role of this issue in political discourse, I employ text analysis to identify other words or phrases that are associated with uses of the word “immigrant” and other forms of this word.

Loading the data

I start by loading the relevant libraries.

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.3.2

## Warning: package 'ggplot2' was built under R version 3.3.2

library(tm)

## Warning: package 'tm' was built under R version 3.3.2

## Warning: package 'NLP' was built under R version 3.3.2

library(text2vec)
library(data.table)
library(stringi)

I use the scan command in the base library to load the Twitter, blog, and news data. In each file, one record is separated from the next by a new line, denoted as \n.

twitter <- scan("en_US.twitter.txt", character(0), sep = "\n")

## Warning in scan("en_US.twitter.txt", character(0), sep = "\n"): embedded
## nul(s) found in input

blogs <- scan("en_US.blogs.txt", character(0), sep = "\n") 
news <- scan("en_US.news.txt", character(0), sep = "\n")

To show the distribution of word counts, I calculate the number of words in each record using the stri_count_words() command in the stringi library.

wordcount.twitter <- stri_count_words(twitter)
wordcount.blogs <- stri_count_words(blogs)
wordcount.news <- stri_count_words(news)

Each set of data is currently a long character vector. I want to perform calculations while treating these vectors as data frames. So I employ the data.frame() command.

twitter <- data.frame(twitter)
blogs <- data.frame(blogs)
news <- data.frame(news)

I use the mutate() command to place a numeric ID within each dataset. I convert the text to lowercase and use grepl() to create binary indicators for whether or not each record contains the string “immigra”, which encapulates the word “immigrant” and its derivatives.

twitter.nrow <- nrow(twitter)
blogs.nrow <- nrow(blogs)
news.nrow <- nrow(news)
twitter <- mutate(twitter, 
                  ID = 1:twitter.nrow,
                  twitter = tolower(as.character(twitter)),
                  immigrant = grepl("immigra", twitter))
blogs <- mutate(blogs, 
                ID = 1:blogs.nrow,
                blogs = tolower(as.character(blogs)),
                immigrant = grepl("immigra", blogs))
news <- mutate(news, 
               ID = 1:news.nrow,
               news = tolower(as.character(news)),
               immigrant = grepl("immigra", news))

Generating the vocabulary

The goal here is to tokenize the records in each data set so that we can count distinct words, and see which words are most closely related to the presense of a mention of immigration in a record.

Note: Text analysis is something I am learning about for the first time. In the following report, I follow the workflow used in the vignette for the text2vec package. See browseVignettes(package = "text2vec"). I apply these commands to the en_US data on Twitter posts, blog posts, and news articles. Outside of using the commands suggested by the text2vec vignette, the following work is my own.

The first step in building a document-term matrix for each data frame is to set the data frames as data.table objects with the setDT() command.

setDT(twitter)
setDT(blogs)
setDT(news)
setkey(twitter, ID)
setkey(blogs, ID)
setkey(news, ID)

Stop words are simple and common words that are not meaningful for our analysis. I want to remove them to reduce the size of the vocabularies from each data source and to increase the accuracy of models by reducing the noise. The following list of stop words is taken from http://xpo6.com/list-of-english-stop-words/. I removed from this list the word “bill” as that might relate to legislation.

stopwords <- c("a", "about", "above", "above", "across", "after", "afterwards", "again", "against", "all", "almost", "alone", "along", "already", "also","although","always","am","among", "amongst", "amoungst", "amount",  "an", "and", "another", "any","anyhow","anyone","anything","anyway", "anywhere", "are", "around", "as",  "at", "back","be","became", "because","become","becomes", "becoming", "been", "before", "beforehand", "behind", "being", "below", "beside", "besides", "between", "beyond", "both", "bottom","but", "by", "call", "can", "cannot", "cant", "co", "con", "could", "couldnt", "cry", "de", "describe", "detail", "do", "done", "down", "due", "during", "each", "eg", "eight", "either", "eleven","else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "fifteen", "fifty", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "hundred", "ie", "if", "in", "inc", "indeed", "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much", "must", "my", "myself", "name", "namely", "neither", "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own","part", "per", "perhaps", "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "serious", "several", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "system", "take", "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they", "thick", "thin", "third", "this", "those", "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what", "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves", "the")

We tokenize by word and build the vocabulary for the three data sources using the itoken() and create_vocabulary() functions from the text2vec library.

it_twitter <- itoken(twitter$twitter, 
             tokenizer = word_tokenizer, 
             ids = twitter$ID, 
             progressbar = FALSE)
it_blogs <- itoken(blogs$blogs, 
             tokenizer = word_tokenizer, 
             ids = blogs$ID, 
             progressbar = FALSE)
it_news <- itoken(news$news, 
             tokenizer = word_tokenizer, 
             ids = news$ID, 
             progressbar = FALSE)
vocab.twitter <- create_vocabulary(it_twitter, stopwords = stopwords)
vocab.blogs <- create_vocabulary(it_blogs, stopwords = stopwords)
vocab.news <- create_vocabulary(it_news, stopwords = stopwords)

I prune the vocabulary by removing words that appear too often (in more than half of the records) or not often enough (less than 3 times total).

pruned_twitter <- prune_vocabulary(vocab.twitter, 
                                   term_count_min = 3,
                                   doc_proportion_max = 0.5)
vector.twitter <- vocab_vectorizer(pruned_twitter)                                 
pruned_blogs <- prune_vocabulary(vocab.blogs, 
                                 term_count_min = 3, 
                                 doc_proportion_max = 0.5)
vector.blogs <- vocab_vectorizer(pruned_blogs)                                   
pruned_news <- prune_vocabulary(vocab.news, 
                                term_count_min = 3, 
                                doc_proportion_max = 0.5)  
vector.news <- vocab_vectorizer(pruned_news)

Finally, I use the create_dtm() function to create the document-term matricies.

dtm.twitter <- create_dtm(it_twitter, vector.twitter)
dtm.blogs <- create_dtm(it_blogs, vector.blogs)
dtm.news <- create_dtm(it_news, vector.news)

Summarizing the data

The first important summary is the number of records with and without a mention of immigration in each of the data frames.

table(twitter$immigrant)

## 
##   FALSE    TRUE 
## 2359676     472

table(blogs$immigrant)

## 
##  FALSE   TRUE 
## 897837   1451

table(news$immigrant)

## 
##   FALSE    TRUE 
## 1005808    4434

To see the number of words remaining in the pruned vocabularies of the three data sources, we use the dim() function.

dim(dtm.twitter)

## [1] 2360148  101343

dim(dtm.blogs)

## [1] 899288 109454

dim(dtm.news)

## [1] 1010242  108498

To see the means and variances of the word counts across documents of each type, we summarize the word counts.

summary(wordcount.twitter)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.75   18.00   47.00

summary(wordcount.blogs)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.75   60.00 6726.00

summary(wordcount.news)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.41   46.00 1796.00

Next I use histograms to display the distribution of word counts from each data source. I start with the Twitter data.

g <- ggplot(data.frame(wordcount.twitter), aes(x=wordcount.twitter)) +
      geom_histogram(bins=32, fill=I("blue"), col=I("red")) +
      xlab("Number of words") +
      ylab("Number of documents") +
      ggtitle("Distribution of Word Counts in the Twitter Data")
g

Next I show the word counts in the blogs data.

wordcount.blogs <- wordcount.blogs[wordcount.blogs < 500]
g <- ggplot(data.frame(wordcount.blogs), aes(x=wordcount.blogs)) +
      geom_histogram(binwidth=20, fill=I("blue"), col=I("red")) +
      xlab("Number of words") +
      ylab("Number of documents") +
      ggtitle("Distribution of Word Counts in the Blogs Data")
g

Finally I show the word counts in the news data.

wordcount.news <- wordcount.news[wordcount.news < 250]
g <- ggplot(data.frame(wordcount.news), aes(x=wordcount.news)) +
      geom_histogram(binwidth=20, fill=I("blue"), col=I("red")) +
      xlab("Number of words") +
      ylab("Number of documents") +
      ggtitle("Distribution of Word Counts in the News Data")
g

Summary of findings

In this report I showed that while there are far more twitter posts in the data than news or blogs, most of the documents that mention immigration are news articles. All three data sources have vocabulary of roughly equal size. Blogs are the longest records by word count, on average, followed by news articles then tweets. Blogs also have a distribution of word counts that is heavily skewed upwards.

Plan for a predictive model

The research question is: do tweets, blogs, and news articles discuss immigration using different language? For that, I need to determine the words or n-grams that are most closely associated with immigration using only Twitter, blogs, or news articles. Then I can compare the set of words or n-grams that are identified using each data source. To do this, I will use the glmnet package to fit a logistic regression model with an L1 penalty and 4 fold cross-validation.