(WIP) Capstone NLP

Background

In Partnership with SWIFTKEY, the Johns Hopkins Data Science Specialization Capstone involves the creation of a text-prediction algorithm incorporated into a Shiny app.

In this RMarkdown is the data download, cleaning, exploration, statistical analysis, visualization, and prediction modeling with each section titled appropriately.

The Data

zipURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(zipURL, destfile = "./swiftzip.zip") #548 MB 
unzip("./swiftzip.zip", list = TRUE) #get the file names don't extract

##                             Name    Length                Date
## 1                         final/         0 2014-07-22 10:10:00
## 2                   final/de_DE/         0 2014-07-22 10:10:00
## 3  final/de_DE/de_DE.twitter.txt  75578341 2014-07-22 10:11:00
## 4    final/de_DE/de_DE.blogs.txt  85459666 2014-07-22 10:11:00
## 5     final/de_DE/de_DE.news.txt  95591959 2014-07-22 10:11:00
## 6                   final/ru_RU/         0 2014-07-22 10:10:00
## 7    final/ru_RU/ru_RU.blogs.txt 116855835 2014-07-22 10:12:00
## 8     final/ru_RU/ru_RU.news.txt 118996424 2014-07-22 10:12:00
## 9  final/ru_RU/ru_RU.twitter.txt 105182346 2014-07-22 10:12:00
## 10                  final/en_US/         0 2014-07-22 10:10:00
## 11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
## 12    final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
## 13   final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
## 14                  final/fi_FI/         0 2014-07-22 10:10:00
## 15    final/fi_FI/fi_FI.news.txt  94234350 2014-07-22 10:11:00
## 16   final/fi_FI/fi_FI.blogs.txt 108503595 2014-07-22 10:12:00
## 17 final/fi_FI/fi_FI.twitter.txt  25331142 2014-07-22 10:10:00

Looking at the list of files, you’ll notice 3 types of files (Twitter, Blogs, and News) in 4 languages (German, Russian, English, Finnish).

Sneak Peak

The data include the text itself, the date accessed, and if applicable a tag for the subject. Some text is written in multiple languages or contain words highly similar across languages. These need to be cleaned appropriately.

A sample of some of the English Twitter data is shown below:

#directories will be repeatedly set in and out of the zip file 

filepth <- getwd() # get the working directory 
unzip("./swiftzip.zip", exdir = filepth) #unzip the files into the working directory 
#inside the directory is now a new directory called "final"

#inside "final" is 4 folders, 1 for each language 

setdirectory <- "./final/en_US" #specify the directory with the ENGLISH US files 
setwd(setdirectory)
 #twitter, news, and blog txt files 

con <- file("en_US.twitter.txt","r") #some issues with connecting to the file 
#most likely caused by directory issues, such as only accessing in a notebook chunk 
txtsmpl1 <- readLines(con,5) #read the first 5 lines of the file 

txtsmpl1

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."                                                                       
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"                           
## [5] "Words from a complete stranger! Made my birthday even better :)"

The first 5 lines are shown. We can use readLines to assign the texts to different variables, -warning- they are quite large. The en_US_twitter.txt is over 300MB

each element in each file varies in the number of characters:

setwd(setdirectory) #every notebook chunks needs this because the files are not in the same directory 
#as the RMD file 

twittertxt <- readLines("en_US.twitter.txt", skipNul = TRUE) 
blogtxt <- readLines("en_US.blogs.txt", skipNul = TRUE) 

newscon <- file("en_US.news.txt",'rb')
newstxt <- readLines(newscon)
close(newscon)

#Due to a variety of special characters and nuls, special arguments were set to readLines 
#Due to a SUB in newstxt, a binary read was required 

txtlist <- list(twittertxt, blogtxt, newstxt) 
maxlength <- function(txt){max(nchar(txt))}
lapply(txtlist, maxlength) #get the longest element from each file and return it's nchar

## [[1]]
## [1] 213
## 
## [[2]]
## [1] 40835
## 
## [[3]]
## [1] 11384

There is a blog post with over 40,000 characters. As expected, twitter is capped at 213 (twitter allows for URLs to make tweets exceed 140 characters).

Exploratory Analysis

numberOfEntries <- lapply(txtlist, length) 
meanlength <- function(txt){mean(nchar(txt))}
medianlength <- function(txt){median(nchar(txt))}

AvgNumChars <- lapply(txtlist,meanlength)
MedNumChars <- lapply(txtlist,medianlength)

entrytable <- cbind.data.frame(source = c("twitter","blogs","news"), 
                               'Number of Entries' = as.character(numberOfEntries),
                               'Mean Num Chars' = as.character(
                                       lapply(AvgNumChars, round)),
                               'Median Num Chars' = as.character(MedNumChars)
)
entrytable

##    source Number of Entries Mean Num Chars Median Num Chars
## 1 twitter           2360148             69               64
## 2   blogs            899288            232              157
## 3    news           1010242            202              185

As the table above shows, tweets tend to stay around 60 - 70 characters, which makes sense as the limit for any tweet is 140 characters. Blog posts and News articles vary greatly and include some entries of very large size that skew the Mean Number of Characters. This means that the sampling method used can greatly vary the prediction capabilities of our model, as longer posts and articles may include technical language (that wouldn’t be used by a casual SWIFT user), repetitive use of pronouns (it, he, she) with varying subjects, and other unique constructs.

Data Selection

To greatly reduce our data (to fit typical RAM usage) we will select data that is within the 25th and 75th percentile of the number of characters for each source. These percentages are arbitrary, but intuitively, should include the most typical (average) ways that language are used.

reducer <- function(txt){ 
        bound <- quantile(nchar(txt))[c(2,4)]  #25 percentile and 75% percentile
        
        txt[nchar(txt) >= bound[1] & nchar(txt) <= bound[2]]
        }

smalltwitter <- reducer(twittertxt)
smallblogs <- reducer(blogtxt)
smallnews <- reducer(newstxt)

Our data have been reduced by 50%. To visualize these entries in word form, we’ll use a few select libraries and create histograms for commonly used words and various n-grams. To do this we will also remove punctation, numbers and extra whitespace, and force all text to lowercase.

library(tm)

## Loading required package: NLP

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(tidytext)

cleaner <- function(txt){ 
        txt <- gsub("n't", " not",txt) #for won't, don't, etc. 
        txt <- txt %>% removePunctuation() %>% stripWhitespace() %>% tolower() %>% removeNumbers()
        }

ctwitter <- cleaner(smalltwitter)
cblogs <- cleaner(smallblogs)
cnews <- cleaner(smallnews)

rm(twittertxt,blogtxt,newstxt, smalltwitter,smallblogs,smallnews) #reduce memory

With the cleaned data, we can create multiple n-grams and highlight the most frequent ones, but memory is still an issue. We will first sample 10% of each cleaned dataset and explore those, before modeling on the cleaner datasets.

tidify <- function(txt){ 
        txt <- cbind.data.frame(entry = 1:length(txt), txt) #retain grouping information
        txt[,2] <- as.character(txt[,2]) #entry is factor, text is character
        txt <- group_by(txt, entry) %>% unnest_tokens(word,txt)}
#note: to count words across entries you must ungroup 

set.seed(4) #for reproducibility

#get 10% of the character vectors, then tidify them into unigram tibbles 
tidytwitter <- tidify(sample(ctwitter, .1*length(ctwitter)))  
tidyblogs <- tidify(sample(cblogs, .1*length(cblogs)))
tidynews <- tidify(sample(cnews, .1*length(cnews)))

num.unique <- function(txt){nrow(unique(txt[,1]))} # counts the number of entries sampled
num.unique.word <- function(txt){nrow(unique(txt[,2]))} #counts number of unique words per source

tidylist <- list(tidytwitter,tidyblogs,tidynews)
unigramTable <- cbind.data.frame(
        source = c("twitter","blogs","news"), 
        'Number of Entries' = sapply(tidylist, num.unique), 
        'Number of Unique Words' = sapply(tidylist, num.unique.word)
)
unigramTable

##    source Number of Entries Number of Unique Words
## 1 twitter            119453                  66913
## 2   blogs             45327                  65216
## 3    news             50655                  67907

Most Frequent Words per source

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

#converting to data.table for computational speed 
dtTwitter <- data.table(tidytwitter)
dtBlogs <- data.table(tidyblogs)
dtNews <- data.table(tidynews)

top10 <- function(txt){count(txt, word, sort = TRUE)[1:10,]}

The top ten words for each source:

ggTwitter <- ggplot(aes(word,n), data = top10(dtTwitter)) + 
        geom_col()+xlab(NULL) +coord_flip() + labs(title = "Twitter Top 10 words")

ggBlogs <- ggplot(aes(word,n), data = top10(dtBlogs)) + 
        geom_col()+xlab(NULL) +coord_flip() + labs(title = "Blogs Top 10 words")

ggNews <- ggplot(aes(word,n), data = top10(dtNews)) + 
        geom_col()+xlab(NULL) +coord_flip() + labs(title = "News Top 10 words") 

ggTwitter

ggBlogs

ggNews

Results and Future Plans

The sources do not have equivalent top ten words, nor do they have similar proportions of popular words. This shows a fundamental difference in the language of tweets, blog posts, and news articles (which makes sense considering slang, technical writing, and personal writing).

From here, a similar analysis using a variety of n-grams (2, 3, and 4 word groups) will be done and a predictive model can be made. A model can be made based on each source and from there, a majority vote model can be created to predict words based on which source the input words are most similar too.