In Partnership with SWIFTKEY, the Johns Hopkins Data Science Specialization Capstone involves the creation of a text-prediction algorithm incorporated into a Shiny app.
In this RMarkdown is the data download, cleaning, exploration, statistical analysis, visualization, and prediction modeling with each section titled appropriately.
zipURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(zipURL, destfile = "./swiftzip.zip") #548 MB
unzip("./swiftzip.zip", list = TRUE) #get the file names don't extract
## Name Length Date
## 1 final/ 0 2014-07-22 10:10:00
## 2 final/de_DE/ 0 2014-07-22 10:10:00
## 3 final/de_DE/de_DE.twitter.txt 75578341 2014-07-22 10:11:00
## 4 final/de_DE/de_DE.blogs.txt 85459666 2014-07-22 10:11:00
## 5 final/de_DE/de_DE.news.txt 95591959 2014-07-22 10:11:00
## 6 final/ru_RU/ 0 2014-07-22 10:10:00
## 7 final/ru_RU/ru_RU.blogs.txt 116855835 2014-07-22 10:12:00
## 8 final/ru_RU/ru_RU.news.txt 118996424 2014-07-22 10:12:00
## 9 final/ru_RU/ru_RU.twitter.txt 105182346 2014-07-22 10:12:00
## 10 final/en_US/ 0 2014-07-22 10:10:00
## 11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
## 12 final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
## 13 final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
## 14 final/fi_FI/ 0 2014-07-22 10:10:00
## 15 final/fi_FI/fi_FI.news.txt 94234350 2014-07-22 10:11:00
## 16 final/fi_FI/fi_FI.blogs.txt 108503595 2014-07-22 10:12:00
## 17 final/fi_FI/fi_FI.twitter.txt 25331142 2014-07-22 10:10:00
Looking at the list of files, you’ll notice 3 types of files (Twitter, Blogs, and News) in 4 languages (German, Russian, English, Finnish).
The data include the text itself, the date accessed, and if applicable a tag for the subject. Some text is written in multiple languages or contain words highly similar across languages. These need to be cleaned appropriately.
A sample of some of the English Twitter data is shown below:
#directories will be repeatedly set in and out of the zip file
filepth <- getwd() # get the working directory
unzip("./swiftzip.zip", exdir = filepth) #unzip the files into the working directory
#inside the directory is now a new directory called "final"
#inside "final" is 4 folders, 1 for each language
setdirectory <- "./final/en_US" #specify the directory with the ENGLISH US files
setwd(setdirectory)
#twitter, news, and blog txt files
con <- file("en_US.twitter.txt","r") #some issues with connecting to the file
#most likely caused by directory issues, such as only accessing in a notebook chunk
txtsmpl1 <- readLines(con,5) #read the first 5 lines of the file
txtsmpl1
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
## [5] "Words from a complete stranger! Made my birthday even better :)"
The first 5 lines are shown. We can use readLines to assign the texts to different variables, -warning- they are quite large. The en_US_twitter.txt is over 300MB
each element in each file varies in the number of characters:
setwd(setdirectory) #every notebook chunks needs this because the files are not in the same directory
#as the RMD file
twittertxt <- readLines("en_US.twitter.txt", skipNul = TRUE)
blogtxt <- readLines("en_US.blogs.txt", skipNul = TRUE)
newscon <- file("en_US.news.txt",'rb')
newstxt <- readLines(newscon)
close(newscon)
#Due to a variety of special characters and nuls, special arguments were set to readLines
#Due to a SUB in newstxt, a binary read was required
txtlist <- list(twittertxt, blogtxt, newstxt)
maxlength <- function(txt){max(nchar(txt))}
lapply(txtlist, maxlength) #get the longest element from each file and return it's nchar
## [[1]]
## [1] 213
##
## [[2]]
## [1] 40835
##
## [[3]]
## [1] 11384
There is a blog post with over 40,000 characters. As expected, twitter is capped at 213 (twitter allows for URLs to make tweets exceed 140 characters).
numberOfEntries <- lapply(txtlist, length)
meanlength <- function(txt){mean(nchar(txt))}
medianlength <- function(txt){median(nchar(txt))}
AvgNumChars <- lapply(txtlist,meanlength)
MedNumChars <- lapply(txtlist,medianlength)
entrytable <- cbind.data.frame(source = c("twitter","blogs","news"),
'Number of Entries' = as.character(numberOfEntries),
'Mean Num Chars' = as.character(
lapply(AvgNumChars, round)),
'Median Num Chars' = as.character(MedNumChars)
)
entrytable
## source Number of Entries Mean Num Chars Median Num Chars
## 1 twitter 2360148 69 64
## 2 blogs 899288 232 157
## 3 news 1010242 202 185
As the table above shows, tweets tend to stay around 60 - 70 characters, which makes sense as the limit for any tweet is 140 characters. Blog posts and News articles vary greatly and include some entries of very large size that skew the Mean Number of Characters. This means that the sampling method used can greatly vary the prediction capabilities of our model, as longer posts and articles may include technical language (that wouldn’t be used by a casual SWIFT user), repetitive use of pronouns (it, he, she) with varying subjects, and other unique constructs.
To greatly reduce our data (to fit typical RAM usage) we will select data that is within the 25th and 75th percentile of the number of characters for each source. These percentages are arbitrary, but intuitively, should include the most typical (average) ways that language are used.
reducer <- function(txt){
bound <- quantile(nchar(txt))[c(2,4)] #25 percentile and 75% percentile
txt[nchar(txt) >= bound[1] & nchar(txt) <= bound[2]]
}
smalltwitter <- reducer(twittertxt)
smallblogs <- reducer(blogtxt)
smallnews <- reducer(newstxt)
Our data have been reduced by 50%. To visualize these entries in word form, we’ll use a few select libraries and create histograms for commonly used words and various n-grams. To do this we will also remove punctation, numbers and extra whitespace, and force all text to lowercase.
library(tm)
## Loading required package: NLP
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(tidytext)
cleaner <- function(txt){
txt <- gsub("n't", " not",txt) #for won't, don't, etc.
txt <- txt %>% removePunctuation() %>% stripWhitespace() %>% tolower() %>% removeNumbers()
}
ctwitter <- cleaner(smalltwitter)
cblogs <- cleaner(smallblogs)
cnews <- cleaner(smallnews)
rm(twittertxt,blogtxt,newstxt, smalltwitter,smallblogs,smallnews) #reduce memory
With the cleaned data, we can create multiple n-grams and highlight the most frequent ones, but memory is still an issue. We will first sample 10% of each cleaned dataset and explore those, before modeling on the cleaner datasets.
tidify <- function(txt){
txt <- cbind.data.frame(entry = 1:length(txt), txt) #retain grouping information
txt[,2] <- as.character(txt[,2]) #entry is factor, text is character
txt <- group_by(txt, entry) %>% unnest_tokens(word,txt)}
#note: to count words across entries you must ungroup
set.seed(4) #for reproducibility
#get 10% of the character vectors, then tidify them into unigram tibbles
tidytwitter <- tidify(sample(ctwitter, .1*length(ctwitter)))
tidyblogs <- tidify(sample(cblogs, .1*length(cblogs)))
tidynews <- tidify(sample(cnews, .1*length(cnews)))
num.unique <- function(txt){nrow(unique(txt[,1]))} # counts the number of entries sampled
num.unique.word <- function(txt){nrow(unique(txt[,2]))} #counts number of unique words per source
tidylist <- list(tidytwitter,tidyblogs,tidynews)
unigramTable <- cbind.data.frame(
source = c("twitter","blogs","news"),
'Number of Entries' = sapply(tidylist, num.unique),
'Number of Unique Words' = sapply(tidylist, num.unique.word)
)
unigramTable
## source Number of Entries Number of Unique Words
## 1 twitter 119453 66913
## 2 blogs 45327 65216
## 3 news 50655 67907
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
#converting to data.table for computational speed
dtTwitter <- data.table(tidytwitter)
dtBlogs <- data.table(tidyblogs)
dtNews <- data.table(tidynews)
top10 <- function(txt){count(txt, word, sort = TRUE)[1:10,]}
The top ten words for each source:
ggTwitter <- ggplot(aes(word,n), data = top10(dtTwitter)) +
geom_col()+xlab(NULL) +coord_flip() + labs(title = "Twitter Top 10 words")
ggBlogs <- ggplot(aes(word,n), data = top10(dtBlogs)) +
geom_col()+xlab(NULL) +coord_flip() + labs(title = "Blogs Top 10 words")
ggNews <- ggplot(aes(word,n), data = top10(dtNews)) +
geom_col()+xlab(NULL) +coord_flip() + labs(title = "News Top 10 words")
ggTwitter
ggBlogs
ggNews
The sources do not have equivalent top ten words, nor do they have similar proportions of popular words. This shows a fundamental difference in the language of tweets, blog posts, and news articles (which makes sense considering slang, technical writing, and personal writing).
From here, a similar analysis using a variety of n-grams (2, 3, and 4 word groups) will be done and a predictive model can be made. A model can be made based on each source and from there, a majority vote model can be created to predict words based on which source the input words are most similar too.