Summary`

This is a milestone report on the current capstone project where students have to develop a text prediction model and shiny GUI based on data provided by Swiftkey. In this report, using 3 text files provided from twitter, blogs and news sites, we will be conducting some exploratory analysis and cleaning up of the data. This is to better prepare for building the resulting prediction model.

Load the necessary libraries

require(tm)
require(ggplot2)
require(wordcloud)
require(RWeka)
require(dplyr)
require(stringi)
require(SnowballC)

Getting and Loading the Data into R

The aforementioned dataset can be found here: Capstone Dataset

The file can either be downloaded from the link provided and unzipped or it can be sourced directly from using R. In this case, the files has been saved and extracted into the same directory as the report file.

#Open connection to the 3 text files in english
con_twitter <- file("./en_US.twitter.txt", "r")
con_blogs <- file("./en_US.blogs.txt", "r")
con_news <- file("./en_US.news.txt", "r")
#read the files
twitter <- readLines(con_twitter, skipNul = TRUE)
news <- readLines(con_news, skipNul = TRUE)
blogs <- readLines(con_blogs, skipNul = TRUE)

Initial Look at the Data Files

It is important to know the basic information of any files are that being preprocessed and analyzed. Below is a look at these details:

data.frame(cbind(c("twitter","news","blogs"),rbind(stri_stats_general(twitter),stri_stats_general(news),stri_stats_general(blogs))))
##        V1   Lines LinesNEmpty     Chars CharsNWhite
## 1 twitter 2360148     2360148 162385035   134371036
## 2    news   77259       77259  15683765    13117038
## 3   blogs  899288      899288 208361438   171926076

Cleaning up the Data

Before any exploratory analysis can be done, the data has to be cleaned. We will remove punctuations, numbers, special characters, URLS, changing all text to lower casing for better analysis. Due to the size of the data, we will only be sampling(5%) a part of it.

set.seed(1234)
#sample and combine the data for better handling
sample_data <- c(sample(twitter, length(twitter)*0.05),
                 sample(news, length(news)*0.05),
                 sample(blogs, length(blogs)*0.05))
#Brief look at the sample
stri_stats_general(sample_data)
##       Lines LinesNEmpty       Chars CharsNWhite 
##      166833      166833    19236179    15900229
#Cleaning the data
corpus <- VCorpus(VectorSource(sample_data)) #create the corpus
toSpace <- content_transformer(function(x,pattern) gsub(pattern, " ", x)) #create a function to transform anything to space
corpus <- tm_map(corpus, toSpace, "http[^[:space:]]*") #remove URL
corpus <- tm_map(corpus, toSpace, "/|@|\\|") #remove @ symbol and slash
corpus <- tm_map(corpus, content_transformer(tolower)) #tolower casing
corpus <- tm_map(corpus, removePunctuation) #remove punctuations
corpus <- tm_map(corpus, removeNumbers) #remove numbers
corpus <- tm_map(corpus, stripWhitespace) #remove extra spacing
corpus <- tm_map(corpus, PlainTextDocument) #change characters to plain text
corpus <- tm_map(corpus, removeWords, stopwords("english")) 
corpus <- tm_map(corpus, stemDocument) # stem words!

Exploratory Data Analysis

Tokenizing the Text Sample

Firstly, let’s look at the number of single words (unigram) plus the frequency of bigrams and trigrams occurrence, we must first tokenize the sample

#tokenizing the data
corpus.df <- data.frame(text = unlist(sapply(corpus, '[', 'content')), stringsAsFactors = F)
unigramtoken <- data.frame(table(NGramTokenizer(corpus.df, Weka_control(min = 1, max = 1))))
bigramtoken <- data.frame(table(NGramTokenizer(corpus.df, Weka_control(min = 2, max = 2))))
trigramtoken <- data.frame(table(NGramTokenizer(corpus.df, Weka_control(min = 3, max = 3))))

#order the tokens by descending order
unigram <- unigramtoken[order(unigramtoken$Freq, decreasing = TRUE),]
bigram  <- bigramtoken[order(bigramtoken$Freq, decreasing = TRUE),]
trigram <- trigramtoken[order(trigramtoken$Freq, decreasing = TRUE),]

Unigrams Frequency

#get top 20 unigrams
unigram_20 <- head(unigram,20)
g <- ggplot(unigram_20)
g + geom_bar(aes(x = reorder(Var1, - Freq), y = Freq), stat="identity") + xlab("Unigrams") + ylab("Frequency") + ggtitle("Top 20 Most Common Unigrams") + theme(axis.text.x = element_text(angle = 90, hjust = 1))

Bigrams Frequency

#get top 20 bigrams
bigram_20 <- head(bigram,20)
g <- ggplot(bigram_20)
g + geom_bar(aes(x = reorder(Var1, - Freq), y = Freq), stat="identity") + xlab("Bigrams") + ylab("Frequency") + ggtitle("Top 20 Most Common Bigrams") + theme(axis.text.x = element_text(angle = 90, hjust = 1))

Trigrams Frequency

#get top 20 bigrams
trigram_20 <- head(trigram,20)
g <- ggplot(trigram_20)
g + geom_bar(aes(x = reorder(Var1, - Freq), y = Freq), stat="identity") + xlab("Trigrams") + ylab("Frequency") + ggtitle("Top 20 Most Common Trigrams") + theme(axis.text.x = element_text(angle = 90, hjust = 1))

Findings and Comments

Most of the words found are common even in regular speech. However, there were a few modern internet slang and abbreviations such as “rt”, “happi” , “pretti” etc, which is not helpful in building an algorithm that suggests words. Grammar is important.

Since twitter is probably the worse platform to get proper words and statements, I would reduce use of it during training of data for the shiny app. Same for blogs as well, although the butchering of the english language is less severe there. News site would definitely have the best writing quality.

Henceforth, different sampling sizes would need to be used in order to find a good balance between popular “proper” words and slangs.