This milestone report is about exploring and analysing the SwifKey data. The main goal of this project is to create an algorithm to predict next possible words while typing a fragment of text into an input field . The analysis of n-gram will be done on three different data sources i.e. blogs, news and twitter. In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus (source : Wikipedia).
The data consists of text files in four languages: English, Russian, German and Finnish. For each language, are three files namely blogs, news and twitter. Out of this four languages, i choose to analyze English as it is the only language that i am familiar with.
setwd("C:/Users/DELL 1/Documents/Capstone Project/")
library(RWeka)
library(stringi)
library(stringr)
library(ggplot2)
library(dplyr)
library(reshape2)
library(tm)
twitter <- readLines(con <- file ("./final/en_US/en_US.twitter.txt", encoding = "UTF-8"))
blogs <- readLines(con <- file ("./final/en_US/en_US.blogs.txt", encoding = "UTF-8"))
news <- readLines(con <- file ("./final/en_US/en_US.news.txt", encoding = "UTF-8"))
close(con)
The following codes compute file size, line and word count for the English language blog, news and twitter files
blogs_words <- stri_count_words(blogs)
news_words <- stri_count_words(news)
twitter_words <- stri_count_words(twitter)
blogs_size <- file.info("final/en_US/en_US.blogs.txt")$size/1024^2
news_size <- file.info("final/en_US/en_US.news.txt")$size/1024^2
twitter_size <- file.info("final/en_US/en_US.twitter.txt")$size/1024^2
summary_table <- data.frame(filename = c("blogs","news","twitter"),
file_size_MB = c(blogs_size, news_size, twitter_size),
num_of_lines = c(length(blogs),length(news),length(twitter)),
num_of_words = c(sum(blogs_words),sum(news_words),sum(twitter_words)),
mean_num_Of_words = c(mean(blogs_words),mean(news_words),mean(twitter_words)))
summary_table
## filename file_size_MB num_of_lines num_of_words mean_num_Of_words
## 1 blogs 200.4242 899288 37546246 41.75108
## 2 news 196.2775 77259 2674536 34.61779
## 3 twitter 159.3641 2360148 30093369 12.75063
From each data source, 1% of the sample is taken. Sampling is done to get a quick analysis on the data , to reduce the time needed for pre-processing , to clean the data as well as tokenizing the words of a corpus into different n-gram. This is done with the hope that the chosen sample is sufficient to represent the whole data population.
set.seed(1000)
blogs_sample <- sample(blogs, length(blogs)*0.01)
news_sample <- sample(news, length(news)*0.01)
twitter_sample <- sample(twitter, length(twitter)*0.01)
twitter_sample <- sapply(twitter_sample,
function(row) iconv(row, "latin1", "ASCII", sub=""))
#Creating corpus
#The three samples taken from blogs, news and tweets are now combine for further analysis.
text_sample <- c(blogs_sample,news_sample,twitter_sample)
length(text_sample) #number of lines
## [1] 33365
sum(stri_count_words(text_sample)) #number of words
## [1] 706286
#remove all weird characters
cleanedTwitter <- sapply(twitter, function(x) iconv (enc2utf8(x), sub ="byte"))
cleanedBlogs <- sapply(twitter, function(x) iconv (enc2utf8(x), sub ="byte"))
cleanedNews <- sapply(twitter, function(x) iconv (enc2utf8(x), sub ="byte"))
The following code is meant to clean the chosen data sample from other “noises”.
doc.vec <- VectorSource(text_sample)
doc.corpus <- Corpus(doc.vec)
#convert to lower case
doc.corpus <- tm_map(doc.corpus, tolower)
#remove all punctuation
doc.corpus <- tm_map(doc.corpus, removePunctuation )
#remove all numbers
doc.corpus <- tm_map(doc.corpus, removeNumbers )
#remove all white spaces
doc.corpus <- tm_map(doc.corpus, stripWhitespace )
#convert to plain text document
doc.corpus <- tm_map(doc.corpus, PlainTextDocument )
# Remove stopwords
doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english"))
The following codes compute word frequencies for each document and orders them from largest to smallest. The top 25 most frequent words from each file is reported in the form of a histogram. A document-term matrix or term-document matrix (TDM) is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms(source : Wikipedia)
Get_Freq <- function(TDM)
{
freq <- sort(rowSums(as.matrix(TDM)), decreasing = TRUE)
return(data.frame(word = names(freq), freq = freq))
}
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
createPlot <- function(data, label)
{
ggplot(data[1:25,], aes(reorder(word, -freq), freq)) +
labs(x = label, y = "Frequency") +
theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
geom_bar(stat = "identity", fill = "blue")
}
# Get frequencies of most common n-grams in data sample
freq1 <- Get_Freq(removeSparseTerms(TermDocumentMatrix(doc.corpus), 0.999))
freq2 <- Get_Freq(removeSparseTerms(TermDocumentMatrix(doc.corpus, control = list(tokenize = bigram)), 0.999))
freq3 <- Get_Freq(removeSparseTerms(TermDocumentMatrix(doc.corpus, control = list(tokenize = trigram)), 0.9999))
Here is a histogram of the 25 most common unigram in the data sample.
createPlot(freq1, "25 Most Common Unigram")
Next historam shows the 25 most common bigrams in the data sample.
createPlot(freq2, "25 Most Common Bigrams")
Here is a final histogram depicting the 25 most common unigrams in the data sample.
createPlot(freq3, "25 Most Common Trigrams")
A few observations can be done from the above analysis :
The n-gram shows a good predictive results even with a small sample size.
The processing capability is very slow albeit the small sample used.
Words produced by Bigrams and Trigrams analysis make more sense than Unigram
The next plan is to :
Perform remaining pre-processing data cleanup
Remove Profanity (I am still considering this due the the concern of losing the actual meaning of words if profanity are removed)
-Determine if word stemming is beneficial
-Build a prediction model based on larger size of sample data (perhaps 2%)
-Build a shiny web app to allow users to type in a phrase of words and hit submit.
-Create R-presentation describing the application