For a non data scientist manager!
The overall goal of the project is to predict the next character before a person keys in, based on what he has entered earlier. You might have already seen this in your mobile message keyboard. Basic use of this for a user is reduction in time taken to type next word.
Now to achieve this we need to create a software that provides the user with best “guess” on next word. This software needs a “brain” which can “guess” (Data Scientists call it more elegently as “predict” by putting mathematical certainity, instead of random probability!) the next word. Now that “brain” is called as a “model”, and as any human brain that needs to be “trained” to “predict” well.
To train this software brain, it needs to understand from examples what is the best choice of next word? or what is the best choice of next word in the context of last couple of words written?
Now English being a vast language, we have statistically speaking unlimited number of words. So it is clear that to make this brain, more accurate in prediction we need to feed it with huge corpus of words which are part of sentences. Thanks to modern age, news, blog, social media like twitter provides abundant amount of mountain of sentences from which our artificial brain can learn and predict.
In next few sections, you will see how we downloaded and did initial analysis of the data.
Let us start with loading the needed libraries and data.
#Keep Blog, News, Twitter Files in working directory
library(stringi)
library(tm)
library(RWeka)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Read Data
blogs <- readLines(con <- file("en_US.blogs.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)
news <- readLines(con <- file("en_US.news.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)
twitter <- readLines(con <- file("en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)
#Line and Word Counts
data.frame("Blog_Line_Count"=length(blogs),
"News_Line_Count"=length(news),
"Twitter_Line_Count"=length(twitter)
,"Blog_Word_Count"=sum(stri_count_words(blogs)),
"News_Word_Count"=sum(stri_count_words(news)),
"Twitter_Word_Count"=sum(stri_count_words(twitter))
)
## Blog_Line_Count News_Line_Count Twitter_Line_Count Blog_Word_Count
## 1 899288 77259 2360148 37546246
## News_Word_Count Twitter_Word_Count
## 1 2674536 30093410
Looks like we have fairly large english word corpus! Let us see the most frequently occuring 1 word, 2 word and 3 word combination.
To do that we need to do litte bit of data cleansing. Remove Punctuations, Stopwords, Numerics, etc. Packages will help to do all these fundamental data cleansing tasks so that we will be left with only corpus of words. On this corpus of words, we can check the top occuring unigram, bigram and trigrams.
But as you see, the data is huge and it makes sense to start analysis with a fraction of data say, 0.5% and so analysis on that.
set.seed(123)
blogs_red<- sample(blogs, 0.005*length(blogs))
news_red <- sample(news, 0.005*length(news))
twitter_red <- sample(twitter, 0.005*length(twitter))
sample <- c(blogs_red, news_red, twitter_red)
sum(stri_count_words(sample))
## [1] 350706
#Little bit of data cleansing.
sample <- iconv(sample, 'UTF-8', 'ASCII')
corpus <- Corpus(VectorSource(as.data.frame(sample, stringsAsFactors = FALSE)))
corpus <- corpus %>%
tm_map(tolower) %>%
tm_map(PlainTextDocument) %>%
tm_map(removeWords, stopwords('english')) %>%
tm_map(removePunctuation) %>%
tm_map(removeNumbers) %>%
tm_map(stripWhitespace)
uni <- NGramTokenizer(corpus, Weka_control(min = 1, max = 1))
bi <- NGramTokenizer(corpus, Weka_control(min = 2, max = 2))
tri <- NGramTokenizer(corpus, Weka_control(min = 3, max = 3))
Let us check out our top 20 unigram, bigram and trigram words and visualize them.
uni.df <- data.frame(table(uni))
uni.df <- uni.df[order(uni.df$Freq, decreasing = TRUE),]
ggplot(uni.df[1:20,], aes(x=uni, y=Freq)) +
geom_bar(stat="Identity", fill="#D95F02")+
xlab("Unigrams") + ylab("Frequency")+
ggtitle("Top 20 Unigrams") +
theme(axis.text.x=element_text(angle=90, hjust=1))
bi.df <- data.frame(table(bi))
bi.df <- bi.df[order(bi.df$Freq, decreasing = TRUE),]
ggplot(bi.df[1:20,], aes(x=bi, y=Freq)) +
geom_bar(stat="Identity", fill="#D95F02")+
xlab("Bigrams") + ylab("Frequency")+
ggtitle("Top 20 Bigrams") +
theme(axis.text.x=element_text(angle=90, hjust=1))
tri.df <- data.frame(table(tri))
tri.df <- tri.df[order(tri.df$Freq, decreasing = TRUE),]
ggplot(tri.df[1:20,], aes(x=tri, y=Freq)) +
geom_bar(stat="Identity", fill="#D95F02")+
xlab("Trigrams") + ylab("Frequency")+
ggtitle("Top 20 Trigrams") +
theme(axis.text.x=element_text(angle=90, hjust=1))
Fun Fact: Intersting to see that “I Love You” made it to the top used trigram!
By now we have fairly good idea about our data set and its basic stats. Now we will go for building intelligent models which can learn from the word associations,might be based on n words prior used to predict the best possible next word. That is the goal.