Objectives

  1. This is an HTML page describing the exploratory analysis of the training data set.
  2. In this data scientist done basic summaries of the three files like Word counts, line counts and basic data tables.
  3. Data scientist also made basic plots, such as histograms to illustrate features of the data.
  4. The audience of this report is a non-data scientist manager who could appreciate the project.

Basic Report

For a non data scientist manager!

The overall goal of the project is to predict the next character before a person keys in, based on what he has entered earlier. You might have already seen this in your mobile message keyboard. Basic use of this for a user is reduction in time taken to type next word.

Now to achieve this we need to create a software that provides the user with best “guess” on next word. This software needs a “brain” which can “guess” (Data Scientists call it more elegently as “predict” by putting mathematical certainity, instead of random probability!) the next word. Now that “brain” is called as a “model”, and as any human brain that needs to be “trained” to “predict” well.

To train this software brain, it needs to understand from examples what is the best choice of next word? or what is the best choice of next word in the context of last couple of words written?

Now English being a vast language, we have statistically speaking unlimited number of words. So it is clear that to make this brain, more accurate in prediction we need to feed it with huge corpus of words which are part of sentences. Thanks to modern age, news, blog, social media like twitter provides abundant amount of mountain of sentences from which our artificial brain can learn and predict.

In next few sections, you will see how we downloaded and did initial analysis of the data.

Basic Summary:Word Count, Line Count,Basic Data Tables

Let us start with loading the needed libraries and data.

#Keep Blog, News, Twitter Files in working directory
library(stringi)
library(tm)
library(RWeka)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#Read Data
blogs <- readLines(con <- file("en_US.blogs.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)

news <- readLines(con <- file("en_US.news.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)

twitter <- readLines(con <- file("en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)
close(con)

How many Lines are there in each file and how many words are there?

#Line and Word Counts
data.frame("Blog_Line_Count"=length(blogs),
           "News_Line_Count"=length(news),
           "Twitter_Line_Count"=length(twitter)
,"Blog_Word_Count"=sum(stri_count_words(blogs)),
"News_Word_Count"=sum(stri_count_words(news)),
"Twitter_Word_Count"=sum(stri_count_words(twitter))
)
##   Blog_Line_Count News_Line_Count Twitter_Line_Count Blog_Word_Count
## 1          899288           77259            2360148        37546246
##   News_Word_Count Twitter_Word_Count
## 1         2674536           30093410

Looks like we have fairly large english word corpus! Let us see the most frequently occuring 1 word, 2 word and 3 word combination.

Cleaning Data and Building Corpus

To do that we need to do litte bit of data cleansing. Remove Punctuations, Stopwords, Numerics, etc. Packages will help to do all these fundamental data cleansing tasks so that we will be left with only corpus of words. On this corpus of words, we can check the top occuring unigram, bigram and trigrams.

But as you see, the data is huge and it makes sense to start analysis with a fraction of data say, 0.5% and so analysis on that.

set.seed(123)
blogs_red<- sample(blogs, 0.005*length(blogs))
news_red <- sample(news, 0.005*length(news))
twitter_red <- sample(twitter, 0.005*length(twitter))
sample <- c(blogs_red, news_red, twitter_red)
sum(stri_count_words(sample))
## [1] 350706
#Little bit of data cleansing.
sample <- iconv(sample, 'UTF-8', 'ASCII')
corpus <- Corpus(VectorSource(as.data.frame(sample, stringsAsFactors = FALSE))) 
corpus <- corpus %>%
  tm_map(tolower) %>%
  tm_map(PlainTextDocument) %>%
  tm_map(removeWords, stopwords('english')) %>%
  tm_map(removePunctuation) %>%
  tm_map(removeNumbers) %>%
  tm_map(stripWhitespace)
  

uni <- NGramTokenizer(corpus, Weka_control(min = 1, max = 1))
bi <- NGramTokenizer(corpus, Weka_control(min = 2, max = 2))
tri <- NGramTokenizer(corpus, Weka_control(min = 3, max = 3))

Let us check out our top 20 unigram, bigram and trigram words and visualize them.

uni.df <- data.frame(table(uni))
uni.df <- uni.df[order(uni.df$Freq, decreasing = TRUE),]

ggplot(uni.df[1:20,], aes(x=uni, y=Freq)) +
  geom_bar(stat="Identity", fill="#D95F02")+
  xlab("Unigrams") + ylab("Frequency")+
  ggtitle("Top 20 Unigrams") +
  theme(axis.text.x=element_text(angle=90, hjust=1))

bi.df <- data.frame(table(bi))
bi.df <- bi.df[order(bi.df$Freq, decreasing = TRUE),]

ggplot(bi.df[1:20,], aes(x=bi, y=Freq)) +
  geom_bar(stat="Identity", fill="#D95F02")+
  xlab("Bigrams") + ylab("Frequency")+
  ggtitle("Top 20 Bigrams") +
  theme(axis.text.x=element_text(angle=90, hjust=1))

tri.df <- data.frame(table(tri))
tri.df <- tri.df[order(tri.df$Freq, decreasing = TRUE),]

ggplot(tri.df[1:20,], aes(x=tri, y=Freq)) +
  geom_bar(stat="Identity", fill="#D95F02")+
  xlab("Trigrams") + ylab("Frequency")+
  ggtitle("Top 20 Trigrams") +
  theme(axis.text.x=element_text(angle=90, hjust=1))

Summary

Fun Fact: Intersting to see that “I Love You” made it to the top used trigram!

By now we have fairly good idea about our data set and its basic stats. Now we will go for building intelligent models which can learn from the word associations,might be based on n words prior used to predict the best possible next word. That is the goal.