Data Analysis

Let’s load en_us data for initial data analysis. It has 3 files for blog, twitter and news.

Let’s look at our data set. For initial analysis lets look at length and word cound of all three files

df <- data.frame(text.source = c("blog", "twitter", "news"), line.count = NA, word.count = NA)
input_str_list <- list(blog = blog, twitter = twitter, news = news)
df$line.count <- sapply(input_str_list, length)
df$word.count <- sapply(input_str_list, func_word_count)

Plot Line Count

library(ggplot2)
p<-p<-ggplot(data=df, aes(x=df$text.source, y=df$line.count)) +
  geom_bar(stat="identity", color="white", fill="steelblue") + 
  ggtitle("Line Count") +
  labs(x="Data Source",y="Line Count Count") +
  geom_text(aes(label = df$line.count), size = 3, position = position_stack(vjust = 0.5)) +
  theme_minimal()
p

## Plot Word Count

w<-ggplot(data=df, aes(x=df$text.source, y=df$word.count)) +
  geom_bar(stat="identity", color="white", fill= "steelblue") +
  ggtitle("Word Count") +
  labs(x="Data Source",y="Word Count") +
  geom_text(aes(label = df$word.count), size = 3, position = position_stack(vjust = 0.5)) +
  theme_minimal()
w

For further analysus let’s subset the data and analyze first 3000 lines of all 3 files.

blog_subset <- blog[1:3000]
twitter_subset <- twitter[1:3000]
news_subset <- news[1:3000]

For initial analysis we will look at subset of data. Let’s analyze blog subset data. Let’s load the corpus and do n-gram (1-gram) modelling to analyze data pattern.

library("tm")
## Warning: package 'tm' was built under R version 3.3.3
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(reshape2)
a <- Corpus(VectorSource( blog_subset)) 
a <- tm_map(a, removeNumbers) 
a <- tm_map(a,removePunctuation) 
a <- tm_map(a,stripWhitespace) 
a <- tm_map(a,content_transformer(tolower))
options(mc.cores=1)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
tdm <- TermDocumentMatrix(a, control = list(tokenize = BigramTokenizer))

This gives us matrix of ngram model. Let’s get top 15 high frequency words and thier frequency. Please note that this data is only a subset for initial analysis

tdm.matrix_blog <- as.matrix(tdm)
topwords <- rowSums(tdm.matrix_blog)
top_sort <- sort(topwords, decreasing = TRUE)
top_15 <- top_sort[1:15]

Top 15 words from blog subset

##  the  and that  for with  you  was this have  but  are  not from  all they 
## 6107 3644 1536 1160  991  974  908  833  729  684  581  546  483  457  427

Let’s do same analysis for twitter and news data as well

Top 15 words for News data subset:

##  the  you  and  for that with this your  are have just  all  its  but  was 
## 1176  690  566  476  317  225  212  210  202  191  186  161  158  158  155

Top 15 words for Twitter data subset:

##  the  and  for that with said  was  his  but from  are have  its  has  not 
## 5779 2640 1145 1030  776  752  664  498  492  452  445  431  356  356  348

Next steps

The project goal is to develop a natural language prediction algorithm and app. We will continue with developing n-gram dictionary. We will construct a dictionary of bigrams, trigrams, and four-grams, collectively called n-grams.

Than we will go into predicting from N-grams using back-off model from dictionary built in ngram model.

After the model is developed , an app will take input from user. The app will suggest most likely words that shouold come next.