Let’s load en_us data for initial data analysis. It has 3 files for blog, twitter and news.
Let’s look at our data set. For initial analysis lets look at length and word cound of all three files
df <- data.frame(text.source = c("blog", "twitter", "news"), line.count = NA, word.count = NA)
input_str_list <- list(blog = blog, twitter = twitter, news = news)
df$line.count <- sapply(input_str_list, length)
df$word.count <- sapply(input_str_list, func_word_count)
library(ggplot2)
p<-p<-ggplot(data=df, aes(x=df$text.source, y=df$line.count)) +
geom_bar(stat="identity", color="white", fill="steelblue") +
ggtitle("Line Count") +
labs(x="Data Source",y="Line Count Count") +
geom_text(aes(label = df$line.count), size = 3, position = position_stack(vjust = 0.5)) +
theme_minimal()
p
## Plot Word Count
w<-ggplot(data=df, aes(x=df$text.source, y=df$word.count)) +
geom_bar(stat="identity", color="white", fill= "steelblue") +
ggtitle("Word Count") +
labs(x="Data Source",y="Word Count") +
geom_text(aes(label = df$word.count), size = 3, position = position_stack(vjust = 0.5)) +
theme_minimal()
w
For further analysus let’s subset the data and analyze first 3000 lines of all 3 files.
blog_subset <- blog[1:3000]
twitter_subset <- twitter[1:3000]
news_subset <- news[1:3000]
For initial analysis we will look at subset of data. Let’s analyze blog subset data. Let’s load the corpus and do n-gram (1-gram) modelling to analyze data pattern.
library("tm")
## Warning: package 'tm' was built under R version 3.3.3
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(reshape2)
a <- Corpus(VectorSource( blog_subset))
a <- tm_map(a, removeNumbers)
a <- tm_map(a,removePunctuation)
a <- tm_map(a,stripWhitespace)
a <- tm_map(a,content_transformer(tolower))
options(mc.cores=1)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
tdm <- TermDocumentMatrix(a, control = list(tokenize = BigramTokenizer))
This gives us matrix of ngram model. Let’s get top 15 high frequency words and thier frequency. Please note that this data is only a subset for initial analysis
tdm.matrix_blog <- as.matrix(tdm)
topwords <- rowSums(tdm.matrix_blog)
top_sort <- sort(topwords, decreasing = TRUE)
top_15 <- top_sort[1:15]
Top 15 words from blog subset
## the and that for with you was this have but are not from all they
## 6107 3644 1536 1160 991 974 908 833 729 684 581 546 483 457 427
Let’s do same analysis for twitter and news data as well
Top 15 words for News data subset:
## the you and for that with this your are have just all its but was
## 1176 690 566 476 317 225 212 210 202 191 186 161 158 158 155
Top 15 words for Twitter data subset:
## the and for that with said was his but from are have its has not
## 5779 2640 1145 1030 776 752 664 498 492 452 445 431 356 356 348
The project goal is to develop a natural language prediction algorithm and app. We will continue with developing n-gram dictionary. We will construct a dictionary of bigrams, trigrams, and four-grams, collectively called n-grams.
Than we will go into predicting from N-grams using back-off model from dictionary built in ngram model.
After the model is developed , an app will take input from user. The app will suggest most likely words that shouold come next.