Jiahao Deng
I used the “caTools package” to split the data into two parts and use the “tm package” to clean the smaller part of the data. And I used the “wordcloud package” to illustrate the frequency of the words.
I have saved my Workplace so I just need to load them in to R.
## Warning in readLines("en_US.twitter.txt"): line 167155 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 268547 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 1274086 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 1759032 appears to contain
## an embedded nul
To make the exploratory analysis faster, I use sample.split function to get a small sample of the data.
library(caTools)
set.seed(seed = 0)
blogs.split = sample.split(blogs, SplitRatio = 0.01)
blogs.trainSparse = subset(blogs, blogs.split==TRUE)
twitter.split = sample.split(twitter, SplitRatio = 0.01)
twitter.trainSparse = subset(twitter, twitter.split==TRUE)
news.split = sample.split(news, SplitRatio = 0.01)
news.trainSparse = subset(news, news.split==TRUE)
library(tm)
## Loading required package: NLP
library(tidyr)
blogs.corpus = Corpus(VectorSource(blogs.trainSparse))
blogs.corpus <- blogs.corpus %>%
tm_map(tolower) %>%
tm_map(PlainTextDocument) %>%
tm_map(removePunctuation) %>%
tm_map(removeWords, c("just","can","like",stopwords("english")))%>%
tm_map(removeNumbers) %>%
tm_map(stripWhitespace) %>%
tm_map(PlainTextDocument)
twitter.corpus = Corpus(VectorSource(twitter.trainSparse))
twitter.corpus <- twitter.corpus %>%
tm_map(tolower) %>%
tm_map(PlainTextDocument) %>%
tm_map(removePunctuation) %>%
tm_map(removeWords, c("just","can","like",stopwords("english")))%>%
tm_map(removeNumbers) %>%
tm_map(stripWhitespace) %>%
tm_map(PlainTextDocument)
news.corpus = Corpus(VectorSource(news.trainSparse))
news.corpus <- news.corpus %>%
tm_map(tolower) %>%
tm_map(PlainTextDocument) %>%
tm_map(removePunctuation) %>%
tm_map(removeWords, c("just","can","like",stopwords("english")))%>%
tm_map(removeNumbers) %>%
tm_map(stripWhitespace) %>%
tm_map(PlainTextDocument)
Wordcloud in the Blogs File
Wordcloud in the Twitter File
Wordcloud in the News File
blogs.g<-ggplot(blogs.top.plot, aes(x = reorder(word, times),y = times))+ geom_bar(stat = "identity",size = .5 ,fill = "lightgreen")+ theme(axis.text.x=element_text(angle = 90,hjust = 1)) + ggtitle("Top 20 Words in Blogs File") + xlab("Top Words") + ylab("Number of Records")
blogs.g
twitter.g<-ggplot(twitter.top.plot, aes(x = reorder(word, times),y = times))+ geom_bar(stat = "identity",size = .5 ,fill = "lightgreen")+ theme(axis.text.x=element_text(angle = 90,hjust = 1)) + ggtitle("Top 20 Words in Twitter File") + xlab("Top Words") + ylab("Number of Records")
twitter.g
news.g<-ggplot(news.top.plot, aes(x = reorder(word, times),y = times))+ geom_bar(stat = "identity",size = .5 ,fill = "lightgreen")+ theme(axis.text.x=element_text(angle = 90,hjust = 1)) + ggtitle("Top 20 Words in News File") + xlab("Top Words") + ylab("Number of Records")
news.g
1.If the word is in a n-gram word group of a high frequency, predict the following word as it appears in the n-gram word group.
2.Build a KNN model to choose the one with the nearest distance with the previous word to be the predicted next word.