Jiahao Deng’s Milestone Report

Jiahao Deng

In this report I am trying to accomplish the following task:

  1. Demonstrate that I’ve downloaded the data and have successfully loaded it in.
  2. Create a basic report of summary statistics about the data sets.
  3. Report any interesting findings that you amassed so far.
  4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Tools that I use

I used the “caTools package” to split the data into two parts and use the “tm package” to clean the smaller part of the data. And I used the “wordcloud package” to illustrate the frequency of the words.

Load the working environment

I have saved my Workplace so I just need to load them in to R.

## Warning in readLines("en_US.twitter.txt"): line 167155 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 268547 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 1274086 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 1759032 appears to contain
## an embedded nul

Split the data

To make the exploratory analysis faster, I use sample.split function to get a small sample of the data.

library(caTools)
set.seed(seed = 0)
blogs.split = sample.split(blogs, SplitRatio = 0.01)
blogs.trainSparse = subset(blogs, blogs.split==TRUE)
twitter.split = sample.split(twitter, SplitRatio = 0.01)
twitter.trainSparse = subset(twitter, twitter.split==TRUE)
news.split = sample.split(news, SplitRatio = 0.01)
news.trainSparse = subset(news, news.split==TRUE)

Clean the data

library(tm)
## Loading required package: NLP
library(tidyr)
blogs.corpus = Corpus(VectorSource(blogs.trainSparse))
blogs.corpus   <- blogs.corpus %>%
                  tm_map(tolower) %>%
                  tm_map(PlainTextDocument) %>%
                  tm_map(removePunctuation) %>%
                  tm_map(removeWords, c("just","can","like",stopwords("english")))%>%
                  tm_map(removeNumbers) %>%
                  tm_map(stripWhitespace) %>%
                  tm_map(PlainTextDocument)

twitter.corpus = Corpus(VectorSource(twitter.trainSparse))
twitter.corpus <- twitter.corpus %>%
                  tm_map(tolower) %>%
                  tm_map(PlainTextDocument) %>%
                  tm_map(removePunctuation) %>%
                  tm_map(removeWords, c("just","can","like",stopwords("english")))%>%
                  tm_map(removeNumbers) %>%
                  tm_map(stripWhitespace) %>%
                  tm_map(PlainTextDocument)
news.corpus = Corpus(VectorSource(news.trainSparse))
news.corpus    <- news.corpus %>%
                  tm_map(tolower) %>%
                  tm_map(PlainTextDocument) %>%
                  tm_map(removePunctuation) %>%
                  tm_map(removeWords, c("just","can","like",stopwords("english")))%>%
                  tm_map(removeNumbers) %>%
                  tm_map(stripWhitespace) %>%
                  tm_map(PlainTextDocument)

Plot

Wordcloud in the Blogs File

Wordcloud in the Twitter File

Wordcloud in the News File

blogs.g<-ggplot(blogs.top.plot, aes(x = reorder(word, times),y = times))+ geom_bar(stat = "identity",size = .5 ,fill = "lightgreen")+ theme(axis.text.x=element_text(angle = 90,hjust = 1)) + ggtitle("Top 20 Words in Blogs File") + xlab("Top Words") + ylab("Number of Records")
blogs.g

twitter.g<-ggplot(twitter.top.plot, aes(x = reorder(word, times),y = times))+ geom_bar(stat = "identity",size = .5 ,fill = "lightgreen")+ theme(axis.text.x=element_text(angle = 90,hjust = 1)) + ggtitle("Top 20 Words in Twitter File") + xlab("Top Words") + ylab("Number of Records")
twitter.g

news.g<-ggplot(news.top.plot, aes(x = reorder(word, times),y = times))+ geom_bar(stat = "identity",size = .5 ,fill = "lightgreen")+ theme(axis.text.x=element_text(angle = 90,hjust = 1)) + ggtitle("Top 20 Words in News File") + xlab("Top Words") + ylab("Number of Records")
news.g

Initial idea to build a prediction model

I would like to build a prediction model based on the following steps:

1.If the word is in a n-gram word group of a high frequency, predict the following word as it appears in the n-gram word group.

2.Build a KNN model to choose the one with the nearest distance with the previous word to be the predicted next word.