This report will analyze the data available for the Capstone Project. The goal of the project is to develop a predictive text writing application. There are three data sets available:
1. en_US.blogs.txt
2. en_US.twitter.txt
3. en_US.news.txt
Each data set has a different usage of the language. To be simple and concise, no R code will not be shown in the report.
As a start we tokenize the words in each text file by punctuation and whitespace. Then the data sets are cleaned by converting characters to ascii encoding and converting all characters to lower case. Finally frequency tables of tokens are established and saved to data frame for a later display.
## Warning: package 'stringr' was built under R version 3.1.3
First, let’s have a look at the number of lines of each data set. With 2.3 billion lines the “twitter”-dataset has the most number of lines, followed by “news”" with about 1 billion and “blogs” with 900,000 lines.
One might suggest that the one dataset with the most lines of code also has the most number of words. But as can be seen below, the “twitter”" data set has the lowest number of words. This makes sense as the maximum number of characters per each tweet is 160. Both “news” and “blogs” have higher number of words with around 40 billion.
Surprisingly the number of different tokens is the same for “blogs” and “twitter”. It might be supposed that people who are using twitter would use less different words than people that write blogs. Similarly it could be expected that professionals who write articles for news would have the most vocabulary.
When looking at the most frequent words of the “twitter” dataset besides the words that are used for conjunction like “and”, “or” and “to” the words “I” and “you” have a high frequency: These are the most frequent words in the “twitter” dataset:
These two words naturally aren’t that frequent in the news data set:
The same is for blogs:
Also interesting is the average number of characters per word. Due to its restriction, people that are using twitter are using shorter words (4.59 characters per word), while “news” has an average of 5.23 characters per word.
## [1] 4.605104 5.007488 5.251882
As a next step, separate models for each dataset will be created, since the usage of language varies slightly. So far only one model is used for data processing. Several NLP packages will be evaluated after splitting the files into training and test sets while experimenting with the training and test sizes. Profanity words will have to be cleaned of all files and the limitations of the shiny app have to be looked into.