The purpose of this project is to develop a text predicting algorithm based on the text provided in 3 files containing English Language Text collected from Blogs, News Articles and Twitter.
This document describes the analysis of the provided data for the purpose of creating the algorithm for predicting text. This document first describes the conduction preliminary analysis of the files in terms of finding number of lines, words, sentences. Then, the document analyses a part of the Twitter file for analysing frequent terms used, word associations and clustering of words.
As the document provides plots of the analysis, the amount of data used for the analysis has been kept small. However, selecting a small set of data from a large data set randomly can be equally as effective for the purpose of reasonably accurate prediction.
3 Source Files have been provided. They are as follows.
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
| fileName | linesPerFile | wordsPerFile | avgWordsPerFile |
|---|---|---|---|
| en_US.blogs.txt | 899288 | 36886194 | 41.01711 |
| en_US.news.txt | 1010242 | 33535577 | 33.19559 |
| en_US.twitter.txt | 2360148 | 29414810 | 12.46312 |
…
We notice that the words per lines for Blogs is about times the words per line for News and about times the words per line for Twitter. I needs figuring out if the words per sentences also are similar (In the raw files, the sentences span across lines and also one line contains more than one sentence). If they were to be similar, then it would imply that people write longer sentences in Blogs and in News articles as compared to posts in Twitter.
However, we notice that the number of words in all the 3 files is quite similar though the number of lines of text in Twitter is far more than the number of lines in Blogs and News file. This is expected as there is a limitation on the number of characters that can be used per tweet and there is no such limit when writing Blogs or writing News articles.
These graphs show that the words per line in Twitter files is fairly regular. Whereas, in Blogs and in News articles, most of the lines contains very few words and a few lines contains a lot of words. (The graphs for blog and news have been excluded to shorten the size of the file as RPUBS and GUTHUB gives errors)
This display below gives an idea of the most frequently and least frequently used words in forming sentences in Blogs, News and Twitter. The frequency distribution of the words is also provided here separately for the Blogs, News and Twitter.
Commonly used words like pronouns, etc. have been removed from the word list.
A separate analysis is performed to see the volume of profanity words (swear words) used in the Blogs, News and Twitter (The profanity word list is obtained from the Internet and may not be a complete list).
After taking the initial measures, we clean the documents. The following steps were implemented for cleaning the documents.
The Top 10 words (most frequently occuring) in the file “en_US.blogs.txt” is provided below.
## WORD FREQ
## 1 about 114832
## 2 out 108098
## 3 up 105541
## 4 just 99524
## 5 like 97912
## 6 more 92065
## 7 time 87438
## 8 get 70480
## 9 know 59505
## 10 now 58780
The Top 10 words occur 2.42 percent time in the document.
The number of use of profanity words in Blogs is 20157 times.
The frequency distribution of the words used in Blog is given below (after removing commonly used words and after profanity treatment).
The mean of the word frequency is 42.0271856.
The median of the word frequency is 1.
(The graphs for blog and news have been excluded to shorten the size of the file as RPUBS and GUTHUB gives errors)
The number of words removed during the Blog document cleaning process is 21690.
The Top 10 words (most frequently occuring) in the file “en_US.news.txt” is provided below.
## WORD FREQ
## 1 said 250326
## 2 about 89527
## 3 more 87913
## 4 up 72030
## 5 out 71482
## 6 new 70189
## 7 after 62416
## 8 year 57320
## 9 just 52981
## 10 first 52542
The Top 10 words occur 2.58 percent time in the document.
The number of use of profanity words in News is 7162 times.
The frequency distribution of the words used in News is given below (after removing commonly used words and after profanity treatment).
The mean of the word frequency is 57.9986147.
The median of the word frequency is 2.
(The graphs for blog and news have been excluded to shorten the size of the file as RPUBS and GUTHUB gives errors)
The number of words removed during the News document cleaning process is 7967.
The Top 10 words (most frequently occuring) in the file “en_US.twitter.txt” is provided below.
## WORD FREQ
## 1 im 157871
## 2 just 149580
## 3 like 121279
## 4 out 114020
## 5 up 112830
## 6 get 111901
## 7 love 105430
## 8 good 99549
## 9 about 90929
## 10 day 89815
The Top 10 words occur 3.92 percent time in the document.
The number of use of profanity words in Twitter is 85831 times.
The frequency distribution of the words used in Twitter is given below (after removing commonly used words and after profanity treatment).
The mean of the word frequency is 35.2041203.
The median of the word frequency is 1.
The number of words removed during the Twitter document cleaning process is 88832.
Sentence analysis is conducted on 100,000 lines of text in each of the files.
The number of sentences extracted from the 10,000 lines of each file are as follows.
This seems to indicte that people write shorter sentences in Blogs and News articles. Comparatively, in Twitter people tend to write longer sentences. This could be because people do not write proper sentences in Twitter and possibly do not use sentence seperator characters while writing Tweets.
Term Analysis of the Twitter is provided below. 1000 sentences from the Twitter file is used for this analysis.
Word associations in Twitter file is provided below. 1000 sentences from the Twitter file is used for this analysis.
Clustering analysis is provided below.