Objective:
The goal of this project is to retrieve the unstructured data, convert it into structured data and try to understand the data in order to create a prediction algorithm. The eventual algorithm will be able to suggest the next words based on the current words.
Exploratory Analysis:
Initial analysis first focusses on downloading the data set. The data set consists of files in four languages. The unstructured files in the english language is chosen for analysis and is stored in a “textData” directory
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
Data Summary: The files were loaded into R and no: of lines and words present in each file were obtained.“Stringi” library was used to get the initial word counts and plotly library was used to plot the data.
Sampling:
Since the data files are too big a random sample that consists only 1000 lines of the original data was chosen for analysis.
Cleaning Data: The initial check on the data, showed that it has URLs, twitter accounts and many quotes. These were cleaned up to make proper analysis. The “tm” library package was used.
N Gram Analysis The unigram, bigram and trigram lists were generated and an histogram of the top 15 words for twitter, blogs and news were plotted. “RWeka” ,“Plotly” and “wordlcoud” libraries were used for n-gram analysis and plotting.
Word cloud for Twitter Unigram words
Bar plot for the top 15 unigram words from Twitter
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 720209 38.5 1168576 62.5 1168576 62.5
## Vcells 1061191 8.1 6137255 46.9 9278700 70.8
Word cloud for unigram words from Blogs
Bar plot of the top 15 words used in blogs
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 720512 38.5 1442291 77.1 1442291 77.1
## Vcells 1062484 8.2 4795907 36.6 18294940 139.6
Word cloud of the frequently used words in News
Bar plot of the top 15 commonly used words in News
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 720509 38.5 1442291 77.1 1442291 77.1
## Vcells 1062429 8.2 12049857 92.0 18827903 143.7
Word cloud for the bigram words used in Twitter
Bar plot of the top 15 bigram words in Twitter
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 720697 38.5 1442291 77.1 1442291 77.1
## Vcells 1064337 8.2 12722019 97.1 19878156 151.7
Word cloud of the bigram words used in Blogs
Bar plot of the top 15 bigram words used in blogs
Word cloud of the bigram words used in News
Bar plot of the top 15 bigram words in News
Word cloud of trigram words used in Twitter
Bar plot of top 15 trigram wors used in twitter
Word cloud of trigram words used in Blogs
Bar plot of top 15 trigram words used in Blogs
Word cloud of trigram words used in News
Bar plot of trigram words used in News
Conclusion: The data analysis gives a rough idea on the frequently used words in twitter, news and blogs. Because of the memory crunch issue, I have chosen a very limited randmom sample of size 1000. Using this as a training model, I plan to test it on different test sets and then predict the word. Once this is done I can make a shiny application and a presentation that describes its usage.