The goal of this whole project is to come up with an algorithm to predict the next word, and built a word suggestion application using the algorithm. Source data in four languages (English, German, Russian and Finnish) to build the algorithm was provided by the instructors. Since 1) the data set in English is expected to be used for the exercises and 2) I have little knowledge on the other three languages, I use the English data set for this project.
The objective of this report is 1) load the given data sets, 2) explore and give a brief summary of them, and 3) come up with a plan to develop an algorithm and application.
US_B <- readLines("final/en_US/en_US.blogs.txt")
US_N <- readLines("final/en_US/en_US.news.txt")
US_T <- readLines("final/en_US/en_US.twitter.txt")
Check the size of the data sets
These data sets are too large to conduct further analyses. Therefore, I randomly pick some lines from each data set and continue the analysis. Since dealing with over 1000 lines slows down calculation processes significantly, I select 1000 lines from each data set and use for the further calculation.
Check sampled sets
I utilize tm package which is an R package specialized for text mining.
This cleaning process includes
In my option, whether a word or sentence is offensive or not heavily depends on context. Any words could be used in offensive ways, and some potentially offensive words could be used in non-offensive ways. Detecting offensive usage of words and eliminate them requires another level of language processing skills. Therefore, I stick with eliminating seven absolutely offensive words (“shit”,“piss”,“fuck”,“cunt”,“cocksucker”,“motherfucker”,“tits”) from the data sets.
After the cleaning proccess, text lines became like examples below.
Calculate Term Document Matrix (TDM: reflect the number of times each word in the corpus is fund in each of the documents) to find words with the highest frequency of usage.
Top 20 words with the highest frequency of usage:
Compare the frequency of usage among the data sources (Blog, News and Twitter): plot 200 words with the highest frequency of usage. Note that “the” and “and” had significantly high frequencies (the = Blog: 1973, News: 1911, Twitter: 392, and = Blog: 1124, News: 870, Twitter: 189); therefire the two words were excluded in the graphs. X and Y axes show the requency of usage of words in each data source.
The word frequencies in Blog and News sources are similar compared to that in Twitter.
Top 20 words with the highest frequency of usage:
How many bigram words were collected?
Compare bigram word counts among data sources (Blog, News, and Twitter): plot 200 words with the highest frequencies.
From the data exploration, I found that:
Because of this uniqueness in Twitter, I plan to develop an algorithm to predict the next word for the Twitter service. Since Twitter has a limitation on the number of characters to use in a post, I would imagine more abbreviations are used in this service. Also, I imagine emoticons and emojis are often used as well. I would spend next several weeks to take into account these Twitter specific features and develop an algorithm and an application.