This document describes the current state of work for the generation of an interactive system oriented to achieve text prediction based on dynamic input text. In general, this is a project where concepts from different areas are applied, like statistics, regressions, machine learning, exploratory data analysis, cleaning data, and developing web data products. The databases corresponds to sources from the company Swiftkey in conjunction with John Hopkins University and Coursera. Swiftkey is dedicated to build such kind of prediction systems for interactive keyboard systems for Android or iPhone and other commercial systems. The idea is to explore new approaches to predicting text for a more accurate help to users. There are four databases provided in for this project in different languages: German, English, Finish and Russian. There are three nature of corpora in each group: Text from Blogs, from news and from twitter. The present article is focused on the english version of the dataset.
Conceptually, the three different natures of corpora texts provides a variety of writing manners. For example in Blogs the writing is normally handled in first person and in a way like speaking to an audience. On the other hand, news are more formally written, generally in third person. Finally, twitter text is the most informal writing way and abreviated, using informality and even letters to represent words. Therefore, we expect to find some particular characteristics in the data.
Besides the writing style, there are fields or topics which are being treated in the different texts or paragraphs. This variable is also a predictor variable in order to infere next words in sentences. Therefore, in order to predict next words in a sentence being written in real time, a previous analysis is necessary. To be in conditions to carry out an analysis of the texts, our first task is to put our data in a form that can be quantitatively analyzed. This means, to transform the database from an unstructured way into a more structured form, or Corpora; taking out Uppercases, punctuation, some words, adding low degree of information and finding equivalences between complete words, words with typos, abreviations, etc.
The dataset is original from HC Corpora - www.corpora.heliohost.org and it has been obtained from this link which is made available from a collaboration from the company Swiftkey and John Hopkins University - Coursera as said before.
After obtaining the data we manipulate it using the package TM: Text Mining available at CRAN.r-project.org.
The complete dataset contains:
* en_US.blogs.txt : Number of Lines –> 899,288 - Number of Words –> 37,334,690
* en_US.news.txt : Number of Lines –> 1,010,242 - Number of Words –> 34,372,720
* en_US.twitter.txt : Number of Lines –> 2,360,148 - Number of Words –> 30,374,206
In later sections we create TermDocumentMatrices which corresponds to tables with the data in structured forms. Nevertheless, we prefer to show the data in plots rather than exposing the raw tables.
The first action from our side is to reduce the volume of data in random manner in order to develop the first explorations for this project. We take approximately 300,000 words from random lines from each of the 3 texts into the new files: a) en_US.twitter.brief.txt, b) en_US.news.brief.txt, and c) en_US.blogs.brief.txt. This is around 1% of the total volume of data. The Big Data Approach, to consider the whole (or larger amount) of the data is one of the next steps in the development of this project after this report.
We created a Virtual Corpus via VCorpus from tm package, and then manipulate the data: 1) Converting all letters to lowercase, 2) Removing punctuation, and 3) Regularizing the excess of white spaces generated through previous modifications.
You will see in the follwoing line a didactic approach, showing the result of each step of manipulation.
The subsetting was don in OSX command line with:
# $ sort -r en_US.blogs.txt | tail -10000 > ../en_US_brief/en_US.blogs.brief.txt
# $ sort -r en_US.news.txt | tail -10000 > ../en_US_brief/en_US.news.brief.txt
# $ sort -r en_US.twitter.txt | tail -20000 > ../en_US_brief/en_US.twitter.brief.txt
Creating the Virtual Corpus:
library(tm) # requires install.packages("tm")
#library(SnowballC)
source <- "/Users/naifoproject/Documents/Courses/Data_Science_Specialization/00_capstone_project/data/final/en_US_brief/"
db <- VCorpus(DirSource(source, encoding = "UTF-8"), readerControl = list(language = "english"))
Sampling one paragraph to track the correct changes:
db_cropped <- db
db_cropped[[1]][[1]][[3]]
## [1] "\"Yes she will, Seven.\" I retort, \"I think the Borg Queen knows how to enjoy a party a lot more than you do. If you don't like it, loosen up and let Annika enjoy it. This will be her first Enterprise Christmas Party.\""
Converting to lowercase:
db_cropped <- tm_map(db_cropped, content_transformer(tolower))
db_cropped[[1]][[1]][[3]]
## [1] "\"yes she will, seven.\" i retort, \"i think the borg queen knows how to enjoy a party a lot more than you do. if you don't like it, loosen up and let annika enjoy it. this will be her first enterprise christmas party.\""
Removing punctuation:
db_cropped[[1]][[1]]<- gsub( "[[:punct:]]"," ",db_cropped[[1]][[1]])
db_cropped[[2]][[1]]<- gsub( "[[:punct:]]"," ",db_cropped[[2]][[1]])
db_cropped[[3]][[1]]<- gsub( "[[:punct:]]"," ",db_cropped[[3]][[1]])
db_cropped[[1]][[1]][[3]]
## [1] " yes she will seven i retort i think the borg queen knows how to enjoy a party a lot more than you do if you don t like it loosen up and let annika enjoy it this will be her first enterprise christmas party "
Cleaning excess of white spaces:
db_cropped <- tm_map(db_cropped, stripWhitespace)
db_cropped[[1]][[1]][[3]]
## [1] " yes she will seven i retort i think the borg queen knows how to enjoy a party a lot more than you do if you don t like it loosen up and let annika enjoy it this will be her first enterprise christmas party "
In each step we have taken one sentence from the Corpus in order to verify that our changes are really made, as seen above until reaching a regular text which can be analyzed easier technically.
We are interested to know the single words in each document from the Corpus just to have a first approach feeling about it.
From our previous intuition, there should be differences in the most common ways amongst texts coming from news, which are very formally written, blogs and twitter posts, the later ones being more informally written and in general in first person and directed to a single or reduced target set of individuals.
At first instance, we considered the posibility to merge all the texts into a single text Corpus for prediction. However, we found interesting to keep the data separated at least until this moment, thinking in the possibility that our Shiny Web App to be developed can at least have a double or triple input field simulating the input of text for news, blogs or twitter posts. It is not decided yet but ir would be posible that only two input fields are defined, one for formal text (News and Blogs) and one for informal writting (twitter posts).
We can see below the frequencies of the most common words in each document: a) en_US.blogs.brief.txt, b) en_US.news.brief.txt, and c) en_US.twitter.brief.txt.
It is interesting to see how the writting in First Person, with sentences directed mainly to one individual, makes the twitter texts to have an excess of “you” words, and presence of words absent in the other texts, like “lol”.
Then, we need to start thinking in how single words shown above are combined linearly in the texts for the construction of phrases. To have an insight on this fact, we use the method of analyzing N-Grams which corresponds to continuous sequences of N-words present in the texts. Beside the single word approach shown above, we conduct the construction of N=2 and N=2 N-Grams and plot the frequencies of most common combinations for each of the cases in the Figures below. For this purpose we uses the package RWeka which provides facilities for N-Gram constructions.
For the 2-Grams we can start seen a strong difference between twitter texts and other texts (Blogs and News). Our attention is called by the 2-Gram “a lot” which is much more common in news than other writtings, fact that we are going to investigate. We see in general and, even with samples of the same size, that twitter posts lack of 2-Grams that are very common in other writting styles. We believe that this is because of the limitation of total words and the consequent abreviated written style used by people to communicate maximum information with minimum auqntity of words. A particularly interesting characteristic of twitter texts is the high presence of “i m” pairs, confirming that those texts are written mainly is First Person.
Analyzing the 3-Grams we confirm the same characteristics found in 2-Grams for twitter writtings. We have used log scale for y axis in order to compensate the large different of the 3-Gram “a lot of” which is exceeding other 3-Grams by far.
Our conclusion from this exploration is that it will be of much interest the implementation of a Shiny Web App that can predict in a Twitter Style Input Form and in a formal Input Form separately, given the difference in the texts part of our Corpus.
We will conduct further analysis in the following stages of this project.
This project is on going with the support from John Hopkins University-Coursera and Swiftkey. Some remote meetings are being scheduled to clarify points and facts about this project.
We identify and highlight here some of the points that we will have to address during next weeks:
For this report we have used around 1% of the dataset, taken in a random way. We believe this has been representative for the current purposes. However, for accuracy improvement and better performance on prediction algorithms we will have to carry out an evaluation of how much of the data it is worth to consider.
Based on the fact from previous paragraph, we might have to explore the possibility to use Off-Memory methods to manipulate the Datasets. We will explore the possibility of using a so called filehash package together with Permanent Corpuses from tm package, and other approaches like databases (e.g. MySQL). This is because we have intended to process the whole data and for limited laptop or desktop computers it has been difficult due to computing time and memory constraints.
Finally, regarding this point, for the sake of speed, we will also explore the possibility to use Hadoop technique in order to perform distributed manipulation of the dataset.
Prediction model is still under analysis study and it is not presented in detail in this document. We are pointing to a Probabilistic Model considering Bayes, or Markov Chains, but we would like to experiment with other approaches, always giving priority to the ones that allow us to have the quickest real time response in the Shiny Web App to be implemented. We aim to have a direct feeling of a benchmarking of this type of prediction problem.
As mentioned before, we plan to develop a simple Shiny Web App with two Input Forms: * One to predict next word for a user writting twitter posts.
* One to predict next word for a user writting more formal texts, like in a blog or news.
* For learning purposes we are planning to implement this time oue Shiny Application in an own server, where Shiny Server will be installed and we will try to compare the performance with hosting the application at shinyapps.io site.