This is the Week 02 Milestone Report for the Data Science Specialization. The goal of this report is to present the characteristics of the dataset that will be used further to produce a predictive algorithm, which will be embedded in a Shiny Application.
Specifically, it will be presented an exploratory analysis performed on a collection of english texts derived from blogs, news and twitters.
The aforementioned analysis will be basis of the predictive algorithm.
The dataset used in this project is provided by the course and comprises texts from a variety of sources in three different languages. To download the dataset:
if( !file.exists("Coursera-SwiftKey.zip") ){
download.file(url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
destfile = "Coursera-SwiftKey.zip")
system("unzip Coursera-SwiftKey.zip final/* -d data")
}
For this report, it will only be used the english texts from all the available sources: blogs, news and twitter. The following table shows a summary statistics of the datasets:
Source | Size | Number of texts | Total number of words | Average number of words per text | Median number of words per text | Std. number of words per text | Max. number of words per text | Min number of words per text |
---|---|---|---|---|---|---|---|---|
Blogs | 255.4 Mb | 899,288 | 37,153,277 | 41.31 | 28 | 46.16 | 6629 | 0 |
News | 257.3 Mb | 1,010,242 | 34,197,137 | 33.85 | 31 | 22.48 | 1792 | 0 |
319 Mb | 2,360,148 | 29,706,971 | 12.59 | 12 | 6.82 | 47 | 1 |
The first thing to notice is the size of the datasets. There more than 800 MB of text data. The datasets contain more than 4 millions of texts. The majority of texts comes from twitter source, which is composed with more than the sum of all texts from blogs and news sources. Another important thing to notice is although blogs dataset comprises the smallest number of texts among the sources, it has the largest number of total words and average number of words per text among the sources.
As the datasets have large size, this report will only consider a fraction of texts from each source. Specifically, the dataset used in this report will comprise a sample of 1% of all texts from each source.
One of the most important things to do when working with text data is to analyze not only the most frequently words, but also word combinations used in texts.
In order to do that, the texts will be tokenized in n-grams with stop words being removed. After that, the most frequently unigrams, bigram and trigam will be analyzed.
The following figure shows the 10 most frequently words used in texts from all sources.
As we can see, love, day and love are the three most common words used in texts from all sources combined. Notice that as stop words are being removed from analysis, common used pronouns (‘I’, ‘you’, ‘we’), articles (‘the’, ‘a’, ‘an’) are not seen in the last figure, although they might be the most used words in texts.
Now, let’s see how the most frenquently words differs across sources:
Some of the 10 most used words are common among sources, but some words appeared in twitter are specific to this plataform such as ‘lol’ and ‘rt’. This is an interesting finding as it might require some special attention for the construction of the predictive algorithm.
Once the ultimate goal of this project is to shed some light in how the prediction model will be constructed, let’s see the most used combination of words used in this dataset. The following figure shows the 10 most frequently used bigrams:
Notice that as we are analyzing combination of words, it is better to not remove stop words as some combinations of words involve these kind of terms. In fact, the figure above shows that the 10 most used combinations of 2 words involves stop words. Also, the next figure shows a similar pattern when analyzing these word combinations considering the sources separately:
Now, let’s see the combination of 3 most frequently used words in these texts:
Apparently, the most used combinations also involve stop words somehow. Let’s see if this pattern is still sustained when looking across sources:
As we can see, most of the combination of words involves stop words. This is important to note as stop words might be very important for the construction of predictive algorithm.
The next steps of project is to construct the prediction model and deploy it in a Shiny application.
The predictive algorithm will be based in a model that will use the n-gram relative frequency. Bigram and trigram models will be the basis to lookup for the the most likely words, followed by unigrams.
The Shiny application will comprises an interface where the user will supply text in a input box and the app will use the predictive algorithm to suggest the following most likely expression.