This is a milestone report for the John Hopkins School of Public Health Data Science Specialization Capstone project on Swift Key. The following steps are taken to explore the data before building the shift key application.
3 data files are given. They are “en_US.news.txt”, “en_US.blogs.txt” and “en_US.twitter.txt”. The files are read into R is shown:
# Read all data.
con <- file("final data/en_US/en_US.news.txt", open="rb")
news <- readLines(con, encoding="UTF-8")
close(con)
blogs <- readLines("final data/en_US/en_US.blogs.txt", encoding = 'UTF-8')
tweets <- readLines("final data/en_US/en_US.twitter.txt", encoding = 'UTF-8')
Notice that the news data is read differently into R. That is because using the same method of reading the data willl result in only about 7% of the news data read. The method shown above is used to read the data to ensure all of the given data are read.
Here are the libraries used:
# Load Libraries
library(tm); library(quanteda); library(wordcloud); library(ggplot2); library(RWeka); library(dplyr)
The next few sections show some plots of the data.
Each piece of news, blogs or tweets are referred to as a document. The histogram above shows the number of documents in each dataset. Each dataset consists of close to 1 million documents. The twitter data has over 2 million records.
Next, the documents in each dataset is tokenized. Tokenizing means breaking each documents in the dataset into individual words.
The documents in each dataset is tokenized. The next graph shows the total number of unique words in each dataset, and in all the dataset. Given that there are many ways to express something in English, the number of unique words used in the dataset is important to reflect the variety of input that a user may use. It is useful to examine which of the dataset provides the most number of unique words per document.
Blogs has the most number of unique words per document, followed by news. Blogs even outperform the twitter and combine dataset by more than 2 times. It may be worthwhile to evaluate whether using just blogs data alone to build an accurate shift key prediction algorithm as it provides the most information.
A common procedure in text analysis is to breakdown the documents into individual words. This technique is called n-grams, where n presents a positive integer number. For example, when n equals to 1, “n-” is replaced with the word “uni” and becomes “unigram”. Unigram therefore represents a list of the different single words used in the documents in the dataset.
The graphics above is called a word cloud, and it consists of the words which occur at least 13000 times in the news dataset. Words are sized according to how frequent it appears in the documents of a dataset. There are 150 words that meets the criteria.“said” is the word with the largest frequency in the news dataset at 250348 occurences.
Next, here are the words with a frequency of at least 60000 times in the combine dataset. There are a total of 77 words in the word cloud. “will” is the word with the largest frequency in the combined dataset at 314977 occurences.
When n is 2, the n-gram is called bi-gram. When n is 3, the n-gram is called a tri-gram. Similarly, the documents in news dataset are broken down into 2 or 3 continuous words for bigram and trigram respectively. The 50 most frequent words for each n-gram are represented as wordcloud below.
The most common bigram is “in the” with 1380 occurences.
The most common trigram is “one of the” with 107 occurences.
The above reports shows the steps in which the 3 dataset are broken down, explored and analyzed before the swift key application is developed. The blogs dataset alone provides more information that the rest of the individual dataset and the combined dataset.
If you have any questions, please contact the author at his blog.