This project is part of the capstone project of the Data Science Specialization, where it is asked to download text data provided by Swiftkey, do an exploratory anlysis, buid a model to predict the next word and create a shinny app that takes as input a phrase (multiple words), one clicks submit, and it predicts the next word. This milestone report is an intermediate R markdown report that describes in plain language, plots, and code, your exploratory analysis of the course data set.
The data sets were downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip . For the purpose of this report , only the Texts in English will be used. These texts come from three different sources: blogs, news and tweets.
# Set the working directory where the data is located
setwd("~/R_Coursera/Capstone Project")
# Reading the datasets
# TWITTER - reading and loading dataset
con <- file("en_US.twitter.txt", "rb")
twitter<-readLines(con, skipNul = TRUE, encoding="UTF-8")
close(con)
# BLOGS - reading and loading dataset
con <- file("en_US.blogs.txt", "rb")
blogs <-readLines(con, skipNul = TRUE, encoding="UTF-8")
close(con)
# NEWS - reading and loading dataset
con <- file("en_US.news.txt", "rb")
news <-readLines(con, skipNul = TRUE, encoding="UTF-8")
close(con)
Below we have a summary statistics table showing some characteristics of the datasets. They are big files and together they have more than 800 megabytes. In terms of word counts, all the files have more than 30 million words each. Together , they sum up to more than 100 million words. This is a huge number of words and a preprocessing task to reduce this number will be a very important step. In terms of number of lines, the twitter dataset has more than 2 million. Blogs dataset has the smallest size and number of lines, almost 900,000 lines. These statistics shows that this project is a computer demanding job and that some memory space and processing optimization will be necessary.
## Size WordCounts LineCounts MaxLength Length
## twitter 316037600 30093410 2360148 140 162096241
## blogs 260564320 37546246 899288 40833 206824505
## news 261759048 34762395 1010242 11384 203223159
Considering that the data sets are very big, and that we don’t need all the data to train a good model to predict the next word, we are going to create a data sample. We will do some cleaning work before starting the text analysis. It is necessary to clean the dataset, row by row, getting rid of punctuations, numbers, hyphens and symbols. Also, it important to lower case all the words in order to avoid word repetition.
The table below shows some numbers of the data sample that will be used on the next steps of the process.
## Size WordCounts LineCounts
## Data Sample 4198576 509761 21347
Our data sample was tokenized and cleaned on the steps before. Now it is time to create the unigrams(terms based in a single word) and plot a graph to display the 20 most frequent words on the data sample.
Below you see the 20 most frequent bigrams(terms based in a two consecutive words) found in our data sample.
The next plot displays the the 20 most frequent trigrams(terms based in a three consecutive words) in our data sample.
The strategy for creating the prediction algorithm and the future shinny app consists of finding a way to reduce the features of document frequency matrix. The dimensionality of the DFM is growing very fast and this can impact of the prediction and the computer performance.
Also, I believe that it is better not do word stemming neither stopwords removal. This should improve the prediction accuracy.
I intend to use research for some methods to reduce dimensionality and use the most significant words to better train the prediction algorithm.
The Shinny app should have a clean and simple interface and should alert the users that stopwords and profanity words were not removed.