Objective

This project is part of the capstone project of the Data Science Specialization, where it is asked to download text data provided by Swiftkey, do an exploratory anlysis, buid a model to predict the next word and create a shinny app that takes as input a phrase (multiple words), one clicks submit, and it predicts the next word. This milestone report is an intermediate R markdown report that describes in plain language, plots, and code, your exploratory analysis of the course data set.

Downloading and Loading Data Sets

The data sets were downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip . For the purpose of this report , only the Texts in English will be used. These texts come from three different sources: blogs, news and tweets.

# Set the working directory where the data is located
setwd("~/R_Coursera/Capstone Project")
# Reading the datasets
#   TWITTER - reading and loading dataset
con <- file("en_US.twitter.txt", "rb") 
twitter<-readLines(con, skipNul = TRUE, encoding="UTF-8")
close(con)
# BLOGS - reading and loading dataset
con <- file("en_US.blogs.txt", "rb") 
blogs <-readLines(con, skipNul = TRUE, encoding="UTF-8")
close(con)
# NEWS - reading and loading dataset
con <- file("en_US.news.txt", "rb") 
news <-readLines(con, skipNul = TRUE, encoding="UTF-8")
close(con)

Basic summary statistics of the datasets

Below we have a summary statistics table showing some characteristics of the datasets. They are big files and together they have more than 800 megabytes. In terms of word counts, all the files have more than 30 million words each. Together , they sum up to more than 100 million words. This is a huge number of words and a preprocessing task to reduce this number will be a very important step. In terms of number of lines, the twitter dataset has more than 2 million. Blogs dataset has the smallest size and number of lines, almost 900,000 lines. These statistics shows that this project is a computer demanding job and that some memory space and processing optimization will be necessary.

##              Size WordCounts LineCounts MaxLength    Length
## twitter 316037600   30093410    2360148       140 162096241
## blogs   260564320   37546246     899288     40833 206824505
## news    261759048   34762395    1010242     11384 203223159

Creating and cleaning the data sample

Considering that the data sets are very big, and that we don’t need all the data to train a good model to predict the next word, we are going to create a data sample. We will do some cleaning work before starting the text analysis. It is necessary to clean the dataset, row by row, getting rid of punctuations, numbers, hyphens and symbols. Also, it important to lower case all the words in order to avoid word repetition.

The table below shows some numbers of the data sample that will be used on the next steps of the process.

##                Size WordCounts LineCounts
## Data Sample 4198576     509761      21347

Interesting findings

Our data sample was tokenized and cleaned on the steps before. Now it is time to create the unigrams(terms based in a single word) and plot a graph to display the 20 most frequent words on the data sample.

Below you see the 20 most frequent bigrams(terms based in a two consecutive words) found in our data sample.

The next plot displays the the 20 most frequent trigrams(terms based in a three consecutive words) in our data sample.

Plans for creating prediction algorithm and shinny app

The strategy for creating the prediction algorithm and the future shinny app consists of finding a way to reduce the features of document frequency matrix. The dimensionality of the DFM is growing very fast and this can impact of the prediction and the computer performance.

Also, I believe that it is better not do word stemming neither stopwords removal. This should improve the prediction accuracy.

I intend to use research for some methods to reduce dimensionality and use the most significant words to better train the prediction algorithm.

The Shinny app should have a clean and simple interface and should alert the users that stopwords and profanity words were not removed.