Introduction/Background

The eventual goal of this partnership project with SwiftKey is to create a predictive text algorithm, with an intuitive user interface, where a user is given suggestions for subsequent words based upon the words they’ve entered. This app will allow users to generate text content much quicker than they could without the app.

The bedrock for this eventual app is a large amount of text data gleaned from online sources. The purpose of this report is to a) show how the data was obtained, loaded, cleaned, and preprocessed, and b) offer a summary of that text data and present some exploratory analysis and visualizations.

Downloading, Reading, and Preprocessing the Data

The source of the data is an online repository of English Language news, blogs, and Twitter feeds (located at https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip). In order to work with this data in an R environment, it has to be downloaded from the source, unzipped, and made into a corpus (a formatted collection of a large amount of text data, primarily used for text mining and predictive text analysis). This was achieved with the following R code chunk:

URLcapdata <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if(!file.exists("C:/Users/Dobbs.Dobbs/Desktop/Git_Clone/Data Science Capstone Data/capdata.zip")){ 
download.file(URLcapdata,"C:/Users/Dobbs.Dobbs/Desktop/Git_Clone/Data Science Capstone Data/capdata.zip")
}
capdata <- "C:/Users/Dobbs.Dobbs/Desktop/Git_Clone/Data Science Capstone Data/capdata.zip"
outDir <- "C:/Users/Dobbs.Dobbs/Desktop/Git_Clone/Data Science Capstone Data"
unzip(capdata, exdir = outDir)

#Make corpus
ENcorpus <- VCorpus(
  DirSource("C:/Users/Dobbs.Dobbs/Desktop/Git_Clone/Data Science Capstone Data/final/en_US"))

In almost all cases, data gleaned in this fashion will not be entirely useful as-is. It has to go through cleaning and preprocessing. Things like profanity, punctuation, numbers, and unrecognizable characters are not at all helpful when analyzing text data, and can cause problems. The gruesome detail of the cleaning and preprocessing won’t be shown in this report, but can be provided for interested parties.

Data Summary and Visualization

The next step in our quest to create the perfect predictive text app is to do some exploratory analysis of the corpora. How large are the files we’re dealing with? How many words are in the sample? Which words appear most often? Which words appear together most often?

There is a lot of data in the sample set, even after cleaning and preprocessing. More data can take longer to work with, but it’ll likely give greater confidence in the predictive text algorithm in the end.

## Statistics for Corpus of Sample Data:
## Corpus Size = 436.5 Mb ; Corpus Word Count = 36032397

So, which words appear most frequently, and how many times do they appear? Below is a bar chart that shows the most commonly encountered terms, and specifically those that appeared at least 100,000 times. Below that is another interesting visualization, a “word cloud”, which shows words that appear most often in larger font and in different colors.

The last bit of exploratory analysis for now is to have a look at 2-grams and 3-grams. That is, word combinations of two or three words that appear most often together.

Conclusions

A wealth of data can be obtained from multiple sources, and they don’t have to be perfect in order to be useful. It can be said with great confidence that the project is proceeding as planned, and the end result of an RShiny predictive text app is within reach.