This Milestone Report will highlight key deliverables: * Data is uploaded and cleaned * Summary statistics about the data sets * Report any interesting findings * Plans for creating a prediction algorithm and Shiny app

Overview

This project will analyze text data provided by SwiftKey, a corporate partner in this capstone, that builds a smart keyboard to make it easier for people to type on their mobile devices. Swiftkey uses predictive text models to predict the text people are going to type.

The text data sets provided are in four languages:

  1. German (DE)
  2. English (US)
  3. Finnish (FI)
  4. Russian (RU)

For each language there are three types of text data available:

  1. Blogs
  2. News
  3. Twitter

The text data sets are quite large, requiring a lot of memory and time to work with. However, we can still create a prediction models by sampling a subset of the data that is representative of the whole dataset. We will focus on the English data sets to build a predictive text app in the Shiny App environment. The goal is to create a Shiny App that can input text and output a prediction of the next word.

Load package libraries for RStudio

These libraries will be needed to complete an exploratory data analysis.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(knitr)
library(tokenizers)
library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(dplyr)
library(ggplot2)
library(tidyr)
library(NLP)
library(tm) 
library(rJava)
library(RWeka)
library(ggplot2)

Upload the text datasets

The data files are uploaded as values in RStudio. Values are lists, vectors, or matrices that store metadata about the text.

setwd("~/Documents/final/en_US")
blog_data <- readLines("~/Documents/final/en_US/en_US.blogs.txt")
news_data <- readLines("~/Documents/final/en_US/en_US.news.txt")
twitter_data <- readLines("~/Documents/final/en_US/en_US.twitter.txt")

Clean up the text data to make it easier to explore

Now that the data is uploaded, it is time to sample a smaller subset to work with and clean it up. It is also a good idea to take a look at the text data to see what it is looking like. The ‘Original Lines’ are the number of lines in the original dataset. The ‘Sample Lines’ are the lines in the samples dataset. The ‘Max Line Length’ is the max characters per line, note that Twitter is 140 characters. The ‘Average Line Length’ is the average number of characters per line.

##   Dataset Original_Lines Sample_Lines Max_LineLength Average_LineLength
## 1   Blogs         899288         4496           2014          227.58875
## 2    News        1010242         5051           1892          198.54544
## 3 Twitter        2360148        11801            140           68.82722
## [1] "The sampled text data set contains 21348 lines."

Tokenizer

Expoloring some of the words in the text data. This also provides an example of what tokenizing looks like.

setwd("~/Documents/final/en_US")
text_corpus <- VCorpus(VectorSource(text_data))

text_corpus <- tm_map(text_corpus, content_transformer(tolower))
text_corpus <- tm_map(text_corpus, removePunctuation)
text_corpus <- tm_map(text_corpus, removeNumbers)
text_corpus <- tm_map(text_corpus, stemDocument)
text_corpus <- tm_map(text_corpus, stripWhitespace)
## # A tibble: 60 × 2
##    word     count
##    <chr>    <dbl>
##  1 a            2
##  2 actually     1
##  3 ago          1
##  4 and          1
##  5 be           1
##  6 being        1
##  7 blog         1
##  8 class        1
##  9 classes      1
## 10 crazy        1
## # … with 50 more rows
## [[1]]
##  [1] "two"     "weeks"   "ago"     "i"       "taught"  "my"      "first"  
##  [8] "class"   "on"      "living"  "off"     "food"    "storage"
## 
## [[2]]
##  [1] "i"          "thought"    "it"         "went"       "pretty"    
##  [6] "well"       "and"        "hopefully"  "i"          "left"      
## [11] "the"        "impression" "of"         "being"      "a"         
## [16] "relatively" "sane"       "person"     "despite"    "some"      
## [21] "of"         "the"        "crazy"      "things"     "i"         
## [26] "do"        
## 
## [[3]]
##  [1] "some"      "of"        "my"        "blog"      "followers" "were"     
##  [7] "there"     "which"     "actually"  "made"      "the"       "whole"    
## [13] "thing"     "so"        "much"      "more"      "fun"      
## 
## [[4]]
##  [1] "i"         "have"      "2"         "more"      "classes"   "scheduled"
##  [7] "so"        "far"       "so"        "hopefully" "there"     "will"     
## [13] "be"        "a"         "lot"       "of"        "people"    "inspired" 
## [19] "to"        "do"        "something" "with"      "their"     "food"     
## [25] "storage"

Plotting the data

To do the exploratory analysis, the text data will be tokenized, which is where individual words are broken up into their token parts. These tokens can be single words (unigram), two words (bigram), or three words (trigram).

term freq
the the 23723
and and 11719
for for 5513
that that 5475
you you 4632
with with 3536
was was 2979
have have 2765
this this 2627
are are 2373

term freq
of the of the 2192
in the in the 2118
to the to the 1051
for the for the 1020
on the on the 970
to be to be 840
at the at the 709
and the and the 623
in a in a 620
go to go to 560

term freq
one of the one of the 175
a lot of a lot of 155
thank for the thank for the 141
i want to i want to 115
to be a to be a 107
out of the out of the 97
go to be go to be 95
look forward to look forward to 89
part of the part of the 78
is go to is go to 76

Prediction Algorithm and Shiny App

The data cleaning so far has removed the puncuation and numbers, and made all text lower case. The model and Shiny App will need to be able to predict the next word based on the previous words in the sentence. The text prediction model will need to be able to handle a diverse set of input text data. There will need to be a prediction model for unigrams, bigrams, and trigrams. These models will need to developed concurrently to include in the Shiny App.