Milestone Report

This Milestone Report will highlight key deliverables: * Data is uploaded and cleaned * Summary statistics about the data sets * Report any interesting findings * Plans for creating a prediction algorithm and Shiny app

Overview

This project will analyze text data provided by SwiftKey, a corporate partner in this capstone, that builds a smart keyboard to make it easier for people to type on their mobile devices. Swiftkey uses predictive text models to predict the text people are going to type.

The text data sets provided are in four languages:

German (DE)
English (US)
Finnish (FI)
Russian (RU)

For each language there are three types of text data available:

Blogs
News
Twitter

The text data sets are quite large, requiring a lot of memory and time to work with. However, we can still create a prediction models by sampling a subset of the data that is representative of the whole dataset. We will focus on the English data sets to build a predictive text app in the Shiny App environment. The goal is to create a Shiny App that can input text and output a prediction of the next word.

Load package libraries for RStudio

These libraries will be needed to complete an exploratory data analysis.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(knitr)
library(tokenizers)
library(tm)

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(dplyr)
library(ggplot2)
library(tidyr)
library(NLP)
library(tm) 
library(rJava)
library(RWeka)
library(ggplot2)

Upload the text datasets

The data files are uploaded as values in RStudio. Values are lists, vectors, or matrices that store metadata about the text.

setwd("~/Documents/final/en_US")
blog_data <- readLines("~/Documents/final/en_US/en_US.blogs.txt")
news_data <- readLines("~/Documents/final/en_US/en_US.news.txt")
twitter_data <- readLines("~/Documents/final/en_US/en_US.twitter.txt")

Clean up the text data to make it easier to explore

Now that the data is uploaded, it is time to sample a smaller subset to work with and clean it up. It is also a good idea to take a look at the text data to see what it is looking like. The ‘Original Lines’ are the number of lines in the original dataset. The ‘Sample Lines’ are the lines in the samples dataset. The ‘Max Line Length’ is the max characters per line, note that Twitter is 140 characters. The ‘Average Line Length’ is the average number of characters per line.

##   Dataset Original_Lines Sample_Lines Max_LineLength Average_LineLength
## 1   Blogs         899288         4496           2014          227.58875
## 2    News        1010242         5051           1892          198.54544
## 3 Twitter        2360148        11801            140           68.82722

## [1] "The sampled text data set contains 21348 lines."

Tokenizer

Expoloring some of the words in the text data. This also provides an example of what tokenizing looks like.

setwd("~/Documents/final/en_US")
text_corpus <- VCorpus(VectorSource(text_data))

text_corpus <- tm_map(text_corpus, content_transformer(tolower))
text_corpus <- tm_map(text_corpus, removePunctuation)
text_corpus <- tm_map(text_corpus, removeNumbers)
text_corpus <- tm_map(text_corpus, stemDocument)
text_corpus <- tm_map(text_corpus, stripWhitespace)

## # A tibble: 60 × 2
##    word     count
##    <chr>    <dbl>
##  1 a            2
##  2 actually     1
##  3 ago          1
##  4 and          1
##  5 be           1
##  6 being        1
##  7 blog         1
##  8 class        1
##  9 classes      1
## 10 crazy        1
## # … with 50 more rows

## [[1]]
##  [1] "two"     "weeks"   "ago"     "i"       "taught"  "my"      "first"  
##  [8] "class"   "on"      "living"  "off"     "food"    "storage"
## 
## [[2]]
##  [1] "i"          "thought"    "it"         "went"       "pretty"    
##  [6] "well"       "and"        "hopefully"  "i"          "left"      
## [11] "the"        "impression" "of"         "being"      "a"         
## [16] "relatively" "sane"       "person"     "despite"    "some"      
## [21] "of"         "the"        "crazy"      "things"     "i"         
## [26] "do"        
## 
## [[3]]
##  [1] "some"      "of"        "my"        "blog"      "followers" "were"     
##  [7] "there"     "which"     "actually"  "made"      "the"       "whole"    
## [13] "thing"     "so"        "much"      "more"      "fun"      
## 
## [[4]]
##  [1] "i"         "have"      "2"         "more"      "classes"   "scheduled"
##  [7] "so"        "far"       "so"        "hopefully" "there"     "will"     
## [13] "be"        "a"         "lot"       "of"        "people"    "inspired" 
## [19] "to"        "do"        "something" "with"      "their"     "food"     
## [25] "storage"

Plotting the data

To do the exploratory analysis, the text data will be tokenized, which is where individual words are broken up into their token parts. These tokens can be single words (unigram), two words (bigram), or three words (trigram).

	term	freq
the	the	23723
and	and	11719
for	for	5513
that	that	5475
you	you	4632
with	with	3536
was	was	2979
have	have	2765
this	this	2627
are	are	2373

	term	freq
of the	of the	2192
in the	in the	2118
to the	to the	1051
for the	for the	1020
on the	on the	970
to be	to be	840
at the	at the	709
and the	and the	623
in a	in a	620
go to	go to	560

	term	freq
one of the	one of the	175
a lot of	a lot of	155
thank for the	thank for the	141
i want to	i want to	115
to be a	to be a	107
out of the	out of the	97
go to be	go to be	95
look forward to	look forward to	89
part of the	part of the	78
is go to	is go to	76

Prediction Algorithm and Shiny App

The data cleaning so far has removed the puncuation and numbers, and made all text lower case. The model and Shiny App will need to be able to predict the next word based on the previous words in the sentence. The text prediction model will need to be able to handle a diverse set of input text data. There will need to be a prediction model for unigrams, bigrams, and trigrams. These models will need to developed concurrently to include in the Shiny App.