This Milestone Report will highlight key deliverables: * Data is uploaded and cleaned * Summary statistics about the data sets * Report any interesting findings * Plans for creating a prediction algorithm and Shiny app
This project will analyze text data provided by SwiftKey, a corporate partner in this capstone, that builds a smart keyboard to make it easier for people to type on their mobile devices. Swiftkey uses predictive text models to predict the text people are going to type.
The text data sets provided are in four languages:
For each language there are three types of text data available:
The text data sets are quite large, requiring a lot of memory and time to work with. However, we can still create a prediction models by sampling a subset of the data that is representative of the whole dataset. We will focus on the English data sets to build a predictive text app in the Shiny App environment. The goal is to create a Shiny App that can input text and output a prediction of the next word.
These libraries will be needed to complete an exploratory data analysis.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.1.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(knitr)
library(tokenizers)
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(dplyr)
library(ggplot2)
library(tidyr)
library(NLP)
library(tm)
library(rJava)
library(RWeka)
library(ggplot2)
The data files are uploaded as values in RStudio. Values are lists, vectors, or matrices that store metadata about the text.
setwd("~/Documents/final/en_US")
blog_data <- readLines("~/Documents/final/en_US/en_US.blogs.txt")
news_data <- readLines("~/Documents/final/en_US/en_US.news.txt")
twitter_data <- readLines("~/Documents/final/en_US/en_US.twitter.txt")
Now that the data is uploaded, it is time to sample a smaller subset to work with and clean it up. It is also a good idea to take a look at the text data to see what it is looking like. The ‘Original Lines’ are the number of lines in the original dataset. The ‘Sample Lines’ are the lines in the samples dataset. The ‘Max Line Length’ is the max characters per line, note that Twitter is 140 characters. The ‘Average Line Length’ is the average number of characters per line.
## Dataset Original_Lines Sample_Lines Max_LineLength Average_LineLength
## 1 Blogs 899288 4496 2014 227.58875
## 2 News 1010242 5051 1892 198.54544
## 3 Twitter 2360148 11801 140 68.82722
## [1] "The sampled text data set contains 21348 lines."
Expoloring some of the words in the text data. This also provides an example of what tokenizing looks like.
setwd("~/Documents/final/en_US")
text_corpus <- VCorpus(VectorSource(text_data))
text_corpus <- tm_map(text_corpus, content_transformer(tolower))
text_corpus <- tm_map(text_corpus, removePunctuation)
text_corpus <- tm_map(text_corpus, removeNumbers)
text_corpus <- tm_map(text_corpus, stemDocument)
text_corpus <- tm_map(text_corpus, stripWhitespace)
## # A tibble: 60 × 2
## word count
## <chr> <dbl>
## 1 a 2
## 2 actually 1
## 3 ago 1
## 4 and 1
## 5 be 1
## 6 being 1
## 7 blog 1
## 8 class 1
## 9 classes 1
## 10 crazy 1
## # … with 50 more rows
## [[1]]
## [1] "two" "weeks" "ago" "i" "taught" "my" "first"
## [8] "class" "on" "living" "off" "food" "storage"
##
## [[2]]
## [1] "i" "thought" "it" "went" "pretty"
## [6] "well" "and" "hopefully" "i" "left"
## [11] "the" "impression" "of" "being" "a"
## [16] "relatively" "sane" "person" "despite" "some"
## [21] "of" "the" "crazy" "things" "i"
## [26] "do"
##
## [[3]]
## [1] "some" "of" "my" "blog" "followers" "were"
## [7] "there" "which" "actually" "made" "the" "whole"
## [13] "thing" "so" "much" "more" "fun"
##
## [[4]]
## [1] "i" "have" "2" "more" "classes" "scheduled"
## [7] "so" "far" "so" "hopefully" "there" "will"
## [13] "be" "a" "lot" "of" "people" "inspired"
## [19] "to" "do" "something" "with" "their" "food"
## [25] "storage"
To do the exploratory analysis, the text data will be tokenized, which is where individual words are broken up into their token parts. These tokens can be single words (unigram), two words (bigram), or three words (trigram).
| term | freq | |
|---|---|---|
| the | the | 23723 |
| and | and | 11719 |
| for | for | 5513 |
| that | that | 5475 |
| you | you | 4632 |
| with | with | 3536 |
| was | was | 2979 |
| have | have | 2765 |
| this | this | 2627 |
| are | are | 2373 |
| term | freq | |
|---|---|---|
| of the | of the | 2192 |
| in the | in the | 2118 |
| to the | to the | 1051 |
| for the | for the | 1020 |
| on the | on the | 970 |
| to be | to be | 840 |
| at the | at the | 709 |
| and the | and the | 623 |
| in a | in a | 620 |
| go to | go to | 560 |
| term | freq | |
|---|---|---|
| one of the | one of the | 175 |
| a lot of | a lot of | 155 |
| thank for the | thank for the | 141 |
| i want to | i want to | 115 |
| to be a | to be a | 107 |
| out of the | out of the | 97 |
| go to be | go to be | 95 |
| look forward to | look forward to | 89 |
| part of the | part of the | 78 |
| is go to | is go to | 76 |
The data cleaning so far has removed the puncuation and numbers, and made all text lower case. The model and Shiny App will need to be able to predict the next word based on the previous words in the sentence. The text prediction model will need to be able to handle a diverse set of input text data. There will need to be a prediction model for unigrams, bigrams, and trigrams. These models will need to developed concurrently to include in the Shiny App.