Instructions:

The goal of this project is to demonstrate the skills I have learned in the previous modules/courses working with data analysis and that I am on track to create my prediction algorithm.

The data was downloaded from Capstone Dataset.

For this purpose, I will be using the en_US files.

Simulations:

A. Set wd and get data

path <- setwd("/Users/sexybaboy/Documents/Files/Zetch/Online\ Courses/Data\ Science\ Specialization\ Feb18/R/Capstone\ /Milestone\ Report")
# Download and unzip data
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url, file.path(path, "Coursera-Swiftkey.zip"))
unzip(zipfile = "Coursera-Swiftkey.zip")

The data set has 4 different languages (English - US, German, Finnish and Russian) from 3 sources (news, blogs, twitter).

B. Explore data, data preprocessing and data sampling:

blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
library(stringi)
## Warning: package 'stringi' was built under R version 3.6.2
# Get file sizes
blogs.size <- file.info("final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("final/en_US/en_US.news.txt")$size / 1024 ^ 2
twitter.size <- file.info("final/en_US/en_US.twitter.txt")$size / 1024 ^ 2

# Get words in files
blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)

# Summary of the data sets
data.frame(source = c("blogs", "news", "twitter"),
           file.size.MB = c(blogs.size, news.size, twitter.size),
           num.lines = c(length(blogs), length(news), length(twitter)),
           num.words = c(sum(blogs.words), sum(news.words), sum(twitter.words)),
           mean.num.words = c(mean(blogs.words), mean(news.words), mean(twitter.words)))
##    source file.size.MB num.lines num.words mean.num.words
## 1   blogs     200.4242    899288  37546239       41.75107
## 2    news     196.2775   1010242  34762395       34.40997
## 3 twitter     159.3641   2360148  30093413       12.75065

Data sources are summarized according to file size, line count, word count and mean words per line.

library(tm)
## Warning: package 'tm' was built under R version 3.6.2
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 3.6.2
# Data sampling
set.seed(700)
ds <- c(sample(blogs, length(blogs) * 0.01),
                 sample(news, length(news) * 0.01),
                 sample(twitter, length(twitter) * 0.01))

# Create corpus and clean the data
corpus <- VCorpus(VectorSource(ds))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)

C. Tokenizationn

library(RWeka)
## Warning: package 'RWeka' was built under R version 3.6.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
options(mc.cores = 1)

unigram <- function(tdm) {
  freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
  return(data.frame(word = names(freq), freq = freq))
}
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
Plot <- function(data, label) {
  ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
         labs(x = label, y = "Frequency") +
         theme(axis.text.x = element_text(angle = 45, size = 10, hjust = 1)) +
         geom_bar(stat = "identity", fill = I("darkgreen"))
}

# Get frequencies of most common n-grams in data sample
Unigrams <- unigram(removeSparseTerms(TermDocumentMatrix(corpus), 0.9999))
Bigrams <- unigram(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = bigram)), 0.9999))
Trigrams <- unigram(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = trigram)), 0.9999))

D. Data plotting

# Unigrams
Plot(Unigrams, "Top 30 Unigrams")

# Bigrams
Plot(Bigrams, "Top 30 Bigrams")

# Trigrams
Plot(Trigrams, "Top 30 Trigrams")

Prediction and plans for the ShinyApp

I plan to use n-gram model with frequency lookup to calculate the probabilities of the next word occuring relative to the previous words. The trigram model will be used to predict the next word. If no matching trigram can be found, the algorithm would use the bigram model then the unigram model, if needed.

For the Shiny app, the plan is to create an app with a simple interface where the user can enter a string of text or phrase. The prediction model will then give a list of suggested words to update the next word.