January 7, 2019

Project Description

The main objective of the capstone project is to build a prediction application similar to the technology used on mobile phones by swiftkey, that predicts a “next word” based on past text combinations of words. This combinations were gathered from thousands of text documents coming from different sources such as:

  • Twitter
  • Blogs
  • News

The origial dataset that was used for this project can be directly downloaded from the following url: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The Dataset (File Size)

The Original datasets have multiple text files in various lenguages altought for the sake of simplicity only english will be used for this project.

Blogs File Size (Mb):

[1] 200.4242

News File Size (Mb):

[1] 196.2775

Twitter File Size (Mb):

[1] 159.3641

The Dataset (Summary information)

Blogs number of characters distribution:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.0    47.0   157.0   231.7   331.0 40835.0 

News number of characters distribution:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      2     111     186     203     270    5760 

Twitter number of characters distribution:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    2.0    37.0    64.0    68.8   100.0   213.0 

Sampling the Dataset

Due to the massive size of the dataset the data will be sampled to avoid issues with scalability with the final app

We will just pick 20% of each ones of the initial datasets and they will be joined together into a same data.frame

set.seed(3007) sam_blogs<-sample(blogs,size = (longblogs/5),replace = TRUE) sam_news<-sample(news,size = (longnews/5),replace = TRUE) sam_twitter<-sample(twitter,size = (longtwitter/5),replace = TRUE) muestra<-c(sam_blogs,sam_news,sam_twitter) writeLines(muestra,"muestragrande.txt") muestra<-as.data.frame(muestra) names(muestra)<-c("text")

Number of lines of text of the sample:

[1] 667337

Creating N-grams using Quanteda

Tokenization code:

train.tokens<-tokens(muestra$text,what="word",remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_separators = TRUE, remove_twitter = TRUE, remove_hyphens = TRUE, remove_url = TRUE) train.tokens<-tokens_tolower(train.tokens) train.tokens.dfm <- dfm(train.tokens,tolower = FALSE) unigram_freq<-colSums(train.tokens.dfm) Unigram <- data.frame(words=names(unigram_freq), count=unigram_freq)

Link to milestone report for more info:

http://rpubs.com/chanduatp/454820

Link to the final app:

https://ceche1212.shinyapps.io/predictnextwordfinalcapstone/