Executive summary.

This is the milestone report, for the Data science capstone project. This report contains the following parts:

  1. Data download part.

  2. A fundamental exploratory analysis of the data.

  3. An item with more in-depth exploratory study and a portion (sample of the data set) to get some sense of how the data set looks like.

  4. Next steps description

Load the Data

Read the data and unzip it

#Link to the Location of the Text data file 
UrlLink<-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
#Create Data Directory (If not exist)
if(!file.exists("data")) {  
        dir.create("data")  
} 
#Download and Unzip the file (if not exist)
if(!file.exists("./data/Coursera-SwiftKey.zip")) { 
    download.file(UrlLink, destfile="./data/Coursera-SwiftKey.zip")  
        if(!file.exists("./data/final"))
          unzip("./data/Coursera-SwiftKey.zip",exdir="./data") 
}

Read the texts from the files

en_US.Blogs <- readLines(con = "./data/final/en_US/en_US.blogs.txt")
en_US.news <- readLines(con = "./data/final/en_US/en_US.news.txt")
en_US.twitter <- readLines(con = "./data/final/en_US/en_US.twitter.txt")

Basic Statistics

Build a table of basic statistics of the three Datasets: 1. Number of Lines. 2. The number of sentences. 3. The number of characters.

Lines(sentences), word, and characters statistics

en_US.blogs.txt en_US.news.txt en_US.twitter.txt
Lines 899288 77259 2360148
Total Number of Words 37334131 2643969 30373543
Total Number of Chars 208361438 15683765 162384825

Sample - For Exploratory purpose

Sample - for the exploratory purpose, The amount of data is significant, doing the calculation on the entire a population of that will take a long time and consume a high volume of memory

set.seed(534)
Sample.Text<-c(sample(en_US.Blogs,length(en_US.Blogs) * (10/100),replace = FALSE),
               sample(en_US.news,length(en_US.news) * (10/100),replace = FALSE),
               sample(en_US.twitter,length(en_US.twitter) * (10/100),replace = FALSE))
               
 rm(en_US.Blogs)
 rm(en_US.news)
 rm(en_US.twitter) 

Data Cleaning

The cleaning process includes: Tokenize into sentences (each line may contain more than one sentence) Remove sentence which has non-English phrases Remove Punctuation Remove @hashtags Remove URLs Remove Stop Words (For the Exploratory step Stopwords are removed)

#Convert the Sample.txt vector to a Data frame 
DF <- data_frame(text = Sample.Text)
rm(Sample.Text)

#Split of sentences (Each line may contain more than one sentence)
Text.Sentences<-DF %>%
  unnest_tokens(sentence, text, token = "sentences")


#Convert all non-Engish words to NA and remove them
Text.Sentences<-lapply( Text.Sentences$sentence, iconv,from = "latin1", to = "ASCII")
#If a sentence contains non-English word or character remove the entire sentence 
Text.Sentences <- Text.Sentences[!is.na(Text.Sentences)]

#Remove punctuation, Twitter, Numbers, Hyphens, symbols, and URL  
Text.Sentences <- tokens(
    x = tolower(Text.Sentences),
    remove_punct = TRUE,
    remove_twitter = TRUE,
    remove_numbers = TRUE,
    remove_hyphens = TRUE,
    remove_symbols = TRUE,
    remove_url = TRUE)
#Remove Stop Words 
Text.Sentences <- tokens_remove(Text.Sentences,stopwords("english"))

Create 1 gram word’s Statistics

Create 1 gram word’s statistics and plot it using ggplot2.

## Creating a dfm from a tokens input...
##    ... lowercasing
##    ... found 428,783 documents, 128,404 features
##    ... stemming features (English)
## , trimmed 33914 feature variants
##    ... created a 428,783 x 94,490 sparse dfm
##    ... complete. 
## Elapsed time: 3 seconds.

1 Gram - Cloud words

Create 2 grams word’s Statistics

Create 2 grams word’s statistics and plot it using ggplot2.

## Creating a dfm from a tokens input...
##    ... lowercasing
##    ... found 428,783 documents, 1,781,719 features
##    ... created a 428,783 x 1,781,719 sparse dfm
##    ... complete. 
## Elapsed time: 9.77 seconds.

2 Grams - Cloud words

Create 3 grams word’s Statisticss

Create 3 grams word’s statistics and plot it using ggplot2.

## Creating a dfm from a tokens input...
##    ... lowercasing
##    ... found 428,783 documents, 2,205,684 features
##    ... created a 428,783 x 2,205,684 sparse dfm
##    ... complete. 
## Elapsed time: 12.6 seconds.

3 Grams - Cloud words

Next steps …

  1. Build a full 1:3 grams model and dictionary to be used for the prediction algorithm
  2. Check if I can use more grams (4 and 5), Check if more data can be used
  3. Select a Model and a smoothing method. Although this is a significant word sample, it will not be enough, and words are missing (overcome the issue of giving probability zero to unseen words)
  4. Program the model
  5. Select an efficient way how to save the data (n-grams) in memory or file (Probably Data.Table …)
  6. Build the Shiny server and UI