The motivation for this project is to:

Demonstrate data download
Create a basic report of summary statistics about the data sets.
Report any interesting findings
Get feedback on plans for creating a prediction algorithm and Shiny app. Task 0 Download the zip file and unzip and next Set working directory

# specify the source and destination of the download
#destination_file <- "Coursera-SwiftKey.zip"
#source_file <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

# execute the download
#download.file(source_file, destination_file)

# extract the files from the zip file
#unzip(destination_file)setwd("~/Rworkingdir/Capstone/Coursera-SwiftKey/final/en_US")
options(warn=-1)
blogs <- readLines("en_US.blogs.txt")
news <- readLines("en_US.news.txt")
twitter <- readLines("en_US.twitter.txt")

Summarize data

# file size (in MegaBytes/MB)
file.info("en_US.blogs.txt")$size   / 1024^2

## [1] 200.4242

file.info("en_US.news.txt")$size    / 1024^2

## [1] 196.2775

file.info("en_US.twitter.txt")$size / 1024^2

## [1] 159.3641

# number of lines
length(blogs)

## [1] 899288

length(news)

## [1] 77259

length(twitter)

## [1] 2360148

# number of characters per line
summary( nchar(blogs)   )

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    47.0   157.0   231.7   331.0 40840.0

summary( nchar(news)    )

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2     111     186     203     270    5760

summary( nchar(twitter) )

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0    37.0    64.0    68.8   100.0   213.0

# more character analysis analysis

library(stringi)
stats_blogs   <- stri_stats_general(blogs)
stats_news    <- stri_stats_general(news)
stats_twitter <- stri_stats_general(twitter)

# textual analysis
words_blogs   <- stri_count_words(blogs)
words_news    <- stri_count_words(news)
words_twitter <- stri_count_words(twitter)

# summaries
summary( words_blogs   )

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   29.00   42.43   61.00 6726.00

summary( words_news    )

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.87   46.00 1123.00

summary( words_twitter )

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     7.0    12.0    12.8    18.0    60.0

# plots

library(ggplot2)
qplot(words_blogs)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(words_news)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

qplot(words_twitter)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

The first summary reveals that blog file contains the most data per line and the twitter files have the shortest lines but highest number of lines. After loading in the data, it was found that the twitter sample had many more individual entries, as the original file sizes were similar it was decided to plot histograms of the lengths of the individual entries . It was noticed that the twitter text was very unique in that the number of words never exceeded 40 whereas the word count was higher giving indication that same words repeated more times.

Processing Steps

1.Split the lines/paragraphs to sentences: It is important to split these lines/paragraphs into sentences before building the n-gram. Otherwise, n-gram builder will create erroneous items that cross the sentence border. It is possible to use “.” as a delimiter option but it would confuse with words like “Mr.”, “Ms.”. As a result, special purpose sentence splitter like one from openNLP project is used.

2 Normalize words and write to a cleaned data: Change all words to lower case Mark profanity words to remove later in n-gram rather than remove it right away. Attempt to correct minor spelling error (e’g replace characters that is repeated 3 times or more with only 1 occurence of the character) Remove all charaters that is not single quotes or alpha characters (numbers, punctuations, smiley.) remove extra white spaces 3 Build n-gram count data using RWeka’s NGramTokenizer: Only n-gram 1 to 3 are built.

4 Apply smoothing and backoff techniques to build the language model

Conclusion

Natural Language Processing (NLP) is a field of computer science, artificial intelligence and linguistics concerned with the interactions between computers and human (natural) languages

Based on the first examination of the data the plan is

create test and train data sets for each files with a smaller number of records because of performance issues perform some cleaning and filtering split the lines in the given files in their logical units (for all 3 files together) split the generated “sentence-based”-output into smaller sets before storing them perform the frequency calculation for the n-grams combine the results for the smaller sets for an overall view

Further steps

The next step is finding the n-grams in the sentences on different levels 2-grams, 3-grams and so on and define rules, like “when n-grams of different levels exists which should be taken” or other way round “when no n-gram of a certain level exists which should be taken instead”.The prediction will be based upon clustering of most popular words and/or a probability measure.

The model should be trained on a sample set of a combination of all three files. The last step is a Shiny App for providing an interactive front end for the algorithm.The shiny app for demonstration purpose will have a text field which will interact with the app backend.Based upon the words typed by the user and received words the application will lookup the next probable options and update the user interface for possibe auto completion

Data Science Capstone Project Milestone Report

Vishakha

Wednesday, March 25, 2015

Summarize data

Processing Steps

Conclusion

Further steps