The motivation for this project is to:
# specify the source and destination of the download
#destination_file <- "Coursera-SwiftKey.zip"
#source_file <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# execute the download
#download.file(source_file, destination_file)
# extract the files from the zip file
#unzip(destination_file)setwd("~/Rworkingdir/Capstone/Coursera-SwiftKey/final/en_US")
options(warn=-1)
blogs <- readLines("en_US.blogs.txt")
news <- readLines("en_US.news.txt")
twitter <- readLines("en_US.twitter.txt")
# file size (in MegaBytes/MB)
file.info("en_US.blogs.txt")$size / 1024^2
## [1] 200.4242
file.info("en_US.news.txt")$size / 1024^2
## [1] 196.2775
file.info("en_US.twitter.txt")$size / 1024^2
## [1] 159.3641
# number of lines
length(blogs)
## [1] 899288
length(news)
## [1] 77259
length(twitter)
## [1] 2360148
# number of characters per line
summary( nchar(blogs) )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 47.0 157.0 231.7 331.0 40840.0
summary( nchar(news) )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 111 186 203 270 5760
summary( nchar(twitter) )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 37.0 64.0 68.8 100.0 213.0
# more character analysis analysis
library(stringi)
stats_blogs <- stri_stats_general(blogs)
stats_news <- stri_stats_general(news)
stats_twitter <- stri_stats_general(twitter)
# textual analysis
words_blogs <- stri_count_words(blogs)
words_news <- stri_count_words(news)
words_twitter <- stri_count_words(twitter)
# summaries
summary( words_blogs )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 29.00 42.43 61.00 6726.00
summary( words_news )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 34.87 46.00 1123.00
summary( words_twitter )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 7.0 12.0 12.8 18.0 60.0
# plots
library(ggplot2)
qplot(words_blogs)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
qplot(words_news)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
qplot(words_twitter)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
The first summary reveals that blog file contains the most data per line and the twitter files have the shortest lines but highest number of lines. After loading in the data, it was found that the twitter sample had many more individual entries, as the original file sizes were similar it was decided to plot histograms of the lengths of the individual entries . It was noticed that the twitter text was very unique in that the number of words never exceeded 40 whereas the word count was higher giving indication that same words repeated more times.
1.Split the lines/paragraphs to sentences: It is important to split these lines/paragraphs into sentences before building the n-gram. Otherwise, n-gram builder will create erroneous items that cross the sentence border. It is possible to use “.” as a delimiter option but it would confuse with words like “Mr.”, “Ms.”. As a result, special purpose sentence splitter like one from openNLP project is used.
2 Normalize words and write to a cleaned data: Change all words to lower case Mark profanity words to remove later in n-gram rather than remove it right away. Attempt to correct minor spelling error (e’g replace characters that is repeated 3 times or more with only 1 occurence of the character) Remove all charaters that is not single quotes or alpha characters (numbers, punctuations, smiley.) remove extra white spaces 3 Build n-gram count data using RWeka’s NGramTokenizer: Only n-gram 1 to 3 are built.
4 Apply smoothing and backoff techniques to build the language model
Natural Language Processing (NLP) is a field of computer science, artificial intelligence and linguistics concerned with the interactions between computers and human (natural) languages
Based on the first examination of the data the plan is
create test and train data sets for each files with a smaller number of records because of performance issues perform some cleaning and filtering split the lines in the given files in their logical units (for all 3 files together) split the generated “sentence-based”-output into smaller sets before storing them perform the frequency calculation for the n-grams combine the results for the smaller sets for an overall view
The next step is finding the n-grams in the sentences on different levels 2-grams, 3-grams and so on and define rules, like “when n-grams of different levels exists which should be taken” or other way round “when no n-gram of a certain level exists which should be taken instead”.The prediction will be based upon clustering of most popular words and/or a probability measure.
The model should be trained on a sample set of a combination of all three files. The last step is a Shiny App for providing an interactive front end for the algorithm.The shiny app for demonstration purpose will have a text field which will interact with the app backend.Based upon the words typed by the user and received words the application will lookup the next probable options and update the user interface for possibe auto completion