In this report I present the following:
Then I will submit this report to coursera and receive feedback on my plans for creating a prediction algorithm and Shiny app for the capstone.
The main objective in this course is to apply data science in the area of natural language processing.
The final result of this course will be to construct a Shiny application that accepts some text inputed by the user and try to predict what the next word will be.
I would like to thank Coursera and instructors of this course who have worked really hard in providing, not only in English but in other languages too, a set of files containing texts extracted from blogs, news/media sites and twitter, that are to be used as an input in the development of a prediction algorithm to achieve defined objectives of the project.
In the following sections I will analyze a subset of the data provided.
Dataset for training can be downloaded from the following link as what provided by the course instructors:
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
Loading it in R:
# Setting working directory
setwd("E:/DataScienceCapStone/")
# Reading datasets (only English version for my project) into variables
DataNews <- readLines(file("./data/en_US/en_US.news.txt"))
DataBlogs <- readLines(file("./data/en_US/en_US.blogs.txt"))
DataTwitter <- readLines(file("./data/en_US/en_US.twitter.txt"))
Next tasks to accomplish is:
But since this data is text data, as a part of this report I will perform the following basic statistics:
Checking length of each dataset.
Checking the size of each dataset.
Checking the maximum count of the number of human readable characters in each line (or, length of the longest line seen) in each dataset.
## [1] "Length of the Dataset (News) is = 77259 lines"
## [1] "Length of the Dataset (Blogs) is = 899288 lines"
## [1] "Length of the Dataset (Twitter) is = 2360148 lines"
## [1] "Size of the Dataset (News) is = 196.277512550354 MB"
## [1] "Size of the Dataset (Blogs) is = 200.424207687378 MB"
## [1] "Length of the Dataset (Twitter) is = 159.364068984985 MB"
## [1] "Max characters (length of the longest line seen) in a record in the Dataset (News) is = 5760"
## [1] "Max characters (length of the longest line seen) in a record in the Dataset (Blogs) is = 40835"
## [1] "Max characters (length of the longest line seen) in a record in the Dataset (Twitter) is = 213"
What I noticed in the project’s datasets:
That each of the files is huge in size (as in results above) and processing them takes a lot long, so the utilization of memory available should be carefully thought.
By the inference from the basic descriptive statistics (as above) for each of datasets, I would rather join them into one big dataset, but of course, without loosing any of the caracteristics of each of that file.
With a vocabulary of 1% of the total number of words we can predict 91% of the text.
Text data from the three data files are merged together and a text corpus is built using R’s “tm” library.
To avoid memory related issues (during this milestone report preparation) and for faster performance, I would consider only the first 5000 lines from each of the datasets in the following steps.
# Load "tm" library
library(tm)
## Loading required package: NLP
# Load 5000 lines from every dataset into one big corpus
mergedfiles <- paste(DataNews[1:5000], DataBlogs[1:5000], DataTwitter[1:5000])
mycorpus <- VCorpus(VectorSource(mergedfiles))
# Remove white spaces from corpus
mycorpus <- tm_map(mycorpus, stripWhitespace)
# Remove digits and numbers from corpus
mycorpus <- tm_map(mycorpus, removeNumbers)
# Remove punctuation from corpus
mycorpus <- tm_map(mycorpus, removePunctuation)
# Remove stopwords from corpus
mycorpus <- tm_map(mycorpus, removeWords, stopwords("english"))
mycorpus <- tm_map(mycorpus, stemDocument)
In the following steps I will develop n-Grams (2, 3, 4, 5-word sequences, with their frequency).
For this purpose, I will use R’s “RWeka” package.
# Load "RWeka" library
library(RWeka)
corpusDf <-data.frame(text = unlist(sapply(mycorpus, `[`, "content")),
stringsAsFactors = F)
findNGrams <- function(corp, grams) {
ngram <- NGramTokenizer(corp, Weka_control(min = grams, max = grams,
delimiters = " \\r\\n\\t.,;:\"()?!"))
ngram2 <- data.frame(table(ngram))
#pick only top 25
ngram3 <- ngram2[order(ngram2$Freq, decreasing = TRUE), ][1:100, ]
colnames(ngram3) <- c("String", "Count")
ngram3
}
TwoGrams <- findNGrams(corpusDf, 2)
ThreeGrams <- findNGrams(corpusDf, 3)
FourGrams <- findNGrams(corpusDf, 4)
In the next step I will create world cloud of all the 2, 3 and 4-Grams, that were created in the previous section.
library(wordcloud)
## Loading required package: RColorBrewer
library(RColorBrewer)
par(mfrow = c(1, 3))
palette <- brewer.pal(8, "Dark2")
wordcloud(TwoGrams[, 1], TwoGrams[, 2], min.freq = 1,
random.order = F, ordered.colors = F, colors = palette)
text(x = 0.5, y = 0, "2-Gram cloud")
wordcloud(ThreeGrams[, 1], ThreeGrams[, 2], min.freq = 1,
random.order = F, ordered.colors = F, colors = palette)
text(x = 0.5, y = 0, "3-Gram cloud")
wordcloud(FourGrams[, 1], FourGrams[,2], min.freq = 1,
random.order = F, ordered.colors = F, colors = palette)
text(x = 0.5, y = 0, "4-Gram cloud")
par(mfrow = c(1, 1))
barplot(TwoGrams[1:20, 2],
cex.names = 0.6, names.arg = TwoGrams[1:20, 1], col = "blue",
main = "Histogram: 2-Grams", las = 2)
barplot(ThreeGrams[1:20, 2],
cex.names = 0.6, names.arg = ThreeGrams[1:20, 1], col = "yellow",
main = "Histogram: 3-Grams", las = 2)
barplot(FourGrams[1:20, 2],
cex.names = 0.6, names.arg = FourGrams[1:20, 1], col = "pink",
main = "Histogram: 4-Grams", las = 2)
My plan for the final project report is as follws:
I am going to consider only English datasets for my project, other language I can not understand so will not work on them.
I will use “Full corpus” built out of all the datasets in the project in order to generate better 3, 4 and 5-Grams.
Better plottign methods and visualization
Drop those words and n-grams which are of lower frequency
Do not try to predict numbers but try to predict words that may come after a number in general.
Dealing with spelling errors. Consider using dictionary to check spelling error. Add terms that appear with some frequency to the dictionary (doesn’t matter if this word is not an “official term” from standard dictionary), the reason, because people use these words in practice so let algorithm predict these words too. We would drop only those words that are not in the dictonary and have low frequency in our corpus.
Add rules to clean up emoticons from the corpus
Train a prediction algorithm based on n-grams (e.g. either using a package with markov chain functionality or coding my own ad-hoc search functions)
Calculate probabiliites, add 1, backtrack.
Do not remove any words (including stop words or even profanity words) from the list in order to avoid creating “holes” in the data that could lead to incorrect n-grams and incorrect prediction.
Use a standard list of profanity terms (available for download) to filter profanity terms before showing to the customer (but let them inside the database)
Measure memory and time, look for the options for enhancement of speed and accuracy in prediction.
Add capitalized word rules to the algorithm
Build separate model for capitalization (frequency of all caps vs. first cap when not at the begining of a sentence vs. lowercase)
Finally, present the working model through shiny website.