The Capstone project for the Coursera Data Science course consists of making an a predictive text model. The model should predict the next word someone wants to type (in English) on a mobile device. This means there are constraints on the use of memory and cpu by the prediction model.
For this assignment Johns Hopkins works together with Swiftkey (a company that develops predictive text analytics). Some big unstructured text files have been provided. The first task at hand is to get the data, do some exploritory analysis and clean the data.
Suggested tasks are:
1. Loading the data.
2. Sampling the data.
3. Tokenization - identifying appropriate tokens such as words, punctuation, and numbers.
4. Profanity filtering - removing profanity and other words you do not want to predict.
First I will implement some basic settings and download and unzip the files.
## Initialize settings for analyses
# loading libraries (https://github.com/lgreski/datasciencectacontent/blob/master/markdown/capstone-simplifiedApproach.md)
libs <- c("data.table", "quanteda", "sqldf", "stringi", "readtext", "knitr", "ggplot2")
lapply(libs, require, character.only = TRUE)
# keep environment clean --> remove libs
rm(libs)
# Set options
options(stringsAsFactors = FALSE)
# set working directory
setwd(
"H:/Users/Leo/Documents/Studie/Data Science - Johns Hopkins University - 2016-2017/10 Capstone Project"
)
# set path to data
path <-
"H:/Users/Leo/Documents/Studie/Data Science - Johns Hopkins University - 2016-2017/10 Capstone Project/data/final/en_US"
## Downloading the original data
## Create directory if it does not yet exist
if (!file.exists("./data")) {
dir.create("./data")
}
FileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
## if file is not already downloaded, download the file
if (!file.exists("./data/Swiftkey Dataset.zip")) {
download.file(FileUrl, destfile = "./data/Swiftkey Dataset.zip")
}
## Unzip file to datadirectory if the directory does not yet exist exists
if (!file.exists("./data/final")) {
unzip("./data/Swiftkey Dataset.zip", exdir = "./data")
}
# keep environment clean --> remove FileUrl
rm(FileUrl)
Now that we have donwloaded the data, let’s take a look at these files.
## file filesize LineCount Longest_Line WordCount AvgWordCount
## 1 Blogs 200.42 899288 40833 37546246 41.75
## 2 News 196.28 1010242 11384 34762395 34.41
## 3 Twitter 159.36 2360148 140 30093410 12.75
As we can see it is a large dataset containing more than 3 million sentences and over 70 million words. The file containing blog-data seems to bee the most complicated. It has the longest line, the highest wordcount and the highest average number of words per line. As expected, the twitter data has shorter lines.
As might be expected, the summary shows that the three files contain very different kinds of texts.
The files are pretty big. This will cause issues in cpu- and memory usage. For the exploratory data analysis I will make use of a sample of 10,000 lines from each file. Same size for each file makes sure that different kinds of text are represented better in the sample than choosing for a proportion of each file.
## standaard settings:
set.seed(1903)
Trials <- 10000
Prob <- .5
### make a sample based on seed, probability and number of trials
BlogSample <-
Blogs[rbinom(Trials, length(Blogs), Prob)]
NewsSample <-
News[rbinom(Trials, length(News), Prob)]
TwitterSample <-
Twitter[rbinom(Trials, length(Twitter), Prob)]
TextSample <- c(BlogSample, NewsSample, TwitterSample)
# keep environment clean --> remove objects from environment
rm(Blogs, News, Twitter, Prob, Trials)
Following steps will be the first steps taken in cleaning the text:
# Functions from: http://www.mjdenny.com/Text_Processing_In_R.html
#' function to clean a string
Clean_String <- function(string) {
# Lowercase
temp <- tolower(string)
#' Remove everything that is not a number or letter (may want to keep more
#' stuff in your actual analyses).
temp <- stringr::str_replace_all(temp, "[^a-zA-Z\\s]", " ")
# Shrink down to just one white space
temp <- stringr::str_replace_all(temp, "[\\s]+", " ")
# Split it
temp <- stringr::str_split(temp, " ")[[1]]
# Get rid of trailing "" if necessary
indexes <- which(temp == "")
if (length(indexes) > 0) {
temp <- temp[-indexes]
}
return(temp)
}
#' function to clean text
Clean_Text_Block <- function(text) {
if (length(text) <= 1) {
# Check to see if there is any text at all with another conditional
if (length(text) == 0) {
cat("There was no text in this document! \n")
to_return <-
list(
num_tokens = 0,
unique_tokens = 0,
text = ""
)
} else{
# If there is one, and only only one line of text then tokenize it
clean_text <- Clean_String(text)
num_tok <- length(clean_text)
num_uniq <- length(unique(clean_text))
to_return <-
list(
num_tokens = num_tok,
unique_tokens = num_uniq,
text = clean_text
)
}
} else{
# Get rid of blank lines
indexes <- which(text == "")
if (length(indexes) > 0) {
text <- text[-indexes]
}
# Loop through the lines in the text and use the append() function to
clean_text <- Clean_String(text[1])
for (i in 2:length(text)) {
# add them to a vector
clean_text <-
append(clean_text, Clean_String(text[i]))
}
# Calculate the number of tokens and unique tokens and return them in a
# named list object.
num_tok <- length(clean_text)
num_uniq <- length(unique(clean_text))
to_return <-
list(
num_tokens = num_tok,
unique_tokens = num_uniq,
text = clean_text
)
}
return(to_return)
}
# create sample files for the different sources and write them to disk
BlogSampleClean <- Clean_Text_Block(BlogSample)$text
write(BlogSampleClean, file = paste0(path, "/BlogSampleClean.txt"))
NewsSampleClean <- Clean_Text_Block(NewsSample)$text
write(NewsSampleClean, file = paste0(path, "/NewsSampleClean.txt"))
TwitterSampleClean <- Clean_Text_Block(TwitterSample)$text
write(TwitterSampleClean, file = paste0(path, "/TwitterSampleClean.txt"))
TextSampleClean <-
c(BlogSampleClean, NewsSampleClean, TwitterSampleClean)
# TextSampleClean <- Clean_Text_Block(TextSample)$text
# keep environment clean --> remove objects from environment
rm(
TextSample,
Clean_String,
Clean_Text_Block,
BlogSample,
NewsSample,
TwitterSample,
BlogSampleClean,
NewsSampleClean,
TwitterSampleClean
)
Lets see the first and last 50 words:
head(TextSampleClean, 50)
## [1] "perfect" "a" "strange" "thing" "happened" "to"
## [7] "me" "in" "between" "the" "first" "season"
## [13] "of" "laid" "and" "this" "second" "season"
## [19] "premiere" "i" "grew" "fonder" "of" "the"
## [25] "show" "this" "doesn" "t" "mean" "my"
## [31] "opinion" "of" "the" "first" "season" "changed"
## [37] "i" "still" "think" "of" "it" "as"
## [43] "a" "drippy" "one" "note" "trying" "too"
## [49] "hard" "dramedy"
tail(TextSampleClean, 50)
## [1] "the" "kohl" "center" "on"
## [5] "friday" "men" "s" "basketball"
## [9] "plays" "at" "pm" "and"
## [13] "the" "women" "play" "at"
## [17] "pm" "thats" "because" "your"
## [21] "martain" "brains" "of" "this"
## [25] "operation" "i" "remember" "seeing"
## [29] "that" "some" "where" "other"
## [33] "then" "your" "bio" "spring"
## [37] "curb" "creative" "leadership" "lecture"
## [41] "features" "siva" "vaidhaynathan" "patriots"
## [45] "broncos" "close" "until" "the"
## [49] "th" "quarter"
Not all words seem to be English words and for instance “doesn’t” is split into two words “doesn” and “t”. Also some words occur multiple times and other do not. For prediction of the next word it might not be necessary to have a perfectly cleaned data set, since we will be looking at frequencies of words that are used together.
First let’s look at some tools we can use, like a corpus, ngrams and a document-term matrix.
Wikipedia defines a corpus as: “In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.” (https://en.wikipedia.org/wiki/Text_corpus)
This is what we need to do analyses.
Definition of an n-gram (according to wikipedia): “an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.” (https://en.wikipedia.org/wiki/N-gram)
N-grams are typically used for predicting the next item in a sequence in the form af a (n-1)-order Markov model (https://en.wikipedia.org/wiki/N-gram#Applications).
This is exactly what we are trying to achieve in this project.
A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. (https://en.wikipedia.org/wiki/Document-term_matrix)
Let’s take a look at 50 words / n-grams with the highest frequency.
# create unigrams
NGram1 <- tokens_ngrams(TextSampleClean, 1L)
# create a document term matrixes
DocTermMtrx1 <- dfm(NGram1)
# top 50 unigrams
Top50_1 <- topfeatures(DocTermMtrx1, 50)
# Create a data.frame for ggplot
Top50_1_Df <- data.frame(
list(
term = names(Top50_1),
frequency = unname(Top50_1)
)
)
# Sort by reverse frequency order
Top50_1_Df$term <- with(Top50_1_Df, reorder(term, -frequency))
ggplot(Top50_1_Df) + geom_point(aes(x=term, y=frequency)) +
theme(axis.text.x=element_text(angle=90, hjust=1)) +
ggtitle("Top 50 words")
The top 50 words contain a lot of “stopwords”. There are even some words that do not have a meaning such as “s” and “t”.
Proportion of words that only occur once: 0.11.
Now let’s take a look at combination of words.
Frequency of bigrams are a lot lower than for single words. Still there are some issues that might need to be handled in additional text cleaning.
Proportion of bigrams that only occur once: 0.33.
Proportion of trigrams that only occur once: 0.43.
Combining more words reduces the frequency in which they appear. Also the proportion of combination occuring just once increases with the number of words combined in the ngram.
Let’s also look at the distribution of the ngrams.
The plot shows that few words make up for most of the words used in the textfiles and if we switch to bigrams or trigrams we need a bigger base of data to get the same proportion. But even with bigrams and trigrams we get approximately 90% of occurences with only 50% of the unique trigrams. This suggest we do not need every unique instance of an ngram to get a good prediction.
Following steps need to be taken before the final product can be presented: