Coursera Data Science Specialization intends to give the basic skills involved with being a data scientist. The goal of this capstone is to mimic the experience of being a data scientist. The project consists of developing a predictive model of text(Predictive Text Analytics) using a Swifkey Company Dataset. The main steps of this exercise will consist then on downloading the dataset, understanding it, cleaning it and do some basic analysis.
The dataset can be downloaded and then uncompressed from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. It consists of 4 folders corresponding to 4 different languages (German, English, Finnish, and Russian), and each folder containing 3 files from 3 different text sources, namely blogs, news, and Twitter. Our analysis starts with a summary table including, for each file in the bundle, file stats (size in bytes) and data derived from the execution of the wc command (i.e. lines and words counting, and words per line ratio):
# Get the file stat list from each file
listOfFiles <- dir("HC", recursive = TRUE, full.names = TRUE)
listOfFileInfos <- data.frame(file = listOfFiles, size = file.info(listOfFiles)$size)
listOfFileInfos$sizeInMB <- round(listOfFileInfos$size / (1024 * 1024), digits = 2)
# Generate four new columns in order to be completed with 'wc' command execution data
listOfFileInfos$lineCount <- 0
listOfFileInfos$wordCount <- 0
listOfFileInfos$wordsPerLineRatio <- 0
# adding a column in order to show the file language
listOfFileInfos <- listOfFileInfos %>%
rowwise() %>%
mutate(language =
ifelse(str_detect(file, "en_US"), 'English',
ifelse(str_detect(file, "de_DE"), 'German',
ifelse(str_detect(file, "fi_FI"), 'Finnish',
ifelse(str_detect(file, "ru_RU"), 'Russian', 'not-defined')))))
# Auxiliary function. It allows get data from files using the 'wc' command.
executeWc <- function(x) as.numeric(str_split(system(paste0("wc ", x), intern = TRUE), boundary("word"))[[1]][1:2])
# Complete de file stats with the 'wc' command data
for (index in 1:nrow(listOfFileInfos)) {
wcCommandResults <- executeWc(listOfFileInfos[index,]$file)
listOfFileInfos[index,]$lineCount <- wcCommandResults[1]
listOfFileInfos[index,]$wordCount <- wcCommandResults[2]
listOfFileInfos[index,]$wordsPerLineRatio <- round(wcCommandResults[2] / wcCommandResults[1], digits = 2)
}
columNamesToShow <- c('File', 'Size', 'Size in MB', 'Line count', 'Word count', 'W/L ratio', 'Language')
# Show a formatted table
kable(listOfFileInfos, col.names = columNamesToShow) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"),
full_width = FALSE)
| File | Size | Size in MB | Line count | Word count | W/L ratio | Language |
|---|---|---|---|---|---|---|
| HC/de_DE/de_DE.blogs.txt | 85459666 | 81.50 | 371440 | 12653185 | 34.07 | German |
| HC/de_DE/de_DE.news.txt | 95591959 | 91.16 | 244743 | 13219388 | 54.01 | German |
| HC/de_DE/de_DE.twitter.txt | 75578341 | 72.08 | 947774 | 11803735 | 12.45 | German |
| HC/en_US/en_US.blogs.txt | 210160014 | 200.42 | 899288 | 37334690 | 41.52 | English |
| HC/en_US/en_US.news.txt | 205811889 | 196.28 | 1010242 | 34372720 | 34.02 | English |
| HC/en_US/en_US.twitter.txt | 167105338 | 159.36 | 2360148 | 30374206 | 12.87 | English |
| HC/fi_FI/fi_FI.blogs.txt | 108503595 | 103.48 | 439785 | 12732013 | 28.95 | Finnish |
| HC/fi_FI/fi_FI.news.txt | 94234350 | 89.87 | 485758 | 10446725 | 21.51 | Finnish |
| HC/fi_FI/fi_FI.twitter.txt | 25331142 | 24.16 | 285214 | 3153003 | 11.05 | Finnish |
| HC/ru_RU/ru_RU.blogs.txt | 116855835 | 111.44 | 337100 | 9691167 | 28.75 | Russian |
| HC/ru_RU/ru_RU.news.txt | 118996424 | 113.48 | 196360 | 9416099 | 47.95 | Russian |
| HC/ru_RU/ru_RU.twitter.txt | 105182346 | 100.31 | 881414 | 9542485 | 10.83 | Russian |
Next, a couple of sample lines of Twitter files are shown: - de_DE.twitter.txt:
connnectionBlogsFile <- file("HC/de_DE/de_DE.twitter.txt", "r")
readLines(connnectionBlogsFile, 3)
## [1] "irgendwas stimmt mut meinem internet am pc nich :("
## [2] "\"Wir haben hier ein angebrochenes Fass Bier!\" habe ich mir auch anders vorgestellt. Fragt sich nur, wer darüber gekotzt hat."
## [3] "Meine Kommilitonen beschweren sich, nie Freizeit zu haben... Anscheinend mache ich was falsch. Naja. Läuft..."
close(connnectionBlogsFile)
connnectionBlogsFile <- file("HC/en_US/en_US.twitter.txt", "r")
readLines(connnectionBlogsFile, 3)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
close(connnectionBlogsFile)
connnectionBlogsFile <- file("HC/de_DE/de_DE.twitter.txt", "r")
readLines(connnectionBlogsFile, 3)
## [1] "irgendwas stimmt mut meinem internet am pc nich :("
## [2] "\"Wir haben hier ein angebrochenes Fass Bier!\" habe ich mir auch anders vorgestellt. Fragt sich nur, wer darüber gekotzt hat."
## [3] "Meine Kommilitonen beschweren sich, nie Freizeit zu haben... Anscheinend mache ich was falsch. Naja. Läuft..."
close(connnectionBlogsFile)
Special characters like questoin marks are present in the texts. Due to that, they should be discarded in further cleaning data stages. Also, the data contain words of offensive and profane meaning. So, we need to filter these words as well. In the context of the Capstone project, only the English language files will be taken into account, that is:
englishFiles <- listOfFileInfos[listOfFileInfos$language == "English",] # Select files in english language
kable(englishFiles, col.names = columNamesToShow)%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"),
full_width = FALSE)
| File | Size | Size in MB | Line count | Word count | W/L ratio | Language |
|---|---|---|---|---|---|---|
| HC/en_US/en_US.blogs.txt | 210160014 | 200.42 | 899288 | 37334690 | 41.52 | English |
| HC/en_US/en_US.news.txt | 205811889 | 196.28 | 1010242 | 34372720 | 34.02 | English |
| HC/en_US/en_US.twitter.txt | 167105338 | 159.36 | 2360148 | 30374206 | 12.87 | English |
Only a portion of the data will be used for an initial analysis, therefore getting a sample for the 3 file types for US set: blogs, news, twitter. A Corpus (collection of documents) is also created based on the 3 samples. Finished the loading task, a sampling of 1% of the data is performed.
tweets <- readLines('HC/en_US/en_US.twitter.txt', encoding = 'UTF-8', skipNul = TRUE)
tweets <- iconv(tweets, to = "ASCII", sub="")
blogs <- readLines('HC/en_US/en_US.blogs.txt', encoding = 'UTF-8', skipNul = TRUE)
newsFileConnection <- file('HC/en_US/en_US.news.txt', encoding = 'UTF-8', open = 'rb')
news <- readLines(newsFileConnection, skipNul = TRUE)
close(newsFileConnection)
sampledText <- c(
blogs[sample(1:length(blogs),length(blogs)/100)],
news[sample(1:length(news),length(news)/100)],
tweets[sample(1:length(tweets),length(tweets)/100)])
remove(blogs)
remove(tweets)
remove(news)
This section will use the text mining library ‘tm’ (loaded previously) to perform Data cleaning tasks, which are meaningful in Predictive Text Analytics. Main cleaning steps are: 1. Converting the document to lowercase 2. Removing punctuation marks 3. Removing numbers 4. Removing stopwords (i.e. “and”, “or”, “not”, “is”, etc) 5. Removing undesired terms 6. Removing extra whitespaces generated in previous 5 steps
sampledText <- iconv(sampledText, to = "ASCII", sub="")
corpus <- VCorpus(VectorSource(sampledText))
corpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 42695
Following a sample of the first document:
writeLines(as.character(corpus[[1]]))
## The last tags are the ones I got from Jacqui for C and D - surprising how random selection got me two tags from the same lady, wonder what the odds were against that happening!
This transformation converts the whole corpus text to lowercase, using the tolower() transformation. Next, the sample of first 2 documents is shown again:
corpus <- tm_map(corpus, content_transformer(tolower))
In this step, for removing punctuations and numbers, the operations used are removePunctuation() and removeNumbers().
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))})
corpus <- tm_map(corpus, toSpace, "-")
corpus <- tm_map(corpus, toSpace, ":")
corpus <- tm_map(corpus, toSpace, "`")
corpus <- tm_map(corpus, toSpace, "´")
corpus <- tm_map(corpus, toSpace, " -")
corpus <- tm_map(corpus, toSpace, "[\x82\x91\x92]") # Special single quotes
corpus <- tm_map(corpus, toSpace, '(ftp|http|https)[^([:blank:]|\\"|<|&|#\n\r)]+') # URIs
corpus <- tm_map(corpus, toSpace, '(@|#)[^\\s]+') # Twitter users and hashtags
corpus <- tm_map(corpus, toSpace, '^[[:alnum:].-_]+@[[:alnum:].-]+$') # Emails addresses
corpus <- tm_map(corpus, removePunctuation)# Punctuations
corpus <- tm_map(corpus, removeNumbers) # Numbers
In this transformation, multiple whitespaces are collapsed to a single blank.
corpus <- tm_map(corpus, stripWhitespace)
In this case, stop words (for English language) are removed.
corpus <- tm_map(corpus, removeWords, stopwords("english"))
The capstone project aims to develop a word prediction app, and one is not interested in the prediction of swear words. Due to that, a profanity filtering task is necessary.
swearWordsFileUrl <- 'http://www.frontgatemedia.com/new/wp-content/uploads/2014/03/Terms-to-Block.csv' # URL has list of swear words chosen for filtering
rawSwearWords <- readLines(swearWordsFileUrl)
swearWords <- gsub(',"?', '', rawSwearWords[5:length(rawSwearWords)])
corpus <- tm_map(corpus, removeWords, swearWords)
Stemming is the process of reducing inflected (or sometimes derived) words to their word stem or root, e.g. working and worked to work. Doing this task, words with same root can be reduced to their stem.
corpus <- tm_map(corpus, stemDocument)
This step starts with the creation of Document Term Matrices (DTM), which allows one can find the occurrences of words in he corpus, that is, which words/combinations present higher frequencies.Specifically, three DTMs are build, for words(1-Grams), 2-Grams and 3-Grams.The, frequencies are calculated and sorted. As a result, different plots displaying the 10 most frequent words/combinations are shown.
# Tokenizers based on NLP package
unigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
# Utility function, for getting the top ten frequencies
getNgramFrequencies <- function(dtm) {
sort(colSums(as.matrix(dtm)), decreasing = TRUE)
}
unigramDtm <- DocumentTermMatrix(corpus, control = list(tokenize = unigramTokenizer))
unigramDtm <- removeSparseTerms(unigramDtm, 0.999)
unigramFrequencies <- getNgramFrequencies(unigramDtm)
unigram10Frequencies <- unigramFrequencies[1:10]
unigramFrequenciesDF <- data.frame(word = names(unigram10Frequencies), frequency = as.numeric(unigram10Frequencies))
bigramDtm <- DocumentTermMatrix(corpus, control = list(tokenize = bigramTokenizer))
bigramDtm <- removeSparseTerms(bigramDtm, 0.999)
bigramFrequencies <- getNgramFrequencies(bigramDtm)
bigram10Frequencies <- bigramFrequencies[1:10]
bigramFrequenciesDF <- data.frame(bigram = names(bigram10Frequencies), frequency = as.numeric(bigram10Frequencies))
trigramDtm <- DocumentTermMatrix(corpus, control = list(tokenize = trigramTokenizer))
trigramDtm <- removeSparseTerms(trigramDtm, 0.9999)
trigramFrequencies <- getNgramFrequencies(trigramDtm)
trigram10Frequencies <- trigramFrequencies[1:10]
trigramFrequenciesDF <- data.frame(trigram = names(trigram10Frequencies), frequency = as.numeric(trigram10Frequencies))
kable(unigramFrequenciesDF, col.names = c('Word', 'Frequency')) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"),
full_width = FALSE)
| Word | Frequency |
|---|---|
| one | 3144 |
| will | 3131 |
| said | 3084 |
| like | 3061 |
| get | 2960 |
| just | 2922 |
| time | 2702 |
| can | 2468 |
| day | 2364 |
| year | 2358 |
ggplot(data = unigramFrequenciesDF, aes(reorder(word, -frequency), frequency)) +
geom_bar(stat = "identity") +
ggtitle("Most frequent words") +
xlab("Words") + ylab("Frequency") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
kable(bigramFrequenciesDF, col.names = c('2-Gram', 'Frequency')) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"),
full_width = FALSE)
| 2-Gram | Frequency |
|---|---|
| year old | 236 |
| right now | 234 |
| last year | 228 |
| look like | 220 |
| new york | 204 |
| cant wait | 194 |
| feel like | 189 |
| last night | 177 |
| look forward | 165 |
| high school | 164 |
ggplot(data = bigramFrequenciesDF, aes(reorder(bigram, -frequency), frequency)) +
geom_bar(stat = "identity") +
ggtitle("Most frequent 2-Grams") +
xlab("2-Grams") + ylab("Frequency") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
kable(trigramFrequenciesDF, col.names = c('Word', 'Frequency')) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive", "bordered"),
full_width = FALSE)
| Word | Frequency |
|---|---|
| cant wait see | 38 |
| new york citi | 33 |
| let us know | 27 |
| happi mother day | 22 |
| presid barack obama | 22 |
| happi new year | 21 |
| im pretti sure | 21 |
| dont even know | 20 |
| cant wait get | 17 |
| cinco de mayo | 17 |
ggplot(data = trigramFrequenciesDF, aes(reorder(trigram, -frequency), frequency)) +
geom_bar(stat = "identity") +
ggtitle("Most frequent 3-Grams") +
xlab("3-Grams") + ylab("Frequency") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Word instances:
sum(unigramFrequencies)
## [1] 409437
Unique words:
length(unigramFrequencies)
## [1] 2040
wordcloud2(data=unigramFrequenciesDF,size=0.5,shape = 'star')
When loading the data, “news” has an incomplete final line; “twitter” set seems to contain some nulllines.There are still some accents that weren’t completely removed in the cleaning step; it would be cleaner for the algorithm to remove those in a second cleaning round.
The next steps of the project will be to build a predictive algorithm using N-Grams lookups, in order to compute probabilites for the next occurence regarding to the previous words, backing off to a lower level (e.g. from 3-gram to 2-gram, and so forth) as needed. Later, developing a web app (using Shiny) that uses such algorithm, suggesting to the user the next word.