This is the interim milestone report for the Coursera Data Science Capstone project. The objective of this report is to demonstrate the work done to this point namely: importing the data, exploratory data analysis, data cleansing and the construction of algorithms to model the relationship between words. In the next phase of the project, we will incorporate natural language processing techniques to build a predictive text application.
The supplied data set (called HC Corpora) for this project contains snippets from: Twitter, News and Blog data. The data is sorted into four language groups: en_US, de_DE, ru_RU and fi_FI.
The starting point for this project is of course the data. Here I download the Coursera Swiftkey Dataset.
# load the URL for the source data
Data_URL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# check if Data directory exists, if not create
if(!file.exists("./Data")) {
dir.create("./Data")
}
# check if subdirectory exists, if not download data
if(!file.exists("./Data/Coursera-SwiftKey.zip")) {
download.file(Data_URL,destfile="./Data/Coursera-SwiftKey.zip",mode = "wb")
}
# check if data is alraeady unzipped, if not unzip now
if(!file.exists("./Data/final")) {
unzip(zipfile="./Data/Coursera-SwiftKey.zip",exdir="./Data")
}
The raw data contains four different language groups. They are English, Finnish, German and Russian. To improve my German langauge skills I will process the German data set.
# load up German subset of data
DE_blogs <- readLines("Data/final/de_DE/de_DE.blogs.txt",
encoding = "UTF-8", warn = FALSE, skipNul = TRUE)
DE_news <- readLines("Data/final/de_DE/de_DE.news.txt",
encoding = "UTF-8", skipNul = TRUE)
DE_twitter <- readLines("Data/final/de_DE/de_DE.twitter.txt",
encoding = "UTF-8", skipNul = TRUE)
Let’s see what the actual data looks like:
head(DE_news, n=3)
## [1] "Das Rezept für ihre Schokobrezln hat die 60-Jährige schon vor 26 Jahren in einer österreichischen Sendung entdeckt. Mittlerweile kommenn sie jedes Jahr zu Weihnachten auf den Teller. Thurner hat sich heuer das erste Mal beim Zuckerguss-Magazin beworben, obwohl sie schon öfter mal Rezepte daraus nach gebacken hat. Sie erzählt: „Ich habe das Rezept für die Schokobrezln noch nie irgendwo gesehen. Da habe ich es eingeschickt.“ Lachend fügt sie hinzu: „Und ich wollte unbedingt diese Schürze.“"
## [2] "Für die Linksparteibewerber ist nun die entscheidende Frage, wie sich SPD- und Grünenwähler in einer solchen Persönlichkeitsentscheidung verhalten. Rot-Rot-Grün auf Kreisebene hatte zum Beispiel Siegried Konieczny 2009 überraschend ins Amt gebracht, vor dem zweiten Wahlgang hatte der CDU-Kandidat seinerzeit noch sehr deutlich geführt. In derartigen Stichwahlen sind jederzeit Überraschungen möglich."
## [3] "Nach Einschätzung des DIW ist das kräftige Plus im dritten Quartal vor allem dem Quartalsauftakt im Juli zu verdanken, da aufgrund der späten Lage der Sommerferien in einigen Bundesländern ein großer Teil der Produktion vorgezogen wurde. Schon im August und September sei die Industrieproduktion dagegen deutlich zurückgegangen. Im Jahresdurchschnitt 2011 dürfte das Wirtschaftswachstum nach Einschätzung des DIW trotzdem rund drei Prozent betragen. Auch der Ökonom Carsten Brzeski von der ING Bank hält dieses Ziel nach wie vor für erreichbar. Allerdings ist auch seine Einschätzung für die kommenden Monate düster."
## Data.File Word.Count Total.Character Size.of.file.in.MB Length.in.Lines
## 1 Twitter 6205913 40729298 81.50069 181958
## 2 Blogs 13375092 93388799 91.16360 244743
## 3 News 11646047 72776717 72.07712 947774
As we can see from the initial analysis, the original data set is very large. To facilitate speedy processing, I will use a sampling rate 10% of the original data sets. These subsets are then combined into a single, smaller set. This will definitely improve performance. As we learned in the lectures, using a smaller data set does not significantly impact accuracy. Fo more info on this see also the recommended text: Text Mining with R: A Tidy Approach by Julia Silge and David Robinson. Here is how I set the sampling rate:
# set the samp0leing rate
sample_rate <- .10
# Sample the data
set.seed(1)
DE_sample <- c(sample(DE_blogs, length(DE_blogs) * sample_rate),
sample(DE_news, length(DE_news) * sample_rate),
sample(DE_twitter, length(DE_twitter) * sample_rate))
As we saw in the data exploration, the text contains numerous elements which are not important to us. This includes numbers, special symbols, URLs, punctuation, capitalization, stopwords, excess white space. Warning: running this cleaning code block may take a very long time to process. A sampling rate of 10% takes approximately 5 minutes on a current (2020) desktop computer.
To undestand the data better, let’s build histograms for the unigrams, bigrams and trigrams.
To do this we need to apply tokenization to the sample. Tokenization generates a list of the words the sample contains. We can extend the tokenazation to get a list of word pairs this is called a bigrams. Similarly, a group of three words is a trigrams. The tokenization process is discussed in depth in the ebook: Text Mining with R:A Tidy Approach by Silge and Robinson.
Warning: running this Tokenizer code block may take a very long time to process. A sampling rate of 10% can take up to 10 minutes on a current (2020) desktop computer.
To plot the histograms, I use the ggplot library. In the appropriate colours of course.
I wanted to see the effect of using the stopwords. Such as: (corpus <- tm_map(corpus, removeWords, stopwords(“de”))) I ran the same algorithm but without removing the German stopwords. The shear number of hits is remarkable! Well over 75 times compared to removing the stopwords. The second plot terminated early as the data frame exceeded 20GB. This also exemplifies why stopwords are often removed, as they do not contain important significance in search functions.
The primary goal is to develop a predictive algorithm. The above n-grams will be the foundation unless I develop a better model in the mean time.
Secondly, I want to deploy the predictive algorithm as a Shiny application. The app will have a user-interface which allows the user to enter a word. Hopefully the app will correctly predict the most likely words that would follow the input.
To achieve these goals the next steps will be: to review the data processing done to date (sampling, cleaning). This will be followed by training and testing. Selection of the prediction procedure and algorithms required. Considering the amount of the raw data, performance optimization may also be required.