Milestone Report

Introduction

This is the interim milestone report for the Coursera Data Science Capstone project. The objective of this report is to demonstrate the work done to this point namely: importing the data, exploratory data analysis, data cleansing and the construction of algorithms to model the relationship between words. In the next phase of the project, we will incorporate natural language processing techniques to build a predictive text application.

Data Set

The supplied data set (called HC Corpora) for this project contains snippets from: Twitter, News and Blog data. The data is sorted into four language groups: en_US, de_DE, ru_RU and fi_FI.

Data Import

The starting point for this project is of course the data. Here I download the Coursera Swiftkey Dataset.

# load the URL for the source data
        Data_URL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# check if Data directory exists, if not create
        if(!file.exists("./Data")) {
                dir.create("./Data")
        }
# check if subdirectory exists, if not download data
        if(!file.exists("./Data/Coursera-SwiftKey.zip")) {
                download.file(Data_URL,destfile="./Data/Coursera-SwiftKey.zip",mode = "wb")
        }
# check if data is alraeady unzipped, if not unzip now
        if(!file.exists("./Data/final")) {
                unzip(zipfile="./Data/Coursera-SwiftKey.zip",exdir="./Data")
        }

Data Selection

The raw data contains four different language groups. They are English, Finnish, German and Russian. To improve my German langauge skills I will process the German data set.

# load up German subset of data
        DE_blogs <- readLines("Data/final/de_DE/de_DE.blogs.txt", 
                                encoding = "UTF-8", warn = FALSE, skipNul = TRUE)
        DE_news <- readLines("Data/final/de_DE/de_DE.news.txt", 
                                encoding = "UTF-8", skipNul = TRUE)
        DE_twitter <- readLines("Data/final/de_DE/de_DE.twitter.txt", 
                                encoding = "UTF-8", skipNul = TRUE)

Basic summary of Data

Let’s see what the actual data looks like:

head(DE_news, n=3)

## [1] "Das Rezept für ihre Schokobrezln hat die 60-Jährige schon vor 26 Jahren in einer österreichischen Sendung entdeckt. Mittlerweile kommenn sie jedes Jahr zu Weihnachten auf den Teller. Thurner hat sich heuer das erste Mal beim Zuckerguss-Magazin beworben, obwohl sie schon öfter mal Rezepte daraus nach gebacken hat. Sie erzählt: „Ich habe das Rezept für die Schokobrezln noch nie irgendwo gesehen. Da habe ich es eingeschickt.“ Lachend fügt sie hinzu: „Und ich wollte unbedingt diese Schürze.“"                                                                                                                            
## [2] "Für die Linksparteibewerber ist nun die entscheidende Frage, wie sich SPD- und Grünenwähler in einer solchen Persönlichkeitsentscheidung verhalten. Rot-Rot-Grün auf Kreisebene hatte zum Beispiel Siegried Konieczny 2009 überraschend ins Amt gebracht, vor dem zweiten Wahlgang hatte der CDU-Kandidat seinerzeit noch sehr deutlich geführt. In derartigen Stichwahlen sind jederzeit Überraschungen möglich."                                                                                                                                                                                                                       
## [3] "Nach Einschätzung des DIW ist das kräftige Plus im dritten Quartal vor allem dem Quartalsauftakt im Juli zu verdanken, da aufgrund der späten Lage der Sommerferien in einigen Bundesländern ein großer Teil der Produktion vorgezogen wurde. Schon im August und September sei die Industrieproduktion dagegen deutlich zurückgegangen. Im Jahresdurchschnitt 2011 dürfte das Wirtschaftswachstum nach Einschätzung des DIW trotzdem rund drei Prozent betragen. Auch der Ökonom Carsten Brzeski von der ING Bank hält dieses Ziel nach wie vor für erreichbar. Allerdings ist auch seine Einschätzung für die kommenden Monate düster."

Summary of the DE Data files:

##   Data.File Word.Count Total.Character Size.of.file.in.MB Length.in.Lines
## 1   Twitter    6205913        40729298           81.50069          181958
## 2     Blogs   13375092        93388799           91.16360          244743
## 3      News   11646047        72776717           72.07712          947774

As we can see from the initial analysis, the original data set is very large. To facilitate speedy processing, I will use a sampling rate 10% of the original data sets. These subsets are then combined into a single, smaller set. This will definitely improve performance. As we learned in the lectures, using a smaller data set does not significantly impact accuracy. Fo more info on this see also the recommended text: Text Mining with R: A Tidy Approach by Julia Silge and David Robinson. Here is how I set the sampling rate:

# set the samp0leing rate
        sample_rate <- .10

# Sample the data
        set.seed(1)
        DE_sample <- c(sample(DE_blogs, length(DE_blogs) * sample_rate),
                        sample(DE_news, length(DE_news) * sample_rate),
                        sample(DE_twitter, length(DE_twitter) * sample_rate))

Data Cleaning of the Sample Subset

As we saw in the data exploration, the text contains numerous elements which are not important to us. This includes numbers, special symbols, URLs, punctuation, capitalization, stopwords, excess white space. Warning: running this cleaning code block may take a very long time to process. A sampling rate of 10% takes approximately 5 minutes on a current (2020) desktop computer.

Data Analysis

To undestand the data better, let’s build histograms for the unigrams, bigrams and trigrams.

To do this we need to apply tokenization to the sample. Tokenization generates a list of the words the sample contains. We can extend the tokenazation to get a list of word pairs this is called a bigrams. Similarly, a group of three words is a trigrams. The tokenization process is discussed in depth in the ebook: Text Mining with R:A Tidy Approach by Silge and Robinson.

Warning: running this Tokenizer code block may take a very long time to process. A sampling rate of 10% can take up to 10 minutes on a current (2020) desktop computer.

To plot the histograms, I use the ggplot library. In the appropriate colours of course.

Random Observation

I wanted to see the effect of using the stopwords. Such as: (corpus <- tm_map(corpus, removeWords, stopwords(“de”))) I ran the same algorithm but without removing the German stopwords. The shear number of hits is remarkable! Well over 75 times compared to removing the stopwords. The second plot terminated early as the data frame exceeded 20GB. This also exemplifies why stopwords are often removed, as they do not contain important significance in search functions.