Executive Summary

This report has been written in the context of the capstone project of Johns Hopkins’ the Data Science Specialization course. Its goal is to document the steps completed so far and to give a short prospect of the plans for furher proceeding. It presents the learnings and basic exploratory analysis.

In particular a unigram matrix was created in order to get an impression of the words in the corpus. Although some effort has been spent for cleaning the corpus, the main finding so far is that there is still much noise in the data (e. g. “words” like “zzolo”). So, corpus cleaning will remain an issue in the next phases.

Disclaimer: As suggested by the mentors I do not display code in the main body of the report. Only a selection of code chunks is presented in the appendix to give an impression of the approach.

Loading the Data

This turned out to be rather time consuming and caused some headache. Evantually, the files of the English database (en_US.blogs.txt, n_us.twitter.txt, en_US.news.txt) were read in by readLines. The en_US.news file containded a few unreadble strings, that caused an error and abort of loading. I could not solve this in R, so I deleted the critical stings in Notpad++.

Reducing the Data and File Summary

As the files are of considerable size, the first thing to do, was to reduce the amount of data for further analyis. This was done in two steps:

  1. Splitting each data file into a train (~75%) and test (~25%) set using a random coin flip mepthod (see appendix, Code chunk 1)
  2. Drawing a 10% sample form each file using random coin flip method

Table 1 summarizes some numeric aspects of this reduction. For the next stages, this rigorous reduction might be eased and a biger sample considered. I asume a trade-off between effective corpus cleaning and initial sample size.

Table 1: Summary of reduction process
status blog twitter news
original size (MB) 210 167 205
original rows 899288 2360148 1010242
train rows 674988 1768906 757770
sample rows 67612 177196 75511

Creating a Corpus for Text Mining

I decided to use the tm package for the text mining part of this report. Time limits did not allow for testing other packages like Quanteda or Tidy Text Mining.

I conducted a series of cleaning steps on the inital corpus in order to get a corpus as clean as possible that would serve as a good base for the machine learning part to come. As performance and memory issues will be important, it is key to have a data base that is as small as possible, but most relevant for predictions.

The cleaining was an iterative process consisting basically of two stages. Firstly, I did several standard cleanings (I credit some valuable ideas to Brandon Kopp). I then created a term document matrix of unigrams for closer word inspection. Based on this, I, secondly, refined the cleaning of the corpus.

Concerning the standard cleaning, I abstained from stemming and eliminating stopwords, as the goal of the project is not classical semantic distant text reading, but predicting next words depending on preceding words. I did, however, remove profane words using the frontgatemedia.com list suggested by Eric Bruce in the course forum.

The cleaining steps are documented in the appendix, code chunk 2.

Creating Term Document Matrix and Statistics

I created a term document matrix (TDM) of unigrams as a basis for manually inspecting the data and doing some statistics. At this stage I abstained from doing bi- an trigram tokenization. This will follow in the next stages of the project.

I started with drawing heatmaps, one for the 2000 most frequent words, one for words of middle (average) frequency and one for the less frequent words. The top 2000 graph shows a differentiated cluster image, with some overweight of the twitter data and an blank in the news data in the middle region. The middle and bottom 2000 graphs show large undifferentiated lumps with marked differencies between the three data sets. Together the heatmaps suggest, that all three data sets have to be considered for prediction and omitting a set would reduce the predictive performance.

Heatmaps of Words

Table 2 corroborates this, as most words occur only in one data set. This value will most probably be reduced by further cleaning, but it seems still relevant.

Table 2: Word Frequencies
Total words Words in one file only Words in two files Words in all three files
occurrence. 156099 104522 23204 28373
% 100 66 14 18


Finally, the 20 most frequent words in each data set are visualized. The plots show extreme values for a few words like “the”, “and” or “you”. Their predictive value in bi- and trigram combinations will have to be carefully considered.

Plot of Top 20 Words per Data Set

The Further Steps

In the next stages a simple prediction algorithm will be developed. For this I plan the following steps:

Concluding Remark

I shall be grateful for any criticism and suggestions from peers that help to enhance the results of this undertaking.

Appendix

Code chunk 1: Create train and test set

set.seed(147)
n <- length(tl); ntr <- 1; p <- 0.75
rb <- rbinom(n, size = ntr, prob = p)
r1 <- grep("1", rb)
r0 <- grep("0", rb)
tltrain <- tl[r1]
tltest <- tl[r0]

Code chunk 2: Corpus Cleaning

#Transform to lower case
corp <- tm_map(corp, content_transformer(tolower))
#Remove ellipsis, hyphens and forward slashes
corp <- tm_map(corp, content_transformer(gsub), pattern = "-|\\/|\\.\\.\\.", replacement = " ")
#Remove ASCII characters
corp <- tm_map(corp, content_transformer(iconv), from = "latin1", to = "ASCII", sub = " ")
#Remove URLs
corp <- tm_map(corp, content_transformer(gsub), pattern = "(http\\:\\/\\/|https\\:\\/\\/)?(www.)?[a-z\\.]*?[a-z]*\\.(org|com|gov|net|edu)", replacement = " ")
corp <- tm_map(corp, content_transformer(gsub), pattern = ".html|.htm|.aspx", replacement = " ")
#Remove Character Refrences
corp <- tm_map(corp, content_transformer(gsub), pattern = "\177|\032", replacement = " ")
#Remove email adresses
corp <- tm_map(corp, content_transformer(gsub), pattern = "[A-Za-z0-9]*\\@[a-z]*\\.([a-zA-Z]{3}|[a-zA-Z]{2})", replacement = " ")
#Convert contraction n't to the not
corp <- tm_map(corp, content_transformer(gsub), pattern = "n't", replacement = " not")
#Replace strings of same letter with just one instance of letter
library(stringr)
corp <- tm_map(corp, content_transformer(str_replace_all), pattern = "(\\w)\\1{2,}", replacement = "\1")
#Remove Numbers
corp <- tm_map(corp, removeNumbers)
#Remove Punctuation
corp <- tm_map(corp, removePunctuation)
#Strip White Spaces
corp <- tm_map(corp , stripWhitespace)
#Remove profanity
csv <- "Terms-to-Block.csv"
list<- read.csv(csv, header=FALSE, stringsAsFactors = FALSE, skip = 4)
toblock <- list[,2]
toblock <- gsub(",", "", toblock)
corp <- tm_map(corp, removeWords, toblock)

#Cleaning after manually inspecting the data in tdm

#Remove words longer than 16 characters
corp <- tm_map(corp, content_transformer(gsub), pattern = "\\b\\w{16,}", replacement = "") 
corp <- tm_map(corp, content_transformer(gsub), pattern = "#", replacement = " ")
corp <- tm_map(corp, content_transformer(gsub), pattern = "\\$", replacement = " ")
corp <- tm_map(corp, content_transformer(gsub), pattern = "<u+0099>", replacement = " ")
corp <- tm_map(corp, content_transformer(gsub), pattern = "&", replacement = " ")
corp <- tm_map(corp, content_transformer(gsub), pattern = "\\*", replacement = " ")
corp <- tm_map(corp, content_transformer(gsub), pattern = "\\[", replacement = " ")
corp <- tm_map(corp, content_transformer(gsub), pattern = "\\]", replacement = " ")
corp <- tm_map(corp, content_transformer(gsub), pattern = "aa", replacement = " a")