CapstoneWeek2

1) Introduction

This document corresponds to the Milestone Report, assignment of week 2, Data Science Capstone course from Coursera. This course, the 10th out of 10 courses comprising the Data Science Specialization from the John Hopkins University, allows students to create a usable/public data product that can be used to show your skills to potential employers. Projects will be drawn from real-world problems and will be conducted with industry, government, and academic partners.

The Data Science Capstone project aims to develop a Shiny app that takes as input a phrase (multiple words), one clicks submit, and it will predict the next word. In order to achieve that, the course dataset will be used as a training set, as well as NLP techniques will be applied to analyze and build the corresponding predictive model.

2) Dataset

2.1) Description

From the course dataset information, the data comes from a corpus called HC Corpora (the original site is not reachable, but an archive of it can be seen at https://web-beta.archive.org/web/20160930083655/http://www.corpora.heliohost.org/aboutcorpus.html). Corpora are collected from publicly available sources by a web crawler.

As said before, the dataset can be downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip, and consists, once being downloaded and uncompressed, of 4 folders corresponding to 4 different languages (German, English, Finnish, and Russian), and each folder containing 3 files from 3 different text sources (blogs, news, and Twitter):

#	File
1	final/de_DE/de_DE.blogs.txt
2	final/de_DE/de_DE.news.txt
3	final/de_DE/de_DE.twitter.txt
4	final/en_US/en_US.blogs.txt
5	final/en_US/en_US.news.txt
6	final/en_US/en_US.twitter.txt
7	final/fi_FI/fi_FI.blogs.txt
8	final/fi_FI/fi_FI.news.txt
9	final/fi_FI/fi_FI.twitter.txt
10	final/ru_RU/ru_RU.blogs.txt
11	final/ru_RU/ru_RU.news.txt
12	final/ru_RU/ru_RU.twitter.txt

Next, a couple of sample lines of Twitter files are shown:

de_DE.twitter.txt:

## [1] "irgendwas stimmt mut meinem internet am pc nich :("                                                                             
## [2] "\"Wir haben hier ein angebrochenes Fass Bier!\" habe ich mir auch anders vorgestellt. Fragt sich nur, wer darÃ¼ber gekotzt hat."
## [3] "Meine Kommilitonen beschweren sich, nie Freizeit zu haben... Anscheinend mache ich was falsch. Naja. LÃ¤uft..."

en_US.twitter.txt:

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."

There exist special characters in the texts, for example, question marks. Due to that, they should be taken into account to be discarded in further cleaning data stages.

2.2) Dataset details

Our analysis starts with a summary table including, for each file in the bundle, file stats (size in bytes) and data derived from the execution of the wc command (i.e. lines and words counting, and words per line ratio):

File	Size	Size in MB	Line count	Word count	W/L ratio	Language
final/de_DE/de_DE.blogs.txt	85459666	81.50	371440	12609981	33.95	German
final/de_DE/de_DE.news.txt	95591959	91.16	244743	13198704	53.93	German
final/de_DE/de_DE.twitter.txt	75578341	72.08	947774	11793502	12.44	German
final/en_US/en_US.blogs.txt	210160014	200.42	899288	37272578	41.45	English
final/en_US/en_US.news.txt	205811889	196.28	1010242	34309642	33.96	English
final/en_US/en_US.twitter.txt	167105338	159.36	2360148	30341028	12.86	English
final/fi_FI/fi_FI.blogs.txt	108503595	103.48	439785	12709831	28.90	Finnish
final/fi_FI/fi_FI.news.txt	94234350	89.87	485758	10406584	21.42	Finnish
final/fi_FI/fi_FI.twitter.txt	25331142	24.16	285214	3151016	11.05	Finnish
final/ru_RU/ru_RU.blogs.txt	116855835	111.44	337100	2044904	6.07	Russian
final/ru_RU/ru_RU.news.txt	118996424	113.48	196360	1801975	9.18	Russian
final/ru_RU/ru_RU.twitter.txt	105182346	100.31	881414	2424427	2.75	Russian

In the context of the Capstone project, only the English language files will be taken into account, that is:

File	Size	Size in MB	Line count	Word count	W/L ratio	Language
final/en_US/en_US.blogs.txt	210160014	200.42	899288	37272578	41.45	English
final/en_US/en_US.news.txt	205811889	196.28	1010242	34309642	33.96	English
final/en_US/en_US.twitter.txt	167105338	159.36	2360148	30341028	12.86	English

From this standpoint, one sees that we are going to be dealing with around 556 MB of data. This data size could become into a very slow performance in processing the data. Because of that, a subset consisting of 1% of the original dataset can be used, as suggested in Task 1 of the course.

2.2) Dataset cleaning

Once the dataset and its details have been introduced, the following step is its cleaning. Firstly, total of text from the blog, Twitter and news files is loaded, considering the flag skipNul = TRUE for line reading in order to skip nulls, and the opening option ‘rb’ when reading en_US.news.txt so that the warning “incomplete final line found on …”. Finished the loading task, a sampling of 1% of the data is performed.

This task resulted in an object of size 8 MB, meaning lesser than the original set. Next, a corpus is created from the text already sampled, using the tm package, in order to take advantage of the text mining functionality provided by that package.

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 42695

The corpus has 1017566 words approximately.

Following a sample of first 2 documents:

## A government report published earlier this week tells how the Secret Service agents involved in the Colombian prostitution scandal paid nine of the 12 women that they took home from the bar on that wild evening.

## However I miss the fun: the buzz of a crowded shop on the last Saturday before Christmas, meeting old friends at drunken book launches and having a good bitch with the publishers' sales reps. I don't think there's much fun in the book trade any more, so perhaps I was lucky to get out while I did.

The cleaning consists of several transformation tasks, namely:

Uniforming the text to lowercase
Removing punctuation, number, special characters, etc.
Striping whitespaces
Removing stop words
Profanity filtering (Removing swear words)
Stemming the text

Many of these task are performed using tm transformation operations, the rest of them need certain custom coding, using the function content_transformer() from tm package. The transformations provided by tm package are:

## [1] "removeNumbers"     "removePunctuation" "removeWords"      
## [4] "stemDocument"      "stripWhitespace"

2.2.1) Uniforming the text to lowercase

This transformation converts the whole corpus text to lowercase, using the tolower() transformation. Next, the sample of first 2 documents is shown again:

## a government report published earlier this week tells how the secret service agents involved in the colombian prostitution scandal paid nine of the 12 women that they took home from the bar on that wild evening.

## however i miss the fun: the buzz of a crowded shop on the last saturday before christmas, meeting old friends at drunken book launches and having a good bitch with the publishers' sales reps. i don't think there's much fun in the book trade any more, so perhaps i was lucky to get out while i did.

2.2.2) Removing punctuation, number, special characters, etc.

In this step, for removing punctuation and numbers, the operations used are removePunctuation() and removeNumbers() For special characters and character sequences (e.g URLs, email addresses, Twitter users and hashtags, etc.), the toSpace() function is used.

## a government report published earlier this week tells how the secret service agents involved in the colombian prostitution scandal paid nine of the  women that they took home from the bar on that wild evening

## however i miss the fun  the buzz of a crowded shop on the last saturday before christmas meeting old friends at drunken book launches and having a good bitch with the publishers sales reps i dont think theres much fun in the book trade any more so perhaps i was lucky to get out while i did

2.2.3) Striping whitespaces

In this transformation, multiple whitespaces are collapsed to a single blank. The operation is performed using the stripWhitespace() transformation:

## a government report published earlier this week tells how the secret service agents involved in the colombian prostitution scandal paid nine of the women that they took home from the bar on that wild evening

## however i miss the fun the buzz of a crowded shop on the last saturday before christmas meeting old friends at drunken book launches and having a good bitch with the publishers sales reps i dont think theres much fun in the book trade any more so perhaps i was lucky to get out while i did

2.2.4) Removing stop words

In this case, stop words (for English language) are removed. The operation to be applied is the stopwords() transformation:

##  government report published earlier  week tells   secret service agents involved   colombian prostitution scandal paid nine   women   took home   bar   wild evening

## however  miss  fun  buzz   crowded shop   last saturday  christmas meeting old friends  drunken book launches    good bitch   publishers sales reps  dont think theres much fun   book trade    perhaps   lucky  get

2.2.5) Profanity filtering

The capstone project aims to develop a word prediction app, and one is not interested in the prediction of swear words. Due to that, a profanity filtering task is necessary. Doing a little research on the Internet, one can find several offensive words lists to be used for filtering. The list chosen for this task is the provided in the article “A list of 723 bad words to blacklist & how to use Facebook’s moderation tool”. The list can be downloaded from: http://www.frontgatemedia.com/new/wp-content/uploads/2014/03/Terms-to-Block.csv Next, a few sample of such list:

##  [1] "fuk"      "c-0-c-k"  "retard"   "gringo"   "bod"      "jerk0ff" 
##  [7] "cumshots" "organ"    "chink"    "fuckup"

2.2.6) Stemming the text

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem or root, e.g. working and worked to work. Doing this task, words with same root can be reduced to their stem. This process can be performed using the operation stemDocument().

Finally, the corpus has 562150 words, 455416 less than from the beginning.

3) Analysis

3.1) Exploratory Analisis

This step starts with the creation of Document Term Matrices (DTM), which allows one can find the occurrences of words in he corpus, that is, which words/combinations present higher frequencies. For this, the DocumentTermMatrix() function from the tm package and N-Gram tokenizers from RWeka package are used. Specifically, three DTMs are build, for words(1-Grams), 2-Grams and 3-Grams. Built the DTMs, frequencies are calculated and sorted. As a result, different plots displaying the 10 most frequent words/combinations are shown.

For words:

Word Frequency

will 3262

one 3166

just 3030

get 3026

said 3001

like 2980

time 2615

can 2505

year 2418

day 2300

Word	Frequency
will	3262
one	3166
just	3030
get	3026
said	3001
like	2980
time	2615
can	2505
year	2418
day	2300

For 2-Grams:

2-Gram	Frequency
year old	244
last year	238
right now	225
cant wait	204
feel like	186
look like	185
new york	183
year ago	165
dont know	162
last night	149

For 3-Grams:

Word	Frequency
cant wait see	43
happi mother day	39
let us know	31
feel like im	21
im pretti sure	21
new york citi	21
dont even know	19
happi new year	17
dream come true	15
gov chris christi	15

Next, the following questions from the Task 2 are answered:

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

Taking the advantage of already having the word frequencies (descending ordered), we can iterate over them and get the total amount of unique words and word instances:

Word instances:

## [1] 407115

Unique words:

## [1] 2021

With this data, the amount of unique words for a coverage of 50% is 287, and for 90% is 1338.

How do you evaluate how many of the words come from foreign languages?

The English language does not use special letters or accents (no more than the ASCII characters), and this feature can get a way to detect non-English words. It can be achieved trying to detect words containing accents/umlauts(e.g. ó, ü) or non-English letter (e.g. ß or Slavic letters), taking into account that these letters can be found in the character encoding ISO/IEC 8859-1 (also known as Latin-1), and a special tag can be set when a special letter is detected. Next, an example of it can be done:

# Return a data frame with 2 column, word and valid (TRUE for words in English, FALSE otherwise)
detectNonEnglishWords <- function(line) {
  
  convertWord <- function(word) iconv(word, 'ISO8859-1', 'ASCII', sub = '<NON_ENGLISH_LETTER>')
  
  isNotConvertedWord <- function(word) !str_detect(convertWord(word), '<NON_ENGLISH_LETTER>')
  
  wordsInLine <- str_split(line, boundary("word"))[[1]]
  wordsDF <- data.frame(word = wordsInLine)
  wordsDF <- wordsDF %>% 
    rowwise() %>% 
    mutate(valid = isNotConvertedWord(word))
  
  wordsDF
}

An example applying text ‘The Fußball is the King of Sports’ (using Fußball in German instead of Football in English)

originalText <- 'The Fußball is the King of Sports'
originalText

## [1] "The Fußball is the King of Sports"

detectNonEnglishWords('The Fußball is the King of Sports')

## # A tibble: 7 x 2
## # Rowwise: 
##   word    valid
##   <chr>   <lgl>
## 1 The     TRUE 
## 2 Fußball FALSE
## 3 is      TRUE 
## 4 the     TRUE 
## 5 King    TRUE 
## 6 of      TRUE 
## 7 Sports  TRUE

This function can be used for removing non-English words as well:

# Remove non-English words from a line of text
removeNonEnglishWords <- function(line) {
  wordsDF <- detectNonEnglishWords(line)
  filteredLine <- paste(wordsDF[wordsDF$valid == TRUE, 'word']$word, collapse = " ")
  filteredLine
}

originalText <- 'The Fußball is the King of Sports'
originalText

## [1] "The Fußball is the King of Sports"

removeNonEnglishWords('The Fußball is the King of Sports')

## [1] "The is the King of Sports"