Milestone Report Rubric

Abstract

The goal of this document is just to display that I have gotten used to working with the data and that I am on track to create my prediction algorithm. It explains my exploratory analysis and my goals for the eventual app and algorithm.

There are large data sets which has been more challenging than expected. The data is now cleaned.

The work will be on prediction and will be based on n-gram and a backoff method.

Preface

The motivation for this project is to:

Demonstrate that I have downloaded the data and have successfully loaded it in. Create a basic report of summary statistics about the data sets. Report any interesting findings that I amassed so far. Get feedback on MY plans for creating a prediction algorithm and Shiny app.

Understanding the problem

The first step in analyzing any new data set is figuring out:

  1. what data I have and

  2. what are the standard tools and models used for that type of data.

Make sure I have downloaded the data from Coursera before heading for the exercises. This exercise uses the files named LOCALE.blogs.txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora (www.corpora.heliohost.org). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details on the corpora available. The files have been language filtered but may still contain some foreign text.

In this capstone I will be applying data science in the area of natural language processing. As a first step toward working on this project, I should familiarize myself with Natural Language Processing, Text Mining, and the associated tools in R. Here are some resources that may be helpful to me.

  • Natural language processing Wikipedia page
  • Text mining infrastucture in R
  • CRAN Task View: Natural Language Processing
  • Coursera course on NLP (not in R)

Dataset

This is the training data to get you started that will be the basis for most of the capstone. I must download the data from the Coursera site and not from external websites to start.

Capstone Dataset

My original exploration of the data and modeling steps will be performed on this data set. Later in the capstone, if i find additional data sets that may be useful for building my model I may use them.

Tasks to accomplish

  • Obtaining the data - Can I download the data and load/manipulate it in R?
  • Familiarizing myself with NLP and text mining - Learn about the basics of natural language processing and how it relates to the data science process I have learned in the Data Science Specialization.

Questions to consider

What do the data look like? Where do the data come from? Can you think of any other data sources that might help you in this project? What are the common steps in natural language processing? What are some common issues in the analysis of text data? What is the relationship between NLP and the concepts you have learned in the Specialization?

Data acquisition and cleaning

Large databases comprising of text in a target language are commonly used when generating language models for various purposes. In this exercise, I will use the English database but may consider three other databases in German, Russian and Finnish.

The goal of this task is to get familiar with the databases and do the necessary cleaning. After this exercise, I should understand what real data looks like and how much effort I need to put into cleaning the data. When I commence on developing a new language, the first thing is to understand the language and its peculiarities with respect to my target. I can learn to read, speak and write the language. Alternatively, I can study data and learn from existing information about the language through literature and the internet. At the very least, I need to understand how the language is written: writing script, existing input methods, some phonetic knowledge, etc.

Note that the data contain words of offensive and profane meaning. They are left there intentionally to highlight the fact that the developer has to work on them.

Tasks to accomplish 1. Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it. 2. Profanity filtering - removing profanity and other words I do not want to predict.

  1. Loading the data in. This dataset is fairly large. We emphasize that I don’t necessarily need to load the entire dataset in to build my algorithms (see point 2 below). At least initially, I might want to use a smaller subset of the data. Reading in chunks or lines using R’s readLines or scan functions can be useful. I can also loop over each line of text by embedding readLines within a for/while loop, but this may be slower than reading in large chunks at a time. Reading pieces of the file at a time will require the use of a file connection in R.

  2. Sampling. To reiterate, to build models I don’t need to load in and use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. My inference class and how a representative sample can be used to infer facts about a population. I might want to create a separate sub-sample dataset by reading in a random subset of the original data and writing it out to a separate file. That way, I can store the sample and not have to recreate it every time.

Loading libraries, setting seed and setting directory

library(tm) 
library(XML)
library(wordcloud) 
library(RColorBrewer)
library(caret)
library(NLP) 
library(openNLP) 
library(RWeka)
library(qdap) 
library(ggplot2)
set.seed(9275)

Loading files and doing pre-processing

If you want to repeat all the steps, you need to download the data and unpack it to the current data folder.

All the data is qute big, more than 500 mb and contain millions of senteces, so the loading may take a lot of time!

To make the report more “honest”, all the data was loaded and than I’ve made sampling and pasting all togeather.

setwd("H:/Aulas/Data Science/Módulo 10 - Data Science Capstone/Dados/Coursera-SwiftKey")
enBlogs <- readLines(paste(getwd(),"/final/en_US/en_US.blogs.txt",sep=""))
enNews <- readLines(paste(getwd(),"/final/en_US/en_US.news.txt",sep=""))
enTwitter <- readLines(paste(getwd(),"/final/en_US/en_US.twitter.txt",sep=""))

sampleBlogs <- sample(enBlogs,1000)
sampleNews <- sample(enNews,1000)
sampleTwitter <- sample(enTwitter,1000)
sample <- c(sampleBlogs,sampleNews,sampleTwitter)
txt <- sent_detect(sample)
remove(sampleBlogs,sampleNews,sampleTwitter,enBlogs,enNews,enTwitter,sample)

Removing everything we do not need

txt <- removeNumbers(txt)
txt <- removePunctuation(txt)
txt <- stripWhitespace(txt)
txt <- tolower(txt)
txt <- txt[which(txt!="")]
txt <- data.frame(txt,stringsAsFactors = FALSE)

Making ordered data frames of 1-grams, 2-grams, 3-grams

words<-WordTokenizer(txt) 
grams<-NGramTokenizer(txt)

for(i in 1:length(grams)) 
    {if(length(WordTokenizer(grams[i]))==2) break}
for(j in 1:length(grams)) 
{if(length(WordTokenizer(grams[j]))==1) break}

onegrams <- data.frame(table(words))
onegrams <- onegrams[order(onegrams$Freq, decreasing = TRUE),]
bigrams <- data.frame(table(grams[i:(j-1)]))
bigrams <- bigrams[order(bigrams$Freq, decreasing = TRUE),]
trigrams <- data.frame(table(grams[1:(i-1)]))
trigrams <- trigrams[order(trigrams$Freq, decreasing = TRUE),]
remove(i,j,grams)

Some words are more frequent than others - what are the distributions of word frequencies?

wordcloud(words, scale=c(5,0.1), max.words=100, random.order=FALSE, 
          rot.per=0.5, use.r.layout=FALSE, colors=brewer.pal(8,"Accent"))

wordcloud(onegrams$words, onegrams$Freq, scale=c(5,0.5), max.words=300, 
          random.order=FALSE, rot.per=0.5, use.r.layout=FALSE, 
          colors=brewer.pal(8,"Accent"))

The first graph shows the distribution of words in the corpora except such words, as “the”, “a”, “of”, “to”, etc. The second graph - the distribution of all single wors. The frequences lay between 3796 to 1.

What are the frequencies of 2-grams and 3-grams in the dataset?

barplot(bigrams[1:20,2],col="red",
        names.arg = bigrams$Var1[1:20],srt = 45,
        space=0.1, xlim=c(0,20),las=2)

barplot(trigrams[1:20,2],col="blue",
        names.arg = trigrams$Var1[1:20],srt = 45,
        space=0.1, xlim=c(0,20),las=2)

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

sumCover <- 0
for(i in 1:length(onegrams$Freq)) {
    sumCover <- sumCover + onegrams$Freq[i]
    if(sumCover >= 0.5*sum(onegrams$Freq)){break}
}
print(i)
## [1] 146
sumCover <- 0
for(i in 1:length(onegrams$Freq)) {
    sumCover <- sumCover + onegrams$Freq[i]
    if(sumCover >= 0.9*sum(onegrams$Freq)){break}
}
print(i)
## [1] 5532

Owing to this, we need 146 words to cover 50% of all word instances in the language and 5.532 words to cover 90% of all word instances in the language.

How do you evaluate how many of the words come from foreign languages?

It seems to me, that the best way is to compare the text with some well-known dictionary. Also, this is the way to remove “rude” words. Nevertheless, there are too few such words and their impact is too small and we do not need to take it into account.

Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

  1. Prediction, based on the location (traditions, holidays, names, places, etc)
  2. Lerning the writing style of the author
  3. Using additional dictionary with n-grams: first, remove from the dictionary low-frequency words, than use the others for better prediction of n-grams.