Milestone Report Rubric

Abstract

The goal of this document is just to display that I have gotten used to working with the data and that I am on track to create my prediction algorithm. It explains my exploratory analysis and my goals for the eventual app and algorithm.

There are large data sets which has been more challenging than expected. The data is now cleaned.

The work will be on prediction and will be based on n-gram and a backoff method.

Preface

The motivation for this project is to:

Demonstrate that I have downloaded the data and have successfully loaded it in. Create a basic report of summary statistics about the data sets. Report any interesting findings that I amassed so far. Get feedback on MY plans for creating a prediction algorithm and Shiny app.

Understanding the problem

The first step in analyzing any new data set is figuring out:

  1. what data I have and

  2. what are the standard tools and models used for that type of data.

Make sure I have downloaded the data from Coursera before heading for the exercises. This exercise uses the files named LOCALE.blogs.txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora (www.corpora.heliohost.org). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details on the corpora available. The files have been language filtered but may still contain some foreign text.

In this capstone I will be applying data science in the area of natural language processing. As a first step toward working on this project, I should familiarize myself with Natural Language Processing, Text Mining, and the associated tools in R. Here are some resources that may be helpful to me.

  • Natural language processing Wikipedia page
  • Text mining infrastucture in R
  • CRAN Task View: Natural Language Processing
  • Coursera course on NLP (not in R)

Dataset

This is the training data to get you started that will be the basis for most of the capstone. I must download the data from the Coursera site and not from external websites to start.

Capstone Dataset

My original exploration of the data and modeling steps will be performed on this data set. Later in the capstone, if i find additional data sets that may be useful for building my model I may use them.

Tasks to accomplish

  • Obtaining the data - Can I download the data and load/manipulate it in R?
  • Familiarizing myself with NLP and text mining - Learn about the basics of natural language processing and how it relates to the data science process I have learned in the Data Science Specialization.

Questions to consider

What do the data look like? Where do the data come from? Can you think of any other data sources that might help you in this project? What are the common steps in natural language processing? What are some common issues in the analysis of text data? What is the relationship between NLP and the concepts you have learned in the Specialization?

Data acquisition and cleaning

Large databases comprising of text in a target language are commonly used when generating language models for various purposes. In this exercise, I will use the English database but may consider three other databases in German, Russian and Finnish.

The goal of this task is to get familiar with the databases and do the necessary cleaning. After this exercise, I should understand what real data looks like and how much effort I need to put into cleaning the data. When I commence on developing a new language, the first thing is to understand the language and its peculiarities with respect to my target. I can learn to read, speak and write the language. Alternatively, I can study data and learn from existing information about the language through literature and the internet. At the very least, I need to understand how the language is written: writing script, existing input methods, some phonetic knowledge, etc.

Note that the data contain words of offensive and profane meaning. They are left there intentionally to highlight the fact that the developer has to work on them.

Tasks to accomplish 1. Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it. 2. Profanity filtering - removing profanity and other words I do not want to predict.

  1. Loading the data in. This dataset is fairly large. We emphasize that I don’t necessarily need to load the entire dataset in to build my algorithms (see point 2 below). At least initially, I might want to use a smaller subset of the data. Reading in chunks or lines using R’s readLines or scan functions can be useful. I can also loop over each line of text by embedding readLines within a for/while loop, but this may be slower than reading in large chunks at a time. Reading pieces of the file at a time will require the use of a file connection in R.

  2. Sampling. To reiterate, to build models I don’t need to load in and use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. My inference class and how a representative sample can be used to infer facts about a population. I might want to create a separate sub-sample dataset by reading in a random subset of the original data and writing it out to a separate file. That way, I can store the sample and not have to recreate it every time.

Setting directory, connecting the data and open files

setwd("H:/Aulas/Data Science/Módulo 10 - Data Science Capstone/Dados/Coursera-SwiftKey/final/en_US/")

con <- file("en_US.twitter.txt", "r")
con2 <- file("en_US.blogs.txt", "r")
con3 <- file("en_US.news.txt", "r")
badwords <- file("badwords.txt")

Loading libraries

library(stringi)
library(ggplot2)
library(magrittr)
library(markdown)
library(RWeka)
library(openNLP)
library(wordcloud)
library(tm)
library(NLP)
library(qdap)

Dataset Twitter

setwd("H:/Aulas/Data Science/Módulo 10 - Data Science Capstone/Dados/Coursera-SwiftKey/final/en_US/")
system("wc -l en_US.twitter.txt")

Corpora Tiny Datasets

setwd("H:/Aulas/Data Science/Módulo 10 - Data Science Capstone/Dados/Coursera-SwiftKey/final/en_US/")
fewTwitter <- readLines(con,4000)  

fewBlogs <- readLines(con2,4000) 
fewNews <- readLines(con3,4000) 
fewData <- paste(fewTwitter, fewBlogs,fewNews)

I identify appropriate tokens such as words, punctuation, and numbers and I remove profanity words. I use the tm package and I klean de data. I use Github to download a list of profanity words to filter out Link https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en I Break the text lines into sentences, as when we do bi-grams.

fewData <- sent_detect(fewData, language = "en", model = NULL)

I builded clean main Corpus

corpus <- VCorpus(VectorSource(fewData)) # Build the main corpus
corpus <- tm_map(corpus, removeNumbers) # remove numbers
corpus <- tm_map(corpus, stripWhitespace) # remove whitespaces
corpus <- tm_map(corpus, content_transformer(tolower)) #lowercase all contents
corpus <- tm_map(corpus, removePunctuation) # remove special characters

Removing the bad words

profanewordsvector <- VectorSource(badwords)

Converting Corpus to Data Frame with RWeka package

cleanData<-data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)

The single word tokenization, Bi-grams sets and Tri-grams sets for Analysis with RWeka.

singletok <- NGramTokenizer(cleanData, Weka_control(min = 1, max = 1))
bitok <- NGramTokenizer(cleanData, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.
,;:\"()?!"))
tritok <- NGramTokenizer(cleanData, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t
.,;:\"()?!"))
bitritok <- paste(tritok,bitok)

Exploratory analysis

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships I observe in the data and prepare to build my first linguistic models.

Tasks to accomplish

  1. Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
  2. Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Questions to consider

  1. Some words are more frequent than others - what are the distributions of word frequencies?
  2. What are the frequencies of 2-grams and 3-grams in the dataset?
  3. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
  4. How do you evaluate how many of the words come from foreign languages?
  5. Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

I prepare data frames in word order by frecuencies.

single <- data.frame(table(singletok))
bitoke <- data.frame(table(bitok))
tritoke <- data.frame(table(tritok))
singlesort <- single[order(single$Freq,decreasing = TRUE),]
bitoksort <- bitoke[order(bitoke$Freq,decreasing = TRUE),]
tritoksort <- tritoke[order(tritoke$Freq,decreasing = TRUE),]
singleFrec <- singlesort[1:15,]
colnames(singleFrec) <- c("Word","Frequency")
bitoksortFrec<- bitoksort[1:15,]
colnames(bitoksortFrec) <- c("Word","Frequency")
tritoksortFrec <- tritoksort[1:15,]
colnames(tritoksortFrec) <- c("Word","Frequency")

15 single words by major frecuencies in alphabetical order.

The first analysis we will perform is a graphic. This will show us which words are the most frequent and what their frequency is. Ggplot was used to set the frequency.

ggplot(singleFrec, aes(x=Word, y=Frequency), ) + geom_bar(stat="Identity", fill="red", colour = "red") +geom_text(aes(label=Frequency),vjust=-0.1) + theme(axis.text.x = element_text(angle = 45, hjust = 2))

15 bi-grams words by major frecuencies in alphabetical order.

Next, we will do the same for Bigrams, i.e. two word combinations. We follow exactly the same process, but this time we will pass the argument 2.

ggplot(bitoksortFrec, aes(x=Word, y=Frequency), ) + geom_bar(stat="Identity", fill="lightblue", colour = "blue") +geom_text(aes(label=Frequency),vjust=-0.1) + theme(axis.text.x = element_text(angle = 45, hjust = 2))

15 tri-grams words by major frecuencies in alphabetical order.

Finally, we will follow exactly the same process for trigrams, i.e. three word combinations.

ggplot(tritoksortFrec, aes(x=Word, y=Frequency), ) + geom_bar(stat="Identity", fill="lightyellow", colour = "yellow") +geom_text(aes(label=Frequency),vjust=-0.1) + theme(axis.text.x = element_text(angle = 45, hjust = 2))

I made a series of from 10% -> 90% to find the words which cover the textual percentage and I build a percentage function.

woperc <- function(percentage) {
    totalwords <- sum(singlesort$Freq)
    percent = 0
    cumsum = 0
    i = 1
    while (percent < percentage)
    {
        cumsum = cumsum + singlesort$Freq[i]
        percent = cumsum/totalwords
        i = i + 1
    }
    return(i)
}

Also, I made a plot showing the progression in the percentage according to the order of appearance of each word considering the frequency datasets.

percents <- c(10,20,30,40,50,60,70,80,90)
timeswordsAppears <- c(woperc(0.1), woperc(0.2), woperc(0.3), woperc(0.4), woperc(0.5), woperc(0.6), woperc(0.7), woperc(0.8), woperc(0.9))
qplot(percents,timeswordsAppears, geom=c("line","point")) +geom_text(aes(label=timeswordsAppears), hjust=1.35, vjust=-0.1) + scale_x_discrete(breaks=c(10,20,30,40,50,60,70,80,90), labels=c(10,20,30,40,50,60,70,80,90))

Conclusions

The objective in this paper was to build my first simple model for the relationship between words. This was the first step in building a predictive text mining application. I will explore simple models and discover more complicated modeling techniques in future.

I learned to build basic n-gram model and using the exploratory analysis I performed, build a basic n-gram model for predicting the words based on the previous 1, 2, or 3 words. In addittion, I build a model to handle unseen n-grams.