Milestone Report Rubric

Abstract

The goal of this document is just to display that I have gotten used to working with the data and that I am on track to create my prediction algorithm. It explains my exploratory analysis and my goals for the eventual app and algorithm.

There are large data sets which has been more challenging than expected. The data is now cleaned.

The work will be on prediction and will be based on n-gram and a backoff method.

Preface

The motivation for this project is to:

  1. Demonstrate that I have downloaded the data and have successfully loaded it in.
  2. Create a basic report of summary statistics about the data sets.
  3. Report any interesting findings that I amassed so far.
  4. Get feedback on MY plans for creating a prediction algorithm and Shiny app.

Understanding the problem

The first step in analyzing any new data set is figuring out: (a) what data I have and (b) what are the standard tools and models used for that type of data. Make sure I have downloaded the data from Coursera before heading for the exercises. This exercise uses the files named LOCALE.blogs.txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora (www.corpora.heliohost.org). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details on the corpora available. The files have been language filtered but may still contain some foreign text.

In this capstone I will be applying data science in the area of natural language processing. As a first step toward working on this project, I should familiarize myself with Natural Language Processing, Text Mining, and the associated tools in R. Here are some resources that may be helpful to me.

  • Natural language processing Wikipedia page
  • Text mining infrastucture in R
  • CRAN Task View: Natural Language Processing
  • Coursera course on NLP (not in R)

Dataset

This is the training data to get you started that will be the basis for most of the capstone. I must download the data from the Coursera site and not from external websites to start.

Capstone Dataset My original exploration of the data and modeling steps will be performed on this data set. Later in the capstone, if i find additional data sets that may be useful for building my model I may use them.

Tasks to accomplish

  • Obtaining the data - Can I download the data and load/manipulate it in R?
  • Familiarizing myself with NLP and text mining - Learn about the basics of natural language processing and how it relates to the data science process I have learned in the Data Science Specialization.

Questions to consider

What do the data look like? Where do the data come from? Can you think of any other data sources that might help you in this project? What are the common steps in natural language processing? What are some common issues in the analysis of text data? What is the relationship between NLP and the concepts you have learned in the Specialization?

Data acquisition and cleaning

Large databases comprising of text in a target language are commonly used when generating language models for various purposes. In this exercise, I will use the English database but may consider three other databases in German, Russian and Finnish.

The goal of this task is to get familiar with the databases and do the necessary cleaning. After this exercise, I should understand what real data looks like and how much effort I need to put into cleaning the data. When I commence on developing a new language, the first thing is to understand the language and its peculiarities with respect to my target. I can learn to read, speak and write the language. Alternatively, I can study data and learn from existing information about the language through literature and the internet. At the very least, I need to understand how the language is written: writing script, existing input methods, some phonetic knowledge, etc.

Note that the data contain words of offensive and profane meaning. They are left there intentionally to highlight the fact that the developer has to work on them.

Tasks to accomplish 1. Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it. 2. Profanity filtering - removing profanity and other words I do not want to predict.

  1. Loading the data in.
    This dataset is fairly large. We emphasize that I don’t necessarily need to load the entire dataset in to build my algorithms (see point 2 below). At least initially, I might want to use a smaller subset of the data. Reading in chunks or lines using R’s readLines or scan functions can be useful. I can also loop over each line of text by embedding readLines within a for/while loop, but this may be slower than reading in large chunks at a time. Reading pieces of the file at a time will require the use of a file connection in R.

  2. Sampling. To reiterate, to build models I don’t need to load in and use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. My inference class and how a representative sample can be used to infer facts about a population. I might want to create a separate sub-sample dataset by reading in a random subset of the original data and writing it out to a separate file. That way, I can store the sample and not have to recreate it every time.

Open connection and download the files

# fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/
# Coursera-SwiftKey.zip?accessType=DOWNLOAD"

# download.file(fileUrl, 
# destfile = "/Users/administrador/Specialization/capstone/SwiftKey.zip", 
# method = "curl")
list.files("C:/capstone")
## [1] "capstone002.html"  "capstone002.Rmd"   "capstone002_files"
## [4] "final"             "SwiftKey.zip"
dateDownloaded <- date()
dateDownloaded
## [1] "Mon Mar 30 22:47:09 2015"

Set the path, unzip the file and

setwd("C:/capstone/")
unzip("SwiftKey.zip")
unzip("SwiftKey.zip", list =TRUE)
##                             Name    Length                Date
## 1                         final/         0 2014-07-22 10:10:00
## 2                   final/de_DE/         0 2014-07-22 10:10:00
## 3  final/de_DE/de_DE.twitter.txt  75578341 2014-07-22 10:11:00
## 4    final/de_DE/de_DE.blogs.txt  85459666 2014-07-22 10:11:00
## 5     final/de_DE/de_DE.news.txt  95591959 2014-07-22 10:11:00
## 6                   final/ru_RU/         0 2014-07-22 10:10:00
## 7    final/ru_RU/ru_RU.blogs.txt 116855835 2014-07-22 10:12:00
## 8     final/ru_RU/ru_RU.news.txt 118996424 2014-07-22 10:12:00
## 9  final/ru_RU/ru_RU.twitter.txt 105182346 2014-07-22 10:12:00
## 10                  final/en_US/         0 2014-07-22 10:10:00
## 11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
## 12    final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
## 13   final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
## 14                  final/fi_FI/         0 2014-07-22 10:10:00
## 15    final/fi_FI/fi_FI.news.txt  94234350 2014-07-22 10:11:00
## 16   final/fi_FI/fi_FI.blogs.txt 108503595 2014-07-22 10:12:00
## 17 final/fi_FI/fi_FI.twitter.txt  25331142 2014-07-22 10:10:00

Connecting the data

      setwd("C:/capstone/final/en_US")
      list.files("C:/capstone/final/en_US")
## [1] "badwords.txt"      "en_US.blogs.txt"   "en_US.news.txt"   
## [4] "en_US.twitter.txt"
      con <- file("en_US.twitter.txt", "r")
      con2 <- file("en_US.blogs.txt", "r")
      con3 <- file("en_US.news.txt", "r")
      badwords <- file("badwords.txt")

Other connections

# con <- file(file.choose(), "r")

Exploring the data

 readLines(con,1)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
 readLines(con,5)
## [1] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [2] "they've decided its more fun if I don't."                                                                       
## [3] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"                           
## [4] "Words from a complete stranger! Made my birthday even better :)"                                                
## [5] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"
# close(con) ## close  connection
# unlink("en_US.twitter.txt") ### Delete the file connected

*loading necessary libraries

library(stringi)
library(ggplot2)
library(magrittr)
library(markdown)
library(RWeka)
library(openNLP)
library(wordcloud)
## Loading required package: RColorBrewer
library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(NLP)
library(qdap)
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
## Loading required package: qdapTools
## WARNING: Rtools is required to build R packages, but no version of Rtools compatible with R 3.1.3 was found. (Only the following incompatible version(s) of Rtools were found:3.3)
## 
## Please download and install Rtools 3.1 from http://cran.r-project.org/bin/windows/Rtools/ and then run find_rtools().
## 
## Attaching package: 'qdap'
## 
## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, as.TermDocumentMatrix
## 
## The following object is masked from 'package:magrittr':
## 
##     %>%
## 
## The following object is masked from 'package:base':
## 
##     Filter

Dataset Twitter

system("wc -l en_US.twitter.txt")
## Warning: comando ejecutado 'wc -l en_US.twitter.txt' tiene estatus 127

Corpora Tiny Datasets

fewTwitter <- readLines(con,4000)   #(file("en_US.twitter.txt","r"), 4000)

fewBlogs <- readLines(con2,4000) #(file("en_US.blogs.txt","r"), 4000)
fewNews <- readLines(con3,4000) #(file("en_US.news.txt","r"), 4000)
fewData <- paste(fewTwitter, fewBlogs,fewNews)
#badwords2 <- readLines(badwords)

I identify appropriate tokens such as words, punctuation, and numbers and I remove profanity words. I use the tm package and I klean de data. I use Github to download a list of profanity words to filter out Link (https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en) I Break the text lines into sentences, as when we do bi-grams.

fewData <- sent_detect(fewData, language = "en", model = NULL)

I builded clean main Corpus

corpus <- VCorpus(VectorSource(fewData)) # Build the main corpus
corpus <- tm_map(corpus, removeNumbers) # remove numbers
corpus <- tm_map(corpus, stripWhitespace) # remove whitespaces
corpus <- tm_map(corpus, content_transformer(tolower)) #lowercase all contents
corpus <- tm_map(corpus, removePunctuation) # remove special characters

Removing the bad words

  profanewordsvector <- VectorSource(badwords)
corpus <- tm_map(corpus, removeWords, profanewordsvector)

Converting Corpus to Data Frame with RWeka package

cleanData<-data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)

The single word tokenization, Bi-grams sets and Tri-grams sets for Analysis with RWeka.

singletok <- NGramTokenizer(cleanData, Weka_control(min = 1, max = 1))
bitok <- NGramTokenizer(cleanData, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.
,;:\"()?!"))
tritok <- NGramTokenizer(cleanData, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t
.,;:\"()?!"))
bitritok <- paste(tritok,bitok)

Exploratory analysis

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships I observe in the data and prepare to build my first linguistic models.

Tasks to accomplish

  1. Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
  2. Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Questions to consider

  1. Some words are more frequent than others - what are the distributions of word frequencies?
  2. What are the frequencies of 2-grams and 3-grams in the dataset?
  3. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
  4. How do you evaluate how many of the words come from foreign languages?
  5. Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

I prepare data frames in word order by frecuencies.

single <- data.frame(table(singletok))
bitoke <- data.frame(table(bitok))
tritoke <- data.frame(table(tritok))
singlesort <- single[order(single$Freq,decreasing = TRUE),]
bitoksort <- bitoke[order(bitoke$Freq,decreasing = TRUE),]
tritoksort <- tritoke[order(tritoke$Freq,decreasing = TRUE),]
singleFrec <- singlesort[1:15,]
colnames(singleFrec) <- c("Word","Frequency")
bitoksortFrec<- bitoksort[1:15,]
colnames(bitoksortFrec) <- c("Word","Frequency")
tritoksortFrec <- tritoksort[1:15,]
colnames(tritoksortFrec) <- c("Word","Frequency")

15 single words by major frecuencies in alphabetical order.

ggplot(singleFrec, aes(x=Word, y=Frequency), ) + geom_bar(stat="Identity", fill="red", colour = "pink") +geom_text(aes(label=Frequency),vjust=-0.1) + theme(axis.text.x = element_text(angle = 45, hjust = 2))

15 bi-grams words by major frecuencies in alphabetical order.

ggplot(bitoksortFrec, aes(x=Word, y=Frequency), ) + geom_bar(stat="Identity", fill="blue", colour = "green") +geom_text(aes(label=Frequency),vjust=-0.1) + theme(axis.text.x = element_text(angle = 45, hjust = 2))

15 tri-grams words by major frecuencies in alphabetical order.

ggplot(tritoksortFrec, aes(x=Word, y=Frequency), ) + geom_bar(stat="Identity", fill="yellow", colour = "black") +geom_text(aes(label=Frequency),vjust=-0.1) + theme(axis.text.x = element_text(angle = 45, hjust = 2))

I made a series of from 10% -> 90% to find the words which cover the textual percentage and I build a percentage function.

woperc <- function(percentage) {
totalwords <- sum(singlesort$Freq)
percent = 0
cumsum = 0
i = 1
while (percent < percentage)
{
cumsum = cumsum + singlesort$Freq[i]
percent = cumsum/totalwords
i = i + 1
}
return(i)
}

Also, I made a plot showing the progression in the percentage according to the order of appearance of each word considering the frequency datasets.

percents <- c(10,20,30,40,50,60,70,80,90)
timeswordsAppears <- c(woperc(0.1), woperc(0.2), woperc(0.3), woperc(0.4), woperc(0.5), woperc(0.6), woperc(0.7), woperc(0.8), woperc(0.9))
qplot(percents,timeswordsAppears, geom=c("line","point")) +geom_text(aes(label=timeswordsAppears), hjust=1.35, vjust=-0.1) + scale_x_discrete(breaks=c(10,20,30,40,50,60,70,80,90), labels=c(10,20,30,40,50,60,70,80,90))

Finding the most common words

#corpterMatrix <- DocumentTermMatrix(cleanData)
#inspect(corpterMatrix[1:3,1:4])

#findFreqTerms(corpterMatrix,3000)

Conclusions

The objective in this paper was to build my first simple model for the relationship between words. This was the first step in building a predictive text mining application. I will explore simple models and discover more complicated modeling techniques in future.

I learned to build basic n-gram model and using the exploratory analysis I performed, build a basic n-gram model for predicting the words based on the previous 1, 2, or 3 words. In addittion, I build a model to handle unseen n-grams.