The goal of this document is just to display that I have gotten used to working with the data and that I am on track to create my prediction algorithm. It explains my exploratory analysis and my goals for the eventual app and algorithm.
There are large data sets which has been more challenging than expected. The data is now cleaned.
The work will be on prediction and will be based on n-gram and a backoff method.
The motivation for this project is to:
The first step in analyzing any new data set is figuring out: (a) what data I have and (b) what are the standard tools and models used for that type of data. Make sure I have downloaded the data from Coursera before heading for the exercises. This exercise uses the files named LOCALE.blogs.txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora (www.corpora.heliohost.org). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details on the corpora available. The files have been language filtered but may still contain some foreign text.
In this capstone I will be applying data science in the area of natural language processing. As a first step toward working on this project, I should familiarize myself with Natural Language Processing, Text Mining, and the associated tools in R. Here are some resources that may be helpful to me.
Dataset
This is the training data to get you started that will be the basis for most of the capstone. I must download the data from the Coursera site and not from external websites to start.
Capstone Dataset My original exploration of the data and modeling steps will be performed on this data set. Later in the capstone, if i find additional data sets that may be useful for building my model I may use them.
Tasks to accomplish
Questions to consider
What do the data look like? Where do the data come from? Can you think of any other data sources that might help you in this project? What are the common steps in natural language processing? What are some common issues in the analysis of text data? What is the relationship between NLP and the concepts you have learned in the Specialization?
Large databases comprising of text in a target language are commonly used when generating language models for various purposes. In this exercise, I will use the English database but may consider three other databases in German, Russian and Finnish.
The goal of this task is to get familiar with the databases and do the necessary cleaning. After this exercise, I should understand what real data looks like and how much effort I need to put into cleaning the data. When I commence on developing a new language, the first thing is to understand the language and its peculiarities with respect to my target. I can learn to read, speak and write the language. Alternatively, I can study data and learn from existing information about the language through literature and the internet. At the very least, I need to understand how the language is written: writing script, existing input methods, some phonetic knowledge, etc.
Note that the data contain words of offensive and profane meaning. They are left there intentionally to highlight the fact that the developer has to work on them.
Tasks to accomplish 1. Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it. 2. Profanity filtering - removing profanity and other words I do not want to predict.
Loading the data in.
This dataset is fairly large. We emphasize that I don’t necessarily need to load the entire dataset in to build my algorithms (see point 2 below). At least initially, I might want to use a smaller subset of the data. Reading in chunks or lines using R’s readLines or scan functions can be useful. I can also loop over each line of text by embedding readLines within a for/while loop, but this may be slower than reading in large chunks at a time. Reading pieces of the file at a time will require the use of a file connection in R.
Sampling. To reiterate, to build models I don’t need to load in and use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. My inference class and how a representative sample can be used to infer facts about a population. I might want to create a separate sub-sample dataset by reading in a random subset of the original data and writing it out to a separate file. That way, I can store the sample and not have to recreate it every time.
Open connection and download the files
# fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/
# Coursera-SwiftKey.zip?accessType=DOWNLOAD"
# download.file(fileUrl,
# destfile = "/Users/administrador/Specialization/capstone/SwiftKey.zip",
# method = "curl")
list.files("C:/capstone")
## [1] "capstone002.html" "capstone002.Rmd" "capstone002_files"
## [4] "final" "SwiftKey.zip"
dateDownloaded <- date()
dateDownloaded
## [1] "Mon Mar 30 22:47:09 2015"
Set the path, unzip the file and
setwd("C:/capstone/")
unzip("SwiftKey.zip")
unzip("SwiftKey.zip", list =TRUE)
## Name Length Date
## 1 final/ 0 2014-07-22 10:10:00
## 2 final/de_DE/ 0 2014-07-22 10:10:00
## 3 final/de_DE/de_DE.twitter.txt 75578341 2014-07-22 10:11:00
## 4 final/de_DE/de_DE.blogs.txt 85459666 2014-07-22 10:11:00
## 5 final/de_DE/de_DE.news.txt 95591959 2014-07-22 10:11:00
## 6 final/ru_RU/ 0 2014-07-22 10:10:00
## 7 final/ru_RU/ru_RU.blogs.txt 116855835 2014-07-22 10:12:00
## 8 final/ru_RU/ru_RU.news.txt 118996424 2014-07-22 10:12:00
## 9 final/ru_RU/ru_RU.twitter.txt 105182346 2014-07-22 10:12:00
## 10 final/en_US/ 0 2014-07-22 10:10:00
## 11 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
## 12 final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
## 13 final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
## 14 final/fi_FI/ 0 2014-07-22 10:10:00
## 15 final/fi_FI/fi_FI.news.txt 94234350 2014-07-22 10:11:00
## 16 final/fi_FI/fi_FI.blogs.txt 108503595 2014-07-22 10:12:00
## 17 final/fi_FI/fi_FI.twitter.txt 25331142 2014-07-22 10:10:00
Connecting the data
setwd("C:/capstone/final/en_US")
list.files("C:/capstone/final/en_US")
## [1] "badwords.txt" "en_US.blogs.txt" "en_US.news.txt"
## [4] "en_US.twitter.txt"
con <- file("en_US.twitter.txt", "r")
con2 <- file("en_US.blogs.txt", "r")
con3 <- file("en_US.news.txt", "r")
badwords <- file("badwords.txt")
Other connections
# con <- file(file.choose(), "r")
Exploring the data
readLines(con,1)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
readLines(con,5)
## [1] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [2] "they've decided its more fun if I don't."
## [3] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
## [4] "Words from a complete stranger! Made my birthday even better :)"
## [5] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"
# close(con) ## close connection
# unlink("en_US.twitter.txt") ### Delete the file connected
*loading necessary libraries
library(stringi)
library(ggplot2)
library(magrittr)
library(markdown)
library(RWeka)
library(openNLP)
library(wordcloud)
## Loading required package: RColorBrewer
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
library(NLP)
library(qdap)
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
## Loading required package: qdapTools
## WARNING: Rtools is required to build R packages, but no version of Rtools compatible with R 3.1.3 was found. (Only the following incompatible version(s) of Rtools were found:3.3)
##
## Please download and install Rtools 3.1 from http://cran.r-project.org/bin/windows/Rtools/ and then run find_rtools().
##
## Attaching package: 'qdap'
##
## The following objects are masked from 'package:tm':
##
## as.DocumentTermMatrix, as.TermDocumentMatrix
##
## The following object is masked from 'package:magrittr':
##
## %>%
##
## The following object is masked from 'package:base':
##
## Filter
Dataset Twitter
system("wc -l en_US.twitter.txt")
## Warning: comando ejecutado 'wc -l en_US.twitter.txt' tiene estatus 127
Corpora Tiny Datasets
fewTwitter <- readLines(con,4000) #(file("en_US.twitter.txt","r"), 4000)
fewBlogs <- readLines(con2,4000) #(file("en_US.blogs.txt","r"), 4000)
fewNews <- readLines(con3,4000) #(file("en_US.news.txt","r"), 4000)
fewData <- paste(fewTwitter, fewBlogs,fewNews)
#badwords2 <- readLines(badwords)
I identify appropriate tokens such as words, punctuation, and numbers and I remove profanity words. I use the tm package and I klean de data. I use Github to download a list of profanity words to filter out Link (https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en) I Break the text lines into sentences, as when we do bi-grams.
fewData <- sent_detect(fewData, language = "en", model = NULL)
I builded clean main Corpus
corpus <- VCorpus(VectorSource(fewData)) # Build the main corpus
corpus <- tm_map(corpus, removeNumbers) # remove numbers
corpus <- tm_map(corpus, stripWhitespace) # remove whitespaces
corpus <- tm_map(corpus, content_transformer(tolower)) #lowercase all contents
corpus <- tm_map(corpus, removePunctuation) # remove special characters
Removing the bad words
profanewordsvector <- VectorSource(badwords)
corpus <- tm_map(corpus, removeWords, profanewordsvector)
Converting Corpus to Data Frame with RWeka package
cleanData<-data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)
The single word tokenization, Bi-grams sets and Tri-grams sets for Analysis with RWeka.
singletok <- NGramTokenizer(cleanData, Weka_control(min = 1, max = 1))
bitok <- NGramTokenizer(cleanData, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.
,;:\"()?!"))
tritok <- NGramTokenizer(cleanData, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t
.,;:\"()?!"))
bitritok <- paste(tritok,bitok)
The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships I observe in the data and prepare to build my first linguistic models.
Tasks to accomplish
Questions to consider
I prepare data frames in word order by frecuencies.
single <- data.frame(table(singletok))
bitoke <- data.frame(table(bitok))
tritoke <- data.frame(table(tritok))
singlesort <- single[order(single$Freq,decreasing = TRUE),]
bitoksort <- bitoke[order(bitoke$Freq,decreasing = TRUE),]
tritoksort <- tritoke[order(tritoke$Freq,decreasing = TRUE),]
singleFrec <- singlesort[1:15,]
colnames(singleFrec) <- c("Word","Frequency")
bitoksortFrec<- bitoksort[1:15,]
colnames(bitoksortFrec) <- c("Word","Frequency")
tritoksortFrec <- tritoksort[1:15,]
colnames(tritoksortFrec) <- c("Word","Frequency")
15 single words by major frecuencies in alphabetical order.
ggplot(singleFrec, aes(x=Word, y=Frequency), ) + geom_bar(stat="Identity", fill="red", colour = "pink") +geom_text(aes(label=Frequency),vjust=-0.1) + theme(axis.text.x = element_text(angle = 45, hjust = 2))
15 bi-grams words by major frecuencies in alphabetical order.
ggplot(bitoksortFrec, aes(x=Word, y=Frequency), ) + geom_bar(stat="Identity", fill="blue", colour = "green") +geom_text(aes(label=Frequency),vjust=-0.1) + theme(axis.text.x = element_text(angle = 45, hjust = 2))
15 tri-grams words by major frecuencies in alphabetical order.
ggplot(tritoksortFrec, aes(x=Word, y=Frequency), ) + geom_bar(stat="Identity", fill="yellow", colour = "black") +geom_text(aes(label=Frequency),vjust=-0.1) + theme(axis.text.x = element_text(angle = 45, hjust = 2))
I made a series of from 10% -> 90% to find the words which cover the textual percentage and I build a percentage function.
woperc <- function(percentage) {
totalwords <- sum(singlesort$Freq)
percent = 0
cumsum = 0
i = 1
while (percent < percentage)
{
cumsum = cumsum + singlesort$Freq[i]
percent = cumsum/totalwords
i = i + 1
}
return(i)
}
Also, I made a plot showing the progression in the percentage according to the order of appearance of each word considering the frequency datasets.
percents <- c(10,20,30,40,50,60,70,80,90)
timeswordsAppears <- c(woperc(0.1), woperc(0.2), woperc(0.3), woperc(0.4), woperc(0.5), woperc(0.6), woperc(0.7), woperc(0.8), woperc(0.9))
qplot(percents,timeswordsAppears, geom=c("line","point")) +geom_text(aes(label=timeswordsAppears), hjust=1.35, vjust=-0.1) + scale_x_discrete(breaks=c(10,20,30,40,50,60,70,80,90), labels=c(10,20,30,40,50,60,70,80,90))
Finding the most common words
#corpterMatrix <- DocumentTermMatrix(cleanData)
#inspect(corpterMatrix[1:3,1:4])
#findFreqTerms(corpterMatrix,3000)
The objective in this paper was to build my first simple model for the relationship between words. This was the first step in building a predictive text mining application. I will explore simple models and discover more complicated modeling techniques in future.
I learned to build basic n-gram model and using the exploratory analysis I performed, build a basic n-gram model for predicting the words based on the previous 1, 2, or 3 words. In addittion, I build a model to handle unseen n-grams.