The goal of this document is just to display that I have gotten used to working with the data and that I am on track to create my prediction algorithm. It explains my exploratory analysis and my goals for the eventual app and algorithm.
There are large data sets which has been more challenging than expected. The data is now cleaned.
The work will be on prediction and will be based on n-gram and a backoff method.
The motivation for this project is to:
Demonstrate that I have downloaded the data and have successfully loaded it in. Create a basic report of summary statistics about the data sets. Report any interesting findings that I amassed so far. Get feedback on MY plans for creating a prediction algorithm and Shiny app.
The first step in analyzing any new data set is figuring out:
what data I have and
what are the standard tools and models used for that type of data.
Make sure I have downloaded the data from Coursera before heading for the exercises. This exercise uses the files named LOCALE.blogs.txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora (www.corpora.heliohost.org). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details on the corpora available. The files have been language filtered but may still contain some foreign text.
In this capstone I will be applying data science in the area of natural language processing. As a first step toward working on this project, I should familiarize myself with Natural Language Processing, Text Mining, and the associated tools in R. Here are some resources that may be helpful to me.
Dataset
This is the training data to get you started that will be the basis for most of the capstone. I must download the data from the Coursera site and not from external websites to start.
Capstone Dataset
My original exploration of the data and modeling steps will be performed on this data set. Later in the capstone, if i find additional data sets that may be useful for building my model I may use them.
Tasks to accomplish
Questions to consider
What do the data look like? Where do the data come from? Can you think of any other data sources that might help you in this project? What are the common steps in natural language processing? What are some common issues in the analysis of text data? What is the relationship between NLP and the concepts you have learned in the Specialization?
Large databases comprising of text in a target language are commonly used when generating language models for various purposes. In this exercise, I will use the English database but may consider three other databases in German, Russian and Finnish.
The goal of this task is to get familiar with the databases and do the necessary cleaning. After this exercise, I should understand what real data looks like and how much effort I need to put into cleaning the data. When I commence on developing a new language, the first thing is to understand the language and its peculiarities with respect to my target. I can learn to read, speak and write the language. Alternatively, I can study data and learn from existing information about the language through literature and the internet. At the very least, I need to understand how the language is written: writing script, existing input methods, some phonetic knowledge, etc.
Note that the data contain words of offensive and profane meaning. They are left there intentionally to highlight the fact that the developer has to work on them.
Tasks to accomplish 1. Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it. 2. Profanity filtering - removing profanity and other words I do not want to predict.
Loading the data in. This dataset is fairly large. We emphasize that I don’t necessarily need to load the entire dataset in to build my algorithms (see point 2 below). At least initially, I might want to use a smaller subset of the data. Reading in chunks or lines using R’s readLines or scan functions can be useful. I can also loop over each line of text by embedding readLines within a for/while loop, but this may be slower than reading in large chunks at a time. Reading pieces of the file at a time will require the use of a file connection in R.
Sampling. To reiterate, to build models I don’t need to load in and use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. My inference class and how a representative sample can be used to infer facts about a population. I might want to create a separate sub-sample dataset by reading in a random subset of the original data and writing it out to a separate file. That way, I can store the sample and not have to recreate it every time.
setwd("H:/Aulas/Data Science/Módulo 10 - Data Science Capstone/Dados/Coursera-SwiftKey/final/en_US/")
con <- file("en_US.twitter.txt", "r")
con2 <- file("en_US.blogs.txt", "r")
con3 <- file("en_US.news.txt", "r")
badwords <- file("badwords.txt")
library(stringi)
library(ggplot2)
library(magrittr)
library(markdown)
library(RWeka)
library(openNLP)
library(wordcloud)
library(tm)
library(NLP)
library(qdap)
setwd("H:/Aulas/Data Science/Módulo 10 - Data Science Capstone/Dados/Coursera-SwiftKey/final/en_US/")
system("wc -l en_US.twitter.txt")
setwd("H:/Aulas/Data Science/Módulo 10 - Data Science Capstone/Dados/Coursera-SwiftKey/final/en_US/")
fewTwitter <- readLines(con,4000)
fewBlogs <- readLines(con2,4000)
fewNews <- readLines(con3,4000)
fewData <- paste(fewTwitter, fewBlogs,fewNews)
I identify appropriate tokens such as words, punctuation, and numbers and I remove profanity words. I use the tm package and I klean de data. I use Github to download a list of profanity words to filter out Link https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en I Break the text lines into sentences, as when we do bi-grams.
fewData <- sent_detect(fewData, language = "en", model = NULL)
I builded clean main Corpus
corpus <- VCorpus(VectorSource(fewData)) # Build the main corpus
corpus <- tm_map(corpus, removeNumbers) # remove numbers
corpus <- tm_map(corpus, stripWhitespace) # remove whitespaces
corpus <- tm_map(corpus, content_transformer(tolower)) #lowercase all contents
corpus <- tm_map(corpus, removePunctuation) # remove special characters
Removing the bad words
profanewordsvector <- VectorSource(badwords)
Converting Corpus to Data Frame with RWeka package
cleanData<-data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)
The single word tokenization, Bi-grams sets and Tri-grams sets for Analysis with RWeka.
singletok <- NGramTokenizer(cleanData, Weka_control(min = 1, max = 1))
bitok <- NGramTokenizer(cleanData, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.
,;:\"()?!"))
tritok <- NGramTokenizer(cleanData, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t
.,;:\"()?!"))
bitritok <- paste(tritok,bitok)
The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships I observe in the data and prepare to build my first linguistic models.
Tasks to accomplish
Questions to consider
I prepare data frames in word order by frecuencies.
single <- data.frame(table(singletok))
bitoke <- data.frame(table(bitok))
tritoke <- data.frame(table(tritok))
singlesort <- single[order(single$Freq,decreasing = TRUE),]
bitoksort <- bitoke[order(bitoke$Freq,decreasing = TRUE),]
tritoksort <- tritoke[order(tritoke$Freq,decreasing = TRUE),]
singleFrec <- singlesort[1:15,]
colnames(singleFrec) <- c("Word","Frequency")
bitoksortFrec<- bitoksort[1:15,]
colnames(bitoksortFrec) <- c("Word","Frequency")
tritoksortFrec <- tritoksort[1:15,]
colnames(tritoksortFrec) <- c("Word","Frequency")
15 single words by major frecuencies in alphabetical order.
The first analysis we will perform is a graphic. This will show us which words are the most frequent and what their frequency is. Ggplot was used to set the frequency.
ggplot(singleFrec, aes(x=Word, y=Frequency), ) + geom_bar(stat="Identity", fill="red", colour = "red") +geom_text(aes(label=Frequency),vjust=-0.1) + theme(axis.text.x = element_text(angle = 45, hjust = 2))
15 bi-grams words by major frecuencies in alphabetical order.
Next, we will do the same for Bigrams, i.e. two word combinations. We follow exactly the same process, but this time we will pass the argument 2.
ggplot(bitoksortFrec, aes(x=Word, y=Frequency), ) + geom_bar(stat="Identity", fill="lightblue", colour = "blue") +geom_text(aes(label=Frequency),vjust=-0.1) + theme(axis.text.x = element_text(angle = 45, hjust = 2))
15 tri-grams words by major frecuencies in alphabetical order.
Finally, we will follow exactly the same process for trigrams, i.e. three word combinations.
ggplot(tritoksortFrec, aes(x=Word, y=Frequency), ) + geom_bar(stat="Identity", fill="lightyellow", colour = "yellow") +geom_text(aes(label=Frequency),vjust=-0.1) + theme(axis.text.x = element_text(angle = 45, hjust = 2))
I made a series of from 10% -> 90% to find the words which cover the textual percentage and I build a percentage function.
woperc <- function(percentage) {
totalwords <- sum(singlesort$Freq)
percent = 0
cumsum = 0
i = 1
while (percent < percentage)
{
cumsum = cumsum + singlesort$Freq[i]
percent = cumsum/totalwords
i = i + 1
}
return(i)
}
Also, I made a plot showing the progression in the percentage according to the order of appearance of each word considering the frequency datasets.
percents <- c(10,20,30,40,50,60,70,80,90)
timeswordsAppears <- c(woperc(0.1), woperc(0.2), woperc(0.3), woperc(0.4), woperc(0.5), woperc(0.6), woperc(0.7), woperc(0.8), woperc(0.9))
qplot(percents,timeswordsAppears, geom=c("line","point")) +geom_text(aes(label=timeswordsAppears), hjust=1.35, vjust=-0.1) + scale_x_discrete(breaks=c(10,20,30,40,50,60,70,80,90), labels=c(10,20,30,40,50,60,70,80,90))
The objective in this paper was to build my first simple model for the relationship between words. This was the first step in building a predictive text mining application. I will explore simple models and discover more complicated modeling techniques in future.
I learned to build basic n-gram model and using the exploratory analysis I performed, build a basic n-gram model for predicting the words based on the previous 1, 2, or 3 words. In addittion, I build a model to handle unseen n-grams.