The Milestone Project, from the Coursera Data Science Capstone project, is the the first task in the week two.
Using text mining techniques, or NLP, it is intended to make a short and consisted report of the main characteristics of three databases full of text:
The databases can be downloaded at the following link.
library(dplyr)
library(knitr)
library(stringi)
library(tm)
library(magrittr)
library(ggplot2)
library(RWeka)
library(kableExtra)
library(wordcloud)
library(RColorBrewer)
library(plotly)
To download the databases, we need to store them in a specific file, therefore, the file where the project will be stored and the databases will be created in R.
getwd()
setwd("C:/Users/aleja/Documents/Cursos/Coursera R pratices")
if(!file.exists("./Milestone_Project")){
dir.create("./Milestone_Project")
}
if(!file.exists("./swiftkey_data")){
dir.create("./swiftkey_data")
}
Now, we can download the databases.
"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"-> url1
if(!file.exists("SwiftKey_data.zip")){
download.file(url1,destfile="./swiftkey_data/SwiftKey_data.zip",mode = "wb")
}
With the data downloaded as a ZIP format, it needs to be unzip.
unzip(zipfile="./swiftkey_data/SwiftKey_data.zip",exdir="./swiftkey_data")
In this steps, it is extracted and processed the information. News file needs a different treat since it has a lot of special characters.
dir_tw <-"./swiftkey_data/final/en_US/en_US.twitter.txt"
dir_bl <-"./swiftkey_data/final/en_US/en_US.blogs.txt"
dir_nw<-"./swiftkey_data/final/en_US/en_US.news.txt"
readLines(dir_tw,warn=FALSE,encoding="UTF-8")->twitter
readLines(dir_bl,warn=FALSE,encoding="UTF-8")->blog
readLines(file(dir_nw, open="rb"),warn=FALSE,encoding="UTF-8")->news
Get all words and sizes from the files.
stri_count_words(blog)->nwords_bl
stri_count_words(news)->nwords_nw
stri_count_words(twitter)->nwords_tw
file.info("./swiftkey_data/final/en_US/en_US.blogs.txt")$size / 1024 ^ 2->blog.size
file.info("./swiftkey_data/final/en_US/en_US.news.txt")$size / 1024 ^ 2 ->news.size
file.info("./swiftkey_data/final/en_US/en_US.twitter.txt")$size / 1024 ^ 2->twitter.size
A short summary of the datasets.
summary_info<-data.frame(file.size.MB = c(blog.size, news.size, twitter.size),
nlines = c(length(blog), length(news), length(twitter)),
nwords = c(sum(nwords_bl), sum(nwords_nw), sum(nwords_tw)))
rownames(summary_info)<-c("Blog","News","Twitter")
colnames(summary_info)<-c("Size in MB","Number of Lines","Number of Words")
summary_info
It can be seen that the most extensive file is Blogs file, which contains 37546239 words. The file with most lines of text is Twitter file.
To perform an excellent exploratory data analysis for text mining, it is essential to clean the databases.
Remove common English stop words
Remove punctuation marks
Convert to plain text documents
Remove URL, Twitter handles and email patterns by converting them to spaces 5.using a custom content transformer
Convert all words to lowercase
Remove numbers
Trim whitespace
Additional to this process, will be taken the 1\(%\) of sample from the three datasets.
set.seed(123)
sample_data<-c(sample(blog, length(blog)*0.01),
sample(news, length(news)*0.01),
sample(twitter, length(twitter)*0.01))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- VCorpus(VectorSource(sample_data))
corpus<- corpus %>%
tm_map(toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+") %>%
tm_map( toSpace, "@[^\\s]+") %>%
tm_map( tolower) %>%
tm_map(removeWords, stopwords("en")) %>%
tm_map(removePunctuation) %>%
tm_map(removeNumbers) %>%
tm_map(stripWhitespace) %>%
tm_map(PlainTextDocument)
Visually representing the content of a text document is one of the most important tasks in the field of text mining. As a data scientist or NLP specialist, not only we explore the content of documents from different aspects and at different levels of details, but also we summarize a single document, show the words and topics, detect events, and create storylines. (Source)
setwd("C:/Users/aleja/Documents/Cursos/Coursera R pratices/Milestone_Project")
saveRDS(corpus, file = "./swiftkey_data/final/en_US/en_US.corpus.rds")
corpus<-readRDS(file = "./swiftkey_data/final/en_US/en_US.corpus.rds")
corpusText <- data.frame(text = unlist(sapply(corpus, '[', "content")),
stringsAsFactors = FALSE)
writeLines(corpusText$text,
file("./swiftkey_data/final/en_US/en_US.corpus.txt", open = "w"))
finally, you can see the content of the first ten documents, belonging to the corpus. Remember that a 1% sample was taken from the original datasets.
kable(head(corpusText$text, 10),
row.names = FALSE,
col.names = NULL,
align = c("l"),
caption = "First 10 Documents") %>% kable_styling(position = "left")
| bruschetta however missed mark instead manageable twobite crostini huge slices grilled bread heaped toppings tomato cannellini beans roasted peppers goat cheese |
| walden pond mt rainier big sur everglades forth |
| despite laws banning cell phones driving increased awareness dangers ’s common fact cell phone use driving still widespread occurrence perhaps discouraging issue much distracted driving occurs amongst young drivers safety concern also might indicate problem deeply rooted future generations |
| ghosts goblins |
| now can write specific post information day week preplan things bit love love one place love finally got another little area life organized love things going get easier now got act together |
| trying pin photos muslin walls bit tricky |
| rosso fruiting around bored pent cold feel good |
| lastly anyone seen new harry potter movie planning go word advice take kleenex especially read books able appreciate happening deeper level yes changes differences still wonderful ending series want go back |
| generally enjoyed movie things get nose mostly involved inclusion footage henry king film train robbery sequence blends fairly seamlessly another example proved especially distracting well filmed northfield raid lead flies men falling around frank jesse take time ride alley divest long dusters might well ask two men caught firefight pause well answer ’re see recycled footage power fonda riding plate glass window later jumping horses cliff – heroes hadn’t wearing dusters ’ film now scenes great first time round smacks certain cheapness wheel another problem end film know bob ford going shoot jesse stands chair straighten picture well ford gives back head point blank range instead dropping floor like sack potatoes jesse swivels around glare reproachfully assassin succumbing wound bah |
| accessories martha stewart floral border punch marvy notched corner punch stampin corner rounder punch stampin grosgrain ribbon flowers prima flowers stickles glitter glue waterfall clear effects embossing glaze white liquid pearls lace sewing basket |
Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation. (Source)
N-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. (Source)
Unigrams:
unigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
unigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = unigramTokenizer))
unigramMatrixFreq <- sort(rowSums(as.matrix(removeSparseTerms(unigramMatrix, 0.99))), decreasing = TRUE)
unigramMatrixFreqdf <- data.frame(word = names(unigramMatrixFreq), freq = unigramMatrixFreq)
save(unigramMatrixFreqdf, file = "./swiftkey_data/unigramMatrixFreqdf.Rda")
load("./swiftkey_data/unigramMatrixFreqdf.Rda")
g1 <- ggplot(unigramMatrixFreqdf[1:20,], aes(x = reorder(word, -freq), y = freq)) +
geom_bar(stat = "identity", fill="firebrick")+
theme_light()+
xlab("")+
ylab("Frequency") +
theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
axis.text.x = element_text(hjust = 1.0, angle = 90),
axis.text.y = element_text(hjust = 0.5, vjust = 0.5))+
ggtitle("Most Common Unigrams")
#g1
ggplotly(g1, dynamicTicks = T, tooltip = ("y"))
Bigrams:
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = bigramTokenizer))
# eliminate sparse terms for each n-gram and get frequencies of most common n-grams
bigramMatrixFreq <- sort(rowSums(as.matrix(removeSparseTerms(bigramMatrix, 0.999))), decreasing = TRUE)
bigramMatrixFreqdf <- data.frame(word = names(bigramMatrixFreq), freq = bigramMatrixFreq)
save(bigramMatrixFreqdf, file = "./swiftkey_data/bigramMatrixFreqdf.Rda")
load("./swiftkey_data/bigramMatrixFreqdf.Rda")
g2 <- ggplot(bigramMatrixFreqdf[1:20,], aes(x = reorder(word, -freq), y = freq))+
geom_bar(stat = "identity", fill = "firebrick")+
theme_light()+
xlab("")+
ylab("Frequency")+
theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
axis.text.x = element_text(hjust = 1.0, angle = 45),
axis.text.y = element_text(hjust = 0.5, vjust = 0.5))+
ggtitle("Most Common Bigrams")
#g2
ggplotly(g2, dynamicTicks = T, tooltip = ("y"))
Trigrams:
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
trigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = trigramTokenizer))
# eliminate sparse terms for each n-gram and get frequencies of most common n-grams
trigramMatrixFreq <- sort(rowSums(as.matrix(removeSparseTerms(trigramMatrix, 0.9999))), decreasing = TRUE)
trigramMatrixFreqdf <- data.frame(word = names(trigramMatrixFreq), freq = trigramMatrixFreq)
save(trigramMatrixFreqdf, file = "./swiftkey_data/trigramMatrixFreqdf.Rda")
load("./swiftkey_data/trigramMatrixFreqdf.Rda")
g3 <- ggplot(trigramMatrixFreqdf[1:20,], aes(x = reorder(word, -freq), y = freq))+
geom_bar(stat = "identity", fill = "firebrick")+
theme_light()+
xlab("")+
ylab("Frequency")+
theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
axis.text.x = element_text(hjust = 1.0, angle = 55),
axis.text.y = element_text(hjust = 0.5, vjust = 0.5))+
ggtitle("Most Common Trigrams")
#g3
ggplotly(g3, dynamicTicks = T, tooltip = ("y"))
This concludes the first project for the Data Science Capstone Project. The final deliverable in the capstone project is to build a predictive algorithm that will be deployed as a Shiny app for the user interface.
The predictive algorithm for the Shiny App, will be using n-gram model with frequency lookup similar to our exploratory analysis above.
It is intended to build the algorithm based on the n-grams, and in this way, that the next word is predicted taking into account the training database.