In this assignment, to demonstrate an understanding of working with data, we load the data into R, generate summary statistics, clean it and perform exploratory analysis of it. Because of the amount of data that we are working with, to facilitate effective processing of the data, we will be working with samples of the entire text.
Cleaning data in this context will include; stripping white spaces, removing punctuations, removing numbers, converting all text to a uniform case (lower case) and dropping empty strings.
The data used for this assignment is the from Swift Key and can be downloaded from http://bit.ly/1NBh6QS. To ensure that we are using the correct dataset, we will always pick it from the original location.
setwd("H:/Data Science/Capstone")
#Create data directory#
dir.create("data")
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
datafile <- "Coursera-SwiftKey.zip"
path <- paste("data", datafile, sep="/")
if (!file.exists(path)){ #check if the file already exists before download
download.file(url, destfile=path)
}
unzip(zipfile=path, exdir="data") #Unzip the downloaded file in the data folder
The downloaded folder also includes txt for French, German and Russian. We will however focus on the English (en_US) text. We will list the files and for each file, we will also find the number of lines, longest line, number of words
en_us.files <- list.files(path="../data", recursive=T, pattern=".*en_.*.txt")
l <- lapply(paste("../data", en_us.files, sep="/"), function(f) {
file.size <- file.info(f)[1]/1024/1024
con <- file(f, open="r")
lines <- readLines(con)
nchars <- lapply(lines, nchar)
maxchars <- which.max(nchars)
nwords <- sum(sapply(strsplit(lines, "\\s+"), length))
close(con)
return(c(f, format(round(file.size, 2), nsmall=2), length(lines), maxchars, nwords))
})
df <- data.frame(matrix(unlist(l), nrow=length(l), byrow=T))
colnames(df) <- c("File Name", "Size(MB)", "Num of Lines", "Longest Line", "Num of Words")
df
## File Name Size(MB) Num of Lines Longest Line
## 1 ../data/final/en_US/en_US.blogs.txt 200.42 899288 483415
## 2 ../data/final/en_US/en_US.news.txt 196.28 77259 14556
## 3 ../data/final/en_US/en_US.twitter.txt 159.36 2360148 1484357
## Num of Words
## 1 37334441
## 2 2643972
## 3 30373792
The exploratory analysis is to help identify non English words in the text, common words and phrases (ngrams), frequencies and any other useful information that can help us better understand and appreciate the data we are working with. To perform the analysis, we will start by loading important packages;
Not that some of these will give you trouble installing if you are using R version 3.3.1. If you have a problem, install and use “pacman” to help install these.
Because the data above is quite massive and would overwhelm our computing resource, we pick out a sample to work with.
fpath <- "../data/final" #data location
enBlogs <- readLines(paste(fpath,"/en_US/en_US.blogs.txt",sep=""))
enNews <- readLines(paste(fpath,"/en_US/en_US.news.txt",sep=""))
enTwitter <- readLines(paste(fpath,"/en_US/en_US.twitter.txt",sep=""))
set.seed(12345)
sBlogs <- sample(enBlogs,2000)
sNews <- sample(enNews,2000)
sTwitter <- sample(enTwitter,2000)
sample <- c(sBlogs,sNews,sTwitter)
txt <- sent_detect(sample)
remove(sBlogs,sNews,sTwitter,enBlogs,enNews,enTwitter,sample)
Once we have our sample, we proceed to clean it up;
txt <- removeNumbers(txt)
txt <- removePunctuation(txt)
txt <- stripWhitespace(txt)
txt <- tolower(txt)
txt <- txt[which(txt!="")]
txt <- data.frame(txt,stringsAsFactors = FALSE)
Then using the tokenizer functions from the “RWeka” package, we break down our data into 1-gram, 2-gram and 3-gram chunks.
words<-WordTokenizer(txt)
grams<-NGramTokenizer(txt)
for(i in 1:length(grams))
{if(length(WordTokenizer(grams[i]))==2) break}
for(j in 1:length(grams))
{if(length(WordTokenizer(grams[j]))==1) break}
onegrams <- data.frame(table(words))
onegrams <- onegrams[order(onegrams$Freq, decreasing = TRUE),]
bigrams <- data.frame(table(grams[i:(j-1)]))
bigrams <- bigrams[order(bigrams$Freq, decreasing = TRUE),]
trigrams <- data.frame(table(grams[1:(i-1)]))
trigrams <- trigrams[order(trigrams$Freq, decreasing = TRUE),]
remove(i,j,grams)
Now that we have broken down our text into smaller chunks, we want to know which text appear more frequently than others. To find out, we will use wordcloud to visualize the variation in frequency.
wordcloud(words, scale=c(5,0.1), max.words=100, random.order=FALSE,
rot.per=0.5, use.r.layout=FALSE, colors=brewer.pal(8,"Accent"))
wordcloud(onegrams$words, onegrams$Freq, scale=c(5,0.5), max.words=300, random.order=FALSE,
rot.per=0.5, use.r.layout=FALSE, colors=brewer.pal(8,"Accent"))
The first graph shows the distribution of the words without stop words and the second shows the frequency distribution of the words including stop words.
We then look at frequency of the bigrams (2-grams) and trigrams (3-words). For these we use an ordinary barplot to examine the frequency.
barplot(bigrams[1:20,2],col="darkgoldenrod",
names.arg = bigrams$Var1[1:20],srt = 45,
space=0.1, xlim=c(0,20),las=2)
Then for the trigrams;
barplot(trigrams[1:20,2],col="cyan4",
names.arg = trigrams$Var1[1:20],srt = 45,
space=0.1, xlim=c(0,20),las=2)
Now we have a better understand of our text, we delve into the strategy for building our predictive model and shiny app.
The prodictive model / algorithm will take the entered text, clean and extract the preceding 1 to n-1 words. It will then use a simple backoff approach in combination with weighting to build a list of probable next words.
The Shiny app will have a simple UI that will allow a user to type text into a single text box. This will trigger our algorithm which then runs to provide suggestions of the next word that the user can select.
https://rstudio-pubs-static.s3.amazonaws.com/31867_8236987cf0a8444e962ccd2aec46d9c3.html
www.modsimworld.org/papers/2015/Natural_Language_Processing.pdf
https://rstudio-pubs-static.s3.amazonaws.com/96252_bd61a0777ad44d04b619ce95ca44219c.html