This is a milestone report for the capestone project of Data Science Specialization.The goal of this report is to show that we’ve been familiar with the data. We will do some exploratory analysis and figure out our plan for the eventual app and algorithm. The major tasks of this report are: 1. download the data and load it in. 2. create a basic report of summary statistics about the data. 3. use 1-grams, 2-grams, and 3-grams to understand frequencies of words and word pairs. 4. discuss plans for creating a prediction algorithm and Shiny app.
We first set our working directory, download the data and unzip it. The data was collected in four languanges: English, Russian, German, and Finnish. Since the data sources are newspapers and magazines, blogs, and Twitter, there are three text files (blogs, news, and twitter) for each language. In this report, we are only interested in the files in english.
setwd("C:/Users/Bell/Desktop/coursera/capstone")
fileUrl <-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if (!file.exists("Coursera-SwiftKey.zip")){
download.file(fileUrl, destfile = "./Coursera-SwiftKey.zip", method="curl")
}
unzip("./Coursera-SwiftKey.zip")
We first check the file size.
fz1 <- file.info("final/en_US/en_US.blogs.txt")$size / 1024^2
fz2 <- file.info("final/en_US/en_US.news.txt")$size / 1024^2
fz3 <- file.info("final/en_US/en_US.twitter.txt")$size / 1024^2
filesizeMB <- rbind(fz1, fz2, fz3)
Then, we load the three interested text files by useing readLines function.
con1 <- file("final/en_US/en_US.blogs.txt", "rb")
blogs <- readLines(con1, encoding = "UTF-8")
close(con1)
#-------
con2 <- file("final/en_US/en_US.news.txt", "rb")
news <- readLines(con2, encoding = "UTF-8")
close(con2)
#------
con3 <- file("final/en_US/en_US.twitter.txt", "rb")
twitter <- readLines(con3, encoding = "UTF-8")
close(con3)
Next, we count the lines and the words for each file.
ls1 <- length(blogs)
ls2 <- length(news)
ls3 <- length(twitter)
linecount <- rbind(ls1, ls2, ls3)
#word count of each file
library(stringi)
ws1 <- sum(stri_count_words(blogs))
ws2 <- sum(stri_count_words(news))
ws3 <- sum(stri_count_words(twitter))
wordcount <- rbind(ws1, ws2, ws3)
Then, we get the file names and combine all the information together.
filenames<- Sys.glob("final/en_US/*.txt")
r <- data.frame(filenames,filesizeMB, linecount, wordcount)
rownames(r) <- c(1,2,3)
print(r)
## filenames filesizeMB linecount wordcount
## 1 final/en_US/en_US.blogs.txt 200.4242 899288 37546246
## 2 final/en_US/en_US.news.txt 196.2775 1010242 34762395
## 3 final/en_US/en_US.twitter.txt 159.3641 2360148 30093369
From the above simple data summary, we can see that these text files are very large files with size 200MB, 196MB,and 159MB respectively. They also contain huge number of lines and words. It is too time consuming to analyze the full dataset. Therefore, we will randomly sample 5% of lines in each text file. The sampled three files are combined together and then be written to a separate directory for further analysis.
set.seed(12345)
s_blogs <- sample(blogs, ls1*0.05)
s_news <- sample(news, ls2*0.05)
s_twitter <- sample(twitter, ls3*0.05)
s_Data <- c(s_blogs,s_news,s_twitter)
writeLines(s_Data, "sample/sampleData.txt")
rm(blogs, news, twitter, s_blogs, s_news, s_twitter, s_Data)
Now, we are ready for data cleaning and tokenization. We first need to construct a corpus from our overall sampled text file.
library(NLP)
library(tm)
library(RWeka)
cname <- file.path("C:/Users/Bell", "Desktop", "Coursera", "capstone","sample")
docs <- Corpus(DirSource(cname))
The raw text file contains some characters, symbols, and words, that may not provide useful information. Therefore, we need to clean the data first. we will remove such things as numbers, punctuation, white space, etc. We will also remove the stopwords in english. Although we randomly sampled the original dataset, out corpus is still around 40Mb. Thus, We will simplify our R codes with pipes (%>%) to save some running time.
library(magrittr)
library(SnowballC)
docsf <- docs %>%
tm_map(tolower) %>% #transfer to lowercase
tm_map(removePunctuation) %>% #remove punctuation
tm_map(removeNumbers) %>% #remove numbers
tm_map(stemDocument) %>% #remove common word endings
tm_map(stripWhitespace)%>% #strip white spaces
tm_map(removeWords, stopwords("en"))%>%
tm_map(PlainTextDocument) #transfer to plain text document
After we clean and tokenize the data, we are ready to do some exploratory analysis. We will use 1-grams model to explore the most frequent words in the data.
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
dtm_uni <- DocumentTermMatrix(docsf, control=list(tokenize=UnigramTokenizer))
dtm_uni_freq <-dtm_uni %>%
removeSparseTerms(0.2) %>%
as.matrix %>%
colSums %>%
sort(decreasing=TRUE)
dtm_uni_freq_d <- data.frame(word = names(dtm_uni_freq), freq = dtm_uni_freq)
head(dtm_uni_freq_d, 10)
## word freq
## will will 15527
## said said 15196
## just just 15091
## one one 13682
## like like 13206
## can can 12305
## get get 11556
## time time 9615
## new new 9582
## good good 8885
Then, we use the 2-grams model to determine the most frequent two-word phrases.
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
dtm_bi <- DocumentTermMatrix(docsf, control=list(tokenize=BigramTokenizer))
dtm_bi_freq <-dtm_bi %>%
removeSparseTerms(0.2) %>%
as.matrix %>%
colSums %>%
sort(decreasing=TRUE)
dtm_bi_freq_d <- data.frame(word = names(dtm_bi_freq), freq = dtm_bi_freq)
head(dtm_bi_freq_d, 10)
## word freq
## right now right now 1220
## last year last year 970
## new york new york 970
## cant wait cant wait 930
## dont know dont know 778
## last night last night 719
## high school high school 715
## years ago years ago 690
## last week last week 642
## feel like feel like 621
Next, we use 3-grams to look at the most frequent three-word phrases.
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
dtm_tri <- DocumentTermMatrix(docsf, control=list(tokenize=TrigramTokenizer))
dtm_tri_freq <-dtm_tri %>%
removeSparseTerms(0.2) %>%
as.matrix %>%
colSums %>%
sort(decreasing=TRUE)
dtm_tri_freq_d <- data.frame(word = names(dtm_tri_freq), freq = dtm_tri_freq)
head(dtm_tri_freq_d, 10)
## word freq
## happy mothers day happy mothers day 174
## cant wait see cant wait see 165
## let us know let us know 116
## new york city new york city 109
## happy new year happy new year 89
## two years ago two years ago 87
## cinco de mayo cinco de mayo 76
## im pretty sure im pretty sure 72
## dont even know dont even know 66
## world war ii world war ii 65
Finally, we show our data mining results in some histogram plots.
library(ggplot2)
library(gridExtra)
frequ10 <- head(dtm_uni_freq_d, 10)
freqb10 <- head(dtm_bi_freq_d, 10)
freqt10 <- head(dtm_tri_freq_d, 10)
up <- ggplot(frequ10, aes(x=reorder(word,freq), y=freq)) +
geom_bar(stat="identity") +
theme_bw() +
coord_flip() +
theme(axis.title.y = element_blank()) +
labs(y="Frequency", title="Top 10 frequent word in Unigram matrix")
bp <-ggplot(freqb10, aes(x=reorder(word,freq), y=freq)) +
geom_bar(stat="identity") +
theme_bw() +
coord_flip() +
theme(axis.title.y = element_blank()) +
labs(y="Frequency", title="Top 10 frequent two-word phrases in Bigram matrix")
tp <- ggplot(freqt10, aes(x=reorder(word,freq), y=freq)) +
geom_bar(stat="identity") +
theme_bw() +
coord_flip() +
theme(axis.title.y = element_blank()) +
labs(y="Frequency", title="Top 10 frequent three-word phrases in Trigram matrix")
grid.arrange(up, bp, tp, ncol=1, nrow =3)
It is really interesting to look at the most frequent word or phrases occuring in the documents, and we do find some interesting things. First, in the cleaning process, we have to remove the stopwords such as “the”, “and”,“are”, etc. These kind of words don’t provide useful information, but they occur very frequently. Initially, we did not remove them, we get the most frequent word “the” with 236543 frequency. Second, we realize that it is very important to look at the data with two-word phrases or three-word phrases. For our 3-grams model, we get the most frequent phrases as “happy mothers day”, “cant wait see”, “let us know”, “new york city”, etc. It is very interesting and useful information. This is also very important for prediction of the next word in a sentence.
We plan to use N-grams models to buid our prediction algorithm. We will use 2-grams and 3-grams to first understand the distribution of words and relationship between the words in the corpora. Then, we may use a 4-gram model to find the most likely next word. This strategy has not been finalized. For the Shiny app, we will creat a simple interface where the user can input a string of text. Then, our prediction model will give a list of the most likely suggested next words.