We use smartphones everywhere. Whether we are at our house or at the beach, we see people fumbling around with their smartphones. And it is not a isolated thing. We can do almost everything with these gadgets nowadays: phone calls, internet, messaging, gps, mp3 are one of the functionalities our phones offer us today. Alongside with this increase in use, people also want to spend less time with non-crucial steps, such as typing. This capstone project is made in conjunction with Swiftkey[1]. They apply natural language processing techniques to a vast amount of text in order to help us texting, guessing what could be the next words to be typed, or correcting our misspelled words. We analysed 3 datasets, from U.S news, blogs and twitter in order to create a language model such as the ones existents in Swiftkey. This first report contemplate only the pre-processing step, in which we obtain the data, sample it and gather a few descriptive statistics about it.
The dataset from Coursera is available in form of .txt files, compressed via zip to reduce its size. In this dataset, there are different languages for people around the world to use it. In this case, we chose the english language files.
Because of its size and for testings purposes, we manually downloaded and decompressed the file in our working directory. We selected three files: “en_US.blogs.txt”, with approximately 200 mb, “en_US.news”, with approximately 196 mb and “en_US.twitter.txt”, with approximately 159 mb. In the R code, we defined the working directory, read and counted the approximated number of words of each document by counting the number of non alphabetic caracters in it. For the twitter data, because of the existence of non UTF-8 caracters, such as emoticons, we used the “iconv” function to remove them.
# set working directory
setwd("C:\\Users\\Caio\\Documents\\Coursera\\Data Science Capstone\\DataSet 0")
# read blogs data: # 899.299 lines
blogsData <- readLines(file("en_US.blogs.txt", encoding = "UTF-8"))
blogNWords <- sum(sapply(gregexpr("\\W+", blogsData), length))
print(paste("Number of lines for blogs dataset:",blogNWords,"words",sep = " "))
## [1] "Number of lines for blogs dataset: 38222279 words"
# read news data: # 77.259 lines
newsData <- readLines(file("en_US.news.txt", encoding = "UTF-8"))
newsNWords <- sum(sapply(gregexpr("\\W+", newsData), length))
print(paste("Number of lines for news dataset:",newsNWords,"words",sep = " "))
## [1] "Number of lines for news dataset: 2748070 words"
# read twitter data: # 2.360.148 lines
twitterData <- readLines(file("en_US.twitter.txt", encoding = "UTF-8"))
# remove emoticons and symbols non UTF-8
twitterData <- iconv(twitterData, from = "latin1", to = "UTF-8", sub="")
twitterNWords <- sum(sapply(gregexpr("\\W+", twitterData), length))
print(paste("Number of lines for twitter dataset:",twitterNWords,"words",sep = " "))
## [1] "Number of lines for twitter dataset: 30513860 words"
Because of the size of the files, we need to create a randomized sample of them in order to process it in a feasible time. We decided to extract 10.000 lines of each file. Finally, we saved each dataset as a RDS file because they are more memory efficient. As we are not going to use the original datasets, we remove them from memory.
n <- 10000
set.seed(3235) # for reproducibility purposes
######
filterIndexes <- sample(length(blogsData), # sample
n, # and
replace = FALSE) # replace
blogsData <- blogsData[filterIndexes] # data
######
######
filterIndexes <- sample(length(newsData),
n,
replace = FALSE)
newsData <- newsData[filterIndexes]
######
######
filterIndexes <- sample(length(twitterData),
n,
replace = FALSE)
twitterData <- twitterData[filterIndexes]
saveRDS(blogsData,'SampleBlogData.rds')
saveRDS(newsData,'SampleNewsData.rds')
saveRDS(twitterData,'SampleTwitterData.rds')
rm(blogsData); rm(newsData); rm(twitterData); # memory release
######
For the exploratory analysis, we used a few addicional packages:
Note that some packages are dependent on other sources, such as other packages or enviroments (Java)
#load libraries
library(tm) # basic package
library(SnowballC) # word steeming
library(wordcloud) # word cloud
library(slam) # sparse matrix arithmethics
library(RWeka) # N-gram creation
The basic element for text mining using the tm package is the Corpus object. So, we convert each sampled dataset into a Corpus object.
#load data
corpusB <- Corpus(VectorSource(readRDS("SampleBlogData.rds")))
corpusN <- Corpus(VectorSource(readRDS("SampleNewsData.rds")))
corpusT <- Corpus(VectorSource(readRDS("SampleTwitterData.rds")))
Next step was to do basic transformations to the corpus dataset that are pertinent to text mining, such as lower case, remove punctuations, numbers and stopwords, word steeming and, finally, creation of the document term matrix, actually the final type of data in which we do our processing. An example of this type of processing can be seen here.
# concatenated list
corpusVector <- list(corpusB, corpusN, corpusT)
myTdmn <- list() # used to store document term matrix
# memory dealocation
rm(corpusB); rm(corpusN); rm(corpusT);
for (i in 1:length(corpusVector)){
# transform to lower case
corpusVector[[i]] <- tm_map(corpusVector[[i]], tolower)
# remove punctuation
corpusVector[[i]] <- tm_map(corpusVector[[i]], removePunctuation)
# remove numbers
corpusVector[[i]] <- tm_map(corpusVector[[i]], removeNumbers)
# remove english stop words
corpusVector[[i]] <- tm_map(corpusVector[[i]], removeWords,stopwords("english"))
# stemm words, keep only radicals
corpusVector[[i]] <- tm_map(corpusVector[[i]], stemDocument)
# transform to plain text
corpusVector[[i]] <- tm_map(corpusVector[[i]], PlainTextDocument)
# calculate document (row) term frequency (column)
myTdmn[[i]] <- DocumentTermMatrix(corpusVector[[i]], control=list(wordLengths=c(0,Inf)))
}
One way to display the most used words in a text is a word cloud, in which the most used words are displayed in respect to their size. From these clouds we can already see some interesting stuff: - Indirect speech (i.e said) and more formal words used in the news wordcloud
- A wider variety of words used in the blogs dataset
- Increased use of positive sentimental words used in the Twitter dataset
# create word cloud
# obs: col_sums from package slam. Used to calculate col sum from sparse matrix
set.seed(3366) # for reproducibility
par(mfrow = c(1,3)) # all 3 plots in 1 row
titles = c("Blog Wordcloud", "News wordcloud", "Twitter Wordcloud")
for(i in 1:3){
# words and its frequencies
wordcloud(words=colnames(myTdmn[[i]]), freq=col_sums(myTdmn[[i]]),
scale = c(3,1),max.words = 100,random.order = F,
rot.per = 0.35,use.r.layout = F, colors = brewer.pal(8,"Dark2"))
title(titles[i])
}
Fig. 1: Word clouds as a visual representation of words used. More often used words are in the center and bigger. Less colors in a word cloud indicates a less complex text structure (much more of one word a few others around it).
N-grams are a continuous sequence of n terms from a given sequence of speech or text. These N-grams form a widely used model in probabilistic language models because of its efficiency and simplicity. We analyzed the most used bigrams used in each of the 3 datasets.
# NGramTokenizer is a a function from the RWeka package and passed as control
# in the DocumentTermMatrix function.
bigramTokenizer <- function(x)NGramTokenizer(x,Weka_control(min = 2, max = 2))
dsNames <- c("Blog","News","Twitter")
par(mfrow = c(3,1))
for(i in 1:3){
# create document term matrix for 2-gram words
bigramDTM <- DocumentTermMatrix(corpusVector[[i]],
control = list(tokenize = bigramTokenizer))
# get 40 most ocurring 2-gram words
bigramTermsCount <- sort(col_sums(bigramDTM),decreasing = TRUE)[1:40]
# create barplot
bar <- barplot(bigramTermsCount, axes = FALSE,axisnames = FALSE,
density = bigramTermsCount+30,
border = "red",
ylab="Frequency", ylim = c(0,max(bigramTermsCount)+9),
main = paste("Frequency of bigrams for",
dsNames[i],
"data set",
sep = " "))
# rotate x labels
text(bar, par("usr")[3], labels = names(bigramTermsCount),
srt = 45, adj = c(1.1,1.1), xpd = TRUE, cex = 0.9)
# add frequency number to the top of each bar
text(bar, y = bigramTermsCount, label = bigramTermsCount, pos = 3,
cex = 0.8, col = "red")
# make y axis appear
axis(2)
}
Fig. 2: Bigrams from each dataset.
From a complexity view, clustering the datasets can show us which corpus uses a wider variety of words togheter. The hierarchical clustering display which words can be grouped togheter, i.e. were found more frequently within the corpus. From these plots, we can see:
dsNames <- c("Blog","News","Twitter")
par(mfrow = c(1,3))
for(i in 1:3){
# remove zeroes
sparseDTM <- removeSparseTerms(t(myTdmn[[i]]),sparse = 0.97)
sparseDTM <- as.matrix(sparseDTM)
distMatrix <- dist(scale(sparseDTM))
# do hierarchical clustering using ward's method
cluster <- hclust(distMatrix,method = "ward.D2")
plot(cluster, cex = 0.9,
xlab = "Hierarchical Clusterization",
main = paste("Estructure complexity for ",
dsNames[i],
"data set",
sep = " "))
}
Fig. 3: Hierarchical clustering from each dataset.
The following ideas can be deducted from this data:
For the next part of this capstone project, some ideas are:
Note: All code is available in this public dropbox folder, just remember to change the Working Directory =]
Caio H. K. Miyashiro - Brazil