This document is part of my Data Science Specialisation Capstone where I am looking at predictive test. Mobile phones, for instance, present you with three options of words, for you to select, when writing a text.
The capstone project goal is to build a predictive text product running with my own predictive text algorithm. In order to do this I will rely on three files with text from Tweeter, Blogs and News in order to gain knowledge on the strcuture of the data.
This milestone has the aim of demonstrating my capability of working with the data by carying out an exploratory data analysis. Statistics on the data files will be collected and preliminary graphs displayed.
The data consist of a zip file containing four folders. The analysis presented here is with the trhee files in the en_US folder. The data can be downloaded here
First we load the data into three different variables:
set.seed(69) # Set seed for repeatability
con <- file("en_US.blogs.txt", "r")
blogs = readLines(con, encoding = "UTF-8")
close(con)
con <- file("en_US.news.txt", "r")
news = readLines(con, encoding = "UTF-8")
close(con)
con <- file("en_US.twitter.txt", "r")
twitter = readLines(con, encoding = "UTF-8")
close(con)
In this section I explore some characteristics of the data files such as file size, number of lines, number of words, number of characters. In order to explore the data further I created three sample corpus, one for each original file. Using these data I collect statistics of frequencies for n-grams for each sample file.
Function to count number of words excluding characters.
# require(stringr)
nwords <- function(string, pseudo = F) {
ifelse(pseudo, pattern <- "\\S+", pattern <- "[[:alpha:]]+")
str_count(string, pattern)
}
Function to count number of lines.
nlines <- function(FileName) {
f <- file(FileName, open = "rb")
nlines <- 0L
while (length(chunk <- readBin(f, "raw", 65536)) > 0) {
nlines <- nlines + sum(chunk == as.raw(10L))
}
return(nlines)
}
Calculating data file statistics.
Here is table of statistics for the three different data files.
summary_table = data.frame(data_set = c("blogs", "news", "twitter"),
size_MB = c(blogs_size, news_size, twitter_size), no_lines = c(NoLinesBlogs,
NoLinesNews, NoLinesTwitter), no_words = c(no_words_blog,
no_words_news, no_words_twitter), no_char = c(blogs_characters,
news_characters, twitter_characters), lngest_line_CHAR = c(longestLine_blogs,
longestLine_news, longestLine_twitter))
print(summary_table)
data_set size_MB no_lines no_words no_char lngest_line_CHAR
1 blogs 200.4242 899288 37874365 206824505 40833
2 news 196.2775 1010242 2662071 15639408 5760
3 twitter 159.3641 2360148 30556095 162096031 140
Due to the size of the data files we will create a corpus containing 10000 lines for each of the data files.
Function to create a corpus including 5000 lines of each file:
# A function to Create a corpus Inputs a textfile name
# Converts it into a corpus via the VCorpus method
CreateCorpus <- function(FileName) {
sample_corpus <- sample(FileName, 5000)
vecSource = VectorSource(sample_corpus)
corpus = VCorpus(vecSource)
return(corpus)
}
Function to clean corpus by:
# A function to Clean a corpus Inputs a corpus Outputs a
# corpus with removewords, remove punctuation, removenumbers
profanity <- readLines("http://www.bannedwordlist.com/lists/swearWords.txt")
CleanCorpus <- function(CorpusName) {
corpus.ng = tm_map(CorpusName, removeWords, c(stopwords(),
"s", "ve"))
corpus.ng = tm_map(corpus.ng, removePunctuation)
corpus.ng = tm_map(corpus.ng, removeNumbers)
corpus.ng = tm_map(corpus.ng, removeWords, profanity)
RemoveNonASCII = function(x) gsub("[^ -~]", "", x)
corpus.ng = tm_map(corpus.ng, content_transformer(RemoveNonASCII))
return(corpus.ng)
}
Function to perform unigrams:
CreateUnigram <- function(CorpusName) {
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1,
max = 1))
tdm.unigram = TermDocumentMatrix(CorpusName, control = list(tokenize = UnigramTokenizer))
return(tdm.unigram)
}
Function to perform bigrams:
CreateBigrams <- function(CorpusName) {
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2,
max = 2))
tdm.bigram = TermDocumentMatrix(CorpusName, control = list(tokenize = BigramTokenizer))
return(tdm.bigram)
}
Function to perform trigrams:
CreateTrigrams <- function(CorpusName) {
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3,
max = 3))
tdm.trigram = TermDocumentMatrix(CorpusName, control = list(tokenize = TrigramTokenizer))
return(tdm.trigram)
}
Function to sort frequency of n-grams by decreasing order
SortFreq <- function(tdm) {
freq = sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
freq.df = data.frame(word = names(freq), freq = freq)
return(freq.df)
}
In this section I use the functions created for calculating and sorting n-grams on the sample corpuses.
The n-grams for the blogs sample are:
corpus.blogs = CreateCorpus(blogs)
corpus.blogs = CleanCorpus(corpus.blogs)
# Create Unigrams
blogs.tdm.unigram = CreateUnigram(corpus.blogs)
blogs.df.unigram = SortFreq(blogs.tdm.unigram)
head(blogs.df.unigram, 5)
word freq
the the 1009
one one 717
will will 618
just just 575
like like 533
# Create Bigrams
blogs.tdm.bigram = CreateBigrams(corpus.blogs)
blogs.df.bigram = SortFreq(blogs.tdm.bigram)
head(blogs.df.bigram, 5)
word freq
i think i think 133
i know i know 128
i can i can 88
i love i love 81
i will i will 73
# Create Trigrams
blogs.tdm.trigram = CreateTrigrams(corpus.blogs)
blogs.df.trigram = SortFreq(blogs.tdm.trigram)
head(blogs.df.trigram, 5)
word freq
i know i i know i 23
i think i i think i 20
i dont know i dont know 14
i thought i i thought i 14
i feel like i feel like 11
The n-grams for the twitter sample are:
corpus.twitter = CreateCorpus(twitter)
corpus.twitter = CleanCorpus(corpus.twitter)
# Create Unigrams
twitter.tdm.unigram = CreateUnigram(corpus.twitter)
twitter.df.unigram = SortFreq(twitter.tdm.unigram)
head(twitter.df.unigram, 5)
word freq
just just 323
like like 270
get get 248
the the 231
good good 210
# Create Bigrams
twitter.tdm.bigram = CreateBigrams(corpus.twitter)
twitter.df.bigram = SortFreq(twitter.tdm.bigram)
head(twitter.df.bigram, 5)
word freq
i love i love 54
i just i just 44
i think i think 43
i can i can 42
i know i know 42
# Create Trigrams
twitter.tdm.trigram = CreateTrigrams(corpus.twitter)
twitter.df.trigram = SortFreq(twitter.tdm.trigram)
head(twitter.df.trigram, 5)
word freq
i know i i know i 7
i wish i i wish i 6
come see us come see us 5
i feel like i feel like 5
cake cake cake cake cake cake 4
The n-grams for the news sample are:
corpus.news = CreateCorpus(news)
corpus.news = CleanCorpus(corpus.news)
# Create Unigrams
news.tdm.unigram = CreateUnigram(corpus.news)
news.df.unigram = SortFreq(news.tdm.unigram)
head(news.df.unigram, 5)
word freq
the the 1237
said said 1194
will will 558
one one 417
new new 333
# Create Bigrams
news.tdm.bigram = CreateBigrams(corpus.news)
news.df.bigram = SortFreq(news.tdm.bigram)
head(news.df.bigram, 5)
word freq
i think i think 81
new york new york 58
last year last year 51
said i said i 45
high school high school 43
# Create Trigrams
news.tdm.trigram = CreateTrigrams(corpus.news)
news.df.trigram = SortFreq(news.tdm.trigram)
head(news.df.trigram, 5)
word freq
a year ago a year ago 10
four years ago four years ago 8
president barack obama president barack obama 8
world war ii world war ii 8
new york city new york city 7
In this section I built some “pretty” exploratory graphs to visualise and compare the n-grams of the three samples. For the unigrams:
p1 = ggplot(head(blogs.df.unigram, 15), aes(reorder(word, freq),
freq)) + geom_bar(stat = "identity", fill = "blue") + coord_flip() +
xlab("Unigrams") + ylab("Frequency") + ggtitle("Blogs") +
theme_bw()
p2 = ggplot(head(twitter.df.unigram, 15), aes(reorder(word, freq),
freq)) + geom_bar(stat = "identity", fill = "blue") + coord_flip() +
xlab("Unigrams") + ylab("Frequency") + ggtitle("Twitter") +
theme_bw()
p3 = ggplot(head(news.df.unigram, 15), aes(reorder(word, freq),
freq)) + geom_bar(stat = "identity", fill = "blue") + coord_flip() +
xlab("Unigrams") + ylab("Frequency") + ggtitle("News") +
theme_bw()
grid.arrange(p1, p2, p3, nrow = 1)
For the bigrams:
p1 = ggplot(head(blogs.df.bigram, 15), aes(reorder(word, freq),
freq)) + geom_bar(stat = "identity", fill = "blue") + coord_flip() +
xlab("Bigrams") + ylab("Frequency") + ggtitle("Blogs") +
theme_bw()
p2 = ggplot(head(twitter.df.bigram, 15), aes(reorder(word, freq),
freq)) + geom_bar(stat = "identity", fill = "blue") + coord_flip() +
xlab("Bigrams") + ylab("Frequency") + ggtitle("Twitter") +
theme_bw()
p3 = ggplot(head(news.df.bigram, 15), aes(reorder(word, freq),
freq)) + geom_bar(stat = "identity", fill = "blue") + coord_flip() +
xlab("Bigrams") + ylab("Frequency") + ggtitle("News") + theme_bw()
grid.arrange(p1, p2, p3, nrow = 1)
For the tri-grams:
p1 = ggplot(head(blogs.df.trigram, 15), aes(reorder(word, freq),
freq)) + geom_bar(stat = "identity", fill = "blue") + coord_flip() +
xlab("Trigrams") + ylab("Frequency") + ggtitle("Blogs") +
theme_bw()
p2 = ggplot(head(twitter.df.trigram, 15), aes(reorder(word, freq),
freq)) + geom_bar(stat = "identity", fill = "blue") + coord_flip() +
xlab("Trigrams") + ylab("Frequency") + ggtitle("Twitter") +
theme_bw()
p3 = ggplot(head(news.df.trigram, 15), aes(reorder(word, freq),
freq)) + geom_bar(stat = "identity", fill = "blue") + coord_flip() +
xlab("Trigrams") + ylab("Frequency") + ggtitle("News") +
theme_bw()
grid.arrange(p1, p2, p3, nrow = 1)
An EDA was carryied out with the data provided for the Capstone of the Data Science Specialisation. The data consisted on three different text files containing text from blogs, news and twitter. Statistics for the files were calculated. The blogs data file was the biggest in size containing ‘r blogs_size’ MB; the twitter file was the smalles with 159.364069 MB. However, the tweeter file had the highest number of lines (2360148 ). The highest number of characters were in the blogs file and the longest line were in the blogs line. Tweets had a longest line of 140, which is to be expected by their definition.
Three different sample corpus were created in order to make an exploratory data analysis viable. The corupses were cleaned for profanity words, punctuation, numbers and non ASCII characters. N-grams (uni, bi, tri) were created for each of the three samples. There seem to be common unigram words in the lists of 15 top unigrams between the three sample corpora. Tweets mention more feelings like happy and love.
Bigrams between blogs and twitter seem to be quite similar with bigrams such as “i” plus a verb e.g. “i know”, “i think”, “i want”, “i love”, “i will”, etc. The bi-grams for news are quiet different they contain names of cities e.g. “san francisco”, “new jersey”, “st louis” and places “white house”.
The pattern for blogs and tweets where first is the pronoun “i” followed by a verb is repeated for trigrams. Trigrams in news include “president barack obma”, “the last time” and oddly “just pig about”.
Further work will be carryied out to create a predictive model for text writing application. The model will incorporate knowledge of the structure of the data as well as some linguistics.