In this data exploration we analyse “Copora” dataset - publicly available data composed of tweeets, blogs and news available online. Goal of this analysis is to present basic overview of the dataset (in terms of its properties and content).
We collect the following forms of information:
In the following section we will describe the data collection process, the section after that gives the results of the data exploration, we finally present conclusions and give references.
For our analysis we use the R language.
Data can be obtained [here|https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip].
# download & unzip
destination_file <- "Coursera-SwiftKey.zip"
source_file <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(source_file, destination_file)
unzip(destination_file)
Zip file contains following files:
list.files(recursive = TRUE, full.names = TRUE)
## [1] "./Coursera-SwiftKey.zip" "./Exploration_Analysis.Rmd"
## [3] "./final/de_DE/de_DE.blogs.txt" "./final/de_DE/de_DE.news.txt"
## [5] "./final/de_DE/de_DE.twitter.txt" "./final/en_US/en_US.blogs.txt"
## [7] "./final/en_US/en_US.news.txt" "./final/en_US/en_US.twitter.txt"
## [9] "./final/fi_FI/fi_FI.blogs.txt" "./final/fi_FI/fi_FI.news.txt"
## [11] "./final/fi_FI/fi_FI.twitter.txt" "./final/ru_RU/ru_RU.blogs.txt"
## [13] "./final/ru_RU/ru_RU.news.txt" "./final/ru_RU/ru_RU.twitter.txt"
## [15] "./sample/blogs.txt" "./sample/news.txt"
## [17] "./sample/twitter.txt"
The corpora data are collected from publicly available sources by a web crawler. The crawler checks for language, so as to mainly get texts consisting of the desired language.
Dataset containst these languages:
list.files(path = "./final" )
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
We will not use other languages than en_US. All languages contains three types of entries: * blogs * tweets
* news
Size of files in megabytes:
file.size("final/en_US/en_US.blogs.txt")/1024/1024
## [1] 200.4242
file.size("final/en_US/en_US.twitter.txt")/1024/1024
## [1] 159.3641
file.size("final/en_US/en_US.news.txt")/1024/1024
## [1] 196.2775
Lets now import those files As these files are quite large, we will import them by lines.
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding="UTF-8", warn = FALSE)
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding="UTF-8", warn = FALSE)
# binary mode import
con <- file("final/en_US/en_US.news.txt", open="rb")
news <- readLines(con, encoding="UTF-8", warn = FALSE)
close(con)
rm(con)
We are going to need these libraries to make data analysis easier:
library(tm)
## Loading required package: NLP
library(SnowballC)
library(stringi)
library(wordcloud)
## Loading required package: RColorBrewer
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(RWeka)
At first lets do some basic analysis of lines of files:
stri_stats_general( blogs )
## Lines LinesNEmpty Chars CharsNWhite
## 899288 899288 206824382 170389539
stri_stats_general( news )
## Lines LinesNEmpty Chars CharsNWhite
## 1010242 1010242 203223154 169860866
stri_stats_general( twitter )
## Lines LinesNEmpty Chars CharsNWhite
## 2360148 2360148 162096031 134082634
We can see than there are over 4 million of lines total and all files does not contain blank lines.
As the dataset is quite huge, we will use only 80% of the data to analyze the words contained.
# create sample dir if not exists
dir.create("./sample/", showWarnings = FALSE)
#sample size
size <- .8
set.seed(50)
blogs_sample <- sample(blogs, size = as.integer(length(blogs)*size))
news_sample <- sample(news, size = as.integer(length(blogs)*size))
twitter_sample <- sample(twitter, size = as.integer(length(blogs)*size))
save(blogs_sample, file = "./sample/blogs.txt")
save(news_sample, file = "./sample/news.txt")
save(twitter_sample, file = "./sample/twitter.txt")
Now to lets analyze words in these three datasets with help from tm library (text mining).
As initial step we need to prepare the data, ie. do some sort of normalization : * remove non-english characters * remove unwanted contents (whitspace character, punctuation), * remove capitalization, * remove numbers, * remove stop words (as they do not carry any interesting information about content) * stem words (to extract essential information of wordws of various shapes)
cp <- VCorpus(DirSource("./sample/"))
cp <- tm_map(cp,content_transformer(function(row) iconv(row, "latin2", "ASCII", sub="")))
cp <- tm_map(cp,content_transformer(stripWhitespace))
cp <- tm_map(cp,content_transformer(removePunctuation))
cp <- tm_map(cp,content_transformer(tolower))
cp <- tm_map(cp,content_transformer(removeNumbers))
cp <- tm_map(cp,removeWords,stopwords("english"))
cp <- tm_map(cp,content_transformer(stemDocument))
dtm <- DocumentTermMatrix(cp)
dtm
## <<DocumentTermMatrix (documents: 3, terms: 26091)>>
## Non-/sparse entries: 35970/42303
## Sparsity : 54%
## Maximal term length: 35
## Weighting : term frequency (tf)
For presentation we use wordcloud with 50 most frequest words for among those files, togheter with histogram of those words.
# Conversion to dataframe
m <- as.matrix(dtm)
freq <- sort(colSums(m),decreasing = TRUE)
df <- data.frame(word=names(freq), freq=freq)
# Subset only to 50 most frequent words
df_top50 <- df[1:50,]
# Histogram
ggplot(df_top50,aes(word,freq)) +
geom_bar(stat="identity", fill="lightblue") +
theme(axis.text.x=element_text(angle=90,vjust=0.2)) +
ggtitle("Words Frequency")
# Wordcloud
wordcloud(names(freq), freq,scale=c(3,.1), max.words = 50, colors = brewer.pal(9,"BuGn"))
Next we take a look at bi-grams and tri-grams. For this analysis we will use RWeka package.
Now lets look at bi-grams (top 10 bi-grams with most occurences):
# Data Preparation
BiGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm_bg <- DocumentTermMatrix(cp, control = list(tokenize = BiGramTokenizer))
m_bg <- as.matrix(dtm_bg)
freq_bg <- sort(colSums(m_bg),decreasing = TRUE)
df_bg <- data.frame(word=names(freq_bg), freq=freq_bg)
df_bg_top10 <- df_bg[1:10,]
# Histogram
ggplot(df_bg_top10 ,aes(word, freq)) +
geom_bar(stat="identity", fill="lightgreen") +
theme(axis.text.x=element_text(angle=90,vjust=0.2)) +
ggtitle("Bi-Grams Frequency")
And we also try to find some frequent tri-grams (top 10 tri-grams with most occurences):
# Data Preparation
TriGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtm_tg <- DocumentTermMatrix(cp, control = list(tokenize = TriGramTokenizer))
m_tg <- as.matrix(dtm_tg)
freq_tg <- sort(colSums(m_tg),decreasing = TRUE)
df_tg <- data.frame(word=names(freq_tg), freq=freq_tg)
df_tg_top10 <- df_tg[1:10,]
# Histogram
ggplot(df_tg_top10 ,aes(word, freq)) +
geom_bar(stat="identity", fill="cyan") +
theme(axis.text.x=element_text(angle=90,vjust=0.2)) +
ggtitle("Tri-Grams Frequency")
We analysed the corpora dataset. The file sizes are around 200 MegaBytes (MBs) per file.
We find that the english blogs and news parts consist of about 1 million items each. Twitter file consist of over 2 million items.
Finally we find most common words with some light preprocessing of these three individual files (striping puctuation, normalizing capitalization and stemming). Most common are: “one”, “will”, “said” and “time.” We also made a workcloud of most common words.
Most common bi-grams (two words occuring together) are : “last year”, “new york” and “year ago”. And most common tri-grams (three words occuring together) are : “caprera hotel venic”, “hotel venic italy” and “italy lake holiday”.