Executive summary

Introduction

In this data exploration we analyse “Copora” dataset - publicly available data composed of tweeets, blogs and news available online. Goal of this analysis is to present basic overview of the dataset (in terms of its properties and content).

We collect the following forms of information:

data files statistics

number of lines
number of non-empty lines
number of words
distribution of words (wordcloud)
distribution of word length

In the following section we will describe the data collection process, the section after that gives the results of the data exploration, we finally present conclusions and give references.

For our analysis we use the R language.

Data

Data can be obtained [here|https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip].

# download & unzip
destination_file <- "Coursera-SwiftKey.zip"
source_file <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(source_file, destination_file)
unzip(destination_file)

Zip file contains following files:

list.files(recursive = TRUE, full.names = TRUE)

##  [1] "./Coursera-SwiftKey.zip"         "./Exploration_Analysis.Rmd"     
##  [3] "./final/de_DE/de_DE.blogs.txt"   "./final/de_DE/de_DE.news.txt"   
##  [5] "./final/de_DE/de_DE.twitter.txt" "./final/en_US/en_US.blogs.txt"  
##  [7] "./final/en_US/en_US.news.txt"    "./final/en_US/en_US.twitter.txt"
##  [9] "./final/fi_FI/fi_FI.blogs.txt"   "./final/fi_FI/fi_FI.news.txt"   
## [11] "./final/fi_FI/fi_FI.twitter.txt" "./final/ru_RU/ru_RU.blogs.txt"  
## [13] "./final/ru_RU/ru_RU.news.txt"    "./final/ru_RU/ru_RU.twitter.txt"
## [15] "./sample/blogs.txt"              "./sample/news.txt"              
## [17] "./sample/twitter.txt"

The corpora data are collected from publicly available sources by a web crawler. The crawler checks for language, so as to mainly get texts consisting of the desired language.

Dataset containst these languages:

list.files(path = "./final" )

## [1] "de_DE" "en_US" "fi_FI" "ru_RU"

We will not use other languages than en_US. All languages contains three types of entries: * blogs * tweets
* news

Size of files in megabytes:

file.size("final/en_US/en_US.blogs.txt")/1024/1024

## [1] 200.4242

file.size("final/en_US/en_US.twitter.txt")/1024/1024

## [1] 159.3641

file.size("final/en_US/en_US.news.txt")/1024/1024

## [1] 196.2775

Lets now import those files As these files are quite large, we will import them by lines.

blogs <- readLines("final/en_US/en_US.blogs.txt", encoding="UTF-8", warn = FALSE)
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding="UTF-8", warn = FALSE)
# binary mode import
con <- file("final/en_US/en_US.news.txt", open="rb")
news <- readLines(con, encoding="UTF-8", warn = FALSE)
close(con)
rm(con)

Data statistics

We are going to need these libraries to make data analysis easier:

library(tm)

## Loading required package: NLP

library(SnowballC)
library(stringi)
library(wordcloud)

## Loading required package: RColorBrewer

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(RWeka)

At first lets do some basic analysis of lines of files:

stri_stats_general( blogs )

##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   206824382   170389539

stri_stats_general( news )

##       Lines LinesNEmpty       Chars CharsNWhite 
##     1010242     1010242   203223154   169860866

stri_stats_general( twitter )

##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162096031   134082634

We can see than there are over 4 million of lines total and all files does not contain blank lines.

As the dataset is quite huge, we will use only 80% of the data to analyze the words contained.

# create sample dir if not exists
dir.create("./sample/", showWarnings = FALSE)

#sample size
size <- .8

set.seed(50)

blogs_sample <- sample(blogs, size = as.integer(length(blogs)*size))
news_sample <- sample(news, size = as.integer(length(blogs)*size))
twitter_sample <- sample(twitter, size = as.integer(length(blogs)*size))

save(blogs_sample, file = "./sample/blogs.txt")
save(news_sample, file = "./sample/news.txt")
save(twitter_sample, file = "./sample/twitter.txt")

Now to lets analyze words in these three datasets with help from tm library (text mining).

As initial step we need to prepare the data, ie. do some sort of normalization : * remove non-english characters * remove unwanted contents (whitspace character, punctuation), * remove capitalization, * remove numbers, * remove stop words (as they do not carry any interesting information about content) * stem words (to extract essential information of wordws of various shapes)

cp <-  VCorpus(DirSource("./sample/"))  
cp <-  tm_map(cp,content_transformer(function(row) iconv(row, "latin2", "ASCII", sub="")))
cp <-  tm_map(cp,content_transformer(stripWhitespace))
cp <-  tm_map(cp,content_transformer(removePunctuation))
cp <-  tm_map(cp,content_transformer(tolower))
cp <-  tm_map(cp,content_transformer(removeNumbers))
cp <-  tm_map(cp,removeWords,stopwords("english"))
cp <-  tm_map(cp,content_transformer(stemDocument))
dtm <- DocumentTermMatrix(cp)
dtm

## <<DocumentTermMatrix (documents: 3, terms: 26091)>>
## Non-/sparse entries: 35970/42303
## Sparsity           : 54%
## Maximal term length: 35
## Weighting          : term frequency (tf)

For presentation we use wordcloud with 50 most frequest words for among those files, togheter with histogram of those words.

# Conversion to dataframe
m <- as.matrix(dtm)
freq <- sort(colSums(m),decreasing = TRUE)
df <- data.frame(word=names(freq), freq=freq)

# Subset only to 50 most frequent words
df_top50 <- df[1:50,]

# Histogram
ggplot(df_top50,aes(word,freq)) + 
  geom_bar(stat="identity", fill="lightblue") + 
  theme(axis.text.x=element_text(angle=90,vjust=0.2)) +
  ggtitle("Words Frequency")

# Wordcloud
wordcloud(names(freq), freq,scale=c(3,.1), max.words = 50, colors = brewer.pal(9,"BuGn"))

Next we take a look at bi-grams and tri-grams. For this analysis we will use RWeka package.

Now lets look at bi-grams (top 10 bi-grams with most occurences):

# Data Preparation
BiGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm_bg <- DocumentTermMatrix(cp, control = list(tokenize = BiGramTokenizer))
m_bg <- as.matrix(dtm_bg)
freq_bg <- sort(colSums(m_bg),decreasing = TRUE)
df_bg <- data.frame(word=names(freq_bg), freq=freq_bg)

df_bg_top10 <- df_bg[1:10,]

# Histogram
ggplot(df_bg_top10 ,aes(word, freq)) + 
  geom_bar(stat="identity", fill="lightgreen") + 
  theme(axis.text.x=element_text(angle=90,vjust=0.2)) + 
  ggtitle("Bi-Grams Frequency")

And we also try to find some frequent tri-grams (top 10 tri-grams with most occurences):

# Data Preparation
TriGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtm_tg <- DocumentTermMatrix(cp, control = list(tokenize = TriGramTokenizer))
m_tg <- as.matrix(dtm_tg)
freq_tg <- sort(colSums(m_tg),decreasing = TRUE)
df_tg <- data.frame(word=names(freq_tg), freq=freq_tg)

df_tg_top10 <- df_tg[1:10,]

# Histogram
ggplot(df_tg_top10 ,aes(word, freq)) + 
  geom_bar(stat="identity", fill="cyan") + 
  theme(axis.text.x=element_text(angle=90,vjust=0.2)) + 
  ggtitle("Tri-Grams Frequency")

Conclusions

We analysed the corpora dataset. The file sizes are around 200 MegaBytes (MBs) per file.

We find that the english blogs and news parts consist of about 1 million items each. Twitter file consist of over 2 million items.

Finally we find most common words with some light preprocessing of these three individual files (striping puctuation, normalizing capitalization and stemming). Most common are: “one”, “will”, “said” and “time.” We also made a workcloud of most common words.

Most common bi-grams (two words occuring together) are : “last year”, “new york” and “year ago”. And most common tri-grams (three words occuring together) are : “caprera hotel venic”, “hotel venic italy” and “italy lake holiday”.

Exploratory Data Analysis - Swiftkey

Jakub Vedral

20 August 2017

Executive summary

Introduction

Data

Data statistics

Conclusions