The project overview

The aim of this milestone report is to demonstrate exploratory analysis of the corpora files and the goals for creation of the predictive algorithms and the data application. This document explains only the major features of the identified data and briefly summarizes plans for creating the prediction algorithm and Shiny app.

The motivation for this project is to:

  1. Demonstrate that the data was successfully downloaded and loaded in.
  2. Create a basic report of summary statistics about the data sets.
  3. Report any interesting findings that have been amassed so far.
  4. Get feedback on the plans for creating a prediction algorithm and Shiny app.

0.The corpora (textual data set) overview

The data for the analysis was downloaded from the course’s link and consists of 3 corporas for each language appropriately: DE, US, FI, RU. Each corpora in its turn comprises 3 kind of texts: blogs, news and twitter. The data was collected with help of the special software - web crawler.

For the analysis only ‘US’ corpora will be used since existing text mining libraries handle this language better than other languages.

The report contains the next steps:

  1. Initial preparations
  • Libraries installations
  • Data downloading
  • Data uploading
  1. Summary statistics analysis
  • Visualization and conclusion
  • Summary table
  • Plot 1. “Size of the corpora, mb”
  • Plot 2. “Distribution of lines over words per line (wpl)” - BOXPLOTS
  • Plot 3. “Distribution of lines over words per line (wpl)” - DISTRIBUTIONS
  • Summary statistics conclusion
  1. Samples preparation
  • Sampling
  • Bad words list preparation
  • Cleaning / Tokenization / Creating a Data-Feature Matrix
  1. Initial analysis of N-gram frequencies
  • Top 10 the most frequent ‘blogs’ Unigram plot
  • Top 10 the most frequent ‘blogs’ Bigram plot
  • Top 10 the most frequent ‘blogs’ Trigram plot

  • Top 10 the most frequent ‘news’ Unigram plot
  • Top 10 the most frequent ‘news’ Bigram plot
  • Top 10 the most frequent ‘news’ Trigram plot

  • Top 10 the most frequent ‘twitter’ Unigram plot
  • Top 10 the most frequent ‘twitter’ Bigram plot
  • Top 10 the most frequent ‘twitter’ Trigram plot

  1. Summary and plans for the next work

1. Initial preparations

1.a Libraries installations

Before starting the work the following libraries will be downloaded:

#install.packages("dplyr")
library(dplyr)

#install.packages("magrittr")
library(magrittr)

#install.packages("stringi")
library(stringi)

#install.packages("kableExtra")
library(kableExtra)

#install.packages("quanteda")
library(quanteda)

#install.packages("ggplot2")
library(ggplot2)

#install.packages("RColorBrewer")
library(RColorBrewer)

1.b Data downloading

corpusURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
fileName <- URLdecode(corpusURL) %>% basename()



#####################
# Download the data
# Uncomment if needed
#####################

#if(!file.exists(fileName)) {
#   download.file(textDataSet, fileName, method = "curl")
#   print ("File was downloaded successfully")
#} else {
#   cat("The file [", fileName, "] already exists!")
#}


#####################
# Unzip the file
# Uncomment if needed
#####################

#library(tools)
#dirName <- file_path_sans_ext(fileName)

#if (!file.exists(dirName)) { 
#  dir.create(dirName)
#  unzip(fileName, exdir = dirName)
#  print ("File was unzipped successfully")
#}

1.c Data uploading

pathToFiles = "./Coursera-SwiftKey/final/en_US"


blogsFileName = "en_US.blogs.txt"
newsFileName = "en_US.news.txt"
twitterFileName = "en_US.twitter.txt"


#########################
# Create file connections
#########################

blogsFilePath = file.path(pathToFiles, blogsFileName)
newsFilePath = file.path(pathToFiles, newsFileName)
twitterFilePath = file.path(pathToFiles, twitterFileName)


###############
# Read the file
###############

blogsVector = readLines(blogsFilePath, encoding = "UTF-8", skipNul = TRUE)
newsVector = readLines(newsFilePath, encoding = "UTF-8", skipNul = TRUE)
twitterVector = readLines(twitterFilePath, encoding = "UTF-8", skipNul = TRUE)

2.Summary statistics analysis

In this part the following statistics will be shown: 1. File-level statistics: - Sizes of the files in ‘mb’

  1. Corpora-level statistics:
    • Number of lines
    • Min / max / median / mean length (in words per line) of the lines
    • The distribution of line’s length (in words per line)
    • Number of characters
####################
# Sizes of the files
####################

blogsFileSize = file.size(blogsFilePath) / 1024^2
newsFileSize = file.size(newsFilePath) / 1024^2
twitterFileSize = file.size(twitterFilePath) / 1024^2

2.a Visualization and conclusion

###########################
# Line's summary statistics
###########################

blogsStatsGeneral = stri_stats_general(blogsVector)
blogsLineNumber = blogsStatsGeneral[["Lines"]]
blogsCharNumber = blogsStatsGeneral[["Chars"]]

newsStatsGeneral = stri_stats_general(newsVector)
newsLineNumber = newsStatsGeneral[["Lines"]]
newsCharNumber = newsStatsGeneral[["Chars"]]

twitterStatsGeneral = stri_stats_general(twitterVector)
twitterLineNumber = twitterStatsGeneral[["Lines"]]
twitterLineNumber = twitterStatsGeneral[["Chars"]]

blogsWordsPerLineVector = stri_count_words(blogsVector)
newsWordsPerLineVector = stri_count_words(newsVector)
twitterWordsPerLineVector = stri_count_words(twitterVector)


#Calculate summary statistics for lines
blogsSummary <- summary(blogsWordsPerLineVector)
newsSummary <- summary(newsWordsPerLineVector)
twitterSummary <- summary(twitterWordsPerLineVector)


#Collect information to the summary table
summaryTable <- data.frame(File = c("en_US.blogs.txt",
                                      "en_US.news.txt",
                                      "en_US.twitter.txt"),
                           
                           Size_mb = c(round(blogsFileSize, 1),
                                          round(newsFileSize, 1),
                                          round(twitterFileSize, 1)),
                           
                           Number_of_lines_M = c(round(blogsLineNumber/10^6, 1),
                                                       round(newsLineNumber/10^6, 1),
                                                       round(twitterLineNumber/10^6, 1)),
                           
                           Number_of_words_M = c(round(sum(blogsWordsPerLineVector)/10^6, 1),
                                                       round(sum(newsWordsPerLineVector)/10^6, 1),
                                                       round(sum(twitterWordsPerLineVector)/10^6, 1)),
                           
                           Median_number_of_words = c(round(blogsSummary["Median"], 1),
                                                           round(newsSummary["Median"], 1),
                                                           round(twitterSummary["Median"], 1)),
                           
                           Number_of_characters_M = c(round(blogsCharNumber/10^6, 1),
                                                            round(newsCharNumber/10^6, 1),
                                                            round(twitterLineNumber/10^6, 1)))

The table below demonstrates summary statistics for every corpora. The most shortest (words per line) - lines from the twitter corpora.

kable(summaryTable) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
File Size_mb Number_of_lines_M Number_of_words_M Median_number_of_words Number_of_characters_M
en_US.blogs.txt 200.4 0.9 37.5 28 206.8
en_US.news.txt 196.3 1.0 34.8 32 203.2
en_US.twitter.txt 159.4 162.1 30.1 12 162.1

Sizes of the corpora files.

####################
# Visualization part
####################

barplot(c(blogsFileSize,
          newsFileSize,
          twitterFileSize), 
        main="Plot 1. Sizes of the corpora files",
        names.arg=c("Blogs", "News", "Twitter"),
        ylab="Plot 1. Size of the corpora, mb",
        col =  brewer.pal(3, "Set2"))

Summary statistics of words per lines (quantiles). Since there are outliers we had to set limit for y-axis.

boxplot(list(newsWordsPerLineVector,
             blogsWordsPerLineVector,
             twitterWordsPerLineVector),
        col = c(rgb(0.2,1,0,0.3),
                rgb(0,0,1,0.3),
                rgb(0.8,0,0,0.3)),
        ylim=c(0, 150),
        names = c("News", "Blogs", "Twitter"),
        ylab = "Words per line, units",
        main="Plot 2.Distribution of lines over words per line (wpl)")

Summary statistics of words per lines (distributions).

hist(blogsWordsPerLineVector,
     breaks=max(blogsWordsPerLineVector) - min(blogsWordsPerLineVector),
     col=rgb(0,0,1,0.3),
     border = NA,
     xlim = c(0, 100),
     ylim = c(0, 140000),
     xlab="Ranges of words per line, wpl", 
     ylab="Number of lines, units",
     main="Plot 3. Distribution of lines over words per line (wpl)"
)

hist(twitterWordsPerLineVector,
     breaks = max(twitterWordsPerLineVector) - min(twitterWordsPerLineVector),
     col=rgb(0.2,1,0,0.3),
     border = NA,
     xlab="Ranges of words per line, wpl", 
     ylab="Number of lines, units",
     main="Distribution of lines over words per line (wpl)",
     add=TRUE
)

hist(newsWordsPerLineVector,
     breaks = max(newsWordsPerLineVector) - min(newsWordsPerLineVector),
     xlim = c(0, 100),
     ylim = c(0, 140000),
     col=rgb(0.8,0,0,0.3),
     border = NA,
     xlab="Ranges of words per line, wpl", 
     ylab="Number of lines, units",
     main="Distribution of lines over words per line (wpl)",
     add=TRUE
)

legend("topright",
       legend=c("Blogs",  "Twitter", "News"),
       col = c(rgb(0,0,1,0.3),
               rgb(0.2,1,0,0.3),
               rgb(0.8,0,0,0.3)
             ),
       pt.cex=2,
       pch=15
)

abline(v = c(4, 6, 27))

2.b Summary statistics conclusion

From the information above we can notice from Plot 2 that lines in the ‘news’ corpora are the longest in words per line (median = 32 word per line), and lines in twitter corpora are shortest (median = 12 words per line), as it was expected before the analysis. From the same plot we can see that the ‘blogs’ corpora has the most distant outlier (words per line) - 6726 wpl. Such a long line is a combination of plenty of sentences and we could split the line into smaller sentences but our aim was to analyze raw corporas. From Plot 3 we can see pronounced rigt-skewness of the ‘blog’ corpora with mode = 4; ‘twitter’ corpora’s distribution is markedly concentrated around its mode = 6 that is result of narrow range of words per line [1 : 47]; the ‘news’ corpora is the most nearest to the normal distribution with its mode = 27, such a from of ‘news’ distribution is consequence of standard form of news articles.

3. Samples preparation

This part will describe the following steps: a. Sampling b. Bad words list preparation c. Cleaning / Tokenization / Creating a Data-Feature Matrix

3.a Sampling

Since the sizes of the corporas are significant to keep them in memory of a usual home computer, the corporas was sampled on random base, the size of sample is 2% of the original sample.

sampleFraction <- 0.02

blogsSampleSize <- round(length(blogsVector) * sampleFraction)
newsSampleSize <- round(length(newsVector) * sampleFraction)
twitterSampleSize <- round(length(twitterVector) * sampleFraction)


set.seed(1)
blogsSample <- sample(blogsVector, blogsSampleSize, replace = FALSE)
newsSample <- sample(newsVector, newsSampleSize, replace = FALSE)
twitterSample <- sample(twitterVector, twitterSampleSize, replace = FALSE)


#Full corporas are no longer needed so we can delete them to keep the memory
rm(blogsVector)
rm(newsVector)
rm(twitterVector)
gc()
##            used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells  2178130 116.4    8170217 436.4   6668759  356.2
## Vcells 11642313  88.9  110521409 843.3 138133767 1053.9

3.b Bad words list preparation

In order to perform profanity filtering we need to use list of bad words which was downloaded from the link: https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/badwordslist/badwords.txt. The file was already downloaded and placed to the directory with corporas.

badWordsfileName = "badwords.txt"
badWordsFilePath = file.path("./Coursera-SwiftKey/final/en_US", badWordsfileName)

#read the file
badWordsVector = readLines(badWordsFilePath, encoding = "UTF-8", skipNul = TRUE)

#clean the file
badWordsVector <- gsub("[*|$|(]", "", badWordsVector)

3.c Cleaning / Tokenization / Creating a Data-Feature Matrix

In this part we will use Quanteda package which allows to clean, tokenize, create a Data-Feature Matrix and even split the text into N-gram just in one command.

# Clean, tokenize, create a Data-Feature Matrix and N-grams

# blogs

blogsSample <- gsub("(f|F|h|Ht)(tp)(s?)(://)(\\S*)", "", blogsSample)
blogsSample <- blogsSample[!grepl(paste(badWordsVector, collapse="|"), blogsSample)]

blogsTokenized <- tokens(blogsSample, 
                         remove_numbers = TRUE,
                         remove_punct = TRUE,
                         remove_hyphens = FALSE,
                         remove_twitter = TRUE) %>%
                             tokens_remove(c(stopwords("english")))

blogsDfm <- blogsTokenized %>% dfm(ngrams = 1:3)


# news

newsSample <- gsub("(f|F|h|Ht)(tp)(s?)(://)(\\S*)", "", newsSample)
newsSample <- newsSample[!grepl(paste(badWordsVector, collapse="|"), newsSample)]

newsTokenized <- tokens(newsSample, 
                         remove_numbers = TRUE,
                         remove_punct = TRUE,
                         remove_hyphens = FALSE,
                         remove_twitter = TRUE) %>%
                             tokens_remove(c(stopwords("english")))

newsDfm <- newsTokenized %>% dfm(ngrams = 1:3)


# twitter

twitterSample <- gsub("(f|F|h|Ht)(tp)(s?)(://)(\\S*)", "", twitterSample)
twitterSample <- twitterSample[!grepl(paste(badWordsVector, collapse="|"), twitterSample)]

twitterTokenized <- tokens(twitterSample, 
                           remove_numbers = TRUE,
                           remove_punct = TRUE,
                           remove_hyphens = FALSE,
                           remove_twitter = TRUE) %>%
                             tokens_remove(c(stopwords("english")))

twitterDfm <- twitterTokenized %>% dfm(ngrams = 1:3)


#Sample corporas are no longer needed so we can delete them to keep the memory

rm(blogsSample)
rm(newsSample)
rm(twitterSample)
gc()
##            used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells  3540810 189.1    8170217 436.4   8170217  436.4
## Vcells 19959257 152.3   88417127 674.6 138133767 1053.9

4. Initial analysis of N-gram frequencies

Data-Feature-Matrixes created in the step 3.c contain unigrams, bigrams, trigrams simultaneously and now we are going to split the matrix to extract these n-grams for the further analysis.

#blogs n-grams matrixes

blogsUnigram <- dfm_select(blogsDfm, pattern = "\\b[a-z]*[^_]\\b", valuetype = "regex")
blogsBigram <- dfm_select(blogsDfm, pattern = "\\b[a-z]*_[a-z]*\\b", valuetype = "regex")
blogsTrigram <- dfm_select(blogsDfm, pattern = "\\b[a-z]*_[a-z]*_[a-z]*\\b", valuetype = "regex")

#blogsDfm is no longer needed so we can delete it to keep the memory

rm(blogsDfm)
gc()
##            used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells  3540775 189.1    8170217 436.4   8170217  436.4
## Vcells 20032572 152.9   70733701 539.7 138133767 1053.9
blogsUnigramFreq <- textstat_frequency(blogsUnigram, n = 10)
blogsBigramFreq <- textstat_frequency(blogsBigram, n = 10)
blogsTrigramFreq <- textstat_frequency(blogsTrigram, n = 10)




#news n-grams matrixes

newsUnigram <- dfm_select(newsDfm, pattern = "\\b[a-z]*[^_]\\b", valuetype = "regex")
newsBigram <- dfm_select(newsDfm, pattern = "\\b[a-z]*_[a-z]*\\b", valuetype = "regex")
newsTrigram <- dfm_select(newsDfm, pattern = "\\b[a-z]*_[a-z]*_[a-z]*\\b", valuetype = "regex")

#newsDfm is no longer needed so we can delete it to keep the memory

rm(newsDfm)
gc()
##            used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells  3570712 190.7    8170217 436.4   8170217  436.4
## Vcells 20246884 154.5   70733701 539.7 138133767 1053.9
newsUnigramFreq <- textstat_frequency(newsUnigram, n = 10)
newsBigramFreq <- textstat_frequency(newsBigram, n = 10)
newsTrigramFreq <- textstat_frequency(newsTrigram, n = 10)




#twitter n-grams matrixes

twitterUnigram <- dfm_select(twitterDfm, pattern = "\\b[a-z]*[^_]\\b", valuetype = "regex")
twitterBigram <- dfm_select(twitterDfm, pattern = "\\b[a-z]*_[a-z]*\\b", valuetype = "regex")
twitterTrigram <- dfm_select(twitterDfm, pattern = "\\b[a-z]*_[a-z]*_[a-z]*\\b", valuetype = "regex")

#twitterDfm is no longer needed so we can delete it to keep the memory

rm(twitterDfm)
gc()
##            used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells  3564156 190.4    8170217 436.4   8170217  436.4
## Vcells 20356498 155.4   70733701 539.7 138133767 1053.9
twitterUnigramFreq <- textstat_frequency(twitterUnigram, n = 10)
twitterBigramFreq <- textstat_frequency(twitterBigram, n = 10)
twitterTrigramFreq <- textstat_frequency(twitterTrigram, n = 10)

The following plots demonstrate frequencies of n-grams for each corpora.

#Plots for blogs

blogsUnigramPlot <- ggplot(data = blogsUnigramFreq, 
                           mapping = aes(x = reorder(feature, frequency), y = frequency)) +
                    geom_bar(stat = "identity") +
                    coord_flip() +
                    labs(x = "Unigrams", y = "Frequency")

blogsBigramPlot <- ggplot(data = blogsBigramFreq, 
                           mapping = aes(x = reorder(feature, frequency), y = frequency)) +
                    geom_bar(stat = "identity") +
                    coord_flip() +
                    labs(x = "Bigrams", y = "Frequency")

blogsTrigramPlot <- ggplot(data = blogsTrigramFreq, 
                           mapping = aes(x = reorder(feature, frequency), y = frequency)) +
                    geom_bar(stat = "identity") +
                    coord_flip() +
                    labs(x = "Trigrams", y = "Frequency")



blogsUnigramPlot

blogsBigramPlot

blogsTrigramPlot

#Plots for news

newsUnigramPlot <- ggplot(data = newsUnigramFreq, 
                           mapping = aes(x = reorder(feature, frequency), y = frequency)) +
                    geom_bar(stat = "identity") +
                    coord_flip() +
                    labs(x = "Unigrams", y = "Frequency")

newsBigramPlot <- ggplot(data = newsBigramFreq, 
                           mapping = aes(x = reorder(feature, frequency), y = frequency)) +
                    geom_bar(stat = "identity") +
                    coord_flip() +
                    labs(x = "Bigrams", y = "Frequency")

newsTrigramPlot <- ggplot(data = newsTrigramFreq, 
                           mapping = aes(x = reorder(feature, frequency), y = frequency)) +
                    geom_bar(stat = "identity") +
                    coord_flip() +
                    labs(x = "Trigrams", y = "Frequency")



newsUnigramPlot

newsBigramPlot

newsTrigramPlot

#Plots for twitter

twitterUnigramPlot <- ggplot(data = twitterUnigramFreq, 
                           mapping = aes(x = reorder(feature, frequency), y = frequency)) +
                    geom_bar(stat = "identity") +
                    coord_flip() +
                    labs(x = "Unigrams", y = "Frequency")

twitterBigramPlot <- ggplot(data = twitterBigramFreq, 
                           mapping = aes(x = reorder(feature, frequency), y = frequency)) +
                    geom_bar(stat = "identity") +
                    coord_flip() +
                    labs(x = "Bigrams", y = "Frequency")

twitterTrigramPlot <- ggplot(data = twitterTrigramFreq, 
                           mapping = aes(x = reorder(feature, frequency), y = frequency)) +
                    geom_bar(stat = "identity") +
                    coord_flip() +
                    labs(x = "Trigrams", y = "Frequency")



twitterUnigramPlot

twitterBigramPlot

twitterTrigramPlot

5. Summary and plans for the next work

In this work we performed the following operations: - the corpora was downloaded - the corpora was read in memory - some summary statistics was collected (file-level and the data-level) - the corpora was pre-processed for the further analysis (cleaning, tokenization, creation of Data-Feature-Matrix and building N-grams) - The N-grams were analyzed with help of frequency approach

The next steps in building the application are: - the corpora will be repre-processed (stop-words will not be deleted in order to keep natural flow of the language) - N-gram model will be analysed in details and validated - the application will be developed and validated and deployed on the Shiny server.