The aim of this milestone report is to demonstrate exploratory analysis of the corpora files and the goals for creation of the predictive algorithms and the data application. This document explains only the major features of the identified data and briefly summarizes plans for creating the prediction algorithm and Shiny app.
The motivation for this project is to:
The data for the analysis was downloaded from the course’s link and consists of 3 corporas for each language appropriately: DE, US, FI, RU. Each corpora in its turn comprises 3 kind of texts: blogs, news and twitter. The data was collected with help of the special software - web crawler.
For the analysis only ‘US’ corpora will be used since existing text mining libraries handle this language better than other languages.
The report contains the next steps:
Top 10 the most frequent ‘blogs’ Trigram plot
Top 10 the most frequent ‘news’ Trigram plot
Top 10 the most frequent ‘twitter’ Trigram plot
Before starting the work the following libraries will be downloaded:
#install.packages("dplyr")
library(dplyr)
#install.packages("magrittr")
library(magrittr)
#install.packages("stringi")
library(stringi)
#install.packages("kableExtra")
library(kableExtra)
#install.packages("quanteda")
library(quanteda)
#install.packages("ggplot2")
library(ggplot2)
#install.packages("RColorBrewer")
library(RColorBrewer)
corpusURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
fileName <- URLdecode(corpusURL) %>% basename()
#####################
# Download the data
# Uncomment if needed
#####################
#if(!file.exists(fileName)) {
# download.file(textDataSet, fileName, method = "curl")
# print ("File was downloaded successfully")
#} else {
# cat("The file [", fileName, "] already exists!")
#}
#####################
# Unzip the file
# Uncomment if needed
#####################
#library(tools)
#dirName <- file_path_sans_ext(fileName)
#if (!file.exists(dirName)) {
# dir.create(dirName)
# unzip(fileName, exdir = dirName)
# print ("File was unzipped successfully")
#}
pathToFiles = "./Coursera-SwiftKey/final/en_US"
blogsFileName = "en_US.blogs.txt"
newsFileName = "en_US.news.txt"
twitterFileName = "en_US.twitter.txt"
#########################
# Create file connections
#########################
blogsFilePath = file.path(pathToFiles, blogsFileName)
newsFilePath = file.path(pathToFiles, newsFileName)
twitterFilePath = file.path(pathToFiles, twitterFileName)
###############
# Read the file
###############
blogsVector = readLines(blogsFilePath, encoding = "UTF-8", skipNul = TRUE)
newsVector = readLines(newsFilePath, encoding = "UTF-8", skipNul = TRUE)
twitterVector = readLines(twitterFilePath, encoding = "UTF-8", skipNul = TRUE)
In this part the following statistics will be shown: 1. File-level statistics: - Sizes of the files in ‘mb’
####################
# Sizes of the files
####################
blogsFileSize = file.size(blogsFilePath) / 1024^2
newsFileSize = file.size(newsFilePath) / 1024^2
twitterFileSize = file.size(twitterFilePath) / 1024^2
###########################
# Line's summary statistics
###########################
blogsStatsGeneral = stri_stats_general(blogsVector)
blogsLineNumber = blogsStatsGeneral[["Lines"]]
blogsCharNumber = blogsStatsGeneral[["Chars"]]
newsStatsGeneral = stri_stats_general(newsVector)
newsLineNumber = newsStatsGeneral[["Lines"]]
newsCharNumber = newsStatsGeneral[["Chars"]]
twitterStatsGeneral = stri_stats_general(twitterVector)
twitterLineNumber = twitterStatsGeneral[["Lines"]]
twitterLineNumber = twitterStatsGeneral[["Chars"]]
blogsWordsPerLineVector = stri_count_words(blogsVector)
newsWordsPerLineVector = stri_count_words(newsVector)
twitterWordsPerLineVector = stri_count_words(twitterVector)
#Calculate summary statistics for lines
blogsSummary <- summary(blogsWordsPerLineVector)
newsSummary <- summary(newsWordsPerLineVector)
twitterSummary <- summary(twitterWordsPerLineVector)
#Collect information to the summary table
summaryTable <- data.frame(File = c("en_US.blogs.txt",
"en_US.news.txt",
"en_US.twitter.txt"),
Size_mb = c(round(blogsFileSize, 1),
round(newsFileSize, 1),
round(twitterFileSize, 1)),
Number_of_lines_M = c(round(blogsLineNumber/10^6, 1),
round(newsLineNumber/10^6, 1),
round(twitterLineNumber/10^6, 1)),
Number_of_words_M = c(round(sum(blogsWordsPerLineVector)/10^6, 1),
round(sum(newsWordsPerLineVector)/10^6, 1),
round(sum(twitterWordsPerLineVector)/10^6, 1)),
Median_number_of_words = c(round(blogsSummary["Median"], 1),
round(newsSummary["Median"], 1),
round(twitterSummary["Median"], 1)),
Number_of_characters_M = c(round(blogsCharNumber/10^6, 1),
round(newsCharNumber/10^6, 1),
round(twitterLineNumber/10^6, 1)))
The table below demonstrates summary statistics for every corpora. The most shortest (words per line) - lines from the twitter corpora.
kable(summaryTable) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
File | Size_mb | Number_of_lines_M | Number_of_words_M | Median_number_of_words | Number_of_characters_M |
---|---|---|---|---|---|
en_US.blogs.txt | 200.4 | 0.9 | 37.5 | 28 | 206.8 |
en_US.news.txt | 196.3 | 1.0 | 34.8 | 32 | 203.2 |
en_US.twitter.txt | 159.4 | 162.1 | 30.1 | 12 | 162.1 |
Sizes of the corpora files.
####################
# Visualization part
####################
barplot(c(blogsFileSize,
newsFileSize,
twitterFileSize),
main="Plot 1. Sizes of the corpora files",
names.arg=c("Blogs", "News", "Twitter"),
ylab="Plot 1. Size of the corpora, mb",
col = brewer.pal(3, "Set2"))
Summary statistics of words per lines (quantiles). Since there are outliers we had to set limit for y-axis.
boxplot(list(newsWordsPerLineVector,
blogsWordsPerLineVector,
twitterWordsPerLineVector),
col = c(rgb(0.2,1,0,0.3),
rgb(0,0,1,0.3),
rgb(0.8,0,0,0.3)),
ylim=c(0, 150),
names = c("News", "Blogs", "Twitter"),
ylab = "Words per line, units",
main="Plot 2.Distribution of lines over words per line (wpl)")
Summary statistics of words per lines (distributions).
hist(blogsWordsPerLineVector,
breaks=max(blogsWordsPerLineVector) - min(blogsWordsPerLineVector),
col=rgb(0,0,1,0.3),
border = NA,
xlim = c(0, 100),
ylim = c(0, 140000),
xlab="Ranges of words per line, wpl",
ylab="Number of lines, units",
main="Plot 3. Distribution of lines over words per line (wpl)"
)
hist(twitterWordsPerLineVector,
breaks = max(twitterWordsPerLineVector) - min(twitterWordsPerLineVector),
col=rgb(0.2,1,0,0.3),
border = NA,
xlab="Ranges of words per line, wpl",
ylab="Number of lines, units",
main="Distribution of lines over words per line (wpl)",
add=TRUE
)
hist(newsWordsPerLineVector,
breaks = max(newsWordsPerLineVector) - min(newsWordsPerLineVector),
xlim = c(0, 100),
ylim = c(0, 140000),
col=rgb(0.8,0,0,0.3),
border = NA,
xlab="Ranges of words per line, wpl",
ylab="Number of lines, units",
main="Distribution of lines over words per line (wpl)",
add=TRUE
)
legend("topright",
legend=c("Blogs", "Twitter", "News"),
col = c(rgb(0,0,1,0.3),
rgb(0.2,1,0,0.3),
rgb(0.8,0,0,0.3)
),
pt.cex=2,
pch=15
)
abline(v = c(4, 6, 27))
From the information above we can notice from Plot 2 that lines in the ‘news’ corpora are the longest in words per line (median = 32 word per line), and lines in twitter corpora are shortest (median = 12 words per line), as it was expected before the analysis. From the same plot we can see that the ‘blogs’ corpora has the most distant outlier (words per line) - 6726 wpl. Such a long line is a combination of plenty of sentences and we could split the line into smaller sentences but our aim was to analyze raw corporas. From Plot 3 we can see pronounced rigt-skewness of the ‘blog’ corpora with mode = 4; ‘twitter’ corpora’s distribution is markedly concentrated around its mode = 6 that is result of narrow range of words per line [1 : 47]; the ‘news’ corpora is the most nearest to the normal distribution with its mode = 27, such a from of ‘news’ distribution is consequence of standard form of news articles.
This part will describe the following steps: a. Sampling b. Bad words list preparation c. Cleaning / Tokenization / Creating a Data-Feature Matrix
Since the sizes of the corporas are significant to keep them in memory of a usual home computer, the corporas was sampled on random base, the size of sample is 2% of the original sample.
sampleFraction <- 0.02
blogsSampleSize <- round(length(blogsVector) * sampleFraction)
newsSampleSize <- round(length(newsVector) * sampleFraction)
twitterSampleSize <- round(length(twitterVector) * sampleFraction)
set.seed(1)
blogsSample <- sample(blogsVector, blogsSampleSize, replace = FALSE)
newsSample <- sample(newsVector, newsSampleSize, replace = FALSE)
twitterSample <- sample(twitterVector, twitterSampleSize, replace = FALSE)
#Full corporas are no longer needed so we can delete them to keep the memory
rm(blogsVector)
rm(newsVector)
rm(twitterVector)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2178130 116.4 8170217 436.4 6668759 356.2
## Vcells 11642313 88.9 110521409 843.3 138133767 1053.9
In order to perform profanity filtering we need to use list of bad words which was downloaded from the link: https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/badwordslist/badwords.txt. The file was already downloaded and placed to the directory with corporas.
badWordsfileName = "badwords.txt"
badWordsFilePath = file.path("./Coursera-SwiftKey/final/en_US", badWordsfileName)
#read the file
badWordsVector = readLines(badWordsFilePath, encoding = "UTF-8", skipNul = TRUE)
#clean the file
badWordsVector <- gsub("[*|$|(]", "", badWordsVector)
In this part we will use Quanteda package which allows to clean, tokenize, create a Data-Feature Matrix and even split the text into N-gram just in one command.
# Clean, tokenize, create a Data-Feature Matrix and N-grams
# blogs
blogsSample <- gsub("(f|F|h|Ht)(tp)(s?)(://)(\\S*)", "", blogsSample)
blogsSample <- blogsSample[!grepl(paste(badWordsVector, collapse="|"), blogsSample)]
blogsTokenized <- tokens(blogsSample,
remove_numbers = TRUE,
remove_punct = TRUE,
remove_hyphens = FALSE,
remove_twitter = TRUE) %>%
tokens_remove(c(stopwords("english")))
blogsDfm <- blogsTokenized %>% dfm(ngrams = 1:3)
# news
newsSample <- gsub("(f|F|h|Ht)(tp)(s?)(://)(\\S*)", "", newsSample)
newsSample <- newsSample[!grepl(paste(badWordsVector, collapse="|"), newsSample)]
newsTokenized <- tokens(newsSample,
remove_numbers = TRUE,
remove_punct = TRUE,
remove_hyphens = FALSE,
remove_twitter = TRUE) %>%
tokens_remove(c(stopwords("english")))
newsDfm <- newsTokenized %>% dfm(ngrams = 1:3)
# twitter
twitterSample <- gsub("(f|F|h|Ht)(tp)(s?)(://)(\\S*)", "", twitterSample)
twitterSample <- twitterSample[!grepl(paste(badWordsVector, collapse="|"), twitterSample)]
twitterTokenized <- tokens(twitterSample,
remove_numbers = TRUE,
remove_punct = TRUE,
remove_hyphens = FALSE,
remove_twitter = TRUE) %>%
tokens_remove(c(stopwords("english")))
twitterDfm <- twitterTokenized %>% dfm(ngrams = 1:3)
#Sample corporas are no longer needed so we can delete them to keep the memory
rm(blogsSample)
rm(newsSample)
rm(twitterSample)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 3540810 189.1 8170217 436.4 8170217 436.4
## Vcells 19959257 152.3 88417127 674.6 138133767 1053.9
Data-Feature-Matrixes created in the step 3.c contain unigrams, bigrams, trigrams simultaneously and now we are going to split the matrix to extract these n-grams for the further analysis.
#blogs n-grams matrixes
blogsUnigram <- dfm_select(blogsDfm, pattern = "\\b[a-z]*[^_]\\b", valuetype = "regex")
blogsBigram <- dfm_select(blogsDfm, pattern = "\\b[a-z]*_[a-z]*\\b", valuetype = "regex")
blogsTrigram <- dfm_select(blogsDfm, pattern = "\\b[a-z]*_[a-z]*_[a-z]*\\b", valuetype = "regex")
#blogsDfm is no longer needed so we can delete it to keep the memory
rm(blogsDfm)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 3540775 189.1 8170217 436.4 8170217 436.4
## Vcells 20032572 152.9 70733701 539.7 138133767 1053.9
blogsUnigramFreq <- textstat_frequency(blogsUnigram, n = 10)
blogsBigramFreq <- textstat_frequency(blogsBigram, n = 10)
blogsTrigramFreq <- textstat_frequency(blogsTrigram, n = 10)
#news n-grams matrixes
newsUnigram <- dfm_select(newsDfm, pattern = "\\b[a-z]*[^_]\\b", valuetype = "regex")
newsBigram <- dfm_select(newsDfm, pattern = "\\b[a-z]*_[a-z]*\\b", valuetype = "regex")
newsTrigram <- dfm_select(newsDfm, pattern = "\\b[a-z]*_[a-z]*_[a-z]*\\b", valuetype = "regex")
#newsDfm is no longer needed so we can delete it to keep the memory
rm(newsDfm)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 3570712 190.7 8170217 436.4 8170217 436.4
## Vcells 20246884 154.5 70733701 539.7 138133767 1053.9
newsUnigramFreq <- textstat_frequency(newsUnigram, n = 10)
newsBigramFreq <- textstat_frequency(newsBigram, n = 10)
newsTrigramFreq <- textstat_frequency(newsTrigram, n = 10)
#twitter n-grams matrixes
twitterUnigram <- dfm_select(twitterDfm, pattern = "\\b[a-z]*[^_]\\b", valuetype = "regex")
twitterBigram <- dfm_select(twitterDfm, pattern = "\\b[a-z]*_[a-z]*\\b", valuetype = "regex")
twitterTrigram <- dfm_select(twitterDfm, pattern = "\\b[a-z]*_[a-z]*_[a-z]*\\b", valuetype = "regex")
#twitterDfm is no longer needed so we can delete it to keep the memory
rm(twitterDfm)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 3564156 190.4 8170217 436.4 8170217 436.4
## Vcells 20356498 155.4 70733701 539.7 138133767 1053.9
twitterUnigramFreq <- textstat_frequency(twitterUnigram, n = 10)
twitterBigramFreq <- textstat_frequency(twitterBigram, n = 10)
twitterTrigramFreq <- textstat_frequency(twitterTrigram, n = 10)
The following plots demonstrate frequencies of n-grams for each corpora.
#Plots for blogs
blogsUnigramPlot <- ggplot(data = blogsUnigramFreq,
mapping = aes(x = reorder(feature, frequency), y = frequency)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(x = "Unigrams", y = "Frequency")
blogsBigramPlot <- ggplot(data = blogsBigramFreq,
mapping = aes(x = reorder(feature, frequency), y = frequency)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(x = "Bigrams", y = "Frequency")
blogsTrigramPlot <- ggplot(data = blogsTrigramFreq,
mapping = aes(x = reorder(feature, frequency), y = frequency)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(x = "Trigrams", y = "Frequency")
blogsUnigramPlot
blogsBigramPlot
blogsTrigramPlot
#Plots for news
newsUnigramPlot <- ggplot(data = newsUnigramFreq,
mapping = aes(x = reorder(feature, frequency), y = frequency)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(x = "Unigrams", y = "Frequency")
newsBigramPlot <- ggplot(data = newsBigramFreq,
mapping = aes(x = reorder(feature, frequency), y = frequency)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(x = "Bigrams", y = "Frequency")
newsTrigramPlot <- ggplot(data = newsTrigramFreq,
mapping = aes(x = reorder(feature, frequency), y = frequency)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(x = "Trigrams", y = "Frequency")
newsUnigramPlot
newsBigramPlot
newsTrigramPlot
#Plots for twitter
twitterUnigramPlot <- ggplot(data = twitterUnigramFreq,
mapping = aes(x = reorder(feature, frequency), y = frequency)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(x = "Unigrams", y = "Frequency")
twitterBigramPlot <- ggplot(data = twitterBigramFreq,
mapping = aes(x = reorder(feature, frequency), y = frequency)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(x = "Bigrams", y = "Frequency")
twitterTrigramPlot <- ggplot(data = twitterTrigramFreq,
mapping = aes(x = reorder(feature, frequency), y = frequency)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(x = "Trigrams", y = "Frequency")
twitterUnigramPlot
twitterBigramPlot
twitterTrigramPlot
In this work we performed the following operations: - the corpora was downloaded - the corpora was read in memory - some summary statistics was collected (file-level and the data-level) - the corpora was pre-processed for the further analysis (cleaning, tokenization, creation of Data-Feature-Matrix and building N-grams) - The N-grams were analyzed with help of frequency approach
The next steps in building the application are: - the corpora will be repre-processed (stop-words will not be deleted in order to keep natural flow of the language) - N-gram model will be analysed in details and validated - the application will be developed and validated and deployed on the Shiny server.