Milestone report - exploratory data analysis

Executive summary

This document represents a Milestone report of Data Science education. It is focused on description of exploratory data analysis conducted on three large sets of data (news, blogs, twitter; obtained from the course site) from USA, in English. Firstly the data manipulation processes are described, followed by analysis outcomes and plans for future development of a shiny application. The whole procedure was conducted in multiple steps in order to avoid issues due to limited memory capacity.

Data preparation

As the first step, libraries were activated and the data were read into R.

lapply(c("readr", "ggplot2", "R.utils", "dplyr", "quanteda"), library, character.only = T)
en_US_news <- read_delim("en_US.news.txt", "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
en_US_blogs <- read_delim("en_US.blogs.txt", "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
en_US_twitter <- read_delim("en_US.twitter.txt", "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)

Due to large sample sizes, original datasets had to be shrinked. Systematic random sampling was introduced, as there was no reason to assume a serial dependency among the posts: starting from the first row, every 20th row was included into the sample data set, which would mean that approximately 5% of the data were used in these analyses.

selectionvector_news <- seq(1, 1000107, by = 20) 
news <- en_US_news[selectionvector_news,]
t1 <- as.data.frame(news)
write.table(t1, "news1.txt")
selectionvector_blogs <- seq(1, 878689, by = 20) 
blogs <- en_US_blogs[selectionvector_blogs,]
t2 <- as.data.frame(blogs)
write.table(t2, "blogs1.txt")
selectionvector_twitter <- seq(1, 2342938, by = 20) 
twitter <- en_US_twitter[selectionvector_twitter,]
t3 <- as.data.frame(twitter)
write.table(t3, "twitter1.txt")

As it can be noticed from the table below, new data sets represent about 5% of the original data.

rows_original <- c(nrow(en_US_news), nrow(en_US_blogs), nrow(en_US_twitter))
size_original <- c(object.size(en_US_news), object.size(en_US_blogs), object.size(en_US_twitter))
rows_shrinked <- c(nrow(t1), nrow(t2), nrow(t3))
size_shrinked <- c(object.size(t1), object.size(t2), object.size(t3))
table <- data.frame(size_original, size_shrinked, rows_original, rows_shrinked)
row.names(table) <- c("news", "blogs", "twitter")
table

##         size_original size_shrinked rows_original rows_shrinked
## news        269165648      13436512       1000107         50006
## blogs       266420112      13242304        878689         43935
## twitter     333218880      16857344       2342938        117147

In order to avoid constant crashes of the software due to limited memory capacity, the sample files were saved as separate files, while the original, large files were removed.

rm(en_US_news)
rm(en_US_blogs)
rm(en_US_twitter)

Then, the created files were re-read as lines, with digits and punctuations being removed and all letters being turned to lowercase.

news1 <- readLines("news1.txt")
blogs1 <- readLines("blogs1.txt")
twitter1 <- readLines("twitter1.txt")

news2 <- gsub("[0-9]", "", news1)
news2 <- gsub("[[:punct:]]", "", news2)
news2 <- tolower(news2)

blogs2 <- gsub("[0-9]", "", blogs1)
blogs2 <- gsub("[[:punct:]]", "", blogs2)
blogs2 <- tolower(blogs2)

twitter2 <- gsub("[0-9]", "", twitter1)
twitter2 <- gsub("[[:punct:]]", "", twitter2)
twitter2 <- tolower(twitter2)

For further analyses, quanteda package was used, which allows operations on tokens by forming the document feature matrices. Firstly, the texts was tokenized, followed by removal of any possible leftover punctuation and stopwords. After creation of n-grams that were one word long, n-grams formed by two words were created.

nDfm1 <- tokens(news2) %>%
  tokens_remove("\\p{P}", valuetype = "regex", padding = TRUE) %>%
  tokens_remove(stopwords("english"), padding  = TRUE) %>%
  tokens_ngrams(n = 1) %>%
  dfm()

bDfm1 <- tokens(blogs2) %>%
  tokens_remove("\\p{P}", valuetype = "regex", padding = TRUE) %>%
  tokens_remove(stopwords("english"), padding  = TRUE) %>%
  tokens_ngrams(n = 1) %>%
  dfm()

tDfm1 <- tokens(twitter2) %>%
  tokens_remove("\\p{P}", valuetype = "regex", padding = TRUE) %>%
  tokens_remove(stopwords("english"), padding  = TRUE) %>%
  tokens_ngrams(n = 1) %>%
  dfm()

nDfm2 <- tokens(news2) %>%
  tokens_remove("\\p{P}", valuetype = "regex", padding = TRUE) %>%
  tokens_remove(stopwords("english"), padding  = TRUE) %>%
  tokens_ngrams(n = 2) %>%
  dfm()

bDfm2 <- tokens(blogs2) %>%
  tokens_remove("\\p{P}", valuetype = "regex", padding = TRUE) %>%
  tokens_remove(stopwords("english"), padding  = TRUE) %>%
  tokens_ngrams(n = 2) %>%
  dfm()

tDfm2 <- tokens(twitter2) %>%
  tokens_remove("\\p{P}", valuetype = "regex", padding = TRUE) %>%
  tokens_remove(stopwords("english"), padding  = TRUE) %>%
  tokens_ngrams(n = 2) %>%
  dfm()

After tokenization, a glimpse was taken at the total number of tokens in the file. Due to many crashes of the system, I kept everything as simple as possible. The first number represents the number of lines and the second represents the number of tokens in news, blogs and twitter files, respectively.

dim(nDfm1)[2]

## [1] 70303

dim(bDfm1)[2]

## [1] 69097

dim(tDfm1)[2]

## [1] 69592

Results

In this section, firstly the results of single-word n-grams were presented, followed by two-word n-grams.

n1 <- data.frame(topfeatures(nDfm1, 20))
n1$word <- rownames(n1)
colnames(n1)[1] <- "frequency"
n1$word <- factor(n1$word, levels = n1$word[order(desc(n1$frequency))])
ggplot(n1, aes(x = word, y = frequency)) + geom_col(fill = "green") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) + ggtitle("News")

b1 <- data.frame(topfeatures(bDfm1, 20))
b1$word <- rownames(b1)
colnames(b1)[1] <- "frequency"
b1$word <- factor(b1$word, levels = b1$word[order(desc(b1$frequency))])
ggplot(b1, aes(x = word, y = frequency)) + geom_col(fill = "red") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) + ggtitle("Blogs")

t1 <- data.frame(topfeatures(tDfm1, 20))
t1$word <- rownames(t1)
colnames(t1)[1] <- "frequency"
t1$word <- factor(t1$word, levels = t1$word[order(desc(t1$frequency))])
ggplot(t1, aes(x = word, y = frequency)) + geom_col(fill = "light blue") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) + ggtitle("Twitter")

As it can be noticed from the graphs above, although the lists differ with respect to most frequent words, some of the words can be found on multiple lists. In context of this project, more interesting are two-word phrases as they allow a glimpse of the relationships between words.

n2 <- data.frame(topfeatures(nDfm2, 20))
n2$word <- rownames(n2)
colnames(n2)[1] <- "frequency"
n2$word <- factor(n2$word, levels = n2$word[order(desc(n2$frequency))])
ggplot(n2, aes(x = word, y = frequency)) + geom_col(fill = "green") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) + ggtitle ("News")

b2 <- data.frame(topfeatures(bDfm2, 20))
b2$word <- rownames(b2)
colnames(b2)[1] <- "frequency"
b2$word <- factor(b2$word, levels = b2$word[order(desc(b2$frequency))])
ggplot(b2, aes(x = word, y = frequency)) + geom_col(fill = "red") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) + ggtitle("Blogs")

t2 <- data.frame(topfeatures(tDfm2, 20))
t2$word <- rownames(t2)
colnames(t2)[1] <- "frequency"
t2$word <- factor(t2$word, levels = t2$word[order(desc(t2$frequency))])
ggplot(t2, aes(x = word, y = frequency)) + geom_col(fill = "light blue") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) + ggtitle("Twitter")

The plots above, presenting the two-word phrases, indicate that: - certain words relatively often appear together and - one word can often be paired with multiple words. This kind of information is very useful for planning future analyses.

Future plans

Corpora above consisted of three divided parts in order to allow versatility of information used in training the predictive models. Due to many crashes my computer had during this task, I did not try to build three-word phrases although I intend to use them in future. In most basic terms: if the two words often appear together, this means we could predict the second word from the first one, and the more often they appear together, the stronger is their relationship. Consequently, we can create an algorithm that used the first word to generate 3 possible alternatives as second words based on the strength of their relationship, i.e. how often they appear together. If my computer allows it, I intend to apply three-word phrases in prediction whenever possible, so the first two words would be used as the predictor of the third word. When this will not be possible (i.e., two-word sentences), two-word phrases would be used. In order to present some options to users when they are starting sentences, one-word n-grams could be formed based on the words that most often appear in the beginning of sentences. Options are unlimited, unlike my computational power, so I plan to use simple solutions in order to create an effective software.

Milestone report - exploratory data analysis

Tomislav Pavlović

4th of December, 2018

Executive summary

Data preparation

Results

Future plans

Thank you for reviewing my report! :)