This document represents a Milestone report of Data Science education. It is focused on description of exploratory data analysis conducted on three large sets of data (news, blogs, twitter; obtained from the course site) from USA, in English. Firstly the data manipulation processes are described, followed by analysis outcomes and plans for future development of a shiny application. The whole procedure was conducted in multiple steps in order to avoid issues due to limited memory capacity.
As the first step, libraries were activated and the data were read into R.
lapply(c("readr", "ggplot2", "R.utils", "dplyr", "quanteda"), library, character.only = T)
en_US_news <- read_delim("en_US.news.txt", "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
en_US_blogs <- read_delim("en_US.blogs.txt", "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
en_US_twitter <- read_delim("en_US.twitter.txt", "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
Due to large sample sizes, original datasets had to be shrinked. Systematic random sampling was introduced, as there was no reason to assume a serial dependency among the posts: starting from the first row, every 20th row was included into the sample data set, which would mean that approximately 5% of the data were used in these analyses.
selectionvector_news <- seq(1, 1000107, by = 20)
news <- en_US_news[selectionvector_news,]
t1 <- as.data.frame(news)
write.table(t1, "news1.txt")
selectionvector_blogs <- seq(1, 878689, by = 20)
blogs <- en_US_blogs[selectionvector_blogs,]
t2 <- as.data.frame(blogs)
write.table(t2, "blogs1.txt")
selectionvector_twitter <- seq(1, 2342938, by = 20)
twitter <- en_US_twitter[selectionvector_twitter,]
t3 <- as.data.frame(twitter)
write.table(t3, "twitter1.txt")
As it can be noticed from the table below, new data sets represent about 5% of the original data.
rows_original <- c(nrow(en_US_news), nrow(en_US_blogs), nrow(en_US_twitter))
size_original <- c(object.size(en_US_news), object.size(en_US_blogs), object.size(en_US_twitter))
rows_shrinked <- c(nrow(t1), nrow(t2), nrow(t3))
size_shrinked <- c(object.size(t1), object.size(t2), object.size(t3))
table <- data.frame(size_original, size_shrinked, rows_original, rows_shrinked)
row.names(table) <- c("news", "blogs", "twitter")
table
## size_original size_shrinked rows_original rows_shrinked
## news 269165648 13436512 1000107 50006
## blogs 266420112 13242304 878689 43935
## twitter 333218880 16857344 2342938 117147
In order to avoid constant crashes of the software due to limited memory capacity, the sample files were saved as separate files, while the original, large files were removed.
rm(en_US_news)
rm(en_US_blogs)
rm(en_US_twitter)
Then, the created files were re-read as lines, with digits and punctuations being removed and all letters being turned to lowercase.
news1 <- readLines("news1.txt")
blogs1 <- readLines("blogs1.txt")
twitter1 <- readLines("twitter1.txt")
news2 <- gsub("[0-9]", "", news1)
news2 <- gsub("[[:punct:]]", "", news2)
news2 <- tolower(news2)
blogs2 <- gsub("[0-9]", "", blogs1)
blogs2 <- gsub("[[:punct:]]", "", blogs2)
blogs2 <- tolower(blogs2)
twitter2 <- gsub("[0-9]", "", twitter1)
twitter2 <- gsub("[[:punct:]]", "", twitter2)
twitter2 <- tolower(twitter2)
For further analyses, quanteda package was used, which allows operations on tokens by forming the document feature matrices. Firstly, the texts was tokenized, followed by removal of any possible leftover punctuation and stopwords. After creation of n-grams that were one word long, n-grams formed by two words were created.
nDfm1 <- tokens(news2) %>%
tokens_remove("\\p{P}", valuetype = "regex", padding = TRUE) %>%
tokens_remove(stopwords("english"), padding = TRUE) %>%
tokens_ngrams(n = 1) %>%
dfm()
bDfm1 <- tokens(blogs2) %>%
tokens_remove("\\p{P}", valuetype = "regex", padding = TRUE) %>%
tokens_remove(stopwords("english"), padding = TRUE) %>%
tokens_ngrams(n = 1) %>%
dfm()
tDfm1 <- tokens(twitter2) %>%
tokens_remove("\\p{P}", valuetype = "regex", padding = TRUE) %>%
tokens_remove(stopwords("english"), padding = TRUE) %>%
tokens_ngrams(n = 1) %>%
dfm()
nDfm2 <- tokens(news2) %>%
tokens_remove("\\p{P}", valuetype = "regex", padding = TRUE) %>%
tokens_remove(stopwords("english"), padding = TRUE) %>%
tokens_ngrams(n = 2) %>%
dfm()
bDfm2 <- tokens(blogs2) %>%
tokens_remove("\\p{P}", valuetype = "regex", padding = TRUE) %>%
tokens_remove(stopwords("english"), padding = TRUE) %>%
tokens_ngrams(n = 2) %>%
dfm()
tDfm2 <- tokens(twitter2) %>%
tokens_remove("\\p{P}", valuetype = "regex", padding = TRUE) %>%
tokens_remove(stopwords("english"), padding = TRUE) %>%
tokens_ngrams(n = 2) %>%
dfm()
After tokenization, a glimpse was taken at the total number of tokens in the file. Due to many crashes of the system, I kept everything as simple as possible. The first number represents the number of lines and the second represents the number of tokens in news, blogs and twitter files, respectively.
dim(nDfm1)[2]
## [1] 70303
dim(bDfm1)[2]
## [1] 69097
dim(tDfm1)[2]
## [1] 69592
In this section, firstly the results of single-word n-grams were presented, followed by two-word n-grams.
n1 <- data.frame(topfeatures(nDfm1, 20))
n1$word <- rownames(n1)
colnames(n1)[1] <- "frequency"
n1$word <- factor(n1$word, levels = n1$word[order(desc(n1$frequency))])
ggplot(n1, aes(x = word, y = frequency)) + geom_col(fill = "green") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) + ggtitle("News")
b1 <- data.frame(topfeatures(bDfm1, 20))
b1$word <- rownames(b1)
colnames(b1)[1] <- "frequency"
b1$word <- factor(b1$word, levels = b1$word[order(desc(b1$frequency))])
ggplot(b1, aes(x = word, y = frequency)) + geom_col(fill = "red") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) + ggtitle("Blogs")
t1 <- data.frame(topfeatures(tDfm1, 20))
t1$word <- rownames(t1)
colnames(t1)[1] <- "frequency"
t1$word <- factor(t1$word, levels = t1$word[order(desc(t1$frequency))])
ggplot(t1, aes(x = word, y = frequency)) + geom_col(fill = "light blue") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) + ggtitle("Twitter")
As it can be noticed from the graphs above, although the lists differ with respect to most frequent words, some of the words can be found on multiple lists. In context of this project, more interesting are two-word phrases as they allow a glimpse of the relationships between words.
n2 <- data.frame(topfeatures(nDfm2, 20))
n2$word <- rownames(n2)
colnames(n2)[1] <- "frequency"
n2$word <- factor(n2$word, levels = n2$word[order(desc(n2$frequency))])
ggplot(n2, aes(x = word, y = frequency)) + geom_col(fill = "green") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) + ggtitle ("News")
b2 <- data.frame(topfeatures(bDfm2, 20))
b2$word <- rownames(b2)
colnames(b2)[1] <- "frequency"
b2$word <- factor(b2$word, levels = b2$word[order(desc(b2$frequency))])
ggplot(b2, aes(x = word, y = frequency)) + geom_col(fill = "red") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) + ggtitle("Blogs")
t2 <- data.frame(topfeatures(tDfm2, 20))
t2$word <- rownames(t2)
colnames(t2)[1] <- "frequency"
t2$word <- factor(t2$word, levels = t2$word[order(desc(t2$frequency))])
ggplot(t2, aes(x = word, y = frequency)) + geom_col(fill = "light blue") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) + ggtitle("Twitter")
The plots above, presenting the two-word phrases, indicate that: - certain words relatively often appear together and - one word can often be paired with multiple words. This kind of information is very useful for planning future analyses.
Corpora above consisted of three divided parts in order to allow versatility of information used in training the predictive models. Due to many crashes my computer had during this task, I did not try to build three-word phrases although I intend to use them in future. In most basic terms: if the two words often appear together, this means we could predict the second word from the first one, and the more often they appear together, the stronger is their relationship. Consequently, we can create an algorithm that used the first word to generate 3 possible alternatives as second words based on the strength of their relationship, i.e. how often they appear together. If my computer allows it, I intend to apply three-word phrases in prediction whenever possible, so the first two words would be used as the predictor of the third word. When this will not be possible (i.e., two-word sentences), two-word phrases would be used. In order to present some options to users when they are starting sentences, one-word n-grams could be formed based on the words that most often appear in the beginning of sentences. Options are unlimited, unlike my computational power, so I plan to use simple solutions in order to create an effective software.