Summary

The goal of this Assignment is explore three datasets. The data sets comes from different sources: news, blogs and twitter. I’ll briefly explain only the major features of the data.

Basic summaries of the three files

Blogs

Line_count Word_count Mean_of_word_count
899288 37546246 41.75108

Example post on a blog

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."

News

Line_count Word_count Mean_of_word_count
77259 2674536 34.61779

Example of news

## [1] "He wasn't home alone, apparently."

Twitter

Line_count Word_count Mean_of_word_count
2360148 30093410 12.75065

Example of tweet

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."

The most common words in all sentences

As you can see below the most common word is “just”“, which appears 2589 times. Next comes”get” which appears 2532 times. We also see words “like” and “one”, which appear 2481 and 2363 times respectively.

feature frequency rank docfreq
just 2589 1 2414
get 2532 2 2259
like 2481 3 2248
one 2363 4 2014
go 2129 5 1929
time 1973 6 1758
love 1921 7 1718
can 1883 8 1681

A word cloud to visualize the text data

A word cloud is graphical representation of frequently used words in the normalized text. The height of each word in this picture is an indication of frequency of occurrence # of the word in entire text

Let’s see ngrams

The general idea is that you can look at each pair (or triple, set of four, etc.) of words that occur next to each other. In a large corpus, you’re likely to see “the red” and “red apple” several times, but less likely to see “apple red” and “red the”. This may be useful to predict next word in typing.

These co-occuring words are known as “n-grams”, where “n” is a number saying how long a string of words you considered. (Unigrams are single words, bigrams are two words, trigrams are three words, 4-grams are four words, etc.)

feature frequency rank docfreq
of_the 2577 1 2044
in_the 2394 2 2082
to_the 1412 3 1260
for_the 1344 4 1290
on_the 1225 5 1122
to_be 1148 6 1067
at_the 905 7 848
i_have 817 8 733

Next Step

In the next step I will use knowledge to build a predictive text product. Using predictive text is useful to speed up your writing. Algorithm which I will apply let your device guess what’s the next word.

Appendix

Load the library
library(quanteda)
library(dplyr)
library(stringi)
library(ggplot2)
library(RColorBrewer)
library(formattable)
Download from the Internet and unzip file
if(!file.exists("dataset.zip")){
  url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
  download.file(url = url, destfile = "dataset.zip", mode = "wb")
  unzip(zipfile = "dataset.zip")
  rm(url)
}
con <- file("final/en_US/en_US.twitter.txt", "r")
twitter <- readLines(con = con, encoding = "UTF-8", skipNul = TRUE)
close(con)

con <- file("final/en_US/en_US.news.txt", "r")
news <- readLines(con = con, encoding = "UTF-8", skipNul = TRUE)
close(con)

con <- file("final/en_US/en_US.blogs.txt", "r")
blogs <- readLines(con = con, encoding = "UTF-8", skipNul = TRUE)
close(con)
Basic summaries of the three files
blogSummary <- data.frame(
  "Line_count" = length(blogs),
  "Word_count" = sum(stri_count_words(blogs)),
  "Mean_of_word_count" =  mean(stri_count_words(blogs))
)
blogSummary %>% formattable()

blogs[1]
newsSummary <- data.frame(
  "Line_count" = length(news),
  "Word_count" = sum(stri_count_words(news)),
  "Mean_of_word_count" =  mean(stri_count_words(news))
)
newsSummary %>% formattable()

news[1]
twitterSummary <- data.frame(
  "Line_count" = length(twitter),
  "Word_count" = sum(stri_count_words(twitter)),
  "Mean_of_word_count" =  mean(stri_count_words(twitter))
)
twitterSummary %>% formattable()

twitter[1]
Randomly sample documents

I took random samples of text due to large size of files

set.seed(123)
texts <- c(blogs, news, twitter)
sample <- sample(x = 1:length(texts), size = length(texts) * 0.01, replace = FALSE)
texts.sample <- texts[sample]
The most common words in all sentences
tokens.ng1 <- tokens(x = texts.sample, what = "word", 
                     remove_numbers = TRUE, remove_punct = TRUE,
                     remove_symbols = TRUE,remove_separators = TRUE,
                     remove_hyphens = TRUE)

dfm.ng1 <- dfm(x = tokens.ng1, tolower = TRUE, stem = TRUE, remove = stopwords())


freq.ng1 <- textstat_frequency(dfm.ng1)
head(freq.ng1, n = 8) %>% formattable()

freq.ng1 %>%
  arrange(desc(frequency)) %>%
  head(10) %>%
  ggplot(aes(x = reorder(feature, frequency), y = frequency, fill = feature)) +
  geom_bar(stat = "identity") +
  xlab(label = "") +
  theme(legend.position = "none")
A word cloud to visualize the text data
set.seed(123)
textplot_wordcloud(x = dfm.ng1, random.color = TRUE, rot.per = .25, max.words = 70, 
                   random.order = FALSE,  colors = brewer.pal(8, "Dark2"))
Let’s see ngrams
tokens.ng2 <- tokens_ngrams(x = tokens.ng1, n = 2L)

dfm.ng2 <- dfm(x = tokens.ng2, tolower = TRUE, stem = TRUE, remove = stopwords())


freq.ng2 <- textstat_frequency(dfm.ng2)
head(freq.ng2, n = 8) %>% formattable()

freq.ng2 %>%
  arrange(desc(frequency)) %>%
  head(10) %>%
  ggplot(aes(x = reorder(feature, frequency), y = frequency, fill = feature)) +
  geom_bar(stat = "identity") +
  xlab(label = "") +
  theme(legend.position = "none")