The goal of this Assignment is explore three datasets. The data sets comes from different sources: news, blogs and twitter. I’ll briefly explain only the major features of the data.
| Line_count | Word_count | Mean_of_word_count |
|---|---|---|
| 899288 | 37546246 | 41.75108 |
Example post on a blog
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."
| Line_count | Word_count | Mean_of_word_count |
|---|---|---|
| 77259 | 2674536 | 34.61779 |
Example of news
## [1] "He wasn't home alone, apparently."
| Line_count | Word_count | Mean_of_word_count |
|---|---|---|
| 2360148 | 30093410 | 12.75065 |
Example of tweet
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
As you can see below the most common word is “just”“, which appears 2589 times. Next comes”get” which appears 2532 times. We also see words “like” and “one”, which appear 2481 and 2363 times respectively.
| feature | frequency | rank | docfreq |
|---|---|---|---|
| just | 2589 | 1 | 2414 |
| get | 2532 | 2 | 2259 |
| like | 2481 | 3 | 2248 |
| one | 2363 | 4 | 2014 |
| go | 2129 | 5 | 1929 |
| time | 1973 | 6 | 1758 |
| love | 1921 | 7 | 1718 |
| can | 1883 | 8 | 1681 |
A word cloud is graphical representation of frequently used words in the normalized text. The height of each word in this picture is an indication of frequency of occurrence # of the word in entire text
The general idea is that you can look at each pair (or triple, set of four, etc.) of words that occur next to each other. In a large corpus, you’re likely to see “the red” and “red apple” several times, but less likely to see “apple red” and “red the”. This may be useful to predict next word in typing.
These co-occuring words are known as “n-grams”, where “n” is a number saying how long a string of words you considered. (Unigrams are single words, bigrams are two words, trigrams are three words, 4-grams are four words, etc.)
| feature | frequency | rank | docfreq |
|---|---|---|---|
| of_the | 2577 | 1 | 2044 |
| in_the | 2394 | 2 | 2082 |
| to_the | 1412 | 3 | 1260 |
| for_the | 1344 | 4 | 1290 |
| on_the | 1225 | 5 | 1122 |
| to_be | 1148 | 6 | 1067 |
| at_the | 905 | 7 | 848 |
| i_have | 817 | 8 | 733 |
In the next step I will use knowledge to build a predictive text product. Using predictive text is useful to speed up your writing. Algorithm which I will apply let your device guess what’s the next word.
library(quanteda)
library(dplyr)
library(stringi)
library(ggplot2)
library(RColorBrewer)
library(formattable)
if(!file.exists("dataset.zip")){
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url = url, destfile = "dataset.zip", mode = "wb")
unzip(zipfile = "dataset.zip")
rm(url)
}
con <- file("final/en_US/en_US.twitter.txt", "r")
twitter <- readLines(con = con, encoding = "UTF-8", skipNul = TRUE)
close(con)
con <- file("final/en_US/en_US.news.txt", "r")
news <- readLines(con = con, encoding = "UTF-8", skipNul = TRUE)
close(con)
con <- file("final/en_US/en_US.blogs.txt", "r")
blogs <- readLines(con = con, encoding = "UTF-8", skipNul = TRUE)
close(con)
blogSummary <- data.frame(
"Line_count" = length(blogs),
"Word_count" = sum(stri_count_words(blogs)),
"Mean_of_word_count" = mean(stri_count_words(blogs))
)
blogSummary %>% formattable()
blogs[1]
newsSummary <- data.frame(
"Line_count" = length(news),
"Word_count" = sum(stri_count_words(news)),
"Mean_of_word_count" = mean(stri_count_words(news))
)
newsSummary %>% formattable()
news[1]
twitterSummary <- data.frame(
"Line_count" = length(twitter),
"Word_count" = sum(stri_count_words(twitter)),
"Mean_of_word_count" = mean(stri_count_words(twitter))
)
twitterSummary %>% formattable()
twitter[1]
I took random samples of text due to large size of files
set.seed(123)
texts <- c(blogs, news, twitter)
sample <- sample(x = 1:length(texts), size = length(texts) * 0.01, replace = FALSE)
texts.sample <- texts[sample]
tokens.ng1 <- tokens(x = texts.sample, what = "word",
remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE,remove_separators = TRUE,
remove_hyphens = TRUE)
dfm.ng1 <- dfm(x = tokens.ng1, tolower = TRUE, stem = TRUE, remove = stopwords())
freq.ng1 <- textstat_frequency(dfm.ng1)
head(freq.ng1, n = 8) %>% formattable()
freq.ng1 %>%
arrange(desc(frequency)) %>%
head(10) %>%
ggplot(aes(x = reorder(feature, frequency), y = frequency, fill = feature)) +
geom_bar(stat = "identity") +
xlab(label = "") +
theme(legend.position = "none")
set.seed(123)
textplot_wordcloud(x = dfm.ng1, random.color = TRUE, rot.per = .25, max.words = 70,
random.order = FALSE, colors = brewer.pal(8, "Dark2"))
tokens.ng2 <- tokens_ngrams(x = tokens.ng1, n = 2L)
dfm.ng2 <- dfm(x = tokens.ng2, tolower = TRUE, stem = TRUE, remove = stopwords())
freq.ng2 <- textstat_frequency(dfm.ng2)
head(freq.ng2, n = 8) %>% formattable()
freq.ng2 %>%
arrange(desc(frequency)) %>%
head(10) %>%
ggplot(aes(x = reorder(feature, frequency), y = frequency, fill = feature)) +
geom_bar(stat = "identity") +
xlab(label = "") +
theme(legend.position = "none")