The goal of this project is just to display that I’ve gotten used to working with the data and that you are on track to create your prediction algorithm:
file_dest <- "./source/Coursera-SwiftKey.zip"
file_src <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
## Save time, download only if you haven't
if(!file.exists(file_dest)) {
download.file(file_src, file_dest)
}
## Save time, unzip only if you haven't
if(file.exists(file_dest)) {
unzip(file_dest, exdir = "./source/")
list.files("./source/final/")
list.files("./source/final/en_US/")
}
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
## Read as UTF-8 and create tibbles (dataframes)
text_blog <- readLines("./source/final/en_US/en_US.blogs.txt", encoding = "UTF-8")
df_text_blog <- as_tibble(text_blog)
colnames(df_text_blog) <- "text"
text_news <- readLines("./source/final/en_US/en_US.news.txt", encoding = "UTF-8")
df_text_news <- as_tibble(text_news)
colnames(df_text_news) <- "text"
text_twit <- readLines("./source/final/en_US/en_US.twitter.txt", encoding = "UTF-8")
df_text_twit <- as_tibble(text_twit)
colnames(df_text_twit) <- "text"
Failiarize Questions to consider
Let’s describe the breadth of data by defining it as: Lines of text and file size Mb.
## Make a data frame for graphing.
mb_df <- data.frame(
Text = c("Blog","News","Tweets"),
Lines = c(
nrow(df_text_blog),
nrow(df_text_news),
nrow(df_text_twit)
),
Mb = c(
round(file.size("./source/final/en_US/en_US.blogs.txt") / 1024^2),
round(file.size("./source/final/en_US/en_US.news.txt") / 1024^2),
round(file.size("./source/final/en_US/en_US.twitter.txt") / 1024^2)
)
) %>%
ggplot(aes(Lines, Mb)) +
geom_point(stat = "identity", colour = "#08457e", size = 3) +
geom_text( aes(label = Text), vjust = -0.75) +
ylab("File size in Mb") +
ylim(150, 215) +
xlab("Lines of text") +
theme_minimal() +
ggtitle("Lines of text and files sizes for blogs, news and tweets")
mb_df
Note that News and Blogs have a greater file size for the lines of text they contain. This is likely due to the 120 char limit Twitter used to impose.
Text will need to be arranged into one, two and three tokens (a meaningful unit of text). This is used to build relationships between words as they appear. Here are the frequencies for the blog. You’ll notice the high incidence of stop words - words commonly used.
For the sake of brevity, Blog will be treated as a simple word frequency, news as a bi-gram and twitter as a tri-gram. Further analysis will inclue all analyses on all text.
## Word frequencies
blog_words <- df_text_blog %>% ## Just 5 rows as
unnest_tokens(word, text) %>%
count(word, sort = T) %>%
top_n(25) %>%
arrange(desc(n)) %>%
ggplot(aes(reorder(word, n), n)) +
geom_bar(stat = "identity", fill = "#08457e") +
ggtitle("Top 25 Blog Words - Frequencies") +
ylab("Words") +
xlab("Frequency") +
coord_flip()
## Selecting by n
blog_words
A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2. The frequency distribution of every bigram in a string is commonly used for simple statistical analysis of text in many applications, including in computational linguistics, cryptography, speech recognition, and so on. (Wikipedia)[https://en.wikipedia.org/wiki/Bigram]
news_bigrams <- df_text_news %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, c("word1","word2"), sep = " ") %>% # Sep to filter stop words
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
unite(bigram, word1, word2, sep = " ") %>% # Re-combine
count(bigram, sort = T) %>%
top_n(25) %>%
ggplot(aes(reorder(bigram, n), n)) +
geom_bar(stat = "identity", fill = "#08457e") +
ggtitle("Top 25 News Bigrams") +
ylab("Words") +
xlab("Frequency") +
coord_flip()
## Selecting by n
news_bigrams
Trigrams are a special case of the n-gram, where n is 3. They are often used in natural language processing for performing statistical analysis of texts. Wikipedia
twit_trigrams <- df_text_twit %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
separate(trigram, c("word1","word2", "word3"), sep = " ") %>% # Sep to filter stop words
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
filter(!word3 %in% stop_words$word) %>%
unite(trigram, word1, word2, word3, sep = " ") %>% # Re-combine
count(trigram, sort = T) %>%
top_n(25) %>%
ggplot(aes(reorder(trigram, n), n)) +
geom_bar(stat = "identity", fill = "#08457e") +
ggtitle("Top 25 Twitter Trigrams") +
ylab("Words") +
xlab("Frequency") +
coord_flip()
## Selecting by n
twit_trigrams
“The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm”. This has been accomplished by downloading, exploring and profiling the data. Word frequenies, n-grams were calculated and visualized. This estabhishes a base for the next phase of research.