The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. In order to do that I will use a code + comments style, where a non-data scientist is basically expected to skip code and focus on key findings summarised in text, tables and plots.
Data was downloaded from the provided coursera link (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip) and unzipped into the project working directory. The following table displays number of lines, words and unique word counts per each of the datasets. It is to note that this is a fairly large dataset, so after downloading and doing some transformations an RDS object was created so as to save time when rerunning and iterating some analyses
library(tidytext)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggVennDiagram)
library(knitr)
rm(list = ls())
#data downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
#unzip("./input_files/Coursera-SwiftKey.zip")
#load data (first time a tibble of all datasets was created and saved, code after the "else" clause, subsequent times the created RDS is loaded)
if(file.exists("tidy_base.RDS")){
tidy_base <- readRDS("tidy_base.RDS")
} else {
en_us.blogs <- readLines("./input_files/final/en_US/en_US.blogs.txt", encoding="UTF-8")
en_us.news <- readLines("./input_files/final/en_US/en_US.news.txt", encoding="UTF-8")
en_us.twitter <- readLines("./input_files/final/en_US/en_US.twitter.txt", encoding="UTF-8")
blogs <- tibble(corpus = "en_us.blogs",
number = 1:length(en_us.blogs),
text = en_us.blogs)
news <- tibble(corpus = "en_us.news",
number = 1:length(en_us.news),
text = en_us.news)
twitter <- tibble(corpus = "en_us.twitter",
number = 1:length(en_us.twitter),
text = en_us.twitter)
rm(en_us.blogs, en_us.news, en_us.twitter)
base <- rbind(blogs,news,twitter)
tidy_base <- unnest_tokens(base, wording, text)
#default option is that words are converted to lowercase
?unnest_tokens
saveRDS(tidy_base,"tidy_base.RDS")
rm(base, blogs, news, twitter)
}
overview <- tidy_base %>%
group_by(corpus) %>%
summarise(
`number of lines` = max(number),
`number of words` = n(),
`average of words per line` = `number of words` / `number of lines`,
`number of unique words` = n_distinct(wording)
)
kable(overview)
| corpus | number of lines | number of words | average of words per line | number of unique words |
|---|---|---|---|---|
| en_us.blogs | 899288 | 37546250 | 41.75109 | 320008 |
| en_us.news | 77259 | 2674536 | 34.61779 | 86620 |
| en_us.twitter | 2360148 | 30093372 | 12.75063 | 370388 |
Some things to note regarding this quick summary: - There are around 70 million words in three corpus. - The average or words per line is quite different in the three datasets, being larger in blogs, the news and finally twitter. - The number of distinct words being used is around 86 thousand in news, but over 300 thousand in both blogs and twitter. This is somehow interesting, as there are an estimated of 170 thousand words currently in use in english (https://englishlive.ef.com/blog/language-lab/many-words-english-language/). So either there are lots of mispelling (which is consistent with news having in general more reviews and spell-checking than blogs and twitter) or another reason. However, misspelling, use of acronyms and such seems the most plausible.
One key aspect of the predictive algorithm is the context in which it is meant to be used, and how wide the underlying dictionary of words and n-grams should be. So, in this section I will briefly explore word frequency and the number of shared words in each corpus, with the idea of coming up with the most robust combination of lightweight table + width of possible uses. The following plots explore these key ideas:
#text summary function
text_summary <- function(db,percent = 0.9,corpus_filter = ""){
if(corpus_filter == ""){
summary <- db %>%
group_by(wording) %>%
summarise(
number = n()
)} else {
summary <- db %>%
filter(corpus == corpus_filter) %>%
group_by(wording) %>%
summarise(
number = n())
}
summary <- summary %>%
arrange(desc(number)) %>%
mutate(
log10_number = log10(number),
perc = number / sum(number),
cumperc = cumsum(number)/sum(number),
order = rank(desc(number)))
summary <- summary %>%
filter(cumperc <= percent )
summary
}
general_summary_all <- text_summary(tidy_base, 1)
word_count_plot <- ggplot(general_summary_all, aes(y = log10_number, x = 1:nrow(general_summary_all))) +
geom_line()+
ggtitle("Word frequency in combined corpus")+
xlab("Word order of frequency of appearance")+
ylab("Number of times used - log10 scale")
word_count_plot
kable(head(general_summary_all))
| wording | number | log10_number | perc | cumperc | order |
|---|---|---|---|---|---|
| the | 2949277 | 6.469716 | 0.0419443 | 0.0419443 | 1 |
| to | 1927842 | 6.285071 | 0.0274176 | 0.0693618 | 2 |
| and | 1601543 | 6.204539 | 0.0227770 | 0.0921388 | 3 |
| a | 1579063 | 6.198399 | 0.0224573 | 0.1145961 | 4 |
| i | 1510481 | 6.179115 | 0.0214819 | 0.1360779 | 5 |
| of | 1295749 | 6.112521 | 0.0184280 | 0.1545059 | 6 |
From the previous plot and summary, it can be seen that just a few words capture most of the used language, even requiring a log scale conversion.
Now, going a bit further:
#part 1
##¿Are there different words being used most commonly in different sources?
general_summary_90_blogs <- text_summary(tidy_base, .9, "en_us.blogs")
general_summary_90_twitter <- text_summary(tidy_base, .9, "en_us.twitter")
general_summary_90_news <- text_summary(tidy_base, .9, "en_us.news")
#bind words
summary_90_combined <- rbind(general_summary_90_blogs %>%
select(wording),
general_summary_90_news %>%
select(wording),
general_summary_90_twitter %>%
select(wording))
summary_90_combined <- summary_90_combined %>%
group_by(wording) %>%
summarise()
summary_90_combined <- summary_90_combined %>%
mutate(blogs = wording %in% general_summary_90_blogs$wording,
news = wording %in% general_summary_90_news$wording,
twitter = wording %in% general_summary_90_twitter$wording,
concordance = (blogs + news + twitter)/3,
where_concordance = case_when(
concordance == 1 ~ "all",
blogs == T & twitter == T ~ "blogs - twitter",
blogs == T & news == T ~ "blogs - news",
news == T & twitter == T ~ "news - twitter",
blogs == T ~ "only blogs",
twitter == T ~ "only twitter",
news == T ~ "only news",
TRUE ~ "there is a mistake in formula"
))
table_summary <- summary_90_combined %>%
group_by(where_concordance) %>%
summarise(
number_of_words = n(),
percent = n() / nrow(summary_90_combined)
)
venn_list <- list(
twitter = as.vector(general_summary_90_twitter$wording),
blogs = as.vector(general_summary_90_blogs$wording),
news = as.vector(general_summary_90_news$wording)
)
four_thousand <- summary_90_combined %>%
filter(where_concordance == "all")
four_thousand_corpus <- function(db,comparison){
filtered <- db %>%
filter(db$wording %in% comparison$wording) %>%
summarise(
percentage = sum(perc)
)
filtered
}
how_many_four <- tibble(
Blogs = as.numeric(four_thousand_corpus(general_summary_90_blogs, four_thousand)),
News = as.numeric(four_thousand_corpus(general_summary_90_news, four_thousand)),
Twitter = as.numeric(four_thousand_corpus(general_summary_90_twitter, four_thousand))
)
ggVennDiagram(venn_list)+
scale_fill_gradient(low = "steelblue", high = "indianred2")
kable(table_summary)
| where_concordance | number_of_words | percent |
|---|---|---|
| all | 4032 | 0.3887014 |
| blogs - news | 1553 | 0.1497156 |
| blogs - twitter | 391 | 0.0376940 |
| news - twitter | 397 | 0.0382724 |
| only blogs | 802 | 0.0773161 |
| only news | 2458 | 0.2369613 |
| only twitter | 740 | 0.0713391 |
kable(how_many_four)
| Blogs | News | |
|---|---|---|
| 0.8467016 | 0.8053804 | 0.8404615 |
So here are an few interesting findings:
Data will be subsetted using the 4000 most common words that represent 80% or more of the three datasets, and using this for creating n-grams for algorithm. The actual procedure, if memory allows, will be unnesting n-grams across all datasets and then filter them using these words as triggers. If that is not possible a sample of lines will be used. To make predictions lowercase letter conversion will be kept. 3 word n-grams will be sought first, then 2 words and finally single words by approximation. Profanity will be cleansed only in prediction, but not corrected in input. When an n-gram doesn’t match input, most common words will be suggested.