Milestone

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. In order to do that I will use a code + comments style, where a non-data scientist is basically expected to skip code and focus on key findings summarised in text, tables and plots.

Data download and basic description

Data was downloaded from the provided coursera link (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip) and unzipped into the project working directory. The following table displays number of lines, words and unique word counts per each of the datasets. It is to note that this is a fairly large dataset, so after downloading and doing some transformations an RDS object was created so as to save time when rerunning and iterating some analyses

library(tidytext)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggVennDiagram)
library(knitr)

rm(list = ls())

#data downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

#unzip("./input_files/Coursera-SwiftKey.zip")

#load data (first time a tibble of all datasets was created and saved, code after the "else" clause, subsequent times the created RDS is loaded)


if(file.exists("tidy_base.RDS")){
  tidy_base <- readRDS("tidy_base.RDS")
} else {
en_us.blogs <- readLines("./input_files/final/en_US/en_US.blogs.txt", encoding="UTF-8")
en_us.news <- readLines("./input_files/final/en_US/en_US.news.txt", encoding="UTF-8")
en_us.twitter <- readLines("./input_files/final/en_US/en_US.twitter.txt", encoding="UTF-8")

blogs <- tibble(corpus = "en_us.blogs",
                number = 1:length(en_us.blogs),
                text = en_us.blogs)

news <- tibble(corpus = "en_us.news",
               number = 1:length(en_us.news),
               text = en_us.news)

twitter <- tibble(corpus = "en_us.twitter",
                  number = 1:length(en_us.twitter),
                  text = en_us.twitter)

rm(en_us.blogs, en_us.news, en_us.twitter)

base <- rbind(blogs,news,twitter)
tidy_base <- unnest_tokens(base, wording, text) 
#default option is that words are converted to lowercase

?unnest_tokens

saveRDS(tidy_base,"tidy_base.RDS")

rm(base, blogs, news, twitter)
}

overview <- tidy_base %>% 
  group_by(corpus) %>% 
  summarise(
    `number of lines` = max(number),
    `number of words` = n(),
    `average of words per line` = `number of words` / `number of lines`,
    `number of unique words` = n_distinct(wording)
  )

kable(overview)
corpus number of lines number of words average of words per line number of unique words
en_us.blogs 899288 37546250 41.75109 320008
en_us.news 77259 2674536 34.61779 86620
en_us.twitter 2360148 30093372 12.75063 370388

Some things to note regarding this quick summary: - There are around 70 million words in three corpus. - The average or words per line is quite different in the three datasets, being larger in blogs, the news and finally twitter. - The number of distinct words being used is around 86 thousand in news, but over 300 thousand in both blogs and twitter. This is somehow interesting, as there are an estimated of 170 thousand words currently in use in english (https://englishlive.ef.com/blog/language-lab/many-words-english-language/). So either there are lots of mispelling (which is consistent with news having in general more reviews and spell-checking than blogs and twitter) or another reason. However, misspelling, use of acronyms and such seems the most plausible.

Word frequencies and further exploring

One key aspect of the predictive algorithm is the context in which it is meant to be used, and how wide the underlying dictionary of words and n-grams should be. So, in this section I will briefly explore word frequency and the number of shared words in each corpus, with the idea of coming up with the most robust combination of lightweight table + width of possible uses. The following plots explore these key ideas:

#text summary function
text_summary <- function(db,percent = 0.9,corpus_filter = ""){
  if(corpus_filter == ""){
    summary <- db %>% 
      group_by(wording) %>% 
      summarise(
        number = n()
      )} else {
        summary <- db %>%
          filter(corpus == corpus_filter) %>% 
          group_by(wording) %>% 
          summarise(
            number = n())
        }
            
    summary <- summary %>% 
      arrange(desc(number)) %>% 
      mutate(
        log10_number = log10(number),
        perc = number / sum(number),
        cumperc = cumsum(number)/sum(number),
        order = rank(desc(number)))
    
    summary <- summary %>% 
      filter(cumperc <= percent )
    
    summary
}

general_summary_all <- text_summary(tidy_base, 1)

word_count_plot <- ggplot(general_summary_all, aes(y = log10_number, x = 1:nrow(general_summary_all))) +
  geom_line()+
  ggtitle("Word frequency in combined corpus")+
  xlab("Word order of frequency of appearance")+
  ylab("Number of times used - log10 scale")

word_count_plot

kable(head(general_summary_all))
wording number log10_number perc cumperc order
the 2949277 6.469716 0.0419443 0.0419443 1
to 1927842 6.285071 0.0274176 0.0693618 2
and 1601543 6.204539 0.0227770 0.0921388 3
a 1579063 6.198399 0.0224573 0.1145961 4
i 1510481 6.179115 0.0214819 0.1360779 5
of 1295749 6.112521 0.0184280 0.1545059 6

From the previous plot and summary, it can be seen that just a few words capture most of the used language, even requiring a log scale conversion.

Now, going a bit further:

#part 1
##¿Are there different words being used most commonly in different sources?
general_summary_90_blogs <- text_summary(tidy_base, .9, "en_us.blogs")
general_summary_90_twitter <- text_summary(tidy_base, .9, "en_us.twitter")
general_summary_90_news <- text_summary(tidy_base, .9, "en_us.news")


#bind words
summary_90_combined <- rbind(general_summary_90_blogs %>% 
                               select(wording),
                             general_summary_90_news %>% 
                               select(wording),
                             general_summary_90_twitter %>% 
                               select(wording))

summary_90_combined <- summary_90_combined %>% 
  group_by(wording) %>% 
  summarise()

summary_90_combined <- summary_90_combined %>% 
  mutate(blogs = wording %in% general_summary_90_blogs$wording,
         news = wording %in% general_summary_90_news$wording,
         twitter = wording %in% general_summary_90_twitter$wording,
         concordance = (blogs + news + twitter)/3,
         where_concordance = case_when(
           concordance == 1 ~ "all",
           blogs == T & twitter == T ~ "blogs - twitter",
           blogs == T & news == T ~ "blogs - news",
           news == T & twitter == T ~ "news - twitter",
           blogs == T ~ "only blogs",
           twitter == T ~ "only twitter",
           news == T ~ "only news",
           TRUE ~ "there is a mistake in formula"
         ))

table_summary <- summary_90_combined %>% 
  group_by(where_concordance) %>% 
  summarise(
    number_of_words = n(),
    percent = n() / nrow(summary_90_combined)
  )

venn_list <- list(
  twitter = as.vector(general_summary_90_twitter$wording),
  blogs = as.vector(general_summary_90_blogs$wording),
  news = as.vector(general_summary_90_news$wording)
)


four_thousand <- summary_90_combined %>% 
  filter(where_concordance == "all")

four_thousand_corpus <- function(db,comparison){
  filtered <- db %>% 
    filter(db$wording %in% comparison$wording) %>% 
    summarise(
      percentage = sum(perc)
    )
  
  filtered
}


how_many_four <- tibble(
  Blogs = as.numeric(four_thousand_corpus(general_summary_90_blogs, four_thousand)),
  News = as.numeric(four_thousand_corpus(general_summary_90_news, four_thousand)),
  Twitter = as.numeric(four_thousand_corpus(general_summary_90_twitter, four_thousand))
)

ggVennDiagram(venn_list)+
  scale_fill_gradient(low = "steelblue", high = "indianred2")

kable(table_summary)
where_concordance number_of_words percent
all 4032 0.3887014
blogs - news 1553 0.1497156
blogs - twitter 391 0.0376940
news - twitter 397 0.0382724
only blogs 802 0.0773161
only news 2458 0.2369613
only twitter 740 0.0713391
kable(how_many_four)
Blogs News Twitter
0.8467016 0.8053804 0.8404615

So here are an few interesting findings:

Further plans

Data will be subsetted using the 4000 most common words that represent 80% or more of the three datasets, and using this for creating n-grams for algorithm. The actual procedure, if memory allows, will be unnesting n-grams across all datasets and then filter them using these words as triggers. If that is not possible a sample of lines will be used. To make predictions lowercase letter conversion will be kept. 3 word n-grams will be sought first, then 2 words and finally single words by approximation. Profanity will be cleansed only in prediction, but not corrected in input. When an n-gram doesn’t match input, most common words will be suggested.