Data Science Specialization - Capstone Project

Milestone

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. In order to do that I will use a code + comments style, where a non-data scientist is basically expected to skip code and focus on key findings summarised in text, tables and plots.

Data download and basic description

Data was downloaded from the provided coursera link (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip) and unzipped into the project working directory. The following table displays number of lines, words and unique word counts per each of the datasets. It is to note that this is a fairly large dataset, so after downloading and doing some transformations an RDS object was created so as to save time when rerunning and iterating some analyses

library(tidytext)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(ggVennDiagram)
library(knitr)

rm(list = ls())

#data downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

#unzip("./input_files/Coursera-SwiftKey.zip")

#load data (first time a tibble of all datasets was created and saved, code after the "else" clause, subsequent times the created RDS is loaded)


if(file.exists("tidy_base.RDS")){
  tidy_base <- readRDS("tidy_base.RDS")
} else {
en_us.blogs <- readLines("./input_files/final/en_US/en_US.blogs.txt", encoding="UTF-8")
en_us.news <- readLines("./input_files/final/en_US/en_US.news.txt", encoding="UTF-8")
en_us.twitter <- readLines("./input_files/final/en_US/en_US.twitter.txt", encoding="UTF-8")

blogs <- tibble(corpus = "en_us.blogs",
                number = 1:length(en_us.blogs),
                text = en_us.blogs)

news <- tibble(corpus = "en_us.news",
               number = 1:length(en_us.news),
               text = en_us.news)

twitter <- tibble(corpus = "en_us.twitter",
                  number = 1:length(en_us.twitter),
                  text = en_us.twitter)

rm(en_us.blogs, en_us.news, en_us.twitter)

base <- rbind(blogs,news,twitter)
tidy_base <- unnest_tokens(base, wording, text) 
#default option is that words are converted to lowercase

?unnest_tokens

saveRDS(tidy_base,"tidy_base.RDS")

rm(base, blogs, news, twitter)
}

overview <- tidy_base %>% 
  group_by(corpus) %>% 
  summarise(
    `number of lines` = max(number),
    `number of words` = n(),
    `average of words per line` = `number of words` / `number of lines`,
    `number of unique words` = n_distinct(wording)
  )

kable(overview)

corpus	number of lines	number of words	average of words per line	number of unique words
en_us.blogs	899288	37546250	41.75109	320008
en_us.news	77259	2674536	34.61779	86620
en_us.twitter	2360148	30093372	12.75063	370388

Some things to note regarding this quick summary: - There are around 70 million words in three corpus. - The average or words per line is quite different in the three datasets, being larger in blogs, the news and finally twitter. - The number of distinct words being used is around 86 thousand in news, but over 300 thousand in both blogs and twitter. This is somehow interesting, as there are an estimated of 170 thousand words currently in use in english (https://englishlive.ef.com/blog/language-lab/many-words-english-language/). So either there are lots of mispelling (which is consistent with news having in general more reviews and spell-checking than blogs and twitter) or another reason. However, misspelling, use of acronyms and such seems the most plausible.

Word frequencies and further exploring

One key aspect of the predictive algorithm is the context in which it is meant to be used, and how wide the underlying dictionary of words and n-grams should be. So, in this section I will briefly explore word frequency and the number of shared words in each corpus, with the idea of coming up with the most robust combination of lightweight table + width of possible uses. The following plots explore these key ideas:

How do word frequencies look like?

#text summary function
text_summary <- function(db,percent = 0.9,corpus_filter = ""){
  if(corpus_filter == ""){
    summary <- db %>% 
      group_by(wording) %>% 
      summarise(
        number = n()
      )} else {
        summary <- db %>%
          filter(corpus == corpus_filter) %>% 
          group_by(wording) %>% 
          summarise(
            number = n())
        }
            
    summary <- summary %>% 
      arrange(desc(number)) %>% 
      mutate(
        log10_number = log10(number),
        perc = number / sum(number),
        cumperc = cumsum(number)/sum(number),
        order = rank(desc(number)))
    
    summary <- summary %>% 
      filter(cumperc <= percent )
    
    summary
}

general_summary_all <- text_summary(tidy_base, 1)

word_count_plot <- ggplot(general_summary_all, aes(y = log10_number, x = 1:nrow(general_summary_all))) +
  geom_line()+
  ggtitle("Word frequency in combined corpus")+
  xlab("Word order of frequency of appearance")+
  ylab("Number of times used - log10 scale")

word_count_plot

kable(head(general_summary_all))

wording	number	log10_number	perc	cumperc	order
the	2949277	6.469716	0.0419443	0.0419443	1
to	1927842	6.285071	0.0274176	0.0693618	2
and	1601543	6.204539	0.0227770	0.0921388	3
a	1579063	6.198399	0.0224573	0.1145961	4
i	1510481	6.179115	0.0214819	0.1360779	5
of	1295749	6.112521	0.0184280	0.1545059	6

From the previous plot and summary, it can be seen that just a few words capture most of the used language, even requiring a log scale conversion.

Now, going a bit further:

How many words represent a given percent of the whole number of words in each corpus?
For a given percentage of the top used words, how many are shared in each corpus? let’s say for words that represent 90% of the corpus?

#part 1
##¿Are there different words being used most commonly in different sources?
general_summary_90_blogs <- text_summary(tidy_base, .9, "en_us.blogs")
general_summary_90_twitter <- text_summary(tidy_base, .9, "en_us.twitter")
general_summary_90_news <- text_summary(tidy_base, .9, "en_us.news")


#bind words
summary_90_combined <- rbind(general_summary_90_blogs %>% 
                               select(wording),
                             general_summary_90_news %>% 
                               select(wording),
                             general_summary_90_twitter %>% 
                               select(wording))

summary_90_combined <- summary_90_combined %>% 
  group_by(wording) %>% 
  summarise()

summary_90_combined <- summary_90_combined %>% 
  mutate(blogs = wording %in% general_summary_90_blogs$wording,
         news = wording %in% general_summary_90_news$wording,
         twitter = wording %in% general_summary_90_twitter$wording,
         concordance = (blogs + news + twitter)/3,
         where_concordance = case_when(
           concordance == 1 ~ "all",
           blogs == T & twitter == T ~ "blogs - twitter",
           blogs == T & news == T ~ "blogs - news",
           news == T & twitter == T ~ "news - twitter",
           blogs == T ~ "only blogs",
           twitter == T ~ "only twitter",
           news == T ~ "only news",
           TRUE ~ "there is a mistake in formula"
         ))

table_summary <- summary_90_combined %>% 
  group_by(where_concordance) %>% 
  summarise(
    number_of_words = n(),
    percent = n() / nrow(summary_90_combined)
  )

venn_list <- list(
  twitter = as.vector(general_summary_90_twitter$wording),
  blogs = as.vector(general_summary_90_blogs$wording),
  news = as.vector(general_summary_90_news$wording)
)


four_thousand <- summary_90_combined %>% 
  filter(where_concordance == "all")

four_thousand_corpus <- function(db,comparison){
  filtered <- db %>% 
    filter(db$wording %in% comparison$wording) %>% 
    summarise(
      percentage = sum(perc)
    )
  
  filtered
}


how_many_four <- tibble(
  Blogs = as.numeric(four_thousand_corpus(general_summary_90_blogs, four_thousand)),
  News = as.numeric(four_thousand_corpus(general_summary_90_news, four_thousand)),
  Twitter = as.numeric(four_thousand_corpus(general_summary_90_twitter, four_thousand))
)

ggVennDiagram(venn_list)+
  scale_fill_gradient(low = "steelblue", high = "indianred2")

kable(table_summary)

where_concordance	number_of_words	percent
all	4032	0.3887014
blogs - news	1553	0.1497156
blogs - twitter	391	0.0376940
news - twitter	397	0.0382724
only blogs	802	0.0773161
only news	2458	0.2369613
only twitter	740	0.0713391

kable(how_many_four)

Blogs	News	Twitter
0.8467016	0.8053804	0.8404615

So here are an few interesting findings:

Just around 40% of the 90% most common words are shared in each corpus.
News has the corpus that is most uncommon to the other two.
Even before doing any sampling, just by ordering, a hard core of only 4000 words can be found to be common to three sources. These 4000 words represent more than 80% of the words in each dataset, and seem to be able to provide a solid foundation to build most common ngrams

Further plans

Data will be subsetted using the 4000 most common words that represent 80% or more of the three datasets, and using this for creating n-grams for algorithm. The actual procedure, if memory allows, will be unnesting n-grams across all datasets and then filter them using these words as triggers. If that is not possible a sample of lines will be used. To make predictions lowercase letter conversion will be kept. 3 word n-grams will be sought first, then 2 words and finally single words by approximation. Profanity will be cleansed only in prediction, but not corrected in input. When an n-gram doesn’t match input, most common words will be suggested.

Data Science Specialization - Capstone Project - Milestone

pguillemi

10/10/2022

Milestone

Data download and basic description

Word frequencies and further exploring

Further plans