Milestone Report

##Introduction The goal of this work is to produce a shiny app that does word prediction. The model will be based on three differnt kinds of text: tweets, blog posts and news articles.

This document summarized the exploratative data analysis and first models and gives an outlook on future plans.

##Reading and sampling from Data save paths to the files:

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.0.1     ✓ dplyr   1.0.0
## ✓ tidyr   1.1.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tm)

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(tidytext)
library(stringr)
library(knitr)
library(markovchain)

## Package:  markovchain
## Version:  0.8.5
## Date:     2020-05-21
## BugReport: http://github.com/spedygiorgio/markovchain/issues

path_b <- "~/rsconnect/Coursera/final/en_US/en_US.blogs.txt"
path_t <- "~/rsconnect/Coursera/final/en_US/en_US.twitter.txt"
path_n <- "~/rsconnect/Coursera/final/en_US/en_US.news.txt"

#twitter
set.seed(1)
#read in all data
all_t <- readLines(path_t)

## Warning in readLines(path_t): line 167155 appears to contain an embedded nul

## Warning in readLines(path_t): line 268547 appears to contain an embedded nul

## Warning in readLines(path_t): line 1274086 appears to contain an embedded nul

## Warning in readLines(path_t): line 1759032 appears to contain an embedded nul

#sample 5 k indexes
ind_sample_t <- sample(length(all_t), 5000)
#make the sample data frame
sample_t <- all_t[ind_sample_t] %>% as.data.frame() %>% cbind(rep("t", length(ind_sample_t)), ind_sample_t, .)
colnames(sample_t) <- c("corpus", "index", "text")

#blogs
all_b <- readLines(path_b)
#sample 5 k indexes
ind_sample_b <- sample(length(all_b), 5000)
#make the sample data frame
sample_b <- all_b[ind_sample_b] %>% as.data.frame() %>% cbind(rep("b", length(ind_sample_b)), ind_sample_b, .)
colnames(sample_b) <- c("corpus", "index", "text")

#news
all_n <- readLines(path_n)
#sample 5 k indexes
ind_sample_n <- sample(length(all_n), 5000)
#make the sample data frame
sample_n <- all_n[ind_sample_n] %>% as.data.frame() %>% cbind(rep("n", length(ind_sample_n)), ind_sample_n, .)
colnames(sample_n) <- c("corpus", "index", "text")

#combine
sample_comb <- bind_rows(sample_t, sample_b, sample_n)

Exploratory Data Analysis

Words per Record

Split the text into single words:

sample_comb_word <- sample_comb %>% unnest_tokens(word, text)

Calculate mean number of words per record:

sample_comb_word %>% group_by(corpus) %>% 
        summarise(tot_words = n(), mean_word_per_rec = tot_words/5000)

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 3
##   corpus tot_words mean_word_per_rec
##   <chr>      <int>             <dbl>
## 1 b         213110              42.6
## 2 n         172269              34.5
## 3 t          63561              12.7

Tweets are quite a bit shorter then blogs and news articles. ## Single Word Counts Show histogram of 100 most used words in tweets:

sample_comb_word %>% filter(corpus == "t") %>% 
        count(word, sort = TRUE) %>% 
        arrange(desc(n)) %>% 
        .[1:100,] %>% 
        ggplot(aes(x = reorder(word, desc(n)), y = n)) +
        geom_bar(stat = "identity") +
        theme_classic() +
        theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0, size = 6)) +
        xlab("word")

Show the distribution of the count for single words:

sample_comb_word %>% filter(corpus == "t") %>% 
        count(word, sort = TRUE) %>% 
        select(n) %>% 
        summary()

##        n           
##  Min.   :   1.000  
##  1st Qu.:   1.000  
##  Median :   1.000  
##  Mean   :   6.273  
##  3rd Qu.:   3.000  
##  Max.   :1924.000

We see that the distribution is very skewed, most words appear only once and some few words appear very often.

This is the same for the other corpi as well. I will not demonstrate this here.

#TF-IDF I also did analysis via TF-IDF, which can be used to find words, that are important in one corpus compared to others. This was also quite interesting and showed, for example that in the twitter data more informal and generally shorter words are used often compared to the other two corpi. The results are probably not that usefull for this tasks, therefore I will not show them here.

#N-Grams Produce bigrams (all the combinations of two words that follow each other in the text):

sample_comb_bigram <- sample_comb %>% unnest_tokens(bigram, text, token = "ngrams", n = 2)

Count the different bigrams from the twitter data and show their distribution:

sample_comb_bigram %>% filter(corpus == "t") %>% 
        count(bigram, sort = TRUE) %>% 
        select(n) %>% 
        summary()

##        n          
##  Min.   :  1.000  
##  1st Qu.:  1.000  
##  Median :  1.000  
##  Mean   :  1.454  
##  3rd Qu.:  1.000  
##  Max.   :157.000

We can see that the distribution is even more skewed, then for the counts of single words (The mean is closer to 1 and the third quartile is now also one).

Lets repeat the same for trigrams:

sample_comb_trigram <- sample_comb %>% unnest_tokens(trigram, text, token = "ngrams", n = 3)

Count the different bigrams from the twitter data and show their distribution:

sample_comb_trigram %>% filter(corpus == "t") %>% 
        count(trigram, sort = TRUE) %>% 
        select(n) %>% 
        summary()

##        n          
##  Min.   :  1.000  
##  1st Qu.:  1.000  
##  Median :  1.000  
##  Mean   :  1.063  
##  3rd Qu.:  1.000  
##  Max.   :177.000

For trigrams its even more extreme.

Plans for Model

In order to predict the next word from up to 3 previous words, I will sample take samples of all three corpi, such that the number of records from each corpus is inversely proportional to their mean number of words per record, to get more or less the same amount of words from each corpus. I will take as much data as possible (I still have to determine what number is reasonable, so far I only worked with small samples).

From these samples I will then produce lists of n-grams (2-4) as well as just a list of all the words (“1-grams”) and count the occurrences of each token (= n-gram).

We will then use these n-grams to provide us with possible next words. For example if we have 2 previous words given, we will search the list of 3-grams for instances where they match the first two words. Then the third word in the 3-gram is a possible prediction for the next word.

We can also assign a probability to each of these possible predictions, by counting the number of the occurrences of the specific 3-gram and comparing it to the total amount of 3-grams, where the two first words match the two given words.

In order decrease the size of my model I will delete all n-grams, where the n-1-gram only occured once.

The prediction function will start searching through the highest order n-gram, depending on the number of given words and then provide the three most likely predictions. If there are no (or not enough matches), the function will then search the table with n-grams of one order lower. If no previous word is matched the predictions will be samples from the word frequency tables with probabilities according to the word count (e.g. the most occurring word is most likely to be sampled)

The goodness of the model will be assessed, by running the function on a number of 1/2/3 grams produced from a test sample (data that was not used to produce our prediction function). Then the results will be compared to the actual next word that occurred in the text. We can then calculate in how many instances one of the three suggestions was correct.

At the end I will produce a small shiny applet, where text can be entered an three suggestions will be provided by the final model.

Milestone Report

Xin Ma

2024/10/11

Milestone Report

Exploratory Data Analysis

Words per Record

Plans for Model