ExploratoryAnalysis

Introduction
File Summary
Uni-grams, word cloud
Uni-grms, By Source
Uni-gram Distribution
Bi-gram Distribution
Tri-gram Distribution
N-gram Prediction Model

suppressPackageStartupMessages({
  library(tidytext)
  library(tidyverse)
  library(stringr)
  library(knitr)
  library(wordcloud)
  library(ngram)
})

## Warning: package 'tidytext' was built under R version 4.1.1

## Warning: package 'tidyverse' was built under R version 4.1.1

## Warning: package 'readr' was built under R version 4.1.1

## Warning: package 'forcats' was built under R version 4.1.1

## Warning: package 'wordcloud' was built under R version 4.1.1

## Warning: package 'ngram' was built under R version 4.1.1

To hide the code complexity, i have written a R script to calculate the required stats and store the outputs in RDS files. If you are interested in viewing in the R script, click here

repo_summary <- readRDS("clean_repos/repo_summary.rds")
tidy_repo <- readRDS("clean_repos/tidy_repo.rds")
cover_90  <- readRDS("clean_repos/cover_90.rds")
bigram_cover_90   <- readRDS("clean_repos/bigram_cover_90.rds")
trigram_cover_90  <- readRDS("clean_repos/trigram_cover_90.rds")

Introduction

This project analyzes the HC Corpora Dataset with the goal of creating a prediction model for predicting n-grams. In this report, I will summarize the exploratory data analysis that i have conducted on the data.

File Summary

Data files provided are: blogs, news, and twitter. Here are few basic stats on the data files.

f_names	f_size	f_lines	n_char	n_words	pct_n_char	pct_lines	pct_words
blogs	200.4242	899288	208361438	37334131	0.54	0.27	0.53
news	196.2775	77259	15683765	2643969	0.04	0.02	0.04
twitter	159.3641	2360148	162385035	30373583	0.42	0.71	0.43

To speed up the processing, I have sampled 10% of the lines from the each file. Each data sample was cleaned and broken into uni, bi and tri-grams. To further speed up the model, i have subsetted the n-grams to cover 90% of the sample phrases.

Uni-grams, word cloud

Next, we will create a word cloud to see the most frequent words in the data.

#' Word cloud
cover_90 %>%
  with(wordcloud(word, n, max.words = 100,
                 colors = brewer.pal(8, 'Dark2'), random.order = FALSE))

Uni-grms, By Source

Now, Let’s look at the word frequencies in the data.

#' Word distribution by source
freq <- tidy_repo %>%
  count(source, word) %>%
  group_by(source) %>%
  mutate(proportion = n / sum(n)) %>%
  spread(source, proportion) %>%
  gather(source, proportion, `blogs`:`twitter`) %>%
  arrange(desc(proportion), desc(n))
freq %>%
  filter(proportion > 0.002) %>% 
  mutate(word = reorder(word, proportion)) %>% 
  ggplot(aes(word, proportion)) +
  geom_col(fill="blue") + 
  xlab(NULL) + 
  coord_flip() +
  theme_light() +
  facet_grid(~source, scales = "free")

Uni-gram Distribution

Distributions were created for each set of n-grams, based on relative frequency.

#' Word distribution
cover_90 %>%
  top_n(20, proportion) %>%
  mutate(word = reorder(word, proportion)) %>%
  ggplot(aes(word, proportion)) +
  geom_col(fill="blue") +
  xlab(NULL) +
  theme_light() +
  coord_flip()

Bi-gram Distribution

#' Bigram distribution
bigram_cover_90 %>%
  top_n(20, proportion) %>%
  mutate(bigram = reorder(bigram, proportion)) %>%
  ggplot(aes(bigram, proportion)) +
  geom_col(fill="blue") +
  xlab(NULL) +
  theme_light() +
  coord_flip()

Tri-gram Distribution

#' trigram distribution
trigram_cover_90 %>%
  top_n(20, proportion) %>%
  mutate(trigram = reorder(trigram, proportion)) %>%
  ggplot(aes(trigram, proportion)) +
  geom_col(fill="blue") +
  xlab(NULL) +
  theme_light() +
  coord_flip()

N-gram Prediction Model

For the N-gram prediction model, I am going to use the n-gram tables created for bi-grams and tri-grams as the basis for prediction. The user will input a word, the model will find the bi-gram with the greatest relative frequency given that word. Similarly, the tri-gram table will be used for making predictions from two word entries and so on.

trigrams_separated <- trigram_cover_90 %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ")
knitr::kable(head(trigrams_separated))

word1	word2	word3	n	proportion	coverage
NA	NA	NA	19499	0.0035005	0.0035005
thanks	for	the	2449	0.0004396	0.0039401
one	of	the	2115	0.0003797	0.0043198
a	lot	of	1967	0.0003531	0.0046729
i	want	to	1365	0.0002450	0.0049180
to	be	a	1311	0.0002354	0.0051533

In the above table, that the tri-grams are separated by word and arranged by relative frequency. When the user inputs two words, the model matches those words and then finds the third word with the greatest relative frequency. Cases where there is no match, or where more than two words are entered, will have random completion.