library(tokenizers)
library(SnowballC)
library(ggplot2)
library(RColorBrewer)
library(gridExtra)
library(tidyverse)
library(ellipsis)
library(dplyr)
library(tidytext)
library(quanteda)
library(quanteda.textplots)
library(quanteda.textstats)
library(knitr)

Text Prediction Milestone Report

To develop a text prodiction algorythm text files from news outlets, blogs, and twitter were taken as the backbone data. The table below shows the magnitude of each text file. The text from the files are collected by a web crawler that stores the text of various source to enhance search algorythms. We save these scrubbed text files so they can broken into a corpus to develop analytical models.

The Text Files

The text files range from 167 to 210 Mb with well over 4 million cumulative lines of text. Twitter does have the fewest characters per document because of Twitter’s 140 character format, but it makes up for that with the largest number of documents in it’s dataset. Whereas the inverse is true for the Blogs, where the maximum document length is almost half a million words, but it has far fewer documents.

Because of their magnitude, a sample of only 500,000 documents will be taken. This will prove to be more than enough to satisfy an efficient and accurate predition algorythm.

twit_pth <- "/Users/TylerShirley 1/Desktop/Learn to Program/Pirate Language/Capstone/Lang_Capstone/en_US.twitter.txt"
usblg_pth <- "/Users/TylerShirley 1/Desktop/Learn to Program/Pirate Language/Capstone/Lang_Capstone/en_US.blogs.txt"
news_pth <- "/Users/TylerShirley 1/Desktop/Learn to Program/Pirate Language/Capstone/Lang_Capstone/en_US.news.txt"

#twitter info
  twit_size <- file.size(twit_pth)/10^6
  twit_lines <- readLines(twit_pth, encoding="UTF-8")
  twit_lngth <- length(twit_lines)
  twit_char <- which.max(nchar(twit_lines))

#blog info
  usblg_size <- file.size(usblg_pth)/10^6
  usblg_lines <- readLines(usblg_pth, encoding="UTF-8")
  usblg_lngth <- length(usblg_lines)
  usblg_char <- which.max(nchar(usblg_lines))

#news info
  news_size <- file.size(news_pth)/10^6
  news_lines <- readLines(news_pth, encoding="UTF-8")
  news_lngth <- length(news_lines)
  news_char <- which.max(nchar(news_lines))
  
corp_data <- data.frame("Source" = c("Twitter", "US Blog", "News"),
  "Size (Mb)" = c(twit_size, usblg_size, news_size),
  "Length" = c(twit_lngth, usblg_lngth, news_lngth),
  "Max Characters" = c(twit_char, usblg_char, news_char))

kable(corp_data)
Source Size..Mb. Length Max.Characters
Twitter 167.1053 2360148 26
US Blog 210.1600 899288 483415
News 205.8119 1010242 123628

Tokenize the Corpus

The three text files are combined then sampled to help with computation efficiency. Next the sampled corpus is tokenized to divide each document in the corpus to a list of words. Tokenization is the way structured text is broken into it’s most basic elements; punctuation, numbers, symbols, words, and spaces. For this algorythm the numbers, punctuation, symbols, and the very common english words will be omitted. Profanity words have also been taken out during the tokenization of the sample corpus.

n-grams

When the corpus was tokenized a number of counted word strings were wanted to determine which are the most common words, or two to three word phrases. These are called n-grams. n is used as the number of words searched for in a group. An example would be a 2-gram, this would consist of parsing through all the two word phrases in a corpus then finding the most common 2 word phrases.

Because the common english words were removed, there are no “the”s or “and”s in the tokenized list. So the most common single word from the tokenized list is “just”, the most common 2-gram is “right now”, and the most common 3-gram is “new york city”

feature frequency rank docfreq group
just 35860 1 33189 all
said 35652 2 32455 all
one 34342 3 30209 all
like 31679 4 28402 all
can 29089 5 25785 all
feature frequency rank docfreq group
right_now 2856 1 2812 all
new_york 2325 2 2166 all
last_year 2167 3 2112 all
last_night 1780 4 1759 all
high_school 1679 5 1562 all
feature frequency rank docfreq group
new_york_city 304 1 301 all
let_us_know 278 2 277 all
vested_interests_vested 251 3 1 all
interests_vested_interests 251 3 1 all
happy_new_year 223 5 220 all

Tokenized graphs

The top 20 words are expressed in a chart form. The top unigrams, bigrams, and trigrams are all plotted to give a perspective of the common phrases or words used in these text documents. It should be noted that as the n-grams get larger, the frequency of repetition will go down.

Word clouds are also created to give another visual representation of frequency. This is a less quantitative method of interpreting data, but it is a very aethetic way to present data to a large audience.

Coverage uniqueness

In the tokenized corpus it is aparent that some words will be very common, and others very rare. The method below explores the uniqueness of the corpus by finding how many words make up 50%, 75%, 90%, and 95% of the total number of words in the corpus.

#Cover different percentages of language used'

freq <- textstat_frequency(dfm_unigrams)
inst_per <- c(50, 75, 90, 95)

freq_tot <- sum(freq$frequency)
freq$percent <- round(freq$frequency/freq_tot * 100, 5)
freq <- freq %>% 
  mutate(cum_per = cumsum(freq$percent))

cuts <- as.data.frame(matrix(ncol = ncol(freq))) 
colnames(cuts) <- colnames(freq)

for (i in 1:length(inst_per))
  cuts[i,] <- slice(head(filter(freq, cum_per > inst_per[i] ),1))

cuts <- rownames_to_column(cuts) %>%
  select(-c(group, rowname, feature, frequency, docfreq, percent)) %>%
  rename(Words = rank) %>%
  rename(Cumulative_Percent = cum_per)

kable(cuts)
Words Cumulative_Percent
1057 50.00608
4838 75.00257
18247 90.00009
42581 95.00002
From the results above it shows how many words are needed to make up different quartiles of the corpus. Only 1057 different words are needed to make up 50% of the corpus dictionary. This makes sense because more specific vocabulary is needed for what an author wants to convey.

Next steps

Create a Markov Network for 2 and 3 grams Create general vocabulary for 1 grams Create a way to handle misspelled words Create a shiny app for word predictions