library(tokenizers)
library(SnowballC)
library(ggplot2)
library(RColorBrewer)
library(gridExtra)
library(tidyverse)
library(ellipsis)
library(dplyr)
library(tidytext)
library(quanteda)
library(quanteda.textplots)
library(quanteda.textstats)
library(knitr)

Text Prediction Milestone Report

To develop a text prodiction algorythm text files from news outlets, blogs, and twitter were taken as the backbone data. The table below shows the magnitude of each text file. The text from the files are collected by a web crawler that stores the text of various source to enhance search algorythms. We save these scrubbed text files so they can broken into a corpus to develop analytical models.

The Text Files

The text files range from 167 to 210 Mb with well over 4 million cumulative lines of text. Twitter does have the fewest characters per document because of Twitter’s 140 character format, but it makes up for that with the largest number of documents in it’s dataset. Whereas the inverse is true for the Blogs, where the maximum document length is almost half a million words, but it has far fewer documents.

Because of their magnitude, a sample of only 500,000 documents will be taken. This will prove to be more than enough to satisfy an efficient and accurate predition algorythm.

twit_pth <- "/Users/TylerShirley 1/Desktop/Learn to Program/Pirate Language/Capstone/Lang_Capstone/en_US.twitter.txt"
usblg_pth <- "/Users/TylerShirley 1/Desktop/Learn to Program/Pirate Language/Capstone/Lang_Capstone/en_US.blogs.txt"
news_pth <- "/Users/TylerShirley 1/Desktop/Learn to Program/Pirate Language/Capstone/Lang_Capstone/en_US.news.txt"

#twitter info
  twit_size <- file.size(twit_pth)/10^6
  twit_lines <- readLines(twit_pth, encoding="UTF-8")
  twit_lngth <- length(twit_lines)
  twit_char <- which.max(nchar(twit_lines))

#blog info
  usblg_size <- file.size(usblg_pth)/10^6
  usblg_lines <- readLines(usblg_pth, encoding="UTF-8")
  usblg_lngth <- length(usblg_lines)
  usblg_char <- which.max(nchar(usblg_lines))

#news info
  news_size <- file.size(news_pth)/10^6
  news_lines <- readLines(news_pth, encoding="UTF-8")
  news_lngth <- length(news_lines)
  news_char <- which.max(nchar(news_lines))
  
corp_data <- data.frame("Source" = c("Twitter", "US Blog", "News"),
  "Size (Mb)" = c(twit_size, usblg_size, news_size),
  "Length" = c(twit_lngth, usblg_lngth, news_lngth),
  "Max Characters" = c(twit_char, usblg_char, news_char))

kable(corp_data)

Source	Size..Mb.	Length	Max.Characters
Twitter	167.1053	2360148	26
US Blog	210.1600	899288	483415
News	205.8119	1010242	123628

Tokenize the Corpus

The three text files are combined then sampled to help with computation efficiency. Next the sampled corpus is tokenized to divide each document in the corpus to a list of words. Tokenization is the way structured text is broken into it’s most basic elements; punctuation, numbers, symbols, words, and spaces. For this algorythm the numbers, punctuation, symbols, and the very common english words will be omitted. Profanity words have also been taken out during the tokenization of the sample corpus.

n-grams

When the corpus was tokenized a number of counted word strings were wanted to determine which are the most common words, or two to three word phrases. These are called n-grams. n is used as the number of words searched for in a group. An example would be a 2-gram, this would consist of parsing through all the two word phrases in a corpus then finding the most common 2 word phrases.

Because the common english words were removed, there are no “the”s or “and”s in the tokenized list. So the most common single word from the tokenized list is “just”, the most common 2-gram is “right now”, and the most common 3-gram is “new york city”

feature	frequency	rank	docfreq	group
just	35860	1	33189	all
said	35652	2	32455	all
one	34342	3	30209	all
like	31679	4	28402	all
can	29089	5	25785	all

feature	frequency	rank	docfreq	group
right_now	2856	1	2812	all
new_york	2325	2	2166	all
last_year	2167	3	2112	all
last_night	1780	4	1759	all
high_school	1679	5	1562	all

feature	frequency	rank	docfreq	group
new_york_city	304	1	301	all
let_us_know	278	2	277	all
vested_interests_vested	251	3	1	all
interests_vested_interests	251	3	1	all
happy_new_year	223	5	220	all

Tokenized graphs

The top 20 words are expressed in a chart form. The top unigrams, bigrams, and trigrams are all plotted to give a perspective of the common phrases or words used in these text documents. It should be noted that as the n-grams get larger, the frequency of repetition will go down.

Word clouds are also created to give another visual representation of frequency. This is a less quantitative method of interpreting data, but it is a very aethetic way to present data to a large audience.

Coverage uniqueness

In the tokenized corpus it is aparent that some words will be very common, and others very rare. The method below explores the uniqueness of the corpus by finding how many words make up 50%, 75%, 90%, and 95% of the total number of words in the corpus.

#Cover different percentages of language used'

freq <- textstat_frequency(dfm_unigrams)
inst_per <- c(50, 75, 90, 95)

freq_tot <- sum(freq$frequency)
freq$percent <- round(freq$frequency/freq_tot * 100, 5)
freq <- freq %>% 
  mutate(cum_per = cumsum(freq$percent))

cuts <- as.data.frame(matrix(ncol = ncol(freq))) 
colnames(cuts) <- colnames(freq)

for (i in 1:length(inst_per))
  cuts[i,] <- slice(head(filter(freq, cum_per > inst_per[i] ),1))

cuts <- rownames_to_column(cuts) %>%
  select(-c(group, rowname, feature, frequency, docfreq, percent)) %>%
  rename(Words = rank) %>%
  rename(Cumulative_Percent = cum_per)

kable(cuts)

Words	Cumulative_Percent
1057	50.00608
4838	75.00257
18247	90.00009
42581	95.00002
From the	results above it shows how many words are needed to make up different quartiles of the corpus. Only 1057 different words are needed to make up 50% of the corpus dictionary. This makes sense because more specific vocabulary is needed for what an author wants to convey.

Next steps

Create a Markov Network for 2 and 3 grams Create general vocabulary for 1 grams Create a way to handle misspelled words Create a shiny app for word predictions

Text Prediction Milestone Report

Tyler Shirley

6/14/2021

Text Prediction Milestone Report

The Text Files

Tokenize the Corpus

n-grams

Tokenized graphs

Coverage uniqueness

Next steps