library(tokenizers)
library(SnowballC)
library(ggplot2)
library(RColorBrewer)
library(gridExtra)
library(tidyverse)
library(ellipsis)
library(dplyr)
library(tidytext)
library(quanteda)
library(quanteda.textplots)
library(quanteda.textstats)
library(knitr)
To develop a text prodiction algorythm text files from news outlets, blogs, and twitter were taken as the backbone data. The table below shows the magnitude of each text file. The text from the files are collected by a web crawler that stores the text of various source to enhance search algorythms. We save these scrubbed text files so they can broken into a corpus to develop analytical models.
The text files range from 167 to 210 Mb with well over 4 million cumulative lines of text. Twitter does have the fewest characters per document because of Twitter’s 140 character format, but it makes up for that with the largest number of documents in it’s dataset. Whereas the inverse is true for the Blogs, where the maximum document length is almost half a million words, but it has far fewer documents.
Because of their magnitude, a sample of only 500,000 documents will be taken. This will prove to be more than enough to satisfy an efficient and accurate predition algorythm.
twit_pth <- "/Users/TylerShirley 1/Desktop/Learn to Program/Pirate Language/Capstone/Lang_Capstone/en_US.twitter.txt"
usblg_pth <- "/Users/TylerShirley 1/Desktop/Learn to Program/Pirate Language/Capstone/Lang_Capstone/en_US.blogs.txt"
news_pth <- "/Users/TylerShirley 1/Desktop/Learn to Program/Pirate Language/Capstone/Lang_Capstone/en_US.news.txt"
#twitter info
twit_size <- file.size(twit_pth)/10^6
twit_lines <- readLines(twit_pth, encoding="UTF-8")
twit_lngth <- length(twit_lines)
twit_char <- which.max(nchar(twit_lines))
#blog info
usblg_size <- file.size(usblg_pth)/10^6
usblg_lines <- readLines(usblg_pth, encoding="UTF-8")
usblg_lngth <- length(usblg_lines)
usblg_char <- which.max(nchar(usblg_lines))
#news info
news_size <- file.size(news_pth)/10^6
news_lines <- readLines(news_pth, encoding="UTF-8")
news_lngth <- length(news_lines)
news_char <- which.max(nchar(news_lines))
corp_data <- data.frame("Source" = c("Twitter", "US Blog", "News"),
"Size (Mb)" = c(twit_size, usblg_size, news_size),
"Length" = c(twit_lngth, usblg_lngth, news_lngth),
"Max Characters" = c(twit_char, usblg_char, news_char))
kable(corp_data)
| Source | Size..Mb. | Length | Max.Characters |
|---|---|---|---|
| 167.1053 | 2360148 | 26 | |
| US Blog | 210.1600 | 899288 | 483415 |
| News | 205.8119 | 1010242 | 123628 |
The three text files are combined then sampled to help with computation efficiency. Next the sampled corpus is tokenized to divide each document in the corpus to a list of words. Tokenization is the way structured text is broken into it’s most basic elements; punctuation, numbers, symbols, words, and spaces. For this algorythm the numbers, punctuation, symbols, and the very common english words will be omitted. Profanity words have also been taken out during the tokenization of the sample corpus.
When the corpus was tokenized a number of counted word strings were wanted to determine which are the most common words, or two to three word phrases. These are called n-grams. n is used as the number of words searched for in a group. An example would be a 2-gram, this would consist of parsing through all the two word phrases in a corpus then finding the most common 2 word phrases.
Because the common english words were removed, there are no “the”s or “and”s in the tokenized list. So the most common single word from the tokenized list is “just”, the most common 2-gram is “right now”, and the most common 3-gram is “new york city”
| feature | frequency | rank | docfreq | group |
|---|---|---|---|---|
| just | 35860 | 1 | 33189 | all |
| said | 35652 | 2 | 32455 | all |
| one | 34342 | 3 | 30209 | all |
| like | 31679 | 4 | 28402 | all |
| can | 29089 | 5 | 25785 | all |
| feature | frequency | rank | docfreq | group |
|---|---|---|---|---|
| right_now | 2856 | 1 | 2812 | all |
| new_york | 2325 | 2 | 2166 | all |
| last_year | 2167 | 3 | 2112 | all |
| last_night | 1780 | 4 | 1759 | all |
| high_school | 1679 | 5 | 1562 | all |
| feature | frequency | rank | docfreq | group |
|---|---|---|---|---|
| new_york_city | 304 | 1 | 301 | all |
| let_us_know | 278 | 2 | 277 | all |
| vested_interests_vested | 251 | 3 | 1 | all |
| interests_vested_interests | 251 | 3 | 1 | all |
| happy_new_year | 223 | 5 | 220 | all |
The top 20 words are expressed in a chart form. The top unigrams, bigrams, and trigrams are all plotted to give a perspective of the common phrases or words used in these text documents. It should be noted that as the n-grams get larger, the frequency of repetition will go down.
Word clouds are also created to give another visual representation of frequency. This is a less quantitative method of interpreting data, but it is a very aethetic way to present data to a large audience.
In the tokenized corpus it is aparent that some words will be very common, and others very rare. The method below explores the uniqueness of the corpus by finding how many words make up 50%, 75%, 90%, and 95% of the total number of words in the corpus.
#Cover different percentages of language used'
freq <- textstat_frequency(dfm_unigrams)
inst_per <- c(50, 75, 90, 95)
freq_tot <- sum(freq$frequency)
freq$percent <- round(freq$frequency/freq_tot * 100, 5)
freq <- freq %>%
mutate(cum_per = cumsum(freq$percent))
cuts <- as.data.frame(matrix(ncol = ncol(freq)))
colnames(cuts) <- colnames(freq)
for (i in 1:length(inst_per))
cuts[i,] <- slice(head(filter(freq, cum_per > inst_per[i] ),1))
cuts <- rownames_to_column(cuts) %>%
select(-c(group, rowname, feature, frequency, docfreq, percent)) %>%
rename(Words = rank) %>%
rename(Cumulative_Percent = cum_per)
kable(cuts)
| Words | Cumulative_Percent |
|---|---|
| 1057 | 50.00608 |
| 4838 | 75.00257 |
| 18247 | 90.00009 |
| 42581 | 95.00002 |
| From the | results above it shows how many words are needed to make up different quartiles of the corpus. Only 1057 different words are needed to make up 50% of the corpus dictionary. This makes sense because more specific vocabulary is needed for what an author wants to convey. |
Create a Markov Network for 2 and 3 grams Create general vocabulary for 1 grams Create a way to handle misspelled words Create a shiny app for word predictions