So far in the early phases of creating a predictive text algorithm the data sets have been loaded into R, sampled, checked for word frequencies, and plotted for visual representation.
The data file was unzipped and each of the English data sets were read in using read.table. Using the sep function (separate function) set equal to the new line character (\n) read.table created a dataframe of two columns. The first column is simply the row names as an increasing list of integers. The second column (label V1) is the complete character string of an associated line of text.
I did run into an issue with reading in the en_US.news.txt file. Embedded in the text were 3 special characters that forced read.table to quit early. I ended up reading in the file three separate times. After each run I checked the length of the file to see where it stopped. Then when back into Notepad to find the corresponding line, manually deleted that special character, and then reran read.table to find the next line. I had to do that three times before I had the complete dataset.
Once each of the 3 datasets were read-in by read.table I took a sample from each. I used the sample function to extract 0.5% of the total number of lines.
Using the dplyr package and the unnest_tokens function I broke the sample datasets into 1 word per row. I also counted the number of times each word appeared in the entire sample, and then sorted them from larget to smallest.
| word | n |
|---|---|
| the | 823 |
| and | 405 |
| to | 393 |
| a | 360 |
| of | 322 |
| in | 253 |
| that | 142 |
| for | 131 |
| on | 118 |
| with | 117 |
| word | n.news | n.blogs | n.twitter | total |
|---|
From this table we can see that the usual suspects of most frequent words are those words that don’t add anything to the context of the question. To fix this issue then the plan is to start with the stop_words list inside the tidyverse package to remove the most common words from each of the datasets. I may need to add or remove words from that pre-packaged list of words.
Thinking about the distinction between the type of writing inside each dataset makes me want to analyze the emotional content of the sentences. I would like to imagine that a “news” sentence would be more factual and less emotional. “Blogs” and “Tweets” I believe tend to much more emotionally written. Once I start breaking the data into n-grams I will also look at whether emotion is correlated to word choice.
I have not encountered swear words that much so far, or non-English words. From the perspective of writing lines of text using a cellular phone maybe one solution is to group recipients by language. Most people would use less formal language with friends, and may also use more expletives with friends. Whereas writing to a professional colleague the type of language used is formal and possibly very specialized. Cellular phones currently have the ability to arrange contacts into groups so maybe arrange contacts by style of language could improve the predictiveness of an algorithm…
# Set-Up
knitr::opts_chunk$set(cache = TRUE)
knitr::opts_chunk$set(warning = FALSE)
# Libraries
library(readr)
library(tidyverse)
library(tidytext)
library(broom)
# Read-In
twit_txt <- read.table("./en_US.twitter.txt",colClasses = "character",
sep = "\n",fill = TRUE,comment.char = "",
quote = "'\"",encoding = "UTF-8",skipNul = TRUE,
numerals = "no.loss")
news_txt <- read.table("./en_US.news.txt",colClasses = "character",
sep = "\n",comment.char = "",quote = "'\"",
fill = TRUE,encoding = "UTF-8",skipNul = TRUE,
numerals = "no.loss",allowEscapes = TRUE)
blogs_txt <- read.table("./en_US.blogs.txt",colClasses = "character",
sep = "\n",comment.char = "",quote = "'\"",
fill = TRUE,encoding = "UTF-8",
skipNul = TRUE,numerals = "no.loss",
allowEscapes = TRUE)
# Samples
set.seed(42)
twitter_sample <- twit_txt %>% sample_frac(0.005)
news_sample <- news_txt %>% sample_frac(0.005)
blogs_sample <- blogs_txt %>% sample_frac(0.005)
# Tokens
news_df <- news_sample %>%
unnest_tokens(word,V1) %>%
count(word, sort=TRUE)
blogs_df <- blogs_sample %>%
unnest_tokens(word,V1) %>%
count(word,sort=TRUE)
twitter_df <- twitter_sample %>%
unnest_tokens(word,V1) %>%
count(word, sort=TRUE)
# Token Table Example
knitr::kable(news_df[1:10,])
# Count Plots
news_df %>%
filter(n>100) %>%
mutate(word = reorder(word,n)) %>%
ggplot(aes(n,word)) +
geom_col() +
labs(y=NULL, title = "News Words")
blogs_df %>%
filter(n>1000) %>%
mutate(word = reorder(word,n)) %>%
ggplot(aes(n,word)) +
geom_col() +
labs(y=NULL, title = "Blogs Words")
twitter_df %>%
filter(n>1000) %>%
mutate(word = reorder(word,n)) %>%
ggplot(aes(n,word)) +
geom_col() +
labs(y=NULL, title = "Twitter Words")
# Side-by-Side Comparison Table
news_df <- news_sample %>%
unnest_tokens(word,V1) %>%
count(word, sort=TRUE) %>%
filter(n>1000)
blogs_df <- blogs_sample %>%
unnest_tokens(word,V1) %>%
count(word,sort=TRUE) %>%
filter(n>800)
twitter_df <- twitter_sample %>%
unnest_tokens(word,V1) %>%
count(word, sort=TRUE) %>%
filter(n>599)
joint_df <-
left_join(news_df,blogs_df,by = "word", suffix=c(".news",".blogs")) %>%
left_join(.,twitter_df, by="word") %>%
rename(n.twitter = n) %>%
na.omit() %>%
mutate(total = rowSums(across(where(is.numeric))))
knitr::kable(joint_df)