This report serves to provide a summary of the preliminary analysis I’ve done on my Swiftkey text mining and analytics project.
The goal of the project is to develop a shiny application that can be used to predict the next word in a string of text. So far, the project is made up of three scripts:
The code for these scripts is displayed in the appendix.
Before I begin manipulating the data into corpus format, let’s take a quick look at the summary statistics. Here I’ve calculated:
file | Number of lines | Number of characters | File size |
---|---|---|---|
blogs | 899288 | 206824382 | 210.16 mb |
news | 1010242 | 203223154 | 205.81 mb |
2360148 | 162096031 | 167.11 mb |
The datasets are too large to use in their entirety, so I had to strike a balance between ensuring I had a large enough sample size and ensuring I could manipulate the data without crashing R. To do this, I chose to:
1.5m words makes for a more than sufficient sample size, and still allows for a reasonable performance time when modeling. The raw data (written to the dataSamples folder so that it can be presented here) looks like this:
blogs | news | |
---|---|---|
so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home. | The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15. | So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;) |
With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one’s life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE’s new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy! | And when it’s often difficult to predict a law’s impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse? | Words from a complete stranger! Made my birthday even better :) |
If you have an alternative argument, let’s hear it! :) | There was a certain amount of scoffing going around a few years ago when the NFL decided to move the draft from the weekend to prime time – eventually splitting off the first round to a separate day. | First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go! |
To prepare the data for Exploratory Data Analysis (EDA), I made the following manipulations:
tm
package. Corpus’ represent a collection of texts, with VCorpus (short for volatile corpus) denoting that the texts are held entirely in memory.SnowballC
package.lead
and lag
functions. If an n-gram wasn’t complete (for example, if there wasn’t five words in a sentence), the corresponding n-gram was replaced with NA.Once the above changes are made, the data (written to the tidyData folder) looks like this:
author | datetimestamp | description | heading | id | language | origin | word | stem | bigram | trigram | fourgram | fivegram | dataset |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NA | 2021-11-24 23:09:04 | NA | NA | 1 | en | NA | anyways | anywai | NA | NA | NA | NA | blogs |
NA | 2021-11-24 23:09:04 | NA | NA | 1 | en | NA | going | go | anyways going | anyways going share | NA | NA | blogs |
NA | 2021-11-24 23:09:04 | NA | NA | 1 | en | NA | share | share | going share | going share home | anyways going share home | anyways going share home decor | blogs |
NA | 2021-11-24 23:09:04 | NA | NA | 1 | en | NA | home | home | share home | share home decor | going share home decor | going share home decor inspiration | blogs |
NA | 2021-11-24 23:09:04 | NA | NA | 1 | en | NA | decor | decor | home decor | home decor inspiration | share home decor inspiration | share home decor inspiration storing | blogs |
NA | 2021-11-24 23:09:04 | NA | NA | 1 | en | NA | inspiration | inspir | decor inspiration | decor inspiration storing | home decor inspiration storing | home decor inspiration storing folder | blogs |
NA | 2021-11-24 23:09:04 | NA | NA | 1 | en | NA | storing | store | inspiration storing | inspiration storing folder | decor inspiration storing folder | decor inspiration storing folder puter | blogs |
NA | 2021-11-24 23:09:04 | NA | NA | 1 | en | NA | folder | folder | storing folder | storing folder puter | inspiration storing folder puter | inspiration storing folder puter amazing | blogs |
NA | 2021-11-24 23:09:04 | NA | NA | 1 | en | NA | puter | puter | folder puter | folder puter amazing | storing folder puter amazing | storing folder puter amazing images | blogs |
NA | 2021-11-24 23:09:04 | NA | NA | 1 | en | NA | amazing | amaz | puter amazing | puter amazing images | folder puter amazing images | folder puter amazing images stored | blogs |
To prepare the data for the Stupid Back-Off (SBO) model, I take a different approach to preparing it:
pivot_longer
to manipulate the data into a dataframe where each row corresponds to a line from one of the three data sources.A different approach is required because the cleaning parameters and tokenization are specified in the actual SBO model. At that stage, I apply the same cleaning and tokenization transformations that were applied in the Data for EDA section. Both the training and testing data are written to the tidyData folder.
The training data looks like this:
Text | |
---|---|
text1 | ANYONE KNOW HOW TO stop a dog from shedding |
text2 | I’m talking tonight at 930 on I can only assume they’re doing a “World Series Preview” show. |
text3 | Then people start dying. A pit-and-pendulum murder here, a victim possibly walled up in a sewer there. A detective (Luke Evans, in a bland performance) figures out that somebody is imitating the deaths in Poe’s fiction and decides he needs the author’s experise. |
In this section, I compute wordclouds, n-gram frequency plots, TF-IDF frequency plots, and perform sentiment analysis.
Here I’ve included a sample of the wordclouds I’ve plotted. I separated the wordclouds by data source to highlight the difference in language each of the three data sources use.
Unigram wordclouds don’t tell too much of a story - they’re most filled with common words.
Bigram wordclouds tend to be a bit more unique. Here we’re starting to see parts of phrases that you’d expect to see in the news.
Here we can see the trigram wordcloud for the twitter data set. After trigrams, the wordclouds become more sparse, so I haven’t included them in this report.
In this section I’ve included a sample of the n-gram frequency charts I’ve plotted. Again, I’ve separated the plots by data source to highlight the difference in language that each of the three data sources use. Frequency is calculated as the percentage of the total words the n-gram makes up in the data source.
Term frequency, inverse document frequency (TF-IDF) statistics measure how important a word is to a document or a collection of documents. TF-IDF’s are calculated by measuring the frequency of a word (term frequency) by its inverse document frequency, which decreases the weight for commonly used words and increases the weight for words that are not used often in a collection of documents.
The TF-IDF frequency plot for the news data set provides an interesting example of this. “Officer” isn’t a word that’s overly common in everyday conversation, but in the news - where crime is talked about far more often than in other contexts - it’s used regulary, leading to it having a high TF-IDF score.
To analyze the sentiment of a text, the sentiment of the entire text is defined as the sum of the sentiment content of each individual word.
There are several general-purpose lexicons that can be used to determine sentiment:
AFINN
assigns words a score between -5 and 5, with larger numbers being considered more positive.bing
categorizes words as either “positive” or “negative”.nrc
categorizes words as either “yes” or “no” for a variety of sentiments (ie. happiness, fear, etc.).I use the AFINN
lexicon to determine the average sentiment of each of the three data sources.
All three data sources have a generally positive sentiment. Blogs and news have a nearly identical average sentiment, although the news has more variance in sentiment. On average, twitter is the most positive of the three data sources.
This is an interesting result, because I tend to think of the news as a generally negative source of information - I would’ve expected its average sentiment to be lower.
In this section I build an SBO model to predict the next word in a sentence based on the 5 previous words.
SBO’s are referred to as “stupid” because of their relatively simple algorithm, but this doesn’t mean they’re an ineffective predictor of the next word in a sentence. SBO’s work as follows:
The model is able to predict the next word in a sentence with ~19% accuracy. While this isn’t perfect, it’s a serviceable baseline that can act as a point of reference while I look into more advanced methodology.
The end goal here is to build a predictive algorithm and deploy it as a Shiny app, where the app will take a phrase as input and output its prediction of the next word.
While I could accomplish this with the SBO model detailed above, I’d love to use some more advanced natural language processing such as Long Short Term Memory (LSTM). Unfortunately, the keras
and tensorflow
packages are primarily built for Python - and while there are wrappers to make them work in R, I’ve been unable to use them without my ten-year old Mac’s keyboard getting hotter than the surface of the sun.
In the coming weeks, I’ll be looking into LSTM and other Recurrent Neural Network (RNN) techniques to determine if any of them will be feasible given my laptops computational constraints.
The appendix below displays the code I’ve written to process, explore, and model the data.
###################
# Data Processing #
###################
## This script takes the first 30k lines from each of the three data sources
## and tokenizes them using the "tm" and "tidytext" packages.
## Load packages
library(tidyverse)
library(tm)
library(tidytext)
library(stopwords)
library(quanteda)
library(rsample)
## Set working directory
setwd("/Users/kevinroche22/RData/SwiftkeyTextMiningAndAnalytics/rawData")
## List of file names
fileNames <- list.files(getwd())
## Read in raw data, capping the number of lines at 30k (huge files - this sample size should suffice)
rawTextData <- fileNames %>% map_dfc(function(file) {
## Applies readLines functions to each of the three files
readr::read_lines(file, skip = 3, n_max = 30000)
})
## Name data
names(rawTextData) <- c("blogs", "news", "twitter")
## Write to dataSamples folder
write_csv(rawTextData, "/Users/kevinroche22/RData/SwiftkeyTextMiningAndAnalytics/dataSamples/rawDataSample.csv")
################
# Data for EDA #
################
## Format and clean using tm, stopwords and tidytext
tidyTextData <- unique(names(rawTextData)) %>% map_dfr(function(name) {
## Progress check
print(name)
## Format and build out word groupings
VCorpus(VectorSource(rawTextData[[name]])) %>% # Change to corpus format
tm_map(removeNumbers) %>% # Remove numbers
tm_map(removePunctuation) %>% # Remove punctuation
tm_map(stripWhitespace) %>% # Strip whitespace
tm_map(content_transformer(tolower)) %>% # Make lowercase
tidy() %>% # tidy returns a tbl_df with one-row-per-document
unnest_tokens(word, text) %>% # Splits text column into word tokens, flattening the table into one-token-per-row
anti_join(get_stopwords(source = "snowball"), by = "word") %>%
group_by(id) %>% # otherwise we'd get word strings across sentences
mutate(stem = SnowballC::wordStem(word),
bigram = paste(lag(word), word, sep = " "),
trigram = paste(lag(word), word, lead(word), sep = " "),
fourgram = paste(lag(word, n = 2), lag(word), word, lead(word), sep = " "),
fivegram = paste(lag(word, n = 2), lag(word), word, lead(word), lead(word, n = 2), sep = " "),
dataset = name) %>% # identifier for which dataset text came from
ungroup()
})
## Replace word groupings that aren't full with NA - faster using base R than piping into the mapping above
is.na(tidyTextData$bigram) <- str_detect(tidyTextData$bigram, "NA")
is.na(tidyTextData$trigram) <- str_detect(tidyTextData$trigram, "NA")
is.na(tidyTextData$fourgram) <- str_detect(tidyTextData$fourgram, "NA")
is.na(tidyTextData$fivegram) <- str_detect(tidyTextData$fivegram, "NA")
######################
# Data for SBO Model #
######################
## Pivot data longer
rawTextData <- rawTextData %>%
pivot_longer(everything(), values_to = "text") %>%
select(text)
## Split into training and testing
splitRawTextData <- rawTextData %>%
initial_split(prop = 0.8)
trainData <- training(splitRawTextData)
testData <- testing(splitRawTextData)
## Convert to corpus
tidyTrainData <- trainData$text %>%
corpus()
tidyTestData <- testData$text %>%
corpus()
#####################################
## Write results to tidyData folder #
#####################################
write_rds(tidyTextData, "/Users/kevinroche22/RData/SwiftkeyTextMiningAndAnalytics/tidyData/tidyData.rds")
write_rds(tidyTrainData, "/Users/kevinroche22/RData/SwiftkeyTextMiningAndAnalytics/tidyData/tidyTrainData.rds")
write_rds(tidyTestData, "/Users/kevinroche22/RData/SwiftkeyTextMiningAndAnalytics/tidyData/tidyTestData.rds")
#############################
# Exploratory Data Analysis #
#############################
## In this section we compute and plot the top 10 most common single words,
## word pairs, word trios, word quartets, and word quintets, and then plot
## their associated word clouds. We also create summary stats, build tf-idf
## plots and perform sentiment analysis.
## Load packages
library(tidyverse)
library(ggplot2)
library(ggthemes)
library(wordcloud2)
library(tidytext)
## Set working directory
setwd("/Users/kevinroche22/RData/SwiftkeyTextMiningAndAnalytics/EDACharts/")
## List of file names for tidy data
tidyDataFolder <- "/Users/kevinroche22/RData/SwiftkeyTextMiningAndAnalytics/tidyData/"
## Read in tidy data
tidyTextData <- read_rds(file = paste0(tidyDataFolder, "tidyData.rds"))
## List of file names for raw data
rawDataFileNames <- paste0("/Users/kevinroche22/RData/SwiftkeyTextMiningAndAnalytics/rawData/",
list.files("/Users/kevinroche22/RData/SwiftkeyTextMiningAndAnalytics/rawData/"))
## Read in raw data
rawTextData <- rawDataFileNames %>% map_dfc(function(file) {
## Applies readLines functions to each of the three files
readr::read_lines(file, skip = 3, n_max = 30000)
})
## Name raw data
names(rawTextData) <- c("blogs", "news", "twitter")
#################
# Summary Stats #
#################
## Calculate file size
fileSizes <- rawDataFileNames %>% map_dfc(function(file) {
## Return file size
paste0(round(file.size(file) / 1000000, 2), " mb")
})
## Summarize raw data
rawDataSummary <- rawDataFileNames %>% map_dfr(function(file) {
## Applies readLines functions to each of the three files
rawFile <- readLines(file)
## Return stats
stringi::stri_stats_general(rawFile)
})
## Add descriptive column and reorder
rawDataSummary <- rawDataSummary %>%
mutate(file = c("blogs", "news", "twitter")) %>%
bind_cols(fileSize = t(fileSizes)) %>%
relocate(file) %>%
select(-c("LinesNEmpty", "CharsNWhite")) %>% # Not particularly useful information imo
as.matrix() %>%
as.data.frame() # Has to be in df format to write but first needed to convert to matrix due to transpose
## Name columns
names(rawDataSummary) <- c("file", "numberOfLines", "numberOfCharacters", "fileSize")
## Write to folder
write_csv(rawDataSummary, "/Users/kevinroche22/RData/SwiftkeyTextMiningAndAnalytics/dataSamples/summaryStats.csv")
############
# Unigrams #
############
## Plot top 10 most commonly used words (%) from each source
unigramFreqPlots <- unique(tidyTextData$dataset) %>% map(function(name) {
## Progress check
print(name)
## Build plot
tidyTextData %>%
filter(dataset == name) %>%
count(word) %>%
mutate(percent = 100*(n/sum(n))) %>%
slice(-n) %>%
slice_max(percent, n = 10) %>%
ggplot(aes(x = reorder(word, percent), y = percent, fill = percent)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle(paste0("Top 10 Most Frequently Used Words - ", tools::toTitleCase(name))) +
labs(x = "Word", y = "Percentage Of Total") +
theme_bw() +
scale_color_tableau() +
theme(legend.position = "none")
})
## Name frequency plots
names(unigramFreqPlots) <- c("blogs", "news", "twitter")
## Write to EDA folder
pwalk(list(filename = paste0("frequencyPlots/unigram/Top10FrequencyWords - ", names(unigramFreqPlots), ".png"),
plot = unigramFreqPlots),
ggsave)
## Plot word clouds from each source and write to folder
unigramWordClouds <- unique(tidyTextData$dataset) %>% map(function(name) {
## Progress check
print(name)
## Manipulate data
countWords <- tidyTextData %>%
filter(dataset == name) %>%
count(word) %>%
filter(n >= 3) %>%
slice_max(n, n = 100)
## Build plot
wordcloud2(data = countWords,
size = 1,
backgroundColor = "black",
shape = "circle")
})
## Name wordclouds
names(unigramWordClouds) <- c("blogs", "news", "twitter")
## Write to folder
unique(names(unigramWordClouds)) %>% map(function(name) {
## Progress check
print(name)
htmlwidgets::saveWidget(unigramWordClouds[[name]],
paste0("wordClouds/unigram/unigramWordClouds - ", name, ".html"), selfcontained = F)
webshot::webshot(paste0("wordClouds/unigram/unigramWordClouds - ", name, ".html"),
paste0("wordClouds/unigram/unigramWordClouds - ", name, ".png"),
vwidth = 600,
vheight = 600,
delay = 10)
})
###########
# Bigrams #
###########
## Plot top 10 most commonly used pairs of words (%) from each source
bigramFreqPlots <- unique(tidyTextData$dataset) %>% map(function(name) {
## Progress check
print(name)
## Build plot
tidyTextData %>%
filter(dataset == name,
!(is.na(bigram))) %>%
count(bigram) %>%
mutate(percent = 100*(n/sum(n))) %>%
slice(-n) %>%
slice_max(percent, n = 10) %>%
ggplot(aes(x = reorder(bigram, percent), y = percent, fill = percent)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle(paste0("Top 10 Most Frequently Used Bigrams - ", tools::toTitleCase(name))) +
labs(x = "Bigram", y = "Percentage Of Total") +
theme_bw() +
scale_color_tableau() +
theme(legend.position = "none")
})
## Name frequency plots
names(bigramFreqPlots) <- c("blogs", "news", "twitter")
## Write to EDA folder
pwalk(list(filename = paste0("frequencyPlots/bigram/Top10FrequencyPairs - ", names(bigramFreqPlots), ".png"),
plot = bigramFreqPlots),
ggsave)
## Plot word clouds from each source
bigramWordClouds <- unique(tidyTextData$dataset) %>% map(function(name) {
## Progress check
print(name)
## Manipulate data
countWords <- tidyTextData %>%
filter(dataset == name,
!(is.na(bigram))) %>%
count(bigram) %>%
filter(n >= 3) %>%
slice_max(n, n = 100)
## Build plot
wordcloud2(data = countWords,
size = 1,
backgroundColor = "black",
shape = "circle")
})
## Name wordclouds
names(bigramWordClouds) <- c("blogs", "news", "twitter")
## Write to folder
unique(names(bigramWordClouds)) %>% map(function(name) {
## Progress check
print(name)
htmlwidgets::saveWidget(bigramWordClouds[[name]],
paste0("wordClouds/bigram/bigramWordClouds - ", name, ".html"), selfcontained = F)
webshot::webshot(paste0("wordClouds/bigram/bigramWordClouds - ", name, ".html"),
paste0("wordClouds/bigram/bigramWordClouds - ", name, ".png"),
vwidth = 600,
vheight = 600,
delay = 10)
})
###########
# 3-grams #
###########
## Plot top 10 most commonly used trios of words (%) from each source
trigramFreqPlots <- unique(tidyTextData$dataset) %>% map(function(name) {
## Progress check
print(name)
## Build plot
tidyTextData %>%
filter(dataset == name,
!(is.na(trigram))) %>%
count(trigram) %>%
mutate(percent = 100*(n/sum(n))) %>%
slice(-n) %>%
slice_max(percent, n = 10) %>%
ggplot(aes(x = reorder(trigram, percent), y = percent, fill = percent)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle(paste0("Top 10 Most Frequently Used Trios Of Words - ", tools::toTitleCase(name))) +
labs(x = "Word Trio", y = "Percentage Of Total") +
theme_bw() +
scale_color_tableau() +
theme(legend.position = "none")
})
## Name frequency plots
names(trigramFreqPlots) <- c("blogs", "news", "twitter")
## Write to EDA folder
pwalk(list(filename = paste0("frequencyPlots/trigram/Top10FrequencyTrios - ", names(trigramFreqPlots), ".png"),
plot = trigramFreqPlots),
ggsave)
## Plot word clouds from each source
trigramWordClouds <- unique(tidyTextData$dataset) %>% map(function(name) {
## Progress check
print(name)
## Manipulate data
countWords <- tidyTextData %>%
filter(dataset == name,
!(is.na(trigram))) %>%
count(trigram) %>%
filter(n >= 3) %>%
slice_max(n, n = 100)
## Build plot
wordcloud2(data = countWords,
size = 1,
backgroundColor = "black",
shape = "circle")
})
## Name wordclouds
names(trigramWordClouds) <- c("blogs", "news", "twitter")
## Write to folder
unique(names(trigramWordClouds)) %>% map(function(name) {
## Progress check
print(name)
htmlwidgets::saveWidget(trigramWordClouds[[name]],
paste0("wordClouds/trigram/trigramWordClouds - ", name, ".html"), selfcontained = F)
webshot::webshot(paste0("wordClouds/trigram/trigramWordClouds - ", name, ".html"),
paste0("wordClouds/trigram/trigramWordClouds - ", name, ".png"),
vwidth = 600,
vheight = 600,
delay = 10)
})
###########
# 4-grams #
###########
## Plot top 10 most commonly used quartets of words (%) from each source
fourgramFreqPlots <- unique(tidyTextData$dataset) %>% map(function(name) {
## Progress check
print(name)
## Build plot
tidyTextData %>%
filter(dataset == name,
!(is.na(fourgram))) %>%
count(fourgram) %>%
mutate(percent = 100*(n/sum(n))) %>%
slice(-n) %>%
slice_max(percent, n = 10) %>%
ggplot(aes(x = reorder(fourgram, percent), y = percent, fill = percent)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle(paste0("Top 10 Most Frequently Used Quartets Of Words - ", tools::toTitleCase(name))) +
labs(x = "Word Quartet", y = "Percentage Of Total") +
theme_bw() +
scale_color_tableau() +
theme(legend.position = "none")
})
## Name frequency plots
names(fourgramFreqPlots) <- c("blogs", "news", "twitter")
## Write to EDA folder
pwalk(list(filename = paste0("frequencyPlots/fourgram/Top10FrequencyQuartets - ", names(fourgramFreqPlots), ".png"),
plot = fourgramFreqPlots),
ggsave)
## Plot word clouds from each source
fourgramWordClouds <- unique(tidyTextData$dataset) %>% map(function(name) {
## Progress check
print(name)
## Manipulate data
countWords <- tidyTextData %>%
filter(dataset == name,
!(is.na(fourgram))) %>%
count(fourgram) %>%
filter(n >= 2) %>%
slice_max(n, n = 100)
## Build plot
wordcloud2(data = countWords,
size = 1,
backgroundColor = "black",
shape = "circle")
})
## Name wordcloud
names(fourgramWordClouds) <- c("blogs", "news", "twitter")
## Write to folder
unique(names(fourgramWordClouds)) %>% map(function(name) {
## Progress check
print(name)
htmlwidgets::saveWidget(fourgramWordClouds[[name]],
paste0("wordClouds/fourgram/fourgramWordClouds - ", name, ".html"), selfcontained = F)
webshot::webshot(paste0("wordClouds/fourgram/fourgramWordClouds - ", name, ".html"),
paste0("wordClouds/fourgram/fourgramWordClouds - ", name, ".png"),
vwidth = 600,
vheight = 600,
delay = 10)
})
###########
# 5-grams #
###########
## Plot top 10 most commonly used quintets of words (%) from each source
fivegramFreqPlots <- unique(tidyTextData$dataset) %>% map(function(name) {
## Progress check
print(name)
## Build plot
tidyTextData %>%
filter(dataset == name,
!(is.na(fivegram))) %>%
count(fivegram) %>%
mutate(percent = 100*(n/sum(n))) %>%
slice(-n) %>%
slice_max(percent, n = 10) %>%
ggplot(aes(x = reorder(fivegram, percent), y = percent, fill = percent)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle(paste0("Top 10 Most Frequently Used Quintets Of Words - ", tools::toTitleCase(name))) +
labs(x = "Word Quintet", y = "Percentage Of Total") +
theme_bw() +
scale_color_tableau() +
theme(legend.position = "none")
})
## Name frequency plots
names(fivegramFreqPlots) <- c("blogs", "news", "twitter")
## Write to EDA folder
pwalk(list(filename = paste0("frequencyPlots/fivegram/Top10FrequencyQuintets - ", names(fivegramFreqPlots), ".png"),
plot = fivegramFreqPlots),
ggsave)
## Plot word clouds from each source
fivegramWordClouds <- unique(tidyTextData$dataset) %>% map(function(name) {
## Progress check
print(name)
## Manipulate data
countWords <- tidyTextData %>%
filter(dataset == name,
!(is.na(fivegram))) %>%
count(fivegram) %>%
filter(n >= 2) %>%
slice_max(n, n = 100)
## Build plot
wordcloud2(data = countWords,
size = 1,
backgroundColor = "black",
shape = "circle")
})
## Name wordcloud
names(fivegramWordClouds) <- c("blogs", "news", "twitter")
## Write to folder
unique(names(fivegramWordClouds)) %>% map(function(name) {
## Progress check
print(name)
htmlwidgets::saveWidget(fivegramWordClouds[[name]],
paste0("wordClouds/fivegram/fivegramWordClouds - ", name, ".html"), selfcontained = F)
webshot::webshot(paste0("wordClouds/fivegram/fivegramWordClouds - ", name, ".html"),
paste0("wordClouds/fivegram/fivegramWordClouds - ", name, ".png"),
vwidth = 600,
vheight = 600,
delay = 10)
})
######################
# Sentiment Analysis #
######################
## Plot average sentiment scores by data source and write to sentiment analysis folder
(tidyTextData %>%
inner_join(get_sentiments("afinn"), by = c("word" = "word")) %>%
group_by(dataset) %>%
ggplot(aes(x = dataset, y = value, fill = dataset)) +
geom_boxplot() +
ggtitle("Average Sentiment By Data Source") +
labs(x = "Data Source", y = "Sentiment Value") +
theme_bw() +
scale_fill_tableau() +
theme(legend.position = "none")) %>%
ggsave(file = paste0(getwd(), "/sentimentAnalysis/sentimentAnalysis.png"))
################
# TF-IDF Plots #
################
## TF-IDF scores account for how common a word is to determine which words
## are most important (ie. more common than is typical) to a text
## Plot 10 words with highest TF-IDF score
tfidfPlots <- unique(tidyTextData$dataset) %>% map(function(name) {
## Check Progress
print(name)
## Build plot
tidyTextData %>%
group_by(dataset) %>%
count(word) %>%
bind_tf_idf(word, dataset, n) %>%
filter(dataset == name,
n >= 30) %>%
slice_max(tf_idf, n = 10) %>%
ggplot(aes(x = reorder(word, tf_idf), y = tf_idf, fill = tf_idf)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle(paste0("Most Important Words By TF-IDF Score - ", tools::toTitleCase(name))) +
labs(x = "Word", y = "Importance") +
theme_bw() +
scale_color_tableau() +
theme(legend.position = "none")
})
## Name TF-IDF Plots
names(tfidfPlots) <- c("blogs", "news", "twitter")
## Write to EDA folder
pwalk(list(filename = paste0("tfidfPlots/TF-IDF Plot - ", names(tfidfPlots), ".png"),
plot = tfidfPlots),
ggsave)
#################
# Data Modeling #
#################
## In this section we build a stupid backoff model (sbo) to predict the
## next word based on the previous n-grams (in this case, I use 5)
## Load packages
library(tidyverse)
library(tidymodels)
library(textrecipes)
library(tensorflow)
library(keras)
library(sbo)
## Set working directory
setwd("/Users/kevinroche22/RData/SwiftkeyTextMiningAndAnalytics/models/")
## List of file names
tidyDataFolder <- "/Users/kevinroche22/RData/SwiftkeyTextMiningAndAnalytics/tidyData/"
## Read in tidy data
trainData <- read_rds(file = paste0(tidyDataFolder, "tidyTrainData.rds"))
testData <- read_rds(file = paste0(tidyDataFolder, "tidyTestData.rds"))
###############################
# Stupid Back Off (SBO) Model #
###############################
## Set seed
set.seed(824) # ripKobe
## Build predictor
sboPredictor <- sbo_predictor(object = trainData, # training data
N = 5, # 5-gram model
dict = target ~ 0.75, # 75% of training corpus used in dictionary
.preprocess = sbo::preprocess, # removes anything non alphanumeric, whitespace, converts to lower, etc.
EOS = ".?!:;", # End-Of-Sentence tokens
lambda = 0.4, # Back-off penalization in SBO algorithm - parameter suggested by authors of methodology
L = 3L, # Number of predictions
filtered = c("<UNK>", "<EOS>") # Exclude the <UNK> and <EOS> tokens from predictions
)
## Evaluate Predictions
sboEvaluation <- eval_sbo_predictor(sboPredictor, test = testData)
## Determine accuracy - ~18%
sboEvaluation %>%
filter(true != "<EOS>") %>%
summarise(accuracyPercentage = (sum(correct)/n())*100,
accuracy = sum(correct)/n(),
uncertaintyPercentage = sqrt(accuracy * (1 - accuracy) / n()))
## Stores next word probability in order of score
sboKGrams <- kgram_freqs(trainData,
N = 5,
dict = target ~ 0.75,
.preprocess = sbo::preprocess,
EOS = ".?!:;")
## Example - predict top 3 words after typing
predict(sboPredictor, "Thanks for having us, we had a great time with")
## Example - all words in dictionary arranged by probability
predict(sboKGrams, "Thanks for having us, we had a great time with")