The data used here comes from Swiftkey. This part of the Capstone Project uses a filtered version of the entire data set.
URL: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.
The goal of this project is to aquire the data, tokenize it, examine word frequencies, and to look for relationships between words.
| File Name | File Size | Total Lines | Total Characters | Total Words |
|---|---|---|---|---|
| en_US.news.txt | 200.989 MB | 338611 | 203026406 | 34760232 |
| en_US.blogs.txt | 205.235MB | 298518 | 207072587 | 37544554 |
| en_US.twitter.txt | 163.189MB | 791434 | 162901600 | 30090015 |
From the provided data we can analyze the number of words, the number of characters, and the number of lines. One issue with the data set was finding the correct line breaks. Using the new line escape character ("\n") the data sets were mostly separated by line. However, there were a few instances where the line length reached into the thousands. The manual examination of those lines in the file revealed that there was a different separator used on some lines. This was evident by the subject matter: on those longer lines the subject matter was completely disjoint. The sampled data files are of a maximum line length of less than 200 words.
Unsurprisingly the most frequent words that appear in the corpora are of the basic type that do not add much to context. Words such as too, and, the, was, is, et. al. The word clouds in plot 1 are of the data sets before and after the stop_words were removed. The stop_words is part of the tidytext package.
Plot 1: Word Clouds of the unfiltered and filtered datasets.
In order to remove expletives from the corpora the stop_words data set will be concatenated with any additional words that need to be removed from any list of possible predicted words.
To aid in the prediction capabilities of a model we can utilize two statistical concepts related to natural language processing that quantify the level of importance a word has to a document. Term frequency is a measure of how frequently a word shows up in a document. The Inverse Document Frequency decreases the weight for commonly used words and increase the weight for words that are not used very much - within a collection of documents. When the two measures are multiplied we get the tf-idf: the frequency of a term adjusted for how rarely it is used. The mathematical equation of the idf uses the natural log function. It is known that the natural log of 1 is zero. Therefore, when looking at the tf-idf values of words in a corpora values extremely close to zero are those of the most frequently occurring type.
Using a tf-idf measurement should aid in the prediction of what type of document is being written. Further on in this early analysis definite distinctions appear between the blog, news, and twitter data sets. The tf-idf should assist in determining what type of text is being written and hopefully increase the predictability of the algorithm.
Plot 2: the largest tf-idf words in each of the data sets
In Plot 2 a few things to take note: 1) The blogs data set has the non-American spelling of words and should be addressed with language detection. 2) The news data set includes the name Kasich, the proper nouns Ohio and Newark, and the word spokeswoman. These terms are highly related to the 2016 U.S. Presidential election and leads us to believe that the data set is very limited to one specific period of time and not representative of “news”. 3) In the Twitter data set there are a few issues: using thx instead of thanks, the expletive, and the “N” word. Maybe Twitter should be treated uniquely from other text data bases? The limitation to 120 characters would of prompted the spelling of thx rather than ‘thanks’, but if thx is now endemic to the Twitter-verse can we safely replace it with the complete spelling? The expletive ‘fuck’ can be further discussed when using sentiment analysis. Then as for the “N” word it can be used as a term of endearment in a subset of the population but it can also be used as a marker of hate in another subset of language users. All indication is the Twitter speak is very different and unique to itself and should have its own prediction algorithm.
Using the blogs sample data three types of language detection packages were used. The cld2 and cld3 packages from Google, and the textcat package. Using only 20% of the sample data: these words were unnested, stop_words were removed, and numerical values were omitted. On this tokenized data the three language detection packages were used.
The tokenized 20% sample data is 1.712e+07 MB in size.
| user.self | sys.self | elapsed | |
|---|---|---|---|
| cld2_time | 0.11 | 0.01 | 0.16 |
| cld3_time | 2.12 | 0.00 | 2.14 |
| textcat_time | 105.52 | 0.09 | 105.83 |
The results in Table 2 suggest that the time needed for the textcat package to assign a value makes it unreasonable to use in a prediction model. Using the cld2 package would offer the greatest advantage in the speed of the prediction. However, both the cld2 and cld3 packages assign a value of NA to any word that it is not sure of. In Table 3 we see that the cld2 package a much higher proportion of NA values to words compared to the cld3 package.
| package | proportion | toal |
|---|---|---|
| cld2 | 0.865 | 76293 |
| cld3 | 0.397 | 76293 |
Even though the cld2 package is the faster of the three language detection packages used on this data - the cld3 package has a higher success rate of assigning a language to a word. Therefore, to increase the success rate of language detection the cld3 package will be used in sacrifice of speed lost.
Sentiment analysis allows us to quantify the emotional intent of words. If an algorithm can detect if a complete statement is going to have a negative versus a positive sentiment then it could shrink the size of a predictive word library. At this time in the tidytext package there are several sentiment lexicons that assign unigrams (single words) a score for positive/negative sentiment.
It is important to note that some, if not all, of the lexicon packages are licensed. Users of these packages must agree to a license before downloading the data. In this first step of building a predictive algorithm for the Data Science Specialization through Johns Hopkins and Coursera I have licensed the AFINN, bing, and nrc lexicons.
While this predictive algorithm is very much still in the starting phase, these first steps of analysis of what has been covered and what will follow has reinforced the idea that the style and content of writing is assignable to the content of the writing. It would be surprising to find expletives within a statistical analysis paper for viral suppression drugs, but it is very likely to show up in a Tweet or within the personal writings of a blog.
Plot 3: Sentiment analysis of the sample data using the bing, nrc, and AFINN packages.
Plot 3 is used to show how sentiment analysis can be used to aid word prediction. While the nrc and bing packages use a binary format to assign words to positive and negative. The AFINN package assigns values between -5 and 5, inclusively. As a stand alone, word sentiment analysis is not extremely informative. However, when combined with n-gram analysis it could be very useful with prediction.
An n-gram is used to look at the the concurrence of words. That is: determining which words tend to follow others immediately. The n in n-gram is a positive integer value. If n = 2 then we could isolate one word and find the most frequently occurring next word. In example: the word not could be followed by words such as like, allow, or sorry. A tri-gram we be three words, quad-gram is five words, etc.
Plot 4: most frequent bigrams.
There is not much surprising information given in the most frequent bi-gram column plots. In conjunction with sentiment analysis having the word ink follow distress is unusual. The news data set is most definitely related to the run-up of the 2016 U.S. Presidential election. In the Twitter data definitely the ‘rt’ needs to be addressed to determine if it needs to be add to a list of stop words.
Plot 5: Most frequent tri-grams.
The blogs tri-gram plot shows that the word amazon appears somewhat frequently but has a distinct relative importance as it relates to website address of the marketplace. The word amazon will need to be included to the list of stop words.
In the Twitter tri-gram plot we see how the word happy tends to show up in correlation with specific dates in the calendar year. When combined with sentiment analysis maybe happy needs to be part of a date/time variable, included with words like “merry”.
Plot 6: Most frequent bi-grams, with Amazon removed
In plot 6 the arrows show the strength and direction of the association between words.
# Setup
knitr::opts_chunk$set(cache = TRUE)
knitr::opts_chunk$set(warning = FALSE)
# Libraries
library(dplyr)
library(readr)
library(tidyverse)
library(tidytext)
library(broom)
library(stringr)
library(stringi)
library(cld2)
library(cld3)
library(textcat)
library(textdata)
library(wordcloud)
library(forcats)
library(igraph)
library(ggraph)
data(stop_words)
# Read_in the files
twitter_txt <- read.table("./en_US.twitter.txt",
colClasses = "character",
sep = "\n",
fill = TRUE,
comment.char = "",
quote = "'\"",
encoding = "UTF-8",
skipNul = TRUE,
numerals = "no.loss")
news_txt <- read.table("./en_US.news.txt",
colClasses = "character",
sep = "\n",
comment.char = "",
quote = "'\"",
fill = TRUE,
encoding = "UTF-8",
skipNul = TRUE,
numerals = "no.loss",
allowEscapes = TRUE)
blogs_txt <- read.table("./en_US.blogs.txt",
colClasses = "character",
sep = "\n",
comment.char = "",
quote = "'\"",
fill = TRUE,
encoding = "UTF-8",
skipNul = TRUE,
numerals = "no.loss",
allowEscapes = TRUE)
# Find important features of the files
file_name <- c("en_US.news.txt","en_US.blogs.txt","en_US.twitter.txt")
file_size <- c("200.989 MB", "205.235MB","163.189MB")
num_lines <- c("338611","298518","791434")
news_char <- sum(str_length(news_txt[,1]))
blog_char <- sum(str_length(blogs_txt[,1]))
twit_char <- sum(str_length(twitter_txt[,1]))
num_char <- c(news_char,blog_char,twit_char)
news_word <- news_txt %>%
unnest_tokens(word,V1) %>%
count(word)
news_word_count <- sum(news_word$n)
blogs_word <- blogs_txt %>%
unnest_tokens(word,V1) %>%
count(word)
blogs_word_count <- sum(blogs_word$n)
twitter_word <- twitter_txt %>%
unnest_tokens(word,V1) %>%
count(word)
twit_word_count <- sum(twitter_word$n)
word_count <- c(news_word_count,blogs_word_count,twit_word_count)
feature_table <- cbind(file_name,file_size,num_lines,num_char,word_count)
feature_table <- data.frame(feature_table)
feature_table <- feature_table %>%
rename(`File Name`= file_name,
`File Size` = file_size,
`Total Lines` = num_lines,
`Total Characters` = num_char,
`Total Words` = word_count)
knitr::kable(feature_table, caption = "Table 1: Features of the Text Data Sets")
set.seed(42)
#Twitter
twitter_sample <- twitter_txt %>% sample_frac(0.05)
#News
news_sample <- news_txt %>% sample_frac(0.05)
#Blogs
blogs_sample <- blogs_txt %>% sample_frac(0.05)
x <- bind_rows("blogs" = blogs_sample,
"news" = news_sample,
"twitter" = twitter_sample,
.id = "source")
# Word Count
word_count <- data.frame(str_count(x[[2]], "\\S+"))
word_count <- word_count %>% rename(word_count = str_count.x..2.......S...)
df <- data.frame(cbind(word_count,x))
df <- df %>% filter(word_count<200) # Filter out lines with too many words
working_df <- df %>% select(-word_count)
word_freq <- working_df %>%
group_by(source) %>%
unnest_tokens(word,V1)
# Unfiltered Word Cloud
word_cloud <- word_freq %>%
filter(!str_detect(word,"\\d")) %>%
group_by(source) %>%
count(word) %>%
ungroup()
# Filtered Word Cloud
word_cloud_filtered <- working_df %>%
group_by(source) %>%
unnest_tokens(word,V1) %>%
anti_join(stop_words) %>%
filter(!str_detect(word,"\\d")) %>%
count(word)
# Unfiltered Word Cloud Graph
word_cloud %>%
with(wordcloud(word,n,max.words = 100))
# Filtered Word Cloud Graph
word_cloud_filtered %>% with(wordcloud(word,n,max.words = 80))
source_words <- working_df %>%
unnest_tokens(word,V1) %>%
anti_join(stop_words) %>%
filter(!str_detect(word,"\\d")) %>%
count(source,word,sort=TRUE)
total_words <- source_words %>%
group_by(source) %>%
summarise(total=sum(n))
source_words <- left_join(source_words,total_words)
source_tf_idf <- source_words %>%
bind_tf_idf(word,source,n)
source_tf_idf %>%
group_by(source) %>%
slice_max(tf_idf, n=10) %>%
ungroup() %>%
ggplot(aes(tf_idf,fct_reorder(word,tf_idf),fill=source))+
geom_col(show.legend = FALSE) +
facet_wrap(~source,ncol = 1,scales = "free")+
labs(x="tf-idf",y=NULL)
# speed of each language package
df_sample <- working_df %>%
sample_frac(0.2) %>%
unnest_tokens(word,V1) %>%
anti_join(stop_words) %>%
filter(!str_detect(word,"\\d")) %>%
count(word, sort=TRUE)
textcat_time <- system.time(textcat(df_sample$word))
cld2_time <- system.time(cld2::detect_language(df_sample$word))
cld3_time <- system.time(cld3::detect_language(df_sample$word))
time_table <- rbind(cld2_time,cld3_time,textcat_time)
time_table <- data.frame(time_table)
time_table <- time_table %>% select(-c(user.child,sys.child))
df_size <- formatC(object.size(working_df),format="e",digits=3)
knitr::kable(time_table, caption = "Table 2: Time (in seconds) that each package needed to assign a language prediction value.")
# Accuracy rates
en_df <- working_df %>%
select(V1) %>%
unnest_tokens(word,V1) %>%
anti_join(stop_words) %>%
filter(!str_detect(word,"\\d")) %>%
distinct()
en_detect <- en_df %>%
mutate(cld2 = cld2::detect_language(text = word),
cld3 = cld3::detect_language(text = word))
x2 <- en_detect %>% select(-word)
en_inacc_prop <- data.frame(colMeans(is.na(x2))) %>%
rename(proportion = colMeans.is.na.x2..)
en_inacc_prop$proportion <- round(en_inacc_prop$proportion,3)
en_inacc_prop <- data.frame(cbind(en_inacc_prop,toal = nrow(en_detect)))
en_inacc_prop <- en_inacc_prop %>% rownames_to_column()
en_inacc_prop <- en_inacc_prop %>%rename(package = rowname)
knitr::kable(en_inacc_prop, comment="Table 3: NA assignment rates")
sentiment_df <- cbind(line=1:nrow(working_df),working_df)
filter_sentiment_df <- sentiment_df %>%
group_by(source) %>%
group_by(line) %>%
unnest_tokens(word,V1) %>%
anti_join(stop_words) %>%
filter(!str_detect(word,"\\d")) %>%
count(word)
bing_nrc <- bind_rows(
filter_sentiment_df %>%
inner_join(get_sentiments("bing")) %>%
mutate(method="Bing et al."),
filter_sentiment_df %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive","negative"))
) %>%
mutate(method="NRC")) %>%
count(method,index=line %/% 20,sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment=positive-negative)
afinn <- filter_sentiment_df %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index=line %/% 20)%>%
summarise(sentiment=sum(value)) %>%
mutate(method="AFINN")
sentiment_plot <- bind_rows(afinn,bing_nrc) %>%
ggplot(aes(index,sentiment,fill=method))+
geom_col(show.legend = FALSE)+
facet_wrap(~method,ncol=1,scales="free_y")
sentiment_plot
## bigram
bi_gram <- working_df %>%
group_by(source) %>%
unnest_tokens(bigram,V1, token="ngrams",n=2) %>%
ungroup()
bigram_separated <- bi_gram %>%
separate(bigram,c("word1","word2"),sep=" ")
bigram_filter <- bigram_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!str_detect(word1,"\\d")) %>%
filter(!word2 %in% stop_words$word) %>%
filter(!str_detect(word2,"\\d"))
bigram_counts <- bigram_filter %>%
count(word1,word2,sort=TRUE)
bigram_united <- bigram_filter %>%
unite(bigram,word1,word2,sep=" ")
bigram_tf_idf <- bigram_united %>%
count(source,bigram) %>%
bind_tf_idf(bigram,source,n) %>%
arrange(desc(tf_idf))
## trigram
tri_gram <- working_df %>%
group_by(source) %>%
unnest_tokens(trigram,V1, token="ngrams",n=3) %>%
ungroup()
trigram_separated <- tri_gram %>%
separate(trigram,c("word1","word2","word3"),sep=" ")
trigram_filter <- trigram_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!str_detect(word1,"\\d")) %>%
filter(!word2 %in% stop_words$word) %>%
filter(!str_detect(word2,"\\d")) %>%
filter(!word3 %in% stop_words$word) %>%
filter(!str_detect(word3,"\\d"))
trigram_counts <- trigram_filter %>%
count(word1,word2,word3,sort=TRUE)
trigram_united <- trigram_filter %>%
unite(trigram,word1,word2,word3,sep=" ")
trigram_tf_idf <- trigram_united %>%
count(source,trigram) %>%
bind_tf_idf(trigram,source,n) %>%
arrange(desc(tf_idf))
filtered_df <- working_df %>%
unnest_tokens(bigram,V1,token="ngrams",n=2) %>%
separate(bigram,c("word1","word2"),sep=" ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!str_detect(word1,"amazon")) %>%
filter(!str_detect(word1,"\\d")) %>%
filter(!word2 %in% stop_words$word) %>%
filter(!str_detect(word2,"amazon")) %>%
filter(!str_detect(word2,"\\d")) %>%
count(word1,word2,sort=TRUE)
bigram_graph <- filtered_df %>%
filter(n>40) %>%
graph_from_data_frame()
set.seed(2021)
a <- grid::arrow(type="closed",length=unit(0.15,"inches"))
bigram_tf_idf %>%
group_by(source) %>%
slice_max(tf_idf, n=10) %>%
ungroup() %>%
ggplot(aes(tf_idf,fct_reorder(bigram,tf_idf),fill=source))+
geom_col(show.legend = FALSE) +
facet_wrap(~source,ncol = 1,scales = "free")+
labs(x="tf-idf of bigram",y=NULL)
trigram_tf_idf %>%
group_by(source) %>%
slice_max(tf_idf, n=4) %>%
ungroup() %>%
ggplot(aes(tf_idf,fct_reorder(trigram,tf_idf),fill=source))+
geom_col(show.legend = FALSE) +
facet_wrap(~source,ncol = 1,scales = "free")+
labs(x="tf-idf of bigram",y=NULL)
ggraph(bigram_graph,layout="fr")+
geom_edge_link(aes(edge_alpha=n),show.legend=FALSE,
arrow=a,end_cap=circle(0.07,"inches"))+
geom_node_point(color="lightpink",size=5)+
geom_node_text(aes(label=name),vjust=1,hjust=1)+
theme_void()