First part of the process is loading necessary libraries for text processing and visual display.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(wordcloud)
## Loading required package: RColorBrewer
library(RColorBrewer)
library(tidytext)
library(wordcloud2)
library(stringr)
library(tidyr)
In order to create reproducible process functions for pre-processing data are created. First function is used for loading text file and inserting them into tibble. Following two functions are used to create frequency counts with tidytext package. In this process stopwords, numbers and punctuation were removed.
load_file <- function(filename) {
con <- file(filename, "r")
file <- readLines(con)
load_tibble <- tibble(1:length(file), text = file)
#close(con)
}
tokenize_clean <- function(dataframe) {
dataframe %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
filter(!str_detect(word, "^[0-9]")) %>%
mutate(word = tolower(word)) %>%
mutate(word = gsub("[^[:alnum:] ]", "", word)) %>%
count(word, sort = TRUE) %>%
mutate(word = reorder(word, n)) %>%
rename(freq = n)
}
tokenize_clean_bigram <- function(dataframe) {
dataframe %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
mutate(bigram = gsub("[^[:alnum:] ]", "", bigram)) %>%
mutate(bigram = tolower(bigram)) %>%
count(bigram, sort = TRUE) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
filter(!str_detect(word1, "^[0-9]")) %>%
filter(!str_detect(word2, "^[0-9]")) %>%
unite(bigram, word1, word2, sep = " ") %>%
mutate(bigram = reorder(bigram, n)) %>%
rename(freq = n) %>%
rename(word = bigram)
}
In this part of the analysis we will use twitter data to create basic chart and interactive word clouds, which are popular tool for text analysis. As mentioned in function definition we will use single word and two word ngrams.
In both cases horizontal bar charts and word clouds will be created.
text_df <- load_file("data/final/en_US/en_US.twitter.txt")
all_words <- text_df %>% summarise(n()) %>% pull()
data(stop_words)
twiter <- tokenize_clean(text_df)
## Joining, by = "word"
clean_words <- twiter %>% summarise(sum(freq)) %>% pull()
In this case 2360148 tweets were loaded with 11963085 distinct cleaned word were loaded.
twiter %>%
top_n(15) %>%
ggplot(aes(freq, word)) +
geom_col() +
labs(y = NULL)
## Selecting by freq
twiter %>%
top_n(300) %>%
wordcloud2()
## Selecting by freq
Based on values in horizontal bar chart we can see that most common word in twitter data is love and day as second.
In this part two word bigrams were created. Due to the fact that if we use full twitter data, creation of bigrams would take too long and we had used sampling technique to select 1 000 000 tweets.
set.seed(42)
twiter_sample <- sample(1:nrow(text_df), 1000000)
sample_twiter_bigrams <- text_df[twiter_sample,]
twiter_bigrams <- tokenize_clean_bigram(sample_twiter_bigrams)
twiter_bigrams %>%
top_n(15) %>%
ggplot(aes(freq, word)) +
geom_col() +
labs(y = NULL)
## Selecting by freq
twiter_bigrams %>%
top_n(300) %>%
wordcloud2()
## Selecting by freq
Most common bigrams that are commonly used together are happy birthday and mothers day.
Final part of processing is cleaning up dataframes and variables.
rm(text_df)
rm(twiter)
rm(twiter_sample)
rm(sample_twiter_bigrams)
rm(twiter_bigrams)
rm(all_words)
rm(clean_words)
In order to prove that previously used functions are reproducible, we will also use them to process news data. Similar process to tweeter data will be used.
news_df <- load_file("data/final/en_US/en_US.news.txt")
all_words <- news_df %>% summarise(n()) %>% pull()
data(stop_words)
news <- tokenize_clean(news_df)
## Joining, by = "word"
clean_words <- news %>% summarise(sum(freq)) %>% pull()
In this process 1010242 news items were loaded, with 15429597 unique clean words.
news %>%
top_n(15) %>%
ggplot(aes(freq, word)) +
geom_col() +
labs(y = NULL)
## Selecting by freq
news %>%
top_n(300) %>%
wordcloud2()
## Selecting by freq
Two most common words in news data were time and people, which is visible in horizontal bar chart and word cloud.
set.seed(42)
news_sample <- sample(1:nrow(news_df), 1000000)
sample_news_bigrams <- news_df[news_sample,]
news_bigrams <- tokenize_clean_bigram(sample_news_bigrams)
news_bigrams %>%
top_n(15) %>%
ggplot(aes(freq, word)) +
geom_col() +
labs(y = NULL)
## Selecting by freq
news_bigrams %>%
top_n(300) %>%
wordcloud2()
## Selecting by freq
In most cases most common bigrams from news data are city names in USA and health care.
Final part of analysis is cleanup of variables and data frame to release memory
rm(news_df)
rm(news)
rm(news_sample)
rm(sample_news_bigrams)
rm(news_bigrams)
rm(all_words)
rm(clean_words)
In final modeling process we will use similar techniques used in this example to analyze sentiment analysis. In modeling process every tweet or news item will be tokenized and lemmatized in order to be used in sentiment analysis.
Potentially some different kind of modeling technique will be used.