First part of the process is loading necessary libraries for text processing and visual display.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(wordcloud)

## Loading required package: RColorBrewer

library(RColorBrewer)
library(tidytext)
library(wordcloud2)
library(stringr)
library(tidyr)

In order to create reproducible process functions for pre-processing data are created. First function is used for loading text file and inserting them into tibble. Following two functions are used to create frequency counts with tidytext package. In this process stopwords, numbers and punctuation were removed.

load_file <- function(filename) {

  con <- file(filename, "r")
  file <- readLines(con)
  load_tibble <- tibble(1:length(file), text = file)
  #close(con)

}

tokenize_clean <- function(dataframe) {

  dataframe %>%
    unnest_tokens(word, text) %>%
    anti_join(stop_words) %>%
    filter(!str_detect(word, "^[0-9]")) %>%
    mutate(word = tolower(word)) %>%
    mutate(word = gsub("[^[:alnum:] ]", "", word)) %>%
    count(word, sort = TRUE) %>%
    mutate(word = reorder(word, n)) %>%
    rename(freq = n)

}

tokenize_clean_bigram <- function(dataframe) {

  dataframe %>%
    unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
    mutate(bigram = gsub("[^[:alnum:] ]", "", bigram)) %>%
    mutate(bigram = tolower(bigram)) %>%
    count(bigram, sort = TRUE) %>%
    separate(bigram, c("word1", "word2"), sep = " ") %>%
    filter(!word1 %in% stop_words$word) %>%
    filter(!word2 %in% stop_words$word) %>%
    filter(!str_detect(word1, "^[0-9]")) %>%
    filter(!str_detect(word2, "^[0-9]")) %>%
    unite(bigram, word1, word2, sep = " ") %>%
    mutate(bigram = reorder(bigram, n)) %>%
    rename(freq = n) %>%
    rename(word = bigram)

}

Twiter data

In this part of the analysis we will use twitter data to create basic chart and interactive word clouds, which are popular tool for text analysis. As mentioned in function definition we will use single word and two word ngrams.

In both cases horizontal bar charts and word clouds will be created.

text_df <- load_file("data/final/en_US/en_US.twitter.txt")

all_words <- text_df %>% summarise(n()) %>% pull()

data(stop_words)

twiter <- tokenize_clean(text_df)

## Joining, by = "word"

clean_words <- twiter %>% summarise(sum(freq)) %>% pull()

In this case 2360148 tweets were loaded with 11963085 distinct cleaned word were loaded.

Twiter barplot and word cloud

twiter %>%
  top_n(15) %>%
  ggplot(aes(freq, word)) +
  geom_col() +
  labs(y = NULL)

## Selecting by freq

twiter %>%
  top_n(300) %>%
  wordcloud2()

## Selecting by freq

Based on values in horizontal bar chart we can see that most common word in twitter data is love and day as second.

Bigrams and word cloud

In this part two word bigrams were created. Due to the fact that if we use full twitter data, creation of bigrams would take too long and we had used sampling technique to select 1 000 000 tweets.

set.seed(42)
twiter_sample <- sample(1:nrow(text_df), 1000000)

sample_twiter_bigrams <- text_df[twiter_sample,]


twiter_bigrams <- tokenize_clean_bigram(sample_twiter_bigrams)

twiter_bigrams %>%
  top_n(15) %>%
  ggplot(aes(freq, word)) +
  geom_col() +
  labs(y = NULL)

## Selecting by freq

twiter_bigrams %>%
  top_n(300) %>%
  wordcloud2()

## Selecting by freq

Most common bigrams that are commonly used together are happy birthday and mothers day.

Final part of processing is cleaning up dataframes and variables.

rm(text_df)
rm(twiter)
rm(twiter_sample)
rm(sample_twiter_bigrams)
rm(twiter_bigrams)
rm(all_words)
rm(clean_words)

News data processing

In order to prove that previously used functions are reproducible, we will also use them to process news data. Similar process to tweeter data will be used.

news_df <- load_file("data/final/en_US/en_US.news.txt")

all_words <- news_df %>% summarise(n()) %>% pull()

data(stop_words)

news <- tokenize_clean(news_df)

## Joining, by = "word"

clean_words <- news %>% summarise(sum(freq)) %>% pull()

In this process 1010242 news items were loaded, with 15429597 unique clean words.

News barplot and word cloud

news %>%
  top_n(15) %>%
  ggplot(aes(freq, word)) +
  geom_col() +
  labs(y = NULL)

## Selecting by freq

news %>%
  top_n(300) %>%
  wordcloud2()

## Selecting by freq

Two most common words in news data were time and people, which is visible in horizontal bar chart and word cloud.

Creating bigrams and word cloud from news data

set.seed(42)

news_sample <- sample(1:nrow(news_df), 1000000)

sample_news_bigrams <- news_df[news_sample,]

news_bigrams <- tokenize_clean_bigram(sample_news_bigrams)

news_bigrams %>%
  top_n(15) %>%
  ggplot(aes(freq, word)) +
  geom_col() +
  labs(y = NULL)

## Selecting by freq

news_bigrams %>%
  top_n(300) %>%
  wordcloud2()

## Selecting by freq

In most cases most common bigrams from news data are city names in USA and health care.

Final part of analysis is cleanup of variables and data frame to release memory

rm(news_df)
rm(news)
rm(news_sample)
rm(sample_news_bigrams)
rm(news_bigrams)
rm(all_words)
rm(clean_words)

Further modeling considerations

In final modeling process we will use similar techniques used in this example to analyze sentiment analysis. In modeling process every tweet or news item will be tokenized and lemmatized in order to be used in sentiment analysis.

Potentially some different kind of modeling technique will be used.