Milestone Report

Introduction

This report details a possible strategy for developing a predictive text app in R/Shiny. It is done as milestone report for the Capstone project of the Data Science Specialisation provided by Johns Hopkins University via Coursera.

We are presented with three text files, which contain extracts from English language news, Twitter and blogs. The purpose of this exercise is to carry out some exploratory data analysis, visualise the data and show how it may be used to create a predictive algorithm. Please note that the algorithm itself is not a part of the exercise. The report is requested in the language that would be easy to understand for a person with no Data Science background, such as a manager.

library(tidytext)
library(wordcloud)

## Loading required package: RColorBrewer

library(tidyverse)

## -- Attaching packages ----------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts -------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(stringr)
library(igraph)

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:purrr':
## 
##     compose, simplify

## The following object is masked from 'package:tidyr':
## 
##     crossing

## The following object is masked from 'package:tibble':
## 
##     as_data_frame

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

library(ggraph)

setwd('D:/DS_Coursera_JH/Capstone')

Basic EDA

A line count is expressly requested in the assignment, although I have to note that I do not see a point of it as we are going to work with words later, not lines.

length(news <- readLines("en_US.news.txt", warn = FALSE))

## [1] 77259

length(twitter <- readLines("en_US.twitter.txt", warn = FALSE))

## [1] 2360148

length(blogs <- readLines("en_US.blogs.txt", warn = FALSE))

## [1] 899288

rm(news, twitter, blogs)

So, we have three large files of text with 77259, 2360148 and 899288 lines accordingly.

As it makes no difference at the end, which file each word comes from, we are going to read all the words into one tidy table, which we can then be used for analysis. At this stage we are going to get rid often-used words, such as ‘the’, ‘a’, ‘I’ etc., as well as any punctuation and numbers.

Read in the data

txt_files <- list.files(pattern = ".txt")

df <- list.files(pattern = "*.txt") %>% 
    map_chr(~ read_file(.)) %>% 
    tibble(text = .) %>% 
    drop_na() %>% 
    mutate(filename = txt_files) %>%
    unnest_tokens(word, text)

df <- df %>% mutate(word = gsub(x = word, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = NA)) %>% 
    drop_na() %>% 
    anti_join(stop_words)

## Joining, by = "word"

Strategy for handling foreign words.

We want to exclude words that are only encountered once. If they are more commonly used, like for example ‘fiancée’, then why not leave them? To that end we can count the least often used words and get rid of them. This will also get rid of things like ‘zzzzzz’ and ‘yaaaaa’ etc. as well as any misspelled words. We will lose some English words like ‘zygodactylous’ but the likelihood of these being used is very small, I believe this is acceptable. We are also getting rid of made-up words, like ‘fearocracy’. We will call these ‘rare words’.

Word Counts

Bellow is a graph detailing the most often used words split by file.

word_counts2 <- df %>% 
    group_by(filename, word) %>% 
    summarise(Total = n()) %>%  
    arrange(desc(Total))

rare_words <- word_counts2 %>% filter(Total == 1)
df <- df %>% anti_join(rare_words, by = 'word')

word_counts2 %>% 
    group_by(filename) %>% 
    top_n(30) %>% 
    ggplot(aes(reorder(word, -Total), Total, fill = filename)) + 
    theme(axis.text.x = element_text(angle = 90)) +
    geom_col() + 
    coord_flip() + 
    facet_wrap(~filename, scales = "free") +
    scale_y_reordered() +
    scale_y_continuous(expand = c(0,0))

## Selecting by Total

## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.

Wordcloud

Bellow is a visualisation of the most often used words put together.

cloud_count <- df %>% select(word) %>%  
    group_by(word) %>% 
    summarise(Total = n()) %>% 
    arrange(desc(Total)) %>% 
    mutate(word = gsub(x = word, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = NA)) %>% 
    drop_na() %>% 
    anti_join(stop_words)

## Joining, by = "word"

wordcloud(cloud_count$word, cloud_count$Total, max.words = 120)

Bigrams

The idea of predictive text is that when any word is entered the algorithm should suggest the following word; this should be a word that is most likely to follow. We can take a look at the relationship between sequences of words, in our case, two words. These are called ‘bi-grams’.

bigrams <- list.files(pattern = "*.txt") %>% 
    map_chr(~ read_file(.)) %>% 
    tibble(text = .) %>% 
    drop_na() %>% 
    mutate(filename = txt_files) %>%
    
    unnest_tokens(word, text, token = 'ngrams', n = 2) %>% 
    count(word, sort = TRUE) %>%
    separate(word, into =c('first_word', 'second_word'), sep = '\\s') %>% 
    anti_join(stop_words, by=c(first_word='word' ) ) %>%
    anti_join(stop_words, by=c(second_word='word' ) ) %>%
    anti_join(rare_words, by=c(first_word='word') ) %>% 
    anti_join(rare_words, by=c(second_word= 'word')) %>% 
    mutate(ngram = paste(first_word, second_word))

bigrams <- bigrams %>% mutate(ngram = gsub(x = ngram, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = NA)) %>% 
    drop_na()


wordcloud(bigrams$ngram, bigrams$n, max.words = 50)

## Warning in wordcloud(bigrams$ngram, bigrams$n, max.words = 50): san
## francisco could not be fit on page. It will not be plotted.

The best way to describe the following visualisation is to quote from “Text Mining with R” by Julia Silge.

“…this is a visualization of a Markov chain, a common model in text processing. In a Markov chain, each choice of word depends only on the previous word. In this case, a random generator following this model might spit out “dear”, then “sir”, then >“william/walter/thomas/thomas’s”, by following each word to the most common words that follow it. To make the visualization interpretable, >we chose to show only the most common word to word connections, but one could imagine an enormous graph representing all connections that >occur in the text."

bigram_graph <- bigrams %>% 
    filter(n > 1700) %>% 
    graph_from_data_frame()  
    


ggraph(bigram_graph, layout = "fr") +
    geom_edge_link() +
    geom_node_point() +
    geom_node_text(aes(label = name), vjust = 1, hjust = 1)

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph::ggraph(bigram_graph, layout = "fr") +
    geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                   arrow = a, end_cap = circle(.07, 'inches')) +
    geom_node_point(color = "lightblue", size = 5) +
    geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
    theme_void()