Capstone - Week 2 Milestone

Synopsis

This document explains how to load, process and analyse text data files in order to study their word frequency. The data files contains tweets, news and blog entries, respectively.

Further a plan is presented for creating a word prediction algorithm and a Shiny app that implements it.

Helper functions

In this document the following helper functions will be used to get a cleaner and more maintainable code.

# import libraries
library(tidytext)
library(dplyr)
library(ngram)
library(ggplot2)

# Read a file from a path
readFile <- function(filepath) {
    con <- file(filepath, "r") 
    content <- readLines(con, encoding = "UTF-8") 
    close(con)
    content
}

# Tokenize words of a data frame
tokenize <- function(df) {
    df %>% 
        unnest_tokens_(input = "name", output = "word", 
                       token = "words", format = "text", collapse = TRUE)
}

# plot words of a data frame
plotWords <- function(df) {
    g <- ggplot(df[1:15,], aes(x=reorder(word, -n), y=n, fill=word)) + 
        geom_bar(stat="identity") + 
        labs(x = "Most frequent words", y = "Frequency")
    g   
}

Data processing

The data files can be downloaded from here. The data files contain text in 4 different languages but for this analysis we will use only text files in English.

The three files contain tweets, news and blog entries captured automatically by a web crawler.

The first step consists in reading the files.

# Read files
twitter <- readFile(".\\final\\en_US\\en_US.twitter.txt")
blogs   <- readFile(".\\final\\en_US\\en_US.blogs.txt")
news    <- readFile(".\\final\\en_US\\en_US.news.txt")

Data exploration

A very valuable information about a text corpus is its size because this can have implications in next processing steps (usage of memory / CPU, for instance).

Let’s see how many rows and how many words each file contains.

# Count lines
numLinesTwitter <- length(twitter)
numLinesBlogs   <- length(blogs)
numLinesNews    <- length(news)

# Create data frames
twitter <- data.frame(twitter, stringsAsFactors = FALSE)
colnames(twitter) <- c("name")

news <- data.frame(news, stringsAsFactors = FALSE)
colnames(news) <- c("name")

blogs <- data.frame(blogs, stringsAsFactors = FALSE)
colnames(blogs) <- c("name")

# Count Words
twitterWordCount <- wordcount(twitter$name)
blogsWordCount   <- wordcount(blogs$name)
newsWordCount    <- wordcount(news$name)

File	# rows	# words
en_US.twitter.txt	2360148	30373543
en_US.blogs.txt	899288	37334131
en_US.news.txt	77259	2643969

Word frequency

Now that we know how many words are contained in the text files, let’s see how they are distributed and what are the most frequent words per file.

# tokenize texts
twitterTokens <- tokenize(twitter)
blogsTokens   <- tokenize(blogs)
newsTokens    <- tokenize(news)

twitterFrequentTokens <- twitterTokens %>% count(word, sort = TRUE)
plotWords(twitterFrequentTokens)

Unfortunately the previous histogram does not provide much information about the subject of the texts because the most frequent words are generic words like articles, conjunctions, pronouns and the like (these words are commonly called stop words).

Let’s remove stop words to get more information about the subjects of the texts.

## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"

Twitter words frequency chart

Blogs words frequency chart

News words frequency chart

Plans for creating a prediction algorithm and Shiny app

The first step consists in learning more about Natural Language Processing. I plan to take the Introduction to Natural Language Processing in R DataCamp course.

Further I will read a few book chapters and articles like

– Dan Jurafsky and James H. Martin Speech and Language Processing

– https://www.r-bloggers.com/natural-language-processing-tutorial/

– https://en.wikipedia.org/wiki/N-gram

– https://towardsdatascience.com/introduction-to-language-models-n-gram-e323081503d9

As far as the word prediction algorithm is concerned my plan is:

– decide how to deal with profanity words (a simple approach could be to eliminate them altogether)

– decide whether to remove stop words or not (removing them might lead to unwanted results)

– investigate if applying a stemming algorithm might be useful (stemming is a process that strips word endings and returns the word root)

– create an n-gram model using some technique like Next Word Prediction using Katz Backoff Model

– develop a strategy to deal with out-of-vocabulary words

– apply the model to the benchmark

– refine the model to improve correct prediction rate

– implement the model in R

– create a Shiny app that suggests new words as the user types

– publish the app so that other users can review and test it