This is a comprehensive project that covers all the areas of the data science specialization offered by john hopkins university. It involves working with NLP and building a predictive model that can be deployed to a useful data application. This is not an easy task and all the steps will be outlined clearly as we proceed.

library(tidyverse)
-- Attaching packages ------------------------------------------------------------------------------- tidyverse 1.3.1 --

v ggplot2 3.3.5     v purrr   0.3.4
v tibble  3.1.8     v dplyr   1.0.7
v tidyr   1.2.0     v stringr 1.4.0
v readr   2.1.3     v forcats 0.5.1

-- Conflicts ---------------------------------------------------------------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

Our dataset

# Reading in.
us_twitter = readLines('./data/Coursera-SwiftKey/final/en_US/en_US.twitter.txt')

# Reading in.
us_blogs = readLines('./data/Coursera-SwiftKey/final/en_US/en_US.blogs.txt')

us_news = readLines('./data/Coursera-SwiftKey/final/en_US/en_US.news.txt')
# Making our data a dataframe
us_twitter_data =
data.frame(
    text = us_twitter
) %>%

# Adding a count of character variable
mutate(no_characters = nchar(text))
list.files('./data/Coursera-SwiftKey/final/en_US')
  1. ‘en_US.blogs.txt’
  2. ‘en_US.news.txt’
  3. ‘en_US.twitter.txt’
  4. ‘find’
  5. ‘pairs_freq.csv’
  6. ‘triple_freq.csv’
  7. ‘word_freq.csv’
lens = c(length(us_twitter), length(us_blogs), length(us_news))
barplot(lens, names.arg = c('us_twitter', 'us_blogs', 'us_news'), col = 'orange', main = 'Lines by Corpus', 
        ylab = 'Length', xlab = 'Corpus')

png

Exploratory Data Analysis

str(us_twitter_data)
'data.frame':   2360148 obs. of  2 variables:
 $ text         : chr  "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long." "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason." "they've decided its more fun if I don't." "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)" ...
 $ no_characters: int  109 111 40 84 63 77 101 51 54 30 ...
summary(us_twitter_data)
     text           no_characters  
 Length:2360148     Min.   :  2.0  
 Class :character   1st Qu.: 37.0  
 Mode  :character   Median : 64.0  
                    Mean   : 68.8  
                    3rd Qu.:100.0  
                    Max.   :213.0  

Regular Expressions

Word Distribution

Our primary tasks are deceptively simple: - How often do certain words occur - How often do certain pairs occur together?

This will require the use of regular expressions because we do not know exactly what words we are looking for.

Let us look at the distributions of the words by their length.

counts = numeric(0)
for (number in 1:10) {
    regex = paste('\\s[a-z]{', number, '}\\s', sep = "")
    count = str_detect(us_twitter_data$text[1:1000], pattern = regex) %>% sum
    counts = c(counts, count)
}
counts
  1. 253
  2. 740
  3. 703
  4. 715
  5. 542
  6. 351
  7. 324
  8. 202
  9. 126
  10. 89
barplot(counts, names.arg = 1:length(counts), col = 'steelblue', xlab = "Word Length", ylab = "Frequency")

png

This might not be very useful as it only shows the relative distribution of a words of a given length. This could hold with other text datasets and not just ours.
Also there are overlapping areas, seeing that the sum adds to more than the total number of words. But the pattern holds throught the whole dataset.

Most Frequent Words

# A function for looking for n number of words occurences
uniquify = function (string, regex) {
    # Extract words using regular expression
    words = str_extract_all(string, regex)

    # Convert to lowercase
    words = tolower(unlist(words))

    # Count frequency of each word
    word_freq = table(words)

    # Sort by frequency
    sorted_freq = sort(word_freq, decreasing = TRUE)

    # Print the 10 most frequent words
    head(sorted_freq, 10)
}
if (!file.exists('./data/Coursera-SwiftKey/final/en_US/word_freq.csv')) {
    
    word_freq = as.data.frame(
                    uniquify(us_twitter_data$text, "\\w+")
                    )

    # Writing this to file.
    write.csv(word_freq, './data/Coursera-SwiftKey/final/en_US/word_freq.csv')
}
start = 1
end = 10
barplot(word_freq$Freq[start:end], names.arg = word_freq$words[start:end], col = 'purple', 
        xlab = 'Word Rank', ylab = 'Word Frequency')

png

A lot of the common words are not at all aurprising. They are the usual ones used every day. - articles: the, a and an - filler words: like - other commonly used words like when, today.

We have to take this into account.

Frequently Occuring Word Pairs.

regex = "\\b\\w+\\s\\w+\\b"
str_view_all(us_twitter_data$text[1:10], regex, match = TRUE)
<!doctype html>
if (!file.exists('./data/Coursera-SwiftKey/final/en_US/pairs_freq.csv')) {
    pairs_freq =
    as.data.frame(
        uniquify(us_twitter_data$text, "\\b\\w+\\s\\w+\\b")
    )

    # Writing this to file.
    write.csv(pairs_freq, './data/Coursera-SwiftKey/final/en_US/pairs_freq.csv')
    head(pairs_freq)
}
barplot(pairs_freq$Freq, names.arg = pairs_freq$words, col = "lightblue", las = 2, ylab = "Word Frequency")

png

Frequently Occuring Word Groups

Now that we have seen what words appear mostly and next to what other words, we can generalize our pattern matching to find bunches of words that occur together.

if (!file.exists('./data/Coursera-SwiftKey/final/en_US/triple_freq.csv')) {
    triple_freq =
    as.data.frame(
        uniquify(us_twitter_data$text, "\\b\\w+\\s\\w+\\s\\w+\\b")
    )

    # Writing this to file.
    write.csv(triple_freq, './data/Coursera-SwiftKey/final/en_US/triple_freq.csv')
}
barplot(triple_freq$Freq, names.arg = triple_freq$words, col = "pink", las = 2, ylab = "Word Frequency")

png

head(triple_freq)
A data.frame: 6 × 2
words Freq
<fct> <int>
1 thanks for the 22284
2 thank you for 7725
3 t wait to 7454
4 looking forward to 6621
5 i love you 6344
6 i want to 5138

That concludes our exploratory data analysis.

n-gram Model

Our tasks are to: - BUild an ngram model based on markov chains - Predict next word based on one two or three words - figure out a way to evaluate model performance - make the model perform in a reasonable amount of time. - Dealing with edge caees; uncommon words