JHU DSS Data Science Capstone, Milestone Report

Getting the Data

Having already downloaded the zip file from the course website, the next step is to unzip and load in the data.

if(!file.exists("./Coursera-SwiftKey")){
  unzip("C:/Users/james/Documents/R/JHU DSS Data Science Capstone DATA/Coursera-SwiftKey.zip", exdir = "./Coursera-SwiftKey")}

Here I will display the three respective file sizes.

blogsize <- file.size("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt")
twittersize <- file.size("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt")
newssize <- file.size("./Coursera-SwiftKey/final/en_US/en_US.news.txt")
data.frame(Data = c("Blogs", "Twitter", "News"), 
           File_Size = c(blogsize/(1024^2), twittersize/(1024^2), newssize/(1024^2)))

##      Data File_Size
## 1   Blogs  200.4242
## 2 Twitter  159.3641
## 3    News  196.2775

Next I will provide the number of respective lines in each dataset along with the length of the longest line of each.

twitter <- readLines("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt")
blogs <- readLines("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt")
news <- readLines("./Coursera-SwiftKey/final/en_US/en_US.news.txt")
data.frame(Data = c("Blogs", "Twitter", "News"),
           Number_of_lines_of_text = c(length(blogs), length(twitter), length(news)),
           Longest_line_length = c(summary(nchar(blogs))[[6]], summary(nchar(twitter))[[6]], summary(nchar(news))[[6]]))

##      Data Number_of_lines_of_text Longest_line_length
## 1   Blogs                  899288               40835
## 2 Twitter                 2360148                 213
## 3    News                   77259                5760

In order to avoid slowing down the data analysis process, I will take a sample of the datasets and work with that. I will then compile them into one dataset and write it to a new text file to again speed up later processing.

set.seed(210914)

blogs_sample <- sample(blogs, length(blogs)*0.05)
twitter_sample <- sample(twitter, length(twitter)*0.05)
news_sample <- sample(news, length(news)*0.05)
all_sample <- c(blogs_sample, twitter_sample, news_sample)
writeLines(all_sample, "./en_US.all_sample.txt")

head(all_sample, n = 3)

## [1] "G Rated gathers up the most brilliant ideas and inspiring product and present them all to you. Follow their blog for awesome DIY ideas, simple recipes, fashion, crafts, and decor."       
## [2] "The Qualifications to qualify for Il Pay day loans is often a lowest typical earnings of Buck1,000, citizenship of america, being a resident of Il for 5years and up, lowest 18 yrs . old."
## [3] "fabric used this week: -8 yards"

Cleaning the Data

Next I will create a function that cleans the data by removing hyperlinks, removing numerics and non-English characters or symbols, and converting all letters to lowercase.

cleaningfn1 <- function(data){
  newdataset <<- gsub("http[^[:space:]]*", "", data)
  newdataset <<- gsub("[^a-zA-Z[:space:]]", "", newdataset)
  newdataset <<- tolower(newdataset)
}
cleaningfn1(all_sample)
all_sample <- newdataset

I want to remove profanity so I will create a profanity variable that is made from an appropriate dataset found online.

profanityURL <- "https://raw.githubusercontent.com/RobertJGabriel/Google-profanity-words/master/list.txt"
if(!file.exists("./profanity.txt")){download.file(profanityURL, "./profanity.txt")}
profanity <- read.table("./profanity.txt")
profanity <- as.character(profanity[,1])
head(profanity)

## [1] "4r5e" "5h1t" "5hit" "a55"  "anal" "anus"

The following for-loop illustrates the process for removing profanity from the all_sample dataset. At this point I have a dataset that is fully cleaned and ready for use. However, in the process that will take place later on, it will in fact be faster and more beneficial to not use this for-loop. I leave it in to show the process I first went through to get to a cleaned dataset.

for(x in 1:length(profanity)){
  all_sample <- gsub(paste0(" ", profanity[x], " "), " ", all_sample)
  all_sample <- gsub(paste0(" ", profanity[x], "$"), "", all_sample)
  all_sample <- gsub(paste0("^", profanity[x], " "), " ", all_sample)
}

Exploratory Data Analysis

library(tidyverse)
library(tidytext)
library(stopwords)

Having loaded the necessary libraries, I next create a stopword variable. At this stage of the analysis it will be illustrative to remove words that are too common and don’t provide meaning to sentences. With this we can see a more interesting representation of the most common one word occurrences.

profanity <- tibble(profanity) %>% rename(word = 1)
stopwords <- stopwords("en") %>% tibble() %>% rename(word = 1)
stopwords$word <- gsub("[^a-zA-Z0-9[:space:]]", "", stopwords$word)

Next I will use the cleaning function built earlier and the unnest_token function to create datasets that are essentially just lists of single words that occur. The main purpose of doing this separately for each dataset (rather than once to the previously cleaned all_sample dataset) is that I want to add a tag, named source, variable to each word occurrence so that I can display a comparison between the different sources.

cleaningfn1(blogs_sample)
blogs_sample <- newdataset
blogs_sample_df <- tibble(blogs_sample)
blogs_sample_unnest <- blogs_sample_df %>%
  unnest_tokens(word, blogs_sample) %>%
  mutate(source="blogs") %>%
  anti_join(profanity) %>% 
  anti_join(stopwords)
blogs_sample_unnest

## # A tibble: 932,979 x 2
##    word      source
##    <chr>     <chr> 
##  1 g         blogs 
##  2 rated     blogs 
##  3 gathers   blogs 
##  4 brilliant blogs 
##  5 ideas     blogs 
##  6 inspiring blogs 
##  7 product   blogs 
##  8 present   blogs 
##  9 follow    blogs 
## 10 blog      blogs 
## # ... with 932,969 more rows

cleaningfn1(twitter_sample)
twitter_sample <- newdataset
twitter_sample_df <- tibble(twitter_sample)
twitter_sample_unnest <- twitter_sample_df %>%
  unnest_tokens(word, twitter_sample) %>%
  mutate(source="twitter") %>%
  anti_join(profanity) %>% 
  anti_join(stopwords)
twitter_sample_unnest

## # A tibble: 816,022 x 2
##    word                           source 
##    <chr>                          <chr>  
##  1 one                            twitter
##  2 year                           twitter
##  3 deal                           twitter
##  4 nutsshould                     twitter
##  5 locked                         twitter
##  6 least                          twitter
##  7 dumbnetworkstheirspendingrules twitter
##  8 constantly                     twitter
##  9 running                        twitter
## 10 time                           twitter
## # ... with 816,012 more rows

cleaningfn1(news_sample)
news_sample <- newdataset
news_sample_df <- tibble(news_sample)
news_sample_unnest <- news_sample_df %>%
  unnest_tokens(word, news_sample) %>%
  mutate(source="news") %>%
  anti_join(profanity) %>% 
  anti_join(stopwords)
news_sample_unnest

## # A tibble: 72,183 x 2
##    word      source
##    <chr>     <chr> 
##  1 army      news  
##  2 corps     news  
##  3 engineers news  
##  4 blown     news  
##  5 holes     news  
##  6 birds     news  
##  7 point     news  
##  8 levee     news  
##  9 monday    news  
## 10 night     news  
## # ... with 72,173 more rows

Here is the compiled dataset ordered by frequency of occurrence followed by a bar chart.

all_sample_unnest <- rbind(blogs_sample_unnest, twitter_sample_unnest, news_sample_unnest) %>%
  count(word) %>%
  arrange(desc(n))
all_sample_unnest

## # A tibble: 113,843 x 2
##    word      n
##    <chr> <int>
##  1 just  12766
##  2 like  11340
##  3 one   10500
##  4 can    9576
##  5 get    9391
##  6 time   8365
##  7 love   7554
##  8 good   7473
##  9 now    7285
## 10 day    7102
## # ... with 113,833 more rows

all_sample_unnest %>% top_n(20) %>%
  ggplot(aes(x=reorder(word, -n), y = n)) + 
  geom_col() + 
  xlab("Word") +
  ylab("Count") +
  scale_x_discrete(guide = guide_axis(n.dodge = 2))

Here is a comparison between the different sources.

rbind(blogs_sample_unnest, twitter_sample_unnest, news_sample_unnest) %>%
  group_by(source) %>%
  count(word) %>% 
  arrange(desc(n)) %>%
  top_n(20) %>%
  ggplot(aes(x=reorder(word, n), y = n)) + 
  geom_col() + 
  xlab("Word") +
  ylab("Count") +
  coord_flip() +
  facet_grid(~source, scales="free")

Next I will create a bigrams dataset, save as an object for later use, and display a bar chart. Here I have decided to leave in stopwords as without them we would have two, three or four word groups that would be mostly grammatically incorrect.

blogs_sample_bigram <- blogs_sample_df %>%
  unnest_tokens(bigram, blogs_sample, token = "ngrams", n=2)
twitter_sample_bigram <- twitter_sample_df %>%
  unnest_tokens(bigram, twitter_sample, token = "ngrams", n=2)
news_sample_bigram <- news_sample_df %>%
  unnest_tokens(bigram, news_sample, token = "ngrams", n=2)
all_sample_bigram <- rbind(blogs_sample_bigram, twitter_sample_bigram, news_sample_bigram) %>%
  count(bigram) %>%
  arrange(desc(n))

if(!file.exists("./bigram_data.rds")){saveRDS(all_sample_bigram, "./bigram_data.rds")}

all_sample_bigram %>% top_n(20) %>%
  ggplot(aes(x=reorder(bigram, -n), y = n)) + 
  geom_col() + 
  xlab("Bigram") +
  ylab("Count") +
  scale_x_discrete(guide = guide_axis(n.dodge = 2))

Next, the same for trigrams.

blogs_sample_trigram <- blogs_sample_df %>%
  unnest_tokens(trigram, blogs_sample, token = "ngrams", n=3) %>%
  na.omit()
twitter_sample_trigram <- twitter_sample_df %>%
  unnest_tokens(trigram, twitter_sample, token = "ngrams", n=3) %>%
  na.omit()
news_sample_trigram <- news_sample_df %>%
  unnest_tokens(trigram, news_sample, token = "ngrams", n=3) %>% 
  na.omit()
all_sample_trigram <- rbind(blogs_sample_trigram, twitter_sample_trigram, news_sample_trigram) %>%
  count(trigram) %>%
  arrange(desc(n))

if(!file.exists("./trigram_data.rds")){saveRDS(all_sample_trigram, "./trigram_data.rds")}

all_sample_trigram %>% top_n(20) %>%
  ggplot(aes(x=reorder(trigram, -n), y = n)) + 
  geom_col() + 
  xlab("Trigram") +
  ylab("Count") +
  scale_x_discrete(guide = guide_axis(angle=90))

Finally, quadgrams.

blogs_sample_quadgram <- blogs_sample_df %>%
  unnest_tokens(quadgram, blogs_sample, token = "ngrams", n=4) %>%
  na.omit()
twitter_sample_quadgram <- twitter_sample_df %>%
  unnest_tokens(quadgram, twitter_sample, token = "ngrams", n=4) %>%
  na.omit()
news_sample_quadgram <- news_sample_df %>%
  unnest_tokens(quadgram, news_sample, token = "ngrams", n=4) %>% 
  na.omit()
all_sample_quadgram <- rbind(blogs_sample_quadgram, twitter_sample_quadgram, news_sample_quadgram) %>%
  count(quadgram) %>%
  arrange(desc(n))

if(!file.exists("./quadgram_data.rds")){saveRDS(all_sample_quadgram, "./quadgram_data.rds")}

all_sample_quadgram %>% top_n(20) %>%
  ggplot(aes(x=reorder(quadgram, -n), y = n)) + 
  geom_col() + 
  xlab("Quadgram") +
  ylab("Count") +
  scale_x_discrete(guide = guide_axis(angle=90))

Let’s see how many unique words account for 50 and 90 percent of all word occurrences.

total_n <- sum(all_sample_unnest$n)
total_word <- nrow(all_sample_unnest)
  
all_sample_cumsum <- all_sample_unnest %>% mutate(proportion = n/total_n, cumsum = cumsum(proportion))

cover_0.5 <- which(all_sample_cumsum$cumsum > 0.5)[1]
cover_0.9 <- which(all_sample_cumsum$cumsum > 0.9)[1]

cover_0.5/total_word

## [1] 0.007440071

cover_0.9/total_word

## [1] 0.1390072

Basic Model, Rough and Early Draft

First, loading the data from the previously saved objects.

bigram_data <- readRDS("./bigram_data.rds")
trigram_data <- readRDS("./trigram_data.rds")
quadgram_data <- readRDS("./quadgram_data.rds")

Next, building a function that will predict the next word after 1, 2 or 3 words are given.

next_word_predictor <- function(x, firstword=NULL, secondword=NULL, thirdword=NULL){
  if(x==2){
      twoword_id <- grep(paste0("^", firstword, " "), bigram_data$bigram)[1]
      first_comb <- bigram_data[twoword_id,][1]
      strsplit(first_comb$bigram, " ")[[1]][2]
    }
    else if (x==3){
      threeword_id <- grep(paste0("^", firstword, " ", secondword, " "), trigram_data$trigram)[1]
      first_comb <- trigram_data[threeword_id,][1]
      strsplit(first_comb$trigram, " ")[[1]][3]
    }
    else if (x==4){
      fourword_id <- grep(paste0("^", firstword, " ", secondword, " ", thirdword," "), quadgram_data$quadgram)[1]
      first_comb <- quadgram_data[fourword_id,][1]
      strsplit(first_comb$quadgram, " ")[[1]][4]
    }
    else{
      print("")
    }
}

Let’s test the function a few times. Note that the first argument is the n in ngram or, in other words, the position number (in the word group) of the word that we want to be predicted. The second, third and fourth arguments are the first words of the word group that precede the word that we want to be predicted.

next_word_predictor(6, "when")

## [1] ""

next_word_predictor(2, "speaking")

## [1] "of"

next_word_predictor(3, "people", "always")

## [1] "assume"

next_word_predictor(4, "why", "do", "they")

## [1] "have"

next_word_predictor(2, "terrible")

## [1] "and"

next_word_predictor(3, "elephants", "are")

## [1] "able"

next_word_predictor(4, "if", "only", "i")

## [1] "could"

Diagnosis

Here I will list some ideas and methods that will be explored to further this project and improve this model.

In order to provide a reasonable experience for the user, the algorithm runtime must be considered. A function like Rprof() could be used in this diagnostic process.
As it is a text prediction algorithm, it will be made for the purpose of use on smartphones. Smartphones have limited memory and processing power so considering how much RAM is required and how much space objects take up will be necessary. Functions like gc() and object.size() will help.
The data only accounts for a limited range of possible word sequences. It would be useful to consider adding a process which accounts for word combinations that haven’t appeared in the dataset. ‘Backoff models’ will be worth exploring.
My first model draft could be simplified. I should consider how to condense the code with quicker ways to achieve the same results. Also to clean it, I would like to improve the argument input simplicity. For instance, instead of the currently required 3, "people", "always", I would like "people always" to be sufficient.