Introduction

In this report, we shall perform exploratory data analysis on the datast.
The main goal here is to explain the major features of the data.
Finally, we shall decide on the algorithm for predicting the net word.

Data

Downloading The Data

The data can be found here.


Before anything, let us first download the data and unzip it.

# Define file and URL
zip_file <- "Coursera-Swiftkey.zip"
zip_url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

# Check if the file exists
if (!file.exists(zip_file)) {
    download.file(zip_url, destfile = zip_file, method = "curl")
    unzip(zip_file)
}

Data Description

The data is available in 4 languages:

  • English
  • German
  • Finnish
  • Russian

We will only be focussing on the English data for this report.

Apart from that, the data is divided into 3 categories:

  • Blogs
  • News
  • Twitter
directory <- "final/en_US/"
blog_file <- paste0(directory, "en_US.blogs.txt")
news_file <- paste0(directory, "en_US.news.txt")
twitter_file <- paste0(directory, "en_US.twitter.txt")

How to handle Cases?

Since letters can be in lowercase or uppercase, the same word can be represented in multiple ways, such as the, The, THE, etc. However, all there instances are technically for the same word, so there is no reason to treat them as seperate words.
Therefore, we shall convert all the words to lowercase.
Note: There are cases where the case matters, for example, US and us are different words. however, these words are very few in number and can be ignored.

How to handle Punctuation?

This product is for phones, where punctuation is not used as much as on a computer. As such, we shall ignore all punctuation marks.

Sampling the Data

Since the data is too large, we shall sample 5% of the lines in each file for our analysis.

p <- 0.05

read_sample <- function(file, p = 0.05) {
    data <- readLines(file)
    data <- sample(data, length(data) * p)
    return(data)
}

blog_data <- read_sample(blog_file, p)
news_data <- read_sample(news_file, p)
twitter_data <- read_sample(twitter_file, p)

data <- c(blog_data, news_data, twitter_data)

Data Cleaning

As discussed before, we shall clean the data by converting all lines to lower case and removing all punctuation marks.

cleaned_data <- tolower(data)
cleaned_data <- gsub("[[:punct:]]", "", cleaned_data)

Now that we have the data, let us perform some exploratory data analysis on it.

Unigrams

First we shall count the number of occurrences of each word in the data.

word_count <- table(unlist(strsplit(cleaned_data, " ")))
word_count <- sort(word_count, decreasing = TRUE)
word_count <- word_count[names(word_count) != ""]

Plot

Let us plot the top 50 words.

top_words <- head(word_count, 50)
top_words <- data.frame(word = names(top_words), freq = as.numeric(top_words))

ggplot(top_words, aes(x = reorder(word, -freq), y = freq)) +
    geom_bar(stat = "identity") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(x = "Word", y = "Frequency", title = "Top 50 Most Frequent Words")

Word Cloud

Let us also plot a word cloud for the top 100 words.

wordcloud(names(word_count), freq = as.numeric(word_count), max.words = 100, colors = brewer.pal(8, "Dark2"))

Looks like most of the words are common english words like the, and, is, etc. While this is expected, it’s not very useful if all our algorithm predicts is these common stop words. So, we need to come up with a different strategy

Bigrams

Let us look at combination of two words (bigrams)

bigrams <- table(unlist(lapply(cleaned_data, function(x) {
    words <- strsplit(x, " ")[[1]]
    words <- words[words != ""]  # Remove empty strings
    if (length(words) < 2) return(NULL)  # Skip if fewer than 2 words
    bigrams <- sapply(1:(length(words) - 1), function(i) {
        paste(words[i], words[i + 1], sep = " ")
    })
    return(bigrams)
})))
bigrams <- sort(bigrams, decreasing = TRUE)

Plot

Let us plot the top 50 bigrams.

top_bigrams <- head(bigrams, 50)
top_bigrams <- data.frame(bigram = names(top_bigrams), freq = as.numeric(top_bigrams))

ggplot(top_bigrams, aes(x = reorder(bigram, -freq), y = freq)) +
    geom_bar(stat = "identity") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(x = "Bigram", y = "Frequency", title = "Top 50 Most Frequent Bigrams")

Word Cloud

Let us also plot a word cloud for the top 100 bigrams.

wordcloud(names(bigrams), freq = as.numeric(bigrams), max.words = 100, colors = brewer.pal(8, "Dark2"))

Looking at the word count, it looks like we have more common phrases like going to, want to, the world etc. Let us also look at trigrams.

Trigram

Let us look at combination of three words (trigrams)

trigrams <- table(unlist(lapply(cleaned_data, function(x) {
    words <- strsplit(x, " ")[[1]]
    words <- words[words != ""]  # Remove empty strings
    if (length(words) < 3) return(NULL)  # Skip if fewer than 3 words
    trigrams <- sapply(1:(length(words) - 2), function(i) {
        paste(words[i], words[i + 1], words[i + 2], sep = " ")
    })
    return(trigrams)
})))
trigrams <- sort(trigrams, decreasing = TRUE)

Plot

Let us plot the top 20 trigrams.

top_trigrams <- head(trigrams, 20)
top_trigrams <- data.frame(trigram = names(top_trigrams), freq = as.numeric(top_trigrams))

ggplot(top_trigrams, aes(x = reorder(trigram, -freq), y = freq)) +
    geom_bar(stat = "identity") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(x = "Trigram", y = "Frequency", title = "Top 50 Most Frequent Trigrams")

Word Cloud

Let us also plot a word cloud for the top 15 trigrams.

wordcloud(names(trigrams), freq = as.numeric(trigrams), max.words = 15, colors = brewer.pal(8, "Dark2"))

As we can see, trigrams, capture even more of the structure of the data, as tthey can store more complex relations. Now that we have performed our analysis, we can work on our predictive algorithm.

Idea

We shall use the following algorithm:

  1. If the word to be predicted is the first word in the sentence, choose the most frequent word.
  2. If the word is the second word in the sentence, than look at the previous word. Select all the bigrams which begin with the previous word:
    • If no bigrams exist, choose the most frequent word
    • If bigrams exist, choose the second word of the most frequent bigram
  3. If the word is to be predicted occours third or later, select all trigrams which begin with the two words before the word to be predicted,
    • If no trigram exist, use the bgram method described in step 2.
    • If trigrams exist, choose the third word of the most frequent bigram.