In this report, we shall perform exploratory data analysis on the
datast.
The main goal here is to explain the major features of the
data.
Finally, we shall decide on the algorithm for predicting the
net word.
The data can be found here.
Before anything, let us first download the data and unzip
it.
# Define file and URL
zip_file <- "Coursera-Swiftkey.zip"
zip_url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# Check if the file exists
if (!file.exists(zip_file)) {
download.file(zip_url, destfile = zip_file, method = "curl")
unzip(zip_file)
}
The data is available in 4 languages:
We will only be focussing on the English data for this report.
Apart from that, the data is divided into 3 categories:
directory <- "final/en_US/"
blog_file <- paste0(directory, "en_US.blogs.txt")
news_file <- paste0(directory, "en_US.news.txt")
twitter_file <- paste0(directory, "en_US.twitter.txt")
Since letters can be in lowercase or uppercase, the same word can be
represented in multiple ways, such as the
,
The
, THE
, etc. However, all there instances
are technically for the same word, so there is no reason to treat them
as seperate words.
Therefore, we shall convert all the words to
lowercase.
Note: There are cases where the case
matters, for example, US
and us
are different
words. however, these words are very few in number and can be
ignored.
This product is for phones, where punctuation is not used as much as on a computer. As such, we shall ignore all punctuation marks.
Since the data is too large, we shall sample 5%
of the
lines in each file for our analysis.
p <- 0.05
read_sample <- function(file, p = 0.05) {
data <- readLines(file)
data <- sample(data, length(data) * p)
return(data)
}
blog_data <- read_sample(blog_file, p)
news_data <- read_sample(news_file, p)
twitter_data <- read_sample(twitter_file, p)
data <- c(blog_data, news_data, twitter_data)
As discussed before, we shall clean the data by converting all lines to lower case and removing all punctuation marks.
cleaned_data <- tolower(data)
cleaned_data <- gsub("[[:punct:]]", "", cleaned_data)
Now that we have the data, let us perform some exploratory data analysis on it.
First we shall count the number of occurrences of each word in the data.
word_count <- table(unlist(strsplit(cleaned_data, " ")))
word_count <- sort(word_count, decreasing = TRUE)
word_count <- word_count[names(word_count) != ""]
Let us plot the top 50 words.
top_words <- head(word_count, 50)
top_words <- data.frame(word = names(top_words), freq = as.numeric(top_words))
ggplot(top_words, aes(x = reorder(word, -freq), y = freq)) +
geom_bar(stat = "identity") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Word", y = "Frequency", title = "Top 50 Most Frequent Words")
Let us also plot a word cloud for the top 100 words.
wordcloud(names(word_count), freq = as.numeric(word_count), max.words = 100, colors = brewer.pal(8, "Dark2"))
Looks like most of the words are common english words like
the
, and
, is
, etc. While this is
expected, it’s not very useful if all our algorithm predicts is these
common stop words. So, we need to come up with a different strategy
Let us look at combination of two words (bigrams)
bigrams <- table(unlist(lapply(cleaned_data, function(x) {
words <- strsplit(x, " ")[[1]]
words <- words[words != ""] # Remove empty strings
if (length(words) < 2) return(NULL) # Skip if fewer than 2 words
bigrams <- sapply(1:(length(words) - 1), function(i) {
paste(words[i], words[i + 1], sep = " ")
})
return(bigrams)
})))
bigrams <- sort(bigrams, decreasing = TRUE)
Let us plot the top 50 bigrams.
top_bigrams <- head(bigrams, 50)
top_bigrams <- data.frame(bigram = names(top_bigrams), freq = as.numeric(top_bigrams))
ggplot(top_bigrams, aes(x = reorder(bigram, -freq), y = freq)) +
geom_bar(stat = "identity") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Bigram", y = "Frequency", title = "Top 50 Most Frequent Bigrams")
Let us also plot a word cloud for the top 100 bigrams.
wordcloud(names(bigrams), freq = as.numeric(bigrams), max.words = 100, colors = brewer.pal(8, "Dark2"))
Looking at the word count, it looks like we have more common phrases
like going to
, want to
, the world
etc. Let us also look at trigrams.
Let us look at combination of three words (trigrams)
trigrams <- table(unlist(lapply(cleaned_data, function(x) {
words <- strsplit(x, " ")[[1]]
words <- words[words != ""] # Remove empty strings
if (length(words) < 3) return(NULL) # Skip if fewer than 3 words
trigrams <- sapply(1:(length(words) - 2), function(i) {
paste(words[i], words[i + 1], words[i + 2], sep = " ")
})
return(trigrams)
})))
trigrams <- sort(trigrams, decreasing = TRUE)
Let us plot the top 20 trigrams.
top_trigrams <- head(trigrams, 20)
top_trigrams <- data.frame(trigram = names(top_trigrams), freq = as.numeric(top_trigrams))
ggplot(top_trigrams, aes(x = reorder(trigram, -freq), y = freq)) +
geom_bar(stat = "identity") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Trigram", y = "Frequency", title = "Top 50 Most Frequent Trigrams")
Let us also plot a word cloud for the top 15 trigrams.
wordcloud(names(trigrams), freq = as.numeric(trigrams), max.words = 15, colors = brewer.pal(8, "Dark2"))
As we can see, trigrams, capture even more of the structure of the data, as tthey can store more complex relations. Now that we have performed our analysis, we can work on our predictive algorithm.
We shall use the following algorithm: