Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:
I went to the
the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. The overall goal of the Capstone project is to build a predictive text model using Natural Language Processing (NLM) along with a predictive text application that will determine the most likely next word when a user inputs a word or a phrase.
The purpose of this milestone report is to demonstrate how the data was downloaded, imported into R, and cleaned. This report also contains an exploratory analysis of the data including summary statistics about the three separate data sets (blogs, news and tweets), interesting findings discovered along the way, and an outline of the next steps that will be taken toward building the predictive application.
library(tm)
## Warning: package 'tm' was built under R version 4.3.3
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 4.3.1
library(stringi)
## Warning: package 'stringi' was built under R version 4.3.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(pryr)
## Warning: package 'pryr' was built under R version 4.3.3
##
## Attaching package: 'pryr'
## The following object is masked from 'package:dplyr':
##
## where
## The following object is masked from 'package:tm':
##
## inspect
library(RColorBrewer)
## Warning: package 'RColorBrewer' was built under R version 4.3.1
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(SnowballC)
## Warning: package 'SnowballC' was built under R version 4.3.1
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.3.3
library(rlang)
##
## Attaching package: 'rlang'
## The following object is masked from 'package:pryr':
##
## bytes
library(syuzhet)
## Warning: package 'syuzhet' was built under R version 4.3.3
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 4.3.3
The data for this project was downloaded from the following link and unzipped in the current working directory. https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
For this version we will only be looking at the English language files: en_US.blogs, en_US.news, en_US.twitter
blogs <- readLines("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
news <- readLines("./Coursera-SwiftKey/final/en_US/en_US.news.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
Lets look at the summary statistics for the 3 datasets. We want to look at the file size, the number of lines, characters and no of words.
stats <- data.frame(
FileName = c("blogs", "news", "twitter"),
FileSize = sapply(list(blogs, news, twitter), function(x){format(object.size(x), "MB")}),
t(rbind(sapply(list(blogs, news, twitter), stri_stats_general),
Words = sapply(list(blogs, news, twitter), stri_stats_latex)[4,]))
)
stats
## FileName FileSize Lines LinesNEmpty Chars CharsNWhite Words
## 1 blogs 255.4 Mb 899288 899288 206824382 170389539 37570839
## 2 news 19.8 Mb 77259 77259 15639408 13072698 2651432
## 3 twitter 319 Mb 2360148 2360148 162096241 134082806 30451170
From the summary, we can see the file sizes are huge. So, we are going to subset the data into three dataframes containing a 1% sample of the 3 files. We will set a seed so the sampling will be reproducible. Post which we will look at the stats for this sampled data
set.seed(100)
sampleSize <- 0.01
blogsSub <- sample(blogs, length(blogs) * sampleSize)
newsSub <- sample(news, length(news) * sampleSize)
twitterSub <- sample(twitter, length(twitter) * sampleSize)
sampleStats <- data.frame(
FileName = c("blogsSub", "newsSub", "twitterSub"),
FileSize = sapply(list(blogsSub, newsSub, twitterSub), function(x){format(object.size(x), "MB")}),
t(rbind(sapply(list(blogsSub, newsSub, twitterSub), stri_stats_general),
Words = sapply(list(blogsSub, newsSub, twitterSub), stri_stats_latex)[4,])
)
)
sampleStats
## FileName FileSize Lines LinesNEmpty Chars CharsNWhite Words
## 1 blogsSub 2.6 Mb 8992 8992 2076045 1710312 378094
## 2 newsSub 0.2 Mb 772 772 158851 132664 27124
## 3 twitterSub 3.2 Mb 23601 23601 1619860 1340069 304095
The sample data now needs to be cleaned. So data cleaning activities like removing punctuation, removing stop words, removing numbers and converting text to lower case.
clean_text <- function(text) {
text <- tolower(text) # Convert to lowercase
text <- removePunctuation(text) # Remove punctuation
text <- removeNumbers(text) # Remove numbers
text <- stripWhitespace(text) # Remove excess whitespace
text <- removeWords(text, stopwords("en")) # Remove stopwords
text <- text[text != ""] # Remove empty elements
return(text)
}
# Apply cleaning function to each dataset
blogs_clean <- clean_text(blogsSub)
news_clean <- clean_text(newsSub)
twitter_clean <- clean_text(twitterSub)
Lets take a look at the top 10 words from each of the 3 sources.
# Function to get top 10 most frequent words
top_10_words <- function(text) {
text_corpus <- Corpus(VectorSource(text))
dtm <- DocumentTermMatrix(text_corpus)
word_freq <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)
return(head(word_freq, 10))
}
# Get top 10 words for each dataset
top_10_blogs <- top_10_words(blogs_clean)
top_10_news <- top_10_words(news_clean)
top_10_twitter <- top_10_words(twitter_clean)
# Print top 10 words
print(top_10_blogs)
## one will like just can time ’s get also know
## 1275 1112 1056 1007 927 849 717 687 587 576
print(top_10_news)
## said will one new state also year can time people
## 198 90 74 57 49 48 46 45 45 42
print(top_10_twitter)
## just like get love good will day can thanks dont
## 1510 1168 1135 1066 981 980 959 894 893 872
Now that we know the top 10 lets take a look at how many times these words were used in the sample data
# Function to plot word frequencies
plot_word_freq <- function(word_freq, title) {
word_freq_df <- data.frame(
word = names(word_freq),
freq = word_freq
)
ggplot(word_freq_df, aes(x = reorder(word, freq), y = freq)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = title, x = "Words", y = "Frequency") +
theme_minimal()
}
# Plot word frequencies
plot_word_freq(top_10_blogs, "Top 10 Words in Blogs")
plot_word_freq(top_10_news, "Top 10 Words in News")
plot_word_freq(top_10_twitter, "Top 10 Words in Twitter")
### Word Cloud
#Exploratory data analysis: Word Cloud
wordcloud_data_blogs <- unlist(strsplit(blogs_clean, "\\s+"))
wordcloud(wordcloud_data_blogs, max.words = 100, random.order = FALSE)
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents
wordcloud_data_news <- unlist(strsplit(news_clean, "\\s+"))
wordcloud(wordcloud_data_news, max.words = 100, random.order = FALSE)
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
wordcloud_data_twitter <- unlist(strsplit(twitter_clean, "\\s+"))
wordcloud(wordcloud_data_twitter, max.words = 100, random.order = FALSE)
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
I tried using 20% , 10% and 5% of data as sample, but my personal PC wasnt able to handle this. So i have used sample size of 1%. The above barplots and Word Clouds highlight the most frequently used words. In the next steps, we will create a prediction model and test the same. The model will then be fine tuned to enhance performance post which a Shiny app will be deployed to return the prediction of the next word based on the word(s) entered by the user.