Introduction

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:

I went to the

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. The overall goal of the Capstone project is to build a predictive text model using Natural Language Processing (NLM) along with a predictive text application that will determine the most likely next word when a user inputs a word or a phrase.

The purpose of this milestone report is to demonstrate how the data was downloaded, imported into R, and cleaned. This report also contains an exploratory analysis of the data including summary statistics about the three separate data sets (blogs, news and tweets), interesting findings discovered along the way, and an outline of the next steps that will be taken toward building the predictive application.

Loading the required libraries

library(tm)
## Warning: package 'tm' was built under R version 4.3.3
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 4.3.1
library(stringi)
## Warning: package 'stringi' was built under R version 4.3.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(pryr)
## Warning: package 'pryr' was built under R version 4.3.3
## 
## Attaching package: 'pryr'
## The following object is masked from 'package:dplyr':
## 
##     where
## The following object is masked from 'package:tm':
## 
##     inspect
library(RColorBrewer)
## Warning: package 'RColorBrewer' was built under R version 4.3.1
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
library(SnowballC)
## Warning: package 'SnowballC' was built under R version 4.3.1
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.3.3
library(rlang)
## 
## Attaching package: 'rlang'
## The following object is masked from 'package:pryr':
## 
##     bytes
library(syuzhet)
## Warning: package 'syuzhet' was built under R version 4.3.3
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 4.3.3

Data Download and load into R

The data for this project was downloaded from the following link and unzipped in the current working directory. https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

For this version we will only be looking at the English language files: en_US.blogs, en_US.news, en_US.twitter

blogs <- readLines("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
news <- readLines("./Coursera-SwiftKey/final/en_US/en_US.news.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt", warn = FALSE, encoding = "UTF-8", skipNul = TRUE)

Generate Summary Statistics

Lets look at the summary statistics for the 3 datasets. We want to look at the file size, the number of lines, characters and no of words.

stats <- data.frame(
        FileName = c("blogs", "news", "twitter"),
        FileSize = sapply(list(blogs, news, twitter), function(x){format(object.size(x), "MB")}),
        t(rbind(sapply(list(blogs, news, twitter), stri_stats_general),
        Words = sapply(list(blogs, news, twitter), stri_stats_latex)[4,]))
)

stats
##   FileName FileSize   Lines LinesNEmpty     Chars CharsNWhite    Words
## 1    blogs 255.4 Mb  899288      899288 206824382   170389539 37570839
## 2     news  19.8 Mb   77259       77259  15639408    13072698  2651432
## 3  twitter   319 Mb 2360148     2360148 162096241   134082806 30451170

Sample the Data

From the summary, we can see the file sizes are huge. So, we are going to subset the data into three dataframes containing a 1% sample of the 3 files. We will set a seed so the sampling will be reproducible. Post which we will look at the stats for this sampled data

set.seed(100)
sampleSize <- 0.01

blogsSub <- sample(blogs, length(blogs) * sampleSize)
newsSub <- sample(news, length(news) * sampleSize)
twitterSub <- sample(twitter, length(twitter) * sampleSize)

sampleStats <- data.frame(
        FileName = c("blogsSub", "newsSub", "twitterSub"),
        FileSize = sapply(list(blogsSub, newsSub, twitterSub), function(x){format(object.size(x), "MB")}),
        t(rbind(sapply(list(blogsSub, newsSub, twitterSub), stri_stats_general),
        Words = sapply(list(blogsSub, newsSub, twitterSub), stri_stats_latex)[4,])
        )
)

sampleStats
##     FileName FileSize Lines LinesNEmpty   Chars CharsNWhite  Words
## 1   blogsSub   2.6 Mb  8992        8992 2076045     1710312 378094
## 2    newsSub   0.2 Mb   772         772  158851      132664  27124
## 3 twitterSub   3.2 Mb 23601       23601 1619860     1340069 304095

Data Cleaning

The sample data now needs to be cleaned. So data cleaning activities like removing punctuation, removing stop words, removing numbers and converting text to lower case.

clean_text <- function(text) {
  text <- tolower(text)                             # Convert to lowercase
  text <- removePunctuation(text)                   # Remove punctuation
  text <- removeNumbers(text)                       # Remove numbers
  text <- stripWhitespace(text)                     # Remove excess whitespace
  text <- removeWords(text, stopwords("en"))        # Remove stopwords
  text <- text[text != ""]                          # Remove empty elements
  return(text)
}

# Apply cleaning function to each dataset
blogs_clean <- clean_text(blogsSub)
news_clean <- clean_text(newsSub)
twitter_clean <- clean_text(twitterSub)

Exploratory Data Analysis

Lets take a look at the top 10 words from each of the 3 sources.

# Function to get top 10 most frequent words
top_10_words <- function(text) {
  text_corpus <- Corpus(VectorSource(text))
  dtm <- DocumentTermMatrix(text_corpus)
  word_freq <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)
  return(head(word_freq, 10))
}

# Get top 10 words for each dataset
top_10_blogs <- top_10_words(blogs_clean)
top_10_news <- top_10_words(news_clean)
top_10_twitter <- top_10_words(twitter_clean)

# Print top 10 words
print(top_10_blogs)
##  one will like just  can time   ’s  get also know 
## 1275 1112 1056 1007  927  849  717  687  587  576
print(top_10_news)
##   said   will    one    new  state   also   year    can   time people 
##    198     90     74     57     49     48     46     45     45     42
print(top_10_twitter)
##   just   like    get   love   good   will    day    can thanks   dont 
##   1510   1168   1135   1066    981    980    959    894    893    872

Word Frequencies

Now that we know the top 10 lets take a look at how many times these words were used in the sample data

# Function to plot word frequencies
plot_word_freq <- function(word_freq, title) {
  word_freq_df <- data.frame(
    word = names(word_freq),
    freq = word_freq
  )
  
  ggplot(word_freq_df, aes(x = reorder(word, freq), y = freq)) +
    geom_bar(stat = "identity") +
    coord_flip() +
    labs(title = title, x = "Words", y = "Frequency") +
    theme_minimal()
}

# Plot word frequencies
plot_word_freq(top_10_blogs, "Top 10 Words in Blogs")

plot_word_freq(top_10_news, "Top 10 Words in News")

plot_word_freq(top_10_twitter, "Top 10 Words in Twitter")

### Word Cloud

#Exploratory data analysis: Word Cloud
wordcloud_data_blogs <- unlist(strsplit(blogs_clean, "\\s+"))
wordcloud(wordcloud_data_blogs, max.words = 100, random.order = FALSE)
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

wordcloud_data_news <- unlist(strsplit(news_clean, "\\s+"))
wordcloud(wordcloud_data_news, max.words = 100, random.order = FALSE)
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents

wordcloud_data_twitter <- unlist(strsplit(twitter_clean, "\\s+"))
wordcloud(wordcloud_data_twitter, max.words = 100, random.order = FALSE)
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents

Findings

I tried using 20% , 10% and 5% of data as sample, but my personal PC wasnt able to handle this. So i have used sample size of 1%. The above barplots and Word Clouds highlight the most frequently used words. In the next steps, we will create a prediction model and test the same. The model will then be fine tuned to enhance performance post which a Shiny app will be deployed to return the prediction of the next word based on the word(s) entered by the user.