Synopsis

This milestone report presents an initial exploration of the text data used in the Coursera Data Science Capstone project. The purpose of this analysis is to understand the structure, size, and characteristics of the data before building a text prediction model.

The report summarizes key statistics of the datasets and outlines the planned approach for developing a predictive algorithm and an interactive Shiny application.

Getting the data

## Load required libraries
library(plyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr)
library(tm)
## Loading required package: NLP
## Check if data already exists (no downloading during knit)
data_path <- "./projectData/final/en_US"

if (!dir.exists(data_path)) {
  stop("Data not found. Please download and unzip the Coursera-SwiftKey dataset manually.")
}

Once the dataset is downloaded start reading it as this a huge dataset so we’ll read line by line only the amount of data needed before doing that lets first list all the files in the directory List all the files of /final/en_US Dataset folder The data sets consist of text from 3 different sources: 1) News, 2) Blogs and 3) Twitter feeds. In this project, we will only focus on the English - US data sets.

path <- file.path("./projectData/final" , "en_US")
files<-list.files(path, recursive=TRUE)
# Lets make a file connection of the twitter data set
con <- file("./projectData/final/en_US/en_US.twitter.txt", "r") 
#lineTwitter<-readLines(con,encoding = "UTF-8", skipNul = TRUE)
lineTwitter<-readLines(con, skipNul = TRUE)
# Close the connection handle when you are done
close(con)
# Lets make a file connection of the blog data set
con <- file("./projectData/final/en_US/en_US.blogs.txt", "r") 
#lineBlogs<-readLines(con,encoding = "UTF-8", skipNul = TRUE)
lineBlogs<-readLines(con, skipNul = TRUE)
# Close the connection handle when you are done
close(con)
# Lets make a file connection of the news data set
con <- file("./projectData/final/en_US/en_US.news.txt", "r") 
#lineNews<-readLines(con,encoding = "UTF-8", skipNul = TRUE)
lineNews<-readLines(con, skipNul = TRUE)
# Close the connection handle when you are done
close(con)

We examined the data sets and summarize our findings (file sizes, line counts, word counts, and mean words per line) below.

library(stringi)
# Get file sizes
lineBlogs.size <- file.info("./projectData/final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
lineNews.size <- file.info("./projectData/final/en_US/en_US.news.txt")$size / 1024 ^ 2
lineTwitter.size <- file.info("./projectData/final/en_US/en_US.twitter.txt")$size / 1024 ^ 2

# Get words in files
lineBlogs.words <- stri_count_words(lineBlogs)
lineNews.words <- stri_count_words(lineNews)
lineTwitter.words <- stri_count_words(lineTwitter)

# Summary of the data sets
data.frame(source = c("blogs", "news", "twitter"),
           file.size.MB = c(lineBlogs.size, lineNews.size, lineTwitter.size),
           num.lines = c(length(lineBlogs), length(lineNews), length(lineTwitter)),
           num.words = c(sum(lineBlogs.words), sum(lineNews.words), sum(lineTwitter.words)),
           mean.num.words = c(mean(lineBlogs.words), mean(lineNews.words), mean(lineTwitter.words)))
##    source file.size.MB num.lines num.words mean.num.words
## 1   blogs     200.4242    899288  37546806       41.75170
## 2    news     196.2775   1010206  34761151       34.40996
## 3 twitter     159.3641   2360148  30096690       12.75204

Cleaning The Data

Before performing exploratory analysis, we must clean the data first. This involves removing URLs, special characters, punctuations, numbers, excess whitespace, stopwords, and changing the text to lower case. Since the data sets are quite large, we will randomly choose 2% of the data to demonstrate the data cleaning and exploratory analysis also please take care of the UTF chars.

library(tm)
# Sample the data
set.seed(5000)
data.sample <- c(sample(lineBlogs, length(lineBlogs) * 0.02),
                 sample(lineNews, length(lineNews) * 0.02),
                 sample(lineTwitter, length(lineTwitter) * 0.02))

# Create corpus and clean the data
corpus <- VCorpus(VectorSource(data.sample))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)

##Exploratory Analysis Now tine to do some exploratory analysis on the data. It would be interesting and helpful to find the most frequently occurring words in the data. Here we list the most common (n-grams) uni-grams, bi-grams, and tri-grams.

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
sample_words <- c(
  stri_count_words(sample(lineBlogs, 5000)),
  stri_count_words(sample(lineNews, 5000)),
  stri_count_words(sample(lineTwitter, 5000))
)

ggplot(data.frame(words = sample_words), aes(x = words)) +
  geom_histogram(binwidth = 5, fill = "steelblue", color = "black") +
  labs(
    title = "Distribution of Words per Line",
    x = "Number of Words",
    y = "Frequency"
  )

Conclusion and Further Planning

This exploratory analysis confirms that the training data is large and varied, containing text from blogs, news articles, and social media. Each source has different writing styles and sentence lengths, which must be considered when building a prediction model.

The next stage of this project will involve creating an n-gram based prediction algorithm with a back-off strategy. This algorithm will then be integrated into a Shiny application that allows users to enter text and receive next-word suggestions in real time.