Getting the data
Load required libraries
library(plyr)
library(dplyr)
library(knitr)
library(tm)
Check if data already exists (no downloading during knit)
data_path <-"./project/final/en_US"
if (!dir.exists(data_path)) { stop("Data not found. Please download and unzip the Coursera-SwiftKey dataset manually.") }
Once the dataset is downloaded start reading it as this a huge
dataset so we’ll read line by line only the amount of data needed before
doing that letsfirst list all the files in the directory List all the
files of/final/en_US Dataset folder The data sets consist of text from
3different sources: 1) News, 2) Blogs and 3) Twitter feeds. In
thisproject, we will only focus on the English - US data sets.
path <- file.path("./project/final" , "en_US") files<-list.files(path,recursive=TRUE)
Close theconnectionhandle when you are done
close(con)
Lets make a fileconnection of the blog data set
con <-file("./project/final/en_US/en_US.blogs.txt", "r")
lineBlogs<-readLines(con,encoding = “UTF-8”, skipNul = TRUE)
lineBlogs<-readLines(con, skipNul = TRUE)
Close the connection handle when you are done
close(con)
Lets make a file connection of the news data set
con <- file("./project/final/en_US/en_US.news.txt", "r")
lineNews<-readLines(con,encoding = “UTF-8”, skipNul = TRUE)
lineNews<-readLines(con, skipNul = TRUE)
Close the connection handle when you are done
close(con)
We examined the data sets and summarize our findings (file sizes,
line counts, word counts, and mean words per line) below.
library(stringi)
Get file sizes
lineBlogs.size <-file.info("./projectData/final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
lineNews.size <-file.info("./projectData/final/en_US/en_US.news.txt")$size/ 1024 ^ 2
lineTwitter.size <-file.info("./projectData/final/en_US/en_US.twitter.txt")$size/ 1024 ^2
Get words in files
lineBlogs.words <- stri_count_words(lineBlogs)
lineNews.words <- stri_count_words(lineNews)
lineTwitter.words <- stri_count_words(lineTwitter)
Summary of the data sets
data.frame(source = c("blogs", "news", "twitter"),
file.size.MB = c(lineBlogs.size, lineNews.size, lineTwitter.size),
num.lines = c(length(lineBlogs), length(lineNews), length(lineTwitter)),
num.words = c(sum(lineBlogs.words), sum(lineNews.words),
sum(lineTwitter.words)),
mean.num.words = c(mean(lineBlogs.words), mean(lineNews.words),
mean(lineTwitter.words)))
Cleaning The Data
Before performing exploratory analysis, we must clean the data
first. This involves removing URLs, special characters, punctuations,
numbers, excess whitespace, stopwords, and changing the text to lower
case. Since the data sets are quite large, we will randomly choose 2% of
the data to demonstrate the data cleaning and exploratory analysis also
please take care of the UTF chars.
library(tm)
Sample the data
set.seed(5000) data.sample <- c(sample(lineBlogs, length(lineBlogs) * 0.02),
sample(lineNews, length(lineNews) * 0.02), sample(lineTwitter, length(lineTwitter) * 0.02))
Create corpus and clean the data
corpus <- VCorpus(VectorSource(data.sample)) toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
Exploratory Analysis
Now, its time to do some exploratory analysis on the data. It would
be interesting and helpful to find the most frequently occurring words
in the data. Here we list the most common (n-grams) uni-grams, bi-grams,
and tri-grams.
library(ggplot2)
sample_words <- c( stri_count_words(sample(lineBlogs, 5000)),
stri_count_words(sample(lineNews, 5000)),
stri_count_words(sample(lineTwitter, 5000)) )
ggplot(data.frame(words = sample_words),
aes(x = words)) + geom_histogram(binwidth = 5,
fill = "steelblue", color = "black") + labs( title = "Distribution of Words per Line", x ="Number of Words", y = "Frequency" )
Conclusion and Further Planning
This exploratory analysis confirms that the training data is large
and varied, containing text from blogs, news articles, and social media.
Each source has different writing styles and sentence lengths, which
must be considered when building a prediction model.
The next stage of this project will involve creating an n-gram based
prediction algorithm with a back-off strategy. This algorithm will then
be integrated into a Shiny application that allows users to enter text
and receive next-word suggestions in real time.