Executive Summary

This milestone report presents an initial exploration of the text data used in the Coursera Data Science Capstone project. The purpose of this analysis is to understand the structure, size, and characteristics of the data before building a text prediction model.The report summarizes key statistics of the datasets and outlines the planned approach for developing a predictive algorithm and an interactive Shiny application

Getting the data

Load required libraries

library(plyr) 
library(dplyr) 
library(knitr) 
library(tm)

Check if data already exists (no downloading during knit)

data_path <-"./project/final/en_US"
if (!dir.exists(data_path)) { stop("Data not found. Please download and unzip the Coursera-SwiftKey dataset manually.") }

Once the dataset is downloaded start reading it as this a huge dataset so we’ll read line by line only the amount of data needed before doing that letsfirst list all the files in the directory List all the files of/final/en_US Dataset folder The data sets consist of text from 3different sources: 1) News, 2) Blogs and 3) Twitter feeds. In thisproject, we will only focus on the English - US data sets.

path <- file.path("./project/final" , "en_US") files<-list.files(path,recursive=TRUE)

Lets make a file connection of the twitter data set

con <- file("./project/final/en_US/en_US.twitter.txt", "r")

lineTwitter<-readLines(con,encoding = “UTF-8”, skipNul = TRUE)

lineTwitter<-readLines(con, skipNul = TRUE)

Close theconnectionhandle when you are done

close(con)

Lets make a fileconnection of the blog data set

con <-file("./project/final/en_US/en_US.blogs.txt", "r")

lineBlogs<-readLines(con,encoding = “UTF-8”, skipNul = TRUE)

lineBlogs<-readLines(con, skipNul = TRUE)

Close the connection handle when you are done

close(con)

Lets make a file connection of the news data set

con <- file("./project/final/en_US/en_US.news.txt", "r")

lineNews<-readLines(con,encoding = “UTF-8”, skipNul = TRUE)

lineNews<-readLines(con, skipNul = TRUE)

Close the connection handle when you are done

close(con)

We examined the data sets and summarize our findings (file sizes, line counts, word counts, and mean words per line) below.

library(stringi)

Get file sizes

lineBlogs.size <-file.info("./projectData/final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
lineNews.size <-file.info("./projectData/final/en_US/en_US.news.txt")$size/ 1024 ^ 2
lineTwitter.size <-file.info("./projectData/final/en_US/en_US.twitter.txt")$size/ 1024 ^2

Get words in files

lineBlogs.words <- stri_count_words(lineBlogs)
lineNews.words <- stri_count_words(lineNews)
lineTwitter.words <- stri_count_words(lineTwitter)

Summary of the data sets

data.frame(source = c("blogs", "news", "twitter"),
file.size.MB = c(lineBlogs.size, lineNews.size, lineTwitter.size),
num.lines = c(length(lineBlogs), length(lineNews), length(lineTwitter)),
num.words = c(sum(lineBlogs.words), sum(lineNews.words),
sum(lineTwitter.words)),
mean.num.words = c(mean(lineBlogs.words), mean(lineNews.words),
mean(lineTwitter.words)))

Cleaning The Data

Before performing exploratory analysis, we must clean the data first. This involves removing URLs, special characters, punctuations, numbers, excess whitespace, stopwords, and changing the text to lower case. Since the data sets are quite large, we will randomly choose 2% of the data to demonstrate the data cleaning and exploratory analysis also please take care of the UTF chars.

library(tm)

Sample the data

set.seed(5000) data.sample <- c(sample(lineBlogs, length(lineBlogs) * 0.02),
sample(lineNews, length(lineNews) * 0.02), sample(lineTwitter, length(lineTwitter) * 0.02))

Create corpus and clean the data

corpus <- VCorpus(VectorSource(data.sample)) toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)

Exploratory Analysis

Now, its time to do some exploratory analysis on the data. It would be interesting and helpful to find the most frequently occurring words in the data. Here we list the most common (n-grams) uni-grams, bi-grams, and tri-grams.

library(ggplot2)
sample_words <- c( stri_count_words(sample(lineBlogs, 5000)),
stri_count_words(sample(lineNews, 5000)),
stri_count_words(sample(lineTwitter, 5000)) )
ggplot(data.frame(words = sample_words),
aes(x = words)) + geom_histogram(binwidth = 5,
fill = "steelblue", color = "black") + labs( title = "Distribution of Words per Line", x ="Number of Words", y = "Frequency" )

Conclusion and Further Planning

This exploratory analysis confirms that the training data is large and varied, containing text from blogs, news articles, and social media. Each source has different writing styles and sentence lengths, which must be considered when building a prediction model.

The next stage of this project will involve creating an n-gram based prediction algorithm with a back-off strategy. This algorithm will then be integrated into a Shiny application that allows users to enter text and receive next-word suggestions in real time.