Introduction

This project is aimed at developing a text predictive model from a large unstructured database of the english language. The body of text is mainly from blogs, news and twitter. The data is obtained from a corpus called HC Corpora and the files have been language filtered but still contain some foreign text.

The method adopted is to analyze the large corpus of text documents to discover the structure in the data and how words are put together. A step-wise approach which entails cleaning and analyzing the text data, and then, building and sampling from a predictive text model is utilized.

The data is downloaded from source and unzipped.

source.url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

# Download file if not already available
if (!file.exists("Coursera-SwiftKey.zip")){
        
        download.file(source.url,"Coursera-SwiftKey.zip")
        unzip("./Coursera-SwiftKey.zip")
}
unzip("./Coursera-SwiftKey.zip")

Each data set from, blogs, news and twitter is loaded into R with errors suppressed and nulls skipped where applicable.

# Read data into r
data.blogs <- readLines("./final/en_US/en_US.blogs.txt")
data.news <- readLines("./final/en_US/en_US.news.txt", warn = FALSE)
data.twitter <- readLines("./final/en_US/en_US.twitter.txt", skipNul = TRUE)

Data cleaning and transformations

Each of the data sets is converted into a data frame to allow for easy manipulation.

# convert class to data frame
data.blogs <- data.frame(data.blogs)
data.news <- data.frame(data.news)
data.twitter <- data.frame(data.twitter)

The data is each marked as either blogs, news or twitter under a column named code and row-combined into a single data base.

data.blogs$code <- "blogs"
data.news$code <- "news"
data.twitter$code <- "twitter"

names(data.blogs) <- c("text", "code")
names(data.news) <- c("text", "code")
names(data.twitter) <- c("text", "code")

data <- rbind.data.frame(data.blogs,data.news, data.twitter)

Additional columns holding information about the length of characters and number of words are introduced into the data frame.

# Count number of char and words per corpus
data$num.char <- nchar(data$text)
data$num.words <- sapply(strsplit(data$text, " "), length)

Exploratory data analysis

The class of the column code is converted into factor and the summary of the data extracted

data$code <- as.factor(data$code)
summary(data)
       text                code            num.char         num.words      
   Length:3336695     blogs  : 899288   Min.   :    1.0   Min.   :   1.00  
   Class :character   news   :  77259   1st Qu.:   39.0   1st Qu.:   7.00  
   Mode  :character   twitter:2360148   Median :   73.0   Median :  14.00  
                                        Mean   :  115.8   Mean   :  21.08  
                                        3rd Qu.:  122.0   3rd Qu.:  22.00  
                                        Max.   :40835.0   Max.   :6630.00

Visualizing the distribution of the character lengths and word lengths

Conclusion

The distribution of character length is identical to the distribution of the number of words across the sources of data. However, there is a pattern present in the distribution of words that is clearly missing in the distribution of character lengths. These cluster-like patterns are likely to be shortened forms of salutations and other commonly used expressions that are adopted by users for typing convenience and may require further investigation going forward.

Beyond this stage, the data will be prepared and split into training and testing samples for building and testing a predictive text model.