This project is aimed at developing a text predictive model from a large unstructured database of the english language. The body of text is mainly from blogs, news and twitter. The data is obtained from a corpus called HC Corpora and the files have been language filtered but still contain some foreign text.
The method adopted is to analyze the large corpus of text documents to discover the structure in the data and how words are put together. A step-wise approach which entails cleaning and analyzing the text data, and then, building and sampling from a predictive text model is utilized.
The data is downloaded from source and unzipped.
source.url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# Download file if not already available
if (!file.exists("Coursera-SwiftKey.zip")){
download.file(source.url,"Coursera-SwiftKey.zip")
unzip("./Coursera-SwiftKey.zip")
}
unzip("./Coursera-SwiftKey.zip")
Each data set from, blogs, news and twitter is loaded into R with errors suppressed and nulls skipped where applicable.
# Read data into r
data.blogs <- readLines("./final/en_US/en_US.blogs.txt")
data.news <- readLines("./final/en_US/en_US.news.txt", warn = FALSE)
data.twitter <- readLines("./final/en_US/en_US.twitter.txt", skipNul = TRUE)
Each of the data sets is converted into a data frame to allow for easy manipulation.
# convert class to data frame
data.blogs <- data.frame(data.blogs)
data.news <- data.frame(data.news)
data.twitter <- data.frame(data.twitter)
The data is each marked as either blogs, news or twitter under a column named code and row-combined into a single data base.
data.blogs$code <- "blogs"
data.news$code <- "news"
data.twitter$code <- "twitter"
names(data.blogs) <- c("text", "code")
names(data.news) <- c("text", "code")
names(data.twitter) <- c("text", "code")
data <- rbind.data.frame(data.blogs,data.news, data.twitter)
Additional columns holding information about the length of characters and number of words are introduced into the data frame.
# Count number of char and words per corpus
data$num.char <- nchar(data$text)
data$num.words <- sapply(strsplit(data$text, " "), length)
The class of the column code is converted into factor and the summary of the data extracted
data$code <- as.factor(data$code)
summary(data)
text code num.char num.words
Length:3336695 blogs : 899288 Min. : 1.0 Min. : 1.00
Class :character news : 77259 1st Qu.: 39.0 1st Qu.: 7.00
Mode :character twitter:2360148 Median : 73.0 Median : 14.00
Mean : 115.8 Mean : 21.08
3rd Qu.: 122.0 3rd Qu.: 22.00
Max. :40835.0 Max. :6630.00
Visualizing the distribution of the character lengths and word lengths
The distribution of character length is identical to the distribution of the number of words across the sources of data. However, there is a pattern present in the distribution of words that is clearly missing in the distribution of character lengths. These cluster-like patterns are likely to be shortened forms of salutations and other commonly used expressions that are adopted by users for typing convenience and may require further investigation going forward.
Beyond this stage, the data will be prepared and split into training and testing samples for building and testing a predictive text model.