Executive Summary SwiftKey is an input method for Android and iOS devices, such as smartphones and tablets. SwiftKey uses a blend of artificial intelligence technologies that enable it to predict the next word the user intends to type.SwiftKey learns from previous SMS messages and output predictions based on currently input text and what it has learned.
This part of the project deals with cleaning to be able to prepare the data for exploration and analysis of the content to to determine how best to use it for predictive analysis.
Functions Loading data
Loading = function(folder) {
wd = getwd()
zfile = "Coursera-SwiftKey.zip"
url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
folder = paste(wd, folder, sep = "/")
zfile = paste(wd, zfile, sep = "/")
if (!dir.exists(folder)) {
download.file(url, zfile)
unzip(zfile)
}
docs = VCorpus(DirSource(folder, mode = "text"), readerControl = list(reader = readPlain,
language = "en"))
docs
}
Tidying data for numbers, encoding, punctuation, profanity and space trimming
clean = function(data) {
data = rm_email(data) #remove emails
data = rm_emoticon(data) #remove emoticons
data = rm_citation(data) #remove citation
data = rm_title_name(data) #remove title
data = rm_abbreviation(data) #remove abbreviations
data = rm_date(data) #remove dates
data = rm_non_ascii(data) #remove non-ascii characters
data = gsubfn("http[^ ]*|www[^ ]*", "", data) #removing URLs
data = tolower(data) #convert to lower
data = pclean(data) #remove profanity
data = removeNumbers(data) #remove numbers
data = rm_repeated_characters(data) #remove repeititve characters
data = gsubfn("[][#$%()`*:;\"\\+\\&\\/<=>@^_|~{}=\\-]", "", data) # replacing special characters
data = gsubfn("\\.+|\\s+\\.+", ".", data) #replace repetitive periods with a single period
data = gsubfn("\\?+|\\s+\\?+", "?", data) #replace repetitive question marks with a single question mark
data = gsubfn("\\'+|\\s+\\'?'+", "'", data) #replace repetitive apostrophes with a single apostrphe
data = gsubfn("!+|\\s+!+", "!", data) #replace repetitive exclamation with a single exclamation
data = gsubfn(",+|\\s+,+", ",", data) #replace repetitive exclamation with a single exclamation
data = gsubfn(" i ", " I ", data) #capitalize standalone i
data = gsubfn(" i've | ive ", " I've ", data) #capitalize i've
data = gsubfn(" i'm | im ", " I'm ", data) #capitalize i've
data = gsubfn(" i'd", " I'd ", data) #capitalize i've
data = gsubfn("(^|[.?!][[:space:]])([[:alpha:]])", "\\1\\U\\2", data, perl = TRUE) #capitalize first letter of a sentence or row
data = ctrim(data)
data
}
pclean = function(data, xwords = words) {
data = removeWords(data, xwords) #remove profanity
data
}
ctrim = function(data) {
ddata = as.data.frame(data)
names(ddata) = c("content")
ddata = subset(ddata, ddata$content != "") #removing empty rows
data = as.character(ddata$content)
data = rm_white_multiple(data)
data = str_trim(data)
data
}
Processing data
Execution
Loading and assessing data
# Loading cleaning profanity
myw = Loading("final/words")
words = as.character(strsplit(myw[[1]]$content, "\n"))
words = ctrim(words)
Processing
# Cleaning and pre-processing data
exec("final/en_US", "blogs", 1)
exec("final/en_US", "news", 2)
exec("final/en_US", "twitter", 3)
Result: The three files- en_US.blogs, en_US.news and en_US.twitter were all cleaned for the following: - Remove emails, emoticons, citation, title, abbreviations, dates, non-ascii characters, removing URLs, profanity, numbers, repeititve characters, puntuations except (./,/‘/!/?), white spaces - Convert to lower in order to have a consistent casing - Replace repetitive punctuation (/./,/’/!/?) with single punctuation - Capitalize first person “I”, capitalize first letter of a sentence(Letters that follow a ./!/?)or row, trim sentences.
I would like for some punctutation to remain in the corpus to train the model on grammatical correctness.
I noticed some words are joined without any clear delimiters such as “#NeedACInSchool”. While I’d like to attempt splitting these phrases into individual words such as “Need AC in School”, I am leaving these words and proper nouns found in the document(s) to be sparsed out.
The statistics captured before and after cleaning is proof of cleaning. The cleaned corpus is saved in a folder and it is this data that will used for exploratory analysis.
Environment: 1. OS: Windows 7; Tool: R version 3.3.3; R Studio version 1.0.136; Publishing tool: RPubs, HTML 4. Data: With thanks to source: http://www.swiftkey.com, http://www.coursera.org, https://www.jhu.edu/. Reference: www.stackoverflow.com 6. Analyst: Uma Venkataramani; Date of Analysis: May 2017