Coursera Data Science CapStone- SwiftKey- Data Processing

Executive Summary SwiftKey is an input method for Android and iOS devices, such as smartphones and tablets. SwiftKey uses a blend of artificial intelligence technologies that enable it to predict the next word the user intends to type.SwiftKey learns from previous SMS messages and output predictions based on currently input text and what it has learned.

This part of the project deals with cleaning to be able to prepare the data for exploration and analysis of the content to to determine how best to use it for predictive analysis.

Functions Loading data

Loading = function(folder) {
    wd = getwd()
    zfile = "Coursera-SwiftKey.zip"
    url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
    folder = paste(wd, folder, sep = "/")
    zfile = paste(wd, zfile, sep = "/")
    if (!dir.exists(folder)) {
        download.file(url, zfile)
        unzip(zfile)
    }
    docs = VCorpus(DirSource(folder, mode = "text"), readerControl = list(reader = readPlain, 
        language = "en"))
    docs
}

Tidying data for numbers, encoding, punctuation, profanity and space trimming

clean = function(data) {
    data = rm_email(data)  #remove emails
    data = rm_emoticon(data)  #remove emoticons
    data = rm_citation(data)  #remove citation
    data = rm_title_name(data)  #remove title
    data = rm_abbreviation(data)  #remove abbreviations
    data = rm_date(data)  #remove dates
    data = rm_non_ascii(data)  #remove non-ascii characters
    data = gsubfn("http[^ ]*|www[^ ]*", "", data)  #removing URLs
    data = tolower(data)  #convert to lower
    data = pclean(data)  #remove profanity
    data = removeNumbers(data)  #remove numbers
    data = rm_repeated_characters(data)  #remove repeititve characters
    data = gsubfn("[][#$%()`*:;\"\\+\\&\\/<=>@^_|~{}=\\-]", "", data)  # replacing special characters
    data = gsubfn("\\.+|\\s+\\.+", ".", data)  #replace repetitive periods with a single period
    data = gsubfn("\\?+|\\s+\\?+", "?", data)  #replace repetitive question marks with a single question mark
    data = gsubfn("\\'+|\\s+\\'?'+", "'", data)  #replace repetitive apostrophes with a single apostrphe
    data = gsubfn("!+|\\s+!+", "!", data)  #replace repetitive exclamation with a single exclamation
    data = gsubfn(",+|\\s+,+", ",", data)  #replace repetitive exclamation with a single exclamation
    data = gsubfn(" i ", " I ", data)  #capitalize standalone i
    data = gsubfn(" i've | ive ", " I've ", data)  #capitalize i've
    data = gsubfn(" i'm | im ", " I'm ", data)  #capitalize i've
    data = gsubfn(" i'd", " I'd ", data)  #capitalize i've
    data = gsubfn("(^|[.?!][[:space:]])([[:alpha:]])", "\\1\\U\\2", data, perl = TRUE)  #capitalize first letter of a sentence or row
    data = ctrim(data)
    data
}
pclean = function(data, xwords = words) {
    data = removeWords(data, xwords)  #remove profanity
    data
}
ctrim = function(data) {
    ddata = as.data.frame(data)
    names(ddata) = c("content")
    ddata = subset(ddata, ddata$content != "")  #removing empty rows
    data = as.character(ddata$content)
    data = rm_white_multiple(data)
    data = str_trim(data)
    data
}

Processing data

Execution

Loading and assessing data

# Loading cleaning profanity
myw = Loading("final/words")
words = as.character(strsplit(myw[[1]]$content, "\n"))
words = ctrim(words)

Processing

# Cleaning and pre-processing data
exec("final/en_US", "blogs", 1)
exec("final/en_US", "news", 2)
exec("final/en_US", "twitter", 3)

Result: The three files- en_US.blogs, en_US.news and en_US.twitter were all cleaned for the following: - Remove emails, emoticons, citation, title, abbreviations, dates, non-ascii characters, removing URLs, profanity, numbers, repeititve characters, puntuations except (./,/‘/!/?), white spaces - Convert to lower in order to have a consistent casing - Replace repetitive punctuation (/./,/’/!/?) with single punctuation - Capitalize first person “I”, capitalize first letter of a sentence(Letters that follow a ./!/?)or row, trim sentences.

I would like for some punctutation to remain in the corpus to train the model on grammatical correctness.

I noticed some words are joined without any clear delimiters such as “#NeedACInSchool”. While I’d like to attempt splitting these phrases into individual words such as “Need AC in School”, I am leaving these words and proper nouns found in the document(s) to be sparsed out.

The statistics captured before and after cleaning is proof of cleaning. The cleaned corpus is saved in a folder and it is this data that will used for exploratory analysis.

Environment: 1. OS: Windows 7; Tool: R version 3.3.3; R Studio version 1.0.136; Publishing tool: RPubs, HTML 4. Data: With thanks to source: http://www.swiftkey.com, http://www.coursera.org, https://www.jhu.edu/. Reference: www.stackoverflow.com 6. Analyst: Uma Venkataramani; Date of Analysis: May 2017