1. Main Report

1.1 Overview

The goal of this project is just to provide basic exploratory analysis (by way of textual analysis) on a series of social media data collated by Swiftkey (source: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip). This document is designed to be concise and explanation is only provided for major features of the data. Additionally, it will also provide a brief overview of the plan to creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager.

1.2 Data and Summary Statistics

The files used in this analysis are as follows:

  • A file containing data from US blogs (in English).

  • A file containing data from US news sites (in English)

  • A file containing data from US twitter (in English)

The key statistics of these files are as summarised in the table

file_name file_size_MB Lines LinesNEmpty Chars CharsNWhite word_count
en_US.blogs 200.4242 899288 899288 206824382 170389539 37570839
en_US.news 196.2775 77259 77259 15639408 13072698 2651432
en_US.twitter NA 2360148 2360148 162096241 134082806 30451170

1.3 Data Pre-processing

Data from these files were pre-processed using the following steps:

  1. Random samples from each of these files were taken.

  2. A consolidated sample was constructed by aggregating the 3 random samples, and this was saved to the local drive.

(The objective of step 1-2 was to make dataset smaller and hence requiring less resources to handle without compromising significantly on accuracy. Additionally, temporary files that were no longer required were removed to conserve computer memory)

  1. A corpus is constructed from the consolidated sample. This was subsequently “cleaned” to remove punctuations, and other undesired “noises” that might affect our word frequency analysis.

  2. Textual analysis based on word frequency is conducted based on the pre-processed corpus. Graphical plots of frequency of single word, 2-word, and 3-word are done and they are as shown below.

1.4 Graphical plots of word frequency

1.5 Observations from the graphical plots

Salient observations include:

  • The most frequent places cited were: “new york”, “new jersey”, “los angeles”, and “st loius”.

  • The most popular “time” related words used were: “last year”, “right now”, “two years ago”, “last week”, “first time”, “next week”, and “everyday”.

  • The most popular events cited were: “happy mothers day”, and “happy new year”

  • There were also frequent references made on roles / people: “president barrack obama”, “us district judge”, “senior vice president”, and “public relations counsel”.

1.6 Thoughts on Prediction Model

Current perspectives on how to build keystroke prediction algorithm include:

  1. Analyse word string patterns in corpus by frequency association (e.g. “first” is closely associated with “time”; “happy mothers” is closely associated with “day”). The string patterns can be established by focusing on Bigram and Trigram.

  2. Analyse preceding keywords typed by user, and predict the next few words based on the patterns mentioned in (1).

  3. As user type more words, the model should respond interactively by revising its predictions on what the next word is most likely to be.

1.7 Thoughts on Shiny App

The shiny app will enable users to key in text, and it will generate visualisation plots (either n-gram plots or word cloud), using the frequency prediction model as mentioned earlier. These plots will be revised in interactively revised as users key in more words.

2. Annexes - Codes used to generate the plots

2.1 Load data & libraries

  1. load social media data
blog <- readLines("en_US.blogs.txt", encoding="UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding="UTF-8", skipNul = TRUE, warn=FALSE)#to ignore error message
tweet <- readLines("en_US.twitter.txt", encoding="UTF-8", skipNul = TRUE)
  1. load essential libraries to generate ngrams and word-counts
library(dplyr); library(doParallel); library(stringi); library(tm); library(ggplot2); library(wordcloud); 

2.2 Summary table for the raw stats

raw_stats <- data.frame(
  file_name=c("en_US.blogs", "en_US.news", "en_US.twitter"),
  file_size_MB = c(file.info("en_US.blogs.txt")$size/(1024^2),
                   file.info("en_US.news.txt")$size/(1024^2),
                   file.info("en_US.twitter")$size/(1024^2)
                   ),
  t(rbind(sapply(list(blog,news,tweet), stri_stats_general),
          word_count=sapply(list(blog, news, tweet), stri_stats_latex)[4,]))
  )
kable(raw_stats)

2.3 Create random sampling of the words used in various social media sites

creating a random sampling to get a (small) snapshot of the words in the text documents

  #random sampling
set.seed(1000)
sample_blog <- blog[sample(1:length(blog), 10000, replace=FALSE)]
sample_news <- news[sample(1:length(news), 10000, replace=FALSE)]
sample_tweet <- tweet[sample(1:length(tweet), 10000, replace=FALSE)]

  #save sample files to folder (need to manually create a folder called "sample" first, and insert the path). This is not essential, but good practice.
writeLines(sample_combined <- c(sample_blog, sample_news, sample_tweet), "C:/Users/Andy's Home PC/Documents/Coursera Courses/Data Science/Capstone Project/sample/sample_combined.txt")
# 
# writeLines(sample_blog, "C:/Users/Andy's Home PC/Documents/Coursera Courses/Data Science/Capstone Project/sample/sample_blog.txt")
# 
# writeLines(sample_news, "C:/Users/Andy's Home PC/Documents/Coursera Courses/Data Science/Capstone Project/sample/sample_news.txt")
# 
# writeLines(sample_tweet, "C:/Users/Andy's Home PC/Documents/Coursera Courses/Data Science/Capstone Project/sample/sample_tweet.txt")

  #remove temporary dataframes and variables that we no longer need
rm(blog, news, tweet)
rm(sample_blog, sample_news, sample_tweet)
rm(sample_combined)

2.4 Create Corpus from folder of earlier saved sample(s)

  #folder to saved sample texts
corpus.folder <- file.path("C:\\Users\\Andy's Home PC\\Documents\\Coursera Courses\\Data Science\\Capstone Project\\sample")   #folder path
corpus.folder   #check path to folder
dir(corpus.folder) #check the number of items in the folder

docs <- VCorpus(DirSource(corpus.folder))
summary(docs)
# inspect(docs[1]) #inspect first text document
# writeLines(as.character(docs[1])) #we can read content of this document
class(docs)

2.5 Pre-processing of corpus - removing non-ASCII characters and other symbols

Note: using tm_map method produces the following error message even when content_transformer wrapper are used: Error in UseMethod(“content”, x) : no applicable method for ‘content’ applied to an object of class “character”.

These tm_map methods appear unstable:

- docs <- tm_map(docs, content_transformer(tolower)) #unstable

- docs <- tm_map(docs, content_transformer(removePunctuation)) #unstable

- docs <- tm_map(docs, content_transformer(removeNumbers)) #unstable

- docs <- tm_map(docs, content_transformer(stripWhitespace)) #unstable

To overcome this, I wrote functions to do preprocessing and then used tm_map on those functions. The result appears to be more stable.

  #remove URL
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
docs <- tm_map(docs, removeURL)

  #removing non-ASCII characters
removeNonASCII <- function(x) iconv(x, "latin1", "ASCII", sub="")
docs <- tm_map(docs, removeNonASCII)

  #remove punctuations
    #write function
removePunct <- function(x) gsub("[[:punct:]]", "", x)
docs <- tm_map(docs, removePunct)

  #removing all special characters except for alphabet and numbers
# removeSpecialChar <- function(x) gsub("[^A-Za-z0-9]", "", x)
# docs <- tm_map(docs, removeSpecialChar)

  #change to lower case
LowerCase <- function(x) sapply(x, tolower)
docs <- tm_map(docs, LowerCase)

  #remove Numbers
removeNum <- function(x) gsub("[[:digit:]]", "", x)
docs <- tm_map(docs, removeNum)

  #remove unnecessary white space, replace with only 1 space
removeSpace <- function(x) gsub("\\s+", " ", x)
docs <- tm_map(docs, removeSpace)


docs <- tm_map(docs, removeWords, stopwords("english")) #ok

  #remove profanity words (downloaded the basic list on 20 Oct 2018 from the following source: https://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/ )
profanity <- readLines("C:\\Users\\Andy's Home PC\\Documents\\Coursera Courses\\Data Science\\Capstone Project\\profanity\\profanity.txt")
docs <- tm_map(docs, removeWords, profanity)


docs <- tm_map(docs, PlainTextDocument) #in this final step, we get R to treat processed doc as plain text

2.7 Create Term Document Matrix (DTM) and generate the various plots

1-Gram and its plot

TDM <- TermDocumentMatrix(docs) 
# inspect(TDM)
# dim(TDM)
# terms <-Terms(TDM)
# length(terms)
# unique(Encoding(terms)) #still has [1] "UTF-8"   "unknown"

#remove sparse terms
TDM.common <- removeSparseTerms(TDM, .999)
# dim(TDM.common)
freq <- rowSums(as.matrix(TDM.common))
ord <- order(freq)
# freq[head(ord)]
# freq[tail(ord, n=30)]
wordFreq <- freq[tail(ord, n=30)]
commonTerms <- Terms(TDM.common)
# length(commonTerms)

wordFreq <- as.data.frame(wordFreq)

library(data.table)
wordFreq <- setDT(wordFreq, keep.rownames=TRUE)
wordFreq <- wordFreq[order(wordFreq, decreasing=TRUE),]

#plot 
g<- ggplot(wordFreq, aes(reorder(rn, wordFreq), wordFreq))
g<- g + geom_bar(stat = "identity", fill="#97DAB7")+ theme_minimal
g <- g + coord_flip()
g<- g + ggtitle("Monogram")
g <- g + ylab("Frequency")
g<- g + xlab("Words")
g 

2-Gram and its plot

To generate bi-gram and tri-gram, the RWeka and rJava libraries are typically needed. But they are tricky to install. Instead, I have used the NLP package and created n-grams tokenizer by writing functions. (For more information: http://tm.r-forge.r-project.org/faq.html#Bigrams)

bigram_tokenizer <- function(x) 
  unlist(lapply(ngrams(words(x), 2), paste, collapse=" "), use.names=FALSE)

trigram_tokenizer <- function(x)
  unlist(lapply(ngrams(words(x), 3), paste, collapse=" "), use.names=FALSE)

Use the created bigram tokenizer to transform text data and generate the bigram

TDM.bigram <- TermDocumentMatrix(docs, control = list(tokenize = bigram_tokenizer))
#dim(TDM.bigram)
TDM.bigram.common <- removeSparseTerms(TDM.bigram, 0.9999)
#dim(TDM.bigram.common)
#Terms(TDM.bigram.common)
freq.bigram <- rowSums(as.matrix(TDM.bigram.common)) #sums up the frequency of 2-worded words
ord <- order(freq.bigram) #sorting by ascending order
#freq.bigram[head(ord)] #print the top 6 2-worded words using subsetting (ascending)
# freq.bigram[tail(ord)] #print the last 6 2-worded words using subsetting (descending)
# freq.bigram[tail(ord, n=30)] #print last 30 2-worded words using subsetting (descending)
bigramFreq <- freq.bigram[tail(ord, n=30)]

  #transform data for plotting
bigramFreq <- as.data.frame(bigramFreq)
bigramFreq <- setDT(bigramFreq, keep.rownames = TRUE)
bigramFreq <- bigramFreq[order(bigramFreq, decreasing=TRUE),]

  #plot 20 most frequent 2-worded words
h <-ggplot(bigramFreq, aes(reorder(rn, bigramFreq), bigramFreq))
h <- h + geom_bar(stat = "identity", fill="steelblue") + theme_minimal()
h <- h + coord_flip()
h <- h + ggtitle("Bigram")
h <- h + ylab("Frequency")
h <- h + xlab("Words (Bigrams)")
h

3-Gram and its plot

use the tri-gram tokenizer to transform data and generate the trigram plot

TDM.trigram <- TermDocumentMatrix(docs, control=list(tokenize = trigram_tokenizer))
#dim(TDM.trigram)
TDM.trigram.common <- removeSparseTerms(TDM.trigram, 0.999)
# dim(TDM.trigram.common)
freq.trigram <- rowSums(as.matrix(TDM.trigram.common))
ord <- order(freq.trigram)
# freq.trigram[head(ord)]
# freq.trigram[tail(ord, n=30)]
trigramFreq <- freq.trigram[tail(ord, n=30)]


  #transform data for plotting
trigramFreq <- as.data.frame(trigramFreq)
trigramFreq <- setDT(trigramFreq, keep.rownames = TRUE)
trigramFreq <- trigramFreq[order(trigramFreq, decreasing = TRUE),]

  #plot 30 most frequent 3-worded words
i <- ggplot(trigramFreq, aes(reorder(rn, trigramFreq), trigramFreq))
i <- i + geom_bar(stat = "identity", fill="#DBCC8E") + theme_minimal()
i <- i + coord_flip()
i <- i + ggtitle("Trigram")
i <- i + ylab("Frequency")
i <- i + xlab("Words (Trigrams)")
i

Plot the n-grams side-by-side

require(gridExtra)
grid.arrange(g,h,i, ncol=3)