1. Main Report
1.1 Overview
The goal of this project is just to provide basic exploratory analysis (by way of textual analysis) on a series of social media data collated by Swiftkey (source: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip). This document is designed to be concise and explanation is only provided for major features of the data. Additionally, it will also provide a brief overview of the plan to creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager.
1.2 Data and Summary Statistics
The files used in this analysis are as follows:
A file containing data from US blogs (in English).
A file containing data from US news sites (in English)
A file containing data from US twitter (in English)
The key statistics of these files are as summarised in the table
| en_US.blogs |
200.4242 |
899288 |
899288 |
206824382 |
170389539 |
37570839 |
| en_US.news |
196.2775 |
77259 |
77259 |
15639408 |
13072698 |
2651432 |
| en_US.twitter |
NA |
2360148 |
2360148 |
162096241 |
134082806 |
30451170 |
1.3 Data Pre-processing
Data from these files were pre-processed using the following steps:
Random samples from each of these files were taken.
A consolidated sample was constructed by aggregating the 3 random samples, and this was saved to the local drive.
(The objective of step 1-2 was to make dataset smaller and hence requiring less resources to handle without compromising significantly on accuracy. Additionally, temporary files that were no longer required were removed to conserve computer memory)
A corpus is constructed from the consolidated sample. This was subsequently “cleaned” to remove punctuations, and other undesired “noises” that might affect our word frequency analysis.
Textual analysis based on word frequency is conducted based on the pre-processed corpus. Graphical plots of frequency of single word, 2-word, and 3-word are done and they are as shown below.
1.4 Graphical plots of word frequency

1.5 Observations from the graphical plots
Salient observations include:
The most frequent places cited were: “new york”, “new jersey”, “los angeles”, and “st loius”.
The most popular “time” related words used were: “last year”, “right now”, “two years ago”, “last week”, “first time”, “next week”, and “everyday”.
The most popular events cited were: “happy mothers day”, and “happy new year”
There were also frequent references made on roles / people: “president barrack obama”, “us district judge”, “senior vice president”, and “public relations counsel”.
1.6 Thoughts on Prediction Model
Current perspectives on how to build keystroke prediction algorithm include:
Analyse word string patterns in corpus by frequency association (e.g. “first” is closely associated with “time”; “happy mothers” is closely associated with “day”). The string patterns can be established by focusing on Bigram and Trigram.
Analyse preceding keywords typed by user, and predict the next few words based on the patterns mentioned in (1).
As user type more words, the model should respond interactively by revising its predictions on what the next word is most likely to be.
1.7 Thoughts on Shiny App
The shiny app will enable users to key in text, and it will generate visualisation plots (either n-gram plots or word cloud), using the frequency prediction model as mentioned earlier. These plots will be revised in interactively revised as users key in more words.
2. Annexes - Codes used to generate the plots
2.1 Load data & libraries
- load social media data
blog <- readLines("en_US.blogs.txt", encoding="UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding="UTF-8", skipNul = TRUE, warn=FALSE)#to ignore error message
tweet <- readLines("en_US.twitter.txt", encoding="UTF-8", skipNul = TRUE)
- load essential libraries to generate ngrams and word-counts
library(dplyr); library(doParallel); library(stringi); library(tm); library(ggplot2); library(wordcloud);
2.2 Summary table for the raw stats
raw_stats <- data.frame(
file_name=c("en_US.blogs", "en_US.news", "en_US.twitter"),
file_size_MB = c(file.info("en_US.blogs.txt")$size/(1024^2),
file.info("en_US.news.txt")$size/(1024^2),
file.info("en_US.twitter")$size/(1024^2)
),
t(rbind(sapply(list(blog,news,tweet), stri_stats_general),
word_count=sapply(list(blog, news, tweet), stri_stats_latex)[4,]))
)
kable(raw_stats)
2.4 Create Corpus from folder of earlier saved sample(s)
#folder to saved sample texts
corpus.folder <- file.path("C:\\Users\\Andy's Home PC\\Documents\\Coursera Courses\\Data Science\\Capstone Project\\sample") #folder path
corpus.folder #check path to folder
dir(corpus.folder) #check the number of items in the folder
docs <- VCorpus(DirSource(corpus.folder))
summary(docs)
# inspect(docs[1]) #inspect first text document
# writeLines(as.character(docs[1])) #we can read content of this document
class(docs)
2.5 Pre-processing of corpus - removing non-ASCII characters and other symbols
Note: using tm_map method produces the following error message even when content_transformer wrapper are used: Error in UseMethod(“content”, x) : no applicable method for ‘content’ applied to an object of class “character”.
These tm_map methods appear unstable:
- docs <- tm_map(docs, content_transformer(tolower)) #unstable
- docs <- tm_map(docs, content_transformer(removePunctuation)) #unstable
- docs <- tm_map(docs, content_transformer(removeNumbers)) #unstable
- docs <- tm_map(docs, content_transformer(stripWhitespace)) #unstable
To overcome this, I wrote functions to do preprocessing and then used tm_map on those functions. The result appears to be more stable.
#remove URL
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
docs <- tm_map(docs, removeURL)
#removing non-ASCII characters
removeNonASCII <- function(x) iconv(x, "latin1", "ASCII", sub="")
docs <- tm_map(docs, removeNonASCII)
#remove punctuations
#write function
removePunct <- function(x) gsub("[[:punct:]]", "", x)
docs <- tm_map(docs, removePunct)
#removing all special characters except for alphabet and numbers
# removeSpecialChar <- function(x) gsub("[^A-Za-z0-9]", "", x)
# docs <- tm_map(docs, removeSpecialChar)
#change to lower case
LowerCase <- function(x) sapply(x, tolower)
docs <- tm_map(docs, LowerCase)
#remove Numbers
removeNum <- function(x) gsub("[[:digit:]]", "", x)
docs <- tm_map(docs, removeNum)
#remove unnecessary white space, replace with only 1 space
removeSpace <- function(x) gsub("\\s+", " ", x)
docs <- tm_map(docs, removeSpace)
docs <- tm_map(docs, removeWords, stopwords("english")) #ok
#remove profanity words (downloaded the basic list on 20 Oct 2018 from the following source: https://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/ )
profanity <- readLines("C:\\Users\\Andy's Home PC\\Documents\\Coursera Courses\\Data Science\\Capstone Project\\profanity\\profanity.txt")
docs <- tm_map(docs, removeWords, profanity)
docs <- tm_map(docs, PlainTextDocument) #in this final step, we get R to treat processed doc as plain text
2.7 Create Term Document Matrix (DTM) and generate the various plots
1-Gram and its plot
TDM <- TermDocumentMatrix(docs)
# inspect(TDM)
# dim(TDM)
# terms <-Terms(TDM)
# length(terms)
# unique(Encoding(terms)) #still has [1] "UTF-8" "unknown"
#remove sparse terms
TDM.common <- removeSparseTerms(TDM, .999)
# dim(TDM.common)
freq <- rowSums(as.matrix(TDM.common))
ord <- order(freq)
# freq[head(ord)]
# freq[tail(ord, n=30)]
wordFreq <- freq[tail(ord, n=30)]
commonTerms <- Terms(TDM.common)
# length(commonTerms)
wordFreq <- as.data.frame(wordFreq)
library(data.table)
wordFreq <- setDT(wordFreq, keep.rownames=TRUE)
wordFreq <- wordFreq[order(wordFreq, decreasing=TRUE),]
#plot
g<- ggplot(wordFreq, aes(reorder(rn, wordFreq), wordFreq))
g<- g + geom_bar(stat = "identity", fill="#97DAB7")+ theme_minimal
g <- g + coord_flip()
g<- g + ggtitle("Monogram")
g <- g + ylab("Frequency")
g<- g + xlab("Words")
g
2-Gram and its plot
To generate bi-gram and tri-gram, the RWeka and rJava libraries are typically needed. But they are tricky to install. Instead, I have used the NLP package and created n-grams tokenizer by writing functions. (For more information: http://tm.r-forge.r-project.org/faq.html#Bigrams)
bigram_tokenizer <- function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse=" "), use.names=FALSE)
trigram_tokenizer <- function(x)
unlist(lapply(ngrams(words(x), 3), paste, collapse=" "), use.names=FALSE)
Use the created bigram tokenizer to transform text data and generate the bigram
TDM.bigram <- TermDocumentMatrix(docs, control = list(tokenize = bigram_tokenizer))
#dim(TDM.bigram)
TDM.bigram.common <- removeSparseTerms(TDM.bigram, 0.9999)
#dim(TDM.bigram.common)
#Terms(TDM.bigram.common)
freq.bigram <- rowSums(as.matrix(TDM.bigram.common)) #sums up the frequency of 2-worded words
ord <- order(freq.bigram) #sorting by ascending order
#freq.bigram[head(ord)] #print the top 6 2-worded words using subsetting (ascending)
# freq.bigram[tail(ord)] #print the last 6 2-worded words using subsetting (descending)
# freq.bigram[tail(ord, n=30)] #print last 30 2-worded words using subsetting (descending)
bigramFreq <- freq.bigram[tail(ord, n=30)]
#transform data for plotting
bigramFreq <- as.data.frame(bigramFreq)
bigramFreq <- setDT(bigramFreq, keep.rownames = TRUE)
bigramFreq <- bigramFreq[order(bigramFreq, decreasing=TRUE),]
#plot 20 most frequent 2-worded words
h <-ggplot(bigramFreq, aes(reorder(rn, bigramFreq), bigramFreq))
h <- h + geom_bar(stat = "identity", fill="steelblue") + theme_minimal()
h <- h + coord_flip()
h <- h + ggtitle("Bigram")
h <- h + ylab("Frequency")
h <- h + xlab("Words (Bigrams)")
h
3-Gram and its plot
use the tri-gram tokenizer to transform data and generate the trigram plot
TDM.trigram <- TermDocumentMatrix(docs, control=list(tokenize = trigram_tokenizer))
#dim(TDM.trigram)
TDM.trigram.common <- removeSparseTerms(TDM.trigram, 0.999)
# dim(TDM.trigram.common)
freq.trigram <- rowSums(as.matrix(TDM.trigram.common))
ord <- order(freq.trigram)
# freq.trigram[head(ord)]
# freq.trigram[tail(ord, n=30)]
trigramFreq <- freq.trigram[tail(ord, n=30)]
#transform data for plotting
trigramFreq <- as.data.frame(trigramFreq)
trigramFreq <- setDT(trigramFreq, keep.rownames = TRUE)
trigramFreq <- trigramFreq[order(trigramFreq, decreasing = TRUE),]
#plot 30 most frequent 3-worded words
i <- ggplot(trigramFreq, aes(reorder(rn, trigramFreq), trigramFreq))
i <- i + geom_bar(stat = "identity", fill="#DBCC8E") + theme_minimal()
i <- i + coord_flip()
i <- i + ggtitle("Trigram")
i <- i + ylab("Frequency")
i <- i + xlab("Words (Trigrams)")
i
Plot the n-grams side-by-side
require(gridExtra)
grid.arrange(g,h,i, ncol=3)