Data Science Capstone

Introduction

This is the final course of the Data Science Specialization in R. It combines all the knowledge and skills learned during the course - from understanding data science, to installing R and RStudio, loading, subsetting, wrangling, exploring, using statistical inference, training, and testing our data sets based on applicable machine learning.

The capstone is a partnership between Johns Hopkins University and Swiftkey. I have used this product a while back in 2013-2015. I was amazed at the innovation on digital keyboards. The ability to slide your finger across the keyboard without lifting it. It then predicts the word with high accuracy. This course provides a blueprint on how to achieve the word prediction technology behind it.

The capstone will be evaluated based on the following assessments:

An introductory quiz to test whether you have downloaded and can manipulate the data.
An intermediate R markdown report that describes in plain language, plots, and code your exploratory analysis of the course data set.
Two natural language processing quizzes, where you apply your predictive model to real data to check how it is working.
A Shiny app that takes as input a phrase (multiple words), one clicks submit, and it predicts the next word.
A 5 slide deck created with R presentations pitching your algorithm and app to your boss or investor.

First Week

The first week involves understanding text mining infrastructure in R and exploring the data sets provided from the course. The process I took to understand the subject are as follows:

Read the material
Apply the learned material and exploring the data sets which
Leads to more questions such as:
- Optimizing for speed vs. accuracy?
- Are there any other framework that can do this better?
- What technology are out there in NLP language?

The exploration leads to more questions. The goal is to optimize algorithm based either on speed or accuracy. Finding the balance is pretty difficult.

Below is the environment needed to examine our data sets.

library(bibtex)
library(knitr)
library(rvest)
library(tidyverse)
library(glue)
library(stringi)
library(caret)
library(spacyr)
library(tidytext)
library(echarts4r)

Data sets can be found here but before we dive-in the data, let us define some terminologies that are use often in NLP infrastructure such as Text Mining, Corpus, and Tokenization.

Text Mining

Amazon product reviews, Yelp, reddit, twitter feeds, facebook, and LinkedIn are some of the sites that are text-mined to provide market research, sentiment analysis and in many cases to build a data product. An example of a data product I use are tradingviews.com and stocktwits.com. One of the features they provide are to measure publicly traded - bullish or bearish sentiments. These are valuable tools for investors and day traders. Rather than reviewing pages and pages of tweets or reddit post, a person can visualize the sentiment of users based on there comment, timeframe (i.e. day, week, month, quarter etc.) in a simple graph.

Here is a basic example of text mining:

# Mining our openning paragraph
samp_text <- "This is the final course of the specialization. It combines all the knowledge and skills learned during the course - from understanding data science, to installing R and RStudio, loading, subsetting, wrangling, exploring, using statistical inference, training, and testing our data sets based on applicable machine learning. The capstone is a partnership between Johns Hopkins University and Swiftkey. I have used this product a while back in 2013-2015. I was amazed at the innovation on digital keyboards. The ability to slide your finger across the keyboard without lifting it. It then predicts the word with high accuracy. This course provides a blueprint on how to achieve the word prediction technology behind it."

# 10 most frequent terms
sampdf <- data_frame(text = samp_text) %>% 
    unnest_tokens(word, text) %>%    # split words
    anti_join(stop_words) %>%    # take out "a", "an", "the", etc.
    count(word, sort = TRUE)    # count occurrences

sampdf[1:10,] %>% kable(caption = 'Text Mining') %>% kableExtra::kable_styling()

Text Mining
word	n
data	2
word	2
2013	1
2015	1
ability	1
accuracy	1
achieve	1
amazed	1
applicable	1
based	1

sampdf[1:5,] %>% ggplot(aes(y=reorder(word,n), x=n))+geom_col()+labs(y= "Words/Characters", x="Frequency", title = "Text Mining")

We can visually see “word” and “data” are the most frequent words in our paragraph. It makes sense since the topic is analyzing text data. Let’s go on to the next topic of Corpus.

Corpus

Corpus is a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject. Corpora are generally used for statsitical liguistic analysis and hypothesis testing (Mayo 2017). Let’s look at our data set for the capstone which can be found here. Below is our code to summarize our data set.

## Download file (run at the beginning only) ---------

# if (!file.exists("data")) {
#  dir.create("data")
#}
# download.file(
#  "https://d396qusza40orc.cloudfront.net/ds scapstone/dataset/Coursera-SwiftKey.zip",
#  destfile = "./data/Coursera-SwiftKey.zip"
# )
# unzip("./data/Coursera-SwiftKey.zip", exdir = "./data")

## Combining -----------
files2 <- list.dirs("./data/final")
lsfile <-  paste0(files2[2:5],"/",
                  list.files(files2[2:5]))

# gives us a list of directory paths to the actual text file
ldir <- normalizePath(files2[2:5], "rb") 

# gives us full path and filename
finaldir <- dir(path=ldir, full.names=TRUE) 

## Build a table --------------

## Num_Words total number of words in a txt file
Num_Words <- vector("numeric")

## Num_Lines number of lines per txt file
Num_Lines <- vector("numeric")

## Range of words per line
Min_Words <- vector("numeric")
Mean_Words <- vector("numeric")
Max_Words <- vector("numeric")
for (i in 1:12) {
      Num_Words[i] <-
        print(sum(stri_count_words(readLines(finaldir[[i]]))))
      Mean_Words[i] <-
        print(round(mean(stri_count_words(
          readLines(finaldir[[i]])
        ))), digits = 2)
      Min_Words[i] <-
        print(round(min(stri_count_words(
          readLines(finaldir[[i]])
        ))), digits = 2)
      Max_Words[i] <-
        print(round(max(stri_count_words(
          readLines(finaldir[[i]])
        ))), digits = 2)
      Num_Lines[i] <- print(length(readLines(finaldir[i])))
}

# Table -------------
list_files <- tibble(
        'Name' = list.files(files2[2:5]),
        'Size_MB' = round(file.size(finaldir) / 10 ^ 6, digits =
                            2),
        Lines = Num_Lines,
        Words = Num_Words,
        Min = Min_Words,
        Average = Mean_Words,
        Max = Max_Words
)

# knit to table -----------
kable(list_files, caption = 'Corpus-Collection of Text',
                align = c(rep('c', times = 5))) %>% 
        kableExtra::kable_styling()

Corpus-Collection of Text
Name	Size_MB	Lines	Words	Min	Average	Max
de_DE.blogs.txt	85.46	371440	12682659	0	34	1638
de_DE.news.txt	95.59	244743	13375092	1	55	603
de_DE.twitter.txt	75.58	947774	11646033	0	12	42
en_US.blogs.txt	210.16	899288	37546250	0	42	6726
en_US.news.txt	205.81	1010242	34762395	1	34	1796
en_US.twitter.txt	167.11	2360148	30093372	1	13	47
fi_FI.blogs.txt	108.50	439785	12785318	0	29	2353
fi_FI.news.txt	94.23	485758	10532432	1	22	478
fi_FI.twitter.txt	25.33	285214	3147083	1	11	44
ru_RU.blogs.txt	116.86	337100	9388482	1	28	1197
ru_RU.news.txt	119.00	196360	9057248	1	46	1581
ru_RU.twitter.txt	105.18	881414	9231328	1	10	36

Table 1.1: Corpus-Collection of Text, summarizes our 12 collections into their respective name files, size in mb, number of lines, number of words and their summary from minimum, average and max words per line. . The data sets have a collection of 4 languages compose of blogs, news, and twitter. The languages are in German, English, Finnish, and Russian. All characters are based on roman characters and in several cases I saw several emoji character inside twitter text collections.

Let us visualize the File Name into Size_MB, Lines, and Words.

list_files %>% ggplot(aes(x=reorder(Name, -Size_MB), 
                          y=Size_MB)) + 
                  geom_col() + 
                  theme(axis.text.x=element_text(angle=90,hjust=.1)) + 
                  labs(x= "File Name", y= "File Size (mb)", title = "File Name and Size (mb)")

Number of lines:

list_files %>% ggplot(aes(x=reorder(Name, -Lines), 
                          y=Lines)) + 
                  geom_col() + 
                  theme(axis.text.x=element_text(angle=90,hjust=.1)) + 
                  labs(x= "File Name", y= "Number of Lines", title = "Number of Lines per File")

Number of Words:

list_files %>% ggplot(aes(x=reorder(Name, -Words), 
                          y=Words)) + 
                  geom_col() + 
                  theme(axis.text.x=element_text(angle=90,hjust=.1)) + 
                  labs(x= "File Name", y= "Number of Words", title = "Number of Words per File")

Tokenization

Tokens are the building blocks of Natural Language Processing. Tokenization is a way of separating text into smaller units using a delimeter such as space or hyphen. Tokens can be words, characters, or subwords. (Pai 2020).

An example of a word token from a sentence: “You are the best”. We are assuming space is a delimiter. The tokenization of the sentence results in 4 tokens You_are_the_best.

The example provided is pretty straight forward. There are other opportunities as well, words or characters such as:

aren’t
lol
Oh là là !
😍

There are plenty of packages in R that can handle these occasions. Some of the pacakges are tm, quanteda, openNLP, spacyr, rweka, and tidytext. We will use spacyr for these part of the capstone but in the long run I want to learn more about the python environment in NLP. Combining these two languages are beneficial and RStudio provides the necessary IDE to use both languages simultaneously. Let’s go through the basics of our data using spacyr.

Spacyr provides a convenient R wrapper around the Python spaCy package. It offers easy access to the following functionality of spaCy:

parsing texts into tokens or sentences;
lemmatizing tokens;
parsing dependencies (to identify the grammatical structure of the sentence); and
identifying, extracting, or consolidating token sequences that form named entities or noun phrases.
It also allows a user to request additional token-level attributes directly from spaCy (Kenneth Benoit 2020).

library(spacyr)
spacy_initialize(model = "en_core_web_sm")

# reading lines of English blogs, news and tweets 
# with a sample size of 10000 starting from the top line. 

blogs_samp <- readLines(finaldir[[4]], n=10000)
news_samp <- readLines(finaldir[[5]], n=10000)
tweets_samp <- readLines(finaldir[[6]], n=10000)

head(blogs_samp)

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [5] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"                                                                                    
## [6] "If you have an alternative argument, let's hear it! :)"

length(blogs_samp)

## [1] 10000

The collections of blogs seem to have no specific topic. The information are simply a collection of random blogs. The same applies to news and tweets. Let’s dig deeper and see what we have inside our data using spacy_parse function. Every line I’ve reviewed are not connected to the next line.

parse_blogs <- spacy_parse(blogs_samp)
head(parse_blogs) %>% kable(caption = 'Spacyr Tokenization') %>% kableExtra::kable_styling()

Spacyr Tokenization
doc_id	sentence_id	token_id	token	lemma	pos	entity
text1	1	1	In	in	ADP
text1	1	2	the	the	DET	DATE_B
text1	1	3	years	year	NOUN	DATE_I
text1	1	4	thereafter	thereafter	ADV
text1	1	5	,	,	PUNCT
text1	1	6	most	most	ADJ

Table Spacyr Tokenization “doc_id = text1” is basically line 1 in our main blog file. The “sentence_id” is beginning sentence ending with a punctuation either “., ? or !” in line 1. If it has more than 1 sentence then the sentence_id changes to 2, 3, 4 etc. depending on how many sentences are in the line. The token_id is the position of word or punctuation in sentence_id. Token is the actual word or punctuantion. The lemma is the root word. The “pos” is the description or use of the word such as noun, adp = adposition such as “in, of, after, and before”. The “entity” are person, date, numbers etc.

We can create a data.frame output and tokenize our samples below.

parse_blogs_tok <- 
  spacy_tokenize(blogs_samp, 
    remove_punct=TRUE, 
    output="data.frame")

parse_blogs_tok %>% 
    tail() %>% 
    kable(caption = "Tail") %>% 
    kableExtra::kable_styling()

Tail
	doc_id	token
420373	text10000	USING
420374	text10000	THEM
420375	text10000	TO
420376	text10000	DO
420377	text10000	HIS
420378	text10000	WILL

summary(parse_blogs_tok) %>% 
  kable(caption = "Summary") %>% 
  kableExtra::kable_styling()

Summary
	doc_id	token
	Length:420378	Length:420378
	Class :character	Class :character
	Mode :character	Mode :character

We can extract language properties:

parse_blogs_entity <- spacy_parse(blogs_samp, 
                                  lemma = FALSE, 
                                  entity = TRUE, 
                                  nounphrase = TRUE)

entity_extract(parse_blogs_entity) %>% 
      arrange(doc_id) %>% 
      tail(10) %>% 
      kable(caption = "Entity Extraction") %>% 
      kableExtra::kable_styling()

Entity Extraction
	doc_id	sentence_id	entity	entity_type
13666	text9977	1	the_Young_Inventor_’s_Challenge	ORG
13667	text9979	1	Savior	ORG
13668	text998	1	the_North_Pacific	LOC
13669	text9981	1	Christians	NORP
13670	text9986	1	Indonesia	GPE
13671	text9988	1	the_Coffin_Hop	FAC
13672	text999	2	Pat	PERSON
13673	text9991	1	M&S	ORG
13674	text9994	1	Terry_Richardson	PERSON
13675	text9999	1	Joe_Kingsley	PERSON

We can extract dates, events and cardinal or ordinal quantities:

entity_extract(parse_blogs_entity, 
               type = "all") %>% 
          tail() %>% 
          kable(caption = "All Entity Extraction") %>% 
          kableExtra::kable_styling()

All Entity Extraction
	doc_id	sentence_id	entity	entity_type
22848	text9991	1	M&S	ORG
22849	text9992	1	6:5	CARDINAL
22850	text9993	2	last_year	DATE
22851	text9994	1	Terry_Richardson	PERSON
22852	text9995	2	Sunday	DATE
22853	text9999	1	Joe_Kingsley	PERSON

Spacyr can consolidate functions to compound multi-word entities into single “token”:

entity_consolidate(parse_blogs_entity) %>% 
        tail(10) %>% 
        kable(caption = "Multi-Word Entities") %>% 
        kableExtra::kable_styling()

Multi-Word Entities
	doc_id	sentence_id	token_id	token	pos	entity_type
464512	text9998	4	7	tabs	NOUN
464513	text9998	4	8	on	ADP
464514	text9998	4	9	regular	ADJ
464515	text9998	4	10	commenters	NOUN
464516	text9998	4	11	on	ADP
464517	text9998	4	12	Twitter	PROPN
464518	text9998	4	13	.	PUNCT
464519	text9999	1	1	Writer	NOUN
464520	text9999	1	2	:	PUNCT
464521	text9999	1	3	Joe_Kingsley	ENTITY	PERSON

It can also consolidate noun phrase into single “token”:

nounphrase_consolidate(parse_blogs_entity) %>% 
        tail(10) %>% 
        kable(caption = "Noun Phrase") %>% 
        kableExtra::kable_styling()

Noun Phrase
	doc_id	sentence_id	token_id	token	pos
385109	text9998	4	6	keep	VERB
385110	text9998	4	7	tabs	nounphrase
385111	text9998	4	8	on	ADP
385112	text9998	4	9	regular_commenters	nounphrase
385113	text9998	4	10	on	ADP
385114	text9998	4	11	Twitter	nounphrase
385115	text9998	4	12	.	PUNCT
385116	text9999	1	1	Writer	nounphrase
385117	text9999	1	2	:	PUNCT
385118	text9999	1	3	JoeKingsley	nounphrase

Let’s analyze the blog, news, tweet data set and visualize it.

x <- blogs_samp

x <- spacy_parse(x)

x$token <- tolower(x$token)
x$lemma <- tolower(x$lemma)

x <- x %>% filter(pos != "DET" & pos != "PUNCT" & 
                    pos != "ADP" & pos != "AUX" & 
                    pos != "PRON" & pos != "CCONJ" &
                    pos != "SCONJ" & pos != "SYM" & 
                    pos != "PART" & pos != "ADV" & 
                    lemma != "’s" & lemma != "’" & 
                    lemma != "-" & lemma != "_" & 
                    lemma != "-" & lemma != "°")

x <- x %>% group_by(lemma)


x <- x %>% summarise(count =n()) 
x <- x %>% arrange(-count)

wcloud <- wordcloud2::wordcloud2(x[1:50,])
markdown_widget(wcloud,path=".",filename="wordcloud.png")

./wordcloud.png

ggplotly(x[1:50,] %>% ggplot(aes(x=lemma, y=count)) + 
           geom_point() + 
           labs(x= "Lemma", y="Count", title = "Blog Text Mining"))

x[1:50,] %>% ggplot(aes(y=reorder(lemma,count), x=count)) + 
           geom_col() +
           labs(x= "Lemma", y="Count", title = "Blog Text Mining")

News Data Set:

x <- news_samp

x <- spacy_parse(x)

x$token <- tolower(x$token)
x$lemma <- tolower(x$lemma)

x <- x %>% filter(pos != "DET" & pos != "PUNCT" & 
                    pos != "ADP" & pos != "AUX" & 
                    pos != "PRON" & pos != "CCONJ" &
                    pos != "SCONJ" & pos != "SYM" & 
                    pos != "PART" & pos != "ADV" & 
                    lemma != "’s" & lemma != "’" & 
                    lemma != "-" & lemma != "_" & 
                    lemma != "-" & lemma != "°")
head(x)

##   doc_id sentence_id token_id token lemma   pos entity
## 1  text2           1        2   st.   st. PROPN  GPE_B
## 2  text2           1        3 louis louis PROPN  GPE_I
## 3  text2           1        4 plant plant  NOUN       
## 4  text2           1        7 close close  VERB       
## 5  text2           2        3   die   die  VERB       
## 6  text2           2        5   old   old   ADJ

x <- x %>% group_by(lemma)


x <- x %>% summarise(count =n()) 
x <- x %>% arrange(-count)

wcloud<-wordcloud2::wordcloud2(x[1:50,])
markdown_widget(wcloud,path=".",filename="wordcloud1.png")

./wordcloud1.png

ggplotly(x[1:50,] %>% ggplot(aes(x=lemma, y=count)) + 
           geom_point() + 
           labs(x= "Lemma", y="Count", title = "News Text Mining"))

x[1:50,] %>% ggplot(aes(y=reorder(lemma,count), x=count)) + 
           geom_col() +labs(x= "Lemma", y="Count", title = "News Text Mining")

Tweet Data Set:

x <- news_samp

x <- spacy_parse(x)

x$token <- tolower(x$token)
x$lemma <- tolower(x$lemma)

x <- x %>% filter(pos != "DET" & pos != "PUNCT" & 
                    pos != "ADP" & pos != "AUX" & 
                    pos != "PRON" & pos != "CCONJ" &
                    pos != "SCONJ" & pos != "SYM" & 
                    pos != "PART" & pos != "ADV" & 
                    lemma != "’s" & lemma != "’" & 
                    lemma != "-" & lemma != "_" & 
                    lemma != "-" & lemma != "°")

x <- x %>% group_by(lemma)

x <- x %>% summarise(count =n()) 

x <- x %>% arrange(-count)

wcloud <- wordcloud2::wordcloud2(x[1:50,])
markdown_widget(wcloud,path=".",filename="wordcloud2.png")

./wordcloud2.png

ggplotly(x[1:50,] %>% ggplot(aes(x=lemma, y=count)) + 
           geom_point() + 
           labs(x= "Lemma", y="Count", title =  "Tweet Text Mining"))

x[1:50,] %>% ggplot(aes(y=reorder(lemma,count), x=count)) + 
           geom_col() +labs(x= "Lemma", y="Count", title = "Tweet Text Mining")

Next Step

Now that we have explored the data sets. There are still a lot of questions that keeps hunting me. The top three questions are as follows:

Is it better to gather news, tweets, blogs that are specific in topic or category? For example coffee related news, tweets, and blogs will have a higher accuracy if I’m typing a coffee related topic.
How about the construction of a sentence? There are construction rules to build a sentence. A simple sentence consist of “independent clause, object and a modifier” (7ESL 2020).
What is the best algorithm for text classification? Naive Bayes, Support Vector Machines, Deep Learning…

References

7ESL. 2020. “Siimple Sentence.” https://7esl.com/simple-sentence/.

Kenneth Benoit, Akitaka Matsuo. 2020. “A Guide to Using Spacyr.” https://spacyr.quanteda.io/articles/using_spacyr.html.

Mayo, Matthew. 2017. “Building a Wikipedia Text Corpus for Natural Language Processing.” https://www.kdnuggets.com/2017/11/building-wikipedia-text-corpus-nlp.html#:~:text=In%20linguistics%20and%20NLP%2C%20corpus,of%20corpus)%20may%20be%20useful.

Pai, Aravind. 2020. “What Is Tokenization in Nlp? Here’s All You Need to Know.” https://www.analyticsvidhya.com/blog/2020/05/what-is-tokenization-nlp/#:~:text=Tokenization%20is%20a%20common%20task%20in%20Natural%20Language%20Processing%20(NLP).&text=Tokens%20are%20the%20building%20blocks,words%2C%20characters%2C%20or%20subwords.

Milestone Report

Richard Nacianceno