The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm.
Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm.
This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager.
You should make use of tables and plots to illustrate important summaries of the data set.
The motivation for this project is to:
Demonstrate that you’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.
Does the link lead to an HTML page describing the exploratory analysis of the training data set?
Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables?
Has the data scientist made basic plots, such as histograms to illustrate features of the data?
Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?
This report is the first step in the capstone project for the Data Science Specialization from John Hopkins University. The project will use the corpus of data from blogs, news, and twitter twits and try to predict the next word after a few letters or words are entered at the prompt. This is a great interface to integrate with mobile applications limited input options
Finally we will use this report to briefly analzye the data and create a plan to describe the predictive model used in the shiy app.
# A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
# https://www.rdocumentation.org/packages/ggplot2/versions/3.3.0
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
# A framework for text mining applications within R.
# https://www.rdocumentation.org/packages/tm/versions/0.7-7
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
# Basic classes and methods for Natural Language Processing.
# https://www.rdocumentation.org/packages/NLP/versions/0.2-0
library(NLP)
# An interface to the Apache OpenNLP tools (version 1.5.3). The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text written in Java
# https://www.rdocumentation.org/packages/openNLP/versions/0.2-7
library(openNLP)
## Warning: package 'openNLP' was built under R version 3.6.3
# Provides color schemes for maps (and other graphics) designed by Cynthia Brewer as described at http://colorbrewer2.org
# https://www.rdocumentation.org/packages/RColorBrewer/versions/1.1-2
# Amazing lib for color schemes: https://colorbrewer2.org/#type=sequential&scheme=Purples&n=3
library(RColorBrewer)
# Fast, correct, consistent, portable and convenient character string/text processing in every locale and any native encoding
# https://www.rdocumentation.org/packages/stringi/versions/1.4.6
library(stringi)
## Warning: package 'stringi' was built under R version 3.6.2
# Low-level interface to Java VM very much like .C/.Call and friends. Allows creation of objects, calling methods and accessing fields.
# https://www.rdocumentation.org/packages/rJava/versions/0.9-12
library(rJava)
## Warning: package 'rJava' was built under R version 3.6.3
# An R interface to Weka (Version 3.9.3). Weka is a collection of machine learning algorithms for data mining tasks written in Java, containing tools for data pre-processing, classification, regression, clustering, association rules, and visualization
# https://www.rdocumentation.org/packages/RWeka/versions/0.4-42
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.6.3
library(RWekajars) ## required by RWeka
# An R interface to the C 'libstemmer' library that implements Porter's word stemming algorithm for collapsing words to a common root to aid comparison of vocabulary
# https://www.rdocumentation.org/packages/SnowballC/versions/0.7.0
library(SnowballC)
## Warning: package 'SnowballC' was built under R version 3.6.3
# Automates many of the tasks associated with quantitative discourse analysis of transcripts containing discourse including frequency counts of sentence types, words, sentences, turns of talk, syllables and other assorted analysis tasks.
# https://www.rdocumentation.org/packages/qdap/versions/2.3.6
library(qdap)
## Warning: package 'qdap' was built under R version 3.6.3
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
##
## Attaching package: 'qdapRegex'
## The following object is masked from 'package:ggplot2':
##
## %+%
## Loading required package: qdapTools
## Registered S3 methods overwritten by 'qdap':
## method from
## t.DocumentTermMatrix tm
## t.TermDocumentMatrix tm
##
## Attaching package: 'qdap'
## The following objects are masked from 'package:tm':
##
## as.DocumentTermMatrix, as.TermDocumentMatrix
## The following object is masked from 'package:NLP':
##
## ngrams
## The following object is masked from 'package:base':
##
## Filter
The files used for this analysis will only include the English language files from the total Corpora. The files will include English US blogs (en_US.blogs.txt), English US News (en_US.news.txt), and English US twits from Twitter (en_US.twitter.txt)
The data will be loaded from the local project folder. Files will be included in the github repo.
content_blogs <- file("en_US.blogs.txt",
open="rb")
data_blogs <- readLines(content_blogs,
encoding = "UTF-8",
skipNul = TRUE)
content_news <- file("en_US.news.txt",
open = "rb")
data_news <- readLines(content_news,
encoding = "UTF-8",
skipNul = TRUE)
content_twitter <- file("en_US.twitter.txt",
open = "rb")
data_twitter <- readLines(content_twitter,
encoding = "UTF-8",
skipNul=TRUE)
We’ll calculate the size of each file in the corpus, the number of lines per file, the number of words per file, and the length of the longest line recorded
blogs_file_size <- file.info("en_US.blogs.txt")$size / 1024 / 1024
blogs_lines <- length(data_blogs)
blogs_words <- sum(stri_count_words(data_blogs))
blogs_line_length <- max(nchar(data_blogs))
paste("Blog File size (MB) = ",blogs_file_size)
## [1] "Blog File size (MB) = 200.424207687378"
paste("Blog Lines = ", blogs_lines)
## [1] "Blog Lines = 899288"
paste("Blog Words = ", blogs_words)
## [1] "Blog Words = 37546239"
paste("Blog Longest line length = ", blogs_line_length)
## [1] "Blog Longest line length = 40833"
news_file_size <- file.info("en_US.news.txt")$size / 1024 / 1024
news_lines <- length(data_news)
news_words <- sum(stri_count_words(data_news))
news_line_length <- max(nchar(data_news))
paste("News File size (MB) = ", news_file_size)
## [1] "News File size (MB) = 196.277512550354"
paste("News Lines = ", news_lines)
## [1] "News Lines = 1010242"
paste("News Words = ", news_words)
## [1] "News Words = 34762395"
paste("News Longest line length = ", news_line_length)
## [1] "News Longest line length = 11384"
twits_file_size <- file.info("en_US.twitter.txt")$size / 1024 / 1024
twits_lines <- length(data_twitter)
twits_words <- sum(stri_count_words(data_twitter))
twits_line_length <- max(nchar(data_twitter))
paste("Twits File size (MB) = ", twits_file_size)
## [1] "Twits File size (MB) = 159.364068984985"
paste("Twits Lines = ", twits_lines)
## [1] "Twits Lines = 2360148"
paste("Twits Words = ", twits_words)
## [1] "Twits Words = 30093413"
paste("Twits Longest line length = ", twits_line_length)
## [1] "Twits Longest line length = 140"
data_summary <- data.frame(
file_names = c("Blogs","News","Twits"),
file_size = c(blogs_file_size, news_file_size, twits_file_size),
line_counts = c(blogs_lines, news_lines, twits_lines),
word_counts = c(blogs_words, news_words, twits_words),
max_line_length = c(blogs_line_length, news_line_length, twits_line_length)
)
data_summary
## file_names file_size line_counts word_counts max_line_length
## 1 Blogs 200.4242 899288 37546239 40833
## 2 News 196.2775 1010242 34762395 11384
## 3 Twits 159.3641 2360148 30093413 140
Data Size >>> Blogs data is slightly larger than News or Twits.
Lines >>> Twits have the highest count of lines but the smallest size of the file. Likely because each twit is only 140 characters
Words >>> Blogs have the larger amount of words (surprising) and surpass the number of words in the news. One would expect the news data to have more words because usually are in-depth articles, whereas blogs are usually quick expressions of the blogger’s mind
Line Length >>> Blogs also have the longest line at 40,833 characters
In mobile apps the amount of data that can be downloaded and stored on the device is limited, therefore we need to create a sample of the original data that can be used to create inferences for the algorithm
set.seed(7765287)
size_per_sample = 1000
blogs_sample <- sample(data_blogs,
size = size_per_sample,
replace = TRUE)
news_sample <- sample(data_news,
size = size_per_sample,
replace = TRUE)
twits_sample <- sample(data_twitter,
size = size_per_sample,
replace = TRUE)
samples <- c(blogs_sample,
news_sample,
twits_sample)
paste("Length of samples = ", length(samples))
## [1] "Length of samples = 3000"
writeLines(samples, "samples.txt")
We will use a list of profanity or “bad words” identified in the English language and by browsers (Google, Yahoo, IE, etc).
More information can be found here: https://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/
We will use the corpus in a series of steps to clean further and further until the corpus is fully clean
# Read the sample created before and create an initial corpus
samples_file <- file("samples.txt")
samples_lines <- readLines(samples_file)
# Create the corpus based on the sample data
# NOTE: in the most recent version of tm we need to use VCorpus and not Corpus
samples_corpus <- VCorpus(
VectorSource(
samples_lines))
# More info here:
# https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
# Step 1 - convert the corpus to UTF-8
samples_corpus <- tm_map(samples_corpus,
content_transformer(
function(x) iconv(x,
to="UTF-8",
sub="byte")))
# Step 2 - convert the corpus to lowercase
samples_corpus <- tm_map(samples_corpus,
content_transformer(tolower))
# Step 3 - remove punctuation, including - , ; : " ', etc
samples_corpus <- tm_map(samples_corpus,
content_transformer(removePunctuation),
preserve_intra_word_dashes=TRUE)
# Step 4 - filter "bad words"
custom_bad_words = readLines("custom-bad-words-nlp.txt")
samples_corpus <- tm_map(samples_corpus,
removeWords,
custom_bad_words)
# Step 5- remove numeric numbers
samples_corpus <- tm_map(samples_corpus,
content_transformer(removeNumbers))
# Step 6 - remove URLs
urls_to_remove <- function(x) gsub("http[[:alnum:]]*", "", x)
samples_corpus <- tm_map(samples_corpus,
content_transformer(urls_to_remove))
# Step 7 - remove stop words in English.
# Note: this might remove words that are needed to understand the content of the sentence
# More info here: https://en.wikipedia.org/wiki/Stop_words
samples_corpus <- tm_map(samples_corpus,
removeWords,
stopwords("english"))
# Step 8 - Remove white spaces
samples_corpus <- tm_map(samples_corpus,
stripWhitespace)
# Step 9 - the final step converts the corpus to plain text
samples_corpus <- tm_map(samples_corpus,
PlainTextDocument)
# Create a file with the content of samples. It can be used for debugging
write.csv(samples_corpus,
"samples_corpus.csv",
row.names=F)
# Create a R Data object to easily load the Corpus
saveRDS(samples_corpus, file = "clean_corpus.RData")
# Display a few lines of the corpus to make sure it displays correctly
for (i in 1:25){
print(samples_corpus[[i]]$content)
}
## [1] "aureolin"
## [1] "grace receive day"
## [1] "luckily someone tough work – pam downtown case contact barnes noble determined figure sand try get books onto ipads need – couple weeks away"
## [1] "yet christopher nolan needed loosen collar inception find humor playfulness can come odd group -developed personalities entering peoples dreams though tom hardys eames comes closest fun job film seems feel ashamed embrace caper weighted crazy wife story device dicaprio already tackled shutter island even end revolutionary road insight relationship merely supposed supply depth make movie serious capital s say final shot movie leaves question whether cobb still dream state reaction matter one way "
## [1] " breakfast room glass box extends beyond main volume capture views indian creek morning sun main vertical element core residence ’ diameter void houses partially suspended stair spirals ’ width ground level ’ width roof garden third level"
## [1] " even wonders nature right empty temperature-controlled abode caught spiders creepy-crawlies moved stuff now theyve returned reinforcementsand extended families awesome"
## [1] "whats moral story go ballet man dont get tricky worth risk wives smarter us"
## [1] "psi worried might sleep horrific brain injury though wear helmet always called nurse hotline apparently everything ok just ice neck take easy command boyfriend give gentle massages sounds good "
## [1] "even though surrounded sweet nothing lovers valentine’s never really moved remember celebrating mother best friend high school years february suddenly got way going home will buy mother artificial rose us will spend day ordinarily best friend’s case ’s tradition give card heart shape stuffs together remembered days gone ’s valentine"
## [1] " metal drunkards metal bros"
## [1] "april "
## [1] "now know need give reasons must support anna hazare beneath anna’s dignity beg make case support fighting abusive corrupt regime still prefer ostriches country nah country vast term even ’m ostrich fact may even ostrich city delhi let’s just keep ostriches city ’s quick recap"
## [1] "behold will make thee new sharp threshing instrument teeth thou shalt thresh mountains beat small shalt make hills chaff thou shalt fan wind shall carry away whirlwind shall scatter thou shalt rejoice lord shalt glory holy one "
## [1] "full-room shots nearly entire house except kids rooms bathrooms"
## [1] " nature consider equality something grasped"
## [1] " lit face many times number"
## [1] " important note one reasons democrats leftists avowed marxists easily steal away rights country “ people” grown largely ignorant rights begin many know rights know abstractly example many citizens know religious freedom yet aren’t familiar way first amendment worded don’t understand amendment doesn’t just recognize -given right religious freedom actually bars government interfering religious exercises words first amendment says “congress shall make law respecting establishment religion prohibiting free exercise thereof” ties hands government hands people yet ignorance ’ve allowed things reversed"
## [1] "janeane garofalo"
## [1] "one instant accident"
## [1] "oddly type judgement sometimes swings way meryl streep portraying toothy bouffant iron lady international screens renewed discussion britains now-frail ex-prime minister feel must mention margaret thatcher neighbour"
## [1] " remember elementary school use walk school knee-deep snow walk home snow even deeper uphill ways school knew teachers principals people basic values parents basically demanded respect effort adhering specific values values biblical one ever complained biblical values laws based promote hard work along compassion"
## [1] "secondly found wanting book end sign good story got last paragraph went back re-read last couple pages just remind ending im spoilers left feeling hope thats never bad thing"
## [1] " economy covered business media newspapers magazines cable channels tends get short-shrift mainstream media well may shortage reporters actually understand economic trends beyond obvious"
## [1] "endre beeing high sugar birthday party"
## [1] " sense loneliness"
file_clean_corpus <- readRDS("clean_corpus.RData")
# load the clean corpus into a data frame
clean_corpus <- data.frame(text =
unlist(
sapply(
file_clean_corpus,
`[`,
"content")),
stringsAsFactors = FALSE)
#clean_corpus
We’re going to read all the words from the corpus and create tokens called n-grams with one word, two words, and three words (unigrams, bigrams, and trigrams respectively)
corpus_unigrams <- NGramTokenizer(clean_corpus,
Weka_control(min = 1,
max = 1,
delimiters = " \\r\\n\\t.,;:\"()?!"))
corpus_unigrams <- data.frame(table(corpus_unigrams))
corpus_unigrams <- corpus_unigrams[
order(
corpus_unigrams$Freq,
decreasing = TRUE),]
names(corpus_unigrams) <- c("Word", "Frequency")
corpus_unigrams$Word <- as.character(corpus_unigrams$Word)
write.csv(corpus_unigrams[
corpus_unigrams$Frequency > 1,
],
"corpus_unigrams.csv",
row.names=F)
corpus_unigrams <- read.csv("corpus_unigrams.csv",
stringsAsFactors = F)
saveRDS(corpus_unigrams, file = "corpus_unigrams.RData")
corpus_unigrams <- readRDS("corpus_unigrams.RData")
ggplt <- ggplot(data = corpus_unigrams[1:5,],
aes(x = Word,
y = Frequency))
ggplt2 <- ggplt +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle("Unigram Frequency") +
guides(color = "none") +
coord_flip()
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
ggplt3 <- ggplt2 +
geom_text(data = corpus_unigrams[1:5,],
aes(x = Word,
y = Frequency,
label = Frequency),
hjust = -1,
position = "identity") +
scale_fill_grey() +
theme_classic()
ggplt3
head(corpus_unigrams)
## Word Frequency
## 1 said 300
## 2 one 276
## 3 will 255
## 4 just 218
## 5 like 212
## 6 can 187
corpus_bigrams <- NGramTokenizer(clean_corpus,
Weka_control(min = 2,
max = 2,
delimiters = " \\r\\n\\t.,;:\"()?!"))
corpus_bigrams <- data.frame(
table(
corpus_bigrams))
corpus_bigrams <- corpus_bigrams[
order(
corpus_bigrams$Freq,
decreasing = TRUE),]
names(corpus_bigrams) <- c("Words","Frequency")
corpus_bigrams$Words <- as.character(corpus_bigrams$Words)
head(corpus_bigrams)
## Words Frequency
## 21544 last year 21
## 26691 new york 19
## 45539 years ago 15
## 11261 dont know 12
## 21514 last night 12
## 26623 new jersey 12
bi_ggplt <- ggplot(data = corpus_bigrams[1:5,],
aes(x = Words,
y = Frequency))
bi_ggplt2 <- bi_ggplt +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle("Bigram Frequency") +
guides(color = "none") +
coord_flip()
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
bi_ggplt3 <- bi_ggplt2 +
geom_text(data = corpus_bigrams[1:5,],
aes(x = Words,
y = Frequency,
label = Frequency),
hjust = -1,
position = "identity") +
scale_fill_grey() +
theme_classic()
bi_ggplt3
split_words <- strsplit(corpus_bigrams$Words,split=" ")
corpus_bigrams <- transform(corpus_bigrams,
word_one = sapply(split_words,"[[",1),
word_two = sapply(split_words,"[[",2))
corpus_bigrams <- data.frame(word_one = corpus_bigrams$word_one,
word_two = corpus_bigrams$word_two,
frequency = corpus_bigrams$Frequency,
stringsAsFactors = FALSE)
write.csv(corpus_bigrams[
corpus_bigrams$frequency > 1,],
"corpus_bigrams.csv",
row.names = F)
corpus_bigrams <- read.csv("corpus_bigrams.csv",
stringsAsFactors = F)
saveRDS(corpus_bigrams,"corpus_bigrams.RData")
head(corpus_bigrams)
## word_one word_two frequency
## 1 last year 21
## 2 new york 19
## 3 years ago 15
## 4 dont know 12
## 5 last night 12
## 6 new jersey 12
corpus_trigram <- NGramTokenizer(clean_corpus,
Weka_control(min = 3,
max = 3,
delimiters = " \\r\\n\\t.,;:\"()?!"))
corpus_trigram <- data.frame(
table(
corpus_trigram))
corpus_trigram <- corpus_trigram[
order(
corpus_trigram$Freq,
decreasing = TRUE),]
names(corpus_trigram) <- c("Words","Frequency")
corpus_trigram$Words <- as.character(corpus_trigram$Words)
head(corpus_trigram)
## Words Frequency
## 1324 aesthetic correction circuitry 5
## 15301 fort jackson south 5
## 19788 im pretty sure 5
## 20905 jackson south carolina 5
## 19443 human rights abuses 4
## 23008 let us know 4
tri_ggplt <- ggplot(data = corpus_trigram[1:5,],
aes(x = Words,
y = Frequency))
tri_ggplt2 <- tri_ggplt +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle("Bigram Frequency") +
guides(color = "none") +
coord_flip()
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
tri_ggplt3 <- tri_ggplt2 +
geom_text(data = corpus_trigram[1:5,],
aes(x = Words,
y = Frequency,
label = Frequency),
hjust = -1,
position = "identity") +
scale_fill_grey() +
theme_classic()
tri_ggplt3
split_word_3 <- strsplit(corpus_trigram$Words,
split = " ")
# corpus sample with all trigrams split
head(split_word_3)
## [[1]]
## [1] "aesthetic" "correction" "circuitry"
##
## [[2]]
## [1] "fort" "jackson" "south"
##
## [[3]]
## [1] "im" "pretty" "sure"
##
## [[4]]
## [1] "jackson" "south" "carolina"
##
## [[5]]
## [1] "human" "rights" "abuses"
##
## [[6]]
## [1] "let" "us" "know"
corpus_trigram <- transform(corpus_trigram,
one = sapply(split_word_3,"[[",1),
two = sapply(split_word_3,"[[",2),
three = sapply(split_word_3,"[[",3))
# corpus after transformed with split words
head(corpus_trigram)
## Words Frequency one two three
## 1324 aesthetic correction circuitry 5 aesthetic correction circuitry
## 15301 fort jackson south 5 fort jackson south
## 19788 im pretty sure 5 im pretty sure
## 20905 jackson south carolina 5 jackson south carolina
## 19443 human rights abuses 4 human rights abuses
## 23008 let us know 4 let us know
corpus_trigram <- data.frame(word_one = corpus_trigram$one,
word_two = corpus_trigram$two,
word_three = corpus_trigram$three,
frequency = corpus_trigram$Frequency,
stringsAsFactors = FALSE)
# corpus after added to a new trigram corpus dataframe
head(corpus_trigram)
## word_one word_two word_three frequency
## 1 aesthetic correction circuitry 5
## 2 fort jackson south 5
## 3 im pretty sure 5
## 4 jackson south carolina 5
## 5 human rights abuses 4
## 6 let us know 4
write.csv(corpus_trigram[
corpus_trigram$frequency > 1,],
"corpus_trigram.csv",
row.names = F)
corpus_trigram <- read.csv("corpus_trigram.csv",
stringsAsFactors = F)
saveRDS(corpus_trigram,"corpus_trigram.RData")
# final corpus after saved an R object
head(corpus_trigram)
## word_one word_two word_three frequency
## 1 aesthetic correction circuitry 5
## 2 fort jackson south 5
## 3 im pretty sure 5
## 4 jackson south carolina 5
## 5 human rights abuses 4
## 6 let us know 4