The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set.
The motivation for this project is to:
- Demonstrate that you’ve downloaded the data and have successfully loaded it in.
- Create a basic report of summary statistics about the data sets.
- Report any interesting findings that you amassed so far.
- Get feedback on your plans for creating a prediction algorithm and Shiny app.
Review criteria
Does the link lead to an HTML page describing the exploratory analysis of the training data set?
Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables?
Has the data scientist made basic plots, such as histograms to illustrate features of the data?
Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?
library(tidyverse)
require(stringi) # string manipulations
library(tm) # text mining functions
zipfile_url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if(!file.exists("Coursera-SwiftKey.zip")) {
download.file(zipfile_url, "Coursera-SwiftKey.zip")
unzip(zipfile = "Coursera-SwiftKey.zip", exdir = "Coursera-SwiftKey_unzipped", overwrite = F)
}
# read data
blogs.txt = "Coursera-SwiftKey_unzipped/final/en_US/en_US.blogs.txt"
handle <- file(blogs.txt,"r")
blogs <- readLines(blogs.txt, encoding="UTF-8", skipNul = TRUE)
close(handle)
# stri_stats_general(blogs) # number of lines and characters
twitter.txt = "Coursera-SwiftKey_unzipped/final/en_US/en_US.twitter.txt"
handle <- file(twitter.txt,"r")
twitter <- readLines(twitter.txt, encoding="UTF-8", skipNul = TRUE)
close(handle)
# stri_stats_general(twitter) # number of lines and characters
news.txt = "Coursera-SwiftKey_unzipped/final/en_US/en_US.news.txt"
handle <- file(news.txt,open = "rb")
news <- readLines(news.txt, encoding="UTF-8", skipNul = TRUE)
close(handle)
# stri_stats_general(news) # number of lines and characters
# Number of lines:
summary(blogs)
## Length Class Mode
## 899288 character character
summary(news)
## Length Class Mode
## 1010242 character character
summary(twitter)
## Length Class Mode
## 2360148 character character
# Number of words:
blogs.words <- stringi::stri_count_words(blogs)
length(blogs.words)
## [1] 899288
news.words <- stringi::stri_count_words(news)
length(news.words)
## [1] 1010242
twitter.words <- stringi::stri_count_words(twitter)
length(twitter.words)
## [1] 2360148
I removed lines with no words. The table below shows some basic statistics for the 3 sources (Number of items in each file and the Average, Median and Max number of words / item). The histogram shows the distribution of word / item for each of the 3 sources. Both axes were log2-transformed for aestetic purposes.
Blog posts and news articles are the longest items, averaging 41.75 words/post and 34.41 words/article, while tweets were only 12.75 words/tweet. The histograms show a few blog posts reach several thousands words long while very few news articles pass 1000 words.
Obviously, there were items that spanned several lines in the source files, as well as items that shared one line. This problem will have an insignifficant effect on the following analyses.
words <- data.frame(`Source` = c(rep("Blogs", length(blogs.words)),
rep("News", length(news.words)),
rep("Twitter", length(twitter.words))),
`Words` = c(blogs.words, news.words, twitter.words)) %>%
filter(`Words` > 0)
words %>% group_by(`Source`) %>%
summarise(`Lines` = n(),
`Average` = mean(`Words`),
# `Min` = min(`Words`),
`Median` = median(`Words`),
`Max` = max(`Words`))
ggplot(words, aes(x=`Words`)) +
facet_grid(.~`Source`)+
geom_histogram(aes(fill=`Source`), color="white", size=0.2, breaks = 2^(0:13)) +
scale_x_log10("Words / item",
limits = c(1, 10^4),
breaks = 2^(0:13),
labels=scales::number_format(1, decimal.mark = " ")) +
scale_y_log10("Number of items",
limits = c(1, 10^6),
breaks=10^(0:7),
labels=scales::number_format(1, decimal.mark = " ")) +
theme(legend.position = "none") +
theme(axis.text.x = element_text(angle = 90, vjust=0.5, hjust=1)) +
theme(panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank())
# For speed of development only: sample 10000 lines of text
# In a real scenario, one would analyse all of the data (hopefully on a computer with more than 4 GB of RAM)
set.seed(1)
blogs_sample <- blogs[sample(1:length(blogs), 3333)]
news_sample <- news[sample(1:length(news), 3333)]
twitter_sample <- twitter[sample(1:length(twitter), 3333)]
# Conslidating the sample files
sampleData <- c(blogs_sample, news_sample, twitter_sample)
# textData <- c(blogs, news, twitter)
tm.corpus <- tm::SimpleCorpus(tm::VectorSource(sampleData))
# tm.corpus <- tm::SimpleCorpus(tm::VectorSource(textData))
# Make everithing the same letter case (lowercase)
tm.corpus <- tm::tm_map(tm.corpus, tm::content_transformer(stringi::stri_trans_tolower))
# Select letters only; replace everything else with spaces (removed later)
tm.corpus <- tm::tm_map(tm.corpus, tm::content_transformer(
function(x) { str_replace_all(x, "[^a-z]", " ") }
))
# Remove unnecessary spaces
tm.corpus <- tm::tm_map(tm.corpus, tm::stripWhitespace)
# Other possible transformations:
# tm.corpus <- tm::tm_map(tm.corpus, tm::removeNumbers)
# tm.corpus <- tm::tm_map(tm.corpus, tm::removePunctuation)
# tm.corpus <- tm::tm_map(tm.corpus, tm::content_transformer(
# function(x) { str_remove_all(x, "[\"\'\`”“™–—’‘]") }
# ))
# tm.corpus <- tm_map(tm.corpus, removeWords, stopwords('english')) # Remove some common but irrelevant words
# head(unlist(sapply(tm.corpus, identity)), 20) # preview
I choseto study n2-grams (pairs of words).
library(quanteda) # Functions for counting words and n-grams
tokens <- quanteda::tokens(paste(unlist(sapply(tm.corpus, identity)), collapse = " "),
# remove_punct = T,
# remove_symbols = T,
# remove_numbers = T,
# remove_url = T,
# remove_separators = T,
# split_hyphens = F
)
n1_grams <- quanteda::dfm(tokens) %>% quanteda::textstat_frequency() %>%
select(c("feature", "frequency"))
n2_grams <- quanteda::tokens_ngrams(tokens, n = 2, concatenator = " ") %>%
quanteda::dfm() %>% quanteda::textstat_frequency() %>%
select(c("feature", "frequency")) %>%
mutate(`features` = `feature`) %>%
tidyr::separate(col=`features`, into = c("first word", "second word"), sep=" ", remove=T)
S_first <- n2_grams %>%
group_by(`first word`) %>% summarise(S_first = sum(`frequency`)) # %>% arrange(desc(S_first))
n2_grams <- merge(x=n2_grams, y=S_first, by="first word") %>%
mutate(`probability2.1` = `frequency` / `S_first`) %>%
arrange(desc(`probability2.1`), `first word`, `second word`, desc(`frequency`)) %>%
# filter(`probability2.1`<0.5)
select(`feature`, `frequency`, `first word`, `second word`,`first word frequency` = `S_first`, `probability of second word` = `probability2.1`)
Common words line “the”, “and”, “to” or “of” are unexpectedly most prevalent in the dataset.
The chart below shows the most common 100 words, scaled by their prevalence.
set.seed(1)
top_first <- n1_grams %>% arrange(desc(`frequency`)) %>% dplyr::top_n(100)
wordcloud2::wordcloud2(top_first, size = 1.6, minSize = 0.5)
These words tend to be followed by common words too.
The chart below shows the probability of the most common second word, for each of the most common 50 first words. The 2-grams are ordered by their overall prevalence. For example: there are approximatively 1500 occurences of the n2-gram “of the”, whcih makes it the most frequent n2-gram; the word “the” was the most commonly used words after the word “of” (approx. 22% of the occurences of the word “of” are followed by the word “the”).
top_first <- n1_grams %>% arrange(desc(`frequency`)) %>% dplyr::top_n(50)
top_first <- as.character(top_first$feature)
db <- n2_grams %>% filter(`first word` %in% top_first) %>%
filter(! duplicated(`first word`)) %>% arrange(desc(`frequency`))
db$`feature` = factor(db$`feature`, levels = db$`feature`)
db %>% ggplot(aes(x=`feature`, y=`probability of second word`)) +
geom_bar(aes(fill=`frequency`), stat="identity",width = 0.75) +
scale_y_continuous("Probability of the second word, for the particular first word",
labels = scales::percent_format(1))+
scale_x_discrete("n2-grams (features), sorted by their overall frequency")+
scale_fill_viridis_c("Frequency of\nn2-gram",
breaks = seq(0, 10000, 100), limits=c(0, 1500))+
coord_flip() +
theme(legend.position = c(1, 1), legend.justification = c(1, 1),
legend.background = element_blank(), legend.key.height = unit(15, "mm"))
After this preliminary analysis, I have a good idea what the data looks like. I will make a Shiny app that will predict the next word as I type. It will predict the first word from the list of n1-grams. It will predict the next words from the list of n2-grams, given the previous one. It is possible to extend the algorithm to n3-grams, or even higher but I think there will be diminishing returns.
One possible modification is to weight the n2-grams probabilities by the prevalence of the second word as an n1-gram. For example, after the word “data”, the algorithm may determine that the best word is “analysis”. However, sice the word “science” is much more prevalent on its own than “analysis”, a better algorithm would increase the probability of the word “science” and decease that of the word “analysis”. Therefore, the predicted n2-gram would be “data science”. The weghting factor could be a tuning parameter.
.