Data Acquision and Read

file_dest <- "./source/Coursera-SwiftKey.zip"
file_src  <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

## Save time, download only if you haven't
if(!file.exists(file_dest)) {
  download.file(file_src, file_dest)
}

## Save time, unzip only if you haven't
if(file.exists(file_dest)) {
  unzip(file_dest, exdir = "./source/")
  list.files("./source/final/")
  list.files("./source/final/en_US/")
}

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

## Read as UTF-8 and create tibbles (dataframes)
text_blog <- readLines("./source/final/en_US/en_US.blogs.txt", encoding = "UTF-8")
df_text_blog <- as_tibble(text_blog)
colnames(df_text_blog) <- "text"

text_news <- readLines("./source/final/en_US/en_US.news.txt", encoding = "UTF-8")
df_text_news <- as_tibble(text_news)
colnames(df_text_news) <- "text"

text_twit <- readLines("./source/final/en_US/en_US.twitter.txt", encoding = "UTF-8")
df_text_twit <- as_tibble(text_twit)
colnames(df_text_twit) <- "text"

Failiarize Questions to consider

What do the data look like?
Where do the data come from?
Can you think of any other data sources that might help you in this project?
What are the common steps in natural language processing?
What are some common issues in the analysis of text data?
What is the relationship between NLP and the concepts you have learned in the Specialization?

Basic stats about the files

Let’s describe the breadth of data by defining it as: Lines of text and file size Mb.

## Make a data frame for graphing.
mb_df <- data.frame(
  Text = c("Blog","News","Tweets"),
  Lines = c(
    nrow(df_text_blog),
    nrow(df_text_news),
    nrow(df_text_twit)
  ),
  Mb = c(
    round(file.size("./source/final/en_US/en_US.blogs.txt") / 1024^2),
    round(file.size("./source/final/en_US/en_US.news.txt") / 1024^2),
    round(file.size("./source/final/en_US/en_US.twitter.txt") / 1024^2)
  )
) %>% 
ggplot(aes(Lines, Mb)) +
  geom_point(stat = "identity", colour = "#08457e", size = 3) +
  geom_text( aes(label = Text), vjust = -0.75) +
  ylab("File size in Mb") +
  ylim(150, 215) +
  xlab("Lines of text") +
  theme_minimal() +
  ggtitle("Lines of text and files sizes for blogs, news and tweets")
mb_df

Note that News and Blogs have a greater file size for the lines of text they contain. This is likely due to the 120 char limit Twitter used to impose.

Text Frequencies

Text will need to be arranged into one, two and three tokens (a meaningful unit of text). This is used to build relationships between words as they appear. Here are the frequencies for the blog. You’ll notice the high incidence of stop words - words commonly used.

For the sake of brevity, Blog will be treated as a simple word frequency, news as a bi-gram and twitter as a tri-gram. Further analysis will inclue all analyses on all text.

## Word frequencies
blog_words <- df_text_blog %>%  ## Just 5 rows as
  unnest_tokens(word, text) %>% 
  count(word, sort = T) %>% 
  top_n(25) %>% 
  arrange(desc(n)) %>% 
  ggplot(aes(reorder(word, n), n)) +
  geom_bar(stat = "identity", fill = "#08457e") +
  ggtitle("Top 25 Blog Words - Frequencies") +
  ylab("Words") +
  xlab("Frequency") +
  coord_flip()

## Selecting by n

blog_words

Bigrams

A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2. The frequency distribution of every bigram in a string is commonly used for simple statistical analysis of text in many applications, including in computational linguistics, cryptography, speech recognition, and so on. (Wikipedia)[https://en.wikipedia.org/wiki/Bigram]

news_bigrams <- df_text_news %>% 
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% 
  separate(bigram, c("word1","word2"), sep = " ") %>% # Sep to filter stop words
  filter(!word1 %in% stop_words$word) %>% 
  filter(!word2 %in% stop_words$word) %>% 
  unite(bigram, word1, word2, sep = " ") %>%  # Re-combine
  count(bigram, sort = T) %>% 
  top_n(25) %>% 
  ggplot(aes(reorder(bigram, n), n)) +
  geom_bar(stat = "identity", fill = "#08457e") +
  ggtitle("Top 25 News Bigrams") +
  ylab("Words") +
  xlab("Frequency") +
  coord_flip()

## Selecting by n

news_bigrams

Trigrams

Trigrams are a special case of the n-gram, where n is 3. They are often used in natural language processing for performing statistical analysis of texts. Wikipedia

twit_trigrams <- df_text_twit %>% 
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>% 
  separate(trigram, c("word1","word2", "word3"), sep = " ") %>% # Sep to filter stop words
  filter(!word1 %in% stop_words$word) %>% 
  filter(!word2 %in% stop_words$word) %>% 
  filter(!word3 %in% stop_words$word) %>% 
  unite(trigram, word1, word2, word3, sep = " ") %>%  # Re-combine
  count(trigram, sort = T) %>% 
  top_n(25) %>% 
  ggplot(aes(reorder(trigram, n), n)) +
  geom_bar(stat = "identity", fill = "#08457e") +
  ggtitle("Top 25 Twitter Trigrams") +
  ylab("Words") +
  xlab("Frequency") +
  coord_flip()

## Selecting by n

twit_trigrams

Conclusion

“The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm”. This has been accomplished by downloading, exploring and profiling the data. Word frequenies, n-grams were calculated and visualized. This estabhishes a base for the next phase of research.

Coursera - DS Capstone

Ken W.