Coursera Capstone Project Milestone Report - Draft

Data Science Specialization from Johns Hopkins University

Author

Daniel Morales

Published

November 25, 2024

YAML Header

---
title: "Coursera Capstone Project Milestone Report"
subtitle: "Data Science Specialization from Johns Hopkins University"
author: "Daniel Morales"
date: last-modified
toc: true
format: 
  html: 
    code-fold: true
    code-summary: "Show the code"
    code-copy: true
execute: 
  cache: true
editor: visual
---

Introduction

This is the Milestone Report for the Capstone Project from Coursera and Johns Hopkins University Data Science Specialization. The goal for the Capstone Project is to create a Shiny App with a textbox that, using given data and like the keyboards from smartphones, produces three options for what the next typed word might be.

The goal for this Milestone Report is to show that we are able to download, explore and start to model with the data. This data is available to download here and we will be using the files in English, listed below:

en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt

We are assuming that the data is already downloaded, unzipped and available in the active R directory.

Setup

We start loading the R packages needed and the data.

Show the code

library(ggplot2)
library(knitr)
library(readr)
#library(RWeka)
library(stringi)
library(tm)

Carregando pacotes exigidos: NLP


Attaching package: 'NLP'

The following object is masked from 'package:ggplot2':

    annotate

Show the code

library(wordcloud)

Carregando pacotes exigidos: RColorBrewer

Show the code

blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE, warn = FALSE)

Data Summary

Let us start by investigating the data we will be using. Basic information about the files and their character count is provided below.

Show the code

char_count <- list(nchar(twitter), nchar(blogs), nchar(news))

eda_files_chars <- data.frame(
  "File Name" = c("en_US.blogs.txt", 
                  "en_US.news.txt", 
                  "en_US.twitter.txt"),
  "File Size" = paste(round(
    file.info(c("en_US.blogs.txt",
                "en_US.news.txt",
                "en_US.twitter.txt"))$size / 1048576,
    digits = 1
  ), "MB"),
  "Line Count" = sapply(list(blogs,
                             news,
                             twitter), length),
  "Character Count" = sapply(char_count, sum),
  "Min CPL" = sapply(char_count, min),
  "Mean CPL" = sapply(char_count, mean),
  "Max CPL" = sapply(char_count, max),
  check.names = FALSE
)

kable(eda_files_chars, format.args = list(big.mark = ","))

File Name	File Size	Line Count	Character Count	Min CPL	Mean CPL	Max CPL
en_US.blogs.txt	200.4 MB	899,288	162,096,241	2	68.68054	140
en_US.news.txt	196.3 MB	77,259	206,824,505	1	229.98695	40,833
en_US.twitter.txt	159.4 MB	2,360,148	15,639,408	2	202.42830	5,760

As expected, the maximum number of characters per line on the database from Twitter is limited to 140, given the time when it was extracted. Now let us see some statistics on word counts.

Show the code

words_per_line <- lapply(list(blogs, news, twitter), stri_count_words)

eda_files_words <- data.frame(
  "File Name" = c("en_US.blogs.txt", 
                  "en_US.news.txt", 
                  "en_US.twitter.txt"),
  "Word Count" = sapply(words_per_line, sum),
  "Min WPL" = sapply(words_per_line, min),
  "Mean WPL" = round(sapply(words_per_line, mean)),
  "Max WPL" = sapply(words_per_line, max),
  check.names = FALSE
)

kable(eda_files_words, format.args = list(big.mark = ","))

File Name	Word Count	Min WPL	Mean WPL	Max WPL
en_US.blogs.txt	37,546,250	0	42	6,726
en_US.news.txt	2,674,536	1	35	1,123
en_US.twitter.txt	30,093,413	1	13	47

Now visualizing the distribution of words per line in each database.

Show the code

ggplot(data.frame(blogs_wpl = words_per_line[[1]]), aes(x = blogs_wpl)) +
  geom_histogram(binwidth = 5) + 
  xlab("Words per Line") +
  ylab("Frequency") +
  ggtitle("US Blogs") +
  theme_bw()

Show the code

ggplot(data.frame(blogs_wpl = words_per_line[[2]]), aes(x = blogs_wpl)) +
  geom_histogram(binwidth = 5) + 
  xlab("Words per Line") +
  ylab("Frequency") +
  ggtitle("US News") +
  theme_bw()

Show the code

ggplot(data.frame(blogs_wpl = words_per_line[[3]]), aes(x = blogs_wpl)) +
  geom_histogram(binwidth = 5) + 
  xlab("Words per Line") +
  ylab("Frequency") +
  ggtitle("US Twitter") +
  theme_bw()

Preparing the Data

Show the code

# dataset <- Corpus(VectorSource(c(blogs, news, twitter)))

Exploratory Data Analysis

Show the code

# dtm <- TermDocumentMatrix(dataset) 
# matrix <- as.matrix(dtm) 
# words <- sort(rowSums(matrix), decreasing = TRUE) 
# df <- data.frame(word = names(words), freq = words)
# 
# wordcloud(words = df$word,
#           freq = df$freq,
#           min.freq = 1, 
#           max.words = 200, 
#           random.order = FALSE, rot.per = 0.35,
#           colors = brewer.pal(8, "Dark2"))

Grading

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to:

Demonstrate that you’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.

Grading Criteria Overview

Does the link lead to an HTML page describing the exploratory analysis of the training data set?
Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables?
Has the data scientist made basic plots, such as histograms to illustrate features of the data?
Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?

Exploratory Data Analysis

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.

Tasks to accomplish

Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Questions to consider

Some words are more frequent than others - what are the distributions of word frequencies?
What are the frequencies of 2-grams and 3-grams in the dataset?
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
How do you evaluate how many of the words come from foreign languages?
Can you think of a way to increase the coverage -- identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

Modeling

The goal here is to build your first simple model for the relationship between words. This is the first step in building a predictive text mining application. You will explore simple models and discover more complicated modeling techniques.

Tasks to accomplish

Build basic n-gram model - using the exploratory analysis you performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words.
Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn’t observed.

Questions to consider

How can you efficiently store an n-gram model (think Markov Chains)?
How can you use the knowledge about word frequencies to make your model smaller and more efficient?
How many parameters do you need (i.e. how big is n in your n-gram model)?
Can you think of simple ways to “smooth” the probabilities (think about giving all n-grams a non-zero probability even if they aren’t observed in the data) ?
How do you evaluate whether your model is any good?
How can you use backoff models to estimate the probability of unobserved n-grams?

Hints, tips, and tricks

As you develop your prediction model, two key aspects that you will have to keep in mind are the size and runtime of the algorithm. These are defined as:

Size: the amount of memory (physical RAM) required to run the model in R
Runtime: The amount of time the algorithm takes to make a prediction given the acceptable input

Your goal for this prediction model is to minimize both the size and runtime of the model in order to provide a reasonable experience to the user.

Keep in mind that currently available predictive text models can run on mobile phones, which typically have limited memory and processing power compared to desktop computers. Therefore, you should consider very carefully (1) how much memory is being used by the objects in your workspace; and (2) how much time it is taking to run your model. Ultimately, your model will need to run in a Shiny app that runs on the shinyapps.io server.

Tips, tricks, and hints

Here are a few tools that may be of use to you as you work on their algorithm:

object.size(): this function reports the number of bytes that an R object occupies in memory
Rprof(): this function runs the profiler in R that can be used to determine where bottlenecks in your function may exist. The profr package (available on CRAN) provides some additional tools for visualizing and summarizing profiling data.
gc(): this function runs the garbage collector to retrieve unused RAM for R. In the process it tells you how much memory is currently being used by R.

There will likely be a tradeoff that you have to make in between size and runtime. For example, an algorithm that requires a lot of memory, may run faster, while a slower algorithm may require less memory. You will have to find the right balance between the two in order to provide a good experience to the user.