未命名.knit

1. Introduction

This is the Milestone Report for the Data Science Capstone Project. The goal of this phase is to demonstrate that we have successfully loaded the dataset, performed an initial exploratory data analysis, and planned the structure for our prediction model and eventual Shiny app.

The data comes from a corpus called HC Corpora, which contains English text collected from blogs, news articles, and Twitter.

2. Data Summary

We downloaded and read the following three files:

en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt

Below is a summary of the number of lines, words, characters, and max line length in each dataset:

library(stringi)

# 使用绝对路径加载数据
blogs <- readLines("/Users/drsn/Desktop/Data Science Course and Certificates/Capstone/final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("/Users/drsn/Desktop/Data Science Course and Certificates/Capstone/final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("/Users/drsn/Desktop/Data Science Course and Certificates/Capstone/final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

# 摘要表格
summary_df <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(sum(stri_count_words(blogs)),
            sum(stri_count_words(news)),
            sum(stri_count_words(twitter))),
  Characters = c(sum(nchar(blogs)),
                 sum(nchar(news)),
                 sum(nchar(twitter))),
  MaxLineLength = c(max(nchar(blogs)),
                    max(nchar(news)),
                    max(nchar(twitter)))
)
summary_df

##    Source   Lines    Words Characters MaxLineLength
## 1   Blogs  899288 37546250  206824505         40833
## 2    News 1010242 34762395  203223159         11384
## 3 Twitter 2360148 30093413  162096241           140

3. Exploratory Data Analysis

At this stage, only minimal preprocessing was done. In future steps, we will clean the data further by removing numbers, special characters, and applying standard text normalization such as stopword removal and stemming if needed.

We created a sample of 1% from each data source and merged them into a combined corpus.

set.seed(2025)

sample_data <- function(data, prob = 0.01) {
  data[rbinom(length(data), 1, prob) == 1]
}

blogs_sample <- sample_data(blogs)
news_sample <- sample_data(news)
twitter_sample <- sample_data(twitter)

text_sample <- c(blogs_sample, news_sample, twitter_sample)

3.1 Word Frequency (Unigram)

library(dplyr)
library(tidytext)
library(ggplot2)
library(tibble)

sample_tbl <- tibble(line = 1:length(text_sample), text = text_sample)

unigram <- sample_tbl %>%
  unnest_tokens(word, text) %>%
  count(word, sort = TRUE)

unigram %>%
  slice_max(order_by = n, n = 20) %>%
  ggplot(aes(x = reorder(word, n), y = n)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 20 Most Frequent Words", x = "Words", y = "Frequency")

3.2 Word Pairs (Bigram)

bigram <- sample_tbl %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  count(bigram, sort = TRUE)

bigram %>%
  slice_max(order_by = n, n = 20) %>%
  ggplot(aes(x = reorder(bigram, n), y = n)) +
  geom_col(fill = "darkgreen") +
  coord_flip() +
  labs(title = "Top 20 Most Frequent Bigrams", x = "Bigrams", y = "Frequency")

3.3 Word Triplets (Trigram)

trigram <- sample_tbl %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
  count(trigram, sort = TRUE)

trigram %>%
  slice_max(order_by = n, n = 20) %>%
  ggplot(aes(x = reorder(trigram, n), y = n)) +
  geom_col(fill = "tomato") +
  coord_flip() +
  labs(title = "Top 20 Most Frequent Trigrams", x = "Trigrams", y = "Frequency")

4. Plans for Prediction Algorithm and App

The goal of this project is to develop a next-word prediction model, similar to those used in smart keyboards. Based on our exploratory analysis, we will create a model using n-gram frequency tables (1-gram, 2-gram, 3-gram). The key steps will include:

Tokenizing and cleaning text data
Building n-gram models for predicting next words
Implementing a backoff strategy to handle unseen n-grams
Smoothing probabilities to avoid zero-frequency issues

The final model will be deployed as a Shiny web app using shinyapps.io. Users will input one or more words, and the app will return a list of likely next words based on the training data.

5. Conclusion

This milestone shows that the dataset has been successfully loaded, a representative sample has been created, and initial exploration of word frequency has been performed. We are now on track to begin building our prediction model and deploying it via Shiny.