This is the Milestone Report for the Data Science Capstone Project. The goal of this phase is to demonstrate that we have successfully loaded the dataset, performed an initial exploratory data analysis, and planned the structure for our prediction model and eventual Shiny app.
The data comes from a corpus called HC Corpora, which contains English text collected from blogs, news articles, and Twitter.
We downloaded and read the following three files:
en_US.blogs.txten_US.news.txten_US.twitter.txtBelow is a summary of the number of lines, words, characters, and max line length in each dataset:
library(stringi)
# 使用绝对路径加载数据
blogs <- readLines("/Users/drsn/Desktop/Data Science Course and Certificates/Capstone/final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("/Users/drsn/Desktop/Data Science Course and Certificates/Capstone/final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("/Users/drsn/Desktop/Data Science Course and Certificates/Capstone/final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
# 摘要表格
summary_df <- data.frame(
Source = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter)),
Words = c(sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter))),
Characters = c(sum(nchar(blogs)),
sum(nchar(news)),
sum(nchar(twitter))),
MaxLineLength = c(max(nchar(blogs)),
max(nchar(news)),
max(nchar(twitter)))
)
summary_df
## Source Lines Words Characters MaxLineLength
## 1 Blogs 899288 37546250 206824505 40833
## 2 News 1010242 34762395 203223159 11384
## 3 Twitter 2360148 30093413 162096241 140
At this stage, only minimal preprocessing was done. In future steps, we will clean the data further by removing numbers, special characters, and applying standard text normalization such as stopword removal and stemming if needed.
We created a sample of 1% from each data source and merged them into a combined corpus.
set.seed(2025)
sample_data <- function(data, prob = 0.01) {
data[rbinom(length(data), 1, prob) == 1]
}
blogs_sample <- sample_data(blogs)
news_sample <- sample_data(news)
twitter_sample <- sample_data(twitter)
text_sample <- c(blogs_sample, news_sample, twitter_sample)
library(dplyr)
library(tidytext)
library(ggplot2)
library(tibble)
sample_tbl <- tibble(line = 1:length(text_sample), text = text_sample)
unigram <- sample_tbl %>%
unnest_tokens(word, text) %>%
count(word, sort = TRUE)
unigram %>%
slice_max(order_by = n, n = 20) %>%
ggplot(aes(x = reorder(word, n), y = n)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Top 20 Most Frequent Words", x = "Words", y = "Frequency")
bigram <- sample_tbl %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
count(bigram, sort = TRUE)
bigram %>%
slice_max(order_by = n, n = 20) %>%
ggplot(aes(x = reorder(bigram, n), y = n)) +
geom_col(fill = "darkgreen") +
coord_flip() +
labs(title = "Top 20 Most Frequent Bigrams", x = "Bigrams", y = "Frequency")
trigram <- sample_tbl %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
count(trigram, sort = TRUE)
trigram %>%
slice_max(order_by = n, n = 20) %>%
ggplot(aes(x = reorder(trigram, n), y = n)) +
geom_col(fill = "tomato") +
coord_flip() +
labs(title = "Top 20 Most Frequent Trigrams", x = "Trigrams", y = "Frequency")
The goal of this project is to develop a next-word prediction model, similar to those used in smart keyboards. Based on our exploratory analysis, we will create a model using n-gram frequency tables (1-gram, 2-gram, 3-gram). The key steps will include:
The final model will be deployed as a Shiny web app
using shinyapps.io. Users will input one or more words, and
the app will return a list of likely next words based on the training
data.
This milestone shows that the dataset has been successfully loaded, a representative sample has been created, and initial exploration of word frequency has been performed. We are now on track to begin building our prediction model and deploying it via Shiny.