Exploratory Data Analysis Report

1. Data Loading

We successfully downloaded and loaded the datasets.

# 修改路径为你本地的数据路径
blogs   <- readLines("en_US.blogs.txt",   encoding = "UTF-8", skipNul = TRUE)
news    <- readLines("en_US.news.txt",    encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

length(blogs); length(news); length(twitter)

2. Basic Summaries

We computed basic statistics such as line counts, total words, and average words per line.

summary_stats <- function(data){
  words_per_line <- sapply(strsplit(data, "\s+"), length)
  data.frame(
    Lines = length(data),
    TotalWords = sum(words_per_line),
    AvgWordsPerLine = round(mean(words_per_line), 2)
  )
}

stats <- rbind(
  Blogs   = summary_stats(blogs),
  News    = summary_stats(news),
  Twitter = summary_stats(twitter)
)

kable(stats, caption = "Summary Statistics of the Three Datasets")

3. Basic Plots

Here we show histograms of words per line to illustrate differences between datasets.

blogs_words   <- sapply(strsplit(blogs, "\s+"), length)
news_words    <- sapply(strsplit(news, "\s+"), length)
twitter_words <- sapply(strsplit(twitter, "\s+"), length)

df_plot <- data.frame(
  words = c(blogs_words, news_words, twitter_words),
  source = factor(c(
    rep("Blogs", length(blogs_words)),
    rep("News", length(news_words)),
    rep("Twitter", length(twitter_words))
  ))
)

ggplot(df_plot, aes(x = words, fill = source)) +
  geom_histogram(bins = 50, alpha = 0.6, position = "identity") +
  xlim(0, 200) +
  labs(title = "Distribution of Words per Line", 
       x = "Words per Line", y = "Count")

4. Interesting Findings

Blogs and news entries are longer and more formal.
Twitter entries are short and informal, often containing hashtags/emojis.
Word distributions are highly skewed, typical of natural language.

5. Plan for Prediction Algorithm

Preprocessing: cleaning, lowercasing, removing URLs/emojis where necessary.
N-gram model: build bigram and trigram frequency tables.
Prediction: implement a backoff strategy when higher-order n-grams are missing.
Evaluation: test on a held-out sample.

6. Plan for Shiny App

Input box for user text.
Output: top 3 predicted next words.
Lightweight, intuitive UI for non-technical users.

Conclusion

We confirmed successful data loading and exploration.
Key differences across datasets were identified.
Next step: build n-gram prediction model and Shiny app.