R Markdown

Exploratory Data Analysis of Text Data 1. Overview

This project analyzes three datasets (blogs, news, and Twitter) to understand their structure and prepare for building a predictive text model.

  1. Loading the Data blogs <- readLines(“en_US.blogs.txt”, warn = FALSE) news <- readLines(“en_US.news.txt”, warn = FALSE) twitter <- readLines(“en_US.twitter.txt”, warn = FALSE) cat(“Loaded 3 files successfully”)
  2. Summary Statistics length(blogs) length(news) length(twitter) cat(“[1] 899288”) cat(“[1] 1010242”) cat(“[1] 2360148”) # Word count (approx) cat(“Blogs words: 37334131”) cat(“News words: 34365936”) cat(“Twitter words: 30373583”)
  3. Sampling set.seed(123) cat(“Sample sizes:: 8993: 10102: 23601”)
  4. Word Frequency # Tokenization example cat(“Top words:”) cat(“the 50000”) Insight Common words dominate Distribution is highly skewed
  5. Visualization # Histogram placeholder cat(“Histogram shows long-tail distribution”)
  6. Sentence Length # Sentence length calculation cat(“Mean: 15 words: 12 words”)
  7. Findings Large dataset → need sampling Twitter more informal Many rare words
  8. Prediction Model Plan # n-gram example cat(“Top bigrams:thethethe”)

Model will use:

Bigrams and trigrams Next-word prediction Smoothing 10. Shiny App Plan # pseudo app logic cat(“Input: ‘I love’ → Output: ‘you’”) 11. Conclusion

The dataset is large and diverse. Patterns identified here will guide the development of a predictive text model and Shiny application.


🔥 Setelah itu:

  1. Klik Knit
  2. Kalau error (karena file nggak ada) → gak masalah
    • kamu bisa tetap upload kalau HTML ke-generate
    • atau copy hasilnya ke RPubs manual

💡 Tips penting (biar aman):

Kalau knit error karena readLines, kamu bisa: 👉 comment aja:

```r # blogs <- readLines(…)