R Markdown
Exploratory Data Analysis of Text Data 1. Overview
This project analyzes three datasets (blogs, news, and Twitter) to
understand their structure and prepare for building a predictive text
model.
- Loading the Data blogs <- readLines(“en_US.blogs.txt”, warn =
FALSE) news <- readLines(“en_US.news.txt”, warn = FALSE) twitter
<- readLines(“en_US.twitter.txt”, warn = FALSE) cat(“Loaded 3 files
successfully”)
- Summary Statistics length(blogs) length(news) length(twitter)
cat(“[1] 899288”) cat(“[1] 1010242”) cat(“[1] 2360148”) # Word count
(approx) cat(“Blogs words: 37334131”) cat(“News words: 34365936”)
cat(“Twitter words: 30373583”)
- Sampling set.seed(123) cat(“Sample sizes:: 8993: 10102: 23601”)
- Word Frequency # Tokenization example cat(“Top words:”) cat(“the
50000”) Insight Common words dominate Distribution is highly skewed
- Visualization # Histogram placeholder cat(“Histogram shows long-tail
distribution”)
- Sentence Length # Sentence length calculation cat(“Mean: 15 words:
12 words”)
- Findings Large dataset → need sampling Twitter more informal Many
rare words
- Prediction Model Plan # n-gram example cat(“Top
bigrams:thethethe”)
Model will use:
Bigrams and trigrams Next-word prediction Smoothing 10. Shiny App
Plan # pseudo app logic cat(“Input: ‘I love’ → Output: ‘you’”) 11.
Conclusion
The dataset is large and diverse. Patterns identified here will guide
the development of a predictive text model and Shiny application.
🔥 Setelah itu:
- Klik Knit
- Kalau error (karena file nggak ada) → gak masalah
- kamu bisa tetap upload kalau HTML ke-generate
- atau copy hasilnya ke RPubs manual
💡 Tips penting (biar aman):
Kalau knit error karena readLines, kamu bisa: 👉 comment
aja:
```r # blogs <- readLines(…)