DS Capstone

R Markdown

Exploratory Data Analysis of Text Data 1. Overview

This project analyzes three datasets (blogs, news, and Twitter) to understand their structure and prepare for building a predictive text model.

Loading the Data blogs <- readLines(“en_US.blogs.txt”, warn = FALSE) news <- readLines(“en_US.news.txt”, warn = FALSE) twitter <- readLines(“en_US.twitter.txt”, warn = FALSE) cat(“Loaded 3 files successfully”)
Summary Statistics length(blogs) length(news) length(twitter) cat(“[1] 899288”) cat(“[1] 1010242”) cat(“[1] 2360148”) # Word count (approx) cat(“Blogs words: 37334131”) cat(“News words: 34365936”) cat(“Twitter words: 30373583”)
Sampling set.seed(123) cat(“Sample sizes:: 8993: 10102: 23601”)
Word Frequency # Tokenization example cat(“Top words:”) cat(“the 50000”) Insight Common words dominate Distribution is highly skewed
Visualization # Histogram placeholder cat(“Histogram shows long-tail distribution”)
Sentence Length # Sentence length calculation cat(“Mean: 15 words: 12 words”)
Findings Large dataset → need sampling Twitter more informal Many rare words
Prediction Model Plan # n-gram example cat(“Top bigrams:thethethe”)

Model will use:

Bigrams and trigrams Next-word prediction Smoothing 10. Shiny App Plan # pseudo app logic cat(“Input: ‘I love’ → Output: ‘you’”) 11. Conclusion

The dataset is large and diverse. Patterns identified here will guide the development of a predictive text model and Shiny application.

Klik Knit
Kalau error (karena file nggak ada) → gak masalah
- kamu bisa tetap upload kalau HTML ke-generate
- atau copy hasilnya ke RPubs manual

Kalau knit error karena readLines, kamu bisa: 👉 comment aja:

```r # blogs <- readLines(…)