Achievement Report

Student Name: Geeta Yadav
Date: 17 July 2025

  1. Successfully installed R and RStudio on the local system.
  2. Downloaded the en_US dataset (blogs, news, and twitter files).
  3. Loaded all three datasets into R using readLines() for initial exploration.
  4. Performed line count analysis to understand dataset size.
  5. Calculated word counts and character counts for each dataset.
  6. Checked for and handled encoding issues and special characters.
  7. Cleaned the text: converted to lowercase, removed punctuation and numbers.
  8. Removed stopwords and extra whitespace using tm and stringr packages.
  9. Plotted word frequency using base R and ggplot2.
  10. Created basic word clouds for visualization of common terms.
  11. Investigated term frequency across datasets.
  12. Used sampling to speed up processing of the Twitter data.
  13. Explored bi-grams and tri-grams using the tidytext package.
  14. Compared frequency of words like “love” and “hate” in tweets.
  15. Identified longest lines in each dataset.
  16. Successfully created an R Markdown (.Rmd) report.
  17. Knitted the report to HTML format in RStudio.
  18. Published the Milestone Report on RPubs.
  19. Submitted the RPubs link and project title on Coursera.
  20. Ready to begin model building and prediction tasks in the next phase.