Achievement Report
Student Name: Geeta Yadav
Date: 17 July 2025
- Successfully installed R and RStudio on the local system.
- Downloaded the en_US dataset (blogs, news, and twitter files).
- Loaded all three datasets into R using readLines() for initial
exploration.
- Performed line count analysis to understand dataset size.
- Calculated word counts and character counts for each dataset.
- Checked for and handled encoding issues and special
characters.
- Cleaned the text: converted to lowercase, removed punctuation and
numbers.
- Removed stopwords and extra whitespace using tm and stringr
packages.
- Plotted word frequency using base R and ggplot2.
- Created basic word clouds for visualization of common terms.
- Investigated term frequency across datasets.
- Used sampling to speed up processing of the Twitter data.
- Explored bi-grams and tri-grams using the tidytext package.
- Compared frequency of words like “love” and “hate” in tweets.
- Identified longest lines in each dataset.
- Successfully created an R Markdown (.Rmd) report.
- Knitted the report to HTML format in RStudio.
- Published the Milestone Report on RPubs.
- Submitted the RPubs link and project title on Coursera.
- Ready to begin model building and prediction tasks in the next
phase.