I successfully downloaded and loaded the training dataset using R.
setwd("/Users/mac/Downloads/en_US")
twitter_data <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
length(twitter_data) # Total number of tweets
## [1] 2360148
head(twitter_data, 5) # Preview first few tweets
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
## [5] "Words from a complete stranger! Made my birthday even better :)"
library(stringr)
# Total tweets
num_tweets <- length(twitter_data)
# Word and character counts
word_counts <- str_count(twitter_data, "\\S+")
char_counts <- nchar(twitter_data)
# Summary
summary_df <- data.frame(
Metric = c("Total Tweets", "Average Words per Tweet", "Average Characters per Tweet"),
Value = c(num_tweets, round(mean(word_counts), 2), round(mean(char_counts), 2))
)
print(summary_df)
## Metric Value
## 1 Total Tweets 2360148.00
## 2 Average Words per Tweet 12.87
## 3 Average Characters per Tweet 68.68
library(ggplot2)
ggplot(data.frame(char_counts), aes(x = char_counts)) +
geom_histogram(binwidth = 10, fill = "steelblue") +
labs(title = "Distribution of Tweet Lengths", x = "Characters", y = "Frequency")
install.packages("tidytext")
##
## The downloaded binary packages are in
## /var/folders/7g/z2vwfjvx7d53p143_j87zp0w0000gp/T//RtmpcpzZhh/downloaded_packages
library(tidytext)
library(dplyr)
twitter_df <- data.frame(text = twitter_data)
tokens <- twitter_df %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
top_words <- tokens %>%
count(word, sort = TRUE) %>%
top_n(20)
ggplot(top_words, aes(reorder(word, n), n)) +
geom_col(fill = "darkgreen") +
coord_flip() +
labs(title = "Top 20 Most Common Words", x = "Word", y = "Frequency")
I successfully loaded and explored the en_US.twitter.txt dataset. I reviewed tweet counts, average word and character lengths, and visualized tweet length distribution and common words. The data looks rich and varied, with lots of informal language and abbreviations.
Next, I plan to clean and tokenize the text, extract features like n-grams or TF-IDF, and build a predictive model—likely for sentiment or next-word prediction. I’ll wrap it in a Shiny app so users can input text and get real-time predictions. This report shows I’m on track and ready for feedback before moving forward.