Data Loading

I successfully downloaded and loaded the training dataset using R.

setwd("/Users/mac/Downloads/en_US")

twitter_data <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
length(twitter_data)  # Total number of tweets

## [1] 2360148

head(twitter_data, 5) # Preview first few tweets

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."                                                                       
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"                           
## [5] "Words from a complete stranger! Made my birthday even better :)"

library(stringr)

# Total tweets
num_tweets <- length(twitter_data)

# Word and character counts
word_counts <- str_count(twitter_data, "\\S+")
char_counts <- nchar(twitter_data)

# Summary
summary_df <- data.frame(
  Metric = c("Total Tweets", "Average Words per Tweet", "Average Characters per Tweet"),
  Value = c(num_tweets, round(mean(word_counts), 2), round(mean(char_counts), 2))
)
print(summary_df)

##                         Metric      Value
## 1                 Total Tweets 2360148.00
## 2      Average Words per Tweet      12.87
## 3 Average Characters per Tweet      68.68

Text Length Distribution

library(ggplot2)

ggplot(data.frame(char_counts), aes(x = char_counts)) +
  geom_histogram(binwidth = 10, fill = "steelblue") +
  labs(title = "Distribution of Tweet Lengths", x = "Characters", y = "Frequency")

Word Frequency (Top 20)

install.packages("tidytext")

## 
## The downloaded binary packages are in
##  /var/folders/7g/z2vwfjvx7d53p143_j87zp0w0000gp/T//RtmpcpzZhh/downloaded_packages

library(tidytext)
library(dplyr)

twitter_df <- data.frame(text = twitter_data)
tokens <- twitter_df %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

top_words <- tokens %>%
  count(word, sort = TRUE) %>%
  top_n(20)

ggplot(top_words, aes(reorder(word, n), n)) +
  geom_col(fill = "darkgreen") +
  coord_flip() +
  labs(title = "Top 20 Most Common Words", x = "Word", y = "Frequency")

Summary

I successfully loaded and explored the en_US.twitter.txt dataset. I reviewed tweet counts, average word and character lengths, and visualized tweet length distribution and common words. The data looks rich and varied, with lots of informal language and abbreviations.

Next, I plan to clean and tokenize the text, extract features like n-grams or TF-IDF, and build a predictive model—likely for sentiment or next-word prediction. I’ll wrap it in a Shiny app so users can input text and get real-time predictions. This report shows I’m on track and ready for feedback before moving forward.

NLP Exploratory Analysis

Evans Codjoe

2025-09-19

Data Loading

Text Length Distribution

Word Frequency (Top 20)

Summary