Introduction

The goal of this project is to demonstrate our progress in understanding the text data and to lay the foundation for creating a prediction algorithm and Shiny app. This milestone report summarizes our exploratory data analysis, highlights key findings, and outlines the next steps in the project.

Data Overview

We are working with three text files: "en_US.news.txt", "en_US.twitter.txt", and "en_US.blogs.txt". The primary objectives of this analysis are to:

  1. Demonstrate successful data loading and initial processing.
  2. Provide summary statistics and basic plots for each file.
  3. Highlight interesting findings from the data.
  4. Outline plans for developing a prediction algorithm and Shiny app.

Data Loading and Sampling

library(tm)
library(NLP)
library(RWeka)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)

# Function to process each file
process_file <- function(file_path, sample_size = 1200) {
  text_data <- readLines(file_path, warn = FALSE, encoding = "UTF-8")
  text_data_sample <- sample(text_data, size = sample_size)
  corpus <- VCorpus(VectorSource(text_data_sample))
  corpus <- tm_map(corpus, content_transformer(tolower))   # Convert to lowercase
  corpus <- tm_map(corpus, removePunctuation)              # Remove punctuation
  corpus <- tm_map(corpus, removeNumbers)                  # Remove numbers
  corpus <- tm_map(corpus, stripWhitespace)                # Remove extra whitespace
  BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
  bigrams <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
  word_tokenizer <- function(x) unlist(strsplit(as.character(x), "\\s+"))
  word_matrix <- TermDocumentMatrix(corpus, control = list(tokenize = word_tokenizer))
  list(
    bigram_df = as.data.frame(as.table(as.matrix(bigrams))),
    word_df = as.data.frame(as.table(as.matrix(word_matrix)))
  )
}

# Process each file
files <- c("en_US.news.txt", "en_US.twitter.txt", "en_US.blogs.txt")
results <- lapply(files, process_file)
names(results) <- files

Summary Statistics

Word and Line Counts

# Function to get word and line counts
get_summary_stats <- function(file_path) {
  text_data <- readLines(file_path, warn = FALSE, encoding = "UTF-8")
  list(
    line_count = length(text_data),
    word_count = sum(sapply(text_data, function(line) length(unlist(strsplit(line, "\\s+")))))
  )
}

# Summary statistics for each file
stats <- lapply(files, get_summary_stats)
data.frame(
  File = files,
  Line_Count = sapply(stats, `[[`, "line_count"),
  Word_Count = sapply(stats, `[[`, "word_count")
)
##                File Line_Count Word_Count
## 1    en_US.news.txt      77259    2643969
## 2 en_US.twitter.txt    2360148   30373543
## 3   en_US.blogs.txt     899288   37334131

Separate Analysis for Each File

Analysis for “en_US.news.txt”

# Extract data for "en_US.news.txt"
news_results <- results[["en_US.news.txt"]]

# Function to plot top bigrams
plot_top_bigrams <- function(bigram_df, title) {
  # Adjust column names if necessary
  colnames(bigram_df) <- c("Bigram", "Document", "Frequency")
  
  bigram_df <- bigram_df[order(-bigram_df$Frequency), ]
  top_bigrams <- bigram_df[1:20, ]
  ggplot(top_bigrams, aes(x = reorder(Bigram, Frequency), y = Frequency)) +
    geom_bar(stat = "identity", fill = "steelblue") +
    coord_flip() +
    labs(title = title, x = "Bigram", y = "Frequency") +
    theme_minimal()
}

# Function to plot top words
plot_top_words <- function(word_df, title) {
  # Adjust column names if necessary
  colnames(word_df) <- c("Word", "Document", "Frequency")
  
  word_df <- word_df[order(-word_df$Frequency), ]
  top_words <- word_df[1:20, ]
  ggplot(top_words, aes(x = reorder(Word, Frequency), y = Frequency)) +
    geom_bar(stat = "identity", fill = "firebrick") +
    coord_flip() +
    labs(title = title, x = "Word", y = "Frequency") +
    theme_minimal()
}

# Plotting for "en_US.news.txt"
plot_top_bigrams(news_results$bigram_df, "Top Bigrams - en_US.news.txt")

plot_top_words(news_results$word_df, "Top Words - en_US.news.txt")

Analysis for “en_US.blogs.txt”

# Extract data for "en_US.blogs.txt"
blogs_results <- results[["en_US.blogs.txt"]]

# Plotting top 20 bigrams
plot_top_bigrams(blogs_results$bigram_df, "Top Bigrams - en_US.blogs.txt")

# Plotting top 20 words
plot_top_words(blogs_results$word_df, "Top Words - en_US.blogs.txt")

Analysis for “en_US.twitter.txt”

# Extract data for "en_US.twitter.txt"
twitter_results <- results[["en_US.twitter.txt"]]

# Plotting top 20 bigrams
plot_top_bigrams(twitter_results$bigram_df, "Top Bigrams - en_US.twitter.txt")

# Plotting top 20 words
plot_top_words(twitter_results$word_df, "Top Words - en_US.twitter.txt")

Combined Analysis

Word Cloud for Combined Bigrams

# Combine bigram data from all files
combined_bigram_freq <- do.call(rbind, lapply(results, function(x) x$bigram_df))
colnames(combined_bigram_freq) <- c("Bigram", "Document", "Frequency")
combined_bigram_freq <- aggregate(Frequency ~ Bigram, data = combined_bigram_freq, sum)

# Plot word cloud
wordcloud(words = combined_bigram_freq$Bigram, freq = combined_bigram_freq$Frequency, 
          min.freq = 10, scale = c(3, 0.5), colors = brewer.pal(8, "Dark2"))

Interesting Findings

Top Bigrams: The most frequent bigrams (word pairs) provide insights into common phrases and patterns across each dataset. Top Words: Analyzing the top individual words helps identify key terms that appear frequently in each dataset. Combined Word Cloud: The word cloud visualizes the most common bigrams across all datasets, providing a high-level overview of frequently occurring phrases.

Future Work

Our next steps include:

Creating a Prediction Algorithm: Develop an algorithm that leverages the n-gram models to predict the next word or phrase based on user input. Building a Shiny App: Develop a Shiny app to interactively explore the text data, visualize word and bigram frequencies, and use the prediction algorithm.

Conclusion

This report provides a snapshot of our progress in understanding the text data. We have successfully loaded and processed the data, generated key statistics, and visualized important features. Our next steps will focus on building predictive models and interactive applications.