Introduction

This document provides an overview of the major features of the dataset used for building a text prediction algorithm. The dataset consists of text files in English from blogs, news, and Twitter. The aim is to explore the data, summarize its key characteristics, and outline plans for developing a predictive model and Shiny application.

Data Summary

Loading and Preprocessing

Below are basic statistics for each file:

# Load necessary libraries
library(stringr)
library(knitr)


# Define file paths
files <- c("./final/en_US/en_US.blogs.txt", "./final/en_US/en_US.news.txt", "./final/en_US/en_US.twitter.txt")

# Calculate basic statistics for each file
data_stats <- data.frame(
  File = files,
  Lines = sapply(files, function(file) length(readLines(file, warn = FALSE))),
  Words = sapply(files, function(file) sum(str_count(readLines(file, warn = FALSE), "\\S+"))),
  Max_Line_Length = sapply(files, function(file) max(nchar(readLines(file, warn = FALSE))))
)

# Display table
kable(data_stats, caption = "Summary Statistics of the Datasets")
Summary Statistics of the Datasets
File Lines Words Max_Line_Length
./final/en_US/en_US.blogs.txt ./final/en_US/en_US.blogs.txt 899288 37334131 40833
./final/en_US/en_US.news.txt ./final/en_US/en_US.news.txt 1010242 34372530 11384
./final/en_US/en_US.twitter.txt ./final/en_US/en_US.twitter.txt 2360148 30373543 140

The table above highlights the number of lines, total word count, and maximum line length for each dataset.

Exploratory Data Analysis

Exploring the datasets further, we analyze the frequency of words to identify the most common terms. This helps to understand the structure and characteristics of the text data, such as stopword prevalence, informal language usage (e.g., in Twitter), and topical themes.

To achieve this, we use tokenization to split the text into individual words and calculate word frequencies. Below is the process and visualization of the most frequent words in the Twitter dataset:

# Example tokenization and word frequency
library(tidytext)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Sample data for analysis
sample_data <- readLines("./final/en_US/en_US.twitter.txt", n = 5000)
tokens <- unnest_tokens(data.frame(text = sample_data), word, text)

# Calculate word frequencies
word_freq <- tokens %>%
  count(word, sort = TRUE) %>%
  top_n(10)
## Selecting by n
# Plot top 10 words
ggplot(word_freq, aes(x = reorder(word, n), y = n)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Top 10 Most Frequent Words", x = "Words", y = "Frequency")

The bar chart above illustrates the ten most frequently occurring words in the Twitter dataset. As expected, common stopwords such as “the,” “and,” and “to” dominate the dataset. These words will likely need to be removed or treated differently when building the prediction model to focus on meaningful terms. Insights and Observations

From this initial exploration, several observations can be made:

Twitter data contains shorter lines and often includes informal language, hashtags, and emojis, which may require special handling during preprocessing.
The blog dataset includes longer lines, suggesting that it might contain more formal and elaborate text compared to Twitter.
Common stopwords are prevalent across all datasets and will need to be filtered or downweighted during modeling.

Future Plans

With the foundational analysis complete, the next steps involve:

Preprocessing: Cleaning the data by removing stopwords, profanity, URLs, and unnecessary punctuation. Tokenizing the text into unigrams, bigrams, and trigrams for predictive modeling.
Model Building: Developing a predictive model using n-grams and incorporating smoothing techniques to handle rare words and phrases effectively.
Shiny Application: Deploying an interactive app that allows users to input text and receive word predictions dynamically. The app will also include options to visualize word frequencies and n-gram statistics.