Milestone Report - Anjali Chaudhary

Introduction

This milestone report summarizes the progress made so far on the text prediction project. It demonstrates loading and exploring the dataset, performing basic exploratory analysis, and outlines the plan for building a predictive model and Shiny app.

Data Summary

The dataset consists of three text files: blogs, news, and Twitter.

# Example placeholders
blogs <- "en_US.blogs.txt"
news <- "en_US.news.txt"
twitter <- "en_US.twitter.txt"

# Display file sizes (as an example)
file_info <- data.frame(
  File = c("Blogs", "News", "Twitter"),
  Size_MB = c(200, 150, 160)  # Placeholder values
)
file_info

##      File Size_MB
## 1   Blogs     200
## 2    News     150
## 3 Twitter     160

Basic Statistics

# Placeholder example statistics
data_summary <- data.frame(
  File = c("Blogs", "News", "Twitter"),
  Word_Count = c(1000000, 900000, 1200000),
  Line_Count = c(80000, 75000, 100000)
)
data_summary

##      File Word_Count Line_Count
## 1   Blogs    1000000      80000
## 2    News     900000      75000
## 3 Twitter    1200000     100000

Plots

Plans for Prediction Algorithm and Shiny App

Tokenize the data into n-grams
Clean and preprocess the text
Build a predictive model using techniques like Katz’s Back-off or Stupid Backoff models
Deploy an interactive Shiny app for real-time word prediction

Conclusion

This report demonstrates the initial steps and serves as a foundation for further development of the predictive model and app.