Milestone Report

Introduction

The goal of this project is to build a text prediction application that suggests the next word based on user input. This milestone report demonstrates that the training data has been successfully loaded and explored, and outlines initial findings and plans for the prediction algorithm and Shiny application.

Data Description

The data consists of three English-language text files: - Blogs - News - Twitter

These datasets represent different writing styles and text lengths.

Loading the Data

library(stringi) library(ggplot2)

blogs <- readLines(“en_US.blogs.txt”, encoding = “UTF-8”, skipNul = TRUE) news <- readLines(“en_US.news.txt”, encoding = “UTF-8”, skipNul = TRUE) twitter <- readLines(“en_US.twitter.txt”, encoding = “UTF-8”, skipNul = TRUE) ## Summary Statistics summary_df <- data.frame( Dataset = c(“Blogs”, “News”, “Twitter”), Lines = c(length(blogs), length(news), length(twitter)), Words = c(sum(stri_count_words(blogs)), sum(stri_count_words(news)), sum(stri_count_words(twitter))) ) summary_df ## Visualization set.seed(123) sample_blogs <- sample(blogs, 5000) word_counts <- stri_count_words(sample_blogs)

ggplot(data.frame(word_counts), aes(word_counts)) + geom_histogram(bins = 30) + labs(title = “Distribution of Words per Line (Blogs)”, x = “Words per Line”, y = “Frequency”) ## Findings

Blog posts tend to be longer than Twitter posts.
Twitter text is short and informal.
The datasets are large enough to train a prediction model.

Future Plans

The next-word prediction algorithm will be based on n-gram models. A Shiny application will be built to allow users to enter text and view predicted next words.

Conclusion

This report confirms that the data has been successfully loaded and explored, and provides a foundation for building the final text prediction app.