The goal of this project is to build a text prediction application that suggests the next word based on user input. This milestone report demonstrates that the training data has been successfully loaded and explored, and outlines initial findings and plans for the prediction algorithm and Shiny application.
The data consists of three English-language text files: - Blogs - News - Twitter
These datasets represent different writing styles and text lengths.
library(stringi) library(ggplot2)
blogs <- readLines(“en_US.blogs.txt”, encoding = “UTF-8”, skipNul = TRUE) news <- readLines(“en_US.news.txt”, encoding = “UTF-8”, skipNul = TRUE) twitter <- readLines(“en_US.twitter.txt”, encoding = “UTF-8”, skipNul = TRUE) ## Summary Statistics summary_df <- data.frame( Dataset = c(“Blogs”, “News”, “Twitter”), Lines = c(length(blogs), length(news), length(twitter)), Words = c(sum(stri_count_words(blogs)), sum(stri_count_words(news)), sum(stri_count_words(twitter))) ) summary_df ## Visualization set.seed(123) sample_blogs <- sample(blogs, 5000) word_counts <- stri_count_words(sample_blogs)
ggplot(data.frame(word_counts), aes(word_counts)) + geom_histogram(bins = 30) + labs(title = “Distribution of Words per Line (Blogs)”, x = “Words per Line”, y = “Frequency”) ## Findings
The next-word prediction algorithm will be based on n-gram models. A Shiny application will be built to allow users to enter text and view predicted next words.
This report confirms that the data has been successfully loaded and explored, and provides a foundation for building the final text prediction app.