The goal of this project is to demonstrate familiarity with the text data and to show progress toward building a predictive text model and Shiny web application. This report presents an exploratory analysis of the training data and outlines plans for the final application.
if (!require(stringi)) install.packages("stringi")
if (!require(ggplot2)) install.packages("ggplot2")
library(stringi)
library(ggplot2)
The dataset consists of text collected from three sources: - Blogs - News - Twitter
These sources represent different writing styles and text lengths.
data.frame( Source = c(“Blogs”, “News”, “Twitter”), Lines = c(length(blogs), length(news), length(twitter)) )
library(ggplot2)
words <- c( stri_count_words(blogs), stri_count_words(news), stri_count_words(twitter) )
source <- c( rep(“Blogs”, length(blogs)), rep(“News”, length(news)), rep(“Twitter”, length(twitter)) )
df <- data.frame(words = words, source = source)
ggplot(df, aes(words)) + geom_histogram(fill = “steelblue”, bins = 50) + facet_wrap(~source, scales = “free_y”) + labs(title = “Word Count per Line”, x = “Words per Line”, y = “Frequency”)
##Plan for Prediction Algorithm
The final application will predict the next word based on previously typed words using statistical language models. The model will learn common word patterns from the text data.
##Plan for Prediction Algorithm
The final application will predict the next word based on previously typed words using statistical language models. The model will learn common word patterns from the text data. —
##Conclusion
This exploratory analysis confirms readiness to build the prediction model and the Shiny application. —