Introduction
The purpose of this project is to demonstrate familiarity with working on large real-world text data and to outline a plan for developing a text prediction algorithm and a Shiny application. At this stage, the focus is on exploratory analysis, understanding the structure of the data, and identifying key patterns.
Data Description
The dataset consists of text data collected from three sources:
Blogs News Twitter
Each file contains English text representing different writing styles and vocabulary usage.
Loading the Data
Load required libraries
library(tm) library(stringr)
NOTE: Change paths if your files are in a different location
blogs <- readLines(“en_US.blogs.txt”, encoding = “UTF-8”, skipNul = TRUE) news <- readLines(“en_US.news.txt”, encoding = “UTF-8”, skipNul = TRUE) twitter <- readLines(“en_US.twitter.txt”, encoding = “UTF-8”, skipNul = TRUE)
Summary Statistics
data_summary <- data.frame( Source = c(“Blogs”, “News”, “Twitter”), Lines = c(length(blogs), length(news), length(twitter)), Words = c( sum(str_count(blogs, “+”)), sum(str_count(news, “+”)), sum(str_count(twitter, “+”)) ) )
data_summary
Word Frequency Analysis (Sample)
sample_text <- tolower(sample(blogs, 10000)) sample_text <- removePunctuation(sample_text) sample_text <- removeNumbers(sample_text) sample_text <- stripWhitespace(sample_text)
words <- unlist(strsplit(sample_text, “+”)) word_freq <- sort(table(words), decreasing = TRUE)
head(word_freq, 10)
Interesting Findings
Blogs contain richer vocabulary. News text is more formal and structured. Twitter text is shorter and more informal. A small number of words appear very frequently.
Plan for Prediction Algorithm
The prediction model will use N-gram language modeling. Trigrams will be preferred, with fallback to bigrams and unigrams when needed. This approach balances accuracy and performance.
Plan for Shiny Application
The Shiny app will allow users to enter text and receive a predicted next word. The interface will be simple, fast, and suitable for non-technical users.