Introduction

The purpose of this project is to demonstrate familiarity with working on large real-world text data and to outline a plan for developing a text prediction algorithm and a Shiny application. At this stage, the focus is on exploratory analysis, understanding the structure of the data, and identifying key patterns.

Data Description

The dataset consists of text data collected from three sources:

Blogs News Twitter

Each file contains English text representing different writing styles and vocabulary usage.

Loading the Data

Load required libraries

library(tm) library(stringr)

NOTE: Change paths if your files are in a different location

blogs <- readLines(“en_US.blogs.txt”, encoding = “UTF-8”, skipNul = TRUE) news <- readLines(“en_US.news.txt”, encoding = “UTF-8”, skipNul = TRUE) twitter <- readLines(“en_US.twitter.txt”, encoding = “UTF-8”, skipNul = TRUE)

Summary Statistics

data_summary <- data.frame( Source = c(“Blogs”, “News”, “Twitter”), Lines = c(length(blogs), length(news), length(twitter)), Words = c( sum(str_count(blogs, “+”)), sum(str_count(news, “+”)), sum(str_count(twitter, “+”)) ) )

data_summary

Word Frequency Analysis (Sample)

sample_text <- tolower(sample(blogs, 10000)) sample_text <- removePunctuation(sample_text) sample_text <- removeNumbers(sample_text) sample_text <- stripWhitespace(sample_text)

words <- unlist(strsplit(sample_text, “+”)) word_freq <- sort(table(words), decreasing = TRUE)

head(word_freq, 10)

Interesting Findings

Blogs contain richer vocabulary. News text is more formal and structured. Twitter text is shorter and more informal. A small number of words appear very frequently.

Plan for Prediction Algorithm

The prediction model will use N-gram language modeling. Trigrams will be preferred, with fallback to bigrams and unigrams when needed. This approach balances accuracy and performance.

Plan for Shiny Application

The Shiny app will allow users to enter text and receive a predicted next word. The interface will be simple, fast, and suitable for non-technical users.