Data Science Capstone - Milestone Report

Introduction

This report presents the exploratory analysis of the training data for the Data Science Capstone project. The goal is to understand the dataset and prepare for building a prediction model and Shiny application.

Data Loading

setwd("C:/Users/91962/OneDrive/Desktop")

blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

Summary of Data

word_count <- function(x) {
  sum(sapply(strsplit(x, "\\s+"), length))
}

blogs_words <- word_count(blogs)
news_words <- word_count(news)
twitter_words <- word_count(twitter)

data_summary <- data.frame(
  Dataset = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(blogs_words, news_words, twitter_words)
)

data_summary

##   Dataset   Lines    Words
## 1   Blogs  899288 37334131
## 2    News 1010206 34371031
## 3 Twitter 2360148 30373583

Sample Plot

blog_chars <- nchar(blogs)
hist(blog_chars, main="Blogs Character Count", xlab="Characters per line")

Findings

The dataset contains three files: blogs, news, and twitter.
The twitter dataset has the highest number of lines.
Blogs and news have longer sentences compared to twitter.
The dataset contains a large amount of text useful for prediction.

Plan for Prediction Model

I will use n-gram models to predict the next word.
The text will be cleaned by removing punctuation, numbers, and extra spaces.
Frequent word patterns will be identified from the dataset.
The model will predict the next word based on user input.

Plan for Shiny App

The Shiny app will provide a text input box for the user.
The user can enter a sentence or phrase.
The app will display predicted next words.
The app will be simple and easy to use.