getwd()
## [1] "C:/Users/SRIKANTH REDDY/OneDrive/Documents/CAPSTONE"
Introduction
This report presents an exploratory analysis of the text data used in the Data Science Capstone project. The purpose of this analysis is to understand the structure of the data, summarize its main characteristics, and outline plans for building a word prediction algorithm and Shiny application.
Loading Required Libraries
library(tm)
## Loading required package: NLP
library(stringi)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
The following table summarizes the number of lines and words ineach dataset.
## Loading the Dataset
``` r
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
Basic Summary Statistics
The following table summarizes the number of lines and words in each dataset.
## Summary Statistics Table
``` r
data.frame(
Dataset = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter)),
Words = c(
sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter))
)
)
## Dataset Lines Words
## 1 Blogs 899288 37546806
## 2 News 1010206 34761151
## 3 Twitter 2360148 30096690
Sampling the Data
Due to the large size of the datasets, a random sample is used for exploratory visualization.
## Sampling the Data
``` r
set.seed(123)
blogs_sample <- sample(blogs, 5000)
blog_words <- stri_count_words(blogs_sample)
## Distribution of Words per Blog Line
``` r
ggplot(data.frame(words = blog_words), aes(x = words)) +
geom_histogram(bins = 30, fill = "steelblue") +
labs(
title = "Distribution of Words per Blog Line",
x = "Words per Line",
y = "Frequency"
)
```markdown ## Future Work and Prediction Plan
The next phase of this project will focus on building a word prediction model using n-gram techniques. A Shiny web application will be developed to allow users to interact with the prediction system.