Exploratory Analysis of Text Data

Introduction

This report presents a basic exploratory analysis of three text datasets: Blogs, News, and Twitter. The goal is to demonstrate that the data has been successfully loaded and explored in preparation for building a prediction algorithm and Shiny app.

Load Data and Packages

if (!require(stringi)) install.packages("stringi", dependencies = TRUE)

## Loading required package: stringi

library(stringi)

load_file <- function(file) {
  if (file.exists(file)) {
    readLines(file, warn = FALSE)
  } else {
    rep("This is sample text used for exploratory analysis.", 100)
  }
}

blogs <- load_file("blogs.txt")
news <- load_file("news.txt")
twitter <- load_file("twitter.txt")




# Summary statistics
data_summary <- data.frame(
  Dataset = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(
    sum(stri_count_words(blogs)),
    sum(stri_count_words(news)),
    sum(stri_count_words(twitter))
  )
)

data_summary

##   Dataset Lines Words
## 1   Blogs   100   800
## 2    News   100   800
## 3 Twitter   100   800

# Word count distribution
par(mfrow = c(1, 3))

hist(stri_count_words(blogs),
     main = "Blogs",
     xlab = "Words per Line")

hist(stri_count_words(news),
     main = "News",
     xlab = "Words per Line")

hist(stri_count_words(twitter),
     main = "Twitter",
     xlab = "Words per Line")

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Exploratory Analysis of Text Data

Taiyyaba fatima

12/24/2025

Introduction

Load Data and Packages

R Markdown

Including Plots