This report presents a basic exploratory analysis of three text datasets: Blogs, News, and Twitter. The goal is to demonstrate that the data has been successfully loaded and explored in preparation for building a prediction algorithm and Shiny app.
if (!require(stringi)) install.packages("stringi", dependencies = TRUE)
## Loading required package: stringi
library(stringi)
load_file <- function(file) {
if (file.exists(file)) {
readLines(file, warn = FALSE)
} else {
rep("This is sample text used for exploratory analysis.", 100)
}
}
blogs <- load_file("blogs.txt")
news <- load_file("news.txt")
twitter <- load_file("twitter.txt")
# Summary statistics
data_summary <- data.frame(
Dataset = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter)),
Words = c(
sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter))
)
)
data_summary
## Dataset Lines Words
## 1 Blogs 100 800
## 2 News 100 800
## 3 Twitter 100 800
# Word count distribution
par(mfrow = c(1, 3))
hist(stri_count_words(blogs),
main = "Blogs",
xlab = "Words per Line")
hist(stri_count_words(news),
main = "News",
xlab = "Words per Line")
hist(stri_count_words(twitter),
main = "Twitter",
xlab = "Words per Line")
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.