This report summarizes my exploratory data analysis for the Data Science Capstone project.
The goal is to show the data has been downloaded, loaded, and explored and to outline plans for the prediction algorithm.
Important: adjust the file paths below if your data is in a different location.
# Example paths used in the course dataset
# If files are in folder 'final/en_US/' use these exact names
blogs <- readLines("final/en_US/en_US.blogs.txt", warn = FALSE)
news <- readLines("final/en_US/en_US.news.txt", warn = FALSE)
twitter <- readLines("final/en_US/en_US.twitter.txt", warn = FALSE)
# Basic Summaries
library(stringi)
data_summary <- data.frame(
File = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter)),
Words = c(sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter)))
)
data_summary
## File Lines Words
## 1 Blogs 899288 37546250
## 2 News 1010242 34762395
## 3 Twitter 2360148 30093372
set.seed(123)
sample_data <- c(
sample(blogs, min(2000, length(blogs))),
sample(news, min(2000, length(news))),
sample(twitter, min(2000, length(twitter)))
)
length(sample_data)
## [1] 6000
library(tm)
corpus <- VCorpus(VectorSource(sample_data))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
tdm <- TermDocumentMatrix(corpus)
freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
freq_df <- data.frame(term = names(freq), freq = freq)
head(freq_df, 10)
## term freq
## the the 8724
## and and 4557
## that that 1906
## for for 1868
## with with 1330
## you you 1296
## was was 1128
## have have 943
## this this 899
## are are 849
library(ggplot2)
top20 <- head(freq_df, 15)
ggplot(top20, aes(x=reorder(term, freq), y=freq)) +
geom_col() + coord_flip() + labs(x="", y="Frequency",
title="Top words (sample)")
predictNextWord() function.This Milestone demonstrates: - Data was loaded and inspected.
- Basic counts and small plots are provided.
- A clear roadmap for the prediction algorithm and Shiny app is included.