Introduction
This Milestone Report shows that I have successfully loaded the data and performed basic exploratory analysis for the Capstone project.
Data loading
library(ggplot2) library(stringi)
blogs <- readLines(“en_US.blogs.txt”, encoding = “UTF-8”) news <- readLines(“en_US.news.txt”, encoding = “UTF-8”) twitter <- readLines(“en_US.twitter.txt”, encoding = “UTF-8”)
Basic summery
length(blogs) length(news) length(twitter)
summary(nchar(blogs)) summary(nchar(news)) summary(nchar(twitter))
Histogram of tweeter line lengths
twitter_lengths <- stri_length(twitter) qplot(twitter_lengths, bins = 30, main = “Line Lengths in Twitter”)
Conclusion
The data has been successfully loaded and basic summaries were generated. Twitter lines tend to be shorter, as expected, and further cleaning and n-gram analysis will be performed in the next steps of the Capstone Project.