Introduction

This Milestone Report shows that I have successfully loaded the data and performed basic exploratory analysis for the Capstone project.

Data loading

library(ggplot2) library(stringi)

blogs <- readLines(“en_US.blogs.txt”, encoding = “UTF-8”) news <- readLines(“en_US.news.txt”, encoding = “UTF-8”) twitter <- readLines(“en_US.twitter.txt”, encoding = “UTF-8”)

Basic summery

length(blogs) length(news) length(twitter)

summary(nchar(blogs)) summary(nchar(news)) summary(nchar(twitter))

Histogram of tweeter line lengths

twitter_lengths <- stri_length(twitter) qplot(twitter_lengths, bins = 30, main = “Line Lengths in Twitter”)

Conclusion

The data has been successfully loaded and basic summaries were generated. Twitter lines tend to be shorter, as expected, and further cleaning and n-gram analysis will be performed in the next steps of the Capstone Project.