The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm.
There are four objectives for this report 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in 2. Create a basic report of summary statistics about the data sets. 3. Report any interesting findings that you amassed so far. 4. Get feedback on your plans for creating a prediction algorithm and Shiny app.
setwd("C:\\Users\\Vadim Katsemba\\Documents")
twitter.en <- readLines(t <- file("en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)
blogs.en <- readLines(b <- file("en_US.blogs.txt"), encoding = "UTF-8", skipNul = TRUE)
news.en <- readLines(n <- file("en_US.news.txt"), encoding = "UTF-8", skipNul = TRUE)
## Warning in readLines(n <- file("en_US.news.txt"), encoding = "UTF-8",
## skipNul = TRUE): incomplete final line found on 'en_US.news.txt'
library(stringi)
stri_stats_general(twitter.en)
## Lines LinesNEmpty Chars CharsNWhite
## 2360148 2360148 162096248 134082806
stri_stats_general(blogs.en)
## Lines LinesNEmpty Chars CharsNWhite
## 899288 899288 206824382 170389539
stri_stats_general(news.en)
## Lines LinesNEmpty Chars CharsNWhite
## 77259 77259 15639408 13072698
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.5
twitter.words <- stri_count_words(twitter.en)
summary(twitter.words)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 12.00 12.75 18.00 47.00
ggplot() + aes(twitter.words, fill = ..count..) + geom_histogram(bins = 20) + scale_fill_gradient("count", low = "green", high = "red")
blogs.words <- stri_count_words(blogs.en)
summary(blogs.words)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 28.00 41.75 60.00 6726.00
ggplot() + aes(blogs.words, fill = ..count..) + geom_histogram(bins = 20) + scale_fill_gradient("count", low = "red", high = "purple")
news.words <- stri_count_words(news.en)
summary(news.words)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 34.62 46.00 1123.00
ggplot() + aes(news.words, fill = ..count..) + geom_histogram(bins = 20) + scale_fill_gradient("count", low = "purple", high = "blue")
twitter.sample <- sample(twitter.en, 1000)
twitter.sample[100:102]
## [1] "thanks! RT : , You guys seriously rocked it on the new #Sounders spots. Outstanding work, you magnificent basterds!"
## [2] "Doesn't Senator Rest have some cookies to bake at home?"
## [3] "Are people seriously fighting over which artists they are fans of? Really?"
blog.sample <- sample(blogs.en, 1000)
blog.sample[125]
## [1] "7 doll"
news.sample <- sample(news.en, 1000)
news.sample[75]
## [1] "But the criminal charges could eventually be dismissed if Lagozzino pursues a treatment plan developed by OHSU Hospital, Portland Police and the county's mental health services, according to Norm Frink, Multnomah County chief deputy district attorney."
It is immediately apparent that the words in the Twitter file are distributed much differently from the News and Blogs files, because of the fact that Twitter has a character limit.