Data Science Capstone: Milestone Report

Summary

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm.

There are four objectives for this report 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in 2. Create a basic report of summary statistics about the data sets. 3. Report any interesting findings that you amassed so far. 4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Loading the data

setwd("C:\\Users\\Vadim Katsemba\\Documents")
twitter.en <- readLines(t <- file("en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)
blogs.en <- readLines(b <- file("en_US.blogs.txt"), encoding = "UTF-8", skipNul = TRUE)
news.en <- readLines(n <- file("en_US.news.txt"), encoding = "UTF-8", skipNul = TRUE)

## Warning in readLines(n <- file("en_US.news.txt"), encoding = "UTF-8",
## skipNul = TRUE): incomplete final line found on 'en_US.news.txt'

Summary Statistics

Summary of each dataset

library(stringi)
stri_stats_general(twitter.en)

##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162096248   134082806

stri_stats_general(blogs.en)

##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   206824382   170389539

stri_stats_general(news.en)

##       Lines LinesNEmpty       Chars CharsNWhite 
##       77259       77259    15639408    13072698

Twitter words distribution

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.2.5

twitter.words <- stri_count_words(twitter.en)
summary(twitter.words)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.75   18.00   47.00

ggplot() + aes(twitter.words, fill = ..count..) + geom_histogram(bins = 20) + scale_fill_gradient("count", low = "green", high = "red")

Blogs words distribution

blogs.words <- stri_count_words(blogs.en)
summary(blogs.words)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.75   60.00 6726.00

ggplot() + aes(blogs.words, fill = ..count..) + geom_histogram(bins = 20) + scale_fill_gradient("count", low = "red", high = "purple")

News words distribution

news.words <- stri_count_words(news.en)
summary(news.words)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.62   46.00 1123.00

ggplot() + aes(news.words, fill = ..count..) + geom_histogram(bins = 20) + scale_fill_gradient("count", low = "purple", high = "blue")

Samples

twitter.sample <- sample(twitter.en, 1000)
twitter.sample[100:102]

## [1] "thanks! RT : , You guys seriously rocked it on the new #Sounders spots. Outstanding work, you magnificent basterds!"
## [2] "Doesn't Senator Rest have some cookies to bake at home?"                                                            
## [3] "Are people seriously fighting over which artists they are fans of? Really?"

blog.sample <- sample(blogs.en, 1000)
blog.sample[125]

## [1] "7 doll"

news.sample <- sample(news.en, 1000)
news.sample[75]

## [1] "But the criminal charges could eventually be dismissed if Lagozzino pursues a treatment plan developed by OHSU Hospital, Portland Police and the county's mental health services, according to Norm Frink, Multnomah County chief deputy district attorney."

Findings

It is immediately apparent that the words in the Twitter file are distributed much differently from the News and Blogs files, because of the fact that Twitter has a character limit.

Next Steps

Clean the Twitter, Blogs and News files.
Utilize the various gram functions.
Train and test the prediction models.
Evalute the final models for performance and accuracy.