Milestone Report of the Capstone project

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Review criteria:

1.Demonstrate that you’ve downloaded the data and have successfully loaded it in.
2.Create a basic report of summary statistics about the data sets.
3.Report any interesting findings that you amassed so far.
4.Get feedback on your plans for creating a prediction algorithm and Shiny app.

1.Libraries and download data

Include Required liraries

require('tm')
## Loading required package: tm
## Warning: package 'tm' was built under R version 3.5.3
## Loading required package: NLP
require('ggplot2')
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.5.3
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
require('stringi')
## Loading required package: stringi
## Warning: package 'stringi' was built under R version 3.5.3

```r setwd(“C:/Users/ELIZABETH/Documents/R/Capstone”)

file.blog <- “C:/Users/ELIZABETH/Documents/R/Capstone Project/Capstone/en_US/en_US.blogs.txt” file.twitter <- “C:/Users/ELIZABETH/Documents/R/Capstone Project/Capstone/en_US/en_US.twitter.txt” file.web <- “C:/Users/ELIZABETH/Documents/R/Capstone Project/Capstone/en_US/en_US.news.txt”

lines.blog <- readLines(file(file.blog))
lines.twitter <- readLines(file(file.twitter)) ```

## Warning in readLines(file(file.twitter)): line 167155 appears to contain an ## embedded nul

## Warning in readLines(file(file.twitter)): line 268547 appears to contain an ## embedded nul

## Warning in readLines(file(file.twitter)): line 1274086 appears to contain ## an embedded nul

## Warning in readLines(file(file.twitter)): line 1759032 appears to contain ## an embedded nul

r lines.web <- readLines(file(file.web))

## Warning in readLines(file(file.web)): incomplete final line found on 'C:/ ## Users/ELIZABETH/Documents/R/Capstone Project/Capstone/en_US/en_US.news.txt'

2.Exploratory analysis

Get the number of lines of each source of data

length(lines.blog)
## [1] 899288
length(lines.twitter)
## [1] 2360148
length(lines.web)
## [1] 77259

Get the number of words per line of data and present the summary

# Get number of words per line
nwords.blog <- stri_count_words(lines.blog)
nwords.twitter <- stri_count_words(lines.twitter)
nwords.web <- stri_count_words(lines.web)

# Summary
summary(nwords.blog)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   29.00   42.43   61.00 6726.00
summary(nwords.twitter)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     7.0    12.0    12.8    18.0    60.0
summary(nwords.web)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.87   46.00 1123.00

Further exploration of a sample

Here I prepare a random sample of 10% of total lines provided by the source data. This is done to decrease the processing time but maintaining enough cases to be representative of the original sample.

slines.blog <- sample(lines.blog, 0.1*length(lines.blog))    
slines.twitter <- sample(lines.blog, 0.1*length(lines.twitter))  
slines.web <- sample(lines.blog, 0.1*length(lines.web))  

snwords.blog <- stri_count_words(slines.blog)
snwords.twitter <- stri_count_words(slines.twitter)
snwords.web <- stri_count_words(slines.web)

df.nwords.all <- data.frame(nword = c(snwords.blog, snwords.twitter, snwords.web), 
                            type = c(rep("blog", length(snwords.blog)), rep("twitter",length(snwords.twitter)), rep("web", length(slines.web))))

Plotting the density probability of the frequency of the number of words per line

ggplot(data = df.nwords.all) + geom_density(aes(nword)) + facet_wrap(~type, nrow = 3) + xlim(0,500) 
## Warning: Removed 94 rows containing non-finite values (stat_density).

#Create a corpus and clean it to see which words occur more often. We apply only for web source to see an example as the computation is heavy.

webCorpus = Corpus(VectorSource(slines.web))
webCorpus = tm_map(webCorpus, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(webCorpus, content_transformer(tolower)):
## transformation drops documents
webCorpus = tm_map(webCorpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(webCorpus, removePunctuation):
## transformation drops documents
webCorpus = tm_map(webCorpus, removeNumbers)
## Warning in tm_map.SimpleCorpus(webCorpus, removeNumbers): transformation
## drops documents
webDTM = TermDocumentMatrix(webCorpus,
                           control = list(minWordLength = 1))
mWeb = as.matrix(webDTM)
webOrder <- sort(rowSums(mWeb), decreasing = TRUE)

Finally displaying which are the 10-top frequent words and 10-bottom words (only used once)

head(webOrder, 10)
##   the   and  that   for   you   was  with  this  have   are 
## 16281  9431  4031  3168  2565  2529  2451  2214  1977  1778
tail(webOrder, 10)
##      ambient      fridges       tapped    moralitys       refine 
##            1            1            1            1            1 
##      divulge suburbsbased        andre       rookie        topps 
##            1            1            1            1            1

Conclusions

We have performed an exploratory analysis. Some important findings:

The amount of lines per source type, and distribution of amount of words per line We have seen in the plot how the distribution of amount of words per line for twitter is more abrupt than for blog, and for the web. This can be caused by the limitation of characters of twitter they need to be more specific. We approached the number of times that each word is repated. As we could imagine, the most common words are prepositions, and we can find very rare words that appear only once. This can be used to calculate marginal probabilities of each word, that can be used for the prediction model as addition to the a posteriori probabilities depending on which one was the previous word.