The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.
1.Demonstrate that you’ve downloaded the data and have successfully loaded it in.
2.Create a basic report of summary statistics about the data sets.
3.Report any interesting findings that you amassed so far.
4.Get feedback on your plans for creating a prediction algorithm and Shiny app.
Include Required liraries
require('tm')
## Loading required package: tm
## Warning: package 'tm' was built under R version 3.5.3
## Loading required package: NLP
require('ggplot2')
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.5.3
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
require('stringi')
## Loading required package: stringi
## Warning: package 'stringi' was built under R version 3.5.3
```r setwd(“C:/Users/ELIZABETH/Documents/R/Capstone”)
file.blog <- “C:/Users/ELIZABETH/Documents/R/Capstone Project/Capstone/en_US/en_US.blogs.txt” file.twitter <- “C:/Users/ELIZABETH/Documents/R/Capstone Project/Capstone/en_US/en_US.twitter.txt” file.web <- “C:/Users/ELIZABETH/Documents/R/Capstone Project/Capstone/en_US/en_US.news.txt”
lines.blog <- readLines(file(file.blog))
lines.twitter <- readLines(file(file.twitter)) ```
## Warning in readLines(file(file.twitter)): line 167155 appears to contain an ## embedded nul
## Warning in readLines(file(file.twitter)): line 268547 appears to contain an ## embedded nul
## Warning in readLines(file(file.twitter)): line 1274086 appears to contain ## an embedded nul
## Warning in readLines(file(file.twitter)): line 1759032 appears to contain ## an embedded nul
r lines.web <- readLines(file(file.web))
## Warning in readLines(file(file.web)): incomplete final line found on 'C:/ ## Users/ELIZABETH/Documents/R/Capstone Project/Capstone/en_US/en_US.news.txt'
Get the number of lines of each source of data
length(lines.blog)
## [1] 899288
length(lines.twitter)
## [1] 2360148
length(lines.web)
## [1] 77259
Get the number of words per line of data and present the summary
# Get number of words per line
nwords.blog <- stri_count_words(lines.blog)
nwords.twitter <- stri_count_words(lines.twitter)
nwords.web <- stri_count_words(lines.web)
# Summary
summary(nwords.blog)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 29.00 42.43 61.00 6726.00
summary(nwords.twitter)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 7.0 12.0 12.8 18.0 60.0
summary(nwords.web)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 34.87 46.00 1123.00
Here I prepare a random sample of 10% of total lines provided by the source data. This is done to decrease the processing time but maintaining enough cases to be representative of the original sample.
slines.blog <- sample(lines.blog, 0.1*length(lines.blog))
slines.twitter <- sample(lines.blog, 0.1*length(lines.twitter))
slines.web <- sample(lines.blog, 0.1*length(lines.web))
snwords.blog <- stri_count_words(slines.blog)
snwords.twitter <- stri_count_words(slines.twitter)
snwords.web <- stri_count_words(slines.web)
df.nwords.all <- data.frame(nword = c(snwords.blog, snwords.twitter, snwords.web),
type = c(rep("blog", length(snwords.blog)), rep("twitter",length(snwords.twitter)), rep("web", length(slines.web))))
Plotting the density probability of the frequency of the number of words per line
ggplot(data = df.nwords.all) + geom_density(aes(nword)) + facet_wrap(~type, nrow = 3) + xlim(0,500)
## Warning: Removed 94 rows containing non-finite values (stat_density).
#Create a corpus and clean it to see which words occur more often. We apply only for web source to see an example as the computation is heavy.
webCorpus = Corpus(VectorSource(slines.web))
webCorpus = tm_map(webCorpus, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(webCorpus, content_transformer(tolower)):
## transformation drops documents
webCorpus = tm_map(webCorpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(webCorpus, removePunctuation):
## transformation drops documents
webCorpus = tm_map(webCorpus, removeNumbers)
## Warning in tm_map.SimpleCorpus(webCorpus, removeNumbers): transformation
## drops documents
webDTM = TermDocumentMatrix(webCorpus,
control = list(minWordLength = 1))
mWeb = as.matrix(webDTM)
webOrder <- sort(rowSums(mWeb), decreasing = TRUE)
Finally displaying which are the 10-top frequent words and 10-bottom words (only used once)
head(webOrder, 10)
## the and that for you was with this have are
## 16281 9431 4031 3168 2565 2529 2451 2214 1977 1778
tail(webOrder, 10)
## ambient fridges tapped moralitys refine
## 1 1 1 1 1
## divulge suburbsbased andre rookie topps
## 1 1 1 1 1
We have performed an exploratory analysis. Some important findings:
The amount of lines per source type, and distribution of amount of words per line We have seen in the plot how the distribution of amount of words per line for twitter is more abrupt than for blog, and for the web. This can be caused by the limitation of characters of twitter they need to be more specific. We approached the number of times that each word is repated. As we could imagine, the most common words are prepositions, and we can find very rare words that appear only once. This can be used to calculate marginal probabilities of each word, that can be used for the prediction model as addition to the a posteriori probabilities depending on which one was the previous word.