The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.
1) Does the link lead to an HTML page describing the exploratory analysis of the training data set?
2) Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables?
3) Has the data scientist made basic plots, such as histograms to illustrate features of the data?
4) Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?
library(RCurl)
## Loading required package: bitops
library(knitr)
library(RColorBrewer)
library(stringi)
library(wordcloud)
library(ggplot2)
library(ngram)
library(NLP)
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(tm)
library(slam)
library(xtable)
library(quanteda)
## Package version: 1.5.1
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following objects are masked from 'package:tm':
##
## as.DocumentTermMatrix, stopwords
## The following object is masked from 'package:utils':
##
## View
You can get the data with this link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
With this data I am going to use twitter, blogs, and the news.
setwd("~/Downloads/")
news1 <- "en_US.news.txt"
news <- file("en_US.news.txt", open="r")
news_data <- readLines(news); close(news)
blogs1 <- "en_US.blogs.txt"
blogs <- file("en_US.blogs.txt", open="r")
blogs_data <- readLines(blogs); close(blogs)
twitter1 <- "en_US.twitter.txt"
twitter <- file("en_US.twitter.txt", open="r")
twitter_data <- readLines(twitter); close(twitter)
## Warning in readLines(twitter): line 167155 appears to contain an embedded nul
## Warning in readLines(twitter): line 268547 appears to contain an embedded nul
## Warning in readLines(twitter): line 1274086 appears to contain an embedded nul
## Warning in readLines(twitter): line 1759032 appears to contain an embedded nul
dsize<-1000
setwd("~/Downloads/")
blogs <- readLines("en_US.blogs.txt",dsize)
news <- readLines("en_US.news.txt",dsize)
twitter <- readLines("en_US.Twitter.txt",dsize)
corpus <- Corpus(VectorSource(c(blogs, news, twitter)))
Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables? Yes!
This is the basic summaries of the three files: Blogs, News, and Twitter! It includes word count, line count and basic tables
len_b<-length(blogs_data)
len_n<-length(news_data)
len_t<-length(twitter_data)
word_b <-sum(stri_count_words(blogs_data))
word_n <-sum(stri_count_words(news_data))
word_t <-sum(stri_count_words(twitter_data))
summary<-data.frame(c("Blogs","News", "Twitter"), c(len_b, len_n, len_t), c(word_b,word_n,word_t))
kable(summary, col.names=c('Type', 'No of Lines', 'No of words'))
| Type | No of Lines | No of words |
|---|---|---|
| Blogs | 899288 | 37546239 |
| News | 1010242 | 34762395 |
| 2360148 | 30093372 |
The data set is pretty big and might be hard for r to handle so I will cut it.
dsize<-5000
setwd("~/Downloads/")
blogs <- readLines("en_US.blogs.txt",dsize)
news <- readLines("en_US.news.txt",dsize)
twitter <- readLines("en_US.Twitter.txt",dsize)
corpus <- Corpus(VectorSource(c(blogs, news, twitter)))
I am going to be cleaning the data
data12 <- Corpus(VectorSource(c(blogs, news, twitter)))
data12 <- tm_map(data12, removePunctuation)
## Warning in tm_map.SimpleCorpus(data12, removePunctuation): transformation drops
## documents
data12 <- tm_map(data12, removeNumbers)
## Warning in tm_map.SimpleCorpus(data12, removeNumbers): transformation drops
## documents
data12 <- tm_map(data12, stripWhitespace)
## Warning in tm_map.SimpleCorpus(data12, stripWhitespace): transformation drops
## documents
data12 <- tm_map(data12, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(data12, removeWords, stopwords("english")):
## transformation drops documents
data12 <- tm_map(data12, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(data12, content_transformer(tolower)):
## transformation drops documents
I will start working with the data to get usefull information. I will be using wordcloud and a histogram.
# Gram-1
gram1<- TermDocumentMatrix(data12)
wordMatrix = as.data.frame((as.matrix( gram1 )) )
v <- sort(rowSums(wordMatrix),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
plot<-d[1:20,]
The wordcloud is a great tool to use to see what word is used the most
wordcloud(plot$word, max.words=5000, colors=brewer.pal(8,"Accent"))
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents
This is the frequency of each of the words used in the wordcloud
g_unigram <- ggplot(plot, aes(x=reorder(word, freq),y=freq)) +
geom_bar(stat="identity", fill="blue") +
ggtitle(paste("gram1")) +
xlab("Unigrams") + ylab("Freq") +
theme(axis.text.x=element_text(angle=90, hjust=1))
g_unigram
You can see what words are used the most for the blogs, twitter and news. With this data, it is possible to look at key words to dectect feelings of the user.