Cap Stone Project

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

1) Does the link lead to an HTML page describing the exploratory analysis of the training data set?
2) Has the data scientist done basic summaries of the three files? Word counts, line counts and basic          data tables?
3) Has the data scientist made basic plots, such as histograms to illustrate features of the data?
4) Was the report written in a brief, concise style, in a way that a non-data scientist manager could          appreciate?

Load in libraries

library(RCurl)

## Loading required package: bitops

library(knitr)
library(RColorBrewer)
library(stringi)
library(wordcloud)
library(ggplot2)
library(ngram)
library(NLP)

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(tm)
library(slam)
library(xtable)
library(quanteda)

## Package version: 1.5.1

## Parallel computing: 2 of 8 threads used.

## See https://quanteda.io for tutorials and examples.

## 
## Attaching package: 'quanteda'

## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, stopwords

## The following object is masked from 'package:utils':
## 
##     View

Data

You can get the data with this link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

With this data I am going to use twitter, blogs, and the news.

setwd("~/Downloads/")
news1 <- "en_US.news.txt"
news <- file("en_US.news.txt", open="r")
news_data <- readLines(news); close(news)

blogs1 <- "en_US.blogs.txt"
blogs <- file("en_US.blogs.txt", open="r")
blogs_data <- readLines(blogs); close(blogs) 

twitter1 <- "en_US.twitter.txt"
twitter <- file("en_US.twitter.txt", open="r")
twitter_data <- readLines(twitter); close(twitter)

## Warning in readLines(twitter): line 167155 appears to contain an embedded nul

## Warning in readLines(twitter): line 268547 appears to contain an embedded nul

## Warning in readLines(twitter): line 1274086 appears to contain an embedded nul

## Warning in readLines(twitter): line 1759032 appears to contain an embedded nul

Cleaning the data

dsize<-1000
setwd("~/Downloads/")
blogs <- readLines("en_US.blogs.txt",dsize)
news <- readLines("en_US.news.txt",dsize)
twitter <- readLines("en_US.Twitter.txt",dsize)
corpus <- Corpus(VectorSource(c(blogs, news, twitter)))

Summary of the data

Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables? Yes!

This is the basic summaries of the three files: Blogs, News, and Twitter! It includes word count, line count and basic tables

len_b<-length(blogs_data)
len_n<-length(news_data)
len_t<-length(twitter_data)

word_b <-sum(stri_count_words(blogs_data))
word_n <-sum(stri_count_words(news_data))
word_t <-sum(stri_count_words(twitter_data))

summary<-data.frame(c("Blogs","News", "Twitter"), c(len_b, len_n, len_t), c(word_b,word_n,word_t))
kable(summary, col.names=c('Type', 'No of Lines', 'No of words'))

Type	No of Lines	No of words
Blogs	899288	37546239
News	1010242	34762395
Twitter	2360148	30093372

Cutting the day

The data set is pretty big and might be hard for r to handle so I will cut it.

dsize<-5000
setwd("~/Downloads/")
blogs <- readLines("en_US.blogs.txt",dsize)
news <- readLines("en_US.news.txt",dsize)
twitter <- readLines("en_US.Twitter.txt",dsize)
corpus <- Corpus(VectorSource(c(blogs, news, twitter)))

Cleaning and sampling the data

I am going to be cleaning the data

data12 <- Corpus(VectorSource(c(blogs, news, twitter)))
data12 <- tm_map(data12, removePunctuation)

## Warning in tm_map.SimpleCorpus(data12, removePunctuation): transformation drops
## documents

data12 <- tm_map(data12, removeNumbers)

## Warning in tm_map.SimpleCorpus(data12, removeNumbers): transformation drops
## documents

data12 <- tm_map(data12, stripWhitespace)

## Warning in tm_map.SimpleCorpus(data12, stripWhitespace): transformation drops
## documents

data12 <- tm_map(data12, removeWords, stopwords("english"))

## Warning in tm_map.SimpleCorpus(data12, removeWords, stopwords("english")):
## transformation drops documents

data12 <- tm_map(data12, content_transformer(tolower))

## Warning in tm_map.SimpleCorpus(data12, content_transformer(tolower)):
## transformation drops documents

Analyzing the Data

I will start working with the data to get usefull information. I will be using wordcloud and a histogram.

# Gram-1
gram1<- TermDocumentMatrix(data12)
wordMatrix = as.data.frame((as.matrix(  gram1 )) ) 
v <- sort(rowSums(wordMatrix),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
plot<-d[1:20,]

The wordcloud is a great tool to use to see what word is used the most

wordcloud(plot$word, max.words=5000, colors=brewer.pal(8,"Accent"))

## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents

## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

This is the frequency of each of the words used in the wordcloud

g_unigram <- ggplot(plot, aes(x=reorder(word, freq),y=freq)) + 
        geom_bar(stat="identity", fill="blue") + 
        ggtitle(paste("gram1")) + 
        xlab("Unigrams") + ylab("Freq") + 
        theme(axis.text.x=element_text(angle=90, hjust=1))
g_unigram

Conclusion

You can see what words are used the most for the blogs, twitter and news. With this data, it is possible to look at key words to dectect feelings of the user.

Cap Stone Project

Ainslee

12/16/2019