milestones Capstone Project

** OBJECTIVE**

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set.

The motivation for this project is to:

Demonstrate that you’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.

Creation of Libraries

library(tm)
library(RWeka)
library(openNLP)
library(tau)
library(Rstem)
library(SnowballC)
library(quanteda)
library(stringr)
library(slam)
library(stylo)

setting of directory and creation of files:

##    Length     Class      Mode 
##     77259 character character

##    Length     Class      Mode 
##    899288 character character

##    Length     Class      Mode 
##   2360148 character character

Graphical analysis of data

First I analyse the number of lines of each file

library(ggplot2)
numlines <- c(length(blogs),length(news),length(tweets))
numlines <- data.frame(numlines)
numlines$names <- c("blogs","news","twitter")
ggplot(numlines,aes(x=names,y=numlines)) + geom_bar(stat='identity', fill="blue", color='blue') + xlab('File source') + ylab('Total No. of Lines') + ggtitle('Total Line Count per File Source')

Twitter file is the file with the most amount of lines, folowed by blogs

library(R.utils)

## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.7.0 (2015-02-19) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.19.0 (2015-02-27) successfully loaded. See ?R.oo for help.
## 
## Attaching package: 'R.oo'
## 
## The following objects are masked from 'package:methods':
## 
##     getClasses, getMethods
## 
## The following objects are masked from 'package:base':
## 
##     attach, detach, gc, load, save
## 
## R.utils v2.2.0 (2015-12-09) successfully loaded. See ?R.utils for help.
## 
## Attaching package: 'R.utils'
## 
## The following object is masked from 'package:utils':
## 
##     timestamp
## 
## The following objects are masked from 'package:base':
## 
##     cat, commandArgs, getOption, inherits, isOpen, parse, warnings

 News.words <- sum(sapply(gregexpr("\\W+", news), length) + 1)
 blogs.words <- sum(sapply(gregexpr("\\W+", blogs), length) + 1)
 tweets.words <- sum(sapply(gregexpr("\\W+", tweets), length) + 1)

numwords <- c(News.words, blogs.words,tweets.words)
numwords <- data.frame(numwords)
numwords$names <- c("news","blogs","twitter")
numwords

##   numwords   names
## 1  2837489    news
## 2 39386844   blogs
## 3 32874052 twitter

ggplot(numwords,aes(x=names, y=numwords)) + geom_bar(stat='identity', fill="red", color='red') + xlab('File source') + ylab('Total No. of words') + ggtitle('Total Words Count per File Source')

Twitter and blogs has a huge amount of words with tweets.words and blogs.words each one.

Analize size of files to check if they are too big

## Analize size of files

dir<-"D:/personal/data science/Capstone Project/final/en_US"
size.news<-file.info(file.path(dir,"en_US.news.txt"))$size/1000^2
size.blogs<-file.info(file.path(dir,"en_US.blogs.txt"))$size/1000^2
size.tweets<-file.info(file.path(dir,"en_US.twitter.txt"))$size/1000^2

sizefiles <- c(size.news,size.blogs , size.tweets)
sizefiles <- data.frame(sizefiles)
sizefiles$names <- c("news","blogs","twitter")
ggplot(sizefiles,aes(x=names, y=sizefiles)) + geom_bar(stat='identity', fill="green", color='green') + xlab('File source') + ylab('Size of each file') + ggtitle('Total size per File Source')

The all files are too big, so I need to work with a smaller sample

SampleTweets=sample(tweets, 1000)
SampleBlogs=sample(blogs, 1000)
SampleNews=sample(news, 1000)
total.samples=paste(SampleBlogs,SampleNews,SampleTweets, sep=" ")

Checking Data

summary(total.samples)

##    Length     Class      Mode 
##      1000 character character

## The first 
SampleTweets[1:3]

## [1] "My brother's gas in this heated car with the windows up<<<<<"
## [2] "Went walking to the store"                                   
## [3] "Now it's time to get back to work: music!!"

SampleBlogs[220]

## [1] "Another aspect of sharing is finding those other readers who may have understood it in same or different way. In this way a small thought evolves in different ways and takes on multiple shapes and can give rise to many new thoughts as well. When I share books with somebody, I am never talking about a single book. My thoughts are a mix of multiple books that I might have read thereby creating some unique combinations that I myself could not realize until I started sharing."

SampleNews[50]

## [1] "At least this time, viola player Max Raimi found a bright side to the constant gridlock, observing: â<U+0080><U+009C>You couldnâ<U+0080><U+0099>t have a drive-by shooting here.â<U+0080>"

After checking data. Cleaning is the first step. But stopwords won´t be deleted because of the goal of the Capstone Project.

Planning of next steps

The next plan will it be:

To clean punctuation, whitespace, stemming, strange characters, etc.
To create a Corpus.
To create onegram, bigram and trigram functions.
To create a TextDocumentMatrix.
Create a frequency file.
To develope predictive model.

milestones Capstone Project

Gonzalo Andres Moreno

29 de diciembre de 2015

Creation of Libraries

Graphical analysis of data

Checking Data

Planning of next steps