DSS.Capstone.MilestoneReport

Task 1 : Problem Statement & Understanding

We all know, knowledge is the power. Human civilization is the journey of exploring knowledge. Everyday, we are gaining knowledge and experience and becoming more intelligent. We all are part of global village and connected by world wide web. So, communication plays a bigger role in this civilization.And, nowadays, we are also exploring many newer avenues of communications. Those communication types are naive, does not follow any standard rule,some times also does not obay the proper structue. We dont need to be face to face to communicate, it may be vurtual, we dont need in front of someone to show our emotions/sentiments, it may be via internet. The problem is, when we wish to analyze the indirect communications and try to extract the jist out of it, it is very difficult to do that task. Because, we are human and our emotions are very complex. The task becomes more difficult, when we wish to analyze those human emotions by Machines/computers. Because, machines do not have intelligence. So, we have to give artificial intelligence, by which machines can analyse emotions. Text mining & Natural Language Processing & Sentiment Analysis are areas of study that combine linguistics and machine learning that attempts to extract the main sentiment of the data frame. NLP has many sub-areas of focus as described on the Wikipedia page, all with the same end goal - computers to acknowledge information just as a human being would do. Trying to implement this, involves knowledge in linguistics, statistics and programming. The end goal of the Data Science Specialization Capstone Project is to produce a predictive text algorithm in R that based on a user’s text input the system will suggest the next most likely word to be entered.

Task 2 : Loading The data for analysis

blogs=readLines("en_US.blogs.txt")
news=readLines("en_US.news.txt")

## Warning in readLines("en_US.news.txt"): incomplete final line found on
## 'en_US.news.txt'

twitter=readLines("en_US.twitter.txt")

## Warning in readLines("en_US.twitter.txt"): line 167155 appears to contain
## an embedded nul

## Warning in readLines("en_US.twitter.txt"): line 268547 appears to contain
## an embedded nul

## Warning in readLines("en_US.twitter.txt"): line 1274086 appears to contain
## an embedded nul

## Warning in readLines("en_US.twitter.txt"): line 1759032 appears to contain
## an embedded nul

We should also check the structure of the data.

str(blogs)

##  chr [1:899288] "In the years thereafter, most of the Oil fields and platforms were named after pagan â€œgodsâ€.""| __truncated__ ...

str(blogs)

##  chr [1:899288] "In the years thereafter, most of the Oil fields and platforms were named after pagan â€œgodsâ€.""| __truncated__ ...

str(twitter)

##  chr [1:2360148] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long." ...

Task 3 : Load the required packages

library(NLP)
library(tm)
library(SnowballC)
library(caTools)
library(ggplot2)

## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate

Task 4 : Working with sample data set

set.seed(123)
split.blogs=sample.split(blogs, SplitRatio=.01)
blogs.sample=subset(blogs, split.blogs==TRUE)
split.news=sample.split(news, SplitRatio=.01)
news.sample=subset(news, split.news==TRUE)
split.twitter=sample.split(twitter, SplitRatio=.01)
twitter.sample=subset(twitter, split.twitter==TRUE)

Task 5 : Summary Statistics

file.info("en_US.blogs.txt")$size / 1024^2

## [1] 200.4242

file.info("en_US.news.txt")$size / 1024^2

## [1] 196.2775

file.info("en_US.twitter.txt")$size / 1024^2

## [1] 159.3641

length(blogs)

## [1] 899288

length(news)

## [1] 77259

length(twitter)

## [1] 2360148

max(nchar(blogs))

## [1] 40835

max(nchar(news))

## [1] 5760

max(nchar(twitter))

## [1] 213

love_count <- sum(grepl("love", twitter))
hate_count <- sum(grepl("hate", twitter))
love_count / hate_count

## [1] 4.108592