We all know, knowledge is the power. Human civilization is the journey of exploring knowledge. Everyday, we are gaining knowledge and experience and becoming more intelligent. We all are part of global village and connected by world wide web. So, communication plays a bigger role in this civilization.And, nowadays, we are also exploring many newer avenues of communications. Those communication types are naive, does not follow any standard rule,some times also does not obay the proper structue. We dont need to be face to face to communicate, it may be vurtual, we dont need in front of someone to show our emotions/sentiments, it may be via internet. The problem is, when we wish to analyze the indirect communications and try to extract the jist out of it, it is very difficult to do that task. Because, we are human and our emotions are very complex. The task becomes more difficult, when we wish to analyze those human emotions by Machines/computers. Because, machines do not have intelligence. So, we have to give artificial intelligence, by which machines can analyse emotions. Text mining & Natural Language Processing & Sentiment Analysis are areas of study that combine linguistics and machine learning that attempts to extract the main sentiment of the data frame. NLP has many sub-areas of focus as described on the Wikipedia page, all with the same end goal - computers to acknowledge information just as a human being would do. Trying to implement this, involves knowledge in linguistics, statistics and programming. The end goal of the Data Science Specialization Capstone Project is to produce a predictive text algorithm in R that based on a user’s text input the system will suggest the next most likely word to be entered.
blogs=readLines("en_US.blogs.txt")
news=readLines("en_US.news.txt")
## Warning in readLines("en_US.news.txt"): incomplete final line found on
## 'en_US.news.txt'
twitter=readLines("en_US.twitter.txt")
## Warning in readLines("en_US.twitter.txt"): line 167155 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 268547 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 1274086 appears to contain
## an embedded nul
## Warning in readLines("en_US.twitter.txt"): line 1759032 appears to contain
## an embedded nul
We should also check the structure of the data.
str(blogs)
## chr [1:899288] "In the years thereafter, most of the Oil fields and platforms were named after pagan “godsâ€.""| __truncated__ ...
str(blogs)
## chr [1:899288] "In the years thereafter, most of the Oil fields and platforms were named after pagan “godsâ€.""| __truncated__ ...
str(twitter)
## chr [1:2360148] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long." ...
library(NLP)
library(tm)
library(SnowballC)
library(caTools)
library(ggplot2)
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'package:NLP':
##
## annotate
set.seed(123)
split.blogs=sample.split(blogs, SplitRatio=.01)
blogs.sample=subset(blogs, split.blogs==TRUE)
split.news=sample.split(news, SplitRatio=.01)
news.sample=subset(news, split.news==TRUE)
split.twitter=sample.split(twitter, SplitRatio=.01)
twitter.sample=subset(twitter, split.twitter==TRUE)
file.info("en_US.blogs.txt")$size / 1024^2
## [1] 200.4242
file.info("en_US.news.txt")$size / 1024^2
## [1] 196.2775
file.info("en_US.twitter.txt")$size / 1024^2
## [1] 159.3641
length(blogs)
## [1] 899288
length(news)
## [1] 77259
length(twitter)
## [1] 2360148
max(nchar(blogs))
## [1] 40835
max(nchar(news))
## [1] 5760
max(nchar(twitter))
## [1] 213
love_count <- sum(grepl("love", twitter))
hate_count <- sum(grepl("hate", twitter))
love_count / hate_count
## [1] 4.108592