The goal of this task is to write a report that can be used as the basis for a final product or perhaps a submission to the Yelp dataset challenge. In addition, you need to create a 5-slide deck using RStudio Presenter tools to describe and promote your analysis.
Write a 5-page report using R Markdown that describes your question/problem, how you used the dataset, the analysis you conducted, and the conclusions that you drew from the analysis. Your report must have the following sections clearly labelled:
Introduction - A description of the question/problem and the rationale for studying it
Methods and Data - Describe how you used the data and the type of analytic methods that you used; it’s okay to be a bit technical here but clarity is important
Results - Describe what you found through your analysis of the data.
Discussion - Explain how you interpret the results of your analysis and what the implications are for your question/problem.
The final report must be in PDF format using standard page sizes. Please be considerate with respect to using readable font sizes and page margins.
try.error = function(x)
{
# create missing value
y = NA
# tryCatch error
try_error = tryCatch(tolower(x), error=function(e) e)
# if not an error
if (!inherits(try_error, "error"))
y = tolower(x)
# result
return(y)
}
suppressWarnings(suppressMessages(library(ggplot2)))
suppressWarnings(suppressMessages(library(xlsx)))
suppressWarnings(suppressMessages(library(sentiment)))
suppressWarnings(suppressMessages(library(wordcloud)))
setwd("D:/Google Drive/Coursera/capstone_yelt")
RndStr <- function(n = 1, lenght = 12)
{
randomString <- c(1:n) # initialize vector
for (i in 1:n)
{
randomString[i] <- paste(sample(c(0:9, letters, LETTERS),
lenght, replace = TRUE),
collapse = "")
}
return(randomString)
}
library(RODBC)
## Warning: package 'RODBC' was built under R version 3.2.2
conn <- odbcConnect(dsn = "capstone", uid = "hdfs", pwd = "")
sqlQuery(conn, "ADD JAR /CML/lib/lib/hive-serdes-1.0-SNAPSHOT.jar;")
## character(0)
sqlQuery(conn, "set mapred.job.priority='VERY_HIGH';")
## character(0)
df <-
sqlQuery(
conn, "select a.user_id, a.date ,
regexp_replace(a.`text`, '\\\n|\\\r','') as review,
regexp_replace(b.`text`, '\\\n|\\\r','') as tips
from review a left join tip b on (a.business_id=b.business_id) and (a.date=b.date) and (a.user_id=b.user_id)
where a.business_id = '4bEjOyTaDG24SY5TxsaUNQ'")
odbcClose(conn)
head(df)
## user_id date
## 1 H43f_rD3czq0UqY-4zhWjA 2005-10-10
## 2 pW91HUnVz6ssLZ4dY-ztyQ 2005-12-02
## 3 aoxuw-XpJYIX1-R0TUS7CQ 2006-01-07
## 4 91e8Cg7Vqj7G5XEXK3uaYA 2006-05-17
## 5 3lPLOLeeJVyJFtLphJ6xzA 2006-06-22
## 6 2Z653F4UvqDIFrY52skFwA 2006-07-24
## review
## 1 If you enjoy a little people watching with your dining, sit out on the outside terrace and watch Las Vegas tourist walk the strip as they celebrate their winnings or rave about the fabulous Bellagio fountains. As the restaurant is directly across from th
## 2 Though heartbroken and a bit aimless on my 22nd birthday, a meal at Mon Ami Gabi really helped cheer me up.I imagine that if you go there relatively happy, you'd really enjoy yourself!
## 3 The food and wine was amazing, but the super high price is going to make this a 4-star. Although I must admit that the price was a direct result of ordering numerous glasses of wine rather than a bottle. Dumb. But it was the best meal we had in Vegas,
## 4 Definitely one of my favorites on the Strip! I've been here several times and it has never disappointed! The food here is wonderful as well as the service. Their steaks are fabulous, as well as their seafood. Worth a visit if you've never been!
## 5 What great friends I have..... We ate at Mon Ami Gabi translation (My friend Gabi) last Sat. night. It was awesome! It is located in the (wait for it..... Paris Hotel). What more can a grl who loves Paris ask for? Me and my 6 grl friends had a delicio
## 6 Very good. There is a bunch of tourist trap bullshit in Vegas. This is not it. The food is fantastic for the price. The bloody mary bar is a lot of fun. Definitely recommended.
## tips
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>
#write.csv(queryResult,paste(RndStr(),".csv",sep = ""))
#df <- read.xlsx("abi_tips_review.xls",1)
df <- as.data.frame(df$review)
names(df) = "review"
df <-as.data.frame(sapply(df,gsub,pattern="[[:digit:]]",replacement=""))
df <-as.data.frame(sapply(df,gsub,pattern="[[:punct:]]",replacement=""))
df <-as.data.frame(sapply(df,gsub,pattern="@\\w+",replacement=""))
df <-as.data.frame(sapply(df,gsub,pattern="^\\s+|\\s+$",replacement=""))
df <-as.data.frame(sapply(df,gsub,pattern="[ \t]{2,}",replacement=""))
df = as.data.frame(sapply(df, try.error))
df = as.data.frame(df[!is.na(df)])
names(df) = "review"
class_emo = classify_emotion(df, algorithm="bayes", prior=1.0)
emotion = class_emo[,7]
emotion[is.na(emotion)] = "unknown"
class_pol = classify_polarity(df, algorithm="bayes")
# get polarity best fit
polarity = class_pol[,4]
# data frame with results
sent_df = data.frame(text=df, emotion=emotion,
polarity=polarity, stringsAsFactors=FALSE)
sent_df = within(sent_df,
emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE))))
# plot distribution of emotions
ggplot(sent_df, aes(x=emotion)) +
geom_bar(aes(y=..count.., fill=emotion)) +
scale_fill_brewer(palette="Dark2") +
labs(x="emotion categories", y="number of tweets")
ggplot(sent_df, aes(x=polarity)) +
geom_bar(aes(y=..count.., fill=polarity)) +
scale_fill_brewer(palette="RdGy") +
labs(x="polarity categories", y="number of tweets")
emos = levels(factor(sent_df$emotion))
nemo = length(emos)
emo.docs = rep("", nemo)
for (i in 1:nemo)
{
tmp = df$review[emotion == emos[i]]
emo.docs[i] = paste(tmp, collapse=" ")
}
# remove stopwords
emo.docs = removeWords(emo.docs, stopwords("english"))
# create corpus
corpus = Corpus(VectorSource(emo.docs))
tdm = TermDocumentMatrix(corpus)
tdm = as.matrix(tdm)
colnames(tdm) = emos
# comparison word cloud
suppressWarnings(suppressMessages(comparison.cloud(tdm, colors = brewer.pal(nemo, "Dark2"),
scale = c(3,.5), random.order = FALSE, title.size = 1.5)))
Write 5 slides using RStudio Presenter that summarizes your report for a general audience. Explain what the problem you presented is interesting and how your analysis directly addresses that problem. The slide deck should be published as a viewable HTML presentation hosted on R Pubs or GitHub and you should be able to provide a link to the presentation so that others can view it.