Click here for Presentation Video
As consumers, we want to spend on worth-buy products. While in-store staff and the sellers will always say good things,
we want to know more about the product we are interested on, from both aspects - the good and the bad.
One of many others way to gain this info, is through experience from existing user which is tend to be would be more reliable.
Using this application, user can do online searching for a product type offered by different available brands by giving the necessary inputs to the application. From that point, the application will do the search and provide a list for the requested product. User can scroll over the list and do screening on the product’s brief specification, and see what available in the market.
While it is good to see what are gain and drawback of what each brand offering, some people might already have specific selection in mind, but instead just the matter of looking for reviews from other users to be firm on own decision.
For this scenario, filter can be used to shorten the list down and get the user to the product on interest directly.
Once the selection made from the list, this application will display 2 word clouds from text analysis on user reviews. On the top, word cloud from the positive reviews, and at the bottom side, word cloud from negative reviews.
Most of the common rating systems only show the average rating scale from 1 to 5.
However, this type of rating showcase is too general and does not showing the actual feedback in detail manner replace with "However, this type of rating showcase is too general and does not show the actual feedback in detailed manner.
For a potential customer to know the actual comments/reviews of the specific product, the customer need to browse through the comment section and read comments per entry one by one.
This process is time consuming and the customer might missed out the critical feedback due to time-constraint.
1. To clean and import the data set into Postgres DB.
2. To analyze the data set and provide brief details in different plots format.
3. To help user understand the common sentiments of the products of interest by generating word cloud based on positive and negative reviews.
Before proceeding with sentiment analysis, we have performed data analysis to understand the data set:
con <- dbConnect(RPostgres::Postgres(), dbname='pds-project',
host = 'pdsgrouproject.carynrwc1v04.ap-southeast-1.rds.amazonaws.com',
port = 5432,
user = 'pdsgroupproject',
password = 'pdsadmin123', bigint="integer")
1: Total number of products in data set
sql <- paste("select count(title) from amazon_prod_reviews",
sep="")
#cat(sql,"\n")
res <- dbGetQuery(con, sql)
res$count
## [1] 615747
2: Top 10 brands having most number of 5 star ratings:
sql <- paste("select brand, count(overall) from amazon_prod_reviews apr where overall = 5 group by brand, overall order by count desc limit 10",
sep="")
#cat(sql,"\n")
res <- dbGetQuery(con, sql)
table <- with(res, table(brand, count))
res %>%
mutate(brand = fct_reorder(brand, desc(count))) %>%
ggplot(aes(x=brand, y=count)) +
geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4) +
coord_flip() +
xlab("") +
theme_bw()
3: Distribution of review counts by ratings.
sql <- paste("select overall, count(overall) from amazon_prod_reviews apr group by overall order by overall",
sep="")
#cat(sql,"\n")
res <- dbGetQuery(con, sql)
res %>%
ggplot(aes(x=overall, y=count)) +
geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4) +
#scale_y_binned() +
theme_bw() +
scale_y_continuous(labels = comma)
4: Top 5 brands having most number of products:
sql <- paste("select brand, count(title) from amazon_prod_reviews apr group by brand, title order by count desc limit 5",
sep="")
res <- dbGetQuery(con, sql)
data <- data.frame(
category=c(res$brand),
count=c(res$count)
)
data$fraction <- data$count / sum(data$count)
data$ymax <- cumsum(data$fraction)
data$ymin <- c(0, head(data$ymax, n=-1))
data$labelPosition <- (data$ymax + data$ymin) / 2
data$label <- paste0(data$category, "\n value: ", data$count)
ggplot(data, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=category)) +
geom_rect() +
geom_text( x=2, aes(y=labelPosition, label=label, color=category), size=3) +
scale_fill_brewer(palette=3) +
scale_color_brewer(palette=10) +
coord_polar(theta="y") +
xlim(c(-1, 4)) +
theme_void() +
theme(legend.position = "none")
library(shiny)
library(DT)
library(DBI)
library(wordcloud)
library(colourpicker)
library(dplyr)
library(tidytext)
library(stringr)
library(shinythemes)
shinyApp(
con <- dbConnect(RPostgres::Postgres(), dbname='pds-project',
host = 'pdsgrouproject.carynrwc1v04.ap-southeast-1.rds.amazonaws.com',
port = 5432,
user = 'pdsgroupproject',
password = 'pdsadmin123'),
ui = fluidPage(
titlePanel("Amazon Product Reviews"),
p("Amazon Product Reviews", id= "title"),
tags$head(
tags$script(HTML('
Shiny.addCustomMessageHandler("updatechildrow", function(data) {
//console.log(data);
reviews = data[1];
reviews_count = reviews["overall"].length;
dataset = [];
for (i=0; i<reviews_count; i++) {
d = [];
d.push(reviews["overall"][i]);
d.push(reviews["vote"][i]);
d.push(reviews["summary"][i]);
d.push(reviews["reviewtext"][i]);
dataset.push(d);
}
$("div[id*=\'child_details\']:not(#child_details+data[0])").parent().parent().remove();
$("tr.selected").removeClass("selected");
$("#child_details"+data[0]).DataTable({
data: dataset,
columns: [
{ title: "Rating" },
{ title: "Votes" },
{ title: "Summary" },
{ title: "Review" }
]
});
})
'))
),
sidebarLayout(
sidebarPanel(
textInput("txtSearch", "Search product:"),
selectInput("selectBrand", "Brand", c('All')),
actionButton("btnSearch", "Search"),
plotOutput("good_review"),
plotOutput("bad_review")
),
mainPanel(
dataTableOutput("tblProducts")
)
)),
server = function(input, output, session) {
brands <- observeEvent(input$txtSearch, {
if (input$txtSearch != "") {
sql <- paste("SELECT DISTINCT brand FROM amazon_prod_reviews dwd ",
"WHERE UPPER(title) LIKE '%", toupper(input$txtSearch), "%' ",
"ORDER BY brand",
sep="")
#cat(sql,"\n")
res <- dbGetQuery(con, sql)
# Update the selection for Brand. Include the 'All' option to select all brands.
updateSelectInput(session, "selectBrand", choices=c('All',res), selected='All')
}
})
# Update data table with selected brand.
df <- eventReactive(input$btnSearch, {
if (input$txtSearch != "") {
sql <- paste("SELECT AVG(overall) overall_avg, COUNT(overall) review_count, ",
"title, brand FROM amazon_prod_reviews ",
"WHERE UPPER(title) LIKE '%", toupper(input$txtSearch), "%' ",
sep="")
# If 'All' is not selected, construct the SQL to search for only the specific brand.
if (input$selectBrand != 'All') {
sql <- paste(sql,
"AND brand = ", dbQuoteString(con, input$selectBrand), " ",
sep="")
}
sql <- paste(sql,
"GROUP BY Title, Brand ORDER BY review_count DESC, overall_avg DESC ",
# "LIMIT 10 ",
sep="")
#cat(sql,"\n")
res <- dbGetQuery(con, sql)
}
})
# dataset for good review
datasource <- eventReactive(input$productselect,{
wordCloudDf <- data.frame()
sql <- paste("SELECT overall, summary FROM amazon_prod_reviews ",
"WHERE title = '", input$productselect[5], "' ",
"AND overall >= 3",
sep="")
wordCloudDf <- dbGetQuery(con, sql)
good_t_w <-
wordCloudDf %>%
select(summary)
tokens <- good_t_w %>%
unnest_tokens(word, summary) %>%
count(word, sort = TRUE) %>%
ungroup()
tokens_clean <- tokens %>%
anti_join(stop_words)
nums <- tokens_clean %>% filter(str_detect(word, "^[0-9]")) %>% select(word) %>% unique()
tokens_clean <- tokens_clean %>%
anti_join(nums, by = "word")
tokens_clean <- tokens_clean %>%
anti_join(data.frame(word=c("star","stars")), by = "word")
return(tokens_clean)
})
# dataset for bad review
datasource_v2 <- eventReactive(input$productselect,{
sql <- paste("SELECT overall, summary FROM amazon_prod_reviews ",
"WHERE title = '", input$productselect[5], "' ",
"AND overall < 3",
sep="")
wordCloudDf <- dbGetQuery(con, sql)
bad_t_w <-
wordCloudDf %>%
select(summary)
tokens2 <- bad_t_w %>%
unnest_tokens(word, summary) %>%
count(word, sort = TRUE) %>%
ungroup()
tokens_clean2 <- tokens2 %>%
anti_join(stop_words)
nums <- tokens_clean2 %>% filter(str_detect(word, "^[0-9]")) %>% select(word) %>% unique()
tokens_clean2 <- tokens_clean2 %>%
anti_join(nums, by = "word")
tokens_clean2 <- tokens_clean2 %>%
anti_join(data.frame(word=c("star","stars")), by = "word")
return(tokens_clean2)
})
# wordcloud for good review
output$good_review <- renderPlot({
x <- datasource()$word
y <- datasource()$n
minfreq_bigram <- 2
wordcloud(x,y,random.order=FALSE,scale = c(3,0.5),min.freq = minfreq_bigram,colors = brewer.pal(8,"Dark2"),
max.words=50)
})
# wordcloud for bad review
output$bad_review <- renderPlot({
x <- datasource_v2()$word
y <- datasource_v2()$n
minfreq_bigram <- 2
wordcloud(x,y,random.order=FALSE,scale = c(3,0.5),min.freq = minfreq_bigram,colors = brewer.pal(8,"Dark2"),
max.words=50)
})
output$tblProducts <- renderDataTable(
{ df() },
filter = list(position='top', clear=FALSE),
callback = JS("
var format = function(d) {
Shiny.setInputValue('productselect', [d.index(), d.data()]);
return d.data();
}
table.on('click', 'td', function() {
var tr = $(this).closest('tr');
var row = table.row(tr);
var rowData = row.data();
var index = row.index();
if (row.child.isShown()) {
row.child.hide();
}
else {
row.child('<table id=\"child_details'+index+'\"></table>').show();
format(row);
}
});
")
)
prod <- observeEvent(input$productselect, {
# print(input$productselect[1])
# print(input$productselect[5])
sql <- paste("SELECT overall, vote, summary, reviewtext FROM amazon_prod_reviews ",
"WHERE title=",dbQuoteString(con,input$productselect[5]), sep="")
res <- dbGetQuery(con, sql)
session$sendCustomMessage("updatechildrow", list(input$productselect[1], res))
})
},
options = list(height = 1000)
)
In conclusion, we are able to define and classify positive and negative sentiments based on customer comments and ratings. We are able to visualize the words by using word cloud. We have found some reviews may not reflect the overall rating for the product. Our current method of sentiment analysis is able to simplify the ratings of 5 categories to a polarity of 2 categories. Hence, the classification accuracy had improved. However, the method needs to be refined by possibly creating custom dictionaries to improve the sentiment classification especially those reviews that use high frequency of negative words but convey positive sentiments overall.