Group Project

Amazon Product Reviews

INTRODUCTION

Click here for Presentation Video
As consumers, we want to spend on worth-buy products. While in-store staff and the sellers will always say good things,
we want to know more about the product we are interested on, from both aspects - the good and the bad.
One of many others way to gain this info, is through experience from existing user which is tend to be would be more reliable.

Using this application, user can do online searching for a product type offered by different available brands by giving the necessary inputs to the application. From that point, the application will do the search and provide a list for the requested product. User can scroll over the list and do screening on the product’s brief specification, and see what available in the market.

While it is good to see what are gain and drawback of what each brand offering, some people might already have specific selection in mind, but instead just the matter of looking for reviews from other users to be firm on own decision.
For this scenario, filter can be used to shorten the list down and get the user to the product on interest directly.

Once the selection made from the list, this application will display 2 word clouds from text analysis on user reviews. On the top, word cloud from the positive reviews, and at the bottom side, word cloud from negative reviews.

PROBLEM STATEMENT

Most of the common rating systems only show the average rating scale from 1 to 5.
However, this type of rating showcase is too general and does not showing the actual feedback in detail manner replace with "However, this type of rating showcase is too general and does not show the actual feedback in detailed manner.

For a potential customer to know the actual comments/reviews of the specific product, the customer need to browse through the comment section and read comments per entry one by one.

This process is time consuming and the customer might missed out the critical feedback due to time-constraint.

OBJECTIVE

1. To clean and import the data set into Postgres DB.
2. To analyze the data set and provide brief details in different plots format.
3. To help user understand the common sentiments of the products of interest by generating word cloud based on positive and negative reviews.

DATA SOURCE & CLEANING

Click here for Data Source

The pre-processing activity begins by converting the format of the original data into csv.
Original dataset consists of 2 files: product metadata, and product reviews. the 2 files are merged into 1 file based on the ASIN column (the product ID).
The next step is to identify columns that bear no relevance to the analysis being done and drop it from the dataset. These columns are either fully empty, or do not contain any useful information.
Next data clean-up are performed. Example, “titles” column containing HTML tags are removed and columns which records are too sparsely populated are removed as well
Reviews are categorized accordingly. Reviews are ranged from 1 to 5 where 1 being worst and 5 being best. Reviews with 1-3 stars are categorized as poor and 4-5 stars are good reviews. This will be used to generate word clouds for ‘good’ reviews and ‘poor’ reviews.
Text pre-processing for sentiment analysis are performed by tokenization of review texts and removal of stop words.

DATA ANALYSIS

Before proceeding with sentiment analysis, we have performed data analysis to understand the data set:

  con <- dbConnect(RPostgres::Postgres(), dbname='pds-project',
                   host = 'pdsgrouproject.carynrwc1v04.ap-southeast-1.rds.amazonaws.com',
                   port = 5432,
                   user = 'pdsgroupproject',
                   password = 'pdsadmin123', bigint="integer")

1: Total number of products in data set

        sql <- paste("select count(title) from amazon_prod_reviews",
                     sep="")
        #cat(sql,"\n")
        res <- dbGetQuery(con, sql)
        res$count

## [1] 615747

2: Top 10 brands having most number of 5 star ratings:

        sql <- paste("select brand, count(overall) from amazon_prod_reviews apr where overall = 5 group by brand, overall order by count desc limit 10",
                     sep="")
        #cat(sql,"\n")
        res <- dbGetQuery(con, sql)
        table <- with(res, table(brand, count))
        res %>%
          mutate(brand = fct_reorder(brand, desc(count))) %>%
          ggplot(aes(x=brand, y=count)) + 
          geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4) +
          coord_flip() +
          xlab("") +
          theme_bw()

3: Distribution of review counts by ratings.

        sql <- paste("select overall, count(overall) from amazon_prod_reviews apr group by overall order by overall",
                     sep="")
        #cat(sql,"\n")
        res <- dbGetQuery(con, sql)
        res %>%
          ggplot(aes(x=overall, y=count)) +
          geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4) +
          #scale_y_binned() +
          theme_bw() +
          scale_y_continuous(labels = comma)

4: Top 5 brands having most number of products:

        sql <- paste("select brand, count(title) from amazon_prod_reviews apr  group by brand, title order by count desc limit 5",
                     sep="")
        res <- dbGetQuery(con, sql)
        data <- data.frame(
          category=c(res$brand),
          count=c(res$count)
        )
        data$fraction <- data$count / sum(data$count)

        data$ymax <- cumsum(data$fraction)

        data$ymin <- c(0, head(data$ymax, n=-1))

        data$labelPosition <- (data$ymax + data$ymin) / 2

        data$label <- paste0(data$category, "\n value: ", data$count)

        ggplot(data, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=category)) +
          geom_rect() +
          geom_text( x=2, aes(y=labelPosition, label=label, color=category), size=3) + 
          scale_fill_brewer(palette=3) +
          scale_color_brewer(palette=10) +
          coord_polar(theta="y") +
          xlim(c(-1, 4)) +
          theme_void() +
          theme(legend.position = "none")

SHINY APP & CODE

Guidelines

User can search the interested appliances in the search column.“, strong(”Eg: Refrigerator, Stove, Oven, Washer, Dryer, Humidfier"), User can choose to narrow the search by selecting the brand of ideal appliances from dropdown menu.
A table with the selected type of appliances will be listed based on the review counts in desending order.
User can further select any product from the table product to view the word cloud. Table of the review’s summary also will be shown once click on the product.
Click here to open shiny app in new tab

NOTE: Due to limited free resources (CPU, Memory, etc) the app response is slow, do give sometime to let it load, search and generate the wordcloud. *** Do open the RMARKDOWN in browser to open the embedded shiny app properly.

Shiny App Screenshot

Shiny App

Shiny App Code

library(shiny)
library(DT)
library(DBI)
library(wordcloud)
library(colourpicker)
library(dplyr)
library(tidytext)
library(stringr)
library(shinythemes)
shinyApp(
  con <- dbConnect(RPostgres::Postgres(), dbname='pds-project',
                   host = 'pdsgrouproject.carynrwc1v04.ap-southeast-1.rds.amazonaws.com',
                   port = 5432,
                   user = 'pdsgroupproject',
                   password = 'pdsadmin123'),
  ui = fluidPage(
    titlePanel("Amazon Product Reviews"),
    p("Amazon Product Reviews", id= "title"),
    tags$head(
      tags$script(HTML('
        Shiny.addCustomMessageHandler("updatechildrow", function(data) {
            //console.log(data);
            reviews = data[1];
            reviews_count = reviews["overall"].length;
            
            dataset = [];
            
            for (i=0; i<reviews_count; i++) {
                d = [];
                d.push(reviews["overall"][i]);
                d.push(reviews["vote"][i]);
                d.push(reviews["summary"][i]);
                d.push(reviews["reviewtext"][i]);
                dataset.push(d);
            }
            $("div[id*=\'child_details\']:not(#child_details+data[0])").parent().parent().remove();
            $("tr.selected").removeClass("selected");
            $("#child_details"+data[0]).DataTable({
                data: dataset,
                columns: [
                    { title: "Rating" },
                    { title: "Votes" },
                    { title: "Summary" },
                    { title: "Review" }
                ]
            });
        })
        '))
    ),
    sidebarLayout(
      sidebarPanel(
        textInput("txtSearch", "Search product:"),
        selectInput("selectBrand", "Brand", c('All')),
        actionButton("btnSearch", "Search"),
        plotOutput("good_review"),
        plotOutput("bad_review")
      ),
      mainPanel(
        dataTableOutput("tblProducts")
      )
    )),
  server = function(input, output, session) {
    brands <- observeEvent(input$txtSearch, {
      if (input$txtSearch != "") {
        sql <- paste("SELECT DISTINCT brand FROM amazon_prod_reviews dwd ",
                     "WHERE UPPER(title) LIKE '%", toupper(input$txtSearch), "%' ",
                     "ORDER BY brand",
                     sep="")
        #cat(sql,"\n")
        res <- dbGetQuery(con, sql)
        
        # Update the selection for Brand. Include the 'All' option to select all brands.
        updateSelectInput(session, "selectBrand", choices=c('All',res), selected='All')
      }
    })
    # Update data table with selected brand.
    df <- eventReactive(input$btnSearch, {
      if (input$txtSearch != "") {
        sql <- paste("SELECT AVG(overall) overall_avg, COUNT(overall) review_count, ",
                     "title, brand FROM amazon_prod_reviews ",
                     "WHERE UPPER(title) LIKE '%", toupper(input$txtSearch), "%' ",
                     sep="")
        
        # If 'All' is not selected, construct the SQL to search for only the specific brand.
        if (input$selectBrand != 'All') {
          sql <- paste(sql,
                       "AND brand = ", dbQuoteString(con, input$selectBrand), " ",
                       sep="")
        }
        sql <- paste(sql,
                     "GROUP BY Title, Brand ORDER BY review_count DESC, overall_avg DESC ",
                     #                         "LIMIT 10 ",
                     sep="")
        
        #cat(sql,"\n")
        res <- dbGetQuery(con, sql)
      }
    })
    # dataset for good review
    
    datasource <- eventReactive(input$productselect,{
      wordCloudDf <- data.frame()
      sql <- paste("SELECT overall, summary FROM amazon_prod_reviews ",
                   "WHERE title = '", input$productselect[5], "' ",
                   "AND overall >= 3",
                   sep="")
        wordCloudDf <- dbGetQuery(con, sql)
        good_t_w <- 
            wordCloudDf %>% 
            select(summary)
        tokens <- good_t_w %>% 
            unnest_tokens(word, summary) %>% 
            count(word, sort = TRUE) %>% 
            ungroup()
        tokens_clean <- tokens %>%
            anti_join(stop_words)
        nums <- tokens_clean %>% filter(str_detect(word, "^[0-9]")) %>% select(word) %>% unique()
        tokens_clean <- tokens_clean %>% 
            anti_join(nums, by = "word")
        tokens_clean <- tokens_clean %>% 
          anti_join(data.frame(word=c("star","stars")), by = "word")
        return(tokens_clean)
    })
    # dataset for bad review
    
    datasource_v2 <- eventReactive(input$productselect,{
      sql <- paste("SELECT overall, summary FROM amazon_prod_reviews ",
                   "WHERE title = '", input$productselect[5], "' ",
                   "AND overall < 3",
                   sep="")
        wordCloudDf <- dbGetQuery(con, sql)
        bad_t_w <- 
            wordCloudDf %>% 
            select(summary)
        tokens2 <- bad_t_w %>% 
            unnest_tokens(word, summary) %>% 
            count(word, sort = TRUE) %>% 
            ungroup()
        tokens_clean2 <- tokens2 %>%
            anti_join(stop_words)
        nums <- tokens_clean2 %>% filter(str_detect(word, "^[0-9]")) %>% select(word) %>% unique()
        tokens_clean2 <- tokens_clean2 %>% 
            anti_join(nums, by = "word")
        tokens_clean2 <- tokens_clean2 %>% 
          anti_join(data.frame(word=c("star","stars")), by = "word")
        return(tokens_clean2)
    })
    # wordcloud for good review
    
    output$good_review <- renderPlot({
        x <- datasource()$word
        y <- datasource()$n
        minfreq_bigram <- 2
        
        wordcloud(x,y,random.order=FALSE,scale = c(3,0.5),min.freq = minfreq_bigram,colors = brewer.pal(8,"Dark2"),
                  max.words=50)
        
    })
    
    # wordcloud for bad review
    
    output$bad_review <- renderPlot({
        x <- datasource_v2()$word
        y <- datasource_v2()$n
        minfreq_bigram <- 2
        
        wordcloud(x,y,random.order=FALSE,scale = c(3,0.5),min.freq = minfreq_bigram,colors = brewer.pal(8,"Dark2"),
                  max.words=50)
        
    })
    output$tblProducts <- renderDataTable(
      { df() },
      filter = list(position='top', clear=FALSE),
      callback = JS("
            var format = function(d) { 
                Shiny.setInputValue('productselect', [d.index(), d.data()]);
                return d.data();
            }
            
            table.on('click', 'td', function() {
                var tr = $(this).closest('tr');
                var row = table.row(tr);
                var rowData = row.data();
                
                var index = row.index();
                
                if (row.child.isShown()) { 
                    row.child.hide();
                }
                else {
                    row.child('<table id=\"child_details'+index+'\"></table>').show();
                    format(row);
                }
            });
        ")
    )
    
    prod <- observeEvent(input$productselect, {
      # print(input$productselect[1])
      # print(input$productselect[5])
      
      sql <- paste("SELECT overall, vote, summary, reviewtext FROM amazon_prod_reviews ",
                   "WHERE title=",dbQuoteString(con,input$productselect[5]), sep="")
      res <- dbGetQuery(con, sql)
      session$sendCustomMessage("updatechildrow", list(input$productselect[1], res))
    })
  },
  options = list(height = 1000)
)

CONCLUSION

In conclusion, we are able to define and classify positive and negative sentiments based on customer comments and ratings. We are able to visualize the words by using word cloud. We have found some reviews may not reflect the overall rating for the product. Our current method of sentiment analysis is able to simplify the ratings of 5 categories to a polarity of 2 categories. Hence, the classification accuracy had improved. However, the method needs to be refined by possibly creating custom dictionaries to improve the sentiment classification especially those reviews that use high frequency of negative words but convey positive sentiments overall.