Penjualan Produk Amazon Terbaik Berdasarkan Rating dan Jumlah Ulasan

Latar Belakang Dataset

Amazon adalah perusahaan teknologi multinasional Amerika yang berbasis di Seattle, Washington, yang berfokus pada e-commerce, komputasi awan, streaming digital, dan kecerdasan buatan. Perusahaan ini awalnya merupakan pasar online untuk buku, tetapi kemudian diperluas untuk menjual barang elektronik, perangkat lunak, video game, pakaian, furnitur, makanan, mainan, dan perhiasan.Saat ini Amazon menjadi perusahaan swasta terbesar kedua di Amerika Serikat dan salah satu perusahaan paling bernilai di dunia.

Dataset Amazon Best Seller June 2021 Products ini menyajikan kategori produk yang dijual oleh Amazon, kemudian kode unik produk, link produk, jumlah penjual disetiap kategori produk, peringkat produk, rating produk, jumlah ulasan, dan harga produk.

Dataset ini dibuat untuk membantu perusahan menemukan produk dengan penjualan terbaik (Best Seller) pada Juni 2021. Oleh karena itu, kali ini saya akan mencoba melakukan EDA, Data Preparation, dan Data visualization untuk menemukan penjualan terbaik Rating dan Jumlah Ulasan.

Pertannyaan Analisis

Kategori produk mana yang mempunyai Rating paling tinggi dan paling rendah dilihat dari rata-ratanya?
Kategori produk mana yang mendapatkan Jumlah Ulasan (Reviews Count) paling banyak dan paling sedikit dilihat dari rata-ratanya?
Kategori produk mana yang mempunyai harga (Price) paling mahal dan paling murah dilihat dari rata-ratanya?

Import Library

library(heatmaply)

## Warning: package 'heatmaply' was built under R version 4.1.2

## Loading required package: plotly

## Warning: package 'plotly' was built under R version 4.1.2

## Loading required package: ggplot2

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

## Loading required package: viridis

## Loading required package: viridisLite

## 
## ======================
## Welcome to heatmaply version 1.3.0
## 
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
## 
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## You may ask questions at stackoverflow, use the r and heatmaply tags: 
##   https://stackoverflow.com/questions/tagged/heatmaply
## ======================

library(visdat)

## Warning: package 'visdat' was built under R version 4.1.2

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v tibble  3.1.2     v dplyr   1.0.6
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## v purrr   0.3.4

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks plotly::filter(), stats::filter()
## x dplyr::lag()    masks stats::lag()

library(skimr)

## Warning: package 'skimr' was built under R version 4.1.2

library(DataExplorer)

## Warning: package 'DataExplorer' was built under R version 4.1.2

library(dplyr)
library(ggplot2)
library(tidyr)
library(readr)
library(tibble)
library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

Mengakses Dataset

df <- read.csv("~/RStudio PSDS/Amazon_Best_Seller_2021_June.csv", stringsAsFactors=TRUE)
head(df)

##         ASIN    Category
## 1 B079QHML21 Electronics
## 2 B07FZ8S74R Electronics
## 3 B07XJ8C8F5 Electronics
## 4 B07WVFCVJN Electronics
## 5 B08YT2N5SX Electronics
## 6 B07TMJ1R3X Electronics
##                                                                                      Product.Link
## 1 https://www.amazon.com/gp/offer-listing/B079QHML21/ref=dp_olp_new_mbc?ie=UTF8&amp;condition=all
## 2 https://www.amazon.com/gp/offer-listing/B07FZ8S74R/ref=dp_olp_new_mbc?ie=UTF8&amp;condition=all
## 3 https://www.amazon.com/gp/offer-listing/B07XJ8C8F5/ref=dp_olp_new_mbc?ie=UTF8&amp;condition=all
## 4 https://www.amazon.com/gp/offer-listing/B07WVFCVJN/ref=dp_olp_new_mbc?ie=UTF8&amp;condition=all
## 5 https://www.amazon.com/gp/offer-listing/B08YT2N5SX/ref=dp_olp_new_mbc?ie=UTF8&amp;condition=all
## 6 https://www.amazon.com/gp/offer-listing/B07TMJ1R3X/ref=dp_olp_new_mbc?ie=UTF8&amp;condition=all
##   No.of.Sellers Rank Rating Reviews.Count   Price
## 1     1 Sellers   #1    4.7       640,721 $39.99 
## 2     1 Sellers   #2    4.7       854,114 $34.99 
## 3     1 Sellers   #3    4.7       267,821 $44.99 
## 4    27 Sellers   #4    4.8       114,267 $28.48 
## 5     1 Sellers   #5    4.7       267,821 $49.99 
## 6     1 Sellers   #6    4.6       100,278 $89.99

Memahami Data

ASIN = Kode unik produk
Category = Kategori produk
Product Link = Link produk
No of Sellers = Jumlah penjual di setiap daftar produk
Rank = Peringkat produk
Rating = Rating produk
Reviews Count = Jumlah total ulasan pada suatu produk
Price = Harga produk

EDA (Exploratory Data Analysis)

Dimensi Data

dim(df)

## [1] 707   8

Variabel Pada Dataset

names(df)

## [1] "ASIN"          "Category"      "Product.Link"  "No.of.Sellers"
## [5] "Rank"          "Rating"        "Reviews.Count" "Price"

str(df)

## 'data.frame':    707 obs. of  8 variables:
##  $ ASIN         : Factor w/ 653 levels "080241270X","1250080401",..: 241 284 418 405 621 379 415 519 225 510 ...
##  $ Category     : Factor w/ 7 levels "Books","Camera & Photo",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Product.Link : Factor w/ 653 levels "https://www.amazon.com/gp/offer-listing/0062060627/ref=dp_olp_new_mbc?ie=UTF8&amp;condition=all",..: 241 284 418 405 621 379 415 519 225 510 ...
##  $ No.of.Sellers: Factor w/ 61 levels "1 Sellers","10 Sellers",..: 1 1 1 26 1 1 1 1 52 1 ...
##  $ Rank         : Factor w/ 100 levels "#1","#10","#100",..: 1 13 24 35 46 57 68 79 90 2 ...
##  $ Rating       : num  4.7 4.7 4.7 4.8 4.7 4.6 4.5 4.5 4.7 4.7 ...
##  $ Reviews.Count: Factor w/ 546 levels "1","1,022","1,030",..: 459 523 236 63 236 51 62 12 522 367 ...
##  $ Price        : Factor w/ 287 levels "$0.88 ","$10.00 ",..: 181 168 194 147 208 275 168 63 178 129 ...

Memperbarui data dengan mengubah tipe data agar tidak terjadi error saat Plot Distribusi Data Numerik dan Pairplot

# Mengubah tipe data yang ada pada kolom 'No of Sellers' dari character menjadi numeric, dengan menghapus kata ' Sellers'
df$No.of.Sellers <- as.numeric(gsub(" Sellers","",df$No.of.Sellers))
# Mengubah tipe data yang ada pada kolom 'Rank' dari character menjadi numeric, dengan menghapus simbol '#'
df$Rank <- as.numeric(gsub("#","",df$Rank))
# Mengubah tipe data yang ada pada kolom 'Reviews Count' dari character menjadi numeric, dengan menghapus tanda koma ','
df$Reviews.Count <- as.numeric(gsub(",", "", df$Reviews.Count))
# Mengubah tipe data yang ada pada kolom 'Price' dari character menjadi numeric, dengan menghapus simbol '$'
df$Price <- (as.numeric(gsub("[^[:alnum:]]","",df$Price)))/100
head(df)

##         ASIN    Category
## 1 B079QHML21 Electronics
## 2 B07FZ8S74R Electronics
## 3 B07XJ8C8F5 Electronics
## 4 B07WVFCVJN Electronics
## 5 B08YT2N5SX Electronics
## 6 B07TMJ1R3X Electronics
##                                                                                      Product.Link
## 1 https://www.amazon.com/gp/offer-listing/B079QHML21/ref=dp_olp_new_mbc?ie=UTF8&amp;condition=all
## 2 https://www.amazon.com/gp/offer-listing/B07FZ8S74R/ref=dp_olp_new_mbc?ie=UTF8&amp;condition=all
## 3 https://www.amazon.com/gp/offer-listing/B07XJ8C8F5/ref=dp_olp_new_mbc?ie=UTF8&amp;condition=all
## 4 https://www.amazon.com/gp/offer-listing/B07WVFCVJN/ref=dp_olp_new_mbc?ie=UTF8&amp;condition=all
## 5 https://www.amazon.com/gp/offer-listing/B08YT2N5SX/ref=dp_olp_new_mbc?ie=UTF8&amp;condition=all
## 6 https://www.amazon.com/gp/offer-listing/B07TMJ1R3X/ref=dp_olp_new_mbc?ie=UTF8&amp;condition=all
##   No.of.Sellers Rank Rating Reviews.Count Price
## 1             1    1    4.7        640721 39.99
## 2             1    2    4.7        854114 34.99
## 3             1    3    4.7        267821 44.99
## 4            27    4    4.8        114267 28.48
## 5             1    5    4.7        267821 49.99
## 6             1    6    4.6        100278 89.99

Mengecek informasi data yang telah diperbarui

str(df)

## 'data.frame':    707 obs. of  8 variables:
##  $ ASIN         : Factor w/ 653 levels "080241270X","1250080401",..: 241 284 418 405 621 379 415 519 225 510 ...
##  $ Category     : Factor w/ 7 levels "Books","Camera & Photo",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Product.Link : Factor w/ 653 levels "https://www.amazon.com/gp/offer-listing/0062060627/ref=dp_olp_new_mbc?ie=UTF8&amp;condition=all",..: 241 284 418 405 621 379 415 519 225 510 ...
##  $ No.of.Sellers: num  1 1 1 27 1 1 1 1 67 1 ...
##  $ Rank         : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Rating       : num  4.7 4.7 4.7 4.8 4.7 4.6 4.5 4.5 4.7 4.7 ...
##  $ Reviews.Count: num  640721 854114 267821 114267 267821 ...
##  $ Price        : num  40 35 45 28.5 50 ...

Mengecek missing data yang telah diperbarui

sapply(df, function(x) sum(is.na(x)))

##          ASIN      Category  Product.Link No.of.Sellers          Rank 
##             0             0             0             0             0 
##        Rating Reviews.Count         Price 
##             0             0             0

vis_miss(df)

Mengecek Outlier

num_cols <- unlist(lapply(df, is.numeric))
df_num <- df[ , num_cols]  
boxplot(df_num)

Melihat Korelasi Data

plot_correlation(df_num)

Melihat Statistik Data

summary(df)

##          ASIN                          Category  
##  B081RJ8DW1:  3   Books                    : 70  
##  B095NWYQBC:  3   Camera & Photo           :100  
##  B001E1Y5O6:  2   Clothing, Shoes & Jewelry:100  
##  B01171X0UW:  2   Electronics              :147  
##  B0148NNKTC:  2   Gift Cards               :100  
##  B071FXZBMV:  2   Toys & Games             : 95  
##  (Other)   :693   Video Games              : 95  
##                                                                                           Product.Link
##  https://www.amazon.com/gp/offer-listing/B081RJ8DW1/ref=dp_olp_new_mbc?ie=UTF8&amp;condition=all:  3  
##  https://www.amazon.com/gp/offer-listing/B095NWYQBC/ref=dp_olp_new_mbc?ie=UTF8&amp;condition=all:  3  
##  https://www.amazon.com/gp/offer-listing/B001E1Y5O6/ref=dp_olp_new_mbc?ie=UTF8&amp;condition=all:  2  
##  https://www.amazon.com/gp/offer-listing/B01171X0UW/ref=dp_olp_new_mbc?ie=UTF8&amp;condition=all:  2  
##  https://www.amazon.com/gp/offer-listing/B0148NNKTC/ref=dp_olp_new_mbc?ie=UTF8&amp;condition=all:  2  
##  https://www.amazon.com/gp/offer-listing/B071FXZBMV/ref=dp_olp_new_mbc?ie=UTF8&amp;condition=all:  2  
##  (Other)                                                                                        :693  
##  No.of.Sellers          Rank            Rating      Reviews.Count   
##  Min.   :  1.000   Min.   :  1.00   Min.   :1.400   Min.   :     1  
##  1st Qu.:  1.000   1st Qu.: 27.00   1st Qu.:4.500   1st Qu.:  5138  
##  Median :  1.000   Median : 52.00   Median :4.700   Median : 18023  
##  Mean   :  8.334   Mean   : 51.36   Mean   :4.593   Mean   : 77006  
##  3rd Qu.:  5.000   3rd Qu.: 75.50   3rd Qu.:4.800   3rd Qu.: 49594  
##  Max.   :214.000   Max.   :100.00   Max.   :5.000   Max.   :854114  
##                                                                     
##      Price       
##  Min.   :  0.88  
##  1st Qu.: 13.99  
##  Median : 25.99  
##  Mean   : 55.69  
##  3rd Qu.: 50.00  
##  Max.   :899.00  
##

Plot Distribusi Data Numerik

d <- melt(df_num)

## No id variables; using all as measure variables

ggplot(d,aes(x = value)) + 
    facet_wrap(~variable,scales = "free_x") + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Pairplot

pairs.panels(df_num, 
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
             )

Data Preparation

Membuat data frame untuk visualisasi yang akan menjawab pertanyaan pertama

x <- group_by(df, Category)
x <- summarize(x, Rating = mean(Rating, na.rm = TRUE ))
x

## # A tibble: 7 x 2
##   Category                  Rating
##   <fct>                      <dbl>
## 1 Books                       4.67
## 2 Camera & Photo              4.32
## 3 Clothing, Shoes & Jewelry   4.47
## 4 Electronics                 4.62
## 5 Gift Cards                  4.81
## 6 Toys & Games                4.65
## 7 Video Games                 4.63

Membuat data frame untuk visualisasi yang akan menjawab pertanyaan kedua

y <- group_by(df, Category)
y <- summarize(y, Reviews.Count = mean(Reviews.Count, na.rm = TRUE ))
y

## # A tibble: 7 x 2
##   Category                  Reviews.Count
##   <fct>                             <dbl>
## 1 Books                            25899.
## 2 Camera & Photo                    7218.
## 3 Clothing, Shoes & Jewelry        28686.
## 4 Electronics                      76938.
## 5 Gift Cards                      334868.
## 6 Toys & Games                     17687.
## 7 Video Games                      26979.

Membuat data frame untuk visualisasi yang akan menjawab pertanyaan ketiga

z <- group_by(df, Category)
z <- summarize(z, Price = mean(Price, na.rm = TRUE ))
z

## # A tibble: 7 x 2
##   Category                  Price
##   <fct>                     <dbl>
## 1 Books                      11.5
## 2 Camera & Photo             50.6
## 3 Clothing, Shoes & Jewelry  17.6
## 4 Electronics               135. 
## 5 Gift Cards                 44.0
## 6 Toys & Games               17.3
## 7 Video Games                61.0

Data Visualization

1. Kategori produk mana yang mempunyai Rating paling tinggi dan paling rendah dilihat dari rata-ratanya?

A <- plot_ly(
     x = x$Category,     
     y = x$Rating,
     type = "bar"
)
A <- A %>% layout(title = "Rata-Rata Rating Setiap Kategori",
                         xaxis = list(title = "Category"),
                         yaxis = list (title = "Rating"))
A

Dilihat dari rata-ratanya, Rating paling tinggi didapatkan oleh kategori Kartu Hadiah (Gift Card), sedangkan paling rendah didapatkan oleh kategori Kamera & Foto (Camera & Photo).

2. Kategori produk mana yang mendapatkan Jumlah Ulasan (Reviews Count) paling banyak dan paling sedikit dilihat dari rata-ratanya?

B <- plot_ly(
     x = y$Category,     
     y = y$Reviews.Count,
     type = "bar"
)
B <- B %>% layout(title = "Rata-Rata Jumlah Ulasan Setiap Kategori",
                         xaxis = list(title = "Category"),
                         yaxis = list (title = "Reviews Count"))
B

Dilihat dari rata-ratanya, Jumlah Ulasan paling banyak diterima oleh Kartu Hadiah (Gift Card), sedangkan Jumlah Ulasan paling sedikit diterima oleh Kamera & Foto (Camera & Photo).

3. Kategori produk mana yang mempunyai harga (Price) paling mahal dan paling murah dilihat dari rata-ratanya?

C <- plot_ly(
  x = z$Category,     
  y = z$Price,
  type = "bar"
)
C <- C %>% layout(title = "Rata-Rata Harga Setiap Kategori",
                          xaxis = list(title = "Category"),
                          yaxis = list (title = "Price"))
C

Dilihat dari rata-ratanya, Harga paling mahal adalah kategori elektronik (Electronics), sedangkan Harga paling murah adalah Buku (Books).

Analisis Data Menggunakan R

Yulia Candra Dewi

1/12/2022