Amazon adalah perusahaan teknologi multinasional Amerika yang berbasis di Seattle, Washington, yang berfokus pada e-commerce, komputasi awan, streaming digital, dan kecerdasan buatan. Perusahaan ini awalnya merupakan pasar online untuk buku, tetapi kemudian diperluas untuk menjual barang elektronik, perangkat lunak, video game, pakaian, furnitur, makanan, mainan, dan perhiasan.Saat ini Amazon menjadi perusahaan swasta terbesar kedua di Amerika Serikat dan salah satu perusahaan paling bernilai di dunia.
Dataset Amazon Best Seller June 2021 Products ini menyajikan kategori produk yang dijual oleh Amazon, kemudian kode unik produk, link produk, jumlah penjual disetiap kategori produk, peringkat produk, rating produk, jumlah ulasan, dan harga produk.
Dataset ini dibuat untuk membantu perusahan menemukan produk dengan penjualan terbaik (Best Seller) pada Juni 2021. Oleh karena itu, kali ini saya akan mencoba melakukan EDA, Data Preparation, dan Data visualization untuk menemukan penjualan terbaik Rating dan Jumlah Ulasan.
Kategori produk mana yang mempunyai Rating paling tinggi dan paling rendah dilihat dari rata-ratanya?
Kategori produk mana yang mendapatkan Jumlah Ulasan (Reviews Count) paling banyak dan paling sedikit dilihat dari rata-ratanya?
Kategori produk mana yang mempunyai harga (Price) paling mahal dan paling murah dilihat dari rata-ratanya?
library(heatmaply)
## Warning: package 'heatmaply' was built under R version 4.1.2
## Loading required package: plotly
## Warning: package 'plotly' was built under R version 4.1.2
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Loading required package: viridis
## Loading required package: viridisLite
##
## ======================
## Welcome to heatmaply version 1.3.0
##
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
##
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## You may ask questions at stackoverflow, use the r and heatmaply tags:
## https://stackoverflow.com/questions/tagged/heatmaply
## ======================
library(visdat)
## Warning: package 'visdat' was built under R version 4.1.2
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.2 v dplyr 1.0.6
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## v purrr 0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks plotly::filter(), stats::filter()
## x dplyr::lag() masks stats::lag()
library(skimr)
## Warning: package 'skimr' was built under R version 4.1.2
library(DataExplorer)
## Warning: package 'DataExplorer' was built under R version 4.1.2
library(dplyr)
library(ggplot2)
library(tidyr)
library(readr)
library(tibble)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
df <- read.csv("~/RStudio PSDS/Amazon_Best_Seller_2021_June.csv", stringsAsFactors=TRUE)
head(df)
## ASIN Category
## 1 B079QHML21 Electronics
## 2 B07FZ8S74R Electronics
## 3 B07XJ8C8F5 Electronics
## 4 B07WVFCVJN Electronics
## 5 B08YT2N5SX Electronics
## 6 B07TMJ1R3X Electronics
## Product.Link
## 1 https://www.amazon.com/gp/offer-listing/B079QHML21/ref=dp_olp_new_mbc?ie=UTF8&condition=all
## 2 https://www.amazon.com/gp/offer-listing/B07FZ8S74R/ref=dp_olp_new_mbc?ie=UTF8&condition=all
## 3 https://www.amazon.com/gp/offer-listing/B07XJ8C8F5/ref=dp_olp_new_mbc?ie=UTF8&condition=all
## 4 https://www.amazon.com/gp/offer-listing/B07WVFCVJN/ref=dp_olp_new_mbc?ie=UTF8&condition=all
## 5 https://www.amazon.com/gp/offer-listing/B08YT2N5SX/ref=dp_olp_new_mbc?ie=UTF8&condition=all
## 6 https://www.amazon.com/gp/offer-listing/B07TMJ1R3X/ref=dp_olp_new_mbc?ie=UTF8&condition=all
## No.of.Sellers Rank Rating Reviews.Count Price
## 1 1 Sellers #1 4.7 640,721 $39.99
## 2 1 Sellers #2 4.7 854,114 $34.99
## 3 1 Sellers #3 4.7 267,821 $44.99
## 4 27 Sellers #4 4.8 114,267 $28.48
## 5 1 Sellers #5 4.7 267,821 $49.99
## 6 1 Sellers #6 4.6 100,278 $89.99
dim(df)
## [1] 707 8
names(df)
## [1] "ASIN" "Category" "Product.Link" "No.of.Sellers"
## [5] "Rank" "Rating" "Reviews.Count" "Price"
str(df)
## 'data.frame': 707 obs. of 8 variables:
## $ ASIN : Factor w/ 653 levels "080241270X","1250080401",..: 241 284 418 405 621 379 415 519 225 510 ...
## $ Category : Factor w/ 7 levels "Books","Camera & Photo",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Product.Link : Factor w/ 653 levels "https://www.amazon.com/gp/offer-listing/0062060627/ref=dp_olp_new_mbc?ie=UTF8&condition=all",..: 241 284 418 405 621 379 415 519 225 510 ...
## $ No.of.Sellers: Factor w/ 61 levels "1 Sellers","10 Sellers",..: 1 1 1 26 1 1 1 1 52 1 ...
## $ Rank : Factor w/ 100 levels "#1","#10","#100",..: 1 13 24 35 46 57 68 79 90 2 ...
## $ Rating : num 4.7 4.7 4.7 4.8 4.7 4.6 4.5 4.5 4.7 4.7 ...
## $ Reviews.Count: Factor w/ 546 levels "1","1,022","1,030",..: 459 523 236 63 236 51 62 12 522 367 ...
## $ Price : Factor w/ 287 levels "$0.88 ","$10.00 ",..: 181 168 194 147 208 275 168 63 178 129 ...
# Mengubah tipe data yang ada pada kolom 'No of Sellers' dari character menjadi numeric, dengan menghapus kata ' Sellers'
df$No.of.Sellers <- as.numeric(gsub(" Sellers","",df$No.of.Sellers))
# Mengubah tipe data yang ada pada kolom 'Rank' dari character menjadi numeric, dengan menghapus simbol '#'
df$Rank <- as.numeric(gsub("#","",df$Rank))
# Mengubah tipe data yang ada pada kolom 'Reviews Count' dari character menjadi numeric, dengan menghapus tanda koma ','
df$Reviews.Count <- as.numeric(gsub(",", "", df$Reviews.Count))
# Mengubah tipe data yang ada pada kolom 'Price' dari character menjadi numeric, dengan menghapus simbol '$'
df$Price <- (as.numeric(gsub("[^[:alnum:]]","",df$Price)))/100
head(df)
## ASIN Category
## 1 B079QHML21 Electronics
## 2 B07FZ8S74R Electronics
## 3 B07XJ8C8F5 Electronics
## 4 B07WVFCVJN Electronics
## 5 B08YT2N5SX Electronics
## 6 B07TMJ1R3X Electronics
## Product.Link
## 1 https://www.amazon.com/gp/offer-listing/B079QHML21/ref=dp_olp_new_mbc?ie=UTF8&condition=all
## 2 https://www.amazon.com/gp/offer-listing/B07FZ8S74R/ref=dp_olp_new_mbc?ie=UTF8&condition=all
## 3 https://www.amazon.com/gp/offer-listing/B07XJ8C8F5/ref=dp_olp_new_mbc?ie=UTF8&condition=all
## 4 https://www.amazon.com/gp/offer-listing/B07WVFCVJN/ref=dp_olp_new_mbc?ie=UTF8&condition=all
## 5 https://www.amazon.com/gp/offer-listing/B08YT2N5SX/ref=dp_olp_new_mbc?ie=UTF8&condition=all
## 6 https://www.amazon.com/gp/offer-listing/B07TMJ1R3X/ref=dp_olp_new_mbc?ie=UTF8&condition=all
## No.of.Sellers Rank Rating Reviews.Count Price
## 1 1 1 4.7 640721 39.99
## 2 1 2 4.7 854114 34.99
## 3 1 3 4.7 267821 44.99
## 4 27 4 4.8 114267 28.48
## 5 1 5 4.7 267821 49.99
## 6 1 6 4.6 100278 89.99
str(df)
## 'data.frame': 707 obs. of 8 variables:
## $ ASIN : Factor w/ 653 levels "080241270X","1250080401",..: 241 284 418 405 621 379 415 519 225 510 ...
## $ Category : Factor w/ 7 levels "Books","Camera & Photo",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Product.Link : Factor w/ 653 levels "https://www.amazon.com/gp/offer-listing/0062060627/ref=dp_olp_new_mbc?ie=UTF8&condition=all",..: 241 284 418 405 621 379 415 519 225 510 ...
## $ No.of.Sellers: num 1 1 1 27 1 1 1 1 67 1 ...
## $ Rank : num 1 2 3 4 5 6 7 8 9 10 ...
## $ Rating : num 4.7 4.7 4.7 4.8 4.7 4.6 4.5 4.5 4.7 4.7 ...
## $ Reviews.Count: num 640721 854114 267821 114267 267821 ...
## $ Price : num 40 35 45 28.5 50 ...
sapply(df, function(x) sum(is.na(x)))
## ASIN Category Product.Link No.of.Sellers Rank
## 0 0 0 0 0
## Rating Reviews.Count Price
## 0 0 0
vis_miss(df)
num_cols <- unlist(lapply(df, is.numeric))
df_num <- df[ , num_cols]
boxplot(df_num)
plot_correlation(df_num)
summary(df)
## ASIN Category
## B081RJ8DW1: 3 Books : 70
## B095NWYQBC: 3 Camera & Photo :100
## B001E1Y5O6: 2 Clothing, Shoes & Jewelry:100
## B01171X0UW: 2 Electronics :147
## B0148NNKTC: 2 Gift Cards :100
## B071FXZBMV: 2 Toys & Games : 95
## (Other) :693 Video Games : 95
## Product.Link
## https://www.amazon.com/gp/offer-listing/B081RJ8DW1/ref=dp_olp_new_mbc?ie=UTF8&condition=all: 3
## https://www.amazon.com/gp/offer-listing/B095NWYQBC/ref=dp_olp_new_mbc?ie=UTF8&condition=all: 3
## https://www.amazon.com/gp/offer-listing/B001E1Y5O6/ref=dp_olp_new_mbc?ie=UTF8&condition=all: 2
## https://www.amazon.com/gp/offer-listing/B01171X0UW/ref=dp_olp_new_mbc?ie=UTF8&condition=all: 2
## https://www.amazon.com/gp/offer-listing/B0148NNKTC/ref=dp_olp_new_mbc?ie=UTF8&condition=all: 2
## https://www.amazon.com/gp/offer-listing/B071FXZBMV/ref=dp_olp_new_mbc?ie=UTF8&condition=all: 2
## (Other) :693
## No.of.Sellers Rank Rating Reviews.Count
## Min. : 1.000 Min. : 1.00 Min. :1.400 Min. : 1
## 1st Qu.: 1.000 1st Qu.: 27.00 1st Qu.:4.500 1st Qu.: 5138
## Median : 1.000 Median : 52.00 Median :4.700 Median : 18023
## Mean : 8.334 Mean : 51.36 Mean :4.593 Mean : 77006
## 3rd Qu.: 5.000 3rd Qu.: 75.50 3rd Qu.:4.800 3rd Qu.: 49594
## Max. :214.000 Max. :100.00 Max. :5.000 Max. :854114
##
## Price
## Min. : 0.88
## 1st Qu.: 13.99
## Median : 25.99
## Mean : 55.69
## 3rd Qu.: 50.00
## Max. :899.00
##
d <- melt(df_num)
## No id variables; using all as measure variables
ggplot(d,aes(x = value)) +
facet_wrap(~variable,scales = "free_x") + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
pairs.panels(df_num,
method = "pearson", # correlation method
hist.col = "#00AFBB",
density = TRUE, # show density plots
ellipses = TRUE # show correlation ellipses
)
Membuat data frame untuk visualisasi yang akan menjawab pertanyaan pertama
x <- group_by(df, Category)
x <- summarize(x, Rating = mean(Rating, na.rm = TRUE ))
x
## # A tibble: 7 x 2
## Category Rating
## <fct> <dbl>
## 1 Books 4.67
## 2 Camera & Photo 4.32
## 3 Clothing, Shoes & Jewelry 4.47
## 4 Electronics 4.62
## 5 Gift Cards 4.81
## 6 Toys & Games 4.65
## 7 Video Games 4.63
Membuat data frame untuk visualisasi yang akan menjawab pertanyaan kedua
y <- group_by(df, Category)
y <- summarize(y, Reviews.Count = mean(Reviews.Count, na.rm = TRUE ))
y
## # A tibble: 7 x 2
## Category Reviews.Count
## <fct> <dbl>
## 1 Books 25899.
## 2 Camera & Photo 7218.
## 3 Clothing, Shoes & Jewelry 28686.
## 4 Electronics 76938.
## 5 Gift Cards 334868.
## 6 Toys & Games 17687.
## 7 Video Games 26979.
Membuat data frame untuk visualisasi yang akan menjawab pertanyaan ketiga
z <- group_by(df, Category)
z <- summarize(z, Price = mean(Price, na.rm = TRUE ))
z
## # A tibble: 7 x 2
## Category Price
## <fct> <dbl>
## 1 Books 11.5
## 2 Camera & Photo 50.6
## 3 Clothing, Shoes & Jewelry 17.6
## 4 Electronics 135.
## 5 Gift Cards 44.0
## 6 Toys & Games 17.3
## 7 Video Games 61.0
1. Kategori produk mana yang mempunyai Rating paling tinggi dan paling rendah dilihat dari rata-ratanya?
A <- plot_ly(
x = x$Category,
y = x$Rating,
type = "bar"
)
A <- A %>% layout(title = "Rata-Rata Rating Setiap Kategori",
xaxis = list(title = "Category"),
yaxis = list (title = "Rating"))
A
Dilihat dari rata-ratanya, Rating paling tinggi didapatkan oleh kategori Kartu Hadiah (Gift Card), sedangkan paling rendah didapatkan oleh kategori Kamera & Foto (Camera & Photo).
2. Kategori produk mana yang mendapatkan Jumlah Ulasan (Reviews Count) paling banyak dan paling sedikit dilihat dari rata-ratanya?
B <- plot_ly(
x = y$Category,
y = y$Reviews.Count,
type = "bar"
)
B <- B %>% layout(title = "Rata-Rata Jumlah Ulasan Setiap Kategori",
xaxis = list(title = "Category"),
yaxis = list (title = "Reviews Count"))
B
Dilihat dari rata-ratanya, Jumlah Ulasan paling banyak diterima oleh Kartu Hadiah (Gift Card), sedangkan Jumlah Ulasan paling sedikit diterima oleh Kamera & Foto (Camera & Photo).
3. Kategori produk mana yang mempunyai harga (Price) paling mahal dan paling murah dilihat dari rata-ratanya?
C <- plot_ly(
x = z$Category,
y = z$Price,
type = "bar"
)
C <- C %>% layout(title = "Rata-Rata Harga Setiap Kategori",
xaxis = list(title = "Category"),
yaxis = list (title = "Price"))
C
Dilihat dari rata-ratanya, Harga paling mahal adalah kategori elektronik (Electronics), sedangkan Harga paling murah adalah Buku (Books).