Greetings! Welcome to my Learning By Building (LBB). This LBB talks about Herbal Products. First, we need to store/call/assign the data to Herb_1 and take a peek with head().
Herb_1 <- read.csv("Iherb categories.csv")
head(Herb_1)Then, we need to inspect the data before explore further more.
dim(Herb_1)## [1] 2148 15
names(Herb_1)## [1] "product_id" "product_name" "product_company" "product_flavour"
## [5] "category" "product_size" "product_price" "product_rate"
## [9] "one_star" "two_star" "three_star" "four_star"
## [13] "five_star" "reviews_number" "product_quality"
Well, after inspect the data, we had 2.148 products and 15 columns that represented by product names, the company, kind of product’s flavour, category, size, price, product’s overall rate, product’s rate in one star, two star, three star, four star, and five star, how many reviews of the product, and the quality (poor, good, or excellent). After that, we need to know is there any missing value or there isn’t.
colSums(is.na(Herb_1))## product_id product_name product_company product_flavour category
## 0 0 0 0 0
## product_size product_price product_rate one_star two_star
## 0 0 0 0 0
## three_star four_star five_star reviews_number product_quality
## 0 0 0 0 0
Great! There are no missing value in this data. After that, we need to check the data type of each column.
str(Herb_1)## 'data.frame': 2148 obs. of 15 variables:
## $ product_id : int 88411 96659 96973 41988 69658 41989 5500 30664 5494 6871 ...
## $ product_name : chr "Matcha Road, Matcha + Collagen, Grass-Fed Collagen Peptides, 7.9 oz (224 g)" "California Gold Nutrition, Superfoods, Matcha Green Tea Powder, 4 oz (114 g)" "Matcha Road, Matcha + Vitamin C, Trial Pack (Citrus Ginger, Dragon Fruit, Original), 3 Packets, 0.18 oz (5 g) Each" "Yogi Tea, Organic, Honey Lavender Stress Relief, Caffeine Free, 16 Tea Bags, 1.02 oz (29 g)" ...
## $ product_company: chr "Matcha Road" "California Gold Nutrition" "Matcha Road" "Yogi Tea" ...
## $ product_flavour: chr " Matcha + Collagen, Grass-Fed Collagen Peptides," " Superfoods, Matcha Green Tea Powder," " Matcha + Vitamin C, Trial Pack (Citrus Ginger, Dragon Fruit, Original), 3 Packets," " Organic, Honey Lavender Stress Relief, Caffeine Free, 16 Tea Bags," ...
## $ category : chr "Tea" "Tea" "Tea" "Tea" ...
## $ product_size : chr "224 g" "114 g" "5 g" "29 g" ...
## $ product_price : chr "39.21" "26.96" "8.99" "6.5" ...
## $ product_rate : num 4.5 3.8 2.9 4.5 4.4 4.5 4 4.1 4.6 4.4 ...
## $ one_star : int 4 5 8 33 60 28 73 107 26 14 ...
## $ two_star : int 2 1 3 39 31 25 61 95 14 29 ...
## $ three_star : int 2 4 1 78 65 80 119 293 56 39 ...
## $ four_star : int 7 10 3 175 108 146 204 501 106 116 ...
## $ five_star : int 56 14 7 731 632 731 492 1066 762 354 ...
## $ reviews_number : int 71 34 22 1056 897 1010 949 2062 964 552 ...
## $ product_quality: chr "excellent" "good" "good" "excellent" ...
There are some columns that have incorrect data type, thus we had to clean the data (by remove some unnecessary characters) and change the incorrect data type.
Herb_1$product_size <- sub(pattern = " g",replacement = "", Herb_1$product_size)
head(unique(Herb_1$product_size))## [1] "224" "114" "5" "29" "170" "24"
Drop row 335, because it’s product price is not available (should be a missing value).
Herb_1 <- Herb_1[-335,]
str(Herb_1)## 'data.frame': 2147 obs. of 15 variables:
## $ product_id : int 88411 96659 96973 41988 69658 41989 5500 30664 5494 6871 ...
## $ product_name : chr "Matcha Road, Matcha + Collagen, Grass-Fed Collagen Peptides, 7.9 oz (224 g)" "California Gold Nutrition, Superfoods, Matcha Green Tea Powder, 4 oz (114 g)" "Matcha Road, Matcha + Vitamin C, Trial Pack (Citrus Ginger, Dragon Fruit, Original), 3 Packets, 0.18 oz (5 g) Each" "Yogi Tea, Organic, Honey Lavender Stress Relief, Caffeine Free, 16 Tea Bags, 1.02 oz (29 g)" ...
## $ product_company: chr "Matcha Road" "California Gold Nutrition" "Matcha Road" "Yogi Tea" ...
## $ product_flavour: chr " Matcha + Collagen, Grass-Fed Collagen Peptides," " Superfoods, Matcha Green Tea Powder," " Matcha + Vitamin C, Trial Pack (Citrus Ginger, Dragon Fruit, Original), 3 Packets," " Organic, Honey Lavender Stress Relief, Caffeine Free, 16 Tea Bags," ...
## $ category : chr "Tea" "Tea" "Tea" "Tea" ...
## $ product_size : chr "224" "114" "5" "29" ...
## $ product_price : chr "39.21" "26.96" "8.99" "6.5" ...
## $ product_rate : num 4.5 3.8 2.9 4.5 4.4 4.5 4 4.1 4.6 4.4 ...
## $ one_star : int 4 5 8 33 60 28 73 107 26 14 ...
## $ two_star : int 2 1 3 39 31 25 61 95 14 29 ...
## $ three_star : int 2 4 1 78 65 80 119 293 56 39 ...
## $ four_star : int 7 10 3 175 108 146 204 501 106 116 ...
## $ five_star : int 56 14 7 731 632 731 492 1066 762 354 ...
## $ reviews_number : int 71 34 22 1056 897 1010 949 2062 964 552 ...
## $ product_quality: chr "excellent" "good" "good" "excellent" ...
Then, change data type.
Herb_1$product_company <- as.factor(Herb_1$product_company)
Herb_1$category <- as.factor(Herb_1$category)
Herb_1$product_quality <- as.factor(Herb_1$product_quality)
Herb_1$product_size <- as.numeric(Herb_1$product_size)## Warning: NAs introduced by coercion
Herb_1$product_price <- as.numeric(Herb_1$product_price)## Warning: NAs introduced by coercion
str(Herb_1)## 'data.frame': 2147 obs. of 15 variables:
## $ product_id : int 88411 96659 96973 41988 69658 41989 5500 30664 5494 6871 ...
## $ product_name : chr "Matcha Road, Matcha + Collagen, Grass-Fed Collagen Peptides, 7.9 oz (224 g)" "California Gold Nutrition, Superfoods, Matcha Green Tea Powder, 4 oz (114 g)" "Matcha Road, Matcha + Vitamin C, Trial Pack (Citrus Ginger, Dragon Fruit, Original), 3 Packets, 0.18 oz (5 g) Each" "Yogi Tea, Organic, Honey Lavender Stress Relief, Caffeine Free, 16 Tea Bags, 1.02 oz (29 g)" ...
## $ product_company: Factor w/ 223 levels "21st Century",..: 116 24 116 218 123 218 218 218 218 201 ...
## $ product_flavour: chr " Matcha + Collagen, Grass-Fed Collagen Peptides," " Superfoods, Matcha Green Tea Powder," " Matcha + Vitamin C, Trial Pack (Citrus Ginger, Dragon Fruit, Original), 3 Packets," " Organic, Honey Lavender Stress Relief, Caffeine Free, 16 Tea Bags," ...
## $ category : Factor w/ 12 levels " Breakfast"," Herb and Spices",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ product_size : num 224 114 5 29 170 29 24 32 32 24 ...
## $ product_price : num 39.21 26.96 8.99 6.5 24.49 ...
## $ product_rate : num 4.5 3.8 2.9 4.5 4.4 4.5 4 4.1 4.6 4.4 ...
## $ one_star : int 4 5 8 33 60 28 73 107 26 14 ...
## $ two_star : int 2 1 3 39 31 25 61 95 14 29 ...
## $ three_star : int 2 4 1 78 65 80 119 293 56 39 ...
## $ four_star : int 7 10 3 175 108 146 204 501 106 116 ...
## $ five_star : int 56 14 7 731 632 731 492 1066 762 354 ...
## $ reviews_number : int 71 34 22 1056 897 1010 949 2062 964 552 ...
## $ product_quality: Factor w/ 3 levels "excellent","good",..: 1 2 2 1 1 1 1 1 1 1 ...
summary(Herb_1)## product_id product_name product_company
## Min. : 374 Length:2147 Frontier Natural Products: 183
## 1st Qu.:32400 Class :character Bob's Red Mill : 77
## Median :58719 Mode :character Twinings : 75
## Mean :56278 Simply Organic : 73
## 3rd Qu.:81954 Now Foods : 56
## Max. :99554 Traditional Medicinals : 45
## (Other) :1638
## product_flavour category product_size
## Length:2147 Tea :679 Min. : 0.0
## Class :character Herb and Spices :268 1st Qu.: 40.0
## Mode :character Chocolate & Candies snack:209 Median : 100.0
## Fruit & Vegetables :199 Mean : 193.3
## vinegars & Oils :195 3rd Qu.: 340.0
## Breakfast :187 Max. :2268.0
## (Other) :410 NA's :6
## product_price product_rate one_star two_star
## Min. : 3.15 Min. :1.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 7.52 1st Qu.:4.200 1st Qu.: 1.000 1st Qu.: 0.000
## Median :11.06 Median :4.400 Median : 3.000 Median : 2.000
## Mean :15.42 Mean :4.302 Mean : 8.546 Mean : 6.225
## 3rd Qu.:19.82 3rd Qu.:4.600 3rd Qu.: 8.000 3rd Qu.: 6.000
## Max. :96.98 Max. :5.000 Max. :267.000 Max. :184.000
## NA's :175
## three_star four_star five_star reviews_number
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 1.0
## 1st Qu.: 1.00 1st Qu.: 3.00 1st Qu.: 13.0 1st Qu.: 20.0
## Median : 5.00 Median : 10.00 Median : 39.0 Median : 61.0
## Mean : 15.06 Mean : 32.04 Mean : 139.7 Mean : 201.5
## 3rd Qu.: 14.00 3rd Qu.: 27.50 3rd Qu.: 106.0 3rd Qu.: 161.5
## Max. :643.00 Max. :1349.00 Max. :6936.0 Max. :8585.0
##
## product_quality
## excellent:1861
## good : 264
## poor : 22
##
##
##
##
Oops, there are missing values in product_price, we need to remove it first. Up to this point, if there’s any missing value in our data exist after inspect data (missing value is not exist when inspect the data), there’s a chance that the contents of the row/column is not same/equal to the actual contents of row/column.
Herb_1 <- na.omit(Herb_1)
dim(Herb_1)## [1] 1966 15
The missing values are dropped/removed. Then we will start exploration on data.
Brief explanation
summary(Herb_1)## product_id product_name product_company
## Min. : 374 Length:1966 Frontier Natural Products: 178
## 1st Qu.:32651 Class :character Simply Organic : 72
## Median :58780 Mode :character Twinings : 58
## Mean :56479 Now Foods : 55
## 3rd Qu.:82036 Bob's Red Mill : 54
## Max. :99554 Traditional Medicinals : 44
## (Other) :1505
## product_flavour category product_size
## Length:1966 Tea :610 Min. : 0.0
## Class :character Herb and Spices :255 1st Qu.: 40.0
## Mode :character Chocolate & Candies snack:197 Median : 100.0
## vinegars & Oils :188 Mean : 193.8
## Fruit & Vegetables :173 3rd Qu.: 340.0
## Breakfast :154 Max. :2268.0
## (Other) :389
## product_price product_rate one_star two_star
## Min. : 3.150 Min. :1.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 7.522 1st Qu.:4.200 1st Qu.: 1.000 1st Qu.: 0.000
## Median :11.070 Median :4.400 Median : 3.000 Median : 2.000
## Mean :15.436 Mean :4.309 Mean : 8.894 Mean : 6.402
## 3rd Qu.:19.850 3rd Qu.:4.600 3rd Qu.: 8.000 3rd Qu.: 6.000
## Max. :96.980 Max. :5.000 Max. :267.000 Max. :184.000
##
## three_star four_star five_star reviews_number
## Min. : 0.00 Min. : 0.00 Min. : 0 Min. : 1.0
## 1st Qu.: 1.00 1st Qu.: 3.00 1st Qu.: 14 1st Qu.: 22.0
## Median : 5.00 Median : 11.00 Median : 42 Median : 65.0
## Mean : 15.56 Mean : 33.16 Mean : 146 Mean : 209.2
## 3rd Qu.: 15.00 3rd Qu.: 28.75 3rd Qu.: 111 3rd Qu.: 173.0
## Max. :643.00 Max. :1349.00 Max. :6936 Max. :8585.0
##
## product_quality
## excellent:1717
## good : 231
## poor : 18
##
##
##
##
Brief explanation summary :
We know that summary above are talks several things, but there are surely some business insight that we don’t know if we’re only observe from the summary. Before go to the next step, let’s create a new columns filled by sales result of each product by reviews number.
Herb_1$sales <- Herb_1$product_price * Herb_1$reviews_number
head(Herb_1)Ok let’s we move to the next step, we’re gonna do some exploration through the Herbal Product’s data.
cat_agg <- aggregate(sales ~ category, Herb_1, FUN = sum)
cat_agg[order(-cat_agg$sales), ]Vinnegars & Oils are the most popular to purchased on Herbal Product or there’s a chance that the price of Vinergars & Oils product id more expensive that herbal Tea product. Then, let’s we take a look deeper into Vinegar & Oils products.
vinegars <- Herb_1[Herb_1$category == "vinegars & Oils", ]
head(vinegars)dim(vinegars)## [1] 188 16
There are 188 vinnegars & oils products on dataset. Which company/brand have higher sales than other?
vin_agg <- aggregate(cbind(sales, reviews_number) ~ product_company + product_quality, vinegars, FUN = sum)
head(vin_agg[order(-vin_agg$sales), ], 5)Table above shows that vinnegars & oils from Nutiva has highest value both on sales and reviews_number. This means vinnegars & oils from Nutiva are preferable than other product/brand. Other insight that we can conclude from table above is reviews_number of Jarrow Formulas vinnegars & oils is higher than Garden of Life vinnegars & oils. This mean vinnegars & oils from Garden of Life is more expensive than vinnegars & oils from Jarrow Formulas (since sales value obtained from product_price * reviews_number).
Let’s take a peek which Brand/Company is the most recognized on Tea Product and let’s we take a peek what about the sales of top 5 popular herbal tea product.
tea <- Herb_1[Herb_1$category == "Tea",]
head(tea)dim(tea)## [1] 610 16
Then, what about the product quality of herbal tea products? Let’s take a look at it.
table(tea$product_quality)##
## excellent good poor
## 542 63 5
There are 610 of herbal tea products and 542 among them have excellent product quality. Thus, top 5 most recognized/popular herbal tea product and have excellent product quality is :
tea_agg <- aggregate(cbind(reviews_number, sales) ~ product_company + product_quality, tea, FUN = sum)
head(tea_agg[order(-tea_agg$reviews_number), ], 5)Table above shows top 5 most recognized/popular of herbal tea product. Yogi Tea placed on the first place by 14,735 review. But if we take a look at sales column, Harney & Sons herbal tea placed on the first place (have higher sales than Yogi Tea product’s). It means there are possibilities that some of Harney & sons customer’s repeated order or the price of Harney & Sons products is higher that the Yogi Tea Product.
Now, let’s we dive deeper which brand has overall quality product based on customer review is excellent, by product_quality from the dataset. First, we’re gonna subset the products that has excellent reviews.
excellent <- Herb_1[Herb_1$product_quality == "excellent", ]
dim(excellent)## [1] 1717 16
There are 1717 products has excellent review from the customer. Alright, now we’re gonna explore more through the dataset.
cat_agg <- aggregate(reviews_number ~ category, data = excellent, FUN = mean)
cat_agg[order(-cat_agg$reviews_number), ]On the first place there are Vinnegars & Oils and we already know from the previous exploratory from the sales. This means that many people tried to live a healthy life by using herbal oils and vinnegars. On the second place there are Cocoa. This could indicate that there are many people support herbal cocoa products. Why? We know that actually, the origin of sweetness in chocolate is not came from the cocoa it self, but there are artificial sweeteners added. This Cocoa is exclude chocolate bar or snacks we know. This Cocoa here is more like cocoa powder for drink and cooking purpose.
cocoa <- excellent[excellent$category == "Cocoa", ]
cocoa_agg <- aggregate(reviews_number ~ product_company, cocoa, FUN = mean)
head(cocoa_agg[order(-cocoa_agg$reviews_number), ], 5)Table above shows that cocoa product from Nestle company is preferable than other cocoa brand from other company. After cocoa product, let’s we explore on coffee product since coffee is the most consumed beverage.
coffee <- excellent[excellent$category == "Coffee", ]
coffee_agg <- aggregate(reviews_number ~ product_company, coffee, FUN = mean)
head(coffee_agg[order(-coffee_agg$reviews_number), ], 5)Okay, so based on table above coffee from California Gold Nutrition is most preferable coffee than other coffee from other brand. If we look at the difference between California Gold Nutrition product and Dandy Blend product (which placed at 2nd), the difference is huge. It means, there’s possibilities California Gold Nutrition product is known than other product.