Welcome

Greetings! Welcome to my Learning By Building (LBB). This LBB talks about Herbal Products. First, we need to store/call/assign the data to Herb_1 and take a peek with head().

Herb_1 <- read.csv("Iherb categories.csv")
head(Herb_1)

Then, we need to inspect the data before explore further more.

Inspect Dataset

dim(Herb_1)
## [1] 2148   15
names(Herb_1)
##  [1] "product_id"      "product_name"    "product_company" "product_flavour"
##  [5] "category"        "product_size"    "product_price"   "product_rate"   
##  [9] "one_star"        "two_star"        "three_star"      "four_star"      
## [13] "five_star"       "reviews_number"  "product_quality"

Well, after inspect the data, we had 2.148 products and 15 columns that represented by product names, the company, kind of product’s flavour, category, size, price, product’s overall rate, product’s rate in one star, two star, three star, four star, and five star, how many reviews of the product, and the quality (poor, good, or excellent). After that, we need to know is there any missing value or there isn’t.

colSums(is.na(Herb_1))
##      product_id    product_name product_company product_flavour        category 
##               0               0               0               0               0 
##    product_size   product_price    product_rate        one_star        two_star 
##               0               0               0               0               0 
##      three_star       four_star       five_star  reviews_number product_quality 
##               0               0               0               0               0

Great! There are no missing value in this data. After that, we need to check the data type of each column.

str(Herb_1)
## 'data.frame':    2148 obs. of  15 variables:
##  $ product_id     : int  88411 96659 96973 41988 69658 41989 5500 30664 5494 6871 ...
##  $ product_name   : chr  "Matcha Road, Matcha + Collagen, Grass-Fed Collagen Peptides, 7.9 oz (224 g)" "California Gold Nutrition, Superfoods, Matcha Green Tea Powder, 4 oz (114 g)" "Matcha Road, Matcha + Vitamin C, Trial Pack (Citrus Ginger, Dragon Fruit, Original), 3 Packets, 0.18 oz (5 g) Each" "Yogi Tea, Organic, Honey Lavender Stress Relief, Caffeine Free, 16 Tea Bags, 1.02 oz (29 g)" ...
##  $ product_company: chr  "Matcha Road" "California Gold Nutrition" "Matcha Road" "Yogi Tea" ...
##  $ product_flavour: chr  " Matcha + Collagen, Grass-Fed Collagen Peptides," " Superfoods, Matcha Green Tea Powder," " Matcha + Vitamin C, Trial Pack (Citrus Ginger, Dragon Fruit, Original), 3 Packets," " Organic, Honey Lavender Stress Relief, Caffeine Free, 16 Tea Bags," ...
##  $ category       : chr  "Tea" "Tea" "Tea" "Tea" ...
##  $ product_size   : chr  "224 g" "114 g" "5 g" "29 g" ...
##  $ product_price  : chr  "39.21" "26.96" "8.99" "6.5" ...
##  $ product_rate   : num  4.5 3.8 2.9 4.5 4.4 4.5 4 4.1 4.6 4.4 ...
##  $ one_star       : int  4 5 8 33 60 28 73 107 26 14 ...
##  $ two_star       : int  2 1 3 39 31 25 61 95 14 29 ...
##  $ three_star     : int  2 4 1 78 65 80 119 293 56 39 ...
##  $ four_star      : int  7 10 3 175 108 146 204 501 106 116 ...
##  $ five_star      : int  56 14 7 731 632 731 492 1066 762 354 ...
##  $ reviews_number : int  71 34 22 1056 897 1010 949 2062 964 552 ...
##  $ product_quality: chr  "excellent" "good" "good" "excellent" ...

There are some columns that have incorrect data type, thus we had to clean the data (by remove some unnecessary characters) and change the incorrect data type.

Data Cleaning

Herb_1$product_size <- sub(pattern = " g",replacement = "", Herb_1$product_size)
head(unique(Herb_1$product_size))
## [1] "224" "114" "5"   "29"  "170" "24"

Drop row 335, because it’s product price is not available (should be a missing value).

Herb_1 <- Herb_1[-335,]
str(Herb_1)
## 'data.frame':    2147 obs. of  15 variables:
##  $ product_id     : int  88411 96659 96973 41988 69658 41989 5500 30664 5494 6871 ...
##  $ product_name   : chr  "Matcha Road, Matcha + Collagen, Grass-Fed Collagen Peptides, 7.9 oz (224 g)" "California Gold Nutrition, Superfoods, Matcha Green Tea Powder, 4 oz (114 g)" "Matcha Road, Matcha + Vitamin C, Trial Pack (Citrus Ginger, Dragon Fruit, Original), 3 Packets, 0.18 oz (5 g) Each" "Yogi Tea, Organic, Honey Lavender Stress Relief, Caffeine Free, 16 Tea Bags, 1.02 oz (29 g)" ...
##  $ product_company: chr  "Matcha Road" "California Gold Nutrition" "Matcha Road" "Yogi Tea" ...
##  $ product_flavour: chr  " Matcha + Collagen, Grass-Fed Collagen Peptides," " Superfoods, Matcha Green Tea Powder," " Matcha + Vitamin C, Trial Pack (Citrus Ginger, Dragon Fruit, Original), 3 Packets," " Organic, Honey Lavender Stress Relief, Caffeine Free, 16 Tea Bags," ...
##  $ category       : chr  "Tea" "Tea" "Tea" "Tea" ...
##  $ product_size   : chr  "224" "114" "5" "29" ...
##  $ product_price  : chr  "39.21" "26.96" "8.99" "6.5" ...
##  $ product_rate   : num  4.5 3.8 2.9 4.5 4.4 4.5 4 4.1 4.6 4.4 ...
##  $ one_star       : int  4 5 8 33 60 28 73 107 26 14 ...
##  $ two_star       : int  2 1 3 39 31 25 61 95 14 29 ...
##  $ three_star     : int  2 4 1 78 65 80 119 293 56 39 ...
##  $ four_star      : int  7 10 3 175 108 146 204 501 106 116 ...
##  $ five_star      : int  56 14 7 731 632 731 492 1066 762 354 ...
##  $ reviews_number : int  71 34 22 1056 897 1010 949 2062 964 552 ...
##  $ product_quality: chr  "excellent" "good" "good" "excellent" ...

Then, change data type.

Herb_1$product_company <- as.factor(Herb_1$product_company)
Herb_1$category <- as.factor(Herb_1$category)
Herb_1$product_quality <- as.factor(Herb_1$product_quality)
Herb_1$product_size <- as.numeric(Herb_1$product_size)
## Warning: NAs introduced by coercion
Herb_1$product_price <- as.numeric(Herb_1$product_price)
## Warning: NAs introduced by coercion
str(Herb_1)
## 'data.frame':    2147 obs. of  15 variables:
##  $ product_id     : int  88411 96659 96973 41988 69658 41989 5500 30664 5494 6871 ...
##  $ product_name   : chr  "Matcha Road, Matcha + Collagen, Grass-Fed Collagen Peptides, 7.9 oz (224 g)" "California Gold Nutrition, Superfoods, Matcha Green Tea Powder, 4 oz (114 g)" "Matcha Road, Matcha + Vitamin C, Trial Pack (Citrus Ginger, Dragon Fruit, Original), 3 Packets, 0.18 oz (5 g) Each" "Yogi Tea, Organic, Honey Lavender Stress Relief, Caffeine Free, 16 Tea Bags, 1.02 oz (29 g)" ...
##  $ product_company: Factor w/ 223 levels "21st Century",..: 116 24 116 218 123 218 218 218 218 201 ...
##  $ product_flavour: chr  " Matcha + Collagen, Grass-Fed Collagen Peptides," " Superfoods, Matcha Green Tea Powder," " Matcha + Vitamin C, Trial Pack (Citrus Ginger, Dragon Fruit, Original), 3 Packets," " Organic, Honey Lavender Stress Relief, Caffeine Free, 16 Tea Bags," ...
##  $ category       : Factor w/ 12 levels " Breakfast"," Herb and Spices",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ product_size   : num  224 114 5 29 170 29 24 32 32 24 ...
##  $ product_price  : num  39.21 26.96 8.99 6.5 24.49 ...
##  $ product_rate   : num  4.5 3.8 2.9 4.5 4.4 4.5 4 4.1 4.6 4.4 ...
##  $ one_star       : int  4 5 8 33 60 28 73 107 26 14 ...
##  $ two_star       : int  2 1 3 39 31 25 61 95 14 29 ...
##  $ three_star     : int  2 4 1 78 65 80 119 293 56 39 ...
##  $ four_star      : int  7 10 3 175 108 146 204 501 106 116 ...
##  $ five_star      : int  56 14 7 731 632 731 492 1066 762 354 ...
##  $ reviews_number : int  71 34 22 1056 897 1010 949 2062 964 552 ...
##  $ product_quality: Factor w/ 3 levels "excellent","good",..: 1 2 2 1 1 1 1 1 1 1 ...
summary(Herb_1)
##    product_id    product_name                        product_company
##  Min.   :  374   Length:2147        Frontier Natural Products: 183  
##  1st Qu.:32400   Class :character   Bob's Red Mill           :  77  
##  Median :58719   Mode  :character   Twinings                 :  75  
##  Mean   :56278                      Simply Organic           :  73  
##  3rd Qu.:81954                      Now Foods                :  56  
##  Max.   :99554                      Traditional Medicinals   :  45  
##                                     (Other)                  :1638  
##  product_flavour                         category    product_size   
##  Length:2147        Tea                      :679   Min.   :   0.0  
##  Class :character    Herb and Spices         :268   1st Qu.:  40.0  
##  Mode  :character   Chocolate & Candies snack:209   Median : 100.0  
##                     Fruit & Vegetables       :199   Mean   : 193.3  
##                     vinegars & Oils          :195   3rd Qu.: 340.0  
##                      Breakfast               :187   Max.   :2268.0  
##                     (Other)                  :410   NA's   :6       
##  product_price    product_rate      one_star          two_star      
##  Min.   : 3.15   Min.   :1.000   Min.   :  0.000   Min.   :  0.000  
##  1st Qu.: 7.52   1st Qu.:4.200   1st Qu.:  1.000   1st Qu.:  0.000  
##  Median :11.06   Median :4.400   Median :  3.000   Median :  2.000  
##  Mean   :15.42   Mean   :4.302   Mean   :  8.546   Mean   :  6.225  
##  3rd Qu.:19.82   3rd Qu.:4.600   3rd Qu.:  8.000   3rd Qu.:  6.000  
##  Max.   :96.98   Max.   :5.000   Max.   :267.000   Max.   :184.000  
##  NA's   :175                                                        
##    three_star       four_star         five_star      reviews_number  
##  Min.   :  0.00   Min.   :   0.00   Min.   :   0.0   Min.   :   1.0  
##  1st Qu.:  1.00   1st Qu.:   3.00   1st Qu.:  13.0   1st Qu.:  20.0  
##  Median :  5.00   Median :  10.00   Median :  39.0   Median :  61.0  
##  Mean   : 15.06   Mean   :  32.04   Mean   : 139.7   Mean   : 201.5  
##  3rd Qu.: 14.00   3rd Qu.:  27.50   3rd Qu.: 106.0   3rd Qu.: 161.5  
##  Max.   :643.00   Max.   :1349.00   Max.   :6936.0   Max.   :8585.0  
##                                                                      
##   product_quality
##  excellent:1861  
##  good     : 264  
##  poor     :  22  
##                  
##                  
##                  
## 

Oops, there are missing values in product_price, we need to remove it first. Up to this point, if there’s any missing value in our data exist after inspect data (missing value is not exist when inspect the data), there’s a chance that the contents of the row/column is not same/equal to the actual contents of row/column.

Herb_1 <- na.omit(Herb_1)
dim(Herb_1)
## [1] 1966   15

The missing values are dropped/removed. Then we will start exploration on data.

Data Exploration

Brief explanation

summary(Herb_1)
##    product_id    product_name                        product_company
##  Min.   :  374   Length:1966        Frontier Natural Products: 178  
##  1st Qu.:32651   Class :character   Simply Organic           :  72  
##  Median :58780   Mode  :character   Twinings                 :  58  
##  Mean   :56479                      Now Foods                :  55  
##  3rd Qu.:82036                      Bob's Red Mill           :  54  
##  Max.   :99554                      Traditional Medicinals   :  44  
##                                     (Other)                  :1505  
##  product_flavour                         category    product_size   
##  Length:1966        Tea                      :610   Min.   :   0.0  
##  Class :character    Herb and Spices         :255   1st Qu.:  40.0  
##  Mode  :character   Chocolate & Candies snack:197   Median : 100.0  
##                     vinegars & Oils          :188   Mean   : 193.8  
##                     Fruit & Vegetables       :173   3rd Qu.: 340.0  
##                      Breakfast               :154   Max.   :2268.0  
##                     (Other)                  :389                   
##  product_price     product_rate      one_star          two_star      
##  Min.   : 3.150   Min.   :1.000   Min.   :  0.000   Min.   :  0.000  
##  1st Qu.: 7.522   1st Qu.:4.200   1st Qu.:  1.000   1st Qu.:  0.000  
##  Median :11.070   Median :4.400   Median :  3.000   Median :  2.000  
##  Mean   :15.436   Mean   :4.309   Mean   :  8.894   Mean   :  6.402  
##  3rd Qu.:19.850   3rd Qu.:4.600   3rd Qu.:  8.000   3rd Qu.:  6.000  
##  Max.   :96.980   Max.   :5.000   Max.   :267.000   Max.   :184.000  
##                                                                      
##    three_star       four_star         five_star    reviews_number  
##  Min.   :  0.00   Min.   :   0.00   Min.   :   0   Min.   :   1.0  
##  1st Qu.:  1.00   1st Qu.:   3.00   1st Qu.:  14   1st Qu.:  22.0  
##  Median :  5.00   Median :  11.00   Median :  42   Median :  65.0  
##  Mean   : 15.56   Mean   :  33.16   Mean   : 146   Mean   : 209.2  
##  3rd Qu.: 15.00   3rd Qu.:  28.75   3rd Qu.: 111   3rd Qu.: 173.0  
##  Max.   :643.00   Max.   :1349.00   Max.   :6936   Max.   :8585.0  
##                                                                    
##   product_quality
##  excellent:1717  
##  good     : 231  
##  poor     :  18  
##                  
##                  
##                  
## 

Brief explanation summary :

  1. There are at least 1.972 transaction of herbal products of all categories.
  2. Products from Frontier Natural Products company is the most purchased than other companies.
  3. Tea is the most popular/purchased than other product categories (612 transactions) and followed by Herb and Spices (255 transaction).
  4. There are 1717 customers reviewed excellent on herbal product.
  5. At least, 1 customer leave any review for the herbal product.

We know that summary above are talks several things, but there are surely some business insight that we don’t know if we’re only observe from the summary. Before go to the next step, let’s create a new columns filled by sales result of each product by reviews number.

Herb_1$sales <- Herb_1$product_price * Herb_1$reviews_number
head(Herb_1)

Ok let’s we move to the next step, we’re gonna do some exploration through the Herbal Product’s data.

Which Category have the Highest Sales?

cat_agg <- aggregate(sales ~ category, Herb_1, FUN = sum)
cat_agg[order(-cat_agg$sales), ]

Vinnegars & Oils are the most popular to purchased on Herbal Product or there’s a chance that the price of Vinergars & Oils product id more expensive that herbal Tea product. Then, let’s we take a look deeper into Vinegar & Oils products.

Top 5 Most Recognized Brand - Vinnegars & Oils

vinegars <- Herb_1[Herb_1$category == "vinegars & Oils", ]
head(vinegars)
dim(vinegars)
## [1] 188  16

There are 188 vinnegars & oils products on dataset. Which company/brand have higher sales than other?

vin_agg <- aggregate(cbind(sales, reviews_number) ~ product_company + product_quality, vinegars, FUN = sum)
head(vin_agg[order(-vin_agg$sales), ], 5)

Table above shows that vinnegars & oils from Nutiva has highest value both on sales and reviews_number. This means vinnegars & oils from Nutiva are preferable than other product/brand. Other insight that we can conclude from table above is reviews_number of Jarrow Formulas vinnegars & oils is higher than Garden of Life vinnegars & oils. This mean vinnegars & oils from Garden of Life is more expensive than vinnegars & oils from Jarrow Formulas (since sales value obtained from product_price * reviews_number).

Top 5 Most Recognized Brand - Tea

Let’s take a peek which Brand/Company is the most recognized on Tea Product and let’s we take a peek what about the sales of top 5 popular herbal tea product.

tea <- Herb_1[Herb_1$category == "Tea",]
head(tea)
dim(tea)
## [1] 610  16

Then, what about the product quality of herbal tea products? Let’s take a look at it.

table(tea$product_quality)
## 
## excellent      good      poor 
##       542        63         5

There are 610 of herbal tea products and 542 among them have excellent product quality. Thus, top 5 most recognized/popular herbal tea product and have excellent product quality is :

tea_agg <- aggregate(cbind(reviews_number, sales) ~ product_company + product_quality, tea, FUN = sum)
head(tea_agg[order(-tea_agg$reviews_number), ], 5)

Table above shows top 5 most recognized/popular of herbal tea product. Yogi Tea placed on the first place by 14,735 review. But if we take a look at sales column, Harney & Sons herbal tea placed on the first place (have higher sales than Yogi Tea product’s). It means there are possibilities that some of Harney & sons customer’s repeated order or the price of Harney & Sons products is higher that the Yogi Tea Product.

Which Brand and Products by Product Review?

Now, let’s we dive deeper which brand has overall quality product based on customer review is excellent, by product_quality from the dataset. First, we’re gonna subset the products that has excellent reviews.

excellent <- Herb_1[Herb_1$product_quality == "excellent", ]
dim(excellent)
## [1] 1717   16

There are 1717 products has excellent review from the customer. Alright, now we’re gonna explore more through the dataset.

Which category product has the highest average reviews?

cat_agg <- aggregate(reviews_number ~ category, data = excellent, FUN = mean)
cat_agg[order(-cat_agg$reviews_number), ]

On the first place there are Vinnegars & Oils and we already know from the previous exploratory from the sales. This means that many people tried to live a healthy life by using herbal oils and vinnegars. On the second place there are Cocoa. This could indicate that there are many people support herbal cocoa products. Why? We know that actually, the origin of sweetness in chocolate is not came from the cocoa it self, but there are artificial sweeteners added. This Cocoa is exclude chocolate bar or snacks we know. This Cocoa here is more like cocoa powder for drink and cooking purpose.

Which Cocoa Brand has the highest average review?

cocoa <- excellent[excellent$category == "Cocoa", ]

cocoa_agg <- aggregate(reviews_number ~ product_company, cocoa, FUN = mean)
head(cocoa_agg[order(-cocoa_agg$reviews_number), ], 5)

Table above shows that cocoa product from Nestle company is preferable than other cocoa brand from other company. After cocoa product, let’s we explore on coffee product since coffee is the most consumed beverage.

Which Coffee Brand has the highest acerage review?

coffee <- excellent[excellent$category == "Coffee", ]

coffee_agg <- aggregate(reviews_number ~ product_company, coffee, FUN = mean)
head(coffee_agg[order(-coffee_agg$reviews_number), ], 5)

Okay, so based on table above coffee from California Gold Nutrition is most preferable coffee than other coffee from other brand. If we look at the difference between California Gold Nutrition product and Dandy Blend product (which placed at 2nd), the difference is huge. It means, there’s possibilities California Gold Nutrition product is known than other product.

Overall Summary

  • From 1966 rows or products that we had, 1717 product that have an excellent review.
  • Products from Frontier Natural Products company is the most purchased than other companies.
  • Vinegars & Oils has the highest average reviews with excellent review on product and total sales. Vinegars & oils from Nutiva is the best known and have lots of buyer than other product.
  • There are 610 of herbal tea products and 542 among them have excellent product quality. Among of Tea product, Yogi Tea brand is the best known than other brand (actually the writer have tried and approve it).
  • For Cocoa product, Nestle brand is the best known, followed by Now Food brand with a slight difference.
  • The most popular and has the highest average review number of coffee product is from California Gold Nutrition brand. Based on previous chapter, there are huge difference between California Gold Nutrition brand and Dandy Blend brand. It could be coffee product from California Gold Nutrition more famous than other product.