BUSINESS ANALYTICS PROJECT - GROUP 14

GROUP MEMBERS:

Anurag Chavan (5921740) Anuj Goyal (6003087) Bhavesh Abhang (5935814) Prithvi Thakkar (6017215)

PART 1: INTRODUCTION

Business analytics is the study of data using operations and statistical analysis and has become an integral part towards the success of many organizations. Basically, from the business and managerial perspective, business analytics is a vital element which is used by managers to gain knowledge regarding their organization which aid them in making insightful decisions. As a result, business analytics provides managers with key information to predicting the future of the organization’s sales (Fouladirad, Neal, Ituarte, Alexander, & Ghareeb, n.d.).

The modern-day retailer is no longer just a seller of products but also a data analyst. eCommerce data analysis has become a vital tool for gaining insights into customers, market trends, and competitive intelligence. By leveraging data analysis, retailers can gain a better understanding of their customer base, identify best sellers in their product categories, and make informed decisions on pricing, marketing, and inventory levels (Hillel, 2023).

The global eCommerce market is expected to be worth $6.3 trillion in 2024, up from $5.8 trillion in 2023. This continuous growth makes eCommerce one of the most competitive industries. The heightened competitiveness has pushed businesses to find ways to gain an edge over their competitors. The best option they have is to turn to the vast amounts of data shared by customers (Poddębniak, 2024).

Amazon’s business is based on the Internet and e-commerce, totally focused on the customer with its own logistics system, and in which any person or company, however small or large, can sell their products through a platform 100% optimized and easy to use where you can manage everything related to your account, brand and product. Its evolution as a company has a lot to do with the application of the Long Tail economic model; have an almost infinite supply, long tail, to meet high demand and provide each consumer with the product they are looking for, no matter how minority it may be (Liftingroup, n.d.).

In addition to its e-commerce platform, Amazon provides Amazon Web Services (AWS) for cloud hosting solutions, helping businesses, non-profit organizations, and government agencies deliver low-cost websites and applications. Amazon also offers streaming services such as Amazon Prime Video and Amazon Music (Liftingroup, n.d.).

However, this project concentrates on its eCommerce business, which offers retailers the opportunity to expand and internationalize their businesses by connecting with new clients and generating strong visibility.

BUSINESS PROBLEM

The business problem aims to analyze the current data to understand which products in which categories are trending and best sellers. With the help business analysis on the existing data, businesses can gain a better understanding of consumer preferences and build more effective products and marketing by analysing data on what things are popular. This insight would be extremely helpful for the vendors on the website to gain an advantage and enhance sales in the current competitive E-Commerce market.

REFERENCES

Fouladirad, M., Neal, J., Ituarte, J. V., Alexander, J., & Ghareeb, A. (n.d.). Entertaining data: Business analytics and Netflix. Retrieved from https://www.researchgate.net/profile/Joshua-Alexander-4/publication/333998454_Entertaining_Data_Business_Analytics_and_Netflix/links/5d121fe692851cf4404a5a9a/Entertaining-Data-Business-Analytics-and-Netflix.pdf
Hillel, O. (2023, January 18). How to use eCommerce data analysis for best seller intelligence. DataCluster. Retrieved from
https://datacluster.com/ecommerce-data-analysis-for-best-seller-intelligence#:~:text=It%20involves%20gathering%20and%20analyzing,for%20product%20innovation%20and%20development
Poddębniak, M. (2024, June 20). What is ecommerce analytics and how can you use it to grow your business. Piwik PRO. Retrieved from https://piwik.pro/blog/what-is-ecommerce-analytics-and-how-can-you-use-it-to-grow-your-business/
Liftingroup. (n.d.). Amazon and its business model and growth. Retrieved from https://www.liftingroup.com/en/expertise/business-model-amazon/

INSTALLING PACKAGES

In this tutorial, we will use the following packages. These are bundles of functions that extend base R functionality in numerous ways.

lattice
- It is a powerful package for creating trellis graphs, which are useful for visualizing multivariate data. It is based on the Grid graphics system.
ggplot2
- It is a widely-used package for creating complex and multi-layered graphics. It is based on the Grammar of Graphics, which provides a systematic way to build plots layer by layer.
plotly
- It is an interactive plotting library that can be used to create web-based interactive graphs. It integrates seamlessly with ggplot2.
stringr
- It provides a cohesive set of functions designed for easy and consistent manipulation of strings, leveraging regular expressions and offering robust handling of text data.
gridExtra
- It provides functions to arrange multiple grid-based graphics in a single layout, facilitating the combination of different plots and graphical objects in various configurations.
apaTables
- This package allows you to generate tables according to the American Psychological Association’s (APA) style. We use this package to create a correlation matrix that is more informative (and nicer to the eye) compared to the standard correlation matrix that is generated by the base package in R.
C50
- This is the core package we will use in the workbook on classification in which we generate decision trees using the C5.0 algorithm. In this tutorial, we’ll generate such a decision tree when we predict the generated clusters with the attributes used for finding these clusters (this is called ‘semi-supervised learning’). We will delve further into decision trees in the next tutorial. In this tutorial, we use them to assess the importance of the different predictors.
cluster
- The cluster-package contains a toolbox for performing various cluster analyses. Although the clusters in this tutorial are determined using the base functionality available in R, we use the silhouette-function from this package to assess the quality of the found clusters.
factoextra
- This package makes it easy to extract and visualize the output of exploratory multivariate data analysis, of which cluster analysis is one of the options. We’ll use the fviz_cluster-function from this package to get a projection of the found clusters in a 2-dimensional plane.
gmodels
- This package allows us to create a table that cross-tabulates the found clusters with the ‘defaulted’-variable. As we will explain later, this table aims to create targeted communication campaigns.
reshape2
- This package contains functions designed to make it easier to reshape your data. It offers a streamlined approach to pivoting data frames between wide format (where different variables are stored in separate columns) and long format (where variables are stacked in a single column but distinguished by another identifier column). This capability is instrumental in data analysis and visualization tasks where the data structure needs to be modified to suit specific analysis or plotting functions.
Hmisc
- This package provides robust data analysis and statistical modelling tools, especially in the biomedical field. In this tutorial, we use its functionality to produce concise summaries of data frames, including descriptive statistics essential for preliminary data analysis. For example, the describe() function that we use in this tutorial offers detailed summaries for each variable in a dataset, including counts of missing values, unique values, and basic statistical measures.
randomForest
- This is a popular and powerful tool for performing random forest analysis. This is a type of ensemble learning method used for classification and regression tasks. The random forest algorithm creates a “forest” of decision trees using random subsets of the data and then aggregates their predictions to improve accuracy and control over-fitting.
rsample
- This package has a function for partitioning the data that is much easier to digest than the code needed if we want to do this with the basic R functionality. We expect you to be able to adjust this code when required.
plotirx
- It offers a variety of specialized plotting functions that extend the base graphics capabilities, providing tools for creating plots such as pie charts, radial plots, and other custom visualizations.
plotrix
- It is a part of the tidyverse and provides a set of functions for data manipulation, offering intuitive and efficient tools for filtering, selecting, arranging, mutating, and summarizing data frames.
magick
- It provides a powerful and comprehensive suite of functions for image processing, enabling users to read, write, manipulate, and analyze images in various formats.

# List of packages
packages <- c("ggplot2", "gridExtra", "corrplot", "stringr", "lattice", "dplyr", 
              "fuzzyjoin", "httr", "magick", "C50", "rsample", "gmodels", 
              "randomForestExplainer", "party", "randomForest", "caret", "effects", 
              "pROC", "textreg", "coefplot", "vip", "visreg", "texreg")

# Install packages that are not already installed
install.packages(setdiff(packages, rownames(installed.packages())))

# Check if Rtools is needed and prompt installation for Windows users
if (.Platform$OS.type == "windows" && !requireNamespace("devtools", quietly = TRUE)) {
  install.packages("devtools")
  devtools::install_github("r-lib/pkgbuild")
  pkgbuild::check_rtools()
  if (!pkgbuild::has_rtools()) {
    stop("Rtools is required but not installed. Please install Rtools from https://cran.rstudio.com/bin/windows/Rtools/")
  }
}

# Visualization packages
suppressWarnings(library(ggplot2))
suppressWarnings(library(gridExtra))
suppressWarnings(library(corrplot))

## corrplot 0.92 loaded

# Data Cleaning packages
suppressWarnings(library(stringr))
suppressWarnings(library(lattice))
suppressWarnings(library(dplyr))

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

suppressWarnings(library(fuzzyjoin))
suppressWarnings(library(httr))
suppressWarnings(library(magick))

## Linking to ImageMagick 6.9.12.93
## Enabled features: cairo, fontconfig, freetype, heic, lcms, pango, raw, rsvg, webp
## Disabled features: fftw, ghostscript, x11

# Decision Tree Analysis (Classification) packages
suppressWarnings(library(C50))
suppressWarnings(library(rsample))
suppressWarnings(library(gmodels))
suppressWarnings(library(randomForestExplainer))

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

suppressWarnings(library(party))

## Loading required package: grid

## Loading required package: mvtnorm

## Loading required package: modeltools

## Loading required package: stats4

## Loading required package: strucchange

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## 
## Attaching package: 'strucchange'

## The following object is masked from 'package:stringr':
## 
##     boundary

## 
## Attaching package: 'party'

## The following object is masked from 'package:dplyr':
## 
##     where

suppressWarnings(library(randomForest))

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

# Logistic Regression packages
suppressWarnings(library(caret))

## 
## Attaching package: 'caret'

## The following object is masked from 'package:httr':
## 
##     progress

suppressWarnings(library(effects))

## Loading required package: carData

## Use the command
##     lattice::trellis.par.set(effectsTheme())
##   to customize lattice options for effects plots.
## See ?effectsTheme for details.

suppressWarnings(library(pROC))

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following object is masked from 'package:gmodels':
## 
##     ci

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

suppressWarnings(library(textreg))
suppressWarnings(library(coefplot))
suppressWarnings(library(vip))

## 
## Attaching package: 'vip'

## The following object is masked from 'package:utils':
## 
##     vi

suppressWarnings(library(visreg))
suppressWarnings(library(texreg))

## Version:  1.39.3
## Date:     2023-11-09
## Author:   Philip Leifeld (University of Essex)
## 
## Consider submitting praise using the praise or praise_interactive functions.
## Please cite the JSS article in your publications -- see citation("texreg").

# Check loaded packages
loaded_packages <- c("ggplot2", "gridExtra", "corrplot", "stringr", "lattice", "dplyr", 
                     "fuzzyjoin", "httr", "magick", "C50", "rsample", "gmodels", 
                     "randomForestExplainer", "party", "randomForest", "caret", "effects", 
                     "pROC", "textreg", "coefplot", "vip", "visreg", "texreg")

loaded_successfully <- loaded_packages[sapply(loaded_packages, require, character.only = TRUE, quietly = TRUE)]
not_loaded <- setdiff(loaded_packages, loaded_successfully)

if (length(not_loaded) > 0) {
  warning("The following packages were not loaded successfully: ", paste(not_loaded, collapse = ", "))
}

DATASET SETUP

In this assignment, we’ll be using Amazon Products UK 2023 Dataset for our analysis.

amazondata <- read.csv("amz_uk_processed_data.csv")

PART 2: DATA EXPLORATION & PREPARATION

2.1. DATA EXPLORATION

In data exploration, we want to know the data that we want to work with. Here we can use relevant R function and data visualizations to explore the data and variables. We have manually created the department variable to broadly categorize existing segments so that we could reduce the number of categories (230) to segments (18). This is a part of data preparation but we need this before we start exploring and vizualization.

# Create a mapping of original categories to new categories
category_mapping <- list(
  "industrial and scientific" = c("Lab & Scientific Products", "Industrial Electrical", "Test & Measurement", "3D Printing & Scanning", "Abrasive & Finishing Products", "Professional Medical Supplies", "Cutting Tools", "Material Handling Products", "Packaging & Shipping Supplies", "Agricultural Equipment & Supplies", "Construction Machinery"),
  
  "toys and games" = c("Toys and Games", "Games & Game Accessories", "Dolls & Accessories", "Sports Toys & Outdoor", "Kids' Play Vehicles", "Kids' Play Figures", "Baby & Toddler Toys", "Learning & Education Toys", "Toy Advent Calendars", "Electronic Toys", "Building & Construction Toys", "Remote & App-Controlled Devices", "Kids' Dress Up & Pretend Play", "Soft Toys", "Jigsaws & Puzzles"),
  
  "sports and outdoors" = c("Sports & Outdoors", "Ski Goggles", "Skiing Poles", "Hiking Hand & Foot Warmers", "Women's Sports & Outdoor Shoes", "Cycling Shoes", "Ski Clothing", "Men's Sports & Outdoor Shoes", "Darts & Dartboards", "Table Tennis", "Billiard, Snooker & Pool", "Bowling", "Trampolines & Accessories", "Ballet & Dancing Footwear", "Sports Supplements"),
  
  "health and household" = c("Health & Personal Care", "Thermometers & Meteorological Instruments", "Household Batteries, Chargers & Accessories", "Vacuums & Floorcare", "Large Appliances", "Ironing & Steamers"),
  
  "arts and crafts" = c("Arts & Crafts", "Kids' Art & Craft Supplies", "Art & Craft Supplies", "Pens, Pencils & Writing Supplies"),
  
  "electronics" = c("Wearable Technology", "PC Gaming Accessories", "USB Gadgets", "Projectors", "CPUs", "Computer Screws", "3D Printers", "Barebone PCs", "Mobile Phone Accessories", "Computers, Components & Accessories", "Home Audio Record Players", "Radios & Boomboxes", "Car & Vehicle Electronics", "Monitor Accessories", "Cables & Accessories", "Computer Cases", "KVM Switches", "Printers & Accessories", "Telephones, VoIP & Accessories", "Surveillance Cameras", "Tripods & Monopods", "Mobile Phones & Communication", "Electrical Power Accessories", "Tablet Accessories", "Keyboards, Mice & Input Devices", "Headphones & Earphones", "Smartwatches", "Office Electronics", "Internal TV Tuner & Video Capture Cards", "Headphones, Earphones & Accessories", "Camera & Photo Accessories", "Home Cinema, TV & Video", "Hi-Fi & Home Audio Accessories", "Portable Sound & Video Products", "USB Hubs", "Computer Audio & Video Accessories", "Adapters", "Computer & Server Racks", "Hard Drive Accessories", "Networking Devices", "Data Storage", "Mobile Phones & Smartphones", "Digital Cameras", "Film Cameras", "Binoculars, Telescopes & Optics", "Hi-Fi Receivers & Separates", "Hi-Fi Speakers"),
  
  "automotive" = c("Car & Motorbike", "Motorbike Clothing", "Motorbike Accessories", "Motorbike Batteries", "Motorbike Boots & Luggage", "Motorbike Exhaust & Exhaust Systems", "Motorbike Handlebars, Controls & Grips", "Motorbike Lighting", "Motorbike Seat Covers", "Motorbike Electrical & Batteries"),
  
  "baby products" = c("Baby", "Handmade Baby Products"),
  
  "patio and garden" = c("Plants, Seeds & Bulbs", "Garden Furniture & Accessories", "Bird & Wildlife Care", "Mowers & Outdoor Power Tools", "Outdoor Heaters & Fire Pits", "Garden Décor", "Gardening", "Outdoor Cooking", "Decking & Fencing", "Pools, Hot Tubs & Supplies", "Garden Storage & Housing", "Garden Tools & Watering Equipment"),
  
  "home and kitchen" = c("Handmade Home & Kitchen Products", "Hardware", "Storage & Home Organisation", "Fireplaces, Stoves & Accessories", "Furniture & Lighting", "Storage & Organisation", "Living Room Furniture", "Bedding & Linen", "Curtain & Blind Accessories", "Home Brewing & Wine Making", "Tableware", "Kitchen Storage & Organisation", "Kitchen Tools & Gadgets", "Cookware", "Water Coolers, Filters & Cartridges", "Beer, Wine & Spirits", "Small Kitchen Appliances", "Kitchen & Bath Fixtures", "Home Fragrance", "Window Treatments", "Home Entertainment Furniture", "Dining Room Furniture", "Home Bar Furniture", "Kitchen Linen", "Mattress Pads & Toppers", "Children's Bedding", "Bedding Accessories", "Decorative Artificial Flora", "Candles & Holders", "Signs & Plaques", "Home Office Furniture", "Inflatable Beds, Pillows & Accessories", "Bedding Collections", "Bakeware", "Grocery", "Photo Frames", "Rugs, Pads & Protectors", "Mirrors", "Clocks", "Doormats", "Decorative Home Accessories", "Boxes & Organisers", "Slipcovers", "Vases", "Bedroom Furniture", "Hallway Furniture", "Coffee, Tea & Espresso"),
  
  "bath and body" = c("Bath & Body", "Bathroom Lighting", "Rough Plumbing", "Bathroom Furniture", "Bathroom Linen"),
  
  "tools and home improvement" = c("Light Bulbs", "Heating, Cooling & Air Quality", "Power & Hand Tools", "Painting Supplies, Tools & Wall Treatments", "Building Supplies", "Safety & Security", "Lights and switches", "Plugs", "Electrical", "Lighting", "Outdoor Lighting", "Torches", "Indoor Lighting", "String Lights"),
  
  "beauty" = c("Fragrances", "Skin Care", "Beauty", "Manicure & Pedicure Products", "Make-up", "Hair Care"),
  
  "clothing and accessories" = c("Boys", "Girls", "Handmade Clothing, Shoes & Accessories", "Men", "Luggage and travel gear", "Women"),
  
  "health and wellness" = c("Professional Medical Supplies", "Sports Supplements"),
  
  "pet supplies" = c("Pet Supplies"),
  
  "fashion" = c("Handmade Jewellery")
)

# Function to map original category to new category
map_category <- function(original_category) {
  for (new_cat in names(category_mapping)) {
    if (original_category %in% category_mapping[[new_cat]]) {
      return(new_cat)
    }
  }
  return("Other")  # For any categories that don't match
}

# Create new variable with mapped categories
amazondata$department <- sapply(amazondata$categoryName, map_category)

# This code gives us an overview of list of variables availabe to us. There are total 11 variables available to us to gain an insight.

colnames(amazondata)

##  [1] "asin"              "title"             "imgUrl"           
##  [4] "productURL"        "stars"             "reviews"          
##  [7] "price"             "isBestSeller"      "boughtInLastMonth"
## [10] "categoryName"      "department"

head(amazondata)

##         asin
## 1 B09B96TG33
## 2 B01HTH3C8S
## 3 B09B8YWXDF
## 4 B09B8T5VGV
## 5 B09WX6QD65
## 6 B09B97WSLF
##                                                                                                                                                                                                title
## 1                                                                                Echo Dot (5th generation, 2022 release) | Big vibrant sound Wi-Fi and Bluetooth smart speaker with Alexa | Charcoal
## 2 Anker Soundcore mini, Super-Portable Bluetooth Speaker with 15-Hour Playtime, 66-Foot Bluetooth Range, Wireless Speaker with Enhanced Bass, Noise-Cancelling Microphone, for Outdoor, Travel, Home
## 3                                                                           Echo Dot (5th generation, 2022 release) | Big vibrant sound Wi-Fi and Bluetooth smart speaker with Alexa | Deep Sea Blue
## 4                                                                 Echo Dot with clock (5th generation, 2022 release) | Bigger vibrant sound Wi-Fi and Bluetooth smart speaker and Alexa | Cloud Blue
## 5                                                                                                  Introducing Echo Pop | Full sound compact Wi-Fi and Bluetooth smart speaker with Alexa | Charcoal
## 6                                                              Echo Dot with clock (5th generation, 2022 release) | Bigger vibrant sound Wi-Fi and Bluetooth smart speaker and Alexa | Glacier White
##                                                           imgUrl
## 1 https://m.media-amazon.com/images/I/71C3lbbeLsL._AC_UL320_.jpg
## 2 https://m.media-amazon.com/images/I/61c5rSxwP0L._AC_UL320_.jpg
## 3 https://m.media-amazon.com/images/I/61j3SEUjMJL._AC_UL320_.jpg
## 4 https://m.media-amazon.com/images/I/71yf6yTNWSL._AC_UL320_.jpg
## 5 https://m.media-amazon.com/images/I/613dEoF9-rL._AC_UL320_.jpg
## 6 https://m.media-amazon.com/images/I/71bnkM5q8CL._AC_UL320_.jpg
##                               productURL stars reviews price isBestSeller
## 1 https://www.amazon.co.uk/dp/B09B96TG33   4.7   15308 21.99        False
## 2 https://www.amazon.co.uk/dp/B01HTH3C8S   4.7   98099 23.99         True
## 3 https://www.amazon.co.uk/dp/B09B8YWXDF   4.7   15308 21.99        False
## 4 https://www.amazon.co.uk/dp/B09B8T5VGV   4.7    7205 31.99        False
## 5 https://www.amazon.co.uk/dp/B09WX6QD65   4.6    1881 17.99        False
## 6 https://www.amazon.co.uk/dp/B09B97WSLF   4.7    7205 31.99        False
##   boughtInLastMonth   categoryName  department
## 1                 0 Hi-Fi Speakers electronics
## 2                 0 Hi-Fi Speakers electronics
## 3                 0 Hi-Fi Speakers electronics
## 4                 0 Hi-Fi Speakers electronics
## 5                 0 Hi-Fi Speakers electronics
## 6                 0 Hi-Fi Speakers electronics

# The summary() function provides a detailed summary of the various elements within an object, most commonly a data frame or a vector like Minimum, 1st Quartile, Median, Mean, 3rd Quartile, and Maximum.

summary(amazondata)

##      asin              title              imgUrl           productURL       
##  Length:2222742     Length:2222742     Length:2222742     Length:2222742    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      stars          reviews              price           isBestSeller      
##  Min.   :0.000   Min.   :      0.0   Min.   :     0.00   Length:2222742    
##  1st Qu.:0.000   1st Qu.:      0.0   1st Qu.:    10.00   Class :character  
##  Median :0.000   Median :      0.0   Median :    19.90   Mode  :character  
##  Mean   :2.032   Mean   :    382.2   Mean   :    94.26                     
##  3rd Qu.:4.400   3rd Qu.:     44.0   3rd Qu.:    47.71                     
##  Max.   :5.000   Max.   :1356658.0   Max.   :100000.00                     
##  boughtInLastMonth  categoryName        department       
##  Min.   :    0.00   Length:2222742     Length:2222742    
##  1st Qu.:    0.00   Class :character   Class :character  
##  Median :    0.00   Mode  :character   Mode  :character  
##  Mean   :   18.57                                        
##  3rd Qu.:    0.00                                        
##  Max.   :50000.00

2.1.1 EXPLORATION OF NUMERIC VARIABLES

Let us now check the skewness by finding mean, median, and standard variation of every numeric variable. It helps in addressing any potential unreliability of data effectively.

# Summary and Standard Deviation of stars
stars_summary <- summary(amazondata$stars)
stars_sd <- sd(amazondata$stars)
cat("Summary of stars:\n", stars_summary, "\n")

## Summary of stars:
##  0 0 0 2.03187 4.4 5

cat("Standard Deviation of stars: ", stars_sd, "\n\n")

## Standard Deviation of stars:  2.185497

# Summary and Standard Deviation of reviews
reviews_summary <- summary(amazondata$reviews)
reviews_sd <- sd(amazondata$reviews)
cat("Summary of reviews:\n", reviews_summary, "\n")

## Summary of reviews:
##  0 0 0 382.1617 44 1356658

cat("Standard Deviation of reviews: ", reviews_sd, "\n\n")

## Standard Deviation of reviews:  5020.752

# Summary and Standard Deviation of price
price_summary <- summary(amazondata$price)
price_sd <- sd(amazondata$price)
cat("Summary of price:\n", price_summary, "\n")

## Summary of price:
##  0 10 19.9 94.25737 47.71 1e+05

cat("Standard Deviation of price: ", price_sd, "\n\n")

## Standard Deviation of price:  360.6225

# Summary and Standard Deviation of boughtInLastMonth
boughtInLastMonth_summary <- summary(amazondata$boughtInLastMonth)
boughtInLastMonth_sd <- sd(amazondata$boughtInLastMonth)
cat("Summary of boughtInLastMonth:\n", boughtInLastMonth_summary, "\n")

## Summary of boughtInLastMonth:
##  0 0 0 18.56902 0 50000

cat("Standard Deviation of boughtInLastMonth: ", boughtInLastMonth_sd, "\n")

## Standard Deviation of boughtInLastMonth:  191.903

# This shows us that there are no null values in the data set.

colSums(is.na(amazondata))

##              asin             title            imgUrl        productURL 
##                 0                 0                 0                 0 
##             stars           reviews             price      isBestSeller 
##                 0                 0                 0                 0 
## boughtInLastMonth      categoryName        department 
##                 0                 0                 0

# The str function in R provides a compact and readable summary of the structure of an R object. 

str(amazondata)

## 'data.frame':    2222742 obs. of  11 variables:
##  $ asin             : chr  "B09B96TG33" "B01HTH3C8S" "B09B8YWXDF" "B09B8T5VGV" ...
##  $ title            : chr  "Echo Dot (5th generation, 2022 release) | Big vibrant sound Wi-Fi and Bluetooth smart speaker with Alexa | Charcoal" "Anker Soundcore mini, Super-Portable Bluetooth Speaker with 15-Hour Playtime, 66-Foot Bluetooth Range, Wireless"| __truncated__ "Echo Dot (5th generation, 2022 release) | Big vibrant sound Wi-Fi and Bluetooth smart speaker with Alexa | Deep Sea Blue" "Echo Dot with clock (5th generation, 2022 release) | Bigger vibrant sound Wi-Fi and Bluetooth smart speaker and"| __truncated__ ...
##  $ imgUrl           : chr  "https://m.media-amazon.com/images/I/71C3lbbeLsL._AC_UL320_.jpg" "https://m.media-amazon.com/images/I/61c5rSxwP0L._AC_UL320_.jpg" "https://m.media-amazon.com/images/I/61j3SEUjMJL._AC_UL320_.jpg" "https://m.media-amazon.com/images/I/71yf6yTNWSL._AC_UL320_.jpg" ...
##  $ productURL       : chr  "https://www.amazon.co.uk/dp/B09B96TG33" "https://www.amazon.co.uk/dp/B01HTH3C8S" "https://www.amazon.co.uk/dp/B09B8YWXDF" "https://www.amazon.co.uk/dp/B09B8T5VGV" ...
##  $ stars            : num  4.7 4.7 4.7 4.7 4.6 4.7 4.7 4.7 4.7 4.5 ...
##  $ reviews          : int  15308 98099 15308 7205 1881 7205 15308 103673 29909 16014 ...
##  $ price            : num  22 24 22 32 18 ...
##  $ isBestSeller     : chr  "False" "True" "False" "False" ...
##  $ boughtInLastMonth: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ categoryName     : chr  "Hi-Fi Speakers" "Hi-Fi Speakers" "Hi-Fi Speakers" "Hi-Fi Speakers" ...
##  $ department       : chr  "electronics" "electronics" "electronics" "electronics" ...

2.1.2. EXPLORING THE OUTLIERS (NUMERIC VARIABLES)

To explore the outliers, we need to explore the summary about the numerical variables. In total we have five numerical variables as can be seen from the above. So, we find mean and standard deviation and check the skewness using the ggplot and histogram function. Visualizing it will help us to check the data reliability for future complications.

# Load necessary libraries
library(ggplot2)
library(gridExtra)


# Define the columns for which boxplots will be created
variables <- c("stars", "reviews", "price", "boughtInLastMonth")

# Initialize a list to store plots
plots_list <- list()

# Create boxplots for each variable
for (var in variables) {
  # Create the plot for the current variable
  plot <- ggplot(amazondata, aes(y = .data[[var]])) +
    geom_boxplot() +
    ggtitle(paste("Boxplot of", var)) +
    theme(plot.title = element_text(hjust = 0.5))
  
  # Add the plot to the list
  plots_list[[length(plots_list) + 1]] <- plot
}

# Arrange the plots in a 2 by 3 grid
grid.arrange(grobs = plots_list, ncol = 3, padding = unit(5, "lines"))

# Initialize a data frame to store outliers
outliers <- data.frame(Variable = character(), Outlier_Value = numeric(), stringsAsFactors = FALSE)

# Define a function to find outliers for each variable
find_outliers <- function(var) {
  # Calculate quartiles
  q1 <- quantile(amazondata[[var]], 0.25)
  q3 <- quantile(amazondata[[var]], 0.75)
  
  # Calculate IQR (inter-quartile range)
  iqr <- q3 - q1
  
  # Define upper and lower bounds for outliers
  upper_bound <- q3 + 1.5 * iqr
  lower_bound <- q1 - 1.5 * iqr
  
  # Find outliers
  var_outliers <- amazondata[amazondata[[var]] > upper_bound | amazondata[[var]] < lower_bound, var]
  
  # Check if outliers exist
  if (length(var_outliers) > 0) {
    # Add outliers to the outliers data frame
    outliers <<- rbind(outliers, data.frame(Variable = var, Outlier_Value = var_outliers))
  }
}

# Apply the function to each variable
for (var in variables) {
  find_outliers(var)
}

#print (outliers)

Outlier Identification from Box Plots

stars: The distribution of the ‘stars’ variable shows a concentration of ratings around 0 to 4.4 (IQR=4.4) stars, with a median at 0 indicating a notable number of items with 0 stars. The lack of outliers suggests a fairly uniform distribution without extreme values, reflecting a typical range of ratings within the dataset.
reviews: The reviews variable has median value 0, which is indicative of very poor central tendency of data. Due to the interquartile range (IQR=44.0), about half of the data points are found at its lower level around 0 reviews. However, due to presence of so many outlying figures, it means most items have few reviews while some have very high counts making everything skewed toward right side. Reviews distribution is highly right skewed because mean>median.
price: The median value of the price variable is 19.90 which points towards low data central tendency. A close clustering of the middle 50 per cent of values around zero is hinted through the interquartile range (IQR=84.26). It is clear that there are many outlier which show most products are cheap while some others have been sold at an extremely expensive price that makes its composition extremely skewed. Prices are skewed to the right (mean>median) and they are mostly located in the lower part of the range.
boughtInLastMonth: The boxplot shows that the majority of the data points for purchases in the last month are clustered outside IQR=0 with a median of 0. There are a significant number of outliers with very high purchase numbers. The presence of many outliers suggests a large variability in the number of purchases.

# Define the columns for which histograms will be created
columns <- c("stars", "reviews", "price", "boughtInLastMonth")

# Initialize a list to store the plots
hist_plots <- list()

# Create histograms for each variable
for (col in columns) {
  # Create the histogram plot for the current variable
  hist_plot <- ggplot(amazondata, aes(x = !!sym(col))) +
    geom_histogram() +
    ggtitle(paste("Distribution of", col)) +
    theme(plot.title = element_text(hjust = 0.5), 
          axis.text.x = element_text(angle = 45, hjust = 1))
  
  # Add the histogram plot to the list
  hist_plots[[length(hist_plots) + 1]] <- hist_plot
}

# Arrange the histograms in a grid
grid.arrange(grobs = hist_plots, ncol = 3)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Outlier Identification from Histogram

stars: There are many products having a star rating close to 4 or 5. There is a significant number of products with a rating of 0, which might indicate that these products have not been rated yet. The distribution is right-skewed with most of the data points concentrated on the higher end of the star ratings.
reviews: Most products have very few reviews, with a significant number of products having zero reviews. The distribution is heavily right-skewed, indicating that a small number of products receive a very high number of reviews.
price: A large number of products have a price of zero, which might indicate missing data or free products. For the non-zero prices, the distribution is right-skewed, indicating that higher-priced products are less common.
boughtInLastMonth: A significant number of products have zero purchases in the last month. The distribution is heavily right-skewed, with very few products having high purchase numbers.

2.1.3. DATA VISUALIZATION

We can see from the outliers that there are many zero values in the 4 numeric variables seen above. The ‘Zero” values are of no use to us and we cannot perform any operations on it or derive any insights. So, we will identify the zero values and bind them together and finally eliminate them. We will find the rows with all numeric variables viz reviews, price, listPrice, boughtInLastMonth have values as ’0’. We will not take stars variables into consideration as it can be logically assumed that it directly is related to reviews. If there is no presence of other 4 numeric variables, stars can be ignored for convenience.

Both the histograms and the boxplot highlight a common trend: a significant number of products have minimal engagement (in terms of ratings, reviews, prices, and purchases), while a smaller subset of products exhibits very high engagement. This indicates a typical long-tail distribution common in e-commerce platforms, where a small number of products dominate sales and engagement metrics.

The zero values for stars, reviews, price, and purchases indicate the need for data cleaning. Products with zero values for these variables might be excluded or treated separately to avoid skewing the analysis.

This analysis provides a clear picture of the data distribution, helping to identify potential areas for deeper investigation and data cleaning efforts.

FINDING TOTAL NUMBER OF PRODUCTS WITH ZERO VALUES

Total number of products with zero values is 2062874 in the variables reviews, price and boughtInLastMonth. We will remove this in further steps.

library(dplyr)

# Define the columns to check for zero values
check_columns <- c("reviews", "price", "boughtInLastMonth", "isBestSeller")

# Extract rows where any of the specified columns contain zeros
zero_value_subset <- amazondata %>% 
  filter(if_any(all_of(check_columns), ~ . == 0))

# Calculate the total number of products with zero values
total_zero_value_products <- nrow(zero_value_subset)

# Print the total number of products with zero values
cat("Total number of products with zero values:", total_zero_value_products, "\n")

## Total number of products with zero values: 2062874

# Optionally, print the subset with zero values for inspection
# print("Subset with zero values:")
# print(zero_value_subset)

# Find duplicates in the subset
duplicates <- zero_value_subset %>% 
  duplicated() %>% 
  which()

# Extract the duplicate rows
duplicate_rows <- zero_value_subset[duplicates, ]

PERCENTAGE OF ZERO AND NON ZERO VALUES IN THE VARIABLES

The graph below indicates the number of products with zero values as a percentage in a particular variable. In price, we can see that there is a missing non-zero variable which indicates that there is no 0 price in the dataset.

# Load necessary libraries
library(ggplot2)
library(reshape2)

# Calculate the percentages of zero and non-zero values
percentages <- sapply(amazondata[, c("stars", "reviews", "price", "boughtInLastMonth")], function(x) {
  num_zeros <- sum(x == 0, na.rm = TRUE)
  total_values <- sum(!is.na(x))
  percentage_zeros <- (num_zeros / total_values) * 100
  percentage_non_zeros <- 100 - percentage_zeros
  c(zero = percentage_zeros, non_zero = percentage_non_zeros)
})

# Convert to data frame and reshape for ggplot2
percentages_df <- as.data.frame(t(percentages))
percentages_df$variable <- rownames(percentages_df)
percentages_long <- melt(percentages_df, id.vars = "variable", variable.name = "type", value.name = "percentage")

# Create the bar graph
ggplot(percentages_long, aes(x = variable, y = percentage, fill = type)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Percentages of Zero and Non-Zero Values in the variables",
       x = "Variable",
       y = "Percentage",
       fill = "Value Type") +
  theme_minimal() +
  theme(legend.position = "bottom")

# Calculate the percentage and counts of products with both stars and reviews equal to 0
num_both_zeros <- sum(amazondata$stars == 0 & amazondata$reviews == 0, na.rm = TRUE)
total_products <- nrow(amazondata)
percentage_both_zeros <- (num_both_zeros / total_products) * 100

# Print the percentage and counts of both zeros
cat("Percentage of products with both stars and reviews equal to 0:", percentage_both_zeros, "%\n")

## Percentage of products with both stars and reviews equal to 0: 52.83654 %

cat("Number of products with both stars and reviews equal to 0:", num_both_zeros, "\n")

## Number of products with both stars and reviews equal to 0: 1174420

# Create a data frame for plotting
zeros_data <- data.frame(Variable = "Stars and Reviews", Percentage = percentage_both_zeros)

# Create the bar graph
ggplot(zeros_data, aes(x = Variable, y = Percentage, fill = Variable)) +
  geom_bar(stat = "identity") +
  labs(title = "Percentage of Products with Both Stars and Reviews Equal to 0",
       x = "Variable",
       y = "Percentage") +
  theme_minimal() +
  theme(legend.position = "none")

This output indicates that the value of both stars and reviews is around 52% of the total values.

2.1.4. EXPLORING NON-NUMERICAL VARIABLES

We have total 6 non numerical variables.

VISUALIZATION 1: BAR PLOT OF TOTAL SALES QUANTITY OF DEPARTMENTS

The graph below indicates the total sales quantity sold on amazon by respective department.

library(ggplot2)

# Count the occurrences of each category
category_counts <- table(amazondata$department)
category_counts_amazondata <- as.data.frame(category_counts)

# Create a bar plot
ggplot(category_counts_amazondata, aes(x = Var1, y = Freq)) +
  geom_bar(stat = "identity") +
  labs(title = "Product Categories", x = "Department", y = "Sales Quantity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(hjust = 3))

Now, we will find top 5 categories which have highest percentage of best seller products. Post that, we will vizualize using the top 5 category and comparing total sales and best seller sales in this category. In turn, this will help us make a firm decision on which category products can achieve best seller ranking and help in faster decision making.

VISUALIZATION 2: TOP 5 DEPARTMENTS BY MOST SALES IN NUMBER

# Select top 5 departments

# Calculate the frequency of each category

category_counts <- table(amazondata$department)

# Sort the categories by frequency in descending order
sorted_categories <- sort(category_counts, decreasing = TRUE)

# Select the top 10 categories
top_categories <- names(sorted_categories)[1:5]

print(top_categories)

## [1] "sports and outdoors" "Other"               "electronics"        
## [4] "home and kitchen"    "toys and games"

VISUALIZATION 3: TABLE OF DEPARTMENTS FOR BEST SELLING PRODUCTS BY PERCENTAGE

The table below indicates all departments with their bestseller quantity and product quantity sold. From this, we have put selected top 5 departments with maximum bestseller percentage.

# Load necessary library
library(dplyr)

# Group by categoryName and calculate counts
category <- amazondata %>%
  group_by(department) %>%
  summarise(
    isBestSeller_Count = sum(isBestSeller == "True"),  # Count of best sellers
    Product_Count = n(),  # Total products in each category
    Percentage_BestSeller = round((isBestSeller_Count / Product_Count) * 100, 2)  # Percentage of best sellers
  ) %>%
  arrange(desc(Percentage_BestSeller))  # Arrange by Percentage_BestSeller in descending order

# Output the updated data frame
category

## # A tibble: 17 × 4
##    department             isBestSeller_Count Product_Count Percentage_BestSeller
##    <chr>                               <int>         <int>                 <dbl>
##  1 pet supplies                          284          9430                  3.01
##  2 health and household                  668         34638                  1.93
##  3 baby products                         192         15463                  1.24
##  4 arts and crafts                       238         24537                  0.97
##  5 home and kitchen                     1039        139299                  0.75
##  6 tools and home improv…                524         71476                  0.73
##  7 clothing and accessor…                576         85798                  0.67
##  8 industrial and scient…                119         23040                  0.52
##  9 patio and garden                      266         78286                  0.34
## 10 automotive                            107         38149                  0.28
## 11 bath and body                          68         27422                  0.25
## 12 Other                                 710        335613                  0.21
## 13 beauty                                193         92439                  0.21
## 14 toys and games                        250        124193                  0.2 
## 15 electronics                           395        247967                  0.16
## 16 fashion                                10         12420                  0.08
## 17 sports and outdoors                   379        862572                  0.04

# Select top 5 departments with highest Percentage_BestSeller
top_departments <- head(category, 5)

# Set up the plotting area with wider margins and larger plot size
par(mar = c(5, 5, 4, 4) + 0.5)
options(repr.plot.width=10, repr.plot.height=8)

# Calculate percentages for pie chart
product_percentages <- top_departments$Product_Count / sum(top_departments$Product_Count) * 100
best_seller_percentages <- top_departments$isBestSeller_Count / sum(top_departments$isBestSeller_Count) * 100

# Create a pie chart for Product_Count
pie(product_percentages, 
    labels = paste(top_departments$department, " (", round(product_percentages, 1), "%)"), 
    main = "Product Count by Department in Top 5 Departments",
    col = rainbow(length(product_percentages)),
    cex = 0.8,  # Adjust label size
    radius = 0.9,  # Adjust pie size
    clockwise = TRUE)

This code essentially creates a pie chart showing the distribution of Product_Count across the top 5 departments with the highest Percentage_BestSeller values. Home and kitchen has maximum product count among the bestseller products.

VISUALIZATION 4: SCATTER PLOT OF PRICE VS REVIEWS

library(ggplot2)
plot(amazondata$price, amazondata$reviews, pch = 16, col = "blue", cex = 1,
     main = "Price vs. Number of Reviews", xlab = "Price", ylab = "Number of Reviews",
     xlim = c(min(amazondata$price, na.rm = TRUE), max(amazondata$price, na.rm = TRUE)),
     ylim = c(min(amazondata$reviews, na.rm = TRUE), max(amazondata$reviews, na.rm = TRUE)))

# Add a grid
grid()

# Add a legend
legend("topright", legend = "Data Points", pch = 16, col = "black", cex = 1)

VISUALIZATION 5: SCATTER PLOT OF STARS VS PRICE

plot(amazondata$stars, amazondata$price, pch = 16, col = "blue", cex = 1,
     main = "Stars vs. Price", xlab = "Stars", ylab = "Price",
     xlim = c(min(amazondata$stars, na.rm = TRUE), max(amazondata$stars, na.rm = TRUE)),
     ylim = c(min(amazondata$price, na.rm = TRUE), max(amazondata$price, na.rm = TRUE)))

# Add a grid
grid()

# Add a legend
legend("topright", legend = "Data Points", pch = 16, col = "black", cex = 1)

VISUALIZATION 6: SCATTER PLOT OF STARS VS REVIEWS

plot(amazondata$stars, amazondata$reviews, pch = 16, col = "blue", cex = 1,
     main = "Stars vs. Reviews", xlab = "Stars", ylab = "Reviews",
     xlim = c(min(amazondata$stars, na.rm = TRUE), max(amazondata$stars, na.rm = TRUE)),
     ylim = c(min(amazondata$reviews, na.rm = TRUE), max(amazondata$reviews, na.rm = TRUE)))

# Add a grid
grid()

# Add a legend
legend("topright", legend = "Data Points", pch = 16, col = "black", cex = 1)

The following scatter plots illustrate a positive association between star ratings and both product reviews and prices. Products with more reviews tend to have higher star ratings, while the price distribution across different star ratings remains relatively even. This suggests that higher-rated products garner more attention and potentially higher prices.

2.2 DATA CLEANING AND PREPARATION

2.2.1. DEALING WITH OUTLIERS

# Remove rows where no sales were made last month
paste("Number of Rows that will be deleted, because no sales were made last month",sum(amazondata$boughtInLastMonth == 0))

## [1] "Number of Rows that will be deleted, because no sales were made last month 2061427"

amazondata <- amazondata[amazondata$boughtInLastMonth != 0.00, ] 

# Remove products that did not receive any reviews
amazondata <- amazondata[amazondata$reviews != 0.00, ] 

paste("Finally, the final dataset contains rows: ", nrow(amazondata))

## [1] "Finally, the final dataset contains rows:  159868"

# Find the entries in department with blankspace

indices_with_blankspace <- grep("\\s", amazondata$department)
departments_with_blankspace <- amazondata[indices_with_blankspace, ]
length(unique(departments_with_blankspace$department))

## [1] 12

amazondata$cdepartment <- gsub("  ", " & ", amazondata$department)
# print(amazondata$department)

2.2.2. FEATURE ENGINEERING

Make non-numeric variable numeric. The first step is IsBestSeller to make into binary. Yes=1 and No=0

amazondata$isBestSeller <- ifelse(amazondata$isBestSeller == "True", 1, 0)
glimpse(amazondata)

## Rows: 159,868
## Columns: 12
## $ asin              <chr> "B01N4V4X5M", "B07WVP26FR", "B083K77PZL", "B09J47Y78…
## $ title             <chr> "Upgraded, Anker Soundcore Boost Bluetooth Speaker w…
## $ imgUrl            <chr> "https://m.media-amazon.com/images/I/71ZVrRZLTcL._AC…
## $ productURL        <chr> "https://www.amazon.co.uk/dp/B01N4V4X5M", "https://w…
## $ stars             <dbl> 4.7, 4.4, 4.6, 4.4, 4.7, 4.3, 4.3, 4.6, 4.5, 4.4, 4.…
## $ reviews           <int> 29387, 2493, 2979, 804, 3208, 1279, 15963, 4641, 202…
## $ price             <dbl> 39.98, 15.99, 65.59, 22.99, 34.99, 21.99, 39.90, 39.…
## $ isBestSeller      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ boughtInLastMonth <int> 600, 1000, 200, 200, 200, 100, 100, 100, 50, 50, 50,…
## $ categoryName      <chr> "Hi-Fi Speakers", "Hi-Fi Speakers", "Hi-Fi Speakers"…
## $ department        <chr> "electronics", "electronics", "electronics", "electr…
## $ cdepartment       <chr> "electronics", "electronics", "electronics", "electr…

2.2.3.: DATA TRANSFORMATION

We will apply log transformation for continuous variables so that it skews less on the right. This results in minimizing the effect of high values while dispersing low ones across a certain range which messes up the symmetry of their distribution and so make models perform better.

Classification trees use the natural logarithm to balance out predictor variables. Consequently, this method leads to a fairer set of splits and improved decision boundaries, which all together promote an increase in prediction accuracy through attenuation of prejudice against majority classes.

Log transformation ensures that the effect of outliers is diminished and enhances a linear relationship between predictor variables and the log odds of the response variable in Logistic Regression. By doing this, logistic regression coefficients are easier to interpret and more dependable leading to more precise probability estimates for class membership.

Next, our focus will be on numerical variables which are indicating skewness in previous histograms. In this process, title length and categorical/binary variables will be omitted.

# Create faceted histograms
columns <- c( "price", "boughtInLastMonth", "reviews")

# Create a list to store the plots
plots <- list()

# Create histograms for each variable
for (col in columns) {
  plot <- ggplot(amazondata, aes(x = !!sym(col))) +
    geom_histogram() +
    ggtitle(paste(col, "Distribution")) +
    theme(plot.title = element_text(hjust = 0.5, size = unit(10, "mm")), 
          axis.text.x = element_text(angle = 45, hjust = 1))

  
  plots[[length(plots) + 1]] <- plot
}

# Arrange histograms in a grid
grid.arrange(grobs = plots, ncol = 4, width = 30, height = 20)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

After the log transformation, we visualize the dataset in histograms to confirm that the skewness of these variables have been successfully addressed.

# Apply log transformation and save to new variables
amazondata$reviews_log <- log(1 + amazondata$reviews)
amazondata$price_log <- log(1 + amazondata$price)
amazondata$boughtInLastMonth_log <- log(1 + amazondata$boughtInLastMonth)



# Create faceted histograms
columns <- c("reviews_log", "price_log", "boughtInLastMonth_log","stars")

# Create a list to store the plots
plots <- list()

# Create histograms for each variable
for (col in columns) {
  plot <- ggplot(amazondata, aes(x = !!sym(col))) +
    geom_histogram() +
    ggtitle(paste(col, "Distribution")) +
    theme(plot.title = element_text(hjust = 0.5, size = unit(8, "mm")), 
          axis.text.x = element_text(angle = 45, hjust = 1))
  
  plots[[length(plots) + 1]] <- plot
}

# Arrange histograms in a grid
grid.arrange(grobs = plots, ncol = 3,  width = 20, height = 20)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

No normalization or standardization is needed for binary categorical variables. So ‘isBestSeller’ do not need to be normalized or standardized.

Data normalization is useful when working with algorithms sensitive to feature scaling (such as k-nearest neighbours or neural networks). Normalizing will make these algorithms converge faster. Normalization is also useful when features have different units or scales or when you need values in a bounded interval such as [0, 1] or another specific range.

Based on the above reason, the following variables will be normalized using min-max scaling approach. price stars listPrice reviews boughtInLastMonth titleLength

# Define a function to perform min-max scaling
min_max_scaling <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}

# Apply min-max scaling to the specified variables and add them to the original dataframe
amazondata <- amazondata %>%
  mutate(price_log_normalized = min_max_scaling(price_log),
         stars_normalized = min_max_scaling(stars),
         reviews_log_normalized = min_max_scaling(reviews_log),
         boughtInLastMonth_log_normalized = min_max_scaling(boughtInLastMonth_log),
      )

# Visualize the normalized variables to confirm they're on the same scale.

columns <- c("price_log_normalized", "stars_normalized", "reviews_log_normalized", "boughtInLastMonth_log_normalized")

# Create a list to store the plots
plots <- list()

# Create histograms for each variable
for (col in columns) {
  plot <- ggplot(amazondata, aes(x = !!sym(col))) +
    geom_histogram() +
    ggtitle(paste(col, "Distribution")) +
    theme(plot.title = element_text(hjust = 0.5, size = unit(8, "mm")))
  
  plots[[length(plots) + 1]] <- plot
}

# Arrange histograms in a grid
grid.arrange(grobs = plots, ncol = 3, width = 30, height = 20)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

At this point, we have created quite a few new variables. Based on the following list of column names, we will choose the variables that can be used in our classification tree and logistic regression models.

2.2.4. INITIAL PLOTTING

Just to remind again, the goal of the research is to give answer to “where should sellers focus on (department)” and “how to be best seller”. To make it crystal clear, let us examine the number of Best Seller in each of the category.

# Load necessary library
library(ggplot2)

# Plot 1: Horizontal bar plot of products against category type
plot2 <- ggplot(amazondata, aes(y = department)) +
  geom_bar() +
  labs(title = "Products per Department", x = "Count", y = "Department") +
  theme(plot.title = element_text(hjust = 0.5, size = 14))

# Plot 2: Horizontal bar plot of bestsellers per category
plot1 <- ggplot(amazondata, aes(y = department, fill = isBestSeller)) +
  geom_bar(position = "dodge") +
  labs(title = "Bestsellers per Department", x = "Count", y = "Department") +
  theme(plot.title = element_text(hjust = 0.5, size = 14))


# Display the plots
print(plot1)

## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

print(plot2)

# Load necessary libraries
library(ggplot2)
library(dplyr)

# Calculate the percentage of bestsellers per category
category_bestseller_pct <- amazondata %>%
  group_by(department) %>%
  summarize(total = n(),
            bestsellers = sum(isBestSeller == TRUE)) %>%
  mutate(percentage = (bestsellers / total) * 100) %>%
  arrange(desc(percentage))

# Print the dataframe to check values
print(category_bestseller_pct)

## # A tibble: 17 × 4
##    department                 total bestsellers percentage
##    <chr>                      <int>       <int>      <dbl>
##  1 fashion                       56           4      7.14 
##  2 health and household       11082         617      5.57 
##  3 sports and outdoors         5222         230      4.40 
##  4 automotive                  1882          75      3.99 
##  5 industrial and scientific   2064          72      3.49 
##  6 clothing and accessories    9767         329      3.37 
##  7 patio and garden            6486         200      3.08 
##  8 pet supplies                9100         276      3.03 
##  9 baby products               4909         144      2.93 
## 10 home and kitchen           27424         796      2.90 
## 11 Other                      15475         345      2.23 
## 12 arts and crafts            10747         219      2.04 
## 13 tools and home improvement  9517         176      1.85 
## 14 toys and games             15227         224      1.47 
## 15 bath and body               2978          41      1.38 
## 16 electronics                 4676          62      1.33 
## 17 beauty                     23256         176      0.757

# Plot: Horizontal bar plot of percentage of bestsellers per category
plot3 <- ggplot(category_bestseller_pct, aes(y = reorder(department, percentage), x = percentage)) +
  geom_bar(stat = "identity", fill = "magenta") +
  labs(title = "Percentage of Bestsellers per Department", x = "Percentage", y = "Department") +
  theme(plot.title = element_text(hjust = 0.5, size = 14), 
        axis.text.y = element_text(size = 12))

# Display the plot
print(plot3)

2.2.5. REFLECTIONS

When exploring and preparing data, we comprehend all variables for the purposes of identifying issues that may occur as well as come up with solutions aimed at solving these problems. It is very important to clean up and refine datasets before using them for modeling purposes as that makes them more reliable sources for predictive models. Cleaning up a dataset entails getting rid of zeros or null values, dropping irrelevant groups, changing the names of categories so that they form one final document.

PART 3: DATA MODELING

Finally, since the dataset is ready to be modelled, we can assign it to a more intuitive name. In this part, we will use two modelling techniques to reach the goal of this analytics:

Classification using Decision Tree Methods

Logistic Regression

# Load necessary libraries
library(ggplot2)
library(dplyr)

# Define min-max scaling function
min_max_scaling <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}

# Create new variable "titleLength"
amazondata <- amazondata %>%
  mutate(titleLength = nchar(title))

# View the first few rows to confirm the new variable
head(amazondata[c("title", "titleLength")])

##                                                                                                                                                                                                        title
## 30                           Upgraded, Anker Soundcore Boost Bluetooth Speaker with Well-Balanced Sound, BassUp, 12H Playtime, USB-C, IPX7 Waterproof, with Customizable EQ via App, Wireless Stereo Pairing
## 33                                                                               Kolaura Portable Wireless Speaker, Bluetooth 5.0 Speaker with 3D Stereo HiFi Bass, 1500mAh Battery, 12 Hour Playtime (Blue)
## 93     W-KING Bluetooth Speaker, 50W Speakers Wireless Bluetooth 5.0 With Deep Bass, IPX6 Waterproof Loud Bluetooth Speaker With 40H Playback/Two Portable Speakers Pairing/TF Card/EQ/NFC for Outdoor Party
## 144                          XSOUND Portable Bluetooth Speaker Wireless 24W wireless bluetooth speaker with Bluetooth 5.2, enhanced bass, stereo sound, 24h playtime, usb c, IPX7 Waterproof outdoor speaker
## 161                                                                                   soundcore Anker Mini 3 Bluetooth Speaker, BassUp and PartyCast Technology, USB-C, Waterproof IPX7, and Customizable EQ
## 230 HEYSONG Shower Speaker, Waterpoof Portable Bluetooth Speakers Wireless with LED Lights, IP67, 36H Playtime, Rich Bass Small Speaker Bluetooth for Camping, Kayak, Beach, Bathroom, Travel, Gifts for Men
##     titleLength
## 30          175
## 33          123
## 93          197
## 144         175
## 161         118
## 230         200

# Summary statistics of the new variable
summary(amazondata$titleLength)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    78.0   133.0   125.8   175.0   534.0

# Apply min-max scaling to "titleLength"
amazondata <- amazondata %>%
  mutate(titleLength_normalized = min_max_scaling(titleLength))

# View the first few rows to confirm the new variable
head(amazondata[c("title", "titleLength", "titleLength_normalized")])

##                                                                                                                                                                                                        title
## 30                           Upgraded, Anker Soundcore Boost Bluetooth Speaker with Well-Balanced Sound, BassUp, 12H Playtime, USB-C, IPX7 Waterproof, with Customizable EQ via App, Wireless Stereo Pairing
## 33                                                                               Kolaura Portable Wireless Speaker, Bluetooth 5.0 Speaker with 3D Stereo HiFi Bass, 1500mAh Battery, 12 Hour Playtime (Blue)
## 93     W-KING Bluetooth Speaker, 50W Speakers Wireless Bluetooth 5.0 With Deep Bass, IPX6 Waterproof Loud Bluetooth Speaker With 40H Playback/Two Portable Speakers Pairing/TF Card/EQ/NFC for Outdoor Party
## 144                          XSOUND Portable Bluetooth Speaker Wireless 24W wireless bluetooth speaker with Bluetooth 5.2, enhanced bass, stereo sound, 24h playtime, usb c, IPX7 Waterproof outdoor speaker
## 161                                                                                   soundcore Anker Mini 3 Bluetooth Speaker, BassUp and PartyCast Technology, USB-C, Waterproof IPX7, and Customizable EQ
## 230 HEYSONG Shower Speaker, Waterpoof Portable Bluetooth Speakers Wireless with LED Lights, IP67, 36H Playtime, Rich Bass Small Speaker Bluetooth for Camping, Kayak, Beach, Bathroom, Travel, Gifts for Men
##     titleLength titleLength_normalized
## 30          175              0.3264540
## 33          123              0.2288931
## 93          197              0.3677298
## 144         175              0.3264540
## 161         118              0.2195122
## 230         200              0.3733583

df_model <- amazondata

3.1. MODELLING PREPARATION

We perform a correlation analysis to understand the relationships between our key variables. This will help us identify any variables that are strongly correlated, allowing us to group them in advance. This simplifies the interpretation of clusters in our later analysis. We will use the apa.cor.table function from the apaTables package for this task.

# Define columns of interest
columns <- c("boughtInLastMonth", "reviews", "price",  "titleLength", "stars")

# Generate correlation table using apaTables package
apaTables::apa.cor.table(df_model[columns])

## 
## 
## Means, standard deviations, and correlations with confidence intervals
##  
## 
##   Variable             M       SD      1            2            3         
##   1. boughtInLastMonth 257.28  670.48                                      
##                                                                            
##   2. reviews           1634.10 5420.54 .26**                               
##                                        [.26, .27]                          
##                                                                            
##   3. price             16.94   25.55   -.04**       .03**                  
##                                        [-.04, -.03] [.02, .03]             
##                                                                            
##   4. titleLength       125.83  53.17   -.03**       -.05**       .03**     
##                                        [-.03, -.02] [-.05, -.04] [.03, .04]
##                                                                            
##   5. stars             4.39    0.34    .06**        .07**        .04**     
##                                        [.06, .07]   [.07, .08]   [.04, .05]
##                                                                            
##   4           
##               
##               
##               
##               
##               
##               
##               
##               
##               
##               
##               
##   -.13**      
##   [-.13, -.12]
##               
## 
## Note. M and SD are used to represent mean and standard deviation, respectively.
## Values in square brackets indicate the 95% confidence interval.
## The confidence interval is a plausible range of population correlations 
## that could have caused the sample correlation (Cumming, 2014).
##  * indicates p < .05. ** indicates p < .01.
##

To improve visual clarity, we will employ the corrplot function. In this plot, an “X” indicates a correlation that is not statistically significant. More comprehensive details on these correlations will be provided in the modeling section. For now, we can see that most variables show some degree of linear correlation, although generally, the correlations are weak.

# Select the specified columns from the DataFrame
subset_df <- df_model[, columns]

# Compute the correlation matrix for the selected columns
correlation_matrix <- cor(subset_df)

# Generate a correlation matrix with p-values
res <- cor.mtest(subset_df)

# Visualize the correlation matrix using a heatmap
corrplot(
  correlation_matrix, method = "color", type = "upper", 
  addCoef.col = "navy", tl.col = "black",
  tl.srt = 45, tl.cex = 0.8, order = "hclust", 
  diag = FALSE, outline = TRUE, 
  title = "Correlation Heatmap of Selected Features", p.mat = res$p
)

3.2. LOGISTIC REGRESSION

Logistic regression has been selected due to its robust capability to analyze multiple explanatory variables simultaneously while mitigating the impact of confounding factors (Sperandei, 2014). Particularly suited for binary classification tasks, such as predicting whether a product is a best seller or not, logistic regression offers a powerful analytical framework.

Steps in Logistic Regression Modeling

To maintain clarity throughout this extensive modeling process, we will provide a structured overview of the logistic regression approach:

Data Preparation: Initially, we will subset the dataset and prepare for a 70:30 train-test split.
Executing the Regression Model: Two main dataframes will be utilized: df_mod (comprising all departments) and df_modeling (containing data from the top 5 departments). This approach ensures a broad overview followed by focused analysis on the key departments addressing the primary business challenges.
Extracting Coefficients: The regression coefficients represent the relative importance of variables within each department.
Plotting the Coefficients: Visual representation of the coefficients will provide insights into variable impact.
Analyzing Z-Values: Evaluating the statistical significance (typically |z| > 1.96 for a 95% confidence interval) helps understand the predictors’ effect on outcomes, while controlling for other variables.
Plotting Partial Effects: To enhance model accuracy, understanding the partial effects of variables when others are held constant is crucial.
Model Evaluation: Assessment metrics such as AUC-ROC curves and Confusion Matrices will gauge model performance.
Further Improvements: Post-evaluation, adjustments can be made to optimize model performance, especially in minimizing Type II errors (predicting ‘not best seller’ when it’s actually a ‘best seller’). This includes defining an optimal threshold using ROC curves to enhance recall performance.

#STEP 1: Preparing Dataset for logistic regression

We will prepare two dataframes for logistic regression: df_mod and df_modeling. df_modeling will be a subset focusing on the top 5 departments based on the percentage of best-selling products within each department.

df_modeling_all <- df_model
df_mod <- df_model
df_modeling <- subset(df_model, department %in% c("fashion", "health and household", "automotive", "sports and outdoors", "industrial and scientific"))
department_dfs <- split(df_modeling, df_modeling$department)

We will prepare the dataset for training and testing our logistic regression model using a 70% training dataset and a 30% testing dataset split. To ensure reproducibility, we will set a seed for the random number generator used in this dataset split. This split allows us to train the logistic regression model on a majority of the data and evaluate its performance on unseen test data.

#Preparing dataset for training and testing
set.seed(123)
split_ratio <- 0.7

#Dataset training and testing for all department dataset
indices <- sample(1:nrow(df_modeling_all), size = floor(split_ratio * nrow(df_modeling_all)))
df_train_all <- df_modeling_all[indices, ]
df_test_all <- df_modeling_all[-indices, ]

#Dataset training and testing for top 5 department
train_dfs <- list()
test_dfs <- list()

# Looping through each department
for (department in names(department_dfs)) {
  df <- department_dfs[[department]]
  indices <- sample(1:nrow(df), size = floor(split_ratio * nrow(df)))
  
  # Subsetting the data frame into training and testing sets
  train_df <- df[indices, ]
  test_df <- df[-indices, ]
  
  # storing the dataset 
  train_dfs[[department]] <- train_df
  test_dfs[[department]] <- test_df
}

#STEP 2: Executing logistic regression

# define an empty list to store model summaries for each department
model_summaries <- list()
models <- list()
coef_plots <- list()

# model
model <- glm(isBestSeller ~  price_log_normalized + 
                 reviews_log_normalized + titleLength_normalized + stars_normalized + boughtInLastMonth, 
               data = df_train_all, family = binomial)

# Storing the summary 
model_summaries[["all"]] <- summary(model)
models[["all"]] <- model

# Plotting coefficients for the current model with department name annotation
  coef_plots[["all"]] <- coefplot::coefplot(
    model,
    title = paste("Coefficients Plot for all department"),
    ylab = "Variables",
    xlab = "Estimated Coefficient"
  )

# Looping through each department
for (department in names(train_dfs)) {
  # Getting the training data frame for the current department
  df_training <- train_dfs[[department]]
  
  # Fitting the logistic regression model
  model <- glm(isBestSeller ~  price_log_normalized + 
                 reviews_log_normalized + titleLength_normalized + stars_normalized + boughtInLastMonth, 
               data = df_training, family = binomial)
  
  # Storing to df
  model_summaries[[department]] <- summary(model)
  models[[department]] <- model
  
  # Plotting coefficients
  coef_plots[[department]] <- coefplot::coefplot(
    model,
    title = paste("Coefficients Plot for", department),
    ylab = "Variables",
    xlab = "Estimated Coefficient"
  )
}

print(model_summaries)

## $all
## 
## Call:
## glm(formula = isBestSeller ~ price_log_normalized + reviews_log_normalized + 
##     titleLength_normalized + stars_normalized + boughtInLastMonth, 
##     family = binomial, data = df_train_all)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -5.996e+00  2.731e-01 -21.957  < 2e-16 ***
## price_log_normalized    2.094e-01  1.849e-01   1.132  0.25750    
## reviews_log_normalized  4.959e+00  1.516e-01  32.715  < 2e-16 ***
## titleLength_normalized  5.628e-01  2.030e-01   2.773  0.00556 ** 
## stars_normalized       -4.992e-01  3.073e-01  -1.625  0.10420    
## boughtInLastMonth       4.333e-04  1.465e-05  29.579  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26160  on 111906  degrees of freedom
## Residual deviance: 22949  on 111901  degrees of freedom
## AIC: 22961
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## $automotive
## 
## Call:
## glm(formula = isBestSeller ~ price_log_normalized + reviews_log_normalized + 
##     titleLength_normalized + stars_normalized + boughtInLastMonth, 
##     family = binomial, data = df_training)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -5.034137   1.970603  -2.555   0.0106 *  
## price_log_normalized    0.585397   1.586702   0.369   0.7122    
## reviews_log_normalized  2.528969   1.196899   2.113   0.0346 *  
## titleLength_normalized -0.836206   1.503518  -0.556   0.5781    
## stars_normalized        0.186155   2.316589   0.080   0.9360    
## boughtInLastMonth       0.001732   0.000365   4.746 2.07e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 444.40  on 1316  degrees of freedom
## Residual deviance: 384.62  on 1311  degrees of freedom
## AIC: 396.62
## 
## Number of Fisher Scoring iterations: 6
## 
## 
## $fashion
## 
## Call:
## glm(formula = isBestSeller ~ price_log_normalized + reviews_log_normalized + 
##     titleLength_normalized + stars_normalized + boughtInLastMonth, 
##     family = binomial, data = df_training)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)  
## (Intercept)             1.036e+01  8.157e+00   1.270   0.2040  
## price_log_normalized    1.378e+01  1.214e+01   1.134   0.2566  
## reviews_log_normalized -9.437e+00  7.371e+00  -1.280   0.2004  
## titleLength_normalized  2.459e+00  8.227e+00   0.299   0.7650  
## stars_normalized       -1.696e+01  1.001e+01  -1.694   0.0903 .
## boughtInLastMonth       1.526e-04  1.175e-02   0.013   0.9896  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21.153  on 38  degrees of freedom
## Residual deviance: 16.336  on 33  degrees of freedom
## AIC: 28.336
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## $`health and household`
## 
## Call:
## glm(formula = isBestSeller ~ price_log_normalized + reviews_log_normalized + 
##     titleLength_normalized + stars_normalized + boughtInLastMonth, 
##     family = binomial, data = df_training)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -4.290e+00  7.187e-01  -5.968 2.39e-09 ***
## price_log_normalized    6.976e-01  4.261e-01   1.637   0.1016    
## reviews_log_normalized  2.843e+00  4.396e-01   6.466 1.00e-10 ***
## titleLength_normalized  1.487e+00  5.774e-01   2.575   0.0100 *  
## stars_normalized       -1.317e+00  7.932e-01  -1.660   0.0968 .  
## boughtInLastMonth       3.813e-04  2.687e-05  14.188  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3289.2  on 7756  degrees of freedom
## Residual deviance: 2869.9  on 7751  degrees of freedom
## AIC: 2881.9
## 
## Number of Fisher Scoring iterations: 6
## 
## 
## $`industrial and scientific`
## 
## Call:
## glm(formula = isBestSeller ~ price_log_normalized + reviews_log_normalized + 
##     titleLength_normalized + stars_normalized + boughtInLastMonth, 
##     family = binomial, data = df_training)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -7.2905767  1.9213258  -3.795 0.000148 ***
## price_log_normalized    0.7585714  1.6288786   0.466 0.641429    
## reviews_log_normalized  3.3564192  1.1851491   2.832 0.004625 ** 
## titleLength_normalized  4.6617282  1.7041723   2.735 0.006229 ** 
## stars_normalized        0.6865408  2.0181164   0.340 0.733714    
## boughtInLastMonth       0.0008986  0.0001752   5.129 2.92e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 427.89  on 1443  degrees of freedom
## Residual deviance: 348.54  on 1438  degrees of freedom
## AIC: 360.54
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## $`sports and outdoors`
## 
## Call:
## glm(formula = isBestSeller ~ price_log_normalized + reviews_log_normalized + 
##     titleLength_normalized + stars_normalized + boughtInLastMonth, 
##     family = binomial, data = df_training)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -5.1012911  1.1311462  -4.510 6.49e-06 ***
## price_log_normalized    0.6729650  0.7375298   0.912  0.36153    
## reviews_log_normalized  4.5825325  0.6438306   7.118 1.10e-12 ***
## titleLength_normalized -0.6362049  0.8106429  -0.785  0.43256    
## stars_normalized       -0.4269741  1.3036513  -0.328  0.74327    
## boughtInLastMonth       0.0007325  0.0002394   3.060  0.00221 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1283.1  on 3654  degrees of freedom
## Residual deviance: 1192.4  on 3649  degrees of freedom
## AIC: 1204.4
## 
## Number of Fisher Scoring iterations: 6

Interpretation of Logistic Regression Models

All Departments Model: The model for all departments reveals several significant predictors of a product becoming a bestseller. The number of reviews (reviews_log_normalized) and the amount bought in the last month (boughtInLastMonth) are highly significant positive predictors (p < 0.001). Title length (titleLength_normalized) is also a significant positive predictor (p < 0.01). The star rating (stars_normalized) shows a marginally significant negative association (p < 0.1), while price is not a significant factor. This suggests that customer engagement (through reviews and recent purchases) and product presentation (title length) are key drivers of bestseller status across all departments.
Automotive Department: In the automotive category, the strongest predictor is whether the product was bought in the last month (boughtInLastMonth, p < 0.001). The number of reviews (reviews_log_normalized) is marginally significant (p < 0.05). Other factors, including price, title length, and star rating, do not show significant associations. This emphasizes the importance of recent purchase behavior in determining bestseller status for automotive products.
Fashion Department: The fashion department model shows no statistically significant predictors at the conventional p < 0.05 level. The star rating (stars_normalized) is marginally significant (p < 0.1) with a negative association. However, the small sample size (39 observations) likely contributes to the lack of significant results and makes these findings less reliable. More data would be needed to draw robust conclusions for this department.
Health and Household Department: In this category, the number of reviews (reviews_log_normalized) and recent purchases (boughtInLastMonth) are highly significant positive predictors (p < 0.001). Title length (titleLength_normalized) is also a significant positive predictor (p < 0.05). The star rating (stars_normalized) shows a marginally significant negative association (p < 0.1). Price is not a significant factor. This suggests that customer engagement and product visibility are key drivers of bestseller status in the health and household category.
Industrial and Scientific Department: For industrial and scientific products, the number of reviews (reviews_log_normalized), title length (titleLength_normalized), and recent purchases (boughtInLastMonth) are all significant positive predictors (p < 0.01 or lower). Price and star rating do not show significant associations. This indicates that product visibility, customer engagement, and recent purchase behavior are important factors in determining bestseller status in this category.
Sports and Outdoors Department: In the sports and outdoors category, the number of reviews (reviews_log_normalized) is a highly significant positive predictor (p < 0.001), and recent purchases (boughtInLastMonth) is also a significant positive predictor (p < 0.01). Price, title length, and star rating do not show significant associations. This suggests that customer engagement through reviews and recent purchase behavior are the primary drivers of bestseller status for sports and outdoor products.

General Observations: 1. Across most categories, the number of reviews and recent purchase behavior consistently emerge as important predictors of bestseller status. 2. The importance of other factors (price, title length, star rating) varies by department, suggesting that bestseller dynamics may differ across product categories. 3. Some departments, particularly Fashion, have very small sample sizes, which affects the reliability of the results and calls for caution in interpretation. 4. The models generally show good fit to the data, as indicated by improvements in deviance and AIC compared to null models. 5. The normalization of variables makes direct interpretation of effect sizes challenging without knowing the specifics of the normalization process. 6. These models reveal associations between the predictors and bestseller status, but do not necessarily imply causation. Other unmeasured factors could also be influencing these relationships.

#STEP 3: Extracting coefficient

Extracting coefficients provides insights into the relative importance of variables within each department. These coefficients quantify how strongly each predictor influences the likelihood of a product becoming a best seller.

# Initializing an empty data frame to store coefficients from all models
all_coefs <- data.frame(
  Department = character(),
  Term = character(),
  Estimate = numeric(),
  Pr.value = numeric(),
  stringsAsFactors = FALSE  # Avoid factors to make data manipulation easier
)

# Looping through each department and extract coefficients
for (department in c("all", names(train_dfs))) {
  # Extracting the coefficients table
  coefs <- summary(models[[department]])$coefficients
  
  # Creating a temporary data frame to store coefficients for current model
  temp_df <- data.frame(
    Department = department,
    Term = rownames(coefs),
    Estimate = coefs[, "Estimate"],
    Pr.value = coefs[, "Pr(>|z|)"],
    stringsAsFactors = FALSE
  )
  
  # Binding the temporary data frame to the main data frame
  all_coefs <- rbind(all_coefs, temp_df)
}

# Printing the final coefficients table
print(all_coefs)

##                                        Department                   Term
## (Intercept)                                   all            (Intercept)
## price_log_normalized                          all   price_log_normalized
## reviews_log_normalized                        all reviews_log_normalized
## titleLength_normalized                        all titleLength_normalized
## stars_normalized                              all       stars_normalized
## boughtInLastMonth                             all      boughtInLastMonth
## (Intercept)1                           automotive            (Intercept)
## price_log_normalized1                  automotive   price_log_normalized
## reviews_log_normalized1                automotive reviews_log_normalized
## titleLength_normalized1                automotive titleLength_normalized
## stars_normalized1                      automotive       stars_normalized
## boughtInLastMonth1                     automotive      boughtInLastMonth
## (Intercept)2                              fashion            (Intercept)
## price_log_normalized2                     fashion   price_log_normalized
## reviews_log_normalized2                   fashion reviews_log_normalized
## titleLength_normalized2                   fashion titleLength_normalized
## stars_normalized2                         fashion       stars_normalized
## boughtInLastMonth2                        fashion      boughtInLastMonth
## (Intercept)3                 health and household            (Intercept)
## price_log_normalized3        health and household   price_log_normalized
## reviews_log_normalized3      health and household reviews_log_normalized
## titleLength_normalized3      health and household titleLength_normalized
## stars_normalized3            health and household       stars_normalized
## boughtInLastMonth3           health and household      boughtInLastMonth
## (Intercept)4            industrial and scientific            (Intercept)
## price_log_normalized4   industrial and scientific   price_log_normalized
## reviews_log_normalized4 industrial and scientific reviews_log_normalized
## titleLength_normalized4 industrial and scientific titleLength_normalized
## stars_normalized4       industrial and scientific       stars_normalized
## boughtInLastMonth4      industrial and scientific      boughtInLastMonth
## (Intercept)5                  sports and outdoors            (Intercept)
## price_log_normalized5         sports and outdoors   price_log_normalized
## reviews_log_normalized5       sports and outdoors reviews_log_normalized
## titleLength_normalized5       sports and outdoors titleLength_normalized
## stars_normalized5             sports and outdoors       stars_normalized
## boughtInLastMonth5            sports and outdoors      boughtInLastMonth
##                              Estimate      Pr.value
## (Intercept)             -5.995897e+00 7.396551e-107
## price_log_normalized     2.094143e-01  2.574981e-01
## reviews_log_normalized   4.959000e+00 9.627752e-235
## titleLength_normalized   5.627646e-01  5.561241e-03
## stars_normalized        -4.992396e-01  1.041951e-01
## boughtInLastMonth        4.333419e-04 2.781977e-192
## (Intercept)1            -5.034137e+00  1.063047e-02
## price_log_normalized1    5.853966e-01  7.121730e-01
## reviews_log_normalized1  2.528969e+00  3.460631e-02
## titleLength_normalized1 -8.362064e-01  5.780971e-01
## stars_normalized1        1.861545e-01  9.359532e-01
## boughtInLastMonth1       1.732279e-03  2.070541e-06
## (Intercept)2             1.036138e+01  2.040041e-01
## price_log_normalized2    1.377640e+01  2.566334e-01
## reviews_log_normalized2 -9.437315e+00  2.004034e-01
## titleLength_normalized2  2.459187e+00  7.650048e-01
## stars_normalized2       -1.695802e+01  9.027500e-02
## boughtInLastMonth2       1.525688e-04  9.896379e-01
## (Intercept)3            -4.289785e+00  2.394680e-09
## price_log_normalized3    6.976084e-01  1.015600e-01
## reviews_log_normalized3  2.842821e+00  1.003731e-10
## titleLength_normalized3  1.487127e+00  1.001416e-02
## stars_normalized3       -1.317049e+00  9.681652e-02
## boughtInLastMonth3       3.812598e-04  1.094556e-45
## (Intercept)4            -7.290577e+00  1.479084e-04
## price_log_normalized4    7.585714e-01  6.414291e-01
## reviews_log_normalized4  3.356419e+00  4.624846e-03
## titleLength_normalized4  4.661728e+00  6.228950e-03
## stars_normalized4        6.865408e-01  7.337143e-01
## boughtInLastMonth4       8.986412e-04  2.919520e-07
## (Intercept)5            -5.101291e+00  6.487583e-06
## price_log_normalized5    6.729650e-01  3.615276e-01
## reviews_log_normalized5  4.582532e+00  1.098178e-12
## titleLength_normalized5 -6.362049e-01  4.325619e-01
## stars_normalized5       -4.269741e-01  7.432733e-01
## boughtInLastMonth5       7.324904e-04  2.212057e-03

#STEP 4: PLOTTING COEFFICIENTS

Coefficient plots allow us to assess both the direction (positive or negative) and the magnitude of effects. These plots enable us to identify which variables hold the greatest influence within each department.

# Looping through each department and display its coefficient
for (department_name in names(coef_plots)) {
  plot <- coef_plots[[department_name]]
  print(plot)
}

#Step 5: Displaying Z Values for All Logistic Regression Models

The z-value is obtained by dividing the regression coefficient (estimate) by its standard error. A positive z-score signifies that the value related to a predictor variable is above the mean average, whereas a negative z-score indicates it is below the mean.

filtered_all_coefs1 <- all_coefs[all_coefs$Department %in% c("all","fashion", "health and household", "automotive", "sports and outdoors", "industrial and scientific"), ]

# Reshaping the data frame to a wide format
wide_df1 <- reshape(filtered_all_coefs1,
                   timevar = "Term",
                   idvar = "Department",
                   direction = "wide",
                   v.names = "Pr.value")

## Warning in reshapeWide(data, idvar = idvar, timevar = timevar, varying =
## varying, : some constant variables (Estimate) are really varying

# Cleaning up the names by removing the prefix ("z.value.") added by the reshape function
names(wide_df1) <- sub("", "", names(wide_df1))

# Setting the row names to the Department names 
rownames(wide_df1) <- wide_df1$Department
wide_df1$Department <- NULL  # Remove the now redundant Department column
wide_df1$"Pr.value.(Intercept)"<- NULL  # Remove the now redundant Department column
wide_df1$Estimate<- NULL  # Remove the now redundant Department column

# Showing the updated data frame with rounded values (excluding one specific column)
wide_df1

##                           Pr.value.price_log_normalized
## all                                           0.2574981
## automotive                                    0.7121730
## fashion                                       0.2566334
## health and household                          0.1015600
## industrial and scientific                     0.6414291
## sports and outdoors                           0.3615276
##                           Pr.value.reviews_log_normalized
## all                                         9.627752e-235
## automotive                                   3.460631e-02
## fashion                                      2.004034e-01
## health and household                         1.003731e-10
## industrial and scientific                    4.624846e-03
## sports and outdoors                          1.098178e-12
##                           Pr.value.titleLength_normalized
## all                                           0.005561241
## automotive                                    0.578097082
## fashion                                       0.765004778
## health and household                          0.010014159
## industrial and scientific                     0.006228950
## sports and outdoors                           0.432561925
##                           Pr.value.stars_normalized Pr.value.boughtInLastMonth
## all                                      0.10419508              2.781977e-192
## automotive                               0.93595319               2.070541e-06
## fashion                                  0.09027500               9.896379e-01
## health and household                     0.09681652               1.094556e-45
## industrial and scientific                0.73371427               2.919520e-07
## sports and outdoors                      0.74327331               2.212057e-03

filtered_all_coefs2 <- all_coefs[all_coefs$Department %in% c("all", "fashion", "health and household", "automotive", "sports and outdoors", "industrial and scientific"), ]

# Reshaping the data frame to a wide format
wide_df2 <- reshape(filtered_all_coefs2,
                   timevar = "Term",
                   idvar = "Department",
                   direction = "wide",
                   v.names = "Estimate")

## Warning in reshapeWide(data, idvar = idvar, timevar = timevar, varying =
## varying, : some constant variables (Pr.value) are really varying

# Cleaning up the names by removing the prefix ("z.value.") added by the reshape function
names(wide_df2) <- sub("", "", names(wide_df2))

# Setting the row names to the Department names 
rownames(wide_df2) <- wide_df2$Department
wide_df2$Department <- NULL  # Remove the now redundant Department column
wide_df2$"Estimate.(Intercept)"<- NULL  # Remove the now redundant Department column
wide_df2$"Pr.value"<- NULL  # Remove the now redundant Department column

# Showing the updated data frame with rounded values (excluding one specific column)
wide_df2

##                           Estimate.price_log_normalized
## all                                           0.2094143
## automotive                                    0.5853966
## fashion                                      13.7763958
## health and household                          0.6976084
## industrial and scientific                     0.7585714
## sports and outdoors                           0.6729650
##                           Estimate.reviews_log_normalized
## all                                              4.959000
## automotive                                       2.528969
## fashion                                         -9.437315
## health and household                             2.842821
## industrial and scientific                        3.356419
## sports and outdoors                              4.582532
##                           Estimate.titleLength_normalized
## all                                             0.5627646
## automotive                                     -0.8362064
## fashion                                         2.4591873
## health and household                            1.4871271
## industrial and scientific                       4.6617282
## sports and outdoors                            -0.6362049
##                           Estimate.stars_normalized Estimate.boughtInLastMonth
## all                                      -0.4992396               0.0004333419
## automotive                                0.1861545               0.0017322794
## fashion                                 -16.9580210               0.0001525688
## health and household                     -1.3170493               0.0003812598
## industrial and scientific                 0.6865408               0.0008986412
## sports and outdoors                      -0.4269741               0.0007324904

Calculating the absolute z-score

Although the z-score alone does not directly reveal the importance of variables, comparing the absolute values of z-scores across predictors can provide insights into their relative significance. Generally, higher absolute z-scores suggest stronger effects.

z_scores <- list()
for (department in names(models)) {
  coefficients <- coef(models[[department]])
  
  # Calculating z-scores for each coefficient
  z_scores[[department]] <- scale(coefficients, center = TRUE, scale = TRUE)
}

z_scores <- list()
for (department in names(models)) {
  # Getting the coefficients from the model summary
  coefficients <- coef(models[[department]])
  
  z_scores[[department]] <- abs(scale(coefficients[-1], center = TRUE, scale = TRUE))
}

z_score_plots <- list()

# Looping through each department's z-scores
for (department in names(z_scores)) {
  variables <- names(models[[department]]$coefficients)[-1]
  
  # Creating a data frame with variable names and absolute z-scores
  z_scores_df <- data.frame(
    Variable = variables,
    Abs_Z_Score = unname(z_scores[[department]])
  )
  
  # Creating a bar plot for absolute z-scores
  z_score_plot <- ggplot(z_scores_df, aes(x = reorder(Variable, Abs_Z_Score), y = Abs_Z_Score)) +
    geom_bar(stat = "identity", fill = "green") +
    labs(title = paste("Absolute Z-Scores of Coefficients for", department),
         x = "Variables",
         y = "Absolute Z-Score") +
    theme_minimal() +
    coord_flip()

  z_score_plots[[department]] <- z_score_plot
  
  
}

# Displaying the plots
for (plot in z_score_plots) {
  print(plot)
}

#Step 6: Plotting the Partial Effect

In addition to the coefficients discussed in the previous section, the partial effect measures how individual predictor variables influence the predicted probability of a positive outcome in a logistic regression model. It quantifies the change in the predicted probability of the outcome for each unit change in the predictor variable, holding all other variables constant.

create_and_display_plots <- function(department, model) {
  # First, we need to compute the partial effects
  partial_effects <- allEffects(model)
  
  # Create ggplot objects for each effect
  plots <- lapply(partial_effects, function(effect) {
    effect_df <- as.data.frame(effect)
    p <- ggplot(effect_df, aes(x = eval(as.name(names(effect_df)[1])), y = fit)) +
      geom_line() +
      geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
      labs(title = paste(names(effect_df)[1]), x = names(effect_df)[1], y = "Effect") +
      theme_minimal() +
      theme(
        plot.title = element_text(size = 8),    # Adjust size: plot titles
        axis.title = element_text(size = 6),    # Adjust size: axis titles
        axis.text.x = element_text(size = 8),    # Adjust size: x-axis text
        axis.text.y = element_text(size = 8)     # Adjust size: y-axis text
      )
    return(p)
  })

  # Combine plots into a grid
  plot_grid <- marrangeGrob(plots, nrow = 2, ncol = 3, top = department)

  # Print the plot grid explicitly
  grid::grid.draw(plot_grid)
}

# Loop through each department model and display plots
for (department in names(models)) {
  model <- models[[department]]
  create_and_display_plots(department, model)
}

#Step 7: Calculating Model Evaluation To evaluate the logistic regression model, we will use ROC (Receiver Operating Characteristic) and AUC (Area Under the ROC Curve).

ROC Curve: This plot shows the trade-off between the true positive rate (sensitivity) and false positive rate (1 - specificity) for different threshold values. It helps visualize how well the model can distinguish between classes.
AUC (Area Under the ROC Curve): AUC quantifies the overall performance of the classifier across all possible thresholds. A higher AUC value indicates better model performance in distinguishing between the two classes.

In this research, we will plot the ROC curve to analyze the model’s performance, and derive the AUC from observing the ROC plot.

model_evaluations <- list()

predicted_probabilities <- predict(models[["all"]], df_test_all, type = "response")
actual_labels_all <- df_test_all$isBestSeller

roc_curve <- roc(actual_labels_all, predicted_probabilities)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

auc_value <- auc(roc_curve)
model_evaluations[["all"]] <- list(
AUC = auc_value,
PredictedProbabilities = predicted_probabilities,
ActualLabels = actual_labels_all
)

# Looping through each department's test dataset
for (department in names(test_dfs)) {
  # Getting the test data frame for the current department
  df_test <- test_dfs[[department]]
  
  # Getting the logistic regression model for the current department
  model_logistic <- models[[department]]
  
  # Predicting probabilities of being a best seller using the model
  predicted_probabilities <- predict(model_logistic, df_test, type = "response")
  
  # Evaluating the model's performance
  actual_labels <- df_test$isBestSeller  
  
  # Explicitly setting factor levels
  actual_labels <- factor(actual_labels, levels = c(0, 1))
  
  roc_curve <- roc(actual_labels, predicted_probabilities)
  auc_value <- auc(roc_curve)
  
  # Storing the model evaluation results
  model_evaluations[[department]] <- list(
    AUC = auc_value,
    PredictedProbabilities = predicted_probabilities,
    ActualLabels = actual_labels
  )
}

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

#Step 8: Plotting ROC Curve for Further Improvement

The ROC curve visually represents the balance between the true positive rate (sensitivity) and the false positive rate (1 - specificity) for various thresholds in a binary classification model. It helps to understand how well the model can differentiate between positive and negative cases across different decision boundaries.

The AUC (Area Under the ROC Curve) serves as a summary metric of the ROC curve’s effectiveness. It measures the overall accuracy of the model, with an AUC of 0.5 reflecting random guessing and an AUC of 1.0 indicating perfect classification.

Upon reviewing the ROC curves for all departments and the top 5 departments, it was found that only the Electronics department has an AUC value below 0.5. This implies that the model’s performance is weakest for the Electronics department compared to the other five departments.

# Width and height for the ROC plot
width <- 15
height <- 15
# Creating an empty list to store ROC curve plots for each department
roc_plots <- list()
# Looping through each department's model evaluations
for (department in names(model_evaluations)) {
  # Getting the predicted probabilities and actual labels for the current department
  predicted_probs <- model_evaluations[[department]]$PredictedProbabilities
  actual_labels <- model_evaluations[[department]]$ActualLabels
  # Calculating the ROC curve
  roc_curve <- roc(actual_labels, predicted_probs)
  # Getting the AUC value from model_evaluations list
  auc_value <- model_evaluations[[department]]$AUC
  # Plotting the ROC curve with AUC value in the title and adjusted size
roc_plot <- plot(roc_curve, main = paste("ROC Curve for", department, "(AUC =", round(auc_value, 2), ")"), width = width, height = height, cex.main = 0.8)
roc_plots[[department]] <- roc_plot
}

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

After assessing the ROC curves, all models (from all departments to the top 5 departments) exhibit relatively high AUC values (>0.5). The strongest models are observed in the all-department, fashion, automotive, and Sports-and-Outdoors departments. Given that none of the AUC values are close to zero (<0.3), we can proceed to evaluate the confusion matrix, comparing true values with predicted values.

#Step 9: Recalculating Evaluation Metrics

Evaluation Metrics:

Accuracy: Measures the proportion of correctly predicted instances (both true positives and true negatives) out of all instances. While straightforward, accuracy can be misleading in imbalanced datasets where one class dominates.

Precision: Quantifies the proportion of true positive predictions out of all positive predictions (both true positives and false positives). High precision indicates that when the model predicts a positive instance, it is likely to be correct. This metric is crucial in scenarios where false positives are costly.

Recall: Measures the proportion of true positive predictions out of all actual positive instances (both true positives and false negatives). High recall indicates that the model captures most of the actual positive instances. It is essential when missing positive cases (false negatives) are critical, such as in disease diagnosis.

F1 Score: The harmonic mean of precision and recall. It balances both metrics and provides a single score that considers both false positives and false negatives. The F1 score is useful when there is a need to strike a balance between precision and recall.

metrics <- list()
evaluation_results <- list()

# Defining a function to calculate the confusion matrix
calculate_confusion_matrix <- function(predicted_labels, actual_labels) {
  TP_A <- sum(predicted_labels == 1 & actual_labels == 1)
  TN_A <- sum(predicted_labels == 0 & actual_labels == 0)
  FP_A <- sum(predicted_labels == 1 & actual_labels == 0)
  FN_A <- sum(predicted_labels == 0 & actual_labels == 1)
  
  confusion_matrix <- matrix(c(TP_A, FN_A, FP_A, TN_A), nrow = 2, byrow = TRUE,
                             dimnames = list(c("Actual Positive (1)", "Actual Negative (0)"),
                                             c("Predicted Positive", "Predicted Negative")))
  return(confusion_matrix)
}

# Looping through each department's model evaluations
for (department in names(model_evaluations)) {
  # Getting the predicted probabilities and actual labels for the current department
  predicted_probs <- model_evaluations[[department]]$PredictedProbabilities
  actual_labels <- model_evaluations[[department]]$ActualLabels
  
  # Calculating predicted labels based on a threshold (e.g., 0.5 for binary classification)
  predicted_labels <- ifelse(predicted_probs >= 0.5, 1, 0)
  
  # Make confusion matrix
  conf_matrix_A <- calculate_confusion_matrix(predicted_labels, actual_labels)
  
  # Calculating evaluation metrics
  accuracy_A <- sum(diag(conf_matrix_A)) / sum(conf_matrix_A)
  precision_A <- conf_matrix_A[1, 1] / sum(conf_matrix_A[1, ])
  recall_A <- conf_matrix_A[1, 1] / sum(conf_matrix_A[, 1])
  f1_score_A <- 2 * (precision_A * recall_A) / (precision_A + recall_A)
  
  # Storing the evaluation metrics and confusion matrix in the metrics list
  metrics[[department]] <- list(
    Accuracy = accuracy_A,
    Precision = precision_A,
    Recall = recall_A,
    F1_Score = f1_score_A,
    Confusion_Matrix = conf_matrix_A
  )
}
for (department in names(metrics)) {
  cat("Department:", department, "\n")
  cat("Accuracy:", metrics[[department]]$Accuracy, "\n")
  cat("Precision:", metrics[[department]]$Precision, "\n")
  cat("Recall:", metrics[[department]]$Recall, "\n")
  cat("F1 Score:", metrics[[department]]$F1_Score, "\n")
  cat("Confusion Matrix:\n")
  print(metrics[[department]]$Confusion_Matrix)
  cat("\n")
}

## Department: all 
## Accuracy: 0.9749171 
## Precision: 0.03700589 
## Recall: 0.4313725 
## F1 Score: 0.06816421 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                 44               1145
## Actual Negative (0)                 58              46714
## 
## Department: automotive 
## Accuracy: 0.9646018 
## Precision: 0.1363636 
## Recall: 0.75 
## F1 Score: 0.2307692 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                  3                 19
## Actual Negative (0)                  1                542
## 
## Department: fashion 
## Accuracy: 0.9411765 
## Precision: 1 
## Recall: 0.5 
## F1 Score: 0.6666667 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                  1                  0
## Actual Negative (0)                  1                 15
## 
## Department: health and household 
## Accuracy: 0.9428571 
## Precision: 0.09326425 
## Recall: 0.5454545 
## F1 Score: 0.159292 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                 18                175
## Actual Negative (0)                 15               3117
## 
## Department: industrial and scientific 
## Accuracy: 0.9645161 
## Precision: 0.1304348 
## Recall: 0.6 
## F1 Score: 0.2142857 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                  3                 20
## Actual Negative (0)                  2                595
## 
## Department: sports and outdoors 
## Accuracy: 0.9514997 
## Precision: 0 
## Recall: 0 
## F1 Score: NaN 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                  0                 75
## Actual Negative (0)                  1               1491

To enhance model performance, we aim to define an optimal threshold for the logistic regression model using the ROC curve to maximize recall and minimize Type II errors (false negatives).

Concept:

After analyzing the Confusion Matrix, particularly focusing on Type II errors (isBestSeller=“yes” predicted as “no”), we recognize the need to improve recall. In this business context, false negatives (Type II errors) are critical as they imply missing potential Best Seller products, impacting client benefits.

Approach:

The optimal threshold determines the probability output where the model classifies an instance as “isBestSeller=‘yes’”. Instead of relying on a fixed threshold (e.g., >0.5), which may not suit all models, we use the ROC curve to identify the threshold that maximizes recall. This method helps mitigate Type II errors by prioritizing the correct identification of true positives (Best Sellers).

Example Scenario:

For instance, setting a threshold based on the ROC curve might lead us to lower the probability threshold for predicting “isBestSeller=‘yes’” to capture more true positive cases, thus reducing false negatives.

By optimizing the threshold in this manner, we aim to enhance the model’s ability to accurately identify potential Best Sellers, aligning with business objectives and maximizing client benefits.

# Looping through each department's ROC curve plots
for (department in names(roc_plots)) {
  # Get the ROC curve data from the roc_plots list
  roc_curve <- roc_plots[[department]]
  
  # Finding the optimal threshold for maximizing recall (sensitivity)
  optimal_threshold <- coords(roc_curve, "best", ret = "threshold", maximize = "sensitivity")$threshold
  
  # Getting the predicted probabilities and actual labels for the current department
  predicted_probs <- model_evaluations[[department]]$PredictedProbabilities
  actual_labels <- model_evaluations[[department]]$ActualLabels
  
  # Calculating predicted labels based on the optimal threshold
  predicted_labels <- ifelse(predicted_probs >= optimal_threshold, 1, 0)
  
  # Creating confusion matrix
  TP <- sum(predicted_labels == 1 & actual_labels == 1)
  FN <- sum(predicted_labels == 0 & actual_labels == 1)
  FP <- sum(predicted_labels == 1 & actual_labels == 0)
  TN <- sum(predicted_labels == 0 & actual_labels == 0)
  
  # Creating confusion matrix with labels
  confusion_matrix <- matrix(c(TP, FN, FP, TN), nrow = 2, byrow = TRUE)
  colnames(confusion_matrix) <- c("Predicted Positive", "Predicted Negative")
  rownames(confusion_matrix) <- c("Actual Positive (1)", "Actual Negative (0)")
  
  # Calculating evaluation metrics
  accuracy <- (TP + TN) / (TP + TN + FP + FN)
  precision <- TP / (TP + FP)
  recall <- TP / (TP + FN)
  f1_score <- 2 * (precision * recall) / (precision + recall)
  
  # Storing the evaluation results in the evaluation_results list
  evaluation_results[[department]] <- list(
    Threshold = optimal_threshold,
    ConfusionMatrix = confusion_matrix,
    Accuracy = accuracy,
    Precision = precision,
    Recall = recall,
    F1_Score = f1_score
  )
}
# Displaying the evaluation results for each department
for (department in names(evaluation_results)) {
  cat("Department:", department, "\n")
  cat("Optimal Threshold:", evaluation_results[[department]]$Threshold, "\n")
  cat("Confusion Matrix:\n")
  print(evaluation_results[[department]]$ConfusionMatrix)
  cat("Accuracy:", evaluation_results[[department]]$Accuracy, "\n")
  cat("Precision:", evaluation_results[[department]]$Precision, "\n")
  cat("Recall:", evaluation_results[[department]]$Recall, "\n")
  cat("F1 Score:", evaluation_results[[department]]$F1_Score, "\n\n")
}

## Department: all 
## Optimal Threshold: 0.02791888 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                744                445
## Actual Negative (0)              11417              35355
## Accuracy: 0.752674 
## Precision: 0.06117918 
## Recall: 0.6257359 
## F1 Score: 0.1114607 
## 
## Department: automotive 
## Optimal Threshold: 0.03612456 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                 17                  5
## Actual Negative (0)                143                400
## Accuracy: 0.7380531 
## Precision: 0.10625 
## Recall: 0.7727273 
## F1 Score: 0.1868132 
## 
## Department: fashion 
## Optimal Threshold: 0.8623695 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                  1                  0
## Actual Negative (0)                  0                 16
## Accuracy: 1 
## Precision: 1 
## Recall: 1 
## F1 Score: 1 
## 
## Department: health and household 
## Optimal Threshold: 0.05193374 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                126                 67
## Actual Negative (0)                842               2290
## Accuracy: 0.7266165 
## Precision: 0.1301653 
## Recall: 0.6528497 
## F1 Score: 0.2170543 
## 
## Department: industrial and scientific 
## Optimal Threshold: 0.0188599 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                 19                  4
## Actual Negative (0)                304                293
## Accuracy: 0.5032258 
## Precision: 0.05882353 
## Recall: 0.826087 
## F1 Score: 0.1098266 
## 
## Department: sports and outdoors 
## Optimal Threshold: 0.05185521 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                 49                 26
## Actual Negative (0)                357               1135
## Accuracy: 0.7555839 
## Precision: 0.1206897 
## Recall: 0.6533333 
## F1 Score: 0.2037422

Following the determination of the optimal threshold to maximize recall using the ROC curve, we present the evaluation metrics:

Upon applying the optimal threshold derived from the ROC curve, we present the evaluation metrics:

# Load the knitr package
library(knitr)

# Creating a data frame from evaluation_results
evaluation_df <- data.frame(
  Optimal_Threshold = sapply(evaluation_results, function(x) x$Threshold),
  Accuracy = sapply(evaluation_results, function(x) x$Accuracy),
  Precision = sapply(evaluation_results, function(x) x$Precision),
  Recall = sapply(evaluation_results, function(x) x$Recall),
  F1_Score = sapply(evaluation_results, function(x) x$F1_Score)
)

# Convert the columns to numeric (if not already)
evaluation_df$Recall <- as.numeric(as.character(evaluation_df$Recall))
evaluation_df$Accuracy <- as.numeric(as.character(evaluation_df$Accuracy))
evaluation_df$Precision <- as.numeric(as.character(evaluation_df$Precision))
evaluation_df$F1_Score <- as.numeric(as.character(evaluation_df$F1_Score))

# Sorting the data frame by recall in descending order
evaluation_df <- evaluation_df[order(-evaluation_df$Recall),]

# Showing the sorted table
kable(evaluation_df, caption = "Evaluation Results Sorted by Recall (Descending)")

Evaluation Results Sorted by Recall (Descending)
	Optimal_Threshold	Accuracy	Precision	Recall	F1_Score
fashion	0.8623695	1.0000000	1.0000000	1.0000000	1.0000000
industrial and scientific	0.0188599	0.5032258	0.0588235	0.8260870	0.1098266
automotive	0.0361246	0.7380531	0.1062500	0.7727273	0.1868132
sports and outdoors	0.0518552	0.7555839	0.1206897	0.6533333	0.2037422
health and household	0.0519337	0.7266165	0.1301653	0.6528497	0.2170543
all	0.0279189	0.7526740	0.0611792	0.6257359	0.1114607

# Load necessary libraries
library(ggplot2)
library(reshape2)

# Creating the data frame from the evaluation results
evaluation_df <- data.frame(
  Department = c("fashion", "industrial and scientific", "automotive", "sports and outdoors", "health and household", "all"),
  Optimal_Threshold = c(0.8623695, 0.0188599, 0.0361246, 0.0518552, 0.0519337, 0.0279189),
  Accuracy = c(1.0000000, 0.5032258, 0.7380531, 0.7555839, 0.7266165, 0.7526740),
  Precision = c(1.0000000, 0.0588235, 0.1062500, 0.1206897, 0.1301653, 0.0611790),
  Recall = c(1.0000000, 0.8260870, 0.7727273, 0.6533333, 0.6528497, 0.6257359),
  F1_Score = c(1.0000000, 0.1098266, 0.1868132, 0.2037422, 0.2170543, 0.1114607)
)

# Melt the data frame for easier plotting with ggplot2
evaluation_melted <- melt(evaluation_df, id.vars = "Department")

# Plot the metrics
ggplot(evaluation_melted, aes(x = Department, y = value, fill = variable)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~variable, scales = "free_y") +
  ggtitle("Evaluation Metrics by Department") +
  ylab("Value") +
  theme_minimal() +
  theme(legend.position = "bottom", legend.title = element_blank(), 
        axis.text.x = element_text(angle = 45, hjust = 1))

3.3. DECISION TREE ANALYSIS

#First, since isBestSeller is numeric boolean 0 and 1, we will change it to "no" and "yes" respectively.
df_model$isBestSeller <- as.character(df_model$isBestSeller)
df_model$isBestSeller <- ifelse(df_model$isBestSeller == 1, "yes", "no")
df_model$department <- gsub("&| |,", "", df_model$department)

vrs <- c("department", "isBestSeller")
df_model[ vrs ] <- lapply(df_model[ vrs ], factor)
remove(vrs)

column_for_base <- c("isBestSeller", "department", "stars", "reviews", "price", "boughtInLastMonth", "titleLength")
column_for_log <- c("isBestSeller", "department", "stars", "reviews_log", "price_log", "boughtInLastMonth_log", "titleLength")
column_for_norm <- c("isBestSeller", "department", "stars_normalized", "reviews_log_normalized", "price_log_normalized","boughtInLastMonth_log", "titleLength_normalized")

# Create the subset of the dataframe
df_model_base <- df_model[column_for_base]
df_model_log <- df_model[column_for_log]
df_model_norm <- df_model[column_for_norm]

create_train_test_split <- function(data, prop) {
  set.seed(34567819)
  
  split <- rsample::initial_split(data, prop = 0.7)
  training_set <- training(split)
  testing_set <- testing(split)
  
  #output
  list(training = training_set, testing = testing_set)
}


main_source = df_model_base
train_test_df <- create_train_test_split(main_source, 0.7)
training <- train_test_df$training
testing <- train_test_df$testing


model <- C50::C5.0(isBestSeller ~., 
                   data = training)
summary(model)

## 
## Call:
## C5.0.formula(formula = isBestSeller ~ ., data = training)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sat Jul 13 02:03:18 2024
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 111907 cases (7 attributes) from undefined.data
## 
## Decision tree:
## 
## boughtInLastMonth <= 600: no (102450/1527)
## boughtInLastMonth > 600:
## :...boughtInLastMonth <= 3000: no (8707/953)
##     boughtInLastMonth > 3000:
##     :...boughtInLastMonth <= 7000: no (561/175)
##         boughtInLastMonth > 7000:
##         :...department in {artsandcrafts,bathandbody,electronics,
##             :              patioandgarden,sportsandoutdoors}: no (18/5)
##             department in {automotive,babyproducts,clothingandaccessories,
##             :              fashion,healthandhousehold,industrialandscientific,
##             :              Other,toolsandhomeimprovement,
##             :              toysandgames}: yes (82/28)
##             department = beauty:
##             :...stars <= 4.3: no (3)
##             :   stars > 4.3: yes (23/10)
##             department = homeandkitchen:
##             :...reviews <= 17445: no (29/9)
##             :   reviews > 17445: yes (14/2)
##             department = petsupplies:
##             :...titleLength <= 88: no (8)
##                 titleLength > 88: yes (12/4)
## 
## 
## Evaluation on training data (111907 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      11 2713( 2.4%)   <<
## 
## 
##      (a)    (b)    <-classified as
##    -----  -----
##   109107     44    (a): class no
##     2669     87    (b): class yes
## 
## 
##  Attribute usage:
## 
##  100.00% boughtInLastMonth
##    0.17% department
##    0.04% reviews
##    0.02% stars
##    0.02% titleLength
## 
## 
## Time: 0.5 secs

model_boost <- C5.0(isBestSeller ~.,
                    data = main_source,
                    trials = 10)

# Predicting on the testing data


pred.test <- predict(model_boost, testing)
CrossTable(testing$isBestSeller, pred.test,
           prop.chisq = FALSE,
           prop.c = FALSE,
           prop.r = FALSE,
           prop.t = FALSE,
           dnn = c("Actual", "Predicted"))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |-------------------------|
## 
##  
## Total Observations in Table:  47961 
## 
##  
##              | Predicted 
##       Actual |        no |       yes | Row Total | 
## -------------|-----------|-----------|-----------|
##           no |     46709 |        22 |     46731 | 
## -------------|-----------|-----------|-----------|
##          yes |      1194 |        36 |      1230 | 
## -------------|-----------|-----------|-----------|
## Column Total |     47903 |        58 |     47961 | 
## -------------|-----------|-----------|-----------|
## 
##

input_no_tree = 50
model.forest <- randomForest::randomForest(isBestSeller ~., 
                                           data = main_source,
                                           ntree = input_no_tree, 
                                           mtry = 2, 
                                           replace = TRUE,
                                           importance = TRUE)
                                           
pred.test <- predict(model.forest, testing)
CrossTable(testing$isBestSeller, pred.test,
           prop.chisq = FALSE,
           prop.c = FALSE,
           prop.r = FALSE,
           prop.t = FALSE,
           dnn = c("Actual", "Predicted"))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |-------------------------|
## 
##  
## Total Observations in Table:  47961 
## 
##  
##              | Predicted 
##       Actual |        no |       yes | Row Total | 
## -------------|-----------|-----------|-----------|
##           no |     46731 |         0 |     46731 | 
## -------------|-----------|-----------|-----------|
##          yes |        83 |      1147 |      1230 | 
## -------------|-----------|-----------|-----------|
## Column Total |     46814 |      1147 |     47961 | 
## -------------|-----------|-----------|-----------|
## 
##

input_no_tree = 100
model.forest <- randomForest::randomForest(isBestSeller ~., 
                                           data = main_source,
                                           ntree = input_no_tree, 
                                           mtry = 2, 
                                           replace = TRUE,
                                           importance = TRUE)

pred.test <- predict(model.forest, testing)
CrossTable(testing$isBestSeller, pred.test,
           prop.chisq = FALSE,
           prop.c = FALSE,
           prop.r = FALSE,
           prop.t = FALSE,
           dnn = c("Actual", "Predicted"))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |-------------------------|
## 
##  
## Total Observations in Table:  47961 
## 
##  
##              | Predicted 
##       Actual |        no |       yes | Row Total | 
## -------------|-----------|-----------|-----------|
##           no |     46731 |         0 |     46731 | 
## -------------|-----------|-----------|-----------|
##          yes |        32 |      1198 |      1230 | 
## -------------|-----------|-----------|-----------|
## Column Total |     46763 |      1198 |     47961 | 
## -------------|-----------|-----------|-----------|
## 
##

DECISION TREE:

The model output shows that 111907 cases (70% of 159868 cases )were utilized to grow the tree and number of features: 7. The decision tree can be interpreted as follows: i. If ‘boughtInLastMonth’ is less than/equal to 600 then it is not a best seller. ii. Else, ‘boughtInLastMonth’ is less than/equal to 3000, then it is not a best seller. iii. Else, ‘boughtInLastMonth’ is less than/equal to 7000, then it is not a best seller; and if greater then it is a best seller. The confusion matrix:

1.⁠ ⁠True positives: the number of cases where the model correctly predicts the positive class 2.⁠ ⁠True negatives: the number of cases where the model correctly predicts the negative class 3.⁠ ⁠False positives: the number of cases where the model incorrectly predicts the negative class as positive 4.⁠ ⁠False negatives: the number of cases where the model incorrectly predicts the positive class as negative In this model, we want to predict department cases with best-selling products, hence best selling product is our positive class. The attribute usage shows high importance of ‘boughtInLastMonth’ with relevance of 100%. To further enhance the result, we evaluate by applying it to unseen testing data, and boosting by setting number of trials to 10, 50 & 100. Initially, the model has an accuracy of ((TP+TN)/n) = ((46709+36)/47961) = 97.46%. However, upon further boosting we get 99.82% and 99.99% accuracy respectively.

# Load necessary libraries
library(C50)
library(randomForest)
library(caret)
library(ggplot2)
library(reshape2)

# User's code for decision tree models
main_source = df_model_base
train_test_df <- create_train_test_split(main_source, 0.7)
training <- train_test_df$training
testing <- train_test_df$testing

# Boosted Decision Tree Model
model_boost <- C5.0(isBestSeller ~., data = main_source, trials = 10)

# Predicting on the testing data
pred.test_boost <- predict(model_boost, testing)
CrossTable(testing$isBestSeller, pred.test_boost, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, prop.t = FALSE, dnn = c("Actual", "Predicted"))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |-------------------------|
## 
##  
## Total Observations in Table:  47961 
## 
##  
##              | Predicted 
##       Actual |        no |       yes | Row Total | 
## -------------|-----------|-----------|-----------|
##           no |     46709 |        22 |     46731 | 
## -------------|-----------|-----------|-----------|
##          yes |      1194 |        36 |      1230 | 
## -------------|-----------|-----------|-----------|
## Column Total |     47903 |        58 |     47961 | 
## -------------|-----------|-----------|-----------|
## 
##

# Random Forest Model with 50 Trees
input_no_tree = 50
model_forest_50 <- randomForest::randomForest(isBestSeller ~., data = main_source, ntree = input_no_tree, mtry = 2, replace = TRUE, importance = TRUE)

# Predicting on the testing data
pred.test_forest_50 <- predict(model_forest_50, testing)
CrossTable(testing$isBestSeller, pred.test_forest_50, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, prop.t = FALSE, dnn = c("Actual", "Predicted"))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |-------------------------|
## 
##  
## Total Observations in Table:  47961 
## 
##  
##              | Predicted 
##       Actual |        no |       yes | Row Total | 
## -------------|-----------|-----------|-----------|
##           no |     46731 |         0 |     46731 | 
## -------------|-----------|-----------|-----------|
##          yes |        91 |      1139 |      1230 | 
## -------------|-----------|-----------|-----------|
## Column Total |     46822 |      1139 |     47961 | 
## -------------|-----------|-----------|-----------|
## 
##

# Random Forest Model with 100 Trees
input_no_tree = 100
model_forest_100 <- randomForest::randomForest(isBestSeller ~., data = main_source, ntree = input_no_tree, mtry = 2, replace = TRUE, importance = TRUE)

# Predicting on the testing data
pred.test_forest_100 <- predict(model_forest_100, testing)
CrossTable(testing$isBestSeller, pred.test_forest_100, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, prop.t = FALSE, dnn = c("Actual", "Predicted"))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |-------------------------|
## 
##  
## Total Observations in Table:  47961 
## 
##  
##              | Predicted 
##       Actual |        no |       yes | Row Total | 
## -------------|-----------|-----------|-----------|
##           no |     46731 |         0 |     46731 | 
## -------------|-----------|-----------|-----------|
##          yes |        34 |      1196 |      1230 | 
## -------------|-----------|-----------|-----------|
## Column Total |     46765 |      1196 |     47961 | 
## -------------|-----------|-----------|-----------|
## 
##

# Function to calculate performance metrics
evaluate_model <- function(actual, predicted) {
  confusion <- confusionMatrix(as.factor(predicted), as.factor(actual))
  accuracy <- confusion$overall['Accuracy']
  precision <- confusion$byClass['Pos Pred Value']
  recall <- confusion$byClass['Sensitivity']
  f1 <- 2 * ((precision * recall) / (precision + recall))
  return(c(accuracy, precision, recall, f1))
}

# Evaluate models
metrics <- data.frame(
  Model = c("Boosted Decision Tree", "Random Forest (50 Trees)", "Random Forest (100 Trees)"),
  Accuracy = numeric(3),
  Precision = numeric(3),
  Recall = numeric(3),
  F1_Score = numeric(3)
)

# Boosted Decision Tree
metrics[1, 2:5] <- evaluate_model(testing$isBestSeller, pred.test_boost)

# Random Forest (50 Trees)
metrics[2, 2:5] <- evaluate_model(testing$isBestSeller, pred.test_forest_50)

# Random Forest (100 Trees)
metrics[3, 2:5] <- evaluate_model(testing$isBestSeller, pred.test_forest_100)

# Melt the data frame for easier plotting with ggplot2
metrics_melted <- melt(metrics, id.vars = "Model")

# Plot the metrics with values displayed on the bars
ggplot(metrics_melted, aes(x = Model, y = value, fill = variable)) +
  geom_bar(stat = "identity", position = "dodge", color = "black") +
  geom_text(aes(label = round(value, 2)), vjust = -0.5, position = position_dodge(0.9), size = 3) +
  facet_wrap(~variable, scales = "free_y") +
  ggtitle("Evaluation Metrics by Model") +
  ylab("Value") +
  xlab("Model") +
  theme_minimal() +
  theme(legend.position = "bottom", legend.title = element_blank(), 
        axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette = "Set3") +  # Use a color palette for better differentiation
  scale_y_continuous(expand = expansion(mult = c(0, 0.2)))  # Adjust y-axis limits for better spacing

PART 4: CONCLUSION & RECOMMENDATIONS

4.1. CONCLUSION

# Load necessary libraries
library(ggplot2)
library(reshape2)

# Create data frame for logistic regression (conversation 1)
logistic_evaluation_df <- data.frame(
  Model = "Logistic Regression",
  Accuracy = 0.7526740,
  Precision = 0.0611790,
  Recall = 0.6257359,
  F1_Score = 0.1114607
)

# Create data frame for decision tree models (conversation 2)
decision_tree_metrics <- data.frame(
  Model = c("Boosted Decision Tree", "Random Forest (50 Trees)", "Random Forest (100 Trees)"),
  Accuracy = c(0.97, 1.00, 1.00),
  Precision = c(0.98, 1.00, 1.00),
  Recall = c(1.00, 1.00, 1.00),
  F1_Score = c(0.99, 1.00, 1.00)
)

# Melt the data frames for easier plotting with ggplot2
logistic_melted <- melt(logistic_evaluation_df, id.vars = "Model")
decision_tree_melted <- melt(decision_tree_metrics, id.vars = "Model")

# Combine the melted data frames
combined_melted <- rbind(logistic_melted, decision_tree_melted)

# Plot the combined metrics with values displayed on the bars
ggplot(combined_melted, aes(x = Model, y = value, fill = variable)) +
  geom_bar(stat = "identity", position = "dodge", color = "black") +
  geom_text(aes(label = round(value, 2)), vjust = -0.5, position = position_dodge(0.9), size = 3) +
  facet_wrap(~variable, scales = "free_y") +
  ggtitle("Comparison of Evaluation Metrics by Model - Decision Tree, Random Forest(50/100) and Logistic Regression") +
  ylab("Value") +
  xlab("Model") +
  theme_minimal() +
  theme(legend.position = "bottom", legend.title = element_blank(), 
        axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette = "Set3") +  # Use a color palette for better differentiation
  scale_y_continuous(expand = expansion(mult = c(0, 0.2)))  # Adjust y-axis limits for better spacing

To analyze the performance of various evaluation models using factors of Accuracy, Recall, Precision and F1 score. We have utilized models of Logistic Regression, Boosted Decision Tree, Random Forest (50 Trees and 100 Trees). a. Accuracy: This determines the model’s quality of prediction. b. Precision: This determines the probability of getting true positive predictions of total positive predictions and as less false positive results as possible. c. Recall: This defines the proportion true positive predictions to total number of actual positive cases achieved by the model, i.e. ability of the model to find all positive cases. d. F1 score: This is a combination of precision and recall, that provides a single metric accounting for both. We can clearly observe that Random Forest and Decision Tree have comparatively much greater accuracy, precision, recall and F1 score, than the Logistic Regression.

4.2. RECOMMENDATIONS

Upon analyzing Amazon UK’s e-commerce dataset, we can suggest following strategic recommendations to target the business objectives:

Business Question 1: A new upcoming vendor should target which department for better sales probability and performance?

As observed, the vendor can focus on following five “Best Seller” departments:

1.⁠ ⁠Sports & Outdoors

2.⁠ ⁠Automotive

3.⁠ ⁠Industrial and Scientific

4.⁠ ⁠Health and Household

5.⁠ ⁠Fashion

Targeting these categories can help vendors improve their possibilities of tapping the best-selling market.

Business Question 2: Based on the analysis of the best-selling products in the top 5 departments, what price range should a vendor aim for to optimize sales in each department?

# Load necessary libraries
library(dplyr)

# Assume df_model is the main dataframe used in the previous code

# Step 1: Identify the top 5 departments based on the percentage of best-selling products
top_departments <- df_model %>%
  group_by(department) %>%
  summarise(
    isBestSeller_Count = sum(isBestSeller == "yes"),  # Count of best sellers
    Product_Count = n(),  # Total products in each department
    Percentage_BestSeller = round((isBestSeller_Count / Product_Count) * 100, 2)  # Percentage of best sellers
  ) %>%
  arrange(desc(Percentage_BestSeller)) %>%
  head(5) %>%
  pull(department)  # Extract the department names

# Step 2: For each top department, find the product that sold the most and its price
best_selling_products <- df_model %>%
  filter(department %in% top_departments) %>%
  group_by(department, title) %>%
  summarise(
    Total_Sold = sum(boughtInLastMonth),
    Price = mean(price)  # Assuming price is the same for all records of a product
  ) %>%
  arrange(department, desc(Total_Sold)) %>%
  slice(1)  # Select the top product for each department

## `summarise()` has grouped output by 'department'. You can override using the
## `.groups` argument.

# Display the results
best_selling_products

## # A tibble: 5 × 4
## # Groups:   department [5]
##   department              title                                 Total_Sold Price
##   <fct>                   <chr>                                      <int> <dbl>
## 1 automotive              Prestone PSCW0039A Screen Wash for C…       5000  4.5 
## 2 fashion                 A little wish for starting uni wish …        300  3.75
## 3 healthandhousehold      Amazon Brand – Mama Bear Sensitive U…      40000 12.6 
## 4 industrialandscientific Gloveman Clear Vinyl Gloves (Box of …      10000  5.99
## 5 sportsandoutdoors       Warrior Supplements Essentials Creat…      10000 12.0

# Load necessary library
library(ggplot2)

# Create the data frame
best_selling_products <- data.frame(
  department = c("automotive", "fashion", "healthandhousehold", "industrialandscientific", "sportsandoutdoors"),
  title = c("Prestone PSCW0039A Screen Wash for Cars, Concentrate makes up to 50 Litres of Screenwash, protects to −18C, 2.5 Litre, Yellow",
            "A little wish for starting uni wish string bracelet | Gift for Uni | Good luck at university gift | Uni survival gift",
            "Amazon Brand – Mama Bear Sensitive Unscented Baby Wipes, 1008 Count (18 Packs of 56)",
            "Gloveman Clear Vinyl Gloves (Box of 100) (Medium)",
            "Warrior Supplements Essentials Creatine Powder, 300 g"),
  Total_Sold = c(5000, 300, 40000, 10000, 10000),
  Price = c(4.50, 3.75, 12.60, 5.99, 11.95)
)

# Plot the data
ggplot(best_selling_products, aes(x = department, y = Price, fill = department)) +
  geom_bar(stat = "identity", width = 0.6, color = "black") +
  geom_text(aes(label = paste0("£", Price)), vjust = -0.5, size = 5) +
  labs(title = "Optimal Price Points for Best-Selling Products in Top Departments",
       x = "Department",
       y = "Price (in £)") +
  ylim(0, 15) +  # Increase the upper limit of y-axis to avoid cutting off the label
  theme_minimal() +
  theme(legend.position = "none", 
        axis.text.x = element_text(angle = 45, hjust = 1, size = 12),
        axis.title.x = element_text(size = 14),
        axis.title.y = element_text(size = 14),
        plot.title = element_text(size = 16, face = "bold"))

Based on the analysis of the best-selling products in the top 5 departments, vendors should aim to sell their products within the following price ranges to aim for maximum sales in each department:

Automotive : Around £4.50 Fashion: Around £3.75 Health and Household: Around £12.60 Industrial and Scientific: Around £5.99 Sports and Outdoors: Around £11.95

These price points are derived from the highest-selling products within each respective department, providing a benchmark for optimal pricing strategies.

Business Question 3: Which features influence products to become a bestseller on Amazon UK?

The vendors should focus on improving following tactics to tap potential market better and efficiently:

Customer Reviews: The number of reviews acts as a significant predictor of bestseller status. Vendors can enhance this by encouraging customers by sending feedback emails and providing incentives for their efforts, and additionally react to those feedback to enhance credibility and trust.

Recent Sales: The ‘boughtInLastMonth’ variable is highly influential criteria for most of the departments. Vendors can focus on targeted marketing using coupons or special deals to increase the probability of recent sales and eventually having a multiplicative effect.

Refine Product Titles: Vendors can focus on creative product title name that improves the attractiveness and quick easy information transfer to grab attention

Price: Interestingly, price is not a significant factor in most departments, except for a positive association in the Sports and Outdoors category.

Business Question 4: What features must be concentrated on to maximize probability of product becoming a bestseller?

1.⁠ ⁠Price Factors:

Fashion Category: High-priced items tend to move towards becoming bestsellers.

- Recommendation: Work on improving your quality and the brand rather than simply always trying to undercut. Emphasize the premium features and Unique Selling Proposition in your products.

Other Categories: ⁠Observation - Price is not working as the key factor.

- Recommendation: Maintain and improve the display of products. Marketing and make the product showcase informative.

2.⁠ ⁠Review and Rating Strategy: Observation - High star rating does not guarantee success. The number of reviews and recent sales performance, on the other hand, is a much more telling sign of the success of a product.

- ⁠Recommendation ⁠Across Categories: Follow upwith the customer by email after sale is completed, in pursuit of a review on the product. Focus marketing on products with a larger number of recent sales, though not necessarily a higher rating.

Scanning through the products that enjoy ratings slightly below the average and which sell well, after identifying the drivers. This could be competitive pricing or good promotions; it could also be even a product feature. Businesses will thus have to clarify these factors to fit the strategies better with what drives bestseller status within respective categories.

Group 14_Business_Analytics_Amazon_UK

2024-07-12