BUSINESS ANALYTICS PROJECT - GROUP 14

Group Members:

Anurag Chavan (5921740) Anuj Goyal (6003087) Bhavesh Abhang (5935814) Prithvi Thakker

PART 1: INTRODUCTION

Business analytics is the study of data using operations and statistical analysis and has become an integral part towards the success of many organizations. Basically, from the business and managerial perspective, business analytics is a vital element which is used by managers to gain knowledge regarding their organization which aid them in making insightful decisions. As a result, business analytics provides managers with key information to predicting the future of the organization’s sales (Fouladirad, Neal, Ituarte, Alexander, & Ghareeb, n.d.).

The modern-day retailer is no longer just a seller of products but also a data analyst. eCommerce data analysis has become a vital tool for gaining insights into customers, market trends, and competitive intelligence. By leveraging data analysis, retailers can gain a better understanding of their customer base, identify best sellers in their product categories, and make informed decisions on pricing, marketing, and inventory levels (Hillel, 2023).

The global eCommerce market is expected to be worth $6.3 trillion in 2024, up from $5.8 trillion in 2023. This continuous growth makes eCommerce one of the most competitive industries. The heightened competitiveness has pushed businesses to find ways to gain an edge over their competitors. The best option they have is to turn to the vast amounts of data shared by customers (Poddębniak, 2024).

Amazon’s business is based on the Internet and e-commerce, totally focused on the customer with its own logistics system, and in which any person or company, however small or large, can sell their products through a platform 100% optimized and easy to use where you can manage everything related to your account, brand and product. Its evolution as a company has a lot to do with the application of the Long Tail economic model; have an almost infinite supply, long tail, to meet high demand and provide each consumer with the product they are looking for, no matter how minority it may be (Liftingroup, n.d.).

In addition to its e-commerce platform, Amazon provides Amazon Web Services (AWS) for cloud hosting solutions, helping businesses, non-profit organizations, and government agencies deliver low-cost websites and applications. Amazon also offers streaming services such as Amazon Prime Video and Amazon Music (Liftingroup, n.d.).

However, this project concentrates on its eCommerce business, which offers retailers the opportunity to expand and internationalize their businesses by connecting with new clients and generating strong visibility.

BUSINESS PROBLEM

The business problem aims to analyze the current data to understand which products in which categories are trending and best sellers. With the help business analysis on the existing data, businesses can gain a better understanding of consumer preferences and build more effective products and marketing by analysing data on what things are popular. This insight would be extremely helpful for the vendors on the website to gain an advantage and enhance sales in the current competitive E-Commerce market.

REFERENCES

Fouladirad, M., Neal, J., Ituarte, J. V., Alexander, J., & Ghareeb, A. (n.d.). Entertaining data: Business analytics and Netflix. Retrieved from https://www.researchgate.net/profile/Joshua-Alexander-4/publication/333998454_Entertaining_Data_Business_Analytics_and_Netflix/links/5d121fe692851cf4404a5a9a/Entertaining-Data-Business-Analytics-and-Netflix.pdf
Hillel, O. (2023, January 18). How to use eCommerce data analysis for best seller intelligence. DataCluster. Retrieved from
https://datacluster.com/ecommerce-data-analysis-for-best-seller-intelligence#:~:text=It%20involves%20gathering%20and%20analyzing,for%20product%20innovation%20and%20development
Poddębniak, M. (2024, June 20). What is ecommerce analytics and how can you use it to grow your business. Piwik PRO. Retrieved from https://piwik.pro/blog/what-is-ecommerce-analytics-and-how-can-you-use-it-to-grow-your-business/
Liftingroup. (n.d.). Amazon and its business model and growth. Retrieved from https://www.liftingroup.com/en/expertise/business-model-amazon/

Installing Packages

In this tutorial, we will use the following packages. These are bundles of functions that extend base R functionality in numerous ways.

lattice
- It is a powerful package for creating trellis graphs, which are useful for visualizing multivariate data. It is based on the Grid graphics system.
ggplot2
- It is a widely-used package for creating complex and multi-layered graphics. It is based on the Grammar of Graphics, which provides a systematic way to build plots layer by layer.
plotly
- It is an interactive plotting library that can be used to create web-based interactive graphs. It integrates seamlessly with ggplot2.
stringr
- It provides a cohesive set of functions designed for easy and consistent manipulation of strings, leveraging regular expressions and offering robust handling of text data.
gridExtra
- It provides functions to arrange multiple grid-based graphics in a single layout, facilitating the combination of different plots and graphical objects in various configurations.
apaTables
- This package allows you to generate tables according to the American Psychological Association’s (APA) style. We use this package to create a correlation matrix that is more informative (and nicer to the eye) compared to the standard correlation matrix that is generated by the base package in R.
C50
- This is the core package we will use in the workbook on classification in which we generate decision trees using the C5.0 algorithm. In this tutorial, we’ll generate such a decision tree when we predict the generated clusters with the attributes used for finding these clusters (this is called ‘semi-supervised learning’). We will delve further into decision trees in the next tutorial. In this tutorial, we use them to assess the importance of the different predictors.
cluster
- The cluster-package contains a toolbox for performing various cluster analyses. Although the clusters in this tutorial are determined using the base functionality available in R, we use the silhouette-function from this package to assess the quality of the found clusters.
factoextra
- This package makes it easy to extract and visualize the output of exploratory multivariate data analysis, of which cluster analysis is one of the options. We’ll use the fviz_cluster-function from this package to get a projection of the found clusters in a 2-dimensional plane.
gmodels
- This package allows us to create a table that cross-tabulates the found clusters with the ‘defaulted’-variable. As we will explain later, this table aims to create targeted communication campaigns.
reshape2
- This package contains functions designed to make it easier to reshape your data. It offers a streamlined approach to pivoting data frames between wide format (where different variables are stored in separate columns) and long format (where variables are stacked in a single column but distinguished by another identifier column). This capability is instrumental in data analysis and visualization tasks where the data structure needs to be modified to suit specific analysis or plotting functions.
Hmisc
- This package provides robust data analysis and statistical modelling tools, especially in the biomedical field. In this tutorial, we use its functionality to produce concise summaries of data frames, including descriptive statistics essential for preliminary data analysis. For example, the describe() function that we use in this tutorial offers detailed summaries for each variable in a dataset, including counts of missing values, unique values, and basic statistical measures.
randomForest
- This is a popular and powerful tool for performing random forest analysis. This is a type of ensemble learning method used for classification and regression tasks. The random forest algorithm creates a “forest” of decision trees using random subsets of the data and then aggregates their predictions to improve accuracy and control over-fitting.
rsample
- This package has a function for partitioning the data that is much easier to digest than the code needed if we want to do this with the basic R functionality. We expect you to be able to adjust this code when required.
plotirx
- It offers a variety of specialized plotting functions that extend the base graphics capabilities, providing tools for creating plots such as pie charts, radial plots, and other custom visualizations.
plotrix
- It is a part of the tidyverse and provides a set of functions for data manipulation, offering intuitive and efficient tools for filtering, selecting, arranging, mutating, and summarizing data frames.
magick
- It provides a powerful and comprehensive suite of functions for image processing, enabling users to read, write, manipulate, and analyze images in various formats.

# List of packages
packages <- c("ggplot2", "gridExtra", "corrplot", "stringr", "lattice", "dplyr", 
              "fuzzyjoin", "httr", "magick", "C50", "rsample", "gmodels", 
              "randomForestExplainer", "party", "randomForest", "caret", "effects", 
              "pROC", "textreg", "coefplot", "vip", "visreg", "texreg")

# Install packages that are not already installed
install.packages(setdiff(packages, rownames(installed.packages())))

# Check if Rtools is needed and prompt installation for Windows users
if (.Platform$OS.type == "windows" && !requireNamespace("devtools", quietly = TRUE)) {
  install.packages("devtools")
  devtools::install_github("r-lib/pkgbuild")
  pkgbuild::check_rtools()
  if (!pkgbuild::has_rtools()) {
    stop("Rtools is required but not installed. Please install Rtools from https://cran.rstudio.com/bin/windows/Rtools/")
  }
}

# Visualization packages
suppressWarnings(library(ggplot2))
suppressWarnings(library(gridExtra))
suppressWarnings(library(corrplot))

## corrplot 0.92 loaded

# Data Cleaning packages
suppressWarnings(library(stringr))
suppressWarnings(library(lattice))
suppressWarnings(library(dplyr))

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

suppressWarnings(library(fuzzyjoin))
suppressWarnings(library(httr))
suppressWarnings(library(magick))

## Linking to ImageMagick 6.9.12.93
## Enabled features: cairo, fontconfig, freetype, heic, lcms, pango, raw, rsvg, webp
## Disabled features: fftw, ghostscript, x11

# Decision Tree Analysis (Classification) packages
suppressWarnings(library(C50))
suppressWarnings(library(rsample))
suppressWarnings(library(gmodels))
suppressWarnings(library(randomForestExplainer))

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

suppressWarnings(library(party))

## Loading required package: grid

## Loading required package: mvtnorm

## Loading required package: modeltools

## Loading required package: stats4

## Loading required package: strucchange

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## 
## Attaching package: 'strucchange'

## The following object is masked from 'package:stringr':
## 
##     boundary

## 
## Attaching package: 'party'

## The following object is masked from 'package:dplyr':
## 
##     where

suppressWarnings(library(randomForest))

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

# Logistic Regression packages
suppressWarnings(library(caret))

## 
## Attaching package: 'caret'

## The following object is masked from 'package:httr':
## 
##     progress

suppressWarnings(library(effects))

## Loading required package: carData

## Use the command
##     lattice::trellis.par.set(effectsTheme())
##   to customize lattice options for effects plots.
## See ?effectsTheme for details.

suppressWarnings(library(pROC))

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following object is masked from 'package:gmodels':
## 
##     ci

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

suppressWarnings(library(textreg))
suppressWarnings(library(coefplot))
suppressWarnings(library(vip))

## 
## Attaching package: 'vip'

## The following object is masked from 'package:utils':
## 
##     vi

suppressWarnings(library(visreg))
suppressWarnings(library(texreg))

## Version:  1.39.3
## Date:     2023-11-09
## Author:   Philip Leifeld (University of Essex)
## 
## Consider submitting praise using the praise or praise_interactive functions.
## Please cite the JSS article in your publications -- see citation("texreg").

# Check loaded packages
loaded_packages <- c("ggplot2", "gridExtra", "corrplot", "stringr", "lattice", "dplyr", 
                     "fuzzyjoin", "httr", "magick", "C50", "rsample", "gmodels", 
                     "randomForestExplainer", "party", "randomForest", "caret", "effects", 
                     "pROC", "textreg", "coefplot", "vip", "visreg", "texreg")

loaded_successfully <- loaded_packages[sapply(loaded_packages, require, character.only = TRUE, quietly = TRUE)]
not_loaded <- setdiff(loaded_packages, loaded_successfully)

if (length(not_loaded) > 0) {
  warning("The following packages were not loaded successfully: ", paste(not_loaded, collapse = ", "))
}

DATASET SETUP

In this assignment, we’ll be using Amazon Products UK 2023 Dataset for our analysis.

dataset_init <- read.csv("amz_uk_processed_data.csv")

PART 2: DATA EXPLORATION & PREPARATION

PART 2.1: DATA EXPLORATION

In data exploration, we want to know the data that we want to work with. Here we can use relevant R function and data visualizations to explore the data and variables. We have manually created the department variable to broadly categorize existing segments so that we could reduce the number of categories (230) to segments (18). This is a part of data preparation but we need this before we start exploring and vizualization.

# Create a mapping of original categories to new categories
category_mapping <- list(
  "industrial and scientific" = c("Lab & Scientific Products", "Industrial Electrical", "Test & Measurement", "3D Printing & Scanning", "Abrasive & Finishing Products", "Professional Medical Supplies", "Cutting Tools", "Material Handling Products", "Packaging & Shipping Supplies", "Agricultural Equipment & Supplies", "Construction Machinery"),
  
  "toys and games" = c("Toys and Games", "Games & Game Accessories", "Dolls & Accessories", "Sports Toys & Outdoor", "Kids' Play Vehicles", "Kids' Play Figures", "Baby & Toddler Toys", "Learning & Education Toys", "Toy Advent Calendars", "Electronic Toys", "Building & Construction Toys", "Remote & App-Controlled Devices", "Kids' Dress Up & Pretend Play", "Soft Toys", "Jigsaws & Puzzles"),
  
  "sports and outdoors" = c("Sports & Outdoors", "Ski Goggles", "Skiing Poles", "Hiking Hand & Foot Warmers", "Women's Sports & Outdoor Shoes", "Cycling Shoes", "Ski Clothing", "Men's Sports & Outdoor Shoes", "Darts & Dartboards", "Table Tennis", "Billiard, Snooker & Pool", "Bowling", "Trampolines & Accessories", "Ballet & Dancing Footwear", "Sports Supplements"),
  
  "health and household" = c("Health & Personal Care", "Thermometers & Meteorological Instruments", "Household Batteries, Chargers & Accessories", "Vacuums & Floorcare", "Large Appliances", "Ironing & Steamers"),
  
  "arts and crafts" = c("Arts & Crafts", "Kids' Art & Craft Supplies", "Art & Craft Supplies", "Pens, Pencils & Writing Supplies"),
  
  "electronics" = c("Wearable Technology", "PC Gaming Accessories", "USB Gadgets", "Projectors", "CPUs", "Computer Screws", "3D Printers", "Barebone PCs", "Mobile Phone Accessories", "Computers, Components & Accessories", "Home Audio Record Players", "Radios & Boomboxes", "Car & Vehicle Electronics", "Monitor Accessories", "Cables & Accessories", "Computer Cases", "KVM Switches", "Printers & Accessories", "Telephones, VoIP & Accessories", "Surveillance Cameras", "Tripods & Monopods", "Mobile Phones & Communication", "Electrical Power Accessories", "Tablet Accessories", "Keyboards, Mice & Input Devices", "Headphones & Earphones", "Smartwatches", "Office Electronics", "Internal TV Tuner & Video Capture Cards", "Headphones, Earphones & Accessories", "Camera & Photo Accessories", "Home Cinema, TV & Video", "Hi-Fi & Home Audio Accessories", "Portable Sound & Video Products", "USB Hubs", "Computer Audio & Video Accessories", "Adapters", "Computer & Server Racks", "Hard Drive Accessories", "Networking Devices", "Data Storage", "Mobile Phones & Smartphones", "Digital Cameras", "Film Cameras", "Binoculars, Telescopes & Optics", "Hi-Fi Receivers & Separates", "Hi-Fi Speakers"),
  
  "automotive" = c("Car & Motorbike", "Motorbike Clothing", "Motorbike Accessories", "Motorbike Batteries", "Motorbike Boots & Luggage", "Motorbike Exhaust & Exhaust Systems", "Motorbike Handlebars, Controls & Grips", "Motorbike Lighting", "Motorbike Seat Covers", "Motorbike Electrical & Batteries"),
  
  "baby products" = c("Baby", "Handmade Baby Products"),
  
  "patio and garden" = c("Plants, Seeds & Bulbs", "Garden Furniture & Accessories", "Bird & Wildlife Care", "Mowers & Outdoor Power Tools", "Outdoor Heaters & Fire Pits", "Garden Décor", "Gardening", "Outdoor Cooking", "Decking & Fencing", "Pools, Hot Tubs & Supplies", "Garden Storage & Housing", "Garden Tools & Watering Equipment"),
  
  "home and kitchen" = c("Handmade Home & Kitchen Products", "Hardware", "Storage & Home Organisation", "Fireplaces, Stoves & Accessories", "Furniture & Lighting", "Storage & Organisation", "Living Room Furniture", "Bedding & Linen", "Curtain & Blind Accessories", "Home Brewing & Wine Making", "Tableware", "Kitchen Storage & Organisation", "Kitchen Tools & Gadgets", "Cookware", "Water Coolers, Filters & Cartridges", "Beer, Wine & Spirits", "Small Kitchen Appliances", "Kitchen & Bath Fixtures", "Home Fragrance", "Window Treatments", "Home Entertainment Furniture", "Dining Room Furniture", "Home Bar Furniture", "Kitchen Linen", "Mattress Pads & Toppers", "Children's Bedding", "Bedding Accessories", "Decorative Artificial Flora", "Candles & Holders", "Signs & Plaques", "Home Office Furniture", "Inflatable Beds, Pillows & Accessories", "Bedding Collections", "Bakeware", "Grocery", "Photo Frames", "Rugs, Pads & Protectors", "Mirrors", "Clocks", "Doormats", "Decorative Home Accessories", "Boxes & Organisers", "Slipcovers", "Vases", "Bedroom Furniture", "Hallway Furniture", "Coffee, Tea & Espresso"),
  
  "bath and body" = c("Bath & Body", "Bathroom Lighting", "Rough Plumbing", "Bathroom Furniture", "Bathroom Linen"),
  
  "tools and home improvement" = c("Light Bulbs", "Heating, Cooling & Air Quality", "Power & Hand Tools", "Painting Supplies, Tools & Wall Treatments", "Building Supplies", "Safety & Security", "Lights and switches", "Plugs", "Electrical", "Lighting", "Outdoor Lighting", "Torches", "Indoor Lighting", "String Lights"),
  
  "beauty" = c("Fragrances", "Skin Care", "Beauty", "Manicure & Pedicure Products", "Make-up", "Hair Care"),
  
  "clothing and accessories" = c("Boys", "Girls", "Handmade Clothing, Shoes & Accessories", "Men", "Luggage and travel gear", "Women"),
  
  "health and wellness" = c("Professional Medical Supplies", "Sports Supplements"),
  
  "pet supplies" = c("Pet Supplies"),
  
  "fashion" = c("Handmade Jewellery")
)

# Function to map original category to new category
map_category <- function(original_category) {
  for (new_cat in names(category_mapping)) {
    if (original_category %in% category_mapping[[new_cat]]) {
      return(new_cat)
    }
  }
  return("Other")  # For any categories that don't match
}

# Create new variable with mapped categories
dataset_init$department <- sapply(dataset_init$categoryName, map_category)

# This code gives us an overview of list of variables availabe to us. There are total 11 variables available to us to gain an insight.

colnames(dataset_init)

##  [1] "asin"              "title"             "imgUrl"           
##  [4] "productURL"        "stars"             "reviews"          
##  [7] "price"             "isBestSeller"      "boughtInLastMonth"
## [10] "categoryName"      "department"

head(dataset_init)

##         asin
## 1 B09B96TG33
## 2 B01HTH3C8S
## 3 B09B8YWXDF
## 4 B09B8T5VGV
## 5 B09WX6QD65
## 6 B09B97WSLF
##                                                                                                                                                                                                title
## 1                                                                                Echo Dot (5th generation, 2022 release) | Big vibrant sound Wi-Fi and Bluetooth smart speaker with Alexa | Charcoal
## 2 Anker Soundcore mini, Super-Portable Bluetooth Speaker with 15-Hour Playtime, 66-Foot Bluetooth Range, Wireless Speaker with Enhanced Bass, Noise-Cancelling Microphone, for Outdoor, Travel, Home
## 3                                                                           Echo Dot (5th generation, 2022 release) | Big vibrant sound Wi-Fi and Bluetooth smart speaker with Alexa | Deep Sea Blue
## 4                                                                 Echo Dot with clock (5th generation, 2022 release) | Bigger vibrant sound Wi-Fi and Bluetooth smart speaker and Alexa | Cloud Blue
## 5                                                                                                  Introducing Echo Pop | Full sound compact Wi-Fi and Bluetooth smart speaker with Alexa | Charcoal
## 6                                                              Echo Dot with clock (5th generation, 2022 release) | Bigger vibrant sound Wi-Fi and Bluetooth smart speaker and Alexa | Glacier White
##                                                           imgUrl
## 1 https://m.media-amazon.com/images/I/71C3lbbeLsL._AC_UL320_.jpg
## 2 https://m.media-amazon.com/images/I/61c5rSxwP0L._AC_UL320_.jpg
## 3 https://m.media-amazon.com/images/I/61j3SEUjMJL._AC_UL320_.jpg
## 4 https://m.media-amazon.com/images/I/71yf6yTNWSL._AC_UL320_.jpg
## 5 https://m.media-amazon.com/images/I/613dEoF9-rL._AC_UL320_.jpg
## 6 https://m.media-amazon.com/images/I/71bnkM5q8CL._AC_UL320_.jpg
##                               productURL stars reviews price isBestSeller
## 1 https://www.amazon.co.uk/dp/B09B96TG33   4.7   15308 21.99        False
## 2 https://www.amazon.co.uk/dp/B01HTH3C8S   4.7   98099 23.99         True
## 3 https://www.amazon.co.uk/dp/B09B8YWXDF   4.7   15308 21.99        False
## 4 https://www.amazon.co.uk/dp/B09B8T5VGV   4.7    7205 31.99        False
## 5 https://www.amazon.co.uk/dp/B09WX6QD65   4.6    1881 17.99        False
## 6 https://www.amazon.co.uk/dp/B09B97WSLF   4.7    7205 31.99        False
##   boughtInLastMonth   categoryName  department
## 1                 0 Hi-Fi Speakers electronics
## 2                 0 Hi-Fi Speakers electronics
## 3                 0 Hi-Fi Speakers electronics
## 4                 0 Hi-Fi Speakers electronics
## 5                 0 Hi-Fi Speakers electronics
## 6                 0 Hi-Fi Speakers electronics

# The summary() function provides a detailed summary of the various elements within an object, most commonly a data frame or a vector like Minimum, 1st Quartile, Median, Mean, 3rd Quartile, and Maximum.

summary(dataset_init)

##      asin              title              imgUrl           productURL       
##  Length:2222742     Length:2222742     Length:2222742     Length:2222742    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      stars          reviews              price           isBestSeller      
##  Min.   :0.000   Min.   :      0.0   Min.   :     0.00   Length:2222742    
##  1st Qu.:0.000   1st Qu.:      0.0   1st Qu.:    10.00   Class :character  
##  Median :0.000   Median :      0.0   Median :    19.90   Mode  :character  
##  Mean   :2.032   Mean   :    382.2   Mean   :    94.26                     
##  3rd Qu.:4.400   3rd Qu.:     44.0   3rd Qu.:    47.71                     
##  Max.   :5.000   Max.   :1356658.0   Max.   :100000.00                     
##  boughtInLastMonth  categoryName        department       
##  Min.   :    0.00   Length:2222742     Length:2222742    
##  1st Qu.:    0.00   Class :character   Class :character  
##  Median :    0.00   Mode  :character   Mode  :character  
##  Mean   :   18.57                                        
##  3rd Qu.:    0.00                                        
##  Max.   :50000.00

PART 2.1.1: EXPLORATION OF NUMERIC VARIABLES

Let us now check the skewness by finding mean, median, and standard variation of every numeric variable. It helps in addressing any potential unreliability of data effectively.

# Summary and Standard Deviation of stars
stars_summary <- summary(dataset_init$stars)
stars_sd <- sd(dataset_init$stars)
cat("Summary of stars:\n", stars_summary, "\n")

## Summary of stars:
##  0 0 0 2.03187 4.4 5

cat("Standard Deviation of stars: ", stars_sd, "\n\n")

## Standard Deviation of stars:  2.185497

# Summary and Standard Deviation of reviews
reviews_summary <- summary(dataset_init$reviews)
reviews_sd <- sd(dataset_init$reviews)
cat("Summary of reviews:\n", reviews_summary, "\n")

## Summary of reviews:
##  0 0 0 382.1617 44 1356658

cat("Standard Deviation of reviews: ", reviews_sd, "\n\n")

## Standard Deviation of reviews:  5020.752

# Summary and Standard Deviation of price
price_summary <- summary(dataset_init$price)
price_sd <- sd(dataset_init$price)
cat("Summary of price:\n", price_summary, "\n")

## Summary of price:
##  0 10 19.9 94.25737 47.71 1e+05

cat("Standard Deviation of price: ", price_sd, "\n\n")

## Standard Deviation of price:  360.6225

# Summary and Standard Deviation of boughtInLastMonth
boughtInLastMonth_summary <- summary(dataset_init$boughtInLastMonth)
boughtInLastMonth_sd <- sd(dataset_init$boughtInLastMonth)
cat("Summary of boughtInLastMonth:\n", boughtInLastMonth_summary, "\n")

## Summary of boughtInLastMonth:
##  0 0 0 18.56902 0 50000

cat("Standard Deviation of boughtInLastMonth: ", boughtInLastMonth_sd, "\n")

## Standard Deviation of boughtInLastMonth:  191.903

# This shows us that there are no null values in the data set.

colSums(is.na(dataset_init))

##              asin             title            imgUrl        productURL 
##                 0                 0                 0                 0 
##             stars           reviews             price      isBestSeller 
##                 0                 0                 0                 0 
## boughtInLastMonth      categoryName        department 
##                 0                 0                 0

# The str function in R provides a compact and readable summary of the structure of an R object. 

str(dataset_init)

## 'data.frame':    2222742 obs. of  11 variables:
##  $ asin             : chr  "B09B96TG33" "B01HTH3C8S" "B09B8YWXDF" "B09B8T5VGV" ...
##  $ title            : chr  "Echo Dot (5th generation, 2022 release) | Big vibrant sound Wi-Fi and Bluetooth smart speaker with Alexa | Charcoal" "Anker Soundcore mini, Super-Portable Bluetooth Speaker with 15-Hour Playtime, 66-Foot Bluetooth Range, Wireless"| __truncated__ "Echo Dot (5th generation, 2022 release) | Big vibrant sound Wi-Fi and Bluetooth smart speaker with Alexa | Deep Sea Blue" "Echo Dot with clock (5th generation, 2022 release) | Bigger vibrant sound Wi-Fi and Bluetooth smart speaker and"| __truncated__ ...
##  $ imgUrl           : chr  "https://m.media-amazon.com/images/I/71C3lbbeLsL._AC_UL320_.jpg" "https://m.media-amazon.com/images/I/61c5rSxwP0L._AC_UL320_.jpg" "https://m.media-amazon.com/images/I/61j3SEUjMJL._AC_UL320_.jpg" "https://m.media-amazon.com/images/I/71yf6yTNWSL._AC_UL320_.jpg" ...
##  $ productURL       : chr  "https://www.amazon.co.uk/dp/B09B96TG33" "https://www.amazon.co.uk/dp/B01HTH3C8S" "https://www.amazon.co.uk/dp/B09B8YWXDF" "https://www.amazon.co.uk/dp/B09B8T5VGV" ...
##  $ stars            : num  4.7 4.7 4.7 4.7 4.6 4.7 4.7 4.7 4.7 4.5 ...
##  $ reviews          : int  15308 98099 15308 7205 1881 7205 15308 103673 29909 16014 ...
##  $ price            : num  22 24 22 32 18 ...
##  $ isBestSeller     : chr  "False" "True" "False" "False" ...
##  $ boughtInLastMonth: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ categoryName     : chr  "Hi-Fi Speakers" "Hi-Fi Speakers" "Hi-Fi Speakers" "Hi-Fi Speakers" ...
##  $ department       : chr  "electronics" "electronics" "electronics" "electronics" ...

Exploring the outliers (Numeric Variables)

To explore the outliers, we need to explore the summary about the numerical variables. In total we have five numerical variables as can be seen from the above. So, we find mean and standard deviation and check the skewness using the ggplot and histogram function. Visualizing it will help us to check the data reliability for future complications.

BOXPLOT

# Load necessary libraries
library(ggplot2)
library(gridExtra)


# Define the columns for which boxplots will be created
variables <- c("stars", "reviews", "price", "boughtInLastMonth")

# Initialize a list to store plots
plots_list <- list()

# Create boxplots for each variable
for (var in variables) {
  # Create the plot for the current variable
  plot <- ggplot(dataset_init, aes(y = .data[[var]])) +
    geom_boxplot() +
    ggtitle(paste("Boxplot of", var)) +
    theme(plot.title = element_text(hjust = 0.5))
  
  # Add the plot to the list
  plots_list[[length(plots_list) + 1]] <- plot
}

# Arrange the plots in a 2 by 3 grid
grid.arrange(grobs = plots_list, ncol = 3, padding = unit(5, "lines"))

# Initialize a data frame to store outliers
outliers <- data.frame(Variable = character(), Outlier_Value = numeric(), stringsAsFactors = FALSE)

# Define a function to find outliers for each variable
find_outliers <- function(var) {
  # Calculate quartiles
  q1 <- quantile(dataset_init[[var]], 0.25)
  q3 <- quantile(dataset_init[[var]], 0.75)
  
  # Calculate IQR (inter-quartile range)
  iqr <- q3 - q1
  
  # Define upper and lower bounds for outliers
  upper_bound <- q3 + 1.5 * iqr
  lower_bound <- q1 - 1.5 * iqr
  
  # Find outliers
  var_outliers <- dataset_init[dataset_init[[var]] > upper_bound | dataset_init[[var]] < lower_bound, var]
  
  # Check if outliers exist
  if (length(var_outliers) > 0) {
    # Add outliers to the outliers data frame
    outliers <<- rbind(outliers, data.frame(Variable = var, Outlier_Value = var_outliers))
  }
}

# Apply the function to each variable
for (var in variables) {
  find_outliers(var)
}

#print (outliers)

Outlier Identification from Box Plots

stars: The distribution of the ‘stars’ variable shows a concentration of ratings around 0 to 4.4 (IQR=4.4) stars, with a median at 0 indicating a notable number of items with 0 stars. The lack of outliers suggests a fairly uniform distribution without extreme values, reflecting a typical range of ratings within the dataset.
reviews: The reviews variable has median value 0, which is indicative of very poor central tendency of data. Due to the interquartile range (IQR=44.0), about half of the data points are found at its lower level around 0 reviews. However, due to presence of so many outlying figures, it means most items have few reviews while some have very high counts making everything skewed toward right side. Reviews distribution is highly right skewed because mean>median.
price: The median value of the price variable is 19.90 which points towards low data central tendency. A close clustering of the middle 50 per cent of values around zero is hinted through the interquartile range (IQR=84.26). It is clear that there are many outlier which show most products are cheap while some others have been sold at an extremely expensive price that makes its composition extremely skewed. Prices are skewed to the right (mean>median) and they are mostly located in the lower part of the range.
boughtInLastMonth: The boxplot shows that the majority of the data points for purchases in the last month are clustered outside IQR=0 with a median of 0. There are a significant number of outliers with very high purchase numbers. The presence of many outliers suggests a large variability in the number of purchases.

#HISTOGRAM

# Define the columns for which histograms will be created
columns <- c("stars", "reviews", "price", "boughtInLastMonth")

# Initialize a list to store the plots
hist_plots <- list()

# Create histograms for each variable
for (col in columns) {
  # Create the histogram plot for the current variable
  hist_plot <- ggplot(dataset_init, aes(x = !!sym(col))) +
    geom_histogram() +
    ggtitle(paste("Distribution of", col)) +
    theme(plot.title = element_text(hjust = 0.5), 
          axis.text.x = element_text(angle = 45, hjust = 1))
  
  # Add the histogram plot to the list
  hist_plots[[length(hist_plots) + 1]] <- hist_plot
}

# Arrange the histograms in a grid
grid.arrange(grobs = hist_plots, ncol = 3)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Outlier Identification from Histogram:

stars: There are many products having a star rating close to 4 or 5. There is a significant number of products with a rating of 0, which might indicate that these products have not been rated yet. The distribution is right-skewed with most of the data points concentrated on the higher end of the star ratings.
reviews: Most products have very few reviews, with a significant number of products having zero reviews. The distribution is heavily right-skewed, indicating that a small number of products receive a very high number of reviews.
price: A large number of products have a price of zero, which might indicate missing data or free products. For the non-zero prices, the distribution is right-skewed, indicating that higher-priced products are less common.
boughtInLastMonth: A significant number of products have zero purchases in the last month. The distribution is heavily right-skewed, with very few products having high purchase numbers.

Data Vizualization

We can see from the outliers that there are many zero values in the 4 numeric variables seen above. The ‘Zero” values are of no use to us and we cannot perform any operations on it or derive any insights. So, we will identify the zero values and bind them together and finally eliminate them. We will find the rows with all numeric variables viz reviews, price, listPrice, boughtInLastMonth have values as ’0’. We will not take stars variables into consideration as it can be logically assumed that it directly is related to reviews. If there is no presence of other 4 numeric variables, stars can be ignored for convenience.

Both the histograms and the boxplot highlight a common trend: a significant number of products have minimal engagement (in terms of ratings, reviews, prices, and purchases), while a smaller subset of products exhibits very high engagement. This indicates a typical long-tail distribution common in e-commerce platforms, where a small number of products dominate sales and engagement metrics.

The zero values for stars, reviews, price, and purchases indicate the need for data cleaning. Products with zero values for these variables might be excluded or treated separately to avoid skewing the analysis.

This analysis provides a clear picture of the data distribution, helping to identify potential areas for deeper investigation and data cleaning efforts.

FINDING TOTAL NUMBER OF PRODUCTS WITH ZERO VALUES

Total number of products with zero values is 2062874 in the variables reviews, price and boughtInLastMonth. We will remove this in further steps.

library(dplyr)

# Define the columns to check for zero values
check_columns <- c("reviews", "price", "boughtInLastMonth", "isBestSeller")

# Extract rows where any of the specified columns contain zeros
zero_value_subset <- dataset_init %>% 
  filter(if_any(all_of(check_columns), ~ . == 0))

# Calculate the total number of products with zero values
total_zero_value_products <- nrow(zero_value_subset)

# Print the total number of products with zero values
cat("Total number of products with zero values:", total_zero_value_products, "\n")

## Total number of products with zero values: 2062874

# Optionally, print the subset with zero values for inspection
# print("Subset with zero values:")
# print(zero_value_subset)

# Find duplicates in the subset
duplicates <- zero_value_subset %>% 
  duplicated() %>% 
  which()

# Extract the duplicate rows
duplicate_rows <- zero_value_subset[duplicates, ]

PERCENTAGE OF ZERO AND NON ZERO VALUES IN THE VARIABLES

The graph below indicates the number of products with zero values as a percentage in a particular variable. In price, we can see that there is a missing non-zero variable which indicates that there is no 0 price in the dataset.

# Load necessary libraries
library(ggplot2)
library(reshape2)

# Calculate the percentages of zero and non-zero values
percentages <- sapply(dataset_init[, c("stars", "reviews", "price", "boughtInLastMonth")], function(x) {
  num_zeros <- sum(x == 0, na.rm = TRUE)
  total_values <- sum(!is.na(x))
  percentage_zeros <- (num_zeros / total_values) * 100
  percentage_non_zeros <- 100 - percentage_zeros
  c(zero = percentage_zeros, non_zero = percentage_non_zeros)
})

# Convert to data frame and reshape for ggplot2
percentages_df <- as.data.frame(t(percentages))
percentages_df$variable <- rownames(percentages_df)
percentages_long <- melt(percentages_df, id.vars = "variable", variable.name = "type", value.name = "percentage")

# Create the bar graph
ggplot(percentages_long, aes(x = variable, y = percentage, fill = type)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Percentages of Zero and Non-Zero Values in the variables",
       x = "Variable",
       y = "Percentage",
       fill = "Value Type") +
  theme_minimal() +
  theme(legend.position = "bottom")

# Calculate the percentage and counts of products with both stars and reviews equal to 0
num_both_zeros <- sum(dataset_init$stars == 0 & dataset_init$reviews == 0, na.rm = TRUE)
total_products <- nrow(dataset_init)
percentage_both_zeros <- (num_both_zeros / total_products) * 100

# Print the percentage and counts of both zeros
cat("Percentage of products with both stars and reviews equal to 0:", percentage_both_zeros, "%\n")

## Percentage of products with both stars and reviews equal to 0: 52.83654 %

cat("Number of products with both stars and reviews equal to 0:", num_both_zeros, "\n")

## Number of products with both stars and reviews equal to 0: 1174420

# Create a data frame for plotting
zeros_data <- data.frame(Variable = "Stars and Reviews", Percentage = percentage_both_zeros)

# Create the bar graph
ggplot(zeros_data, aes(x = Variable, y = Percentage, fill = Variable)) +
  geom_bar(stat = "identity") +
  labs(title = "Percentage of Products with Both Stars and Reviews Equal to 0",
       x = "Variable",
       y = "Percentage") +
  theme_minimal() +
  theme(legend.position = "none")

This output indicates that the value of both stars and reviews is around 52% of the total values.

EXPLORING NON-NUMERICAL VARIABLES

We have total 6 non numerical variables.

VISUALIZATION 1: BAR PLOT OF TOTAL SALES QUANTITY OF DEPARTMENTS

The graph below indicates the total sales quantity sold on amazon by respective department.

library(ggplot2)

# Count the occurrences of each category
category_counts <- table(dataset_init$department)
category_counts_dataset_init <- as.data.frame(category_counts)

# Create a bar plot
ggplot(category_counts_dataset_init, aes(x = Var1, y = Freq)) +
  geom_bar(stat = "identity") +
  labs(title = "Product Categories", x = "Department", y = "Sales Quantity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(hjust = 3))

Now, we will find top 5 categories which have highest percentage of best seller products. Post that, we will vizualize using the top 5 category and comparing total sales and best seller sales in this category. In turn, this will help us make a firm decision on which category products can achieve best seller ranking and help in faster decision making.

VISUALIZATION 2: TOP 5 DEPARTMENTS BY MOST SALES IN NUMBER

# Select top 5 departments

# Calculate the frequency of each category

category_counts <- table(dataset_init$department)

# Sort the categories by frequency in descending order
sorted_categories <- sort(category_counts, decreasing = TRUE)

# Select the top 10 categories
top_categories <- names(sorted_categories)[1:5]

print(top_categories)

## [1] "sports and outdoors" "Other"               "electronics"        
## [4] "home and kitchen"    "toys and games"

VISUALIZATION 3: TABLE OF DEPARTMENTS FOR BEST SELLING PRODUCTS BY PERCENTAGE

The table below indicates all departments with their bestseller quantity and product quantity sold. From this, we have put selected top 5 departments with maximum bestseller percentage.

# Load necessary library
library(dplyr)

# Group by categoryName and calculate counts
category <- dataset_init %>%
  group_by(department) %>%
  summarise(
    isBestSeller_Count = sum(isBestSeller == "True"),  # Count of best sellers
    Product_Count = n(),  # Total products in each category
    Percentage_BestSeller = round((isBestSeller_Count / Product_Count) * 100, 2)  # Percentage of best sellers
  ) %>%
  arrange(desc(Percentage_BestSeller))  # Arrange by Percentage_BestSeller in descending order

# Output the updated data frame
category

## # A tibble: 17 × 4
##    department             isBestSeller_Count Product_Count Percentage_BestSeller
##    <chr>                               <int>         <int>                 <dbl>
##  1 pet supplies                          284          9430                  3.01
##  2 health and household                  668         34638                  1.93
##  3 baby products                         192         15463                  1.24
##  4 arts and crafts                       238         24537                  0.97
##  5 home and kitchen                     1039        139299                  0.75
##  6 tools and home improv…                524         71476                  0.73
##  7 clothing and accessor…                576         85798                  0.67
##  8 industrial and scient…                119         23040                  0.52
##  9 patio and garden                      266         78286                  0.34
## 10 automotive                            107         38149                  0.28
## 11 bath and body                          68         27422                  0.25
## 12 Other                                 710        335613                  0.21
## 13 beauty                                193         92439                  0.21
## 14 toys and games                        250        124193                  0.2 
## 15 electronics                           395        247967                  0.16
## 16 fashion                                10         12420                  0.08
## 17 sports and outdoors                   379        862572                  0.04

# Select top 5 departments with highest Percentage_BestSeller
top_departments <- head(category, 5)

# Set up the plotting area with wider margins and larger plot size
par(mar = c(5, 5, 4, 4) + 0.5)
options(repr.plot.width=10, repr.plot.height=8)

# Calculate percentages for pie chart
product_percentages <- top_departments$Product_Count / sum(top_departments$Product_Count) * 100
best_seller_percentages <- top_departments$isBestSeller_Count / sum(top_departments$isBestSeller_Count) * 100

# Create a pie chart for Product_Count
pie(product_percentages, 
    labels = paste(top_departments$department, " (", round(product_percentages, 1), "%)"), 
    main = "Product Count by Department in Top 5 Departments",
    col = rainbow(length(product_percentages)),
    cex = 0.8,  # Adjust label size
    radius = 0.9,  # Adjust pie size
    clockwise = TRUE)

This code essentially creates a pie chart showing the distribution of Product_Count across the top 5 departments with the highest Percentage_BestSeller values. Home and kitchen has maximum product count among the bestseller products.

VISUALIZATION 4: SCATTER PLOT OF PRICE VS REVIEWS

library(ggplot2)
plot(dataset_init$price, dataset_init$reviews, pch = 16, col = "blue", cex = 1,
     main = "Price vs. Number of Reviews", xlab = "Price", ylab = "Number of Reviews",
     xlim = c(min(dataset_init$price, na.rm = TRUE), max(dataset_init$price, na.rm = TRUE)),
     ylim = c(min(dataset_init$reviews, na.rm = TRUE), max(dataset_init$reviews, na.rm = TRUE)))

# Add a grid
grid()

# Add a legend
legend("topright", legend = "Data Points", pch = 16, col = "black", cex = 1)

VISUALIZATION 5: SCATTER PLOT OF STARS VS PRICE

plot(dataset_init$stars, dataset_init$price, pch = 16, col = "blue", cex = 1,
     main = "Stars vs. Price", xlab = "Stars", ylab = "Price",
     xlim = c(min(dataset_init$stars, na.rm = TRUE), max(dataset_init$stars, na.rm = TRUE)),
     ylim = c(min(dataset_init$price, na.rm = TRUE), max(dataset_init$price, na.rm = TRUE)))

# Add a grid
grid()

# Add a legend
legend("topright", legend = "Data Points", pch = 16, col = "black", cex = 1)

# VISUALIZATION 6: SCATTER PLOT OF STARS VS REVIEWS

plot(dataset_init$stars, dataset_init$reviews, pch = 16, col = "blue", cex = 1,
     main = "Stars vs. Reviews", xlab = "Stars", ylab = "Reviews",
     xlim = c(min(dataset_init$stars, na.rm = TRUE), max(dataset_init$stars, na.rm = TRUE)),
     ylim = c(min(dataset_init$reviews, na.rm = TRUE), max(dataset_init$reviews, na.rm = TRUE)))

# Add a grid
grid()

# Add a legend
legend("topright", legend = "Data Points", pch = 16, col = "black", cex = 1)

The following scatter plots illustrate a positive association between star ratings and both product reviews and prices. Products with more reviews tend to have higher star ratings, while the price distribution across different star ratings remains relatively even. This suggests that higher-rated products garner more attention and potentially higher prices.

#SECTION 2.2 DATA CLEANING AND PREPARATION

#DEALING WITH OUTLIERS

# Remove rows where no sales were made last month
paste("Number of Rows that will be deleted, because no sales were made last month",sum(dataset_init$boughtInLastMonth == 0))

## [1] "Number of Rows that will be deleted, because no sales were made last month 2061427"

dataset_init <- dataset_init[dataset_init$boughtInLastMonth != 0.00, ] 

# Remove products that did not receive any reviews
dataset_init <- dataset_init[dataset_init$reviews != 0.00, ] 

paste("Finally, the final dataset contains rows: ", nrow(dataset_init))

## [1] "Finally, the final dataset contains rows:  159868"

# Find the entries in department with blankspace

indices_with_blankspace <- grep("\\s", dataset_init$department)
departments_with_blankspace <- dataset_init[indices_with_blankspace, ]
length(unique(departments_with_blankspace$department))

## [1] 12

dataset_init$cdepartment <- gsub("  ", " & ", dataset_init$department)
# print(dataset_init$department)

FEATURE ENGINEERING

Make non-numeric variable numeric. The first step is IsBestSeller to make into binary. Yes=1 and No=0

dataset_init$isBestSeller <- ifelse(dataset_init$isBestSeller == "True", 1, 0)
glimpse(dataset_init)

## Rows: 159,868
## Columns: 12
## $ asin              <chr> "B01N4V4X5M", "B07WVP26FR", "B083K77PZL", "B09J47Y78…
## $ title             <chr> "Upgraded, Anker Soundcore Boost Bluetooth Speaker w…
## $ imgUrl            <chr> "https://m.media-amazon.com/images/I/71ZVrRZLTcL._AC…
## $ productURL        <chr> "https://www.amazon.co.uk/dp/B01N4V4X5M", "https://w…
## $ stars             <dbl> 4.7, 4.4, 4.6, 4.4, 4.7, 4.3, 4.3, 4.6, 4.5, 4.4, 4.…
## $ reviews           <int> 29387, 2493, 2979, 804, 3208, 1279, 15963, 4641, 202…
## $ price             <dbl> 39.98, 15.99, 65.59, 22.99, 34.99, 21.99, 39.90, 39.…
## $ isBestSeller      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ boughtInLastMonth <int> 600, 1000, 200, 200, 200, 100, 100, 100, 50, 50, 50,…
## $ categoryName      <chr> "Hi-Fi Speakers", "Hi-Fi Speakers", "Hi-Fi Speakers"…
## $ department        <chr> "electronics", "electronics", "electronics", "electr…
## $ cdepartment       <chr> "electronics", "electronics", "electronics", "electr…

#2.4: Data transformation

To address the right-skewness of the continuous variables, log transformation will be applied. It is done by reducing the magnitude of larger values and spreading out smaller values. By applying log transformation, we aim to make the distribution of continuous variables more symmetric, which can improve the performance of the statistical models.

Classification Tree: Log transformation can help balance the distribution of predictor variables, making splits more equitable and leading to better decision boundaries. This can potentially result in more accurate predictions by reducing bias towards dominant classes.

Logistic Regression: Log transformation can mitigate the influence of outliers and improve the linearity assumption between predictor variables and the log odds of the response variable. This can enhance the interpretability and stability of the logistic regression coefficients, leading to more reliable estimates of the probability of class membership.

Next, we focus on numeric variables that have shown skewness in the earlier histograms. Title length and categorical/binary variables would be excluded from this process.

# Create faceted histograms
columns <- c( "price", "boughtInLastMonth", "reviews")

# Create a list to store the plots
plots <- list()

# Create histograms for each variable
for (col in columns) {
  plot <- ggplot(dataset_init, aes(x = !!sym(col))) +
    geom_histogram() +
    ggtitle(paste(col, "Distribution")) +
    theme(plot.title = element_text(hjust = 0.5, size = unit(10, "mm")), 
          axis.text.x = element_text(angle = 45, hjust = 1))

  
  plots[[length(plots) + 1]] <- plot
}

# Arrange histograms in a grid
grid.arrange(grobs = plots, ncol = 4, width = 30, height = 20)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

After the log transformation, we visualize the dataset in histograms to confirm that the skewness of these variables have been successfully addressed.

# Apply log transformation and save to new variables
dataset_init$reviews_log <- log(1 + dataset_init$reviews)
dataset_init$price_log <- log(1 + dataset_init$price)
dataset_init$boughtInLastMonth_log <- log(1 + dataset_init$boughtInLastMonth)



# Create faceted histograms
columns <- c("reviews_log", "price_log", "boughtInLastMonth_log","stars")

# Create a list to store the plots
plots <- list()

# Create histograms for each variable
for (col in columns) {
  plot <- ggplot(dataset_init, aes(x = !!sym(col))) +
    geom_histogram() +
    ggtitle(paste(col, "Distribution")) +
    theme(plot.title = element_text(hjust = 0.5, size = unit(8, "mm")), 
          axis.text.x = element_text(angle = 45, hjust = 1))
  
  plots[[length(plots) + 1]] <- plot
}

# Arrange histograms in a grid
grid.arrange(grobs = plots, ncol = 3,  width = 20, height = 20)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

No normalization or standardization is needed for binary categorical variables. So ‘isBestSeller’ do not need to be normalized or standardized.

Data normalization is useful when working with algorithms sensitive to feature scaling (such as k-nearest neighbours or neural networks). Normalizing will make these algorithms converge faster. Normalization is also useful when features have different units or scales or when you need values in a bounded interval such as [0, 1] or another specific range.

Based on the above reason, the following variables will be normalized using min-max scaling approach. price stars listPrice reviews boughtInLastMonth titleLength

# Define a function to perform min-max scaling
min_max_scaling <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}

# Apply min-max scaling to the specified variables and add them to the original dataframe
dataset_init <- dataset_init %>%
  mutate(price_log_normalized = min_max_scaling(price_log),
         stars_normalized = min_max_scaling(stars),
         reviews_log_normalized = min_max_scaling(reviews_log),
         boughtInLastMonth_log_normalized = min_max_scaling(boughtInLastMonth_log),
      )

# Visualize the normalized variables to confirm they're on the same scale.

columns <- c("price_log_normalized", "stars_normalized", "reviews_log_normalized", "boughtInLastMonth_log_normalized")

# Create a list to store the plots
plots <- list()

# Create histograms for each variable
for (col in columns) {
  plot <- ggplot(dataset_init, aes(x = !!sym(col))) +
    geom_histogram() +
    ggtitle(paste(col, "Distribution")) +
    theme(plot.title = element_text(hjust = 0.5, size = unit(8, "mm")))
  
  plots[[length(plots) + 1]] <- plot
}

# Arrange histograms in a grid
grid.arrange(grobs = plots, ncol = 3, width = 30, height = 20)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

At this point, we have created quite a few new variables. Based on the following list of column names, we will choose the variables that can be used in our classification tree and logistic regression models.

2.5: Initial plotting

Just to remind again, the goal of the research is to give answer to “where should sellers focus on (department)” and “how to be best seller”. To make it crystal clear, let us examine the number of Best Seller in each of the category.

# Load necessary library
library(ggplot2)

# Plot 1: Horizontal bar plot of products against category type
plot2 <- ggplot(dataset_init, aes(y = department)) +
  geom_bar() +
  labs(title = "Products per Department", x = "Count", y = "Department") +
  theme(plot.title = element_text(hjust = 0.5, size = 14))

# Plot 2: Horizontal bar plot of bestsellers per category
plot1 <- ggplot(dataset_init, aes(y = department, fill = isBestSeller)) +
  geom_bar(position = "dodge") +
  labs(title = "Bestsellers per Department", x = "Count", y = "Department") +
  theme(plot.title = element_text(hjust = 0.5, size = 14))


# Display the plots
print(plot1)

## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

print(plot2)

# Load necessary libraries
library(ggplot2)
library(dplyr)

# Calculate the percentage of bestsellers per category
category_bestseller_pct <- dataset_init %>%
  group_by(department) %>%
  summarize(total = n(),
            bestsellers = sum(isBestSeller == TRUE)) %>%
  mutate(percentage = (bestsellers / total) * 100) %>%
  arrange(desc(percentage))

# Print the dataframe to check values
print(category_bestseller_pct)

## # A tibble: 17 × 4
##    department                 total bestsellers percentage
##    <chr>                      <int>       <int>      <dbl>
##  1 fashion                       56           4      7.14 
##  2 health and household       11082         617      5.57 
##  3 sports and outdoors         5222         230      4.40 
##  4 automotive                  1882          75      3.99 
##  5 industrial and scientific   2064          72      3.49 
##  6 clothing and accessories    9767         329      3.37 
##  7 patio and garden            6486         200      3.08 
##  8 pet supplies                9100         276      3.03 
##  9 baby products               4909         144      2.93 
## 10 home and kitchen           27424         796      2.90 
## 11 Other                      15475         345      2.23 
## 12 arts and crafts            10747         219      2.04 
## 13 tools and home improvement  9517         176      1.85 
## 14 toys and games             15227         224      1.47 
## 15 bath and body               2978          41      1.38 
## 16 electronics                 4676          62      1.33 
## 17 beauty                     23256         176      0.757

# Plot: Horizontal bar plot of percentage of bestsellers per category
plot3 <- ggplot(category_bestseller_pct, aes(y = reorder(department, percentage), x = percentage)) +
  geom_bar(stat = "identity", fill = "magenta") +
  labs(title = "Percentage of Bestsellers per Category", x = "Percentage", y = "Category") +
  theme(plot.title = element_text(hjust = 0.5, size = 14), 
        axis.text.y = element_text(size = 12))

# Display the plot
print(plot3)

2.6: Reflections:

Through the critical phase in Data exploration and Preparation, we could conclude that exploring and understanding each variable is an important step thus we could determine the challenges and the next strategy to overcome it. Moreover, generating clean and useful dataset is important to generate reliable input for the following modelling activities. Activities, such as removing the 0 or na values, deleting unimportant categories, renaming data categories etc, are necessary in generating the final dataset.

SECTION 3: DATA MODELING

Finally, since the dataset is ready to be modelled, we can assign it to a more intuitive name. In this part, we will use two modelling techniques to reach the goal of this analytics:

Classification using Decision Tree Methods

Logistic Regression

# Load necessary libraries
library(ggplot2)
library(dplyr)

# Define min-max scaling function
min_max_scaling <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}

# Create new variable "titleLength"
dataset_init <- dataset_init %>%
  mutate(titleLength = nchar(title))

# View the first few rows to confirm the new variable
head(dataset_init[c("title", "titleLength")])

##                                                                                                                                                                                                        title
## 30                           Upgraded, Anker Soundcore Boost Bluetooth Speaker with Well-Balanced Sound, BassUp, 12H Playtime, USB-C, IPX7 Waterproof, with Customizable EQ via App, Wireless Stereo Pairing
## 33                                                                               Kolaura Portable Wireless Speaker, Bluetooth 5.0 Speaker with 3D Stereo HiFi Bass, 1500mAh Battery, 12 Hour Playtime (Blue)
## 93     W-KING Bluetooth Speaker, 50W Speakers Wireless Bluetooth 5.0 With Deep Bass, IPX6 Waterproof Loud Bluetooth Speaker With 40H Playback/Two Portable Speakers Pairing/TF Card/EQ/NFC for Outdoor Party
## 144                          XSOUND Portable Bluetooth Speaker Wireless 24W wireless bluetooth speaker with Bluetooth 5.2, enhanced bass, stereo sound, 24h playtime, usb c, IPX7 Waterproof outdoor speaker
## 161                                                                                   soundcore Anker Mini 3 Bluetooth Speaker, BassUp and PartyCast Technology, USB-C, Waterproof IPX7, and Customizable EQ
## 230 HEYSONG Shower Speaker, Waterpoof Portable Bluetooth Speakers Wireless with LED Lights, IP67, 36H Playtime, Rich Bass Small Speaker Bluetooth for Camping, Kayak, Beach, Bathroom, Travel, Gifts for Men
##     titleLength
## 30          175
## 33          123
## 93          197
## 144         175
## 161         118
## 230         200

# Summary statistics of the new variable
summary(dataset_init$titleLength)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    78.0   133.0   125.8   175.0   534.0

# Apply min-max scaling to "titleLength"
dataset_init <- dataset_init %>%
  mutate(titleLength_normalized = min_max_scaling(titleLength))

# View the first few rows to confirm the new variable
head(dataset_init[c("title", "titleLength", "titleLength_normalized")])

##                                                                                                                                                                                                        title
## 30                           Upgraded, Anker Soundcore Boost Bluetooth Speaker with Well-Balanced Sound, BassUp, 12H Playtime, USB-C, IPX7 Waterproof, with Customizable EQ via App, Wireless Stereo Pairing
## 33                                                                               Kolaura Portable Wireless Speaker, Bluetooth 5.0 Speaker with 3D Stereo HiFi Bass, 1500mAh Battery, 12 Hour Playtime (Blue)
## 93     W-KING Bluetooth Speaker, 50W Speakers Wireless Bluetooth 5.0 With Deep Bass, IPX6 Waterproof Loud Bluetooth Speaker With 40H Playback/Two Portable Speakers Pairing/TF Card/EQ/NFC for Outdoor Party
## 144                          XSOUND Portable Bluetooth Speaker Wireless 24W wireless bluetooth speaker with Bluetooth 5.2, enhanced bass, stereo sound, 24h playtime, usb c, IPX7 Waterproof outdoor speaker
## 161                                                                                   soundcore Anker Mini 3 Bluetooth Speaker, BassUp and PartyCast Technology, USB-C, Waterproof IPX7, and Customizable EQ
## 230 HEYSONG Shower Speaker, Waterpoof Portable Bluetooth Speakers Wireless with LED Lights, IP67, 36H Playtime, Rich Bass Small Speaker Bluetooth for Camping, Kayak, Beach, Bathroom, Travel, Gifts for Men
##     titleLength titleLength_normalized
## 30          175              0.3264540
## 33          123              0.2288931
## 93          197              0.3677298
## 144         175              0.3264540
## 161         118              0.2195122
## 230         200              0.3733583

df_model <- dataset_init

#3.1: Modelling preparation

We perform a correlation analysis to understand the relationships between our key variables. This will help us identify any variables that are strongly correlated, allowing us to group them in advance. This simplifies the interpretation of clusters in our later analysis. We will use the apa.cor.table function from the apaTables package for this task.

# Define columns of interest
columns <- c("boughtInLastMonth", "reviews", "price",  "titleLength", "stars")

# Generate correlation table using apaTables package
apaTables::apa.cor.table(df_model[columns])

## 
## 
## Means, standard deviations, and correlations with confidence intervals
##  
## 
##   Variable             M       SD      1            2            3         
##   1. boughtInLastMonth 257.28  670.48                                      
##                                                                            
##   2. reviews           1634.10 5420.54 .26**                               
##                                        [.26, .27]                          
##                                                                            
##   3. price             16.94   25.55   -.04**       .03**                  
##                                        [-.04, -.03] [.02, .03]             
##                                                                            
##   4. titleLength       125.83  53.17   -.03**       -.05**       .03**     
##                                        [-.03, -.02] [-.05, -.04] [.03, .04]
##                                                                            
##   5. stars             4.39    0.34    .06**        .07**        .04**     
##                                        [.06, .07]   [.07, .08]   [.04, .05]
##                                                                            
##   4           
##               
##               
##               
##               
##               
##               
##               
##               
##               
##               
##               
##   -.13**      
##   [-.13, -.12]
##               
## 
## Note. M and SD are used to represent mean and standard deviation, respectively.
## Values in square brackets indicate the 95% confidence interval.
## The confidence interval is a plausible range of population correlations 
## that could have caused the sample correlation (Cumming, 2014).
##  * indicates p < .05. ** indicates p < .01.
##

To improve visual clarity, we will employ the corrplot function. In this plot, an “X” indicates a correlation that is not statistically significant. More comprehensive details on these correlations will be provided in the modeling section. For now, we can see that most variables show some degree of linear correlation, although generally, the correlations are weak.

# Select the specified columns from the DataFrame
subset_df <- df_model[, columns]

# Compute the correlation matrix for the selected columns
correlation_matrix <- cor(subset_df)

# Generate a correlation matrix with p-values
res <- cor.mtest(subset_df)

# Visualize the correlation matrix using a heatmap
corrplot(
  correlation_matrix, method = "color", type = "upper", 
  addCoef.col = "navy", tl.col = "black",
  tl.srt = 45, tl.cex = 0.8, order = "hclust", 
  diag = FALSE, outline = TRUE, 
  title = "Correlation Heatmap of Selected Features", p.mat = res$p
)

#3.2: LOGISTIC REGRESSION

Logistic regression has been selected due to its robust capability to analyze multiple explanatory variables simultaneously while mitigating the impact of confounding factors (Sperandei, 2014). Particularly suited for binary classification tasks, such as predicting whether a product is a best seller or not, logistic regression offers a powerful analytical framework.

Steps in Logistic Regression Modeling

To maintain clarity throughout this extensive modeling process, we will provide a structured overview of the logistic regression approach:

Data Preparation: Initially, we will subset the dataset and prepare for a 70:30 train-test split.
Executing the Regression Model: Two main dataframes will be utilized: df_mod (comprising all departments) and df_modeling (containing data from the top 5 departments). This approach ensures a broad overview followed by focused analysis on the key departments addressing the primary business challenges.
Extracting Coefficients: The regression coefficients represent the relative importance of variables within each department.
Plotting the Coefficients: Visual representation of the coefficients will provide insights into variable impact.
Analyzing Z-Values: Evaluating the statistical significance (typically |z| > 1.96 for a 95% confidence interval) helps understand the predictors’ effect on outcomes, while controlling for other variables.
Plotting Partial Effects: To enhance model accuracy, understanding the partial effects of variables when others are held constant is crucial.
Model Evaluation: Assessment metrics such as AUC-ROC curves and Confusion Matrices will gauge model performance.
Further Improvements: Post-evaluation, adjustments can be made to optimize model performance, especially in minimizing Type II errors (predicting ‘not best seller’ when it’s actually a ‘best seller’). This includes defining an optimal threshold using ROC curves to enhance recall performance.

#STEP 1: Preparing Dataset for logistic regression

We will prepare two dataframes for logistic regression: df_mod and df_modeling. df_modeling will be a subset focusing on the top 5 departments based on the percentage of best-selling products within each department.

df_modeling_all <- df_model
df_mod <- df_model
df_modeling <- subset(df_model, department %in% c("fashion", "health and household", "automotive", "sports and outdoors", "industrial and scientific"))
department_dfs <- split(df_modeling, df_modeling$department)

We will prepare the dataset for training and testing our logistic regression model using a 70% training dataset and a 30% testing dataset split. To ensure reproducibility, we will set a seed for the random number generator used in this dataset split. This split allows us to train the logistic regression model on a majority of the data and evaluate its performance on unseen test data.

#Preparing dataset for training and testing
set.seed(123)
split_ratio <- 0.7

#Dataset training and testing for all department dataset
indices <- sample(1:nrow(df_modeling_all), size = floor(split_ratio * nrow(df_modeling_all)))
df_train_all <- df_modeling_all[indices, ]
df_test_all <- df_modeling_all[-indices, ]

#Dataset training and testing for top 5 department
train_dfs <- list()
test_dfs <- list()

# Looping through each department
for (department in names(department_dfs)) {
  df <- department_dfs[[department]]
  indices <- sample(1:nrow(df), size = floor(split_ratio * nrow(df)))
  
  # Subsetting the data frame into training and testing sets
  train_df <- df[indices, ]
  test_df <- df[-indices, ]
  
  # storing the dataset 
  train_dfs[[department]] <- train_df
  test_dfs[[department]] <- test_df
}

#STEP 2: Executing logistic regression

# define an empty list to store model summaries for each department
model_summaries <- list()
models <- list()
coef_plots <- list()

# model
model <- glm(isBestSeller ~  price_log_normalized + 
                 reviews_log_normalized + titleLength_normalized + stars_normalized + boughtInLastMonth, 
               data = df_train_all, family = binomial)

# Storing the summary 
model_summaries[["all"]] <- summary(model)
models[["all"]] <- model

# Plotting coefficients for the current model with department name annotation
  coef_plots[["all"]] <- coefplot::coefplot(
    model,
    title = paste("Coefficients Plot for all department"),
    ylab = "Variables",
    xlab = "Estimated Coefficient"
  )

# Looping through each department
for (department in names(train_dfs)) {
  # Getting the training data frame for the current department
  df_training <- train_dfs[[department]]
  
  # Fitting the logistic regression model
  model <- glm(isBestSeller ~  price_log_normalized + 
                 reviews_log_normalized + titleLength_normalized + stars_normalized + boughtInLastMonth, 
               data = df_training, family = binomial)
  
  # Storing to df
  model_summaries[[department]] <- summary(model)
  models[[department]] <- model
  
  # Plotting coefficients
  coef_plots[[department]] <- coefplot::coefplot(
    model,
    title = paste("Coefficients Plot for", department),
    ylab = "Variables",
    xlab = "Estimated Coefficient"
  )
}

print(model_summaries)

## $all
## 
## Call:
## glm(formula = isBestSeller ~ price_log_normalized + reviews_log_normalized + 
##     titleLength_normalized + stars_normalized + boughtInLastMonth, 
##     family = binomial, data = df_train_all)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -5.996e+00  2.731e-01 -21.957  < 2e-16 ***
## price_log_normalized    2.094e-01  1.849e-01   1.132  0.25750    
## reviews_log_normalized  4.959e+00  1.516e-01  32.715  < 2e-16 ***
## titleLength_normalized  5.628e-01  2.030e-01   2.773  0.00556 ** 
## stars_normalized       -4.992e-01  3.073e-01  -1.625  0.10420    
## boughtInLastMonth       4.333e-04  1.465e-05  29.579  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26160  on 111906  degrees of freedom
## Residual deviance: 22949  on 111901  degrees of freedom
## AIC: 22961
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## $automotive
## 
## Call:
## glm(formula = isBestSeller ~ price_log_normalized + reviews_log_normalized + 
##     titleLength_normalized + stars_normalized + boughtInLastMonth, 
##     family = binomial, data = df_training)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -5.034137   1.970603  -2.555   0.0106 *  
## price_log_normalized    0.585397   1.586702   0.369   0.7122    
## reviews_log_normalized  2.528969   1.196899   2.113   0.0346 *  
## titleLength_normalized -0.836206   1.503518  -0.556   0.5781    
## stars_normalized        0.186155   2.316589   0.080   0.9360    
## boughtInLastMonth       0.001732   0.000365   4.746 2.07e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 444.40  on 1316  degrees of freedom
## Residual deviance: 384.62  on 1311  degrees of freedom
## AIC: 396.62
## 
## Number of Fisher Scoring iterations: 6
## 
## 
## $fashion
## 
## Call:
## glm(formula = isBestSeller ~ price_log_normalized + reviews_log_normalized + 
##     titleLength_normalized + stars_normalized + boughtInLastMonth, 
##     family = binomial, data = df_training)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)  
## (Intercept)             1.036e+01  8.157e+00   1.270   0.2040  
## price_log_normalized    1.378e+01  1.214e+01   1.134   0.2566  
## reviews_log_normalized -9.437e+00  7.371e+00  -1.280   0.2004  
## titleLength_normalized  2.459e+00  8.227e+00   0.299   0.7650  
## stars_normalized       -1.696e+01  1.001e+01  -1.694   0.0903 .
## boughtInLastMonth       1.526e-04  1.175e-02   0.013   0.9896  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21.153  on 38  degrees of freedom
## Residual deviance: 16.336  on 33  degrees of freedom
## AIC: 28.336
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## $`health and household`
## 
## Call:
## glm(formula = isBestSeller ~ price_log_normalized + reviews_log_normalized + 
##     titleLength_normalized + stars_normalized + boughtInLastMonth, 
##     family = binomial, data = df_training)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -4.290e+00  7.187e-01  -5.968 2.39e-09 ***
## price_log_normalized    6.976e-01  4.261e-01   1.637   0.1016    
## reviews_log_normalized  2.843e+00  4.396e-01   6.466 1.00e-10 ***
## titleLength_normalized  1.487e+00  5.774e-01   2.575   0.0100 *  
## stars_normalized       -1.317e+00  7.932e-01  -1.660   0.0968 .  
## boughtInLastMonth       3.813e-04  2.687e-05  14.188  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3289.2  on 7756  degrees of freedom
## Residual deviance: 2869.9  on 7751  degrees of freedom
## AIC: 2881.9
## 
## Number of Fisher Scoring iterations: 6
## 
## 
## $`industrial and scientific`
## 
## Call:
## glm(formula = isBestSeller ~ price_log_normalized + reviews_log_normalized + 
##     titleLength_normalized + stars_normalized + boughtInLastMonth, 
##     family = binomial, data = df_training)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -7.2905767  1.9213258  -3.795 0.000148 ***
## price_log_normalized    0.7585714  1.6288786   0.466 0.641429    
## reviews_log_normalized  3.3564192  1.1851491   2.832 0.004625 ** 
## titleLength_normalized  4.6617282  1.7041723   2.735 0.006229 ** 
## stars_normalized        0.6865408  2.0181164   0.340 0.733714    
## boughtInLastMonth       0.0008986  0.0001752   5.129 2.92e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 427.89  on 1443  degrees of freedom
## Residual deviance: 348.54  on 1438  degrees of freedom
## AIC: 360.54
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## $`sports and outdoors`
## 
## Call:
## glm(formula = isBestSeller ~ price_log_normalized + reviews_log_normalized + 
##     titleLength_normalized + stars_normalized + boughtInLastMonth, 
##     family = binomial, data = df_training)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -5.1012911  1.1311462  -4.510 6.49e-06 ***
## price_log_normalized    0.6729650  0.7375298   0.912  0.36153    
## reviews_log_normalized  4.5825325  0.6438306   7.118 1.10e-12 ***
## titleLength_normalized -0.6362049  0.8106429  -0.785  0.43256    
## stars_normalized       -0.4269741  1.3036513  -0.328  0.74327    
## boughtInLastMonth       0.0007325  0.0002394   3.060  0.00221 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1283.1  on 3654  degrees of freedom
## Residual deviance: 1192.4  on 3649  degrees of freedom
## AIC: 1204.4
## 
## Number of Fisher Scoring iterations: 6

Interpretation of Logistic Regression Models

All Departments Model: The model for all departments reveals several significant predictors of a product becoming a bestseller. The number of reviews (reviews_log_normalized) and the amount bought in the last month (boughtInLastMonth) are highly significant positive predictors (p < 0.001). Title length (titleLength_normalized) is also a significant positive predictor (p < 0.01). The star rating (stars_normalized) shows a marginally significant negative association (p < 0.1), while price is not a significant factor. This suggests that customer engagement (through reviews and recent purchases) and product presentation (title length) are key drivers of bestseller status across all departments.
Automotive Department: In the automotive category, the strongest predictor is whether the product was bought in the last month (boughtInLastMonth, p < 0.001). The number of reviews (reviews_log_normalized) is marginally significant (p < 0.05). Other factors, including price, title length, and star rating, do not show significant associations. This emphasizes the importance of recent purchase behavior in determining bestseller status for automotive products.
Fashion Department: The fashion department model shows no statistically significant predictors at the conventional p < 0.05 level. The star rating (stars_normalized) is marginally significant (p < 0.1) with a negative association. However, the small sample size (39 observations) likely contributes to the lack of significant results and makes these findings less reliable. More data would be needed to draw robust conclusions for this department.
Health and Household Department: In this category, the number of reviews (reviews_log_normalized) and recent purchases (boughtInLastMonth) are highly significant positive predictors (p < 0.001). Title length (titleLength_normalized) is also a significant positive predictor (p < 0.05). The star rating (stars_normalized) shows a marginally significant negative association (p < 0.1). Price is not a significant factor. This suggests that customer engagement and product visibility are key drivers of bestseller status in the health and household category.
Industrial and Scientific Department: For industrial and scientific products, the number of reviews (reviews_log_normalized), title length (titleLength_normalized), and recent purchases (boughtInLastMonth) are all significant positive predictors (p < 0.01 or lower). Price and star rating do not show significant associations. This indicates that product visibility, customer engagement, and recent purchase behavior are important factors in determining bestseller status in this category.
Sports and Outdoors Department: In the sports and outdoors category, the number of reviews (reviews_log_normalized) is a highly significant positive predictor (p < 0.001), and recent purchases (boughtInLastMonth) is also a significant positive predictor (p < 0.01). Price, title length, and star rating do not show significant associations. This suggests that customer engagement through reviews and recent purchase behavior are the primary drivers of bestseller status for sports and outdoor products.

General Observations: 1. Across most categories, the number of reviews and recent purchase behavior consistently emerge as important predictors of bestseller status. 2. The importance of other factors (price, title length, star rating) varies by department, suggesting that bestseller dynamics may differ across product categories. 3. Some departments, particularly Fashion, have very small sample sizes, which affects the reliability of the results and calls for caution in interpretation. 4. The models generally show good fit to the data, as indicated by improvements in deviance and AIC compared to null models. 5. The normalization of variables makes direct interpretation of effect sizes challenging without knowing the specifics of the normalization process. 6. These models reveal associations between the predictors and bestseller status, but do not necessarily imply causation. Other unmeasured factors could also be influencing these relationships.

#STEP 3: Extracting coefficient

Extracting coefficients provides insights into the relative importance of variables within each department. These coefficients quantify how strongly each predictor influences the likelihood of a product becoming a best seller.

# Initializing an empty data frame to store coefficients from all models
all_coefs <- data.frame(
  Department = character(),
  Term = character(),
  Estimate = numeric(),
  Pr.value = numeric(),
  stringsAsFactors = FALSE  # Avoid factors to make data manipulation easier
)

# Looping through each department and extract coefficients
for (department in c("all", names(train_dfs))) {
  # Extracting the coefficients table
  coefs <- summary(models[[department]])$coefficients
  
  # Creating a temporary data frame to store coefficients for current model
  temp_df <- data.frame(
    Department = department,
    Term = rownames(coefs),
    Estimate = coefs[, "Estimate"],
    Pr.value = coefs[, "Pr(>|z|)"],
    stringsAsFactors = FALSE
  )
  
  # Binding the temporary data frame to the main data frame
  all_coefs <- rbind(all_coefs, temp_df)
}

# Printing the final coefficients table
print(all_coefs)

##                                        Department                   Term
## (Intercept)                                   all            (Intercept)
## price_log_normalized                          all   price_log_normalized
## reviews_log_normalized                        all reviews_log_normalized
## titleLength_normalized                        all titleLength_normalized
## stars_normalized                              all       stars_normalized
## boughtInLastMonth                             all      boughtInLastMonth
## (Intercept)1                           automotive            (Intercept)
## price_log_normalized1                  automotive   price_log_normalized
## reviews_log_normalized1                automotive reviews_log_normalized
## titleLength_normalized1                automotive titleLength_normalized
## stars_normalized1                      automotive       stars_normalized
## boughtInLastMonth1                     automotive      boughtInLastMonth
## (Intercept)2                              fashion            (Intercept)
## price_log_normalized2                     fashion   price_log_normalized
## reviews_log_normalized2                   fashion reviews_log_normalized
## titleLength_normalized2                   fashion titleLength_normalized
## stars_normalized2                         fashion       stars_normalized
## boughtInLastMonth2                        fashion      boughtInLastMonth
## (Intercept)3                 health and household            (Intercept)
## price_log_normalized3        health and household   price_log_normalized
## reviews_log_normalized3      health and household reviews_log_normalized
## titleLength_normalized3      health and household titleLength_normalized
## stars_normalized3            health and household       stars_normalized
## boughtInLastMonth3           health and household      boughtInLastMonth
## (Intercept)4            industrial and scientific            (Intercept)
## price_log_normalized4   industrial and scientific   price_log_normalized
## reviews_log_normalized4 industrial and scientific reviews_log_normalized
## titleLength_normalized4 industrial and scientific titleLength_normalized
## stars_normalized4       industrial and scientific       stars_normalized
## boughtInLastMonth4      industrial and scientific      boughtInLastMonth
## (Intercept)5                  sports and outdoors            (Intercept)
## price_log_normalized5         sports and outdoors   price_log_normalized
## reviews_log_normalized5       sports and outdoors reviews_log_normalized
## titleLength_normalized5       sports and outdoors titleLength_normalized
## stars_normalized5             sports and outdoors       stars_normalized
## boughtInLastMonth5            sports and outdoors      boughtInLastMonth
##                              Estimate      Pr.value
## (Intercept)             -5.995897e+00 7.396551e-107
## price_log_normalized     2.094143e-01  2.574981e-01
## reviews_log_normalized   4.959000e+00 9.627752e-235
## titleLength_normalized   5.627646e-01  5.561241e-03
## stars_normalized        -4.992396e-01  1.041951e-01
## boughtInLastMonth        4.333419e-04 2.781977e-192
## (Intercept)1            -5.034137e+00  1.063047e-02
## price_log_normalized1    5.853966e-01  7.121730e-01
## reviews_log_normalized1  2.528969e+00  3.460631e-02
## titleLength_normalized1 -8.362064e-01  5.780971e-01
## stars_normalized1        1.861545e-01  9.359532e-01
## boughtInLastMonth1       1.732279e-03  2.070541e-06
## (Intercept)2             1.036138e+01  2.040041e-01
## price_log_normalized2    1.377640e+01  2.566334e-01
## reviews_log_normalized2 -9.437315e+00  2.004034e-01
## titleLength_normalized2  2.459187e+00  7.650048e-01
## stars_normalized2       -1.695802e+01  9.027500e-02
## boughtInLastMonth2       1.525688e-04  9.896379e-01
## (Intercept)3            -4.289785e+00  2.394680e-09
## price_log_normalized3    6.976084e-01  1.015600e-01
## reviews_log_normalized3  2.842821e+00  1.003731e-10
## titleLength_normalized3  1.487127e+00  1.001416e-02
## stars_normalized3       -1.317049e+00  9.681652e-02
## boughtInLastMonth3       3.812598e-04  1.094556e-45
## (Intercept)4            -7.290577e+00  1.479084e-04
## price_log_normalized4    7.585714e-01  6.414291e-01
## reviews_log_normalized4  3.356419e+00  4.624846e-03
## titleLength_normalized4  4.661728e+00  6.228950e-03
## stars_normalized4        6.865408e-01  7.337143e-01
## boughtInLastMonth4       8.986412e-04  2.919520e-07
## (Intercept)5            -5.101291e+00  6.487583e-06
## price_log_normalized5    6.729650e-01  3.615276e-01
## reviews_log_normalized5  4.582532e+00  1.098178e-12
## titleLength_normalized5 -6.362049e-01  4.325619e-01
## stars_normalized5       -4.269741e-01  7.432733e-01
## boughtInLastMonth5       7.324904e-04  2.212057e-03

#STEP 4: PLOTTING COEFFICIENTS

Coefficient plots allow us to assess both the direction (positive or negative) and the magnitude of effects. These plots enable us to identify which variables hold the greatest influence within each department.

# Looping through each department and display its coefficient
for (department_name in names(coef_plots)) {
  plot <- coef_plots[[department_name]]
  print(plot)
}

#Step 5: Displaying Z Values for All Logistic Regression Models

The z-value is obtained by dividing the regression coefficient (estimate) by its standard error. A positive z-score signifies that the value related to a predictor variable is above the mean average, whereas a negative z-score indicates it is below the mean.

filtered_all_coefs1 <- all_coefs[all_coefs$Department %in% c("all","fashion", "health and household", "automotive", "sports and outdoors", "industrial and scientific"), ]

# Reshaping the data frame to a wide format
wide_df1 <- reshape(filtered_all_coefs1,
                   timevar = "Term",
                   idvar = "Department",
                   direction = "wide",
                   v.names = "Pr.value")

## Warning in reshapeWide(data, idvar = idvar, timevar = timevar, varying =
## varying, : some constant variables (Estimate) are really varying

# Cleaning up the names by removing the prefix ("z.value.") added by the reshape function
names(wide_df1) <- sub("", "", names(wide_df1))

# Setting the row names to the Department names 
rownames(wide_df1) <- wide_df1$Department
wide_df1$Department <- NULL  # Remove the now redundant Department column
wide_df1$"Pr.value.(Intercept)"<- NULL  # Remove the now redundant Department column
wide_df1$Estimate<- NULL  # Remove the now redundant Department column

# Showing the updated data frame with rounded values (excluding one specific column)
wide_df1

##                           Pr.value.price_log_normalized
## all                                           0.2574981
## automotive                                    0.7121730
## fashion                                       0.2566334
## health and household                          0.1015600
## industrial and scientific                     0.6414291
## sports and outdoors                           0.3615276
##                           Pr.value.reviews_log_normalized
## all                                         9.627752e-235
## automotive                                   3.460631e-02
## fashion                                      2.004034e-01
## health and household                         1.003731e-10
## industrial and scientific                    4.624846e-03
## sports and outdoors                          1.098178e-12
##                           Pr.value.titleLength_normalized
## all                                           0.005561241
## automotive                                    0.578097082
## fashion                                       0.765004778
## health and household                          0.010014159
## industrial and scientific                     0.006228950
## sports and outdoors                           0.432561925
##                           Pr.value.stars_normalized Pr.value.boughtInLastMonth
## all                                      0.10419508              2.781977e-192
## automotive                               0.93595319               2.070541e-06
## fashion                                  0.09027500               9.896379e-01
## health and household                     0.09681652               1.094556e-45
## industrial and scientific                0.73371427               2.919520e-07
## sports and outdoors                      0.74327331               2.212057e-03

filtered_all_coefs2 <- all_coefs[all_coefs$Department %in% c("all", "fashion", "health and household", "automotive", "sports and outdoors", "industrial and scientific"), ]

# Reshaping the data frame to a wide format
wide_df2 <- reshape(filtered_all_coefs2,
                   timevar = "Term",
                   idvar = "Department",
                   direction = "wide",
                   v.names = "Estimate")

## Warning in reshapeWide(data, idvar = idvar, timevar = timevar, varying =
## varying, : some constant variables (Pr.value) are really varying

# Cleaning up the names by removing the prefix ("z.value.") added by the reshape function
names(wide_df2) <- sub("", "", names(wide_df2))

# Setting the row names to the Department names 
rownames(wide_df2) <- wide_df2$Department
wide_df2$Department <- NULL  # Remove the now redundant Department column
wide_df2$"Estimate.(Intercept)"<- NULL  # Remove the now redundant Department column
wide_df2$"Pr.value"<- NULL  # Remove the now redundant Department column

# Showing the updated data frame with rounded values (excluding one specific column)
wide_df2

##                           Estimate.price_log_normalized
## all                                           0.2094143
## automotive                                    0.5853966
## fashion                                      13.7763958
## health and household                          0.6976084
## industrial and scientific                     0.7585714
## sports and outdoors                           0.6729650
##                           Estimate.reviews_log_normalized
## all                                              4.959000
## automotive                                       2.528969
## fashion                                         -9.437315
## health and household                             2.842821
## industrial and scientific                        3.356419
## sports and outdoors                              4.582532
##                           Estimate.titleLength_normalized
## all                                             0.5627646
## automotive                                     -0.8362064
## fashion                                         2.4591873
## health and household                            1.4871271
## industrial and scientific                       4.6617282
## sports and outdoors                            -0.6362049
##                           Estimate.stars_normalized Estimate.boughtInLastMonth
## all                                      -0.4992396               0.0004333419
## automotive                                0.1861545               0.0017322794
## fashion                                 -16.9580210               0.0001525688
## health and household                     -1.3170493               0.0003812598
## industrial and scientific                 0.6865408               0.0008986412
## sports and outdoors                      -0.4269741               0.0007324904

Calculating the absolute z-score

Although the z-score alone does not directly reveal the importance of variables, comparing the absolute values of z-scores across predictors can provide insights into their relative significance. Generally, higher absolute z-scores suggest stronger effects.

z_scores <- list()
for (department in names(models)) {
  coefficients <- coef(models[[department]])
  
  # Calculating z-scores for each coefficient
  z_scores[[department]] <- scale(coefficients, center = TRUE, scale = TRUE)
}

z_scores <- list()
for (department in names(models)) {
  # Getting the coefficients from the model summary
  coefficients <- coef(models[[department]])
  
  z_scores[[department]] <- abs(scale(coefficients[-1], center = TRUE, scale = TRUE))
}

z_score_plots <- list()

# Looping through each department's z-scores
for (department in names(z_scores)) {
  variables <- names(models[[department]]$coefficients)[-1]
  
  # Creating a data frame with variable names and absolute z-scores
  z_scores_df <- data.frame(
    Variable = variables,
    Abs_Z_Score = unname(z_scores[[department]])
  )
  
  # Creating a bar plot for absolute z-scores
  z_score_plot <- ggplot(z_scores_df, aes(x = reorder(Variable, Abs_Z_Score), y = Abs_Z_Score)) +
    geom_bar(stat = "identity", fill = "green") +
    labs(title = paste("Absolute Z-Scores of Coefficients for", department),
         x = "Variables",
         y = "Absolute Z-Score") +
    theme_minimal() +
    coord_flip()

  z_score_plots[[department]] <- z_score_plot
  
  
}

# Displaying the plots
for (plot in z_score_plots) {
  print(plot)
}

#Step 6: Plotting the Partial Effect

In addition to the coefficients discussed in the previous section, the partial effect measures how individual predictor variables influence the predicted probability of a positive outcome in a logistic regression model. It quantifies the change in the predicted probability of the outcome for each unit change in the predictor variable, holding all other variables constant.

create_and_display_plots <- function(department, model) {
  # First, we need to compute the partial effects
  partial_effects <- allEffects(model)
  
  # Create ggplot objects for each effect
  plots <- lapply(partial_effects, function(effect) {
    effect_df <- as.data.frame(effect)
    p <- ggplot(effect_df, aes(x = eval(as.name(names(effect_df)[1])), y = fit)) +
      geom_line() +
      geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
      labs(title = paste(names(effect_df)[1]), x = names(effect_df)[1], y = "Effect") +
      theme_minimal() +
      theme(
        plot.title = element_text(size = 8),    # Adjust size: plot titles
        axis.title = element_text(size = 6),    # Adjust size: axis titles
        axis.text.x = element_text(size = 8),    # Adjust size: x-axis text
        axis.text.y = element_text(size = 8)     # Adjust size: y-axis text
      )
    return(p)
  })

  # Combine plots into a grid
  plot_grid <- marrangeGrob(plots, nrow = 2, ncol = 3, top = department)

  # Print the plot grid explicitly
  grid::grid.draw(plot_grid)
}

# Loop through each department model and display plots
for (department in names(models)) {
  model <- models[[department]]
  create_and_display_plots(department, model)
}

#Step 7: Calculating Model Evaluation To evaluate the logistic regression model, we will use ROC (Receiver Operating Characteristic) and AUC (Area Under the ROC Curve).

ROC Curve: This plot shows the trade-off between the true positive rate (sensitivity) and false positive rate (1 - specificity) for different threshold values. It helps visualize how well the model can distinguish between classes.
AUC (Area Under the ROC Curve): AUC quantifies the overall performance of the classifier across all possible thresholds. A higher AUC value indicates better model performance in distinguishing between the two classes.

In this research, we will plot the ROC curve to analyze the model’s performance, and derive the AUC from observing the ROC plot.

model_evaluations <- list()

predicted_probabilities <- predict(models[["all"]], df_test_all, type = "response")
actual_labels_all <- df_test_all$isBestSeller

roc_curve <- roc(actual_labels_all, predicted_probabilities)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

auc_value <- auc(roc_curve)
model_evaluations[["all"]] <- list(
AUC = auc_value,
PredictedProbabilities = predicted_probabilities,
ActualLabels = actual_labels_all
)

# Looping through each department's test dataset
for (department in names(test_dfs)) {
  # Getting the test data frame for the current department
  df_test <- test_dfs[[department]]
  
  # Getting the logistic regression model for the current department
  model_logistic <- models[[department]]
  
  # Predicting probabilities of being a best seller using the model
  predicted_probabilities <- predict(model_logistic, df_test, type = "response")
  
  # Evaluating the model's performance
  actual_labels <- df_test$isBestSeller  
  
  # Explicitly setting factor levels
  actual_labels <- factor(actual_labels, levels = c(0, 1))
  
  roc_curve <- roc(actual_labels, predicted_probabilities)
  auc_value <- auc(roc_curve)
  
  # Storing the model evaluation results
  model_evaluations[[department]] <- list(
    AUC = auc_value,
    PredictedProbabilities = predicted_probabilities,
    ActualLabels = actual_labels
  )
}

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

#Step 8: Plotting ROC Curve for Further Improvement

The ROC curve visually represents the balance between the true positive rate (sensitivity) and the false positive rate (1 - specificity) for various thresholds in a binary classification model. It helps to understand how well the model can differentiate between positive and negative cases across different decision boundaries.

The AUC (Area Under the ROC Curve) serves as a summary metric of the ROC curve’s effectiveness. It measures the overall accuracy of the model, with an AUC of 0.5 reflecting random guessing and an AUC of 1.0 indicating perfect classification.

Upon reviewing the ROC curves for all departments and the top 5 departments, it was found that only the Electronics department has an AUC value below 0.5. This implies that the model’s performance is weakest for the Electronics department compared to the other five departments.

# Width and height for the ROC plot
width <- 15
height <- 15
# Creating an empty list to store ROC curve plots for each department
roc_plots <- list()
# Looping through each department's model evaluations
for (department in names(model_evaluations)) {
  # Getting the predicted probabilities and actual labels for the current department
  predicted_probs <- model_evaluations[[department]]$PredictedProbabilities
  actual_labels <- model_evaluations[[department]]$ActualLabels
  # Calculating the ROC curve
  roc_curve <- roc(actual_labels, predicted_probs)
  # Getting the AUC value from model_evaluations list
  auc_value <- model_evaluations[[department]]$AUC
  # Plotting the ROC curve with AUC value in the title and adjusted size
roc_plot <- plot(roc_curve, main = paste("ROC Curve for", department, "(AUC =", round(auc_value, 2), ")"), width = width, height = height, cex.main = 0.8)
roc_plots[[department]] <- roc_plot
}

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

After assessing the ROC curves, all models (from all departments to the top 5 departments) exhibit relatively high AUC values (>0.5). The strongest models are observed in the all-department, fashion, automotive, and Sports-and-Outdoors departments. Given that none of the AUC values are close to zero (<0.3), we can proceed to evaluate the confusion matrix, comparing true values with predicted values.

#Step 9: Recalculating Evaluation Metrics

Evaluation Metrics:

Accuracy: Measures the proportion of correctly predicted instances (both true positives and true negatives) out of all instances. While straightforward, accuracy can be misleading in imbalanced datasets where one class dominates.

Precision: Quantifies the proportion of true positive predictions out of all positive predictions (both true positives and false positives). High precision indicates that when the model predicts a positive instance, it is likely to be correct. This metric is crucial in scenarios where false positives are costly.

Recall: Measures the proportion of true positive predictions out of all actual positive instances (both true positives and false negatives). High recall indicates that the model captures most of the actual positive instances. It is essential when missing positive cases (false negatives) are critical, such as in disease diagnosis.

F1 Score: The harmonic mean of precision and recall. It balances both metrics and provides a single score that considers both false positives and false negatives. The F1 score is useful when there is a need to strike a balance between precision and recall.

metrics <- list()
evaluation_results <- list()

# Defining a function to calculate the confusion matrix
calculate_confusion_matrix <- function(predicted_labels, actual_labels) {
  TP_A <- sum(predicted_labels == 1 & actual_labels == 1)
  TN_A <- sum(predicted_labels == 0 & actual_labels == 0)
  FP_A <- sum(predicted_labels == 1 & actual_labels == 0)
  FN_A <- sum(predicted_labels == 0 & actual_labels == 1)
  
  confusion_matrix <- matrix(c(TP_A, FN_A, FP_A, TN_A), nrow = 2, byrow = TRUE,
                             dimnames = list(c("Actual Positive (1)", "Actual Negative (0)"),
                                             c("Predicted Positive", "Predicted Negative")))
  return(confusion_matrix)
}

# Looping through each department's model evaluations
for (department in names(model_evaluations)) {
  # Getting the predicted probabilities and actual labels for the current department
  predicted_probs <- model_evaluations[[department]]$PredictedProbabilities
  actual_labels <- model_evaluations[[department]]$ActualLabels
  
  # Calculating predicted labels based on a threshold (e.g., 0.5 for binary classification)
  predicted_labels <- ifelse(predicted_probs >= 0.5, 1, 0)
  
  # Make confusion matrix
  conf_matrix_A <- calculate_confusion_matrix(predicted_labels, actual_labels)
  
  # Calculating evaluation metrics
  accuracy_A <- sum(diag(conf_matrix_A)) / sum(conf_matrix_A)
  precision_A <- conf_matrix_A[1, 1] / sum(conf_matrix_A[1, ])
  recall_A <- conf_matrix_A[1, 1] / sum(conf_matrix_A[, 1])
  f1_score_A <- 2 * (precision_A * recall_A) / (precision_A + recall_A)
  
  # Storing the evaluation metrics and confusion matrix in the metrics list
  metrics[[department]] <- list(
    Accuracy = accuracy_A,
    Precision = precision_A,
    Recall = recall_A,
    F1_Score = f1_score_A,
    Confusion_Matrix = conf_matrix_A
  )
}
for (department in names(metrics)) {
  cat("Department:", department, "\n")
  cat("Accuracy:", metrics[[department]]$Accuracy, "\n")
  cat("Precision:", metrics[[department]]$Precision, "\n")
  cat("Recall:", metrics[[department]]$Recall, "\n")
  cat("F1 Score:", metrics[[department]]$F1_Score, "\n")
  cat("Confusion Matrix:\n")
  print(metrics[[department]]$Confusion_Matrix)
  cat("\n")
}

## Department: all 
## Accuracy: 0.9749171 
## Precision: 0.03700589 
## Recall: 0.4313725 
## F1 Score: 0.06816421 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                 44               1145
## Actual Negative (0)                 58              46714
## 
## Department: automotive 
## Accuracy: 0.9646018 
## Precision: 0.1363636 
## Recall: 0.75 
## F1 Score: 0.2307692 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                  3                 19
## Actual Negative (0)                  1                542
## 
## Department: fashion 
## Accuracy: 0.9411765 
## Precision: 1 
## Recall: 0.5 
## F1 Score: 0.6666667 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                  1                  0
## Actual Negative (0)                  1                 15
## 
## Department: health and household 
## Accuracy: 0.9428571 
## Precision: 0.09326425 
## Recall: 0.5454545 
## F1 Score: 0.159292 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                 18                175
## Actual Negative (0)                 15               3117
## 
## Department: industrial and scientific 
## Accuracy: 0.9645161 
## Precision: 0.1304348 
## Recall: 0.6 
## F1 Score: 0.2142857 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                  3                 20
## Actual Negative (0)                  2                595
## 
## Department: sports and outdoors 
## Accuracy: 0.9514997 
## Precision: 0 
## Recall: 0 
## F1 Score: NaN 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                  0                 75
## Actual Negative (0)                  1               1491

To enhance model performance, we aim to define an optimal threshold for the logistic regression model using the ROC curve to maximize recall and minimize Type II errors (false negatives).

Concept:

After analyzing the Confusion Matrix, particularly focusing on Type II errors (isBestSeller=“yes” predicted as “no”), we recognize the need to improve recall. In this business context, false negatives (Type II errors) are critical as they imply missing potential Best Seller products, impacting client benefits.

Approach:

The optimal threshold determines the probability output where the model classifies an instance as “isBestSeller=‘yes’”. Instead of relying on a fixed threshold (e.g., >0.5), which may not suit all models, we use the ROC curve to identify the threshold that maximizes recall. This method helps mitigate Type II errors by prioritizing the correct identification of true positives (Best Sellers).

Example Scenario:

For instance, setting a threshold based on the ROC curve might lead us to lower the probability threshold for predicting “isBestSeller=‘yes’” to capture more true positive cases, thus reducing false negatives.

By optimizing the threshold in this manner, we aim to enhance the model’s ability to accurately identify potential Best Sellers, aligning with business objectives and maximizing client benefits.

# Looping through each department's ROC curve plots
for (department in names(roc_plots)) {
  # Get the ROC curve data from the roc_plots list
  roc_curve <- roc_plots[[department]]
  
  # Finding the optimal threshold for maximizing recall (sensitivity)
  optimal_threshold <- coords(roc_curve, "best", ret = "threshold", maximize = "sensitivity")$threshold
  
  # Getting the predicted probabilities and actual labels for the current department
  predicted_probs <- model_evaluations[[department]]$PredictedProbabilities
  actual_labels <- model_evaluations[[department]]$ActualLabels
  
  # Calculating predicted labels based on the optimal threshold
  predicted_labels <- ifelse(predicted_probs >= optimal_threshold, 1, 0)
  
  # Creating confusion matrix
  TP <- sum(predicted_labels == 1 & actual_labels == 1)
  FN <- sum(predicted_labels == 0 & actual_labels == 1)
  FP <- sum(predicted_labels == 1 & actual_labels == 0)
  TN <- sum(predicted_labels == 0 & actual_labels == 0)
  
  # Creating confusion matrix with labels
  confusion_matrix <- matrix(c(TP, FN, FP, TN), nrow = 2, byrow = TRUE)
  colnames(confusion_matrix) <- c("Predicted Positive", "Predicted Negative")
  rownames(confusion_matrix) <- c("Actual Positive (1)", "Actual Negative (0)")
  
  # Calculating evaluation metrics
  accuracy <- (TP + TN) / (TP + TN + FP + FN)
  precision <- TP / (TP + FP)
  recall <- TP / (TP + FN)
  f1_score <- 2 * (precision * recall) / (precision + recall)
  
  # Storing the evaluation results in the evaluation_results list
  evaluation_results[[department]] <- list(
    Threshold = optimal_threshold,
    ConfusionMatrix = confusion_matrix,
    Accuracy = accuracy,
    Precision = precision,
    Recall = recall,
    F1_Score = f1_score
  )
}
# Displaying the evaluation results for each department
for (department in names(evaluation_results)) {
  cat("Department:", department, "\n")
  cat("Optimal Threshold:", evaluation_results[[department]]$Threshold, "\n")
  cat("Confusion Matrix:\n")
  print(evaluation_results[[department]]$ConfusionMatrix)
  cat("Accuracy:", evaluation_results[[department]]$Accuracy, "\n")
  cat("Precision:", evaluation_results[[department]]$Precision, "\n")
  cat("Recall:", evaluation_results[[department]]$Recall, "\n")
  cat("F1 Score:", evaluation_results[[department]]$F1_Score, "\n\n")
}

## Department: all 
## Optimal Threshold: 0.02791888 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                744                445
## Actual Negative (0)              11417              35355
## Accuracy: 0.752674 
## Precision: 0.06117918 
## Recall: 0.6257359 
## F1 Score: 0.1114607 
## 
## Department: automotive 
## Optimal Threshold: 0.03612456 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                 17                  5
## Actual Negative (0)                143                400
## Accuracy: 0.7380531 
## Precision: 0.10625 
## Recall: 0.7727273 
## F1 Score: 0.1868132 
## 
## Department: fashion 
## Optimal Threshold: 0.8623695 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                  1                  0
## Actual Negative (0)                  0                 16
## Accuracy: 1 
## Precision: 1 
## Recall: 1 
## F1 Score: 1 
## 
## Department: health and household 
## Optimal Threshold: 0.05193374 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                126                 67
## Actual Negative (0)                842               2290
## Accuracy: 0.7266165 
## Precision: 0.1301653 
## Recall: 0.6528497 
## F1 Score: 0.2170543 
## 
## Department: industrial and scientific 
## Optimal Threshold: 0.0188599 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                 19                  4
## Actual Negative (0)                304                293
## Accuracy: 0.5032258 
## Precision: 0.05882353 
## Recall: 0.826087 
## F1 Score: 0.1098266 
## 
## Department: sports and outdoors 
## Optimal Threshold: 0.05185521 
## Confusion Matrix:
##                     Predicted Positive Predicted Negative
## Actual Positive (1)                 49                 26
## Actual Negative (0)                357               1135
## Accuracy: 0.7555839 
## Precision: 0.1206897 
## Recall: 0.6533333 
## F1 Score: 0.2037422

Following the determination of the optimal threshold to maximize recall using the ROC curve, we present the evaluation metrics:

Upon applying the optimal threshold derived from the ROC curve, we present the evaluation metrics:

# Load the knitr package
library(knitr)

# Creating a data frame from evaluation_results
evaluation_df <- data.frame(
  Optimal_Threshold = sapply(evaluation_results, function(x) x$Threshold),
  Accuracy = sapply(evaluation_results, function(x) x$Accuracy),
  Precision = sapply(evaluation_results, function(x) x$Precision),
  Recall = sapply(evaluation_results, function(x) x$Recall),
  F1_Score = sapply(evaluation_results, function(x) x$F1_Score)
)

# Convert the columns to numeric (if not already)
evaluation_df$Recall <- as.numeric(as.character(evaluation_df$Recall))
evaluation_df$Accuracy <- as.numeric(as.character(evaluation_df$Accuracy))
evaluation_df$Precision <- as.numeric(as.character(evaluation_df$Precision))
evaluation_df$F1_Score <- as.numeric(as.character(evaluation_df$F1_Score))

# Sorting the data frame by recall in descending order
evaluation_df <- evaluation_df[order(-evaluation_df$Recall),]

# Showing the sorted table
kable(evaluation_df, caption = "Evaluation Results Sorted by Recall (Descending)")

Evaluation Results Sorted by Recall (Descending)
	Optimal_Threshold	Accuracy	Precision	Recall	F1_Score
fashion	0.8623695	1.0000000	1.0000000	1.0000000	1.0000000
industrial and scientific	0.0188599	0.5032258	0.0588235	0.8260870	0.1098266
automotive	0.0361246	0.7380531	0.1062500	0.7727273	0.1868132
sports and outdoors	0.0518552	0.7555839	0.1206897	0.6533333	0.2037422
health and household	0.0519337	0.7266165	0.1301653	0.6528497	0.2170543
all	0.0279189	0.7526740	0.0611792	0.6257359	0.1114607

3.2 DECISION TREE ANALYSIS

#First, since isBestSeller is numeric boolean 0 and 1, we will change it to "no" and "yes" respectively.
df_model$isBestSeller <- as.character(df_model$isBestSeller)
df_model$isBestSeller <- ifelse(df_model$isBestSeller == 1, "yes", "no")
df_model$department <- gsub("&| |,", "", df_model$department)

vrs <- c("department", "isBestSeller")
df_model[ vrs ] <- lapply(df_model[ vrs ], factor)
remove(vrs)

column_for_base <- c("isBestSeller", "department", "stars", "reviews", "price", "boughtInLastMonth", "titleLength")
column_for_log <- c("isBestSeller", "department", "stars", "reviews_log", "price_log", "boughtInLastMonth_log", "titleLength")
column_for_norm <- c("isBestSeller", "department", "stars_normalized", "reviews_log_normalized", "price_log_normalized","boughtInLastMonth_log", "titleLength_normalized")

# Create the subset of the dataframe
df_model_base <- df_model[column_for_base]
df_model_log <- df_model[column_for_log]
df_model_norm <- df_model[column_for_norm]

create_train_test_split <- function(data, prop) {
  set.seed(34567819)
  
  split <- rsample::initial_split(data, prop = 0.7)
  training_set <- training(split)
  testing_set <- testing(split)
  
  #output
  list(training = training_set, testing = testing_set)
}


main_source = df_model_base
train_test_df <- create_train_test_split(main_source, 0.7)
training <- train_test_df$training
testing <- train_test_df$testing


model <- C50::C5.0(isBestSeller ~., 
                   data = training)
summary(model)

## 
## Call:
## C5.0.formula(formula = isBestSeller ~ ., data = training)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Fri Jul 12 17:44:01 2024
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 111907 cases (7 attributes) from undefined.data
## 
## Decision tree:
## 
## boughtInLastMonth <= 600: no (102450/1527)
## boughtInLastMonth > 600:
## :...boughtInLastMonth <= 3000: no (8707/953)
##     boughtInLastMonth > 3000:
##     :...boughtInLastMonth <= 7000: no (561/175)
##         boughtInLastMonth > 7000:
##         :...department in {artsandcrafts,bathandbody,electronics,
##             :              patioandgarden,sportsandoutdoors}: no (18/5)
##             department in {automotive,babyproducts,clothingandaccessories,
##             :              fashion,healthandhousehold,industrialandscientific,
##             :              Other,toolsandhomeimprovement,
##             :              toysandgames}: yes (82/28)
##             department = beauty:
##             :...stars <= 4.3: no (3)
##             :   stars > 4.3: yes (23/10)
##             department = homeandkitchen:
##             :...reviews <= 17445: no (29/9)
##             :   reviews > 17445: yes (14/2)
##             department = petsupplies:
##             :...titleLength <= 88: no (8)
##                 titleLength > 88: yes (12/4)
## 
## 
## Evaluation on training data (111907 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      11 2713( 2.4%)   <<
## 
## 
##      (a)    (b)    <-classified as
##    -----  -----
##   109107     44    (a): class no
##     2669     87    (b): class yes
## 
## 
##  Attribute usage:
## 
##  100.00% boughtInLastMonth
##    0.17% department
##    0.04% reviews
##    0.02% stars
##    0.02% titleLength
## 
## 
## Time: 0.5 secs

model_boost <- C5.0(isBestSeller ~.,
                    data = main_source,
                    trials = 10)

# Predicting on the testing data


pred.test <- predict(model_boost, testing)
CrossTable(testing$isBestSeller, pred.test,
           prop.chisq = FALSE,
           prop.c = FALSE,
           prop.r = FALSE,
           prop.t = FALSE,
           dnn = c("Actual", "Predicted"))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |-------------------------|
## 
##  
## Total Observations in Table:  47961 
## 
##  
##              | Predicted 
##       Actual |        no |       yes | Row Total | 
## -------------|-----------|-----------|-----------|
##           no |     46709 |        22 |     46731 | 
## -------------|-----------|-----------|-----------|
##          yes |      1194 |        36 |      1230 | 
## -------------|-----------|-----------|-----------|
## Column Total |     47903 |        58 |     47961 | 
## -------------|-----------|-----------|-----------|
## 
##

input_no_tree = 50
model.forest <- randomForest::randomForest(isBestSeller ~., 
                                           data = main_source,
                                           ntree = input_no_tree, 
                                           mtry = 2, 
                                           replace = TRUE,
                                           importance = TRUE)
                                           
pred.test <- predict(model.forest, testing)
CrossTable(testing$isBestSeller, pred.test,
           prop.chisq = FALSE,
           prop.c = FALSE,
           prop.r = FALSE,
           prop.t = FALSE,
           dnn = c("Actual", "Predicted"))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |-------------------------|
## 
##  
## Total Observations in Table:  47961 
## 
##  
##              | Predicted 
##       Actual |        no |       yes | Row Total | 
## -------------|-----------|-----------|-----------|
##           no |     46731 |         0 |     46731 | 
## -------------|-----------|-----------|-----------|
##          yes |        83 |      1147 |      1230 | 
## -------------|-----------|-----------|-----------|
## Column Total |     46814 |      1147 |     47961 | 
## -------------|-----------|-----------|-----------|
## 
##

input_no_tree = 100
model.forest <- randomForest::randomForest(isBestSeller ~., 
                                           data = main_source,
                                           ntree = input_no_tree, 
                                           mtry = 2, 
                                           replace = TRUE,
                                           importance = TRUE)

pred.test <- predict(model.forest, testing)
CrossTable(testing$isBestSeller, pred.test,
           prop.chisq = FALSE,
           prop.c = FALSE,
           prop.r = FALSE,
           prop.t = FALSE,
           dnn = c("Actual", "Predicted"))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |-------------------------|
## 
##  
## Total Observations in Table:  47961 
## 
##  
##              | Predicted 
##       Actual |        no |       yes | Row Total | 
## -------------|-----------|-----------|-----------|
##           no |     46731 |         0 |     46731 | 
## -------------|-----------|-----------|-----------|
##          yes |        32 |      1198 |      1230 | 
## -------------|-----------|-----------|-----------|
## Column Total |     46763 |      1198 |     47961 | 
## -------------|-----------|-----------|-----------|
## 
##

# Extract variable importance
importance <- importance(model.forest)
print(importance)

##                         no       yes MeanDecreaseAccuracy MeanDecreaseGini
## department        70.53341 -3.500829             72.63421         678.6039
## stars             34.65850 -2.968269             32.71886         719.7659
## reviews           12.66761 20.257280             19.64993        2021.1274
## price             27.09167  4.692663             27.82373        1612.6877
## boughtInLastMonth 37.93067 73.661688             49.50979         890.9311
## titleLength       20.92008  3.163913             21.22089        1547.6062

# Plot
varImpPlot(model.forest)

# Initialize an empty list to store evaluation metrics for decision tree models
tree_evaluations <- list()

# Evaluate the performance of the C5.0 model
predictions <- predict(model, testing, type = "class")
confusion_matrix <- table(Predicted = predictions, Actual = testing$isBestSeller)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
precision <- confusion_matrix[2,2] / sum(confusion_matrix[2,])
recall <- confusion_matrix[2,2] / sum(confusion_matrix[,2])
f1_score <- 2 * (precision * recall) / (precision + recall)

# Calculate ROC and AUC for the C5.0 model
probabilities <- predict(model, testing, type = "prob")[,2]
roc_curve <- pROC::roc(testing$isBestSeller, probabilities)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

auc <- pROC::auc(roc_curve)

# Store the metrics in the tree_evaluations list
tree_evaluations[["All"]] <- list(
  Accuracy = accuracy,
  Precision = precision,
  Recall = recall,
  F1_Score = f1_score,
  AUC = auc
)

# Repeat the evaluation for the boosted and random forest models if applicable
# Boosted model
predictions_boost <- predict(model_boost, testing, type = "class")
confusion_matrix_boost <- table(Predicted = predictions_boost, Actual = testing$isBestSeller)
accuracy_boost <- sum(diag(confusion_matrix_boost)) / sum(confusion_matrix_boost)
precision_boost <- confusion_matrix_boost[2,2] / sum(confusion_matrix_boost[2,])
recall_boost <- confusion_matrix_boost[2,2] / sum(confusion_matrix_boost[,2])
f1_score_boost <- 2 * (precision_boost * recall_boost) / (precision_boost + recall_boost)
probabilities_boost <- predict(model_boost, testing, type = "prob")[,2]
roc_curve_boost <- pROC::roc(testing$isBestSeller, probabilities_boost)

## Setting levels: control = no, case = yes
## Setting direction: controls < cases

auc_boost <- pROC::auc(roc_curve_boost)

tree_evaluations[["Boosted"]] <- list(
  Accuracy = accuracy_boost,
  Precision = precision_boost,
  Recall = recall_boost,
  F1_Score = f1_score_boost,
  AUC = auc_boost
)

# Random forest model
predictions_rf <- predict(model.forest, testing, type = "class")
confusion_matrix_rf <- table(Predicted = predictions_rf, Actual = testing$isBestSeller)
accuracy_rf <- sum(diag(confusion_matrix_rf)) / sum(confusion_matrix_rf)
precision_rf <- confusion_matrix_rf[2,2] / sum(confusion_matrix_rf[2,])
recall_rf <- confusion_matrix_rf[2,2] / sum(confusion_matrix_rf[,2])
f1_score_rf <- 2 * (precision_rf * recall_rf) / (precision_rf + recall_rf)
probabilities_rf <- predict(model.forest, testing, type = "prob")[,2]
roc_curve_rf <- pROC::roc(testing$isBestSeller, probabilities_rf)

## Setting levels: control = no, case = yes
## Setting direction: controls < cases

auc_rf <- pROC::auc(roc_curve_rf)

tree_evaluations[["Random Forest"]] <- list(
  Accuracy = accuracy_rf,
  Precision = precision_rf,
  Recall = recall_rf,
  F1_Score = f1_score_rf,
  AUC = auc_rf
)

#4. CONCLUSION

# Combine evaluation metrics from logistic regression and decision tree models

# Initialize an empty data frame to store evaluation metrics
evaluation_summary <- data.frame(
  Model = character(),
  Department = character(),
  Accuracy = numeric(),
  Precision = numeric(),
  Recall = numeric(),
  F1_Score = numeric(),
  AUC = numeric(),
  stringsAsFactors = FALSE
)

# Add evaluation metrics for logistic regression models
for (department in names(metrics)) {
  evaluation_summary <- rbind(evaluation_summary, data.frame(
    Model = "Logistic Regression",
    Department = department,
    Accuracy = metrics[[department]]$Accuracy,
    Precision = metrics[[department]]$Precision,
    Recall = metrics[[department]]$Recall,
    F1_Score = metrics[[department]]$F1_Score,
    AUC = model_evaluations[[department]]$AUC
  ))
}

# Add evaluation metrics for decision tree models
for (department in names(tree_evaluations)) {
  evaluation_summary <- rbind(evaluation_summary, data.frame(
    Model = "Decision Tree",
    Department = department,
    Accuracy = tree_evaluations[[department]]$Accuracy,
    Precision = tree_evaluations[[department]]$Precision,
    Recall = tree_evaluations[[department]]$Recall,
    F1_Score = tree_evaluations[[department]]$F1_Score,
    AUC = tree_evaluations[[department]]$AUC
  ))
}

# Display the evaluation summary
print(evaluation_summary)

##                 Model                Department  Accuracy  Precision     Recall
## 1 Logistic Regression                       all 0.9749171 0.03700589 0.43137255
## 2 Logistic Regression                automotive 0.9646018 0.13636364 0.75000000
## 3 Logistic Regression                   fashion 0.9411765 1.00000000 0.50000000
## 4 Logistic Regression      health and household 0.9428571 0.09326425 0.54545455
## 5 Logistic Regression industrial and scientific 0.9645161 0.13043478 0.60000000
## 6 Logistic Regression       sports and outdoors 0.9514997 0.00000000 0.00000000
## 7       Decision Tree                       All 0.9745210 0.57142857 0.02601626
## 8       Decision Tree                   Boosted 0.9746461 0.62068966 0.02926829
## 9       Decision Tree             Random Forest 0.9993536 1.00000000 0.97479675
##     F1_Score       AUC
## 1 0.06816421 0.7591143
## 2 0.23076923 0.7885485
## 3 0.66666667 1.0000000
## 4 0.15929204 0.7473134
## 5 0.21428571 0.6964533
## 6        NaN 0.7648525
## 7 0.04976672 0.6864733
## 8 0.05590062 0.7033899
## 9 0.98723755 1.0000000

# Load necessary library
library(ggplot2)

# Create bar plots for comparison of evaluation metrics

# Accuracy comparison
accuracy_plot <- ggplot(evaluation_summary, aes(x = Department, y = Accuracy, fill = Model)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Accuracy Comparison", x = "Department", y = "Accuracy") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Precision comparison
precision_plot <- ggplot(evaluation_summary, aes(x = Department, y = Precision, fill = Model)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Precision Comparison", x = "Department", y = "Precision") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Recall comparison
recall_plot <- ggplot(evaluation_summary, aes(x = Department, y = Recall, fill = Model)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Recall Comparison", x = "Department", y = "Recall") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# F1 Score comparison
f1_score_plot <- ggplot(evaluation_summary, aes(x = Department, y = F1_Score, fill = Model)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "F1 Score Comparison", x = "Department", y = "F1 Score") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# AUC comparison
auc_plot <- ggplot(evaluation_summary, aes(x = Department, y = AUC, fill = Model)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "AUC Comparison", x = "Department", y = "AUC") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Display the plots
print(accuracy_plot)

print(precision_plot)

print(recall_plot)

print(f1_score_plot)

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_bar()`).

print(auc_plot)

CONCLUSION

The model comparison between decision tree and logistic matrix helps to determine the performance and accuracy of results obtained from two choices. These are measured in terms on ‘accuracy’, ‘precision’, ’Recall’, ‘F1 score’, ‘AUC (Area under ROC curve)’. The departments are evaluated based on their respective performance metrics with these factors. Explaining the utility of each:

Accuracy: This determines the model’s quality of prediction.
Precision: This determines the probability of getting true positive predictions of total positive predictions and as less false positive results as possible.
Recall: This defines the proportion true positive predictions to total number of actual positive cases achieved by the model, i.e. ability of the model to find all positive cases.
F1 score: This is a combination of precision and recall, that provides a single metric accounting for both.
AUC: This helps to determine the sensitivity by measuring overall performance of the model in various classifications. The closer the value is to 1, higher the compatibility of the model with the data.

In the comparison chart, it can be seen that logistic regression steadily performs better than decision tree in all departments, which can be observed by comparing blue bars (Logistic regression) with red bars (decision tree). Furthermore, it is observed that AUC for Decision tree is just around 0.50, however for Logistic Regression is quite above 0.75 for approximately all departments, i.e. Automotive, Fashion, Health and Household, Industrial and Scientific, Sports and Outdoors, suggesting that it is a more accurate model to determine best selling products.

Departmental Insights:
- The Fashion department sees the highest AUC value for Logistic Regression, indicating very strong performance in this category.
- The Health and Household department has the lowest AUC for both models, suggesting that predicting best sellers in this category might be more challenging.

DECISION TREE: The model output shows that 111907 cases (70% of 159868 cases )were utilized to grow the tree and number of features: 7. The decision tree can be interpreted as follows: i. If ‘boughtInLastMonth’ is less than/equal to 600 then it is not a best seller. ii. Else, ‘boughtInLastMonth’ is less than/equal to 3000, then it is not a best seller. iii. Else, ‘boughtInLastMonth’ is less than/equal to 7000, then it is not a best seller; and if greater then it is a best seller.

The confusion matrix:

True positives: the number of cases where the model correctly predicts the positive class
True negatives: the number of cases where the model correctly predicts the negative class
False positives: the number of cases where the model incorrectly predicts the negative class as positive
False negatives: the number of cases where the model incorrectly predicts the positive class as negative

In this model, we want to predict department cases with best-selling products, hence best selling product is our positive class. The attribute usage shows high importance of ‘boughtInLastMonth’ with relevance of 100%. To further enhance the result, we evaluate by applying it to unseen testing data, and boosting by setting number of trials to 10, 50 & 100. Initially, the model has an accuracy of ((TP+TN)/n) = ((46709+36)/47961) = 97.46%. However, upon further boosting we get 99.82% and 99.99% accuracy respectively.

#RECOMMENDATION

Based on our comprehensive analysis of Amazon uk’s e-commerce data, we can derive strategic recommendations to address our key business questions:

Business Question 1: What factors contribute to a product becoming a bestseller on Amazon uk?

Focus on Customer Reviews: Across most departments, the number of reviews consistently emerged as a significant predictor of bestseller status. We recommend:

Implement strategies to encourage customer reviews, such as follow-up emails or incentives for leaving feedback. Engage with customer reviews promptly, addressing both positive and negative feedback to show active seller involvement.

Optimize Recent Sales: The ‘boughtInLastMonth’ variable is a strong predictor in many departments. While this is partly an effect rather than a cause, it suggests:

Implement targeted marketing campaigns to boost initial sales when launching new products. Use Amazon’s promotional tools like Lightning Deals or Coupons to drive short-term sales spikes.

Refine Product Titles: Title length showed significance in some departments, particularly Industrial and Scientific. We recommend:

Craft informative, keyword-rich titles that balance detail with readability. Tailor title strategies to specific departments, as the impact varies across categories.

#Business Question 2: How do these factors vary across different product categories?

Category-Specific Strategies:

Automotive: Focus on driving recent sales and increasing reviews. Health and Household: Prioritize gathering customer reviews, optimizing title length, and maintaining recent sales. Industrial and Scientific: Emphasize detailed, informative product titles and focus on gathering reviews and recent sales. Sports and Outdoors: Concentrate on increasing review numbers and maintaining steady recent sales. Fashion: Our analysis showed that higher-priced items with more reviews tend to become bestsellers. Focus on quality products and encouraging reviews.

Price Considerations:

In the Fashion category, there was a trend towards higher-priced items being bestsellers. This suggests focusing on quality and brand value rather than competing solely on price. For other categories, price wasn’t consistently a significant factor, indicating that other aspects of the product and its presentation may be more crucial for success.

Star Ratings: Star ratings showed mixed results across categories, sometimes with negative associations. This suggests:

Don’t rely solely on high star ratings for success. Focus more on the quantity of reviews and recent sales performance. Investigate products with slightly lower ratings but high sales to understand what other factors are driving their success.

#Business Question 3: How can sellers optimize their strategies to increase the likelihood of achieving bestseller status?

Leverage Machine Learning Insights: Our Random Forest and C5.0 models showed good predictive power. Use these insights to:

Prioritize efforts on the most influential factors identified by the models for each category. Regularly update and refine strategies based on new data and model predictions.

Balanced Approach: While focusing on key factors, maintain a balanced approach:

Ensure product quality to maintain positive reviews and ratings. Use competitive pricing strategies, but don’t rely on price alone to drive bestseller status. Continuously monitor and adapt to category-specific trends and consumer behaviors.

Data-Driven Decision Making:

Regularly analyze product performance data to identify trends and opportunities. Use A/B testing for elements like product titles and pricing to optimize listings. Stay informed about Amazon’s algorithm updates and adjust strategies accordingly.

Department-Specific Recommendations:

All Departments: Focus on increasing reviews and recent sales. Automotive: Aim for consistent monthly sales of around 300 units for bestseller products. Fashion: Emphasize higher-quality, higher-priced items and focus on gathering reviews. Health and Household: Balance efforts between gathering reviews, optimizing titles, and maintaining sales. Industrial and Scientific: Pay special attention to crafting detailed, informative product titles. Sports and Outdoors: Prioritize increasing review numbers and maintaining steady recent sales.

By implementing these recommendations, sellers can enhance their chances of achieving bestseller status on Amazon uk. Remember that the e-commerce landscape is dynamic, and strategies should be regularly reviewed and adjusted based on the latest data and market trends.

Group 14_Business_Analytics_Amazon_UK

2024-07-12