Section 1: Introduction & Objectives
Section 2:Data understanding
Section 3 :Data preparation
Section 4:Exploratory Data Analysis
- 4.1 Data Visualization
- 4.2 Conclusions and Insights
Section 5:Modeling
- 5.1 Regression Model: Predicting Product Sales
- 5.2 Classification Model: Classifying Sales as High & Low
Section 6: Evaluation
- 6.1 Evaluate Results
- 6.2 Review Process
Section 7: Deployment
- 7.1 Plan Deployment
- 7.2 Interactive Dashboard
  - Conclusion

Section 1: Introduction & Objectives

1.1 Background

1.Lazada is a leading e-commerce platform in Southeast Asia, attracts global sellers and consumers through diverse promotional activities.

2.For merchants, understanding the effectiveness of these promotions is crucial for optimizing marketing strategies and enhancing platform competitiveness.

1.2 Objectives

This analysis focuses on Lazada’s cross-border e-commerce promotions, aiming to assess factors such as discount intensity and brand influence on sales, providing data-driven insights and support for marketing decisions in cross-border e-commerce.

Here are the questions we going to address:

1.Regression

Predict product sales and analyze the impact of factors.( eg. discount rates, ratings, and the number of reviews).

2.Classification

Identify which categories of products are more likely to achieve high sales through promotions.

1.3 Who is going to benefits

1.Cross-border merchants

For those who are looking to enter the cross-border e-commerce market or optimize promotional strategies to enhance the effectiveness of promotions.

2.E-commerce platforms

For companies who are aiming to refine their promotional policies and increase market share (e.g., Lazada、Shopee, Taobao etc.).

3.Supply chain and logistics service providers

For those who are seeking to better predict demand fluctuations, optimize inventory management, and improve delivery efficiency.

1.4 Data Source

https://www.kaggle.com/datasets/phucvr/lazada-crawl

Section 2:Data understanding

2.1 Dataset overview

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readxl)
data <- read_excel("./wqd7004_dataset.xlsx") 
head(data)

## # A tibble: 6 × 14
##       itemId name     brandName category originalPrice priceShow itemSoldCntShow
##        <dbl> <chr>    <chr>     <chr>            <dbl>     <dbl>           <dbl>
## 1  342230105 Origina… Xiaomi    Cellula…       3590000   3466863              72
## 2  464752465 Xiaomi … Xiaomi    Cellula…       5200000   2850000               5
## 3  709428868 Xiaomi … Xiaomi    Cellula…       2690000   1757000             288
## 4  895822931 Xiaomi … Xiaomi    Cellula…       2690000   1849000              52
## 5 1205582653 Phone X… Xiaomi    Cellula…       4890000   4890000              10
## 6 1277481768 Xiaomi … Xiaomi    Tablet         5090000   3999000              17
## # ℹ 7 more variables: discount <dbl>, ratingScore <dbl>, review <dbl>,
## #   location <chr>, sellerName <chr>, sellerId <dbl>, brandId <dbl>

#The number of rows and columns

nrow(data)

## [1] 3586

ncol(data)

## [1] 14

dim(data)

## [1] 3586   14

names(data)

##  [1] "itemId"          "name"            "brandName"       "category"       
##  [5] "originalPrice"   "priceShow"       "itemSoldCntShow" "discount"       
##  [9] "ratingScore"     "review"          "location"        "sellerName"     
## [13] "sellerId"        "brandId"

#Type of data
sapply(data, class)

##          itemId            name       brandName        category   originalPrice 
##       "numeric"     "character"     "character"     "character"       "numeric" 
##       priceShow itemSoldCntShow        discount     ratingScore          review 
##       "numeric"       "numeric"       "numeric"       "numeric"       "numeric" 
##        location      sellerName        sellerId         brandId 
##     "character"     "character"       "numeric"       "numeric"

#Data distribution
summary(data)

##      itemId              name            brandName           category        
##  Min.   :2.542e+08   Length:3586        Length:3586        Length:3586       
##  1st Qu.:2.273e+09   Class :character   Class :character   Class :character  
##  Median :2.427e+09   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2.377e+09                                                           
##  3rd Qu.:2.593e+09                                                           
##  Max.   :2.671e+09                                                           
##  originalPrice        priceShow        itemSoldCntShow       discount    
##  Min.   :    1399   Min.   :    1399   Min.   :     0.0   Min.   : 0.00  
##  1st Qu.: 1200000   1st Qu.:  837328   1st Qu.:     0.0   1st Qu.:15.00  
##  Median : 2150299   Median : 1131232   Median :     5.0   Median :43.00  
##  Mean   : 3834455   Mean   : 2813634   Mean   :   262.8   Mean   :33.45  
##  3rd Qu.: 3388000   3rd Qu.: 2596000   3rd Qu.:    57.0   3rd Qu.:50.00  
##  Max.   :44490000   Max.   :39890000   Max.   :128000.0   Max.   :72.00  
##   ratingScore        review          location          sellerName       
##  Min.   :0.000   Min.   :   0.00   Length:3586        Length:3586       
##  1st Qu.:0.000   1st Qu.:   0.00   Class :character   Class :character  
##  Median :4.622   Median :   1.00   Mode  :character   Mode  :character  
##  Mean   :2.686   Mean   :  26.46                                        
##  3rd Qu.:5.000   3rd Qu.:  41.00                                        
##  Max.   :5.000   Max.   :2886.00                                        
##     sellerId            brandId         
##  Min.   :1.276e+04   Min.   :      667  
##  1st Qu.:2.002e+11   1st Qu.:    65074  
##  Median :2.003e+11   Median :   134666  
##  Mean   :1.800e+11   Mean   : 59893744  
##  3rd Qu.:2.003e+11   3rd Qu.:127167201  
##  Max.   :2.007e+11   Max.   :127221667

2.2 Verify Data Quality

2.2.1 Check missing value

missing_val <- colSums(is.na(data))
print(missing_val)

##          itemId            name       brandName        category   originalPrice 
##               0               0               0               0               0 
##       priceShow itemSoldCntShow        discount     ratingScore          review 
##               0               0               0               0               0 
##        location      sellerName        sellerId         brandId 
##               0               0               0               0

2.2.2 Check validity of the dataset

#check itemId
itemId_positive_int <- all(data$itemId > 0 & data$itemId %% 1 == 0, na.rm = TRUE)
itemId_unique <- length(data$itemId) == length(unique(data$itemId))

if (itemId_positive_int & itemId_unique) {
  cat("itemId:All values are positive integers and unique.\n")
} else {
  cat("itemId:The value of the existence of positive integer or duplicate values.\n")
}

## itemId:All values are positive integers and unique.

##check priceShow&originalPrice
priceShow_positive<-all(data$priceShow > 0,na.rm = TRUE)
originalPrice_positive<-all(data$originalPrice > 0,na.rm = TRUE)

if (priceShow_positive & originalPrice_positive) {
  cat("priceShow: All values are positive.\n")
  cat("originalPrice: All values are positive.\n")
} else {
  cat("Non-positive values exist.")
}

## priceShow: All values are positive.
## originalPrice: All values are positive.

##check discount
if(all(data$discount>=0 & data$discount<100)){
  cat("discount:All values are between 0 and 100.\n")
}else{
  cat("discount:An invalid value exists.\n")
}

## discount:All values are between 0 and 100.

##check ratingScore
if(all(data$ratingScore>=0 & data$ratingScore<=5)){
  cat("ratingScore:All values are between 0 and 5.\n")
}else{
  cat("ratingScore:An invalid value exists.\n")
}

## ratingScore:All values are between 0 and 5.

##check Review and itemSoldCntShow
review<-all(data$review >= 0 & data$review %% 1 == 0,na.rm = TRUE)
itemSoldCntShow<-all(data$itemSoldCntShow >= 0 & data$itemSoldCntShow %% 1 == 0,na.rm = TRUE)


if (review & itemSoldCntShow) {
  cat("review: All values are non-negative integers.\n")
  cat("itemSoldCntShow: All values are non-negative integers.\n")
} else {
  cat("An invalid value exists.")
}

## review: All values are non-negative integers.
## itemSoldCntShow: All values are non-negative integers.

2.3 Univariate analysis

library(ggplot2)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(rlang)

barplot <- function(data, column_name) {
  ggplot(data, aes(x = !!sym(column_name))) +  # 使用 !!sym() 代替 aes_string()
    geom_bar(fill = "skyblue", color = "red") +
    labs(title = paste(column_name), x = column_name, y = "Frequency") +
    theme_minimal()
}

p1 <- barplot(data, "originalPrice")
p2 <- barplot(data, "priceShow")
p3 <- barplot(data, "itemSoldCntShow")
p4 <- barplot(data, "discount")
p5 <- barplot(data, "review")
p6 <- barplot(data, "ratingScore")

grid.arrange(p1, p2, p3, p4, p5, p6, ncol = 3)

2.4 Multivariate analysis

library(ggplot2)
library(reshape2)
library(readxl)
head(data)

## # A tibble: 6 × 14
##       itemId name     brandName category originalPrice priceShow itemSoldCntShow
##        <dbl> <chr>    <chr>     <chr>            <dbl>     <dbl>           <dbl>
## 1  342230105 Origina… Xiaomi    Cellula…       3590000   3466863              72
## 2  464752465 Xiaomi … Xiaomi    Cellula…       5200000   2850000               5
## 3  709428868 Xiaomi … Xiaomi    Cellula…       2690000   1757000             288
## 4  895822931 Xiaomi … Xiaomi    Cellula…       2690000   1849000              52
## 5 1205582653 Phone X… Xiaomi    Cellula…       4890000   4890000              10
## 6 1277481768 Xiaomi … Xiaomi    Tablet         5090000   3999000              17
## # ℹ 7 more variables: discount <dbl>, ratingScore <dbl>, review <dbl>,
## #   location <chr>, sellerName <chr>, sellerId <dbl>, brandId <dbl>

numeric <- data[, c("itemId","sellerId","brandId","originalPrice","itemSoldCntShow", "priceShow","ratingScore", "review","discount")]
matrix <- cor(numeric, use = "complete.obs")  # "complete.obs" 
melted <- melt(matrix)

ggplot(melted, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = sprintf("%.2f", value)), color = "black", size = 3) +
  scale_fill_gradient2(low = "purple", high = "red", mid = "white", midpoint = 0, limit = c(-1, 1)) +
  labs(title = "Heat map among variable", x = "variable", y = "variable", fill = "Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Section 3 :Data preparation

3.1 Clean and Prepare Data

# Load necessary libraries  
library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:reshape2':
## 
##     smiths

library(knitr)  
library(stringr)  
library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(caret)  # For data set partitioning

## Loading required package: lattice

# Step 1: Clear column names  
names(data) <- gsub(" ", "_", names(data))  # Replace spaces with underscores  

# Step 2: Convert columns to appropriate data types  
data$category <- as.factor(data$category)  
data$originalPrice <- as.numeric(data$originalPrice)  
data$priceShow <- as.numeric(data$priceShow)  
data$itemSoldCntShow <- as.numeric(data$itemSoldCntShow)  
data$discount <- as.numeric(data$discount)  
data$ratingScore <- as.numeric(data$ratingScore)  
data$review <- as.numeric(data$review)  

# Step 3: Handle missing values  
data <- data %>%  
  filter(!is.na(originalPrice) & !is.na(priceShow) &  
         !is.na(itemSoldCntShow) & !is.na(discount))  

# Step 4: Handle outliers  
## Function to remove outliers based on IQR  
remove_outliers <- function(df, var) {  
  Q1 <- quantile(df[[var]], 0.25)  
  Q3 <- quantile(df[[var]], 0.75)  
  IQR <- Q3 - Q1  
  
  lower_bound <- Q1 - 1.5 * IQR  
  upper_bound <- Q3 + 1.5 * IQR  
  
  df <- df %>% filter(df[[var]] >= lower_bound & df[[var]] <= upper_bound)  
  return(df)  
}  

## Loop through numeric columns and remove outliers  
numeric_columns <- sapply(data, is.numeric)  
for (col in names(data)[numeric_columns]) {  
  data <- remove_outliers(data, col)  
}  

# Step 5: Create new feature variable (Sales Score) with normalization and weights  
data <- data %>%  
  mutate(  
    normalized_sales = (itemSoldCntShow - min(itemSoldCntShow)) /   
                       (max(itemSoldCntShow) - min(itemSoldCntShow)),  
    normalized_rating = (ratingScore - min(ratingScore)) /   
                        (max(ratingScore) - min(ratingScore)),  
    
    sales_score = (normalized_sales * 0.7) + (normalized_rating * 0.3)  # Setting weights
  )

# Step 6: Clean 'sellerName' and handle 'name' column  
data <- data %>%  
  mutate(  
    sellerName = str_replace_all(sellerName, "[\u4e00-\u9fa5]", ""),  # Remove Chinese characters  
    sellerName = str_trim(sellerName),  # Trim whitespace  
    name = ifelse(nchar(name) > 10, paste0(substr(name, 1, 10), "..."), name)  # Truncate 'name' column  
  ) 

# Step 7: Display cleaned data  
kable(head(data), format = "html", caption = "Prepared Dataset") %>%  
  kable_styling("striped", full_width = F)

Prepared Dataset
itemId	name	brandName	category	originalPrice	priceShow	itemSoldCntShow	discount	ratingScore	review	location	sellerName	sellerId	brandId	normalized_sales	normalized_rating	sales_score
2033044000	Xiaomi MiM…	Xiaomi	Cellular phone	1600000	930468	45	42	4.652	23	Ho Chi Minh	TAY NGUYEN STORE 47	200169939195	4348	0.2571429	0.9304	0.45912
2033070723	Xiaomi Mi8…	Xiaomi	Cellular phone	3150000	1674657	30	47	5.000	20	Ho Chi Minh	TAY NGUYEN STORE 47	200169939195	4348	0.1714286	1.0000	0.42000
2033123603	Xiaomi MiM…	Xiaomi	Cellular phone	1500000	930468	37	38	4.611	18	Ho Chi Minh	TAY NGUYEN STORE 47	200169939195	4348	0.2114286	0.9222	0.42466
2033161146	Xiaomi Mi8…	Xiaomi	Cellular phone	4050000	2141288	28	47	5.000	17	Ho Chi Minh	TAY NGUYEN STORE 47	200169939195	4348	0.1600000	1.0000	0.41200
2033266624	Genuine Xi…	Xiaomi	Cellular phone	1999000	1116748	82	44	4.815	65	Ho Chi Minh	TAY NGUYEN STORE 47	200169939195	4348	0.4685714	0.9630	0.61690
2033406937	Genuine Xi…	Xiaomi	Cellular phone	2850000	1535878	28	46	5.000	16	Ho Chi Minh	TAY NGUYEN STORE 47	200169939195	4348	0.1600000	1.0000	0.41200

# Step 8: Save the cleaned dataset  
write.csv(data, "prepared_dataset.csv", row.names = FALSE)

3.2 Split the Dataset

# Split the dataset into training and testing sets  
set.seed(123)  # For reproducibility
train_index <- createDataPartition(data$sales_score, p = .8,  # 80% As a training set  
                                    list = FALSE,   
                                    times = 1)  # List is FALSE Return index

train_data <- data[train_index, ]  # Training Data  
test_data <- data[-train_index, ]   # Test Data

3.3 Display Prepared Datasets

# Display the first few rows of the prepared training data  
kable(head(train_data), format = "html", caption = "Training Dataset") %>%  
  kable_styling("striped", full_width = F)

Training Dataset
itemId	name	brandName	category	originalPrice	priceShow	itemSoldCntShow	discount	ratingScore	review	location	sellerName	sellerId	brandId	normalized_sales	normalized_rating	sales_score
2033044000	Xiaomi MiM…	Xiaomi	Cellular phone	1600000	930468	45	42	4.652	23	Ho Chi Minh	TAY NGUYEN STORE 47	200169939195	4348	0.2571429	0.9304	0.45912
2033123603	Xiaomi MiM…	Xiaomi	Cellular phone	1500000	930468	37	38	4.611	18	Ho Chi Minh	TAY NGUYEN STORE 47	200169939195	4348	0.2114286	0.9222	0.42466
2033266624	Genuine Xi…	Xiaomi	Cellular phone	1999000	1116748	82	44	4.815	65	Ho Chi Minh	TAY NGUYEN STORE 47	200169939195	4348	0.4685714	0.9630	0.61690
2033406937	Genuine Xi…	Xiaomi	Cellular phone	2850000	1535878	28	46	5.000	16	Ho Chi Minh	TAY NGUYEN STORE 47	200169939195	4348	0.1600000	1.0000	0.41200
2034907442	Genuine Xi…	Xiaomi	Cellular phone	1599000	930468	114	42	5.000	96	Ho Chi Minh	TAY NGUYEN STORE 47	200169939195	4348	0.6514286	1.0000	0.75600
2035034952	Genuine Xi…	Xiaomi	Cellular phone	1350000	837328	38	38	4.684	19	Ho Chi Minh	TAY NGUYEN STORE 47	200169939195	4348	0.2171429	0.9368	0.43304

# Display the first few rows of the prepared testing data  
kable(head(test_data), format = "html", caption = "Testing Dataset") %>%  
  kable_styling("striped", full_width = F)

Testing Dataset
itemId	name	brandName	category	originalPrice	priceShow	itemSoldCntShow	discount	ratingScore	review	location	sellerName	sellerId	brandId	normalized_sales	normalized_rating	sales_score
2033070723	Xiaomi Mi8…	Xiaomi	Cellular phone	3150000	1674657	30	47	5.000	20	Ho Chi Minh	TAY NGUYEN STORE 47	200169939195	4348	0.1714286	1.0000	0.42000
2033161146	Xiaomi Mi8…	Xiaomi	Cellular phone	4050000	2141288	28	47	5.000	17	Ho Chi Minh	TAY NGUYEN STORE 47	200169939195	4348	0.1600000	1.0000	0.41200
2035335256	Xiaomi Red…	Xiaomi	Cellular phone	2400000	1208957	75	50	4.862	65	Ho Chi Minh	TAY NGUYEN STORE 47	200169939195	4348	0.4285714	0.9724	0.59172
2035948704	Xiaomi Mim…	Xiaomi	Cellular phone	1100000	604478	34	45	4.545	22	Ho Chi Minh	TAY NGUYEN STORE 47	200169939195	4348	0.1942857	0.9090	0.40870
2036324094	Genuine Xi…	Xiaomi	Cellular phone	1950000	929537	35	52	5.000	18	Ho Chi Minh	TAY NGUYEN STORE 47	200169939195	4348	0.2000000	1.0000	0.44000
2036369720	Xiaomi Red…	Xiaomi	Cellular phone	668000	361573	141	46	4.956	114	Ho Chi Minh	TAY NGUYEN STORE 47	200169939195	4348	0.8057143	0.9912	0.86136

Section 4:Exploratory Data Analysis

4.1 Data Visualization

# Load necessary libraries  
library(readr)  
library(ggplot2)  
library(dplyr)  
library(corrplot)  

# 1. Load the dataset  
data <- read_csv("prepared_dataset.csv")  

# 2. Explore Key Features  

## 2.1 Distribution of Sales Score  
ggplot(data, aes(x = sales_score)) +  
  geom_histogram(bins = 30, fill = "blue", color = "black") +  
  labs(title = "Distribution of Sales Score", x = "Sales Score", y = "Frequency") +  
  theme_minimal()  # Ensure using theme_minimal() here

## 2.2 Sales Score vs Discount  
ggplot(data, aes(x = discount, y = sales_score)) +  
  geom_point(alpha = 0.5, color = "red") +  
  geom_smooth(method = "lm", se = FALSE, color = "blue") +  
  labs(title = "Sales Score vs Discount", x = "Discount (%)", y = "Sales Score") +  
  theme_minimal()

## 2.3 Sales by Category  
ggplot(data, aes(x = category, y = itemSoldCntShow)) +  
  geom_bar(stat = "identity", fill = "purple", color = "black") +  
  labs(title = "Sales by Category", x = "Category", y = "Units Sold") +  
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +  
  theme_minimal()

## 2.4 Price vs Sales Relationship  
ggplot(data, aes(x = originalPrice, y = itemSoldCntShow)) +  
  geom_point(alpha = 0.5, color = "green") +  
  geom_smooth(method = "lm", se = FALSE, color = "blue") +  
  labs(title = "Price vs Sales Relationship", x = "Original Price", y = "Units Sold") +  
  theme_minimal()

## 2.5 Discount vs Sales Relationship  
ggplot(data, aes(x = discount, y = itemSoldCntShow)) +  
  geom_point(alpha = 0.5, color = "red") +  
  geom_smooth(method = "lm", se = FALSE, color = "blue") +  
  labs(title = "Discount vs Sales Relationship", x = "Discount (%)", y = "Units Sold") +  
  theme_minimal()

## 2.6 Correlation Analysis  
# Select numeric columns  
numeric_data <- data %>% select(where(is.numeric))  
# Calculate the correlation matrix  
correlation_matrix <- cor(numeric_data, use = "complete.obs")  

# Set up the graphical parameters  
par(mar = c(5, 4, 5, 1))  # Top margin increased further to create space for the title  

# Create the correlation plot without title  
corrplot(correlation_matrix, method = "circle", type = "upper",   
         tl.col = "black", tl.srt = 45,   
         addgrid.col = NA)  # Optionally remove grid lines for clarity  

# Add the title manually  
title(main = "Correlation Matrix of Numeric Features", cex.main = 1.5, line = 3)

4.2 Conclusions and Insights

1. Sales Score Distribution

The histogram of sales scores shows that most products tend to have lower sales scores, indicating that many products may not be performing well in the market.
This suggests potential areas for improvement in product offerings or marketing strategies.

2. Impact of Discount on Sales Score

The scatter plot illustrates a positive correlation between discount rates and sales scores.
As discounts increase, sales scores also tend to rise.
This suggests that offering discounts can effectively enhance sales performance, especially for products that struggle to gain traction.

3. Sales Performance by Category

The bar chart depicting sales by category reveals significant disparities among different categories.
Certain categories have notably higher sales volumes, which can inform targeted marketing efforts, inventory management, and product development strategies.

4. Price and Sales Relationship

The relationship between the original price and units sold indicates that lower-priced products generally tend to sell better.
However, some premium-priced items still achieve strong sales, suggesting that brand loyalty or perceived quality can influence purchasing decisions.

5. Discount and Sales Relationship

The analysis shows a positive relationship between discount percentages and units sold.
Implementing effective discount strategies can drive higher sales figures, particularly in competitive markets.

6. Correlation Analysis

The correlation matrix highlights strong relationships between numerical features.
Specifically, sales scores, discount rates, and product ratings exhibit positive correlations with sales volume, indicating that higher quality products with better ratings and discounts can lead to increased sales.

Section 5:Modeling

5.1 Regression Model: Predicting Product Sales

library(caret)
library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

set.seed(123)
# Training and Testing Sets
train_index <- createDataPartition(data$itemSoldCntShow, p = 0.8, list = FALSE)
train_data <- data[train_index, ]
test_data <- data[-train_index, ]

# Build a Linear Regression Model
#lm_model <- lm(Sales ~ Discount + Price, data = train_data)
lm_model <- lm(itemSoldCntShow ~ discount + priceShow+ originalPrice+ ratingScore+ review, data = train_data)

# Evaluation: Assessing Model Performance
summary(lm_model)

## 
## Call:
## lm(formula = itemSoldCntShow ~ discount + priceShow + originalPrice + 
##     ratingScore + review, data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.544 -12.069  -1.299   1.462 109.309 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.579e-01  1.323e+01   0.035    0.972    
## discount       1.731e-01  2.883e-01   0.601    0.548    
## priceShow      7.915e-06  1.529e-05   0.518    0.605    
## originalPrice -7.947e-06  8.179e-06  -0.972    0.331    
## ratingScore    3.336e+00  3.229e-01  10.330   <2e-16 ***
## review         1.062e+00  2.429e-02  43.716   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.94 on 1509 degrees of freedom
## Multiple R-squared:  0.8228, Adjusted R-squared:  0.8222 
## F-statistic:  1401 on 5 and 1509 DF,  p-value: < 2.2e-16

# Prediction on the Test Set
lm_predictions <- predict(lm_model, newdata = test_data)

# Calculate Mean Squared Error (MSE)
lm_mse <- mean((test_data$itemSoldCntShow - lm_predictions)^2)
cat("Regression Model MSE:", lm_mse, "\n")

## Regression Model MSE: 319.625

5.2 Classification Model: Classifying Sales as High & Low

# Custom Classification Thresholds(eg.defining high sales as greater than the average sales)
threshold <- mean(data$itemSoldCntShow, na.rm = TRUE)
data$Sales_Class <- ifelse(data$itemSoldCntShow > threshold, "High", "Low")
data$Sales_Class <- as.factor(data$Sales_Class)  # Convert to Categorical Variable

library(caret)
library(randomForest)
library(ggplot2)
library(lattice)
# Training and Testing Sets
set.seed(123)
train_index_class <- createDataPartition(data$Sales_Class, p = 0.8, list = FALSE)
train_data_class <- data[train_index_class, ]
test_data_class <- data[-train_index_class, ]

# Random Forest Classification Model
rf_model <- randomForest(Sales_Class ~ discount + priceShow +originalPrice +ratingScore+ review, data = train_data_class, ntree = 100)

# Evaluation: Checking Classification Accuracy
rf_predictions <- predict(rf_model, newdata = test_data_class)
confusion_matrix <- confusionMatrix(rf_predictions, test_data_class$Sales_Class)

# Output Model Evaluation Results
print(confusion_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Low
##       High  179   6
##       Low     4 189
##                                           
##                Accuracy : 0.9735          
##                  95% CI : (0.9519, 0.9872)
##     No Information Rate : 0.5159          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9471          
##                                           
##  Mcnemar's Test P-Value : 0.7518          
##                                           
##             Sensitivity : 0.9781          
##             Specificity : 0.9692          
##          Pos Pred Value : 0.9676          
##          Neg Pred Value : 0.9793          
##              Prevalence : 0.4841          
##          Detection Rate : 0.4735          
##    Detection Prevalence : 0.4894          
##       Balanced Accuracy : 0.9737          
##                                           
##        'Positive' Class : High            
##

# Optimize the Model
#Tune the Parameters of the Random Forest Using Grid Search
library(caret)
tune_grid <- expand.grid(.mtry = c(2, 3), .splitrule = "gini", .min.node.size = c(1, 5, 10)) 

control <- trainControl(method = "cv", number = 5) 
rf_tuned <- train(Sales_Class ~  discount + priceShow +originalPrice +ratingScore+ review, data = train_data_class, method = "ranger",
                  trControl = control, tuneGrid = tune_grid) # Output Best Parameters    
print(rf_tuned$bestTune)

##   mtry splitrule min.node.size
## 3    2      gini            10

print(rf_tuned)

## Random Forest 
## 
## 1514 samples
##    5 predictor
##    2 classes: 'High', 'Low' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1211, 1211, 1212, 1210, 1212 
## Resampling results across tuning parameters:
## 
##   mtry  min.node.size  Accuracy   Kappa    
##   2      1             0.9590538  0.9180311
##   2      5             0.9590538  0.9180247
##   2     10             0.9597161  0.9193568
##   3      1             0.9590647  0.9180498
##   3      5             0.9583981  0.9167223
##   3     10             0.9570802  0.9140662
## 
## Tuning parameter 'splitrule' was held constant at a value of gini
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 2, splitrule = gini
##  and min.node.size = 10.

# Evaluate the Tuned Model on the Test Set
rf_tuned_predictions <- predict(rf_tuned, newdata = test_data_class) 
confusion_matrix_tuned <- confusionMatrix(rf_tuned_predictions, test_data_class$Sales_Class) 

print(confusion_matrix_tuned)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Low
##       High  178   8
##       Low     5 187
##                                           
##                Accuracy : 0.9656          
##                  95% CI : (0.9419, 0.9816)
##     No Information Rate : 0.5159          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9312          
##                                           
##  Mcnemar's Test P-Value : 0.5791          
##                                           
##             Sensitivity : 0.9727          
##             Specificity : 0.9590          
##          Pos Pred Value : 0.9570          
##          Neg Pred Value : 0.9740          
##              Prevalence : 0.4841          
##          Detection Rate : 0.4709          
##    Detection Prevalence : 0.4921          
##       Balanced Accuracy : 0.9658          
##                                           
##        'Positive' Class : High            
##

# Visualize Important Features
importance <- importance(rf_model)
varImpPlot(rf_model)

Section 6: Evaluation

6.1 Evaluate Results

# Create evaluation metric visualizations
library(ggplot2)

# Sales prediction model evaluation
predictions_df <- data.frame(
  Actual = test_data$itemSoldCntShow,
  Predicted = lm_predictions
)

ggplot(predictions_df, aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.5) +
  geom_abline(color = "red") +
  labs(title = "Sales Prediction Model: Actual vs Predicted",
       x = "Actual Sales",
       y = "Predicted Sales") +
  theme_minimal()

# Classification model evaluation visualization
plot_confusion_matrix <- function(cm) {
  cm_d <- as.data.frame(cm$table)
  ggplot(cm_d, aes(Prediction, Reference, fill = Freq)) +
    geom_tile() +
    geom_text(aes(label = Freq)) +
    scale_fill_gradient(low = "white", high = "steelblue") +
    labs(title = "Confusion Matrix Heatmap") +
    theme_minimal()
}

plot_confusion_matrix(confusion_matrix_tuned)

6.2 Review Process

# Model performance metrics summary
model_metrics <- data.frame(
  Metric = c("Regression Model MSE", "Classification Accuracy", "Classification Precision", "Classification Recall"),
  Value = c(
    lm_mse,
    confusion_matrix_tuned$overall["Accuracy"],
    confusion_matrix_tuned$byClass["Precision"],
    confusion_matrix_tuned$byClass["Recall"]
  )
)

kable(model_metrics, format = "html", caption = "Model Performance Metrics Summary") %>%
  kable_styling("striped", full_width = F)

Model Performance Metrics Summary
	Metric	Value
	Regression Model MSE	319.6249906
Accuracy	Classification Accuracy	0.9656085
Precision	Classification Precision	0.9569892
Recall	Classification Recall	0.9726776

Section 7: Deployment

7.1 Plan Deployment

# Create prediction function for deployment
predict_sales <- function(discount, price, original_price, rating, reviews) {
  new_data <- data.frame(
    discount = discount,
    priceShow = price,
    originalPrice = original_price,
    ratingScore = rating,
    review = reviews
  )
  predicted_sales <- predict(lm_model, newdata = new_data)
  return(predicted_sales)
}

# Test prediction function
test_prediction <- predict_sales(
  discount = 20,
  price = 100,
  original_price = 120,
  rating = 4.5,
  reviews = 100
)

cat("Test Prediction Result:", test_prediction)

## Test Prediction Result: 125.1366

7.2 Interactive Dashboard

library(shiny)
library(shinydashboard)

## 
## Attaching package: 'shinydashboard'

## The following object is masked from 'package:graphics':
## 
##     box

# UI
ui <- dashboardPage(
  dashboardHeader(title = "Lazada Sales Analysis Dashboard"),
  dashboardSidebar(
    sidebarMenu(
      menuItem("Sales Prediction", tabName = "prediction"),
      menuItem("Sales Classification", tabName = "classification")
    )
  ),
  dashboardBody(
    tabItems(
      tabItem(tabName = "prediction",
        fluidRow(
          box(
            title = "Input Parameters",
            sliderInput("discount", "Discount Rate (%)", 0, 100, 20),
            numericInput("price", "Sale Price", 100),
            numericInput("original_price", "Original Price", 120),
            sliderInput("rating", "Rating", 0, 5, 4.5),
            numericInput("reviews", "Number of Reviews", 100)
          ),
          box(
            title = "Prediction Results",
            verbatimTextOutput("sales_prediction")
          )
        )
      ),
      tabItem(tabName = "classification",
        fluidRow(
          box(plotOutput("confusion_matrix_plot")),
          box(plotOutput("feature_importance_plot"))
        )
      )
    )
  )
)

# Server section
server <- function(input, output) {
  output$sales_prediction <- renderText({
    pred <- predict_sales(
      input$discount,
      input$price,
      input$original_price,
      input$rating,
      input$reviews
    )
    paste("Predicted Sales:", round(pred, 2))
  })
  
  output$confusion_matrix_plot <- renderPlot({
    plot_confusion_matrix(confusion_matrix_tuned)
  })
  
  output$feature_importance_plot <- renderPlot({
    varImpPlot(rf_model)
  })
}

# Run Shiny Apps
shinyApp(ui, server)

Shiny applications not supported in static R Markdown documents

Conclusion

1. Regression Analysis

The regression analysis aimed to predict product sales and examine the impact of features such as discount intensity and price on sales. Key conclusions are as follows:

Impact of Discount Intensity:
- Discount intensity is one of the primary factors affecting sales. Higher discounts often lead to significant sales increases, but excessive discounts may compress profit margins. A balanced approach is recommended.
- The regression model coefficients quantify the specific contribution of discounts to sales, enabling merchants to determine the optimal discount range.
Impact of Price:
- Price has a significant negative correlation with sales, meaning lower prices tend to lead to higher sales.
- For high-end products (with higher prices), sales are more likely driven by brand influence and other marketing strategies rather than discounts alone.
Role of Other Factors:
- Customer reviews and ratings significantly influence purchase decisions. Highly rated products are more attractive to consumers, especially in competitive categories.
- Product visibility during promotions (e.g., through advertisements or recommendations) also has a significant impact on sales.

2. Classification Analysis

The classification analysis aimed to categorize products based on sales volume and examine the factors that make products more likely to achieve high sales. Key findings include:

Characteristics of High-Sales Products:
- High Ratings and More Reviews: High-sales products typically have high user ratings (e.g., above 4.5) and numerous reviews, indicating that positive customer feedback is crucial for driving sales.
- Moderate Price Range: High-sales products are often priced within a range acceptable to most consumers; excessively high or low prices can negatively impact sales.
- Participation in Promotions: Products involved in discount campaigns or bundled offers tend to attract more attention and purchases.
Characteristics of Low-Sales Products:
- Lack of Consumer Trust: Products with low ratings (below 3.5) struggle to attract buyers.
- Higher Prices: Products with higher prices see limited sales growth even during promotions.
- Highly Competitive Categories: In competitive categories (e.g., electronics), products without a significant price advantage or strong brand appeal are more likely to have low sales.

3. Actionable Insights

For Merchants:
- Optimize Promotional Strategies: Based on regression analysis, merchants can dynamically adjust discount intensity to ensure promotions boost sales without compromising profit margins.
- Focus on Customer Experience: Improve product quality and encourage positive customer reviews to increase ratings.
- Pricing Strategies: Use classification analysis results to position product prices effectively, avoiding prices that are too high or too low.
For E-commerce Platforms:
- Enhance Recommendation Systems: Prioritize displaying products with high ratings and numerous reviews to increase the visibility of high-sales products.
- Data-Driven Decision Support: Use platform data to analyze the effectiveness of promotions across different categories and provide personalized recommendations to merchants.
For Consumers:
- Consumers are more likely to choose products with high ratings, substantial discounts, and moderate pricing, indicating that promotions and word-of-mouth significantly influence purchasing decisions.

Lazada Cross-Border E-Commerce Promotion Analysis

2024-12-18