Introduction

In the rapidly evolving world of e-commerce, understanding customer behavior is crucial for the growth and sustainability of online businesses. One of the key metrics that businesses need to monitor closely is customer churn, also known as customer attrition.

Customer churn refers to the scenario when a customer stops doing business or ends the relationship with a company. Predicting customer churn involves using data analysis and machine learning techniques to predict the likelihood of a customer leaving a service or product.

In the context of e-commerce, customer churn prediction can help businesses identify potential churners early on and take appropriate actions to retain them, thereby saving on the cost of acquiring new customers. This process involves analyzing various customer-related factors such as purchase history, customer engagement, product preference, and many others.

In this project, a machine learning model will be developed to predict customer churn for an e-commerce business. By leveraging historical customer data, our model aims to provide accurate predictions that can help the business mitigate customer loss and enhance customer retention strategies. Let’s dive in!

Objectives

  1. To perform a comprehensive analysis of the provided customer data to extract insights into customer behavior and characteristics.

  2. To construct a machine learning model that can accurately predict customers who are likely to churn. This model should leverage the provided variables to identify at-risk customers.

Step 1: Data Collection & Data Importing

library(readxl)
E_Commerce_Dataset <- read_excel("E Commerce Dataset.xlsx", 
    sheet = "E Comm")

Dataset Overview

A dataset overview provides a summary of the dataset’s characteristics, including the number of observations and variables, the types of variables (numerical, categorical), and a brief description of what these variables represent.

The features are listed below:

  1. CustomerID: Unique customer ID
  2. Churn: Churn Flag
  3. Tenure: Tenure of customer in organization
  4. PreferredLoginDevice: Preferred login device of customer
  5. CityTier: City tier
  6. WarehouseToHome: Distance in between warehouse to home of customer
  7. PreferredPaymentMode: Preferred payment method of customer
  8. Gender: Gender of customer
  9. HourSpendOnApp: Number of hours spend on mobile application or website
  10. NumberOfDeviceRegistered: Total number of deceives is registered on particular customer
  11. PreferedOrderCat: Preferred order category of customer in last month
  12. SatisfactionScore: Satisfactory score of customer on service
  13. MaritalStatus: Marital status of customer
  14. NumberOfAddress: Total number of added added on particular customer
  15. OrderAmountHikeFromlastYear: Percentage increases in order from last year
  16. CouponUsed: Total number of coupon has been used in last month
  17. OrderCount: Total number of orders has been places in last month
  18. DaySinceLastOrder: Day Since last order by customer
  19. CashbackAmount: Average cashback in last month

The details of the dataset:

summary(E_Commerce_Dataset)
##    CustomerID        Churn            Tenure      PreferredLoginDevice
##  Min.   :50001   Min.   :0.0000   Min.   : 0.00   Length:5630         
##  1st Qu.:51408   1st Qu.:0.0000   1st Qu.: 2.00   Class :character    
##  Median :52816   Median :0.0000   Median : 9.00   Mode  :character    
##  Mean   :52816   Mean   :0.1684   Mean   :10.19                       
##  3rd Qu.:54223   3rd Qu.:0.0000   3rd Qu.:16.00                       
##  Max.   :55630   Max.   :1.0000   Max.   :61.00                       
##                                   NA's   :264                         
##     CityTier     WarehouseToHome  PreferredPaymentMode    Gender         
##  Min.   :1.000   Min.   :  5.00   Length:5630          Length:5630       
##  1st Qu.:1.000   1st Qu.:  9.00   Class :character     Class :character  
##  Median :1.000   Median : 14.00   Mode  :character     Mode  :character  
##  Mean   :1.655   Mean   : 15.64                                          
##  3rd Qu.:3.000   3rd Qu.: 20.00                                          
##  Max.   :3.000   Max.   :127.00                                          
##                  NA's   :251                                             
##  HourSpendOnApp  NumberOfDeviceRegistered PreferedOrderCat   SatisfactionScore
##  Min.   :0.000   Min.   :1.000            Length:5630        Min.   :1.000    
##  1st Qu.:2.000   1st Qu.:3.000            Class :character   1st Qu.:2.000    
##  Median :3.000   Median :4.000            Mode  :character   Median :3.000    
##  Mean   :2.932   Mean   :3.689                               Mean   :3.067    
##  3rd Qu.:3.000   3rd Qu.:4.000                               3rd Qu.:4.000    
##  Max.   :5.000   Max.   :6.000                               Max.   :5.000    
##  NA's   :255                                                                  
##  MaritalStatus      NumberOfAddress     Complain     
##  Length:5630        Min.   : 1.000   Min.   :0.0000  
##  Class :character   1st Qu.: 2.000   1st Qu.:0.0000  
##  Mode  :character   Median : 3.000   Median :0.0000  
##                     Mean   : 4.214   Mean   :0.2849  
##                     3rd Qu.: 6.000   3rd Qu.:1.0000  
##                     Max.   :22.000   Max.   :1.0000  
##                                                      
##  OrderAmountHikeFromlastYear   CouponUsed       OrderCount    
##  Min.   :11.00               Min.   : 0.000   Min.   : 1.000  
##  1st Qu.:13.00               1st Qu.: 1.000   1st Qu.: 1.000  
##  Median :15.00               Median : 1.000   Median : 2.000  
##  Mean   :15.71               Mean   : 1.751   Mean   : 3.008  
##  3rd Qu.:18.00               3rd Qu.: 2.000   3rd Qu.: 3.000  
##  Max.   :26.00               Max.   :16.000   Max.   :16.000  
##  NA's   :265                 NA's   :256      NA's   :258     
##  DaySinceLastOrder CashbackAmount 
##  Min.   : 0.000    Min.   :  0.0  
##  1st Qu.: 2.000    1st Qu.:145.8  
##  Median : 3.000    Median :163.3  
##  Mean   : 4.543    Mean   :177.2  
##  3rd Qu.: 7.000    3rd Qu.:196.4  
##  Max.   :46.000    Max.   :325.0  
##  NA's   :307
str(E_Commerce_Dataset)
## tibble [5,630 × 20] (S3: tbl_df/tbl/data.frame)
##  $ CustomerID                 : num [1:5630] 50001 50002 50003 50004 50005 ...
##  $ Churn                      : num [1:5630] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tenure                     : num [1:5630] 4 NA NA 0 0 0 NA NA 13 NA ...
##  $ PreferredLoginDevice       : chr [1:5630] "Mobile Phone" "Phone" "Phone" "Phone" ...
##  $ CityTier                   : num [1:5630] 3 1 1 3 1 1 3 1 3 1 ...
##  $ WarehouseToHome            : num [1:5630] 6 8 30 15 12 22 11 6 9 31 ...
##  $ PreferredPaymentMode       : chr [1:5630] "Debit Card" "UPI" "Debit Card" "Debit Card" ...
##  $ Gender                     : chr [1:5630] "Female" "Male" "Male" "Male" ...
##  $ HourSpendOnApp             : num [1:5630] 3 3 2 2 NA 3 2 3 NA 2 ...
##  $ NumberOfDeviceRegistered   : num [1:5630] 3 4 4 4 3 5 3 3 4 5 ...
##  $ PreferedOrderCat           : chr [1:5630] "Laptop & Accessory" "Mobile" "Mobile" "Laptop & Accessory" ...
##  $ SatisfactionScore          : num [1:5630] 2 3 3 5 5 5 2 2 3 3 ...
##  $ MaritalStatus              : chr [1:5630] "Single" "Single" "Single" "Single" ...
##  $ NumberOfAddress            : num [1:5630] 9 7 6 8 3 2 4 3 2 2 ...
##  $ Complain                   : num [1:5630] 1 1 1 0 0 1 0 1 1 0 ...
##  $ OrderAmountHikeFromlastYear: num [1:5630] 11 15 14 23 11 22 14 16 14 12 ...
##  $ CouponUsed                 : num [1:5630] 1 0 0 0 1 4 0 2 0 1 ...
##  $ OrderCount                 : num [1:5630] 1 1 1 1 1 6 1 2 1 1 ...
##  $ DaySinceLastOrder          : num [1:5630] 5 0 3 3 3 7 0 0 2 1 ...
##  $ CashbackAmount             : num [1:5630] 160 121 120 134 130 ...
head(E_Commerce_Dataset)
## # A tibble: 6 × 20
##   CustomerID Churn Tenure PreferredLoginDevice CityTier WarehouseToHome
##        <dbl> <dbl>  <dbl> <chr>                   <dbl>           <dbl>
## 1      50001     1      4 Mobile Phone                3               6
## 2      50002     1     NA Phone                       1               8
## 3      50003     1     NA Phone                       1              30
## 4      50004     1      0 Phone                       3              15
## 5      50005     1      0 Phone                       1              12
## 6      50006     1      0 Computer                    1              22
## # ℹ 14 more variables: PreferredPaymentMode <chr>, Gender <chr>,
## #   HourSpendOnApp <dbl>, NumberOfDeviceRegistered <dbl>,
## #   PreferedOrderCat <chr>, SatisfactionScore <dbl>, MaritalStatus <chr>,
## #   NumberOfAddress <dbl>, Complain <dbl>, OrderAmountHikeFromlastYear <dbl>,
## #   CouponUsed <dbl>, OrderCount <dbl>, DaySinceLastOrder <dbl>,
## #   CashbackAmount <dbl>

Step 2: Data Cleaning

Data cleaning is the process of identifying and correcting or removing errors, inaccuracies, and inconsistencies in datasets to improve data quality and reliability for further analysis.

The summary of the data to be cleaned:

Missing values - Cold deck Imputation

  1. Tenure - Impute with Median
  2. WarehouseTOHome: Impute with Mode
  3. HourSpendOnApp: Impute with Mean
  4. OrderAmountHikeFromlastYear: Impute with Mean
  5. CouponUsed: Impute with Mean
  6. OrderCount: Impute with Mean
  7. DaySinceLastOrder: Impute with Median

Inconsistent Data - Replace with the correct class

  1. PreferredLoginDevice: Phone -> Mobile Phone
  2. PreferredPaymentMode: CC -> Credit Card; COD -> Cash On Delivery
  3. PreferedOrderCat: Mobile -> Mobile Phone

Data cleaning for missing values - Tenure

Plotting the histogram is to crosscheck the Tenure column whether there is a large difference after cleaning it.

hist(E_Commerce_Dataset$Tenure)

We can see from here that it is right skewed histogram. In order to clean this column, the NA values will be replace with median which are 9 as shown in the summary.

E_Commerce_Dataset$Tenure[is.na(E_Commerce_Dataset$Tenure)] <- 9
summary(E_Commerce_Dataset$Tenure)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    3.00    9.00   10.13   15.00   61.00
hist(E_Commerce_Dataset$Tenure)

Cross check with the details of Tenure in the Summary above, the difference is minimal and NA values are formatted as Median.

Data cleaning for missing values - Warehousetohome

hist(E_Commerce_Dataset$WarehouseToHome)

We can see from here that it is right skewed histogram. For WarehouseToHome, we will use Mode to replace the NA values. Therefore, we need to calculate the mode first.

library("DescTools")
wth_mode <- Mode(E_Commerce_Dataset$WarehouseToHome, na.rm=TRUE)
wth_mode
## [1] 9
## attr(,"freq")
## [1] 559

The mode we get is 9 which has a frequency of 559.

E_Commerce_Dataset$WarehouseToHome[is.na(E_Commerce_Dataset$WarehouseToHome)] <- 9
summary(E_Commerce_Dataset$WarehouseToHome)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00    9.00   13.00   15.34   20.00  127.00

As compared with the values before cleaning, the difference is minimal and NA values are formatted as Mode.

Data cleaning for missing values - HourSpendOnApp

hist(E_Commerce_Dataset$HourSpendOnApp)

We can see from here that it is left skewed histogram. In order to clean this column, the NA values will be replace with mean which are 3 as shown in the summary.

E_Commerce_Dataset$HourSpendOnApp[is.na(E_Commerce_Dataset$HourSpendOnApp)] <- 3
summary(E_Commerce_Dataset$HourSpendOnApp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   2.935   3.000   5.000
hist(E_Commerce_Dataset$HourSpendOnApp)

As compared with the values before cleaning, the difference is minimal and NA values are formatted as Mean.

Data cleaning for missing values - OrderAmountHikeFromlastYear

hist(E_Commerce_Dataset$OrderAmountHikeFromlastYear)

We can see from here that it is right skewed histogram. In order to clean this column, the NA values will be replace with mean which are 15.8 (Roundup to 16) as shown in the summary.

E_Commerce_Dataset$OrderAmountHikeFromlastYear[is.na(E_Commerce_Dataset$OrderAmountHikeFromlastYear)] <- 16
summary(E_Commerce_Dataset$OrderAmountHikeFromlastYear)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   13.00   15.00   15.72   18.00   26.00

Let’s do a quick cross check using Histogram to see whether there is a great difference compared to the data before cleaning.

hist(E_Commerce_Dataset$OrderAmountHikeFromlastYear)

As compared with the histogram before cleaning, the difference is minimal and NA values are formatted as Mean.

Data cleaning for missing values - CouponUsed

hist(E_Commerce_Dataset$CouponUsed)

We can see from here that it is right skewed histogram. In order to clean this column, the NA values will be replace with mean which are 1.751 (Roundup to 2) as shown in the summary.

E_Commerce_Dataset$CouponUsed[is.na(E_Commerce_Dataset$CouponUsed)] <- 2
summary(E_Commerce_Dataset$CouponUsed)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   1.000   1.762   2.000  16.000

Let’s do a quick cross check using Histogram to see whether there is a great difference compared to the data before cleaning.

hist(E_Commerce_Dataset$CouponUsed)

As compared with the histogram before cleaning, the difference is minimal and NA values are formatted as Mean.

Data cleaning for missing values - OrderCount

hist(E_Commerce_Dataset$OrderCount)

We can see from here that it is right skewed histogram. In order to clean this column, the NA values will be replace with mean which are 3 as shown in the summary.

E_Commerce_Dataset$OrderCount[is.na(E_Commerce_Dataset$OrderCount)] <- 3
summary(E_Commerce_Dataset$OrderCount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   3.008   3.000  16.000

Let’s do a quick cross check using Histogram to see whether there is a great difference compared to the data before cleaning.

hist(E_Commerce_Dataset$OrderCount)

As compared with the histogram before cleaning, the difference is very minimal (almost no changes) and NA values are formatted as Mean.

Data cleaning for missing values - DaySinceLastOrder

hist(E_Commerce_Dataset$DaySinceLastOrder)

We can see from here that it is right skewed histogram. In order to clean this column, the NA values will be replace with median which are 3 as shown in the summary.

E_Commerce_Dataset$DaySinceLastOrder[is.na(E_Commerce_Dataset$DaySinceLastOrder)] <- 3
summary(E_Commerce_Dataset$DaySinceLastOrder)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   4.459   7.000  46.000

Let’s do a quick cross check using Histogram to see whether there is a great difference compared to the data before cleaning.

hist(E_Commerce_Dataset$DaySinceLastOrder)

As compared with the histogram before cleaning, the difference is minimal and NA values are formatted as Median.

To summarize the cleaned data, we are having the number of rows and columns to check whether any data was deleted. Then we have sum of the total count of NA values.

nrow(E_Commerce_Dataset)
## [1] 5630
ncol(E_Commerce_Dataset)
## [1] 20
sum(is.na(E_Commerce_Dataset))
## [1] 0

Rows = 5630 - All rows remained | Columns = 20 - All columns remained | Count of NA value = 0 - No NA Values were found in the cleaned dataset

Now, let’s crosscheck with the summary of data after cleaning to check whether there is anymore NA values.

summary(E_Commerce_Dataset)
##    CustomerID        Churn            Tenure      PreferredLoginDevice
##  Min.   :50001   Min.   :0.0000   Min.   : 0.00   Length:5630         
##  1st Qu.:51408   1st Qu.:0.0000   1st Qu.: 3.00   Class :character    
##  Median :52816   Median :0.0000   Median : 9.00   Mode  :character    
##  Mean   :52816   Mean   :0.1684   Mean   :10.13                       
##  3rd Qu.:54223   3rd Qu.:0.0000   3rd Qu.:15.00                       
##  Max.   :55630   Max.   :1.0000   Max.   :61.00                       
##     CityTier     WarehouseToHome  PreferredPaymentMode    Gender         
##  Min.   :1.000   Min.   :  5.00   Length:5630          Length:5630       
##  1st Qu.:1.000   1st Qu.:  9.00   Class :character     Class :character  
##  Median :1.000   Median : 13.00   Mode  :character     Mode  :character  
##  Mean   :1.655   Mean   : 15.34                                          
##  3rd Qu.:3.000   3rd Qu.: 20.00                                          
##  Max.   :3.000   Max.   :127.00                                          
##  HourSpendOnApp  NumberOfDeviceRegistered PreferedOrderCat   SatisfactionScore
##  Min.   :0.000   Min.   :1.000            Length:5630        Min.   :1.000    
##  1st Qu.:2.000   1st Qu.:3.000            Class :character   1st Qu.:2.000    
##  Median :3.000   Median :4.000            Mode  :character   Median :3.000    
##  Mean   :2.935   Mean   :3.689                               Mean   :3.067    
##  3rd Qu.:3.000   3rd Qu.:4.000                               3rd Qu.:4.000    
##  Max.   :5.000   Max.   :6.000                               Max.   :5.000    
##  MaritalStatus      NumberOfAddress     Complain     
##  Length:5630        Min.   : 1.000   Min.   :0.0000  
##  Class :character   1st Qu.: 2.000   1st Qu.:0.0000  
##  Mode  :character   Median : 3.000   Median :0.0000  
##                     Mean   : 4.214   Mean   :0.2849  
##                     3rd Qu.: 6.000   3rd Qu.:1.0000  
##                     Max.   :22.000   Max.   :1.0000  
##  OrderAmountHikeFromlastYear   CouponUsed       OrderCount    
##  Min.   :11.00               Min.   : 0.000   Min.   : 1.000  
##  1st Qu.:13.00               1st Qu.: 1.000   1st Qu.: 1.000  
##  Median :15.00               Median : 1.000   Median : 2.000  
##  Mean   :15.72               Mean   : 1.762   Mean   : 3.008  
##  3rd Qu.:18.00               3rd Qu.: 2.000   3rd Qu.: 3.000  
##  Max.   :26.00               Max.   :16.000   Max.   :16.000  
##  DaySinceLastOrder CashbackAmount 
##  Min.   : 0.000    Min.   :  0.0  
##  1st Qu.: 2.000    1st Qu.:145.8  
##  Median : 3.000    Median :163.3  
##  Mean   : 4.459    Mean   :177.2  
##  3rd Qu.: 7.000    3rd Qu.:196.4  
##  Max.   :46.000    Max.   :325.0

After cross checking,there are no more NA values. Next, there are some inconsistent data to adjust.

Dealing with inconsistent data

The summary of categorical variables:

for (i in names(E_Commerce_Dataset)) {
  # Check if the column is of type 'character' (assumed categorical)
  if (class(E_Commerce_Dataset[[i]]) == 'character') {
    # Print value counts
    cat(i, ":")
    print(table(E_Commerce_Dataset[[i]]))
    cat("**********************************************************\n")
}
}
## PreferredLoginDevice :
##     Computer Mobile Phone        Phone 
##         1634         2765         1231 
## **********************************************************
## PreferredPaymentMode :
## Cash on Delivery               CC              COD      Credit Card 
##              149              273              365             1501 
##       Debit Card         E wallet              UPI 
##             2314              614              414 
## **********************************************************
## Gender :
## Female   Male 
##   2246   3384 
## **********************************************************
## PreferedOrderCat :
##            Fashion            Grocery Laptop & Accessory             Mobile 
##                826                410               2050                809 
##       Mobile Phone             Others 
##               1271                264 
## **********************************************************
## MaritalStatus :
## Divorced  Married   Single 
##      848     2986     1796 
## **********************************************************

Data Cleaning for inconsistent data - PreferredLoginDevice

Replacing Phone to Mobile Phone in PreferredLoginDevice

E_Commerce_Dataset$PreferredLoginDevice <- replace(E_Commerce_Dataset$PreferredLoginDevice, E_Commerce_Dataset$PreferredLoginDevice == "Phone", "Mobile Phone")
head(E_Commerce_Dataset$PreferredLoginDevice)
## [1] "Mobile Phone" "Mobile Phone" "Mobile Phone" "Mobile Phone" "Mobile Phone"
## [6] "Computer"

Data Cleaning for inconsistent data - PreferredPaymentMode

Replacing CC and COD to Credit card and Cash on Delivery in PreferredPaymentMode

E_Commerce_Dataset$PreferredPaymentMode <- replace(E_Commerce_Dataset$PreferredPaymentMode, E_Commerce_Dataset$PreferredPaymentMode == "CC", "Credit Card")
E_Commerce_Dataset$PreferredPaymentMode <- replace(E_Commerce_Dataset$PreferredPaymentMode, E_Commerce_Dataset$PreferredPaymentMode == "COD", "Cash on Delivery")
head(E_Commerce_Dataset$PreferredPaymentMode, 15)
##  [1] "Debit Card"       "UPI"              "Debit Card"       "Debit Card"      
##  [5] "Credit Card"      "Debit Card"       "Cash on Delivery" "Credit Card"     
##  [9] "E wallet"         "Debit Card"       "Cash on Delivery" "Debit Card"      
## [13] "Cash on Delivery" "Credit Card"      "Credit Card"

Data Cleaning for inconsistent data - PreferedOrderCat

Replacing Mobile to Mobile Phone in PreferedOrderCat

E_Commerce_Dataset$PreferedOrderCat <- replace(E_Commerce_Dataset$PreferedOrderCat, E_Commerce_Dataset$PreferedOrderCat == "Mobile", "Mobile Phone")
head(E_Commerce_Dataset$PreferedOrderCat)
## [1] "Laptop & Accessory" "Mobile Phone"       "Mobile Phone"      
## [4] "Laptop & Accessory" "Mobile Phone"       "Mobile Phone"

Here is to cross check whether there is any inconsistencies in the dataset.

for (i in names(E_Commerce_Dataset)) {
  # Check if the column is of type 'character' (assumed categorical)
  if (class(E_Commerce_Dataset[[i]]) == 'character') {
    # Print value counts
    cat(i, ":")
    print(table(E_Commerce_Dataset[[i]]))
    cat("********************\n")
  }
}
## PreferredLoginDevice :
##     Computer Mobile Phone 
##         1634         3996 
## ********************
## PreferredPaymentMode :
## Cash on Delivery      Credit Card       Debit Card         E wallet 
##              514             1774             2314              614 
##              UPI 
##              414 
## ********************
## Gender :
## Female   Male 
##   2246   3384 
## ********************
## PreferedOrderCat :
##            Fashion            Grocery Laptop & Accessory       Mobile Phone 
##                826                410               2050               2080 
##             Others 
##                264 
## ********************
## MaritalStatus :
## Divorced  Married   Single 
##      848     2986     1796 
## ********************

Therefore, the dataset is considered cleaned.

Last but not least, the cleaned data will be exported for other groupmates usage.

write.csv(E_Commerce_Dataset, "ecomm_cleaned_data.csv")

Step 3: Exploratory Data Analysis

  1. Is there a relationship between Gender and Churn? & Which Gender has more Orders?
  2. What is the distribution of satisfaction score for churned and retained customer?
  3. Is there correlation between customer complaints and Churn?
  4. Is there any relationship between coupon used and churn?
  5. Which marital status has the highest churn rate?
  6. Correlation matrix between variables

Load and import cleaned data and packages

library(readxl)
ecomm_cleaned_data <- read.csv("ecomm_cleaned_data.csv")

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(knitr)
library(tidyr)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(treemap)
library(ggcorrplot)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths

1. Is there a relationship between Gender and Churn? & Which Gender has more Orders?

#### Which Gender has more Orders?
## ['Churn']=='0' = No Churn
## ['Churn']=='1' = Churn

# Group by 'Gender' and calculate the mean of 'OrderCount' and get who have more order
gender_orders <- ecomm_cleaned_data %>%
  group_by(Gender) %>%
  summarize(mean_order_count = mean(OrderCount))

# Create a bar plot using ggplot2
ggplot(gender_orders, aes(x = Gender, y = mean_order_count)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(title = "Mean Order Count by Gender", x = "Gender", y = "Mean Order Count")

Female and male having similar order amount.

###Is there a relationship between Gender and Churn?

# grouping count how many gender have churn/no-churn
ecomm_churn_gender <- ecomm_cleaned_data %>% count(Gender, Churn)

female_churn <- select( filter(ecomm_churn_gender, Gender == 'Female' & Churn == '1'), n)
female_non_churn <- select( filter(ecomm_churn_gender, Gender == 'Female' & Churn == '0'), n)
male_churn <- select( filter(ecomm_churn_gender, Gender == 'Male' & Churn == '1'), n)
male_non_churn <- select( filter(ecomm_churn_gender, Gender == 'Male' & Churn == '0'), n)

#the percentage of the leaving females out of the females
percentageF <- female_churn /(female_churn + female_non_churn) * 100

#the percentage of the leaving males out of the males
percentageM <- male_churn /(male_churn + male_non_churn) * 100

paste("Percentage of the leaving females out of total females is", percentageF)
## [1] "Percentage of the leaving females out of total females is 15.4942119323241"
paste("Percentage of the leaving males out of total males is", percentageM)
## [1] "Percentage of the leaving males out of total males is 17.7304964539007"

There is not a big difference between the males and the female churn from thier own gender.

## Create pie chart with percentages

#only want who churn = 1
ecomm_churn_1_gender <- filter( ecomm_churn_gender, Churn=='1')

#calc for percentage who is churn
ecomm_churn_1_gender$percentage <- (ecomm_churn_1_gender$n / sum(ecomm_churn_1_gender$n )) * 100

ecomm_churn_1_gender
##   Gender Churn   n percentage
## 1 Female     1 348   36.70886
## 2   Male     1 600   63.29114
ggplot(ecomm_churn_1_gender, aes(x = "", y = percentage, fill = Gender)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  geom_text(aes(label = paste0(round(percentage,2), "%")), position = position_stack(vjust = 0.5)) +
  theme_void() +
  labs(title = "Pie Chart with Percentages")

As we see the males are more likely to churn. The chart shown 63.3 % churned males compare with female.

2. What is the distribution of satisfaction score for churned and retained customer?

Step A: Data Transformation

First, the count of churn and non churn for each SatisfactionScore is tabulated.

Satchurn <- ecomm_cleaned_data %>% select(SatisfactionScore, Churn) %>% count(SatisfactionScore, Churn, name = "Count") %>% arrange(dplyr::desc("Churn"))
print(Satchurn)
##    SatisfactionScore Churn Count
## 1                  1     0  1030
## 2                  1     1   134
## 3                  2     0   512
## 4                  2     1    74
## 5                  3     0  1406
## 6                  3     1   292
## 7                  4     0   890
## 8                  4     1   184
## 9                  5     0   844
## 10                 5     1   264

Step B: Data Visualization

Using the table generated in Step A, a barplot is generated for analysis.

ggplot(Satchurn, aes(x = SatisfactionScore, y = Count, group = Churn, fill = Churn)) + geom_bar(position = "dodge", stat = "identity") + geom_text(
    aes(label = Count, Count = Count + 0.05),
    position = position_dodge(0.9),
    vjust = 0
  ) 
## Warning in geom_text(aes(label = Count, Count = Count + 0.05), position =
## position_dodge(0.9), : Ignoring unknown aesthetics: Count

Interpretation of result: Higher satisfaction scores strongly associate with lower churn, as expected. Critical driver.

3. Is there correlation between customer complaints and Churn?

Grouped the column name “Complain” and calculate its occurs value with complain and no complain

complain_count <- count(ecomm_cleaned_data, Complain) %>%
  rename (churn = n)

complain_count
##   Complain churn
## 1        0  4026
## 2        1  1604
complain_churn <- count(ecomm_cleaned_data, Complain, Churn) %>% 
  rename (count = n) %>% 
  mutate (Complain = ifelse(Complain == 0, 'No Complain', 'Complain')) %>%
  mutate (Churn = ifelse(Churn == 0, 'No Churn', 'Churn')) %>%
  mutate(count=as.numeric(count))

complain_churn
##      Complain    Churn count
## 1 No Complain No Churn  3586
## 2 No Complain    Churn   440
## 3    Complain No Churn  1096
## 4    Complain    Churn   508
complain_churn <- as.data.frame(complain_churn)
complain_churn
##      Complain    Churn count
## 1 No Complain No Churn  3586
## 2 No Complain    Churn   440
## 3    Complain No Churn  1096
## 4    Complain    Churn   508
fig_complain_churn <- treemap(
  complain_churn,
  index = c("Complain", "Churn"),
  vSize = "count",
  draw = TRUE,
  vColor = "Complain",
  palette = c('No Complain' = '#FF0000', 'Complain' = '#0000FF'),  # Use hexadecimal color codes
  title = 'Complain Vs Churn'
)

4. Is there any relationship between coupon used and churn?

Step A: Data Transformation

coupon_churnrate <- ecomm_cleaned_data %>% select(CouponUsed, Churn) %>% count(CouponUsed, Churn, name = "Count") %>% arrange(dplyr::desc("Churn"))
print(coupon_churnrate)
##    CouponUsed Churn Count
## 1           0     0   844
## 2           0     1   186
## 3           1     0  1727
## 4           1     1   378
## 5           2     0  1309
## 6           2     1   230
## 7           3     0   281
## 8           3     1    46
## 9           4     0   167
## 10          4     1    30
## 11          5     0   106
## 12          5     1    23
## 13          6     0    90
## 14          6     1    18
## 15          7     0    71
## 16          7     1    18
## 17          8     0    33
## 18          8     1     9
## 19          9     0    11
## 20          9     1     2
## 21         10     0    11
## 22         10     1     3
## 23         11     0    10
## 24         11     1     2
## 25         12     0     8
## 26         12     1     1
## 27         13     0     8
## 28         14     0     5
## 29         15     1     1
## 30         16     0     1
## 31         16     1     1

Step B: Data Visualization

ggplot(coupon_churnrate, aes(x = CouponUsed, y = Count, 10, group = Churn, fill = Churn)) + geom_bar(position = "dodge", stat = "identity")

5. Which marital status has the highest churn rate?

marital_status <- ecomm_cleaned_data %>%
  group_by(Churn, MaritalStatus) %>%
  summarise(Count = n(), .groups = 'drop') %>%
  arrange(Churn, desc(Count))
 
marital_status
## # A tibble: 6 × 3
##   Churn MaritalStatus Count
##   <int> <chr>         <int>
## 1     0 Married        2642
## 2     0 Single         1316
## 3     0 Divorced        724
## 4     1 Single          480
## 5     1 Married         344
## 6     1 Divorced        124
# Convert 'Churn' to a factor
ecomm_cleaned_data_copy <- ecomm_cleaned_data
ecomm_cleaned_data_copy$Churn <- as.factor(ecomm_cleaned_data_copy$Churn)

# Determine the maximum value for the y-axis
max_count <- max(table(ecomm_cleaned_data_copy$MaritalStatus, ecomm_cleaned_data_copy$Churn))

churn_by_marital_status <- ecomm_cleaned_data_copy %>%
  ggplot(aes(x = MaritalStatus, fill = Churn)) +
  geom_bar(position = position_dodge()) +
  scale_fill_manual(values = c("blue", "red")) +
  scale_y_continuous(breaks = seq(0, max_count, by = 500)) +
  labs(title = "Churn Count by Marital Status", 
       x = "Marital Status", 
       y = "Churn Count")

churn_by_marital_status

Single customers represent over half of the total customer churn at 480 (50.6%), despite married individuals forming the largest customer segment.

6. Correlation matrix between variables

To replace the character variable with binary data (0,1,2,3)

ecomm_cleaned_data <- ecomm_cleaned_data %>%
  mutate_at(vars(PreferredLoginDevice, PreferredPaymentMode, Gender, PreferedOrderCat, MaritalStatus), 
            ~as.numeric(as.factor(.)))
head(ecomm_cleaned_data)
##   X CustomerID Churn Tenure PreferredLoginDevice CityTier WarehouseToHome
## 1 1      50001     1      4                    2        3               6
## 2 2      50002     1      9                    2        1               8
## 3 3      50003     1      9                    2        1              30
## 4 4      50004     1      0                    2        3              15
## 5 5      50005     1      0                    2        1              12
## 6 6      50006     1      0                    1        1              22
##   PreferredPaymentMode Gender HourSpendOnApp NumberOfDeviceRegistered
## 1                    3      1              3                        3
## 2                    5      2              3                        4
## 3                    3      2              2                        4
## 4                    3      2              2                        4
## 5                    2      2              3                        3
## 6                    3      1              3                        5
##   PreferedOrderCat SatisfactionScore MaritalStatus NumberOfAddress Complain
## 1                3                 2             3               9        1
## 2                4                 3             3               7        1
## 3                4                 3             3               6        1
## 4                3                 5             3               8        0
## 5                4                 5             3               3        0
## 6                4                 5             3               2        1
##   OrderAmountHikeFromlastYear CouponUsed OrderCount DaySinceLastOrder
## 1                          11          1          1                 5
## 2                          15          0          1                 0
## 3                          14          0          1                 3
## 4                          23          0          1                 3
## 5                          11          1          1                 3
## 6                          22          4          6                 7
##   CashbackAmount
## 1         159.93
## 2         120.90
## 3         120.28
## 4         134.07
## 5         129.60
## 6         139.19

The indicators of character variables:

# Create a data frame
data <- data.frame(
  "Variable" = c("PreferredLoginDevice", "PreferredLoginDevice", "PreferredPaymentMode", "PreferredPaymentMode", "PreferredPaymentMode", "PreferredPaymentMode", "PreferredPaymentMode", "Gender", "Gender", "PreferedOrderCat", "PreferedOrderCat", "PreferedOrderCat", "PreferedOrderCat", "PreferedOrderCat", "MaritalStatus", "MaritalStatus", "MaritalStatus"),
  "Indicator" = c("Computer", "Mobile Phone", "Cash on Delivery", "Credit Card", "Debit Card", "E Wallet", "UPI", "Female", "Male", "Fashion", "Grocery", "Laptop & Accessory", "Mobile Phone", "Others", "Divorced", "Married", "Single"),
  "Numerical_Label" = c(1, 2, 1, 2, 3, 4, 5, 1, 2, 1, 2, 3, 4, 5, 1, 2, 3)
)

# Print the data frame
print(data)
##                Variable          Indicator Numerical_Label
## 1  PreferredLoginDevice           Computer               1
## 2  PreferredLoginDevice       Mobile Phone               2
## 3  PreferredPaymentMode   Cash on Delivery               1
## 4  PreferredPaymentMode        Credit Card               2
## 5  PreferredPaymentMode         Debit Card               3
## 6  PreferredPaymentMode           E Wallet               4
## 7  PreferredPaymentMode                UPI               5
## 8                Gender             Female               1
## 9                Gender               Male               2
## 10     PreferedOrderCat            Fashion               1
## 11     PreferedOrderCat            Grocery               2
## 12     PreferedOrderCat Laptop & Accessory               3
## 13     PreferedOrderCat       Mobile Phone               4
## 14     PreferedOrderCat             Others               5
## 15        MaritalStatus           Divorced               1
## 16        MaritalStatus            Married               2
## 17        MaritalStatus             Single               3

Create correlation matrix of the data

# Create correlation matrix of the data
corr_mat <- round(cor(ecomm_cleaned_data), 3)
head(corr_mat)
##                           X CustomerID  Churn Tenure PreferredLoginDevice
## X                     1.000      1.000 -0.019  0.035               -0.003
## CustomerID            1.000      1.000 -0.019  0.035               -0.003
## Churn                -0.019     -0.019  1.000 -0.338               -0.051
## Tenure                0.035      0.035 -0.338  1.000                0.041
## PreferredLoginDevice -0.003     -0.003 -0.051  0.041                1.000
## CityTier              0.003      0.003  0.085 -0.058                0.002
##                      CityTier WarehouseToHome PreferredPaymentMode Gender
## X                       0.003           0.087                0.006  0.004
## CustomerID              0.003           0.087                0.006  0.004
## Churn                   0.085           0.057                0.002  0.029
## Tenure                 -0.058           0.001               -0.009 -0.046
## PreferredLoginDevice    0.002          -0.022                0.007 -0.016
## CityTier                1.000           0.023                0.246 -0.025
##                      HourSpendOnApp NumberOfDeviceRegistered PreferedOrderCat
## X                             0.580                    0.411            0.000
## CustomerID                    0.580                    0.411            0.000
## Churn                         0.019                    0.108            0.105
## Tenure                       -0.017                   -0.021           -0.189
## PreferredLoginDevice          0.020                   -0.021            0.005
## CityTier                     -0.010                    0.028           -0.171
##                      SatisfactionScore MaritalStatus NumberOfAddress Complain
## X                               -0.033         0.124           0.161   -0.010
## CustomerID                      -0.033         0.124           0.161   -0.010
## Churn                            0.105         0.140           0.044    0.250
## Tenure                          -0.014        -0.105           0.235   -0.021
## PreferredLoginDevice             0.036        -0.029           0.026    0.001
## CityTier                        -0.012         0.007          -0.029    0.003
##                      OrderAmountHikeFromlastYear CouponUsed OrderCount
## X                                          0.116      0.227      0.136
## CustomerID                                 0.116      0.227      0.136
## Churn                                     -0.011     -0.010     -0.028
## Tenure                                     0.011      0.130      0.179
## PreferredLoginDevice                      -0.010     -0.010     -0.016
## CityTier                                  -0.032      0.020      0.033
##                      DaySinceLastOrder CashbackAmount
## X                                0.099          0.217
## CustomerID                       0.099          0.217
## Churn                           -0.156         -0.154
## Tenure                           0.174          0.468
## PreferredLoginDevice             0.002          0.048
## CityTier                         0.010          0.056

Reduce the size of correlation matrix

melted_corr_mat <- melt(corr_mat)
head(melted_corr_mat)
##                   Var1 Var2  value
## 1                    X    X  1.000
## 2           CustomerID    X  1.000
## 3                Churn    X -0.019
## 4               Tenure    X  0.035
## 5 PreferredLoginDevice    X -0.003
## 6             CityTier    X  0.003

Plotting the correlation heatmap

ggplot(data = melted_corr_mat, aes(x=Var1, y=Var2, fill = value)) + geom_tile() + geom_text(aes(Var2,Var1,label = value), color = "black", size = 1.5) + scale_fill_gradient2(low = "turquoise", high = "yellow", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name="Pearson\nCorrelation") +
  theme_minimal() +theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

Step 4: Data Modelling

Load required packages

library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 4.1-8
library(class)
library(e1071)
library(rpart)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(xgboost)
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:plotly':
## 
##     slice
## The following object is masked from 'package:dplyr':
## 
##     slice
library(adabag)
## Loading required package: caret
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following objects are masked from 'package:DescTools':
## 
##     MAE, RMSE
## Loading required package: foreach
## 
## Attaching package: 'foreach'
## The following object is masked from 'package:DescTools':
## 
##     %:%
## Loading required package: doParallel
## Loading required package: iterators
## Loading required package: parallel
library(smotefamily)
library(mlr)
## Loading required package: ParamHelpers
## Warning message: 'mlr' is in 'maintenance-only' mode since July 2019.
## Future development will only happen in 'mlr3'
## (<https://mlr3.mlr-org.com>). Due to the focus on 'mlr3' there might be
## uncaught bugs meanwhile in {mlr} - please consider switching.
## 
## Attaching package: 'mlr'
## The following object is masked from 'package:caret':
## 
##     train
## The following object is masked from 'package:e1071':
## 
##     impute
library(ROSE)
## Loaded ROSE 0.0-4
library(caTools)
library(caret)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
## 
##     select
## The following object is masked from 'package:dplyr':
## 
##     select
library(ada) 
library(Metrics)
## 
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
## 
##     precision, recall
## The following object is masked from 'package:pROC':
## 
##     auc
if (!requireNamespace("ROCR", quietly = TRUE)) {
  install.packages("ROCR")
}
library(ROCR)
## 
## Attaching package: 'ROCR'
## The following object is masked from 'package:mlr':
## 
##     performance
library(ggplot2)

Handling Imbalanced Data:

# Convert non-numeric columns to numeric
numeric_columns <- sapply(ecomm_cleaned_data, is.numeric)
ecomm_numeric <- ecomm_cleaned_data[, numeric_columns]

# Separate predictors and response variable
X <- ecomm_numeric[, -which(names(ecomm_numeric) == "Churn")]
Y <- ecomm_numeric$Churn

# Combine SMOTE and Tomek
rose_result <- ROSE(Y ~ ., data = cbind(Y, X), seed = 42, p = 0.7, N = 2 * sum(Y == 1))

# Extracting the oversampled data
x_over <- rose_result$data[, -1]
y_over <- rose_result$data[, 1]

# Print the shape

print(paste("Oversampled data shape: ", dim(x_over), length(y_over)))
## [1] "Oversampled data shape:  1896 1896" "Oversampled data shape:  20 1896"

Data Splitting:

### Split the data
set.seed(42)  # For reproducibility
# Create indices for train and test sets
indices <- sample(1:nrow(x_over), 0.7 * nrow(x_over))

# Split the data
x_train <- x_over[indices, ]
x_test <- x_over[-indices, ]
y_train <- y_over[indices]
y_test <- y_over[-indices]

# Display the total number of samples
cat("Total number of samples:", nrow(x_over), "\n")
## Total number of samples: 1896
# Display the number of samples in the training set
cat("Number of samples in the training set:", nrow(x_train), "\n")
## Number of samples in the training set: 1327
# Display the number of samples in the testing set
cat("Number of samples in the testing set:", nrow(x_test), "\n")
## Number of samples in the testing set: 569
# Now we will make normalization for all data to make them in a common range
MN <- function(data) {
  scaled_data <- scale(data)
  return(scaled_data)
}


# Scale the training data

x_train_scaled <- MN(x_train)
# Scale the testing data
x_test_scaled <- MN(x_test)

1. AdaBoost Algorithm

The AdaBoost (Adaptive Boosting) algorithm is a machine learning technique that combines multiple weak classifiers to form a strong classifier, by iteratively training the classifiers on misclassified instances to improve accuracy.

AdaBoost_train = ada(x = x_train_scaled, y = y_train)
AdaBoost_test = ada(x = x_test_scaled, y = y_test)

# Predict on train and test data
ada_pred_Train  =  predict(AdaBoost_train, as.data.frame(x_train_scaled)) 
ada_pred_Test  =  predict(AdaBoost_test, as.data.frame(x_test_scaled))

# Train data - Find accuracy   
ada_cm_Train = table(y_train, ada_pred_Train)
ada_accu_Train= sum(diag(ada_cm_Train))/sum(ada_cm_Train)

# Test data - Find accuracy   
ada_cm_Test = table(y_test, ada_pred_Test)
ada_accu_Test= sum(diag(ada_cm_Test))/sum(ada_cm_Test)

cat('Using model: ', 'Ada Boost Classifier', '\n')
## Using model:  Ada Boost Classifier
print(paste('Accurancy Train Score:', ada_accu_Train))
## [1] "Accurancy Train Score: 0.938206480783723"
print(paste('Accurancy Test Score:', ada_accu_Test))
## [1] "Accurancy Test Score: 0.952548330404218"

2. Random Forest Algorithm

The Random Forest algorithm is a machine learning technique that constructs a multitude of decision trees at training time and outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. It is known for its robustness, simplicity, and ability to handle large datasets with high dimensionality.

suppressWarnings({
# Ensure y_train and y_test are factors for classification
y_train <- as.factor(y_train)
y_test <- as.factor(y_test)

# Fit the model with revised parameters
rf_model <- randomForest(x_train_scaled, y_train, ntree=500, importance=TRUE)

# Predictions
y_pred_train <- predict(rf_model, x_train_scaled)
y_pred_test <- predict(rf_model, x_test_scaled)

# Calculate accuracy
rf_accu_Train <- sum(y_train == y_pred_train) / length(y_train)
rf_accu_Test <- sum(y_test == y_pred_test) / length(y_test)

cat('Using Model: ', 'Random Forest', '\n')
print(paste("Training Accuracy:", rf_accu_Train))
print(paste("Test Accuracy:", rf_accu_Test))
})
## Using Model:  Random Forest 
## [1] "Training Accuracy: 1"
## [1] "Test Accuracy: 0.796133567662566"

3. Logistic Regression Model

The Logistic Regression model is a statistical method used for binary classification problems that predicts the probability of an instance belonging to a particular class by fitting data to a logistic function.

logistic_model <- glm(y_train ~ ., data = as.data.frame(x_train_scaled), family = binomial)

# Make predictions on the training and testing set
train_predictions <- predict(logistic_model, newdata = as.data.frame(x_train_scaled), type = "response")
test_predictions <- predict(logistic_model, newdata = as.data.frame(x_test_scaled), type = "response")

# Convert predicted probabilities to binary predictions (0 or 1)
train_pred_class <- ifelse(train_predictions > 0.5, 1, 0)
test_pred_class <- ifelse(test_predictions > 0.5, 1, 0)

# Evaluate the training and testing accuracy
train_accuracy <- sum(train_pred_class == y_train) / length(y_train)
test_accuracy <- sum(test_pred_class == y_test) / length(y_test)
cat("Using Model: Logistic Regression", "\n")
## Using Model: Logistic Regression
print(paste("Training Accuracy:", train_accuracy))
## [1] "Training Accuracy: 0.800301431801055"
print(paste("Testing Accuracy:", test_accuracy))
## [1] "Testing Accuracy: 0.801405975395431"

Model Evaluation

Model evaluation is the process of assessing the performance of a machine learning model using certain metrics, such as accuracy, precision, recall, or F1 score, to determine how well the model has learned from the training data and can generalize to unseen data.

1. AdaBoost Algorithm

# Assuming y_test and ada_pred_Test are vectors of the same length
# If y_test_fac is the same as y_test, you can use y_test instead

# Generate ROC curve
roc_curve <- roc(y_test, as.numeric(as.character(ada_pred_Test)))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Plot ROC curve
plot(roc_curve, main="ROC Curve")

# Calculate roc_auc
roc_auc <- as.numeric(roc_curve$auc)
cat("ROC AUC:", roc_auc, "\n")
## ROC AUC: 0.9252441
# Create a confusion matrix
conf_matrix <- confusionMatrix(as.factor(ada_pred_Test), as.factor(y_test))
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 145   3
##          1  24 397
##                                           
##                Accuracy : 0.9525          
##                  95% CI : (0.9317, 0.9685)
##     No Information Rate : 0.703           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8821          
##                                           
##  Mcnemar's Test P-Value : 0.0001186       
##                                           
##             Sensitivity : 0.8580          
##             Specificity : 0.9925          
##          Pos Pred Value : 0.9797          
##          Neg Pred Value : 0.9430          
##              Prevalence : 0.2970          
##          Detection Rate : 0.2548          
##    Detection Prevalence : 0.2601          
##       Balanced Accuracy : 0.9252          
##                                           
##        'Positive' Class : 0               
## 
# Convert confusion matrix to data frame for plotting
cm <- as.data.frame(as.table(conf_matrix))

ggplot(data = cm, aes(x = Reference, y = Prediction)) +
  geom_tile(aes(fill = log(Freq)), colour = "white") +
  geom_text(aes(label = sprintf("%1.0f", Freq)), vjust = 1) +
  scale_fill_gradient(low = "yellow", high = "turquoise") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1)) +
  labs(fill = "Log(Frequency)")

2. Random Forest Algorithm

# Generate ROC curve
roc_curve <- roc(y_test, as.numeric(as.character(y_pred_test)))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Plot ROC curve
plot(roc_curve, main="ROC Curve")

# Calculate roc_auc
roc_auc <- as.numeric(roc_curve$auc)
cat("ROC AUC:", roc_auc, "\n")
## ROC AUC: 0.6926849
# Create a confusion matrix
conf_matrix <- confusionMatrix(as.factor(y_pred_test), as.factor(y_test))
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  74  21
##          1  95 379
##                                           
##                Accuracy : 0.7961          
##                  95% CI : (0.7607, 0.8285)
##     No Information Rate : 0.703           
##     P-Value [Acc > NIR] : 3.086e-07       
##                                           
##                   Kappa : 0.4411          
##                                           
##  Mcnemar's Test P-Value : 1.220e-11       
##                                           
##             Sensitivity : 0.4379          
##             Specificity : 0.9475          
##          Pos Pred Value : 0.7789          
##          Neg Pred Value : 0.7996          
##              Prevalence : 0.2970          
##          Detection Rate : 0.1301          
##    Detection Prevalence : 0.1670          
##       Balanced Accuracy : 0.6927          
##                                           
##        'Positive' Class : 0               
## 
# Convert confusion matrix to data frame for plotting
cm <- as.data.frame(as.table(conf_matrix))

ggplot(data = cm, aes(x = Reference, y = Prediction)) +
  geom_tile(aes(fill = log(Freq)), colour = "white") +
  geom_text(aes(label = sprintf("%1.0f", Freq)), vjust = 1) +
  scale_fill_gradient(low = "yellow", high = "turquoise") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1)) +
  labs(fill = "Log(Frequency)")

3. Logistic Regression Model

# Generate ROC curve
roc_curve <- roc(y_test, as.numeric(as.character(test_pred_class)))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Plot ROC curve
plot(roc_curve, main="ROC Curve")

# Calculate roc_auc
roc_auc <- as.numeric(roc_curve$auc)
cat("ROC AUC:", roc_auc, "\n")
## ROC AUC: 0.720355
# Create a confusion matrix
conf_matrix <- confusionMatrix(as.factor(test_pred_class), as.factor(y_test))
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  88  32
##          1  81 368
##                                           
##                Accuracy : 0.8014          
##                  95% CI : (0.7662, 0.8334)
##     No Information Rate : 0.703           
##     P-Value [Acc > NIR] : 6.313e-08       
##                                           
##                   Kappa : 0.481           
##                                           
##  Mcnemar's Test P-Value : 6.318e-06       
##                                           
##             Sensitivity : 0.5207          
##             Specificity : 0.9200          
##          Pos Pred Value : 0.7333          
##          Neg Pred Value : 0.8196          
##              Prevalence : 0.2970          
##          Detection Rate : 0.1547          
##    Detection Prevalence : 0.2109          
##       Balanced Accuracy : 0.7204          
##                                           
##        'Positive' Class : 0               
## 
# Convert confusion matrix to data frame for plotting
cm <- as.data.frame(as.table(conf_matrix))

ggplot(data = cm, aes(x = Reference, y = Prediction)) +
  geom_tile(aes(fill = log(Freq)), colour = "white") +
  geom_text(aes(label = sprintf("%1.0f", Freq)), vjust = 1) +
  scale_fill_gradient(low = "yellow", high = "turquoise") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1)) +
  labs(fill = "Log(Frequency)")

Model Performance Comparison

all_models_df <- data.frame(
  model_name = c('Random Forest', 'Ada Boost','Logistic Regression'),
  Train_Accuarcy = c(rf_accu_Train, ada_accu_Train, train_accuracy),
  Test_Accuarcy = c(rf_accu_Test, ada_accu_Test, test_accuracy)
)

all_models_df
##            model_name Train_Accuarcy Test_Accuarcy
## 1       Random Forest      1.0000000     0.7961336
## 2           Ada Boost      0.9382065     0.9525483
## 3 Logistic Regression      0.8003014     0.8014060
ggp_train <- ggplot(all_models_df, aes(x = model_name, y = Train_Accuarcy, fill = model_name)) +
  geom_bar(stat = "identity") +
  ggtitle("Models Vs Train Accuracies") +
  coord_flip() +
  theme_minimal()

ggp_train

ggp_test <- ggplot(all_models_df, aes(x = model_name, y = Test_Accuarcy, fill = model_name)) +
  geom_bar(stat = "identity") +
  ggtitle("Models Vs Test Accuracies") +
  coord_flip() +
  theme_minimal()

ggp_test

Interpretation of the Model Performance:

Ada Boost > Random Forest > Logistic Regression

Export to ShinyApp

For deployment usage.

# Export AdaBoost model
saveRDS(AdaBoost_train, file = "AdaBoost_model.rds")

# Print a message indicating successful export
cat('AdaBoost model exported successfully.\n')
## AdaBoost model exported successfully.

Data Product

The ShinyApp link: https://spniqe-jie0ying-tan.shinyapps.io/downloads/