In the rapidly evolving world of e-commerce, understanding customer behavior is crucial for the growth and sustainability of online businesses. One of the key metrics that businesses need to monitor closely is customer churn, also known as customer attrition.
Customer churn refers to the scenario when a customer stops doing business or ends the relationship with a company. Predicting customer churn involves using data analysis and machine learning techniques to predict the likelihood of a customer leaving a service or product.
In the context of e-commerce, customer churn prediction can help businesses identify potential churners early on and take appropriate actions to retain them, thereby saving on the cost of acquiring new customers. This process involves analyzing various customer-related factors such as purchase history, customer engagement, product preference, and many others.
In this project, a machine learning model will be developed to predict customer churn for an e-commerce business. By leveraging historical customer data, our model aims to provide accurate predictions that can help the business mitigate customer loss and enhance customer retention strategies. Let’s dive in!
To perform a comprehensive analysis of the provided customer data to extract insights into customer behavior and characteristics.
To construct a machine learning model that can accurately predict customers who are likely to churn. This model should leverage the provided variables to identify at-risk customers.
library(readxl)
E_Commerce_Dataset <- read_excel("E Commerce Dataset.xlsx",
sheet = "E Comm")
A dataset overview provides a summary of the dataset’s characteristics, including the number of observations and variables, the types of variables (numerical, categorical), and a brief description of what these variables represent.
summary(E_Commerce_Dataset)
## CustomerID Churn Tenure PreferredLoginDevice
## Min. :50001 Min. :0.0000 Min. : 0.00 Length:5630
## 1st Qu.:51408 1st Qu.:0.0000 1st Qu.: 2.00 Class :character
## Median :52816 Median :0.0000 Median : 9.00 Mode :character
## Mean :52816 Mean :0.1684 Mean :10.19
## 3rd Qu.:54223 3rd Qu.:0.0000 3rd Qu.:16.00
## Max. :55630 Max. :1.0000 Max. :61.00
## NA's :264
## CityTier WarehouseToHome PreferredPaymentMode Gender
## Min. :1.000 Min. : 5.00 Length:5630 Length:5630
## 1st Qu.:1.000 1st Qu.: 9.00 Class :character Class :character
## Median :1.000 Median : 14.00 Mode :character Mode :character
## Mean :1.655 Mean : 15.64
## 3rd Qu.:3.000 3rd Qu.: 20.00
## Max. :3.000 Max. :127.00
## NA's :251
## HourSpendOnApp NumberOfDeviceRegistered PreferedOrderCat SatisfactionScore
## Min. :0.000 Min. :1.000 Length:5630 Min. :1.000
## 1st Qu.:2.000 1st Qu.:3.000 Class :character 1st Qu.:2.000
## Median :3.000 Median :4.000 Mode :character Median :3.000
## Mean :2.932 Mean :3.689 Mean :3.067
## 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :6.000 Max. :5.000
## NA's :255
## MaritalStatus NumberOfAddress Complain
## Length:5630 Min. : 1.000 Min. :0.0000
## Class :character 1st Qu.: 2.000 1st Qu.:0.0000
## Mode :character Median : 3.000 Median :0.0000
## Mean : 4.214 Mean :0.2849
## 3rd Qu.: 6.000 3rd Qu.:1.0000
## Max. :22.000 Max. :1.0000
##
## OrderAmountHikeFromlastYear CouponUsed OrderCount
## Min. :11.00 Min. : 0.000 Min. : 1.000
## 1st Qu.:13.00 1st Qu.: 1.000 1st Qu.: 1.000
## Median :15.00 Median : 1.000 Median : 2.000
## Mean :15.71 Mean : 1.751 Mean : 3.008
## 3rd Qu.:18.00 3rd Qu.: 2.000 3rd Qu.: 3.000
## Max. :26.00 Max. :16.000 Max. :16.000
## NA's :265 NA's :256 NA's :258
## DaySinceLastOrder CashbackAmount
## Min. : 0.000 Min. : 0.0
## 1st Qu.: 2.000 1st Qu.:145.8
## Median : 3.000 Median :163.3
## Mean : 4.543 Mean :177.2
## 3rd Qu.: 7.000 3rd Qu.:196.4
## Max. :46.000 Max. :325.0
## NA's :307
str(E_Commerce_Dataset)
## tibble [5,630 × 20] (S3: tbl_df/tbl/data.frame)
## $ CustomerID : num [1:5630] 50001 50002 50003 50004 50005 ...
## $ Churn : num [1:5630] 1 1 1 1 1 1 1 1 1 1 ...
## $ Tenure : num [1:5630] 4 NA NA 0 0 0 NA NA 13 NA ...
## $ PreferredLoginDevice : chr [1:5630] "Mobile Phone" "Phone" "Phone" "Phone" ...
## $ CityTier : num [1:5630] 3 1 1 3 1 1 3 1 3 1 ...
## $ WarehouseToHome : num [1:5630] 6 8 30 15 12 22 11 6 9 31 ...
## $ PreferredPaymentMode : chr [1:5630] "Debit Card" "UPI" "Debit Card" "Debit Card" ...
## $ Gender : chr [1:5630] "Female" "Male" "Male" "Male" ...
## $ HourSpendOnApp : num [1:5630] 3 3 2 2 NA 3 2 3 NA 2 ...
## $ NumberOfDeviceRegistered : num [1:5630] 3 4 4 4 3 5 3 3 4 5 ...
## $ PreferedOrderCat : chr [1:5630] "Laptop & Accessory" "Mobile" "Mobile" "Laptop & Accessory" ...
## $ SatisfactionScore : num [1:5630] 2 3 3 5 5 5 2 2 3 3 ...
## $ MaritalStatus : chr [1:5630] "Single" "Single" "Single" "Single" ...
## $ NumberOfAddress : num [1:5630] 9 7 6 8 3 2 4 3 2 2 ...
## $ Complain : num [1:5630] 1 1 1 0 0 1 0 1 1 0 ...
## $ OrderAmountHikeFromlastYear: num [1:5630] 11 15 14 23 11 22 14 16 14 12 ...
## $ CouponUsed : num [1:5630] 1 0 0 0 1 4 0 2 0 1 ...
## $ OrderCount : num [1:5630] 1 1 1 1 1 6 1 2 1 1 ...
## $ DaySinceLastOrder : num [1:5630] 5 0 3 3 3 7 0 0 2 1 ...
## $ CashbackAmount : num [1:5630] 160 121 120 134 130 ...
head(E_Commerce_Dataset)
## # A tibble: 6 × 20
## CustomerID Churn Tenure PreferredLoginDevice CityTier WarehouseToHome
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 50001 1 4 Mobile Phone 3 6
## 2 50002 1 NA Phone 1 8
## 3 50003 1 NA Phone 1 30
## 4 50004 1 0 Phone 3 15
## 5 50005 1 0 Phone 1 12
## 6 50006 1 0 Computer 1 22
## # ℹ 14 more variables: PreferredPaymentMode <chr>, Gender <chr>,
## # HourSpendOnApp <dbl>, NumberOfDeviceRegistered <dbl>,
## # PreferedOrderCat <chr>, SatisfactionScore <dbl>, MaritalStatus <chr>,
## # NumberOfAddress <dbl>, Complain <dbl>, OrderAmountHikeFromlastYear <dbl>,
## # CouponUsed <dbl>, OrderCount <dbl>, DaySinceLastOrder <dbl>,
## # CashbackAmount <dbl>
Data cleaning is the process of identifying and correcting or removing errors, inaccuracies, and inconsistencies in datasets to improve data quality and reliability for further analysis.
Missing values - Cold deck Imputation
Inconsistent Data - Replace with the correct class
Plotting the histogram is to crosscheck the Tenure column whether there is a large difference after cleaning it.
hist(E_Commerce_Dataset$Tenure)
We can see from here that it is right skewed histogram. In order to clean this column, the NA values will be replace with median which are 9 as shown in the summary.
E_Commerce_Dataset$Tenure[is.na(E_Commerce_Dataset$Tenure)] <- 9
summary(E_Commerce_Dataset$Tenure)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 3.00 9.00 10.13 15.00 61.00
hist(E_Commerce_Dataset$Tenure)
Cross check with the details of Tenure in the Summary above, the difference is minimal and NA values are formatted as Median.
hist(E_Commerce_Dataset$WarehouseToHome)
We can see from here that it is right skewed histogram. For WarehouseToHome, we will use Mode to replace the NA values. Therefore, we need to calculate the mode first.
library("DescTools")
wth_mode <- Mode(E_Commerce_Dataset$WarehouseToHome, na.rm=TRUE)
wth_mode
## [1] 9
## attr(,"freq")
## [1] 559
The mode we get is 9 which has a frequency of 559.
E_Commerce_Dataset$WarehouseToHome[is.na(E_Commerce_Dataset$WarehouseToHome)] <- 9
summary(E_Commerce_Dataset$WarehouseToHome)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 9.00 13.00 15.34 20.00 127.00
As compared with the values before cleaning, the difference is minimal and NA values are formatted as Mode.
hist(E_Commerce_Dataset$HourSpendOnApp)
We can see from here that it is left skewed histogram. In order to clean this column, the NA values will be replace with mean which are 3 as shown in the summary.
E_Commerce_Dataset$HourSpendOnApp[is.na(E_Commerce_Dataset$HourSpendOnApp)] <- 3
summary(E_Commerce_Dataset$HourSpendOnApp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 2.935 3.000 5.000
hist(E_Commerce_Dataset$HourSpendOnApp)
As compared with the values before cleaning, the difference is minimal and NA values are formatted as Mean.
hist(E_Commerce_Dataset$OrderAmountHikeFromlastYear)
We can see from here that it is right skewed histogram. In order to clean this column, the NA values will be replace with mean which are 15.8 (Roundup to 16) as shown in the summary.
E_Commerce_Dataset$OrderAmountHikeFromlastYear[is.na(E_Commerce_Dataset$OrderAmountHikeFromlastYear)] <- 16
summary(E_Commerce_Dataset$OrderAmountHikeFromlastYear)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 13.00 15.00 15.72 18.00 26.00
Let’s do a quick cross check using Histogram to see whether there is a great difference compared to the data before cleaning.
hist(E_Commerce_Dataset$OrderAmountHikeFromlastYear)
As compared with the histogram before cleaning, the difference is minimal and NA values are formatted as Mean.
hist(E_Commerce_Dataset$CouponUsed)
We can see from here that it is right skewed histogram. In order to clean this column, the NA values will be replace with mean which are 1.751 (Roundup to 2) as shown in the summary.
E_Commerce_Dataset$CouponUsed[is.na(E_Commerce_Dataset$CouponUsed)] <- 2
summary(E_Commerce_Dataset$CouponUsed)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 1.000 1.762 2.000 16.000
Let’s do a quick cross check using Histogram to see whether there is a great difference compared to the data before cleaning.
hist(E_Commerce_Dataset$CouponUsed)
As compared with the histogram before cleaning, the difference is minimal and NA values are formatted as Mean.
hist(E_Commerce_Dataset$OrderCount)
We can see from here that it is right skewed histogram. In order to clean this column, the NA values will be replace with mean which are 3 as shown in the summary.
E_Commerce_Dataset$OrderCount[is.na(E_Commerce_Dataset$OrderCount)] <- 3
summary(E_Commerce_Dataset$OrderCount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 3.008 3.000 16.000
Let’s do a quick cross check using Histogram to see whether there is a great difference compared to the data before cleaning.
hist(E_Commerce_Dataset$OrderCount)
As compared with the histogram before cleaning, the difference is very minimal (almost no changes) and NA values are formatted as Mean.
hist(E_Commerce_Dataset$DaySinceLastOrder)
We can see from here that it is right skewed histogram. In order to clean this column, the NA values will be replace with median which are 3 as shown in the summary.
E_Commerce_Dataset$DaySinceLastOrder[is.na(E_Commerce_Dataset$DaySinceLastOrder)] <- 3
summary(E_Commerce_Dataset$DaySinceLastOrder)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 4.459 7.000 46.000
Let’s do a quick cross check using Histogram to see whether there is a great difference compared to the data before cleaning.
hist(E_Commerce_Dataset$DaySinceLastOrder)
As compared with the histogram before cleaning, the difference is minimal and NA values are formatted as Median.
To summarize the cleaned data, we are having the number of rows and columns to check whether any data was deleted. Then we have sum of the total count of NA values.
nrow(E_Commerce_Dataset)
## [1] 5630
ncol(E_Commerce_Dataset)
## [1] 20
sum(is.na(E_Commerce_Dataset))
## [1] 0
Rows = 5630 - All rows remained | Columns = 20 - All columns remained | Count of NA value = 0 - No NA Values were found in the cleaned dataset
Now, let’s crosscheck with the summary of data after cleaning to check whether there is anymore NA values.
summary(E_Commerce_Dataset)
## CustomerID Churn Tenure PreferredLoginDevice
## Min. :50001 Min. :0.0000 Min. : 0.00 Length:5630
## 1st Qu.:51408 1st Qu.:0.0000 1st Qu.: 3.00 Class :character
## Median :52816 Median :0.0000 Median : 9.00 Mode :character
## Mean :52816 Mean :0.1684 Mean :10.13
## 3rd Qu.:54223 3rd Qu.:0.0000 3rd Qu.:15.00
## Max. :55630 Max. :1.0000 Max. :61.00
## CityTier WarehouseToHome PreferredPaymentMode Gender
## Min. :1.000 Min. : 5.00 Length:5630 Length:5630
## 1st Qu.:1.000 1st Qu.: 9.00 Class :character Class :character
## Median :1.000 Median : 13.00 Mode :character Mode :character
## Mean :1.655 Mean : 15.34
## 3rd Qu.:3.000 3rd Qu.: 20.00
## Max. :3.000 Max. :127.00
## HourSpendOnApp NumberOfDeviceRegistered PreferedOrderCat SatisfactionScore
## Min. :0.000 Min. :1.000 Length:5630 Min. :1.000
## 1st Qu.:2.000 1st Qu.:3.000 Class :character 1st Qu.:2.000
## Median :3.000 Median :4.000 Mode :character Median :3.000
## Mean :2.935 Mean :3.689 Mean :3.067
## 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :6.000 Max. :5.000
## MaritalStatus NumberOfAddress Complain
## Length:5630 Min. : 1.000 Min. :0.0000
## Class :character 1st Qu.: 2.000 1st Qu.:0.0000
## Mode :character Median : 3.000 Median :0.0000
## Mean : 4.214 Mean :0.2849
## 3rd Qu.: 6.000 3rd Qu.:1.0000
## Max. :22.000 Max. :1.0000
## OrderAmountHikeFromlastYear CouponUsed OrderCount
## Min. :11.00 Min. : 0.000 Min. : 1.000
## 1st Qu.:13.00 1st Qu.: 1.000 1st Qu.: 1.000
## Median :15.00 Median : 1.000 Median : 2.000
## Mean :15.72 Mean : 1.762 Mean : 3.008
## 3rd Qu.:18.00 3rd Qu.: 2.000 3rd Qu.: 3.000
## Max. :26.00 Max. :16.000 Max. :16.000
## DaySinceLastOrder CashbackAmount
## Min. : 0.000 Min. : 0.0
## 1st Qu.: 2.000 1st Qu.:145.8
## Median : 3.000 Median :163.3
## Mean : 4.459 Mean :177.2
## 3rd Qu.: 7.000 3rd Qu.:196.4
## Max. :46.000 Max. :325.0
After cross checking,there are no more NA values. Next, there are some inconsistent data to adjust.
The summary of categorical variables:
for (i in names(E_Commerce_Dataset)) {
# Check if the column is of type 'character' (assumed categorical)
if (class(E_Commerce_Dataset[[i]]) == 'character') {
# Print value counts
cat(i, ":")
print(table(E_Commerce_Dataset[[i]]))
cat("**********************************************************\n")
}
}
## PreferredLoginDevice :
## Computer Mobile Phone Phone
## 1634 2765 1231
## **********************************************************
## PreferredPaymentMode :
## Cash on Delivery CC COD Credit Card
## 149 273 365 1501
## Debit Card E wallet UPI
## 2314 614 414
## **********************************************************
## Gender :
## Female Male
## 2246 3384
## **********************************************************
## PreferedOrderCat :
## Fashion Grocery Laptop & Accessory Mobile
## 826 410 2050 809
## Mobile Phone Others
## 1271 264
## **********************************************************
## MaritalStatus :
## Divorced Married Single
## 848 2986 1796
## **********************************************************
Replacing Phone to Mobile Phone in PreferredLoginDevice
E_Commerce_Dataset$PreferredLoginDevice <- replace(E_Commerce_Dataset$PreferredLoginDevice, E_Commerce_Dataset$PreferredLoginDevice == "Phone", "Mobile Phone")
head(E_Commerce_Dataset$PreferredLoginDevice)
## [1] "Mobile Phone" "Mobile Phone" "Mobile Phone" "Mobile Phone" "Mobile Phone"
## [6] "Computer"
Replacing CC and COD to Credit card and Cash on Delivery in PreferredPaymentMode
E_Commerce_Dataset$PreferredPaymentMode <- replace(E_Commerce_Dataset$PreferredPaymentMode, E_Commerce_Dataset$PreferredPaymentMode == "CC", "Credit Card")
E_Commerce_Dataset$PreferredPaymentMode <- replace(E_Commerce_Dataset$PreferredPaymentMode, E_Commerce_Dataset$PreferredPaymentMode == "COD", "Cash on Delivery")
head(E_Commerce_Dataset$PreferredPaymentMode, 15)
## [1] "Debit Card" "UPI" "Debit Card" "Debit Card"
## [5] "Credit Card" "Debit Card" "Cash on Delivery" "Credit Card"
## [9] "E wallet" "Debit Card" "Cash on Delivery" "Debit Card"
## [13] "Cash on Delivery" "Credit Card" "Credit Card"
Replacing Mobile to Mobile Phone in PreferedOrderCat
E_Commerce_Dataset$PreferedOrderCat <- replace(E_Commerce_Dataset$PreferedOrderCat, E_Commerce_Dataset$PreferedOrderCat == "Mobile", "Mobile Phone")
head(E_Commerce_Dataset$PreferedOrderCat)
## [1] "Laptop & Accessory" "Mobile Phone" "Mobile Phone"
## [4] "Laptop & Accessory" "Mobile Phone" "Mobile Phone"
Here is to cross check whether there is any inconsistencies in the dataset.
for (i in names(E_Commerce_Dataset)) {
# Check if the column is of type 'character' (assumed categorical)
if (class(E_Commerce_Dataset[[i]]) == 'character') {
# Print value counts
cat(i, ":")
print(table(E_Commerce_Dataset[[i]]))
cat("********************\n")
}
}
## PreferredLoginDevice :
## Computer Mobile Phone
## 1634 3996
## ********************
## PreferredPaymentMode :
## Cash on Delivery Credit Card Debit Card E wallet
## 514 1774 2314 614
## UPI
## 414
## ********************
## Gender :
## Female Male
## 2246 3384
## ********************
## PreferedOrderCat :
## Fashion Grocery Laptop & Accessory Mobile Phone
## 826 410 2050 2080
## Others
## 264
## ********************
## MaritalStatus :
## Divorced Married Single
## 848 2986 1796
## ********************
Therefore, the dataset is considered cleaned.
Last but not least, the cleaned data will be exported for other groupmates usage.
write.csv(E_Commerce_Dataset, "ecomm_cleaned_data.csv")
Load and import cleaned data and packages
library(readxl)
ecomm_cleaned_data <- read.csv("ecomm_cleaned_data.csv")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(knitr)
library(tidyr)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(treemap)
library(ggcorrplot)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
#### Which Gender has more Orders?
## ['Churn']=='0' = No Churn
## ['Churn']=='1' = Churn
# Group by 'Gender' and calculate the mean of 'OrderCount' and get who have more order
gender_orders <- ecomm_cleaned_data %>%
group_by(Gender) %>%
summarize(mean_order_count = mean(OrderCount))
# Create a bar plot using ggplot2
ggplot(gender_orders, aes(x = Gender, y = mean_order_count)) +
geom_bar(stat = "identity", fill = "skyblue", color = "black") +
labs(title = "Mean Order Count by Gender", x = "Gender", y = "Mean Order Count")
Female and male having similar order amount.
###Is there a relationship between Gender and Churn?
# grouping count how many gender have churn/no-churn
ecomm_churn_gender <- ecomm_cleaned_data %>% count(Gender, Churn)
female_churn <- select( filter(ecomm_churn_gender, Gender == 'Female' & Churn == '1'), n)
female_non_churn <- select( filter(ecomm_churn_gender, Gender == 'Female' & Churn == '0'), n)
male_churn <- select( filter(ecomm_churn_gender, Gender == 'Male' & Churn == '1'), n)
male_non_churn <- select( filter(ecomm_churn_gender, Gender == 'Male' & Churn == '0'), n)
#the percentage of the leaving females out of the females
percentageF <- female_churn /(female_churn + female_non_churn) * 100
#the percentage of the leaving males out of the males
percentageM <- male_churn /(male_churn + male_non_churn) * 100
paste("Percentage of the leaving females out of total females is", percentageF)
## [1] "Percentage of the leaving females out of total females is 15.4942119323241"
paste("Percentage of the leaving males out of total males is", percentageM)
## [1] "Percentage of the leaving males out of total males is 17.7304964539007"
There is not a big difference between the males and the female churn from thier own gender.
## Create pie chart with percentages
#only want who churn = 1
ecomm_churn_1_gender <- filter( ecomm_churn_gender, Churn=='1')
#calc for percentage who is churn
ecomm_churn_1_gender$percentage <- (ecomm_churn_1_gender$n / sum(ecomm_churn_1_gender$n )) * 100
ecomm_churn_1_gender
## Gender Churn n percentage
## 1 Female 1 348 36.70886
## 2 Male 1 600 63.29114
ggplot(ecomm_churn_1_gender, aes(x = "", y = percentage, fill = Gender)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y", start = 0) +
geom_text(aes(label = paste0(round(percentage,2), "%")), position = position_stack(vjust = 0.5)) +
theme_void() +
labs(title = "Pie Chart with Percentages")
As we see the males are more likely to churn. The chart shown 63.3 % churned males compare with female.
Step A: Data Transformation
First, the count of churn and non churn for each SatisfactionScore is tabulated.
Satchurn <- ecomm_cleaned_data %>% select(SatisfactionScore, Churn) %>% count(SatisfactionScore, Churn, name = "Count") %>% arrange(dplyr::desc("Churn"))
print(Satchurn)
## SatisfactionScore Churn Count
## 1 1 0 1030
## 2 1 1 134
## 3 2 0 512
## 4 2 1 74
## 5 3 0 1406
## 6 3 1 292
## 7 4 0 890
## 8 4 1 184
## 9 5 0 844
## 10 5 1 264
Step B: Data Visualization
Using the table generated in Step A, a barplot is generated for analysis.
ggplot(Satchurn, aes(x = SatisfactionScore, y = Count, group = Churn, fill = Churn)) + geom_bar(position = "dodge", stat = "identity") + geom_text(
aes(label = Count, Count = Count + 0.05),
position = position_dodge(0.9),
vjust = 0
)
## Warning in geom_text(aes(label = Count, Count = Count + 0.05), position =
## position_dodge(0.9), : Ignoring unknown aesthetics: Count
Interpretation of result: Higher satisfaction scores strongly associate with lower churn, as expected. Critical driver.
Grouped the column name “Complain” and calculate its occurs value with complain and no complain
complain_count <- count(ecomm_cleaned_data, Complain) %>%
rename (churn = n)
complain_count
## Complain churn
## 1 0 4026
## 2 1 1604
complain_churn <- count(ecomm_cleaned_data, Complain, Churn) %>%
rename (count = n) %>%
mutate (Complain = ifelse(Complain == 0, 'No Complain', 'Complain')) %>%
mutate (Churn = ifelse(Churn == 0, 'No Churn', 'Churn')) %>%
mutate(count=as.numeric(count))
complain_churn
## Complain Churn count
## 1 No Complain No Churn 3586
## 2 No Complain Churn 440
## 3 Complain No Churn 1096
## 4 Complain Churn 508
complain_churn <- as.data.frame(complain_churn)
complain_churn
## Complain Churn count
## 1 No Complain No Churn 3586
## 2 No Complain Churn 440
## 3 Complain No Churn 1096
## 4 Complain Churn 508
fig_complain_churn <- treemap(
complain_churn,
index = c("Complain", "Churn"),
vSize = "count",
draw = TRUE,
vColor = "Complain",
palette = c('No Complain' = '#FF0000', 'Complain' = '#0000FF'), # Use hexadecimal color codes
title = 'Complain Vs Churn'
)
Step A: Data Transformation
coupon_churnrate <- ecomm_cleaned_data %>% select(CouponUsed, Churn) %>% count(CouponUsed, Churn, name = "Count") %>% arrange(dplyr::desc("Churn"))
print(coupon_churnrate)
## CouponUsed Churn Count
## 1 0 0 844
## 2 0 1 186
## 3 1 0 1727
## 4 1 1 378
## 5 2 0 1309
## 6 2 1 230
## 7 3 0 281
## 8 3 1 46
## 9 4 0 167
## 10 4 1 30
## 11 5 0 106
## 12 5 1 23
## 13 6 0 90
## 14 6 1 18
## 15 7 0 71
## 16 7 1 18
## 17 8 0 33
## 18 8 1 9
## 19 9 0 11
## 20 9 1 2
## 21 10 0 11
## 22 10 1 3
## 23 11 0 10
## 24 11 1 2
## 25 12 0 8
## 26 12 1 1
## 27 13 0 8
## 28 14 0 5
## 29 15 1 1
## 30 16 0 1
## 31 16 1 1
Step B: Data Visualization
ggplot(coupon_churnrate, aes(x = CouponUsed, y = Count, 10, group = Churn, fill = Churn)) + geom_bar(position = "dodge", stat = "identity")
marital_status <- ecomm_cleaned_data %>%
group_by(Churn, MaritalStatus) %>%
summarise(Count = n(), .groups = 'drop') %>%
arrange(Churn, desc(Count))
marital_status
## # A tibble: 6 × 3
## Churn MaritalStatus Count
## <int> <chr> <int>
## 1 0 Married 2642
## 2 0 Single 1316
## 3 0 Divorced 724
## 4 1 Single 480
## 5 1 Married 344
## 6 1 Divorced 124
# Convert 'Churn' to a factor
ecomm_cleaned_data_copy <- ecomm_cleaned_data
ecomm_cleaned_data_copy$Churn <- as.factor(ecomm_cleaned_data_copy$Churn)
# Determine the maximum value for the y-axis
max_count <- max(table(ecomm_cleaned_data_copy$MaritalStatus, ecomm_cleaned_data_copy$Churn))
churn_by_marital_status <- ecomm_cleaned_data_copy %>%
ggplot(aes(x = MaritalStatus, fill = Churn)) +
geom_bar(position = position_dodge()) +
scale_fill_manual(values = c("blue", "red")) +
scale_y_continuous(breaks = seq(0, max_count, by = 500)) +
labs(title = "Churn Count by Marital Status",
x = "Marital Status",
y = "Churn Count")
churn_by_marital_status
Single customers represent over half of the total customer churn at 480 (50.6%), despite married individuals forming the largest customer segment.
To replace the character variable with binary data (0,1,2,3)
ecomm_cleaned_data <- ecomm_cleaned_data %>%
mutate_at(vars(PreferredLoginDevice, PreferredPaymentMode, Gender, PreferedOrderCat, MaritalStatus),
~as.numeric(as.factor(.)))
head(ecomm_cleaned_data)
## X CustomerID Churn Tenure PreferredLoginDevice CityTier WarehouseToHome
## 1 1 50001 1 4 2 3 6
## 2 2 50002 1 9 2 1 8
## 3 3 50003 1 9 2 1 30
## 4 4 50004 1 0 2 3 15
## 5 5 50005 1 0 2 1 12
## 6 6 50006 1 0 1 1 22
## PreferredPaymentMode Gender HourSpendOnApp NumberOfDeviceRegistered
## 1 3 1 3 3
## 2 5 2 3 4
## 3 3 2 2 4
## 4 3 2 2 4
## 5 2 2 3 3
## 6 3 1 3 5
## PreferedOrderCat SatisfactionScore MaritalStatus NumberOfAddress Complain
## 1 3 2 3 9 1
## 2 4 3 3 7 1
## 3 4 3 3 6 1
## 4 3 5 3 8 0
## 5 4 5 3 3 0
## 6 4 5 3 2 1
## OrderAmountHikeFromlastYear CouponUsed OrderCount DaySinceLastOrder
## 1 11 1 1 5
## 2 15 0 1 0
## 3 14 0 1 3
## 4 23 0 1 3
## 5 11 1 1 3
## 6 22 4 6 7
## CashbackAmount
## 1 159.93
## 2 120.90
## 3 120.28
## 4 134.07
## 5 129.60
## 6 139.19
The indicators of character variables:
# Create a data frame
data <- data.frame(
"Variable" = c("PreferredLoginDevice", "PreferredLoginDevice", "PreferredPaymentMode", "PreferredPaymentMode", "PreferredPaymentMode", "PreferredPaymentMode", "PreferredPaymentMode", "Gender", "Gender", "PreferedOrderCat", "PreferedOrderCat", "PreferedOrderCat", "PreferedOrderCat", "PreferedOrderCat", "MaritalStatus", "MaritalStatus", "MaritalStatus"),
"Indicator" = c("Computer", "Mobile Phone", "Cash on Delivery", "Credit Card", "Debit Card", "E Wallet", "UPI", "Female", "Male", "Fashion", "Grocery", "Laptop & Accessory", "Mobile Phone", "Others", "Divorced", "Married", "Single"),
"Numerical_Label" = c(1, 2, 1, 2, 3, 4, 5, 1, 2, 1, 2, 3, 4, 5, 1, 2, 3)
)
# Print the data frame
print(data)
## Variable Indicator Numerical_Label
## 1 PreferredLoginDevice Computer 1
## 2 PreferredLoginDevice Mobile Phone 2
## 3 PreferredPaymentMode Cash on Delivery 1
## 4 PreferredPaymentMode Credit Card 2
## 5 PreferredPaymentMode Debit Card 3
## 6 PreferredPaymentMode E Wallet 4
## 7 PreferredPaymentMode UPI 5
## 8 Gender Female 1
## 9 Gender Male 2
## 10 PreferedOrderCat Fashion 1
## 11 PreferedOrderCat Grocery 2
## 12 PreferedOrderCat Laptop & Accessory 3
## 13 PreferedOrderCat Mobile Phone 4
## 14 PreferedOrderCat Others 5
## 15 MaritalStatus Divorced 1
## 16 MaritalStatus Married 2
## 17 MaritalStatus Single 3
Create correlation matrix of the data
# Create correlation matrix of the data
corr_mat <- round(cor(ecomm_cleaned_data), 3)
head(corr_mat)
## X CustomerID Churn Tenure PreferredLoginDevice
## X 1.000 1.000 -0.019 0.035 -0.003
## CustomerID 1.000 1.000 -0.019 0.035 -0.003
## Churn -0.019 -0.019 1.000 -0.338 -0.051
## Tenure 0.035 0.035 -0.338 1.000 0.041
## PreferredLoginDevice -0.003 -0.003 -0.051 0.041 1.000
## CityTier 0.003 0.003 0.085 -0.058 0.002
## CityTier WarehouseToHome PreferredPaymentMode Gender
## X 0.003 0.087 0.006 0.004
## CustomerID 0.003 0.087 0.006 0.004
## Churn 0.085 0.057 0.002 0.029
## Tenure -0.058 0.001 -0.009 -0.046
## PreferredLoginDevice 0.002 -0.022 0.007 -0.016
## CityTier 1.000 0.023 0.246 -0.025
## HourSpendOnApp NumberOfDeviceRegistered PreferedOrderCat
## X 0.580 0.411 0.000
## CustomerID 0.580 0.411 0.000
## Churn 0.019 0.108 0.105
## Tenure -0.017 -0.021 -0.189
## PreferredLoginDevice 0.020 -0.021 0.005
## CityTier -0.010 0.028 -0.171
## SatisfactionScore MaritalStatus NumberOfAddress Complain
## X -0.033 0.124 0.161 -0.010
## CustomerID -0.033 0.124 0.161 -0.010
## Churn 0.105 0.140 0.044 0.250
## Tenure -0.014 -0.105 0.235 -0.021
## PreferredLoginDevice 0.036 -0.029 0.026 0.001
## CityTier -0.012 0.007 -0.029 0.003
## OrderAmountHikeFromlastYear CouponUsed OrderCount
## X 0.116 0.227 0.136
## CustomerID 0.116 0.227 0.136
## Churn -0.011 -0.010 -0.028
## Tenure 0.011 0.130 0.179
## PreferredLoginDevice -0.010 -0.010 -0.016
## CityTier -0.032 0.020 0.033
## DaySinceLastOrder CashbackAmount
## X 0.099 0.217
## CustomerID 0.099 0.217
## Churn -0.156 -0.154
## Tenure 0.174 0.468
## PreferredLoginDevice 0.002 0.048
## CityTier 0.010 0.056
Reduce the size of correlation matrix
melted_corr_mat <- melt(corr_mat)
head(melted_corr_mat)
## Var1 Var2 value
## 1 X X 1.000
## 2 CustomerID X 1.000
## 3 Churn X -0.019
## 4 Tenure X 0.035
## 5 PreferredLoginDevice X -0.003
## 6 CityTier X 0.003
Plotting the correlation heatmap
ggplot(data = melted_corr_mat, aes(x=Var1, y=Var2, fill = value)) + geom_tile() + geom_text(aes(Var2,Var1,label = value), color = "black", size = 1.5) + scale_fill_gradient2(low = "turquoise", high = "yellow", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal() +theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
Load required packages
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-8
library(class)
library(e1071)
library(rpart)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(xgboost)
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:plotly':
##
## slice
## The following object is masked from 'package:dplyr':
##
## slice
library(adabag)
## Loading required package: caret
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following objects are masked from 'package:DescTools':
##
## MAE, RMSE
## Loading required package: foreach
##
## Attaching package: 'foreach'
## The following object is masked from 'package:DescTools':
##
## %:%
## Loading required package: doParallel
## Loading required package: iterators
## Loading required package: parallel
library(smotefamily)
library(mlr)
## Loading required package: ParamHelpers
## Warning message: 'mlr' is in 'maintenance-only' mode since July 2019.
## Future development will only happen in 'mlr3'
## (<https://mlr3.mlr-org.com>). Due to the focus on 'mlr3' there might be
## uncaught bugs meanwhile in {mlr} - please consider switching.
##
## Attaching package: 'mlr'
## The following object is masked from 'package:caret':
##
## train
## The following object is masked from 'package:e1071':
##
## impute
library(ROSE)
## Loaded ROSE 0.0-4
library(caTools)
library(caret)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
##
## select
## The following object is masked from 'package:dplyr':
##
## select
library(ada)
library(Metrics)
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
##
## precision, recall
## The following object is masked from 'package:pROC':
##
## auc
if (!requireNamespace("ROCR", quietly = TRUE)) {
install.packages("ROCR")
}
library(ROCR)
##
## Attaching package: 'ROCR'
## The following object is masked from 'package:mlr':
##
## performance
library(ggplot2)
Handling Imbalanced Data:
# Convert non-numeric columns to numeric
numeric_columns <- sapply(ecomm_cleaned_data, is.numeric)
ecomm_numeric <- ecomm_cleaned_data[, numeric_columns]
# Separate predictors and response variable
X <- ecomm_numeric[, -which(names(ecomm_numeric) == "Churn")]
Y <- ecomm_numeric$Churn
# Combine SMOTE and Tomek
rose_result <- ROSE(Y ~ ., data = cbind(Y, X), seed = 42, p = 0.7, N = 2 * sum(Y == 1))
# Extracting the oversampled data
x_over <- rose_result$data[, -1]
y_over <- rose_result$data[, 1]
# Print the shape
print(paste("Oversampled data shape: ", dim(x_over), length(y_over)))
## [1] "Oversampled data shape: 1896 1896" "Oversampled data shape: 20 1896"
Data Splitting:
### Split the data
set.seed(42) # For reproducibility
# Create indices for train and test sets
indices <- sample(1:nrow(x_over), 0.7 * nrow(x_over))
# Split the data
x_train <- x_over[indices, ]
x_test <- x_over[-indices, ]
y_train <- y_over[indices]
y_test <- y_over[-indices]
# Display the total number of samples
cat("Total number of samples:", nrow(x_over), "\n")
## Total number of samples: 1896
# Display the number of samples in the training set
cat("Number of samples in the training set:", nrow(x_train), "\n")
## Number of samples in the training set: 1327
# Display the number of samples in the testing set
cat("Number of samples in the testing set:", nrow(x_test), "\n")
## Number of samples in the testing set: 569
# Now we will make normalization for all data to make them in a common range
MN <- function(data) {
scaled_data <- scale(data)
return(scaled_data)
}
# Scale the training data
x_train_scaled <- MN(x_train)
# Scale the testing data
x_test_scaled <- MN(x_test)
The AdaBoost (Adaptive Boosting) algorithm is a machine learning technique that combines multiple weak classifiers to form a strong classifier, by iteratively training the classifiers on misclassified instances to improve accuracy.
AdaBoost_train = ada(x = x_train_scaled, y = y_train)
AdaBoost_test = ada(x = x_test_scaled, y = y_test)
# Predict on train and test data
ada_pred_Train = predict(AdaBoost_train, as.data.frame(x_train_scaled))
ada_pred_Test = predict(AdaBoost_test, as.data.frame(x_test_scaled))
# Train data - Find accuracy
ada_cm_Train = table(y_train, ada_pred_Train)
ada_accu_Train= sum(diag(ada_cm_Train))/sum(ada_cm_Train)
# Test data - Find accuracy
ada_cm_Test = table(y_test, ada_pred_Test)
ada_accu_Test= sum(diag(ada_cm_Test))/sum(ada_cm_Test)
cat('Using model: ', 'Ada Boost Classifier', '\n')
## Using model: Ada Boost Classifier
print(paste('Accurancy Train Score:', ada_accu_Train))
## [1] "Accurancy Train Score: 0.938206480783723"
print(paste('Accurancy Test Score:', ada_accu_Test))
## [1] "Accurancy Test Score: 0.952548330404218"
The Random Forest algorithm is a machine learning technique that constructs a multitude of decision trees at training time and outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. It is known for its robustness, simplicity, and ability to handle large datasets with high dimensionality.
suppressWarnings({
# Ensure y_train and y_test are factors for classification
y_train <- as.factor(y_train)
y_test <- as.factor(y_test)
# Fit the model with revised parameters
rf_model <- randomForest(x_train_scaled, y_train, ntree=500, importance=TRUE)
# Predictions
y_pred_train <- predict(rf_model, x_train_scaled)
y_pred_test <- predict(rf_model, x_test_scaled)
# Calculate accuracy
rf_accu_Train <- sum(y_train == y_pred_train) / length(y_train)
rf_accu_Test <- sum(y_test == y_pred_test) / length(y_test)
cat('Using Model: ', 'Random Forest', '\n')
print(paste("Training Accuracy:", rf_accu_Train))
print(paste("Test Accuracy:", rf_accu_Test))
})
## Using Model: Random Forest
## [1] "Training Accuracy: 1"
## [1] "Test Accuracy: 0.796133567662566"
The Logistic Regression model is a statistical method used for binary classification problems that predicts the probability of an instance belonging to a particular class by fitting data to a logistic function.
logistic_model <- glm(y_train ~ ., data = as.data.frame(x_train_scaled), family = binomial)
# Make predictions on the training and testing set
train_predictions <- predict(logistic_model, newdata = as.data.frame(x_train_scaled), type = "response")
test_predictions <- predict(logistic_model, newdata = as.data.frame(x_test_scaled), type = "response")
# Convert predicted probabilities to binary predictions (0 or 1)
train_pred_class <- ifelse(train_predictions > 0.5, 1, 0)
test_pred_class <- ifelse(test_predictions > 0.5, 1, 0)
# Evaluate the training and testing accuracy
train_accuracy <- sum(train_pred_class == y_train) / length(y_train)
test_accuracy <- sum(test_pred_class == y_test) / length(y_test)
cat("Using Model: Logistic Regression", "\n")
## Using Model: Logistic Regression
print(paste("Training Accuracy:", train_accuracy))
## [1] "Training Accuracy: 0.800301431801055"
print(paste("Testing Accuracy:", test_accuracy))
## [1] "Testing Accuracy: 0.801405975395431"
Model evaluation is the process of assessing the performance of a machine learning model using certain metrics, such as accuracy, precision, recall, or F1 score, to determine how well the model has learned from the training data and can generalize to unseen data.
# Assuming y_test and ada_pred_Test are vectors of the same length
# If y_test_fac is the same as y_test, you can use y_test instead
# Generate ROC curve
roc_curve <- roc(y_test, as.numeric(as.character(ada_pred_Test)))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Plot ROC curve
plot(roc_curve, main="ROC Curve")
# Calculate roc_auc
roc_auc <- as.numeric(roc_curve$auc)
cat("ROC AUC:", roc_auc, "\n")
## ROC AUC: 0.9252441
# Create a confusion matrix
conf_matrix <- confusionMatrix(as.factor(ada_pred_Test), as.factor(y_test))
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 145 3
## 1 24 397
##
## Accuracy : 0.9525
## 95% CI : (0.9317, 0.9685)
## No Information Rate : 0.703
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8821
##
## Mcnemar's Test P-Value : 0.0001186
##
## Sensitivity : 0.8580
## Specificity : 0.9925
## Pos Pred Value : 0.9797
## Neg Pred Value : 0.9430
## Prevalence : 0.2970
## Detection Rate : 0.2548
## Detection Prevalence : 0.2601
## Balanced Accuracy : 0.9252
##
## 'Positive' Class : 0
##
# Convert confusion matrix to data frame for plotting
cm <- as.data.frame(as.table(conf_matrix))
ggplot(data = cm, aes(x = Reference, y = Prediction)) +
geom_tile(aes(fill = log(Freq)), colour = "white") +
geom_text(aes(label = sprintf("%1.0f", Freq)), vjust = 1) +
scale_fill_gradient(low = "yellow", high = "turquoise") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1)) +
labs(fill = "Log(Frequency)")
# Generate ROC curve
roc_curve <- roc(y_test, as.numeric(as.character(y_pred_test)))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Plot ROC curve
plot(roc_curve, main="ROC Curve")
# Calculate roc_auc
roc_auc <- as.numeric(roc_curve$auc)
cat("ROC AUC:", roc_auc, "\n")
## ROC AUC: 0.6926849
# Create a confusion matrix
conf_matrix <- confusionMatrix(as.factor(y_pred_test), as.factor(y_test))
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 74 21
## 1 95 379
##
## Accuracy : 0.7961
## 95% CI : (0.7607, 0.8285)
## No Information Rate : 0.703
## P-Value [Acc > NIR] : 3.086e-07
##
## Kappa : 0.4411
##
## Mcnemar's Test P-Value : 1.220e-11
##
## Sensitivity : 0.4379
## Specificity : 0.9475
## Pos Pred Value : 0.7789
## Neg Pred Value : 0.7996
## Prevalence : 0.2970
## Detection Rate : 0.1301
## Detection Prevalence : 0.1670
## Balanced Accuracy : 0.6927
##
## 'Positive' Class : 0
##
# Convert confusion matrix to data frame for plotting
cm <- as.data.frame(as.table(conf_matrix))
ggplot(data = cm, aes(x = Reference, y = Prediction)) +
geom_tile(aes(fill = log(Freq)), colour = "white") +
geom_text(aes(label = sprintf("%1.0f", Freq)), vjust = 1) +
scale_fill_gradient(low = "yellow", high = "turquoise") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1)) +
labs(fill = "Log(Frequency)")
# Generate ROC curve
roc_curve <- roc(y_test, as.numeric(as.character(test_pred_class)))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Plot ROC curve
plot(roc_curve, main="ROC Curve")
# Calculate roc_auc
roc_auc <- as.numeric(roc_curve$auc)
cat("ROC AUC:", roc_auc, "\n")
## ROC AUC: 0.720355
# Create a confusion matrix
conf_matrix <- confusionMatrix(as.factor(test_pred_class), as.factor(y_test))
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 88 32
## 1 81 368
##
## Accuracy : 0.8014
## 95% CI : (0.7662, 0.8334)
## No Information Rate : 0.703
## P-Value [Acc > NIR] : 6.313e-08
##
## Kappa : 0.481
##
## Mcnemar's Test P-Value : 6.318e-06
##
## Sensitivity : 0.5207
## Specificity : 0.9200
## Pos Pred Value : 0.7333
## Neg Pred Value : 0.8196
## Prevalence : 0.2970
## Detection Rate : 0.1547
## Detection Prevalence : 0.2109
## Balanced Accuracy : 0.7204
##
## 'Positive' Class : 0
##
# Convert confusion matrix to data frame for plotting
cm <- as.data.frame(as.table(conf_matrix))
ggplot(data = cm, aes(x = Reference, y = Prediction)) +
geom_tile(aes(fill = log(Freq)), colour = "white") +
geom_text(aes(label = sprintf("%1.0f", Freq)), vjust = 1) +
scale_fill_gradient(low = "yellow", high = "turquoise") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1)) +
labs(fill = "Log(Frequency)")
all_models_df <- data.frame(
model_name = c('Random Forest', 'Ada Boost','Logistic Regression'),
Train_Accuarcy = c(rf_accu_Train, ada_accu_Train, train_accuracy),
Test_Accuarcy = c(rf_accu_Test, ada_accu_Test, test_accuracy)
)
all_models_df
## model_name Train_Accuarcy Test_Accuarcy
## 1 Random Forest 1.0000000 0.7961336
## 2 Ada Boost 0.9382065 0.9525483
## 3 Logistic Regression 0.8003014 0.8014060
ggp_train <- ggplot(all_models_df, aes(x = model_name, y = Train_Accuarcy, fill = model_name)) +
geom_bar(stat = "identity") +
ggtitle("Models Vs Train Accuracies") +
coord_flip() +
theme_minimal()
ggp_train
ggp_test <- ggplot(all_models_df, aes(x = model_name, y = Test_Accuarcy, fill = model_name)) +
geom_bar(stat = "identity") +
ggtitle("Models Vs Test Accuracies") +
coord_flip() +
theme_minimal()
ggp_test
Interpretation of the Model Performance:
Ada Boost > Random Forest > Logistic Regression
For deployment usage.
# Export AdaBoost model
saveRDS(AdaBoost_train, file = "AdaBoost_model.rds")
# Print a message indicating successful export
cat('AdaBoost model exported successfully.\n')
## AdaBoost model exported successfully.
The ShinyApp link: https://spniqe-jie0ying-tan.shinyapps.io/downloads/