This project is created by \(Vijayendher\) \(Gatla\), \(K\) \(Sai\) \(Karthikeya\), \(Khizar\) \(Mohamed\) \(Zubair\) \(Sait\)
In today’s digital era, maximizing
Return On Investment (ROI) in digital advertising is
paramount for businesses to thrive in a competitive and cost-oriented
market.
Leveraging data science techniques, this project analyzes \(Advertisement\) \(Performance\) using a comprehensive
marketing dataset encompassing acquisition cost, ROI,
customer segment, posting date, target audience age, and more. Through
meticulous preprocessing, various regression models are
trained, including \(SVR\) and \(ARIMA\), to identify top-performing
advertisements across different platforms.
These insights enable publishers to strategically allocate resources, ensuring optimal ROI and maximizing profitability in digital advertising endeavors.
library(dplyr) #Used to manipulate data in R
library(readr) #Used to reAd csv files
library(DataExplorer) #Used To explore data
library(DT) #Used to display tables in Rmarkdown
library(ggplot2) #Used to display different plots and graphs
library(heatmaply) #Used for Heatmaps
library(kableExtra) #Used to handle markdown content
library(tidyr) # for better text visibility
library(caret) # Used for regression model handling
library(e1071) #Used for building Support Vector Models
library(cluster) #Used for building Clustering Models
library(forecast) #Used for Time Series Analysis
The dataset which is used in this project is downloaded from Kaggle called as \(Marketing\) \(Campaign\) \(Dataset\).
The dataset consists of 2,00,000 rows and
16 columns.Here’s a view of the data:
data <- read.csv("C:/Users/vijay/Downloads/marketing_campaign_dataset.csv", stringsAsFactors = FALSE)
DT::datatable(head(data,5),
rownames = FALSE,
options=list(
scrollX=TRUE
))
Here’s the tabular representation of different metrics for the different columns in the dataset:
data <- read.csv("C:/Users/vijay/Downloads/marketing_campaign_dataset.csv", stringsAsFactors = FALSE)
summary_table <- summary(data)
kable(summary_table, format = "markdown") %>%
kable_styling() %>%
scroll_box(width = "100%")
| Campaign_ID | Company | Campaign_Type | Target_Audience | Duration | Channel_Used | Conversion_Rate | Acquisition_Cost | ROI | Location | Language | Clicks | Impressions | Engagement_Score | Customer_Segment | Date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1 | Length:200000 | Length:200000 | Length:200000 | Length:200000 | Length:200000 | Min. :0.01000 | Length:200000 | Min. :2.000 | Length:200000 | Length:200000 | Min. : 100.0 | Min. : 1000 | Min. : 1.000 | Length:200000 | Length:200000 | |
| 1st Qu.: 50001 | Class :character | Class :character | Class :character | Class :character | Class :character | 1st Qu.:0.05000 | Class :character | 1st Qu.:3.500 | Class :character | Class :character | 1st Qu.: 325.0 | 1st Qu.: 3266 | 1st Qu.: 3.000 | Class :character | Class :character | |
| Median :100001 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Median :0.08000 | Mode :character | Median :5.010 | Mode :character | Mode :character | Median : 550.0 | Median : 5518 | Median : 5.000 | Mode :character | Mode :character | |
| Mean :100001 | NA | NA | NA | NA | NA | Mean :0.08007 | NA | Mean :5.002 | NA | NA | Mean : 549.8 | Mean : 5507 | Mean : 5.495 | NA | NA | |
| 3rd Qu.:150000 | NA | NA | NA | NA | NA | 3rd Qu.:0.12000 | NA | 3rd Qu.:6.510 | NA | NA | 3rd Qu.: 775.0 | 3rd Qu.: 7753 | 3rd Qu.: 8.000 | NA | NA | |
| Max. :200000 | NA | NA | NA | NA | NA | Max. :0.15000 | NA | Max. :8.000 | NA | NA | Max. :1000.0 | Max. :10000 | Max. :10.000 | NA | NA |
Two columns min-age and max-age are
derived from the Target_Audience attribute.
The Target_Audience consists of some rows that consists
of the value All ages,which is replace by
All a-b.
Here’s the modified dataset:
#Duplicating the data
data <- read.csv("C:/Users/vijay/Downloads/marketing_campaign_dataset.csv", stringsAsFactors = FALSE)
data_new <-data #make a copy of input data
data_new$Target_Audience[data_new$Target_Audience=="All Ages"]<-"All a-b"
DT::datatable(head(data_new, 5),
rownames = FALSE,
options = list(
pageLength = 10,
scrollX=TRUE))
The attribute a is replaced with minimum age and
b is replaced with the maximum age of the
Target_Audience_Min and Target_Audience_Max
respectively.
Here’s the modified dataset:
data_new <- data_new %>%
separate(Target_Audience, into = c("Target_Audience_Gender", "Target_Audience_Age"), sep = " ")
data_new <- data_new %>%
separate(Target_Audience_Age, into = c("Target_Audience_Age_Min", "Target_Audience_Age_Max"), sep = "-")
#replacing a
min_age <- min(data_new$Target_Audience_Age_Min[data_new$Target_Audience_Age_Min != "a"], na.rm = TRUE)
# Replace "a" with the minimum value
data_new$Target_Audience_Age_Min[data_new$Target_Audience_Age_Min == "a"] <- min_age
#replacing b
max_age <- max(data_new$Target_Audience_Age_Max[data_new$Target_Audience_Age_Max != "b"], na.rm = TRUE)
# Replace "a" with the minimum value
data_new$Target_Audience_Age_Max[data_new$Target_Audience_Age_Max == "b"] <- max_age
DT::datatable(head(data_new, 5),
rownames = FALSE,
options = list(
pageLength = 10, scrollX=TRUE))
\(Acquisition\) \(Cost\) consists of a dollar literal \((\$)\) which is not helpful at the time of
analysis, and is removed with the help of
Regular Expressions.
The \(Duration\) is in the string
format 30 days, which is not needed in the analysis so the
days is removed and is renamed with
Duration_In_Days.
data_new <- data_new %>%
mutate(Duration_In_Days = as.numeric(sub(" days", "", Duration)))
# Drop the original Duration column
data_new <- data_new %>%
select(-Duration)
non_numeric_values <- data_new$Acquisition_Cost[!grepl("^\\$?[0-9,.]*$", data_new$Acquisition_Cost, perl = TRUE)]
# If there are non-numeric values, print them
if (length(non_numeric_values) > 0) {
print(paste("Non-numeric values in Acquisition_Cost column:", unique(non_numeric_values)))
}
# Modify the Acquisition_Cost column
data_new <- data_new %>%
mutate(Acquisition_Cost = as.numeric(gsub("[^0-9.]", "", Acquisition_Cost)))
# Print the modified data
DT::datatable(head(data_new, 10),
rownames = FALSE,
options = list(
pageLength = 10,
scrollX=TRUE))
Additional Columns named Target_Audience_Age_Mean
,Cost Per Click (CPC)
,Cost Per Acquisition (CPA)
,Click Through Rate (CTR) are derived using mathematical
formulation of existing data columns and added as new columns in the
dataset.
The modified data is now exported to a new csv file.
#Reading the Dataset
data <- read.csv("C:/Users/vijay/Desktop/PMA/final_data.csv", stringsAsFactors = FALSE)
# Calculate CPC, CTR, CPA, and CPI
data <- data %>%
mutate(
# CPC (Cost Per Click)
Target_Audience_Age_Mean=(Target_Audience_Age_Min+Target_Audience_Age_Max)/2,
CPC = Acquisition_Cost / Clicks,
# CTR (Click Through Rate)
CTR = (Clicks / Impressions) * 100,
# CPA (Cost Per Acquisition)
CPA = Acquisition_Cost / (Conversion_Rate * Impressions),
# CPI (Cost Per Impression)
CPI = Acquisition_Cost / Impressions
)
DT::datatable(head(data, 5), rownames = FALSE, options = list(pageLength = 10,scrollX=TRUE))
write.csv(data, "final_modified_data.csv",row.names=FALSE)
Clicks and Impressions have a weak positive
correlation (0.2). This means that a higher number of impressions leads
to a small increase in clicks.
Cost per Click (CPC) and Acquisition Cost
have a moderate positive correlation (0.5). This means that an increase
in CPC is associated with an increase in
Acquisition Cost.
ROI (Return on Investment) has a weak negative
correlation with both Clicks (-0.2) and
Impressions (-0.3). This means that there might be a small
decrease in ROI as the number of clicks or impressions increases.
It is important to note that correlation does not imply causation. There could be other factors influencing these metrics.
To understand the dataset thoroughly and understanding the data different plots are plotted so that inferences can be made.
The plots are made using the ggplot2 package is used to
for different data visualization.
The channel with the highest CPC is Facebook, followed
by Google Ads and then Email. This means that for every CPC
value, Facebook has the highest density of clicks.
There is a cluster of channels with a lower CPC around 0.01, which includes Email, Instagram and YouTube. It appears that it is more likely to get a click on these channels for a CPC of around 0.01.
marketing_data <- read.csv("C:/Users/vijay/Desktop/PMA/final_modified_data.csv", stringsAsFactors = FALSE)
# Subset the first 1000 records
subset_data <- marketing_data[1:1000, ]
# Create a density plot for CPC vs Channel Used
ggplot(subset_data, aes(x = CPC, fill = Channel_Used)) +
geom_density(alpha = 0.5) +
labs(title = "Density Plot of CPC by Channel Used") +
facet_wrap(~Channel_Used, scales = "free") +
theme_minimal()
There is a higher density of clicks for Email and Facebook at a lower log(CPC) compared to Google Ads, YouTube, Instagram and Website.
This means that for these channels, it is more likely to get clicks at a lower cost per click.
The density is spread out more for Facebook and Google Ads compared to the other channels. This suggests that there is a wider range of CPCs for clicks on these channels.
# Apply log transformation to CPC
subset_data$log_CPC <- log(subset_data$CPC + 1)
# Create a density plot for log(CPC) vs Channel Used
ggplot(subset_data, aes(x = log_CPC, fill = Channel_Used)) +
geom_density(alpha = 0.5) +
labs(title = "Density Plot of log(CPC) by Channel Used") +
facet_wrap(~Channel_Used, scales = "free") +
theme_minimal()
Email appears to have the highest CTR overall, with a
cluster of density around 0.04.
Facebook and Google Ads also show a cluster of density around 0.04, but it appears to be lower than Email.
The remaining channels, YouTube, Instagram and Website, all show a
lower density of CTR spread across a wider range of values.
This suggests that CTR for these channels is less
concentrated and more likely to vary.
# Create a density plot for CTR vs Channel Used
ggplot(subset_data, aes(x = CTR, fill = Channel_Used)) +
geom_density(alpha = 0.5) +
labs(title = "Density Plot of CTR by Channel Used") +
facet_wrap(~Channel_Used, scales = "free") +
theme_minimal()
There is a higher density of clicks for Email and Facebook at a lower
log(CTR) compared to Google Ads, YouTube, Instagram and
Website.
This means that for Email and Facebook, it is more likely to get clicks at a lower cost per click.
The density is spread out more for Facebook and Google Ads compared to the other channels. This suggests that there is a wider range of click-through rates for clicks on these channels.
# Apply log transformation to CPC
subset_data$log_CTR <- log(subset_data$CTR + 1)
# Create a density plot for log(CTR) vs Channel Used
ggplot(subset_data, aes(x = log_CTR, fill = Channel_Used)) +
geom_density(alpha = 0.5) +
labs(title = "Density Plot of log(CTR) by Channel Used") +
facet_wrap(~Channel_Used, scales = "free") +
theme_minimal()
The channel with the lowest CPA is Email, followed by
Facebook and then YouTube. This means that for every CPA
value, Email has the highest density of acquisitions.
There is a cluster of channels with a higher CPA around
0.005, which includes Google Ads, Instagram and Website. It appears that
it is more likely to get an acquisition on these channels for a
CPA of around 0.005.
# Create a density plot for CPA vs Channel Used
ggplot(subset_data, aes(x = CPA, fill = Channel_Used)) +
geom_density(alpha = 0.5) +
labs(title = "Density Plot of CPA by Channel Used") +
facet_wrap(~Channel_Used, scales = "free") +
theme_minimal()
There is a higher density of clicks for Email and Facebook at a lower
log(CPA) compared to Google Ads, YouTube, Instagram and
Website.
This means that for these channels, it is more likely to get clicks at a lower cost per acquisition.
The density is spread out more for Facebook and Google Ads compared to the other channels.
# Apply log transformation to CPC
subset_data$log_CPA <- log(subset_data$CPA + 1)
# Create a density plot for log(CTR) vs Channel Used
ggplot(subset_data, aes(x = log_CPA, fill = Channel_Used)) +
geom_density(alpha = 0.5) +
labs(title = "Density Plot of log(CPA) by Channel Used") +
facet_wrap(~Channel_Used, scales = "free") +
theme_minimal()
The channels seem to target audiences with a mean age between 25 and 40 years old.
Facebook and Youtube appear to target a slightly younger audience with a higher density around 25 years old.
Email and Instagram appear to target a slightly older audience with a higher density around 30 years old.
# Calculate mean age by channel used
mean_age_by_channel <- aggregate(Target_Audience_Age_Mean ~ Channel_Used, data = subset_data, FUN = mean)
# Create a density plot for mean age distribution vs. Channel Used
ggplot(subset_data, aes(x = Target_Audience_Age_Mean, fill = Channel_Used)) +
geom_density(alpha = 0.5) +
labs(title = "Density Plot of Mean Age by Channel Used") +
facet_wrap(~Channel_Used, scales = "free") +
theme_minimal()
Younger age groups (18-24) tend to have a higher CTR across all social media platforms compared to older age groups. This suggests that younger users are more likely to click on ads compared to older users.
Instagram appears to have the highest CTR for almost all age groups. This means that users are more likely to click on ads on Instagram compared to other platforms.
Facebook seems to have a higher CTR for younger age groups (18-24) compared to Email and Google Ads. However, for older age groups, Email and Google Ads appear to have a similar or slightly higher CTR than Facebook.
The CTR for all social media platforms appears to decrease as the age group increases. This suggests that targeting younger audiences might be more effective for campaigns that rely on clicks
# Calculate mean age by channel used
marketing_data <- read.csv("C:/Users/vijay/Desktop/PMA/final_modified_data.csv", stringsAsFactors = FALSE)
ggplot(marketing_data, aes(x = Target_Audience_Age_Mean, fill = Channel_Used)) +
geom_histogram(binwidth = 5, alpha = 0.5, position = "identity") +
labs(title = "Age Distribution for Different Social Media Platforms by CTR", x = "Age", y = "Frequency") +
facet_wrap(~Channel_Used, scales = "free", drop = TRUE) +
theme_minimal()
There are four customer segments: Fashionistas, Foodies, Health & Wellness, and Outdoor Adventurers.
Each customer segment seems to have a different age distribution across social media platforms.
It appears that Facebook and Instagram tend to have a wider reach across all age groups for most customer segments.
There is a trend where some social media platforms are more popular with younger age groups (20-30) while others are more popular with older age groups (30-40). For example, Snapchat seems to be more popular with the 20-25 age group for most customer segments.
ggplot(marketing_data, aes(x = Target_Audience_Age_Mean, fill = Customer_Segment)) +
geom_histogram(binwidth = 5, alpha = 0.5, position = "identity") +
labs(title = "Age Distribution for Different Social Media Platforms by CTR", x = "Age", y = "Frequency") +
facet_wrap(~Customer_Segment, scales = "free", drop = TRUE) +
theme_minimal()
The Regression model helps us to numerically predict the value. The ROI being the most important term used in evaluating the performance of a advertisement is used.
The linear regression model is a very simple model which helps in fitting the values the straight line. The model does not show very good performance as shown in the actual versus the predicted graph.
The reason of not upto mark performance can be attributed to the non linearity and the dependency of the variables which are used in the dataset.
# Load the data
marketing_data <- read.csv("C:/Users/vijay/Desktop/PMA/final_data.csv", stringsAsFactors = FALSE)
# Remove 'Date' column
marketing_data <- subset(marketing_data, select = -c(Date, Campaign_ID, Company, Clicks, Engagement_Score))
# Take the first 1000 records
subset_data <- head(marketing_data, 1000)
# Set seed for reproducibility
set.seed(123)
# Split data into train and test sets
trainIndex <- createDataPartition(subset_data$ROI, p = 0.8, list = FALSE)
train_data <- subset_data[trainIndex, ]
test_data <- subset_data[-trainIndex, ]
# Train regression model
conversion_model <- train(ROI ~ ., data = train_data, method = "lm")
# Predict on test set
predictions <- predict(conversion_model, newdata = test_data)
# Create a data frame for actual vs. predicted values
results <- data.frame(Actual = test_data$ROI, Predicted = predictions)
# Plot actual vs. predicted graph
ggplot(results, aes(x = Actual, y = Predicted)) +
geom_point(color = "blue", alpha = 0.5) +
geom_abline(intercept = 0, slope = 1, color = "red", linewidth = 1.5) +
labs(title = "Actual vs. Predicted ROI",
x = "Actual ROI", y = "Predicted ROI") +
theme_minimal()
# Calculate R-squared
rsquared <- cor(results$Actual, results$Predicted)^2
# Calculate MSE
mse <- mean((results$Actual - results$Predicted)^2)
# Calculate RMSE
rmse <- sqrt(mse)
There is a positive correlation between the actual ROI
and predicted ROI. This means that there is a tendency for
campaigns with a higher predicted ROI to also have a higher
actual ROI.
However, the data points are scattered around the positive trend line, which indicates that the prediction is not always accurate.There are some data points that fall below the trend line. This means that the predicted ROI was higher than the actual ROI for these campaigns.
There are also data points above the trend line. This means that the predicted ROI was lower than the actual ROI for these campaigns.
## R-squared: 0.0003636811
## MSE: 3.066079
## RMSE: 1.751022
The Support Vector Regression is then built as an
alternative to the linear regression model.
The Support Vector Regression model is build in 2
stages:
# Load your dataset
data <- read.csv("C:/Users/vijay/Desktop/PMA/new code/final_data.csv",nrows=1000)
# Select relevant numerical columns
relevant_columns <- c("Acquisition_Cost", "ROI", "Clicks", "Impressions", "Engagement_Score", "Duration_In_Days")
# Subset the data with relevant columns
data_subset <- data[, relevant_columns]
# Set seed for reproducibility
set.seed(123)
# Split data into training and test sets (80% training, 20% test)
train_index <- sample(1:nrow(data_subset), 0.8 * nrow(data_subset))
train_data <- data_subset[train_index, ]
test_data <- data_subset[-train_index, ]
# Split predictors and target variable for training and test sets
X_train <- train_data[, -1] # All columns except the first (ROI)
Y_train <- train_data$ROI
X_test <- test_data[, -1]
Y_test <- test_data$ROI
# Scale the predictors for training and test sets
X_train_scaled <- scale(X_train)
X_test_scaled <- scale(X_test)
# Train SVR model on training data
svr_model <- svm(Y_train ~ ., data = as.data.frame(X_train_scaled), kernel = "radial")
# Predict ROI for test data
predicted_ROI <- predict(svr_model, X_test_scaled)
# Calculate RMSE for test data
rmse_test <- sqrt(mean((Y_test - predicted_ROI)^2))
# Calculate R-squared for test data
rsquared_test <- cor(Y_test, predicted_ROI)^2
# Calculate MSE for test data
mse_test <- mean((Y_test - predicted_ROI)^2)
# Create a data frame with actual and predicted ROI values
plot_data <- data.frame(Actual_ROI = Y_test, Predicted_ROI = predicted_ROI)
# Plot predicted vs. actual graph using ggplot
ggplot(plot_data, aes(x = Actual_ROI, y = Predicted_ROI)) +
geom_point(color = "blue", alpha = 0.4) +
geom_abline(intercept = 0, slope = 1, color = "red", linewidth = 1.5) +
labs(title = "Predicted vs. Actual ROI (Test Data)",
x = "Actual ROI", y = "Predicted ROI") +
theme_minimal()
summary(svr_model)
##
## Call:
## svm(formula = Y_train ~ ., data = as.data.frame(X_train_scaled),
## kernel = "radial")
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1
## gamma: 0.2
## epsilon: 0.1
##
##
## Number of Support Vectors: 112
The predicted ROI is generally lower than the
actual ROI. There are more data points above the diagonal
line than below it. This means that the model tends to underestimate the
ROI for most campaigns.
There is a cluster of data points in the bottom left corner of the graph, where both the predicted and actual ROI are low.
There are also a few data points scattered in the top right corner of the graph, where both the predicted and actual ROI are high.
## Test Data Metrics:
## RMSE: 0.1603113
## R-squared: 0.9914833
## MSE: 0.0256997
data <- read.csv("C:/Users/vijay/Desktop/PMA/new code/final_data.csv")
# Select relevant numerical columns
relevant_columns <- c("Acquisition_Cost", "ROI", "Clicks", "Impressions", "Engagement_Score", "Duration_In_Days")
# Subset the data with relevant columns
data_subset <- data[, relevant_columns]
# Set seed for reproducibility
set.seed(123)
# Split data into training and test sets (80% training, 20% test)
train_index <- sample(1:nrow(data_subset), 0.8 * nrow(data_subset))
train_data <- data_subset[train_index, ]
test_data <- data_subset[-train_index, ]
# Split predictors and target variable for training and test sets
X_train <- train_data[, -1] # All columns except the first (ROI)
Y_train <- train_data$ROI
X_test <- test_data[, -1]
Y_test <- test_data$ROI
# Scale the predictors for training and test sets
X_train_scaled <- scale(X_train)
X_test_scaled <- scale(X_test)
# Train SVR model on training data
svr_model <- svm(Y_train ~ ., data = as.data.frame(X_train_scaled), kernel = "radial")
# Predict ROI for test data
predicted_ROI <- predict(svr_model, X_test_scaled)
# Calculate RMSE for test data
rmse_test <- sqrt(mean((Y_test - predicted_ROI)^2))
# Calculate R-squared for test data
rsquared_test <- cor(Y_test, predicted_ROI)^2
# Calculate MSE for test data
mse_test <- mean((Y_test - predicted_ROI)^2)
# Create a data frame with actual and predicted ROI values
plot_data <- data.frame(Actual_ROI = Y_test, Predicted_ROI = predicted_ROI)
# Plot predicted vs. actual graph using ggplot
ggplot(plot_data, aes(x = Actual_ROI, y = Predicted_ROI)) +
geom_point(color = "blue", alpha = 0.2) +
geom_abline(intercept = 0, slope = 1, color = "red", linewidth = 1.5) +
labs(title = "Predicted vs. Actual ROI (Test Data)",
x = "Actual ROI", y = "Predicted ROI") +
theme_minimal()
summary(svr_model)
##
## Call:
## svm(formula = Y_train ~ ., data = as.data.frame(X_train_scaled),
## kernel = "radial")
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1
## gamma: 0.2
## epsilon: 0.1
##
##
## Number of Support Vectors: 231
The predicted ROI is generally lower than the
actual ROI. There are more data points above the diagonal
line than below it. This means that the model tends to underestimate the
ROI for most campaigns.
There is a cluster of data points in the bottom left corner of the
graph, where both the predicted and actual ROI
are low.
There are also a few data points scattered in the top right corner of
the graph, where both the predicted and actual ROI are
high.
# Print the performance metrics for the test data
{
cat("Test Data Metrics:\n")
cat("RMSE:", rmse_test, "\n")
cat("R-squared:", rsquared_test, "\n")
cat("MSE:", mse_test, "\n")
}
## Test Data Metrics:
## RMSE: 0.07173009
## R-squared: 0.9986862
## MSE: 0.005145205
We used kmeans for the clustering part which is based on
Clicks and Impressions.
So what we derive from this model is, when a user wants to publish an ad on a specific platform, he sets a goal of impressions that is no of impressions/views he wants to achieve in a particular period of time.
data <- read.csv("C:/Users/vijay/Desktop/PMA/final_data.csv", stringsAsFactors = FALSE)
# Selecting the input features
input_features <- c("Campaign_Type", "Target_Audience_Gender", "Target_Audience_Age_Min",
"Target_Audience_Age_Max", "Channel_Used", "Location", "Language",
"Clicks", "Impressions", "Customer_Segment", "Duration_In_Days")
# Subsetting the dataset with selected features
data_subset <- data[input_features]
# Convert all variables to numeric
data_subset <- apply(data_subset, 2, as.numeric)
# Handling missing values by imputing with mean
data_subset[is.na(data_subset)] <- mean(data_subset, na.rm = TRUE)
# Sampling a subset of the data (adjust the size as needed)
set.seed(123)
sampled_indices <- sample(nrow(data_subset), 1000)
data_subset <- data_subset[sampled_indices, ]
# Performing k-means clustering
k <- 5
set.seed(123)
kmeans_model <- kmeans(data_subset, centers = k)
# Plotting the clusters
plot(data_subset[, c("Clicks", "Impressions")], col = kmeans_model$cluster,
main = "K-means Clustering of Marketing Data (Audience Engagement)",
xlab = "Clicks", ylab = "Impressions")
points(kmeans_model$centers[, c("Clicks", "Impressions")], col = 1:k, pch = 19, cex = 2, bg = "white")
legend("topright", legend = paste("Zone", 1:k), col = 1:k, pch = 19, cex = 1, title = "Aud_Eng")
## Calculate silhouette score
silhouette_score <- silhouette(kmeans_model$cluster, dist(data_subset))
cat("Silhouette Score:", mean(silhouette_score[, "sil_width"]), "\n")
## Silhouette Score: 0.5112601
This model tells the minimum no of clicks required to achieve the no of impressions. Higher the clicks higher the no of impressions or views the clusters are audience engagement and we have divided the clusters into zones, zones tell the client under which category he wants to target the ads.
The residuals for Facebook and Youtube appear to be more volatile than the residuals for the other campaigns. This suggests that there may be more unpredictable factors affecting these campaigns.
The residuals for Instagram and Website seem to be more stable over time. This suggests that the ARIMA model may be fitting these campaigns better.
# Read the dataset
marketing_data <- read.csv("C:/Users/vijay/Desktop/PMA/final_modified_data.csv", stringsAsFactors = FALSE)
# Convert 'Date' column to Date format
marketing_data$Date <- as.Date(marketing_data$Date, format = "%d-%m-%Y")
# Take the first 10,000 records
#sampled_data <- head(marketing_data, 100000)
# Subset data for each unique channel
channels <- unique(marketing_data$Channel_Used)
# Create a list to store plots for each channel
plots_list <- list()
# Create a list to store MAE for each channel
mae_list <- list()
# Perform ARIMA analysis and plot for each channel
for (channel in channels) {
channel_data <- filter(marketing_data, Channel_Used == channel)
# Aggregate data to monthly level and calculate average ROI
monthly_avg_roi <- channel_data %>%
group_by(month = format(Date, "%Y-%m")) %>%
summarize(avg_roi = mean(ROI, na.rm = TRUE))
# ARIMA analysis
arima_model <- auto.arima(monthly_avg_roi$avg_roi)
# Calculate MAE
arima_accuracy <- accuracy(arima_model)
mae <- mean(abs(arima_accuracy[, "MAE"]))
mae_list[[channel]] <- mae
# Plot monthly average ROI against date and store the plot in the list
plots_list[[channel]] <- autoplot(arima_model$residuals) +
labs(x = "Date", y = "Residuals", title = paste("ARIMA Residuals Plot for", channel))
}
# Combine plots separately for each channel using facet_wrap
plots_combined <- do.call(gridExtra::grid.arrange, c(plots_list, ncol = 2))
# Print the MAE for each channel
for (channel in channels) {
cat("MAE for", channel, ":", mae_list[[channel]], "\n")
}
## MAE for Google Ads : 0.02502657
## MAE for YouTube : 0.02329913
## MAE for Instagram : 0.03148257
## MAE for Website : 0.02331701
## MAE for Facebook : 0.03417161
## MAE for Email : 0.02325383
The residuals for Health & Wellness and Tech Enthusiasts seem to be more stable than the residuals for Fashion and Outdoors.
This suggests that the ARIMA model may be fitting the data for Health & V and Tech Enthusiasts better.
There is a positive spike in the residuals around period 8 for Outdoors. This could indicate an unexpected event that increased the metric for Outdoors at that time.
# Read the dataset
marketing_data <- read.csv("C:/Users/vijay/Desktop/PMA/final_modified_data.csv", stringsAsFactors = FALSE)
# Convert 'Date' column to Date format
marketing_data$Date <- as.Date(marketing_data$Date, format = "%d-%m-%Y")
# Take a random subset of 10,000 records
#sampled_data <- head(marketing_data, 10000)
# Subset data for each unique segment
segments <- unique(marketing_data$Customer_Segment)
# Create a list to store plots for each segment
plots_list <- list()
# Create a list to store MAE for each segment
mae_list <- list()
# Perform ARIMA analysis and plot for each segment
for (segment in segments) {
segment_data <- filter(marketing_data, Customer_Segment == segment)
# Aggregate data to monthly level and calculate average ROI
monthly_avg_roi <- segment_data %>%
group_by(month = format(Date, "%Y-%m")) %>%
summarize(avg_roi = mean(ROI, na.rm = TRUE))
# ARIMA analysis
arima_model <- auto.arima(monthly_avg_roi$avg_roi)
# Calculate MAE
arima_accuracy <- accuracy(arima_model)
mae <- mean(abs(arima_accuracy[, "MAE"]))
mae_list[[segment]] <- mae
# Plot monthly average ROI against date and store the plot in the list
plots_list[[segment]] <- autoplot(arima_model$residuals) +
labs(x = "Date", y = "Residuals", title = paste("ARIMA Residuals Plot for", segment))
}
# Combine plots separately for each segment using facet_wrap
plots_combined <- do.call(gridExtra::grid.arrange, c(plots_list, ncol = 2))
# Print the MAE for each segment
for (segment in segments) {
cat("MAE for", segment, ":", mae_list[[segment]], "\n")
}
## MAE for Health & Wellness : 0.01641105
## MAE for Fashionistas : 0.02314913
## MAE for Outdoor Adventurers : 0.01843864
## MAE for Foodies : 0.02775127
## MAE for Tech Enthusiasts : 0.02678632
The study underscores the pivotal role of data-driven insights in
crafting successful advertising strategies, emphasizing the significance
of metrics like Click-Through Rate (CTR),
Cost Per Click (CPC), and
Return on Investment (ROI). By employing various machine
learning models tailored to extract key information for advertisers, it
navigates through the intricate landscape of advertisement optimization.
Through regression models focused on ROI and time series analysis
elucidating the dynamic nature of advertisement effectiveness across
different durations and channels, the study
provides actionable intelligence for advertisers seeking to maximize
their returns.
In today’s evolving economic landscape, the study underscores the
imperative of leveraging data science for informed decision-making in
advertising. It delineates how discerning analysis of numerical
correlations and the deployment of advanced models can empower
advertisers to identify optimal social media channels and
strategic timing for ad placement, ultimately enhancing
ROI. For aspiring advertisers, this research serves as a compass,
guiding them towards efficacious advertising practices grounded in
empirical evidence and predictive analytics.