This project is created by \(Vijayendher\) \(Gatla\), \(K\) \(Sai\) \(Karthikeya\), \(Khizar\) \(Mohamed\) \(Zubair\) \(Sait\)

About the Project

In today’s digital era, maximizing Return On Investment (ROI) in digital advertising is paramount for businesses to thrive in a competitive and cost-oriented market.

Leveraging data science techniques, this project analyzes \(Advertisement\) \(Performance\) using a comprehensive marketing dataset encompassing acquisition cost, ROI, customer segment, posting date, target audience age, and more. Through meticulous preprocessing, various regression models are trained, including \(SVR\) and \(ARIMA\), to identify top-performing advertisements across different platforms.

These insights enable publishers to strategically allocate resources, ensuring optimal ROI and maximizing profitability in digital advertising endeavors.

Libraries Required

library(dplyr) #Used to manipulate data in R
library(readr) #Used to reAd csv files
library(DataExplorer) #Used To explore data
library(DT) #Used to display tables in Rmarkdown
library(ggplot2) #Used to display different plots and graphs
library(heatmaply) #Used for Heatmaps
library(kableExtra) #Used to handle markdown content
library(tidyr) # for better text visibility
library(caret) # Used for regression model handling
library(e1071) #Used for building Support Vector Models
library(cluster) #Used for building Clustering Models
library(forecast) #Used for Time Series Analysis

Dataset Description

The dataset which is used in this project is downloaded from Kaggle called as \(Marketing\) \(Campaign\) \(Dataset\).

The dataset consists of 2,00,000 rows and 16 columns.Here’s a view of the data:

data <- read.csv("C:/Users/vijay/Downloads/marketing_campaign_dataset.csv", stringsAsFactors = FALSE)
DT::datatable(head(data,5),
              rownames = FALSE,
              options=list(
                scrollX=TRUE
              ))

Summary of the Data

Here’s the tabular representation of different metrics for the different columns in the dataset:

data <- read.csv("C:/Users/vijay/Downloads/marketing_campaign_dataset.csv", stringsAsFactors = FALSE)
summary_table <- summary(data)

kable(summary_table, format = "markdown") %>%
  kable_styling() %>%
  scroll_box(width = "100%")
Campaign_ID Company Campaign_Type Target_Audience Duration Channel_Used Conversion_Rate Acquisition_Cost ROI Location Language Clicks Impressions Engagement_Score Customer_Segment Date
Min. : 1 Length:200000 Length:200000 Length:200000 Length:200000 Length:200000 Min. :0.01000 Length:200000 Min. :2.000 Length:200000 Length:200000 Min. : 100.0 Min. : 1000 Min. : 1.000 Length:200000 Length:200000
1st Qu.: 50001 Class :character Class :character Class :character Class :character Class :character 1st Qu.:0.05000 Class :character 1st Qu.:3.500 Class :character Class :character 1st Qu.: 325.0 1st Qu.: 3266 1st Qu.: 3.000 Class :character Class :character
Median :100001 Mode :character Mode :character Mode :character Mode :character Mode :character Median :0.08000 Mode :character Median :5.010 Mode :character Mode :character Median : 550.0 Median : 5518 Median : 5.000 Mode :character Mode :character
Mean :100001 NA NA NA NA NA Mean :0.08007 NA Mean :5.002 NA NA Mean : 549.8 Mean : 5507 Mean : 5.495 NA NA
3rd Qu.:150000 NA NA NA NA NA 3rd Qu.:0.12000 NA 3rd Qu.:6.510 NA NA 3rd Qu.: 775.0 3rd Qu.: 7753 3rd Qu.: 8.000 NA NA
Max. :200000 NA NA NA NA NA Max. :0.15000 NA Max. :8.000 NA NA Max. :1000.0 Max. :10000 Max. :10.000 NA NA

Preprocessing

Modifying Audience


Two columns min-age and max-age are derived from the Target_Audience attribute.

The Target_Audience consists of some rows that consists of the value All ages,which is replace by All a-b.

Here’s the modified dataset:

#Duplicating the data
data <- read.csv("C:/Users/vijay/Downloads/marketing_campaign_dataset.csv", stringsAsFactors = FALSE)
data_new <-data #make a copy of input data
data_new$Target_Audience[data_new$Target_Audience=="All Ages"]<-"All a-b"

DT::datatable(head(data_new, 5),
              rownames = FALSE,
              options = list(
                pageLength = 10,
                 scrollX=TRUE))

The attribute a is replaced with minimum age and b is replaced with the maximum age of the Target_Audience_Min and Target_Audience_Max respectively.

Here’s the modified dataset:

data_new <- data_new %>%
  separate(Target_Audience, into = c("Target_Audience_Gender", "Target_Audience_Age"), sep = " ")

data_new <- data_new %>%
  separate(Target_Audience_Age, into = c("Target_Audience_Age_Min", "Target_Audience_Age_Max"), sep = "-")

#replacing a
min_age <- min(data_new$Target_Audience_Age_Min[data_new$Target_Audience_Age_Min != "a"], na.rm = TRUE)

# Replace "a" with the minimum value
data_new$Target_Audience_Age_Min[data_new$Target_Audience_Age_Min == "a"] <- min_age
#replacing b
max_age <- max(data_new$Target_Audience_Age_Max[data_new$Target_Audience_Age_Max != "b"], na.rm = TRUE)

# Replace "a" with the minimum value
data_new$Target_Audience_Age_Max[data_new$Target_Audience_Age_Max == "b"] <- max_age

DT::datatable(head(data_new, 5),
              rownames = FALSE,
              options = list(
                pageLength = 10, scrollX=TRUE))

Acquisition Cost and Duration


\(Acquisition\) \(Cost\) consists of a dollar literal \((\$)\) which is not helpful at the time of analysis, and is removed with the help of Regular Expressions.

The \(Duration\) is in the string format 30 days, which is not needed in the analysis so the days is removed and is renamed with Duration_In_Days.

data_new <- data_new %>%
  mutate(Duration_In_Days = as.numeric(sub(" days", "", Duration)))

# Drop the original Duration column
data_new <- data_new %>%
  select(-Duration)
non_numeric_values <- data_new$Acquisition_Cost[!grepl("^\\$?[0-9,.]*$", data_new$Acquisition_Cost, perl = TRUE)]
# If there are non-numeric values, print them
if (length(non_numeric_values) > 0) {
  print(paste("Non-numeric values in Acquisition_Cost column:", unique(non_numeric_values)))
}

# Modify the Acquisition_Cost column
data_new <- data_new %>%
  mutate(Acquisition_Cost = as.numeric(gsub("[^0-9.]", "", Acquisition_Cost)))

# Print the modified data

DT::datatable(head(data_new, 10),
              rownames = FALSE,
              options = list(
                pageLength = 10,
                scrollX=TRUE))

Additional Metrics Derivation


Additional Columns named Target_Audience_Age_Mean ,Cost Per Click (CPC) ,Cost Per Acquisition (CPA) ,Click Through Rate (CTR) are derived using mathematical formulation of existing data columns and added as new columns in the dataset.

The modified data is now exported to a new csv file.

#Reading the Dataset
data <- read.csv("C:/Users/vijay/Desktop/PMA/final_data.csv", stringsAsFactors = FALSE)

# Calculate CPC, CTR, CPA, and CPI
data <- data %>%
  mutate(
    # CPC (Cost Per Click)
    Target_Audience_Age_Mean=(Target_Audience_Age_Min+Target_Audience_Age_Max)/2,
    
    CPC = Acquisition_Cost / Clicks,
    
    # CTR (Click Through Rate)
    CTR = (Clicks / Impressions) * 100,
    
    # CPA (Cost Per Acquisition)
    CPA = Acquisition_Cost / (Conversion_Rate * Impressions),
    
    # CPI (Cost Per Impression)
    CPI = Acquisition_Cost / Impressions
  )

DT::datatable(head(data, 5), rownames = FALSE, options = list(pageLength = 10,scrollX=TRUE))
write.csv(data, "final_modified_data.csv",row.names=FALSE)

Correlation Between Variables

Clicks and Impressions have a weak positive correlation (0.2). This means that a higher number of impressions leads to a small increase in clicks.

Cost per Click (CPC) and Acquisition Cost have a moderate positive correlation (0.5). This means that an increase in CPC is associated with an increase in Acquisition Cost.

ROI (Return on Investment) has a weak negative correlation with both Clicks (-0.2) and Impressions (-0.3). This means that there might be a small decrease in ROI as the number of clicks or impressions increases.

It is important to note that correlation does not imply causation. There could be other factors influencing these metrics.

Exploratory Data Analysis

To understand the dataset thoroughly and understanding the data different plots are plotted so that inferences can be made.

The plots are made using the ggplot2 package is used to for different data visualization.

Analysis of CPC

The channel with the highest CPC is Facebook, followed by Google Ads and then Email. This means that for every CPC value, Facebook has the highest density of clicks.

There is a cluster of channels with a lower CPC around 0.01, which includes Email, Instagram and YouTube. It appears that it is more likely to get a click on these channels for a CPC of around 0.01.

marketing_data <- read.csv("C:/Users/vijay/Desktop/PMA/final_modified_data.csv", stringsAsFactors = FALSE)

# Subset the first 1000 records
subset_data <- marketing_data[1:1000, ]

# Create a density plot for CPC vs Channel Used
ggplot(subset_data, aes(x = CPC, fill = Channel_Used)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of CPC by Channel Used") +
  facet_wrap(~Channel_Used, scales = "free") +
  theme_minimal()

There is a higher density of clicks for Email and Facebook at a lower log(CPC) compared to Google Ads, YouTube, Instagram and Website.

This means that for these channels, it is more likely to get clicks at a lower cost per click.

The density is spread out more for Facebook and Google Ads compared to the other channels. This suggests that there is a wider range of CPCs for clicks on these channels.

# Apply log transformation to CPC
subset_data$log_CPC <- log(subset_data$CPC + 1) 


# Create a density plot for log(CPC) vs Channel Used
ggplot(subset_data, aes(x = log_CPC, fill = Channel_Used)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of log(CPC) by Channel Used") +
  facet_wrap(~Channel_Used, scales = "free") +
  theme_minimal()

Analysis of CTR

Email appears to have the highest CTR overall, with a cluster of density around 0.04.

Facebook and Google Ads also show a cluster of density around 0.04, but it appears to be lower than Email.

The remaining channels, YouTube, Instagram and Website, all show a lower density of CTR spread across a wider range of values. This suggests that CTR for these channels is less concentrated and more likely to vary.

# Create a density plot for CTR vs Channel Used
ggplot(subset_data, aes(x = CTR, fill = Channel_Used)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of CTR by Channel Used") +
  facet_wrap(~Channel_Used, scales = "free") +
  theme_minimal()

There is a higher density of clicks for Email and Facebook at a lower log(CTR) compared to Google Ads, YouTube, Instagram and Website.

This means that for Email and Facebook, it is more likely to get clicks at a lower cost per click.

The density is spread out more for Facebook and Google Ads compared to the other channels. This suggests that there is a wider range of click-through rates for clicks on these channels.

# Apply log transformation to CPC
subset_data$log_CTR <- log(subset_data$CTR + 1) 

# Create a density plot for log(CTR) vs Channel Used
ggplot(subset_data, aes(x = log_CTR, fill = Channel_Used)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of log(CTR) by Channel Used") +
  facet_wrap(~Channel_Used, scales = "free") +
  theme_minimal()

Analysis of CPA

The channel with the lowest CPA is Email, followed by Facebook and then YouTube. This means that for every CPA value, Email has the highest density of acquisitions.

There is a cluster of channels with a higher CPA around 0.005, which includes Google Ads, Instagram and Website. It appears that it is more likely to get an acquisition on these channels for a CPA of around 0.005.

# Create a density plot for CPA vs Channel Used
ggplot(subset_data, aes(x = CPA, fill = Channel_Used)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of CPA by Channel Used") +
  facet_wrap(~Channel_Used, scales = "free") +
  theme_minimal()

There is a higher density of clicks for Email and Facebook at a lower log(CPA) compared to Google Ads, YouTube, Instagram and Website.

This means that for these channels, it is more likely to get clicks at a lower cost per acquisition.

The density is spread out more for Facebook and Google Ads compared to the other channels.

# Apply log transformation to CPC
subset_data$log_CPA <- log(subset_data$CPA + 1) 

# Create a density plot for log(CTR) vs Channel Used
ggplot(subset_data, aes(x = log_CPA, fill = Channel_Used)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of log(CPA) by Channel Used") +
  facet_wrap(~Channel_Used, scales = "free") +
  theme_minimal()

Analysis Of Mean Age

The channels seem to target audiences with a mean age between 25 and 40 years old.

Facebook and Youtube appear to target a slightly younger audience with a higher density around 25 years old.

Email and Instagram appear to target a slightly older audience with a higher density around 30 years old.

# Calculate mean age by channel used
mean_age_by_channel <- aggregate(Target_Audience_Age_Mean ~ Channel_Used, data = subset_data, FUN = mean)

# Create a density plot for mean age distribution vs. Channel Used
ggplot(subset_data, aes(x = Target_Audience_Age_Mean, fill = Channel_Used)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of Mean Age by Channel Used") +
  facet_wrap(~Channel_Used, scales = "free") +
  theme_minimal()

Younger age groups (18-24) tend to have a higher CTR across all social media platforms compared to older age groups. This suggests that younger users are more likely to click on ads compared to older users.

Instagram appears to have the highest CTR for almost all age groups. This means that users are more likely to click on ads on Instagram compared to other platforms.

Facebook seems to have a higher CTR for younger age groups (18-24) compared to Email and Google Ads. However, for older age groups, Email and Google Ads appear to have a similar or slightly higher CTR than Facebook.

The CTR for all social media platforms appears to decrease as the age group increases. This suggests that targeting younger audiences might be more effective for campaigns that rely on clicks

# Calculate mean age by channel used
marketing_data <- read.csv("C:/Users/vijay/Desktop/PMA/final_modified_data.csv", stringsAsFactors = FALSE)

ggplot(marketing_data, aes(x = Target_Audience_Age_Mean, fill = Channel_Used)) +
  geom_histogram(binwidth = 5, alpha = 0.5, position = "identity") +
  labs(title = "Age Distribution for Different Social Media Platforms by CTR", x = "Age", y = "Frequency") +
  facet_wrap(~Channel_Used, scales = "free", drop = TRUE) +
  theme_minimal()

There are four customer segments: Fashionistas, Foodies, Health & Wellness, and Outdoor Adventurers.

Each customer segment seems to have a different age distribution across social media platforms.

It appears that Facebook and Instagram tend to have a wider reach across all age groups for most customer segments.

There is a trend where some social media platforms are more popular with younger age groups (20-30) while others are more popular with older age groups (30-40). For example, Snapchat seems to be more popular with the 20-25 age group for most customer segments.

ggplot(marketing_data, aes(x = Target_Audience_Age_Mean, fill = Customer_Segment)) +
  geom_histogram(binwidth = 5, alpha = 0.5, position = "identity") +
  labs(title = "Age Distribution for Different Social Media Platforms by CTR", x = "Age", y = "Frequency") +
  facet_wrap(~Customer_Segment, scales = "free", drop = TRUE) +
  theme_minimal()

Model Building

Linear Regression

The Regression model helps us to numerically predict the value. The ROI being the most important term used in evaluating the performance of a advertisement is used.

The linear regression model is a very simple model which helps in fitting the values the straight line. The model does not show very good performance as shown in the actual versus the predicted graph.

The reason of not upto mark performance can be attributed to the non linearity and the dependency of the variables which are used in the dataset.

# Load the data
marketing_data <- read.csv("C:/Users/vijay/Desktop/PMA/final_data.csv", stringsAsFactors = FALSE)

# Remove 'Date' column
marketing_data <- subset(marketing_data, select = -c(Date, Campaign_ID, Company, Clicks, Engagement_Score))

# Take the first 1000 records
subset_data <- head(marketing_data, 1000)

# Set seed for reproducibility
set.seed(123)

# Split data into train and test sets
trainIndex <- createDataPartition(subset_data$ROI, p = 0.8, list = FALSE)
train_data <- subset_data[trainIndex, ]
test_data <- subset_data[-trainIndex, ]

# Train regression model
conversion_model <- train(ROI ~ ., data = train_data, method = "lm")

# Predict on test set
predictions <- predict(conversion_model, newdata = test_data)

# Create a data frame for actual vs. predicted values
results <- data.frame(Actual = test_data$ROI, Predicted = predictions)

# Plot actual vs. predicted graph
ggplot(results, aes(x = Actual, y = Predicted)) +
  geom_point(color = "blue", alpha = 0.5) +
  geom_abline(intercept = 0, slope = 1, color = "red", linewidth = 1.5) +
  labs(title = "Actual vs. Predicted ROI",
       x = "Actual ROI", y = "Predicted ROI") +
  theme_minimal()

# Calculate R-squared
rsquared <- cor(results$Actual, results$Predicted)^2
# Calculate MSE
mse <- mean((results$Actual - results$Predicted)^2)
# Calculate RMSE
rmse <- sqrt(mse)

There is a positive correlation between the actual ROI and predicted ROI. This means that there is a tendency for campaigns with a higher predicted ROI to also have a higher actual ROI.

However, the data points are scattered around the positive trend line, which indicates that the prediction is not always accurate.There are some data points that fall below the trend line. This means that the predicted ROI was higher than the actual ROI for these campaigns.

There are also data points above the trend line. This means that the predicted ROI was lower than the actual ROI for these campaigns.

## R-squared: 0.0003636811 
## MSE: 3.066079 
## RMSE: 1.751022

Support Vector Regression

The Support Vector Regression is then built as an alternative to the linear regression model.

The Support Vector Regression model is build in 2 stages:

  • First a subset of 1000 records is taken in order to analyse the performance of the model.
# Load your dataset
data <- read.csv("C:/Users/vijay/Desktop/PMA/new code/final_data.csv",nrows=1000)

# Select relevant numerical columns
relevant_columns <- c("Acquisition_Cost", "ROI", "Clicks", "Impressions", "Engagement_Score", "Duration_In_Days")

# Subset the data with relevant columns
data_subset <- data[, relevant_columns]

# Set seed for reproducibility
set.seed(123)

# Split data into training and test sets (80% training, 20% test)
train_index <- sample(1:nrow(data_subset), 0.8 * nrow(data_subset))
train_data <- data_subset[train_index, ]
test_data <- data_subset[-train_index, ]

# Split predictors and target variable for training and test sets
X_train <- train_data[, -1]  # All columns except the first (ROI)
Y_train <- train_data$ROI
X_test <- test_data[, -1]
Y_test <- test_data$ROI

# Scale the predictors for training and test sets
X_train_scaled <- scale(X_train)
X_test_scaled <- scale(X_test)

# Train SVR model on training data
svr_model <- svm(Y_train ~ ., data = as.data.frame(X_train_scaled), kernel = "radial")

# Predict ROI for test data
predicted_ROI <- predict(svr_model, X_test_scaled)

# Calculate RMSE for test data
rmse_test <- sqrt(mean((Y_test - predicted_ROI)^2))

# Calculate R-squared for test data
rsquared_test <- cor(Y_test, predicted_ROI)^2

# Calculate MSE for test data
mse_test <- mean((Y_test - predicted_ROI)^2)

# Create a data frame with actual and predicted ROI values
plot_data <- data.frame(Actual_ROI = Y_test, Predicted_ROI = predicted_ROI)

# Plot predicted vs. actual graph using ggplot
ggplot(plot_data, aes(x = Actual_ROI, y = Predicted_ROI)) +
  geom_point(color = "blue", alpha = 0.4) +
  geom_abline(intercept = 0, slope = 1, color = "red", linewidth = 1.5) +
  labs(title = "Predicted vs. Actual ROI (Test Data)",
       x = "Actual ROI", y = "Predicted ROI") +
  theme_minimal()

summary(svr_model)
## 
## Call:
## svm(formula = Y_train ~ ., data = as.data.frame(X_train_scaled), 
##     kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.2 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  112

The predicted ROI is generally lower than the actual ROI. There are more data points above the diagonal line than below it. This means that the model tends to underestimate the ROI for most campaigns.

There is a cluster of data points in the bottom left corner of the graph, where both the predicted and actual ROI are low.

There are also a few data points scattered in the top right corner of the graph, where both the predicted and actual ROI are high.

## Test Data Metrics:
## RMSE: 0.1603113 
## R-squared: 0.9914833 
## MSE: 0.0256997
  • As there is a increase in performance using the SVR model the data is then trained using the complete dataset.
data <- read.csv("C:/Users/vijay/Desktop/PMA/new code/final_data.csv")
# Select relevant numerical columns
relevant_columns <- c("Acquisition_Cost", "ROI", "Clicks", "Impressions", "Engagement_Score", "Duration_In_Days")

# Subset the data with relevant columns
data_subset <- data[, relevant_columns]

# Set seed for reproducibility
set.seed(123)

# Split data into training and test sets (80% training, 20% test)
train_index <- sample(1:nrow(data_subset), 0.8 * nrow(data_subset))
train_data <- data_subset[train_index, ]
test_data <- data_subset[-train_index, ]

# Split predictors and target variable for training and test sets
X_train <- train_data[, -1]  # All columns except the first (ROI)
Y_train <- train_data$ROI
X_test <- test_data[, -1]
Y_test <- test_data$ROI

# Scale the predictors for training and test sets
X_train_scaled <- scale(X_train)
X_test_scaled <- scale(X_test)

# Train SVR model on training data
svr_model <- svm(Y_train ~ ., data = as.data.frame(X_train_scaled), kernel = "radial")

# Predict ROI for test data
predicted_ROI <- predict(svr_model, X_test_scaled)

# Calculate RMSE for test data
rmse_test <- sqrt(mean((Y_test - predicted_ROI)^2))

# Calculate R-squared for test data
rsquared_test <- cor(Y_test, predicted_ROI)^2

# Calculate MSE for test data
mse_test <- mean((Y_test - predicted_ROI)^2)

# Create a data frame with actual and predicted ROI values
plot_data <- data.frame(Actual_ROI = Y_test, Predicted_ROI = predicted_ROI)

# Plot predicted vs. actual graph using ggplot
ggplot(plot_data, aes(x = Actual_ROI, y = Predicted_ROI)) +
  geom_point(color = "blue", alpha = 0.2) +
  geom_abline(intercept = 0, slope = 1, color = "red", linewidth = 1.5) +
  labs(title = "Predicted vs. Actual ROI (Test Data)",
       x = "Actual ROI", y = "Predicted ROI") +
  theme_minimal()

summary(svr_model)
## 
## Call:
## svm(formula = Y_train ~ ., data = as.data.frame(X_train_scaled), 
##     kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.2 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  231

The predicted ROI is generally lower than the actual ROI. There are more data points above the diagonal line than below it. This means that the model tends to underestimate the ROI for most campaigns.

There is a cluster of data points in the bottom left corner of the graph, where both the predicted and actual ROI are low.

There are also a few data points scattered in the top right corner of the graph, where both the predicted and actual ROI are high.

# Print the performance metrics for the test data
{
cat("Test Data Metrics:\n")
cat("RMSE:", rmse_test, "\n")
cat("R-squared:", rsquared_test, "\n")
cat("MSE:", mse_test, "\n")
}
## Test Data Metrics:
## RMSE: 0.07173009 
## R-squared: 0.9986862 
## MSE: 0.005145205

Clustering

We used kmeans for the clustering part which is based on Clicks and Impressions.

So what we derive from this model is, when a user wants to publish an ad on a specific platform, he sets a goal of impressions that is no of impressions/views he wants to achieve in a particular period of time.

data <- read.csv("C:/Users/vijay/Desktop/PMA/final_data.csv", stringsAsFactors = FALSE)

# Selecting the input features
input_features <- c("Campaign_Type", "Target_Audience_Gender", "Target_Audience_Age_Min", 
                    "Target_Audience_Age_Max", "Channel_Used", "Location", "Language", 
                    "Clicks", "Impressions", "Customer_Segment", "Duration_In_Days")

# Subsetting the dataset with selected features
data_subset <- data[input_features]

# Convert all variables to numeric
data_subset <- apply(data_subset, 2, as.numeric)

# Handling missing values by imputing with mean
data_subset[is.na(data_subset)] <- mean(data_subset, na.rm = TRUE)

# Sampling a subset of the data (adjust the size as needed)
set.seed(123)  
sampled_indices <- sample(nrow(data_subset), 1000) 
data_subset <- data_subset[sampled_indices, ]

# Performing k-means clustering
k <- 5 
set.seed(123)
kmeans_model <- kmeans(data_subset, centers = k)

# Plotting the clusters
plot(data_subset[, c("Clicks", "Impressions")], col = kmeans_model$cluster, 
     main = "K-means Clustering of Marketing Data (Audience Engagement)",
     xlab = "Clicks", ylab = "Impressions")
points(kmeans_model$centers[, c("Clicks", "Impressions")], col = 1:k, pch = 19, cex = 2, bg = "white")
legend("topright", legend = paste("Zone", 1:k), col = 1:k, pch = 19, cex = 1, title = "Aud_Eng")

## Calculate silhouette score
silhouette_score <- silhouette(kmeans_model$cluster, dist(data_subset))
cat("Silhouette Score:", mean(silhouette_score[, "sil_width"]), "\n")
## Silhouette Score: 0.5112601

This model tells the minimum no of clicks required to achieve the no of impressions. Higher the clicks higher the no of impressions or views the clusters are audience engagement and we have divided the clusters into zones, zones tell the client under which category he wants to target the ads.

Time Series Analysis

Arima Analysis of ROI throughout the year for different channels of ads published.

The residuals for Facebook and Youtube appear to be more volatile than the residuals for the other campaigns. This suggests that there may be more unpredictable factors affecting these campaigns.

The residuals for Instagram and Website seem to be more stable over time. This suggests that the ARIMA model may be fitting these campaigns better.

  • The metric used here is \(Mean\) \(Absolute\) \(Error\) \((MAE)\) to better undertand the error rate of different variables under consideration.
# Read the dataset
marketing_data <- read.csv("C:/Users/vijay/Desktop/PMA/final_modified_data.csv", stringsAsFactors = FALSE)

# Convert 'Date' column to Date format
marketing_data$Date <- as.Date(marketing_data$Date, format = "%d-%m-%Y")

# Take the first 10,000 records
#sampled_data <- head(marketing_data, 100000)

# Subset data for each unique channel
channels <- unique(marketing_data$Channel_Used)

# Create a list to store plots for each channel
plots_list <- list()

# Create a list to store MAE for each channel
mae_list <- list()

# Perform ARIMA analysis and plot for each channel
for (channel in channels) {
  channel_data <- filter(marketing_data, Channel_Used == channel)
  
  # Aggregate data to monthly level and calculate average ROI
  monthly_avg_roi <- channel_data %>%
    group_by(month = format(Date, "%Y-%m")) %>%
    summarize(avg_roi = mean(ROI, na.rm = TRUE))
  
  # ARIMA analysis
  arima_model <- auto.arima(monthly_avg_roi$avg_roi)
  
  # Calculate MAE
  arima_accuracy <- accuracy(arima_model)
  mae <- mean(abs(arima_accuracy[, "MAE"]))
  mae_list[[channel]] <- mae
  
  # Plot monthly average ROI against date and store the plot in the list
  plots_list[[channel]] <- autoplot(arima_model$residuals) +
    labs(x = "Date", y = "Residuals", title = paste("ARIMA Residuals Plot for", channel))
}

# Combine plots separately for each channel using facet_wrap
plots_combined <- do.call(gridExtra::grid.arrange, c(plots_list, ncol = 2))

# Print the MAE for each channel
for (channel in channels) {
  cat("MAE for", channel, ":", mae_list[[channel]], "\n")
}
## MAE for Google Ads : 0.02502657 
## MAE for YouTube : 0.02329913 
## MAE for Instagram : 0.03148257 
## MAE for Website : 0.02331701 
## MAE for Facebook : 0.03417161 
## MAE for Email : 0.02325383

Arima Analysis of ROI throughout the year for different customer segments of ads published.

The residuals for Health & Wellness and Tech Enthusiasts seem to be more stable than the residuals for Fashion and Outdoors.

This suggests that the ARIMA model may be fitting the data for Health & V and Tech Enthusiasts better.

There is a positive spike in the residuals around period 8 for Outdoors. This could indicate an unexpected event that increased the metric for Outdoors at that time.

# Read the dataset
marketing_data <- read.csv("C:/Users/vijay/Desktop/PMA/final_modified_data.csv", stringsAsFactors = FALSE)

# Convert 'Date' column to Date format
marketing_data$Date <- as.Date(marketing_data$Date, format = "%d-%m-%Y")

# Take a random subset of 10,000 records
#sampled_data <- head(marketing_data, 10000)

# Subset data for each unique segment
segments <- unique(marketing_data$Customer_Segment)

# Create a list to store plots for each segment
plots_list <- list()

# Create a list to store MAE for each segment
mae_list <- list()

# Perform ARIMA analysis and plot for each segment
for (segment in segments) {
  segment_data <- filter(marketing_data, Customer_Segment == segment)
  
  # Aggregate data to monthly level and calculate average ROI
  monthly_avg_roi <- segment_data %>%
    group_by(month = format(Date, "%Y-%m")) %>%
    summarize(avg_roi = mean(ROI, na.rm = TRUE))
  
  # ARIMA analysis
  arima_model <- auto.arima(monthly_avg_roi$avg_roi)
  
  # Calculate MAE
  arima_accuracy <- accuracy(arima_model)
  mae <- mean(abs(arima_accuracy[, "MAE"]))
  mae_list[[segment]] <- mae
  
  # Plot monthly average ROI against date and store the plot in the list
  plots_list[[segment]] <- autoplot(arima_model$residuals) +
    labs(x = "Date", y = "Residuals", title = paste("ARIMA Residuals Plot for", segment))
}

# Combine plots separately for each segment using facet_wrap
plots_combined <- do.call(gridExtra::grid.arrange, c(plots_list, ncol = 2))

# Print the MAE for each segment
for (segment in segments) {
  cat("MAE for", segment, ":", mae_list[[segment]], "\n")
}
## MAE for Health & Wellness : 0.01641105 
## MAE for Fashionistas : 0.02314913 
## MAE for Outdoor Adventurers : 0.01843864 
## MAE for Foodies : 0.02775127 
## MAE for Tech Enthusiasts : 0.02678632

Conclusion

The study underscores the pivotal role of data-driven insights in crafting successful advertising strategies, emphasizing the significance of metrics like Click-Through Rate (CTR), Cost Per Click (CPC), and Return on Investment (ROI). By employing various machine learning models tailored to extract key information for advertisers, it navigates through the intricate landscape of advertisement optimization. Through regression models focused on ROI and time series analysis elucidating the dynamic nature of advertisement effectiveness across different durations and channels, the study provides actionable intelligence for advertisers seeking to maximize their returns.

In today’s evolving economic landscape, the study underscores the imperative of leveraging data science for informed decision-making in advertising. It delineates how discerning analysis of numerical correlations and the deployment of advanced models can empower advertisers to identify optimal social media channels and strategic timing for ad placement, ultimately enhancing ROI. For aspiring advertisers, this research serves as a compass, guiding them towards efficacious advertising practices grounded in empirical evidence and predictive analytics.