Proposal

Background

Digital AIM Media is a multimedia company that provides a range of digital advertising services such as social media marketing, search engine optimization, and email marketing well as print media services throughout the United States. Consisting of three major subsidiaries, AIM Media Texas, AIM Media Midwest, and AIM Media Indiana. AIM Media Indiana, which will be the focus of this project, formerly known as Home News Enterprises, is an American printer and publisher of daily and weekly newspapers in the surrounding cities of central Indiana. The flagship newspaper is The Republic in Columbus, and its other newspaper holdings include coverage of small cities and counties south and east of Indianapolis. Excluding the 2012 acquisition of Seymour Tribune, the company boasts an overall circulation of 55,000. The company operates presses in Columbus and Greenfield, Indiana, both of which offers commercial printing services in addition to printing its own newspapers. The company also has editorial and business offices in Columbus, Franklin (Daily Journal), Nashville (Brown County Democrat), Pendleton (Times-Post), Seymour (Tribune), and Greenfield (Daily Reporter).

Digital AIM Media is committed to providing its clients with innovative and effective digital advertising solutions. The company uses cutting-edge technology and data analytics to optimize campaign performance and a good return on investment for its clients in terms of budget spent compared to traffic density. With the decline of print media and the growth of digital media, Digital AIM Media is well-positioned to capitalize on the growing demand for digital advertising solutions.

Business Problem

The U.S. Census Bureau’s Service Annual Survey (SAS) estimated that between 2002 and 2020, Newspaper Publishers suffered an estimated revenue dropped by 52.0%. Estimated revenue for Periodical Publishing (Magazines, etc.) as well as Video Tape and Disc Rental revenue has also gone down by 40.5% and 88.5% respectively. Total estimated weekday circulation of U.S. daily newspapers was 55.8 million in 2000 and dropped to 24.2 million by 2020 (Grundy, 2022). Digital media on the other hand, is cost-effective and immediate. It can be less expensive than print media, depending on the graphic design. Campaigns and content can be produced, launched, and updated faster than print. Digital is interactive and can generate consumer information as well. According to Statista, revenue in the Digital Newspapers & Magazines segment is projected to reach $38.60 billion in 2023. Revenue is expected to show an annual growth rate (CAGR 2023-2027) of 2.73%, resulting in a projected market volume of $42.99 billion by 2027. In the Digital Newspapers & Magazines segment, the number of users is expected to amount to 1,881.9m users by 2027 (Statista, 2023).

With the decline of print media in mind, and the growth in digital media industry, Digital AIM Media, henceforth known as ‘the company’, has decided to conduct research on the setups for digital advertising campaigns to determine which variables in a campaign have a larger impact on the performance and outcome of a campaign.

Some of the hypothesis that will be researched on includes:

Does the specific months during which the campaign was active affect the performance of the campaign (e.g., click-through rate, impression volume, etc.)
Is the time of day a factor in affecting the performance of a digital advertising campaign?
Does one platform (Google, Facebook, Simpli.fi) perform better than the other when processed under the same metrics.

Initial Plan

The first step of this analysis involves obtaining an understanding of the data that is currently accessible. As such, exploratory data analysis is to be performed on the datasets, identifying variables such as impressions, clicks, click-through-rate, engagement rate, ROI, that relate to the analysis and removing the redundant variables. Creating hypothesis based upon the observed data to determine if intricate details such as time of day, day of the week, week of the month, month of the year, targeted demographics, age groups, gender, and more affects the outcome of a digital advertising campaign; and if so, to what degree do these variables affect the overall performance. Starting with the Google platform, these observations will be repeated as well for Facebook and Simpli.fi campaigns.

Once statistical trends have been identified on all digital advertising platforms, descriptive analysis will be done to outline respective trends and provide feedback on best practices that would help optimize future campaigns. Cluster analysis will also be done to identify if similar trends in performance vary between platforms or if the performance of a campaign is independent of the platform. Utilizing data visualization techniques such as bar charts and graphs to help simplify the final draft of the report.

Finally, predictive analysis can be done to determine if the findings above are statistically accurate or if the research proves to be inconclusive and the performance and outcome of a digital campaign is disassociated from the variables observed in the datasets.

Stakeholders

Stakeholder/Group	Position	Level of Influence	Level of Interest
Richard Clark	VP & Publisher	High	High
John Senger	Director of Digital Sales	High	High
Sales	Digital Marketing Managers Legacy Print Sales Representatives	High Low	High Low
Operations	Ad Designers Campaign Processors	Low Low	Medium Medium
Client	Advertising Customers Readers	High Low	High Low

Richard, the Vice President & Publisher of the organization, is a stakeholder who can affect or be affected by the organization’s actions or decisions. Richard is a high-priority stakeholder due to his seniority and decision-making authority within the organization. Richard’s needs and expectations may include achieving the organization’s strategic goals, increasing revenue and profits, improving the quality of publications, and maintaining a positive reputation. Richard may be interested in increasing the organization’s market share, expanding its reach, and enhancing its competitive position. His concerns may include maintaining editorial integrity, complying with industry regulations, and managing risks associated with digital disruption. Overall, Richard’s position as Vice President & Publisher makes him a critical stakeholder in the organization. By understanding his needs and concerns and developing effective strategies to engage him, the organization can build a positive relationship with him and increase the likelihood of achieving its goals.

John is likely to have a high interest in the research since he is the Director of Digital Sales, and the research is focused on the setups for digital advertising campaigns. As part of his role, he is responsible for maximizing the revenue generated from digital sales, so he would be interested in understanding which variables have the most significant impact on campaign performance. As the Director of Digital Sales, John has a significant level of influence over the digital sales strategy and the types of campaigns that are executed. He is likely to have a say in the research questions, methodology, and data sources used in the research. The research could have a significant impact on John’s role, as it could provide insights into which variables have the most significant impact on campaign performance. This information could be used to optimize campaign setups and improve the overall revenue generated from digital sales. Overall, John is a key stakeholder in this research on digital advertising campaigns. His high level of interest and influence in digital sales make his support and engagement critical to the success of the research. Engaging with him regularly and keeping him informed throughout the research process is essential to ensuring that he remains supportive and engaged in the initiative.

As the primary stakeholders responsible for designing and implementing digital advertising campaigns, the Digital Marketing Managers would have a high level of influence on the research. They are likely to have input into the research questions, methodology, and data sources used in the research. The Digital Marketing Managers would have a high interest in this research, as they are responsible for designing, implementing, and monitoring digital advertising campaigns. The research would provide valuable insights into the variables that impact campaign performance, which they can use to optimize campaign performance and generate better results. The research could have a significant impact on the Digital Marketing Managers’ roles, as the insights gained could be used to improve campaign performance and generate better results. This, in turn, could impact the company’s revenue and profitability. Overall, the Digital Marketing Managers are critical stakeholders in this research. Their high level of interest and influence in designing and implementing digital advertising campaigns make their support and engagement essential to the success of the research.

Print sales representatives are a group of stakeholders who may be impacted by the research being conducted by Digital AIM Media. As the company focuses on digital advertising campaigns, print sales representatives may feel threatened as their primary source of revenue is in print media. Print sales representatives have a moderate level of influence. While they may not have decision-making authority within the company, they have an influential role in the organization as they are the primary source of revenue for print media. As such, they may have the ability to influence the company’s decisions regarding their job roles and responsibilities. Print sales representatives have a direct interest in the research being conducted by Digital AIM Media. The outcome of the research will determine the company’s future direction and success, which could potentially affect the sales representatives’ job security and income.

Customers play a critical role in the success of any business, and their satisfaction is essential to retain them and gain new customers. Customers are interested in the quality of the digital advertising campaigns and the value they receive from them. They want their advertising campaigns to be effective in reaching their target audience and generating a positive ROI. Customers can influence the reputation of Digital AIM Media by sharing their experience with others. They can leave reviews and ratings on social media and other platforms that can impact the company’s reputation. Overall, customers are a critical stakeholder for Digital AIM Media, and understanding their perspectives, needs, and expectations can help the company improve its services and build long-term relationships with its customers.

Data Preparation

Original sources where the data used in this research was obtained from three major digital marketing platforms that the company utilizes. These are Google, Facebook, and Simpli.fi. The original purpose of the data is designed to gain a better understanding on the performance of each campaign by year, by season, by traffic, by demographic, and by budget. This helps contextualize the data and ensure that it is used appropriately. These data were collected in the year of 2023 as this allows for the entirety of 2022 and 2021 to be completely logged and collected. This also allows for a more accurate reflection of current trends and behaviors. As the number of variables that are available in the original datasets were innumerable, data cleaning was applied to ensure that only the appropriate variables are used to answer the research questions. Since the data was sourced from different platforms, different formats were applied to the datasets when initially obtaining them. Therefore, data cleaning was necessary in correcting errors and inconsistencies in the data as well as observing and removing any and all missing values and outliers in the datasets.

Datasets

Figure 1: Overview of Digital AIM Media IN Google Advertising Performance in 2022

Figure 2: Google Campaigns dataset

Figure 3: Overview of Facebook Campaign Data in 2022

Figure 4: Account Performance by Campaign on Simpli.fi in 2022

Based on the research questions and hypotheses, the following datasets may be used in this research:

Digital advertising data from 2021 and 2022: This dataset would provide information on the performance of digital campaigns after the pandemic. It would include metrics such as click-through rates, conversion rates, and impressions.
Monthly digital advertising data: This dataset would provide information on the performance of digital campaigns during different months throughout the year. It would include metrics such as click-through rates, conversion rates, and impressions.
Demographic data: This dataset would provide information on the characteristics of the audience targeted by digital campaigns. It would include information such as age, gender, income level, and location.
Platform-specific advertising data: This dataset would provide information on the performance of digital campaigns on specific advertising platforms such as Google, Facebook, and Simpli.fi. It would include metrics such as click-through rates, conversion rates, and impressions.
A/B testing data: This dataset would provide information on the results of A/B testing campaigns, which involves testing different variables to determine which ones have a greater impact on the performance of a digital campaign. It would include metrics such as click-through rates, conversion rates, and impressions for each variation tested.

Data Cleaning

Removing unnecessary columns:
- This step involves removing columns that are not relevant to the research question. These columns could include irrelevant data or data that has no bearing on the research question. Removing these columns will help reduce the size of the dataset and make it easier to work with.
Check for and remove duplicates:
- Duplicate data can skew the results of data analysis, and it is important to remove them before analysis. This step involves identifying and removing any duplicate data in the dataset. Duplicates can occur due to data entry errors or technical issues and need to be removed from the dataset to ensure data accuracy.
Handle missing values:
- Missing values can cause issues with data analysis, and therefore it is necessary to handle them appropriately. This step will involve identifying missing values in the dataset and then determining how best to handle them. This could involve imputing values, removing rows or columns with missing values, or leaving the missing values as they are.
Check for and remove outliers:
- Outliers can have a significant impact on data analysis, and therefore it is necessary to identify and remove them before analysis. This step involves identifying outliers and determining the best way to handle them, whether by removing them or adjusting them.
Standardize data types:
- This step involves ensuring that all data types in the dataset are standardized, which will make it easier to analyze the data. For example, ensuring that all dates are in the same format and that numerical data is in the same units.
Ensure consistency across datasets:
- This step involves ensuring that the datasets from different platforms are consistent with each other. For example, ensuring that the variables in each dataset are named consistently. Overall, these data cleaning steps are important to ensure that the dataset is accurate, consistent, and can be used for meaningful analysis.

Data Summary

The initial Google datasets included a good number of unnecessary variables such as Account Name, Campaign Name, Ad Group, Final URL, Headline, Long Headline, Headline 2, Description, Business name, Call to Action, and more. Majority of these variables did not have any data in them as well. As such, data cleaning was very much necessary in ensuring that these datasets were primed for analysis. With the objective of determining the correlation between time and traffic, only variables such as day, and campaign performance by day were kept in the cleaned dataset. All null values were removed from the dataset.

With Facebook and Simpli.fi, both platforms included the ability to clean the datasets prior to downloading and obtaining the xlsx files. As such, both platforms provided datasets that were mostly cleaned and organized. Once datasets for both platforms were obtained, the last thing to do was ensuring that all 6 datasets were consistently formatted.

Constraints

Some of the constraints for this research include limited access to specific data such as conversion rate in a digital campaign. Majority of the digital advertising campaigns was done without setting up conversion tracking with the clients. While it may be possible to measure campaign performance metrics, such as click-through rates or engagement, it can be more challenging to measure the impact of these metrics on broader business outcomes, such as revenue or customer lifetime value. Without a clear understanding of this impact, it may be difficult to assess the true value of a campaign.

There are many external factors that can influence the success of a digital advertising campaign, including changes in the competitive landscape, consumer behavior, and broader economic trends. This may prove challenging to the research and result in limitations and affect the accuracy of the findings.

The digital advertising industry is constantly evolving, with new technologies, platforms, and trends emerging all the time. This can make it difficult to design research that is relevant and up-to-date, and to draw conclusions that will remain valid in the future.

Analysis

Analysis Techniques

Exploratory Analysis
- Exploratory analysis is an approach to data analysis that is used to gain an initial understanding of the data and identify potential patterns or trends. In the context of digital advertising campaigns, exploratory analysis could be used to identify variables that are likely to have an impact on campaign performance, such as ad format, ad placement, or target audience. Exploratory analysis can involve various statistical techniques, such as regression analysis or correlation analysis, to identify relationships between variables.
Descriptive Analysis
- Descriptive analysis is a technique that is used to summarize and describe the data in a meaningful way. In the context of digital advertising campaigns, descriptive analysis could be used to summarize the performance metrics of individual campaigns, such as click-through rates or conversion rates, and identify trends or patterns in the data. This type of analysis could be used to develop benchmarks or norms for different types of campaigns or industries.
Cluster Analysis
- Cluster analysis is a technique that is used to identify groups of observations in a dataset that are like each other. In the context of digital advertising campaigns, cluster analysis could be used to group campaigns that have similar performance characteristics, such as similar click-through rates or conversion rates. This type of analysis could be used to identify patterns in the data and develop insights into the types of campaigns that are most likely to be successful.
Time-series Analysis
- Time-series analysis is a statistical technique used to analyze and model data that is collected over time. It involves studying the patterns, trends, and changes in the data over a specific time period, and can be used to make forecasts or predictions about future values. Time-series analysis can be applied to a wide range of fields, such as finance, economics, engineering, environmental science, and many others. Some common methods used in time-series analysis include trend analysis, seasonal analysis, smoothing, and forecasting.
Data Visualization
- Data visualization is a technique that involves presenting data in visual formats, such as charts, graphs, or maps. Data visualization can be a powerful tool in digital advertising campaigns, as it can help to identify patterns or trends in the data that might not be immediately apparent from raw data. For example, visualizations could be used to show the relationship between ad placement and conversion rates or to highlight the performance of different types of ads.

Overall, the use of these analytics techniques could help Digital AIM Media to gain a better understanding of the variables that impact digital advertising campaign performance, develop benchmarks or norms for different types of campaigns or industries, and identify trends or patterns in the data that could inform future campaigns.

Packages

readxl - The readxl package is an R package that provides functions to read data from Excel files (.xls and .xlsx). This package allows users to import data from Excel spreadsheets into R as data frames. The readxl package is especially useful when working with data that has been collected or recorded in an Excel file. By importing the data into R using readxl, users can leverage the powerful data manipulation and analysis capabilities of R to clean, analyze, and visualize their data. Once you have loaded the readxl package into your R session using library(readxl), you can use the functions provided by the package, such as read_excel() and excel_sheets(), to read data from Excel files and extract information about the spreadsheets contained in the file.

dplyr - The dplyr package is a popular R package designed to facilitate data manipulation tasks such as filtering, selecting, grouping, summarizing, and joining data. It provides a consistent set of functions for manipulating data in a way that is both easy to read and write. The primary purpose of the dplyr package is to provide a set of functions that make it easy to work with data frames in R. The package is designed to streamline common data manipulation tasks and reduce the amount of code required to perform these tasks.dplyr is also designed to work seamlessly with other R packages that are commonly used in data analysis, such as tidyr and ggplot2. The package is part of the tidyverse, a collection of R packages designed to work together to make data analysis more efficient and intuitive.Some of the key functions provided by dplyr include:

filter(): Used to subset rows based on a condition or set of conditions.
select(): Used to select a subset of columns from a data frame.
mutate(): Used to add new variables or modify existing variables in a data frame.
summarize(): Used to compute summary statistics for one or more variables in a data frame.
group_by(): Used to group a data frame by one or more variables.

fpc - The fpc package in R provides tools for clustering and classification of data. The package implements several clustering algorithms and provides functions for cluster validation and cluster comparison. The primary purpose of the fpc package is to provide a range of clustering methods that can be used to group similar objects together in a dataset. The fpc package also provides functions for cluster validation, which is the process of evaluating the quality of clustering results. The package includes functions for computing cluster indices such as the silhouette width and Dunn’s index, which can be used to assess the quality of clustering results. In addition, the fpc package provides functions for cluster comparison, which is the process of comparing clustering results obtained with different algorithms or parameter settings. The package includes functions for computing the adjusted Rand index and the Fowlkes-Mallows index, which can be used to compare clustering results. The package includes functions for:

K-means clustering: a popular clustering algorithm that partitions a dataset into k clusters based on the similarity between data points.
Hierarchical clustering: a clustering algorithm that builds a tree-like structure to represent the relationships between data points.
Model-based clustering: a clustering algorithm that uses statistical models to represent the data and estimate the number of clusters.

ggplot2 - The ggplot2 package in R is a powerful and flexible data visualization package that allows users to create customized and publication-quality plots. It is built on the philosophy of the Grammar of Graphics, which is a system for creating visualizations that can be described in terms of a set of building blocks. The primary purpose of the ggplot2 package is to provide a system for creating visualizations that are both aesthetically pleasing and informative. The package includes a set of functions for creating a wide range of plots, including scatter plots, line plots, bar plots, and more. ggplot2 provides a layered approach to building visualizations, where users can add individual layers to a plot to create more complex visualizations. This approach makes it easy to customize the appearance of a plot and add additional information to it. Some of the key features of the ggplot2 package include:

The ability to create complex visualizations with multiple layers.
The ability to customize the appearance of plots using a wide range of parameters.
The ability to use different data types (e.g. categorical, numerical) and plot them in various ways.
The ability to add annotations and text labels to plots to provide additional information.

tidyverse - The tidyverse package in R is a collection of packages designed to provide a consistent and coherent approach to data manipulation, data visualization, and data analysis. The package includes several popular packages, including dplyr, ggplot2, and tidyr, among others. The primary purpose of the tidyverse package is to provide a set of tools for working with data that are consistent, intuitive, and efficient. The package is built on the principles of tidy data, which is a way of organizing and structuring data to facilitate analysis. In addition to these core packages, the tidyverse package also includes several other packages that are useful for data analysis, such as readr for reading and writing data, purrr for functional programming, and stringr for string manipulation. Some of the key features of the tidyverse package include:

A consistent syntax and set of functions for working with data frames.
A focus on data manipulation tasks, such as filtering, selecting, and transforming data.
A range of functions and tools for data visualization, including the popular ggplot2 package.
A suite of functions for data cleaning and preparation, including the tidyr package.

lubridate - The lubridate package is an R package that helps to work with dates and times. It provides a set of functions that make it easier to parse, manipulate, and format dates and times. The package was designed to overcome some of the difficulties associated with working with dates and times in R, such as inconsistencies in date formats and time zones.The lubridate package provides a number of useful functions for working with dates and times, including functions for extracting different components of a date or time (such as year, month, day, hour, minute, and second), functions for creating new dates and times, and functions for performing arithmetic operations on dates and times (such as adding or subtracting time intervals). It also includes functions for dealing with time zones, leap years, and other common issues that arise when working with dates and times.

Google Analysis

# Load the first dataset
library(readxl)
Google22 <- read_excel("C:/Users/Lynx/Documents/MSDA/MSDA 624 - Analytics Capstone/Google Targeted Display Performance 2022 Cleaned.xlsx")
Google21 <- read_excel("C:/Users/Lynx/Documents/MSDA/MSDA 624 - Analytics Capstone/Google Targeted Display Performance 2021 Cleaned.xlsx")

Linear Regression

The purpose of running a linear regression model is to identify the relationship between two continuous variables - a predictor variable and a response variable. The model estimates the linear relationship between the predictor variable and the response variable by fitting a line through the data points and determining the slope and intercept of that line. The slope of the line represents the change in the response variable per unit change in the predictor variable, and the intercept represents the value of the response variable when the predictor variable is zero.

The first step is to create a new data frame LM1 by making a copy of the original Google21 data frame. Then, the code removes irrelevant variables from LM1 by selecting all columns except for the first four using the negative index notation - (1:4).

Next, the code converts the character-type variables that represent costs on each day of the week to numeric type using the as.numeric() function. The variables being converted are: Sunday_Cost (Converted currency), Monday_Cost (Converted currency), Tuesday_Cost (Converted currency), Wednesday_Cost (Converted currency), Thursday_Cost (Converted currency), Friday_Cost (Converted currency), and Saturday_Cost (Converted currency).

The code then removes any rows that have null values in LM1 using the na.omit() function. Finally, the code checks that there are no null values remaining in LM1 by using the sum() function to count the number of null values and dim() function to get the dimensions of the data frame.

# Create a new data frame
LM1 <- Google21

# Remove irrelevant variables
LM1 <- LM1[,-(1:4)]

# Converting chr type variables to num type variables
LM1$`Sunday_Cost (Converted currency)` <- as.numeric(LM1$`Sunday_Cost (Converted currency)`)

## Warning: NAs introduced by coercion

LM1$`Monday_Cost (Converted currency)` <- as.numeric(LM1$`Monday_Cost (Converted currency)`)

## Warning: NAs introduced by coercion

LM1$`Tuesday_Cost (Converted currency)` <- as.numeric(LM1$`Tuesday_Cost (Converted currency)`)

## Warning: NAs introduced by coercion

LM1$`Wednesday_Cost (Converted currency)` <- as.numeric(LM1$`Wednesday_Cost (Converted currency)`)

## Warning: NAs introduced by coercion

LM1$`Thursday_Cost (Converted currency)` <- as.numeric(LM1$`Thursday_Cost (Converted currency)`)

## Warning: NAs introduced by coercion

LM1$`Friday_Cost (Converted currency)` <- as.numeric(LM1$`Friday_Cost (Converted currency)`)

## Warning: NAs introduced by coercion

LM1$`Saturday_Cost (Converted currency)` <- as.numeric(LM1$`Saturday_Cost (Converted currency)`)

## Warning: NAs introduced by coercion

# Remove null values
LM1 <- na.omit(LM1)

# Ensure that new data frame is correct
sum(is.na(LM1))

## [1] 0

dim(LM1)

## [1] 598  28

# Create a new data frame
LM2 <- Google22

# Remove irrelevant variables
LM2 <- LM2[,-(1:4)]

# Converting chr type variables to num type variables
LM2$`Sunday_Cost (Converted currency)` <- as.numeric(LM2$`Sunday_Cost (Converted currency)`)

## Warning: NAs introduced by coercion

LM2$`Monday_Cost (Converted currency)` <- as.numeric(LM2$`Monday_Cost (Converted currency)`)

## Warning: NAs introduced by coercion

LM2$`Tuesday_Cost (Converted currency)` <- as.numeric(LM2$`Tuesday_Cost (Converted currency)`)

## Warning: NAs introduced by coercion

LM2$`Wednesday_Cost (Converted currency)` <- as.numeric(LM2$`Wednesday_Cost (Converted currency)`)

## Warning: NAs introduced by coercion

LM2$`Thursday_Cost (Converted currency)` <- as.numeric(LM2$`Thursday_Cost (Converted currency)`)

## Warning: NAs introduced by coercion

LM2$`Friday_Cost (Converted currency)` <- as.numeric(LM2$`Friday_Cost (Converted currency)`)

## Warning: NAs introduced by coercion

LM2$`Saturday_Cost (Converted currency)` <- as.numeric(LM2$`Saturday_Cost (Converted currency)`)

## Warning: NAs introduced by coercion

# Remove null values
LM2 <- na.omit(LM2)

# Ensure that new data frame is correct
sum(is.na(LM2))

## [1] 0

dim(LM2)

## [1] 562  28

A function named perform_regression that performs linear regression on a given data frame data is created below, for a specified day of the week (day), and response and predictor variables (response_var_suffix and predictor_var_suffix, respectively). The purpose of this function is to simplify the process of performing linear regression on a subset of the data, by selecting the relevant columns, converting them to numeric variables, and fitting the regression model using lm() function. The function then returns the model object and the summary statistics for further analysis.

The function uses the dplyr package to manipulate data, and starts by selecting the response and predictor variables based on the specified day of the week, and concatenating the column names with the respective suffixes. The selected columns are then converted to numeric variables using mutate_all() function, which applies the as.numeric() function to each column in the data frame. The regression model is then fit using lm() function, where the response variable is specified using double square brackets to reference the column name. Finally, the function returns a list object containing the model object and summary statistics generated by the summary() function.

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.2.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

perform_regression <- function(data, day, response_var_suffix, predictor_var_suffix) {
  # Select columns with specified day of the week
  response_var <- paste0(day, response_var_suffix)
  predictor_var <- paste0(day, predictor_var_suffix)
  
  # Convert selected columns to numeric variables
  data <- data %>%
    select(all_of(c(response_var, predictor_var))) %>%
    mutate_all(~as.numeric(sub(",", "", .)))
  
  # Fit the linear regression model
  model <- lm(data[[response_var]] ~ data[[predictor_var]], data = data)
  
  # Get summary statistics for the model
  summary <- summary(model)
  
  # Return the model object and summary statistics
  return(list(model = model, summary = summary))
}

# Create a vector of days of the week
days <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")

# Create an empty list to store the regression results
results_list <- list()

# Loop through the days of the week and perform the regression for each day
for (day in days) {
  response_var_suffix <- "_Cost (Converted currency)"
  predictor_var_suffix <- "_Impr."
  results <- perform_regression(data = LM1, day = day, response_var_suffix = response_var_suffix, predictor_var_suffix = predictor_var_suffix)
  results_list[[day]] <- results
}

# Print the list of regression results
results_list

## $Sunday
## $Sunday$model
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Coefficients:
##           (Intercept)  data[[predictor_var]]  
##             1.166e+01              9.195e-04  
## 
## 
## $Sunday$summary
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -57.47 -11.62 -10.34  -4.27 451.65 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.166e+01  1.821e+00   6.404 3.07e-10 ***
## data[[predictor_var]] 9.195e-04  1.109e-04   8.290 7.54e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.92 on 596 degrees of freedom
## Multiple R-squared:  0.1034, Adjusted R-squared:  0.1019 
## F-statistic: 68.72 on 1 and 596 DF,  p-value: 7.543e-16
## 
## 
## 
## $Monday
## $Monday$model
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Coefficients:
##           (Intercept)  data[[predictor_var]]  
##             1.666e+01              8.117e-04  
## 
## 
## $Monday$summary
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -64.16  -16.49  -14.97   -8.48 1337.81 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.666e+01  3.238e+00   5.147 3.60e-07 ***
## data[[predictor_var]] 8.117e-04  2.053e-04   3.954 8.59e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 70.5 on 596 degrees of freedom
## Multiple R-squared:  0.02557,    Adjusted R-squared:  0.02393 
## F-statistic: 15.64 on 1 and 596 DF,  p-value: 8.594e-05
## 
## 
## 
## $Tuesday
## $Tuesday$model
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Coefficients:
##           (Intercept)  data[[predictor_var]]  
##             1.503e+01              9.164e-04  
## 
## 
## $Tuesday$summary
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -53.21  -14.97  -13.60   -7.67 1126.05 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.503e+01  2.844e+00   5.286 1.76e-07 ***
## data[[predictor_var]] 9.164e-04  1.653e-04   5.545 4.42e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 62.09 on 596 degrees of freedom
## Multiple R-squared:  0.04906,    Adjusted R-squared:  0.04746 
## F-statistic: 30.75 on 1 and 596 DF,  p-value: 4.424e-08
## 
## 
## 
## $Wednesday
## $Wednesday$model
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Coefficients:
##           (Intercept)  data[[predictor_var]]  
##             1.487e+01              8.724e-04  
## 
## 
## $Wednesday$summary
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -45.60  -14.78  -13.23   -6.67 1110.85 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.487e+01  2.764e+00   5.380 1.07e-07 ***
## data[[predictor_var]] 8.724e-04  1.640e-04   5.318 1.49e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.1 on 596 degrees of freedom
## Multiple R-squared:  0.0453, Adjusted R-squared:  0.0437 
## F-statistic: 28.28 on 1 and 596 DF,  p-value: 1.485e-07
## 
## 
## 
## $Thursday
## $Thursday$model
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Coefficients:
##           (Intercept)  data[[predictor_var]]  
##             1.485e+01              9.347e-04  
## 
## 
## $Thursday$summary
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -60.90  -14.76  -13.32   -6.91 1046.13 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.485e+01  2.691e+00   5.518 5.11e-08 ***
## data[[predictor_var]] 9.347e-04  1.555e-04   6.009 3.25e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 58.79 on 596 degrees of freedom
## Multiple R-squared:  0.05713,    Adjusted R-squared:  0.05554 
## F-statistic: 36.11 on 1 and 596 DF,  p-value: 3.249e-09
## 
## 
## 
## $Friday
## $Friday$model
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Coefficients:
##           (Intercept)  data[[predictor_var]]  
##             1.423e+01              9.196e-04  
## 
## 
## $Friday$summary
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -53.44 -14.11 -12.60  -5.80 928.61 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.423e+01  2.440e+00   5.833 8.91e-09 ***
## data[[predictor_var]] 9.196e-04  1.498e-04   6.137 1.53e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 53.09 on 596 degrees of freedom
## Multiple R-squared:  0.05944,    Adjusted R-squared:  0.05786 
## F-statistic: 37.66 on 1 and 596 DF,  p-value: 1.533e-09
## 
## 
## 
## $Saturday
## $Saturday$model
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Coefficients:
##           (Intercept)  data[[predictor_var]]  
##             11.622186               0.001022  
## 
## 
## $Saturday$summary
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -56.75 -11.63 -10.54  -4.95 512.39 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.162e+01  1.895e+00   6.132 1.58e-09 ***
## data[[predictor_var]] 1.022e-03  1.201e-04   8.507  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.48 on 596 degrees of freedom
## Multiple R-squared:  0.1083, Adjusted R-squared:  0.1068 
## F-statistic: 72.38 on 1 and 596 DF,  p-value: < 2.2e-16

# Create an empty data frame to store the results
results_df <- data.frame(day = character(),
                         coefficient = numeric(),
                         p_value = numeric(),
                         r_squared = numeric(),
                         rmse = numeric(),
                         mae = numeric(),
                         stringsAsFactors = FALSE)

# Loop through the days of the week and extract the coefficients and p-values
for (day in days) {
  model <- results_list[[day]]$model
  coefficient <- coef(model)[2] # extract the coefficient for the predictor variable
  p_value <- summary(model)$coefficients[2, 4] # extract the p-value for the predictor variable
  r_squared <- summary(model)$r.squared # extract the R-squared
  y_pred <- predict(model) # make predictions using the model
  y_actual <- model$model[,1] # extract the actual values of the response variable
  rmse <- sqrt(mean((y_pred - y_actual)^2)) # calculate RMSE
  mae <- mean(abs(y_pred - y_actual)) # calculate MAE
  
  # Add the results to the data frame
  results_df <- rbind(results_df, data.frame(day = day,
                                              coefficient = coefficient,
                                              p_value = p_value,
                                              r_squared = r_squared,
                                              rmse = rmse,
                                              mae = mae,
                                              stringsAsFactors = FALSE))
}

# Print the data frame
results_df

##                              day  coefficient      p_value  r_squared     rmse
## data[[predictor_var]]     Sunday 0.0009195307 7.542562e-16 0.10338424 39.84876
## data[[predictor_var]]1    Monday 0.0008117363 8.593542e-05 0.02556670 70.38329
## data[[predictor_var]]2   Tuesday 0.0009163553 4.423855e-08 0.04905650 61.98704
## data[[predictor_var]]3 Wednesday 0.0008723951 1.485166e-07 0.04530360 60.00043
## data[[predictor_var]]4  Thursday 0.0009346791 3.248669e-09 0.05712562 58.69094
## data[[predictor_var]]5    Friday 0.0009196212 1.533090e-09 0.05943921 53.00570
## data[[predictor_var]]6  Saturday 0.0010221222 1.439645e-16 0.10828620 41.41427
##                             mae
## data[[predictor_var]]  16.66939
## data[[predictor_var]]1 23.80347
## data[[predictor_var]]2 21.71434
## data[[predictor_var]]3 20.91117
## data[[predictor_var]]4 21.19850
## data[[predictor_var]]5 19.95894
## data[[predictor_var]]6 16.94232

Based on the results, all days of the week in 2021 have statistically significant coefficients (p-values < 0.05), indicating that ad cost has a positive effect on performance impressions for each day of the week. The magnitude of the effect (coefficient) is small, ranging from 0.0008 to 0.0010. This suggests that increasing ad cost by a small amount can lead to a small increase in performance impressions for each day of the week.

# Create a vector of days of the week
days <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")

# Create an empty list to store the regression results
results_list <- list()

# Loop through the days of the week and perform the regression for each day
for (day in days) {
  response_var_suffix <- "_Cost (Converted currency)"
  predictor_var_suffix <- "_Impr."
  results <- perform_regression(data = LM2, day = day, response_var_suffix = response_var_suffix, predictor_var_suffix = predictor_var_suffix)
  results_list[[day]] <- results
}

# Print the list of regression results
results_list

## $Sunday
## $Sunday$model
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Coefficients:
##           (Intercept)  data[[predictor_var]]  
##              9.940824               0.001056  
## 
## 
## $Sunday$summary
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -65.71 -10.30  -9.89  -6.47 404.76 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           9.9408244  1.7761015   5.597 3.42e-08 ***
## data[[predictor_var]] 0.0010556  0.0001051  10.041  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.22 on 560 degrees of freedom
## Multiple R-squared:  0.1526, Adjusted R-squared:  0.1511 
## F-statistic: 100.8 on 1 and 560 DF,  p-value: < 2.2e-16
## 
## 
## 
## $Monday
## $Monday$model
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Coefficients:
##           (Intercept)  data[[predictor_var]]  
##             1.695e+01              8.835e-04  
## 
## 
## $Monday$summary
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -53.92  -17.02  -16.24  -11.81 1278.27 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.695e+01  3.481e+00   4.869 1.46e-06 ***
## data[[predictor_var]] 8.835e-04  2.049e-04   4.311 1.92e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 72.29 on 560 degrees of freedom
## Multiple R-squared:  0.03213,    Adjusted R-squared:  0.0304 
## F-statistic: 18.59 on 1 and 560 DF,  p-value: 1.918e-05
## 
## 
## 
## $Tuesday
## $Tuesday$model
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Coefficients:
##           (Intercept)  data[[predictor_var]]  
##             1.525e+01              9.212e-04  
## 
## 
## $Tuesday$summary
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -53.98  -15.38  -14.74  -10.37 1092.93 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.525e+01  3.056e+00   4.990 8.09e-07 ***
## data[[predictor_var]] 9.212e-04  1.753e-04   5.255 2.11e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63.46 on 560 degrees of freedom
## Multiple R-squared:  0.04699,    Adjusted R-squared:  0.04529 
## F-statistic: 27.61 on 1 and 560 DF,  p-value: 2.113e-07
## 
## 
## 
## $Wednesday
## $Wednesday$model
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Coefficients:
##           (Intercept)  data[[predictor_var]]  
##              13.11868                0.00114  
## 
## 
## $Wednesday$summary
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -72.42  -14.14  -13.20  -10.14 1025.10 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.312e+01  2.944e+00   4.456 1.01e-05 ***
## data[[predictor_var]] 1.140e-03  1.676e-04   6.804 2.62e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.17 on 560 degrees of freedom
## Multiple R-squared:  0.07636,    Adjusted R-squared:  0.07471 
## F-statistic:  46.3 on 1 and 560 DF,  p-value: 2.62e-11
## 
## 
## 
## $Thursday
## $Thursday$model
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Coefficients:
##           (Intercept)  data[[predictor_var]]  
##              13.51785                0.00107  
## 
## 
## $Thursday$summary
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -56.94 -14.11 -13.53  -9.89 982.36 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.352e+01  2.917e+00   4.633 4.48e-06 ***
## data[[predictor_var]] 1.070e-03  1.716e-04   6.238 8.72e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.31 on 560 degrees of freedom
## Multiple R-squared:  0.06498,    Adjusted R-squared:  0.06331 
## F-statistic: 38.92 on 1 and 560 DF,  p-value: 8.724e-10
## 
## 
## 
## $Friday
## $Friday$model
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Coefficients:
##           (Intercept)  data[[predictor_var]]  
##             12.264344               0.001068  
## 
## 
## $Friday$summary
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -57.22 -12.83 -12.22  -8.77 789.62 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.226e+01  2.478e+00   4.948 9.91e-07 ***
## data[[predictor_var]] 1.068e-03  1.438e-04   7.425 4.23e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.43 on 560 degrees of freedom
## Multiple R-squared:  0.08962,    Adjusted R-squared:  0.08799 
## F-statistic: 55.13 on 1 and 560 DF,  p-value: 4.229e-13
## 
## 
## 
## $Saturday
## $Saturday$model
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Coefficients:
##           (Intercept)  data[[predictor_var]]  
##              9.804400               0.001146  
## 
## 
## $Saturday$summary
## 
## Call:
## lm(formula = data[[response_var]] ~ data[[predictor_var]], data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -80.39 -10.72  -9.83  -6.93 543.35 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           9.8044001  1.9698122   4.977 8.59e-07 ***
## data[[predictor_var]] 0.0011462  0.0001094  10.481  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.42 on 560 degrees of freedom
## Multiple R-squared:  0.164,  Adjusted R-squared:  0.1625 
## F-statistic: 109.9 on 1 and 560 DF,  p-value: < 2.2e-16

# Create an empty data frame to store the results
results_df <- data.frame(day = character(),
                         coefficient = numeric(),
                         p_value = numeric(),
                         r_squared = numeric(),
                         rmse = numeric(),
                         mae = numeric(),
                         stringsAsFactors = FALSE)

# Loop through the days of the week and extract the coefficients and p-values
for (day in days) {
  model <- results_list[[day]]$model
  coefficient <- coef(model)[2] # extract the coefficient for the predictor variable
  p_value <- summary(model)$coefficients[2, 4] # extract the p-value for the predictor variable
  r_squared <- summary(model)$r.squared # extract the R-squared
  y_pred <- predict(model) # make predictions using the model
  y_actual <- model$model[,1] # extract the actual values of the response variable
  rmse <- sqrt(mean((y_pred - y_actual)^2)) # calculate RMSE
  mae <- mean(abs(y_pred - y_actual)) # calculate MAE
  
  # Add the results to the data frame
  results_df <- rbind(results_df, data.frame(day = day,
                                              coefficient = coefficient,
                                              p_value = p_value,
                                              r_squared = r_squared,
                                              rmse = rmse,
                                              mae = mae,
                                              stringsAsFactors = FALSE))
}

# Print the data frame
results_df

##                              day  coefficient      p_value  r_squared     rmse
## data[[predictor_var]]     Sunday 0.0010556489 6.309236e-22 0.15257568 37.15553
## data[[predictor_var]]1    Monday 0.0008835492 1.917530e-05 0.03212581 72.15905
## data[[predictor_var]]2   Tuesday 0.0009212464 2.112782e-07 0.04698710 63.34668
## data[[predictor_var]]3 Wednesday 0.0011403059 2.619766e-11 0.07635873 61.06511
## data[[predictor_var]]4  Thursday 0.0010703763 8.724210e-10 0.06497926 60.20143
## data[[predictor_var]]5    Friday 0.0010680510 4.228972e-13 0.08962036 51.33713
## data[[predictor_var]]6  Saturday 0.0011461575 1.363949e-23 0.16399547 41.34690
##                             mae
## data[[predictor_var]]  16.98111
## data[[predictor_var]]1 26.69051
## data[[predictor_var]]2 24.12989
## data[[predictor_var]]3 23.56519
## data[[predictor_var]]4 23.19116
## data[[predictor_var]]5 20.84235
## data[[predictor_var]]6 17.82836

Based on the results, all days of the week in 2022 have statistically significant coefficients (p-values < 0.05), indicating that ad cost has a positive effect on performance impressions for each day of the week. The magnitude of the effect (coefficient) is small, ranging from 0.0008 to 0.0011. This suggests that increasing ad cost by a small amount can lead to a small increase in performance impressions for each day of the week.

Descriptive Analysis on Impressions and Clicks

The code chunk below creates bar plots that display the mean values of two variables, “Impressions” and “Clicks”, by day of the week. The data used to create the bar plots is contained in a data frame called LM1, which presumably contains information about the number of impressions and clicks for a certain type of ad on each day of the week.

The first line of code calculates the means for each of the variables of interest (“Sunday_Impr.”, “Monday_Impr.”, etc.) using the colMeans() function.

The next two lines create the actual bar plots using the barplot() function. The first barplot() call creates a bar plot of the mean impressions by day of the week, and the second call creates a bar plot of the mean clicks by day of the week. The names.arg argument specifies the labels for the x-axis (i.e., the days of the week), while the main and ylab arguments specify the title and y-axis label for each plot, respectively.

# Calculate means for each day of the week
means <- colMeans(LM1[, c("Sunday_Impr.", "Monday_Impr.", "Tuesday_Impr.", "Wednesday_Impr.", "Thursday_Impr.", "Friday_Impr.", "Saturday_Impr.", "Sunday_Clicks", "Monday_Clicks", "Tuesday_Clicks", "Wednesday_Clicks", "Thursday_Clicks", "Friday_Clicks", "Saturday_Clicks")])

# Create bar plot
barplot(means[1:7], names.arg = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"), 
        main = "Mean Impressions by Day of Week in 2021", xlab = "Day of Week", ylab = "Mean Impressions")

barplot(means[8:14], names.arg = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"), 
        main = "Mean Clicks by Day of Week in 2021", xlab = "Day of Week", ylab = "Mean Clicks")

The first bar plot shows the mean number of impressions by day of the week. From the plot, it can be observed that impressions tend to be highest on Thursdays and lowest on Saturdays, with a general bell curve trend from Monday to Sunday. The second bar plot shows the mean number of clicks by day of the week. It can be observed that clicks tend to be highest on Tuesdays and lowest on Sundays, with a general trend clicks peaking on Tuesdays before increasing again on Thursdays.

Overall, these plots suggest that the day of the week may have a significant impact on the number of impressions and clicks that an advertisement receives, with higher numbers on weekdays compared to weekends in the year 2021.

# Calculate means for each day of the week
means <- colMeans(LM2[, c("Sunday_Impr.", "Monday_Impr.", "Tuesday_Impr.", "Wednesday_Impr.", "Thursday_Impr.", "Friday_Impr.", "Saturday_Impr.", "Sunday_Clicks", "Monday_Clicks", "Tuesday_Clicks", "Wednesday_Clicks", "Thursday_Clicks", "Friday_Clicks", "Saturday_Clicks")])

# Create bar plot
barplot(means[1:7], names.arg = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"), 
        main = "Mean Impressions by Day of Week in 2022", xlab = "Day of Week", ylab = "Mean Impressions")

barplot(means[8:14], names.arg = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"), 
        main = "Mean Clicks by Day of Week in 2022", xlab = "Day of Week", ylab = "Mean Clicks")

From the first plot, it can be observed that impressions tend to be highest on Wednesday and lowest on Sundays, with a general bell curve trend peaking on Wednesdays. The second bar plot shows the mean number of clicks by day of the week. It can be seen that clicks tend to be highest on Monday and lowest on Sundays, with a general decreasing trend starting from Monday to Sunday.

Overall, these plots suggest that the day of the week may have a significant impact on the number of impressions and clicks that an advertisement receives, with higher numbers on the beginning of the week compared to weekends in the year 2022.

K-means Clustering

K-Means clustering to group the data points into clusters to analyze and interpret the average number of clicks per day within each cluster and compare them to each other. Providing insights into which days tend to have higher or lower click rates, depending on which cluster they belong to.

To begin, the optimal number of clusters to be used in the K-means clustering is determined using the elbow method. The elbow method involves calculating the within-cluster sum of squares (WSS) for different numbers of clusters and then plotting the WSS against the number of clusters. The “elbow” in the plot represents the point of diminishing returns where the addition of more clusters does not significantly reduce the WSS. This elbow point is considered to be the optimal number of clusters.

In the code, the kmeans function is used to fit a clustering model for each value of i between 1 and 10, and the total within-cluster sum of squares is calculated using the tot.withinss attribute of the resulting kmeans object. The wss vector is then updated with these values. Finally, a plot is generated with the number of clusters on the x-axis and the WSS on the y-axis to visually identify the elbow point.

# Find optimal number of clusters using the elbow method
set.seed(123)
wss <- c()
for(i in 1:10) {
  kmeans_model <- kmeans(LM1, centers = i, nstart = 10)
  wss[i] <- kmeans_model$tot.withinss
}

plot(1:10, wss, type = "b", xlab = "Number of clusters", ylab = "Within groups sum of squares")

From the plot above, 3 has been determined to be the optimal number of clusters to be used in the k-means cluster analysis.

library(fpc)
# Create k-means clustering model with 3 clusters
set.seed(123)
kmeans_model1 <- kmeans(LM1[, 1:21], 3)
kmeans_model1

## K-means clustering with 3 clusters of sizes 18, 82, 498
## 
## Cluster means:
##   Sunday_Impr. Sunday_Clicks  Sunday_CTR Sunday_Cost (Converted currency)
## 1    71560.944     227.16667 0.002755556                         80.28111
## 2    22536.878      39.02439 0.001720732                         26.92012
## 3     2436.271      17.45382 0.024028313                         14.69847
##   Monday_Impr. Monday_Clicks  Monday_CTR Monday_Cost (Converted currency)
## 1     67569.56     198.72222 0.002716667                         74.57389
## 2     22241.74      35.70732 0.001629268                         26.67476
## 3      2515.53      22.07229 0.025269076                         19.91982
##   Tuesday_Impr. Tuesday_Clicks Tuesday_CTR Tuesday_Cost (Converted currency)
## 1     74146.222      224.94444 0.002716667                          87.67833
## 2     23934.037       40.00000 0.001651220                          28.89463
## 3      2687.331       22.27711 0.026073896                          18.65444
##   Wednesday_Impr. Wednesday_Clicks Wednesday_CTR
## 1       71717.722        210.00000    0.00260000
## 2       24092.268         37.28049    0.00154878
## 3        2701.319         21.62048    0.02684538
##   Wednesday_Cost (Converted currency) Thursday_Impr. Thursday_Clicks
## 1                            80.85722      74771.778       229.27778
## 2                            29.40646      23860.720        38.53659
## 3                            18.17271       2701.341        21.64056
##   Thursday_CTR Thursday_Cost (Converted currency) Friday_Impr.
## 1  0.002727778                           88.43944    70125.778
## 2  0.001598780                           29.57732    23184.098
## 3  0.029521486                           18.48765     2566.333
## 
## Clustering vector:
##   [1] 3 2 3 3 3 2 3 3 3 3 3 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 3 3
##  [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 2 3 2 3 3 3 3 3 2 3 2 3 3 3 3 3 3 3 3 3 2 3
##  [75] 3 2 3 1 3 3 3 2 2 2 3 2 3 3 2 3 3 3 1 1 3 2 3 3 3 3 3 1 3 3 1 3 3 2 3 3 3
## [112] 1 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 1 3 3 2 3 3 3 3 3 3 3
## [186] 3 3 3 2 1 3 1 3 3 3 3 3 2 2 3 3 3 2 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [223] 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 2 3 3 2 3 3 2 3 3 3 2 3 3 3 3
## [260] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [297] 3 3 3 2 3 3 3 2 2 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [334] 3 3 3 3 3 3 3 3 3 3 2 3 2 2 3 3 3 2 3 1 1 2 1 3 2 3 3 2 3 3 3 2 2 3 3 3 3
## [371] 2 3 2 3 3 3 2 3 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [408] 3 3 3 3 3 3 2 2 3 1 3 2 1 2 3 3 3 3 3 3 3 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [445] 3 3 2 3 3 2 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 2 2 3
## [482] 2 3 3 2 3 3 3 3 2 3 3 2 3 2 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 2 3
## [519] 3 3 3 2 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 2 1 2 3 2 1 2 3 3 3 3 3 3 3 3 3
## [556] 2 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 3 3 3 3 3 3 3 3 3 3 3 3
## [593] 3 3 3 3 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 73507656765 43649770295 34350576403
##  (between_SS / total_SS =  80.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

# Plot the clusters using the first two principal components
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.2.3

LM1_pca <- prcomp(LM1, center = TRUE, scale. = TRUE)
LM1_pca_df <- data.frame(x = LM1_pca$x[,1], y = LM1_pca$x[,2], cluster = as.factor(kmeans_model1$cluster))

ggplot(LM1_pca_df, aes(x = x, y = y, color = cluster)) + 
  geom_point(size = 3) + 
  labs(title = "K-Means Clustering of Google 2021 data", x = "PC1", y = "PC2")

Based on the clustering results for 2021, it appears that there are three distinct patterns in the data. The largest cluster (498 observations) has the highest mean values for impressions and clicks, as well as the lowest mean values for click-through rates and costs. This cluster likely represents campaigns that generate high levels of traffic but with lower quality clicks.

The second largest cluster (82 observations) has lower mean values for all metrics compared to the largest cluster, indicating campaigns with lower traffic and engagement. However, the click-through rate and cost values are also lower, indicating that the clicks generated by these campaigns are of higher quality.

The smallest cluster (18 observations) has the lowest mean values for impressions and clicks, but the highest mean values for click-through rates and costs. This cluster represents campaigns that generate the highest quality clicks, but with lower traffic volume.

# Repeat the steps for Google22 Dataset
set.seed(123)
wss <- c()
for(i in 1:10) {
  kmeans_model <- kmeans(LM2, centers = i, nstart = 10)
  wss[i] <- kmeans_model$tot.withinss
}

plot(1:10, wss, type = "b", xlab = "Number of clusters", ylab = "Within groups sum of squares")

# Create k-means clustering model with 3 clusters
set.seed(123)
kmeans_model2 <- kmeans(LM2[, 1:21], 3)
kmeans_model2

## K-means clustering with 3 clusters of sizes 10, 473, 79
## 
## Cluster means:
##   Sunday_Impr. Sunday_Clicks  Sunday_CTR Sunday_Cost (Converted currency)
## 1    89137.500     102.20000 0.001250000                         97.27500
## 2     3171.218      19.81607 0.024317336                         13.79116
## 3    25909.304      53.24051 0.001536709                         35.13861
##   Monday_Impr. Monday_Clicks Monday_CTR Monday_Cost (Converted currency)
## 1    88507.100     100.80000 0.00131000                        101.89000
## 2     3463.461      26.19662 0.02425032                         20.73063
## 3    26343.759      49.86076 0.00148481                         35.06354
##   Tuesday_Impr. Tuesday_Clicks Tuesday_CTR Tuesday_Cost (Converted currency)
## 1     90681.100      113.20000 0.001370000                          99.13400
## 2      3601.533       24.90063 0.023829387                          19.35167
## 3     26802.329       51.31646 0.001462025                          35.20165
##   Wednesday_Impr. Wednesday_Clicks Wednesday_CTR
## 1       90346.100        112.20000   0.001370000
## 2        3616.469         24.97252   0.024582241
## 3       27069.759         54.58228   0.001472152
##   Wednesday_Cost (Converted currency) Thursday_Impr. Thursday_Clicks
## 1                            98.87700      85629.500        95.70000
## 2                            18.42095       3639.719        24.47357
## 3                            39.11633      26584.696        54.44304
##   Thursday_CTR Thursday_Cost (Converted currency) Friday_Impr.
## 1  0.001300000                           94.45000    88505.600
## 2  0.024173150                           18.31748     3592.423
## 3  0.001459494                           37.91975    26556.759
## 
## Clustering vector:
##   [1] 2 2 2 3 3 2 1 1 2 3 2 2 2 3 2 2 2 3 2 2 2 3 2 2 3 2 2 2 3 2 2 2 2 2 2 2 2
##  [38] 2 2 2 3 2 3 2 2 2 2 2 3 2 2 2 2 2 2 3 2 1 3 1 3 3 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 3 2 3 2 2 3 2 2 2 1 2 3 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2
## [112] 2 2 2 2 2 2 2 2 2 3 2 2 2 1 3 3 2 2 2 2 2 2 2 3 2 2 2 2 3 2 2 2 2 2 2 2 2
## [149] 2 2 2 2 2 3 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 3 2 2 2 3 2 2 2 2 2 2 2 2 2 2
## [186] 2 2 2 2 2 2 2 2 2 3 2 3 2 2 2 2 3 3 2 2 2 2 2 3 2 3 2 2 2 2 2 3 2 2 2 2 2
## [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 3
## [260] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 3 2 2 2 2 2 2 2 2 2 2 3 2 3 2 2 2 2
## [297] 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2
## [334] 2 2 2 3 3 2 2 2 2 2 2 2 3 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [371] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 3 3 2 2 2 2 2 2 2 2 3
## [408] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3
## [445] 2 2 2 2 3 2 3 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 1 3 1 2 2 2 3 2 2 2 3 2 2 2
## [482] 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 2 3 3 3 2 2 2 2 2 2 2 2 2 2
## [519] 3 2 2 2 3 2 3 3 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 3 3 2 2 2 2 2
## [556] 2 2 2 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 51572190595 46144174666 55947051159
##  (between_SS / total_SS =  79.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

LM2_pca <- prcomp(LM2, center = TRUE, scale. = TRUE)
LM2_pca_df <- data.frame(x = LM2_pca$x[,1], y = LM2_pca$x[,2], cluster = as.factor(kmeans_model2$cluster))

ggplot(LM2_pca_df, aes(x = x, y = y, color = cluster)) + 
  geom_point(size = 3) + 
  labs(title = "K-Means Clustering of Google 2022 data", x = "PC1", y = "PC2")

Cluster 1 has the highest average impressions and clicks on Sunday and Monday, with a low click-through rate (CTR) and cost. This cluster may represent an audience segment that is exposed to the ads but does not engage with them, resulting in a low CTR and low cost.

Cluster 2 has the lowest average impressions, clicks, and cost across all days. This cluster may represent an audience segment that is either less likely to be exposed to the ads or less likely to engage with them.

Cluster 3 has the highest average CTR and cost, but relatively low impressions and clicks. This cluster may represent an audience segment that is highly engaged with the ads, resulting in a high CTR and higher cost per click.

Overall, the k-means clustering analysis has provided insights into the relationships between impressions, clicks, and costs in an advertising campaign. By grouping the data into clusters, advertisers can optimize their campaigns based on their goals and priorities. For example, campaigns that prioritize traffic volume may focus on strategies that increase impressions and clicks, while campaigns that prioritize click quality may focus on optimizing their targeting and ad copy.

Dendrogram

The cluster dendrogram is a graphical representation of the hierarchical clustering analysis performed on the LM1 dataset. The dendrogram shows the relationships between the observations (or samples) based on the first 21 variables in the dataset.

Each observation is represented by a vertical line, and the height of the line indicates the distance between that observation and the next closest observation. The closer two observations are, the lower the height of the line that connects them.

The dendrogram is divided into clusters based on a cutoff value, which is determined by the analyst. In this case, the rect.hclust() function was used to draw a rectangle around three clusters. These clusters can be interpreted as groups of observations that are more similar to each other than they are to observations in other clusters.

Google21_hc <- hclust(dist(LM1[,1:21]))
plot(Google21_hc)
rect.hclust(Google21_hc, k = 3)

The dendrogram indicates that the data is not clearly separated into three distinct clusters, but rather forms a continuum of values. In this case, the three clusters identified by rect.hclust() overlap or contain many data points that are difficult to assign to a single cluster. This suggests that the variables may not have a clear underlying structure and that the grouping of the data into three clusters may be arbitrary or subjective.Therefore Cluster Dendrograms will not be reproduced further for other datasets.

Facebook Analysis

Facebook21 <- read_excel("C:/Users/Lynx/Documents/MSDA/MSDA 624 - Analytics Capstone/2021-Facebook-Performance Formatted.xlsx")
Facebook22 <- read_excel("C:/Users/Lynx/Documents/MSDA/MSDA 624 - Analytics Capstone/2022-Facebook-Performance Formatted.xlsx")

Exploratory Analysis

The beginning of the analyses on the Facebook datasets involves determining the correlation between impressions and clicks with the day of the week. The code creates a new data frame called FB21 from an existing data frame called Facebook21, but only includes the columns for “Day of the Week”, “Impressions”, and “Clicks (all)”. By creating an ordered factor variable, the days of the week are assigned a specific order. This is important for later analysis because it allows the days of the week to be compared in a logical way, rather than just alphabetically.

The factor() function takes the existing “Day of the Week” column in FB21, and assigns it new levels based on the specified order: Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, and Saturday. The ordered = TRUE argument ensures that the factor variable is ordered.

# create a new dataframe with only relevant columns
FB21 <- Facebook21 %>% select(`Day of the Week`, Impressions, `Clicks (all)`)

# create ordered factor variable for day of the week
FB21$`Day of the Week` <- factor(FB21$`Day of the Week`, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"), ordered = TRUE)

# create a new dataframe with only relevant columns
FB22 <- Facebook22 %>% select(`Day of the Week`, Impressions, `Clicks (all)`)

# create ordered factor variable for day of the week
FB22$`Day of the Week` <- factor(FB22$`Day of the Week`, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"), ordered = TRUE)

The purpose of this code is to combine two data frames (FB21 and FB22) containing Facebook ad campaign data for different time periods, and then calculate the mean impressions and clicks by day of the week and dataset. The new dataset column is created to identify which data frame each row belongs to. The group_by function is used to group the data by day of the week and dataset, and the summarize function calculates the mean of impressions and clicks for each group. Finally, ggplot is used to create two bar charts to compare the mean impressions and clicks by day of the week and dataset.

# create a new column to indicate which dataset each row belongs to
FB21$dataset <- "FB21"
FB22$dataset <- "FB22"

# combine the datasets
FB_combined <- rbind(FB21, FB22)

# calculate mean impressions and clicks by day of the week and dataset
summary_combined <- FB_combined %>% 
  group_by(`Day of the Week`, dataset) %>% 
  summarize(mean_impressions = mean(Impressions),
            mean_clicks = mean(`Clicks (all)`))

## `summarise()` has grouped output by 'Day of the Week'. You can override using
## the `.groups` argument.

# plot mean impressions by day of the week and dataset
ggplot(summary_combined, aes(x = `Day of the Week`, y = mean_impressions, fill = dataset)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Mean impressions by day of the week and dataset", fill = "Dataset")

# plot mean clicks by day of the week and dataset
ggplot(summary_combined, aes(x = `Day of the Week`, y = mean_clicks, fill = dataset)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Mean clicks by day of the week and dataset", fill = "Dataset")

The results show the mean impressions and mean clicks for each day of the week and dataset (FB21 and FB22). The Day of the Week column is an ordered factor variable with seven levels (Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, and Saturday) that represents the day of the week. The dataset column indicates whether the row belongs to FB21 or FB22. The mean_impressions column shows the mean number of impressions for each day of the week and dataset, while the mean_clicks column shows the mean number of clicks for each day of the week and dataset.

The results suggest that there were slightly more impressions in 2021 compared to 2022, however, there were more clicks in 2022. This could mean that while the number of people receiving ads have slightly declined, the quality of Facebook users to serve ads to have increased over the year.

Time-series Analysis

This code is analyzing the performance of Facebook ads by the hour of the day they were shown. First, the code converts the Day and Time of day (ad account time zone) columns in the Facebook21 data frame to the appropriate date and time formats using the as.Date() and as.POSIXct() functions. Then, the code creates a new data frame summary_data that groups the data by hour of the day and calculates the total number of impressions and clicks for each hour. Finally, the code uses ggplot() to create two plots: one showing the total impressions by hour of day and another showing the total clicks by hour of day. The geom_col() function is used to create a column chart and the ggtitle(), xlab(), and ylab() functions are used to add titles and axis labels to the charts.

# Load necessary libraries
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.3

## Warning: package 'tibble' was built under R version 4.2.3

## Warning: package 'forcats' was built under R version 4.2.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ✔ readr     2.1.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

# Convert Day and Time of day to Date and Time data types
FB21$Day <- as.Date(Facebook21$Day)
FB21$`Time of day` <- as.POSIXct(strptime(Facebook21$`Time of day (ad account time zone)`, "%H:%M:%S"))

# Create a new data frame to group the data by hour and summarize impressions and clicks
summary_data <- FB21 %>%
  group_by(hour = hour(`Time of day`)) %>%
  summarize(total_impressions = sum(Impressions), total_clicks = sum(`Clicks (all)`)) %>%
  arrange(desc(total_impressions))

# Print the summary data
summary_data

## # A tibble: 24 × 3
##     hour total_impressions total_clicks
##    <int>             <dbl>        <dbl>
##  1    21           1316263        10046
##  2    20           1272344         9519
##  3    19           1174671         9201
##  4    22           1137608         8313
##  5    18           1077379         8135
##  6    17            995464         7430
##  7    16            952480         6930
##  8    15            899882         6648
##  9    23            863145         5749
## 10    14            860553         6182
## # ℹ 14 more rows

# Plot the data to visualize the results
ggplot(summary_data, aes(x = hour, y = total_impressions)) +
  geom_col(fill = "blue") +
  ggtitle("Total Impressions by Hour of Day") +
  xlab("Hour of Day") +
  ylab("Total Impressions")

ggplot(summary_data, aes(x = hour, y = total_clicks)) +
  geom_col(fill = "green") +
  ggtitle("Total Clicks by Hour of Day") +
  xlab("Hour of Day") +
  ylab("Total Clicks")

It can be observed from the plot that the ad impressions start increasing from 12 PM, reaches peak during 9 PM, and then starts decreasing after that till midnight. This may indicate that the targeted audience for these ads are most active during these hours. On the other hand, the plot for total clicks by hour of the day shows a somewhat similar trend, but with a peak during 9 PM and then a sharp decline, indicating that the audience is likely to click on the ads during these peak hours but not so much during other hours of the day.

# Convert Day and Time of day to Date and Time data types
FB22$Day <- as.Date(Facebook22$Day)
FB22$`Time of day` <- as.POSIXct(strptime(Facebook22$`Time of day (ad account time zone)`, "%H:%M:%S"))

# Create a new data frame to group the data by hour and summarize impressions and clicks
summary_data2 <- FB22 %>%
  group_by(hour = hour(`Time of day`)) %>%
  summarize(total_impressions = sum(Impressions), total_clicks = sum(`Clicks (all)`)) %>%
  arrange(desc(total_impressions))

# Print the summary data
summary_data2

## # A tibble: 24 × 3
##     hour total_impressions total_clicks
##    <int>             <dbl>        <dbl>
##  1    21           1143179         9950
##  2    20           1125237         9465
##  3    19           1048500         8874
##  4    18            965974         8203
##  5    22            962205         8207
##  6    17            893509         7292
##  7    16            847821         6833
##  8    15            790457         6396
##  9    14            749190         5937
## 10    13            724430         5513
## # ℹ 14 more rows

# Plot the data to visualize the results
ggplot(summary_data2, aes(x = hour, y = total_impressions)) +
  geom_col(fill = "blue") +
  ggtitle("Total Impressions by Hour of Day") +
  xlab("Hour of Day") +
  ylab("Total Impressions")

ggplot(summary_data2, aes(x = hour, y = total_clicks)) +
  geom_col(fill = "green") +
  ggtitle("Total Clicks by Hour of Day") +
  xlab("Hour of Day") +
  ylab("Total Clicks")

Comparing the results between 2021 and 2022, it can be observed that the overall pattern of total impressions and clicks across different hours of the day is relatively similar between the two years. However, there are some differences in the magnitudes of total impressions and clicks in different hours. In 2022, the highest total impressions and clicks are observed between 9pm and 10pm (hour 21), while in 2021, it was between 8pm and 9pm (hour 20). Also, in 2022, the second-highest total impressions and clicks are observed between 8pm and 9pm (hour 20), while in 2021, it was between 7pm and 8pm (hour 19).

Overall, it seems that the Facebook ad performance in terms of total impressions and clicks by hour of the day has shifted slightly to later hours in 2022 compared to 2021. However, more analysis would be needed to determine whether these changes are statistically significant or due to random variation.

Analyzing Trends by Month

As the original Facebook datasets did not contain a variable “month” that is needed for analysis, this code chunk is used to extract the month from the “Day” column in two different data frames, Facebook21 and Facebook22, and create a new column “Month” in each data frame. The “lubridate” library is loaded to use the “month()” function to extract the month. The “mutate()” function from the “dplyr” package is used to add the new column to each data frame. The new column will then be a factor with the abbreviated month name (e.g., “Jan”, “Feb”) that corresponds to the date in the “Day” column.

library(lubridate)

Facebook21 <- Facebook21 %>%
  mutate(Month = month(as.Date(Day), label = TRUE))

Facebook22 <- Facebook22 %>%
  mutate(Month = month(as.Date(Day), label = TRUE))

str(Facebook21)

## tibble [114,389 × 9] (S3: tbl_df/tbl/data.frame)
##  $ Account name                      : chr [1:114389] "healthy365" "healthy365" "healthy365" "healthy365" ...
##  $ Campaign name                     : chr [1:114389] "Healthy365_35085306" "Healthy365_35085306" "Healthy365_35085306" "Healthy365_35085306" ...
##  $ Day                               : chr [1:114389] "2021-09-08" "2021-09-08" "2021-09-08" "2021-09-09" ...
##  $ Day of the Week                   : chr [1:114389] "Wednesday" "Wednesday" "Wednesday" "Thursday" ...
##  $ Time of day (ad account time zone): chr [1:114389] "11:00:00 - 11:59:59" "20:00:00 - 20:59:59" "22:00:00 - 22:59:59" "01:00:00 - 01:59:59" ...
##  $ Impressions                       : num [1:114389] 69 155 240 35 18 31 80 134 145 190 ...
##  $ Clicks (all)                      : num [1:114389] 1 1 1 1 1 1 1 1 1 1 ...
##  $ CTR (all)                         : num [1:114389] 1.449 0.645 0.417 2.857 5.556 ...
##  $ Month                             : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 9 9 9 9 9 9 9 9 9 9 ...

summary(Facebook22$Month)

##   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec 
##  7424  5956  8006  9146  9721  5993  9989  9497  6257  8311 11878 12647

# Plot for Impressions
ggplot(Facebook21, aes(x = Month, y = Impressions)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Impressions by Month", y = "Impressions", x = "Month")

# Plot for Clicks
ggplot(Facebook21, aes(x = Month, y = `Clicks (all)`)) +
  geom_bar(stat = "identity", fill = "red") +
  labs(title = "Clicks by Month", y = "Clicks", x = "Month")

The plots show the monthly trend of Facebook data for the year 2021. The first plot shows the total number of Impressions by month, and the second plot shows the total number of clicks by month. From the first plot, it can be seen that the total number of Impressions are much lower from January to March compared to the rest of the year. The reason for the low numbers could be attributed to the onset of Digital AIM Media’s provision of such services. From August to December, the number of Impressions remained relatively stable with some fluctuations.

From the second plot, similar trends with impressions, the total number of clicks were relatively low in the first quarter of the year before increase and stabilizing in the second half of 2021. However, an interesting point to note is that in September 2021, the number of clicks is significantly higher than the rest of the year.

Overall, the monthly trend of both Impressions and clicks is similar, albeit with different peak months. More research will be required to understand the causality of the differences.

# Plot for Impressions
ggplot(Facebook22, aes(x = Month, y = Impressions)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Impressions by Month", y = "Impressions", x = "Month")

# Plot for Clicks
ggplot(Facebook22, aes(x = Month, y = `Clicks (all)`)) +
  geom_bar(stat = "identity", fill = "red") +
  labs(title = "Clicks by Month", y = "Clicks", x = "Month")

The two plots show the comparison of the number of Impressions and Clicks for Facebook advertising services in 2022. The first plot shows the monthly Impressions, where the highest number of Impressions was recorded in July, followed by April and May, respectively. The second plot shows the monthly Clicks, where the highest number of clicks was recorded in July, followed by April and March, respectively.

Overall, the two plots suggest that the number of Impressions for Facebook advertising services in 2022 was higher in the second half of the year, whereas the number of Clicks were higher in the first half of the year.

Simpli.fi Analysis

Simp21 <- read_excel("C:/Users/Lynx/Documents/MSDA/MSDA 624 - Analytics Capstone/Simpli.fi Hourly Performance by Campaign by Day 2021.xlsx")
Simp22 <- read_excel("C:/Users/Lynx/Documents/MSDA/MSDA 624 - Analytics Capstone/Simpli.fi Hourly Performance by Campaign by Day 2022.xlsx")

Analyzing Trends by Month

Similar to the Facebook datasets, Simpli.fi datasets did not contain a variable “month” that is needed for analysis either. Hence, the code chunk is used to extract the month from the “Time Event Date” column in two different data frames, Simp21 and Simp22, and create a new column “Month” in each data frame.

Simp21 <- Simp21 %>%
  mutate(Month = month(as.Date(`Time Event Date`), label = TRUE))
Simp22 <- Simp22 %>%
  mutate(Month = month(as.Date(`Time Event Date`), label = TRUE))
summary(Simp21$Month)

##   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec 
## 19793 12714  9111  8041  6438 11754 11029 13687 16389 16004 17087 15358

summary(Simp22$Month)

##   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec 
##  8598 10458 10614 13404 12264 10545 11142 12549 16235 17732 16608 13673

# Plot for Impressions
ggplot(Simp21, aes(x = Month, y = 'Performance Impressions')) +
  geom_bar(stat = "identity", fill = "#008080") +
  labs(title = "Impressions by Month", y = "Impressions", x = "Month")

# Plot for Clicks
ggplot(Simp21, aes(x = Month, y = `Performance Clicks`)) +
  geom_bar(stat = "identity", fill = "turquoise") +
  labs(title = "Clicks by Month", y = "Clicks", x = "Month")

The bar plots above show the monthly performance impressions and clicks for Simpli.fi in 2021. The data indicates that number of impressions for Simpli.fi in 2021 gradually decreased from January to May, then increased from June to September, and finally decreased again from October to December. In terms of performance clicks, the number of clicks for Simpli.fi in 2021 followed a decrease from January to May, an increase from June followed by a slight decrease until September before peaking in October and fluctuating the final 2 months.

# Plot for Impressions
ggplot(Simp22, aes(x = Month, y = 'Performance Impressions')) +
  geom_bar(stat = "identity", fill = "#008082") +
  labs(title = "Impressions by Month", y = "Impressions", x = "Month")

# Plot for Clicks
ggplot(Simp22, aes(x = Month, y = `Performance Clicks`)) +
  geom_bar(stat = "identity", fill = "turquoise") +
  labs(title = "Clicks by Month", y = "Clicks", x = "Month")

For 2022, the performance impressions of Simpli.fi campaigns saw a relatively upward trend line throughout the year. The click performance followed a similar trend to the impressions number where the low impression months had low clicks and the high impression months had higher clicks.

Overall, a correlation between impressions and clicks in a Simpli.fi campaign can be ascertained.

Descriptive Analysis

The final analyses in conducted to determine if there is a specific time in the day where there are higher volume of impressions and clicks in a Simpli.fi campaign. The code chunks below analyzes the average and total impressions and clicks per hour throughout the day for Simpli.fi for both 2021 and 2022.

The first code chunk calculates the average impressions and clicks per hour of the day by grouping the data by the hour of the day using the group_by() function. Then, the summarize() function is used to calculate the mean of the Performance Impressions and Performance Clicks columns. The data is then arranged by the hour of the day and plotted using ggplot() with geom_line() to show the distribution of the average impressions and clicks per hour. The scale_color_manual() function sets the colors for the lines and scale_y_log10() sets the y-axis to a log scale.

Simp21 %>%
  group_by(`Time Event Hour`) %>%
  summarize(avg_impressions = mean(`Performance Impressions`),
            avg_clicks = mean(`Performance Clicks`)) %>%
  arrange(`Time Event Hour`) %>%
  ggplot(aes(x = `Time Event Hour`)) +
  geom_line(aes(y = avg_impressions, color = "Average Impressions")) +
  geom_line(aes(y = avg_clicks, color = "Average Clicks")) +
  scale_color_manual(values = c("blue", "green")) +
  labs(title = "Distribution of Average Impressions and Clicks per Hour",
       x = "Hour of the Day",
       y = "Average Count",
       color = "Metric") +
  scale_y_log10()

The second code chunk calculates the total impressions and clicks per hour of the day by grouping the data by the hour of the day using the group_by() function. Then, the summarize() function is used to calculate the sum of the Performance Impressions and Performance Clicks columns. The data is then arranged by the hour of the day and plotted using ggplot() with geom_line() to show the distribution of the total impressions and clicks per hour. The scale_color_manual() function sets the colors for the lines and scale_y_sqrt() sets the y-axis to a square root scale.

Simp21 %>%
  group_by(`Time Event Hour`) %>%
  summarize(total_impressions = sum(`Performance Impressions`),
            total_clicks = sum(`Performance Clicks`)) %>%
  arrange(`Time Event Hour`)%>%
  ggplot(aes(x = `Time Event Hour`)) +
  geom_line(aes(y = total_impressions, color = "Total Impressions")) +
  geom_line(aes(y = total_clicks, color = "Total Clicks")) +
  scale_color_manual(values = c("blue", "green")) +
  labs(title = "Distribution of Total Impressions and Clicks per Hour",
       x = "Hour of the Day",
       y = "Total Count",
       color = "Metric") +
  scale_y_sqrt()

Simp22 %>%
  group_by(`Time Event Hour`) %>%
  summarize(avg_impressions = mean(`Performance Impressions`),
            avg_clicks = mean(`Performance Clicks`)) %>%
  arrange(`Time Event Hour`) %>%
  ggplot(aes(x = `Time Event Hour`)) +
  geom_line(aes(y = avg_impressions, color = "Average Impressions")) +
  geom_line(aes(y = avg_clicks, color = "Average Clicks")) +
  scale_color_manual(values = c("blue", "green")) +
  labs(title = "Distribution of Average Impressions and Clicks per Hour",
       x = "Hour of the Day",
       y = "Average Count",
       color = "Metric") +
  scale_y_log10()

Simp22 %>%
  group_by(`Time Event Hour`) %>%
  summarize(total_impressions = sum(`Performance Impressions`),
            total_clicks = sum(`Performance Clicks`)) %>%
  arrange(`Time Event Hour`)%>%
  ggplot(aes(x = `Time Event Hour`)) +
  geom_line(aes(y = total_impressions, color = "Total Impressions")) +
  geom_line(aes(y = total_clicks, color = "Total Clicks")) +
  scale_color_manual(values = c("blue", "green")) +
  labs(title = "Distribution of Total Impressions and Clicks per Hour",
       x = "Hour of the Day",
       y = "Total Count",
       color = "Metric") +
  scale_y_sqrt()

Overall, the average impressions and clicks for both years were relatively similar with not much fluctuations throughout the year. The total number of clicks follow a similar trendline to the average number of clicks for both 2021 and 2022. However, the total impressions for both 2021 and 2022 have a similar trend fluctuation wherein there is a decrease in impressions closer to midnight before picking up again right after 4am. As for the reasoning behind this, it could be attributed to a variety of factors such as changes in user behavior during certain times of the day, changes in ad targeting or bidding strategies during those hours, or changes in the number of competitors advertising on the platform during those times. It would require further analysis and data to determine the exact cause of the fluctuation.

Comparing the results between platforms

# Combine the variables containing impressions for each day of the week
Google21$Total_Clicks <- rowSums(Google21[, c("Sunday_Clicks", "Monday_Clicks", "Tuesday_Clicks", 
                                            "Wednesday_Clicks", "Thursday_Clicks", "Friday_Clicks", 
                                            "Saturday_Clicks")])

# Calculate the mean impressions for each dataset
mean_clicks21 <- c(mean(Google21$Total_Clicks), mean(Facebook21$`Clicks (all)`), mean(Simp21$`Performance Clicks`))

# Create a data frame with the mean impressions and dataset names
df_clicks21 <- data.frame(Dataset = c("Google 2021", "Facebook 2021", "Simpli.fi 2021"), Mean_Clicks = mean_clicks21)

# Create a bar chart of mean impressions by dataset
ggplot(df_clicks21, aes(x = Dataset, y = Mean_Clicks, fill = Dataset)) +
  geom_bar(stat = "identity") +
  ggtitle("Mean Clicks by Dataset") +
  xlab("Dataset") +
  ylab("Mean Clicks") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_sqrt()

# Combine the variables containing impressions for each day of the week
Google22$Total_Clicks <- rowSums(Google22[, c("Sunday_Clicks", "Monday_Clicks", "Tuesday_Clicks", 
                                            "Wednesday_Clicks", "Thursday_Clicks", "Friday_Clicks", 
                                            "Saturday_Clicks")])

# Calculate the mean impressions for each dataset
mean_clicks22 <- c(mean(Google22$Total_Clicks), mean(Facebook22$`Clicks (all)`), mean(Simp22$`Performance Clicks`))

# Create a data frame with the mean impressions and dataset names
df_clicks22 <- data.frame(Dataset = c("Google 2022", "Facebook 2022", "Simpli.fi 2022"), Mean_Clicks = mean_clicks22)

# Create a bar chart of mean impressions by dataset
ggplot(df_clicks22, aes(x = Dataset, y = Mean_Clicks, fill = Dataset)) +
  geom_bar(stat = "identity") +
  ggtitle("Mean Clicks by Dataset") +
  xlab("Dataset") +
  ylab("Mean Clicks") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_sqrt()

Conclusion

In conclusion, data on Google, Facebook, and Simpli.fi advertising campaigns for 2021 and 2022 were analyzed. The data was presented through various plots and charts to help understand the trends in impressions and clicks for each platform. It can be observed that there were fluctuations in the number of impressions and clicks for all platforms throughout the year, with some months having higher values than others.

In Google, linear regression analysis was used to determine the relationship between two continuous variables by creating a new data frame, LM1, by copying the original data frame, and then removes irrelevant variables and converts character-type variables to numeric. Linear regression was then performed using a function called perform_regression on a specified day of the week, response and predictor variables. Based on the results, all days of the week in 2021 and 2022 have statistically significant coefficients, indicating that ad cost has a positive effect on performance impressions for each day of the week. Furthermore, the analyses also showed that the time of day has an impact on the number of impressions and clicks on the Facebook and Simpli.fi platforms. The total number of impressions for both years followed a similar trend, with a decrease in impressions closer to midnight before picking up again right after 4am. In contrast, the total number of clicks showed a consistent pattern throughout the day with minor fluctuations.

The conclusion that Google performed the best in terms of mean clicks out of the three platforms is based solely on the numbers and statistics presented in the analysis. However, it is important to keep in mind that this is only a surface level analysis and there could be a number of different factors that are affecting the outcome of this analysis. One of the main factors to consider is the difference in the number of data points for each platform. The analysis showed that Google had a significantly higher number of data points than Facebook and Simp, which could skew the results in favor of Google. It is also possible that the higher number of data points for Google could be due to a higher frequency of activity on that platform during the time period of the dataset. Another important factor to consider is the nature of the different platforms. Each platform has its own unique user base, user behavior, and ad targeting options. For example, Google Ads are primarily search-based, while Facebook Ads are primarily social media-based. This means that the effectiveness of each platform may vary depending on the specific goals and target audience of the advertising campaign.

To expand on the conclusion, it is important to note that the analyses presented in this thread are just a starting point for understanding the performance of digital advertising campaigns. There are many other factors that can influence the success of a campaign, such as ad creative, targeting, and audience segmentation. Additionally, the results of these analyses may not be generalizable to all industries or business types, as the effectiveness of digital advertising can vary greatly depending on the specific context. That being said, the data presented does provide some valuable insights into the performance of Google, Facebook, and Simpli.fi advertising campaigns. The results suggest that ad cost has a positive effect on performance impressions across all days of the week for Google ads, while the time of day is a significant factor in the performance of Facebook and Simpli.fi ads. Marketers can use this information to optimize their advertising strategies, such as by increasing their ad spend during peak hours or targeting specific days of the week for their campaigns.

It is also worth noting that the analyses presented focused on a limited set of metrics, namely impressions and clicks. While these are important metrics for evaluating the success of a campaign, they are not the only ones that matter. Marketers should also consider metrics such as conversion rates, bounce rates, and return on investment (ROI) when assessing the effectiveness of their campaigns.

In summary, the data presented provides some valuable insights into the performance of digital advertising campaigns on Google, Facebook, and Simpli.fi. Marketers can use this information to make informed decisions about their advertising strategies, but should also keep in mind the limitations of the analyses presented and the importance of considering a range of metrics when evaluating campaign success.

Capstone Project

Lynx

2023-04-30

Proposal

Background

Business Problem

Initial Plan

Stakeholders

Data Preparation

Datasets

Data Cleaning

Data Summary

Constraints

Analysis

Analysis Techniques

Packages

Google Analysis

Linear Regression

Descriptive Analysis on Impressions and Clicks

K-means Clustering

Dendrogram

Facebook Analysis

Exploratory Analysis

Time-series Analysis

Analyzing Trends by Month

Simpli.fi Analysis

Analyzing Trends by Month

Descriptive Analysis

Comparing the results between platforms

Conclusion

References