EDA Case: Return on Investment in Facebook Ads

MSB 325 - Introduction to Business Analytics (for entrepreneurs)

Author

Josh Mickelson

Published

December 18, 2024

1 Overview

In this case, you will analyze real-world Facebook ad data to uncover insights about ad performance. Imagine you’ve been engaged by Alexa at Brightwave, a young e-commerce company founded by Alexa. She is reviewing ad performance data to guide strategic decisions on budget allocation and targeting. Your task is to help her uncover key insights and actionable recommendations.

This analysis is designed to help you:

Use AI tools to support data analysis and interpretation.
Engage critically with AI outputs, validating and expanding upon them.
Apply insights in practical, decision-making contexts for a client.

Instructions:

Use AI Extensively: For each analysis step, use AI to help generate code, interpret results, and suggest further analysis. Practice prompt writing, asking for code, asking follow-up questions for clarity, and validating AI output.
Show Your Work: Document both the code and the AI prompts you use to generate it. This will show your engagement and help build a record of your analytical process.
Critically Interpret: After each analysis step, write an interpretation in your own words. Consider how reliable the AI’s interpretation is, what alternative views might exist, and what real-world decisions Alexa could make based on the insights.

2 Data

This dataset contains key metrics and calculated fields related to the performance of Facebook ads run by Brightwave, a young e-commerce company founded by Alexa. Brightwave is testing multiple ad campaigns to drive brand awareness, generate leads, and boost sales.

The data specifically focuses on the campaign with Campaign ID 1178, which is Brightwave’s largest ad campaign in terms of both ad spend and conversions. Alexa chose to run this campaign with significant investment, targeting a wide audience to determine the best approach for reaching potential customers. By analyzing this campaign, Alexa hopes to refine future ad strategies to improve engagement and maximize the return on ad spend (ROAS).

The dataset includes both raw metrics directly downloaded from Facebook Ads Manager and calculated fields that provide insights into ad performance. Below is an explanation of each variable included in the dataset:

2.1 Variables

ad_id: Unique identifier for each ad in the dataset, used to track and analyze individual ad performance within Campaign ID 1178.
brightwave_campaign_id: Identifier for the campaign within Brightwave’s internal system, where Campaign ID 1178 is the focus of this analysis due to its significant budget and conversion results.
fb_campaign_id: Identifier used by Facebook to track the ad’s campaign across various Facebook platforms.
age: Age group targeted by the ad (e.g., “30-34”), helping Brightwave understand which demographics are most engaged.
gender: Gender targeted by the ad, indicated as “M” for male or “F” for female, allowing analysis of gender-based engagement and conversions.
interest: Interest category targeted by the ad, represented as a numeric code. This helps Alexa analyze which types of interests are associated with better ad performance.
Impressions: The number of times the ad was shown to users, providing insight into the campaign’s reach and visibility on Facebook.
Clicks: The number of times users clicked on the ad, which indicates the ad’s effectiveness in driving interest and engagement.
Spent: The total amount spent on the ad in dollars, reflecting Brightwave’s investment in reaching its target audience.
Leads: Number of users who showed interest in the product or service (e.g., signed up or expressed interest) but did not necessarily make a purchase. Leads are important for tracking initial interest in the brand.
Conversions: Number of users who completed a high-value action, such as a purchase, after clicking on the ad. This is Brightwave’s primary performance metric, as it reflects actual sales and revenue generated.
CTR (Click-Through Rate): The percentage of impressions that resulted in a click. Calculated as (Clicks / Impressions) * 100. A higher CTR indicates that the ad content resonates well with the audience.
CPC (Cost Per Click): Average cost per click on the ad. Calculated as Spent / Clicks. This metric shows how cost-efficient the ad is in generating clicks.
Lead_Value: Fixed value per lead, set at 5 dollars to represent the assumed value of each lead. Although not directly generating revenue, leads are valuable for understanding potential future customers.
Conversion_Value: Fixed value per conversion, set at 100 dollars to represent the assumed value of each conversion or purchase. This value is used in calculations to estimate total revenue generated from conversions.
Total_Conversion_Value: Total value generated from conversions (purchases), calculated as Conversions * Conversion_Value. This represents the total estimated revenue from Campaign ID 1178’s conversions.
CPA (Cost Per Acquisition): Average cost per conversion. Calculated as Spent / Conversions. CPA provides insight into how much Brightwave spends to generate a single sale or conversion.
ROAS (Return on Ad Spend): Return on ad spend, showing the revenue generated per dollar spent. Calculated as (Conversions * Conversion_Value) / Spent. ROAS is a key metric for understanding the profitability of the ad spend, with higher values indicating a more effective use of budget.
CPM (Cost Per Thousand Impressions): Cost per thousand impressions, calculated as (Spent / Impressions) * 1000. CPM indicates the cost efficiency of reaching the audience in terms of exposure.

2.2 Import Data

To begin your analysis, download the dataset CSV file (fb_ad_data.csv) provided in Canvas and load it into R. Use AI generated code or the code below or the RStudio interface to import the file and name it fb_ad_data:

library(tidyverse)
# Replace "path/to/your/file.csv" with the actual path to the downloaded file
fb_ad_data <- read.csv("path/to/your/file.csv")

# Preview the first few rows to ensure it loaded correctly
head(fb_ad_data)

Ensure that the file is saved in your working directory, or use the full file path in the read.csv() function.

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 625 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): age, gender
dbl (17): ad_id, xyz_campaign_id, fb_campaign_id, interest, Impressions, Cli...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 625 × 19
     ad_id xyz_campaign_id fb_campaign_id age   gender interest Impressions
     <dbl>           <dbl>          <dbl> <chr> <chr>     <dbl>       <dbl>
 1 1121091            1178         144531 30-34 M            10     1194718
 2 1121092            1178         144531 30-34 M            10      637648
 3 1121094            1178         144531 30-34 M            10       24362
 4 1121095            1178         144531 30-34 M            10      459690
 5 1121096            1178         144531 30-34 M            10      750060
 6 1121097            1178         144532 30-34 M            15       30068
 7 1121098            1178         144532 30-34 M            15     1267550
 8 1121100            1178         144532 30-34 M            15     3052003
 9 1121101            1178         144532 30-34 M            15       29945
10 1121102            1178         144532 30-34 M            15      357856
# ℹ 615 more rows
# ℹ 12 more variables: Clicks <dbl>, Spent <dbl>, Leads <dbl>,
#   Conversions <dbl>, CTR <dbl>, CPC <dbl>, Lead_Value <dbl>,
#   Conversion_Value <dbl>, Total_Conversion_Value <dbl>, CPA <dbl>,
#   ROAS <dbl>, CPM <dbl>

3 Instructions for Analysis and Submission

Complete this .qmd file by:
- Filling in the code chunks with your EDA code using R and the tidyverse.
- Filling in the response sections with your interpretation of the results as prompted in the document.
Follow each prompt carefully: For each question, provide both the code (in the designated code chunk) and a written interpretation (in the designated response section) as outlined in this document.
1. Document Prompts and Code: As you use AI, document each prompt and resulting code, along with any changes you make. This demonstrates your engagement and allows you to track what worked well and what needed adjustment.
2. Emphasize Real-World Relevance: In each reflection, relate your insights back to Alexa’s decision-making at Brightwave. Consider how these insights could guide Brightwave’s campaigns, budgeting, or audience targeting.
Render the file to PDF:
- Option 1: Render the .qmd file to an HTML file first, then convert it to a PDF using your browser’s print-to-PDF feature. This option may be the most straightforward as long as it generates a PDF with multiple pages.
- Option 2: Render the .qmd file to a Word .docx file, then convert it to a PDF. This option seems to be the most reliable but has an extra step.
- Option 3: Render the .qmd file directly to a PDF document if you have the necessary tools installed (e.g., LaTeX).
Upload your final PDF to Gradescope through the Canvas assignment:
- Ensure that your PDF is multi-page and fully readable. A single-page PDF will be unreadable in Gradescope.
- Use Gradescope’s interface to label the questions within the PDF, making it easier for grading.

4 Case Study: EDA of Facebook Ads

This assignment analyzes the ad campaign data from BrightWave, focusing on the company’s largest campaign (Campaign ID 1178). Follow the prompts below to complete your analysis. Answer in the designated code chunks and response sections.

4.1 Initial Data Exploration

AI Prompt(s)

Use AI to generate initial code for exploring and understanding the data structure. Begin with a prompt like: “Generate R and tidyverse code to provide summary statistics and visualizations for an initial exploration of my dataset.”

Generate R and tidyverse code to provide summary statistics and visualizations for an initial exploration of a generic dataset named fb_ad_data.

R code to get standard deviations of all numeric variables from a dataset named “fb_ad_data”

Data Analysis

Begin with a general examination of the data. Use AI-generated code to:

Generate summary statistics (e.g., mean, median, standard deviation) for each variable.
Create visualizations (e.g., histograms for continuous variables or bar plots for categorical variables) to understand each variable’s distribution.

Paste your AI-generated code below, and run it to get a foundational view of the dataset. Make sure your code covers all variables so that you have a complete picture before diving into more specific analyses in later sections.


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

Rows: 625 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): age, gender
dbl (17): ad_id, xyz_campaign_id, fb_campaign_id, interest, Impressions, Cli...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

     ad_id         xyz_campaign_id fb_campaign_id       age           
 Min.   :1121091   Min.   :1178    Min.   :144531   Length:625        
 1st Qu.:1121381   1st Qu.:1178    1st Qu.:144587   Class :character  
 Median :1121753   Median :1178    Median :144649   Mode  :character  
 Mean   :1150944   Mean   :1178    Mean   :149996                     
 3rd Qu.:1122165   3rd Qu.:1178    3rd Qu.:144718                     
 Max.   :1314415   Max.   :1178    Max.   :179982                     
                                                                      
    gender             interest       Impressions          Clicks      
 Length:625         Min.   :  2.00   Min.   :   5264   Min.   :  0.00  
 Class :character   1st Qu.: 19.00   1st Qu.:  87043   1st Qu.: 12.00  
 Mode  :character   Median : 27.00   Median : 188758   Median : 31.00  
                    Mean   : 39.43   Mean   : 327718   Mean   : 57.71  
                    3rd Qu.: 63.00   3rd Qu.: 436943   3rd Qu.: 78.00  
                    Max.   :114.00   Max.   :3052003   Max.   :421.00  
                                                                       
     Spent            Leads        Conversions          CTR         
 Min.   :  0.00   Min.   : 0.00   Min.   : 0.000   Min.   :0.00000  
 1st Qu.: 19.11   1st Qu.: 1.00   1st Qu.: 0.000   1st Qu.:0.01219  
 Median : 48.55   Median : 2.00   Median : 1.000   Median :0.01559  
 Mean   : 89.06   Mean   : 4.27   Mean   : 1.395   Mean   :0.01622  
 3rd Qu.:120.88   3rd Qu.: 5.00   3rd Qu.: 2.000   3rd Qu.:0.02044  
 Max.   :639.95   Max.   :60.00   Max.   :21.000   Max.   :0.03761  
                                                                    
      CPC          Lead_Value Conversion_Value Total_Conversion_Value
 Min.   :1.145   Min.   :5    Min.   :100      Min.   :   0.0        
 1st Qu.:1.442   1st Qu.:5    1st Qu.:100      1st Qu.:   0.0        
 Median :1.540   Median :5    Median :100      Median : 100.0        
 Mean   :1.572   Mean   :5    Mean   :100      Mean   : 139.5        
 3rd Qu.:1.700   3rd Qu.:5    3rd Qu.:100      3rd Qu.: 200.0        
 Max.   :2.212   Max.   :5    Max.   :100      Max.   :2100.0        
 NA's   :12                                                          
      CPA              ROAS             CPM        
 Min.   :  0.00   Min.   : 0.000   Min.   :0.0000  
 1st Qu.: 20.59   1st Qu.: 0.000   1st Qu.:0.2009  
 Median : 39.43   Median : 1.181   Median :0.2515  
 Mean   : 58.41   Mean   : 3.275   Mean   :0.2506  
 3rd Qu.: 75.23   3rd Qu.: 2.956   3rd Qu.:0.3102  
 Max.   :352.45   Max.   :67.114   Max.   :0.5040  
 NA's   :237      NA's   :12

# A tibble: 1 × 17
   ad_id xyz_campaign_id fb_campaign_id interest Impressions Clicks Spent Leads
   <dbl>           <dbl>          <dbl>    <dbl>       <dbl>  <dbl> <dbl> <dbl>
1 69242.               0         12682.     32.4     365787.   67.3  102.  5.67
# ℹ 9 more variables: Conversions <dbl>, CTR <dbl>, CPC <dbl>,
#   Lead_Value <dbl>, Conversion_Value <dbl>, Total_Conversion_Value <dbl>,
#   CPA <dbl>, ROAS <dbl>, CPM <dbl>

Warning: Removed 261 rows containing non-finite outside the scale range
(`stat_bin()`).

Insights and Recommendations

After viewing the summary statistics and distributions:

Return on Ad Spend and Cost Per Acquisition are the two variables I want to analyze most. To me, they are the biggest indicators of a successful ad campaign. More specifically, I want to learn more about the ad campaign that produced a 67.114 Return on Ad Spend. If I could continuously produce that sort of value, I’d be a very wealthy man. Both variables have a strong right skew. Why?
Almost all of the numeric variables skew right. This is somewhat expected - pretty much every value has a lower bound of 0, but could theoretically extend up to infinity. There are loads of outliers in what are otherwise fairly normal distributions. For example, why did one ad campaign pull roughly 10x the average impressions? (More than 15x the median.)
Looking at the clicks histogram, most of Alexa’s ads get little to no clicks whatsoever. A smaller, select few, get significantly more. I see no reason to continue running ads that provide very little traffic. Simply capitalize on those that produce the most! Median spend price is just below $50. If you are running A/B tests on lots of different ads, what is the least amount of adspend you can budget to still get accurate data?

4.2 Univariate Analysis for Key Variables

AI Prompt(s)

Use AI to help generate code for univariate exploration. Start with prompts like:

“Generate R code to calculate mean, median, standard deviation, and skewness for each of my selected variables.”
“Create a histogram and boxplot for [variable] to visualize its distribution and identify potential outliers.”

Generate R code to calculate mean, median, standard deviation, and skewness for each of my variables. Create a histogram and boxplot for [variable] to visualize its distribution and identify potential outliers.

Is there a way to rewrite the code for Histogram and Boxplot for a Specific Variable so that I can make a list of variables, and you will generate a histogram and box plot for all of them?

Data Analysis

Choose 4-6 variables that you believe are central to ad performance, such as Spent, CTR, or Conversions.

I selected Spent, Clicks, CTR, and Conversions.

Note: Typically, we’re interested in identifying both “causal” variables that might drive performance (e.g., Spent, Impressions) and “outcome” variables that measure performance directly (e.g., CTR, Conversions). This will help you choose variables to analyze the factors influencing ad success and evaluate the ad’s effectiveness.

For each selected variable, conduct the following analyses:

Summary Statistics: Calculate mean, median, standard deviation, and skewness for each variable. (I did this for every variable)
Visualizations: Use a histogram or boxplot to display the distribution and identify patterns (e.g., symmetry, skewness). (I did this for the 4 variables I selected.)
Outlier Detection: Identify outliers using boxplots and/or statistical methods (e.g., Z-scores or IQR). Outliers can significantly impact insights and decisions. (Outliers are visible in boxplots.)
Normality Test (optional): Since some statistical tests assume normal distribution, test for normality using Q-Q plots or the Shapiro-Wilk test for each variable. If non-normal, note this in your interpretation, as it may impact analysis later. (Since optional, I decided to just eyeball it first. Looking at histograms, distributions look relatively normal, so I elected to pass.)

  ad_id_mean ad_id_median ad_id_sd ad_id_skewness xyz_campaign_id_mean
1    1150944      1121753 69241.79       1.933862                 1178
  xyz_campaign_id_median xyz_campaign_id_sd xyz_campaign_id_skewness
1                   1178                  0                      NaN
  fb_campaign_id_mean fb_campaign_id_median fb_campaign_id_sd
1            149996.2                144649          12681.87
  fb_campaign_id_skewness interest_mean interest_median interest_sd
1                1.933879       39.4288              27    32.39938
  interest_skewness Impressions_mean Impressions_median Impressions_sd
1          1.243481         327717.9             188758       365786.9
  Impressions_skewness Clicks_mean Clicks_median Clicks_sd Clicks_skewness
1             2.309914     57.7088            31  67.30733        1.933214
  Spent_mean Spent_median Spent_sd Spent_skewness Leads_mean Leads_median
1   89.05944        48.55 102.3865       1.947806     4.2704            2
  Leads_sd Leads_skewness Conversions_mean Conversions_median Conversions_sd
1  5.67062       3.858039           1.3952                  1       2.199718
  Conversions_skewness   CTR_mean CTR_median      CTR_sd CTR_skewness CPC_mean
1              3.76473 0.01622111 0.01559341 0.006148356   0.05231848 1.572494
  CPC_median    CPC_sd CPC_skewness Lead_Value_mean Lead_Value_median
1       1.54 0.1689531    0.4806416               5                 5
  Lead_Value_sd Lead_Value_skewness Conversion_Value_mean
1             0                 NaN                   100
  Conversion_Value_median Conversion_Value_sd Conversion_Value_skewness
1                     100                   0                       NaN
  Total_Conversion_Value_mean Total_Conversion_Value_median
1                      139.52                           100
  Total_Conversion_Value_sd Total_Conversion_Value_skewness CPA_mean CPA_median
1                  219.9718                         3.76473 58.40748   39.43292
    CPA_sd CPA_skewness ROAS_mean ROAS_median  ROAS_sd ROAS_skewness  CPM_mean
1 58.01629     2.130119  3.274892    1.181195 7.600636      5.075876 0.2505577
  CPM_median    CPM_sd CPM_skewness
1  0.2514511 0.0846274   -0.4085157

Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.

Insights and Recommendations

For Spent, it appears that Alexa tested many combinations on a lower budget, and ran fewer campaigns at a higher budget. Most ads cost under $100. There are plenty of outliers at a higher budget. This means one of two things. Alexa could have run some ads longer than intended (e.g. forgetting to stop them). If that is the case, I highly encourage better monitoring of advertising. You are also able to set budget caps on ad spend automatically through Facebook, and that can help. The other alternative is that Alexa could have only left the most successful ads running for longer. I am hoping that it is the latter. If so, keep doing what you are doing. Price could also fluctuate due to seasonal changes in ad spend price, like around the holidays.
For Clicks, the histogram looks very similar to Spent. It is strongly skewed right, and again, there are a few high performers, while most ads drew relatively little traffic. The median clicks was 31, with a maximum of 421. To be totally frank, looking at these variables on their own, you can get a general idea of the pattern, but it is hard to say why. Looking at this data simply motivates me to begin bivariate analysis! Are the highest click ads tied with the highest ad spend? Or are there some ads that Alexa spent very little on but still got amazing results? If so, there is room to capitalize. Do more of those.
Clickthrough Rate has a fairly normal distribution. There are two notable high outliers, at around 0.035 and 0.037, which are fairly exceptional for platforms like Facebook. The other data of note is the 12 or so ad campaigns that have a CTR of 0. In other words, they drew 0 clicks, and were totally unsuccessful. I recommend looking for the things that were in common among those ad campaigns. Then, avoid those commonalities at all costs. They clearly aren’t working.
Conversions is yet another right-skewed histogram. Many ads didn’t result in any conversions whatsoever, but there are a couple of exciting results. Median conversions was only 1, but the maximum was a significant outlier at 21 conversions! If this ad had a comparable number of impressions to the median ad, this is incredible. But, until we do bivariate analysis, we can’t really be sure. This incredibly performing ad could have simply come from an ad with higher budget and more impressions.

4.3 Bivariate Analysis for Key Numeric Relationships

AI Prompt(s)

Use AI to generate R and tidyverse code for your bivariate analysis using scatter plots, trend lines, and correlation coefficients for key numeric relationships. Use prompts like:

“Generate R code to create a scatter plot with a trend line for [variable 1] vs. [variable 2].”
“Calculate the Pearson or Spearman correlation for [variable 1] and [variable 2].”

“Generate R code to create a scatter plot with a trend line for Clicks vs. Conversions.

Calculate the Pearson or Spearman correlation for Clicks and. Conversions.”

“Generate R code to create a scatter plot with a trend line for CTR vs. Impressions.

Calculate the Pearson or Spearman correlation for CTR and Impressions.”

use a t-test for comparing means of a numeric variable between two categories (e.g., male vs. female). CTR - Gender.

use ANOVA for comparing means of a numeric variable across multiple categories (e.g., different age groups). CTR - Age groups.

Chi-Square Testing for Categorical-Categorical Relationships: If you observe an association between two categorical variables (e.g., gender and age groups), use a chi-square test to evaluate whether these variables are independent.

Data Analysis

Identify Core Numeric-Numeric Relationships:

Select 2–3 pairs of variables you believe have a meaningful relationship (e.g., Spent vs. Conversions, CTR vs. Impressions). I chose Clicks vs. Conversions, and CTR vs. Impressions.
Visualization Each Pair: Use scatter plots with trend lines to examine the direction and strength of these relationships. Done.
Calculate Correlations: Calculate Pearson or Spearman correlation coefficients, as appropriate, to assess the strength of the relationships. Did both.

Statistical Testing for Numeric-Categorical Relationships (Verification):

If your univariate EDA suggests differences in a numeric variable (e.g., CTR) across categories (e.g., age or gender), use a t-test or ANOVA to confirm these differences.
Instructions:
- use a t-test for comparing means of a numeric variable between two categories (e.g., male vs. female). CTR - Gender.
- use ANOVA for comparing means of a numeric variable across multiple categories (e.g., different age groups). CTR - Age groups.
If you observe notable differences between groups, report these insights to Alexa, as they could inform targeted marketing strategies. Will do. There are statistically significant differences.

Chi-Square Testing for Categorical-Categorical Relationships:

If you observe an association between two categorical variables (e.g., gender and age groups), use a chi-square test to evaluate whether these variables are independent.

`geom_smooth()` using formula = 'y ~ x'

Pearson Correlation:  0.5202851

Spearman Correlation:  0.4571634

`geom_smooth()` using formula = 'y ~ x'

Pearson Correlation:  0.2026069

Spearman Correlation:  0.3444558


    Two Sample t-test

data:  CTR by gender
t = 14.841, df = 623, p-value < 2.2e-16
alternative hypothesis: true difference in means between group F and group M is not equal to 0
95 percent confidence interval:
 0.005486152 0.007159425
sample estimates:
mean in group F mean in group M 
     0.01975175      0.01342897

             Df   Sum Sq  Mean Sq F value Pr(>F)    
age           3 0.007443 0.002481   95.42 <2e-16 ***
Residuals   621 0.016146 0.000026                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

   
    30-34 35-39 40-44 45-49
  F    89    57    60    70
  M   112    90    69    78


    Pearson's Chi-squared test

data:  contingency_table
X-squared = 2.6095, df = 3, p-value = 0.4558

Insights and Recommendations

Interpret Direction and Strength of Relationships: For each variable pair:

Describe the direction (positive or negative) and strength of the relationship.
Reflect on what the relationship might imply for ad performance. For example, a strong positive correlation between Spent and Conversions might indicate that increased spending directly drives more conversions.

Clicks & Conversions have a moderately positive correlation. There is likely a linear component. This is to be expected. Once you get people onto your site, it is likely that some of them will convert. The reason that I chose these two variables was to get a better grasp of the efficacy of Alexa’s on-site sales funnels. If she regularly gets people onto her site without actually buying anything, much more work needs to be done there! Clearly, customers had some interest when they clicked on the ad. So, if an ad pulls hundreds of clicks without a single conversion (as we see many times from the scatter plot), something is likely wrong on Alexa’s site. She can get people there, but often falls short of a sale! She needs to invest serious work into improving her own site. I would also suggest Alexa look at the specific ads run that fall high above the trend line. These are ads that clearly target people with purchasing power and willingness to buy. Maybe other targets had their interest piqued by the ads, but lost interest upon learning more about the product.

CTR and Impressions have a weakly positive correlation. Judging from the Pearson and Spearman correlations, any relationship is likely non-linear, but there is still potential for a monotonic trend. This weak correlation implies that increasing impression volume has marginal effects on Click-Through Rate. This implies that simply increasing the scale of the ads is unlikely to improve much as far as Click-Through Rate. If Alexa wants to increase CTR, she’ll need to focus on other variables, specifically ones that help her narrow down her target customer. The next step would be to analyze data by demographics to see if there are specific subsets of the population to target.

The T-Test for comparing differences in CTR by Gender and the ANOVA test comparing CTR by Age Groups both turned out to return statistically significant differences between groups. In other words, one group highly outperforms another. I’d like to draw particular attention to the F value of the ANOVA test (95.42). A very high F-value suggests that CTR variability between age groups is significantly larger than CTR variability within age groups. The next step I would recommend is to create scatter plots colored by different demographic groups. This will make it much clearer where the differences are.

Briefly, the Chi-Square test indicated that there is no significant relationship between gender and age in this dataset. They are likely independent. Don’t let assumptions on age or gender affect each other.

4.4 Demographic Comparisons

AI Prompt(s)

Use AI to generate R and tidyverse code for your demographic comparisons. In this section, you’ll examine how key numeric relationships vary across demographic categories. Provide the AI prompt(s) you used to generate code for this analysis, numbering multiple iterations clearly.

Copy your AI prompts here:

Data Analysis

Visualize Relationships Across Demographic Categories:
- Choose key variable pairs previously analyzed (e.g., Spent vs. Conversions, CTR vs. Impressions) and examine how these relationships differ across demographic categories such as age or gender.
- Use Color in Scatter Plots: For each pair, create scatter plots with trend lines, using color to represent different demographic categories (e.g., separate lines or color points by age groups or gender).
- Group Comparison Visualizations: Use box plots or faceted scatter plots to visually assess differences in key numeric variables (e.g., CTR, Conversions) across categories.
Statistical Testing for Demographic Differences:
- Conduct statistical tests to confirm whether differences observed across demographic categories are statistically significant. Did this in previous section, the answer is yes.
- Use t-tests or ANOVA for comparing numeric variables across demographic groups. Did this in previous section, the answer is yes, significant.
- Use chi-square tests to examine relationships between categorical demographic variables, if applicable. Did this in previous section, categorical variables were unlikely to be dependant.

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

Insights and Recommendations

Interpret Demographic Differences:

When splitting Clicks vs. Conversions into demographics, a lot more becomes clear. Almost every single point of data below the line of best fit is from a woman. In other words, most women that click on Alexa’s ad never end up buying anything. For this reasoning, even if targeting women increases overall clicks, it’s not worth it if none of them ever end up buying anything off of Alexa’s site. Alexa should target men in her ads moving forward.

When looking at age, younger individuals tend to convert more often, regardless of the number of clicks for a certain ad. Part of this may be due to technological savvy, while part of it may be because ads are more successful among the younger generation. We will come back to this when comparing to CTR vs. Impressions.

Looking at CTR vs. Impressions, there is a strong increase in CTR when targeting women. If you took this data on its own, you would almost immediately drop any sort of ads targeting men and only target women. This is an important lesson in analyzing data from the right perspective, though. Per the previous paragraph, of the people that click onto the ad, only some actually continue to a conversion. Women are significantly underrepresented in those that actually convert. For this reason, you have to be careful. I would advise Alexa to look at all metrics, but only truly evaluate based on data that affects the bottom line. CTR is great, but Conversions or ROAS will always be more important metrics.

When analyzing CTR vs. Impressions on age, you notice something similar. CTR is much higher for older demographics than younger demographics. But real conversions tend to come from younger demographics. Even though the CT percentage is lower, the increase in conversions more than makes up for it in this case. Were the CTR differences greater, this might not be true. But when you compare both points of data, I would advise Alexa to target her ads at 30-39 year old males. From the faceted box plots, females age 30-34 may also be a successful subset of customers.

Increased ad spending targeting these select demographics will likely lead to a greater number of real conversions per clicks, impressions, and ad spend. In other words, Alexa’s return on ad spend is likely to increase. If I was to analyze another data set for this assignment, it would be the relationship between ROAS, age, and gender. That would be the next thing to look at!

4.5 Final Synthesis and Recommendations

AI Prompt(s)

Use AI to help synthesize your findings and support your recommendations. Provide AI prompt(s) that assist with generating high-level summaries, insights, and practical recommendations. Include any prompts used to refine or validate these recommendations.

This is the text from a report I am compiling for a project. Analyze it, and then complete answers to all points in section 4.5 (Failed to read PDF)

Ok, in that case, your instructions remain the same, but I will copy and paste the document text.

Used this to generate bullets, most long-form writing was my own though.

Key Insights

Summarize Core Findings:
- Male audiences aged 30-39 show the highest conversion rates despite lower CTR, and thus should be targeted more
- Female audiences demonstrate higher CTR but significantly lower conversion rates
- Age group 30-34 shows particularly strong performance across metrics, and should be focused on
- There’s a notable disparity between click rates and conversion rates, suggesting potential issues with the sales funnel

Reflection and Practical Considerations

Refine Recommendations:

All recommendations seem feasible and easy. Most consist of advice to simply target demographics more effectively. Rather than add associated costs, this will lower costs! Alexa’s goals with this campaign were to discover the best marketing strategy moving forward - who to target. She has that data now! She simply needs to act on it. The only other recommendations I made for were additional points of data to analyze. The only reason I didn’t do them was for the sake of time, but if this were a real business I was actually invested in, I would obviously do a more complete analysis of every variable.

As mentioned above, the biggest limitation of this analysis is its scope. I was instructed to only focus on a couple sets of variables. If the project had a more holistic scope, I’m sure you could learn so much more! A/B testing is always a solid method to gather more data if needed.
Identify Future Data Needs:
- There were a few ad campaigns that massively outperformed others, like the one with an ROAS of 67.114. I would run ads like that again, to see if this was just one-off behavior, or if it could be replicated. That would be a top priority for me.
- Customer lifetime value by demographic segment
- Seasonal conversion rate variations
Risks and validation considerations:

Gradual Implementation:

Begin with small-scale A/B tests before full rollout
Monitor key metrics weekly to catch any negative trends
Maintain some diversity in targeting to avoid over-dependence

Validation Methods:

Implement conversion tracking across the entire sales funnel
Conduct regular cohort analysis by demographic segment
Test landing page variants with different audience segments (especially knowing how poorly the landing page is performing right now)