RESEARCH QUESTION:

Does the age of the target audience influence their propensity to click on the ad?

Variables needed: Age of the target audience & Clicks on the ad.

library(readxl)

mydata <- read.csv("./KAG_conversion_data.csv", sep = ",")

head(mydata)
##    ad_id xyz_campaign_id fb_campaign_id   age gender interest Impressions
## 1 708746             916         103916 30-34      M       15        7350
## 2 708749             916         103917 30-34      M       16       17861
## 3 708771             916         103920 30-34      M       20         693
## 4 708815             916         103928 30-34      M       28        4259
## 5 708818             916         103928 30-34      M       28        4133
## 6 708820             916         103929 30-34      M       29        1915
##   Clicks Spent Total_Conversion Approved_Conversion
## 1      1  1.43                2                   1
## 2      2  1.82                2                   0
## 3      0  0.00                1                   0
## 4      1  1.25                1                   0
## 5      1  1.29                1                   1
## 6      0  0.00                1                   1

Data source: Sales Conversion Optimization “KAG_conversion_data.csv” (April, 2016). Kaggle: https://www.kaggle.com/datasets/loveall/clicks-conversion-tracking

Unit of observation: A single AD published by an anonymous organization.

Sample size: 1143 observations.

VARIABLES:

  1. ad_id: Unique ID number to identify the AD. String of 6-7 numbers. Categorical Nominal
  2. xyz_campaign_id: ID associated to each Ad campaign of the company. String of 3-4 numbers When several ads have the same value it means that they all are part of the same campaign. Categorical Nominal
  3. fb_campaign_id: This last id represents the way of tracking each campaign in Facebook. String of 6 numbers. Categorical Nominal.
  4. age: Age range (5 year interval) of the individuals to whom an ad is shown in the specific campaign. There are three age groups: 30-34, 35-39, 40-44, and 45-49. Since it is an interval of ages, this variable is Categorical Ordinal.
  5. gender: Gender of the people to whom an ad is shown in the specific campaign designed. With the labels “M” for Males, and “F” for females. Categorical Nominal.
  6. interest: Code that specifies the category of the interest of the people to whom the ad is shown. String of 1-3 numbers. Categorical Nominal.
  7. Impressions: Number of times the ad was shown to a designed group of people. Numeric Ratio.
  8. Clicks: Number of times the ad has been clicked by people. Numeric Ratio
  9. Spent: Amount of money paid by the company to Facebook for each specific ad, in order for it to be displayed on the platform. Measured in euros (€). Numeric Ratio.
  10. Total_Conversion: Total number of people who, after seeing the ad, performed an action regarding the product displayed. Numeric Ratio.
  11. Approved_Conversion: Total number of people who did actually purchased the product after seeing the ad in Facebook. Numeric Ratio.

Now we should clean the data and factor the character variables.

mydata$genderfactor <- factor(mydata$gender, levels = c("M","F"), labels = c("M","F"))
mydata$agefactor <- factor(mydata$age, levels= c("30-34", "35-39", "40-44", "45-49"), labels = c("30-34", "35-39", "40-44", "45-49") )
cleandata <- na.omit(mydata)
head(cleandata)
##    ad_id xyz_campaign_id fb_campaign_id   age gender interest Impressions
## 1 708746             916         103916 30-34      M       15        7350
## 2 708749             916         103917 30-34      M       16       17861
## 3 708771             916         103920 30-34      M       20         693
## 4 708815             916         103928 30-34      M       28        4259
## 5 708818             916         103928 30-34      M       28        4133
## 6 708820             916         103929 30-34      M       29        1915
##   Clicks Spent Total_Conversion Approved_Conversion genderfactor agefactor
## 1      1  1.43                2                   1            M     30-34
## 2      2  1.82                2                   0            M     30-34
## 3      0  0.00                1                   0            M     30-34
## 4      1  1.25                1                   0            M     30-34
## 5      1  1.29                1                   1            M     30-34
## 6      0  0.00                1                   1            M     30-34
library(psych)

summary(cleandata)
##      ad_id         xyz_campaign_id fb_campaign_id       age           
##  Min.   : 708746   Min.   : 916    Min.   :103916   Length:1143       
##  1st Qu.: 777633   1st Qu.: 936    1st Qu.:115716   Class :character  
##  Median :1121185   Median :1178    Median :144549   Mode  :character  
##  Mean   : 987261   Mean   :1067    Mean   :133784                     
##  3rd Qu.:1121805   3rd Qu.:1178    3rd Qu.:144658                     
##  Max.   :1314415   Max.   :1178    Max.   :179982                     
##     gender             interest       Impressions          Clicks      
##  Length:1143        Min.   :  2.00   Min.   :     87   Min.   :  0.00  
##  Class :character   1st Qu.: 16.00   1st Qu.:   6504   1st Qu.:  1.00  
##  Mode  :character   Median : 25.00   Median :  51509   Median :  8.00  
##                     Mean   : 32.77   Mean   : 186732   Mean   : 33.39  
##                     3rd Qu.: 31.00   3rd Qu.: 221769   3rd Qu.: 37.50  
##                     Max.   :114.00   Max.   :3052003   Max.   :421.00  
##      Spent        Total_Conversion Approved_Conversion genderfactor agefactor  
##  Min.   :  0.00   Min.   : 0.000   Min.   : 0.000      M:592        30-34:426  
##  1st Qu.:  1.48   1st Qu.: 1.000   1st Qu.: 0.000      F:551        35-39:248  
##  Median : 12.37   Median : 1.000   Median : 1.000                   40-44:210  
##  Mean   : 51.36   Mean   : 2.856   Mean   : 0.944                   45-49:259  
##  3rd Qu.: 60.02   3rd Qu.: 3.000   3rd Qu.: 1.000                              
##  Max.   :639.95   Max.   :60.000   Max.   :21.000

ADDITIONAL INFORMATION ON VARIABLES:

  1. Ids: we can state that all ad ids lay between the values 708746 and 1314415, all campaign ids between 916 and 1178, and the facebook campaign ones between 103916 and 179982.
  2. Age: the agefactor shows that the most common age range is the 30-34 one and the one that appears the less is that of people between 40 and 44 years.
  3. Impressions: The mean impressions is 186732, while the median is 51509; this suggests a right-skewed distribution, meaning that fewer ads have a very high count of impressions.
  4. Clicks: The information given by the quartiles and the median represents, once again, a non-normal distribution, with few ads with a very big click rate.
  5. Gender: Both genders are quite similar in number, with a slight surpassment of males.

In order to perform the hypothesis testing, a one-way analysis of variance test needs to be done. This test is used because the variable group that will be collected consist of 4 independent age groups. In case a non-parametric test is needed we should opt for the Kruskal-Wallis rank sum test.

First of all we should check the normality of the distribution.

#install.packages("ggpubr")
library(ggpubr)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
ggboxplot(cleandata, x = "agefactor", y = "Clicks", xlab = "Age Group")

The distribution in all age groups seems right skewed, and differ slightly between each other. Nevertheless, we will perform a shapiro test to confirm this asumption.

FOR SHAPIRO TEST: H0 -> The click counts within each age group are normally distributed. H1 -> The click counts within at least one age group do not follow a normal distribution.

library(rstatix)
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
cleandata %>%
  group_by(agefactor) %>%
  shapiro_test(Clicks)
## # A tibble: 4 × 4
##   agefactor variable statistic        p
##   <fct>     <chr>        <dbl>    <dbl>
## 1 30-34     Clicks       0.515 1.68e-32
## 2 35-39     Clicks       0.644 1.88e-22
## 3 40-44     Clicks       0.685 1.36e-19
## 4 45-49     Clicks       0.733 3.92e-20

We reject null hypothesis at p < 0.001 in all age groups.

Since the normality assumption is violated, a non-parametric test is needed. In this case we will continue to perform a Kruskal-Wallis rank sum test.

There is no need to check the Homoscedascity assumption, since we have already rejected the parametric test, so this step could be avoided:

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
leveneTest(cleandata$Clicks, group = cleandata$agefactor)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value   Pr(>F)    
## group    3  14.322 3.65e-09 ***
##       1139                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We get some statistics of the groups to see the differences among them in staistical parameters.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
cleandata %>%
  group_by(agefactor) %>%
  get_summary_stats(Clicks, type = "median_iqr")
## # A tibble: 4 × 5
##   agefactor variable     n median   iqr
##   <fct>     <fct>    <dbl>  <dbl> <dbl>
## 1 30-34     Clicks     426      3  22  
## 2 35-39     Clicks     248      9  34.5
## 3 40-44     Clicks     210     13  42.5
## 4 45-49     Clicks     259     20  73.5

It seems that the groups do have some differences that could be significant. It seems that the greater the age group the more clicks they count for.

In order to check if the distribution actually differs between the age groups Kruskal-Wallis rank sum test shoul be performed.

FOR KRUSKAL-WALLIS RANK SUM TEST: H0 -> The distributions of the click counts across different age groups are equal. H1 -> At least one age group’s distribution of click counts differs from the others.

kruskal.test(agefactor ~ Clicks, data = cleandata)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  agefactor by Clicks
## Kruskal-Wallis chi-squared = 256.34, df = 182, p-value = 0.0002336

We reject null hypothesis at p = 0.0002336. Therefore we state that there are differences in click counts across the groups.

In order to see the strength of the difference, perform the calculation of the effectsize.

kruskal_effsize(agefactor ~ Clicks, data = cleandata)
## # A tibble: 1 × 5
##   .y.           n effsize method  magnitude
## * <chr>     <int>   <dbl> <chr>   <ord>    
## 1 agefactor  1143  0.0774 eta2[H] moderate

The test indicates a moderate degree of association between the age group and the number of clicks.

In order to check if all the groups do differ between one another we perform a wilcoxon rank sum test among this groups and their click count. This test is used due to the independent nature of samples.

FOR WILCOXON RANK SUM TEST: H0 -> There is no difference in the distribution of the variable between the groups being compared. H1 -> There is a difference in the distribution of the variable between at least one pair of groups being compared.

groups_check <- wilcox_test(Clicks ~ agefactor, 
                      paired = FALSE,
                      p.adjust.method = "bonferroni",
                      data = cleandata)

groups_check
## # A tibble: 6 × 9
##   .y.    group1 group2    n1    n2 statistic        p    p.adj p.adj.signif
## * <chr>  <chr>  <chr>  <int> <int>     <dbl>    <dbl>    <dbl> <chr>       
## 1 Clicks 30-34  35-39    426   248    42416. 1.69e- 5 1.01e- 4 ***         
## 2 Clicks 30-34  40-44    426   210    31048  2.6 e-10 1.56e- 9 ****        
## 3 Clicks 30-34  45-49    426   259    34512. 1.31e-16 7.86e-16 ****        
## 4 Clicks 35-39  40-44    248   210    22942. 2.8 e- 2 1.67e- 1 ns          
## 5 Clicks 35-39  45-49    248   259    25351  3.98e- 5 2.39e- 4 ***         
## 6 Clicks 40-44  45-49    210   259    24424. 5.7 e- 2 3.44e- 1 ns

We reject null hypothesis in most cases for p < 0.001 and p = 0.028 (35-39 and 40-45 difference), meaning that those groups differ in the distribution on clicks. The case in which we cannot reject null hypothesis is when comparing 40-44 and 45-49 years old groups, for p = 0.057.

CONCLUSION:

After performing the analysis, we can state that the distribution of click counts differs significantly across age groups (χ^2 = 256.34, p < 0.001), the effect size was moderate (𝜂^2 = 0.07744115). Subsequent, post hoc tests revealed significant differences between all age groups (p < 0.001), including the 35-39 to 40-44 with a p value of 0.028 that lays below the signnificance minimunm of 0.5. The only exception is held in the comparison between the groups of 40-44 and 45-49 years (p = 0.057).