RESEARCH QUESTION:
Does the age of the target audience influence their propensity to click on the ad?
Variables needed: Age of the target audience & Clicks on the ad.
library(readxl)
mydata <- read.csv("./KAG_conversion_data.csv", sep = ",")
head(mydata)
## ad_id xyz_campaign_id fb_campaign_id age gender interest Impressions
## 1 708746 916 103916 30-34 M 15 7350
## 2 708749 916 103917 30-34 M 16 17861
## 3 708771 916 103920 30-34 M 20 693
## 4 708815 916 103928 30-34 M 28 4259
## 5 708818 916 103928 30-34 M 28 4133
## 6 708820 916 103929 30-34 M 29 1915
## Clicks Spent Total_Conversion Approved_Conversion
## 1 1 1.43 2 1
## 2 2 1.82 2 0
## 3 0 0.00 1 0
## 4 1 1.25 1 0
## 5 1 1.29 1 1
## 6 0 0.00 1 1
Data source: Sales Conversion Optimization “KAG_conversion_data.csv” (April, 2016). Kaggle: https://www.kaggle.com/datasets/loveall/clicks-conversion-tracking
Unit of observation: A single AD published by an anonymous organization.
Sample size: 1143 observations.
VARIABLES:
Now we should clean the data and factor the character variables.
mydata$genderfactor <- factor(mydata$gender, levels = c("M","F"), labels = c("M","F"))
mydata$agefactor <- factor(mydata$age, levels= c("30-34", "35-39", "40-44", "45-49"), labels = c("30-34", "35-39", "40-44", "45-49") )
cleandata <- na.omit(mydata)
head(cleandata)
## ad_id xyz_campaign_id fb_campaign_id age gender interest Impressions
## 1 708746 916 103916 30-34 M 15 7350
## 2 708749 916 103917 30-34 M 16 17861
## 3 708771 916 103920 30-34 M 20 693
## 4 708815 916 103928 30-34 M 28 4259
## 5 708818 916 103928 30-34 M 28 4133
## 6 708820 916 103929 30-34 M 29 1915
## Clicks Spent Total_Conversion Approved_Conversion genderfactor agefactor
## 1 1 1.43 2 1 M 30-34
## 2 2 1.82 2 0 M 30-34
## 3 0 0.00 1 0 M 30-34
## 4 1 1.25 1 0 M 30-34
## 5 1 1.29 1 1 M 30-34
## 6 0 0.00 1 1 M 30-34
library(psych)
summary(cleandata)
## ad_id xyz_campaign_id fb_campaign_id age
## Min. : 708746 Min. : 916 Min. :103916 Length:1143
## 1st Qu.: 777633 1st Qu.: 936 1st Qu.:115716 Class :character
## Median :1121185 Median :1178 Median :144549 Mode :character
## Mean : 987261 Mean :1067 Mean :133784
## 3rd Qu.:1121805 3rd Qu.:1178 3rd Qu.:144658
## Max. :1314415 Max. :1178 Max. :179982
## gender interest Impressions Clicks
## Length:1143 Min. : 2.00 Min. : 87 Min. : 0.00
## Class :character 1st Qu.: 16.00 1st Qu.: 6504 1st Qu.: 1.00
## Mode :character Median : 25.00 Median : 51509 Median : 8.00
## Mean : 32.77 Mean : 186732 Mean : 33.39
## 3rd Qu.: 31.00 3rd Qu.: 221769 3rd Qu.: 37.50
## Max. :114.00 Max. :3052003 Max. :421.00
## Spent Total_Conversion Approved_Conversion genderfactor agefactor
## Min. : 0.00 Min. : 0.000 Min. : 0.000 M:592 30-34:426
## 1st Qu.: 1.48 1st Qu.: 1.000 1st Qu.: 0.000 F:551 35-39:248
## Median : 12.37 Median : 1.000 Median : 1.000 40-44:210
## Mean : 51.36 Mean : 2.856 Mean : 0.944 45-49:259
## 3rd Qu.: 60.02 3rd Qu.: 3.000 3rd Qu.: 1.000
## Max. :639.95 Max. :60.000 Max. :21.000
ADDITIONAL INFORMATION ON VARIABLES:
In order to perform the hypothesis testing, a one-way analysis of variance test needs to be done. This test is used because the variable group that will be collected consist of 4 independent age groups. In case a non-parametric test is needed we should opt for the Kruskal-Wallis rank sum test.
First of all we should check the normality of the distribution.
#install.packages("ggpubr")
library(ggpubr)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggboxplot(cleandata, x = "agefactor", y = "Clicks", xlab = "Age Group")
The distribution in all age groups seems right skewed, and differ slightly between each other. Nevertheless, we will perform a shapiro test to confirm this asumption.
FOR SHAPIRO TEST: H0 -> The click counts within each age group are normally distributed. H1 -> The click counts within at least one age group do not follow a normal distribution.
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
cleandata %>%
group_by(agefactor) %>%
shapiro_test(Clicks)
## # A tibble: 4 × 4
## agefactor variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 30-34 Clicks 0.515 1.68e-32
## 2 35-39 Clicks 0.644 1.88e-22
## 3 40-44 Clicks 0.685 1.36e-19
## 4 45-49 Clicks 0.733 3.92e-20
We reject null hypothesis at p < 0.001 in all age groups.
Since the normality assumption is violated, a non-parametric test is needed. In this case we will continue to perform a Kruskal-Wallis rank sum test.
There is no need to check the Homoscedascity assumption, since we have already rejected the parametric test, so this step could be avoided:
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
leveneTest(cleandata$Clicks, group = cleandata$agefactor)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 14.322 3.65e-09 ***
## 1139
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We get some statistics of the groups to see the differences among them in staistical parameters.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
cleandata %>%
group_by(agefactor) %>%
get_summary_stats(Clicks, type = "median_iqr")
## # A tibble: 4 × 5
## agefactor variable n median iqr
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 30-34 Clicks 426 3 22
## 2 35-39 Clicks 248 9 34.5
## 3 40-44 Clicks 210 13 42.5
## 4 45-49 Clicks 259 20 73.5
It seems that the groups do have some differences that could be significant. It seems that the greater the age group the more clicks they count for.
In order to check if the distribution actually differs between the age groups Kruskal-Wallis rank sum test shoul be performed.
FOR KRUSKAL-WALLIS RANK SUM TEST: H0 -> The distributions of the click counts across different age groups are equal. H1 -> At least one age group’s distribution of click counts differs from the others.
kruskal.test(agefactor ~ Clicks, data = cleandata)
##
## Kruskal-Wallis rank sum test
##
## data: agefactor by Clicks
## Kruskal-Wallis chi-squared = 256.34, df = 182, p-value = 0.0002336
We reject null hypothesis at p = 0.0002336. Therefore we state that there are differences in click counts across the groups.
In order to see the strength of the difference, perform the calculation of the effectsize.
kruskal_effsize(agefactor ~ Clicks, data = cleandata)
## # A tibble: 1 × 5
## .y. n effsize method magnitude
## * <chr> <int> <dbl> <chr> <ord>
## 1 agefactor 1143 0.0774 eta2[H] moderate
The test indicates a moderate degree of association between the age group and the number of clicks.
In order to check if all the groups do differ between one another we perform a wilcoxon rank sum test among this groups and their click count. This test is used due to the independent nature of samples.
FOR WILCOXON RANK SUM TEST: H0 -> There is no difference in the distribution of the variable between the groups being compared. H1 -> There is a difference in the distribution of the variable between at least one pair of groups being compared.
groups_check <- wilcox_test(Clicks ~ agefactor,
paired = FALSE,
p.adjust.method = "bonferroni",
data = cleandata)
groups_check
## # A tibble: 6 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 Clicks 30-34 35-39 426 248 42416. 1.69e- 5 1.01e- 4 ***
## 2 Clicks 30-34 40-44 426 210 31048 2.6 e-10 1.56e- 9 ****
## 3 Clicks 30-34 45-49 426 259 34512. 1.31e-16 7.86e-16 ****
## 4 Clicks 35-39 40-44 248 210 22942. 2.8 e- 2 1.67e- 1 ns
## 5 Clicks 35-39 45-49 248 259 25351 3.98e- 5 2.39e- 4 ***
## 6 Clicks 40-44 45-49 210 259 24424. 5.7 e- 2 3.44e- 1 ns
We reject null hypothesis in most cases for p < 0.001 and p = 0.028 (35-39 and 40-45 difference), meaning that those groups differ in the distribution on clicks. The case in which we cannot reject null hypothesis is when comparing 40-44 and 45-49 years old groups, for p = 0.057.
CONCLUSION:
After performing the analysis, we can state that the distribution of click counts differs significantly across age groups (χ^2 = 256.34, p < 0.001), the effect size was moderate (𝜂^2 = 0.07744115). Subsequent, post hoc tests revealed significant differences between all age groups (p < 0.001), including the 35-39 to 40-44 with a p value of 0.028 that lays below the signnificance minimunm of 0.5. The only exception is held in the comparison between the groups of 40-44 and 45-49 years (p = 0.057).