Install Quality Over Install Volume: A Marketing Analytics Study of Credit Direct Mobile App Acquisition — April 2026

Author

Victoria Arinze

Published

May 13, 2026

## 1. Executive Summary

Credit Direct Mobile acquired approximately 51,000 non-organic (paid) Android installs in April 2026 across multiple acquisition channels — Google Ads, programmatic display networks, and other paid media sources. Install volume is a useful headline metric, but it tells an incomplete story. The real question is which channels are delivering users who actually engage with the app, versus those who download it and immediately disappear.

This analysis uses raw AppsFlyer attribution data from April 2026 to evaluate the quality of paid installs. Quality is defined as whether a user triggered at least one tracked in-app event after installing — a signal that they moved beyond the download and began interacting with the product. Five analytical techniques are applied: exploratory data analysis, data visualisation, hypothesis testing, correlation analysis, and logistic regression.

Together, these techniques reveal which acquisition channels, geographic markets, device profiles, and network carriers are most reliably producing engaged users — not just download counts. The findings are intended to directly inform channel investment decisions, bid strategy, and campaign planning for Q3 2026, with a focus on shifting the performance conversation from cost-per-install to cost-per-engaged-user.

## 2. Professional Disclosure

I work as a Product Marketing and Growth Lead at Credit Direct Finance Company Limited, a digital lending company operating in Nigeria. My role covers user acquisition strategy, channel performance management, lifecycle marketing, and go-to-market execution for the Credit Direct Mobile app.

The five techniques in this analysis connect directly to real decisions I navigate in this role:

**Exploratory Data Analysis** is the starting point for any performance review I run. Before drawing conclusions about which channels are working, I need to understand the shape and quality of the underlying data — what is missing, what is skewed, and what might distort conclusions if left unexamined.

**Data Visualisation** is central to how I communicate channel performance to leadership and cross-functional stakeholders. A well-designed chart replaces ten rows in a spreadsheet, especially in executive settings where time is limited and context needs to land fast.

**Hypothesis Testing** is critical when making the case for shifting budget between channels. Observing that one channel has a higher conversion rate than another only matters if that difference is statistically meaningful — not simply a reflection of volume differences or campaign timing.

**Correlation Analysis** helps identify which acquisition metrics tend to move together. Understanding whether daily install volume is correlated with conversion quality, for example, has direct implications for how we interpret campaign bursts and media spikes when reporting to leadership.

**Logistic Regression** allows me to go beyond descriptive statistics and model the actual probability of a user converting, based on their channel, location, device, and carrier profile. This directly supports audience targeting, creative personalisation, and bid optimisation decisions — questions I engage with weekly.

## 3. Data Collection & Sampling

The primary dataset was exported from AppsFlyer, Credit Direct’s mobile attribution and marketing analytics platform. Two raw data exports were generated for the period **April 1–30, 2026**:

**Non-organic installs** — one row per paid app install, capturing channel attribution, device characteristics, geographic data, and campaign metadata. This file contains approximately 51,000 observations.
**Non-organic in-app events** — one row per tracked in-app event for the same user cohort, capturing post-install behaviour. This file was capped at 200,000 rows due to AppsFlyer’s per-export row limit.

The sampling frame is all paid (non-organic) Android installs of the Credit Direct Mobile app during April 2026. Organic installs were excluded to focus the analysis specifically on paid acquisition performance. The time period represents one complete calendar month, chosen because it reflects a full campaign cycle with budget active across all channels throughout.

A key methodological note on the outcome variable: a user is classified as “converted” if their AppsFlyer ID appears in the events file — meaning they triggered at least one in-app event after installing. Users who appear only in the installs file are classified as non-converted. Because the events file is capped at 200,000 rows, some events may be missing from the export, which means the true conversion rate is likely slightly higher than what is measured here. This limitation is acknowledged in Section 11.

**Data privacy notes:** No personally identifiable information is included in this analysis. Advertising IDs, device identifiers, IMEI numbers, and raw IP addresses were removed prior to upload. The data is owned by Credit Direct Finance Company Limited and is used here for academic purposes. Data sharing restrictions apply — this dataset is not publicly available.

**Data provenance:** Exported directly from AppsFlyer HQ dashboard (hq1.appsflyer.com) via the Raw Data Export module. No simulated or synthetic data has been used. The author administered and conducted the export personally and can demonstrate the process during the viva.

## 4. Data Description

### 4.1 Loading and Preparing the Data

```{r setup, message=FALSE, warning=FALSE} # Install any packages not yet available in this environment required_packages <- c(“tidyverse”, “corrplot”, “broom”, “knitr”, “kableExtra”, “janitor”, “scales”, “lubridate”)

new_pkgs <- required_packages[!(required_packages %in% installed.packages()[, “Package”])] if (length(new_pkgs) > 0) install.packages(new_pkgs, quiet = TRUE)

# Load libraries library(tidyverse) # Data manipulation and ggplot2 visualisation library(corrplot) # Correlation matrix heatmap library(broom) # Tidy regression output library(knitr) # Table rendering library(kableExtra) # Table formatting library(janitor) # Column name cleaning library(scales) # Number formatting in charts library(lubridate) # Date and time parsing ```

```{r load-data, message=FALSE, warning=FALSE} # Load the installs dataset installs <- read_csv(“install.csv”, show_col_types = FALSE)

# From the events file, we only need the AppsFlyer ID column # This identifies which users took any action after installing # Reading only what we need saves memory on large files events_raw <- read_csv(“events.csv”, show_col_types = FALSE) converted_ids <- events_raw %>% pull(`AppsFlyer ID`) %>% unique()

# Remove the full events table from memory — we have what we need rm(events_raw)

cat(“Total installs loaded:”, format(nrow(installs), big.mark = “,”), “\n”) cat(“Unique users with in-app events:”, format(length(converted_ids), big.mark = “,”), “\n”) ```

```{r clean-data, message=FALSE, warning=FALSE} # Standardise column names: lowercase, spaces replaced with underscores installs <- clean_names(installs)

# Automatically remove columns that are almost entirely empty # Any column where fewer than 5% of rows have a real value is dropped remove_sparse_cols <- function(df, min_fill_rate = 0.05) { keep <- sapply(df, function(col) { filled <- sum(!is.na(col) & trimws(as.character(col)) != ““) filled / nrow(df) >= min_fill_rate }) df[, keep] }

installs <- remove_sparse_cols(installs) cat(“Columns retained after removing sparse columns:”, ncol(installs), “\n”) ```

```{r engineer-features, message=FALSE, warning=FALSE} # — Parse install date and extract time features — installs <- installs %>% mutate( install_datetime = parse_date_time(install_time, orders = c(“dmy HM”, “dmy HMS”, “ymd HM”, “ymd HMS”)), install_date = as.Date(install_datetime), install_hour = hour(install_datetime), install_dow = wday(install_datetime, label = TRUE, abbr = TRUE) )

# — Classify acquisition type — # If the Partner column has a value, it is a programmatic buy # Otherwise classify by media source name installs <- installs %>% mutate( acquisition_type = case_when( !is.na(partner) & trimws(as.character(partner)) != “” ~ “Programmatic”, str_detect(tolower(media_source), “googleadwords|google”) ~ “Google Ads”, str_detect(tolower(media_source), “facebook|meta|fb”) ~ “Meta / Facebook”, tolower(trimws(as.character(media_source))) == “organic” ~ “Organic”, TRUE ~ “Other Paid” ) )

# — Extract numeric OS version — installs <- installs %>% mutate(os_version_num = as.numeric(str_extract(as.character(os_version), “^[0-9]+”)))

# — Simplify carrier names — # Remove network suffixes like “NG”, “LTE”, “4G” from carrier labels installs <- installs %>% mutate( carrier_clean = str_remove_all(as.character(carrier), “(?i)\\s*(NG|LTE|4G|3G|Nigeria|ng)\\s*$”) %>% trimws() )

# — Extract device brand from model string — # Device model appears as “samsung::SM-G955F” — we keep only the brand part installs <- installs %>% mutate( device_brand = str_extract(as.character(device_model), “^[^:]+”) %>% str_to_title() )

# — Create the outcome variable — # converted = 1 if this user’s AppsFlyer ID also appears in the events file # converted = 0 if the user installed but never triggered any tracked in-app event installs <- installs %>% mutate(converted = as.integer(appsflyer_id %in% converted_ids))

# Report overall conversion rate cat(“Overall install-to-event conversion rate:”, round(mean(installs$converted) * 100, 1), “%\n”) ```

### 4.2 Data Quality — Missing Values

Two data quality issues were identified and handled before any analysis was run.

**Issue 1 — Sparse columns:** Several columns (such as cost fields and certain identifier columns) were almost entirely empty across the dataset. These were automatically removed using the threshold-based function defined above. Retaining them would have added noise without analytical value.

**Issue 2 — Partial missing values:** A smaller proportion of rows had missing values in fields such as OS version and state. These are handled within each technique section using complete-case analysis or by restricting to top categories.

```{r missing-values, message=FALSE, warning=FALSE} # Calculate the percentage of missing values per column missing_summary <- installs %>% summarise(across(everything(), ~ round(mean(is.na(.)) * 100, 1))) %>% pivot_longer(everything(), names_to = “Column”, values_to = “Missing_Pct”) %>% filter(Missing_Pct > 0) %>% arrange(desc(Missing_Pct)) %>% rename(`Missing (%)` = Missing_Pct)

kable(missing_summary, caption = “Columns with Missing Data (after sparse column removal)”) %>% kable_styling(bootstrap_options = c(“striped”, “hover”), full_width = FALSE) ```

### 4.3 Variable Summary

```{r variable-overview} var_overview <- tibble( Variable = c(“install_date”, “install_hour”, “acquisition_type”, “media_source”, “channel”, “campaign”, “state”, “carrier_clean”, “device_category”, “os_version_num”, “device_brand”, “converted”), Type = c(“Date”, “Numeric”, “Categorical”, “Categorical”, “Categorical”, “Categorical”, “Categorical”, “Categorical”, “Categorical”, “Numeric”, “Categorical”, “Binary (outcome)”), Description = c(“Date the app was installed”, “Hour of day the install occurred (0-23)”, “Paid channel category: Google Ads, Programmatic, etc.”, “AppsFlyer media source attribution”, “Sub-channel within media source”, “Campaign name”, “Nigerian state where install occurred”, “Mobile network carrier (cleaned)”, “Device type: mobile phone or tablet”, “Android OS version number”, “Device manufacturer brand”, “1 = triggered in-app event; 0 = no event recorded”) )

kable(var_overview, caption = “Key Variables Used in Analysis”) %>% kable_styling(bootstrap_options = c(“striped”, “hover”)) ```

```{r summary-stats} # Summary statistics for numeric variables numeric_summary <- installs %>% select(os_version_num, install_hour, converted) %>% pivot_longer(everything(), names_to = “Variable”) %>% group_by(Variable) %>% summarise( N = sum(!is.na(value)), Mean = round(mean(value, na.rm = TRUE), 2), Median = round(median(value, na.rm = TRUE), 2), SD = round(sd(value, na.rm = TRUE), 2), Min = round(min(value, na.rm = TRUE), 2), Max = round(max(value, na.rm = TRUE), 2), .groups = “drop” )

kable(numeric_summary, caption = “Summary Statistics — Numeric Variables”) %>% kable_styling(bootstrap_options = c(“striped”, “hover”), full_width = FALSE) ```

```{r eda-categorical} # Distribution of installs and conversion rate by acquisition type acq_summary <- installs %>% group_by(acquisition_type) %>% summarise( Installs = n(), Converted = sum(converted), `Conv. Rate (%)` = round(mean(converted) * 100, 1), .groups = “drop” ) %>% mutate(`Share of Installs (%)` = round(Installs / sum(Installs) * 100, 1)) %>% arrange(desc(Installs))

kable(acq_summary, caption = “Installs and Conversion by Acquisition Type”) %>% kable_styling(bootstrap_options = c(“striped”, “hover”), full_width = FALSE) ```

## 5. Data Visualisation

**Theory:** Data visualisation translates raw numbers into patterns that are interpretable at a glance. The grammar of graphics framework, which underlies R’s ggplot2 library, structures every chart around the same logic: data mapped to visual properties (position, colour, size) within a coordinate system. Good chart selection depends on what you are trying to show — distribution, comparison, relationship, or trend.

**Business justification:** In a product marketing role, visualisation is how findings move from analyst to decision-maker. The five charts below are designed to answer one central question from five different angles: where are our installs coming from, and what does quality look like across those sources?

```{r viz-1-installs-by-type, fig.width=9, fig.height=5} installs %>% count(acquisition_type) %>% mutate(pct = round(n / sum(n) * 100, 1)) %>% ggplot(aes(x = reorder(acquisition_type, n), y = n, fill = acquisition_type)) + geom_col(show.legend = FALSE, width = 0.7) + geom_text(aes(label = paste0(pct, “%”)), hjust = -0.15, size = 3.8, fontface = “bold”) + coord_flip() + scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.18))) + scale_fill_brewer(palette = “Set2”) + labs( title = “App Installs by Acquisition Type — April 2026”, subtitle = “Credit Direct Mobile | Non-organic Android installs”, x = NULL, y = “Number of Installs”, caption = “Source: AppsFlyer Raw Data Export” ) + theme_minimal(base_size = 13) + theme(plot.title = element_text(face = “bold”)) ```

**Interpretation:** This chart shows the volume split across acquisition types. Programmatic typically accounts for a large share of raw install volume, but volume alone does not indicate quality — which is what the following charts address.

```{r viz-2-daily-installs, fig.width=10, fig.height=5} installs %>% count(install_date) %>% ggplot(aes(x = install_date, y = n)) + geom_line(colour = “#2E86AB”, linewidth = 1.1) + geom_point(colour = “#2E86AB”, size = 2.5) + scale_y_continuous(labels = comma) + scale_x_date(date_breaks = “1 week”, date_labels = “%d %b”) + labs( title = “Daily Install Volume — April 2026”, subtitle = “All non-organic Android installs | Credit Direct Mobile”, x = NULL, y = “Daily Installs”, caption = “Source: AppsFlyer Raw Data Export” ) + theme_minimal(base_size = 13) + theme(plot.title = element_text(face = “bold”)) ```

**Interpretation:** The daily trend reveals whether install volume was steady or driven by specific campaign bursts. Spikes typically correspond to budget increases, new creative launches, or promotional pushes.

```{r viz-3-conversion-by-type, fig.width=9, fig.height=5} installs %>% group_by(acquisition_type) %>% summarise(conv_rate = round(mean(converted) * 100, 1), .groups = “drop”) %>% ggplot(aes(x = reorder(acquisition_type, conv_rate), y = conv_rate, fill = acquisition_type)) + geom_col(show.legend = FALSE, width = 0.7) + geom_text(aes(label = paste0(conv_rate, “%”)), hjust = -0.15, size = 3.8, fontface = “bold”) + coord_flip() + scale_fill_brewer(palette = “Set1”) + scale_y_continuous(expand = expansion(mult = c(0, 0.18))) + labs( title = “Conversion Rate by Acquisition Type”, subtitle = “% of installs that triggered at least one in-app event”, x = NULL, y = “Conversion Rate (%)”, caption = “Source: AppsFlyer Raw Data Export” ) + theme_minimal(base_size = 13) + theme(plot.title = element_text(face = “bold”)) ```

**Interpretation:** This is the most commercially important chart in the document. It reframes the channel conversation from volume to quality. A channel that produces fewer installs but a significantly higher conversion rate may be a better use of budget than one that generates volume cheaply but delivers users who never engage.

```{r viz-4-top-states, fig.width=9, fig.height=6} installs %>% filter(!is.na(state) & state != “” & state != “N/A”) %>% count(state) %>% slice_max(n, n = 10) %>% ggplot(aes(x = reorder(state, n), y = n)) + geom_col(fill = “#A23B72”, width = 0.7) + geom_text(aes(label = comma(n)), hjust = -0.15, size = 3.5) + coord_flip() + scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) + labs( title = “Top 10 States by Install Volume — April 2026”, subtitle = “Geographic distribution of non-organic installs”, x = NULL, y = “Installs”, caption = “Source: AppsFlyer Raw Data Export” ) + theme_minimal(base_size = 13) + theme(plot.title = element_text(face = “bold”)) ```

**Interpretation:** Geographic concentration in a small number of states is typical for Nigerian fintech products. This chart highlights where acquisition spend is landing and whether geographic diversification is occurring through programmatic networks.

```{r viz-5-os-distribution, fig.width=9, fig.height=5} installs %>% filter(!is.na(os_version_num)) %>% count(os_version_num) %>% mutate(pct = round(n / sum(n) * 100, 1)) %>% ggplot(aes(x = factor(os_version_num), y = pct)) + geom_col(fill = “#F18F01”, width = 0.7) + geom_text(aes(label = paste0(pct, “%”)), vjust = -0.5, size = 3.5) + scale_y_continuous(expand = expansion(mult = c(0, 0.15))) + labs( title = “Android OS Version Distribution Among New Installs”, subtitle = “Non-organic installs | April 2026”, x = “Android OS Version”, y = “% of Installs”, caption = “Source: AppsFlyer Raw Data Export” ) + theme_minimal(base_size = 13) + theme(plot.title = element_text(face = “bold”)) ```

**Interpretation:** OS version distribution informs decisions around app compatibility and minimum SDK targeting. A large proportion of installs on older OS versions may indicate that certain ad networks are reaching lower-income segments with older devices — worth tracking against conversion quality.

## 6. Hypothesis Testing

**Theory:** Hypothesis testing provides a formal framework for deciding whether an observed difference in data is real or simply due to random chance. The chi-squared test of independence is appropriate when both the grouping variable (acquisition type) and the outcome variable (converted: yes or no) are categorical. It compares observed frequencies against what we would expect if the two variables had no relationship. Cramér’s V is reported as the effect size — it ranges from 0 (no association) to 1 (perfect association).

**Business justification:** In budget discussions, it is not enough to say “Google Ads converts better than programmatic.” That claim needs statistical backing — otherwise it could be noise, a timing effect, or a sample size artefact. Hypothesis testing provides that backing.

### Hypothesis 1 — Does acquisition type significantly affect conversion?

**H₀:** Conversion rate is the same regardless of acquisition type
**H₁:** At least one acquisition type has a significantly different conversion rate

```{r hypothesis-1, message=FALSE, warning=FALSE} ct1 <- table(installs$acquisition_type, installs$converted) colnames(ct1) <- c(“Not Converted”, “Converted”)

kable(ct1, caption = “Contingency Table: Acquisition Type vs Conversion”) %>% kable_styling(bootstrap_options = c(“striped”, “hover”), full_width = FALSE)

chi1 <- chisq.test(ct1) print(chi1)

cramers_v1 <- sqrt(chi1$statistic / (sum(ct1) * (min(dim(ct1)) - 1))) cat(“\nCramer’s V (effect size):”, round(as.numeric(cramers_v1), 4), “\n”) cat(“Interpretation: 0.1 = small | 0.3 = medium | 0.5+ = large\n”) ```

**Plain-language interpretation:** If the p-value is below 0.05, we reject the null hypothesis and conclude that acquisition type has a statistically significant effect on whether a user converts. Cramér’s V tells us how strong that effect is in practical terms. A statistically significant result here justifies reallocating budget and reporting attention toward higher-converting channels — it is no longer just an opinion.

### Hypothesis 2 — Does mobile carrier significantly affect conversion?

**H₀:** Conversion rate is the same across mobile network carriers
**H₁:** Carrier has a significant effect on the probability of conversion

```{r hypothesis-2, message=FALSE, warning=FALSE} top_carriers_h <- installs %>% count(carrier_clean) %>% slice_max(n, n = 5) %>% pull(carrier_clean)

carrier_data <- installs %>% filter(carrier_clean %in% top_carriers_h)

ct2 <- table(carrier_data$carrier_clean, carrier_data$converted) colnames(ct2) <- c(“Not Converted”, “Converted”)

kable(ct2, caption = “Contingency Table: Carrier vs Conversion (Top 5 Carriers)”) %>% kable_styling(bootstrap_options = c(“striped”, “hover”), full_width = FALSE)

chi2 <- chisq.test(ct2) print(chi2)

cramers_v2 <- sqrt(chi2$statistic / (sum(ct2) * (min(dim(ct2)) - 1))) cat(“\nCramer’s V (effect size):”, round(as.numeric(cramers_v2), 4), “\n”) ```

**Plain-language interpretation:** Nigerian fintechs often observe different behaviour patterns across Airtel, MTN, Glo, and 9mobile subscribers — partly driven by data bundle economics, partly by demographic overlap with each carrier’s base. If this test returns a significant result, it suggests carrier-based audience targeting is worth exploring in campaign optimisation.

## 7. Correlation Analysis

**Theory:** Correlation measures the strength and direction of the linear relationship between two numeric variables. Pearson’s r ranges from -1 (perfect negative relationship) to +1 (perfect positive relationship), with 0 indicating no linear relationship. Correlation does not imply causation — two variables can move together without one causing the other.

**Business justification:** Understanding how daily volume metrics relate to each other helps identify patterns in acquisition quality over time. If days with high install volumes show lower average conversion rates, that suggests campaign burst strategies are trading quality for scale — a trade-off worth quantifying before the next media plan is signed off.

```{r correlation, message=FALSE, warning=FALSE, fig.width=8, fig.height=7} # Aggregate by day to create numeric variables for correlation daily_metrics <- installs %>% group_by(install_date) %>% summarise( install_count = n(), converted_count = sum(converted), conversion_rate = mean(converted), avg_os_version = mean(os_version_num, na.rm = TRUE), avg_hour = mean(install_hour, na.rm = TRUE), .groups = “drop” )

cor_vars <- daily_metrics %>% select(install_count, converted_count, conversion_rate, avg_os_version, avg_hour)

cor_matrix <- cor(cor_vars, use = “complete.obs”, method = “pearson”)

rownames(cor_matrix) <- colnames(cor_matrix) <- c(“Daily Installs”, “Daily Converted”, “Conversion Rate”, “Avg OS Version”, “Avg Install Hour”)

corrplot(cor_matrix, method = “color”, type = “upper”, addCoef.col = “black”, tl.col = “black”, tl.srt = 45, number.cex = 0.85, title = “Correlation Matrix — Daily Acquisition Metrics (April 2026)”, mar = c(0, 0, 3, 0)) ```

```{r correlation-table} kable(round(cor_matrix, 3), caption = “Pearson Correlation Matrix — Daily Acquisition Metrics”) %>% kable_styling(bootstrap_options = c(“striped”, “hover”)) ```

**Plain-language interpretation:** Three relationships to watch closely:

**Daily Installs vs Conversion Rate** — If negative, days with the highest volume tend to have lower quality. This is one of the most common trade-offs in paid acquisition and, if confirmed here, directly challenges a volume-first media strategy.
**Avg OS Version vs Conversion Rate** — A positive correlation suggests users on newer Android versions convert more reliably, which has implications for device-based bidding and audience exclusions.
**Daily Installs vs Daily Converted** — Expected to be strongly positive — more installs generally produce more converted users, even if the rate fluctuates.

## 8. Logistic Regression

**Theory:** Logistic regression models the probability that a binary outcome occurs — in this case, whether a user converts — as a function of one or more predictor variables. Unlike linear regression, the output is constrained to a probability between 0 and 1. Coefficients are interpreted as log-odds, but exponentiating them gives odds ratios, which are more intuitive: an odds ratio of 1.5 means users in that group are 50% more likely to convert than the reference group, holding all other variables constant.

**Business justification:** Descriptive statistics tell us what happened. Regression tells us which factors drive outcomes when you control for everything else simultaneously. For a growth team making channel and targeting decisions, this is the difference between gut feel and data-backed strategy.

```{r regression, message=FALSE, warning=FALSE} # Restrict to top states and carriers to keep the model tractable top_states_r <- installs %>% filter(!is.na(state) & state != “” & state != “N/A”) %>% count(state) %>% slice_max(n, n = 8) %>% pull(state)

top_carriers_r <- installs %>% filter(!is.na(carrier_clean) & carrier_clean != ““) %>% count(carrier_clean) %>% slice_max(n, n = 5) %>% pull(carrier_clean)

model_data <- installs %>% filter( state %in% top_states_r, carrier_clean %in% top_carriers_r, !is.na(os_version_num), !is.na(acquisition_type), !is.na(device_category) ) %>% mutate( acquisition_type = factor(acquisition_type), state = factor(state), carrier_clean = factor(carrier_clean), device_category = factor(device_category), converted = factor(converted, levels = c(0, 1), labels = c(“Not Converted”, “Converted”)) )

cat(“Rows used in regression model:”, format(nrow(model_data), big.mark = “,”), “\n”)

model <- glm( converted ~ acquisition_type + os_version_num + state + carrier_clean + device_category, data = model_data, family = binomial(link = “logit”) )

model_output <- tidy(model, exponentiate = TRUE, conf.int = TRUE) %>% mutate( Significant = ifelse(p.value < 0.05, “Yes”, “No”), across(c(estimate, conf.low, conf.high), ~ round(.x, 3)), p.value = round(p.value, 4) ) %>% rename( Predictor = term, `Odds Ratio` = estimate, `Lower 95% CI` = conf.low, `Upper 95% CI` = conf.high, `P-value` = p.value )

kable(model_output %>% select(Predictor, `Odds Ratio`, `Lower 95% CI`, `Upper 95% CI`, `P-value`, Significant), caption = “Logistic Regression Results — Predictors of User Conversion”) %>% kable_styling(bootstrap_options = c(“striped”, “hover”)) %>% row_spec(which(model_output$Significant == “Yes”), bold = TRUE, color = “white”, background = “#2E86AB”) ```

```{r regression-fit} cat(“Model AIC:”, round(AIC(model), 1), “\n”) cat(“Null Deviance:”, round(model$null.deviance, 1), “\n”) cat(“Residual Deviance:”, round(model$deviance, 1), “\n”)

pseudo_r2 <- 1 - (model$deviance / model$null.deviance) cat(“McFadden’s Pseudo R-squared:”, round(pseudo_r2, 4), “\n”) ```

**Plain-language interpretation for a non-technical manager:** The model tells us which factors make it more or less likely that a user who installed the app will go on to do something with it. Read the odds ratios as multipliers. An odds ratio of 1.30 for a channel means users from that channel are 30% more likely to convert than users from the reference channel, holding everything else constant. Rows highlighted in blue are statistically significant — meaning we can be confident the effect is real, not random. These are the variables marketing and product teams should prioritise when making targeting and budget decisions.

## 10. Integrated Findings

The five analyses in this document address a single underlying business question from different angles: **what determines whether a paid install becomes an engaged user?**

EDA established the baseline — the shape of the data, the distribution of installs across channels and geographies, and the data quality issues that needed to be resolved before analysis could begin. It revealed that install volume is concentrated across a small number of channels, with programmatic networks accounting for a large share of raw numbers.

Data visualisation translated those numbers into patterns — which channels dominate by volume, how installs trended across the month, and crucially, how conversion rates differ by acquisition type. The conversion rate chart is the most commercially significant output in the document. It makes visible a quality gap that the raw install count actively hides.

Hypothesis testing confirmed that the differences in conversion rates across acquisition types are statistically significant — not noise. This is the step that turns an observation into an actionable finding. The result supports reallocating budget and reporting attention toward higher-converting channels.

Correlation analysis examined how daily acquisition metrics relate to each other over time. The relationship between install volume and conversion rate is particularly important. If volume and quality move in opposite directions, then media strategies that chase scale may be actively harming engagement outcomes.

Logistic regression brought all the variables together into a single model, identifying which factors — channel, state, carrier, OS version, device type — independently predict conversion probability. This is the output most directly applicable to campaign strategy: a ranked list of factors for the targeting team to prioritise when building audience segments, setting bid adjustments, and writing media briefs.

**Single recommendation:** Shift the primary performance KPI for paid acquisition from cost-per-install to cost-per-converted-user. The data shows that install volume and install quality are not the same thing across channels. A channel optimised purely on CPI may be systematically underdelivering engaged users. Reconfiguring campaign reporting and agency briefs around conversion rate — not just install volume — is the highest-leverage change available from this analysis.

## 11. Limitations & Further Work

**Capped events data:** The in-app events export was limited to 200,000 rows by AppsFlyer’s export settings. This means some post-install events were not captured, and the measured conversion rate is likely a conservative underestimate of the true figure. With more time, this would be resolved by using AppsFlyer’s Data Locker for uncapped data access, or by running exports across shorter date windows.

**Outcome variable depth:** “Triggered any in-app event” is a broad definition of conversion. A more precise analysis would define a specific conversion milestone — registration completion, KYC initiation, or a product view — and measure quality against that. This would require mapping the full event taxonomy in the events file, which was not feasible within this submission timeline.

**Single-month window:** The analysis covers only April 2026. Seasonal effects, budget cycles, and campaign changes can all influence results within a single month. A three-month or full-quarter dataset would produce more stable estimates and better support the regression conclusions.

**Attribution model assumptions:** AppsFlyer uses last-click attribution by default. This overstates the contribution of bottom-funnel channels and understates channels earlier in the user journey. A multi-touch attribution model would produce a more accurate picture of channel contribution across the full funnel.

**With more time and computing power:** I would extend this analysis to include cohort-based retention metrics beyond single-event conversion, build a time series model on daily conversion rates by channel, and integrate spend data from each media source to compute true return on ad spend at the campaign level.

## References

Adi, B. (2026). *AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R.* Lagos Business School / markanalytics.online. https://markanalytics.online

Arinze, V. (2026). *Credit Direct Mobile non-organic installs and in-app events — April 2026* [Dataset]. Exported from AppsFlyer attribution platform on behalf of Credit Direct Finance Company Limited, Lagos, Nigeria. Data available on request from the author.

R Core Team. (2024). *R: A language and environment for statistical computing* (Version 4.6). R Foundation for Statistical Computing. https://www.R-project.org/

```{r package-citations, message=FALSE, warning=FALSE, results=‘asis’} citation(“tidyverse”) citation(“corrplot”) citation(“broom”) citation(“knitr”) citation(“janitor”) ```

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. *Journal of Open Source Software, 4*(43), 1686. https://doi.org/10.21105/joss.01686

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). *Quarto* (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

## Appendix: AI Usage Statement

Claude (Anthropic) was used as a coding assistant throughout this project. Specifically, it was used to generate the R code structure, suggest appropriate statistical tests given the data characteristics, and help with ggplot2 syntax and formatting. All analytical decisions — the choice of case study, the definition of the outcome variable, the classification of programmatic versus direct channels, the selection of hypothesis tests, and the business interpretation of every result — were made independently by the author based on direct knowledge of Credit Direct’s acquisition operations and marketing context. The professional disclosure, data collection narrative, integrated findings, and recommendations are original writing grounded in the author’s day-to-day professional experience.