Introduction

This tutorial explores the relationship between customer income and spending on wine using a marketing campaign dataset. We use the first 100 observations from a simulated version of the “Customer Personality Analysis” dataset on Kaggle, which includes consumer demographics and past purchase behavior. The goal is to determine whether higher income predicts more wine spending, which could inform targeted marketing strategies.

Data Preparation

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load data
data <- read_csv("marketing_campaign_updated.csv")
## Rows: 100 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): Income, MntWines
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View structure
head(data)
## # A tibble: 6 × 2
##   Income MntWines
##    <dbl>    <dbl>
## 1 63891.     279.
## 2 73811.     324.
## 3 47733.     165.
## 4 57039.     238.
## 5 63791.     256.
## 6 90764.     334.
summary(data)
##      Income         MntWines    
##  Min.   :38460   Min.   :128.7  
##  1st Qu.:54564   1st Qu.:209.4  
##  Median :63638   Median :239.1  
##  Mean   :63202   Mean   :242.0  
##  3rd Qu.:70327   3rd Qu.:276.9  
##  Max.   :90764   Max.   :336.1

Exploratory Data Analysis

# Histograms
ggplot(data, aes(x = Income)) +
  geom_histogram(binwidth = 5000, fill = "steelblue", color = "black") +
  theme_minimal() +
  labs(title = "Income Distribution")

ggplot(data, aes(x = MntWines)) +
  geom_histogram(binwidth = 20, fill = "seagreen", color = "black") +
  theme_minimal() +
  labs(title = "Wine Spending Distribution")

Regression Analysis

# Linear regression
model <- lm(MntWines ~ Income, data = data)
summary(model)
## 
## Call:
## lm(formula = MntWines ~ Income, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.433 -18.543   4.877  17.497  53.620 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.141e+01  1.398e+01   1.532    0.129    
## Income      3.491e-03  2.179e-04  16.022   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.2 on 98 degrees of freedom
## Multiple R-squared:  0.7237, Adjusted R-squared:  0.7209 
## F-statistic: 256.7 on 1 and 98 DF,  p-value: < 2.2e-16

Visualization

# Scatter plot with regression line
ggplot(data, aes(x = Income, y = MntWines)) +
  geom_point() +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "Income vs Wine Spending", x = "Income", y = "Wine Spending") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Conclusion

The regression model reveals a positive linear relationship between income and wine spending, indicating that customers with higher incomes tend to spend more on wine. However, the strength of this relationship is modest, suggesting other variables also influence wine spending habits. These insights can guide marketing efforts by identifying income-based customer segments for targeted promotions.

References