This tutorial explores the relationship between customer income and spending on wine using a marketing campaign dataset. We use the first 100 observations from a simulated version of the “Customer Personality Analysis” dataset on Kaggle, which includes consumer demographics and past purchase behavior. The goal is to determine whether higher income predicts more wine spending, which could inform targeted marketing strategies.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load data
data <- read_csv("marketing_campaign_updated.csv")
## Rows: 100 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): Income, MntWines
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View structure
head(data)
## # A tibble: 6 × 2
## Income MntWines
## <dbl> <dbl>
## 1 63891. 279.
## 2 73811. 324.
## 3 47733. 165.
## 4 57039. 238.
## 5 63791. 256.
## 6 90764. 334.
summary(data)
## Income MntWines
## Min. :38460 Min. :128.7
## 1st Qu.:54564 1st Qu.:209.4
## Median :63638 Median :239.1
## Mean :63202 Mean :242.0
## 3rd Qu.:70327 3rd Qu.:276.9
## Max. :90764 Max. :336.1
# Histograms
ggplot(data, aes(x = Income)) +
geom_histogram(binwidth = 5000, fill = "steelblue", color = "black") +
theme_minimal() +
labs(title = "Income Distribution")
ggplot(data, aes(x = MntWines)) +
geom_histogram(binwidth = 20, fill = "seagreen", color = "black") +
theme_minimal() +
labs(title = "Wine Spending Distribution")
# Linear regression
model <- lm(MntWines ~ Income, data = data)
summary(model)
##
## Call:
## lm(formula = MntWines ~ Income, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.433 -18.543 4.877 17.497 53.620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.141e+01 1.398e+01 1.532 0.129
## Income 3.491e-03 2.179e-04 16.022 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.2 on 98 degrees of freedom
## Multiple R-squared: 0.7237, Adjusted R-squared: 0.7209
## F-statistic: 256.7 on 1 and 98 DF, p-value: < 2.2e-16
# Scatter plot with regression line
ggplot(data, aes(x = Income, y = MntWines)) +
geom_point() +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(title = "Income vs Wine Spending", x = "Income", y = "Wine Spending") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The regression model reveals a positive linear relationship between income and wine spending, indicating that customers with higher incomes tend to spend more on wine. However, the strength of this relationship is modest, suggesting other variables also influence wine spending habits. These insights can guide marketing efforts by identifying income-based customer segments for targeted promotions.