if (!"pacman" %in% installed.packages()[, "Package"]) {
install.packages("pacman", dependencies = TRUE)
library("pacman", character.only = TRUE)
}
pacman::p_load("here")
knitr::opts_knit$set(root.dir = here::here())
pacman::p_load("readr")
clv_data <- read_csv("./data/clv_data.csv")
## Rows: 500 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): purchase_frequency, customer_lifetime_value
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(clv_data)
## # A tibble: 6 × 2
## purchase_frequency customer_lifetime_value
## <dbl> <dbl>
## 1 3 110.
## 2 7 190.
## 3 6 160.
## 4 2 94.4
## 5 4 133.
## 6 8 223.
View the Dimensions
The number of observations and the number of variables
dim(clv_data)
## [1] 500 2
View the Data Types
Measuring the variability in the dataset is important because the amount of variability determines how well you can generalizeresults from the sample to a new observation in the population. Low variability is ideal because it means that you can better predict information about the population based on the sample data.
sapply(clv_data, class)
## purchase_frequency customer_lifetime_value
## "numeric" "numeric"
sapply(clv_data[,], sd)
## purchase_frequency customer_lifetime_value
## 2.036393 40.525498
Kurtosis informs us of how often outliers occur in the results. There are different formulas for calculating kurtosis.
pacman::p_load("e1071")
sapply(clv_data[,], kurtosis, type = 2)
## purchase_frequency customer_lifetime_value
## -0.1220038 -0.1484811
The skewness is used to identify the asymmetry of the distribution of results. Similar to kurtosis, there are several ways of computing the skewness.
sapply(clv_data[,], skewness, type = 2)
## purchase_frequency customer_lifetime_value
## -0.04021915 -0.01608242
Covariance is a statistical measure that indicates the direction of the linear relationship between two variables. It assesses whether increases in one variable correspond to increases or decreases in another. It can be:
Positive Covariance
Negative Covariance
Zero Covariance
cov(clv_data, method = "spearman")
## purchase_frequency customer_lifetime_value
## purchase_frequency 20409.91 20235.73
## customer_lifetime_value 20235.73 20874.99
A strong correlation between variables enables us to better predict the value of the dependent variable using the value of the independent variable.We can measure the statistical significance of the correlation using Spearman’s rank correlation rho. However, a weak correlation between two variables does not help us to predict the value of the dependent variable from the value of the independent variable.
cor.test(clv_data$customer_lifetime_value, clv_data$purchase_frequency, method = "spearman")
## Warning in cor.test.default(clv_data$customer_lifetime_value,
## clv_data$purchase_frequency, : Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: clv_data$customer_lifetime_value and clv_data$purchase_frequency
## S = 409190, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.9803588
To view the correlation of all variables
cor(clv_data, method = "spearman")
## purchase_frequency customer_lifetime_value
## purchase_frequency 1.0000000 0.9803588
## customer_lifetime_value 0.9803588 1.0000000
Basic visualizations that can be applied include a histogram, a box and whisker plot, missing data plot, correlation plot, and a scatter plot.
par(mfrow = c(1, 2))
for (i in 1:2) {
if (is.numeric(clv_data[[i]])) {
hist(clv_data[[i]],
main = names(clv_data)[i],
xlab = names(clv_data)[i])
} else {
message(paste("Column", names(clv_data)[i], "is not numeric and will be skipped.")) }
}
par(mfrow = c(1, 2))
for (i in 1:2) {
if (is.numeric(clv_data[[i]])) {
boxplot(clv_data[[i]], main = names(clv_data)[i])
} else {
message(paste("Column", names(clv_data)[i], "is not numeric and will be skipped.")) }
}
pacman::p_load("ggcorrplot")
ggcorrplot(cor(clv_data[,]))
pairs(clv_data$customer_lifetime_value ~ ., data = clv_data, col = clv_data$customer_lifetime_value)
pacman::p_load("ggplot2")
ggplot(clv_data,
aes(x = purchase_frequency, y = customer_lifetime_value)) +
geom_point() +
geom_smooth(method = lm) +
labs(
title = "Relationship between Customer Lifetime Value and Purchase Frequency", x = "Purchase Frequency",
y = "Customer Lifetime Value"
)
## `geom_smooth()` using formula = 'y ~ x'
We then apply a simple linear regression as a statistical test for regression. View the summary of the model.
slr_test <- lm(customer_lifetime_value ~ purchase_frequency, data = clv_data)
summary(slr_test)
##
## Call:
## lm(formula = customer_lifetime_value ~ purchase_frequency, data = clv_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.1176 -5.6169 -0.0491 5.6618 20.4837
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.2538 0.9042 57.79 <2e-16 ***
## purchase_frequency 19.5356 0.1700 114.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.734 on 498 degrees of freedom
## Multiple R-squared: 0.9637, Adjusted R-squared: 0.9636
## F-statistic: 1.32e+04 on 1 and 498 DF, p-value: < 2.2e-16
Diagnostic EDA is performed to validate that the regression assumptions are true with respect to the statistical test. Validating the regression assumption in turn ensures that the statistical tests applied are appropriate for the data and helps to prevent incorrect conclusions.
The test of linearity is used to assess whether the relationship between the dependent variables and the independent variables is linear. This is necessary given that linearity is one of the key assumptions of statistical tests of regression and verifying it is crucial for ensuring the validity of the model’s estimates and predictions.
plot(slr_test, which = 1)
This test is necessary to confirm that each observation is independent of the other. It helps to identify autocorrelationthat is introduced when the data is collected over a close period of time or when one observation is related to another observation. Autocorrelation leads to underestimated standard errors and inflated t-statistics.
pacman::p_load("lmtest")
dwtest(slr_test)
##
## Durbin-Watson test
##
## data: slr_test
## DW = 1.9104, p-value = 0.1573
## alternative hypothesis: true autocorrelation is greater than 0
The test of normality assesses whether the residuals are normally distributed, i.e., most residuals (errors) are close to zero and large errors are rare. A Q-Q plot can be used to conduct the test of normality. A Q-Q plot is a scatterplot of the quantiles of the residuals against the quantiles of a normal distribution. Quantiles are statistical values that divide a dataset or probability distribution into equal-sized intervals. They help in understanding how data is distributed by marking specific points that separate the data into groups of equal size.
plot(slr_test, which = 2)
Homoscedasticity requires that the spread of residuals should be constant across all levels of the independent variable. A scale-location plot (a.k.a. spread-location plot) can be used to conduct a test of homoscedasticity.
The x-axis shows the fitted (predicted) values from the model and the y-axis shows the square root of the standardized residuals. The red line is added to help visualize any patterns.
In a model with homoscedastic errors (equal variance across all predicted values):
• Points should be randomly scattered around a horizontal line
• The smooth line should be approximately horizontal
• The vertical spread of points should be roughly equal across all fitted values
• No obvious patterns, funnels, or trends should be visible
plot(slr_test, which = 3)
The graphical representations of the various tests of assumptions should be accompanied by quantitative values. The gvlma package (Global Validation of Linear Models Assumptions) is useful for this purpose.
pacman::p_load("gvlma")
gvlma_results <- gvlma(slr_test)
summary(gvlma_results)
##
## Call:
## lm(formula = customer_lifetime_value ~ purchase_frequency, data = clv_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.1176 -5.6169 -0.0491 5.6618 20.4837
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.2538 0.9042 57.79 <2e-16 ***
## purchase_frequency 19.5356 0.1700 114.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.734 on 498 degrees of freedom
## Multiple R-squared: 0.9637, Adjusted R-squared: 0.9636
## F-statistic: 1.32e+04 on 1 and 498 DF, p-value: < 2.2e-16
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = slr_test)
##
## Value p-value Decision
## Global Stat 5.08943 0.27824 Assumptions acceptable.
## Skewness 0.03973 0.84201 Assumptions acceptable.
## Kurtosis 3.61252 0.05735 Assumptions acceptable.
## Link Function 0.01459 0.90385 Assumptions acceptable.
## Heteroscedasticity 1.42258 0.23298 Assumptions acceptable.
We can interpret the results of the statistical test with more confidence if the tests of assumptions are successful.
A simple linear regression was conducted on data from 500 observations (N = 500) to examine the relation ship between customer lifetime value (CLV) and purchase frequency. The results indicated that purchase frequency significantly predicted CLV, �� = 19.54, 95% CI [19.20, 19.87], SE = 0.17, t(498) = 114.91, p < .001. The model explained 96.37% of the variance in CLV (R2 = .96, F(1, 498) = 13,200, p < .001). For every unit increase in purchase frequency, CLV increased by approximately 19.54 units. The intercept was 52.25, 95 % CI [50.48, 54.03], and the residual standard error was 7.73, indicating strong predictive accuracy.
t-Statistic
t(498) = 114.91: This is the calculated t-value for the purchase_frequency coefficient. It quantifies how many standard errors the estimated coefficient (19.54) deviates from zero. A larger t-value (e.g., >2) indicates stronger evidence against the null hypothesis (i.e., that the coefficient is zero).
Degrees of Freedom
Degrees of freedom refers to the number of values in a calculation that are free to vary. It is essentially a measure of how much independent information is available for estimating a statistical parameter.
For example: Imagine you need to calculate the average height of 5 people, and you know the sum of all their heights is 340 inches. If you know the heights of 4 of these people (65, 70, 68, and 72 inches), you can automatically determine the height of the fifth person without measuring them: 340 - (65 + 70 + 68 + 72) = 65 inches In this example, even though there are 5 people, you only have 4 degrees of freedom because once you know 4 heights and the total, the 5th height is no longer “free to vary” – it is determined by the other values.
Degrees of freedom affects the shape of sampling distributions, which in turn influences p-values and critical values used in hypothesis testing.
df = 498:
This reflects the sample size adjusted for the number of parameters estimated in the model. For a simple linear regression (one independent variable + an intercept), df is calculated as n − 2, where n is the total sample size. Here, 498=500−2.
The t-statistic evaluates whether purchase frequency has a statistically significant relationship with customer lifetime value. With t(498)=114.91 and p < .001, the result is highly significant, rejecting the null hypothesis. This means purchase frequency strongly predicts CLV in the population.
F-Statistic
F(1, 498) = 13,200, p < .001 The results of the analysis yielded an F-statistic of 13,200, with 1 degree of freedom in the numerator (1 = the number of predictors) and 498 degrees of freedom in the denominator (F(1, 498) = 13,200). The numerator degree of freedom corresponds to the single predictor variable (purchase frequency), while the denominator degree of freedom is derived from the total number of observations (500) minus the number of predictors (1) and then minus 1 for the intercept (500 - 1 - 1 = 498)
The p-value associated with this F-statistic was less than .001 (p < .001). Following APA style guidelines, exact p-values are reported unless they fall below .001, in which case “p < .001” is used. The low p-value (any p-value < .05 is considered low) indicates that the overall regression model is statistically significant.
The F-test in regression evaluates whether the variance explained by the model is significantly greater than the unexplained variance (error). Think of the F-statistic as a ratio of “signal” (useful prediction) to “noise” (unexplained variation). The higher this ratio, the more confident you can be that your model is capturing something real. The larger the F-Statistic, the better the model’s performance.
Coefficient of Determination (R2 )
The R-squared value represents the proportion of the total variation in the dependent variable (CLV) that can be attributed to or explained by the independent variable (purchase frequency). An R-squared of 0.96 indicates that approximately 96% of the variability in customer lifetime value can be explained by its linear relationship with purchase frequency. An R-squared value approaching 1 signifies that the regression line closely aligns with the observed data points.
Adjusted R-squared is a modified statistic that accounts for the number of predictors in the model; for simple linear regression with only one predictor, the difference between R-squared and adjusted R-squared is usually negligible.
Multiple R-squared: Measures the proportion of variance in Y explained by X (e.g., R2 = 0.6 means 60% of sales variance is explained by advertisement expenditure). The multiple R-squared value always increases (or at least never decreases) when you add more independent variables.
Adjusted R-squared: Also measures the proportion of variance in Y explained by X, however, it introduces a penalty based on the number of independent variables relative to the sample size.
The difference between multiple R-squared and adjusted R-squared is negligible in cases where there is only 1 independent variable.
Residual Standard Error
The residual standard error quantifies the average magnitude of the errors (residuals), which are the discrepancies between the observed CLV values and the CLV values predicted by the regression model. It represents the standard deviation of the data points around the regression line. The residual standard error was 7.73 on 498 degrees of freedom. A residual standard error of 7.73 indicates that, on average, the predicted customer lifetime value from the model deviates from the actual observed value by approximately 7.73 units.
A smaller residual standard error implies that the data points are more tightly clustered around the regression line, indicating a more precise model.
Confidence Interval
A 95% confidence interval (CI) for a parameter—such as a regression coefficient—provides a range that, under repeated sampling, would contain the true (but unknown) population parameter 95% of the time. Analogy: Imagine shooting arrows at a target. If you drew a circle around where 95% of your arrows landed, that circle is like a confidence interval—it captures the region in which your “shots” (i.e., estimates from different samples) tend to fall. Uncertainty quantification: A CI communicates your estimate’s precision—narrower intervals imply more precise estimates (often due to larger samples or less variability), whereas wider intervals indicate greater uncertainty about the true value.
The strength of the relationship highlights the critical importance of customer retention. Initiatives that effectively encourage repeat purchases appear to be a primary driver of customer lifetime value based on this analysis. This understanding can guide the allocation of resources towards strategies that foster customer loyalty and encourage repeat business.
The model employed is a simple linear regression, which only considers the linear relationship between purchase frequency and CLV. Other potentially influential factors that are not included in this model could also play a significant role in determining CLV, e.g., the average monetary value of each purchase.