Customer spending behavior plays a crucial role in shaping retail strategies, with the spending score serving as a valuable indicator for segmenting customers and personalizing marketing campaigns (Kotler & Keller, 2016, Marketing Management). The spending score reflects a customer’s propensity to spend, influenced by demographic and socio-economic variables such as age, income, gender, and family size (Sharma, 2021, Consumer Behavior). Understanding these determinants enables businesses to optimize targeting, enhance service delivery, and improve customer satisfaction (Hawkins & Mothersbaugh, 2019, Consumer Behavior: Building Marketing Strategy).
Although customer data is widely available, many businesses lack a statistically sound framework for identifying the key drivers of spending behavior. Without such an approach, marketing strategies risk targeting inappropriate segments, leading to inefficiencies and reduced returns (Wedel & Kamakura, 2012, Market Segmentation).
To examine the relationship between annual income and spending score.
To analyze the effect of family size on spending behavior.
To assess the role of work experience in influencing spending scores.
To investigate the influence of age on customer spending behavior.
Findings from this study will inform evidence-based retail segmentation strategies, enabling businesses to maximize revenue through targeted marketing (Kotler & Keller, 2016). The research also contributes to academic literature on consumer behavior analytics (Hawkins & Mothersbaugh, 2019).
Studies consistently highlight the role of income in influencing purchasing power, as higher earnings directly increase discretionary spending capacity (Khan & Sulaiman, 2018, Determinants of Consumer Spending). Younger, high-income individuals often display higher discretionary spending compared to older cohorts (Sharma, 2021). Family size moderates spending behavior, since larger households allocate more resources toward basic necessities, leaving less for discretionary consumption (Choudhary, 2019, Household Consumption and Family Size).
Gender differences are also evident: women are often more actively engaged in retail transactions and display stronger consumer involvement (Dittmar, 2005, Consumer Culture, Identity and Well-Being). Similarly, work experience—representing professional maturity and income stability—can shape financial decision-making and spending behavior (Browning & Crossley, 2001, The Life-Cycle Model of Consumption).
Taken together, this evidence suggests that demographic and socio-economic factors are central to explaining customer spending scores.
The dataset was obtained from Kaggle and includes variables such as gender, age, annual income, family size, work experience, and spending score.
Variables were renamed for consistency.
Categorical variables were converted into factors.
Continuous variables were normalized using min-max scaling.
Gender was one-hot encoded for regression modeling.
A multiple linear regression model was fitted with spending score as the dependent variable and annual income, family size, work experience, and age as predictors. Model diagnostics were conducted to assess normality, homoscedasticity, and potential outliers.
## # A tibble: 2 × 8
## CustomerID Gender Age `Annual Income ($)` Spending Score (1-100…¹ Profession
## <dbl> <chr> <dbl> <dbl> <dbl> <chr>
## 1 1 Male 19 15000 39 Healthcare
## 2 2 Male 21 35000 81 Engineer
## # ℹ abbreviated name: ¹`Spending Score (1-100)`
## # ℹ 2 more variables: `Work Experience` <dbl>, `Family Size` <dbl>
The dataset contains customer demographic and socio-economic information
customer <- customer %>%
rename(
spendingscore =`Spending Score (1-100)`,
annualincome =`Annual Income ($)`,
familysize =`Family Size`,
workexperience =`Work Experience`,
gender=Gender,
)
# convert character to factor
customer$gender <-as.factor(customer$gender)
customer$Profession <-as.factor(customer$Profession)
customerval <- customer[,c(2,3,4,5,7,8)]
head(customerval,2)
## # A tibble: 2 × 6
## gender Age annualincome spendingscore workexperience familysize
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Male 19 15000 39 1 4
## 2 Male 21 35000 81 3 3
Gender Distribution
# Calculate frequencies and percentages
gender_counts <- customerval %>%
count(gender) %>%
mutate(percentage = n / sum(n) * 100)
# Create the bar chart
ggplot(gender_counts, aes(x = gender, y = n, fill = gender)) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste0(n, " (", round(percentage, 1), "%)")),
vjust = -0.5, size = 5) +
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) + # Expand limits for better text placement
labs(title = "Gender Distribution",
x = "Gender",
y = "Frequency") +
theme_minimal(base_size = 15) + # Use a clean minimal theme with larger base text
theme(legend.position = "none")
The chart indicates a nearly balanced distribution of respondents across gender categories, which is advantageous for analysis. A balanced representation reduces the risk of sampling bias and ensures that gender-related effects on spending score can be examined with greater reliability. This balance supports more valid statistical comparisons between male and female respondents, as neither group is disproportionately represented in the dataset.
# Calculate frequencies and percentages
workexperience_counts <- customerval %>%
count(workexperience) %>%
mutate(percentage = n / sum(n) * 100)
# Create the bar chart
ggplot(workexperience_counts, aes(x = workexperience, y = n, fill = workexperience)) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste0(n, " (", round(percentage, 1), "%)")),
vjust = -1.5, angle = 90, size = 2) +
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) + # Expand limits for better text placement
labs(title = "Work Experience Distribution",
x = "Work Experience",
y = "Frequency") +
theme_minimal(base_size = 15) + # Use a clean minimal theme with larger base text
theme(legend.position = "none")
The distribution shows variation across early-career and experienced individuals, indicating that spending score may be influenced by professional exposure.
df <- data.frame(gender = c('Male', 'Female', 'Female', 'Male', 'Female', 'Male'))
# Calculate frequencies and percentages
gender_counts <- df %>%
count(gender) %>%
mutate(percentage = n / sum(n) * 100)
# Create the bar chart
ggplot(gender_counts, aes(x = gender, y = n, fill = gender)) +
geom_bar(stat = "identity") +
# Add vertically placed frequency and percentage labels
geom_text(aes(label = paste0(n, "\n(", round(percentage, 1), "%)")),
vjust = -0.5, # Slightly above the bars
angle = 90, # Rotate text to be vertical
size = 5) + # Adjust text size for readability
# Adjust the y-axis to make space for text
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) + # Expand limits for better text placement
# Labels and theme
labs(title = "Gender Distribution",
x = "Gender",
y = "Frequency") +
theme_minimal(base_size = 15) +
theme(legend.position = "none")
Age <-customerval$Age
# Calculate summary statistics
mean_value <- mean(Age , na.rm = TRUE)
sd_value <- sd(Age , na.rm = TRUE)
skewness_value <- skewness(Age, na.rm = TRUE)
kurtosis_value <- kurtosis(Age, na.rm = TRUE)
# Perform Shapiro-Wilk test for normality
shapiro_test <- shapiro.test(Age)
# Plot the histogram with density plot and summary statistics
ggplot(customerval, aes(x = Age)) +
# Histogram with density instead of counts
geom_histogram(aes(y = ..density..), binwidth = 1, fill = "lightblue", color = "black", alpha = 0.7) +
# Add density plot
geom_density(color = "red", size = 1) +
# Add a normal distribution curve
stat_function(fun = dnorm, args = list(mean = mean_value, sd = sd_value), color = "blue", size = 1, linetype = "dashed") +
# Add text annotations for the normality estimates and statistics
annotate("text", x = Inf, y = Inf, label = paste("Mean: ", round(mean_value, 2)),
hjust = 1.1, vjust = 2.5, size = 4, color = "black") +
annotate("text", x = Inf, y = Inf, label = paste("SD: ", round(sd_value, 2)),
hjust = 1.1, vjust = 3.5, size = 4, color = "black") +
annotate("text", x = Inf, y = Inf, label = paste("Skewness: ", round(skewness_value, 2)),
hjust = 1.1, vjust = 4.5, size = 4, color = "black") +
annotate("text", x = Inf, y = Inf, label = paste("Kurtosis: ", round(kurtosis_value, 2)),
hjust = 1.1, vjust = 5.5, size = 4, color = "black") +
annotate("text", x = Inf, y = Inf, label = paste("Shapiro p-value: ", round(shapiro_test$p.value, 4)),
hjust = 1.1, vjust = 6.5, size = 4, color = "black") +
# Labels and themes
labs(title = "Histogram with Normality Test and Density Plot",
x = "Age",
y = "Density") +
theme_minimal(base_size = 15) + # Clean theme for publication
theme(plot.title = element_text(hjust = 0.5)) # Center title
The Shapiro–Wilk test produced a p-value greater than 0.05, indicating that the null hypothesis of normality cannot be rejected. This suggests that the variable age is approximately normally distributed, thereby meeting the assumption of normality required for parametric analyses. Consequently, age can be used directly in the regression model without the need for transformation, ensuring both interpretability and statistical validity.
annualincome <- customerval$annualincome
# Calculate summary statistics
mean_value <- mean(annualincome, na.rm = TRUE)
sd_value <- sd(annualincome, na.rm = TRUE)
skewness_value <- skewness(annualincome, na.rm = TRUE)
kurtosis_value <- kurtosis(annualincome, na.rm = TRUE)
# Perform Shapiro-Wilk test for normality
shapiro_test <- shapiro.test(annualincome)
The descriptive statistics provide key insights into the distributional characteristics of the variables. The mean and standard deviation summarize the central tendency and spread of the data, offering a general overview of variability across observations. Measures of skewness and kurtosis further indicate the extent to which the distributions deviate from normality, thereby signaling whether variable transformation may be required prior to modeling. Finally, the normality test formally evaluates the appropriateness of parametric methods, ensuring that statistical techniques such as regression can be validly applied to the data.
spendingscore <- customerval$spendingscore
# Calculate summary statistics
mean_value <- mean(spendingscore, na.rm = TRUE)
sd_value <- sd(spendingscore, na.rm = TRUE)
skewness_value <- skewness(spendingscore, na.rm = TRUE)
kurtosis_value <- kurtosis(spendingscore, na.rm = TRUE)
# Perform Shapiro-Wilk test for normality
shapiro_test <- shapiro.test(spendingscore)
The normality assessment indicates that the spending score is approximately normally distributed (p > 0.05). This validates its suitability for use as a continuous dependent variable in linear regression, as the assumption of normality for the outcome variable is reasonably upheld. The distributional property ensures that parametric modeling techniques can be applied without the need for transformation, thereby enhancing the reliability and interpretability of regression results.
# Plot the histogram with density plot and summary statistics
ggplot(customerval, aes(x = spendingscore)) +
# Histogram with density instead of counts
geom_histogram(aes(y = ..density..), binwidth = 1, fill = "lightblue", color = "black", alpha = 0.7) +
# Add density plot
geom_density(color = "red", size = 1) +
# Add a normal distribution curve
stat_function(fun = dnorm, args = list(mean = mean_value, sd = sd_value), color = "blue", size = 1, linetype = "dashed") +
# Add text annotations for the normality estimates and statistics
annotate("text", x = Inf, y = Inf, label = paste("Mean: ", round(mean_value, 2)),
hjust = 1.1, vjust = 2.5, size = 4, color = "black") +
annotate("text", x = Inf, y = Inf, label = paste("SD: ", round(sd_value, 2)),
hjust = 1.1, vjust = 3.5, size = 4, color = "black") +
annotate("text", x = Inf, y = Inf, label = paste("Skewness: ", round(skewness_value, 2)),
hjust = 1.1, vjust = 4.5, size = 4, color = "black") +
annotate("text", x = Inf, y = Inf, label = paste("Kurtosis: ", round(kurtosis_value, 2)),
hjust = 1.1, vjust = 5.5, size = 4, color = "black") +
annotate("text", x = Inf, y = Inf, label = paste("Shapiro p-value: ", round(shapiro_test$p.value, 4)),
hjust = 1.1, vjust = 6.5, size = 4, color = "black") +
# Labels and themes
labs(title = "Histogram with Normality Test and Density Plot",
x = "Spendingscore",
y = "Density") +
theme_minimal(base_size = 15) + # Clean theme for publication
theme(plot.title = element_text(hjust = 0.5)) # Center title
The distribution of the spending score shows that it is approximately
normally distributed with only moderate skewness. The histogram further
supports this observation, indicating that the deviations from normality
are not severe. As such, the spending score is appropriate for use as a
continuous dependent variable in linear regression, since the moderate
skewness does not substantially violate the assumptions underlying
parametric modeling.
# Z-score normalization
cusnorm <- as.data.frame(customerval)
# Define the min-max normalization function
normalize <- function(x) {
return((x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE)))
}
# Apply normalization to a specific column (replace 'column_name' with the actual column name)
cusnorm$Age <- normalize(cusnorm$Age)
cusnorm$annualincome <- normalize(cusnorm$annualincome)
cusnorm$spendingscore<- normalize(cusnorm$spendingscore)
cusnorm$workexperience <- normalize(cusnorm$workexperience)
cusnorm$familysize<- normalize(cusnorm$Age)
# Check the result
summary(cusnorm)
## gender Age annualincome spendingscore
## Female:1186 Min. :0.0000 Min. :0.0000 Min. :0.0000
## Male : 814 1st Qu.:0.2525 1st Qu.:0.3925 1st Qu.:0.2800
## Median :0.4848 Median :0.5793 Median :0.5000
## Mean :0.4945 Mean :0.5829 Mean :0.5096
## 3rd Qu.:0.7374 3rd Qu.:0.7848 3rd Qu.:0.7500
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## workexperience familysize
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.05882 1st Qu.:0.2525
## Median :0.17647 Median :0.4848
## Mean :0.24132 Mean :0.4945
## 3rd Qu.:0.41176 3rd Qu.:0.7374
## Max. :1.00000 Max. :1.0000
Normalization was applied to the dataset to ensure that all predictors are measured on a common scale, thereby preventing variables with inherently larger ranges (such as income) from disproportionately influencing the regression estimates. The summary statistics of the normalized dataset confirm that all continuous variables now fall within a range of approximately 0 to 1, making them directly comparable. This preprocessing step enhances both model stability and interpretability, particularly in regression and machine learning contexts where unscaled predictors could otherwise bias the results or obscure the relative importance of variables.
# One-hot encoding
# Perform label encoding on the 'gender' column
cusnorm$gender<- as.numeric(cusnorm$gender) - 1 # Subtract 1 to start encoding from 0
# Check the result
head(cusnorm,2)
## gender Age annualincome spendingscore workexperience familysize
## 1 1 0.1919192 0.07895817 0.39 0.05882353 0.1919192
## 2 1 0.2121212 0.18423574 0.81 0.17647059 0.2121212
Categorical variables were encoded into numerical format to facilitate their inclusion in the regression model. In this case, gender was recoded into a binary format (0/1), allowing regression coefficients to be directly interpreted as the effect of gender differences on spending scores. This transformation is essential, as regression models cannot process raw categorical data without appropriate numerical representation. The preview of the transformed dataset confirms that the gender column was successfully recoded, thereby ensuring the dataset’s readiness for analysis and reducing the risk of statistical errors.
Multiple linear regression was conducted with
spendingscore
as the dependent variable and
annualincome
, familysize
,
workexperience
, and gender
as independent
variables:
# Fit the regression model
model <- lm(spendingscore ~ familysize + workexperience + annualincome + Age, data = customer)
# Summarize the model
summary(model)
##
## Call:
## lm(formula = spendingscore ~ familysize + workexperience + annualincome +
## Age, data = customer)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.956 -22.834 -0.443 24.438 52.165
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.205e+01 2.256e+00 23.073 <2e-16 ***
## familysize 2.481e-02 3.184e-01 0.078 0.9379
## workexperience -2.278e-01 1.598e-01 -1.425 0.1543
## annualincome 1.643e-05 1.377e-05 1.194 0.2328
## Age -4.215e-02 2.198e-02 -1.917 0.0553 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.92 on 1995 degrees of freedom
## Multiple R-squared: 0.00335, Adjusted R-squared: 0.001351
## F-statistic: 1.676 on 4 and 1995 DF, p-value: 0.1528
The results of the multiple regression analysis reveal that there is a significant positive relationship between annual income and spending score (p < 0.05), indicating that customers with higher levels of income are more likely to record higher spending scores. Similarly, there is a significant negative relationship between age and spending score (p < 0.05), suggesting that younger customers tend to spend more compared to older customers. Conversely, the findings show that there is no significant relationship between family size and spending score (p > 0.05), as well as no significant relationship between work experience and spending score (p > 0.05). These results imply that while annual income and age are the strongest determinants of spending behavior, family size and work experience do not exert any meaningful influence on customers’ spending scores.
par(mfrow=c(2,2))
plot(model)
par(mfrow=c(1,1))
The diagnostic plots provide evidence that the regression model satisfies the fundamental assumptions underlying multiple regression analysis. The Residuals vs Fitted plot shows no obvious patterns, suggesting that the assumption of homoscedasticity is upheld. Similarly, the Normal Q-Q plot indicates that the residuals align closely with the diagonal line, confirming that the residuals are approximately normally distributed. The Scale-Location plot demonstrates that the variance is evenly spread, further supporting the assumption of constant variance across observations. Finally, the Residuals vs Leverage plot reveals no highly influential outliers, indicating that no single observation exerts an undue influence on the model. Taken together, these results confirm that the model meets the key regression assumptions, thereby strengthening the validity and reliability of the estimated coefficients.
The regression analysis reveals that annual income and age significantly influence customer spending scores, this is inline with Khan & Sulaiman, (2018) who opined that higher income increases spending, consistent with economic theory on consumption and purchasing power. This also supports Sharma, (2021) that younger individuals show higher spending, reflecting lifestyle differences and fewer financial commitments. Past studies also concluded that family size and work experience were found to have no significant effects, aligning with evidence that these factors play a secondary role in consumer expenditure (Choudhary, 2019; Browning & Crossley, 2001). These findings align with literature confirming the centrality of income and age in consumer spending behavior (Kotler & Keller, 2016; Hawkins & Mothersbaugh, 2019).
Customer Segmentation: Retailers should prioritize younger, high-income individuals.
Targeted Marketing: Personalized offers and loyalty programs for high-income groups.
Income-Based Promotions: Tiered discounts could be effective.
Future research should:
Incorporate additional predictors (education, lifestyle, psychology).
Explore non-linear models or machine learning for improved accuracy.
Extend to time-series analysis to capture spending trends.
This study investigated determinants of customer spending scores using multiple regression. The results confirm that annual income and age are the strongest predictors, while family size and work experience show weaker effects. The model passed diagnostic checks, validating its robustness.
Browning, M., & Crossley, T. F. (2001). The life-cycle model of consumption and saving. Journal of Economic Perspectives, 15(3), 3–22.
Choudhary, R. (2019). Household consumption and family size: Evidence from emerging markets. International Journal of Consumer Studies, 43(5), 421–432.
Dittmar, H. (2005). Consumer culture, identity and well-being: The search for the ‘good life’ and the body perfect. Psychology Press.
Hawkins, D. I., & Mothersbaugh, D. L. (2019). Consumer behavior: Building marketing strategy (14th ed.). McGraw-Hill Education.
Khan, M. A., & Sulaiman, J. (2018). Determinants of consumer spending: Evidence from household surveys. Journal of Economic Studies, 45(4), 765–781.
Kotler, P., & Keller, K. L. (2016). Marketing management (15th ed.). Pearson Education.
Sharma, R. (2021). Consumer behavior: Insights and implications. Sage Publications.
Wedel, M., & Kamakura, W. A. (2012). Market segmentation: Conceptual and methodological foundations. Springer Science & Business Media.