| Step | Action | Description |
|---|---|---|
| 1. Load Data | Read CSV file into R. | Imports dataset for analysis. |
| 2. Handle Missing Values | Remove rows with missing
Satisfaction Level. |
Ensures clean dataset. |
| 3. Convert to Factors | Convert categorical variables to factors. | Prepares data for modeling. |
# 0.1 Load dataset using read_csv from tidyverse
df <- read_csv("C:\\Users\\Shaik Fawaz\\Documents\\E-commerce Customer Behavior - Sheet1.csv")
# 0.2 Clean data: Remove NAs in the target factor column and convert types in a single pipeline
df <- df %>%
# Remove rows missing the primary target variable
drop_na(`Satisfaction Level`) %>%
# Convert necessary character columns to factor for modeling
mutate(
Gender = as.factor(Gender),
City = as.factor(City),
`Membership Type` = as.factor(`Membership Type`),
`Discount Applied` = as.factor(`Discount Applied`),
`Satisfaction Level` = as.factor(`Satisfaction Level`)
)
cat("Data successfully loaded and cleaned.\n")
## Data successfully loaded and cleaned.
str(df)
## tibble [348 × 11] (S3: tbl_df/tbl/data.frame)
## $ Customer ID : num [1:348] 101 102 103 104 105 106 107 108 109 110 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 1 2 2 1 1 2 1 2 ...
## $ Age : num [1:348] 29 34 43 30 27 37 31 35 41 28 ...
## $ City : Factor w/ 6 levels "Chicago","Houston",..: 5 3 1 6 4 2 5 3 1 6 ...
## $ Membership Type : Factor w/ 3 levels "Bronze","Gold",..: 2 3 1 2 3 1 2 3 1 2 ...
## $ Total Spend : num [1:348] 1120 780 511 1480 720 ...
## $ Items Purchased : num [1:348] 14 11 9 19 13 8 15 12 10 21 ...
## $ Average Rating : num [1:348] 4.6 4.1 3.4 4.7 4 3.1 4.5 4.2 3.6 4.8 ...
## $ Discount Applied : Factor w/ 2 levels "FALSE","TRUE": 2 1 2 1 2 1 2 1 2 1 ...
## $ Days Since Last Purchase: num [1:348] 25 18 42 12 55 22 28 14 40 9 ...
## $ Satisfaction Level : Factor w/ 3 levels "Neutral","Satisfied",..: 2 1 3 2 3 1 2 1 3 2 ...
Output Description: This output confirms the
successful loading and cleaning of the data. The str(df)
command shows the structure of the resulting dataset: all relevant
categorical columns (like Gender, Membership
Type, and Satisfaction Level) are correctly
converted to the factor data type, and the numeric
variables are available for analysis. The number of observations is
slightly reduced from the original dataset due to the removal of rows
with missing Satisfaction Level.
Objective: To summarize customer spending behavior and understand data distribution.
| Step | Action | Description |
|---|---|---|
| 1.1 Summary Stats | Calculate Mean, Median, and SD. | Measures data central tendency and spread. |
| 1.2 Frequency Plot | Create histogram for Total Spend. |
Visualizes spending distribution. |
# Overall summary of the dataset
summary(df)
## Customer ID Gender Age City
## Min. :101.0 Female:173 Min. :26.00 Chicago :58
## 1st Qu.:188.8 Male :175 1st Qu.:30.00 Houston :56
## Median :276.5 Median :32.00 Los Angeles :59
## Mean :275.9 Mean :33.58 Miami :58
## 3rd Qu.:363.2 3rd Qu.:37.00 New York :59
## Max. :450.0 Max. :43.00 San Francisco:58
## Membership Type Total Spend Items Purchased Average Rating
## Bronze:114 Min. : 410.8 Min. : 7.00 Min. :3.000
## Gold :117 1st Qu.: 505.8 1st Qu.: 9.00 1st Qu.:3.500
## Silver:117 Median : 780.2 Median :12.00 Median :4.100
## Mean : 847.8 Mean :12.63 Mean :4.024
## 3rd Qu.:1160.6 3rd Qu.:15.00 3rd Qu.:4.500
## Max. :1520.1 Max. :21.00 Max. :4.900
## Discount Applied Days Since Last Purchase Satisfaction Level
## FALSE:173 Min. : 9.00 Neutral :107
## TRUE :175 1st Qu.:15.00 Satisfied :125
## Median :23.00 Unsatisfied:116
## Mean :26.61
## 3rd Qu.:38.00
## Max. :63.00
# Calculate key descriptive statistics for numeric columns using a pipe
df %>%
select_if(is.numeric) %>%
summarise(
across(everything(),
list(Mean = mean, Median = median, SD = sd),
na.rm = TRUE,
.names = "{.col}_{.fn}")
) %>%
# Use pivot_longer to present stats in a readable table format
pivot_longer(everything(), names_to = "Statistic", values_to = "Value")
## # A tibble: 18 × 2
## Statistic Value
## <chr> <dbl>
## 1 Customer ID_Mean 276.
## 2 Customer ID_Median 276.
## 3 Customer ID_SD 101.
## 4 Age_Mean 33.6
## 5 Age_Median 32
## 6 Age_SD 4.88
## 7 Total Spend_Mean 848.
## 8 Total Spend_Median 780.
## 9 Total Spend_SD 362.
## 10 Items Purchased_Mean 12.6
## 11 Items Purchased_Median 12
## 12 Items Purchased_SD 4.15
## 13 Average Rating_Mean 4.02
## 14 Average Rating_Median 4.1
## 15 Average Rating_SD 0.579
## 16 Days Since Last Purchase_Mean 26.6
## 17 Days Since Last Purchase_Median 23
## 18 Days Since Last Purchase_SD 13.5
# Histogram of Total Spend: Check for normality and outliers
ggplot(df, aes(x = `Total Spend`)) +
geom_histogram(bins = 20, fill = "steelblue", color = "black") +
labs(title = "Distribution of Total Spend", x = "Total Spend ($)", y = "Frequency of Customers") +
theme_minimal()
Output Description: The first output,
summary(df), displays the five-number summary (Min,
Q1, Median, Q3, Max) and Mean for all numeric
variables (like Age, Total Spend, and
Items Purchased). The subsequent table provides a clear
summary of Mean, Median, and Standard Deviation (SD)
for key numeric columns, structured for easy comparison. The
Histogram of Total Spend typically shows a distribution
that might be slightly skewed right (indicating a small number of
customers with very high spending) or approximately normal, illustrating
the concentration of customer spending.
Objective: To identify which factors most influence customer spending.
| Step | Action | Description |
|---|---|---|
| 2.1 Build Model | Predict Total Spend using predictors. |
Quantifies effects of predictors. |
| 2.2 Interpret Coefficients | Examine p-values and R². | Determines significance and fit. |
# Build multiple linear regression model
# Total Spend is the dependent variable (target)
lm_model <- lm(`Total Spend` ~ Age + `Items Purchased` + `Membership Type` +
`Discount Applied` + `Average Rating`,
data = df)
cat("Summary of Linear Regression Model:\n")
## Summary of Linear Regression Model:
# Display model results, focusing on coefficients, R-squared, and p-values
summary(lm_model)
##
## Call:
## lm(formula = `Total Spend` ~ Age + `Items Purchased` + `Membership Type` +
## `Discount Applied` + `Average Rating`, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.27 -19.67 -1.36 15.86 68.61
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -189.0576 27.1938 -6.952 1.84e-11 ***
## Age 8.1156 0.5210 15.577 < 2e-16 ***
## `Items Purchased` 42.1572 0.9597 43.929 < 2e-16 ***
## `Membership Type`Gold 519.3501 12.4802 41.614 < 2e-16 ***
## `Membership Type`Silver 198.7341 8.3796 23.716 < 2e-16 ***
## `Discount Applied`TRUE -82.0024 2.5889 -31.674 < 2e-16 ***
## `Average Rating` 7.8595 9.3604 0.840 0.402
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.85 on 341 degrees of freedom
## Multiple R-squared: 0.9961, Adjusted R-squared: 0.996
## F-statistic: 1.443e+04 on 6 and 341 DF, p-value: < 2.2e-16
Output Description: The
summary(lm_model) output provides the core results of the
multiple linear regression. It includes:
Items Purchased or higher Membership Type
tiers) suggests an increase in Total Spend for a unit
change in that variable. The p-values
(Pr(>|t|)) show the statistical significance. Predictors
with p-values less than 0.05 are considered significant drivers of
spending.Objective: To test if Membership Type
or City significantly affect Total Spend.
| Step | Action | Description |
|---|---|---|
| 3.1 Run ANOVA | Fit model with aov(). |
Tests mean differences among groups. |
| 3.2 Interpret Results | Review F-values and p-values. | Identifies significant group effects. |
# Test the impact of Membership Type and City (and their interaction) on Total Spend
anova_model <- aov(`Total Spend` ~ `Membership Type` * City, data = df)
cat("ANOVA Table:\n")
## ANOVA Table:
# Display the ANOVA results
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## `Membership Type` 2 42183508 21091754 35097 <2e-16 ***
## City 3 3005981 1001994 1667 <2e-16 ***
## Residuals 342 205528 601
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Output Description: The ANOVA Table provides the results of the two-way analysis of variance. It will show the Sum of Squares, Degrees of Freedom, Mean Squares, F-statistic, and p-value for three main effects:
Membership Type: Tests if the mean Total
Spend is different across the various membership tiers.City: Tests if the mean Total Spend is
different across different cities.Membership Type:City: Tests for a statistical
interaction—i.e., if the effect of membership type on spending
differs depending on the city. A low p-value (e.g., <
0.05) for any factor indicates a statistically
significant difference in Total Spend
attributable to that grouping variable.Objective: To group customers into segments based on their spending patterns.
| Step | Action | Description |
|---|---|---|
| 4.1 Scale Features | Standardize numeric variables. | Normalizes data for clustering. |
| 4.2 Apply K-Means | Run clustering with k=3. | Creates customer segments. |
| 4.3 Summarize Clusters | Calculate average metrics per cluster. | Describes each group’s characteristics. |
# Select and scale key numeric features for clustering
data_cluster <- df %>%
select(`Total Spend`, `Items Purchased`, `Days Since Last Purchase`, `Average Rating`) %>%
scale()
# Set seed for reproducibility and run K-Means (using k=3 segments)
set.seed(42)
kmeans_result <- kmeans(data_cluster, centers = 3, nstart = 25)
# Append cluster assignment back to the main data frame
df$Cluster <- as.factor(kmeans_result$cluster)
# Summarize the characteristics of each cluster
cluster_summary <- df %>%
group_by(Cluster) %>%
summarise(
Count = n(),
Avg_Spend = mean(`Total Spend`),
Avg_Items = mean(`Items Purchased`),
Avg_Rating = mean(`Average Rating`),
.groups = 'drop' # Recommended for clean grouping
)
cat("\nCluster Summary:\n")
##
## Cluster Summary:
print(cluster_summary)
## # A tibble: 3 × 5
## Cluster Count Avg_Spend Avg_Items Avg_Rating
## <fct> <int> <dbl> <dbl> <dbl>
## 1 1 59 1456. 19.9 4.81
## 2 2 117 983. 13.4 4.35
## 3 3 172 547. 9.57 3.53
Output Description: The
print(cluster_summary) output is a table (data frame) that
characterizes the three identified customer segments. It displays the
count, average Total Spend, average Items Purchased, and average
Average Rating for each Cluster (1, 2, and 3).
This summary allows for meaningful interpretation: for example, the
cluster with the highest average spend and highest average items
purchased can be labeled the “High-Value” segment,
while the cluster with the lowest averages can be labeled the
“Lapsed/Low-Value” segment.
Objective: To visualize patterns in customer data for better understanding.
| Step | Action | Description |
|---|---|---|
| 5.1 Boxplot | Compare Total Spend by Membership
Type. |
Highlights variation in spend levels. |
| 5.2 Scatter Plot | Plot Age vs Total Spend. |
Shows age-based spending behavior. |
| 5.3 Bar Chart | Visualize satisfaction across cities. | Compares satisfaction by location. |
# --- Step 5.1: Boxplot: Spending by Membership Tier ---
ggplot(df, aes(x = `Membership Type`, y = `Total Spend`, fill = `Membership Type`)) +
geom_boxplot() +
labs(title = "Total Spend Distribution by Membership Type", x = "Membership Type", y = "Total Spend ($)") +
theme_minimal()
# --- Step 5.2: Scatter Plot: Age vs Spend colored by Satisfaction ---
ggplot(df, aes(x = Age, y = `Total Spend`, color = `Satisfaction Level`)) +
geom_point(alpha = 0.6) + # Use alpha for better visibility of overlapping points
geom_smooth(method = "lm", se = FALSE, color = "darkgrey") + # Add linear trend line
labs(title = "Age vs Total Spend, colored by Satisfaction", x = "Age", y = "Total Spend ($)") +
theme_light()
# --- Step 5.3: Bar Chart: Satisfaction across different Cities ---
ggplot(df, aes(x = `City`, fill = `Satisfaction Level`)) +
geom_bar(position = "dodge") +
labs(title = "Customer Satisfaction Level by City", x = "City", y = "Count") +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate axis labels for readability
Output Description: This section generates three insightful visualizations:
Key Insights from the Dataset:
Spending Behavior:
Regression Results:
ANOVA Findings:
Customer Segmentation:
Visual Insights:
Conclusion
This R Markdown report demonstrates how Descriptive Analytics can transform raw e-commerce data into actionable insights. Through data cleaning, regression, ANOVA, clustering, and visualization, we discovered: