| Step | Action | Description |
|---|---|---|
| 0.1 Load Data | Read CSV file into R. | Imports dataset for analysis. |
| 0.2 Handle Missing Values | Remove rows with missing
Satisfaction Level. |
Ensures clean dataset. |
| 0.3 Convert to Factors | Convert categorical variables to factors. | Prepares data for modeling. |
# 0.1 Load dataset
df <- read_csv("C:\\Users\\Shaik Fawaz\\Documents\\E-commerce Customer Behavior - Sheet1.csv")
# 0.2 Handle missing Satisfaction Level
df_clean <- df %>%
drop_na(`Satisfaction Level`)
# 0.3 Convert character columns to factors
df_clean <- df_clean %>%
mutate(
Gender = as.factor(Gender),
City = as.factor(City),
`Membership Type` = as.factor(`Membership Type`),
`Discount Applied` = as.factor(`Discount Applied`),
`Satisfaction Level` = as.factor(`Satisfaction Level`)
)
cat("Data successfully loaded and cleaned.\n")
## Data successfully loaded and cleaned.
str(df_clean)
## tibble [348 × 11] (S3: tbl_df/tbl/data.frame)
## $ Customer ID : num [1:348] 101 102 103 104 105 106 107 108 109 110 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 1 2 2 1 1 2 1 2 ...
## $ Age : num [1:348] 29 34 43 30 27 37 31 35 41 28 ...
## $ City : Factor w/ 6 levels "Chicago","Houston",..: 5 3 1 6 4 2 5 3 1 6 ...
## $ Membership Type : Factor w/ 3 levels "Bronze","Gold",..: 2 3 1 2 3 1 2 3 1 2 ...
## $ Total Spend : num [1:348] 1120 780 511 1480 720 ...
## $ Items Purchased : num [1:348] 14 11 9 19 13 8 15 12 10 21 ...
## $ Average Rating : num [1:348] 4.6 4.1 3.4 4.7 4 3.1 4.5 4.2 3.6 4.8 ...
## $ Discount Applied : Factor w/ 2 levels "FALSE","TRUE": 2 1 2 1 2 1 2 1 2 1 ...
## $ Days Since Last Purchase: num [1:348] 25 18 42 12 55 22 28 14 40 9 ...
## $ Satisfaction Level : Factor w/ 3 levels "Neutral","Satisfied",..: 2 1 3 2 3 1 2 1 3 2 ...
Objective: To summarize customer spending behavior and understand data distribution.
| Step | Action | Description |
|---|---|---|
| 1.1 Summary Stats | Calculate Mean, Median, and SD. | Measures data central tendency and spread. |
| 1.2 Frequency Plot | Create histogram for Total Spend. |
Visualizes spending distribution. |
# Generate descriptive statistics
summary(df_clean)
## Customer ID Gender Age City
## Min. :101.0 Female:173 Min. :26.00 Chicago :58
## 1st Qu.:188.8 Male :175 1st Qu.:30.00 Houston :56
## Median :276.5 Median :32.00 Los Angeles :59
## Mean :275.9 Mean :33.58 Miami :58
## 3rd Qu.:363.2 3rd Qu.:37.00 New York :59
## Max. :450.0 Max. :43.00 San Francisco:58
## Membership Type Total Spend Items Purchased Average Rating
## Bronze:114 Min. : 410.8 Min. : 7.00 Min. :3.000
## Gold :117 1st Qu.: 505.8 1st Qu.: 9.00 1st Qu.:3.500
## Silver:117 Median : 780.2 Median :12.00 Median :4.100
## Mean : 847.8 Mean :12.63 Mean :4.024
## 3rd Qu.:1160.6 3rd Qu.:15.00 3rd Qu.:4.500
## Max. :1520.1 Max. :21.00 Max. :4.900
## Discount Applied Days Since Last Purchase Satisfaction Level
## FALSE:173 Min. : 9.00 Neutral :107
## TRUE :175 1st Qu.:15.00 Satisfied :125
## Median :23.00 Unsatisfied:116
## Mean :26.61
## 3rd Qu.:38.00
## Max. :63.00
numeric_cols <- df_clean %>% select_if(is.numeric)
data.frame(
Mean = sapply(numeric_cols, mean, na.rm = TRUE),
Median = sapply(numeric_cols, median, na.rm = TRUE),
SD = sapply(numeric_cols, sd, na.rm = TRUE)
)
## Mean Median SD
## Customer ID 275.887931 276.5 101.3046114
## Age 33.577586 32.0 4.8780238
## Total Spend 847.793103 780.2 361.6923754
## Items Purchased 12.632184 12.0 4.1460789
## Average Rating 4.023563 4.1 0.5791447
## Days Since Last Purchase 26.614943 23.0 13.4747498
# Histogram of Total Spend
ggplot(df_clean, aes(x = `Total Spend`)) +
geom_histogram(bins = 20, fill = "steelblue", color = "black") +
labs(title = "Distribution of Total Spend", x = "Total Spend", y = "Frequency")
Output Description: The summary statistics show the spread and central tendency of key numeric columns. The histogram illustrates how spending is distributed — whether most customers spend moderately or a few contribute to high spending.
Objective: To identify which factors most influence customer spending.
| Step | Action | Description |
|---|---|---|
| 2.1 Build Model | Predict Total Spend using predictors. |
Quantifies effects of predictors. |
| 2.2 Interpret Coefficients | Examine p-values and R². | Determines significance and fit. |
# Build multiple linear regression model
lm_model <- lm(`Total Spend` ~ Age + `Items Purchased` + `Membership Type` +
`Discount Applied` + `Average Rating`, data = df_clean)
cat("Summary of Linear Regression Model:\n")
## Summary of Linear Regression Model:
summary(lm_model)
##
## Call:
## lm(formula = `Total Spend` ~ Age + `Items Purchased` + `Membership Type` +
## `Discount Applied` + `Average Rating`, data = df_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.27 -19.67 -1.36 15.86 68.61
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -189.0576 27.1938 -6.952 1.84e-11 ***
## Age 8.1156 0.5210 15.577 < 2e-16 ***
## `Items Purchased` 42.1572 0.9597 43.929 < 2e-16 ***
## `Membership Type`Gold 519.3501 12.4802 41.614 < 2e-16 ***
## `Membership Type`Silver 198.7341 8.3796 23.716 < 2e-16 ***
## `Discount Applied`TRUE -82.0024 2.5889 -31.674 < 2e-16 ***
## `Average Rating` 7.8595 9.3604 0.840 0.402
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.85 on 341 degrees of freedom
## Multiple R-squared: 0.9961, Adjusted R-squared: 0.996
## F-statistic: 1.443e+04 on 6 and 341 DF, p-value: < 2.2e-16
Output Description: The regression summary highlights which variables (like items purchased or membership type) significantly impact total spend. The R-squared value shows how much variation in spending is explained by the predictors.
Objective: To test if Membership Type
or City significantly affect Total Spend.
| Step | Action | Description |
|---|---|---|
| 3.1 Run ANOVA | Fit model with aov(). |
Tests mean differences among groups. |
| 3.2 Interpret Results | Review F-values and p-values. | Identifies significant group effects. |
anova_model <- aov(`Total Spend` ~ `Membership Type` * City, data = df_clean)
cat("ANOVA Table:\n")
## ANOVA Table:
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## `Membership Type` 2 42183508 21091754 35097 <2e-16 ***
## City 3 3005981 1001994 1667 <2e-16 ***
## Residuals 342 205528 601
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Output Description: The ANOVA table shows whether spending significantly differs across membership levels or cities. If p-values are below 0.05, those factors have statistically significant effects on spending.
Objective: To group customers into segments based on their spending patterns.
| Step | Action | Description |
|---|---|---|
| 4.1 Scale Features | Standardize numeric variables. | Normalizes data for clustering. |
| 4.2 Apply K-Means | Run clustering with k=3. | Creates customer segments. |
| 4.3 Summarize Clusters | Calculate average metrics per cluster. | Describes each group’s characteristics. |
data_cluster <- df_clean %>%
select(`Total Spend`, `Items Purchased`, `Days Since Last Purchase`, `Average Rating`) %>%
scale()
set.seed(42)
kmeans_result <- kmeans(data_cluster, centers = 3, nstart = 25)
df_clean$Cluster <- as.factor(kmeans_result$cluster)
cluster_summary <- df_clean %>%
group_by(Cluster) %>%
summarise(
Avg_Spend = mean(`Total Spend`),
Avg_Items = mean(`Items Purchased`),
Avg_Rating = mean(`Average Rating`)
)
cat("\nCluster Summary:\n")
##
## Cluster Summary:
print(cluster_summary)
## # A tibble: 3 × 4
## Cluster Avg_Spend Avg_Items Avg_Rating
## <fct> <dbl> <dbl> <dbl>
## 1 1 1456. 19.9 4.81
## 2 2 983. 13.4 4.35
## 3 3 547. 9.57 3.53
Output Description: The cluster summary shows distinct customer groups — for example, high spenders with frequent purchases and low spenders with fewer items. This segmentation can help target different marketing strategies.
Objective: To visualize patterns in customer data for better understanding.
| Step | Action | Description |
|---|---|---|
| 5.1 Boxplot | Compare Total Spend by Membership
Type. |
Highlights variation in spend levels. |
| 5.2 Scatter Plot | Plot Age vs Total Spend. |
Shows age-based spending behavior. |
| 5.3 Bar Chart | Visualize satisfaction across cities. | Compares satisfaction by location. |
# --- Step 5.1: Boxplot ---
ggplot(df_clean, aes(x = `Membership Type`, y = `Total Spend`, fill = `Membership Type`)) +
geom_boxplot() +
labs(title = "Total Spend by Membership Type", x = "Membership Type", y = "Total Spend")
# --- Step 5.2: Scatter Plot ---
ggplot(df_clean, aes(x = Age, y = `Total Spend`, color = `Satisfaction Level`)) +
geom_point() +
labs(title = "Age vs Total Spend", x = "Age", y = "Total Spend")
# --- Step 5.3: Bar Chart ---
ggplot(df_clean, aes(x = `City`, fill = `Satisfaction Level`)) +
geom_bar(position = "dodge") +
labs(title = "Satisfaction Level by City", x = "City", y = "Count")
Output Description:
Key Insights from the Dataset:
Spending Behavior:
Regression Results:
ANOVA Findings:
Customer Segmentation:
Visual Insights:
Conclusion
This R Markdown report demonstrates how Descriptive Analytics can transform raw e-commerce data into actionable insights. Through data cleaning, regression, ANOVA, clustering, and visualization, we discovered:
```