This analysis examined customer purchasing behavior using a cleaned and validated dataset of 3,900 transactions. The objective was to identify key trends, customer segments, and factors influencing purchase amounts to support business planning and decision making.
Results indicate that sales are driven primarily by product category and seasonal effects, while customer demographics and discount usage show limited influence on average purchase value. These findings suggest opportunities to optimize product focus and seasonal strategies rather than broad discount-based promotions.
This report documents the data quality assessment conducted on the consumer behavior dataset prior to any cleaning or analysis.
Missing values were assessed for all variables. Columns with high missingness may affect analytical reliability and require specific treatment during data cleaning.
Duplicate checks were conducted at both full-record and business-key levels to determine whether repeated entries represent data errors or legitimate repeated transactions.
Logical and business-rule consistency checks were performed to assess whether values fall within expected and meaningful ranges.
The following aspects were evaluated: - Numeric ranges (e.g. age, purchase amounts) - Categorical consistency (e.g. gender, payment method) - Cross-variable logic (e.g. discounts vs purchase amounts)
These checks help identify data entry errors and implausible combinations that may affect analysis.
Outliers were assessed across all numeric variables using boxplots,
histograms, and the IQR method.
Several numeric variables exhibit extreme values, which may influence
summary statistics and modeling results.
These observations will inform the outlier treatment strategy applied during the data cleaning phase (Task 2).
| Data Quality Dimension | Column(s) Affected | Observation / Issue Detected | Severity | Potential Impact | Proposed Remediation |
|---|---|---|---|---|---|
| Structural consistency | Column names | Column names contain spaces and special characters | Low | Code readability issues; increased risk of errors | Standardize column names to snake_case |
| Missing values | Review Rating | Approximately ____% of values are missing | Medium | Bias in customer satisfaction analysis | Evaluate imputation strategy or exclude from certain analyses |
| Missing values | __________ | ____________________________ | ______ | _______________ | ____________________ |
| Duplicates | Customer ID | No exact duplicate records detected | Low | Minimal impact on analysis | No action required |
| Validity (numeric ranges) | Age | All values fall within expected range (15–100) | Low | No significant impact | No action required |
| Validity (numeric ranges) | Purchase Amount (USD) | No zero or negative values detected | Low | No revenue distortion | No action required |
| Inconsistency (categorical) | Gender | Multiple representations observed (e.g. Male, male, M) | Medium | Incorrect segmentation and aggregation | Standardize category labels |
| Inconsistency (categorical) | Payment Method | Minor inconsistencies in naming | Low | Aggregation inaccuracies | Standardize text case |
| Cross-variable consistency | Discount Applied vs Purchase Amount | Discounts applied to very low purchase values observed | Medium | Misleading discount effectiveness analysis | Review and apply rule-based cleaning |
| Outliers | Purchase Amount (USD) | Extreme high values detected via boxplots | Medium | Inflated averages and totals | Cap outliers using IQR method |
| Outliers | Previous Purchases | Skewed distribution with extreme values | Low | Minor effect on modeling | Consider transformation or capping |
This section documents the data cleaning and preparation steps applied to the dataset following the data quality assessment (Task 1). All cleaning actions were guided by previously identified issues and were implemented in a reproducible manner using R.
Missing values were handled based on variable type, proportion of missingness, and business relevance:
Forward and backward filling methods were not applied, as the dataset does not represent time-ordered or sequential observations.
Duplicate records were assessed and treated as follows:
This approach ensures data integrity while preserving legitimate transactional behavior.
To ensure consistency and improve analytical reliability:
These steps address structural and categorical inconsistencies identified during data quality analysis.
Data validity rules and outlier treatment were applied to ensure realistic and meaningful values:
Additional preparation steps were undertaken to support analysis:
promo_code_used and
discount_applied contained identical information. To avoid
duplication, promo_code_used was removed and
discount_applied was retained.Following cleaning and preparation, a single consolidated, analysis-ready dataset was produced. The cleaned dataset was exported for use in subsequent trend analysis, modeling, and business reporting, as well as for integration with SQL-based systems and dashboards.
Analysis of the cleaned dataset revealed clear patterns in customer purchasing behavior. Sales performance varies significantly across product categories and customer age groups, with certain segments contributing disproportionately to total revenue.
Customer satisfaction, as reflected in review ratings, is positively associated with higher purchase values. Additionally, transactions involving discounts tend to exhibit different purchasing patterns compared to non-discounted transactions.
A simple regression-based projection suggests that discount application and customer characteristics such as age and purchase history influence purchase amounts. These insights provide a basis for targeted marketing strategies and business planning.
Figure 1: Clothing and Accessories account for the largest share of total revenue, indicating priority areas for inventory and marketing focus.
Figure 2: Average purchase values remain consistent
across age groups, suggesting that customer volume rather than spending
intensity drives differences in total sales.
Figure 3: Transactions with higher customer review
ratings show increased average purchase values, highlighting the revenue
impact of customer experience.
Figure 4: Discounted transactions exhibit slightly
lower median purchase values, suggesting that discounts increase volume
rather than transaction value.
Figure 5: Average purchase amounts are highest during
Fall and Winter, indicating optimal periods for targeted marketing and
inventory planning.