Executive Summary

This analysis examined customer purchasing behavior using a cleaned and validated dataset of 3,900 transactions. The objective was to identify key trends, customer segments, and factors influencing purchase amounts to support business planning and decision making.

Results indicate that sales are driven primarily by product category and seasonal effects, while customer demographics and discount usage show limited influence on average purchase value. These findings suggest opportunities to optimize product focus and seasonal strategies rather than broad discount-based promotions.

1. Data Overview

This report documents the data quality assessment conducted on the consumer behavior dataset prior to any cleaning or analysis.

Dataset Structure

  • Number of rows: 3900
  • Number of columns: 18
  • Data source: Raw CSV file

2. Missing Values Assessment

Missing values were assessed for all variables. Columns with high missingness may affect analytical reliability and require specific treatment during data cleaning.

3. Duplicate Records

Duplicate checks were conducted at both full-record and business-key levels to determine whether repeated entries represent data errors or legitimate repeated transactions.

4. Inconsistency and Validity Checks

Logical and business-rule consistency checks were performed to assess whether values fall within expected and meaningful ranges.

The following aspects were evaluated: - Numeric ranges (e.g. age, purchase amounts) - Categorical consistency (e.g. gender, payment method) - Cross-variable logic (e.g. discounts vs purchase amounts)

These checks help identify data entry errors and implausible combinations that may affect analysis.

5. Outlier Detection

Outliers were assessed across all numeric variables using boxplots, histograms, and the IQR method.
Several numeric variables exhibit extreme values, which may influence summary statistics and modeling results.

These observations will inform the outlier treatment strategy applied during the data cleaning phase (Task 2).

Data Quality Assessment Summary

Data Quality Dimension Column(s) Affected Observation / Issue Detected Severity Potential Impact Proposed Remediation
Structural consistency Column names Column names contain spaces and special characters Low Code readability issues; increased risk of errors Standardize column names to snake_case
Missing values Review Rating Approximately ____% of values are missing Medium Bias in customer satisfaction analysis Evaluate imputation strategy or exclude from certain analyses
Missing values __________ ____________________________ ______ _______________ ____________________
Duplicates Customer ID No exact duplicate records detected Low Minimal impact on analysis No action required
Validity (numeric ranges) Age All values fall within expected range (15–100) Low No significant impact No action required
Validity (numeric ranges) Purchase Amount (USD) No zero or negative values detected Low No revenue distortion No action required
Inconsistency (categorical) Gender Multiple representations observed (e.g. Male, male, M) Medium Incorrect segmentation and aggregation Standardize category labels
Inconsistency (categorical) Payment Method Minor inconsistencies in naming Low Aggregation inaccuracies Standardize text case
Cross-variable consistency Discount Applied vs Purchase Amount Discounts applied to very low purchase values observed Medium Misleading discount effectiveness analysis Review and apply rule-based cleaning
Outliers Purchase Amount (USD) Extreme high values detected via boxplots Medium Inflated averages and totals Cap outliers using IQR method
Outliers Previous Purchases Skewed distribution with extreme values Low Minor effect on modeling Consider transformation or capping

Data Cleaning

This section documents the data cleaning and preparation steps applied to the dataset following the data quality assessment (Task 1). All cleaning actions were guided by previously identified issues and were implemented in a reproducible manner using R.

1. Missing Value Treatment

Missing values were handled based on variable type, proportion of missingness, and business relevance:

  • Continuous variables with low missingness and no significant skew were imputed using the mean.
  • Continuous variables with skewed distributions or outliers were imputed using the median.
  • Categorical variables with low missingness were imputed using the most frequent category (mode).
  • Variables where zero represents a meaningful absence (e.g. previous purchases) were imputed with zero.
  • Records with missing values in key analytical variables (e.g. customer identifiers or age) were removed to preserve analytical validity.

Forward and backward filling methods were not applied, as the dataset does not represent time-ordered or sequential observations.


2. Duplicate Removal

Duplicate records were assessed and treated as follows:

  • Exact full-row duplicates were removed to eliminate redundant records.
  • Business-key duplicates were reviewed based on customer identifiers. Where duplicates represented data entry errors rather than repeated transactions, only one record per customer was retained.

This approach ensures data integrity while preserving legitimate transactional behavior.


3. Data Standardization

To ensure consistency and improve analytical reliability:

  • Column names were standardized to snake_case format.
  • Leading and trailing whitespace was removed from all character variables.
  • Text values were standardized in terms of letter case.
  • Inconsistent categorical representations (e.g. gender encodings) were mapped to a common set of values.
  • Variables incorrectly stored as text were converted to appropriate numeric types.

These steps address structural and categorical inconsistencies identified during data quality analysis.


4. Outlier and Validity Treatment

Data validity rules and outlier treatment were applied to ensure realistic and meaningful values:

  • Records with implausible values (e.g. unrealistic ages or non-positive purchase amounts) were removed.
  • Extreme outliers in monetary variables were treated using the interquartile range (IQR) method, with values capped to reduce distortion while retaining observations.

5. Feature Engineering and Redundancy Handling

Additional preparation steps were undertaken to support analysis:

  • Age values were grouped into four interpretable age categories to support segmentation analysis.
  • Review ratings were categorized into five satisfaction levels for clearer interpretation.
  • A redundancy check revealed that promo_code_used and discount_applied contained identical information. To avoid duplication, promo_code_used was removed and discount_applied was retained.

6. Analysis-Ready Dataset

Following cleaning and preparation, a single consolidated, analysis-ready dataset was produced. The cleaned dataset was exported for use in subsequent trend analysis, modeling, and business reporting, as well as for integration with SQL-based systems and dashboards.

Trend Analysis and Projections

Analysis of the cleaned dataset revealed clear patterns in customer purchasing behavior. Sales performance varies significantly across product categories and customer age groups, with certain segments contributing disproportionately to total revenue.

Customer satisfaction, as reflected in review ratings, is positively associated with higher purchase values. Additionally, transactions involving discounts tend to exhibit different purchasing patterns compared to non-discounted transactions.

A simple regression-based projection suggests that discount application and customer characteristics such as age and purchase history influence purchase amounts. These insights provide a basis for targeted marketing strategies and business planning.

Product Category Performance

Figure 1: Clothing and Accessories account for the largest share of total revenue, indicating priority areas for inventory and marketing focus.

Customer Spending by Age Group

Figure 2: Average purchase values remain consistent across age groups, suggesting that customer volume rather than spending intensity drives differences in total sales.

Customer Satisfaction and Spending

Figure 3: Transactions with higher customer review ratings show increased average purchase values, highlighting the revenue impact of customer experience.

Discount Impact Analysis

Figure 4: Discounted transactions exhibit slightly lower median purchase values, suggesting that discounts increase volume rather than transaction value.

Seasonal Purchasing Patterns

Figure 5: Average purchase amounts are highest during Fall and Winter, indicating optimal periods for targeted marketing and inventory planning.