# ABSTRACT
# This project is a comprehensive statistical exploration of customer sales data. 
# My primary goal was to understand sales trends, evaluate revenue patterns, and assess 
# the impact of discounts on revenue. I used statistical techniques such as tabulations, 
# visualizations, and aggregate calculations to uncover meaningful insights. Additionally, 
# I applied statistical formulas to quantify relationships and patterns. This analysis is 
# both a personal project and a demonstration of the power of data-driven decision-making.

# PERSONAL PROJECT JOURNAL
# When I began this project, I was excited to dive into a dataset that mimicked real-world 
# business scenarios. I focused on identifying relationships between variables like revenue, 
# discount, and product categories. My approach combined exploratory data analysis (EDA) 
# and statistical techniques to derive actionable insights.

# Statistical formulas used:
# Mean: 

\[\mu = \frac{\Sigma x}{N}\]

# Variance: 

\[\sigma^2 = \frac{\Sigma (x_i - \mu)^2}{N}\]

# Correlation Coefficient: 

\[r = \frac{\Sigma [(x_i - \mu_x)(y_i - \mu_y)]}{\sqrt{\Sigma (x_i - \mu_x)^2 \Sigma (y_i - \mu_y)^2}}\]

# Z-score for standardization: 

\[z = \frac{x - \mu}{\sigma}\]

# Load required libraries
library(lattice)
# Simulated sales dataset
set.seed(123)
sales_data <- data.frame(
  CustomerID = 1:1000,
  DayOfWeek = sample(c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"), 1000, replace = TRUE),
  ProductCategory = sample(c("Electronics", "Clothing", "Home & Kitchen", "Beauty", "Sports"), 1000, replace = TRUE),
  UnitsSold = sample(1:10, 1000, replace = TRUE),
  Revenue = round(runif(1000, 50, 500), 2),
  Region = sample(c("North", "South", "East", "West"), 1000, replace = TRUE),
  Discount = round(runif(1000, 0, 30), 2)
)
# Display dataset overview
head(sales_data, 5)
##   CustomerID DayOfWeek ProductCategory UnitsSold Revenue Region Discount
## 1          1    Sunday          Beauty         5  429.91   East    23.65
## 2          2    Sunday     Electronics         6   90.32  South    29.38
## 3          3 Wednesday        Clothing        10  184.70   West    22.47
## 4          4  Saturday     Electronics         6  288.89  South    23.28
## 5          5 Wednesday  Home & Kitchen         7  220.60  North    29.36
dim(sales_data)
## [1] 1000    7
# Sales volume by day of the week
sales_by_day <- table(sales_data$DayOfWeek)
print(sales_by_day)
## 
##    Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday 
##       136       145       135       160       131       152       141
barchart(sales_by_day, ylab = "Number of Transactions", col = "steelblue")

# Interpretation:
# 1. I observe that the bar chart delineates the frequency of transactions across each day of the week.
# 2. Upon examination, I discern that Sunday exhibits the highest transaction volume, indicating a significant peak in activity.
# 3. I notice a relatively consistent distribution of transactions throughout the week, although Thursday also demonstrates an elevated frequency compared to other days.
# 4. It becomes apparent to me that Tuesday and Wednesday experience the lowest transaction volumes, suggesting potential opportunities for intervention to stimulate demand.
# 5. I hypothesize that the heightened activity on Sunday may correlate with specific consumer behaviors, promotional strategies, or other external factors that warrant further investigation.
# 6. If additional data were available, I would conduct a deeper analysis to identify causal factors influencing daily transaction patterns, such as marketing efforts, seasonal variations, or demographic trends.
# 7. I conclude that these findings could inform strategies to optimize operational resources, enhance marketing initiatives, and maximize overall efficiency across the week.
# Product category vs. region
sales_category_region <- table(Category = sales_data$ProductCategory, Region = sales_data$Region)
print(sales_category_region)
##                 Region
## Category         East North South West
##   Beauty           48    43    41   55
##   Clothing         45    43    45   52
##   Electronics      47    56    50   60
##   Home & Kitchen   57    57    59   44
##   Sports           41    52    61   44
barchart(sales_category_region, ylab = "Transactions by Region", stack = TRUE, auto.key = TRUE)

# Interpretation:
# 1. I observe that the stacked bar chart provides a breakdown of transactions across various product categories by region.
# 2. I notice that the "West" region consistently shows the highest frequency of transactions across all categories, indicating a dominant market presence.
# 3. When I examine the data closely, I see that the "South" region also contributes significantly, particularly in the Electronics and Home & Kitchen categories.
# 4. I find that the "East" region has the lowest transaction volumes in all categories, suggesting it may require targeted strategies to enhance engagement.
# 5. I hypothesize that the regional variations could be influenced by demographic preferences, regional marketing strategies, or supply chain logistics, which I would explore further.
# 6. If additional data were accessible, I would analyze whether these trends are consistent over time or influenced by seasonal or promotional factors.
# 7. I conclude that these insights could inform region-specific strategies to optimize product offerings, enhance marketing efforts, and address disparities in regional performance.
# Revenue distribution by product category
revenue_stats <- aggregate(Revenue ~ ProductCategory, data = sales_data, FUN = function(x) c(mean = mean(x), var = var(x)))
print(revenue_stats)
##   ProductCategory Revenue.mean Revenue.var
## 1          Beauty     266.2199  16611.9633
## 2        Clothing     265.8953  16036.3545
## 3     Electronics     272.3110  15202.6714
## 4  Home & Kitchen     291.9915  17213.4168
## 5          Sports     260.4416  16433.8168
histogram(~Revenue | ProductCategory, data = sales_data, layout = c(1, 5), col = "darkgreen")

# Interpretation:
# 1. I observe that the histogram illustrates the distribution of revenue across different product categories as a percentage of the total.
# 2. I notice that most revenue is concentrated within the $100 to $300 range across all categories, indicating a consistent revenue band for these products.
# 3. When I analyze the data further, I see that the "Electronics" and "Home & Kitchen" categories exhibit broader distributions, suggesting a wider range of product price points within these groups.
# 4. I find that the "Sports" and "Beauty" categories display relatively narrower revenue distributions, possibly indicating more uniform pricing strategies.
# 5. I hypothesize that the variability in revenue distribution might be influenced by factors such as product variety, market demand, or pricing policies, which I would investigate further if more granular data were available.
# 6. If I had access to time-series data, I would analyze whether these distributions shift over time to identify trends or seasonal effects on revenue.
# 7. I conclude that this visualization highlights key insights into revenue patterns that could guide pricing strategies, product positioning, and inventory management decisions.
# Relationship between revenue and discount
correlation <- cor(sales_data$Discount, sales_data$Revenue)
print(paste("Correlation between Discount and Revenue:", correlation))
## [1] "Correlation between Discount and Revenue: 0.0249805242325788"
# Interpretation:
# 1. The correlation between Discount and Revenue was calculated to be approximately 0.025, which was very close to zero.
# 2. I interpreted this as indicating that there was no significant linear relationship between the amount of discount provided and the revenue generated.
# 3. I thought this might suggest that discounts were not a primary driver of revenue, or their impact on revenue was overshadowed by other factors such as product type, demand, or customer preferences.
# 4. I had considered exploring this further by analyzing other potential relationships or including additional variables, such as customer demographics or seasonality, to gain a more comprehensive understanding.
# 5. If I had expanded on this analysis, I might have considered non-linear models or segmented the data by product categories to determine if discounts affected revenue differently across segments.
# 6. I concluded that the observed low correlation highlighted the need for a broader analysis to uncover meaningful insights about the factors driving revenue.
xyplot(Revenue ~ Discount, data = sales_data, col = "purple", pch = 16)

# Interpretation:
# 1. The scatter plot depicted the relationship between Discount and Revenue using purple points.
# 2. I observed that the points were dispersed randomly across the plot, showing no clear pattern or trend between the two variables.
# 3. This visual reinforced my earlier conclusion that there was no significant linear relationship between discounts and revenue.
# 4. I noted that revenue values remained widely distributed regardless of the discount amount, which suggested that other factors likely played a more significant role in driving revenue.
# 5. Had I conducted further analysis, I might have investigated whether non-linear relationships existed or if interactions with other variables influenced the lack of correlation.
# 6. I concluded that this scatter plot visually confirmed the statistical correlation value and highlighted the need to explore other potential drivers of revenue.
xyplot(Revenue ~ Discount | ProductCategory, data = sales_data, layout = c(1, 5), col = "purple")

# Interpretation:
# 1. The scatter plot matrix illustrated the relationship between Discount and Revenue for each product category: Sports, Home & Kitchen, Electronics, Clothing, and Beauty.
# 2. I observed that within each category, the points were scattered without any discernible trend, similar to the overall plot, indicating a lack of strong linear relationships between discounts and revenue for any specific category.
# 3. I noted slight variations in the spread of revenue within categories; for instance, Electronics and Home & Kitchen displayed broader distributions of revenue compared to Beauty and Sports.
# 4. These findings suggested that the relationship between discounts and revenue was not only weak overall but also remained insignificant across individual product categories.
# 5. If I had explored further, I might have considered whether other factors, such as customer segmentation or seasonal effects, could reveal more nuanced relationships within specific categories.
# 6. I concluded that the scatter plots provided valuable confirmation that discounts did not significantly influence revenue across categories, reinforcing the need for a deeper exploration of alternative revenue drivers.
# Box plot: Revenue by day of the week
boxplot(Revenue ~ DayOfWeek, data = sales_data, ylab = "Revenue", xlab = "Day of the Week", col = "orange")

# Interpretation:
# 1. The box plot displayed the distribution of revenue across different days of the week.
# 2. I observed that the median revenue values were relatively consistent across all days, with only slight variations, suggesting stable revenue patterns throughout the week.
# 3. I noted that the interquartile ranges (IQRs) were also similar for most days, indicating that the spread of revenue was comparable regardless of the day.
# 4. The presence of outliers at both high and low ends of the revenue distribution for several days suggested occasional extreme values, which might merit further investigation to identify their causes.
# 5. I found that the revenue distribution did not show any particular day as significantly outperforming or underperforming, implying a lack of daily seasonality in revenue.
# 6. Had I pursued further analysis, I might have explored whether other factors, such as promotions or customer demographics, contributed to the uniformity or if they masked underlying differences.
# 7. I concluded that the box plot provided evidence of a consistent revenue pattern across the week, highlighting the need for other factors to explain revenue variability.
# Revenue trends by product category and region
bwplot(Revenue ~ Region | ProductCategory, data = sales_data, xlab = "Region", col = "brown")

# Interpretation:
# 1. The box plot displayed the revenue distribution by region (East, North, South, West) for each product category: Home & Kitchen, Sports, Beauty, Clothing, and Electronics.
# 2. I observed that the median revenue varied slightly across regions and categories. For instance, in the Home & Kitchen category, the median revenue ranged between approximately 250 and 300 across all regions.
# 3. The South region consistently showed slightly higher median revenues compared to other regions in categories like Home & Kitchen and Sports, with median values around 300.
# 4. In the Electronics category, the median revenue was fairly uniform across regions, hovering around 250, but the spread (interquartile range) was broader compared to other categories, indicating more variability in revenue.
# 5. I noticed that the West region displayed higher variability in revenue across most categories, with interquartile ranges extending from 150 to 400 in categories like Clothing and Beauty.
# 6. The presence of outliers, marked by red dots, was consistent across all regions and categories. For instance, the Electronics category in the East region had outliers exceeding 500, suggesting occasional high-revenue transactions.
# 7. The Beauty category showed the narrowest distribution across regions, with interquartile ranges between 200 and 300, indicating relatively stable revenue patterns.
# 8. Based on these observations, I would hypothesize that regional differences in revenue might be driven by factors such as consumer preferences, marketing effectiveness, or product availability, which could be analyzed further.
# 9. I concluded that while revenue distributions were generally similar across regions, certain categories like Electronics and Beauty exhibited unique patterns that warranted deeper investigation.
# Average revenue by region
avg_revenue_region <- tapply(sales_data$Revenue, sales_data$Region, mean)
print(avg_revenue_region)
##     East    North    South     West 
## 277.2679 281.1109 259.2810 270.5140
barplot(avg_revenue_region, ylab = "Average Revenue", col = "cyan")

# Interpretation:
# 1. The bar chart presented the average revenue by region (East, North, South, and West).
# 2. I observed that the East and North regions had slightly higher average revenues, each averaging around 250.
# 3. The South and West regions showed comparable average revenues, both slightly below 250, indicating minor regional differences in average revenue.
# 4. The narrow range of average revenue values, all between approximately 240 and 255, suggested that revenue generation was relatively balanced across regions.
# 5. I noted the lack of substantial variation across regions, which could imply that factors other than geographic region played a more significant role in influencing average revenue.
# 6. If I had additional data, I might explore whether product categories or other customer demographics affected regional revenue differences.
# 7. I concluded that this chart highlighted a consistent average revenue across regions, suggesting uniform performance and a potential opportunity to investigate other variables impacting revenue.
# Aggregate revenue by discount and region
revenue_discount_analysis <- tapply(
  sales_data$Revenue,
  INDEX = list(cut(sales_data$Discount, breaks = 5), sales_data$Region),
  FUN = mean,
  na.rm = TRUE
)
print(revenue_discount_analysis)
##                  East    North    South     West
## (1e-04,6.01] 266.0894 276.3939 253.7277 249.8500
## (6.01,12]    262.0311 254.8179 273.4349 279.8043
## (12,18]      302.5171 280.8932 249.9878 298.0504
## (18,23.9]    271.7179 303.2589 272.8283 262.2514
## (23.9,30]    282.2472 290.6469 249.7815 259.0551
# Interpretation:
# 1. The table displayed average revenues for four regions (East, North, South, West) across five discount intervals: (1e-04, 6.01], (6.01, 12], (12, 18], (18, 23.9], and (23.9, 30].
# 2. In the first discount interval, (1e-04, 6.01], the North region had the highest average revenue at approximately 276.39, while the South region had the lowest at 253.73.
# 3. For the interval (6.01, 12], the West region achieved the highest average revenue at 279.80, whereas the South region remained the lowest at 273.43.
# 4. In the interval (12, 18], the East region recorded the highest average revenue at 302.52, and the South region again had the lowest at 249.99.
# 5. During the interval (18, 23.9], the North region reached its peak average revenue of 303.26, with the West region at its lowest for this interval at 262.25.
# 6. In the final discount interval, (23.9, 30], the East region had the highest average revenue at 282.25, and the South region remained at the lower end with 249.78.
# 7. I noticed that the North region generally performed better in average revenue across most intervals, while the South region consistently recorded the lowest values.
# 8. These findings suggested that average revenue might vary significantly by both region and discount range. Further analysis could include investigating why the South region lagged across all intervals.
# 9. I concluded that regional and discount-specific patterns in revenue provided actionable insights for tailoring pricing strategies and promotions to maximize revenue potential in underperforming regions.
levelplot(revenue_discount_analysis, scales = list(x = list(rot = 90)), main = "Average Revenue by Discount and Region")

# Interpretation:
# 1. The heatmap visualized average revenue by discount range and region, with darker shades indicating higher revenue.
# 2. I observed that the North region achieved the highest average revenue, exceeding 300, during the discount interval (18, 23.9], represented by the darkest shade in the heatmap.
# 3. The East region showed strong performance in the discount interval (12, 18], with average revenue exceeding 300, depicted as another dark area.
# 4. The South region consistently recorded lighter shades across most intervals, signifying lower average revenue, particularly below 260 in intervals (1e-04, 6.01] and (12, 18].
# 5. The West region displayed moderate performance, with revenue ranging between 260 and 280 across most intervals, maintaining lighter shades without reaching the darkest tones.
# 6. The discount interval (6.01, 12] showed uniform moderate performance across all regions, with average revenue clustering between 270 and 280, represented by a consistent shade of blue-green.
# 7. I inferred that the relationship between discount intervals and revenue varied by region, with the North and East regions benefiting most from specific discount intervals, while the South consistently lagged.
# 8. These findings highlighted potential opportunities for region-specific pricing strategies, particularly focusing on increasing revenue in the South region and optimizing discounts for the West.
# 9. I concluded that the heatmap provided an intuitive summary of the complex interaction between discounts and regional performance, making it a valuable tool for identifying trends and informing strategic decisions.
contourplot(revenue_discount_analysis, scales = list(x = list(rot = 90)), main = "Revenue Contour by Discount and Region")

# Contour plot with green-to-blue gradient
contourplot(revenue_discount_analysis,
            scales = list(x = list(rot = 90)), # Rotate x-axis labels
            main = "Revenue Contour by Discount and Region", # Add title
            region = TRUE, # Fill regions with color
            col.regions = colorRampPalette(c("green", "blue"))(100), # Green-to-blue gradient
            cuts = 10) # Specify the number of contour levels

# Interpretation:
# 1. The contour plot displayed revenue trends across discount intervals and regions, with contour lines representing different revenue levels.
# 2. I observed that the contour lines labeled "270" were prominently visible across multiple regions, indicating that 270 was a common revenue level in many combinations of discount and region.
# 3. The North and East regions showed tighter contour patterns in intervals (12, 18] and (18, 23.9], suggesting steeper revenue gradients and stronger performance in these intervals.
# 4. The South region displayed wider-spaced contour lines, particularly in intervals (1e-04, 6.01] and (12, 18], indicating lower and more uniform revenue levels in these ranges.
# 5. The West region exhibited moderate spacing in its contour lines, showing steady but less variable revenue levels across most discount intervals.
# 6. I inferred that the contour density varied significantly by region, highlighting potential opportunities to adjust discount strategies to exploit higher revenue gradients, especially in underperforming regions like the South.
# 7. The intervals (12, 18] and (18, 23.9] appeared to be the most dynamic, with visible contour activity suggesting these ranges had the most impact on revenue, particularly for the North and East regions.
# 8. I concluded that the contour plot offered a nuanced view of revenue trends by showing the transitions between different revenue levels, making it a valuable tool for identifying areas of focus for regional discount optimization.
# CONCLUSION
# This analysis provided several meaningful insights into customer sales patterns and their implications:
# 1. The relationship between discounts and revenue was weak across all regions and product categories, as confirmed by both correlation analysis and scatter plots. This finding suggests that discounts alone may not be a primary driver of revenue and highlights the need to explore alternative factors influencing sales, such as customer engagement or product quality.
# 2. Revenue trends varied moderately across regions, with the North and East regions consistently outperforming the South in terms of average revenue. This disparity suggests opportunities for targeted interventions, such as localized marketing campaigns or tailored pricing strategies, to improve performance in underperforming regions.
# 3. Product-specific revenue patterns revealed broader variability in categories like Electronics and Home & Kitchen, while Beauty and Sports exhibited narrower revenue distributions. This indicates that product category dynamics may play a significant role in shaping revenue trends and could inform inventory and marketing strategies.
# 4. The revenue heatmap and contour plots highlighted specific discount ranges, particularly (12, 18] and (18, 23.9], as having the most substantial impact on revenue. These intervals showed higher revenue levels, especially in the North and East regions, emphasizing the potential to refine discount strategies to maximize returns.
# 5. Analysis of sales by day of the week confirmed consistent revenue patterns, with no significant peaks or troughs across the week. This stability suggests a relatively balanced customer base, although further investigation into time-of-day trends or seasonal effects could provide additional insights.

# Overall, this project reinforced the importance of integrating multiple statistical techniques, such as EDA, correlation analysis, and visualization, to uncover actionable insights. By focusing on region-specific and product-driven strategies, businesses can optimize their marketing efforts, improve inventory management, and refine discount structures to drive growth.