Data Dive — Sampling and Drawing Conclusions

Loading the “Supermart” CSV file located on desktop

df <- read.csv("/Users/ramyaamudapakula/Desktop/Sem1/Statistics/Data Proposal/Supermart.csv")

Creating Five Random Samples

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Setting seed to reproduce the randomly generated numbers
set.seed(152)

# Creating 5 random samples containing approximately 50% of the rows from the original dataframe df
df_1 <- df[sample(nrow(df), nrow(df) * 0.5, replace = TRUE), ]
df_2 <- df[sample(nrow(df), nrow(df) * 0.5, replace = TRUE), ]
df_3 <- df[sample(nrow(df), nrow(df) * 0.5, replace = TRUE), ]
df_4 <- df[sample(nrow(df), nrow(df) * 0.5, replace = TRUE), ]
df_5 <- df[sample(nrow(df), nrow(df) * 0.5, replace = TRUE), ]

Created five random subsamples of the original dataset using a sampling process with replacement. This simulates the act of collecting data from a population where each subsample represents a different set of observations.
- The creation of random subsamples allows us to observe variations within the data and analyze how different samples can impact our conclusions.
- This process can be important in understanding the robustness of our analyses and drawing reliable conclusions from the data.
- Allows us to check if there are any specific patterns or outliers present in multiple subsamples consistently.

Scrutinizing the Subsamples

# Summary statistics for each subsample involving sales and profit
df_1_summary <- df_1 %>% group_by(Category) %>% summarise(mean_sales = mean(Sales), mean_profit = mean(Profit))
df_2_summary <- df_2 %>% group_by(Category) %>% summarise(mean_sales = mean(Sales), mean_profit = mean(Profit))
df_3_summary <- df_3 %>% group_by(Category) %>% summarise(mean_sales = mean(Sales), mean_profit = mean(Profit))
df_4_summary <- df_4 %>% group_by(Category) %>% summarise(mean_sales = mean(Sales), mean_profit = mean(Profit))
df_5_summary <- df_5 %>% group_by(Category) %>% summarise(mean_sales = mean(Sales), mean_profit = mean(Profit))

Calculated summary statistics for mean sales and mean profit for each category in each subsample.
- Analyzing these summary statistics helps us identify anomalies, variations, and consistent patterns across different subsamples.
- Helps us in checking if there are any categories showing consistent anomalies across the subsamples.

Comparing Mean Sales and Profit Across the Subsamples

print("Comparison of Mean Sales:")

## [1] "Comparison of Mean Sales:"

print(df_1_summary$mean_sales)

## [1] 1507.032 1466.677 1483.359 1518.921 1461.846 1462.049 1428.198

print(df_2_summary$mean_sales)

## [1] 1495.330 1439.864 1512.428 1495.571 1507.055 1515.701 1479.570

print(df_3_summary$mean_sales)

## [1] 1509.623 1512.824 1513.024 1493.252 1509.506 1475.470 1449.583

print(df_4_summary$mean_sales)

## [1] 1460.200 1499.401 1501.062 1532.741 1462.291 1502.186 1455.330

print(df_5_summary$mean_sales)

## [1] 1497.083 1563.486 1504.905 1533.447 1448.548 1521.509 1455.548

print("Comparison of Mean Profit:")

## [1] "Comparison of Mean Profit:"

print(df_1_summary$mean_profit)

## [1] 367.4792 375.2277 362.1324 388.8974 371.2112 365.8106 370.8583

print(df_2_summary$mean_profit)

## [1] 370.0728 358.1563 375.5169 372.4790 380.4734 366.4238 367.2287

print(df_3_summary$mean_profit)

## [1] 377.8166 380.9210 374.5275 374.4415 394.4164 360.6025 380.5086

print(df_4_summary$mean_profit)

## [1] 369.7720 373.8926 365.8074 380.0621 372.3385 367.6175 374.5466

print(df_5_summary$mean_profit)

## [1] 378.3998 392.3479 378.3548 395.1303 368.3957 372.8682 374.7902

Based on the output, it seems that the mean sales and mean profit values for different categories in each subsample are relatively close to each other. But the absence of significantly high or low mean values doesn’t necessarily rule out the presence of outliers. A more comprehensive analyses(such as different statistical tests) are needed to identify potential outliers or anomalies.

Bar Charts Comparing the Mean Sales and Mean Profit for Different Categories in Each Subsample

library(ggplot2)
plot_mean_comparison <- function(subsample, variable, y_label) {
  ggplot(subsample, aes(x = Category, y = !!sym(variable), fill = Category)) +
    geom_col() +
    labs(title = paste("Mean", y_label, "Comparison in Subsample"),
         x = "Category", y = y_label) +
    theme_minimal()
}

# Plotting mean sales comparison
plot_mean_comparison(df_1_summary, "mean_sales", "Sales")

plot_mean_comparison(df_2_summary, "mean_sales", "Sales")

plot_mean_comparison(df_3_summary, "mean_sales", "Sales")

plot_mean_comparison(df_4_summary, "mean_sales", "Sales")

plot_mean_comparison(df_5_summary, "mean_sales", "Sales")

# Plotting mean profit comparison
plot_mean_comparison(df_1_summary, "mean_profit", "Profit")

plot_mean_comparison(df_2_summary, "mean_profit", "Profit")

plot_mean_comparison(df_3_summary, "mean_profit", "Profit")

plot_mean_comparison(df_4_summary, "mean_profit", "Profit")

plot_mean_comparison(df_5_summary, "mean_profit", "Profit")

Monte Carlo Simulations

# Setting up a Monte Carlo simulation for mean sales 
mc_mean_sales <- replicate(35, {
  sample_df <- df[sample(nrow(df), nrow(df) * 0.5, replace = TRUE), ]
  mean(sample_df$Sales)
})

# Setting up a Monte Carlo simulation for mean profit
mc_mean_profit <- replicate(35, {
  sample_df <- df[sample(nrow(df), nrow(df) * 0.5, replace = TRUE), ]
  mean(sample_df$Profit)
})

print(mc_mean_sales)

##  [1] 1489.479 1494.651 1505.398 1515.970 1496.712 1494.094 1500.013 1510.788
##  [9] 1492.735 1496.046 1487.115 1491.825 1501.310 1487.253 1497.005 1484.312
## [17] 1491.371 1493.313 1497.221 1505.757 1495.156 1502.584 1495.490 1509.479
## [25] 1479.518 1489.733 1503.818 1494.741 1503.130 1497.834 1514.458 1489.675
## [33] 1487.475 1491.852 1485.434

print(mc_mean_profit)

##  [1] 371.4028 379.0675 370.0918 372.8428 372.4092 373.0099 373.5840 375.8404
##  [9] 373.1203 382.8953 375.6755 371.0936 368.0518 377.5400 366.1775 372.5269
## [17] 378.0448 371.5496 379.2460 381.0560 377.1214 376.3493 367.3559 377.1804
## [25] 370.0226 381.5760 377.0757 370.8002 377.8924 376.4857 374.8925 373.4676
## [33] 371.7665 369.3838 369.2948

Monte Carlo simulations are conducted to understand the variability of mean sales and mean profit under different random sampling conditions. The simulated distributions closely match the observed patterns in the actual subsamples. This may suggest that the subsamples are likely robust, as the variability tends to follow a consistent pattern.

# Plotting the distributions(using histograms)
par(mfrow = c(1, 2))
hist(mc_mean_sales, main = "Monte Carlo Simulation - Mean Sales")
hist(mc_mean_profit, main = "Monte Carlo Simulation - Mean Profit")