Data Dive — Group By and Probabilities

options(repos = c(CRAN = "https://cran.rstudio.com"))

Loading the “Supermart” CSV file located on desktop

data <- read.csv("/Users/ramyaamudapakula/Desktop/Sem1/Statistics/Data Proposal/Supermart.csv")

Installing and loading the “dplyr” package which is important for data manipulation and transformation.

install.packages("dplyr")

## 
## The downloaded binary packages are in
##  /var/folders/pv/kll1prqs39jc2wvhvd31dfmr0000gn/T//Rtmp5wqijG/downloaded_packages

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Loading the “tidyverse” package which provides a set of tools for data manipulation, visualization, and analysis.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.4     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Installing and loading the “data.table” package to use its efficient data manipulation capabilities.

install.packages("data.table")

## 
## The downloaded binary packages are in
##  /var/folders/pv/kll1prqs39jc2wvhvd31dfmr0000gn/T//Rtmp5wqijG/downloaded_packages

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year

## The following object is masked from 'package:purrr':
## 
##     transpose

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

Using the ‘fread’ function from the “data.table” package to read the “Supermart.csv” file and assigning the resulting data table to a variable

df <- fread("/Users/ramyaamudapakula/Desktop/Sem1/Statistics/Data Proposal/Supermart.csv")

Grouping by Category

df1 <- df %>%
  group_by(Category) %>%
  summarize(Count = n(), AvgSales = mean(Sales), AvgProfit = mean(Profit))
df1

## # A tibble: 7 × 4
##   Category          Count AvgSales AvgProfit
##   <chr>             <int>    <dbl>     <dbl>
## 1 Bakery             1413    1495.      374.
## 2 Beverages          1400    1490.      375.
## 3 Eggs, Meat & Fish  1490    1522.      381.
## 4 Food Grains        1398    1513.      379.
## 5 Fruits & Veggies   1418    1481.      374.
## 6 Oil & Masala       1361    1498.      366.
## 7 Snacks             1514    1478.      375.

Grouping the data frame df by the ‘Category’ column and then calculating summary statistics (count, average sales, and average profit) for each category.

Grouping by Region

df2 <- df %>%
  group_by(Region) %>%
  summarize(Count = n(), AvgSales = mean(Sales), AvgProfit = mean(Profit))
df2

## # A tibble: 5 × 4
##   Region  Count AvgSales AvgProfit
##   <chr>   <int>    <dbl>     <dbl>
## 1 Central  2323    1493.      369.
## 2 East     2848    1492.      377.
## 3 North       1    1254       401.
## 4 South    1619    1507.      385.
## 5 West     3203    1498.      372.

Grouping the data frame df by the ‘Region’ column and then calculating summary statistics (count, average sales, and average profit) for each category.

Grouping by Discount Level

df$DiscountLevel <- cut(df$Discount, breaks = c(0, 0.2, 0.4, 1), labels = c("Low", "Medium", "High"))

df3 <- df %>%
  group_by(DiscountLevel) %>%
  summarize(Count = n(), AvgSales = mean(Sales), AvgProfit = mean(Profit))
df3

## # A tibble: 2 × 4
##   DiscountLevel Count AvgSales AvgProfit
##   <fct>         <int>    <dbl>     <dbl>
## 1 Low            4082    1499.      374.
## 2 Medium         5912    1495.      376.

Creating a new categorical variable, ‘DiscountLevel,’ based on the ‘Discount’ column using the ‘cut’ function. It then groups the data frame by this new variable and calculates summary statistics (count, average sales, and average profit) for each discount level.

Calculating probabilities and assigning special tags to lowest probability groups

df1$Probability <- df1$Count / sum(df1$Count)
df2$Probability <- df2$Count / sum(df2$Count)
df3$Probability <- df3$Count / sum(df3$Count)

# Assigning a special tag to the lowest probability group(s)
min_prob_group1 <- df1$Category[df1$Probability == min(df1$Probability)]
min_prob_group2 <- df2$Region[df2$Probability == min(df2$Probability)]
min_prob_group3 <- df3$DiscountLevel[df3$Probability == min(df3$Probability)]

# Adding a special tag column to the original data frame
df$SpecialTag1 <- ifelse(df$Category %in% min_prob_group1, "LowestProbCategory", "OtherCategories")
df$SpecialTag2 <- ifelse(df$Region %in% min_prob_group2, "LowestProbRegion", "OtherRegions")
df$SpecialTag3 <- ifelse(df$DiscountLevel %in% min_prob_group3, "LowestProbDiscountLevel", "OtherDiscountLevels")

Calculated probability as the count of each group divided by the sum of all counts, providing a relative measure of the group’s representation in the dataset and identifying the group with the lowest probability within each data frame. Later, assigning special tags to the original data frame (df) based on the lowest probability groups identified in the previous step and creating three new columns (SpecialTag1, SpecialTag2, SpecialTag3) that indicate whether each row belongs to the lowest probability group or other groups in terms of categories, regions, and discount levels, respectively.

Category Distribution Visualization

cat_conclusion <- paste("The category with the lowest probability is:", min_prob_group1)
library(ggplot2)
ggplot(df1, aes(x = Category, y = Count)) +
  geom_bar(stat = "identity") +
  labs(title = "Distribution of Products by Category", subtitle = cat_conclusion, x = "Category", y = "Count")

The categories ‘Oil’ and ‘Masala’ have the smallest probability, indicating that these categories are less prevalent in sales compared to others.

Interpretation in Terms of Probability:The low probability suggests that customers are less likely to purchase products in the ‘Oil’ and ‘Masala’ categories compared to other categories.

Region Distribution Visualization

region_conclusion <- paste("The region with the lowest probability is:", min_prob_group2)

# Visualization
ggplot(df2, aes(x = Region, y = Count)) +
  geom_bar(stat = "identity") +
  labs(title = "Distribution of Products by Region", subtitle = region_conclusion, x = "Region", y = "Count")

The ‘North’ region has the smallest probability, implying that sales in this region are relatively lower compared to other regions.

Interpretation in Terms of Probability: The low probability indicates that the likelihood of sales occurring in the ‘North’ region is less compared to sales in other regions.

Discount Level Distribution Visualization

discount_conclusion <- paste("The discount level with the lowest probability is:", min_prob_group3)

# Visualization
ggplot(df3, aes(x = DiscountLevel, y = Count)) +
  geom_bar(stat = "identity") +
  labs(title = "Distribution of Products by Discount Level", subtitle = discount_conclusion, x = "Discount Level", y = "Count")

The ‘Low’ discount level has the smallest probability, indicating that products with low discounts are less frequently purchased.

Interpretation in Terms of Probability: The low probability suggests that customers are less inclined to buy products with ‘Low’ discount levels compared to products with higher discount levels.

Testable Hypothesis: Checking if there is a significant difference in sales between the lowest probability category and other categories.

# Subset data for the lowest probability category and other categories
lowest_prob_category_data <- df[df$SpecialTag1 == "LowestProbCategory", ]
other_category_data <- df[df$SpecialTag1 == "OtherCategories", ]

# Performing t-test
t_test_result <- t.test(lowest_prob_category_data$Sales, other_category_data$Sales)
print(t_test_result)

## 
##  Welch Two Sample t-test
## 
## data:  lowest_prob_category_data$Sales and other_category_data$Sales
## t = 0.078645, df = 1801.3, p-value = 0.9373
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -32.06220  34.74093
## sample estimates:
## mean of x mean of y 
##  1497.753  1496.414

Interpretation of the T-Test Results
Null Hypothesis (H₀): There is no significant difference in sales between the lowest probability category and other categories.

Alternative Hypothesis (H₁): There is a significant difference in sales between the lowest probability category and other categories.

T-Test Summary:
T-Value: 0.078645
Degrees of Freedom (df): 1801.3
P-Value: 0.9373

Possible Factors Influencing the Results:
Seasonal Variation:
Products in the lowest probability category may have seasonal demand patterns, leading to fluctuations in sales throughout the year.
Other categories might be more consistent in demand. Marketing and Visibility:
Insufficient marketing efforts for the lowest probability category could result in lower visibility among customers.
Other categories may benefit from more effective marketing strategies Pricing and Discounts:
Pricing strategies, including discounts and promotions, may differ between the lowest probability category and others.
Customers may be more attracted to categories with higher discounts. Product Availability:
Limited availability of products in the lowest probability category could contribute to lower sales.
Other categories might have a more extensive and varied product assortment.

Further investigations can include:
Exploratory Analysis: Conduct exploratory data analysis to understand the distribution of sales within each category.
Additional Variables: Include additional relevant variables that might influence sales, such as marketing efforts, product popularity, or customer demographics.
Temporal Analysis: Explore whether sales patterns change over time, and consider temporal factors that may impact the results.
Customer Feedback: Collect customer feedback or conduct surveys to understand perceptions and preferences related to the lowest probability category.

Checking for Missing Combinations

existing_combinations <- unique(data[, c("Category", "Region")])
all_combinations <- expand.grid(Category = unique(data$Category), Region = unique(data$Region))
missing_combinations <- anti_join(all_combinations, existing_combinations)

## Joining with `by = join_by(Category, Region)`

print(missing_combinations)

##            Category Region
## 1         Beverages  North
## 2       Food Grains  North
## 3  Fruits & Veggies  North
## 4            Bakery  North
## 5            Snacks  North
## 6 Eggs, Meat & Fish  North

Possible Reasons for Missing Combinations:
Limited Product Variety: The supermarket might not offer products in the “Oil & Masala” category in the “North” region, leading to a missing combination.
Regional Preferences: Consumer preferences in the “North” region may not align with products in the “Oil & Masala” category, resulting in the absence of this combination.

Displaying Most and Least Common Combinations

# Analyze the frequency of each combination
combination_counts <- data %>%
  group_by(Category, Region) %>%
  summarise(Count = n())

## `summarise()` has grouped output by 'Category'. You can override using the
## `.groups` argument.

# Display most common and least common combinations
most_common_combination <- combination_counts[which.max(combination_counts$Count), ]
least_common_combination <- combination_counts[which.min(combination_counts$Count), ]

print(paste("Most Common Combination: ", most_common_combination$Category, " in ", most_common_combination$Region))

## [1] "Most Common Combination:  Eggs, Meat & Fish  in  West"

print(paste("Least Common Combination: ", least_common_combination$Category, " in ", least_common_combination$Region))

## [1] "Least Common Combination:  Oil & Masala  in  North"

Interpretation:
Most Common Combination: “Eggs, Meat & Fish” in the “West” region.
This combination may be popular due to regional preferences or a diverse range of products in the “Eggs, Meat & Fish” category in the “West” region.Least Common Combination: “Oil & Masala” in the “North” region.
This combination might be less common due to specific regional preferences, limited availability of “Oil & Masala” products, or a market trend where these products are not as popular in the “North” region.
Possible Reasons for Combinations’ Frequency:
Regional Preferences: Certain combinations may align well with the preferences of customers in specific regions, leading to higher frequencies.
Product Availability: Combinations with more available and diverse products are likely to be more common.
Market Trends: Consumer trends and market dynamics can influence the popularity of certain combinations over time.
Cultural Factors: Cultural factors may play a role in the demand for specific combinations in different regions.
Marketing Strategies: Effective marketing strategies may boost the sales of certain combinations, making them more common.

Visualizing One of the Combinations

chosen_combination <- combination_counts[1, ]
ggplot(data, aes(x = Category, fill = Region)) +
  geom_bar(position = "dodge") +
  ggtitle("Category and Region Combinations") +
  theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))