options(repos = c(CRAN = "https://cran.rstudio.com"))
data <- read.csv("/Users/ramyaamudapakula/Desktop/Sem1/Statistics/Data Proposal/Supermart.csv")
install.packages("dplyr")
##
## The downloaded binary packages are in
## /var/folders/pv/kll1prqs39jc2wvhvd31dfmr0000gn/T//Rtmp5wqijG/downloaded_packages
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ ggplot2 3.4.4 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
install.packages("data.table")
##
## The downloaded binary packages are in
## /var/folders/pv/kll1prqs39jc2wvhvd31dfmr0000gn/T//Rtmp5wqijG/downloaded_packages
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
## The following object is masked from 'package:purrr':
##
## transpose
## The following objects are masked from 'package:dplyr':
##
## between, first, last
df <- fread("/Users/ramyaamudapakula/Desktop/Sem1/Statistics/Data Proposal/Supermart.csv")
df1 <- df %>%
group_by(Category) %>%
summarize(Count = n(), AvgSales = mean(Sales), AvgProfit = mean(Profit))
df1
## # A tibble: 7 × 4
## Category Count AvgSales AvgProfit
## <chr> <int> <dbl> <dbl>
## 1 Bakery 1413 1495. 374.
## 2 Beverages 1400 1490. 375.
## 3 Eggs, Meat & Fish 1490 1522. 381.
## 4 Food Grains 1398 1513. 379.
## 5 Fruits & Veggies 1418 1481. 374.
## 6 Oil & Masala 1361 1498. 366.
## 7 Snacks 1514 1478. 375.
Grouping the data frame df by the ‘Category’ column and then calculating summary statistics (count, average sales, and average profit) for each category.
df2 <- df %>%
group_by(Region) %>%
summarize(Count = n(), AvgSales = mean(Sales), AvgProfit = mean(Profit))
df2
## # A tibble: 5 × 4
## Region Count AvgSales AvgProfit
## <chr> <int> <dbl> <dbl>
## 1 Central 2323 1493. 369.
## 2 East 2848 1492. 377.
## 3 North 1 1254 401.
## 4 South 1619 1507. 385.
## 5 West 3203 1498. 372.
Grouping the data frame df by the ‘Region’ column and then calculating summary statistics (count, average sales, and average profit) for each category.
Grouping by Discount Level
df$DiscountLevel <- cut(df$Discount, breaks = c(0, 0.2, 0.4, 1), labels = c("Low", "Medium", "High"))
df3 <- df %>%
group_by(DiscountLevel) %>%
summarize(Count = n(), AvgSales = mean(Sales), AvgProfit = mean(Profit))
df3
## # A tibble: 2 × 4
## DiscountLevel Count AvgSales AvgProfit
## <fct> <int> <dbl> <dbl>
## 1 Low 4082 1499. 374.
## 2 Medium 5912 1495. 376.
Creating a new categorical variable, ‘DiscountLevel,’ based on the ‘Discount’ column using the ‘cut’ function. It then groups the data frame by this new variable and calculates summary statistics (count, average sales, and average profit) for each discount level.
cat_conclusion <- paste("The category with the lowest probability is:", min_prob_group1)
library(ggplot2)
ggplot(df1, aes(x = Category, y = Count)) +
geom_bar(stat = "identity") +
labs(title = "Distribution of Products by Category", subtitle = cat_conclusion, x = "Category", y = "Count")
The categories ‘Oil’ and ‘Masala’ have the smallest probability, indicating that these categories are less prevalent in sales compared to others.
Interpretation in Terms of Probability:The low probability suggests that customers are less likely to purchase products in the ‘Oil’ and ‘Masala’ categories compared to other categories.
region_conclusion <- paste("The region with the lowest probability is:", min_prob_group2)
# Visualization
ggplot(df2, aes(x = Region, y = Count)) +
geom_bar(stat = "identity") +
labs(title = "Distribution of Products by Region", subtitle = region_conclusion, x = "Region", y = "Count")
The ‘North’ region has the smallest probability, implying that sales in this region are relatively lower compared to other regions.
Interpretation in Terms of Probability: The low probability indicates that the likelihood of sales occurring in the ‘North’ region is less compared to sales in other regions.
discount_conclusion <- paste("The discount level with the lowest probability is:", min_prob_group3)
# Visualization
ggplot(df3, aes(x = DiscountLevel, y = Count)) +
geom_bar(stat = "identity") +
labs(title = "Distribution of Products by Discount Level", subtitle = discount_conclusion, x = "Discount Level", y = "Count")
The ‘Low’ discount level has the smallest probability, indicating that products with low discounts are less frequently purchased.
Interpretation in Terms of Probability: The low probability suggests that customers are less inclined to buy products with ‘Low’ discount levels compared to products with higher discount levels.
# Subset data for the lowest probability category and other categories
lowest_prob_category_data <- df[df$SpecialTag1 == "LowestProbCategory", ]
other_category_data <- df[df$SpecialTag1 == "OtherCategories", ]
# Performing t-test
t_test_result <- t.test(lowest_prob_category_data$Sales, other_category_data$Sales)
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: lowest_prob_category_data$Sales and other_category_data$Sales
## t = 0.078645, df = 1801.3, p-value = 0.9373
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -32.06220 34.74093
## sample estimates:
## mean of x mean of y
## 1497.753 1496.414
Interpretation of the T-Test Results
Null Hypothesis (H₀): There is no significant difference in
sales between the lowest probability category and other categories.
Alternative Hypothesis (H₁): There is a significant difference in sales between the lowest probability category and other categories.
T-Test Summary:
T-Value: 0.078645
Degrees of Freedom (df): 1801.3
P-Value: 0.9373
Possible Factors Influencing the Results:
Seasonal Variation:
Products in the lowest probability category may have seasonal demand
patterns, leading to fluctuations in sales throughout the year.
Other categories might be more consistent in demand. Marketing
and Visibility:
Insufficient marketing efforts for the lowest probability
category could result in lower visibility among customers.
Other categories may benefit from more effective marketing strategies
Pricing and Discounts:
Pricing strategies, including discounts and promotions, may
differ between the lowest probability category and others.
Customers may be more attracted to categories with higher discounts.
Product Availability:
Limited availability of products in the lowest probability
category could contribute to lower sales.
Other categories might have a more extensive and varied product
assortment.
Further investigations can include:
Exploratory Analysis: Conduct exploratory data analysis to
understand the distribution of sales within each category.
Additional Variables: Include additional relevant
variables that might influence sales, such as marketing efforts, product
popularity, or customer demographics.
Temporal Analysis: Explore whether sales patterns
change over time, and consider temporal factors that may impact the
results.
Customer Feedback: Collect customer feedback or conduct
surveys to understand perceptions and preferences related to the lowest
probability category.
existing_combinations <- unique(data[, c("Category", "Region")])
all_combinations <- expand.grid(Category = unique(data$Category), Region = unique(data$Region))
missing_combinations <- anti_join(all_combinations, existing_combinations)
## Joining with `by = join_by(Category, Region)`
print(missing_combinations)
## Category Region
## 1 Beverages North
## 2 Food Grains North
## 3 Fruits & Veggies North
## 4 Bakery North
## 5 Snacks North
## 6 Eggs, Meat & Fish North
Possible Reasons for Missing Combinations:
Limited Product Variety: The supermarket might not offer
products in the “Oil & Masala” category in the “North” region,
leading to a missing combination.
Regional Preferences: Consumer preferences in the
“North” region may not align with products in the “Oil & Masala”
category, resulting in the absence of this combination.
# Analyze the frequency of each combination
combination_counts <- data %>%
group_by(Category, Region) %>%
summarise(Count = n())
## `summarise()` has grouped output by 'Category'. You can override using the
## `.groups` argument.
# Display most common and least common combinations
most_common_combination <- combination_counts[which.max(combination_counts$Count), ]
least_common_combination <- combination_counts[which.min(combination_counts$Count), ]
print(paste("Most Common Combination: ", most_common_combination$Category, " in ", most_common_combination$Region))
## [1] "Most Common Combination: Eggs, Meat & Fish in West"
print(paste("Least Common Combination: ", least_common_combination$Category, " in ", least_common_combination$Region))
## [1] "Least Common Combination: Oil & Masala in North"
Interpretation:
Most Common Combination: “Eggs, Meat & Fish” in the “West”
region.
This combination may be popular due to regional preferences or a diverse
range of products in the “Eggs, Meat & Fish” category in the “West”
region.Least Common Combination: “Oil & Masala” in
the “North” region.
This combination might be less common due to specific regional
preferences, limited availability of “Oil & Masala” products, or a
market trend where these products are not as popular in the “North”
region.
Possible Reasons for Combinations’ Frequency:
Regional Preferences: Certain combinations may align well with
the preferences of customers in specific regions, leading to higher
frequencies.
Product Availability: Combinations with more available
and diverse products are likely to be more common.
Market Trends: Consumer trends and market dynamics can
influence the popularity of certain combinations over time.
Cultural Factors: Cultural factors may play a role in
the demand for specific combinations in different regions.
Marketing Strategies: Effective marketing strategies
may boost the sales of certain combinations, making them more
common.
chosen_combination <- combination_counts[1, ]
ggplot(data, aes(x = Category, fill = Region)) +
geom_bar(position = "dodge") +
ggtitle("Category and Region Combinations") +
theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))