Data Dive - Summaries

options(repos = c(CRAN = "https://cran.rstudio.com"))

Loading the ‘Supermart’ CSV file located on desktop

data <- read.csv("/Users/ramyaamudapakula/Desktop/Sem1/Statistics/Data Proposal/Supermart.csv")

Here, we are reading a CSV file named “Supermart.csv” located on the desktop and storing its data in a variable named “data” for further analysis.

Summarizing the data of Supermart Grocery Sales

summary_data<-summary(data)
summary_data

##    OrderID          CustomerName         Category         SubCategory       
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      City            OrderDate            Region              Sales     
##  Length:9994        Length:9994        Length:9994        Min.   : 500  
##  Class :character   Class :character   Class :character   1st Qu.:1000  
##  Mode  :character   Mode  :character   Mode  :character   Median :1498  
##                                                           Mean   :1497  
##                                                           3rd Qu.:1995  
##                                                           Max.   :2500  
##     Discount          Profit           State          
##  Min.   :0.1000   Min.   :  25.25   Length:9994       
##  1st Qu.:0.1600   1st Qu.: 180.02   Class :character  
##  Median :0.2300   Median : 320.78   Mode  :character  
##  Mean   :0.2268   Mean   : 374.94                     
##  3rd Qu.:0.2900   3rd Qu.: 525.63                     
##  Max.   :0.3500   Max.   :1120.95

The summary output displays key statistics for numerical columns like Sales, Discount, and Profit, along with information on the distribution and mode for categorical columns such as OrderID, CustomerName, Category, and City in the dataset.

Summary of Sales column

summary_sales<- summary(data$Sales)
summary_sales

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     500    1000    1498    1497    1995    2500

Summary for the “Sales” column indicates a range from 500 to 2500, with a median of 1498 and an average (mean) of 1497, offering insights into the distribution and central tendency of sales values in the dataset.

Summary of Profits column

summary_profit <- summary(data$Profit)
summary_profit

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   25.25  180.02  320.78  374.94  525.63 1120.95

Summary for the “Profit” column indicates a profit range from 25.25 to 1120.95, with a median of 320.78 and an average (mean) of 374.94, offering insights into the distribution and central tendency of profit values in the dataset.

Summary of Category column

summary_category <- summary(data$Category)
summary_category

##    Length     Class      Mode 
##      9994 character character

Summary of unique values and counts in the categorical column ‘Category’

categorical_column <- "Category"
cat_summary <- table(data[[categorical_column]])
cat_summary

## 
##            Bakery         Beverages Eggs, Meat & Fish       Food Grains 
##              1413              1400              1490              1398 
##  Fruits & Veggies      Oil & Masala            Snacks 
##              1418              1361              1514

The “Category” column summary shows the count for each unique category, offering insights into the distribution of product categories in the dataset.

Enhanced Summary of ‘Category’ column

cat_summary_better <- data.frame(Value = names(cat_summary), Count = as.vector(cat_summary))
cat_summary_better

##               Value Count
## 1            Bakery  1413
## 2         Beverages  1400
## 3 Eggs, Meat & Fish  1490
## 4       Food Grains  1398
## 5  Fruits & Veggies  1418
## 6      Oil & Masala  1361
## 7            Snacks  1514

The enhanced summary for the “Category” column, now structured as a data frame, displays each unique category alongside its count.

Combining both numerical and categorical summaries

numerical_summary <- lapply(data[, c("Sales", "Profit")], function(x) {
  c(
    min = min(x),
    max = max(x),
    mean = mean(x),
    median = median(x),
    quantile_25 = quantile(x, 0.25),
    quantile_75 = quantile(x, 0.75)
  )
})
combined_summary <- list(numeric = numerical_summary, categorical = cat_summary_better)
knitr::kable(combined_summary)

	x
min	500.000
max	2500.000
mean	1496.596
median	1498.000
quantile_25.25%	1000.000
quantile_75.75%	1994.750

	x
min	25.2500
max	1120.9500
mean	374.9371
median	320.7800
quantile_25.25%	180.0225
quantile_75.75%	525.6275

Value	Count
Bakery	1413
Beverages	1400
Eggs, Meat & Fish	1490
Food Grains	1398
Fruits & Veggies	1418
Oil & Masala	1361
Snacks	1514

The combined summary includes statistics for “Sales” and “Profit,” along with counts for product categories offering a concise overview of both numerical and categorical features in the dataset.

Novel questions based on the column summaries, data documentation, and project goals:

1. Which sub-categories exhibit the highest and lowest sales performance?

sub_category_sales <- aggregate(Sales ~ SubCategory, data = data, sum)
sorted_sub_categories <- sub_category_sales[order(sub_category_sales$Sales, decreasing = TRUE), ]
print(sorted_sub_categories[1, ])

##      SubCategory   Sales
## 14 Health Drinks 1051439

print(sorted_sub_categories[nrow(sorted_sub_categories), ])

##    SubCategory  Sales
## 21        Rice 498323

The top-selling subcategory is “Health Drinks” with total sales amounting to 1,051,439, while the lowest-selling subcategory is “Rice” with sales of 498,323. These findings highlight variations in sales performance among different product subcategories in the dataset.

2. Are there specific regions that contribute significantly to the overall profit margin?

profit_by_region <- aggregate(Profit ~ Region, data = data, sum)
print(profit_by_region)

##    Region     Profit
## 1 Central  856806.84
## 2    East 1074345.58
## 3   North     401.28
## 4   South  623562.89
## 5    West 1192004.61

max_profit_index <- which.max(profit_by_region$Profit)
region_with_max_profit <- profit_by_region$Region[max_profit_index]
print(region_with_max_profit)

## [1] "West"

The region with the highest total profit is “West”, amounting to $1,192,004.61. This result is obtained by aggregating profits by region, showcasing regional variations in profit contribution.

3. What is the relationship between the discount offered and the resulting profit in each category?

agg_data <- aggregate(cbind(Profit, Discount) ~ Category, data = data, mean)
# Calculate correlation between aggregated profit and discount
correlation <- cor(agg_data$Profit, agg_data$Discount)
print(correlation)

## [1] 0.384205

A correlation coefficient of 0.4025159 indicates a positive correlation between the mean profit and mean discount for each category. This suggests that as the mean discount increases, the mean profit also tends to increase, and vice versa. However, the strength of this correlation is moderate, as it is less than 0.5.

4. How do different regions compare in terms of the average discount offered on products?

avg_discount_by_region <- aggregate(Discount ~ Region, data = data, mean)
print(avg_discount_by_region)

##    Region  Discount
## 1 Central 0.2287258
## 2    East 0.2276721
## 3   North 0.1200000
## 4   South 0.2267758
## 5    West 0.2247268

These results, obtained by aggregating discounts by region, reveal variations in discounting practices across different regions in the dataset.

Visual summaries (i.e., visualizations) of 2 or more columns of ‘Supermart’ data

install.packages("RColorBrewer")

## 
## The downloaded binary packages are in
##  /var/folders/pv/kll1prqs39jc2wvhvd31dfmr0000gn/T//Rtmp862tVy/downloaded_packages

install.packages("reshape2")

## 
## The downloaded binary packages are in
##  /var/folders/pv/kll1prqs39jc2wvhvd31dfmr0000gn/T//Rtmp862tVy/downloaded_packages

The code installs the “RColorBrewer” and “reshape2” packages in R, which are useful for color palettes in visualizations and reshaping data.

Scatter plot of Sales vs. Profit with colors representing Categories

library(ggplot2)
library(RColorBrewer)
category_colors <- c("Bakery" = "yellow", "Beverages" = "brown", "Eggs, Meat & Fish" = "pink", "Food Grains" = "black", "Fruits & Veggies" = "darkgreen", "Oil & Masala" = "red", "Sancks" = "purple")
ggplot(data, aes(x = Sales, y = Profit, color = Category)) +
  geom_point() +
  scale_color_manual(values = category_colors) +
  labs(title = "Sales vs. Profit by Category",
       x = "Sales", y = "Profit")

The scatter plot reveals varying relationships between Sales and Profit across product categories. Further exploration may involve investigating outliers, calculating correlations, and tailoring strategies for specific categories to optimize business performance.

Histogram of Sales with different colors for each category

category_colors <- c("Bakery" = "yellow", "Beverages" = "brown", "Eggs, Meat & Fish" = "pink", "Food Grains" = "black", "Fruits & Veggies" = "darkgreen", "Oil & Masala" = "red", "Sancks" = "purple")
ggplot(data, aes(x = Sales, fill = Category)) +
  geom_histogram(binwidth = 500, position = "identity", color = "white", alpha = 0.7) +
  scale_fill_manual(values = category_colors) +
  labs(title = "Distribution of Sales by Category",
       x = "Sales", y = "Frequency") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12),
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    panel.background = element_blank()
  )

The histogram illustrates the distribution of Sales, categorized by product type. It helps in recognizing sales patterns within each category. Further exploration may involve assessing the skewness of distributions and understanding the impact of category-specific sales variations on overall business performance.

Line plot showing the trend of Sales over time (assuming ‘OrderDate’ is a date variable)

data$OrderDate <- as.Date(data$OrderDate) #taking OrderDate as date variable
category_colors <- c("Bakery" = "yellow", "Beverages" = "brown", "Eggs, Meat & Fish" = "pink", "Food Grains" = "black", "Fruits & Veggies" = "darkgreen", "Oil & Masala" = "red", "Sancks" = "purple")

# Line plot showing the trend of Sales over time
ggplot(data, aes(x = OrderDate, y = Sales, color = Category)) +
  geom_line(linewidth = 1.5) +
  scale_color_manual(values = category_colors) +
  labs(title = "Trend of Sales Over Time by Category",
       x = "Order Date", y = "Sales") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12),
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12)
  )

## Warning: Removed 5952 rows containing missing values (`geom_line()`).

The line plot visualizes the trend of Sales over time, categorized by product type, with distinct colors for each category. This representation allows for the identification of temporal patterns in Sales within different categories.

A heatmap visualizing the correlation between Sales and Profit

library(ggplot2)
library(reshape2)  # Loading reshape2 for melt function

# Calculate the correlation matrix
correlation_matrix <- cor(data[, c("Sales", "Profit")])

# Reshape the correlation matrix into long format
correlation_data <- melt(correlation_matrix)

# Create a correlation heatmap
ggplot(data = correlation_data, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "violet", high = "purple") +
  theme_minimal() +
  labs(title = "Correlation Heatmap for Sales and Profit",
       x = "Variables", y = "Variables")

The correlation heatmap visually represents the relationships between Sales and Profit, offering insights into their correlation patterns. This exploration facilitates understanding the strength and direction of the association between these variables.