options(repos = c(CRAN = "https://cran.rstudio.com"))
data <- read.csv("/Users/ramyaamudapakula/Desktop/Sem1/Statistics/Data Proposal/Supermart.csv")
Here, we are reading a CSV file named “Supermart.csv” located on the desktop and storing its data in a variable named “data” for further analysis.
summary_data<-summary(data)
summary_data
## OrderID CustomerName Category SubCategory
## Length:9994 Length:9994 Length:9994 Length:9994
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## City OrderDate Region Sales
## Length:9994 Length:9994 Length:9994 Min. : 500
## Class :character Class :character Class :character 1st Qu.:1000
## Mode :character Mode :character Mode :character Median :1498
## Mean :1497
## 3rd Qu.:1995
## Max. :2500
## Discount Profit State
## Min. :0.1000 Min. : 25.25 Length:9994
## 1st Qu.:0.1600 1st Qu.: 180.02 Class :character
## Median :0.2300 Median : 320.78 Mode :character
## Mean :0.2268 Mean : 374.94
## 3rd Qu.:0.2900 3rd Qu.: 525.63
## Max. :0.3500 Max. :1120.95
The summary output displays key statistics for numerical columns like Sales, Discount, and Profit, along with information on the distribution and mode for categorical columns such as OrderID, CustomerName, Category, and City in the dataset.
summary_sales<- summary(data$Sales)
summary_sales
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 500 1000 1498 1497 1995 2500
Summary for the “Sales” column indicates a range from 500 to 2500, with a median of 1498 and an average (mean) of 1497, offering insights into the distribution and central tendency of sales values in the dataset.
summary_profit <- summary(data$Profit)
summary_profit
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25.25 180.02 320.78 374.94 525.63 1120.95
Summary for the “Profit” column indicates a profit range from 25.25 to 1120.95, with a median of 320.78 and an average (mean) of 374.94, offering insights into the distribution and central tendency of profit values in the dataset.
summary_category <- summary(data$Category)
summary_category
## Length Class Mode
## 9994 character character
categorical_column <- "Category"
cat_summary <- table(data[[categorical_column]])
cat_summary
##
## Bakery Beverages Eggs, Meat & Fish Food Grains
## 1413 1400 1490 1398
## Fruits & Veggies Oil & Masala Snacks
## 1418 1361 1514
The “Category” column summary shows the count for each unique category, offering insights into the distribution of product categories in the dataset.
cat_summary_better <- data.frame(Value = names(cat_summary), Count = as.vector(cat_summary))
cat_summary_better
## Value Count
## 1 Bakery 1413
## 2 Beverages 1400
## 3 Eggs, Meat & Fish 1490
## 4 Food Grains 1398
## 5 Fruits & Veggies 1418
## 6 Oil & Masala 1361
## 7 Snacks 1514
The enhanced summary for the “Category” column, now structured as a data frame, displays each unique category alongside its count.
numerical_summary <- lapply(data[, c("Sales", "Profit")], function(x) {
c(
min = min(x),
max = max(x),
mean = mean(x),
median = median(x),
quantile_25 = quantile(x, 0.25),
quantile_75 = quantile(x, 0.75)
)
})
combined_summary <- list(numeric = numerical_summary, categorical = cat_summary_better)
knitr::kable(combined_summary)
|
|
The combined summary includes statistics for “Sales” and “Profit,” along with counts for product categories offering a concise overview of both numerical and categorical features in the dataset.
sub_category_sales <- aggregate(Sales ~ SubCategory, data = data, sum)
sorted_sub_categories <- sub_category_sales[order(sub_category_sales$Sales, decreasing = TRUE), ]
print(sorted_sub_categories[1, ])
## SubCategory Sales
## 14 Health Drinks 1051439
print(sorted_sub_categories[nrow(sorted_sub_categories), ])
## SubCategory Sales
## 21 Rice 498323
The top-selling subcategory is “Health Drinks” with total sales amounting to 1,051,439, while the lowest-selling subcategory is “Rice” with sales of 498,323. These findings highlight variations in sales performance among different product subcategories in the dataset.
profit_by_region <- aggregate(Profit ~ Region, data = data, sum)
print(profit_by_region)
## Region Profit
## 1 Central 856806.84
## 2 East 1074345.58
## 3 North 401.28
## 4 South 623562.89
## 5 West 1192004.61
max_profit_index <- which.max(profit_by_region$Profit)
region_with_max_profit <- profit_by_region$Region[max_profit_index]
print(region_with_max_profit)
## [1] "West"
The region with the highest total profit is “West”, amounting to $1,192,004.61. This result is obtained by aggregating profits by region, showcasing regional variations in profit contribution.
agg_data <- aggregate(cbind(Profit, Discount) ~ Category, data = data, mean)
# Calculate correlation between aggregated profit and discount
correlation <- cor(agg_data$Profit, agg_data$Discount)
print(correlation)
## [1] 0.384205
A correlation coefficient of 0.4025159 indicates a positive correlation between the mean profit and mean discount for each category. This suggests that as the mean discount increases, the mean profit also tends to increase, and vice versa. However, the strength of this correlation is moderate, as it is less than 0.5.
avg_discount_by_region <- aggregate(Discount ~ Region, data = data, mean)
print(avg_discount_by_region)
## Region Discount
## 1 Central 0.2287258
## 2 East 0.2276721
## 3 North 0.1200000
## 4 South 0.2267758
## 5 West 0.2247268
These results, obtained by aggregating discounts by region, reveal variations in discounting practices across different regions in the dataset.
install.packages("RColorBrewer")
##
## The downloaded binary packages are in
## /var/folders/pv/kll1prqs39jc2wvhvd31dfmr0000gn/T//Rtmp862tVy/downloaded_packages
install.packages("reshape2")
##
## The downloaded binary packages are in
## /var/folders/pv/kll1prqs39jc2wvhvd31dfmr0000gn/T//Rtmp862tVy/downloaded_packages
The code installs the “RColorBrewer” and “reshape2” packages in R, which are useful for color palettes in visualizations and reshaping data.
library(ggplot2)
library(RColorBrewer)
category_colors <- c("Bakery" = "yellow", "Beverages" = "brown", "Eggs, Meat & Fish" = "pink", "Food Grains" = "black", "Fruits & Veggies" = "darkgreen", "Oil & Masala" = "red", "Sancks" = "purple")
ggplot(data, aes(x = Sales, y = Profit, color = Category)) +
geom_point() +
scale_color_manual(values = category_colors) +
labs(title = "Sales vs. Profit by Category",
x = "Sales", y = "Profit")
The scatter plot reveals varying relationships between Sales and Profit across product categories. Further exploration may involve investigating outliers, calculating correlations, and tailoring strategies for specific categories to optimize business performance.
category_colors <- c("Bakery" = "yellow", "Beverages" = "brown", "Eggs, Meat & Fish" = "pink", "Food Grains" = "black", "Fruits & Veggies" = "darkgreen", "Oil & Masala" = "red", "Sancks" = "purple")
ggplot(data, aes(x = Sales, fill = Category)) +
geom_histogram(binwidth = 500, position = "identity", color = "white", alpha = 0.7) +
scale_fill_manual(values = category_colors) +
labs(title = "Distribution of Sales by Category",
x = "Sales", y = "Frequency") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
legend.title = element_text(size = 14),
legend.text = element_text(size = 12),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank()
)
The histogram illustrates the distribution of Sales, categorized by product type. It helps in recognizing sales patterns within each category. Further exploration may involve assessing the skewness of distributions and understanding the impact of category-specific sales variations on overall business performance.
data$OrderDate <- as.Date(data$OrderDate) #taking OrderDate as date variable
category_colors <- c("Bakery" = "yellow", "Beverages" = "brown", "Eggs, Meat & Fish" = "pink", "Food Grains" = "black", "Fruits & Veggies" = "darkgreen", "Oil & Masala" = "red", "Sancks" = "purple")
# Line plot showing the trend of Sales over time
ggplot(data, aes(x = OrderDate, y = Sales, color = Category)) +
geom_line(linewidth = 1.5) +
scale_color_manual(values = category_colors) +
labs(title = "Trend of Sales Over Time by Category",
x = "Order Date", y = "Sales") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
legend.title = element_text(size = 14),
legend.text = element_text(size = 12)
)
## Warning: Removed 5952 rows containing missing values (`geom_line()`).
The line plot visualizes the trend of Sales over time, categorized by product type, with distinct colors for each category. This representation allows for the identification of temporal patterns in Sales within different categories.
library(ggplot2)
library(reshape2) # Loading reshape2 for melt function
# Calculate the correlation matrix
correlation_matrix <- cor(data[, c("Sales", "Profit")])
# Reshape the correlation matrix into long format
correlation_data <- melt(correlation_matrix)
# Create a correlation heatmap
ggplot(data = correlation_data, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "violet", high = "purple") +
theme_minimal() +
labs(title = "Correlation Heatmap for Sales and Profit",
x = "Variables", y = "Variables")
The correlation heatmap visually represents the relationships between Sales and Profit, offering insights into their correlation patterns. This exploration facilitates understanding the strength and direction of the association between these variables.