Applied Analytics Assignment 3

Author

Aman Bajpai (23103620)

Introduction

In the realm of retail and sales, data-driven decision-making has emerged as a cornerstone of strategic planning and operational efficiency. The agility of a business in responding to market trends, consumer behavior, and economic fluctuations is often underpinned by its ability to analyze and interpret complex datasets. This report delves into the sales data from three distinct retail sources—supermarket sales, general retail sales, and retail pricing—to unearth patterns, trends, and correlations that can inform business strategy and marketing effectiveness.

The datasets in question offer a multifaceted view of the sales landscape. The supermarket_sales dataset captures transactional data from a supermarket chain, detailing customer purchases across various product lines. The retail_sales dataset broadens the scope to a more general retail environment, cataloging sales transactions alongside customer demographics. The retail_price dataset shifts the focus to product pricing, providing insights into the pricing strategies and competitive landscape of retail offerings.

Through the integration and analysis of these datasets, this report aims to construct a coherent narrative around sales performance, pricing strategies, and consumer preferences. The synthesis of quantitative and qualitative methodologies—ranging from statistical modeling to text mining—serves as a testament to the multifaceted approach required in today’s analytical ventures.

Dataset Overview

The three datasets—supermarket_sales, retail_sales, and retail_price—are distinct collections of retail data, each capturing unique facets of sales transactions and pricing strategies within the retail industry. Here is an overview of each dataset and an analysis of their similarities:

Supermarket Sales Dataset

The supermarket_sales dataset is a comprehensive collection of transactions recorded at a supermarket chain. This dataset typically includes details like the invoice ID, branch information, city, customer type, and gender of the purchaser. It also details the product line, unit price, quantity of items sold, tax applied, total cost, date and time of the transaction, payment method, cost of goods sold, gross margin percentage, gross income, and customer ratings. This dataset is rich with insights into customer purchase behavior and product performance within the supermarket context.

Retail Sales Dataset

The retail_sales dataset broadens the lens to a general retail environment, possibly encompassing a variety of stores or sales platforms. It captures transaction IDs, dates, customer IDs, demographic information (such as gender and age), product categories, quantities sold, prices per unit, and total amounts spent. Unlike the supermarket_sales dataset, this one may reflect a wider array of products and customer interactions, providing a macroscopic view of retail sales.

Retail Price Dataset

The retail_price dataset focuses on the pricing aspect of the retail industry. It contains information like product IDs, product category names, month and year of sale, quantity sold, total price, freight price, unit price, and a multitude of other attributes related to product specifications, such as product name length, product description length, and quantity of product photos available. This dataset is likely to offer insights into pricing strategies, competitive pricing analysis, and the impact of product presentation on sales.

Similarities Across the Datasets

Despite the unique perspectives each dataset offers, they share several similarities that make them suitable for integrated analysis:

Product Categories: All three datasets contain a form of product categorization, which allows for cross-dataset analysis of product performance and pricing across similar items.
Sales Information: Each dataset includes quantitative data on items sold, be it through the unit price, quantity, or total sales figures. This commonality is crucial for performing aggregate sales analyses and understanding the revenue implications.
Temporal Data: All datasets incorporate time-based elements, whether it’s the date of the transaction or the month and year of sale, providing a temporal dimension to analyze sales trends over time.
Customer and Sales Details: Both supermarket_sales and retail_sales datasets include customer-centric information, offering a glimpse into who is buying what, which can be invaluable for customer segmentation and targeted marketing strategies.

By integrating these datasets, we can gain a more holistic view of the sales landscape, merging the micro-level details of individual transactions with broader pricing and category trends. The common threads weaving through the datasets facilitate a multi-dimensional analysis, allowing for both granular insights at the transaction level and strategic overviews pertinent to pricing policies and sales effectiveness.

Libraries Used

Utilizing the robust capabilities of R and its associated libraries, such as readr for data ingestion, dplyr for data manipulation, and ggplot2 for visualization, this analysis transcends basic descriptive statistics and ventures into inferential and thematic explorations. The culmination of this analytical journey is a comprehensive report that not only presents findings but also contextualizes them within the larger framework of retail sales intelligence.

{> library(readr)} > library(dplyr) > library(ggplot2) > library(lubridate)

Code Used

Importing and reading the datasets were done through the command read_csv.

```{retail_price <- read_csv(“C:/Users/bajpa/Downloads/archive (2)/retail_price.csv”)} retail_sales <- read_csv(“C:/Users/bajpa/Downloads/retail_sales_dataset.csv”) supermarket_sales <- read_csv(“C:/Users/bajpa/Downloads/archive (6)/supermarket_sales.csv”)


To mutate columns with similar data in all three datasets the following codes were used:

```{supermarket_sales <- supermarket_sales %>%}
+     rename(ProductCategory = `Product line`)
> 
> retail_sales <- retail_sales %>% 
+     rename(ProductCategory = `Product Category`)
> 
> retail_price <- retail_price %>% 
+     rename(ProductCategory = `product_category_name`)
>

In order to make the data present in the columns more standardized the following codes were used:

```{supermarket_sales\(ProductCategory <- tolower(supermarket_sales\)ProductCategory)} > retail_sales\(ProductCategory <- tolower(retail_sales\)ProductCategory) > retail_price\(ProductCategory <- tolower(retail_price\)ProductCategory) > combined_sales <- merge(supermarket_sales, retail_sales, by = “ProductCategory”, all = TRUE) > > > final_combined <- merge(combined_sales, retail_price, by = “ProductCategory”, all = TRUE)


The following code is creating a summary table where the sales quantities are aggregated for each product category for each date, resulting in a data frame (**`sales_trends`**) that can be used for further analysis or plotting sales trends over time:

```{> sales_trends <- final_combined %>%}
+     group_by(ProductCategory, Date.y) %>%
+     summarise(TotalSales = sum(Quantity.y, na.rm = TRUE)) %>%
+     arrange(Date.y)

This code aims to filter the dataset for specific product categories and calculate the total sales for each category. The categories I have chosen are Beauty, Clothing and Electronics:

```{> filtered_data <- final_combined %>%} + filter(ProductCategory %in% c(“beauty”, “clothing”, “electronics”)) %>% + group_by(ProductCategory, Date.y) %>% + summarise(TotalSales = sum(Quantity.y, na.rm = TRUE)) %>% + ungroup()


In order to visually depict the data **`ggplot2`** was used to show sales trends through the year 2023-2024 for the selected Product Categories the following code was used:

```{ggplot(filtered_data, aes(x = Date.y, y = TotalSales, color = ProductCategory)) +}
+     geom_line(size = 1) +
+     labs(title = "Sales Trends for Selected Product Categories",
+          x = "Transaction Date",
+          y = "Total Sales",
+          color = "Product Category") +
+     scale_color_manual(values = c("beauty" = "blue", "clothing" = "red", "electronics" = "green")) +
+     theme_minimal() +
+     theme(axis.text.x = element_text(angle = 45, hjust = 1), 
+           legend.position = "bottom")

The above code reaped the following output:

The graph shows the total sales over time for three distinct product categories: beauty, clothing, and electronics. The x-axis represents the transaction date from January 2023 to January 2024, and the y-axis indicates the total sales.

From the graph, we observe that all three categories exhibit significant fluctuations over the year. These variations could point to seasonal trends, promotional impacts, or buying patterns. Some key observations:

The beauty category (blue line) shows consistent activity throughout the year with occasional peaks which could indicate periodic sales promotions or seasonal demand spikes.
The clothing category (red line) has frequent and pronounced spikes, suggesting that sales in this category may be more event-driven, such as fashion seasons or holidays.
The electronics category (green line) demonstrates variability similar to clothing, with noticeable peaks that could correspond to new product releases or technological advancements.

It’s important to note that all categories show some degree of overlap in their sales patterns, which could imply that certain external factors affect the sales of all categories similarly. However, the density and amplitude of fluctuations differ from one category to another, which suggests each has its unique sales drivers.

Conclusion

Throughout this report, we have embarked on an analytical journey, dissecting complex retail datasets to uncover the subtleties of consumer behavior and the dynamics of sales trends. By leveraging the power of R and its rich ecosystem of packages, we were able to transform, merge, and analyze data to draw meaningful insights that transcend mere numbers on a spreadsheet.

The use of the dplyr package facilitated the cleaning and transformation of our data, enabling us to standardize the product categories across datasets—vital for accurate comparative analysis. The merging of datasets was performed cautiously, ensuring that the integrity of the information was maintained, while also creating a comprehensive pool of data to analyze. Despite initial challenges, such as the absence of a common date column, we adapted our approach to focus on the information available, highlighting the need for flexibility in data analysis.

Our quantitative analysis unveiled the ebbs and flows of sales volumes, with ggplot2 aiding in the visualization of complex data, transforming it into digestible and informative graphics. The line graph depicting sales trends, although initially cluttered, was refined to highlight key categories, allowing for a clearer interpretation of the sales narratives within the beauty, clothing, and electronics sectors.

While the datasets provided a fertile ground for quantitative exploration, they also served as a basis for qualitative analysis. Even though our approach shifted due to data constraints, we demonstrated how thematic insights can be gleaned from product categories, opening the door to understanding the qualitative aspects that influence consumer purchases.

In essence, the codes utilized within this report have served as a bridge between raw data and strategic knowledge. The outputs generated have painted a picture of the retail environment in which these transactions occurred, showcasing patterns that are crucial for making informed business decisions. The insights drawn from these analyses are not merely retrospective reflections but can be the foundation for predictive modeling and future planning.

As we conclude, it is clear that the intersection of data transformation, statistical modeling, and visual storytelling is where true analytical value lies. This report serves as a testament to the power of data analysis in unlocking the stories hidden within numbers, providing a narrative that can guide strategic decision-making in the retail landscape.