Introduction

In this workshop, we will analyze a supermarket sales dataset to uncover insights about sales performance, customer behavior, and trends. We will cover:

  1. Data loading and cleaning
  2. Exploratory data analysis (EDA)
  3. Data visualization with ggplot2
  4. Insights generation through data analysis

Load the Dataset

We will start by loading the supermarket_sales.csv dataset into R. Ensure the file is in your working directory or adjust the path accordingly.

library(readr)

# Load dataset
sales_data <- read_csv("C:/Users/tural/Downloads/supermarket_sales.csv")
## Rows: 1000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): Invoice ID, Branch, City, Customer type, Gender, Product line, Dat...
## dbl  (8): Unit price, Quantity, Tax 5%, Total, cogs, gross margin percentage...
## time (1): Time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display first few rows of the dataset
head(sales_data)
## # A tibble: 6 × 17
##   `Invoice ID` Branch City    `Customer type` Gender `Product line` `Unit price`
##   <chr>        <chr>  <chr>   <chr>           <chr>  <chr>                 <dbl>
## 1 750-67-8428  A      Yangon  Member          Female Health and be…         74.7
## 2 226-31-3081  C      Naypyi… Normal          Female Electronic ac…         15.3
## 3 631-41-3108  A      Yangon  Normal          Male   Home and life…         46.3
## 4 123-19-1176  A      Yangon  Member          Male   Health and be…         58.2
## 5 373-73-7910  A      Yangon  Normal          Male   Sports and tr…         86.3
## 6 699-14-3026  C      Naypyi… Normal          Male   Electronic ac…         85.4
## # ℹ 10 more variables: Quantity <dbl>, `Tax 5%` <dbl>, Total <dbl>, Date <chr>,
## #   Time <time>, Payment <chr>, cogs <dbl>, `gross margin percentage` <dbl>,
## #   `gross income` <dbl>, Rating <dbl>

Dataset Overview:

The dataset contains 1,000 observations and 17 variables. Key variables include: - City: Location of the supermarket - Product.line: Categories of products - Total: Total sales amount - Rating: Customer rating of the service


Data Cleaning and Preparation

Before analysis, we will clean and prepare the data. This step ensures accurate and meaningful results.

Check for Missing Values

# Check for missing values
colSums(is.na(sales_data))
##              Invoice ID                  Branch                    City 
##                       0                       0                       0 
##           Customer type                  Gender            Product line 
##                       0                       0                       0 
##              Unit price                Quantity                  Tax 5% 
##                       0                       0                       0 
##                   Total                    Date                    Time 
##                       0                       0                       0 
##                 Payment                    cogs gross margin percentage 
##                       0                       0                       0 
##            gross income                  Rating 
##                       0                       0

If any missing values are found, handle them accordingly (e.g., by removing rows or imputing values).

Convert Date and Time

Convert the Date column to a proper date format and the Time column to a time format.

# Convert Date to Date format
sales_data$Date <- as.Date(sales_data$Date, format = "%m/%d/%Y")

# Convert Time to POSIXct format
sales_data$Time <- as.POSIXct(sales_data$Time, format = "%H:%M")

# Verify changes
str(sales_data)
## spc_tbl_ [1,000 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Invoice ID             : chr [1:1000] "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
##  $ Branch                 : chr [1:1000] "A" "C" "A" "A" ...
##  $ City                   : chr [1:1000] "Yangon" "Naypyitaw" "Yangon" "Yangon" ...
##  $ Customer type          : chr [1:1000] "Member" "Normal" "Normal" "Member" ...
##  $ Gender                 : chr [1:1000] "Female" "Female" "Male" "Male" ...
##  $ Product line           : chr [1:1000] "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
##  $ Unit price             : num [1:1000] 74.7 15.3 46.3 58.2 86.3 ...
##  $ Quantity               : num [1:1000] 7 5 7 8 7 7 6 10 2 3 ...
##  $ Tax 5%                 : num [1:1000] 26.14 3.82 16.22 23.29 30.21 ...
##  $ Total                  : num [1:1000] 549 80.2 340.5 489 634.4 ...
##  $ Date                   : Date[1:1000], format: "2019-01-05" "2019-03-08" ...
##  $ Time                   : POSIXct[1:1000], format: "1970-01-01 13:08:00" "1970-01-01 10:29:00" ...
##  $ Payment                : chr [1:1000] "Ewallet" "Cash" "Credit card" "Ewallet" ...
##  $ cogs                   : num [1:1000] 522.8 76.4 324.3 465.8 604.2 ...
##  $ gross margin percentage: num [1:1000] 4.76 4.76 4.76 4.76 4.76 ...
##  $ gross income           : num [1:1000] 26.14 3.82 16.22 23.29 30.21 ...
##  $ Rating                 : num [1:1000] 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Invoice ID` = col_character(),
##   ..   Branch = col_character(),
##   ..   City = col_character(),
##   ..   `Customer type` = col_character(),
##   ..   Gender = col_character(),
##   ..   `Product line` = col_character(),
##   ..   `Unit price` = col_double(),
##   ..   Quantity = col_double(),
##   ..   `Tax 5%` = col_double(),
##   ..   Total = col_double(),
##   ..   Date = col_character(),
##   ..   Time = col_time(format = ""),
##   ..   Payment = col_character(),
##   ..   cogs = col_double(),
##   ..   `gross margin percentage` = col_double(),
##   ..   `gross income` = col_double(),
##   ..   Rating = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Exploratory Data Analysis (EDA)

Summary Statistics

Get a quick summary of the dataset to understand the range and distribution of values.

# Summary of the dataset
summary(sales_data)
##   Invoice ID           Branch              City           Customer type     
##  Length:1000        Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     Gender          Product line         Unit price       Quantity    
##  Length:1000        Length:1000        Min.   :10.08   Min.   : 1.00  
##  Class :character   Class :character   1st Qu.:32.88   1st Qu.: 3.00  
##  Mode  :character   Mode  :character   Median :55.23   Median : 5.00  
##                                        Mean   :55.67   Mean   : 5.51  
##                                        3rd Qu.:77.94   3rd Qu.: 8.00  
##                                        Max.   :99.96   Max.   :10.00  
##      Tax 5%            Total              Date           
##  Min.   : 0.5085   Min.   :  10.68   Min.   :2019-01-01  
##  1st Qu.: 5.9249   1st Qu.: 124.42   1st Qu.:2019-01-24  
##  Median :12.0880   Median : 253.85   Median :2019-02-13  
##  Mean   :15.3794   Mean   : 322.97   Mean   :2019-02-14  
##  3rd Qu.:22.4453   3rd Qu.: 471.35   3rd Qu.:2019-03-08  
##  Max.   :49.6500   Max.   :1042.65   Max.   :2019-03-30  
##       Time                          Payment               cogs       
##  Min.   :1970-01-01 10:00:00.00   Length:1000        Min.   : 10.17  
##  1st Qu.:1970-01-01 12:43:00.00   Class :character   1st Qu.:118.50  
##  Median :1970-01-01 15:19:00.00   Mode  :character   Median :241.76  
##  Mean   :1970-01-01 15:24:41.87                      Mean   :307.59  
##  3rd Qu.:1970-01-01 18:15:00.00                      3rd Qu.:448.90  
##  Max.   :1970-01-01 20:59:00.00                      Max.   :993.00  
##  gross margin percentage  gross income         Rating      
##  Min.   :4.762           Min.   : 0.5085   Min.   : 4.000  
##  1st Qu.:4.762           1st Qu.: 5.9249   1st Qu.: 5.500  
##  Median :4.762           Median :12.0880   Median : 7.000  
##  Mean   :4.762           Mean   :15.3794   Mean   : 6.973  
##  3rd Qu.:4.762           3rd Qu.:22.4453   3rd Qu.: 8.500  
##  Max.   :4.762           Max.   :49.6500   Max.   :10.000

Total Sales by City

Analyze total sales for each city.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Total sales by city
sales_by_city <- sales_data %>%
  group_by(City) %>%
  summarise(TotalSales = sum(Total))

sales_by_city
## # A tibble: 3 × 2
##   City      TotalSales
##   <chr>          <dbl>
## 1 Mandalay     106198.
## 2 Naypyitaw    110569.
## 3 Yangon       106200.

Data Visualization

We will use the ggplot2 package to create visualizations that highlight key trends in the data.

Sales by City

library(ggplot2)

# Bar chart of total sales by city
ggplot(sales_by_city, aes(x = City, y = TotalSales, fill = City)) +
  geom_bar(stat = "identity") +
  labs(title = "Total Sales by City", x = "City", y = "Total Sales")

Ratings by Product Line

# Box plot of ratings by product line
ggplot(sales_data, aes(x = `Product line`, y = Rating, fill = `Product line`)) +
  geom_boxplot() +
  labs(title = "Ratings by Product Line", x = "Product Line", y = "Rating") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Payment Method Distribution

# Calculate payment method distribution
payment_distribution <- sales_data %>%
  group_by(Payment) %>%
  summarise(Count = n())

# Pie chart of payment method distribution
ggplot(payment_distribution, aes(x = "", y = Count, fill = Payment)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  labs(title = "Payment Method Distribution")


Student Challenge

Questions for Analysis

  1. Which customer type (Member or Normal) spends more on average?
  2. Which product line generates the highest total sales?
  3. Create a time series plot of total sales over time.

Hints and Code Snippets

Average Spending by Customer Type

library(dplyr)
names(sales_data)
##  [1] "Invoice ID"              "Branch"                 
##  [3] "City"                    "Customer type"          
##  [5] "Gender"                  "Product line"           
##  [7] "Unit price"              "Quantity"               
##  [9] "Tax 5%"                  "Total"                  
## [11] "Date"                    "Time"                   
## [13] "Payment"                 "cogs"                   
## [15] "gross margin percentage" "gross income"           
## [17] "Rating"
# Standardize column names
sales_data <- sales_data %>%
  rename(Customer.type = 'Customer type')

# Retry the analysis
avg_spending <- sales_data %>%
  group_by(Customer.type) %>%
  summarise(AverageSpending = mean(Total, na.rm = TRUE))

Total Sales by Product Line

# Total sales by product line
sales_by_product <- sales_data %>%
  group_by(`Product line`) %>%
  summarise(TotalSales = sum(Total))

sales_by_product
## # A tibble: 6 × 2
##   `Product line`         TotalSales
##   <chr>                       <dbl>
## 1 Electronic accessories     54338.
## 2 Fashion accessories        54306.
## 3 Food and beverages         56145.
## 4 Health and beauty          49194.
## 5 Home and lifestyle         53862.
## 6 Sports and travel          55123.

Time Series Plot of Total Sales

# Aggregate total sales by date
sales_by_date <- sales_data %>%
  group_by(Date) %>%
  summarise(TotalSales = sum(Total))

# Time series plot
ggplot(sales_by_date, aes(x = Date, y = TotalSales)) +
  geom_line(color = "blue") +
  labs(title = "Total Sales Over Time", x = "Date", y = "Total Sales")


Conclusion

In this workshop, we explored a supermarket sales dataset to uncover insights using R. Students are encouraged to try additional analyses and visualizations to deepen their understanding of the dataset.