In this workshop, we will analyze a supermarket sales dataset to uncover insights about sales performance, customer behavior, and trends. We will cover:
ggplot2
We will start by loading the supermarket_sales.csv
dataset into R. Ensure the file is in your working directory or adjust
the path accordingly.
library(readr)
# Load dataset
sales_data <- read_csv("C:/Users/tural/Downloads/supermarket_sales.csv")
## Rows: 1000 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Invoice ID, Branch, City, Customer type, Gender, Product line, Dat...
## dbl (8): Unit price, Quantity, Tax 5%, Total, cogs, gross margin percentage...
## time (1): Time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display first few rows of the dataset
head(sales_data)
## # A tibble: 6 × 17
## `Invoice ID` Branch City `Customer type` Gender `Product line` `Unit price`
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 750-67-8428 A Yangon Member Female Health and be… 74.7
## 2 226-31-3081 C Naypyi… Normal Female Electronic ac… 15.3
## 3 631-41-3108 A Yangon Normal Male Home and life… 46.3
## 4 123-19-1176 A Yangon Member Male Health and be… 58.2
## 5 373-73-7910 A Yangon Normal Male Sports and tr… 86.3
## 6 699-14-3026 C Naypyi… Normal Male Electronic ac… 85.4
## # ℹ 10 more variables: Quantity <dbl>, `Tax 5%` <dbl>, Total <dbl>, Date <chr>,
## # Time <time>, Payment <chr>, cogs <dbl>, `gross margin percentage` <dbl>,
## # `gross income` <dbl>, Rating <dbl>
Dataset Overview:
The dataset contains 1,000 observations and 17 variables. Key
variables include: - City
: Location of the supermarket -
Product.line
: Categories of products - Total
:
Total sales amount - Rating
: Customer rating of the
service
Before analysis, we will clean and prepare the data. This step ensures accurate and meaningful results.
# Check for missing values
colSums(is.na(sales_data))
## Invoice ID Branch City
## 0 0 0
## Customer type Gender Product line
## 0 0 0
## Unit price Quantity Tax 5%
## 0 0 0
## Total Date Time
## 0 0 0
## Payment cogs gross margin percentage
## 0 0 0
## gross income Rating
## 0 0
If any missing values are found, handle them accordingly (e.g., by removing rows or imputing values).
Convert the Date
column to a proper date format and the
Time
column to a time format.
# Convert Date to Date format
sales_data$Date <- as.Date(sales_data$Date, format = "%m/%d/%Y")
# Convert Time to POSIXct format
sales_data$Time <- as.POSIXct(sales_data$Time, format = "%H:%M")
# Verify changes
str(sales_data)
## spc_tbl_ [1,000 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Invoice ID : chr [1:1000] "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
## $ Branch : chr [1:1000] "A" "C" "A" "A" ...
## $ City : chr [1:1000] "Yangon" "Naypyitaw" "Yangon" "Yangon" ...
## $ Customer type : chr [1:1000] "Member" "Normal" "Normal" "Member" ...
## $ Gender : chr [1:1000] "Female" "Female" "Male" "Male" ...
## $ Product line : chr [1:1000] "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
## $ Unit price : num [1:1000] 74.7 15.3 46.3 58.2 86.3 ...
## $ Quantity : num [1:1000] 7 5 7 8 7 7 6 10 2 3 ...
## $ Tax 5% : num [1:1000] 26.14 3.82 16.22 23.29 30.21 ...
## $ Total : num [1:1000] 549 80.2 340.5 489 634.4 ...
## $ Date : Date[1:1000], format: "2019-01-05" "2019-03-08" ...
## $ Time : POSIXct[1:1000], format: "1970-01-01 13:08:00" "1970-01-01 10:29:00" ...
## $ Payment : chr [1:1000] "Ewallet" "Cash" "Credit card" "Ewallet" ...
## $ cogs : num [1:1000] 522.8 76.4 324.3 465.8 604.2 ...
## $ gross margin percentage: num [1:1000] 4.76 4.76 4.76 4.76 4.76 ...
## $ gross income : num [1:1000] 26.14 3.82 16.22 23.29 30.21 ...
## $ Rating : num [1:1000] 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
## - attr(*, "spec")=
## .. cols(
## .. `Invoice ID` = col_character(),
## .. Branch = col_character(),
## .. City = col_character(),
## .. `Customer type` = col_character(),
## .. Gender = col_character(),
## .. `Product line` = col_character(),
## .. `Unit price` = col_double(),
## .. Quantity = col_double(),
## .. `Tax 5%` = col_double(),
## .. Total = col_double(),
## .. Date = col_character(),
## .. Time = col_time(format = ""),
## .. Payment = col_character(),
## .. cogs = col_double(),
## .. `gross margin percentage` = col_double(),
## .. `gross income` = col_double(),
## .. Rating = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Get a quick summary of the dataset to understand the range and distribution of values.
# Summary of the dataset
summary(sales_data)
## Invoice ID Branch City Customer type
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Gender Product line Unit price Quantity
## Length:1000 Length:1000 Min. :10.08 Min. : 1.00
## Class :character Class :character 1st Qu.:32.88 1st Qu.: 3.00
## Mode :character Mode :character Median :55.23 Median : 5.00
## Mean :55.67 Mean : 5.51
## 3rd Qu.:77.94 3rd Qu.: 8.00
## Max. :99.96 Max. :10.00
## Tax 5% Total Date
## Min. : 0.5085 Min. : 10.68 Min. :2019-01-01
## 1st Qu.: 5.9249 1st Qu.: 124.42 1st Qu.:2019-01-24
## Median :12.0880 Median : 253.85 Median :2019-02-13
## Mean :15.3794 Mean : 322.97 Mean :2019-02-14
## 3rd Qu.:22.4453 3rd Qu.: 471.35 3rd Qu.:2019-03-08
## Max. :49.6500 Max. :1042.65 Max. :2019-03-30
## Time Payment cogs
## Min. :1970-01-01 10:00:00.00 Length:1000 Min. : 10.17
## 1st Qu.:1970-01-01 12:43:00.00 Class :character 1st Qu.:118.50
## Median :1970-01-01 15:19:00.00 Mode :character Median :241.76
## Mean :1970-01-01 15:24:41.87 Mean :307.59
## 3rd Qu.:1970-01-01 18:15:00.00 3rd Qu.:448.90
## Max. :1970-01-01 20:59:00.00 Max. :993.00
## gross margin percentage gross income Rating
## Min. :4.762 Min. : 0.5085 Min. : 4.000
## 1st Qu.:4.762 1st Qu.: 5.9249 1st Qu.: 5.500
## Median :4.762 Median :12.0880 Median : 7.000
## Mean :4.762 Mean :15.3794 Mean : 6.973
## 3rd Qu.:4.762 3rd Qu.:22.4453 3rd Qu.: 8.500
## Max. :4.762 Max. :49.6500 Max. :10.000
Analyze total sales for each city.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Total sales by city
sales_by_city <- sales_data %>%
group_by(City) %>%
summarise(TotalSales = sum(Total))
sales_by_city
## # A tibble: 3 × 2
## City TotalSales
## <chr> <dbl>
## 1 Mandalay 106198.
## 2 Naypyitaw 110569.
## 3 Yangon 106200.
We will use the ggplot2
package to create visualizations
that highlight key trends in the data.
library(ggplot2)
# Bar chart of total sales by city
ggplot(sales_by_city, aes(x = City, y = TotalSales, fill = City)) +
geom_bar(stat = "identity") +
labs(title = "Total Sales by City", x = "City", y = "Total Sales")
# Box plot of ratings by product line
ggplot(sales_data, aes(x = `Product line`, y = Rating, fill = `Product line`)) +
geom_boxplot() +
labs(title = "Ratings by Product Line", x = "Product Line", y = "Rating") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Calculate payment method distribution
payment_distribution <- sales_data %>%
group_by(Payment) %>%
summarise(Count = n())
# Pie chart of payment method distribution
ggplot(payment_distribution, aes(x = "", y = Count, fill = Payment)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y") +
labs(title = "Payment Method Distribution")
library(dplyr)
names(sales_data)
## [1] "Invoice ID" "Branch"
## [3] "City" "Customer type"
## [5] "Gender" "Product line"
## [7] "Unit price" "Quantity"
## [9] "Tax 5%" "Total"
## [11] "Date" "Time"
## [13] "Payment" "cogs"
## [15] "gross margin percentage" "gross income"
## [17] "Rating"
# Standardize column names
sales_data <- sales_data %>%
rename(Customer.type = 'Customer type')
# Retry the analysis
avg_spending <- sales_data %>%
group_by(Customer.type) %>%
summarise(AverageSpending = mean(Total, na.rm = TRUE))
# Total sales by product line
sales_by_product <- sales_data %>%
group_by(`Product line`) %>%
summarise(TotalSales = sum(Total))
sales_by_product
## # A tibble: 6 × 2
## `Product line` TotalSales
## <chr> <dbl>
## 1 Electronic accessories 54338.
## 2 Fashion accessories 54306.
## 3 Food and beverages 56145.
## 4 Health and beauty 49194.
## 5 Home and lifestyle 53862.
## 6 Sports and travel 55123.
# Aggregate total sales by date
sales_by_date <- sales_data %>%
group_by(Date) %>%
summarise(TotalSales = sum(Total))
# Time series plot
ggplot(sales_by_date, aes(x = Date, y = TotalSales)) +
geom_line(color = "blue") +
labs(title = "Total Sales Over Time", x = "Date", y = "Total Sales")
In this workshop, we explored a supermarket sales dataset to uncover insights using R. Students are encouraged to try additional analyses and visualizations to deepen their understanding of the dataset.