library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
retail <- read.csv("retail_sales_dataset.csv")
head(retail,5)
## Transaction.ID Date Customer.ID Gender Age Product.Category Quantity
## 1 1 2023-11-24 CUST001 Male 34 Beauty 3
## 2 2 2023-02-27 CUST002 Female 26 Clothing 2
## 3 3 2023-01-13 CUST003 Male 50 Electronics 1
## 4 4 2023-05-21 CUST004 Male 37 Clothing 1
## 5 5 2023-05-06 CUST005 Male 30 Beauty 2
## Price.per.Unit Total.Amount
## 1 50 150
## 2 500 1000
## 3 30 30
## 4 500 500
## 5 50 100
rows = nrow(retail)
cols = ncol(retail)
cat("The number of rows in the dataset are ", rows, " and the number of columns is ", cols)
## The number of rows in the dataset are 1000 and the number of columns is 9
dtatype = class(retail)
dtatype
## [1] "data.frame"
summary(retail)
## Transaction.ID Date Customer.ID Gender
## Min. : 1.0 Length:1000 Length:1000 Length:1000
## 1st Qu.: 250.8 Class :character Class :character Class :character
## Median : 500.5 Mode :character Mode :character Mode :character
## Mean : 500.5
## 3rd Qu.: 750.2
## Max. :1000.0
## Age Product.Category Quantity Price.per.Unit
## Min. :18.00 Length:1000 Min. :1.000 Min. : 25.0
## 1st Qu.:29.00 Class :character 1st Qu.:1.000 1st Qu.: 30.0
## Median :42.00 Mode :character Median :3.000 Median : 50.0
## Mean :41.39 Mean :2.514 Mean :179.9
## 3rd Qu.:53.00 3rd Qu.:4.000 3rd Qu.:300.0
## Max. :64.00 Max. :4.000 Max. :500.0
## Total.Amount
## Min. : 25
## 1st Qu.: 60
## Median : 135
## Mean : 456
## 3rd Qu.: 900
## Max. :2000
na_missing <- sum(is.na(retail))
cat("The number of NAs in the dataset are ", na_missing)
## The number of NAs in the dataset are 0
table(complete.cases(retail))
##
## TRUE
## 1000
library(visdat)
vis_miss(retail)
library(ggplot2)
ggplot(retail, aes(x=Age)) + geom_histogram(color = 'red',fill = 'pink' )+
labs(title="Histogram for Age in Retail Data", x="Age", y="Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The data here display the distribution of age of the customers. It can be seen that it is not skewed in any direction.
ggplot(retail, aes(x=Product.Category, y=Total.Amount, fill=Product.Category)) + geom_boxplot()+
labs(title="Distribution of Amounts Paid per Category", x="Categories", y="Total Amounts ($)")+
scale_fill_discrete(name="Product Categories")
In the figure above, it can be seen that the median Amount customers have paid for items in these categories is low. This indicates that there are more transactions made with lower amounts. The only outliers are present in the Clothing category.
ggplot(retail, aes(x=Product.Category, fill = Gender)) + geom_bar(position = "dodge")+
labs(title="Male vs Female per Category", x="Categories", y="Counts of Male and Females")+
geom_label(stat = "count", aes(label = after_stat(count)),
position = position_dodge(width = 0.9), vjust = 1.2)
In the figure above, it can be seen that women buy more beauty products than men. Men partake in buying more clothing and electronics compared to women but not by much; 3 more men in clothing and 2 more men in electronics.