Homework1

Data Set Information

Dataset was obtained via Kaggle and it was added to Kaggle by Mohammad Talib. No release date was present, but it was updated 2 years ago.
Dataset URL: https://www.kaggle.com/datasets/mohammadtalib786/retail-sales-dataset/data

Importing data and displaying first 5 rows:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
retail <- read.csv("retail_sales_dataset.csv")
head(retail,5)

##   Transaction.ID       Date Customer.ID Gender Age Product.Category Quantity
## 1              1 2023-11-24     CUST001   Male  34           Beauty        3
## 2              2 2023-02-27     CUST002 Female  26         Clothing        2
## 3              3 2023-01-13     CUST003   Male  50      Electronics        1
## 4              4 2023-05-21     CUST004   Male  37         Clothing        1
## 5              5 2023-05-06     CUST005   Male  30           Beauty        2
##   Price.per.Unit Total.Amount
## 1             50          150
## 2            500         1000
## 3             30           30
## 4            500          500
## 5             50          100

Data Quality

Data Size and Data Types

rows = nrow(retail)
cols = ncol(retail)
cat("The number of rows in the dataset are ", rows, " and the number of columns is ", cols)

## The number of rows in the dataset are  1000  and the number of columns is  9

dtatype = class(retail)
dtatype

## [1] "data.frame"

Descriptive Statistics

summary(retail)

##  Transaction.ID       Date           Customer.ID           Gender         
##  Min.   :   1.0   Length:1000        Length:1000        Length:1000       
##  1st Qu.: 250.8   Class :character   Class :character   Class :character  
##  Median : 500.5   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 500.5                                                           
##  3rd Qu.: 750.2                                                           
##  Max.   :1000.0                                                           
##       Age        Product.Category      Quantity     Price.per.Unit 
##  Min.   :18.00   Length:1000        Min.   :1.000   Min.   : 25.0  
##  1st Qu.:29.00   Class :character   1st Qu.:1.000   1st Qu.: 30.0  
##  Median :42.00   Mode  :character   Median :3.000   Median : 50.0  
##  Mean   :41.39                      Mean   :2.514   Mean   :179.9  
##  3rd Qu.:53.00                      3rd Qu.:4.000   3rd Qu.:300.0  
##  Max.   :64.00                      Max.   :4.000   Max.   :500.0  
##   Total.Amount 
##  Min.   :  25  
##  1st Qu.:  60  
##  Median : 135  
##  Mean   : 456  
##  3rd Qu.: 900  
##  Max.   :2000

Missing Data

na_missing <- sum(is.na(retail))
cat("The number of NAs in the dataset are ", na_missing)

## The number of NAs in the dataset are  0

table(complete.cases(retail))

## 
## TRUE 
## 1000

library(visdat)
vis_miss(retail)

Data Visualization

One-Dimensional Visualization

library(ggplot2)

ggplot(retail, aes(x=Age)) + geom_histogram(color = 'red',fill = 'pink' )+
  labs(title="Histogram for Age in Retail Data", x="Age", y="Count")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The data here display the distribution of age of the customers. It can be seen that it is not skewed in any direction.

Comparison Visualization

ggplot(retail, aes(x=Product.Category, y=Total.Amount, fill=Product.Category)) + geom_boxplot()+
  labs(title="Distribution of Amounts Paid per Category", x="Categories", y="Total Amounts ($)")+
  scale_fill_discrete(name="Product Categories")

In the figure above, it can be seen that the median Amount customers have paid for items in these categories is low. This indicates that there are more transactions made with lower amounts. The only outliers are present in the Clothing category.

Two Dimensional Visualization

ggplot(retail, aes(x=Product.Category, fill = Gender)) + geom_bar(position = "dodge")+
  labs(title="Male vs Female per Category", x="Categories", y="Counts of Male and Females")+
  geom_label(stat = "count", aes(label = after_stat(count)), 
            position = position_dodge(width = 0.9), vjust = 1.2)

In the figure above, it can be seen that women buy more beauty products than men. Men partake in buying more clothing and electronics compared to women but not by much; 3 more men in clothing and 2 more men in electronics.