Week 4 Data Dive

Importing libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)

Importing data

df <- read.csv('Auto Sales data.csv')

Generating Random (Sub)samples

## df_1, df_2, df_3 <- Random sample of 1400 objects.
df_1 <- sample_n(df, 1400, replace = TRUE)
df_2 <- sample_n(df, 1400, replace = TRUE)
df_3 <- sample_n(df, 1400, replace = TRUE)

Differentiating the (Sub)Samples

A Boxplot of each (sub)sample’s SALES column

box_df <- data.frame(Sales_1 = df_1$SALES, Sales_2 = df_2$SALES, Sales_3 = df_3$SALES)

box_df |>
  pivot_longer(everything(), values_to="Value", names_to="Variable") |>
  ggplot() +
  geom_boxplot(
    aes(x=Variable, y=Value)
  )

The mean and 25/75 percentiles aren’t too different from each other.

The most significant difference are the outliers! The second random sample has the lowest maximum value, too.Maybe the first and third random samples pulled the same entry as an outlier.

I’d still count all the dots as anomalies since they surpass the interquartile range.

Accessing some qualitative columns

df_1_group <- df_1 |> 
  group_by(STATUS) |>
  count(STATUS, name = "NUMBER")
df_1_group <- df_1_group[order(df_1_group$STATUS),]
df_1_group
## # A tibble: 6 × 2
## # Groups:   STATUS [6]
##   STATUS     NUMBER
##   <chr>       <int>
## 1 Cancelled      30
## 2 Disputed        6
## 3 In Process     15
## 4 On Hold        17
## 5 Resolved       23
## 6 Shipped      1309
df_2_group <- df_2 |> 
  group_by(STATUS) |>
  count(STATUS, name = "NUMBER")
df_2_group <- df_2_group[order(df_2_group$STATUS),]
df_2_group
## # A tibble: 6 × 2
## # Groups:   STATUS [6]
##   STATUS     NUMBER
##   <chr>       <int>
## 1 Cancelled      34
## 2 Disputed        5
## 3 In Process     31
## 4 On Hold        19
## 5 Resolved       28
## 6 Shipped      1283
df_3_group <- df_3 |> 
  group_by(STATUS) |>
  count(STATUS, name = "NUMBER")
df_3_group <- df_3_group[order(df_3_group$STATUS),]
df_3_group
## # A tibble: 6 × 2
## # Groups:   STATUS [6]
##   STATUS     NUMBER
##   <chr>       <int>
## 1 Cancelled      38
## 2 Disputed        6
## 3 In Process     22
## 4 On Hold        23
## 5 Resolved       21
## 6 Shipped      1290

Wow, they’re really close together! I expected the Shipped row to fluctuate more like the number of orders In Process. “Shipped” would be the best status for an order to have, though.

The number of orders “In Process” for the third subsample are rather low, and the number of orders “On Hold” in the first sample are rather high. I’d mark both of these as anomalies!