library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
df <- read.csv('Auto Sales data.csv')
## df_1, df_2, df_3 <- Random sample of 1400 objects.
df_1 <- sample_n(df, 1400, replace = TRUE)
df_2 <- sample_n(df, 1400, replace = TRUE)
df_3 <- sample_n(df, 1400, replace = TRUE)
box_df <- data.frame(Sales_1 = df_1$SALES, Sales_2 = df_2$SALES, Sales_3 = df_3$SALES)
box_df |>
pivot_longer(everything(), values_to="Value", names_to="Variable") |>
ggplot() +
geom_boxplot(
aes(x=Variable, y=Value)
)
The mean and 25/75 percentiles aren’t too different from each other.
The most significant difference are the outliers! The second random sample has the lowest maximum value, too.Maybe the first and third random samples pulled the same entry as an outlier.
I’d still count all the dots as anomalies since they surpass the interquartile range.
df_1_group <- df_1 |>
group_by(STATUS) |>
count(STATUS, name = "NUMBER")
df_1_group <- df_1_group[order(df_1_group$STATUS),]
df_1_group
## # A tibble: 6 × 2
## # Groups: STATUS [6]
## STATUS NUMBER
## <chr> <int>
## 1 Cancelled 30
## 2 Disputed 6
## 3 In Process 15
## 4 On Hold 17
## 5 Resolved 23
## 6 Shipped 1309
df_2_group <- df_2 |>
group_by(STATUS) |>
count(STATUS, name = "NUMBER")
df_2_group <- df_2_group[order(df_2_group$STATUS),]
df_2_group
## # A tibble: 6 × 2
## # Groups: STATUS [6]
## STATUS NUMBER
## <chr> <int>
## 1 Cancelled 34
## 2 Disputed 5
## 3 In Process 31
## 4 On Hold 19
## 5 Resolved 28
## 6 Shipped 1283
df_3_group <- df_3 |>
group_by(STATUS) |>
count(STATUS, name = "NUMBER")
df_3_group <- df_3_group[order(df_3_group$STATUS),]
df_3_group
## # A tibble: 6 × 2
## # Groups: STATUS [6]
## STATUS NUMBER
## <chr> <int>
## 1 Cancelled 38
## 2 Disputed 6
## 3 In Process 22
## 4 On Hold 23
## 5 Resolved 21
## 6 Shipped 1290
Wow, they’re really close together! I expected the Shipped row to fluctuate more like the number of orders In Process. “Shipped” would be the best status for an order to have, though.
The number of orders “In Process” for the third subsample are rather low, and the number of orders “On Hold” in the first sample are rather high. I’d mark both of these as anomalies!