Week 5 Data Dive

Importing libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)

Importing data

df <- read.csv('Auto Sales data.csv')

Unclear Element - What (else) can I use DealSize for?

The categorical element DealSize also wasn’t explicitly explained.

In week 3, I assumed it was connected to the SALES column, and I can confirm this by sorting the database by that column.

Through this, not only does

{r} df_deal <- df[, c("QUANTITYORDERED", "DEALSIZE")] df_deal <- df_deal[order(-df_deal$QUANTITYORDERED),] df_deal df_deal_2 <- df[, c("SALES", "DEALSIZE")] df_deal_2 <- df_deal_2[order(-df_deal_2$SALES),] df_deal_2}

Unclear Column - What can I use MSRP for?

The documentation states that MSRP stands for “Manufacturer’s Suggested Retail Price”, and represents the suggested selling price for each item. The unit price also isn’t noted.

I thought to look at the difference between the MSRP and the actual price of each one, which yielded a wide range, and the average price was notably always higher than the MSRP… the company turned up quite the profit!

df_MSRP <- df |>
  group_by(ORDERNUMBER, MSRP, QUANTITYORDERED, PRICEEACH) |>
  reframe(PRICEDIFF = PRICEEACH - MSRP)
df_MSRP <- df_MSRP[order(-df_MSRP$PRICEDIFF),]
df_MSRP
## # A tibble: 2,747 × 5
##    ORDERNUMBER  MSRP QUANTITYORDERED PRICEEACH PRICEDIFF
##          <int> <int>           <int>     <dbl>     <dbl>
##  1       10391    33              33      253.      220.
##  2       10348    50              29      245.      195.
##  3       10325    54              38      233.      179.
##  4       10356    33              48      202.      170.
##  5       10375    72              43      233.      161.
##  6       10382    50              39      201.      151.
##  7       10395    58              45      199.      141.
##  8       10367    60              45      197.      137.
##  9       10359    62              22      196.      134.
## 10       10325    62              28      192.      130.
## # ℹ 2,737 more rows

I then wondered if there was any correlation I could make with the Price Difference and another existing column. The best I could find was with the PRICE_EACH column - and it’s a positive relationship, if any. This would mean a higher the price correlates to a higher markup from the MSRP.

df_MSRP |>
  ggplot() +
  geom_point(
    aes(x=QUANTITYORDERED, y=PRICEDIFF)
  )

df_MSRP |>
  ggplot() +
  geom_point(
    aes(x=MSRP, y=PRICEDIFF)
  )

df_MSRP |>
  ggplot() +
  geom_point(
    aes(x=PRICEEACH, y=PRICEDIFF)
  )

I also split up the MSRP by each type of product and make a boxplot of each.

Classic Cars spanned across the whole range. It covers the outliers that exist in other products and lacks any itself.

box_df <- data.frame(Product = df$PRODUCTLINE, MSRP = df$MSRP)
box_df |>
  ggplot() +
  geom_boxplot(
    aes(x=Product, y=MSRP)
  )

Without a price unit, I can’t really tell if these are good deals or not.

Any Missing Info?

I found no empty cells, but by sorting the raw data by the order number, I found a couple missing order numbers: 10132, 10138, 10157, 10159, and 10160 to name a few. It was also here that I realized multiple rows could exist with the same order number! I guess that’s not a key value…

Some dates were also absent, though I find the missing numbers to be a far larger indicator of an implicitly empty row. There were no NULL or N/A values, so no rows were explicitly missing.

Notable countries that aren’t in any record include China and South Korea, some of the most wealthiest countries in the world.