library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
df <- read.csv('Auto Sales data.csv')
The categorical element DealSize also wasn’t explicitly explained.
In week 3, I assumed it was connected to the SALES column, and I can confirm this by sorting the database by that column.
Through this, not only does
{r} df_deal <- df[, c("QUANTITYORDERED", "DEALSIZE")] df_deal <- df_deal[order(-df_deal$QUANTITYORDERED),] df_deal df_deal_2 <- df[, c("SALES", "DEALSIZE")] df_deal_2 <- df_deal_2[order(-df_deal_2$SALES),] df_deal_2}
The documentation states that MSRP stands for “Manufacturer’s Suggested Retail Price”, and represents the suggested selling price for each item. The unit price also isn’t noted.
I thought to look at the difference between the MSRP and the actual price of each one, which yielded a wide range, and the average price was notably always higher than the MSRP… the company turned up quite the profit!
df_MSRP <- df |>
group_by(ORDERNUMBER, MSRP, QUANTITYORDERED, PRICEEACH) |>
reframe(PRICEDIFF = PRICEEACH - MSRP)
df_MSRP <- df_MSRP[order(-df_MSRP$PRICEDIFF),]
df_MSRP
## # A tibble: 2,747 × 5
## ORDERNUMBER MSRP QUANTITYORDERED PRICEEACH PRICEDIFF
## <int> <int> <int> <dbl> <dbl>
## 1 10391 33 33 253. 220.
## 2 10348 50 29 245. 195.
## 3 10325 54 38 233. 179.
## 4 10356 33 48 202. 170.
## 5 10375 72 43 233. 161.
## 6 10382 50 39 201. 151.
## 7 10395 58 45 199. 141.
## 8 10367 60 45 197. 137.
## 9 10359 62 22 196. 134.
## 10 10325 62 28 192. 130.
## # ℹ 2,737 more rows
I then wondered if there was any correlation I could make with the Price Difference and another existing column. The best I could find was with the PRICE_EACH column - and it’s a positive relationship, if any. This would mean a higher the price correlates to a higher markup from the MSRP.
df_MSRP |>
ggplot() +
geom_point(
aes(x=QUANTITYORDERED, y=PRICEDIFF)
)
df_MSRP |>
ggplot() +
geom_point(
aes(x=MSRP, y=PRICEDIFF)
)
df_MSRP |>
ggplot() +
geom_point(
aes(x=PRICEEACH, y=PRICEDIFF)
)
I also split up the MSRP by each type of product and make a boxplot of each.
Classic Cars spanned across the whole range. It covers the outliers that exist in other products and lacks any itself.
box_df <- data.frame(Product = df$PRODUCTLINE, MSRP = df$MSRP)
box_df |>
ggplot() +
geom_boxplot(
aes(x=Product, y=MSRP)
)
Without a price unit, I can’t really tell if these are good deals or not.
I found no empty cells, but by sorting the raw data by the order number, I found a couple missing order numbers: 10132, 10138, 10157, 10159, and 10160 to name a few. It was also here that I realized multiple rows could exist with the same order number! I guess that’s not a key value…
Some dates were also absent, though I find the missing numbers to be a far larger indicator of an implicitly empty row. There were no NULL or N/A values, so no rows were explicitly missing.
Notable countries that aren’t in any record include China and South Korea, some of the most wealthiest countries in the world.