Week 3 Data Dive

Importing libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)

Importing data

df <- read.csv('Auto Sales data.csv')

Group-by Data Frames

Frame 1 - How many sales were gathered from each country?

countrySales <- df |> 
  group_by(COUNTRY) |>
  summarize(TOTALSALES = sum(SALES))
countrySales
## # A tibble: 19 × 2
##    COUNTRY     TOTALSALES
##    <chr>            <dbl>
##  1 Australia      630623.
##  2 Austria        202063.
##  3 Belgium        108413.
##  4 Canada         224079.
##  5 Denmark        245637.
##  6 Finland        329582.
##  7 France        1110917.
##  8 Germany        220472.
##  9 Ireland         57756.
## 10 Italy          374674.
## 11 Japan          188168.
## 12 Norway         307464.
## 13 Philippines     94016.
## 14 Singapore      288488.
## 15 Spain         1215687.
## 16 Sweden         210014.
## 17 Switzerland    117714.
## 18 UK             478880.
## 19 USA           3355576.
countrySales <- countrySales |>
  tail(10)
ggplot(
  data = countrySales,
  mapping = aes(x = COUNTRY, y = TOTALSALES)
) +
  geom_bar(
    stat = "identity",
  ) +
  labs(
    title = "Total Sales by Country" 
      )

It makes sense that USA would spend the most on automobiles. I wasn’t expecting that much of a lead, though.

Frame 2 - There’s a lot of products, but which is the most and least expensive…?

df |> 
  group_by(PRODUCTLINE) |>
  summarize(CHEAPEST = min(PRICEEACH), COSLIEST = max(PRICEEACH))
## # A tibble: 7 × 3
##   PRODUCTLINE      CHEAPEST COSLIEST
##   <chr>               <dbl>    <dbl>
## 1 Classic Cars         28.3     249.
## 2 Motorcycles          32.6     230.
## 3 Planes               40.2     233.
## 4 Ships                43.7     173.
## 5 Trains               36.1     199.
## 6 Trucks and Buses     35.7     233.
## 7 Vintage Cars         26.9     253.

Frame 3 - How about the size of orders between companies?

companySales <- df |> 
  group_by(CUSTOMERNAME, DEALSIZE) |>
  count(DEALSIZE, name = "DEALS") |>
  ungroup() |>
  complete(CUSTOMERNAME, DEALSIZE, fill = list(DEALS = 0, freq = 0))
companySales <- companySales[order(companySales$DEALS),]
## companySales <- companySales[order(companySales$DEALS, decreasing = TRUE)]
companySales
## # A tibble: 267 × 3
##    CUSTOMERNAME                       DEALSIZE DEALS
##    <chr>                              <chr>    <int>
##  1 Atelier graphique                  Large        0
##  2 Auto-Moto Classics Inc.            Large        0
##  3 Bavarian Collectables Imports, Co. Large        0
##  4 Boards & Toys Co.                  Large        0
##  5 Cambridge Collectables Co.         Large        0
##  6 Double Decker Gift Stores, Ltd     Large        0
##  7 Enaco Distributors                 Large        0
##  8 Gift Ideas Corp.                   Large        0
##  9 Gifts4AllAges.com                  Large        0
## 10 giftsbymail.co.uk                  Large        0
## # ℹ 257 more rows

There’s 267 rows and 3 different sizes of deal… 267 / 3 = 89 different customers on record.

21 of them have not placed Large orders! That’s nearly a quarter of the roster… if we were to choose a customer at random, there’s a (21/89 ~23.6%) chance they don’t have a Large order on record.

I noticed that they tend to have a fittingly large value in the Sales column (>7000 units or so?). Perhaps most of them are from countries with lower total sales.

How good is the connection between the total sales of a country and the sizes of orders a company there has placed?

Combination Data Frame

comboFrame <- df |> 
  group_by(COUNTRY, PRODUCTLINE) |>
  count(PRODUCTLINE, name = "ORDERS") |>
  ungroup() |>
  complete(COUNTRY, PRODUCTLINE, fill = list(ORDERS = 0, freq = 0))
## comboFrame <- comboFrame[order(comboFrame$ORDERS),]
comboFrame
## # A tibble: 133 × 3
##    COUNTRY   PRODUCTLINE      ORDERS
##    <chr>     <chr>             <int>
##  1 Australia Classic Cars         53
##  2 Australia Motorcycles          26
##  3 Australia Planes               25
##  4 Australia Ships                 2
##  5 Australia Trains                1
##  6 Australia Trucks and Buses     20
##  7 Australia Vintage Cars         58
##  8 Austria   Classic Cars         25
##  9 Austria   Motorcycles           5
## 10 Austria   Planes                6
## # ℹ 123 more rows

Similar to the last group-by data frame, I sorted this one with the lowest values of its continuous variable first. A couple combinations are not present, like Austria not having Trains shipped, and the Philippines not having ever purchased 3 of the 7 types of products…

And then there’s Switzerland, only having purchased Classic Cars (31 of them). They have one of the lowest total sales, and it’s just for one type of product.

On the other end of the spectrum, there’s Spain with 120 orders of Classic Card. USA, France, Spain and Australia appear frequently on the last 10 rows, too!

comboFrame <- comboFrame |>
  head(42)

ggplot(
  data = comboFrame,
  mapping = aes(x = COUNTRY, y = ORDERS, fill = PRODUCTLINE)
) +
  geom_bar(
    stat = "identity",
    position = "dodge"
  )