orders <- read_excel("orders.xlsx", sheet = "Orders")
items <- read_excel("orders.xlsx", sheet = "Items")Homework3
Data Analysis Tools
Working with erronous files
- Importing Data
- Inspect the orders sheet
glimpse(orders)Rows: 949
Columns: 6
$ `order date` <chr> "02/01/2021", "05/01/2021", "06/01/2021", "06/01/2021"…
$ `shipping date` <chr> "04/01/2021", "06/01/2021", "08/01/2021", "06/01/2021"…
$ destination <chr> "Spain", "France", "Spain", "Spain", "Spain", "Portuga…
$ order <chr> "TZZ_02", "CXW_01", "AXC_07", "AXC_10", "AXC_20", "TZZ…
$ units <dbl> 14, 16, 11, 10, 14, 8, 9, 13, 10, 8, 10, 8, 15, 8, 12,…
$ charge <chr> "72.5", "435.5", "97.5", "60.2", "42,7", "73.3", "87.9…
From doing glimpse(orders) we can see the type of each observation. For example, ’order date’, ‘shipping date’, destination, order and charge are characters and units is double. We must be careful because we can not use this table unless we change the format of all columns.
orders <- orders %>%
mutate(
order_date= dmy(`order date`),
shipping_date= dmy(`shipping date`),
destination = factor(destination),
order = factor(order),
charge = as.double(charge))Warning: There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `shipping_date = dmy(`shipping date`)`.
Caused by warning:
! 7 failed to parse.
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
orders# A tibble: 949 × 8
`order date` `shipping date` destination order units charge order_date
<chr> <chr> <fct> <fct> <dbl> <dbl> <date>
1 02/01/2021 04/01/2021 Spain TZZ_02 14 72.5 2021-01-02
2 05/01/2021 06/01/2021 France CXW_01 16 436. 2021-01-05
3 06/01/2021 08/01/2021 Spain AXC_07 11 97.5 2021-01-06
4 06/01/2021 06/01/2021 Spain AXC_10 10 60.2 2021-01-06
5 06/01/2021 07/01/2021 Spain AXC_20 14 NA 2021-01-06
6 08/01/2021 08/01/2021 Portugal TZZ_07 8 73.3 2021-01-08
7 13/01/2021 14/01/2021 Spain AXC_17 9 87.9 2021-01-13
8 14/01/2021 14/01/2021 Spain AXC_03 13 83.6 2021-01-14
9 14/01/2021 15/01/2021 Spain CXW_17 10 185. 2021-01-14
10 17/01/2021 18/01/2021 France AXC_30 8 86.6 2021-01-17
# ℹ 939 more rows
# ℹ 1 more variable: shipping_date <date>
orders <- orders %>% rename(
order_date = order_date,
shipping_date =shipping_date)We changed the columns to the correct format so the data works properly.The dates were converted to real date values, the categories (destination and order) were changed to factors, and the charge was converted to a number.
Finally, the date columns were renamed to make them easier to use later.
- Summary inspectation
summary(orders) order date shipping date destination order
Length:949 Length:949 Andorra : 16 AXC_01 : 24
Class :character Class :character France :113 CXW_15 : 20
Mode :character Mode :character Lmandia : 1 TZZ_08 : 20
Portugal:126 AXC_22 : 19
Spain :692 TZZ_02 : 19
SVERIGE : 1 AXC_04 : 18
(Other):829
units charge order_date shipping_date
Min. : 2.00 Min. : 2.00 Min. :2021-01-02 Min. :2021-01-04
1st Qu.: 8.00 1st Qu.: 54.02 1st Qu.:2022-01-06 1st Qu.:2022-01-03
Median :10.00 Median : 95.25 Median :2023-02-19 Median :2023-02-20
Mean :10.08 Mean : 188.61 Mean :2022-12-20 Mean :2022-12-20
3rd Qu.:12.00 3rd Qu.: 141.65 3rd Qu.:2023-12-12 3rd Qu.:2023-12-14
Max. :23.00 Max. :20160.00 Max. :2024-11-16 Max. :2024-11-17
NA's :7 NA's :7
orders %>% count(destination)# A tibble: 6 × 2
destination n
<fct> <int>
1 Andorra 16
2 France 113
3 Lmandia 1
4 Portugal 126
5 Spain 692
6 SVERIGE 1
orders %>% count(order)# A tibble: 79 × 2
order n
<fct> <int>
1 AXC_01 24
2 AXC_02 16
3 AXC_03 11
4 AXC_04 18
5 AXC_05 8
6 AXC_06 7
7 AXC_07 12
8 AXC_08 8
9 AXC_09 8
10 AXC_10 15
# ℹ 69 more rows
orders %>% count(order_date)# A tibble: 525 × 2
order_date n
<date> <int>
1 2021-01-02 1
2 2021-01-05 1
3 2021-01-06 3
4 2021-01-08 1
5 2021-01-13 1
6 2021-01-14 2
7 2021-01-17 1
8 2021-01-18 1
9 2021-01-23 2
10 2021-01-26 7
# ℹ 515 more rows
We checked the data to understand it better. Using summary() we saw the minimum, maximum, and general statistics. Using count() we checked how many orders there are for each destination, each product, and each date. This helps us see patterns and possible mistakes in the data.
- Plotting a timeline
timeline_full <- orders %>%
count(order_date) %>%
complete(
order_date = seq.Date(min(order_date, na.rm = TRUE),
max(order_date, na.rm = TRUE),
by = "day"),
fill = list(n = 0)
)
#HOW MANY DAYS WITHOUT ORDERS
sum(timeline_full$n == 0)[1] 890
ggplotly(
ggplot(timeline_full, aes(x = order_date, y = n, group = 1, color = n)) +
geom_line(color = "grey") +
geom_point(size = 1.5) +
scale_color_viridis_c(option = "magma") +
labs(
title = "Timeline of Orders",
x = "Order date",
y = "Number of orders",
color = "Orders"
) +
theme_minimal()
)ggplotly(orders %>%
ggplot(aes(order_date, shipping_date)) +
geom_point(color = "#99522F")+
labs(
title = "Timeline of orders",
x = "Order date",
y = "Shipping date"
))We created two timeline plots to check the order dates. The first plot shows how many orders were made each day, and we can see that some days have no orders, which means there are missing dates in the timeline. The second plot compares order dates with shipping dates, and it shows that all points follow a normal pattern with no big gaps. This helps us see if the data is complete and if the dates make sense.
- Shipping times
orders_with_time <- orders %>%
mutate(
shipping_time = shipping_date - order_date
)
orders_with_time# A tibble: 949 × 9
`order date` `shipping date` destination order units charge order_date
<chr> <chr> <fct> <fct> <dbl> <dbl> <date>
1 02/01/2021 04/01/2021 Spain TZZ_02 14 72.5 2021-01-02
2 05/01/2021 06/01/2021 France CXW_01 16 436. 2021-01-05
3 06/01/2021 08/01/2021 Spain AXC_07 11 97.5 2021-01-06
4 06/01/2021 06/01/2021 Spain AXC_10 10 60.2 2021-01-06
5 06/01/2021 07/01/2021 Spain AXC_20 14 NA 2021-01-06
6 08/01/2021 08/01/2021 Portugal TZZ_07 8 73.3 2021-01-08
7 13/01/2021 14/01/2021 Spain AXC_17 9 87.9 2021-01-13
8 14/01/2021 14/01/2021 Spain AXC_03 13 83.6 2021-01-14
9 14/01/2021 15/01/2021 Spain CXW_17 10 185. 2021-01-14
10 17/01/2021 18/01/2021 France AXC_30 8 86.6 2021-01-17
# ℹ 939 more rows
# ℹ 2 more variables: shipping_date <date>, shipping_time <drtn>
orders_with_time %>%
select(shipping_time) %>%
summary() shipping_time
Min. :0.0000 days
1st Qu.:0.0000 days
Median :1.0000 days
Mean :0.9161 days
3rd Qu.:1.0000 days
Max. :4.0000 days
NA's :7
outliers <- orders_with_time %>%
summarise(
q1 = quantile(shipping_time, 0.25, na.rm = TRUE),
q3 = quantile(shipping_time, 0.75, na.rm = TRUE),
IQR = q3 - q1,
out_low = q1 - 1.5 * IQR,
out_high = q3 + 1.5 * IQR
)
outliers# A tibble: 1 × 5
q1 q3 IQR out_low out_high
<drtn> <drtn> <drtn> <drtn> <drtn>
1 0 days 1 days 1 days -1.5 days 2.5 days
orders_with_time %>%
filter(shipping_time > outliers$out_high) %>%
summarise(n = n())# A tibble: 1 × 1
n
<int>
1 62
We calculated the shipping time by subtracting the order date from the shipping date. Most shipping times are between 0 and 2 days, which looks normal. The IQR rule marks some values as outliers, but these are only a few orders that took 3–4 days. They are not real errors — they are simply slower shipments, so the outliers appear because the normal range of shipping is very small.
- Boxplot of change
ggplotly(
ggplot(orders, aes(x = "", y = charge)) +
geom_boxplot(
fill = "#9E542F",
color = "#003049",
) +
coord_flip(ylim = c(
quantile(orders$charge, 0.01, na.rm = TRUE),
quantile(orders$charge, 0.99, na.rm = TRUE)
)) +
theme_minimal() +
labs(
x = "Distribution",
y = "Charge (zoomed)",
title = "Boxplot of Charge"
)
)Warning: Removed 7 rows containing non-finite outside the scale range
(`stat_boxplot()`).
The boxplot shows that most charge values are in a normal range, but there are several very high values that appear as outliers. These outliers happen because some charges were entered incorrectly in the dataset, for example extra zeros or wrong numbers. So the unusual values are caused by data-entry mistakes, not by real business differences.
Practice merging
- Calculate new column
items <- items %>%
rename(
order = "Item",
price_per_unit = "Price"
)
new_orders <- orders %>%
left_join(items, by = "order") %>%
mutate(
list_price = price_per_unit * units
) %>%
select(
order,
price_per_unit,
units,
list_price,
charge,
order_date,
shipping_date,
destination
)
new_orders# A tibble: 949 × 8
order price_per_unit units list_price charge order_date shipping_date
<chr> <dbl> <dbl> <dbl> <dbl> <date> <date>
1 TZZ_02 5.19 14 72.7 72.5 2021-01-02 2021-01-04
2 CXW_01 32.6 16 522. 436. 2021-01-05 2021-01-06
3 AXC_07 9.18 11 101. 97.5 2021-01-06 2021-01-08
4 AXC_10 6.4 10 64 60.2 2021-01-06 2021-01-06
5 AXC_20 3.05 14 42.7 NA 2021-01-06 2021-01-07
6 TZZ_07 9.16 8 73.3 73.3 2021-01-08 2021-01-08
7 AXC_17 11.9 9 107. 87.9 2021-01-13 2021-01-14
8 AXC_03 6.43 13 83.6 83.6 2021-01-14 2021-01-14
9 CXW_17 19.0 10 190. 185. 2021-01-14 2021-01-15
10 AXC_30 10.8 8 86.6 86.6 2021-01-17 2021-01-18
# ℹ 939 more rows
# ℹ 1 more variable: destination <fct>
We joined the Orders table with the Items table to get the unit price for each product. Then we created a new column called list_price by multiplying the unit price by the number of units. This gives us the expected total price for each order.
- Calculate a new column
real_orders <- new_orders %>%
rename(invoice = "charge") %>%
mutate(
expected_charge = price_per_unit * units,
expected_charge = round(expected_charge, 1)
) %>%
select(price_per_unit, units, invoice, expected_charge, order_date, destination)
real_orders %>%
dplyr::filter(invoice != expected_charge)# A tibble: 650 × 6
price_per_unit units invoice expected_charge order_date destination
<dbl> <dbl> <dbl> <dbl> <date> <fct>
1 5.19 14 72.5 72.7 2021-01-02 Spain
2 32.6 16 436. 522. 2021-01-05 France
3 9.18 11 97.5 101 2021-01-06 Spain
4 6.4 10 60.2 64 2021-01-06 Spain
5 11.9 9 87.9 108. 2021-01-13 Spain
6 19.0 10 185. 190. 2021-01-14 Spain
7 14.6 10 137. 146. 2021-01-18 Spain
8 11.9 8 92.2 95.5 2021-01-23 Portugal
9 8.87 15 113. 133 2021-01-23 Spain
10 17.2 10 146. 172. 2021-01-26 Spain
# ℹ 640 more rows
real_orders# A tibble: 949 × 6
price_per_unit units invoice expected_charge order_date destination
<dbl> <dbl> <dbl> <dbl> <date> <fct>
1 5.19 14 72.5 72.7 2021-01-02 Spain
2 32.6 16 436. 522. 2021-01-05 France
3 9.18 11 97.5 101 2021-01-06 Spain
4 6.4 10 60.2 64 2021-01-06 Spain
5 3.05 14 NA 42.7 2021-01-06 Spain
6 9.16 8 73.3 73.3 2021-01-08 Portugal
7 11.9 9 87.9 108. 2021-01-13 Spain
8 6.43 13 83.6 83.6 2021-01-14 Spain
9 19.0 10 185. 190. 2021-01-14 Spain
10 10.8 8 86.6 86.6 2021-01-17 France
# ℹ 939 more rows
We created a new column called expected_charge, which is the price we should get by multiplying the unit price and the number of units. Then we compared this value with the actual invoice in the dataset. Most orders match, but some invoices are much higher or lower than the expected amount. These strange values are probably caused by data-entry mistakes, like typing the wrong number or adding an extra zero.
- Window functions
monthly_kpi <- real_orders %>%
mutate(
month = floor_date(order_date, "month")
) %>%
group_by(destination, month) %>%
summarise(
units_shipped = sum(units, na.rm = TRUE),
revenue = sum(expected_charge, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(destination, month)
monthly_summary <- monthly_kpi %>%
group_by(destination) %>%
mutate(
units_mom_change = units_shipped - lag(units_shipped),
revenue_mom_change = revenue- lag(revenue),
units_mom_pct = round((units_mom_change / lag(units_shipped)) * 100, 1),
revenue_mom_pct = round((revenue_mom_change / lag(revenue)) * 100, 1)
)
head(monthly_summary)# A tibble: 6 × 8
# Groups: destination [1]
destination month units_shipped revenue units_mom_change
<fct> <date> <dbl> <dbl> <dbl>
1 Andorra 2021-04-01 8 86.6 NA
2 Andorra 2021-06-01 10 190. 2
3 Andorra 2021-11-01 12 36.6 2
4 Andorra 2022-02-01 12 122. 0
5 Andorra 2022-03-01 6 53.5 -6
6 Andorra 2022-04-01 12 264. 6
# ℹ 3 more variables: revenue_mom_change <dbl>, units_mom_pct <dbl>,
# revenue_mom_pct <dbl>
We first create a month variable from order_date.
Then we group the data by destination and month and calculate two performance indicators:
units_shipped = total units sold
revenue = total expected charge
Next, we use the lag() function to compare each month with the previous month. We compute the month-on-month change (difference) and the percentage change for both units and revenue. This shows if each destination is growing or falling from one month to the next.
Total number of errors found
shipping_errors <- orders_with_time %>%
filter(as.numeric(shipping_time) > 2.5) %>%
summarise(n = n())
date_errors <- orders %>%
summarise(
order_date_errors = sum(is.na(`order date`)),
shipping_date_errors = sum(is.na(`shipping date`))
)
value_errors <- orders %>%
summarise(
negative_units = sum(units < 0, na.rm = TRUE),
negative_charge = sum(charge < 0, na.rm = TRUE)
)
charge_errors <- orders %>%
filter(charge > quantile(charge, 0.99, na.rm = TRUE)) %>%
summarise(n = n())
invoice_errors <- real_orders %>%
filter(invoice != expected_charge) %>%
summarise(n = n())
na_errors <- orders %>%
summarise(total_na = sum(is.na(across(everything()))))
total_errors <- tibble(
date_errors = date_errors$order_date_errors +
date_errors$shipping_date_errors,
negative_values = value_errors$negative_units +
value_errors$negative_charge,
charge_outliers = charge_errors$n,
invoice_mismatches = invoice_errors$n,
shipping_outliers = shipping_errors$n,
total_na_values = na_errors$total_na
)
total_errors# A tibble: 1 × 6
date_errors negative_values charge_outliers invoice_mismatches
<int> <int> <int> <int>
1 0 0 10 650
# ℹ 2 more variables: shipping_outliers <int>, total_na_values <int>
total_number_of_errors <- sum(total_errors)
total_number_of_errors[1] 736
Using all the checks we ran (missing dates, invalid values, charge outliers, invoice mismatches, shipping outliers and total missing fields), the dataset ends up having a total of 736 data-entry errors.
That means the dataset has a lot of manual mistakes, from missing dates, weird values, inconsistent invoice formatting, outlier charges, to incorrect or impossible shipping times. Basically, the data needs a serious clean-up before doing any real analysis.