Homework3

Data Analysis Tools

Author

Raluca Berbece and Khadijah Ismail Masoud

Published

December 9, 2025

Working with erronous files

  1. Importing Data
orders  <- read_excel("orders.xlsx", sheet = "Orders")
items   <- read_excel("orders.xlsx", sheet = "Items")
  1. Inspect the orders sheet
glimpse(orders)
Rows: 949
Columns: 6
$ `order date`    <chr> "02/01/2021", "05/01/2021", "06/01/2021", "06/01/2021"…
$ `shipping date` <chr> "04/01/2021", "06/01/2021", "08/01/2021", "06/01/2021"…
$ destination     <chr> "Spain", "France", "Spain", "Spain", "Spain", "Portuga…
$ order           <chr> "TZZ_02", "CXW_01", "AXC_07", "AXC_10", "AXC_20", "TZZ…
$ units           <dbl> 14, 16, 11, 10, 14, 8, 9, 13, 10, 8, 10, 8, 15, 8, 12,…
$ charge          <chr> "72.5", "435.5", "97.5", "60.2", "42,7", "73.3", "87.9…

From doing glimpse(orders) we can see the type of each observation. For example, ’order date’, ‘shipping date’, destination, order and charge are characters and units is double. We must be careful because we can not use this table unless we change the format of all columns.

orders <- orders %>% 
mutate(
order_date= dmy(`order date`),
shipping_date= dmy(`shipping date`),
destination = factor(destination),
order = factor(order),
charge = as.double(charge))
Warning: There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `shipping_date = dmy(`shipping date`)`.
Caused by warning:
!  7 failed to parse.
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
orders
# A tibble: 949 × 8
   `order date` `shipping date` destination order  units charge order_date
   <chr>        <chr>           <fct>       <fct>  <dbl>  <dbl> <date>    
 1 02/01/2021   04/01/2021      Spain       TZZ_02    14   72.5 2021-01-02
 2 05/01/2021   06/01/2021      France      CXW_01    16  436.  2021-01-05
 3 06/01/2021   08/01/2021      Spain       AXC_07    11   97.5 2021-01-06
 4 06/01/2021   06/01/2021      Spain       AXC_10    10   60.2 2021-01-06
 5 06/01/2021   07/01/2021      Spain       AXC_20    14   NA   2021-01-06
 6 08/01/2021   08/01/2021      Portugal    TZZ_07     8   73.3 2021-01-08
 7 13/01/2021   14/01/2021      Spain       AXC_17     9   87.9 2021-01-13
 8 14/01/2021   14/01/2021      Spain       AXC_03    13   83.6 2021-01-14
 9 14/01/2021   15/01/2021      Spain       CXW_17    10  185.  2021-01-14
10 17/01/2021   18/01/2021      France      AXC_30     8   86.6 2021-01-17
# ℹ 939 more rows
# ℹ 1 more variable: shipping_date <date>
orders <- orders %>% rename(
order_date = order_date,
shipping_date =shipping_date)

We changed the columns to the correct format so the data works properly.The dates were converted to real date values, the categories (destination and order) were changed to factors, and the charge was converted to a number.

Finally, the date columns were renamed to make them easier to use later.

  1. Summary inspectation
summary(orders)
  order date        shipping date        destination      order    
 Length:949         Length:949         Andorra : 16   AXC_01 : 24  
 Class :character   Class :character   France  :113   CXW_15 : 20  
 Mode  :character   Mode  :character   Lmandia :  1   TZZ_08 : 20  
                                       Portugal:126   AXC_22 : 19  
                                       Spain   :692   TZZ_02 : 19  
                                       SVERIGE :  1   AXC_04 : 18  
                                                      (Other):829  
     units           charge           order_date         shipping_date       
 Min.   : 2.00   Min.   :    2.00   Min.   :2021-01-02   Min.   :2021-01-04  
 1st Qu.: 8.00   1st Qu.:   54.02   1st Qu.:2022-01-06   1st Qu.:2022-01-03  
 Median :10.00   Median :   95.25   Median :2023-02-19   Median :2023-02-20  
 Mean   :10.08   Mean   :  188.61   Mean   :2022-12-20   Mean   :2022-12-20  
 3rd Qu.:12.00   3rd Qu.:  141.65   3rd Qu.:2023-12-12   3rd Qu.:2023-12-14  
 Max.   :23.00   Max.   :20160.00   Max.   :2024-11-16   Max.   :2024-11-17  
                 NA's   :7                               NA's   :7           
orders %>% count(destination)
# A tibble: 6 × 2
  destination     n
  <fct>       <int>
1 Andorra        16
2 France        113
3 Lmandia         1
4 Portugal      126
5 Spain         692
6 SVERIGE         1
orders %>% count(order)
# A tibble: 79 × 2
   order      n
   <fct>  <int>
 1 AXC_01    24
 2 AXC_02    16
 3 AXC_03    11
 4 AXC_04    18
 5 AXC_05     8
 6 AXC_06     7
 7 AXC_07    12
 8 AXC_08     8
 9 AXC_09     8
10 AXC_10    15
# ℹ 69 more rows
orders %>% count(order_date)
# A tibble: 525 × 2
   order_date     n
   <date>     <int>
 1 2021-01-02     1
 2 2021-01-05     1
 3 2021-01-06     3
 4 2021-01-08     1
 5 2021-01-13     1
 6 2021-01-14     2
 7 2021-01-17     1
 8 2021-01-18     1
 9 2021-01-23     2
10 2021-01-26     7
# ℹ 515 more rows

We checked the data to understand it better. Using summary() we saw the minimum, maximum, and general statistics. Using count() we checked how many orders there are for each destination, each product, and each date. This helps us see patterns and possible mistakes in the data.

  1. Plotting a timeline
timeline_full <- orders %>%
  count(order_date) %>%
  complete(
    order_date = seq.Date(min(order_date, na.rm = TRUE),
                          max(order_date, na.rm = TRUE),
                          by = "day"),
    fill = list(n = 0)
  )

#HOW MANY DAYS WITHOUT ORDERS
sum(timeline_full$n == 0)
[1] 890
ggplotly(
  ggplot(timeline_full, aes(x = order_date, y = n, group = 1, color = n)) +
    geom_line(color = "grey") +
    geom_point(size = 1.5) +
    scale_color_viridis_c(option = "magma") +
    labs(
      title = "Timeline of Orders",
      x = "Order date",
      y = "Number of orders",
      color = "Orders"
    ) +
    theme_minimal()
)
ggplotly(orders %>% 
ggplot(aes(order_date, shipping_date)) +
  geom_point(color = "#99522F")+
  labs(
    title = "Timeline of orders",
    x = "Order date",
    y = "Shipping date"
  ))

We created two timeline plots to check the order dates. The first plot shows how many orders were made each day, and we can see that some days have no orders, which means there are missing dates in the timeline. The second plot compares order dates with shipping dates, and it shows that all points follow a normal pattern with no big gaps. This helps us see if the data is complete and if the dates make sense.

  1. Shipping times
orders_with_time <- orders %>%
  mutate(
    shipping_time = shipping_date - order_date
  )

orders_with_time
# A tibble: 949 × 9
   `order date` `shipping date` destination order  units charge order_date
   <chr>        <chr>           <fct>       <fct>  <dbl>  <dbl> <date>    
 1 02/01/2021   04/01/2021      Spain       TZZ_02    14   72.5 2021-01-02
 2 05/01/2021   06/01/2021      France      CXW_01    16  436.  2021-01-05
 3 06/01/2021   08/01/2021      Spain       AXC_07    11   97.5 2021-01-06
 4 06/01/2021   06/01/2021      Spain       AXC_10    10   60.2 2021-01-06
 5 06/01/2021   07/01/2021      Spain       AXC_20    14   NA   2021-01-06
 6 08/01/2021   08/01/2021      Portugal    TZZ_07     8   73.3 2021-01-08
 7 13/01/2021   14/01/2021      Spain       AXC_17     9   87.9 2021-01-13
 8 14/01/2021   14/01/2021      Spain       AXC_03    13   83.6 2021-01-14
 9 14/01/2021   15/01/2021      Spain       CXW_17    10  185.  2021-01-14
10 17/01/2021   18/01/2021      France      AXC_30     8   86.6 2021-01-17
# ℹ 939 more rows
# ℹ 2 more variables: shipping_date <date>, shipping_time <drtn>
orders_with_time %>%
  select(shipping_time) %>%
  summary()
 shipping_time        
 Min.   :0.0000 days  
 1st Qu.:0.0000 days  
 Median :1.0000 days  
 Mean   :0.9161 days  
 3rd Qu.:1.0000 days  
 Max.   :4.0000 days  
 NA's   :7            
outliers <- orders_with_time %>%
  summarise(
    q1       = quantile(shipping_time, 0.25, na.rm = TRUE),
    q3       = quantile(shipping_time, 0.75, na.rm = TRUE),
    IQR      = q3 - q1,
    out_low  = q1 - 1.5 * IQR,
    out_high = q3 + 1.5 * IQR
  )

outliers
# A tibble: 1 × 5
  q1     q3     IQR    out_low   out_high
  <drtn> <drtn> <drtn> <drtn>    <drtn>  
1 0 days 1 days 1 days -1.5 days 2.5 days
orders_with_time %>%
  filter(shipping_time > outliers$out_high) %>%
  summarise(n = n())
# A tibble: 1 × 1
      n
  <int>
1    62

We calculated the shipping time by subtracting the order date from the shipping date. Most shipping times are between 0 and 2 days, which looks normal. The IQR rule marks some values as outliers, but these are only a few orders that took 3–4 days. They are not real errors — they are simply slower shipments, so the outliers appear because the normal range of shipping is very small.

  1. Boxplot of change
ggplotly(
  ggplot(orders, aes(x = "", y = charge)) +
    geom_boxplot(
      fill = "#9E542F",
      color = "#003049",
    ) +
    coord_flip(ylim = c(
      quantile(orders$charge, 0.01, na.rm = TRUE),
      quantile(orders$charge, 0.99, na.rm = TRUE)
    )) +
    theme_minimal() +
    labs(
      x = "Distribution",
      y = "Charge (zoomed)",
      title = "Boxplot of Charge"
    )
)
Warning: Removed 7 rows containing non-finite outside the scale range
(`stat_boxplot()`).

The boxplot shows that most charge values are in a normal range, but there are several very high values that appear as outliers. These outliers happen because some charges were entered incorrectly in the dataset, for example extra zeros or wrong numbers. So the unusual values are caused by data-entry mistakes, not by real business differences.

Practice merging

  1. Calculate new column
items <- items %>%
  rename(
    order = "Item",
    price_per_unit = "Price"
  )

new_orders <- orders %>%
  left_join(items, by = "order") %>%
  mutate(
    list_price = price_per_unit * units
  ) %>%
  select(
    order,
    price_per_unit,
    units,
    list_price,
    charge,
    order_date,
    shipping_date,
    destination
  )

new_orders
# A tibble: 949 × 8
   order  price_per_unit units list_price charge order_date shipping_date
   <chr>           <dbl> <dbl>      <dbl>  <dbl> <date>     <date>       
 1 TZZ_02           5.19    14       72.7   72.5 2021-01-02 2021-01-04   
 2 CXW_01          32.6     16      522.   436.  2021-01-05 2021-01-06   
 3 AXC_07           9.18    11      101.    97.5 2021-01-06 2021-01-08   
 4 AXC_10           6.4     10       64     60.2 2021-01-06 2021-01-06   
 5 AXC_20           3.05    14       42.7   NA   2021-01-06 2021-01-07   
 6 TZZ_07           9.16     8       73.3   73.3 2021-01-08 2021-01-08   
 7 AXC_17          11.9      9      107.    87.9 2021-01-13 2021-01-14   
 8 AXC_03           6.43    13       83.6   83.6 2021-01-14 2021-01-14   
 9 CXW_17          19.0     10      190.   185.  2021-01-14 2021-01-15   
10 AXC_30          10.8      8       86.6   86.6 2021-01-17 2021-01-18   
# ℹ 939 more rows
# ℹ 1 more variable: destination <fct>

We joined the Orders table with the Items table to get the unit price for each product. Then we created a new column called list_price by multiplying the unit price by the number of units. This gives us the expected total price for each order.

  1. Calculate a new column
real_orders <- new_orders %>%
  rename(invoice = "charge") %>%
  mutate(
    expected_charge = price_per_unit * units,
    expected_charge = round(expected_charge, 1)
  ) %>%
  select(price_per_unit, units, invoice, expected_charge, order_date, destination)

  real_orders %>%
  dplyr::filter(invoice != expected_charge)
# A tibble: 650 × 6
   price_per_unit units invoice expected_charge order_date destination
            <dbl> <dbl>   <dbl>           <dbl> <date>     <fct>      
 1           5.19    14    72.5            72.7 2021-01-02 Spain      
 2          32.6     16   436.            522.  2021-01-05 France     
 3           9.18    11    97.5           101   2021-01-06 Spain      
 4           6.4     10    60.2            64   2021-01-06 Spain      
 5          11.9      9    87.9           108.  2021-01-13 Spain      
 6          19.0     10   185.            190.  2021-01-14 Spain      
 7          14.6     10   137.            146.  2021-01-18 Spain      
 8          11.9      8    92.2            95.5 2021-01-23 Portugal   
 9           8.87    15   113.            133   2021-01-23 Spain      
10          17.2     10   146.            172.  2021-01-26 Spain      
# ℹ 640 more rows
real_orders
# A tibble: 949 × 6
   price_per_unit units invoice expected_charge order_date destination
            <dbl> <dbl>   <dbl>           <dbl> <date>     <fct>      
 1           5.19    14    72.5            72.7 2021-01-02 Spain      
 2          32.6     16   436.            522.  2021-01-05 France     
 3           9.18    11    97.5           101   2021-01-06 Spain      
 4           6.4     10    60.2            64   2021-01-06 Spain      
 5           3.05    14    NA              42.7 2021-01-06 Spain      
 6           9.16     8    73.3            73.3 2021-01-08 Portugal   
 7          11.9      9    87.9           108.  2021-01-13 Spain      
 8           6.43    13    83.6            83.6 2021-01-14 Spain      
 9          19.0     10   185.            190.  2021-01-14 Spain      
10          10.8      8    86.6            86.6 2021-01-17 France     
# ℹ 939 more rows

We created a new column called expected_charge, which is the price we should get by multiplying the unit price and the number of units. Then we compared this value with the actual invoice in the dataset. Most orders match, but some invoices are much higher or lower than the expected amount. These strange values are probably caused by data-entry mistakes, like typing the wrong number or adding an extra zero.

  1. Window functions
monthly_kpi <- real_orders %>%
  mutate(
    month = floor_date(order_date, "month")
  ) %>%
  group_by(destination, month) %>%
  summarise(
    units_shipped = sum(units, na.rm = TRUE),
    revenue  = sum(expected_charge, na.rm = TRUE),
    .groups  = "drop"
  ) %>%
  arrange(destination, month)


monthly_summary <- monthly_kpi %>%
  group_by(destination) %>%
  mutate(
    units_mom_change   = units_shipped - lag(units_shipped),
    revenue_mom_change = revenue- lag(revenue),
    units_mom_pct  = round((units_mom_change / lag(units_shipped)) * 100, 1),
    revenue_mom_pct = round((revenue_mom_change / lag(revenue))      * 100, 1)
  )


head(monthly_summary)
# A tibble: 6 × 8
# Groups:   destination [1]
  destination month      units_shipped revenue units_mom_change
  <fct>       <date>             <dbl>   <dbl>            <dbl>
1 Andorra     2021-04-01             8    86.6               NA
2 Andorra     2021-06-01            10   190.                 2
3 Andorra     2021-11-01            12    36.6                2
4 Andorra     2022-02-01            12   122.                 0
5 Andorra     2022-03-01             6    53.5               -6
6 Andorra     2022-04-01            12   264.                 6
# ℹ 3 more variables: revenue_mom_change <dbl>, units_mom_pct <dbl>,
#   revenue_mom_pct <dbl>

We first create a month variable from order_date.

Then we group the data by destination and month and calculate two performance indicators:

  • units_shipped = total units sold

  • revenue = total expected charge

Next, we use the lag() function to compare each month with the previous month. We compute the month-on-month change (difference) and the percentage change for both units and revenue. This shows if each destination is growing or falling from one month to the next.

Total number of errors found

shipping_errors <- orders_with_time %>%
  filter(as.numeric(shipping_time) > 2.5) %>%
  summarise(n = n())

date_errors <- orders %>%
  summarise(
    order_date_errors    = sum(is.na(`order date`)),
    shipping_date_errors = sum(is.na(`shipping date`))
  )

value_errors <- orders %>%
  summarise(
    negative_units  = sum(units < 0, na.rm = TRUE),
    negative_charge = sum(charge < 0, na.rm = TRUE)
  )

charge_errors <- orders %>%
  filter(charge > quantile(charge, 0.99, na.rm = TRUE)) %>%
  summarise(n = n())

invoice_errors <- real_orders %>%
  filter(invoice != expected_charge) %>%
  summarise(n = n())

na_errors <- orders %>%
  summarise(total_na = sum(is.na(across(everything()))))

total_errors <- tibble(
  date_errors        = date_errors$order_date_errors +
                       date_errors$shipping_date_errors,
  negative_values    = value_errors$negative_units +
                       value_errors$negative_charge,
  charge_outliers    = charge_errors$n,
  invoice_mismatches = invoice_errors$n,
  shipping_outliers  = shipping_errors$n,
  total_na_values    = na_errors$total_na
)

total_errors
# A tibble: 1 × 6
  date_errors negative_values charge_outliers invoice_mismatches
        <int>           <int>           <int>              <int>
1           0               0              10                650
# ℹ 2 more variables: shipping_outliers <int>, total_na_values <int>
total_number_of_errors <- sum(total_errors)
total_number_of_errors
[1] 736

Using all the checks we ran (missing dates, invalid values, charge outliers, invoice mismatches, shipping outliers and total missing fields), the dataset ends up having a total of 736 data-entry errors.

That means the dataset has a lot of manual mistakes, from missing dates, weird values, inconsistent invoice formatting, outlier charges, to incorrect or impossible shipping times. Basically, the data needs a serious clean-up before doing any real analysis.