This is the minimal coding necessary to assemble the various data feeds and sort out the likes of variables naming & new features creation plus some general housekeeping tasks. I will continue to add to this code should the need arise for other features to be created.
The dataset I’m using here accompanies a Redbooks publication called Building 360-Degree Information Applications which is available as a free PDF download. Also available as a free download are the excel files in the Additional Material section to follow along with the exercises in the book.
The data covers 3 & 1/2 years worth of sales orders for the Sample Outdoors Company, a fictitious B2B outdoor equipment retailer enterprise. The data comes with details about the products they sell as well as their customers (which are retailers in their case). The data is a subset of the GSDB database, a realistic and feature-rich database created by IBM® to support demos for their products.
I will start with sales transactions as they are distributed across almost 900 separate excel files.
First, I load and merge them at once from a single folder and bind them by row into an orders file .
orders <- list.files(path = "../00_data/order-details",
pattern = "*.csv",
full.names = T) %>%
map_df(~readr::read_csv(.))
# NOTE that all files have the same format with each single variable
# sitting on the same column in each fileorders %>% glimpse()
## Observations: 446,023
## Variables: 45
## $ order_date <dttm> 2004-01-12, 2004-01-12, 2004-01-12, 200...
## $ sales_staff_code <dbl> 10798, 10798, 10406, 10406, 10406, 10406...
## $ order_method_it <chr> "Reparto vendite", "Reparto vendite", "F...
## $ order_method_en <chr> "Sales visit", "Sales visit", "Fax", "Fa...
## $ order_method_nl <chr> "Vertegenwoordiger", "Vertegenwoordiger"...
## $ order_detail_code <dbl> 1000001, 1000002, 1000013, 1000014, 1000...
## $ return_count <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ order_method_no <chr> "Salgsbesøk", "Salgsbesøk", "Faks", "Fak...
## $ order_method_el <chr> "<U+0395>p<U+03AF>s<U+03BA>e<U+03C8><U+03B7> p<U+03C9><U+03BB><U+03B7>t<U+03AE>", "<U+0395>p<U+03AF>s<U+03BA>e<U+03C8><U+03B7> p<U+03C9><U+03BB><U+03B7>t<U+03AE>", "F...
## $ unit_cost <dbl> 15.62, 49.69, 80.00, 23.53, 176.47, 39.0...
## $ order_method_tc <chr> "<U+696D><U+52D9><U+62DC><U+8A2A>", "<U+696D><U+52D9><U+62DC><U+8A2A>", "<U+50B3><U+771F>", "<U+50B3><U+771F>", "<U+50B3><U+771F>", "<U+50B3><U+771F>", ...
## $ order_detail_code_2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ retailer_name <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ retailer_name_mb <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ order_method_cs <chr> "Návšteva prodejce", "Návšteva prodejce"...
## $ order_method_hu <chr> "Kiszállásos eladás", "Kiszállásos eladá...
## $ order_method_ms <chr> "Lawatan jualan", "Lawatan jualan", "Fak...
## $ order_method_th <chr> "<U+0E40><U+0E22><U+0E35><U+0E48><U+0E22><U+0E21><U+0E25><U+0E39><U+0E01><U+0E04><U+0E49><U+0E32>", "<U+0E40><U+0E22><U+0E35><U+0E48><U+0E22><U+0E21><U+0E25><U+0E39><U+0E01><U+0E04><U+0E49><U+0E32>", "<U+0E41><U+0E1F><U+0E01><U+0E0B><U+0E4C>",...
## $ quantity <dbl> 256, 92, 162, 172, 74, 90, 422, 3252, 11...
## $ order_number_2 <dbl> 100001, 100001, 100002, 100002, 100002, ...
## $ sales_branch_code <dbl> 21, 21, 30, 30, 30, 30, 30, 30, 30, 30, ...
## $ promotion_code <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ order_method_id <chr> "Kunjungan penjualan", "Kunjungan penjua...
## $ order_number <dbl> 100001, 100001, 100002, 100002, 100002, ...
## $ order_method_sc <chr> "<U+9500><U+552E><U+9762><U+8BBF>", "<U+9500><U+552E><U+9762><U+8BBF>", "<U+4F20><U+771F>", "<U+4F20><U+771F>", "<U+4F20><U+771F>", "<U+4F20><U+771F>", ...
## $ order_method_fi <chr> "Myyntikäynti", "Myyntikäynti", "Faksi",...
## $ order_method_ru <chr> "<U+0422><U+043E><U+0440><U+0433>. <U+0430><U+0433><U+0435><U+043D><U+0442>", "<U+0422><U+043E><U+0440><U+0433>. <U+0430><U+0433><U+0435><U+043D><U+0442>", "<U+0424><U+0430><U+043A><U+0441>", "<U+0424>...
## $ retailer_site_code <dbl> 20530, 20530, 20895, 20895, 20895, 20895...
## $ unit_sale_price <dbl> 33.69, 102.30, 111.31, 38.90, 334.43, 75...
## $ order_method_da <chr> "Sælgerbesøg", "Sælgerbesøg", "Fax", "Fa...
## $ order_method_pl <chr> "Wizyta", "Wizyta", "Faks", "Faks", "Fak...
## $ order_method_de <chr> "Verkaufsbesuch", "Verkaufsbesuch", "Fax...
## $ order_method_ko <chr> "<U+BC29><U+BB38><U+D310><U+B9E4>", "<U+BC29><U+BB38><U+D310><U+B9E4>", "<U+D329><U+C2A4>", "<U+D329><U+C2A4>", "<U+D329><U+C2A4>", "<U+D329><U+C2A4>", ...
## $ ship_date <dttm> 2004-01-19, 2004-02-17, 2004-01-19, 200...
## $ order_method_sv <chr> "Säljbesök", "Säljbesök", "Fax", "Fax", ...
## $ order_method_fr <chr> "Visite d'un représentant", "Visite d'un...
## $ order_method_code_2 <dbl> 6, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ unit_price <dbl> 35.09, 110.00, 119.69, 40.52, 359.60, 81...
## $ order_method_pt <chr> "Visita vendas", "Visita vendas", "Fax",...
## $ order_method_es <chr> "Visita de ventas", "Visita de ventas", ...
## $ order_close_date <dttm> 2004-02-17, 2004-02-17, 2004-01-19, 200...
## $ retailer_contact_code <dbl> 3549, 3549, 3915, 3915, 3915, 3915, 3915...
## $ order_method_code <dbl> 6, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ order_method_ja <chr> "<U+5E97><U+982D>", "<U+5E97><U+982D>", "<U+30D5><U+30A1><U+30C3><U+30AF><U+30B9>", "<U+30D5><U+30A1><U+30C3><U+30AF><U+30B9>", "<U+30D5><U+30A1><U+30C3><U+30AF><U+30B9>", "...
## $ product_number <dbl> 41110, 69110, 75110, 76110, 85110, 65110...A bit of a clean up is required as there are 45 orders variables but most of them are not necessary. For instance, order_method is available in many languages but I only need the one in English.
I start by removing all order_method other than English.
orders <-
orders %>%
rename(
order_method = order_method_en,
retailer = retailer_name
) %>%
select(
-contains('order_method_')
) orders %>% glimpse()
## Observations: 446,023
## Variables: 21
## $ order_date <dttm> 2004-01-12, 2004-01-12, 2004-01-12, 200...
## $ sales_staff_code <dbl> 10798, 10798, 10406, 10406, 10406, 10406...
## $ order_method <chr> "Sales visit", "Sales visit", "Fax", "Fa...
## $ order_detail_code <dbl> 1000001, 1000002, 1000013, 1000014, 1000...
## $ return_count <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_cost <dbl> 15.62, 49.69, 80.00, 23.53, 176.47, 39.0...
## $ order_detail_code_2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ retailer <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ retailer_name_mb <chr> "Kavanagh Sports", "Kavanagh Sports", "A...
## $ quantity <dbl> 256, 92, 162, 172, 74, 90, 422, 3252, 11...
## $ order_number_2 <dbl> 100001, 100001, 100002, 100002, 100002, ...
## $ sales_branch_code <dbl> 21, 21, 30, 30, 30, 30, 30, 30, 30, 30, ...
## $ promotion_code <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ order_number <dbl> 100001, 100001, 100002, 100002, 100002, ...
## $ retailer_site_code <dbl> 20530, 20530, 20895, 20895, 20895, 20895...
## $ unit_sale_price <dbl> 33.69, 102.30, 111.31, 38.90, 334.43, 75...
## $ ship_date <dttm> 2004-01-19, 2004-02-17, 2004-01-19, 200...
## $ unit_price <dbl> 35.09, 110.00, 119.69, 40.52, 359.60, 81...
## $ order_close_date <dttm> 2004-02-17, 2004-02-17, 2004-01-19, 200...
## $ retailer_contact_code <dbl> 3549, 3549, 3915, 3915, 3915, 3915, 3915...
## $ product_number <dbl> 41110, 69110, 75110, 76110, 85110, 65110...Then I load the products file, which, as it says on the tin, contains the products information
products %>% glimpse()
## Observations: 274
## Variables: 20
## $ product_description <chr> "The Venue has a fun and fashionable overs...
## $ base_product_number <dbl> 125, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12...
## $ product_line <chr> "Personal Accessories", "Camping Equipment...
## $ production_cost <dbl> 0.00, 4.00, 9.22, 15.93, 5.00, 34.97, 85.1...
## $ product_image <chr> "P65PA3WT10.jpg", "P01CE1CG1.jpg", "P02CE1...
## $ product_type <chr> "Watches", "Cooking Gear", "Cooking Gear",...
## $ color <chr> "Red", "Clear", "Brown", "Unspecified", "S...
## $ gross_margin <dbl> 0.00, 0.33, 0.23, 0.28, 0.28, 0.30, 0.28, ...
## $ product_line_code <dbl> 993, 991, 991, 991, 991, 991, 991, 991, 99...
## $ product_name <chr> "Venue", "TrailChef Water Bag", "TrailChef...
## $ introduction_date <dttm> 2005-04-01, 1995-02-15, 1995-02-15, 1995-...
## $ title <chr> "Venue", "TrailChef Water Bag", "TrailChef...
## $ product_type_code <dbl> 960, 951, 951, 951, 951, 951, 951, 951, 95...
## $ product_size <chr> "One-size", "10 liters", "2 liters", "15-p...
## $ brand <chr> "Relax", "TrailChef", "TrailChef", "TrailC...
## $ discontinued_date <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ product_brand_code <dbl> 756, 701, 701, 701, 701, 701, 701, 701, 70...
## $ product_size_code <dbl> 853, 808, 807, 825, 804, 823, 824, 845, 84...
## $ product_color_code <dbl> 922, 908, 906, 924, 923, 923, 923, 923, 91...
## $ product_number <dbl> 125150, 1110, 2110, 3110, 4110, 5110, 6110...Finally, the retailers file
retailers %>% glimpse()
## Observations: 847
## Variables: 13
## $ REGION_CODE <dbl> 710, 710, 710, 710, 710, 710, 710, 710, 710...
## $ REGION_EN <chr> "Americas", "Americas", "Americas", "Americ...
## $ COUNTRY_CODE <dbl> 1004, 1004, 1004, 1004, 1004, 1004, 1004, 1...
## $ COUNTRY_KEY <dbl> 90002, 90002, 90002, 90002, 90002, 90002, 9...
## $ COUNTRY_EN <chr> "Canada", "Canada", "Canada", "Canada", "Ca...
## $ RETAILER_KEY <dbl> 6789, 6789, 6789, 6790, 6790, 6790, 6790, 6...
## $ RETAILER_CODE <dbl> 1189, 1189, 1189, 1190, 1190, 1190, 1190, 1...
## $ RETAILER_NAME <chr> "Rock Steady", "Rock Steady", "Rock Steady"...
## $ RETAILER_SITE_KEY <dbl> 5001, 5002, 5003, 5004, 5005, 5006, 5007, 5...
## $ RETAILER_SITE_CODE <dbl> 20001, 20002, 20003, 20004, 20005, 20006, 2...
## $ RTL_CITY <chr> "Winnipeg", "Montr\xe9al", "Fredericton", "...
## $ RETAILER_TYPE_CODE <dbl> 6, 6, 6, 8, 8, 8, 8, 3, 2, 2, 2, 2, 2, 1, 7...
## $ RETAILER_TYPE_EN <chr> "Outdoors Shop", "Outdoors Shop", "Outdoors...As R is case sensitive, I’m setting retailers column names to lower-case
Now I can joining orders, products and retailers info into one data frame
One long piece of code to sort all in one go.
I’ve added comments to each block to explain what it’s doing.
orders_tbl <-
# create revenue, total product cost and gross profit
df %>%
mutate(
revenue = quantity * unit_price,
tot_prod_cost = quantity * unit_cost,
gross_profit = revenue - tot_prod_cost
) %>%
# replacing NAs in the return_count variable
replace_na(list(return_count = 0)) %>%
# Shorten product line names for readablility and ease of use in charts and code
mutate(
prod_line = case_when(
product_line == 'Camping Equipment' ~ 'Camping Eqpt',
product_line == 'Golf Equipment' ~ 'Golf Eqpt',
product_line == 'Mountaineering Equipment' ~ 'Mountain Eqpt',
product_line == 'Personal Accessories' ~ 'Personal Acces',
product_line == 'Outdoor Protection' ~ 'Outdoor Prot',
TRUE ~ product_line
),
prod_line_2 = case_when(
product_line == 'Camping Equipment' ~ 'Camping Eqpt',
product_line == 'Golf Equipment' ~ 'Golf Eqpt',
product_line == 'Mountaineering Equipment' ~ 'Mountain Eqpt',
product_line == 'Personal Accessories' ~ 'Personal Acces',
product_line == 'Outdoor Protection' ~ 'Personal Acces',
TRUE ~ product_line
),
# create alternative regional group
region2 = case_when(
country_en %in% c('United Kingdom', 'France', 'Spain',
'Netherlands','Belgium','Switzerland') ~ 'West Europe',
country_en %in% c('Germany', 'Italy', 'Finland',
'Austria','Sweden','Denmark') ~ 'East Europe',
TRUE ~ region_en
)
) %>%
# create financial years
mutate(
ord_date = ymd(order_date),
fin_year = case_when(
between(ord_date, ymd("2004-07-01"), ymd('2005-06-30')) ~ 'FY_04_05',
between(ord_date, ymd("2005-07-01"), ymd('2006-06-30')) ~ 'FY_05_06',
between(ord_date, ymd("2006-07-01"), ymd('2007-06-30')) ~ 'FY_06_07',
TRUE ~ 'other'
),
# create all quarters
quarter_all = case_when(
between(ord_date, ymd("2004-01-01"), ymd('2004-03-31')) ~ '04_Q1',
between(ord_date, ymd("2004-04-01"), ymd('2004-06-30')) ~ '04_Q2',
between(ord_date, ymd("2004-07-01"), ymd('2004-09-30')) ~ '04_Q3',
between(ord_date, ymd("2004-10-01"), ymd('2004-12-31')) ~ '04_Q4',
between(ord_date, ymd("2005-01-01"), ymd('2005-03-31')) ~ '05_Q1',
between(ord_date, ymd("2005-04-01"), ymd('2005-06-30')) ~ '05_Q2',
between(ord_date, ymd("2005-07-01"), ymd('2005-09-30')) ~ '05_Q3',
between(ord_date, ymd("2005-10-01"), ymd('2005-12-31')) ~ '05_Q4',
between(ord_date, ymd("2006-01-01"), ymd('2006-03-31')) ~ '06_Q1',
between(ord_date, ymd("2006-04-01"), ymd('2006-06-30')) ~ '06_Q2',
between(ord_date, ymd("2006-07-01"), ymd('2006-09-30')) ~ '06_Q3',
between(ord_date, ymd("2006-10-01"), ymd('2006-12-31')) ~ '06_Q4',
between(ord_date, ymd("2007-01-01"), ymd('2007-03-31')) ~ '07_Q1',
between(ord_date, ymd("2007-04-01"), ymd('2007-06-30')) ~ '07_Q2',
between(ord_date, ymd("2007-07-01"), ymd('2007-09-30')) ~ '07_Q3',
TRUE ~ 'other'
),
# create selected quarters
quarter_sel = case_when(
between(ord_date, ymd("2004-07-01"), ymd('2004-09-30')) ~ '04_Q3',
between(ord_date, ymd("2004-10-01"), ymd('2004-12-31')) ~ '04_Q4',
between(ord_date, ymd("2005-01-01"), ymd('2005-03-31')) ~ '05_Q1',
between(ord_date, ymd("2005-04-01"), ymd('2005-06-30')) ~ '05_Q2',
between(ord_date, ymd("2005-07-01"), ymd('2005-09-30')) ~ '05_Q3',
between(ord_date, ymd("2005-10-01"), ymd('2005-12-31')) ~ '05_Q4',
between(ord_date, ymd("2006-01-01"), ymd('2006-03-31')) ~ '06_Q1',
between(ord_date, ymd("2006-04-01"), ymd('2006-06-30')) ~ '06_Q2',
between(ord_date, ymd("2006-07-01"), ymd('2006-09-30')) ~ '06_Q3',
between(ord_date, ymd("2006-10-01"), ymd('2006-12-31')) ~ '06_Q4',
between(ord_date, ymd("2007-01-01"), ymd('2007-03-31')) ~ '07_Q1',
between(ord_date, ymd("2007-04-01"), ymd('2007-06-30')) ~ '07_Q2',
TRUE ~ 'other'
)
) %>%
# reorder columns and rename a few
select(
order_number,
order_date,
order_close_date,
order_ship_date = ship_date,
fin_year,
quarter_all,
quarter_sel,
order_method,
retailer,
retailer_code,
retailer_type = retailer_type_en,
region = region_en,
region2,
country = country_en,
city = rtl_city,
promotion_code,
return = return_count,
quantity,
unit_price,
unit_sale_price,
unit_cost,
unit_prod_cost = production_cost,
unit_gross_marg = gross_margin,
revenue,
tot_prod_cost,
gross_profit,
prod_numb = product_number,
prod_line,
prod_line_2,
prod_type = product_type,
prod_name = product_name,
brand,
color,
size = product_size,
intro_date = introduction_date,
halt_date = discontinued_date
) orders_tbl %>% glimpse()
## Observations: 446,023
## Variables: 36
## $ order_number <dbl> 100001, 100001, 100002, 100002, 100002, 10000...
## $ order_date <dttm> 2004-01-12, 2004-01-12, 2004-01-12, 2004-01-...
## $ order_close_date <dttm> 2004-02-17, 2004-02-17, 2004-01-19, 2004-01-...
## $ order_ship_date <dttm> 2004-01-19, 2004-02-17, 2004-01-19, 2004-01-...
## $ fin_year <chr> "other", "other", "other", "other", "other", ...
## $ quarter_all <chr> "04_Q1", "04_Q1", "04_Q1", "04_Q1", "04_Q1", ...
## $ quarter_sel <chr> "other", "other", "other", "other", "other", ...
## $ order_method <chr> "Sales visit", "Sales visit", "Fax", "Fax", "...
## $ retailer <chr> "Kavanagh Sports", "Kavanagh Sports", "Ar fre...
## $ retailer_code <dbl> 1398, 1398, 1760, 1760, 1760, 1760, 1760, 176...
## $ retailer_type <chr> "Sports Store", "Sports Store", "Outdoors Sho...
## $ region <chr> "Americas", "Americas", "Americas", "Americas...
## $ region2 <chr> "Americas", "Americas", "Americas", "Americas...
## $ country <chr> "United States", "United States", "Brazil", "...
## $ city <chr> "San Clara", "San Clara", "Sao Paulo", "Sao P...
## $ promotion_code <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ return <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ quantity <dbl> 256, 92, 162, 172, 74, 90, 422, 3252, 1107, 8...
## $ unit_price <dbl> 35.09, 110.00, 119.69, 40.52, 359.60, 81.55, ...
## $ unit_sale_price <dbl> 33.69, 102.30, 111.31, 38.90, 334.43, 75.84, ...
## $ unit_cost <dbl> 15.62, 49.69, 80.00, 23.53, 176.47, 39.00, 2....
## $ unit_prod_cost <dbl> 24.62, 73.33, 80.00, 23.53, 176.47, 39.00, 3....
## $ unit_gross_marg <dbl> 0.23, 0.33, 0.28, 0.40, 0.40, 0.50, 0.50, 0.6...
## $ revenue <dbl> 8983.04, 10120.00, 19389.78, 6969.44, 26610.4...
## $ tot_prod_cost <dbl> 3998.72, 4571.48, 12960.00, 4047.16, 13058.78...
## $ gross_profit <dbl> 4984.32, 5548.52, 6429.78, 2922.28, 13551.62,...
## $ prod_numb <dbl> 41110, 69110, 75110, 76110, 85110, 65110, 100...
## $ prod_line <chr> "Camping Eqpt", "Personal Acces", "Personal A...
## $ prod_line_2 <chr> "Camping Eqpt", "Personal Acces", "Personal A...
## $ prod_type <chr> "Lanterns", "Eyewear", "Knives", "Knives", "N...
## $ prod_name <chr> "Flicker Lantern", "Polar Ice", "Edge Extreme...
## $ brand <chr> "Firefly", "Polar", "Extreme", "Edge", "Extre...
## $ color <chr> "Orange", "Black", "Red", "Black", "Black", "...
## $ size <chr> "8-hour", "Men's", "14.5 cm", "33 cm", "Unspe...
## $ intro_date <dttm> 2005-01-10, 1995-02-15, 2000-10-26, 1995-02-...
## $ halt_date <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...Last but not least, I can drop all original files and save the orders_tbl
# remove original files
rm(df, products, retailers, orders)
# Save orders as RDS
orders_tbl %>%
write_rds("../00_data/orders_tbl.rds")The full R code can be found on my GitHub profile
NOTE: given their size, the RDS file and the order-details folder had to be compress before uploading them on my Github profile