Soy Lenin Pavón, ayudante de la Facultad de Ciencias de la UNAM. Me gusta meditar, amo a los gatos y me encanta la cafeína. De igual manera, consumo mate de manera regular, mi marca favorita es Rosamonte aunque a veces es un poco amargo.
Mis expectativas del curso son aprender a usar R de una manera profesional y aprender lo suficiente para que mis alumnos no me chamaqueen. De igual manera, he querido entender de forma computacional la diferencia entre inferencia y predicción.
## _
## platform x86_64-apple-darwin20
## arch x86_64
## os darwin20
## system x86_64, darwin20
## status
## major 4
## minor 5.3
## year 2026
## month 03
## day 11
## svn rev 89597
## language R
## version.string R version 4.5.3 (2026-03-11)
## nickname Reassured Reassurer
## [1] 8
## [1] "LENIN"
Primero montaremos nuestro sistema importando
tidyverse
Ahora cargaremos los siguientes datasets -
olist_orders_dataset.csv -
olist_order_items_dataset.csv -
olist_products_dataset.csv
orders <- read_csv("data/olist_orders_dataset.csv")
#> Rows: 99441 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (3): order_id, customer_id, order_status
#> dttm (5): order_purchase_timestamp, order_approved_at, order_delivered_carri...
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
items <- read_csv("data/olist_order_items_dataset.csv")
#> Rows: 112650 Columns: 7
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (3): order_id, product_id, seller_id
#> dbl (3): order_item_id, price, freight_value
#> dttm (1): shipping_limit_date
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
products <- read_csv("data/olist_products_dataset.csv")
#> Rows: 32951 Columns: 9
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): product_id, product_category_name
#> dbl (7): product_name_lenght, product_description_lenght, product_photos_qty...
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.print(str(orders))
#> spc_tbl_ [99,441 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#> $ order_id : chr [1:99441] "e481f51cbdc54678b7cc49136f2d6af7" "53cdb2fc8bc7dce0b6741e2150273451" "47770eb9100c2d0c44946d9cf07ec65d" "949d5b44dbf5de918fe9c16f97b45f8a" ...
#> $ customer_id : chr [1:99441] "9ef432eb6251297304e76186b10a928d" "b0830fb4747a6c6d20dea0b8c802d7ef" "41ce2a54c0b03bf3443c3d931a367089" "f88197465ea7920adcdbec7375364d82" ...
#> $ order_status : chr [1:99441] "delivered" "delivered" "delivered" "delivered" ...
#> $ order_purchase_timestamp : POSIXct[1:99441], format: "2017-10-02 10:56:33" "2018-07-24 20:41:37" ...
#> $ order_approved_at : POSIXct[1:99441], format: "2017-10-02 11:07:15" "2018-07-26 03:24:27" ...
#> $ order_delivered_carrier_date : POSIXct[1:99441], format: "2017-10-04 19:55:00" "2018-07-26 14:31:00" ...
#> $ order_delivered_customer_date: POSIXct[1:99441], format: "2017-10-10 21:25:13" "2018-08-07 15:27:45" ...
#> $ order_estimated_delivery_date: POSIXct[1:99441], format: "2017-10-18" "2018-08-13" ...
#> - attr(*, "spec")=
#> .. cols(
#> .. order_id = col_character(),
#> .. customer_id = col_character(),
#> .. order_status = col_character(),
#> .. order_purchase_timestamp = col_datetime(format = ""),
#> .. order_approved_at = col_datetime(format = ""),
#> .. order_delivered_carrier_date = col_datetime(format = ""),
#> .. order_delivered_customer_date = col_datetime(format = ""),
#> .. order_estimated_delivery_date = col_datetime(format = "")
#> .. )
#> - attr(*, "problems")=<externalptr>
#> NULL
print(summary(orders))
#> order_id customer_id order_status
#> Length:99441 Length:99441 Length:99441
#> Class :character Class :character Class :character
#> Mode :character Mode :character Mode :character
#>
#>
#>
#>
#> order_purchase_timestamp order_approved_at
#> Min. :2016-09-04 21:15:19 Min. :2016-09-15 12:16:38
#> 1st Qu.:2017-09-12 14:46:19 1st Qu.:2017-09-12 23:24:16
#> Median :2018-01-18 23:04:36 Median :2018-01-19 11:36:13
#> Mean :2017-12-31 08:43:12 Mean :2017-12-31 18:35:24
#> 3rd Qu.:2018-05-04 15:42:16 3rd Qu.:2018-05-04 20:35:10
#> Max. :2018-10-17 17:30:18 Max. :2018-09-03 17:40:06
#> NA's :160
#> order_delivered_carrier_date order_delivered_customer_date
#> Min. :2016-10-08 10:34:01 Min. :2016-10-11 13:46:32
#> 1st Qu.:2017-09-15 22:28:50 1st Qu.:2017-09-25 22:07:22
#> Median :2018-01-24 16:10:58 Median :2018-02-02 19:28:10
#> Mean :2018-01-04 21:49:48 Mean :2018-01-14 12:09:19
#> 3rd Qu.:2018-05-08 13:37:45 3rd Qu.:2018-05-15 22:48:52
#> Max. :2018-09-11 19:48:28 Max. :2018-10-17 13:22:46
#> NA's :1783 NA's :2965
#> order_estimated_delivery_date
#> Min. :2016-09-30 00:00:00
#> 1st Qu.:2017-10-03 00:00:00
#> Median :2018-02-15 00:00:00
#> Mean :2018-01-24 03:08:37
#> 3rd Qu.:2018-05-25 00:00:00
#> Max. :2018-11-12 00:00:00
#>
print(names(orders))
#> [1] "order_id" "customer_id"
#> [3] "order_status" "order_purchase_timestamp"
#> [5] "order_approved_at" "order_delivered_carrier_date"
#> [7] "order_delivered_customer_date" "order_estimated_delivery_date"
print(head(orders))
#> # A tibble: 6 × 8
#> order_id customer_id order_status order_purchase_times…¹ order_approved_at
#> <chr> <chr> <chr> <dttm> <dttm>
#> 1 e481f51cb… 9ef432eb62… delivered 2017-10-02 10:56:33 2017-10-02 11:07:15
#> 2 53cdb2fc8… b0830fb474… delivered 2018-07-24 20:41:37 2018-07-26 03:24:27
#> 3 47770eb91… 41ce2a54c0… delivered 2018-08-08 08:38:49 2018-08-08 08:55:23
#> 4 949d5b44d… f88197465e… delivered 2017-11-18 19:28:06 2017-11-18 19:45:59
#> 5 ad21c59c0… 8ab97904e6… delivered 2018-02-13 21:18:39 2018-02-13 22:20:29
#> 6 a4591c265… 503740e9ca… delivered 2017-07-09 21:57:05 2017-07-09 22:10:13
#> # ℹ abbreviated name: ¹order_purchase_timestamp
#> # ℹ 3 more variables: order_delivered_carrier_date <dttm>,
#> # order_delivered_customer_date <dttm>, order_estimated_delivery_date <dttm>
print(dim(orders))
#> [1] 99441 8print(str(items))
#> spc_tbl_ [112,650 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#> $ order_id : chr [1:112650] "00010242fe8c5a6d1ba2dd792cb16214" "00018f77f2f0320c557190d7a144bdd3" "000229ec398224ef6ca0657da4fc703e" "00024acbcdf0a6daa1e931b038114c75" ...
#> $ order_item_id : num [1:112650] 1 1 1 1 1 1 1 1 1 1 ...
#> $ product_id : chr [1:112650] "4244733e06e7ecb4970a6e2683c13e61" "e5f2d52b802189ee658865ca93d83a8f" "c777355d18b72b67abbeef9df44fd0fd" "7634da152a4610f1595efa32f14722fc" ...
#> $ seller_id : chr [1:112650] "48436dade18ac8b2bce089ec2a041202" "dd7ddc04e1b6c2c614352b383efe2d36" "5b51032eddd242adc84c38acab88f23d" "9d7a1d34a5052409006425275ba1c2b4" ...
#> $ shipping_limit_date: POSIXct[1:112650], format: "2017-09-19 09:45:35" "2017-05-03 11:05:13" ...
#> $ price : num [1:112650] 58.9 239.9 199 13 199.9 ...
#> $ freight_value : num [1:112650] 13.3 19.9 17.9 12.8 18.1 ...
#> - attr(*, "spec")=
#> .. cols(
#> .. order_id = col_character(),
#> .. order_item_id = col_double(),
#> .. product_id = col_character(),
#> .. seller_id = col_character(),
#> .. shipping_limit_date = col_datetime(format = ""),
#> .. price = col_double(),
#> .. freight_value = col_double()
#> .. )
#> - attr(*, "problems")=<externalptr>
#> NULL
print(summary(items))
#> order_id order_item_id product_id seller_id
#> Length:112650 Min. : 1.000 Length:112650 Length:112650
#> Class :character 1st Qu.: 1.000 Class :character Class :character
#> Mode :character Median : 1.000 Mode :character Mode :character
#> Mean : 1.198
#> 3rd Qu.: 1.000
#> Max. :21.000
#> shipping_limit_date price freight_value
#> Min. :2016-09-19 00:15:34 Min. : 0.85 Min. : 0.00
#> 1st Qu.:2017-09-20 20:57:27 1st Qu.: 39.90 1st Qu.: 13.08
#> Median :2018-01-26 13:59:35 Median : 74.99 Median : 16.26
#> Mean :2018-01-07 15:36:52 Mean : 120.65 Mean : 19.99
#> 3rd Qu.:2018-05-10 14:34:00 3rd Qu.: 134.90 3rd Qu.: 21.15
#> Max. :2020-04-09 22:35:08 Max. :6735.00 Max. :409.68
print(names(items))
#> [1] "order_id" "order_item_id" "product_id"
#> [4] "seller_id" "shipping_limit_date" "price"
#> [7] "freight_value"
print(head(items))
#> # A tibble: 6 × 7
#> order_id order_item_id product_id seller_id shipping_limit_date price
#> <chr> <dbl> <chr> <chr> <dttm> <dbl>
#> 1 00010242fe8c5a6d… 1 4244733e0… 48436dad… 2017-09-19 09:45:35 58.9
#> 2 00018f77f2f0320c… 1 e5f2d52b8… dd7ddc04… 2017-05-03 11:05:13 240.
#> 3 000229ec398224ef… 1 c777355d1… 5b51032e… 2018-01-18 14:48:30 199
#> 4 00024acbcdf0a6da… 1 7634da152… 9d7a1d34… 2018-08-15 10:10:18 13.0
#> 5 00042b26cf59d7ce… 1 ac6c36230… df560393… 2017-02-13 13:57:51 200.
#> 6 00048cc3ae777c65… 1 ef92defde… 6426d21a… 2017-05-23 03:55:27 21.9
#> # ℹ 1 more variable: freight_value <dbl>
print(dim(items))
#> [1] 112650 7print(str(products))
#> spc_tbl_ [32,951 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#> $ product_id : chr [1:32951] "1e9e8ef04dbcff4541ed26657ea517e5" "3aa071139cb16b67ca9e5dea641aaa2f" "96bd76ec8810374ed1b65e291975717f" "cef67bcfe19066a932b7673e239eb23d" ...
#> $ product_category_name : chr [1:32951] "perfumaria" "artes" "esporte_lazer" "bebes" ...
#> $ product_name_lenght : num [1:32951] 40 44 46 27 37 60 56 56 57 36 ...
#> $ product_description_lenght: num [1:32951] 287 276 250 261 402 ...
#> $ product_photos_qty : num [1:32951] 1 1 1 1 4 1 4 2 1 1 ...
#> $ product_weight_g : num [1:32951] 225 1000 154 371 625 ...
#> $ product_length_cm : num [1:32951] 16 30 18 26 20 38 70 40 27 17 ...
#> $ product_height_cm : num [1:32951] 10 18 9 4 17 5 24 8 13 10 ...
#> $ product_width_cm : num [1:32951] 14 20 15 26 13 11 44 40 17 12 ...
#> - attr(*, "spec")=
#> .. cols(
#> .. product_id = col_character(),
#> .. product_category_name = col_character(),
#> .. product_name_lenght = col_double(),
#> .. product_description_lenght = col_double(),
#> .. product_photos_qty = col_double(),
#> .. product_weight_g = col_double(),
#> .. product_length_cm = col_double(),
#> .. product_height_cm = col_double(),
#> .. product_width_cm = col_double()
#> .. )
#> - attr(*, "problems")=<externalptr>
#> NULL
print(summary(products))
#> product_id product_category_name product_name_lenght
#> Length:32951 Length:32951 Min. : 5.00
#> Class :character Class :character 1st Qu.:42.00
#> Mode :character Mode :character Median :51.00
#> Mean :48.48
#> 3rd Qu.:57.00
#> Max. :76.00
#> NA's :610
#> product_description_lenght product_photos_qty product_weight_g
#> Min. : 4.0 Min. : 1.000 Min. : 0
#> 1st Qu.: 339.0 1st Qu.: 1.000 1st Qu.: 300
#> Median : 595.0 Median : 1.000 Median : 700
#> Mean : 771.5 Mean : 2.189 Mean : 2276
#> 3rd Qu.: 972.0 3rd Qu.: 3.000 3rd Qu.: 1900
#> Max. :3992.0 Max. :20.000 Max. :40425
#> NA's :610 NA's :610 NA's :2
#> product_length_cm product_height_cm product_width_cm
#> Min. : 7.00 Min. : 2.00 Min. : 6.0
#> 1st Qu.: 18.00 1st Qu.: 8.00 1st Qu.: 15.0
#> Median : 25.00 Median : 13.00 Median : 20.0
#> Mean : 30.82 Mean : 16.94 Mean : 23.2
#> 3rd Qu.: 38.00 3rd Qu.: 21.00 3rd Qu.: 30.0
#> Max. :105.00 Max. :105.00 Max. :118.0
#> NA's :2 NA's :2 NA's :2
print(names(products))
#> [1] "product_id" "product_category_name"
#> [3] "product_name_lenght" "product_description_lenght"
#> [5] "product_photos_qty" "product_weight_g"
#> [7] "product_length_cm" "product_height_cm"
#> [9] "product_width_cm"
print(head(products))
#> # A tibble: 6 × 9
#> product_id product_category_name product_name_lenght product_description_…¹
#> <chr> <chr> <dbl> <dbl>
#> 1 1e9e8ef04dbc… perfumaria 40 287
#> 2 3aa071139cb1… artes 44 276
#> 3 96bd76ec8810… esporte_lazer 46 250
#> 4 cef67bcfe190… bebes 27 261
#> 5 9dc1a7de2744… utilidades_domesticas 37 402
#> 6 41d3672d4792… instrumentos_musicais 60 745
#> # ℹ abbreviated name: ¹product_description_lenght
#> # ℹ 5 more variables: product_photos_qty <dbl>, product_weight_g <dbl>,
#> # product_length_cm <dbl>, product_height_cm <dbl>, product_width_cm <dbl>
print(dim(products))
#> [1] 32951 9Veamos ahora a detalle las columnas en cada dataset
glimpse(orders)
#> Rows: 99,441
#> Columns: 8
#> $ order_id <chr> "e481f51cbdc54678b7cc49136f2d6af7", "53c…
#> $ customer_id <chr> "9ef432eb6251297304e76186b10a928d", "b08…
#> $ order_status <chr> "delivered", "delivered", "delivered", "…
#> $ order_purchase_timestamp <dttm> 2017-10-02 10:56:33, 2018-07-24 20:41:3…
#> $ order_approved_at <dttm> 2017-10-02 11:07:15, 2018-07-26 03:24:2…
#> $ order_delivered_carrier_date <dttm> 2017-10-04 19:55:00, 2018-07-26 14:31:0…
#> $ order_delivered_customer_date <dttm> 2017-10-10 21:25:13, 2018-08-07 15:27:4…
#> $ order_estimated_delivery_date <dttm> 2017-10-18, 2018-08-13, 2018-09-04, 201…glimpse(items)
#> Rows: 112,650
#> Columns: 7
#> $ order_id <chr> "00010242fe8c5a6d1ba2dd792cb16214", "00018f77f2f03…
#> $ order_item_id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,…
#> $ product_id <chr> "4244733e06e7ecb4970a6e2683c13e61", "e5f2d52b80218…
#> $ seller_id <chr> "48436dade18ac8b2bce089ec2a041202", "dd7ddc04e1b6c…
#> $ shipping_limit_date <dttm> 2017-09-19 09:45:35, 2017-05-03 11:05:13, 2018-01…
#> $ price <dbl> 58.90, 239.90, 199.00, 12.99, 199.90, 21.90, 19.90…
#> $ freight_value <dbl> 13.29, 19.93, 17.87, 12.79, 18.14, 12.69, 11.85, 7…glimpse(products)
#> Rows: 32,951
#> Columns: 9
#> $ product_id <chr> "1e9e8ef04dbcff4541ed26657ea517e5", "3aa071…
#> $ product_category_name <chr> "perfumaria", "artes", "esporte_lazer", "be…
#> $ product_name_lenght <dbl> 40, 44, 46, 27, 37, 60, 56, 56, 57, 36, 54,…
#> $ product_description_lenght <dbl> 287, 276, 250, 261, 402, 745, 1272, 184, 16…
#> $ product_photos_qty <dbl> 1, 1, 1, 1, 4, 1, 4, 2, 1, 1, 1, 4, 3, 2, 4…
#> $ product_weight_g <dbl> 225, 1000, 154, 371, 625, 200, 18350, 900, …
#> $ product_length_cm <dbl> 16, 30, 18, 26, 20, 38, 70, 40, 27, 17, 16,…
#> $ product_height_cm <dbl> 10, 18, 9, 4, 17, 5, 24, 8, 13, 10, 10, 19,…
#> $ product_width_cm <dbl> 14, 20, 15, 26, 13, 11, 44, 40, 17, 12, 16,…Veamos cuantos datos faltan por dataset
orders |> is.na() |> colSums()
#> order_id customer_id
#> 0 0
#> order_status order_purchase_timestamp
#> 0 0
#> order_approved_at order_delivered_carrier_date
#> 160 1783
#> order_delivered_customer_date order_estimated_delivery_date
#> 2965 0products |> is.na() |> colSums()
#> product_id product_category_name
#> 0 610
#> product_name_lenght product_description_lenght
#> 610 610
#> product_photos_qty product_weight_g
#> 610 2
#> product_length_cm product_height_cm
#> 2 2
#> product_width_cm
#> 2Eliminando las filas que contienen al menos un valor uno de cada dataframe,
Revisando cómo quedaron tras la transformación
glimpse(orders)
#> Rows: 96,461
#> Columns: 8
#> $ order_id <chr> "e481f51cbdc54678b7cc49136f2d6af7", "53c…
#> $ customer_id <chr> "9ef432eb6251297304e76186b10a928d", "b08…
#> $ order_status <chr> "delivered", "delivered", "delivered", "…
#> $ order_purchase_timestamp <dttm> 2017-10-02 10:56:33, 2018-07-24 20:41:3…
#> $ order_approved_at <dttm> 2017-10-02 11:07:15, 2018-07-26 03:24:2…
#> $ order_delivered_carrier_date <dttm> 2017-10-04 19:55:00, 2018-07-26 14:31:0…
#> $ order_delivered_customer_date <dttm> 2017-10-10 21:25:13, 2018-08-07 15:27:4…
#> $ order_estimated_delivery_date <dttm> 2017-10-18, 2018-08-13, 2018-09-04, 201…glimpse(products)
#> Rows: 32,340
#> Columns: 9
#> $ product_id <chr> "1e9e8ef04dbcff4541ed26657ea517e5", "3aa071…
#> $ product_category_name <chr> "perfumaria", "artes", "esporte_lazer", "be…
#> $ product_name_lenght <dbl> 40, 44, 46, 27, 37, 60, 56, 56, 57, 36, 54,…
#> $ product_description_lenght <dbl> 287, 276, 250, 261, 402, 745, 1272, 184, 16…
#> $ product_photos_qty <dbl> 1, 1, 1, 1, 4, 1, 4, 2, 1, 1, 1, 4, 3, 2, 4…
#> $ product_weight_g <dbl> 225, 1000, 154, 371, 625, 200, 18350, 900, …
#> $ product_length_cm <dbl> 16, 30, 18, 26, 20, 38, 70, 40, 27, 17, 16,…
#> $ product_height_cm <dbl> 10, 18, 9, 4, 17, 5, 24, 8, 13, 10, 10, 19,…
#> $ product_width_cm <dbl> 14, 20, 15, 26, 13, 11, 44, 40, 17, 12, 16,…Mantenemos los registros no duplicados
Observemos que en products dos de las columnas cuentan con errores de
ortografía en las columnas de products
product_name_lenghtproduct_description_lenghtrename(products, product_name_length = product_name_lenght)
#> # A tibble: 32,340 × 9
#> product_id product_category_name product_name_length product_description_…¹
#> <chr> <chr> <dbl> <dbl>
#> 1 1e9e8ef04db… perfumaria 40 287
#> 2 3aa071139cb… artes 44 276
#> 3 96bd76ec881… esporte_lazer 46 250
#> 4 cef67bcfe19… bebes 27 261
#> 5 9dc1a7de274… utilidades_domesticas 37 402
#> 6 41d3672d479… instrumentos_musicais 60 745
#> 7 732bd381ad0… cool_stuff 56 1272
#> 8 2548af3e6e7… moveis_decoracao 56 184
#> 9 37cc742be07… eletrodomesticos 57 163
#> 10 8c92109888e… brinquedos 36 1156
#> # ℹ 32,330 more rows
#> # ℹ abbreviated name: ¹product_description_lenght
#> # ℹ 5 more variables: product_photos_qty <dbl>, product_weight_g <dbl>,
#> # product_length_cm <dbl>, product_height_cm <dbl>, product_width_cm <dbl>
rename(products, product_description_length = product_description_lenght)
#> # A tibble: 32,340 × 9
#> product_id product_category_name product_name_lenght product_description_…¹
#> <chr> <chr> <dbl> <dbl>
#> 1 1e9e8ef04db… perfumaria 40 287
#> 2 3aa071139cb… artes 44 276
#> 3 96bd76ec881… esporte_lazer 46 250
#> 4 cef67bcfe19… bebes 27 261
#> 5 9dc1a7de274… utilidades_domesticas 37 402
#> 6 41d3672d479… instrumentos_musicais 60 745
#> 7 732bd381ad0… cool_stuff 56 1272
#> 8 2548af3e6e7… moveis_decoracao 56 184
#> 9 37cc742be07… eletrodomesticos 57 163
#> 10 8c92109888e… brinquedos 36 1156
#> # ℹ 32,330 more rows
#> # ℹ abbreviated name: ¹product_description_length
#> # ℹ 5 more variables: product_photos_qty <dbl>, product_weight_g <dbl>,
#> # product_length_cm <dbl>, product_height_cm <dbl>, product_width_cm <dbl>Para el dataframe products vemos que en la columna
product_category_name se encuentran los siguientes valores
distintos
unique(products$product_category_name)
#> [1] "perfumaria"
#> [2] "artes"
#> [3] "esporte_lazer"
#> [4] "bebes"
#> [5] "utilidades_domesticas"
#> [6] "instrumentos_musicais"
#> [7] "cool_stuff"
#> [8] "moveis_decoracao"
#> [9] "eletrodomesticos"
#> [10] "brinquedos"
#> [11] "cama_mesa_banho"
#> [12] "construcao_ferramentas_seguranca"
#> [13] "informatica_acessorios"
#> [14] "beleza_saude"
#> [15] "malas_acessorios"
#> [16] "ferramentas_jardim"
#> [17] "moveis_escritorio"
#> [18] "automotivo"
#> [19] "eletronicos"
#> [20] "fashion_calcados"
#> [21] "telefonia"
#> [22] "papelaria"
#> [23] "fashion_bolsas_e_acessorios"
#> [24] "pcs"
#> [25] "casa_construcao"
#> [26] "relogios_presentes"
#> [27] "construcao_ferramentas_construcao"
#> [28] "pet_shop"
#> [29] "eletroportateis"
#> [30] "agro_industria_e_comercio"
#> [31] "moveis_sala"
#> [32] "sinalizacao_e_seguranca"
#> [33] "climatizacao"
#> [34] "consoles_games"
#> [35] "livros_interesse_geral"
#> [36] "construcao_ferramentas_ferramentas"
#> [37] "fashion_underwear_e_moda_praia"
#> [38] "fashion_roupa_masculina"
#> [39] "moveis_cozinha_area_de_servico_jantar_e_jardim"
#> [40] "industria_comercio_e_negocios"
#> [41] "telefonia_fixa"
#> [42] "construcao_ferramentas_iluminacao"
#> [43] "livros_tecnicos"
#> [44] "eletrodomesticos_2"
#> [45] "artigos_de_festas"
#> [46] "bebidas"
#> [47] "market_place"
#> [48] "la_cuisine"
#> [49] "construcao_ferramentas_jardim"
#> [50] "fashion_roupa_feminina"
#> [51] "casa_conforto"
#> [52] "audio"
#> [53] "alimentos_bebidas"
#> [54] "musica"
#> [55] "alimentos"
#> [56] "tablets_impressao_imagem"
#> [57] "livros_importados"
#> [58] "portateis_casa_forno_e_cafe"
#> [59] "fashion_esporte"
#> [60] "artigos_de_natal"
#> [61] "fashion_roupa_infanto_juvenil"
#> [62] "dvds_blu_ray"
#> [63] "artes_e_artesanato"
#> [64] "pc_gamer"
#> [65] "moveis_quarto"
#> [66] "cine_foto"
#> [67] "fraldas_higiene"
#> [68] "flores"
#> [69] "casa_conforto_2"
#> [70] "portateis_cozinha_e_preparadores_de_alimentos"
#> [71] "seguros_e_servicos"
#> [72] "moveis_colchao_e_estofado"
#> [73] "cds_dvds_musicais"La convertimos a una variable categórica
products$product_category_name <- as.factor(products$product_category_name)
str(products$product_category_name)
#> Factor w/ 73 levels "agro_industria_e_comercio",..: 63 4 33 10 73 46 27 55 29 13 ...Realizando el proceso análogo en orders para
order_status
La convertimos a una variable categórica
Fusionando orders y products
ip <- full_join(items, products)
#> Joining with `by = join_by(product_id)`
consolidated <- full_join(ip, orders)
#> Joining with `by = join_by(order_id)`
glimpse(consolidated)
#> Rows: 112,650
#> Columns: 22
#> $ order_id <chr> "00010242fe8c5a6d1ba2dd792cb16214", "000…
#> $ order_item_id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ product_id <chr> "4244733e06e7ecb4970a6e2683c13e61", "e5f…
#> $ seller_id <chr> "48436dade18ac8b2bce089ec2a041202", "dd7…
#> $ shipping_limit_date <dttm> 2017-09-19 09:45:35, 2017-05-03 11:05:1…
#> $ price <dbl> 58.90, 239.90, 199.00, 12.99, 199.90, 21…
#> $ freight_value <dbl> 13.29, 19.93, 17.87, 12.79, 18.14, 12.69…
#> $ product_category_name <fct> cool_stuff, pet_shop, moveis_decoracao, …
#> $ product_name_lenght <dbl> 58, 56, 59, 42, 59, 36, 52, 39, 59, 52, …
#> $ product_description_lenght <dbl> 598, 239, 695, 480, 409, 558, 815, 1310,…
#> $ product_photos_qty <dbl> 4, 2, 2, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 2…
#> $ product_weight_g <dbl> 650, 30000, 3050, 200, 3750, 450, 200, 1…
#> $ product_length_cm <dbl> 28, 50, 33, 16, 35, 24, 27, 35, 30, 29, …
#> $ product_height_cm <dbl> 9, 30, 13, 10, 40, 8, 5, 75, 12, 3, 25, …
#> $ product_width_cm <dbl> 14, 40, 33, 15, 30, 15, 20, 45, 16, 21, …
#> $ customer_id <chr> "3ce436f183e68e07877b285a838db11a", "f6d…
#> $ order_status <fct> delivered, delivered, delivered, deliver…
#> $ order_purchase_timestamp <dttm> 2017-09-13 08:59:02, 2017-04-26 10:53:0…
#> $ order_approved_at <dttm> 2017-09-13 09:45:35, 2017-04-26 11:05:1…
#> $ order_delivered_carrier_date <dttm> 2017-09-19 18:34:16, 2017-05-04 14:35:0…
#> $ order_delivered_customer_date <dttm> 2017-09-20 23:43:48, 2017-05-12 16:04:2…
#> $ order_estimated_delivery_date <dttm> 2017-09-29, 2017-05-15, 2018-02-05, 201…df_clean <- consolidated %>%
distinct() %>%
drop_na()
glimpse(df_clean)
#> Rows: 108,643
#> Columns: 22
#> $ order_id <chr> "00010242fe8c5a6d1ba2dd792cb16214", "000…
#> $ order_item_id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ product_id <chr> "4244733e06e7ecb4970a6e2683c13e61", "e5f…
#> $ seller_id <chr> "48436dade18ac8b2bce089ec2a041202", "dd7…
#> $ shipping_limit_date <dttm> 2017-09-19 09:45:35, 2017-05-03 11:05:1…
#> $ price <dbl> 58.90, 239.90, 199.00, 12.99, 199.90, 21…
#> $ freight_value <dbl> 13.29, 19.93, 17.87, 12.79, 18.14, 12.69…
#> $ product_category_name <fct> cool_stuff, pet_shop, moveis_decoracao, …
#> $ product_name_lenght <dbl> 58, 56, 59, 42, 59, 36, 52, 39, 59, 52, …
#> $ product_description_lenght <dbl> 598, 239, 695, 480, 409, 558, 815, 1310,…
#> $ product_photos_qty <dbl> 4, 2, 2, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 2…
#> $ product_weight_g <dbl> 650, 30000, 3050, 200, 3750, 450, 200, 1…
#> $ product_length_cm <dbl> 28, 50, 33, 16, 35, 24, 27, 35, 30, 29, …
#> $ product_height_cm <dbl> 9, 30, 13, 10, 40, 8, 5, 75, 12, 3, 25, …
#> $ product_width_cm <dbl> 14, 40, 33, 15, 30, 15, 20, 45, 16, 21, …
#> $ customer_id <chr> "3ce436f183e68e07877b285a838db11a", "f6d…
#> $ order_status <fct> delivered, delivered, delivered, deliver…
#> $ order_purchase_timestamp <dttm> 2017-09-13 08:59:02, 2017-04-26 10:53:0…
#> $ order_approved_at <dttm> 2017-09-13 09:45:35, 2017-04-26 11:05:1…
#> $ order_delivered_carrier_date <dttm> 2017-09-19 18:34:16, 2017-05-04 14:35:0…
#> $ order_delivered_customer_date <dttm> 2017-09-20 23:43:48, 2017-05-12 16:04:2…
#> $ order_estimated_delivery_date <dttm> 2017-09-29, 2017-05-15, 2018-02-05, 201…Así, eliminamos registros repetidos y solamente mantenemos filas que no tienen ningún elemento no definido.