Ejercicio 1. Creación de un proyecto en R

Lenin Pavón

2026-06-09

Introducción

Soy Lenin Pavón, ayudante de la Facultad de Ciencias de la UNAM. Me gusta meditar, amo a los gatos y me encanta la cafeína. De igual manera, consumo mate de manera regular, mi marca favorita es Rosamonte aunque a veces es un poco amargo.

Mis expectativas del curso son aprender a usar R de una manera profesional y aprender lo suficiente para que mis alumnos no me chamaqueen. De igual manera, he querido entender de forma computacional la diferencia entre inferencia y predicción.

R.version
##                _                           
## platform       x86_64-apple-darwin20       
## arch           x86_64                      
## os             darwin20                    
## system         x86_64, darwin20            
## status                                     
## major          4                           
## minor          5.3                         
## year           2026                        
## month          03                          
## day            11                          
## svn rev        89597                       
## language       R                           
## version.string R version 4.5.3 (2026-03-11)
## nickname       Reassured Reassurer
mean(as.numeric(1:15))
## [1] 8
toupper("Lenin")
## [1] "LENIN"

Cargando los datos

Primero montaremos nuestro sistema importando tidyverse

Ahora cargaremos los siguientes datasets - olist_orders_dataset.csv - olist_order_items_dataset.csv - olist_products_dataset.csv

orders   <- read_csv("data/olist_orders_dataset.csv")
#> Rows: 99441 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (3): order_id, customer_id, order_status
#> dttm (5): order_purchase_timestamp, order_approved_at, order_delivered_carri...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
items    <- read_csv("data/olist_order_items_dataset.csv")
#> Rows: 112650 Columns: 7
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (3): order_id, product_id, seller_id
#> dbl  (3): order_item_id, price, freight_value
#> dttm (1): shipping_limit_date
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
products <- read_csv("data/olist_products_dataset.csv")
#> Rows: 32951 Columns: 9
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): product_id, product_category_name
#> dbl (7): product_name_lenght, product_description_lenght, product_photos_qty...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Inspección de los datasets

print(str(orders))
#> spc_tbl_ [99,441 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#>  $ order_id                     : chr [1:99441] "e481f51cbdc54678b7cc49136f2d6af7" "53cdb2fc8bc7dce0b6741e2150273451" "47770eb9100c2d0c44946d9cf07ec65d" "949d5b44dbf5de918fe9c16f97b45f8a" ...
#>  $ customer_id                  : chr [1:99441] "9ef432eb6251297304e76186b10a928d" "b0830fb4747a6c6d20dea0b8c802d7ef" "41ce2a54c0b03bf3443c3d931a367089" "f88197465ea7920adcdbec7375364d82" ...
#>  $ order_status                 : chr [1:99441] "delivered" "delivered" "delivered" "delivered" ...
#>  $ order_purchase_timestamp     : POSIXct[1:99441], format: "2017-10-02 10:56:33" "2018-07-24 20:41:37" ...
#>  $ order_approved_at            : POSIXct[1:99441], format: "2017-10-02 11:07:15" "2018-07-26 03:24:27" ...
#>  $ order_delivered_carrier_date : POSIXct[1:99441], format: "2017-10-04 19:55:00" "2018-07-26 14:31:00" ...
#>  $ order_delivered_customer_date: POSIXct[1:99441], format: "2017-10-10 21:25:13" "2018-08-07 15:27:45" ...
#>  $ order_estimated_delivery_date: POSIXct[1:99441], format: "2017-10-18" "2018-08-13" ...
#>  - attr(*, "spec")=
#>   .. cols(
#>   ..   order_id = col_character(),
#>   ..   customer_id = col_character(),
#>   ..   order_status = col_character(),
#>   ..   order_purchase_timestamp = col_datetime(format = ""),
#>   ..   order_approved_at = col_datetime(format = ""),
#>   ..   order_delivered_carrier_date = col_datetime(format = ""),
#>   ..   order_delivered_customer_date = col_datetime(format = ""),
#>   ..   order_estimated_delivery_date = col_datetime(format = "")
#>   .. )
#>  - attr(*, "problems")=<externalptr> 
#> NULL
print(summary(orders))
#>    order_id         customer_id        order_status      
#>  Length:99441       Length:99441       Length:99441      
#>  Class :character   Class :character   Class :character  
#>  Mode  :character   Mode  :character   Mode  :character  
#>                                                          
#>                                                          
#>                                                          
#>                                                          
#>  order_purchase_timestamp      order_approved_at            
#>  Min.   :2016-09-04 21:15:19   Min.   :2016-09-15 12:16:38  
#>  1st Qu.:2017-09-12 14:46:19   1st Qu.:2017-09-12 23:24:16  
#>  Median :2018-01-18 23:04:36   Median :2018-01-19 11:36:13  
#>  Mean   :2017-12-31 08:43:12   Mean   :2017-12-31 18:35:24  
#>  3rd Qu.:2018-05-04 15:42:16   3rd Qu.:2018-05-04 20:35:10  
#>  Max.   :2018-10-17 17:30:18   Max.   :2018-09-03 17:40:06  
#>                                NA's   :160                  
#>  order_delivered_carrier_date  order_delivered_customer_date
#>  Min.   :2016-10-08 10:34:01   Min.   :2016-10-11 13:46:32  
#>  1st Qu.:2017-09-15 22:28:50   1st Qu.:2017-09-25 22:07:22  
#>  Median :2018-01-24 16:10:58   Median :2018-02-02 19:28:10  
#>  Mean   :2018-01-04 21:49:48   Mean   :2018-01-14 12:09:19  
#>  3rd Qu.:2018-05-08 13:37:45   3rd Qu.:2018-05-15 22:48:52  
#>  Max.   :2018-09-11 19:48:28   Max.   :2018-10-17 13:22:46  
#>  NA's   :1783                  NA's   :2965                 
#>  order_estimated_delivery_date
#>  Min.   :2016-09-30 00:00:00  
#>  1st Qu.:2017-10-03 00:00:00  
#>  Median :2018-02-15 00:00:00  
#>  Mean   :2018-01-24 03:08:37  
#>  3rd Qu.:2018-05-25 00:00:00  
#>  Max.   :2018-11-12 00:00:00  
#> 
print(names(orders))
#> [1] "order_id"                      "customer_id"                  
#> [3] "order_status"                  "order_purchase_timestamp"     
#> [5] "order_approved_at"             "order_delivered_carrier_date" 
#> [7] "order_delivered_customer_date" "order_estimated_delivery_date"
print(head(orders))
#> # A tibble: 6 × 8
#>   order_id   customer_id order_status order_purchase_times…¹ order_approved_at  
#>   <chr>      <chr>       <chr>        <dttm>                 <dttm>             
#> 1 e481f51cb… 9ef432eb62… delivered    2017-10-02 10:56:33    2017-10-02 11:07:15
#> 2 53cdb2fc8… b0830fb474… delivered    2018-07-24 20:41:37    2018-07-26 03:24:27
#> 3 47770eb91… 41ce2a54c0… delivered    2018-08-08 08:38:49    2018-08-08 08:55:23
#> 4 949d5b44d… f88197465e… delivered    2017-11-18 19:28:06    2017-11-18 19:45:59
#> 5 ad21c59c0… 8ab97904e6… delivered    2018-02-13 21:18:39    2018-02-13 22:20:29
#> 6 a4591c265… 503740e9ca… delivered    2017-07-09 21:57:05    2017-07-09 22:10:13
#> # ℹ abbreviated name: ¹​order_purchase_timestamp
#> # ℹ 3 more variables: order_delivered_carrier_date <dttm>,
#> #   order_delivered_customer_date <dttm>, order_estimated_delivery_date <dttm>
print(dim(orders))
#> [1] 99441     8
print(str(items))
#> spc_tbl_ [112,650 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#>  $ order_id           : chr [1:112650] "00010242fe8c5a6d1ba2dd792cb16214" "00018f77f2f0320c557190d7a144bdd3" "000229ec398224ef6ca0657da4fc703e" "00024acbcdf0a6daa1e931b038114c75" ...
#>  $ order_item_id      : num [1:112650] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ product_id         : chr [1:112650] "4244733e06e7ecb4970a6e2683c13e61" "e5f2d52b802189ee658865ca93d83a8f" "c777355d18b72b67abbeef9df44fd0fd" "7634da152a4610f1595efa32f14722fc" ...
#>  $ seller_id          : chr [1:112650] "48436dade18ac8b2bce089ec2a041202" "dd7ddc04e1b6c2c614352b383efe2d36" "5b51032eddd242adc84c38acab88f23d" "9d7a1d34a5052409006425275ba1c2b4" ...
#>  $ shipping_limit_date: POSIXct[1:112650], format: "2017-09-19 09:45:35" "2017-05-03 11:05:13" ...
#>  $ price              : num [1:112650] 58.9 239.9 199 13 199.9 ...
#>  $ freight_value      : num [1:112650] 13.3 19.9 17.9 12.8 18.1 ...
#>  - attr(*, "spec")=
#>   .. cols(
#>   ..   order_id = col_character(),
#>   ..   order_item_id = col_double(),
#>   ..   product_id = col_character(),
#>   ..   seller_id = col_character(),
#>   ..   shipping_limit_date = col_datetime(format = ""),
#>   ..   price = col_double(),
#>   ..   freight_value = col_double()
#>   .. )
#>  - attr(*, "problems")=<externalptr> 
#> NULL
print(summary(items))
#>    order_id         order_item_id     product_id         seller_id        
#>  Length:112650      Min.   : 1.000   Length:112650      Length:112650     
#>  Class :character   1st Qu.: 1.000   Class :character   Class :character  
#>  Mode  :character   Median : 1.000   Mode  :character   Mode  :character  
#>                     Mean   : 1.198                                        
#>                     3rd Qu.: 1.000                                        
#>                     Max.   :21.000                                        
#>  shipping_limit_date               price         freight_value   
#>  Min.   :2016-09-19 00:15:34   Min.   :   0.85   Min.   :  0.00  
#>  1st Qu.:2017-09-20 20:57:27   1st Qu.:  39.90   1st Qu.: 13.08  
#>  Median :2018-01-26 13:59:35   Median :  74.99   Median : 16.26  
#>  Mean   :2018-01-07 15:36:52   Mean   : 120.65   Mean   : 19.99  
#>  3rd Qu.:2018-05-10 14:34:00   3rd Qu.: 134.90   3rd Qu.: 21.15  
#>  Max.   :2020-04-09 22:35:08   Max.   :6735.00   Max.   :409.68
print(names(items))
#> [1] "order_id"            "order_item_id"       "product_id"         
#> [4] "seller_id"           "shipping_limit_date" "price"              
#> [7] "freight_value"
print(head(items))
#> # A tibble: 6 × 7
#>   order_id          order_item_id product_id seller_id shipping_limit_date price
#>   <chr>                     <dbl> <chr>      <chr>     <dttm>              <dbl>
#> 1 00010242fe8c5a6d…             1 4244733e0… 48436dad… 2017-09-19 09:45:35  58.9
#> 2 00018f77f2f0320c…             1 e5f2d52b8… dd7ddc04… 2017-05-03 11:05:13 240. 
#> 3 000229ec398224ef…             1 c777355d1… 5b51032e… 2018-01-18 14:48:30 199  
#> 4 00024acbcdf0a6da…             1 7634da152… 9d7a1d34… 2018-08-15 10:10:18  13.0
#> 5 00042b26cf59d7ce…             1 ac6c36230… df560393… 2017-02-13 13:57:51 200. 
#> 6 00048cc3ae777c65…             1 ef92defde… 6426d21a… 2017-05-23 03:55:27  21.9
#> # ℹ 1 more variable: freight_value <dbl>
print(dim(items))
#> [1] 112650      7
print(str(products))
#> spc_tbl_ [32,951 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#>  $ product_id                : chr [1:32951] "1e9e8ef04dbcff4541ed26657ea517e5" "3aa071139cb16b67ca9e5dea641aaa2f" "96bd76ec8810374ed1b65e291975717f" "cef67bcfe19066a932b7673e239eb23d" ...
#>  $ product_category_name     : chr [1:32951] "perfumaria" "artes" "esporte_lazer" "bebes" ...
#>  $ product_name_lenght       : num [1:32951] 40 44 46 27 37 60 56 56 57 36 ...
#>  $ product_description_lenght: num [1:32951] 287 276 250 261 402 ...
#>  $ product_photos_qty        : num [1:32951] 1 1 1 1 4 1 4 2 1 1 ...
#>  $ product_weight_g          : num [1:32951] 225 1000 154 371 625 ...
#>  $ product_length_cm         : num [1:32951] 16 30 18 26 20 38 70 40 27 17 ...
#>  $ product_height_cm         : num [1:32951] 10 18 9 4 17 5 24 8 13 10 ...
#>  $ product_width_cm          : num [1:32951] 14 20 15 26 13 11 44 40 17 12 ...
#>  - attr(*, "spec")=
#>   .. cols(
#>   ..   product_id = col_character(),
#>   ..   product_category_name = col_character(),
#>   ..   product_name_lenght = col_double(),
#>   ..   product_description_lenght = col_double(),
#>   ..   product_photos_qty = col_double(),
#>   ..   product_weight_g = col_double(),
#>   ..   product_length_cm = col_double(),
#>   ..   product_height_cm = col_double(),
#>   ..   product_width_cm = col_double()
#>   .. )
#>  - attr(*, "problems")=<externalptr> 
#> NULL
print(summary(products))
#>   product_id        product_category_name product_name_lenght
#>  Length:32951       Length:32951          Min.   : 5.00      
#>  Class :character   Class :character      1st Qu.:42.00      
#>  Mode  :character   Mode  :character      Median :51.00      
#>                                           Mean   :48.48      
#>                                           3rd Qu.:57.00      
#>                                           Max.   :76.00      
#>                                           NA's   :610        
#>  product_description_lenght product_photos_qty product_weight_g
#>  Min.   :   4.0             Min.   : 1.000     Min.   :    0   
#>  1st Qu.: 339.0             1st Qu.: 1.000     1st Qu.:  300   
#>  Median : 595.0             Median : 1.000     Median :  700   
#>  Mean   : 771.5             Mean   : 2.189     Mean   : 2276   
#>  3rd Qu.: 972.0             3rd Qu.: 3.000     3rd Qu.: 1900   
#>  Max.   :3992.0             Max.   :20.000     Max.   :40425   
#>  NA's   :610                NA's   :610        NA's   :2       
#>  product_length_cm product_height_cm product_width_cm
#>  Min.   :  7.00    Min.   :  2.00    Min.   :  6.0   
#>  1st Qu.: 18.00    1st Qu.:  8.00    1st Qu.: 15.0   
#>  Median : 25.00    Median : 13.00    Median : 20.0   
#>  Mean   : 30.82    Mean   : 16.94    Mean   : 23.2   
#>  3rd Qu.: 38.00    3rd Qu.: 21.00    3rd Qu.: 30.0   
#>  Max.   :105.00    Max.   :105.00    Max.   :118.0   
#>  NA's   :2         NA's   :2         NA's   :2
print(names(products))
#> [1] "product_id"                 "product_category_name"     
#> [3] "product_name_lenght"        "product_description_lenght"
#> [5] "product_photos_qty"         "product_weight_g"          
#> [7] "product_length_cm"          "product_height_cm"         
#> [9] "product_width_cm"
print(head(products))
#> # A tibble: 6 × 9
#>   product_id    product_category_name product_name_lenght product_description_…¹
#>   <chr>         <chr>                               <dbl>                  <dbl>
#> 1 1e9e8ef04dbc… perfumaria                             40                    287
#> 2 3aa071139cb1… artes                                  44                    276
#> 3 96bd76ec8810… esporte_lazer                          46                    250
#> 4 cef67bcfe190… bebes                                  27                    261
#> 5 9dc1a7de2744… utilidades_domesticas                  37                    402
#> 6 41d3672d4792… instrumentos_musicais                  60                    745
#> # ℹ abbreviated name: ¹​product_description_lenght
#> # ℹ 5 more variables: product_photos_qty <dbl>, product_weight_g <dbl>,
#> #   product_length_cm <dbl>, product_height_cm <dbl>, product_width_cm <dbl>
print(dim(products))
#> [1] 32951     9

Veamos ahora a detalle las columnas en cada dataset

glimpse(orders)
#> Rows: 99,441
#> Columns: 8
#> $ order_id                      <chr> "e481f51cbdc54678b7cc49136f2d6af7", "53c…
#> $ customer_id                   <chr> "9ef432eb6251297304e76186b10a928d", "b08…
#> $ order_status                  <chr> "delivered", "delivered", "delivered", "…
#> $ order_purchase_timestamp      <dttm> 2017-10-02 10:56:33, 2018-07-24 20:41:3…
#> $ order_approved_at             <dttm> 2017-10-02 11:07:15, 2018-07-26 03:24:2…
#> $ order_delivered_carrier_date  <dttm> 2017-10-04 19:55:00, 2018-07-26 14:31:0…
#> $ order_delivered_customer_date <dttm> 2017-10-10 21:25:13, 2018-08-07 15:27:4…
#> $ order_estimated_delivery_date <dttm> 2017-10-18, 2018-08-13, 2018-09-04, 201…
glimpse(items)
#> Rows: 112,650
#> Columns: 7
#> $ order_id            <chr> "00010242fe8c5a6d1ba2dd792cb16214", "00018f77f2f03…
#> $ order_item_id       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,…
#> $ product_id          <chr> "4244733e06e7ecb4970a6e2683c13e61", "e5f2d52b80218…
#> $ seller_id           <chr> "48436dade18ac8b2bce089ec2a041202", "dd7ddc04e1b6c…
#> $ shipping_limit_date <dttm> 2017-09-19 09:45:35, 2017-05-03 11:05:13, 2018-01…
#> $ price               <dbl> 58.90, 239.90, 199.00, 12.99, 199.90, 21.90, 19.90…
#> $ freight_value       <dbl> 13.29, 19.93, 17.87, 12.79, 18.14, 12.69, 11.85, 7…
glimpse(products)
#> Rows: 32,951
#> Columns: 9
#> $ product_id                 <chr> "1e9e8ef04dbcff4541ed26657ea517e5", "3aa071…
#> $ product_category_name      <chr> "perfumaria", "artes", "esporte_lazer", "be…
#> $ product_name_lenght        <dbl> 40, 44, 46, 27, 37, 60, 56, 56, 57, 36, 54,…
#> $ product_description_lenght <dbl> 287, 276, 250, 261, 402, 745, 1272, 184, 16…
#> $ product_photos_qty         <dbl> 1, 1, 1, 1, 4, 1, 4, 2, 1, 1, 1, 4, 3, 2, 4…
#> $ product_weight_g           <dbl> 225, 1000, 154, 371, 625, 200, 18350, 900, …
#> $ product_length_cm          <dbl> 16, 30, 18, 26, 20, 38, 70, 40, 27, 17, 16,…
#> $ product_height_cm          <dbl> 10, 18, 9, 4, 17, 5, 24, 8, 13, 10, 10, 19,…
#> $ product_width_cm           <dbl> 14, 20, 15, 26, 13, 11, 44, 40, 17, 12, 16,…

Limpieza de datos

Veamos cuantos datos faltan por dataset

orders |> is.na() |> colSums()
#>                      order_id                   customer_id 
#>                             0                             0 
#>                  order_status      order_purchase_timestamp 
#>                             0                             0 
#>             order_approved_at  order_delivered_carrier_date 
#>                           160                          1783 
#> order_delivered_customer_date order_estimated_delivery_date 
#>                          2965                             0
items |> is.na() |> sum()
#> [1] 0
products |> is.na() |> colSums()
#>                 product_id      product_category_name 
#>                          0                        610 
#>        product_name_lenght product_description_lenght 
#>                        610                        610 
#>         product_photos_qty           product_weight_g 
#>                        610                          2 
#>          product_length_cm          product_height_cm 
#>                          2                          2 
#>           product_width_cm 
#>                          2

Eliminando las filas que contienen al menos un valor uno de cada dataframe,

orders   <- na.omit(orders)
products <- na.omit(products)

Revisando cómo quedaron tras la transformación

glimpse(orders)
#> Rows: 96,461
#> Columns: 8
#> $ order_id                      <chr> "e481f51cbdc54678b7cc49136f2d6af7", "53c…
#> $ customer_id                   <chr> "9ef432eb6251297304e76186b10a928d", "b08…
#> $ order_status                  <chr> "delivered", "delivered", "delivered", "…
#> $ order_purchase_timestamp      <dttm> 2017-10-02 10:56:33, 2018-07-24 20:41:3…
#> $ order_approved_at             <dttm> 2017-10-02 11:07:15, 2018-07-26 03:24:2…
#> $ order_delivered_carrier_date  <dttm> 2017-10-04 19:55:00, 2018-07-26 14:31:0…
#> $ order_delivered_customer_date <dttm> 2017-10-10 21:25:13, 2018-08-07 15:27:4…
#> $ order_estimated_delivery_date <dttm> 2017-10-18, 2018-08-13, 2018-09-04, 201…
glimpse(products)
#> Rows: 32,340
#> Columns: 9
#> $ product_id                 <chr> "1e9e8ef04dbcff4541ed26657ea517e5", "3aa071…
#> $ product_category_name      <chr> "perfumaria", "artes", "esporte_lazer", "be…
#> $ product_name_lenght        <dbl> 40, 44, 46, 27, 37, 60, 56, 56, 57, 36, 54,…
#> $ product_description_lenght <dbl> 287, 276, 250, 261, 402, 745, 1272, 184, 16…
#> $ product_photos_qty         <dbl> 1, 1, 1, 1, 4, 1, 4, 2, 1, 1, 1, 4, 3, 2, 4…
#> $ product_weight_g           <dbl> 225, 1000, 154, 371, 625, 200, 18350, 900, …
#> $ product_length_cm          <dbl> 16, 30, 18, 26, 20, 38, 70, 40, 27, 17, 16,…
#> $ product_height_cm          <dbl> 10, 18, 9, 4, 17, 5, 24, 8, 13, 10, 10, 19,…
#> $ product_width_cm           <dbl> 14, 20, 15, 26, 13, 11, 44, 40, 17, 12, 16,…

Registros duplicados

Mantenemos los registros no duplicados

items <- distinct(items)
orders <- distinct(orders)
products <- distinct(products)

Nombres de columnas

Observemos que en products dos de las columnas cuentan con errores de ortografía en las columnas de products

rename(products, product_name_length = product_name_lenght)
#> # A tibble: 32,340 × 9
#>    product_id   product_category_name product_name_length product_description_…¹
#>    <chr>        <chr>                               <dbl>                  <dbl>
#>  1 1e9e8ef04db… perfumaria                             40                    287
#>  2 3aa071139cb… artes                                  44                    276
#>  3 96bd76ec881… esporte_lazer                          46                    250
#>  4 cef67bcfe19… bebes                                  27                    261
#>  5 9dc1a7de274… utilidades_domesticas                  37                    402
#>  6 41d3672d479… instrumentos_musicais                  60                    745
#>  7 732bd381ad0… cool_stuff                             56                   1272
#>  8 2548af3e6e7… moveis_decoracao                       56                    184
#>  9 37cc742be07… eletrodomesticos                       57                    163
#> 10 8c92109888e… brinquedos                             36                   1156
#> # ℹ 32,330 more rows
#> # ℹ abbreviated name: ¹​product_description_lenght
#> # ℹ 5 more variables: product_photos_qty <dbl>, product_weight_g <dbl>,
#> #   product_length_cm <dbl>, product_height_cm <dbl>, product_width_cm <dbl>
rename(products, product_description_length = product_description_lenght)
#> # A tibble: 32,340 × 9
#>    product_id   product_category_name product_name_lenght product_description_…¹
#>    <chr>        <chr>                               <dbl>                  <dbl>
#>  1 1e9e8ef04db… perfumaria                             40                    287
#>  2 3aa071139cb… artes                                  44                    276
#>  3 96bd76ec881… esporte_lazer                          46                    250
#>  4 cef67bcfe19… bebes                                  27                    261
#>  5 9dc1a7de274… utilidades_domesticas                  37                    402
#>  6 41d3672d479… instrumentos_musicais                  60                    745
#>  7 732bd381ad0… cool_stuff                             56                   1272
#>  8 2548af3e6e7… moveis_decoracao                       56                    184
#>  9 37cc742be07… eletrodomesticos                       57                    163
#> 10 8c92109888e… brinquedos                             36                   1156
#> # ℹ 32,330 more rows
#> # ℹ abbreviated name: ¹​product_description_length
#> # ℹ 5 more variables: product_photos_qty <dbl>, product_weight_g <dbl>,
#> #   product_length_cm <dbl>, product_height_cm <dbl>, product_width_cm <dbl>

Variables numéricas cargadas como texto

Para el dataframe products vemos que en la columna product_category_name se encuentran los siguientes valores distintos

unique(products$product_category_name)
#>  [1] "perfumaria"                                    
#>  [2] "artes"                                         
#>  [3] "esporte_lazer"                                 
#>  [4] "bebes"                                         
#>  [5] "utilidades_domesticas"                         
#>  [6] "instrumentos_musicais"                         
#>  [7] "cool_stuff"                                    
#>  [8] "moveis_decoracao"                              
#>  [9] "eletrodomesticos"                              
#> [10] "brinquedos"                                    
#> [11] "cama_mesa_banho"                               
#> [12] "construcao_ferramentas_seguranca"              
#> [13] "informatica_acessorios"                        
#> [14] "beleza_saude"                                  
#> [15] "malas_acessorios"                              
#> [16] "ferramentas_jardim"                            
#> [17] "moveis_escritorio"                             
#> [18] "automotivo"                                    
#> [19] "eletronicos"                                   
#> [20] "fashion_calcados"                              
#> [21] "telefonia"                                     
#> [22] "papelaria"                                     
#> [23] "fashion_bolsas_e_acessorios"                   
#> [24] "pcs"                                           
#> [25] "casa_construcao"                               
#> [26] "relogios_presentes"                            
#> [27] "construcao_ferramentas_construcao"             
#> [28] "pet_shop"                                      
#> [29] "eletroportateis"                               
#> [30] "agro_industria_e_comercio"                     
#> [31] "moveis_sala"                                   
#> [32] "sinalizacao_e_seguranca"                       
#> [33] "climatizacao"                                  
#> [34] "consoles_games"                                
#> [35] "livros_interesse_geral"                        
#> [36] "construcao_ferramentas_ferramentas"            
#> [37] "fashion_underwear_e_moda_praia"                
#> [38] "fashion_roupa_masculina"                       
#> [39] "moveis_cozinha_area_de_servico_jantar_e_jardim"
#> [40] "industria_comercio_e_negocios"                 
#> [41] "telefonia_fixa"                                
#> [42] "construcao_ferramentas_iluminacao"             
#> [43] "livros_tecnicos"                               
#> [44] "eletrodomesticos_2"                            
#> [45] "artigos_de_festas"                             
#> [46] "bebidas"                                       
#> [47] "market_place"                                  
#> [48] "la_cuisine"                                    
#> [49] "construcao_ferramentas_jardim"                 
#> [50] "fashion_roupa_feminina"                        
#> [51] "casa_conforto"                                 
#> [52] "audio"                                         
#> [53] "alimentos_bebidas"                             
#> [54] "musica"                                        
#> [55] "alimentos"                                     
#> [56] "tablets_impressao_imagem"                      
#> [57] "livros_importados"                             
#> [58] "portateis_casa_forno_e_cafe"                   
#> [59] "fashion_esporte"                               
#> [60] "artigos_de_natal"                              
#> [61] "fashion_roupa_infanto_juvenil"                 
#> [62] "dvds_blu_ray"                                  
#> [63] "artes_e_artesanato"                            
#> [64] "pc_gamer"                                      
#> [65] "moveis_quarto"                                 
#> [66] "cine_foto"                                     
#> [67] "fraldas_higiene"                               
#> [68] "flores"                                        
#> [69] "casa_conforto_2"                               
#> [70] "portateis_cozinha_e_preparadores_de_alimentos" 
#> [71] "seguros_e_servicos"                            
#> [72] "moveis_colchao_e_estofado"                     
#> [73] "cds_dvds_musicais"

La convertimos a una variable categórica

products$product_category_name <- as.factor(products$product_category_name)
str(products$product_category_name)
#>  Factor w/ 73 levels "agro_industria_e_comercio",..: 63 4 33 10 73 46 27 55 29 13 ...

Realizando el proceso análogo en orders para order_status

unique(orders$order_status)
#> [1] "delivered" "canceled"

La convertimos a una variable categórica

orders$order_status <- as.factor(orders$order_status)
str(orders$order_status)
#>  Factor w/ 2 levels "canceled","delivered": 2 2 2 2 2 2 2 2 2 2 ...

Consolidación de datos

Fusionando orders y products

ip <- full_join(items, products)
#> Joining with `by = join_by(product_id)`
consolidated <- full_join(ip, orders)
#> Joining with `by = join_by(order_id)`
glimpse(consolidated)
#> Rows: 112,650
#> Columns: 22
#> $ order_id                      <chr> "00010242fe8c5a6d1ba2dd792cb16214", "000…
#> $ order_item_id                 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ product_id                    <chr> "4244733e06e7ecb4970a6e2683c13e61", "e5f…
#> $ seller_id                     <chr> "48436dade18ac8b2bce089ec2a041202", "dd7…
#> $ shipping_limit_date           <dttm> 2017-09-19 09:45:35, 2017-05-03 11:05:1…
#> $ price                         <dbl> 58.90, 239.90, 199.00, 12.99, 199.90, 21…
#> $ freight_value                 <dbl> 13.29, 19.93, 17.87, 12.79, 18.14, 12.69…
#> $ product_category_name         <fct> cool_stuff, pet_shop, moveis_decoracao, …
#> $ product_name_lenght           <dbl> 58, 56, 59, 42, 59, 36, 52, 39, 59, 52, …
#> $ product_description_lenght    <dbl> 598, 239, 695, 480, 409, 558, 815, 1310,…
#> $ product_photos_qty            <dbl> 4, 2, 2, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 2…
#> $ product_weight_g              <dbl> 650, 30000, 3050, 200, 3750, 450, 200, 1…
#> $ product_length_cm             <dbl> 28, 50, 33, 16, 35, 24, 27, 35, 30, 29, …
#> $ product_height_cm             <dbl> 9, 30, 13, 10, 40, 8, 5, 75, 12, 3, 25, …
#> $ product_width_cm              <dbl> 14, 40, 33, 15, 30, 15, 20, 45, 16, 21, …
#> $ customer_id                   <chr> "3ce436f183e68e07877b285a838db11a", "f6d…
#> $ order_status                  <fct> delivered, delivered, delivered, deliver…
#> $ order_purchase_timestamp      <dttm> 2017-09-13 08:59:02, 2017-04-26 10:53:0…
#> $ order_approved_at             <dttm> 2017-09-13 09:45:35, 2017-04-26 11:05:1…
#> $ order_delivered_carrier_date  <dttm> 2017-09-19 18:34:16, 2017-05-04 14:35:0…
#> $ order_delivered_customer_date <dttm> 2017-09-20 23:43:48, 2017-05-12 16:04:2…
#> $ order_estimated_delivery_date <dttm> 2017-09-29, 2017-05-15, 2018-02-05, 201…

Verificación de resultados

df_clean <- consolidated %>% 
  distinct() %>%
  drop_na()

glimpse(df_clean)
#> Rows: 108,643
#> Columns: 22
#> $ order_id                      <chr> "00010242fe8c5a6d1ba2dd792cb16214", "000…
#> $ order_item_id                 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ product_id                    <chr> "4244733e06e7ecb4970a6e2683c13e61", "e5f…
#> $ seller_id                     <chr> "48436dade18ac8b2bce089ec2a041202", "dd7…
#> $ shipping_limit_date           <dttm> 2017-09-19 09:45:35, 2017-05-03 11:05:1…
#> $ price                         <dbl> 58.90, 239.90, 199.00, 12.99, 199.90, 21…
#> $ freight_value                 <dbl> 13.29, 19.93, 17.87, 12.79, 18.14, 12.69…
#> $ product_category_name         <fct> cool_stuff, pet_shop, moveis_decoracao, …
#> $ product_name_lenght           <dbl> 58, 56, 59, 42, 59, 36, 52, 39, 59, 52, …
#> $ product_description_lenght    <dbl> 598, 239, 695, 480, 409, 558, 815, 1310,…
#> $ product_photos_qty            <dbl> 4, 2, 2, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 2…
#> $ product_weight_g              <dbl> 650, 30000, 3050, 200, 3750, 450, 200, 1…
#> $ product_length_cm             <dbl> 28, 50, 33, 16, 35, 24, 27, 35, 30, 29, …
#> $ product_height_cm             <dbl> 9, 30, 13, 10, 40, 8, 5, 75, 12, 3, 25, …
#> $ product_width_cm              <dbl> 14, 40, 33, 15, 30, 15, 20, 45, 16, 21, …
#> $ customer_id                   <chr> "3ce436f183e68e07877b285a838db11a", "f6d…
#> $ order_status                  <fct> delivered, delivered, delivered, deliver…
#> $ order_purchase_timestamp      <dttm> 2017-09-13 08:59:02, 2017-04-26 10:53:0…
#> $ order_approved_at             <dttm> 2017-09-13 09:45:35, 2017-04-26 11:05:1…
#> $ order_delivered_carrier_date  <dttm> 2017-09-19 18:34:16, 2017-05-04 14:35:0…
#> $ order_delivered_customer_date <dttm> 2017-09-20 23:43:48, 2017-05-12 16:04:2…
#> $ order_estimated_delivery_date <dttm> 2017-09-29, 2017-05-15, 2018-02-05, 201…

Así, eliminamos registros repetidos y solamente mantenemos filas que no tienen ningún elemento no definido.

Almacenamiento de resultados

write_csv(df_clean, "results/1_wrangling_consolidated.csv")