1 Manipulation des données - Théorie
1.1 Chargement des données
1.2 Vocabulaire
Quelques mots sont importants :
filter()slice()arrange()select()mutate()summarise()
En utilisant group_by(), on peut choisir de grouper les données en fonction d’une variable.
1.2.1 Slice
Le verbe slice() selectionne des lignes du tableau selon leur position. On lui passe un chiffe ou un vecteur de chiffres.
Si on souhaite sélectionner la 345e ligne du tableau “airports”:
## # A tibble: 1 x 9
## faa name lat lon alt tz dst tzone alt_m
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
## 1 CYF Chefornak Airport 60.1 -164. 40 -9 A America/Anchorage 12.2
1.2.2 Filter
filter() permet de filter des observations en fonction de certaines variables.
## # A tibble: 6 x 9
## faa name lat lon alt tz dst tzone alt_m
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New… 318.
## 2 06A Moton Field Municipal… 32.5 -85.7 264 -6 A America/Chi… 80.5
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chi… 244.
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/New… 159.
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/New… 3.35
## 6 0A9 Elizabethton Municipa… 36.4 -82.2 1593 -5 A America/New… 486.
## # A tibble: 6 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 2353 2359 -6 418 442
## 2 2013 1 1 2356 2359 -3 425 437
## 3 2013 1 1 NA 1630 NA NA 1815
## 4 2013 1 1 NA 1935 NA NA 2240
## 5 2013 1 1 NA 1500 NA NA 1825
## 6 2013 1 1 NA 600 NA NA 901
## # … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
## # A tibble: 342 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 857 900 -3 1516 1530
## 2 2013 1 2 909 900 9 1525 1530
## 3 2013 1 3 914 900 14 1504 1530
## 4 2013 1 4 900 900 0 1516 1530
## 5 2013 1 5 858 900 -2 1519 1530
## 6 2013 1 6 1019 900 79 1558 1530
## 7 2013 1 7 1042 900 102 1620 1530
## 8 2013 1 8 901 900 1 1504 1530
## 9 2013 1 9 641 900 1301 1242 1530
## 10 2013 1 10 859 900 -1 1449 1530
## # … with 332 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
## # A tibble: 95,410 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 555 600 -5 913 854
## 6 2013 1 1 558 600 -2 849 851
## 7 2013 1 1 558 600 -2 853 856
## 8 2013 1 1 559 600 -1 941 910
## 9 2013 1 1 600 600 0 851 858
## 10 2013 1 1 601 600 1 844 850
## # … with 95,400 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
1.2.3 Arrange
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
1.2.4 Select
## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # … with 336,766 more rows
stars_with("abc"), ends_with("xyz"), contains("ijk") permet de rechercher un terme commençant par “abc ou terminant par”xyz" ou contenant “ijk”.
1.2.5 Rename
Le nouveau nom est à gauche, ici tail_num.
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tail_num <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
1.2.6 Mutate
1.2.7 Summarise
## # A tibble: 12 x 20
## # Groups: month [12]
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 2 1 456 500 -4 652 648
## 3 2013 3 1 4 2159 125 318 56
## 4 2013 4 1 454 500 -6 636 640
## 5 2013 5 1 9 1655 434 308 2020
## 6 2013 6 1 2 2359 3 341 350
## 7 2013 7 1 1 2029 212 236 2359
## 8 2013 8 1 12 2130 162 257 14
## 9 2013 9 1 9 2359 10 343 340
## 10 2013 10 1 447 500 -13 614 648
## 11 2013 11 1 5 2359 6 352 345
## 12 2013 12 1 13 2359 14 446 445
## # … with 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, type_retard <chr>
## # A tibble: 336,776 x 21
## # Groups: month [12]
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 13 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## # type_retard <chr>, mean_delay_month <dbl>
## # A tibble: 12 x 20
## # Groups: month [12]
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 9 641 900 1301 1242 1530
## 2 2013 10 14 2042 900 702 2255 1127
## 3 2013 11 3 603 1645 798 829 1913
## 4 2013 12 5 756 1700 896 1058 2020
## 5 2013 2 10 2243 830 853 100 1106
## 6 2013 3 17 2321 810 911 135 1020
## 7 2013 4 10 1100 1900 960 1342 2211
## 8 2013 5 3 1133 2055 878 1250 2215
## 9 2013 6 15 1432 1935 1137 1607 2120
## 10 2013 7 22 845 1600 1005 1044 1815
## 11 2013 8 8 2334 1454 520 120 1710
## 12 2013 9 20 1139 1845 1014 1457 2210
## # … with 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, type_retard <chr>
## # A tibble: 105 x 2
## dest n
## <chr> <int>
## 1 ABQ 254
## 2 ACK 265
## 3 ALB 439
## 4 ANC 8
## 5 ATL 17215
## 6 AUS 2439
## 7 AVL 275
## 8 BDL 443
## 9 BGR 375
## 10 BHM 297
## # … with 95 more rows
1.2.8 Group_by
## # A tibble: 2,313 x 4
## # Groups: month, origin [36]
## month origin dest nb
## <int> <chr> <chr> <int>
## 1 1 EWR ALB 64
## 2 1 EWR ATL 362
## 3 1 EWR AUS 51
## 4 1 EWR AVL 2
## 5 1 EWR BDL 37
## 6 1 EWR BNA 111
## 7 1 EWR BOS 430
## 8 1 EWR BQN 31
## 9 1 EWR BTV 100
## 10 1 EWR BUF 119
## # … with 2,303 more rows
Si on utilise mean() avec une condition comme arr_delay>60 on obtient une proportion
1.2.9 Ungroup
Ungroup permet d’annuler le group_by()
1.2.10 Distinct
Permet de supprimer les doublons distinct()
1.3 Valeurs manquantes NA
1.3.1 is.na(x)
## # A tibble: 3 x 1
## x
## <dbl>
## 1 1
## 2 NA
## 3 3
## # A tibble: 2 x 1
## x
## <dbl>
## 1 NA
## 2 3
2 Travaux dirigé
- Sélectionner les patients A02, A36 et A49
## # A tibble: 3 x 16
## id haplotype cyp3A5D age_r sexe_r age_d sexe_d rejet_aigu TIF event
## <chr> <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 A02 autre NEs 50 M 23 M 1 1110 0
## 2 A36 het NEs 60 F 27 M 0 1020 0
## 3 A49 autre NEs 72 F 32 M 0 1260 0
## # … with 6 more variables: delai_event <dbl>, CYP3A4_1B <chr>,
## # MDR1_C1236T <chr>, MDR1_G2677T <chr>, MDR1_C3435T <chr>, pente_creat <dbl>
- Sélectionner les patients avec une pente de créatininémie compris entre -1 et 4.
## # A tibble: 72 x 16
## id haplotype cyp3A5D age_r sexe_r age_d sexe_d rejet_aigu TIF event
## <chr> <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 A06 autre NEs 35 M 22 M 1 1070 0
## 2 E06 het NEs 38 M 28 M 1 1080 0
## 3 G20 het NEs 73 M 71 M 1 1015 0
## 4 B58 hom Es 74 F 19 F 0 1560 0
## 5 A60 autre NEs 48 F 20 M 1 1483 0
## 6 A62 het NEs 40 M 43 M 1 1345 0
## 7 B80 autre Es 65 F 48 F 0 1410 0
## 8 A51 autre NEs 68 M 30 M 0 890 0
## 9 A54 hom NEs 56 M 41 M 0 2130 0
## 10 E44 hom NEs 75 F 48 M 1 1995 0
## # … with 62 more rows, and 6 more variables: delai_event <dbl>,
## # CYP3A4_1B <chr>, MDR1_C1236T <chr>, MDR1_G2677T <chr>, MDR1_C3435T <chr>,
## # pente_creat <dbl>
- Sélectionner les patients avec les ages extremes
## # A tibble: 0 x 16
## # … with 16 variables: id <chr>, haplotype <chr>, cyp3A5D <chr>, age_r <dbl>,
## # sexe_r <chr>, age_d <dbl>, sexe_d <chr>, rejet_aigu <dbl>, TIF <dbl>,
## # event <dbl>, delai_event <dbl>, CYP3A4_1B <chr>, MDR1_C1236T <chr>,
## # MDR1_G2677T <chr>, MDR1_C3435T <chr>, pente_creat <dbl>
## # A tibble: 253 x 3
## id tif_j log_tif_j
## <chr> <dbl> <dbl>
## 1 3 42.5 3.75
## 2 4 34.4 3.54
## 3 5 42.5 3.75
## 4 7 34.5 3.54
## 5 8 51.9 3.95
## 6 9 61.2 4.11
## 7 12 45 3.81
## 8 14 56.2 4.03
## 9 16 86.2 4.46
## 10 17 43.8 3.78
## # … with 243 more rows
Créer une variable age_cat avec 4 catégories de 0 à 30, de 31 à 50, 51 à 70 et > 70 ans.
Donner la somme et la proportion
## # A tibble: 1 x 1
## creat_pos_prop
## <int>
## 1 47
## # A tibble: 1 x 1
## creat_pos_prop
## <dbl>
## 1 18.6