select
Keep the variables name, hair_color, and films.filter
select blonds.filter
select female blonds.mutate
Calculate the metric BMI index.mutate
If BMI is greater than 18.5 and smaller than 24.9, then BMIcat = “healthy”, otherwise BMIcat = “not”.summarize
Calculate mean BMI.group_by
and summarize
Calculate mean BMI by gender.In this exercise you will learn to clean data using the dplyr
package. To this end, you will follow through the codes in one of our e-texts, Data Visualization with R. The given example code below is from Chapter 1.2 Cleaning data.
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 5 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
1.2.1 Selecting variables The select
function allows you to limit your dataset to specified variables (columns).
## # A tibble: 87 x 3
## name height gender
## <chr> <int> <chr>
## 1 Luke Skywalker 172 male
## 2 C-3PO 167 <NA>
## 3 R2-D2 96 <NA>
## 4 Darth Vader 202 male
## 5 Leia Organa 150 female
## 6 Owen Lars 178 male
## 7 Beru Whitesun lars 165 female
## 8 R5-D4 97 <NA>
## 9 Biggs Darklighter 183 male
## 10 Obi-Wan Kenobi 182 male
## # … with 77 more rows
## # A tibble: 87 x 9
## name mass hair_color skin_color eye_color birth_year gender homeworld
## <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke… 77 blond fair blue 19 male Tatooine
## 2 C-3PO 75 <NA> gold yellow 112 <NA> Tatooine
## 3 R2-D2 32 <NA> white, bl… red 33 <NA> Naboo
## 4 Dart… 136 none white yellow 41.9 male Tatooine
## 5 Leia… 49 brown light brown 19 female Alderaan
## 6 Owen… 120 brown, gr… light blue 52 male Tatooine
## 7 Beru… 75 brown light blue 47 female Tatooine
## 8 R5-D4 32 <NA> white, red red NA <NA> Tatooine
## 9 Bigg… 84 black light brown 24 male Tatooine
## 10 Obi-… 77 auburn, w… fair blue-gray 57 male Stewjon
## # … with 77 more rows, and 1 more variable: species <chr>
## # A tibble: 87 x 11
## name height mass hair_color skin_color eye_color homeworld species
## <chr> <int> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 Luke… 172 77 blond fair blue Tatooine Human
## 2 C-3PO 167 75 <NA> gold yellow Tatooine Droid
## 3 R2-D2 96 32 <NA> white, bl… red Naboo Droid
## 4 Dart… 202 136 none white yellow Tatooine Human
## 5 Leia… 150 49 brown light brown Alderaan Human
## 6 Owen… 178 120 brown, gr… light blue Tatooine Human
## 7 Beru… 165 75 brown light blue Tatooine Human
## 8 R5-D4 97 32 <NA> white, red red Tatooine Droid
## 9 Bigg… 183 84 black light brown Tatooine Human
## 10 Obi-… 182 77 auburn, w… fair blue-gray Stewjon Human
## # … with 77 more rows, and 3 more variables: films <list>,
## # vehicles <list>, starships <list>
1.2.2 Selecting observations The filter
function allows you to limit your dataset to observations (rows) meeting a specific criteria. Multiple criteria can be combined with the & (AND) and | (OR) symbols.
## # A tibble: 19 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Leia… 150 49 brown light brown 19 female
## 2 Beru… 165 75 brown light blue 47 female
## 3 Mon … 150 NA auburn fair blue 48 female
## 4 Shmi… 163 NA black fair brown 72 female
## 5 Ayla… 178 55 none blue hazel 48 female
## 6 Adi … 184 50 none dark blue NA female
## 7 Cordé 157 NA brown light brown NA female
## 8 Lumi… 170 56.2 black yellow blue 58 female
## 9 Barr… 166 50 black yellow blue 40 female
## 10 Dormé 165 NA brown light brown NA female
## 11 Zam … 168 55 blonde fair, gre… yellow NA female
## 12 Taun… 213 NA none grey black NA female
## 13 Joca… 167 NA white fair blue NA female
## 14 R4-P… 96 NA none silver, r… red, blue NA female
## 15 Shaa… 178 57 none red, blue… black NA female
## 16 Sly … 178 48 none pale white NA female
## 17 Rey NA NA brown light hazel NA female
## 18 Capt… NA NA unknown unknown unknown NA female
## 19 Padm… 165 45 brown light brown 46 female
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
## # A tibble: 1 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Leia… 150 49 brown light brown 19 female
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
## # A tibble: 7 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Leia… 150 49 brown light brown 19 female
## 2 Wick… 88 20 brown brown brown 8 male
## 3 Fini… 170 NA blond fair blue 91 male
## 4 Adi … 184 50 none dark blue NA female
## 5 Bail… 191 NA black tan brown 67 male
## 6 Joca… 167 NA white fair blue NA female
## 7 Raym… 188 79 brown light brown NA male
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
## # A tibble: 7 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Leia… 150 49 brown light brown 19 female
## 2 Wick… 88 20 brown brown brown 8 male
## 3 Fini… 170 NA blond fair blue 91 male
## 4 Adi … 184 50 none dark blue NA female
## 5 Bail… 191 NA black tan brown 67 male
## 6 Joca… 167 NA white fair blue NA female
## 7 Raym… 188 79 brown light brown NA male
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
1.2.3 Creating/Recoding variables The mutate
function allows you to create new variables or transform existing ones.
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 67.8 170. blond fair blue 19 male
## 2 C-3PO 65.8 165. <NA> gold yellow 112 <NA>
## 3 R2-D2 37.8 70.6 <NA> white, bl… red 33 <NA>
## 4 Dart… 79.6 300. none white yellow 41.9 male
## 5 Leia… 59.1 108. brown light brown 19 female
## 6 Owen… 70.1 265. brown, gr… light blue 52 male
## 7 Beru… 65.0 165. brown light blue 47 female
## 8 R5-D4 38.2 70.6 <NA> white, red red NA <NA>
## 9 Bigg… 72.1 185. black light brown 24 male
## 10 Obi-… 71.7 170. auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 5 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
The ifelse
function (part of base R) can be used for recoding data. The format is ifelse(test, return if TRUE, return if FALSE).
## # A tibble: 87 x 14
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 6 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>,
## # heightcat <chr>
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold other 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… other 33 <NA>
## 4 Dart… 202 136 none white other 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red other NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair other 57 male
## # … with 77 more rows, and 5 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… NA 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 5 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
1.2.4 Summarizing data The summarize
function can be used to reduce multiple values down to a single value (such as a mean). It is often used in conjunction with the by_group function, to calculate statistics by group. In the code below, the na.rm=TRUE option is used to drop missing values before calculating the means.
## # A tibble: 1 x 2
## mean_ht mean_mass
## <dbl> <dbl>
## 1 174. 97.3
## # A tibble: 5 x 3
## gender mean_ht mean_wt
## <chr> <dbl> <dbl>
## 1 female 165. 54.0
## 2 hermaphrodite 175 1358
## 3 male 179. 81.0
## 4 none 200 140
## 5 <NA> 120 46.3
1.2.5 Using pipes Packages like dplyr
and tidyr
allow you to write your code in a compact format using the pipe %>%
operator. Here is an example.
## # A tibble: 8 x 2
## species mean_ht
## <chr> <dbl>
## 1 Clawdite 168
## 2 Human 160.
## 3 Kaminoan 213
## 4 Mirialan 168
## 5 Tholothian 184
## 6 Togruta 178
## 7 Twi'lek 178
## 8 <NA> 137
## # A tibble: 8 x 2
## species mean_ht
## <chr> <dbl>
## 1 Clawdite 168
## 2 Human 160.
## 3 Kaminoan 213
## 4 Mirialan 168
## 5 Tholothian 184
## 6 Togruta 178
## 7 Twi'lek 178
## 8 <NA> 137
1.2.6 Reshaping data
1.2.7 Missing data
select
Keep the variables name, hair_color, and films.## # A tibble: 87 x 3
## name hair_color films
## <chr> <chr> <list>
## 1 Luke Skywalker blond <chr [5]>
## 2 C-3PO <NA> <chr [6]>
## 3 R2-D2 <NA> <chr [7]>
## 4 Darth Vader none <chr [4]>
## 5 Leia Organa brown <chr [5]>
## 6 Owen Lars brown, grey <chr [3]>
## 7 Beru Whitesun lars brown <chr [3]>
## 8 R5-D4 <NA> <chr [1]>
## 9 Biggs Darklighter black <chr [1]>
## 10 Obi-Wan Kenobi auburn, white <chr [6]>
## # … with 77 more rows
filter
select blonds.## # A tibble: 3 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 Anak… 188 84 blond fair blue 41.9 male
## 3 Fini… 170 NA blond fair blue 91 male
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
filter
select female blonds.## # A tibble: 0 x 13
## # … with 13 variables: name <chr>, height <int>, mass <dbl>,
## # hair_color <chr>, skin_color <chr>, eye_color <chr>, birth_year <dbl>,
## # gender <chr>, homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
mutate
Calculate the metric BMI index.Hint: BMI = Weight (KG) ÷ Height (Metres²)
## # A tibble: 87 x 14
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 6 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>,
## # BMI <dbl>
mutate
If BMI is greater than 18.5 and smaller than 24.9, then BMIcat = “healthy”, otherwise BMIcat = “not”.## # A tibble: 87 x 15
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 7 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>,
## # BMI <dbl>, BMIcat <chr>
summarize
Calculate mean BMI.## # A tibble: 1 x 1
## BMI_avg
## <dbl>
## 1 32.0
group_by
and summarize
Calculate mean BMI by gender.## # A tibble: 5 x 2
## gender BMI_avg
## <chr> <dbl>
## 1 female 18.8
## 2 hermaphrodite 443.
## 3 male 25.7
## 4 none 35
## 5 <NA> 31.9
Hint: Use message
, echo
and results
in the chunk options. Refer to the RMarkdown Reference Guide.