select
Keep the variables name, hair_color, and films.filter
select blonds.filter
select female blonds.mutate
Calculate the metric BMI index.mutate
If BMI is greater than 18.5 and smaller than 24.9, then BMIcat = “healthy”, otherwise BMIcat = “not”.`summarize
Calculate mean BMI.group_by
and summarize
Calculate mean BMI by gender.In this exercise you will learn to clean data using the dplyr
package. To this end, you will follow through the codes in one of our e-texts, Data Visualization with R. The given example code below is from Chapter 1.2 Cleaning data.
# Load package
library(tidyverse)
# Import data
data(starwars)
starwars
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 5 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
1.2.1 Selecting variables The select
function allows you to limit your dataset to specified variables (columns).
# keep the variables name, height, and gender
select(starwars, name, height, gender)
## # A tibble: 87 x 3
## name height gender
## <chr> <int> <chr>
## 1 Luke Skywalker 172 male
## 2 C-3PO 167 <NA>
## 3 R2-D2 96 <NA>
## 4 Darth Vader 202 male
## 5 Leia Organa 150 female
## 6 Owen Lars 178 male
## 7 Beru Whitesun lars 165 female
## 8 R5-D4 97 <NA>
## 9 Biggs Darklighter 183 male
## 10 Obi-Wan Kenobi 182 male
## # … with 77 more rows
# keep the variables name and all variables
# between mass and species inclusive
select(starwars, name, mass:species)
## # A tibble: 87 x 9
## name mass hair_color skin_color eye_color birth_year gender homeworld
## <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke… 77 blond fair blue 19 male Tatooine
## 2 C-3PO 75 <NA> gold yellow 112 <NA> Tatooine
## 3 R2-D2 32 <NA> white, bl… red 33 <NA> Naboo
## 4 Dart… 136 none white yellow 41.9 male Tatooine
## 5 Leia… 49 brown light brown 19 female Alderaan
## 6 Owen… 120 brown, gr… light blue 52 male Tatooine
## 7 Beru… 75 brown light blue 47 female Tatooine
## 8 R5-D4 32 <NA> white, red red NA <NA> Tatooine
## 9 Bigg… 84 black light brown 24 male Tatooine
## 10 Obi-… 77 auburn, w… fair blue-gray 57 male Stewjon
## # … with 77 more rows, and 1 more variable: species <chr>
# keep all variables except birth_year and gender
select(starwars, -birth_year, -gender)
## # A tibble: 87 x 11
## name height mass hair_color skin_color eye_color homeworld species
## <chr> <int> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 Luke… 172 77 blond fair blue Tatooine Human
## 2 C-3PO 167 75 <NA> gold yellow Tatooine Droid
## 3 R2-D2 96 32 <NA> white, bl… red Naboo Droid
## 4 Dart… 202 136 none white yellow Tatooine Human
## 5 Leia… 150 49 brown light brown Alderaan Human
## 6 Owen… 178 120 brown, gr… light blue Tatooine Human
## 7 Beru… 165 75 brown light blue Tatooine Human
## 8 R5-D4 97 32 <NA> white, red red Tatooine Droid
## 9 Bigg… 183 84 black light brown Tatooine Human
## 10 Obi-… 182 77 auburn, w… fair blue-gray Stewjon Human
## # … with 77 more rows, and 3 more variables: films <list>,
## # vehicles <list>, starships <list>
1.2.2 Selecting observations The filter
function allows you to limit your dataset to observations (rows) meeting a specific criteria. Multiple criteria can be combined with the & (AND) and | (OR) symbols.
# select females
filter(starwars,
gender == "female")
## # A tibble: 19 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Leia… 150 49 brown light brown 19 female
## 2 Beru… 165 75 brown light blue 47 female
## 3 Mon … 150 NA auburn fair blue 48 female
## 4 Shmi… 163 NA black fair brown 72 female
## 5 Ayla… 178 55 none blue hazel 48 female
## 6 Adi … 184 50 none dark blue NA female
## 7 Cordé 157 NA brown light brown NA female
## 8 Lumi… 170 56.2 black yellow blue 58 female
## 9 Barr… 166 50 black yellow blue 40 female
## 10 Dormé 165 NA brown light brown NA female
## 11 Zam … 168 55 blonde fair, gre… yellow NA female
## 12 Taun… 213 NA none grey black NA female
## 13 Joca… 167 NA white fair blue NA female
## 14 R4-P… 96 NA none silver, r… red, blue NA female
## 15 Shaa… 178 57 none red, blue… black NA female
## 16 Sly … 178 48 none pale white NA female
## 17 Rey NA NA brown light hazel NA female
## 18 Capt… NA NA unknown unknown unknown NA female
## 19 Padm… 165 45 brown light brown 46 female
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
# select females that are from Alderaan
filter(starwars,
gender == "female" & homeworld == "Alderaan")
## # A tibble: 1 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Leia… 150 49 brown light brown 19 female
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
# select individuals that are from
# Alderaan, Coruscant, or Endor
filter(starwars,
homeworld == "Alderaan" | homeworld == "Coruscant" | homeworld == "Endor")
## # A tibble: 7 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Leia… 150 49 brown light brown 19 female
## 2 Wick… 88 20 brown brown brown 8 male
## 3 Fini… 170 NA blond fair blue 91 male
## 4 Adi … 184 50 none dark blue NA female
## 5 Bail… 191 NA black tan brown 67 male
## 6 Joca… 167 NA white fair blue NA female
## 7 Raym… 188 79 brown light brown NA male
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
# this can be written more succinctly as
filter(starwars,
homeworld %in% c("Alderaan", "Coruscant", "Endor"))
## # A tibble: 7 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Leia… 150 49 brown light brown 19 female
## 2 Wick… 88 20 brown brown brown 8 male
## 3 Fini… 170 NA blond fair blue 91 male
## 4 Adi … 184 50 none dark blue NA female
## 5 Bail… 191 NA black tan brown 67 male
## 6 Joca… 167 NA white fair blue NA female
## 7 Raym… 188 79 brown light brown NA male
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
1.2.3 Creating/Recoding variables The mutate
function allows you to create new variables or transform existing ones.
# convert height in centimeters to inches,
# and mass in kilograms to pounds
mutate(starwars,
height = height * 0.394,
mass = mass * 2.205)
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 67.8 170. blond fair blue 19 male
## 2 C-3PO 65.8 165. <NA> gold yellow 112 <NA>
## 3 R2-D2 37.8 70.6 <NA> white, bl… red 33 <NA>
## 4 Dart… 79.6 300. none white yellow 41.9 male
## 5 Leia… 59.1 108. brown light brown 19 female
## 6 Owen… 70.1 265. brown, gr… light blue 52 male
## 7 Beru… 65.0 165. brown light blue 47 female
## 8 R5-D4 38.2 70.6 <NA> white, red red NA <NA>
## 9 Bigg… 72.1 185. black light brown 24 male
## 10 Obi-… 71.7 170. auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 5 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
The ifelse
function (part of base R) can be used for recoding data. The format is ifelse(test, return if TRUE, return if FALSE).
# if height is greater than 180
# then heightcat = "tall",
# otherwise heightcat = "short"
mutate(starwars,
heightcat = ifelse(height > 180, "tall", "short"))
## # A tibble: 87 x 14
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 6 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>,
## # heightcat <chr>
# convert any eye color that is not
# black, blue or brown, to other
mutate(starwars,
eye_color = ifelse(eye_color %in% c("black", "blue", "brown"),eye_color,"other"))
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold other 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… other 33 <NA>
## 4 Dart… 202 136 none white other 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red other NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair other 57 male
## # … with 77 more rows, and 5 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
# set heights greater than 200 or
# less than 75 to missing
mutate(starwars,
height = ifelse(height < 75 | height > 200,NA,height))
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… NA 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 5 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
1.2.4 Summarizing data The summarize
function can be used to reduce multiple values down to a single value (such as a mean). It is often used in conjunction with the by_group function, to calculate statistics by group. In the code below, the na.rm=TRUE option is used to drop missing values before calculating the means.
# calculate mean height and mass
newdata <- summarize(starwars,
mean_ht = mean(height, na.rm=TRUE),
mean_mass = mean(mass, na.rm=TRUE))
newdata
## # A tibble: 1 x 2
## mean_ht mean_mass
## <dbl> <dbl>
## 1 174. 97.3
# calculate mean height and weight by gender
newdata <- group_by(starwars, gender)
newdata <- summarize(newdata,
mean_ht = mean(height, na.rm=TRUE),
mean_wt = mean(mass, na.rm=TRUE))
newdata
## # A tibble: 5 x 3
## gender mean_ht mean_wt
## <chr> <dbl> <dbl>
## 1 female 165. 54.0
## 2 hermaphrodite 175 1358
## 3 male 179. 81.0
## 4 none 200 140
## 5 <NA> 120 46.3
1.2.5 Using pipes Packages like dplyr
and tidyr
allow you to write your code in a compact format using the pipe %>%
operator. Here is an example.
# calculate the mean height for women by species
newdata <- filter(starwars,
gender == "female")
newdata <- group_by(newdata, species)
newdata <- summarize(newdata,
mean_ht = mean(height, na.rm = TRUE))
newdata
## # A tibble: 8 x 2
## species mean_ht
## <chr> <dbl>
## 1 Clawdite 168
## 2 Human 160.
## 3 Kaminoan 213
## 4 Mirialan 168
## 5 Tholothian 184
## 6 Togruta 178
## 7 Twi'lek 178
## 8 <NA> 137
# this can be written as
newdata <- starwars %>%
filter(gender == "female") %>%
group_by(species) %>%
summarize(mean_ht = mean(height, na.rm = TRUE))
newdata
## # A tibble: 8 x 2
## species mean_ht
## <chr> <dbl>
## 1 Clawdite 168
## 2 Human 160.
## 3 Kaminoan 213
## 4 Mirialan 168
## 5 Tholothian 184
## 6 Togruta 178
## 7 Twi'lek 178
## 8 <NA> 137
1.2.6 Reshaping data
1.2.7 Missing data
select
Keep the variables name, hair_color, and films.select(starwars, name, hair_color, films)
## # A tibble: 87 x 3
## name hair_color films
## <chr> <chr> <list>
## 1 Luke Skywalker blond <chr [5]>
## 2 C-3PO <NA> <chr [6]>
## 3 R2-D2 <NA> <chr [7]>
## 4 Darth Vader none <chr [4]>
## 5 Leia Organa brown <chr [5]>
## 6 Owen Lars brown, grey <chr [3]>
## 7 Beru Whitesun lars brown <chr [3]>
## 8 R5-D4 <NA> <chr [1]>
## 9 Biggs Darklighter black <chr [1]>
## 10 Obi-Wan Kenobi auburn, white <chr [6]>
## # … with 77 more rows
filter
select blonds.filter(starwars, hair_color == "blonds")
## # A tibble: 0 x 13
## # … with 13 variables: name <chr>, height <int>, mass <dbl>,
## # hair_color <chr>, skin_color <chr>, eye_color <chr>, birth_year <dbl>,
## # gender <chr>, homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
filter
select female blonds.filter(starwars, hair_color == "blonds" & gender == "female")
## # A tibble: 0 x 13
## # … with 13 variables: name <chr>, height <int>, mass <dbl>,
## # hair_color <chr>, skin_color <chr>, eye_color <chr>, birth_year <dbl>,
## # gender <chr>, homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
mutate
Calculate the metric BMI index.Hint: BMI = Weight (KG) ÷ Height (Metres²)
newdata <- mutate(starwars,
BMI = mass / (height/100)^2)
newdata
## # A tibble: 87 x 14
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 6 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>,
## # BMI <dbl>
mutate
If BMI is greater than 18.5 and smaller than 24.9, then BMIcat = “healthy”, otherwise BMIcat = “not”.`newdata <- mutate(newdata,
BMIcat = ifelse(BMI> 18.5 & BMI < 24.9,
"healthy" ,
"not"))
summarize
Calculate mean BMI.summarize(newdata, BMIaverage = mean(BMI, na.rm=TRUE))
## # A tibble: 1 x 1
## BMIaverage
## <dbl>
## 1 32.0
group_by
and summarize
Calculate mean BMI by gender.newdata %>%
group_by(gender) %>%
summarize(avg_BMI = mean(BMI, na.rm=TRUE))
## # A tibble: 5 x 2
## gender avg_BMI
## <chr> <dbl>
## 1 female 18.8
## 2 hermaphrodite 443.
## 3 male 25.7
## 4 none 35
## 5 <NA> 31.9
newdata <- group_by(newdata, gender)
newdata <- summarize(newdata,
avg_BMI = mean(BMI, na.rm=TRUE))
Hint: Use message
, echo
and results
in the chunk options. Refer to the RMarkdown Reference Guide.