| itle: “Cleaning data” |
| uthor: “Nathan Barnhart” |
| ate: “9/16/2019” |
| utput: |
| html_document: |
| toc: TRUE |
In this exercise you will learn to clean data using the dplyr package. To this end, you will follow through the codes in one of our e-texts, Data Visualization with R. The given example code below is from Chapter 1.2 Cleaning data.
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 5 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
1.2.1 Selecting variables The select function allows you to limit your dataset to specified variables (columns).
## # A tibble: 87 x 3
## name height gender
## <chr> <int> <chr>
## 1 Luke Skywalker 172 male
## 2 C-3PO 167 <NA>
## 3 R2-D2 96 <NA>
## 4 Darth Vader 202 male
## 5 Leia Organa 150 female
## 6 Owen Lars 178 male
## 7 Beru Whitesun lars 165 female
## 8 R5-D4 97 <NA>
## 9 Biggs Darklighter 183 male
## 10 Obi-Wan Kenobi 182 male
## # … with 77 more rows
## # A tibble: 87 x 9
## name mass hair_color skin_color eye_color birth_year gender homeworld
## <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke… 77 blond fair blue 19 male Tatooine
## 2 C-3PO 75 <NA> gold yellow 112 <NA> Tatooine
## 3 R2-D2 32 <NA> white, bl… red 33 <NA> Naboo
## 4 Dart… 136 none white yellow 41.9 male Tatooine
## 5 Leia… 49 brown light brown 19 female Alderaan
## 6 Owen… 120 brown, gr… light blue 52 male Tatooine
## 7 Beru… 75 brown light blue 47 female Tatooine
## 8 R5-D4 32 <NA> white, red red NA <NA> Tatooine
## 9 Bigg… 84 black light brown 24 male Tatooine
## 10 Obi-… 77 auburn, w… fair blue-gray 57 male Stewjon
## # … with 77 more rows, and 1 more variable: species <chr>
## # A tibble: 87 x 11
## name height mass hair_color skin_color eye_color homeworld species
## <chr> <int> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 Luke… 172 77 blond fair blue Tatooine Human
## 2 C-3PO 167 75 <NA> gold yellow Tatooine Droid
## 3 R2-D2 96 32 <NA> white, bl… red Naboo Droid
## 4 Dart… 202 136 none white yellow Tatooine Human
## 5 Leia… 150 49 brown light brown Alderaan Human
## 6 Owen… 178 120 brown, gr… light blue Tatooine Human
## 7 Beru… 165 75 brown light blue Tatooine Human
## 8 R5-D4 97 32 <NA> white, red red Tatooine Droid
## 9 Bigg… 183 84 black light brown Tatooine Human
## 10 Obi-… 182 77 auburn, w… fair blue-gray Stewjon Human
## # … with 77 more rows, and 3 more variables: films <list>,
## # vehicles <list>, starships <list>
1.2.2 Selecting observations The filter function allows you to limit your dataset to observations (rows) meeting a specific criteria. Multiple criteria can be combined with the & (AND) and | (OR) symbols.
## # A tibble: 19 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Leia… 150 49 brown light brown 19 female
## 2 Beru… 165 75 brown light blue 47 female
## 3 Mon … 150 NA auburn fair blue 48 female
## 4 Shmi… 163 NA black fair brown 72 female
## 5 Ayla… 178 55 none blue hazel 48 female
## 6 Adi … 184 50 none dark blue NA female
## 7 Cordé 157 NA brown light brown NA female
## 8 Lumi… 170 56.2 black yellow blue 58 female
## 9 Barr… 166 50 black yellow blue 40 female
## 10 Dormé 165 NA brown light brown NA female
## 11 Zam … 168 55 blonde fair, gre… yellow NA female
## 12 Taun… 213 NA none grey black NA female
## 13 Joca… 167 NA white fair blue NA female
## 14 R4-P… 96 NA none silver, r… red, blue NA female
## 15 Shaa… 178 57 none red, blue… black NA female
## 16 Sly … 178 48 none pale white NA female
## 17 Rey NA NA brown light hazel NA female
## 18 Capt… NA NA unknown unknown unknown NA female
## 19 Padm… 165 45 brown light brown 46 female
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
## # A tibble: 1 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Leia… 150 49 brown light brown 19 female
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
## # A tibble: 7 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Leia… 150 49 brown light brown 19 female
## 2 Wick… 88 20 brown brown brown 8 male
## 3 Fini… 170 NA blond fair blue 91 male
## 4 Adi … 184 50 none dark blue NA female
## 5 Bail… 191 NA black tan brown 67 male
## 6 Joca… 167 NA white fair blue NA female
## 7 Raym… 188 79 brown light brown NA male
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
## # A tibble: 7 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Leia… 150 49 brown light brown 19 female
## 2 Wick… 88 20 brown brown brown 8 male
## 3 Fini… 170 NA blond fair blue 91 male
## 4 Adi … 184 50 none dark blue NA female
## 5 Bail… 191 NA black tan brown 67 male
## 6 Joca… 167 NA white fair blue NA female
## 7 Raym… 188 79 brown light brown NA male
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
1.2.3 Creating/Recoding variables The mutate function allows you to create new variables or transform existing ones.
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 67.8 170. blond fair blue 19 male
## 2 C-3PO 65.8 165. <NA> gold yellow 112 <NA>
## 3 R2-D2 37.8 70.6 <NA> white, bl… red 33 <NA>
## 4 Dart… 79.6 300. none white yellow 41.9 male
## 5 Leia… 59.1 108. brown light brown 19 female
## 6 Owen… 70.1 265. brown, gr… light blue 52 male
## 7 Beru… 65.0 165. brown light blue 47 female
## 8 R5-D4 38.2 70.6 <NA> white, red red NA <NA>
## 9 Bigg… 72.1 185. black light brown 24 male
## 10 Obi-… 71.7 170. auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 5 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
The ifelse function (part of base R) can be used for recoding data. The format is ifelse(test, return if TRUE, return if FALSE).
## # A tibble: 87 x 14
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 6 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>,
## # heightcat <chr>
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold other 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… other 33 <NA>
## 4 Dart… 202 136 none white other 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red other NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair other 57 male
## # … with 77 more rows, and 5 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… NA 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 5 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
1.2.4 Summarizing data The summarize function can be used to reduce multiple values down to a single value (such as a mean). It is often used in conjunction with the by_group function, to calculate statistics by group. In the code below, the na.rm=TRUE option is used to drop missing values before calculating the means.
## # A tibble: 1 x 2
## mean_ht mean_mass
## <dbl> <dbl>
## 1 174. 97.3
## # A tibble: 5 x 3
## gender mean_ht mean_wt
## <chr> <dbl> <dbl>
## 1 female 165. 54.0
## 2 hermaphrodite 175 1358
## 3 male 179. 81.0
## 4 none 200 140
## 5 <NA> 120 46.3
1.2.5 Using pipes Packages like dplyr and tidyr allow you to write your code in a compact format using the pipe %>% operator. Here is an example.
## # A tibble: 8 x 2
## species mean_ht
## <chr> <dbl>
## 1 Clawdite 168
## 2 Human 160.
## 3 Kaminoan 213
## 4 Mirialan 168
## 5 Tholothian 184
## 6 Togruta 178
## 7 Twi'lek 178
## 8 <NA> 137
## # A tibble: 8 x 2
## species mean_ht
## <chr> <dbl>
## 1 Clawdite 168
## 2 Human 160.
## 3 Kaminoan 213
## 4 Mirialan 168
## 5 Tholothian 184
## 6 Togruta 178
## 7 Twi'lek 178
## 8 <NA> 137
1.2.6 Reshaping data
1.2.7 Missing data
select Keep the variables name, hair_color, and films.## # A tibble: 87 x 3
## name hair_color films
## <chr> <chr> <list>
## 1 Luke Skywalker blond <chr [5]>
## 2 C-3PO <NA> <chr [6]>
## 3 R2-D2 <NA> <chr [7]>
## 4 Darth Vader none <chr [4]>
## 5 Leia Organa brown <chr [5]>
## 6 Owen Lars brown, grey <chr [3]>
## 7 Beru Whitesun lars brown <chr [3]>
## 8 R5-D4 <NA> <chr [1]>
## 9 Biggs Darklighter black <chr [1]>
## 10 Obi-Wan Kenobi auburn, white <chr [6]>
## # … with 77 more rows
filter select blonds.## # A tibble: 3 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 Anak… 188 84 blond fair blue 41.9 male
## 3 Fini… 170 NA blond fair blue 91 male
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
filter select female blonds.## # A tibble: 0 x 13
## # … with 13 variables: name <chr>, height <int>, mass <dbl>,
## # hair_color <chr>, skin_color <chr>, eye_color <chr>, birth_year <dbl>,
## # gender <chr>, homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
mutate Calculate the metric BMI index.Hint: BMI = Weight (KG) ÷ Height (Metres²)
## # A tibble: 87 x 14
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 6 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>,
## # BMI <dbl>
mutate If BMI is greater than 18.5 and smaller than 24.9, then BMIcat = “healthy”, otherwise BMIcat = “not”.## # A tibble: 87 x 15
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 7 more variables: homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>,
## # BMI <dbl>, BMIcat <chr>
summarize Calculate mean BMI.## # A tibble: 1 x 1
## BMI_tot
## <dbl>
## 1 32.0
group_by and summarize Calculate mean BMI by gender.## # A tibble: 5 x 2
## gender BMI_avg
## <chr> <dbl>
## 1 female 18.8
## 2 hermaphrodite 443.
## 3 male 25.7
## 4 none 35
## 5 <NA> 31.9
```
Hint: Use message, echo and results in the chunk options. Refer to the RMarkdown Reference Guide.