In this exercise you will learn to clean data using the dplyr package. To this end, you will follow through the codes in one of our e-texts, Data Visualization with R. The given example code below is from Chapter 1.2 Cleaning data.

## # A tibble: 87 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke…    172    77 blond      fair       blue            19   male  
##  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
##  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
##  4 Dart…    202   136 none       white      yellow          41.9 male  
##  5 Leia…    150    49 brown      light      brown           19   female
##  6 Owen…    178   120 brown, gr… light      blue            52   male  
##  7 Beru…    165    75 brown      light      blue            47   female
##  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
##  9 Bigg…    183    84 black      light      brown           24   male  
## 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
## # … with 77 more rows, and 5 more variables: homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>

1.2.1 Selecting variables The select function allows you to limit your dataset to specified variables (columns).

## # A tibble: 87 x 3
##    name               height gender
##    <chr>               <int> <chr> 
##  1 Luke Skywalker        172 male  
##  2 C-3PO                 167 <NA>  
##  3 R2-D2                  96 <NA>  
##  4 Darth Vader           202 male  
##  5 Leia Organa           150 female
##  6 Owen Lars             178 male  
##  7 Beru Whitesun lars    165 female
##  8 R5-D4                  97 <NA>  
##  9 Biggs Darklighter     183 male  
## 10 Obi-Wan Kenobi        182 male  
## # … with 77 more rows
## # A tibble: 87 x 9
##    name   mass hair_color skin_color eye_color birth_year gender homeworld
##    <chr> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>    
##  1 Luke…    77 blond      fair       blue            19   male   Tatooine 
##  2 C-3PO    75 <NA>       gold       yellow         112   <NA>   Tatooine 
##  3 R2-D2    32 <NA>       white, bl… red             33   <NA>   Naboo    
##  4 Dart…   136 none       white      yellow          41.9 male   Tatooine 
##  5 Leia…    49 brown      light      brown           19   female Alderaan 
##  6 Owen…   120 brown, gr… light      blue            52   male   Tatooine 
##  7 Beru…    75 brown      light      blue            47   female Tatooine 
##  8 R5-D4    32 <NA>       white, red red             NA   <NA>   Tatooine 
##  9 Bigg…    84 black      light      brown           24   male   Tatooine 
## 10 Obi-…    77 auburn, w… fair       blue-gray       57   male   Stewjon  
## # … with 77 more rows, and 1 more variable: species <chr>
## # A tibble: 87 x 11
##    name  height  mass hair_color skin_color eye_color homeworld species
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>     <chr>     <chr>  
##  1 Luke…    172    77 blond      fair       blue      Tatooine  Human  
##  2 C-3PO    167    75 <NA>       gold       yellow    Tatooine  Droid  
##  3 R2-D2     96    32 <NA>       white, bl… red       Naboo     Droid  
##  4 Dart…    202   136 none       white      yellow    Tatooine  Human  
##  5 Leia…    150    49 brown      light      brown     Alderaan  Human  
##  6 Owen…    178   120 brown, gr… light      blue      Tatooine  Human  
##  7 Beru…    165    75 brown      light      blue      Tatooine  Human  
##  8 R5-D4     97    32 <NA>       white, red red       Tatooine  Droid  
##  9 Bigg…    183    84 black      light      brown     Tatooine  Human  
## 10 Obi-…    182    77 auburn, w… fair       blue-gray Stewjon   Human  
## # … with 77 more rows, and 3 more variables: films <list>,
## #   vehicles <list>, starships <list>

1.2.2 Selecting observations The filter function allows you to limit your dataset to observations (rows) meeting a specific criteria. Multiple criteria can be combined with the & (AND) and | (OR) symbols.

## # A tibble: 19 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Leia…    150  49   brown      light      brown             19 female
##  2 Beru…    165  75   brown      light      blue              47 female
##  3 Mon …    150  NA   auburn     fair       blue              48 female
##  4 Shmi…    163  NA   black      fair       brown             72 female
##  5 Ayla…    178  55   none       blue       hazel             48 female
##  6 Adi …    184  50   none       dark       blue              NA female
##  7 Cordé    157  NA   brown      light      brown             NA female
##  8 Lumi…    170  56.2 black      yellow     blue              58 female
##  9 Barr…    166  50   black      yellow     blue              40 female
## 10 Dormé    165  NA   brown      light      brown             NA female
## 11 Zam …    168  55   blonde     fair, gre… yellow            NA female
## 12 Taun…    213  NA   none       grey       black             NA female
## 13 Joca…    167  NA   white      fair       blue              NA female
## 14 R4-P…     96  NA   none       silver, r… red, blue         NA female
## 15 Shaa…    178  57   none       red, blue… black             NA female
## 16 Sly …    178  48   none       pale       white             NA female
## 17 Rey       NA  NA   brown      light      hazel             NA female
## 18 Capt…     NA  NA   unknown    unknown    unknown           NA female
## 19 Padm…    165  45   brown      light      brown             46 female
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>
## # A tibble: 1 x 13
##   name  height  mass hair_color skin_color eye_color birth_year gender
##   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
## 1 Leia…    150    49 brown      light      brown             19 female
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>
## # A tibble: 7 x 13
##   name  height  mass hair_color skin_color eye_color birth_year gender
##   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
## 1 Leia…    150    49 brown      light      brown             19 female
## 2 Wick…     88    20 brown      brown      brown              8 male  
## 3 Fini…    170    NA blond      fair       blue              91 male  
## 4 Adi …    184    50 none       dark       blue              NA female
## 5 Bail…    191    NA black      tan        brown             67 male  
## 6 Joca…    167    NA white      fair       blue              NA female
## 7 Raym…    188    79 brown      light      brown             NA male  
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>
## # A tibble: 7 x 13
##   name  height  mass hair_color skin_color eye_color birth_year gender
##   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
## 1 Leia…    150    49 brown      light      brown             19 female
## 2 Wick…     88    20 brown      brown      brown              8 male  
## 3 Fini…    170    NA blond      fair       blue              91 male  
## 4 Adi …    184    50 none       dark       blue              NA female
## 5 Bail…    191    NA black      tan        brown             67 male  
## 6 Joca…    167    NA white      fair       blue              NA female
## 7 Raym…    188    79 brown      light      brown             NA male  
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

1.2.3 Creating/Recoding variables The mutate function allows you to create new variables or transform existing ones.

## # A tibble: 87 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <dbl> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke…   67.8 170.  blond      fair       blue            19   male  
##  2 C-3PO   65.8 165.  <NA>       gold       yellow         112   <NA>  
##  3 R2-D2   37.8  70.6 <NA>       white, bl… red             33   <NA>  
##  4 Dart…   79.6 300.  none       white      yellow          41.9 male  
##  5 Leia…   59.1 108.  brown      light      brown           19   female
##  6 Owen…   70.1 265.  brown, gr… light      blue            52   male  
##  7 Beru…   65.0 165.  brown      light      blue            47   female
##  8 R5-D4   38.2  70.6 <NA>       white, red red             NA   <NA>  
##  9 Bigg…   72.1 185.  black      light      brown           24   male  
## 10 Obi-…   71.7 170.  auburn, w… fair       blue-gray       57   male  
## # … with 77 more rows, and 5 more variables: homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>

The ifelse function (part of base R) can be used for recoding data. The format is ifelse(test, return if TRUE, return if FALSE).

## # A tibble: 87 x 14
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke…    172    77 blond      fair       blue            19   male  
##  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
##  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
##  4 Dart…    202   136 none       white      yellow          41.9 male  
##  5 Leia…    150    49 brown      light      brown           19   female
##  6 Owen…    178   120 brown, gr… light      blue            52   male  
##  7 Beru…    165    75 brown      light      blue            47   female
##  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
##  9 Bigg…    183    84 black      light      brown           24   male  
## 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
## # … with 77 more rows, and 6 more variables: homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>,
## #   heightcat <chr>
## # A tibble: 87 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke…    172    77 blond      fair       blue            19   male  
##  2 C-3PO    167    75 <NA>       gold       other          112   <NA>  
##  3 R2-D2     96    32 <NA>       white, bl… other           33   <NA>  
##  4 Dart…    202   136 none       white      other           41.9 male  
##  5 Leia…    150    49 brown      light      brown           19   female
##  6 Owen…    178   120 brown, gr… light      blue            52   male  
##  7 Beru…    165    75 brown      light      blue            47   female
##  8 R5-D4     97    32 <NA>       white, red other           NA   <NA>  
##  9 Bigg…    183    84 black      light      brown           24   male  
## 10 Obi-…    182    77 auburn, w… fair       other           57   male  
## # … with 77 more rows, and 5 more variables: homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>
## # A tibble: 87 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke…    172    77 blond      fair       blue            19   male  
##  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
##  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
##  4 Dart…     NA   136 none       white      yellow          41.9 male  
##  5 Leia…    150    49 brown      light      brown           19   female
##  6 Owen…    178   120 brown, gr… light      blue            52   male  
##  7 Beru…    165    75 brown      light      blue            47   female
##  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
##  9 Bigg…    183    84 black      light      brown           24   male  
## 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
## # … with 77 more rows, and 5 more variables: homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>

1.2.4 Summarizing data The summarize function can be used to reduce multiple values down to a single value (such as a mean). It is often used in conjunction with the by_group function, to calculate statistics by group. In the code below, the na.rm=TRUE option is used to drop missing values before calculating the means.

## # A tibble: 1 x 2
##   mean_ht mean_mass
##     <dbl>     <dbl>
## 1    174.      97.3
## # A tibble: 5 x 3
##   gender        mean_ht mean_wt
##   <chr>           <dbl>   <dbl>
## 1 female           165.    54.0
## 2 hermaphrodite    175   1358  
## 3 male             179.    81.0
## 4 none             200    140  
## 5 <NA>             120     46.3

1.2.5 Using pipes Packages like dplyr and tidyr allow you to write your code in a compact format using the pipe %>% operator. Here is an example.

## # A tibble: 8 x 2
##   species    mean_ht
##   <chr>        <dbl>
## 1 Clawdite      168 
## 2 Human         160.
## 3 Kaminoan      213 
## 4 Mirialan      168 
## 5 Tholothian    184 
## 6 Togruta       178 
## 7 Twi'lek       178 
## 8 <NA>          137
## # A tibble: 8 x 2
##   species    mean_ht
##   <chr>        <dbl>
## 1 Clawdite      168 
## 2 Human         160.
## 3 Kaminoan      213 
## 4 Mirialan      168 
## 5 Tholothian    184 
## 6 Togruta       178 
## 7 Twi'lek       178 
## 8 <NA>          137

1.2.6 Reshaping data

1.2.7 Missing data

Q1 select Keep the variables name, hair_color, and films.

## # A tibble: 87 x 3
##    name               hair_color    films    
##    <chr>              <chr>         <list>   
##  1 Luke Skywalker     blond         <chr [5]>
##  2 C-3PO              <NA>          <chr [6]>
##  3 R2-D2              <NA>          <chr [7]>
##  4 Darth Vader        none          <chr [4]>
##  5 Leia Organa        brown         <chr [5]>
##  6 Owen Lars          brown, grey   <chr [3]>
##  7 Beru Whitesun lars brown         <chr [3]>
##  8 R5-D4              <NA>          <chr [1]>
##  9 Biggs Darklighter  black         <chr [1]>
## 10 Obi-Wan Kenobi     auburn, white <chr [6]>
## # … with 77 more rows

Q2 filter select blonds.

## # A tibble: 3 x 13
##   name  height  mass hair_color skin_color eye_color birth_year gender
##   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
## 1 Luke…    172    77 blond      fair       blue            19   male  
## 2 Anak…    188    84 blond      fair       blue            41.9 male  
## 3 Fini…    170    NA blond      fair       blue            91   male  
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Q3 filter select female blonds.

## # A tibble: 0 x 13
## # … with 13 variables: name <chr>, height <int>, mass <dbl>,
## #   hair_color <chr>, skin_color <chr>, eye_color <chr>, birth_year <dbl>,
## #   gender <chr>, homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Q4 mutate Calculate the metric BMI index.

Hint: BMI = Weight (KG) ÷ Height (Metres²)

## # A tibble: 87 x 14
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke…    172    77 blond      fair       blue            19   male  
##  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
##  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
##  4 Dart…    202   136 none       white      yellow          41.9 male  
##  5 Leia…    150    49 brown      light      brown           19   female
##  6 Owen…    178   120 brown, gr… light      blue            52   male  
##  7 Beru…    165    75 brown      light      blue            47   female
##  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
##  9 Bigg…    183    84 black      light      brown           24   male  
## 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
## # … with 77 more rows, and 6 more variables: homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>,
## #   BMI <dbl>

Q5 mutate If BMI is greater than 18.5 and smaller than 24.9, then BMIcat = “healthy”, otherwise BMIcat = “not”.

## # A tibble: 87 x 15
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke…    172    77 blond      fair       blue            19   male  
##  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
##  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
##  4 Dart…    202   136 none       white      yellow          41.9 male  
##  5 Leia…    150    49 brown      light      brown           19   female
##  6 Owen…    178   120 brown, gr… light      blue            52   male  
##  7 Beru…    165    75 brown      light      blue            47   female
##  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
##  9 Bigg…    183    84 black      light      brown           24   male  
## 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
## # … with 77 more rows, and 7 more variables: homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>,
## #   BMI <dbl>, BMIcat <chr>

Q6 summarize Calculate mean BMI.

## # A tibble: 1 x 1
##   BMI_avg
##     <dbl>
## 1    32.0

Q7 group_by and summarize Calculate mean BMI by gender.

## # A tibble: 5 x 2
##   gender        BMI_avg
##   <chr>           <dbl>
## 1 female           18.8
## 2 hermaphrodite   443. 
## 3 male             25.7
## 4 none             35  
## 5 <NA>             31.9

Q8 Hide the messages and the code, but display results of the code from the webpage.

Hint: Use message, echo and results in the chunk options. Refer to the RMarkdown Reference Guide.

Q9 Display the title and your name correctly at the top of the webpage.

Q10 Use the correct slug.