hunting missing penguins

Jen Richmond

23/06/2021

The palmerpenguins dataset is a good one to play with when learning how to deal with missing values.

load packages

library(tidyverse)
library(naniar) # useful for visualising missing values
library(palmerpenguins) # get penguin data

load penguin data

penguins <- penguins

How to find out where the missing values are…

The vis_miss() function is useful for visualising whether you have missing values and in which variables.

vis_miss(penguins)

Looks like most of the missing values are in the sex variable, but a couple of penguins are missing values in other variables.

How to find our how many missing values there are…

The summary() function will tell us how many penguins are missing other data values.

summary(penguins)
##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2

So we have 11 penguins missing sex data, and two missing bill and body mass measurements.

How to exclude ALL the penguins with any NAs…

One option is to remove data from ALL the penguins that are missing values on any variables.

Use na.omit() to remove all the penguins with any NAs

penguins_omit <- penguins %>%
  na.omit()

# visualise missing values in filtered penguins
vis_miss(penguins_omit)

Result: after using na.omit() all the missing values are gone.

count(penguins)
## # A tibble: 1 x 1
##       n
##   <int>
## 1   344
count(penguins_omit)
## # A tibble: 1 x 1
##       n
##   <int>
## 1   333

We have gone from 344 penguins to 333 penguins.

How to exclude only NAs in a particular variable…

Removing all the rows that contain any NAs can be a little extreme. In this case, we might now care about the missing sex data, but just want to exclude penguins who are missing bill measurements. In that case, you can filter() to include only observations that are NOT NA.

# filter to only include obs of bill length NOT NA

penguins_filter_na <- penguins %>%
  filter(bill_length_mm != "NA")

# visualise missing values in filtered penguins
vis_miss(penguins_filter_na)

Result: We have filtered out only penguins missing bill data, but left in those missing sex data. We Only lost 2 penguins.

count(penguins)
## # A tibble: 1 x 1
##       n
##   <int>
## 1   344
count(penguins_filter_na)
## # A tibble: 1 x 1
##       n
##   <int>
## 1   342

dealing with NA when using functions

Some functions, like mean(), don’t like NAs, but it is easy to tell R to ignore them.

You can see that when I try and calculate the mean bill length on the original data, it gives me NA because I haven’t told it what to do with the NAs.

penguins %>% 
  summarise(mean_bill = mean(bill_length_mm))
## # A tibble: 1 x 1
##   mean_bill
##       <dbl>
## 1        NA

You can use na.rm = TRUE (aka NA remove) to tell R to ignore the NAs

penguins %>% 
  summarise(mean_bill = mean(bill_length_mm, na.rm = TRUE))
## # A tibble: 1 x 1
##   mean_bill
##       <dbl>
## 1      43.9

Good luck!!