The palmerpenguins dataset is a good one to play with when learning how to deal with missing values.
load packages
library(tidyverse)
library(naniar) # useful for visualising missing values
library(palmerpenguins) # get penguin dataload penguin data
penguins <- penguinsHow to find out where the missing values are…
The vis_miss() function is useful for visualising whether you have missing values and in which variables.
vis_miss(penguins)Looks like most of the missing values are in the sex variable, but a couple of penguins are missing values in other variables.
How to find our how many missing values there are…
The summary() function will tell us how many penguins are missing other data values.
summary(penguins)## species island bill_length_mm bill_depth_mm
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 female:165 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
## Median :197.0 Median :4050 NA's : 11 Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
So we have 11 penguins missing sex data, and two missing bill and body mass measurements.
How to exclude ALL the penguins with any NAs…
One option is to remove data from ALL the penguins that are missing values on any variables.
Use na.omit() to remove all the penguins with any NAs
penguins_omit <- penguins %>%
na.omit()
# visualise missing values in filtered penguins
vis_miss(penguins_omit)Result: after using na.omit() all the missing values are gone.
count(penguins)## # A tibble: 1 x 1
## n
## <int>
## 1 344
count(penguins_omit)## # A tibble: 1 x 1
## n
## <int>
## 1 333
We have gone from 344 penguins to 333 penguins.
How to exclude only NAs in a particular variable…
Removing all the rows that contain any NAs can be a little extreme. In this case, we might now care about the missing sex data, but just want to exclude penguins who are missing bill measurements. In that case, you can filter() to include only observations that are NOT NA.
# filter to only include obs of bill length NOT NA
penguins_filter_na <- penguins %>%
filter(bill_length_mm != "NA")
# visualise missing values in filtered penguins
vis_miss(penguins_filter_na)Result: We have filtered out only penguins missing bill data, but left in those missing sex data. We Only lost 2 penguins.
count(penguins)## # A tibble: 1 x 1
## n
## <int>
## 1 344
count(penguins_filter_na)## # A tibble: 1 x 1
## n
## <int>
## 1 342
dealing with NA when using functions
Some functions, like mean(), don’t like NAs, but it is easy to tell R to ignore them.
You can see that when I try and calculate the mean bill length on the original data, it gives me NA because I haven’t told it what to do with the NAs.
penguins %>%
summarise(mean_bill = mean(bill_length_mm))## # A tibble: 1 x 1
## mean_bill
## <dbl>
## 1 NA
You can use na.rm = TRUE (aka NA remove) to tell R to ignore the NAs
penguins %>%
summarise(mean_bill = mean(bill_length_mm, na.rm = TRUE))## # A tibble: 1 x 1
## mean_bill
## <dbl>
## 1 43.9
Good luck!!