nums <- c(1, 4, NA, 2, 8, NA)Ways to deal with missing data in R
1 Using Base R
say we have a vector with missing data:
I can find which indices are NA:
is.na(nums)[1] FALSE FALSE TRUE FALSE FALSE TRUE
it returns a vector of logical values (TRUEs and FALSEs) that specifies where the NAs are.
In R, we can index vectors by using square brackets. For example, if I want to find the second number in the vector:
nums[2][1] 4
and if I want to index a range:
nums[2:4][1] 4 NA 2
I can also pick and choose which ones I want, in whatever order, and even with repeats, by using a vector of indices:
nums[c(1,5,4,4)][1] 1 8 2 2
And I can even use a vector of logical values of same length to do the same thing. For example, this logical indexing only returns the 3rd and 5th elements (using “T” for “TRUE” and “F” for “FALSE” for brevity):
nums[c(F, F, T, F, T, F)][1] NA 8
So this means we can use the is.na() function to create a vector of logical values, and use that vector to index. For example, to only keep the missing values:
nums[is.na(nums)][1] NA NA
… it return the two missing values. What about doing the opposite?
nums[!is.na(nums)][1] 1 4 2 8
The exlamation point inverts the logical vector (turning TRUEs into FALSEs and vice-versa), and therefore we are only indexing the non-missing values.
Another thing about R is that you can assign indexed data to modify the original object. For example, I can replace the second element in my vector with a new value:
nums[2] <- 50see how the object nums now has the value 50 in second position? We can use this to replace missing values! For example, replace all missing values with a 0:
nums[is.na(nums)] <- 0This method also works on dataframes, or dataframe columns.
All of the above is done with some of the most basic building blocks of R, but there are other options.
2 Using Base R’s replace() function
let’s bring back the original vector:
nums <- c(1, 4, NA, 2, 8, NA)we can do the same as above, but using a function instead of square-bracket indexing. The replace() function can take an object, indices, and the value to replace with:
replace(nums, is.na(nums), 0)[1] 1 4 0 2 8 0
note that this only shows you the result in the console. You could assign to a new object:
nums_no_na <- replace(nums, is.na(nums), 0)…or update the original object:
nums <- replace(nums, is.na(nums), 0)This method also works on dataframes / dataframe columns.
3 Using tidyr’s replace_na()
the tidyr package has a replace_na() function which can be used similarly to replace():
nums <- c(1, 4, NA, 2, 8, NA)
library(tidyr)
replace_na(nums, 0)[1] 1 4 0 2 8 0
… but it can also be used on dataframes. For example, this dataframe has missing data in the two first columns:
airquality# A tibble: 153 × 6
Ozone Solar.R Wind Temp Month Day
<int> <int> <dbl> <int> <int> <int>
1 41 190 7.4 67 5 1
2 36 118 8 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 NA 194 8.6 69 5 10
# ℹ 143 more rows
We can specify different replacement values for different columns:
library(tidyr)
replace_na(airquality, list(Ozone = 20, Solar.R = 200))# A tibble: 153 × 6
Ozone Solar.R Wind Temp Month Day
<int> <int> <dbl> <int> <int> <int>
1 41 190 7.4 67 5 1
2 36 118 8 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 20 200 14.3 56 5 5
6 28 200 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 20 194 8.6 69 5 10
# ℹ 143 more rows
that replaced NAs with 20 for the Ozone column, and with 200 for the Solar.R column.
If you want to explore more options, and more in-depth examples, there are many tutorials online. For example, this R Bloggers post