Back to mean imputation
We have the mean of our first column, wing length
mean_wing
## [1] 57.00794
We identify the locations of the NAs using the general form of
which(is.na(x) == TRUE). I’ll separate this out into parts
first to show how it works.
First, a logical vector of TRUE/FALSE is a NA present
# call is.na()
is_NA_wing <- is.na(df02$wing)
Now use which() to determine which elements of the
vector contain TRUE
# call which()
i_NA_wing <- which(is_NA_wing == TRUE)
In a single line I can do it like this
i_NA_wing <- which(is.na(df02$wing) == TRUE)
Again, we can do this with bracket notation by referring to the wing
column as column 1:
i_NA_wing <- which(is.na(df02[, 1]) == TRUE)
This vector of indices can then pull out the NAs from the
dataframe:
df02$wing[i_NA_wing]
## [1] NA NA NA NA NA NA NA NA NA NA
I can then assigned the mean value of wings to the NAs
df02$wing[i_NA_wing] <- mean_wing
With a column index its done like this
df02[i_NA_wing, 1] <- mean_wing
Note that the mean before doing this (stored in
mean_wing) and after this is the same. I can check this
with a logical comparison
## Add == to compare the two elements
mean_wing == mean(df02$wing)
## [1] TRUE
Since our dataframe is small we can easily do mean imputation on the
remaining features.
First we need the means of the two remaining columns
# call mean() on the columns;
## and set na.rm = TRUE
mean_bill <- mean(df02$bill, na.rm = TRUE) # TODO
mean_weight <- mean(df02$weight, na.rm = TRUE) # TODO
Of course, we can do this with column indices too
# Set the column index to 3
mean_bill <- mean(df02[, 2], na.rm = TRUE)
mean_weight <- mean(df02[, 3], na.rm = TRUE) # TODO
We now need the locations of the NAs
i_NA_bill <- which(is.na(df02$bill) == TRUE)
i_NA_weight <- which(is.na(df02$weight) == TRUE)
Or with column indices
# Set the column indices to be 2 and 3
i_NA_bill <- which(is.na(df02[,2]) == TRUE) # TODO
i_NA_weight <- which(is.na(df02[,3]) == TRUE) # TODO
We can check that we are getting just NAs by using our i_NA_ vectors
to access the elements of the columns.
df02$bill[i_NA_bill]
## [1] NA NA NA NA NA NA NA NA NA NA
df02$weight[i_NA_weight]
## [1] NA NA NA NA NA NA NA NA NA NA NA NA
Now do the the replacement of the NAs
df02$bill[i_NA_bill] <- mean_bill
df02$weight[i_NA_weight] <- mean_weight
For completeness, let’s do this with column indices
# set the column indices with 2 and 3
df02[2,][i_NA_bill, ] <- mean_bill
## Warning in `[<-.data.frame`(`*tmp*`, 2, , value = structure(list(wing = c(56, :
## replacement element 1 has 54 rows to replace 1 rows
## Warning in `[<-.data.frame`(`*tmp*`, 2, , value = structure(list(wing = c(56, :
## replacement element 2 has 54 rows to replace 1 rows
## Warning in `[<-.data.frame`(`*tmp*`, 2, , value = structure(list(wing = c(56, :
## replacement element 3 has 54 rows to replace 1 rows
df02[3,][i_NA_weight, ] <- mean_weight # TODO
## Warning in `[<-.data.frame`(`*tmp*`, 3, , value = structure(list(wing = c(59, :
## replacement element 1 has 55 rows to replace 1 rows
## Warning in `[<-.data.frame`(`*tmp*`, 3, , value = structure(list(wing = c(59, :
## replacement element 2 has 55 rows to replace 1 rows
## Warning in `[<-.data.frame`(`*tmp*`, 3, , value = structure(list(wing = c(59, :
## replacement element 3 has 55 rows to replace 1 rows
We can check that there are now values in these rows
df02$bill[i_NA_bill]
## [1] 8.782063 8.782063 8.782063 8.782063 8.782063 8.782063 8.782063 8.782063
## [9] 8.782063 8.782063
df02$weight[i_NA_weight]
## [1] 17.40328 17.40328 17.40328 17.40328 17.40328 17.40328 17.40328 17.40328
## [9] 17.40328 17.40328 17.40328 17.40328
Calling summary on the data shows us that everything is filled in -
no more NAs are reported in the summary output.
summary(df02)
## wing bill weight
## Min. :53.00 Min. :7.900 Min. :14.5
## 1st Qu.:56.00 1st Qu.:8.400 1st Qu.:16.0
## Median :57.00 Median :8.782 Median :17.4
## Mean :57.01 Mean :8.782 Mean :17.4
## 3rd Qu.:58.00 3rd Qu.:9.120 3rd Qu.:18.5
## Max. :60.00 Max. :9.900 Max. :21.7