This code through explores how to use the ‘mean()’ function safely in R to compute averages in real datasets. It focuses on three common situations: basic numeric vectors, missing values (‘NA’), and TRUE/FALSE data.
Specifically, we’ll explain and demonstrate:
This topic is valuable because ‘mean()’ is everywhere in daily applied work. Understanding how it behaves with ‘NAs’ and logical vectors may help in avoiding confusing results.
Specifically, you’ll learn how to:
Here, we’ll show how ‘mean()’ behaves in different situations
A basic example shows how to compute the average of a numeric vector and demonstrates the underlying arithmetic
# Weekly matcha spending - mean of a numeric vector
matcha_spending <- c(6.50, 5.00, 5.50, 8.00, 6.00, 6.00)
# Compute the mean
mean(matcha_spending)## [1] 6.166667
# Check the math
sum_matcha_spending <- sum(matcha_spending)
length_matcha_spending <- length(matcha_spending)
sum_matcha_spending / length_matcha_spending## [1] 6.166667
This section builds on basic R ideas - numeric vectors, TRUE/FALSE conditions, and missing values and makes the steps of using ‘mean()’ explicit.
Handling missing values with na.rm
Since real datasets often have missing values, it seemed important to understand na.rm. If a vector contains at least one missing value, ‘mean()’ will return NA instead of a meaningful result. With ‘na.rm’ we can tell R to ignore the missing element when computing the average.
# Self reported stress scores for a small survey
stress_scores <- c(85, 80, NA, 75, 93)
mean(stress_scores) # returns NA## [1] NA
## [1] 83.25
What’s more, ‘mean()’ can also be used with logical vectors In R, TRUE is treated as 1 and FALSE as 0. The mean of a logical vector is the proportion of values that are TRUE
# Was your dog the goodest boy/girl today?
goodest_dog <- c(TRUE, TRUE, FALSE, TRUE, NA, TRUE, TRUE, FALSE)
# Mean of logical vector with na.rm
# gives the proportion of goodest days
mean(goodest_dog, na.rm = TRUE)## [1] 0.7142857
# Let's check the math
sum_goodest <- sum(goodest_dog, na.rm =TRUE)
# counts TRUES
n_days <- length(goodest_dog[!is.na(goodest_dog)])
# number of non-missing values
sum_goodest / n_days ## [1] 0.7142857
Most notably, ‘mean()’ is valuable for summarizing groups of observations
Let’s look at the built in data set: Loblolly
Loblolly is a dataframe of 84 rows and 3 columns: height, age, and seed label. Trees with the same seed label come from the same seed source, so we can compare their average height and age.
# Create a grouped summary of Loblolly
loblolly_summary <- Loblolly %>%
# Group rows by seed source
group_by(Seed) %>%
# For each seed group, let's summarize
summarise(
mean_height = mean(height, na.rm = TRUE), # Avg height in seed group, ignoring missing values
mean_age = mean(age, na.rm = TRUE), # Avg age in seed group, ignoring missing values
n_trees = n() # Number of trees (rows) in seed group
)
print(loblolly_summary)## # A tibble: 14 × 4
## Seed mean_height mean_age n_trees
## <ord> <dbl> <dbl> <int>
## 1 329 30.3 13 6
## 2 327 30.6 13 6
## 3 325 31.9 13 6
## 4 307 31.3 13 6
## 5 331 31.0 13 6
## 6 311 31.7 13 6
## 7 315 32.4 13 6
## 8 321 31.2 13 6
## 9 319 32.9 13 6
## 10 301 33.2 13 6
## 11 323 33.6 13 6
## 12 309 33.8 13 6
## 13 303 34.1 13 6
## 14 305 35.1 13 6
Learn more about ‘mean()’ with the following:
This code through references and cites the following sources:
DataCamp (2024). How to use na.rm to handle missing values in R. https://www.datacamp.com/tutorial/na-rm-in-r
DataCamp (2025). R mean() function: Get started with averages. https://www.datacamp.com/tutorial/r-mean-function
Carpentries Incubator (2022). *Data summaries with dplyr. https://carpentries-incubator.github.io/r-tidyverse-4-datasets/instructor/07-data-summaries.html