Introduction

This code through explores how to use the ‘mean()’ function safely in R to compute averages in real datasets. It focuses on three common situations: basic numeric vectors, missing values (‘NA’), and TRUE/FALSE data.


Content Overview

Specifically, we’ll explain and demonstrate:

  • How ‘mean()’ relates to simple arithmetic
  • How ‘na.rm’ changes the behavior of ‘mean()’ when there are missing values
  • How ‘mean()’ of a logical vector is interpreted as a proportion of TRUE values


Why You Should Care

This topic is valuable because ‘mean()’ is everywhere in daily applied work. Understanding how it behaves with ‘NAs’ and logical vectors may help in avoiding confusing results.


Learning Objectives

Specifically, you’ll learn how to:

  • Compute simple means of numeric vectors
  • Use ‘na.rm = TRUE’ to deal with missing values
  • Interpret the mean of TRUE/FALSE data as a proportion
  • Apply ‘mean()’ inside of a grouped summary using the built in Loblolly data set



Using ‘mean()’

Here, we’ll show how ‘mean()’ behaves in different situations


Basic Example

A basic example shows how to compute the average of a numeric vector and demonstrates the underlying arithmetic

# Weekly matcha spending - mean of a numeric vector

matcha_spending <- c(6.50, 5.00, 5.50, 8.00, 6.00, 6.00)

# Compute the mean

mean(matcha_spending)
## [1] 6.166667
# Check the math

sum_matcha_spending <- sum(matcha_spending)
length_matcha_spending <- length(matcha_spending)
sum_matcha_spending / length_matcha_spending
## [1] 6.166667


Further Exposition

This section builds on basic R ideas - numeric vectors, TRUE/FALSE conditions, and missing values and makes the steps of using ‘mean()’ explicit.


Handling missing values with na.rm

Since real datasets often have missing values, it seemed important to understand na.rm. If a vector contains at least one missing value, ‘mean()’ will return NA instead of a meaningful result. With ‘na.rm’ we can tell R to ignore the missing element when computing the average.

# Self reported stress scores for a small survey

stress_scores <- c(85, 80, NA, 75, 93)

mean(stress_scores)     # returns NA
## [1] NA
# Tell R to ignore the missing value when computing the mean

mean(stress_scores, na.rm = TRUE)     
## [1] 83.25


What’s more, ‘mean()’ can also be used with logical vectors In R, TRUE is treated as 1 and FALSE as 0. The mean of a logical vector is the proportion of values that are TRUE

# Was your dog the goodest boy/girl today?

goodest_dog <- c(TRUE, TRUE, FALSE, TRUE, NA, TRUE, TRUE, FALSE)

# Mean of logical vector with na.rm 
# gives the proportion of goodest days

mean(goodest_dog, na.rm = TRUE)
## [1] 0.7142857
# Let's check the math

sum_goodest <- sum(goodest_dog, na.rm =TRUE)     

# counts TRUES

n_days <- length(goodest_dog[!is.na(goodest_dog)])    

# number of non-missing values

sum_goodest / n_days 
## [1] 0.7142857


Advanced Examples

Most notably, ‘mean()’ is valuable for summarizing groups of observations

Let’s look at the built in data set: Loblolly

Loblolly is a dataframe of 84 rows and 3 columns: height, age, and seed label. Trees with the same seed label come from the same seed source, so we can compare their average height and age.

# Let's take a look at the data

head(Loblolly)
# Create a grouped summary of Loblolly

loblolly_summary <- Loblolly %>%
  
  # Group rows by seed source
  
  group_by(Seed) %>%
  
  # For each seed group, let's summarize
  
  summarise(
    mean_height = mean(height, na.rm = TRUE),  # Avg height in seed group, ignoring missing values
    mean_age    = mean(age,    na.rm = TRUE),  # Avg age in seed group, ignoring missing values
    n_trees     = n()                          # Number of trees (rows) in seed group
  )

print(loblolly_summary)
## # A tibble: 14 × 4
##    Seed  mean_height mean_age n_trees
##    <ord>       <dbl>    <dbl>   <int>
##  1 329          30.3       13       6
##  2 327          30.6       13       6
##  3 325          31.9       13       6
##  4 307          31.3       13       6
##  5 331          31.0       13       6
##  6 311          31.7       13       6
##  7 315          32.4       13       6
##  8 321          31.2       13       6
##  9 319          32.9       13       6
## 10 301          33.2       13       6
## 11 323          33.6       13       6
## 12 309          33.8       13       6
## 13 303          34.1       13       6
## 14 305          35.1       13       6



Further Resources

Learn more about ‘mean()’ with the following:




Works Cited

This code through references and cites the following sources: