Concepts in Measurement R Example: Means and SDs

Setup

Please ensure you’ve downloaded R and RStudio. Once everything is installed, open RStudio—this is where we will be doing our calculations.

Installing Packages

For this brief example we will be using the Big Five Inventory (BFI) dataset from the psych package. To install the psych package you can type the following into your Console pane:

# Additionally we will install the other packages that psych recommends by setting dependencies = TRUE
install.packages("psych", dependencies = TRUE)

Getting Item Data

Now that we’ve installed psych we can import the data by either loading the psych packages with library():

# load all objects from the psych library
library(psych)

# assign the bfi dataset to the object idata
idata <- bfi

Or, a typically better alternative, would be to explicitly call the bfi data by using psych::bfi

idata <- psych::bfi

Dataframes

The R equivalent to a dataset is a data.frame. A data.frame is has two dimensions—rows and columns. Additionally, each column of a dataframe can be a different type. For example, our data (now represented as the object idata) has columns which are all of the integer type. We could—however—have columns that were characters (e.g., text), logical (TRUE/FALSE), etc.

'data.frame':   2800 obs. of  28 variables:
 $ A1       : int  2 2 5 4 2 6 2 4 4 2 ...
 $ A2       : int  4 4 4 4 3 6 5 3 3 5 ...
 $ A3       : int  3 5 5 6 3 5 5 1 6 6 ...
 $ A4       : int  4 2 4 5 4 6 3 5 3 6 ...
 $ A5       : int  4 5 4 5 5 5 5 1 3 5 ...
 $ C1       : int  2 5 4 4 4 6 5 3 6 6 ...
 $ C2       : int  3 4 5 4 4 6 4 2 6 5 ...
 $ C3       : int  3 4 4 3 5 6 4 4 3 6 ...
 $ C4       : int  4 3 2 5 3 1 2 2 4 2 ...
 $ C5       : int  4 4 5 5 2 3 3 4 5 1 ...
 $ E1       : int  3 1 2 5 2 2 4 3 5 2 ...
 $ E2       : int  3 1 4 3 2 1 3 6 3 2 ...
 $ E3       : int  3 6 4 4 5 6 4 4 NA 4 ...
 $ E4       : int  4 4 4 4 4 5 5 2 4 5 ...
 $ E5       : int  4 3 5 4 5 6 5 1 3 5 ...
 $ N1       : int  3 3 4 2 2 3 1 6 5 5 ...
 $ N2       : int  4 3 5 5 3 5 2 3 5 5 ...
 $ N3       : int  2 3 4 2 4 2 2 2 2 5 ...
 $ N4       : int  2 5 2 4 4 2 1 6 3 2 ...
 $ N5       : int  3 5 3 1 3 3 1 4 3 4 ...
 $ O1       : int  3 4 4 3 3 4 5 3 6 5 ...
 $ O2       : int  6 2 2 3 3 3 2 2 6 1 ...
 $ O3       : int  3 4 5 4 4 5 5 4 6 5 ...
 $ O4       : int  4 3 5 3 3 6 6 5 6 5 ...
 $ O5       : int  3 3 2 5 3 1 1 3 1 2 ...
 $ gender   : int  1 2 2 2 1 2 1 1 1 2 ...
 $ education: int  NA NA NA NA NA 3 NA 2 1 NA ...
 $ age      : int  16 18 17 17 17 21 18 19 19 17 ...

To access things within a data structure we can use brackets [] after the object. If the object has multiple dimensions (e.g., rows and columns) we can use a comma within the brackets to access the various dimensions (e.g., [rows, columns]). This process is known as indexing.

Indexing Columns

You can index a single column in various ways in R. For instance, by using a numeric index:

# select the 5th column
idata[, 5]
4 5 4 5 5 5 5 1 3 5 5 5 4 6 1 3 5 5 3 5 2 5 6 5 5 ...

or, since dataframes have named columns, by name:

# select the age
idata[, "age"]
16 18 17 17 17 21 18 19 19 17 21 16 16 16 17 17 17 17 16 17 17 17 68 27 18 ...

Indexing Rows

We can index rows in the same manner; however, me must now place the index before the comma.

# select the 10th row
idata[10 ,]
      A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O2 O3 O4
61633  2  5  6  6  5  6  5  6  2  1  2  2  4  5  5  5  5  5  2  4  5  1  5  5
      O5 gender education age
61633  2      2        NA  17

or, since our dataframe has row names, by row name:

# select participant 61688
idata["61688", ]
      A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O2 O3 O4
61688  1  6  6  6  6  6  6  6  1  1  1  1  1  6  6  1  1  1  1  1  6  1  6  6
      O5 gender education age
61688  1      1        NA  30

Indexing Multiple Rows & Columns

R allows you to also to index several columns and or rows by passing a vector of values as the index.

# selecting 4th and 9th column
idata[, c(4, 9)]

# selecting columns 2 through 8; no need to use c()
idata[, 2:8]
      A4 C4
61617  4  4
61618  2  3
61620  4  2
61621  5  5
61622  4  3
61623  6  1
61624  3  2
61629  5  2
61630  3  4
61633  6  2
61634  6  3
61636  5  4
61637  6  2
61639  6  2
61640  2  2
61643  6  3
61650  2  4
61651  4  4
61653  4  4
61654  5  5
      A2 A3 A4 A5 C1 C2 C3
61617  4  3  4  4  2  3  3
61618  4  5  2  5  5  4  4
61620  4  5  4  4  4  5  4
61621  4  6  5  5  4  4  3
61622  3  3  4  5  4  4  5
61623  6  5  6  5  6  6  6
61624  5  5  3  5  5  4  4
61629  3  1  5  1  3  2  4
61630  3  6  3  3  6  6  3
61633  5  6  6  5  6  5  6
61634  4  5  6  5  4  3  5
61636  5  5  5  5  5  4  5
61637  5  5  6  4  5  4  3
61639  5  5  6  6  4  4  4
61640  5  2  2  1  5  5  5
61643  3  6  6  3  5  5  5
61650  6  6  2  5  4  4  4
61651  5  5  4  5  5  5  5
61653  4  5  4  3  5  4  5
61654  4  6  5  5  1  1  1
# selecting first 10 rows
idata[1:10, ]
      A1 A2 A3 A4 A5 C1 C2   ... N5 O1 O2 O3 O4 O5 gender education age
61617  2  4  3  4  4  2  3  ...   3  3  6  3  4  3      1        NA  16
61618  2  4  5  2  5  5  4  ...   5  4  2  4  3  3      2        NA  18
61620  5  4  5  4  4  4  5  ...   3  4  2  5  5  2      2        NA  17
61621  4  4  6  5  5  4  4  ...   1  3  3  4  3  5      2        NA  17
61622  2  3  3  4  5  4  4  ...   3  3  3  4  3  3      1        NA  17
61623  6  6  5  6  5  6  6  ...   3  4  3  5  6  1      2         3  21
61624  2  5  5  3  5  5  4  ...   1  5  2  5  6  1      1        NA  18
61629  4  3  1  5  1  3  2  ...   4  3  2  4  5  3      1         2  19
61630  4  3  6  3  3  6  6  ...   3  6  6  6  6  1      1         1  19
61633  2  5  6  6  5  6  5  ...   4  5  1  5  5  2      2        NA  17
# selecting rows 15, 19, 30 and columns gender, age, and C2
idata[c(15, 19, 30), c("gender", "age", "C2")]
      gender age C2
61640      1  17  5
61653      2  16  4
61673      2  18  5

Filtering & Subsetting

We can also select parts of our dataframe by indexing the things we want and assigning them to a new (or existing) object—this is known as filtering or subsetting. Since, in this example, we are focusing on just the personality items, we need to remove the last 3 columns. If we use a numeric index we can leverage -c(columns we don't want) (note the - symbol) to remove the columns we don’t want.

Important: To actually save or keep our filtered dataframe we must assign it to either a new object, or replace our existing object. If the latter, our old dataset will go away. It’s usually best practice to keep your raw/unfiltered dataset as an object in your environment. In our example, R knows the object holding our raw data (i.e., psych::bfi), so we can replace our object idata with a filtered version of the the dataframe. Our dataframe has 28 columns and we need to remove columns 26 through 28 (i.e., the age, education, and gender columns).

# replace our idata with a subset of the idata object (without columns 26, 27, 28)
idata <- idata[, -c(26,27,28)]

If you subset or filter your dataframe by names you need to use slightly different notation.

# this basically translates to give me all the columns that aren't  age, education, or gender
idata <- idata[, !names(idata) %in% c("age", "education", "gender")]

Calculating Item Means and SDs

The mean and sd Functions

With this in mind, we can calculate the mean for the first column by using the mean() function. We have some missing values (NAs), so we must use an additional argument in the mean() function—the rm.na argument, which we will set to TRUE.

# mean of the first column after removing the NA values
mean(x = idata[, 1], na.rm = TRUE)
[1] 2.413434

The function to calculate the standard deviation sd() works the same way; we need to remove the NAs before calculating the standard deviation.

# mean of the first column after removing the NA values
sd(x = idata[, 1], na.rm = TRUE)
[1] 1.407737

Diving Deeper into mean and sd

If we look at the help documentation of mean and sd, which we can see by calling our respective functions after a ?, so ?mean or ?sd. To illustrate, under the Arguments section the help documentation for mean reads:

x a numeric vector or an R object but not a factor coercible to numeric by as.double(x).

Right now this may not make a lot of sense, but as you learn the vocabulary of R, you’ll find the help documentation becomes increasingly helpful. Specifically, the statement above tells us the type of object that the argument x can be. Let’s try to calculate the means for the first two columns.

mean(x = idata[, 1:2], na.rm = TRUE)
Warning in mean.default(x = idata[, 1:2], na.rm = TRUE): argument is not numeric
or logical: returning NA

Notice how we receive a warning (which may or not make sense) but just know—the issue occurred because x needs to be a single column (i.e., a numeric vector) and not multiple columns.

The implications of this is that we may only pass one column at a time to the mean() and sd() functions, which may seem daunting as it means we would manually need to input each column to the mean and sd functions…this would be a lot of repetition.

# column 1
mean(x = idata[, 1], na.rm = TRUE)
sd(x = idata[, 1], na.rm = TRUE)

# column 2
mean(x = idata[, 2], na.rm = TRUE)
sd(x = idata[, 2], na.rm = TRUE)

# ...

# column 25
mean(x = idata[, 25], na.rm = TRUE)
sd(x = idata[, 25], na.rm = TRUE)

Luckily, R is very good at repeating steps for us, even if things change slightly during each step. The programming paradigm we could use is called a for loop. for each column in our data frame, calculate the mean and standard deviation. We could use a for loop, but are already has a several functions that are a bit more ideal. These functions all fall under a group known as the apply family.

Applying Functions

All of the apply functions will execute some function to each component of a data structure—for data.frames there components are usually columns. sapply—for example—will apply a function to each column in a data.frame or each component of a list. apply will apply a function to either the rows or columns in a matrix or data.frame.

Column-Wise Means

# using sapply
sapply(idata, FUN = mean, na.rm = TRUE)
      A1       A2       A3       A4       A5       C1       C2       C3 
2.413434 4.802380 4.603821 4.699748 4.560345 4.502339 4.369957 4.303957 
      C4       C5       E1       E2       E3       E4       E5       N1 
2.553353 3.296695 2.974433 3.141882 4.000721 4.422429 4.416337 2.929086 
      N2       N3       N4       N5       O1       O2       O3       O4 
3.507737 3.216565 3.185601 2.969686 4.816055 2.713214 4.438312 4.892319 
      O5 
2.489568 
# using apply (we set MARGIN = 2 to denote columns)
apply(idata, MARGIN = 2, FUN = mean, na.rm = TRUE)
      A1       A2       A3       A4       A5       C1       C2       C3 
2.413434 4.802380 4.603821 4.699748 4.560345 4.502339 4.369957 4.303957 
      C4       C5       E1       E2       E3       E4       E5       N1 
2.553353 3.296695 2.974433 3.141882 4.000721 4.422429 4.416337 2.929086 
      N2       N3       N4       N5       O1       O2       O3       O4 
3.507737 3.216565 3.185601 2.969686 4.816055 2.713214 4.438312 4.892319 
      O5 
2.489568 

Column-Wise SDs

By setting FUN = sd we can calculate the standard deviation of each column.

# using sapply
sapply(idata, FUN = sd, na.rm = TRUE)
      A1       A2       A3       A4       A5       C1       C2       C3 
1.407737 1.172020 1.301834 1.479633 1.258512 1.241347 1.318347 1.288552 
      C4       C5       E1       E2       E3       E4       E5       N1 
1.375118 1.628542 1.631505 1.605210 1.352719 1.457517 1.334768 1.570917 
      N2       N3       N4       N5       O1       O2       O3       O4 
1.525944 1.602902 1.569685 1.618647 1.129530 1.565152 1.220901 1.221250 
      O5 
1.327959 
# using apply (we set MARGIN = 2 to denote columns)
apply(idata, MARGIN = 2, FUN = sd, na.rm = TRUE)
      A1       A2       A3       A4       A5       C1       C2       C3 
1.407737 1.172020 1.301834 1.479633 1.258512 1.241347 1.318347 1.288552 
      C4       C5       E1       E2       E3       E4       E5       N1 
1.375118 1.628542 1.631505 1.605210 1.352719 1.457517 1.334768 1.570917 
      N2       N3       N4       N5       O1       O2       O3       O4 
1.525944 1.602902 1.569685 1.618647 1.129530 1.565152 1.220901 1.221250 
      O5 
1.327959 

Row-Wise Calculations

The MARGIN argument in the apply function allows us to determine if we would like apply the FUN row-wise or column-wise. By setting MARGIN = 1 we can get all the row means. Alternatively, for row means, you can use the rowMeans function (which is probably a bit faster). Sadly, there is no rowSD function.

# using apply (we set MARGIN = 1 to denote rows)
# here we just do the first 20 rows
apply(idata[1:20, ], MARGIN = 1, FUN = mean, na.rm = TRUE)
   61617    61618    61620    61621    61622    61623    61624    61629 
3.320000 3.520000 3.880000 3.880000 3.400000 4.160000 3.400000 3.320000 
   61630    61633    61634    61636    61637    61639    61640    61643 
4.208333 4.040000 3.720000 4.208333 3.280000 3.480000 3.560000 4.360000 
   61650    61651    61653    61654 
4.080000 4.360000 3.920000 3.480000 

Both Means and SDs at Once

The apply family also allows you to pass an anonymous or user-defined functions to the FUN argument. These are just custom functions that we can write ourselves to do some desired behavior. For instance, the anonymous function below removes NAs from a vector then returns the mean AND standard deviation.

function(i) {
    i <- i[!is.na(i)]
    return(c(MEAN = mean(i), SD = sd(i)))
}

We can using this as the FUN argument directly. Note: function() can use any symbol(s) as the argument, just know that this symbol will represent the data within the function. To demonstrate below, we use x instead of i:

# run sapply with our anonymous function
sapply(idata, FUN = function(x) {
    x <- x[!is.na(x)]
    return(c(MEAN = mean(x), SD = sd(x)))
    }
)
           A1      A2       A3       A4       A5       C1       C2       C3
MEAN 2.413434 4.80238 4.603821 4.699748 4.560345 4.502339 4.369957 4.303957
SD   1.407737 1.17202 1.301834 1.479633 1.258512 1.241347 1.318347 1.288552
           C4       C5       E1       E2       E3       E4       E5       N1
MEAN 2.553353 3.296695 2.974433 3.141882 4.000721 4.422429 4.416337 2.929086
SD   1.375118 1.628542 1.631505 1.605210 1.352719 1.457517 1.334768 1.570917
           N2       N3       N4       N5       O1       O2       O3       O4
MEAN 3.507737 3.216565 3.185601 2.969686 4.816055 2.713214 4.438312 4.892319
SD   1.525944 1.602902 1.569685 1.618647 1.129530 1.565152 1.220901 1.221250
           O5
MEAN 2.489568
SD   1.327959

We can also give the function a name (making it user-defined) then use it with sapply

# define our anonymous function by giving it a name
mean_sd <- function(x) {
    x <- x[!is.na(x)]
    return(c(MEAN = mean(x), SD = sd(x)))
}

# run it with sapply
sapply(idata, mean_sd)
           A1      A2       A3       A4       A5       C1       C2       C3
MEAN 2.413434 4.80238 4.603821 4.699748 4.560345 4.502339 4.369957 4.303957
SD   1.407737 1.17202 1.301834 1.479633 1.258512 1.241347 1.318347 1.288552
           C4       C5       E1       E2       E3       E4       E5       N1
MEAN 2.553353 3.296695 2.974433 3.141882 4.000721 4.422429 4.416337 2.929086
SD   1.375118 1.628542 1.631505 1.605210 1.352719 1.457517 1.334768 1.570917
           N2       N3       N4       N5       O1       O2       O3       O4
MEAN 3.507737 3.216565 3.185601 2.969686 4.816055 2.713214 4.438312 4.892319
SD   1.525944 1.602902 1.569685 1.618647 1.129530 1.565152 1.220901 1.221250
           O5
MEAN 2.489568
SD   1.327959

Don’t forget to assign the output to a new object

Thus far, I have not been assigning the output to a new object using the <- function. In other words, R is executing the code but forgetting everything afterwards.

Don’t forget to assign data you want to a new object.

# saving out means and sds to a new object
idata_m_sd <- sapply(idata, mean_sd)

idata_m_sd
           A1      A2       A3       A4       A5       C1       C2       C3
MEAN 2.413434 4.80238 4.603821 4.699748 4.560345 4.502339 4.369957 4.303957
SD   1.407737 1.17202 1.301834 1.479633 1.258512 1.241347 1.318347 1.288552
           C4       C5       E1       E2       E3       E4       E5       N1
MEAN 2.553353 3.296695 2.974433 3.141882 4.000721 4.422429 4.416337 2.929086
SD   1.375118 1.628542 1.631505 1.605210 1.352719 1.457517 1.334768 1.570917
           N2       N3       N4       N5       O1       O2       O3       O4
MEAN 3.507737 3.216565 3.185601 2.969686 4.816055 2.713214 4.438312 4.892319
SD   1.525944 1.602902 1.569685 1.618647 1.129530 1.565152 1.220901 1.221250
           O5
MEAN 2.489568
SD   1.327959