# Additionally we will install the other packages that psych recommends by setting dependencies = TRUE
install.packages("psych", dependencies = TRUE)Concepts in Measurement R Example: Means and SDs
Setup
Please ensure you’ve downloaded R and RStudio. Once everything is installed, open RStudio—this is where we will be doing our calculations.
Installing Packages
For this brief example we will be using the Big Five Inventory (BFI) dataset from the psych package. To install the psych package you can type the following into your Console pane:
Getting Item Data
Now that we’ve installed psych we can import the data by either loading the psych packages with library():
# load all objects from the psych library
library(psych)
# assign the bfi dataset to the object idata
idata <- bfiOr, a typically better alternative, would be to explicitly call the bfi data by using psych::bfi
idata <- psych::bfiDataframes
The R equivalent to a dataset is a data.frame. A data.frame is has two dimensions—rows and columns. Additionally, each column of a dataframe can be a different type. For example, our data (now represented as the object idata) has columns which are all of the integer type. We could—however—have columns that were characters (e.g., text), logical (TRUE/FALSE), etc.
'data.frame': 2800 obs. of 28 variables:
$ A1 : int 2 2 5 4 2 6 2 4 4 2 ...
$ A2 : int 4 4 4 4 3 6 5 3 3 5 ...
$ A3 : int 3 5 5 6 3 5 5 1 6 6 ...
$ A4 : int 4 2 4 5 4 6 3 5 3 6 ...
$ A5 : int 4 5 4 5 5 5 5 1 3 5 ...
$ C1 : int 2 5 4 4 4 6 5 3 6 6 ...
$ C2 : int 3 4 5 4 4 6 4 2 6 5 ...
$ C3 : int 3 4 4 3 5 6 4 4 3 6 ...
$ C4 : int 4 3 2 5 3 1 2 2 4 2 ...
$ C5 : int 4 4 5 5 2 3 3 4 5 1 ...
$ E1 : int 3 1 2 5 2 2 4 3 5 2 ...
$ E2 : int 3 1 4 3 2 1 3 6 3 2 ...
$ E3 : int 3 6 4 4 5 6 4 4 NA 4 ...
$ E4 : int 4 4 4 4 4 5 5 2 4 5 ...
$ E5 : int 4 3 5 4 5 6 5 1 3 5 ...
$ N1 : int 3 3 4 2 2 3 1 6 5 5 ...
$ N2 : int 4 3 5 5 3 5 2 3 5 5 ...
$ N3 : int 2 3 4 2 4 2 2 2 2 5 ...
$ N4 : int 2 5 2 4 4 2 1 6 3 2 ...
$ N5 : int 3 5 3 1 3 3 1 4 3 4 ...
$ O1 : int 3 4 4 3 3 4 5 3 6 5 ...
$ O2 : int 6 2 2 3 3 3 2 2 6 1 ...
$ O3 : int 3 4 5 4 4 5 5 4 6 5 ...
$ O4 : int 4 3 5 3 3 6 6 5 6 5 ...
$ O5 : int 3 3 2 5 3 1 1 3 1 2 ...
$ gender : int 1 2 2 2 1 2 1 1 1 2 ...
$ education: int NA NA NA NA NA 3 NA 2 1 NA ...
$ age : int 16 18 17 17 17 21 18 19 19 17 ...
To access things within a data structure we can use brackets [] after the object. If the object has multiple dimensions (e.g., rows and columns) we can use a comma within the brackets to access the various dimensions (e.g., [rows, columns]). This process is known as indexing.
Indexing Columns
You can index a single column in various ways in R. For instance, by using a numeric index:
# select the 5th column
idata[, 5]4 5 4 5 5 5 5 1 3 5 5 5 4 6 1 3 5 5 3 5 2 5 6 5 5 ...
or, since dataframes have named columns, by name:
# select the age
idata[, "age"]16 18 17 17 17 21 18 19 19 17 21 16 16 16 17 17 17 17 16 17 17 17 68 27 18 ...
Indexing Rows
We can index rows in the same manner; however, me must now place the index before the comma.
# select the 10th row
idata[10 ,] A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O2 O3 O4
61633 2 5 6 6 5 6 5 6 2 1 2 2 4 5 5 5 5 5 2 4 5 1 5 5
O5 gender education age
61633 2 2 NA 17
or, since our dataframe has row names, by row name:
# select participant 61688
idata["61688", ] A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O2 O3 O4
61688 1 6 6 6 6 6 6 6 1 1 1 1 1 6 6 1 1 1 1 1 6 1 6 6
O5 gender education age
61688 1 1 NA 30
Indexing Multiple Rows & Columns
R allows you to also to index several columns and or rows by passing a vector of values as the index.
# selecting 4th and 9th column
idata[, c(4, 9)]
# selecting columns 2 through 8; no need to use c()
idata[, 2:8] A4 C4
61617 4 4
61618 2 3
61620 4 2
61621 5 5
61622 4 3
61623 6 1
61624 3 2
61629 5 2
61630 3 4
61633 6 2
61634 6 3
61636 5 4
61637 6 2
61639 6 2
61640 2 2
61643 6 3
61650 2 4
61651 4 4
61653 4 4
61654 5 5
A2 A3 A4 A5 C1 C2 C3
61617 4 3 4 4 2 3 3
61618 4 5 2 5 5 4 4
61620 4 5 4 4 4 5 4
61621 4 6 5 5 4 4 3
61622 3 3 4 5 4 4 5
61623 6 5 6 5 6 6 6
61624 5 5 3 5 5 4 4
61629 3 1 5 1 3 2 4
61630 3 6 3 3 6 6 3
61633 5 6 6 5 6 5 6
61634 4 5 6 5 4 3 5
61636 5 5 5 5 5 4 5
61637 5 5 6 4 5 4 3
61639 5 5 6 6 4 4 4
61640 5 2 2 1 5 5 5
61643 3 6 6 3 5 5 5
61650 6 6 2 5 4 4 4
61651 5 5 4 5 5 5 5
61653 4 5 4 3 5 4 5
61654 4 6 5 5 1 1 1
# selecting first 10 rows
idata[1:10, ] A1 A2 A3 A4 A5 C1 C2 ... N5 O1 O2 O3 O4 O5 gender education age
61617 2 4 3 4 4 2 3 ... 3 3 6 3 4 3 1 NA 16
61618 2 4 5 2 5 5 4 ... 5 4 2 4 3 3 2 NA 18
61620 5 4 5 4 4 4 5 ... 3 4 2 5 5 2 2 NA 17
61621 4 4 6 5 5 4 4 ... 1 3 3 4 3 5 2 NA 17
61622 2 3 3 4 5 4 4 ... 3 3 3 4 3 3 1 NA 17
61623 6 6 5 6 5 6 6 ... 3 4 3 5 6 1 2 3 21
61624 2 5 5 3 5 5 4 ... 1 5 2 5 6 1 1 NA 18
61629 4 3 1 5 1 3 2 ... 4 3 2 4 5 3 1 2 19
61630 4 3 6 3 3 6 6 ... 3 6 6 6 6 1 1 1 19
61633 2 5 6 6 5 6 5 ... 4 5 1 5 5 2 2 NA 17
# selecting rows 15, 19, 30 and columns gender, age, and C2
idata[c(15, 19, 30), c("gender", "age", "C2")] gender age C2
61640 1 17 5
61653 2 16 4
61673 2 18 5
Filtering & Subsetting
We can also select parts of our dataframe by indexing the things we want and assigning them to a new (or existing) object—this is known as filtering or subsetting. Since, in this example, we are focusing on just the personality items, we need to remove the last 3 columns. If we use a numeric index we can leverage -c(columns we don't want) (note the - symbol) to remove the columns we don’t want.
Important: To actually save or keep our filtered dataframe we must assign it to either a new object, or replace our existing object. If the latter, our old dataset will go away. It’s usually best practice to keep your raw/unfiltered dataset as an object in your environment. In our example, R knows the object holding our raw data (i.e., psych::bfi), so we can replace our object idata with a filtered version of the the dataframe. Our dataframe has 28 columns and we need to remove columns 26 through 28 (i.e., the age, education, and gender columns).
# replace our idata with a subset of the idata object (without columns 26, 27, 28)
idata <- idata[, -c(26,27,28)]If you subset or filter your dataframe by names you need to use slightly different notation.
# this basically translates to give me all the columns that aren't age, education, or gender
idata <- idata[, !names(idata) %in% c("age", "education", "gender")]Calculating Item Means and SDs
The mean and sd Functions
With this in mind, we can calculate the mean for the first column by using the mean() function. We have some missing values (NAs), so we must use an additional argument in the mean() function—the rm.na argument, which we will set to TRUE.
# mean of the first column after removing the NA values
mean(x = idata[, 1], na.rm = TRUE)[1] 2.413434
The function to calculate the standard deviation sd() works the same way; we need to remove the NAs before calculating the standard deviation.
# mean of the first column after removing the NA values
sd(x = idata[, 1], na.rm = TRUE)[1] 1.407737
Diving Deeper into mean and sd
If we look at the help documentation of mean and sd, which we can see by calling our respective functions after a ?, so ?mean or ?sd. To illustrate, under the Arguments section the help documentation for mean reads:
xa numeric vector or an R object but not afactorcoercible to numeric byas.double(x).
Right now this may not make a lot of sense, but as you learn the vocabulary of R, you’ll find the help documentation becomes increasingly helpful. Specifically, the statement above tells us the type of object that the argument x can be. Let’s try to calculate the means for the first two columns.
mean(x = idata[, 1:2], na.rm = TRUE)Warning in mean.default(x = idata[, 1:2], na.rm = TRUE): argument is not numeric
or logical: returning NA
Notice how we receive a warning (which may or not make sense) but just know—the issue occurred because x needs to be a single column (i.e., a numeric vector) and not multiple columns.
The implications of this is that we may only pass one column at a time to the mean() and sd() functions, which may seem daunting as it means we would manually need to input each column to the mean and sd functions…this would be a lot of repetition.
# column 1
mean(x = idata[, 1], na.rm = TRUE)
sd(x = idata[, 1], na.rm = TRUE)
# column 2
mean(x = idata[, 2], na.rm = TRUE)
sd(x = idata[, 2], na.rm = TRUE)
# ...
# column 25
mean(x = idata[, 25], na.rm = TRUE)
sd(x = idata[, 25], na.rm = TRUE)Luckily, R is very good at repeating steps for us, even if things change slightly during each step. The programming paradigm we could use is called a for loop. for each column in our data frame, calculate the mean and standard deviation. We could use a for loop, but are already has a several functions that are a bit more ideal. These functions all fall under a group known as the apply family.
Applying Functions
All of the apply functions will execute some function to each component of a data structure—for data.frames there components are usually columns. sapply—for example—will apply a function to each column in a data.frame or each component of a list. apply will apply a function to either the rows or columns in a matrix or data.frame.
Column-Wise Means
# using sapply
sapply(idata, FUN = mean, na.rm = TRUE) A1 A2 A3 A4 A5 C1 C2 C3
2.413434 4.802380 4.603821 4.699748 4.560345 4.502339 4.369957 4.303957
C4 C5 E1 E2 E3 E4 E5 N1
2.553353 3.296695 2.974433 3.141882 4.000721 4.422429 4.416337 2.929086
N2 N3 N4 N5 O1 O2 O3 O4
3.507737 3.216565 3.185601 2.969686 4.816055 2.713214 4.438312 4.892319
O5
2.489568
# using apply (we set MARGIN = 2 to denote columns)
apply(idata, MARGIN = 2, FUN = mean, na.rm = TRUE) A1 A2 A3 A4 A5 C1 C2 C3
2.413434 4.802380 4.603821 4.699748 4.560345 4.502339 4.369957 4.303957
C4 C5 E1 E2 E3 E4 E5 N1
2.553353 3.296695 2.974433 3.141882 4.000721 4.422429 4.416337 2.929086
N2 N3 N4 N5 O1 O2 O3 O4
3.507737 3.216565 3.185601 2.969686 4.816055 2.713214 4.438312 4.892319
O5
2.489568
Column-Wise SDs
By setting FUN = sd we can calculate the standard deviation of each column.
# using sapply
sapply(idata, FUN = sd, na.rm = TRUE) A1 A2 A3 A4 A5 C1 C2 C3
1.407737 1.172020 1.301834 1.479633 1.258512 1.241347 1.318347 1.288552
C4 C5 E1 E2 E3 E4 E5 N1
1.375118 1.628542 1.631505 1.605210 1.352719 1.457517 1.334768 1.570917
N2 N3 N4 N5 O1 O2 O3 O4
1.525944 1.602902 1.569685 1.618647 1.129530 1.565152 1.220901 1.221250
O5
1.327959
# using apply (we set MARGIN = 2 to denote columns)
apply(idata, MARGIN = 2, FUN = sd, na.rm = TRUE) A1 A2 A3 A4 A5 C1 C2 C3
1.407737 1.172020 1.301834 1.479633 1.258512 1.241347 1.318347 1.288552
C4 C5 E1 E2 E3 E4 E5 N1
1.375118 1.628542 1.631505 1.605210 1.352719 1.457517 1.334768 1.570917
N2 N3 N4 N5 O1 O2 O3 O4
1.525944 1.602902 1.569685 1.618647 1.129530 1.565152 1.220901 1.221250
O5
1.327959
Row-Wise Calculations
The MARGIN argument in the apply function allows us to determine if we would like apply the FUN row-wise or column-wise. By setting MARGIN = 1 we can get all the row means. Alternatively, for row means, you can use the rowMeans function (which is probably a bit faster). Sadly, there is no rowSD function.
# using apply (we set MARGIN = 1 to denote rows)
# here we just do the first 20 rows
apply(idata[1:20, ], MARGIN = 1, FUN = mean, na.rm = TRUE) 61617 61618 61620 61621 61622 61623 61624 61629
3.320000 3.520000 3.880000 3.880000 3.400000 4.160000 3.400000 3.320000
61630 61633 61634 61636 61637 61639 61640 61643
4.208333 4.040000 3.720000 4.208333 3.280000 3.480000 3.560000 4.360000
61650 61651 61653 61654
4.080000 4.360000 3.920000 3.480000
Both Means and SDs at Once
The apply family also allows you to pass an anonymous or user-defined functions to the FUN argument. These are just custom functions that we can write ourselves to do some desired behavior. For instance, the anonymous function below removes NAs from a vector then returns the mean AND standard deviation.
function(i) {
i <- i[!is.na(i)]
return(c(MEAN = mean(i), SD = sd(i)))
}
We can using this as the FUN argument directly. Note: function() can use any symbol(s) as the argument, just know that this symbol will represent the data within the function. To demonstrate below, we use x instead of i:
# run sapply with our anonymous function
sapply(idata, FUN = function(x) {
x <- x[!is.na(x)]
return(c(MEAN = mean(x), SD = sd(x)))
}
) A1 A2 A3 A4 A5 C1 C2 C3
MEAN 2.413434 4.80238 4.603821 4.699748 4.560345 4.502339 4.369957 4.303957
SD 1.407737 1.17202 1.301834 1.479633 1.258512 1.241347 1.318347 1.288552
C4 C5 E1 E2 E3 E4 E5 N1
MEAN 2.553353 3.296695 2.974433 3.141882 4.000721 4.422429 4.416337 2.929086
SD 1.375118 1.628542 1.631505 1.605210 1.352719 1.457517 1.334768 1.570917
N2 N3 N4 N5 O1 O2 O3 O4
MEAN 3.507737 3.216565 3.185601 2.969686 4.816055 2.713214 4.438312 4.892319
SD 1.525944 1.602902 1.569685 1.618647 1.129530 1.565152 1.220901 1.221250
O5
MEAN 2.489568
SD 1.327959
We can also give the function a name (making it user-defined) then use it with sapply
# define our anonymous function by giving it a name
mean_sd <- function(x) {
x <- x[!is.na(x)]
return(c(MEAN = mean(x), SD = sd(x)))
}
# run it with sapply
sapply(idata, mean_sd) A1 A2 A3 A4 A5 C1 C2 C3
MEAN 2.413434 4.80238 4.603821 4.699748 4.560345 4.502339 4.369957 4.303957
SD 1.407737 1.17202 1.301834 1.479633 1.258512 1.241347 1.318347 1.288552
C4 C5 E1 E2 E3 E4 E5 N1
MEAN 2.553353 3.296695 2.974433 3.141882 4.000721 4.422429 4.416337 2.929086
SD 1.375118 1.628542 1.631505 1.605210 1.352719 1.457517 1.334768 1.570917
N2 N3 N4 N5 O1 O2 O3 O4
MEAN 3.507737 3.216565 3.185601 2.969686 4.816055 2.713214 4.438312 4.892319
SD 1.525944 1.602902 1.569685 1.618647 1.129530 1.565152 1.220901 1.221250
O5
MEAN 2.489568
SD 1.327959
Don’t forget to assign the output to a new object
Thus far, I have not been assigning the output to a new object using the <- function. In other words, R is executing the code but forgetting everything afterwards.
Don’t forget to assign data you want to a new object.
# saving out means and sds to a new object
idata_m_sd <- sapply(idata, mean_sd)
idata_m_sd A1 A2 A3 A4 A5 C1 C2 C3
MEAN 2.413434 4.80238 4.603821 4.699748 4.560345 4.502339 4.369957 4.303957
SD 1.407737 1.17202 1.301834 1.479633 1.258512 1.241347 1.318347 1.288552
C4 C5 E1 E2 E3 E4 E5 N1
MEAN 2.553353 3.296695 2.974433 3.141882 4.000721 4.422429 4.416337 2.929086
SD 1.375118 1.628542 1.631505 1.605210 1.352719 1.457517 1.334768 1.570917
N2 N3 N4 N5 O1 O2 O3 O4
MEAN 3.507737 3.216565 3.185601 2.969686 4.816055 2.713214 4.438312 4.892319
SD 1.525944 1.602902 1.569685 1.618647 1.129530 1.565152 1.220901 1.221250
O5
MEAN 2.489568
SD 1.327959