1 Before we get started

1.1 Recap: What we learned in the previous tutorial

In the last tutorial, we learned a few things to get you started in using R:

  • How to download R and R studio
  • How to begin using an R Markdown
  • Basic functions in R
  • How to import data into R and view that data
  • How to look at summary statistics for individual variables

Having trouble remembering what exactly an R Markdown is? Want some more resources for learning R?

  • Review what an R Markdown is here.
  • Explore further resources for learning R here.

1.2 Read in our data

We’ll need this later:

library(tidyverse)
library(magrittr) # for %<>% pipe
write_csv(mtcars, "mtcars.csv")
d <- read_csv("mtcars.csv")
d1 <- read_csv("carid.csv")
d2 <- read_csv("cqi.csv")
d <- cbind(d, d1, d2)
d$wt <- d$wt*1000
d %<>% select(carID, mpg, everything())

2 Part I: The Universe of Packages

One of the greatest advantages to using R is that it is open source. This means that practically anyone can create content for R, and as they do, new functionalities and additions become available through R, often faster than other statistical programs, such as SPSS or SAS. This community-based content comes in the form of packages. A package is a collection of related functions developed by people in the R community. Sometimes, a “package” can be a collection of other packages that provide useful functions.

If you are interested in learning more about packages and are looking for recommendations of a few useful ones, check out this website. It covers what we have talked about here and more!

2.1 How do I download and use packages?

2.1.1 Loading packages

You only need to install these packages once, but you’ll need to load them in everytime you start R using the library() function. The distinction is that install.packages() downloads the files from CRAN and places them in a location where R can find them, while library() makes it such that you can call all of the functions that are defined in these packages.

To use install.packages, you must use quotation marks to quote the package you want to install (e.g. install.packages("ggplot2")). library(package) does not need quotation marks and will call the package from where it stored on your computer, so that you can use the functions within the package. So, we only need to install packages once, but each time we start a new R session, we need to reload each package we will be using. There are also some useful functions for checking out the packages you have in your library. Below are detailed some useful functions for checking out and learning more about your packages.

# install.packages("tidyverse")
# install.packages("ggplot2")
# install.packages("Hmisc")
# install.packages("psych")
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
?psych # gives some help for the package; description, etc.
## starting httpd help server ...
##  done
packageDescription("psych") # Shows a description with the information of who created the package, where it is stored in your computer, etc.
## Package: psych
## Version: 2.0.8
## Date: 2020-8-30
## Title: Procedures for Psychological, Psychometric, and Personality
##         Research
## Authors@R: person("William", "Revelle", role =c("aut","cre"),
##         email="revelle@northwestern.edu", comment=c(ORCID =
##         "0000-0003-4880-9610") )
## Description: A general purpose toolbox for personality, psychometric
##         theory and experimental psychology.  Functions are primarily
##         for multivariate analysis and scale construction using factor
##         analysis, principal component analysis, cluster analysis and
##         reliability analysis, although others provide basic descriptive
##         statistics. Item Response Theory is done using factor analysis
##         of tetrachoric and polychoric correlations. Functions for
##         analyzing data at multiple levels include within and between
##         group statistics, including correlations and factor analysis.
##         Functions for simulating and testing particular item and test
##         structures are included. Several functions serve as a useful
##         front end for structural equation modeling.  Graphical displays
##         of path diagrams, factor analysis and structural equation
##         models are created using basic graphics. Some of the functions
##         are written to support a book on psychometric theory as well as
##         publications in personality research. For more information, see
##         the <https://personality-project.org/r/> web page.
## License: GPL (>= 2)
## Imports: mnormt,parallel,stats,graphics,grDevices,methods,lattice,nlme
## Suggests: psychTools, GPArotation, lavaan, lme4, Rcsdp, graph,
##         Rgraphviz
## LazyData: yes
## ByteCompile: TRUE
## URL: https://personality-project.org/r/psych/
##         https://personality-project.org/r/psych-manual.pdf
## NeedsCompilation: no
## Packaged: 2020-09-04 18:48:06 UTC; WR
## Author: William Revelle [aut, cre]
##         (<https://orcid.org/0000-0003-4880-9610>)
## Maintainer: William Revelle <revelle@northwestern.edu>
## Repository: CRAN
## Date/Publication: 2020-09-04 19:50:02 UTC
## Built: R 4.0.2; ; 2020-09-13 18:17:25 UTC; windows
## 
## -- File: C:/Users/cbwin/Documents/R/win-library/4.0/psych/Meta/package.rds
help(package = "psych") # among other things, lists each of the functions in the package. 

library() # Shows a list of all packages you have in libraries with short description

browseVignettes(package = "psych") # shows you how to use some of the common functions within the package

vignette("dplyr") # calls helpful vignettes for the dplyr package

A word of caution: Sometimes, the order in which you load packages from their library matters. Consider the following code:

library(psych)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
## 
##     describe
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
describe(d$mpg)
## d$mpg 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       32        0       25    0.999    20.09    6.796    12.00    14.34 
##      .25      .50      .75      .90      .95 
##    15.43    19.20    22.80    30.09    31.30 
## 
## lowest : 10.4 13.3 14.3 14.7 15.0, highest: 26.0 27.3 30.4 32.4 33.9
detach(package:Hmisc) # unloading packages from your R session
detach(package:psych)
library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
## 
##     describe
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
describe(d$mpg)
##    vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis   se
## X1    1 32 20.09 6.03   19.2    19.7 5.41 10.4 33.9  23.5 0.61    -0.37 1.07

What did you notice? Even though we used the exact same command to describe the variable mpg, we get different output! Herein lies one problem with packages: different packages can use the same name for their functions, especially when its something generic, like describe(). So, when we try to load packages with functions that have the same name into R, its impossible for R to know which one we want to use. Consequently, R will always use the function from the most recently called package: in the first chunk R used the function from the Hmisc package, and in the second chunk it used the function from the psych package.

If we want to use both functions, we can force which package R pulls the function from using two colons: package::function:

Hmisc::describe(d$mpg)
## d$mpg 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       32        0       25    0.999    20.09    6.796    12.00    14.34 
##      .25      .50      .75      .90      .95 
##    15.43    19.20    22.80    30.09    31.30 
## 
## lowest : 10.4 13.3 14.3 14.7 15.0, highest: 26.0 27.3 30.4 32.4 33.9
psych::describe(d$mpg)
##    vars  n  mean   sd median trimmed  mad  min  max range skew kurtosis   se
## X1    1 32 20.09 6.03   19.2    19.7 5.41 10.4 33.9  23.5 0.61    -0.37 1.07

2.2 Good practices with packages

When dealing with packages, it is important that we do so carefully and intentionally so that they operate correctly within our R script. Here are some good practices for dealing with packages:

  • Because you only install packages once, comment out the install.packages() code once you have installed them on your computer. However, it’s a good idea to leave it in your code; it makes the code more easily understandable and reproducible by others (and yourself) later.
  • Keep your commented out install.packages() code, as well as your library() code, in a chunk at the beginning of your Markdown. This reduces any problems with trying to run code without first loading packages and keeps them in one central location, which you can easily reference. This also helps you understand if you have created an issue by loading packages like Hmisc and psych in the wrong order.
  • Next to each library() call, write a comment for why you have that library loaded.
  • When you come across problems with installing or loading packages, make sure to check out the console for messages from R on what’s wrong; these can often be useful.

2.3 What are some useful packages?

#install.packages("tidyverse", "Hsmisc", "dplyr")

3 Part II: Tidy data manipulation using dplyr

One of the most useful packages you can use is a universe of packages called tidyverse (that is, a “universe of ‘tidy’ packages”). You can see a list of the packages included in tidyverse and learn more about them here.

dplyr, one package within tidyverse, is especially useful for data manipulation. But what exactly is data manipulation? Data manipulation is the process of preparing your data for analysis. Hardly (if ever) do we deal with data that is “clean”–that is, data that is perfectly organized and ready for analysis. So, you have to clean up your data to be in good enough condition to analyze easily and clearly. This involves tasks like deleting unnecessary variables, recoding variables, and creating new variables. We can also manipulate our data to present our data in a new way–say, by looking at means of an outcome across gender, instead of the overall sample mean.

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges. Here are some of the main features of the package:

  • select() selects variables (columns), or renames existing columns.
  • filter() selects rows that fit one or more logical expressions.
  • rename() renames columns, and only keep those (note similarity to select())
  • mutate() adds new variables that are functions of existing variables.
  • summarise() reduces multiple values down to a single summary.
  • arrange() changes the ordering of the rows.

These all combine naturally with group_by() which allows you to perform any operation by group. You can learn more about them in vignette("dplyr").

As well as these single-table verbs, dplyr also provides a variety of two-table verbs, which you can learn about in vignette("two-table").

3.1 Select certain variables, i.e. columns (select)

select allows you to subset your dataset by column (i.e by variables). If we wanted our dataset to only include information on the miles per gallon and number of cylinders, we would use select in the following way:

#select(dataset, variable 1, variable 2)
d.new <- select(d, mpg, cyl)
head(d.new)
##    mpg cyl
## 1 21.0   6
## 2 21.0   6
## 3 22.8   4
## 4 21.4   6
## 5 18.7   8
## 6 18.1   6

Try it for yourself: create a dataset that consists only of the number gears and the weight of each car.

3.2 Filter to certain observations, i.e. rows (filter)

filter allows you to subset your dataset by row (i.e. observations). If we wanted our dataset to only include cars that had autmatic transmission, we would use filter in the following way:

#filter(dataset, variable == condition)
d.auto <- filter(d, am == 0)
head(d.auto)
##      carID  mpg cyl  disp  hp drat   wt  qsec vs am gear carb qual_eng
## 1 89829393 21.4   6 258.0 110 3.08 3215 19.44  1  0    3    1        4
## 2 96393949 18.7   8 360.0 175 3.15 3440 17.02  0  0    3    2        5
## 3 38278732 18.1   6 225.0 105 2.76 3460 20.22  1  0    3    1        4
## 4 98312322 14.3   8 360.0 245 3.21 3570 15.84  0  0    3    4        6
## 5 35991694 24.4   4 146.7  62 3.69 3190 20.00  1  0    4    2        5
## 6 93784711 22.8   4 140.8  95 3.92 3150 22.90  1  0    4    2        3
##   qual_trans qual_bod car_prob
## 1          2        2        2
## 2          4        5        4
## 3          5        6        3
## 4          7        4        6
## 5          1        7        4
## 6          3        7        2

Try it for yourself: create a dataset that only includes cars that get more than 25 miles per gallon.

3.3 Rename your variables (rename)

rename

#rename(dataset, new name = old name)
rename(
  d, 
  weight = wt,
  cylinders = cyl
)

As a shortcut, you can also rename variables as you are selecting them, using select(), instead of having to do both steps separately.

3.4 Create a new variable based on another variable (or set of variables) (mutate)

mutate allows you to create new variables using existing variables. You can change the type of a variable, combine old variables, or calculate statistics of old variables.

If we wanted to turn the transmission variable into a factor, create a variable for the mean miles per gallon, and compute a variable representing the square of car weight, we would use mutate in the following way:

mutate(
  d, 
  gear = factor(gear),
  mpg_mean = mean(mpg, na.rm = TRUE),
  wt_sq = wt^2
)
##       carID  mpg cyl  disp  hp drat   wt  qsec vs am gear carb qual_eng
## 1  37747171 21.0   6 160.0 110 3.90 2620 16.46  0  1    4    4        6
## 2  31796991 21.0   6 160.0 110 3.90 2875 17.02  0  1    4    4        1
## 3  72673293 22.8   4 108.0  93 3.85 2320 18.61  1  1    4    1        2
## 4  89829393 21.4   6 258.0 110 3.08 3215 19.44  1  0    3    1        4
## 5  96393949 18.7   8 360.0 175 3.15 3440 17.02  0  0    3    2        5
## 6  38278732 18.1   6 225.0 105 2.76 3460 20.22  1  0    3    1        4
## 7  98312322 14.3   8 360.0 245 3.21 3570 15.84  0  0    3    4        6
## 8  35991694 24.4   4 146.7  62 3.69 3190 20.00  1  0    4    2        5
## 9  93784711 22.8   4 140.8  95 3.92 3150 22.90  1  0    4    2        3
## 10 45281321 19.2   6 167.6 123 3.92 3440 18.30  1  0    4    4        3
## 11 91316755 17.8   6 167.6 123 3.92 3440 18.90  1  0    4    4        1
## 12 28844296 16.4   8 275.8 180 3.07 4070 17.40  0  0    3    3        1
## 13 63232163 17.3   8 275.8 180 3.07 3730 17.60  0  0    3    3        7
## 14 38421526 15.2   8 275.8 180 3.07 3780 18.00  0  0    3    3        3
## 15 24466383 10.4   8 472.0 205 2.93 5250 17.98  0  0    3    4        5
## 16 97133637 10.4   8 460.0 215 3.00 5424 17.82  0  0    3    4        5
## 17 36825326 14.7   8 440.0 230 3.23 5345 17.42  0  0    3    4        5
## 18 33873446 32.4   4  78.7  66 4.08 2200 19.47  1  1    4    1        6
## 19 79797615 30.4   4  75.7  52 4.93 1615 18.52  1  1    4    2        7
## 20 34178155 33.9   4  71.1  65 4.22 1835 19.90  1  1    4    1        3
## 21 48249435 21.5   4 120.1  97 3.70 2465 20.01  1  0    3    1        1
## 22 37845749 15.5   8 318.0 150 2.76 3520 16.87  0  0    3    2        5
## 23 26692995 15.2   8 304.0 150 3.15 3435 17.30  0  0    3    2        6
## 24 62229968 13.3   8 350.0 245 3.73 3840 15.41  0  0    3    4        4
## 25 34936485 19.2   8 400.0 175 3.08 3845 17.05  0  0    3    2        2
## 26 44187828 27.3   4  79.0  66 4.08 1935 18.90  1  1    4    1        2
## 27 96766418 26.0   4 120.3  91 4.43 2140 16.70  0  1    5    2        1
## 28 66566612 30.4   4  95.1 113 3.77 1513 16.90  1  1    5    2        6
## 29 36635492 15.8   8 351.0 264 4.22 3170 14.50  0  1    5    4        2
## 30 52695278 19.7   6 145.0 175 3.62 2770 15.50  0  1    5    6        7
## 31 67465929 15.0   8 301.0 335 3.54 3570 14.60  0  1    5    8        6
## 32 72536977 21.4   4 121.0 109 4.11 2780 18.60  1  1    4    2        4
##    qual_trans qual_bod car_prob mpg_mean    wt_sq
## 1           2        1        2 20.09062  6864400
## 2           4        1        6 20.09062  8265625
## 3           2        1        6 20.09062  5382400
## 4           2        2        2 20.09062 10336225
## 5           4        5        4 20.09062 11833600
## 6           5        6        3 20.09062 11971600
## 7           7        4        6 20.09062 12744900
## 8           1        7        4 20.09062 10176100
## 9           3        7        2 20.09062  9922500
## 10          7        2        2 20.09062 11833600
## 11          6        6        6 20.09062 11833600
## 12          5        4        6 20.09062 16564900
## 13          5        1        7 20.09062 13912900
## 14          3        2        3 20.09062 14288400
## 15          2        7        3 20.09062 27562500
## 16          5        2        6 20.09062 29419776
## 17          4        2        5 20.09062 28569025
## 18          3        3        4 20.09062  4840000
## 19          7        6        3 20.09062  2608225
## 20          6        4        2 20.09062  3367225
## 21          1        2        6 20.09062  6076225
## 22          3        7        3 20.09062 12390400
## 23          1        5        5 20.09062 11799225
## 24          7        2        7 20.09062 14745600
## 25          2        7        4 20.09062 14784025
## 26          5        6        5 20.09062  3744225
## 27          5        4        7 20.09062  4579600
## 28          6        7        5 20.09062  2289169
## 29          5        4        1 20.09062 10048900
## 30          6        4        2 20.09062  7672900
## 31          6        5        3 20.09062 12744900
## 32          5        6        6 20.09062  7728400

Try it for yourself: create a new variable that tells us the amount of horsepower (hp) each car has per cylinder (cyl). Name this variable hp_cyl

3.5 Get summary statistics of a variable (summarise)

Summarise allows us to calculate summary statistics for specific variables in our dataset. For instance, if we wanted to know what the mean miles per gallon of all the cars in our dataset was, we would use summarise in the following way:

#summarise(dataset, summary_variable = function(old_variable))
summarise(d, mpg_mean = mean(mpg, na.rm = TRUE))
##   mpg_mean
## 1 20.09062

FYI, na.rm = TRUE tells the mean function to remove all missing values from the calculation (‘na.rm’ means “removes NAs”).

Try it for yourself: calculate the median miles per gallon

3.6 Doing a series of transformations - using a ‘pipe’ (%>%)

Sometimes we want to use multiple functions at the same time to do multiple transformations of the data. For example, we might want to select certain variables and then filter to certain observations and save this new subset of data to a new dataset. Instead of typing it all out several times, we can use a convenient trick called a ‘pipe’ which functions as a ‘and then’ statement in your code.

Pipes can be a really useful tool to use throughout your coding in R, and there are multiple types of pipes (we’ll learn more about other types in a minute).

Below, we’re going to create a new dataset d2 using the old dataset d, and then select mpg, disp, and gear, and then filter to only include cars with 4 gears.

d2 <- 
  d %>% 
  select(mpg, disp, gear) %>% 
  filter(gear == 4)

Try it for yourself: create a new dataset d3 using d and then select only cyl, hp, and wt, and then filter to cars with 8 cylinders, and then calculate the mean horsepower for these cars.

3.6.1 Another pipe: Compound assignment (%<>%)

Run each of the lines below separately and see what happens. After running each line, check out d in your global environment (top right part of the screen). What’s different?

d %>% mutate(
  lwt = log(wt)
)

d %<>% mutate(
  lwt = log(wt)
)

That’s right, the %<>% back-assigns the output of the operations to overwrite d. That is, whereas a normal pipe only works one way–using d, it computes a new variable and shows us the output–this compound assignment pipe computes a new variable and then assigns that value back to d. Think of the extra < in the operator as saying, “And then assign back to the object on the right side of the operator”. So, any time you want to create a new variable and save it in the data frame, it is best to use this operator.

3.6.2 And one more: The money pipe (%$%)

How might we rewrite the following code using a pipe?

with(d, plot(mpg ~ wt))

Though we might intuit that we could say, d %>% plot(mpg ~ wt), that actually wouldn’t work. We would have to write it as follows:

d %$% plot(mpg ~ wt)

Just like the $ that we use when calling a variable from a dataset (e.g. d$mpg), this pipe helps call the data and variables for certain functions.

The %$% pipe enables us to call from the data set for functions that don’t normally have a data argument. That is, because you can’t say data = d within the plot() function, you would normally have to use the with() function to call the data–with(d, (plot(mpg ~ mpg_cat)))– in order to successfully avoid using d$ before each variable. So, if we use the normal pipe operator (%>%) to call the data (d %>% plot()), the plot() function still isn’t able to recognize the variables, because it doesn’t normally have a data argument. Herein, we can use %$%, which allows us to call and use the data set, even for functions without a data argument.

Learn more about pipes:

3.7 Summarize by group (group_by)

If you wanted to calculate a statistic by group in your data, you can use group_by, a pipe, and summarize. So, for example, if we wanted to calculate the mean miles per gallon for cars with automatic vs. manual transmissions, you would use group_by in the following way:

d %>% 
  group_by(am) %>% 
  summarise(mean_mpg = mean(mpg, na.rm = T))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##      am mean_mpg
##   <dbl>    <dbl>
## 1     0     17.1
## 2     1     24.4

Try it for yourself: calculate the average weight of cars based on the number of cylinders that they have.

3.8 Bringing it all together

3.8.1 How can we use these tools to build effective code?

With these tools, we can now build many combinations of functions to manipulate our data how we want. Consider one example below. We are interested in examining the means of mpg and wt, but only for cars with mpg above 20 (i.e. moderately well performing cars). Additionally, we want to see how cars with different numbers of cylinders and transmission types are different on mpg and wt; so, we also organize these results by number of cylinders (cyl) and transmission type (am). Here is one way we could write this code, utilizing the functions group_by(), select(), summarise(), and filter().

grouped_cars <- group_by(d, cyl, am)
cars_data <- select(grouped_cars, cyl, am, wt, mpg)
summarized_mpg <- summarise(cars_data, 
                wt.mean = mean(wt, na.rm = TRUE), 
                mpg.mean = mean(mpg, na.rm = TRUE))
## `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
final_result <- filter(summarized_mpg, mpg.mean > 20)

final_result
## # A tibble: 3 x 4
## # Groups:   cyl [2]
##     cyl    am wt.mean mpg.mean
##   <dbl> <dbl>   <dbl>    <dbl>
## 1     4     0   2935      22.9
## 2     4     1   2042.     28.1
## 3     6     1   2755      20.6

Now consider the code below; though it looks a bit different, it does the exact same thing as the chunk above! This new code, however, utilizes pipes %>% to make the code more succinct, more organized, and more intuitive. Instead of creating many new objects, it simply says, “Using data set d, group variables by cyl and am and then–using only cyl, am, wt, and mpg– summarize the means of each group, and then only keep rows where the mean of mpg is above 20.” Each line of code builds upon the last, allowing us to get the same result as above with less code.

d %>% 
    group_by(cyl, am) %>% 
    select(cyl, am, wt, mpg) %>% 
    summarise(wt.mean = mean(wt, na.rm = TRUE), mpg.mean = mean(mpg, na.rm = TRUE)) %>% 
    filter(mpg.mean > 20)
## `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
## # A tibble: 3 x 4
## # Groups:   cyl [2]
##     cyl    am wt.mean mpg.mean
##   <dbl> <dbl>   <dbl>    <dbl>
## 1     4     0   2935      22.9
## 2     4     1   2042.     28.1
## 3     6     1   2755      20.6

3.8.2 Solving our most common data manipulation challenges using the grammar of dplyr

Now that we have these tools from dplyr and have learned the basic ‘grammar’ of data manipulation, we can solve some of our common data manipulation challenges:

1. Recoding variables

Our data set includes 4 items of car quality as rated by expert mechanics. Based on ratings from mechanics, each car gets a score between 1 (Very poor) to 7 (Excellent) for engine quality (qual_eng), transmission quality (qual_trans), and body quality (qual_bod). Additionally, mechanics were asked, “How frequently do you estimate this car is apt to have problems?” on a scale from 1 (not at all frequently) to 7 (extremely frequently) (car_prob). Together, these 4 items create the Car Quality Index (CQI).

In order to combine these items into an index, we must first change the values for the variable car_prob: whereas the other variables indicate higher car quality with higher values, car_prob indicates lower car quality with higher values, and thus must be recoded.

You can recode Likert-type items by subtracting the values in the variable by one more than the max of the scale. Thus, a score that was 7–subtracted from 8–now becomes 1, 6 becomes 2, 5 becomes 3, and so on.

d %<>% mutate(
    car_prob_r = 8 - car_prob
  )

2. Creating composite variables

Now we can combine our 4 items into an index (i.e. a composite variable that is the mean of ratings for each car). To make this easiest, a friend of mine made a handy function for computing composite variables. Here, we will call it gen_comp() (i.e., ‘generate composite’). There are two steps to creating a composite variable:

  1. Create a vector (i.e. a grouped list) of each of the items that will go into the composite. You can use this vector as an object to compute descriptives, build tables, etc. with all of the variables in the vector at once. Also, its handy for creating composite variables.

  2. Input the new variable name and the name of the vector it is pulling from into the function gen_comp().

## The code for our new function (no need--at this point--to understand this code completely)
gen_comp <- function(data, comp, vector){
   comp <- enquo(comp)
   data %>% 
       rowwise() %>% 
       mutate(!!quo_name(comp) := mean(c(!!!vector), na.rm = TRUE)) %>% 
       ungroup()
}

## Step 1
vector_cqi <- quos(qual_eng, qual_trans, qual_bod, car_prob)

## Step 2
d %<>% gen_comp(comp = cqi, vector = vector_cqi)

3. Centering

Another common manipulation we may want to do is mean centering our data. Mean centering means that the mean becomes 0 and all other scores are presented in terms of their relative distance from the mean (negative for below the mean and positive for above the mean).

Here, I’ve created a really simple function (var.center) that takes advantage of the function scale() for mean centering our data. Normally, scale() will standardize the data (mean zero with each score representing how many standard deviations it is away from the mean). However, we can tell it to just subtract the mean from scores by indicating scale = FALSE. To make it easier, we have just made a function that does this, but only requires you to input the variables you want to operate on. For other ways to center and for the original code of this custom function, check out this website.

Alternatively, you can center old school: compute means for each variable and subtract that mean from the variable.

## Code for the new function
var.center <- function(x) {
    scale(x, scale = FALSE)
}

## example of new function
d %<>% mutate(
  wt_c2 = var.center(wt),
  cqi_c2 = var.center(cqi)
)

## doing it the old school way
d %<>% mutate(
  mean_wt = mean(wt),
  wt_c = wt - mean_wt,
  mean_cqi = mean(cqi),
  cqi_c = cqi - mean_cqi
)

d %>% select(wt_c, wt_c2, cqi_c, cqi_c2) #Notice that both methods give you the same results
## # A tibble: 32 x 4
##       wt_c wt_c2[,1]  cqi_c cqi_c2[,1]
##      <dbl>     <dbl>  <dbl>      <dbl>
##  1 -597.     -597.   -1.40      -1.40 
##  2 -342.     -342.   -1.15      -1.15 
##  3 -897.     -897.   -1.40      -1.40 
##  4   -2.25     -2.25 -1.65      -1.65 
##  5  223.      223.    0.352      0.352
##  6  243.      243.    0.352      0.352
##  7  353.      353.    1.60       1.60 
##  8  -27.2     -27.2   0.102      0.102
##  9  -67.2     -67.2  -0.398     -0.398
## 10  223.      223.   -0.648     -0.648
## # ... with 22 more rows

And many other useful transformations and manipulations:

# rescaling, computing the log
d %<>% mutate(
  wt_s = wt/1000, #scaling weight down to 
  lmpg = log(mpg) # creating the log of mpg
)

Here are some other useful things you can now do:

library(knitr) # for kable() function (creates nice tables)

d %>% filter(wt > 3.5) %>%
           group_by(cyl, am) %>% 
           summarise(mn = mean(mpg))
## `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
## # A tibble: 6 x 3
## # Groups:   cyl [3]
##     cyl    am    mn
##   <dbl> <dbl> <dbl>
## 1     4     0  22.9
## 2     4     1  28.1
## 3     6     0  19.1
## 4     6     1  20.6
## 5     8     0  15.0
## 6     8     1  15.4
d %>% select(starts_with("qual")) %>% summary
##     qual_eng   qual_trans       qual_bod    
##  Min.   :1   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2   1st Qu.:2.750   1st Qu.:2.000  
##  Median :4   Median :5.000   Median :4.000  
##  Mean   :4   Mean   :4.219   Mean   :4.125  
##  3rd Qu.:6   3rd Qu.:6.000   3rd Qu.:6.000  
##  Max.   :7   Max.   :7.000   Max.   :7.000
d %>% select(starts_with("qual")) %>% describe %>% round(., digits = 2) %>% kable
vars n mean sd median trimmed mad min max range skew kurtosis se
qual_eng 1 32 4.00 1.98 4 4.00 2.97 1 7 6 -0.14 -1.35 0.35
qual_trans 2 32 4.22 1.91 5 4.27 2.22 1 7 6 -0.20 -1.26 0.34
qual_bod 3 32 4.12 2.14 4 4.15 2.97 1 7 6 -0.04 -1.50 0.38
d %>% select(!!!vector_cqi) %>% describe %>% round(., digits = 2) %>% kable
vars n mean sd median trimmed mad min max range skew kurtosis se
qual_eng 1 32 4.00 1.98 4 4.00 2.97 1 7 6 -0.14 -1.35 0.35
qual_trans 2 32 4.22 1.91 5 4.27 2.22 1 7 6 -0.20 -1.26 0.34
qual_bod 3 32 4.12 2.14 4 4.15 2.97 1 7 6 -0.04 -1.50 0.38
car_prob 4 32 4.25 1.80 4 4.23 2.97 1 7 6 -0.04 -1.41 0.32
d %>% 
  rowwise() %>% #rowwise() used to do an operation by rows rather than columns--row means, etc.
  mutate(mymean=mean(c(cyl,mpg))) %>% 
  select(cyl, mpg, mymean)
## # A tibble: 32 x 3
## # Rowwise: 
##      cyl   mpg mymean
##    <dbl> <dbl>  <dbl>
##  1     6  21     13.5
##  2     6  21     13.5
##  3     4  22.8   13.4
##  4     6  21.4   13.7
##  5     8  18.7   13.4
##  6     6  18.1   12.0
##  7     8  14.3   11.2
##  8     4  24.4   14.2
##  9     4  22.8   13.4
## 10     6  19.2   12.6
## # ... with 22 more rows
##Other examples of what we use mutate for in psychology...creating new variables, 

d %>% select(!!!vector_cqi) %>% as.matrix() %>% rcorr()
##            qual_eng qual_trans qual_bod car_prob
## qual_eng       1.00       0.11     0.08    -0.21
## qual_trans     0.11       1.00     0.01     0.10
## qual_bod       0.08       0.01     1.00    -0.21
## car_prob      -0.21       0.10    -0.21     1.00
## 
## n= 32 
## 
## 
## P
##            qual_eng qual_trans qual_bod car_prob
## qual_eng            0.5472     0.6487   0.2528  
## qual_trans 0.5472              0.9615   0.6004  
## qual_bod   0.6487   0.9615              0.2483  
## car_prob   0.2528   0.6004     0.2483

Though dplyr manages to provide ways to solve most of the data manipulation challenges we will come across in just a handful of functions, the functionalities may still be confusing or difficult to remember at first. If only there was a way to remember all of these functions…Wait–there’s a [cheatsheet] for dplyr, too?? Wow, R Studio, you have really outdone yourself this time.

3.8.3 Changing variable type

Remember the four variable types we talked about last time? Often we have occasion to need to change our variables from one type to another–each act differently in analyses and we have to make sure we have the right variable type for what we are trying to do.

For example, we might want to treat some variables as qualitative, nominal factors rather than continuous, numeric integers. In R, we must specify which variables to treat as factors if the levels (i.e., unique values) of the variable are composed of numbers instead of strings. Note that if the variable (e.g., “ID”) levels start with a letter (e.g., “subject1”, “subject2”) R will automatically interpret the variable as a factor. If the variable levels start with a number (e.g., “1”, “2”), R with automatically interpret the variable as an integer. If you want the variable interpreted differently, you have to tell R.

For instance, the variable mpg is continuous, but am is not. However, since the levels of am are indicated with numbers, we must tell R to treat am as a factor:

## the function factor() converts to a factor and the option labels specifies names to assign to the levels
d %<>% 
  mutate(am = factor(am, labels = c("Auto", "Manual")))

#Other alternative
#d$am = factor(d$am, labels = c("Auto", "Manual")) 

Now we can look at the structure of the d data frame again, to make sure am is now a factor:

head(str(d)) # We use head() to look at only the first few variables of the data set
## tibble [32 x 26] (S3: tbl_df/tbl/data.frame)
##  $ carID     : num [1:32] 37747171 31796991 72673293 89829393 96393949 ...
##  $ mpg       : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl       : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
##  $ disp      : num [1:32] 160 160 108 258 360 ...
##  $ hp        : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
##  $ drat      : num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt        : num [1:32] 2620 2875 2320 3215 3440 ...
##  $ qsec      : num [1:32] 16.5 17 18.6 19.4 17 ...
##  $ vs        : num [1:32] 0 0 1 1 0 1 0 1 1 1 ...
##  $ am        : Factor w/ 2 levels "Auto","Manual": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear      : num [1:32] 4 4 4 3 3 3 3 4 4 4 ...
##  $ carb      : num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
##  $ qual_eng  : num [1:32] 6 1 2 4 5 4 6 5 3 3 ...
##  $ qual_trans: num [1:32] 2 4 2 2 4 5 7 1 3 7 ...
##  $ qual_bod  : num [1:32] 1 1 1 2 5 6 4 7 7 2 ...
##  $ car_prob  : num [1:32] 2 6 6 2 4 3 6 4 2 2 ...
##  $ car_prob_r: num [1:32] 6 2 2 6 4 5 2 4 6 6 ...
##  $ cqi       : num [1:32] 2.75 3 2.75 2.5 4.5 4.5 5.75 4.25 3.75 3.5 ...
##  $ wt_c2     : num [1:32, 1] -597.25 -342.25 -897.25 -2.25 222.75 ...
##   ..- attr(*, "scaled:center")= num 3217
##  $ cqi_c2    : num [1:32, 1] -1.398 -1.148 -1.398 -1.648 0.352 ...
##   ..- attr(*, "scaled:center")= num 4.15
##  $ mean_wt   : num [1:32] 3217 3217 3217 3217 3217 ...
##  $ wt_c      : num [1:32] -597.25 -342.25 -897.25 -2.25 222.75 ...
##  $ mean_cqi  : num [1:32] 4.15 4.15 4.15 4.15 4.15 ...
##  $ cqi_c     : num [1:32] -1.398 -1.148 -1.398 -1.648 0.352 ...
##  $ wt_s      : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
##  $ lmpg      : num [1:32] 3.04 3.04 3.13 3.06 2.93 ...
## NULL

3.8.4 Creating factors from continuous variables

Sometimes, we may want to break up a continuous variable into intervals (e.g. for age: 18 - 24, 25 - 30, 30 +). In our dataset, for simplicity, we might want to look at gas mileage as an ordered factor, not a quantitative variable. By making mpg into a factor, we are able to group cars into categories based on their respective mpg. So let us create a new factor, mpg_cat which can be ‘low’, ‘medium’, or ‘high’. Given the mpg variable, we can create a new categorical variable (i.e., factor) by specifying breaks at specific intervals (see below).

Here, we use the dplyr function case_when(), which says, “when this is true,” (the left side of the ~), “make this happen” (the right side of the ~). So, for the first line, “when mpg is less than 17, give the new variable, mpg_cat the value ‘Low’.” Then we use the function ordered() to specify that low, medium, and high go in a specific order (as opposed to levels like red, blue, and yellow, which have no inherent order).

d %<>% 
  mutate(
    mpg_cat = case_when(mpg < 17 ~ "Low",
                        mpg >= 17 & mpg < 24 ~ "Medium",
                        mpg >= 24 ~ "High"),
         mpg_cat = ordered(mpg_cat,levels = c("Low","Medium","High")))

These break points result in 3 mpg categories: below 17, 17:23.9, and 24 and up. We can also visualize these groups:

d %$% plot(mpg ~ mpg_cat)

3.9 Saving your data and going home

After you have manipulated your variables, you will save a new dataset. This can then be used in data analysis. It is best to create one markdown for data manipulation, clean your data, and then save a new, totally clean and ready-to-go data file to use for your analyses in a new markdown.

Something important: Unlike SPSS and Excel, the default in R is not to save your computed variables. When we work with R, we import data into a new space (a data frame) and then work within that space. In SPSS and Excel, you’re always editing the source file and therefore all of your changes can be saved. If you want to save your computed variables in a .csv file, you’ll need to write a new file. But fear not–there’s a simple command that does just that. Let’s say we want to save our newly computed variables and cleaned data set into a permanent R data file. We’d do this:

write_rds(d, 'data_clean.rds')

Notice that we saved our new data file as an .rds file; this is a file extension designating an R data file. When we want to read in the data in a new markdown, we can use the code read_rds('data.clean.rds').

4 Advanced Data manipulation: Reshaping

4.1 Reshaping and labeling dataframes using tidyr

Let’s load in some data that has both between- and within-subject factors. This data file is called kv0.csv. Here, attention (attnr) is a between-subjects factor with 2 levels, attnr = ‘divided’ or ‘focused’; and there are 10 subjects (subidr) at each level. Also, each subject solved anagrams at 3 levels of difficulty, indexed by the number of possible solutions (num = 1, 2, or 3; a within-subjects variable). Subject’s score at each level of num is noted. This is a repeated measures design. How does score depend on attn and num?

As you can see, this dataframe is in short-form, meaning that the within-subject observations are displayed in separate columns, and each subject occupies a single row.

To convert this dataframe to long-form, we can use the gather() function from the tidyr package. Our ‘id.vars’ are those variables that we want to be the same for each subject, and the ‘measure.vars’ are those that are repeated measures on each subject:

The first two arguments are the name of the category variable (key), and the name of the variable that will contain the scores (value). The next arguments are the variables that you don’t want to populate the new variables you are creating, but would like to keep as the same for each subject. Use a minus sign (-) before these variables to express that you want to keep them as is. Experiment with just using key and value, to see how that works.

dl <- d %>% gather(num,score,-subidr,-attnr)
head(dl)
##   subidr   attnr  num score
## 1      1 divided num1     2
## 2      2 divided num1     3
## 3      3 divided num1     3
## 4      4 divided num1     5
## 5      5 divided num1     4
## 6      6 divided num1     5
str(dl)
## 'data.frame':    60 obs. of  4 variables:
##  $ subidr: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ attnr : chr  "divided" "divided" "divided" "divided" ...
##  $ num   : chr  "num1" "num1" "num1" "num1" ...
##  $ score : num  2 3 3 5 4 5 5 5 2 6 ...

Basically, we now have a long-form dataframe with 3 rows for each subject.

If you would like to go back to a wide data format, you would have to use the spread() function which reverses it.

dw <- dl %>% spread(num,score)

Now compare dw and d to see if they are equivalent.

5 Review: End Notes

5.1 What’s an R Markdown again?

This is the main kind of document that I use in RStudio, and it’s the primary advantage of RStudio over base R console. R Markdown allows you to create a file with a mix of R code and regular text, which is useful if you want to have explanations of your code alongside the code itself. This document, for example, is an R Markdown document. It is also useful because you can export your R Markdown file to an html page or a pdf, which comes in handy when you want to share your code or a report of your analyses to someone who doesn’t have R. If you’re interested in learning more about the functionality of R Markdown, you can visit this webpage

R Markdowns use chunks to run code. A chunk is designated by starting with {r}and ending with This is where you will write your code. A new chunk can be created by pressing COMMAND + ALT + I on Mac, or CONTROL + ALT + I on PC.

You can run lines of code by highlighting them, and pressing COMMAND + ENTER on Mac, or CONTROL + ENTER on PC. If you want to run a whole chunk of code, you can press COMMAND + ALT + C on Mac, or ALT + CONTROL + ALT + C on PC. Alternatively, you can run a chunk of code by clicking the green right-facing arrow at the top-right corner of each chunk. The downward-facing arrow directly left of the green arrow will run all code up to that point.

5.2 Some useful resources to continue your learning

A useful resource, in my opinion, is the stackoverflow website. Because this is a general-purpose resource for programming help, it will be useful to use the R tag ([R]) in your queries. A related resource is the statistics stackexchange, which is like Stack Overflow but focused more on the underlying statistical issues. Add other resources

One of the best resources for learning how to use R well, in a “tidy” way, is R for Data Science(R4DS). This contains a good intro to using dplyr, as well as a solid general intro to R.