In this book, we will be using the statistical computing environment R. R at its core is a programming language that specializes on statistics and data analysis. Like all modern programming languages, R is more than just a compiler or interpreter that translates human-writeable formal statements into something a computer understands and can execute. R comes with a complete set of standard libraries that cover the usual basic stuff, as well as a collection of common statistical routines, for example the t-test or functions to compute mean and variance. Even regression analysis and plotting is included with R. Finally, packaged with R comes a rudimentary programming environment, with a console, a simple editor and a help system.
Most of what comes packaged with R we will set aside. R at its very core is a brilliantly designed programming language for the task, but much of the standard libraries is an inconsistent and outdated mess. For example, R comes with a set of commands to import data from various formats (SPSS, CSV, etc.). Some of these commands produce objects called data.frames, whereas others return list of lists. As another example, to sort a data frame by participant number, one has to write the following syntax soup:
D[order(D$Part),]
In the past few years, a single person, Hadley Wickham, has almost single-handedly started an initiative known as the tidyverse. The tidyverse is a growing collection of libraries that ground on a coherent and powerful set of principles for data management. One of the core packages is dplyr and it introduces some a rich, yet rather generic, set of commands for data manipulation. The above sorting of a data frame would be written as
arrange(D, Part)
One of Wickham’s first and well-known contributions is the **ggplot* system for graphics. Think of R’s legacy graphics system as a zoo of individual routines, one for boxplots, another one for scatterplots asf. Like animals in a zoo, they live in different habitats with practically no interaction. ggplot implements a rather abstract framework for data plots, where all pieces can be combined in a myriad of ways, using a simple and consistent syntax.
Where to get these gems? R is an open source system and has spawned an open ecosystem for statistical computing. Thousands of extensions for R have been made available by data scientists for data scientists. The majority of these packages is available through the comprehensive R archive network (CRAN). For the common user it suffices to think of CRAN as an internet catalogue of packages, that can be searched and where desired packages can be downloaded and installed in an instance.
Finally, R comes with a very rudimentary programming environment that carries the questionable charm of early 1990s. Whereas several alternatives exist, most R users will feel most comfortable with the R programming environment Rstudio. It is the most user-friendly and (at the same time!) feature richest software to program in R. Only some die-hard EMACS users might disagree and continue to strain their tendons using the ESS extension.
The next sections describe how you can set up a fully functional environment and verify that it works. Subsequently, we will get to know the basics of programming in R.
First, we make sure that you have R and Rstudio downloaded and installed on your computer. The two programs can be retrieved from the below addresses. Make sure to select the version for your operating system.
If you fully own the computer you are working with, meaning that you have administrator rights, just do the usual downloading and running the setup. If everything is fine, you’ll find R and Rstudio installed under c\Programs\ and both are in your computers start menu. You can directly proceed to the installation of packages.
In corporate environments, two issues can arise with the installation: first a user may not have administrator rights to install programs to the common path c:\programs\. Second, the home directory may reside on a network drive, which is likely to cause trouble when installing packages.
If you have no administrator rights, you must choose your home directory during the setup. If that is a local directory, (c:/Users/YourName/), this should work fine and you can proceed to installation of packages.
If your home directory (i.e. My Documents) is located on a network drive, this is likely to cause trouble. In such a case, you must install R and Rstudio to a local directory (on your computers hard drive), where you have full read/write access. In the following, it is assumed that this directory is D:/Users/YourName/:
D:/Users/YourName/R/. This is where both programs, as well as packages will reside.Rlibrary where all additional packages reside (R comes with a pack of standard packages, which are in a read-only system directory).File -> New File -> Text file.Rprofile in D:/Users/YourName/R/Tools -> Global options -> General -> Default working directory. Select D:/Users/YourName/R/.## .Rprofile
options(stringsAsFactors = FALSE)
.First <- function(){
RHOME <<- getwd()
cat("\nLoading .Rprofile in", getwd(), "\n")
.libPaths(c(paste0(RHOME,"Rlibrary"), .libPaths()))
}
.Last <- function(){
cat("\nGoodbye at ", date(), "\n")
}
With the above steps you have created a customized start-up profile for R. The profile primarily sets the library path to point to a directory on the computers drive. As you are owning this directory, R can install the packages without admin rights. In the second part, you configure Rstudio’s default path, which is where R, invoked from Rstudio, searches for the .Rprofile.
After closing and reopening Rstudio, you should see a message in the console window saying:
Loading .Rprofile in D:/Users/YourName/R/
That means that R has found the .Rprofile file and loaded it at start-up. The .Rprofile file primarily sets the path of the library, which is the collection of packages you install yourself. Whether this was successful can be checked by entering the console window in Rstudio, type the command below and hit Enter.
.libPaths()
If your installation went fine, you should see an output like the following. If the output lacks the first entry, your installation was not successful and you need to check all the above steps.
[1] “D:/Users/YourName/R/Rlibrary” “C:/Program Files/R/R-3.3.0/library”
While R comes with a set of standard packages, thousands of packages are available to enhance functionality for every purpose you can think of. Most package are available from the Comprehensive R Network Archive (CRAN).
For example, the package foreign is delivered with R and provides functions to read data files in various formats, e.g. SPSS files. The package haven is a rather new package, with enhanced functionality and usability. It is not delivered with R, hence, so we have to fetch it.
Generally, packages need to be installed once on your system and be loaded everytime you need them. Installation is fairly straight-forward once your R environment has been setup correctly and you have an internet connection.
In this book we will use a number of additional packages from CRAN. The listing below all required packages using the library(package) command. The package tidyverse is a metapackage that installs and loads a number of modern packages. The ones being used in this book are:
## tidyverse
library(tidyverse)
## data manipulation
library(openxlsx)
## plotting
library(gridExtra)
## regression models
library(rstanarm)
## other
library(devtools) ## only needed for installing from Github
library(knitr)
## non-CRAN packages
library(mascutils)
library(bayr)
We start by checking the packages:
File --> New file --> R scriptCtrl-Return you run every line one-by-one. As a first time user with a fresh installation, you will now see error messages like:Error in library(tidyverse) : there is no package called ‘tidyverse’
This means that the respective package is not yet present in your R library. Before you can use the package tidyverse you have to install it from CRAN. For doing so, use the built-in package management in RStudio, which fetches the package from the web and is to be found in the tab Packages. At the first time, you may have to select a repository and refresh the package list, before you can find and install packages. Then click Ìnstall, enter the names of the missing package(s) and install.
Finally,run the complete code block at once by selecting it and Ctrl-Return. You will see some output to the console, which you should check once again. Unless the output does not contain any error messages (like above), you have successfully installed and loaded all packages.
Note that R has a somewhat idiosyncratic jargon: many language, such as Java or Python, call “libraries” what R calls “packages”. The library in R is strictly the set of packages installed on your computer and the library command loads a package from the library.
Two packages, mascutils and bayr are written by the author of this book. They have not yet been committed to CRAN, but are available on Github which is a general purpose versioning system for programmers. Fortunately, with the help of the devtools package it is rather easy to install these packages, too. Just enter the Rstudio console and type:
library(devtools)
install_github("schmettow/mascutils")
install_github("schmettow/bayr")
Again, you have to do that only once after installing R and you can afterwards load the packages with the library command. Only if the package gets an update to add functionality or remove bugs, you need to run these commands again.
After you have set up your R environment, you are ready to run your first R program:
## Simulation of a data set with 100 participants
## in two between-subject conditions
N <- 100
levels <- c("control","experimental")
Group <- rep(levels, N/2)
age <- round(runif(N, 18, 35),0)
outcome <- rnorm(N, 42 * 5, 10) + (Group == "experimental") * 42
Experiment <- data_frame(Group, age, outcome)
summary(Experiment)
| Group | age | outcome | |
|---|---|---|---|
| Length:100 | Min. :18.0 | Min. :177 | |
| Class :character | 1st Qu.:22.0 | 1st Qu.:211 | |
| Mode :character | Median :28.0 | Median :225 | |
| NA | Mean :26.7 | Mean :230 | |
| NA | 3rd Qu.:31.0 | 3rd Qu.:251 | |
| NA | Max. :35.0 | Max. :273 |
## Plotting the distribution of outcome
Experiment %>%
ggplot( aes(x = outcome) ) +
geom_density(fill = 1)
## ... outcome by group
Experiment %>%
ggplot( aes(fill = Group, x = outcome) ) +
geom_density()
## ... statistical model comparing the groups
model <- stan_glm(formula = outcome ~ Group,
data = Experiment)
fixef(model)
| model | type | nonlin | fixef | re_factor | re_entity | center | lower | upper |
|---|---|---|---|---|---|---|---|---|
| object | fixef | NA | Intercept | NA | NA | 209.1 | 206.2 | 212 |
| object | fixef | NA | Groupexperimental | NA | NA | 40.8 | 36.5 | 45 |
Observe Console, Environment and Plots. Did you see
summary(Experiment) command?Congratulations! You have just
Isn’t it amazing, that in less than 20 simple statements we have just reached the level of a second-year bachelor student? Still, you may find the R output a little inconvenient, as you may want to save the output of your data analysis. Not long ago, that really was an issue, but in the past few years R has become a great tool for reproducable research. The most simple procedure of saving your analysis for print or sharing is to:
CTRL-s and selecting a directory and name.File --> Compile notebook and select Word as a format.CompileA new Word document should appear that shows all code and the output. Now, you can copy-and-paste the graphics into another document or a presentation.
Getting started with Rstudio (presentation)
Getting started with Rstudio (ebook)
rstudio cheat sheets is a collection of beautifully crafted cheat sheets for ggplot, dplyr and more. I suggest you print the data mangling and ggplot cheat sheets and always keep them on your desk.
The tidyverse is a meta package that loads all core tidyverse packages by Hadley Wickham.
cRazy as in ‘idiosyncrasy’
This book is for applied researchers in design sciences, whose frequent task is to analyze data and report it to stakeholders. Consequently, the way I use R in this book capitalizes on interactive data analysis and reporting. As it turns out, a small fraction of R, mostly from the tidyverse, is sufficient to write R code that is effective and fully transparent. In most cases, a short chain of simple data transformations tidies the raw data which can then be pushed into a modelling or graphics engine that will do the hard work. We will not bother (ourselves and others) with usual programming concepts such as conditionals, loops or the somewhat eccentric approaches to functional programming. At the same time, we can almost ignore all the clever and advanced routines that underlay statistical inference and production of graphics, as others have done the hard work for us.
Let me first give you a global account of programming in R, albeit that might be more interesting for readers who have experience with other programming languages (and styles). R serves three purposes: interactive data analysis, creating data analysis reports and developing new statistical routines.
With R one typically works interactively through a data analysis. The analysis often is a rather routine series of steps, like:
A program in R is usually developed iteratively: once you’ve loaded and checked your data, you progress to the next step step of your analysis, test it and proceed. At every step, one or more new objects are created in the environment, capturing intermediate and final results of the analysis:
As R is an interpreter language, there is no tedious compile-and-run cycles in everyday R programming. You develop the analysis as it happens. It is even normal to jump back and forth in an R program, while building it.
R is a way to report and archive what precisely you have been doing with your data. In statistics, mathematical formulas are the common form of unambiguously describing a statistical model. For example, the following equation defines a linear regression model between the observed outcome \(y\) and the predictor \(x\):
\[ \mu_i = \beta_0 + \beta_1x_i\\ y_i \sim N(\mu_i, \sigma) \]
As we will later see [CLM], in Rs formula language the same model is unambiguously specified as:
y ~ x
R is currently the lingua franca of statistical computing. As a programming language, R has the same precision as math, but is more expressive. You can specify complex models, but also graphics and the steps of data checking and preparation. As an example, the following line of code identifies all scores that are more than a magnitude higher than the average score and marks them as invalid. We just applied our own rule of outlier removal to the data. Others may consider the rule invalid or arbitrary, at best. But, at least, we are fully transparent about what we had done. As you type your data analysis, you formally report what precisely you have been doing to the data.
scores <- c(1,2,4,3,6,5,50,1,2,5,800)
mean_score <- mean(scores)
invalids <- scores > (mean_score * 10)
scores[invalids] <- NA
Finally, R is a way to develop and share statistical programs. Thousands of packages in the R ecosystem cover any statistical problem you can imagine. Countless researchers and developers have written statistical routines that can be re-used. As a programming language, R has been designed for that particular purpose. Under the hood of R, a bunch of generic, yet powerful, principles purr to make it a convenient language for typical problems in statistical computation. Most people have a background in general purpose programming languages, like Java, Python or C++. The following three
Functional programming means that functions are first-class citizens in R. Most notably, almost every operation in R takes an input and creates another output object. In standard R programming there is nothing such as changing an object by setting its attributes. Furthermore, there exist so called higher level functions. These are functions that apply other functions. For example, there is a function aaply in the plyr package, that applies another function, say sum per column or row of a matrix. These higher level functions allow for incredibly compact code and make loops practically unnecessary.
Most programming languages count from zero. If you have worked with Python or Java before, you know these nasty “index out-of-bounds” errors that arise. They are easy to fix, but still annoying. Mathematicians, statisticians and basically everyone, young or old, starts counting from 1. And so does R.
For folks who like object orientation, R is a good choice, too. It even gives you the choice of several object-orientation packages. The most basic mechanism is the so-called S3 system, which is rather limited and even a bit informal. S3 mainly provides a simple overloading mechanism. Technically, it is a method dispatcher, more or less. When, later, you find out that you can call the function summary on a great variety of R objects and it always does something useful, it is S3 at work. However, owing to the functional paradigm, almost all R routines go without side effects. If an object has an attribute and
Any statistical analysis can be thought of as a production chain. You take the raw data and process it into a neat data table, which you feed into graphics and regression engines or summarize by other means. At almost every step there is an input and an output object.
Objects are a basic feature of R. They are temporary storage places in the computer’s memory. Objects always have a name chosen by the programmer. By its name, a stored objects can be found back at any time. Two basic operations apply for any objects: an object is stored by assigning it a name; it is retrieved by calling its name. If you wanted to store the number of observations in your data set under the name name N_obs, you use the assignment operator <-. The name of the variable is left of the operator, the assigned value is right of it.
N_obs <- 100
Now, that the value is stored, you can call it any time by simply calling its name:
N_obs
## [1] 100
Just calling the name prints the value to Console. In typical interactive programming sessions with R, this is already quite useful. But, you can do much more with this mechanism.
Often, what you want is to do calculations with the value. For example, you have a repeated measures study and want to calculate the average number of observations per participant. For this you need the number of observations, and the number of participants. The below code creates both objects, does the calculation (right of <-) and stores it in another object avg_N_Obs
N_Obs <- 100
N_Part <- 25
avg_N_Obs <- N_Obs/N_Part
avg_N_Obs
## [1] 4
Objects can exist without a name, but are volatile, then. They cannot be used any further. The following arithmetic operation does create an object, a single number. For a moment, or so, this number exists somewhere in your computers memory, but once it is printed to the screen, it is gone. Of course, the same expression can be called again, resulting in the same number. But, strictly, it is a different object.
N_Obs/N_Part ## gone after printing
## [1] 4
N_Obs/N_Part ## same value, different object
## [1] 4
There even is a formal way to show that the two numbers, although having the same value assigned, are located at different addresses. This is just for the demonstration purpose and you will rarely use it in everyday programming tasks:
tracemem(N_Obs/N_Part)
## [1] "<0000000012B2C280>"
tracemem(N_Obs/N_Part)
## [1] "<0000000012ABDA30>"
Notice the [1] that R put in front of the single value when printing it? This is an index. Different to other programming language, all basic data types are vectors in R. Vectors are containers for storing many values of same type. The values are addressed by the index, N_Obs[1] calls the first element. In R, indices start counting with 1, which is different to most other languages that start at zero. And if you have a single value only, this is just a vector of length one.
For statistics programming having vectors as basic data types makes perfect sense. Any statistical data is a collection of values. What holds for data is also true for functions applied to data. Practically all frequently used mathematical functions work on vectors, take the following example:
X <- rnorm(100, 2, 1)
mean(X)
## [1] 1.88
The rnorm command produces a vector of length 100 from three values. More precisely, it does 100 random draws from a normal distribution with mean 2 and an SD of 1. The mean command takes the collection and reduces it to a single value. By the way, this is precisely, what we call a statistic: a single quantity that characterizes a collection of quantities.
Objects can be of various classes. In R the common basic classes are: logical, factor, character, integer and numeric. Besides that, programmers can define their own complex classes, for example, to store the results of regression models.
Objects of type logical store the two levels TRUE, FALSE, like presence or absence of a treatment, or passed and failed test items. With Boolean operators one can compute new logical values, for example
Apple <- TRUE
Pear <- FALSE
Apple & Pear ## and
## [1] FALSE
Apple | Pear ## or
## [1] TRUE
More generally, logical values can be used for categorization, when their are only two categories, which is called a dichotomous variable. For example, gender is usually coded as a vector of characters ("m", "f", "f"), but one can always do:
Programmers are lazy folks when it comes to typing, therefore R allows you to abbreviate TRUE and FALSE as shown above. As a consequence, one should never assign objects the name reserved for logical values, so don’t do one of the following:
## never do this
TRUE <- "All philosophers have beards" ## a character object, see below
F <- "All gods existed before the big bang"
42 * F
The class numeric stores real numbers and is therefore abundant in statistical programming. All the the usual arithmetic operations apply:
a <- 1.0
b <- a + 1
sqrt(b)
## [1] 1.41
Objects of class integer are more specific as they store natural numbers, only. This often occurs as counts, ranks or indices.
friends <- c(anton = 1,
berta = 3,
carlo = 2)
order(friends)
## [1] 1 3 2
The usual arithmetic operations apply, although the result of operation may no longer be integer, but numeric
N <- 3
sum_of_scores <- 32
mean_score <- 32/3
mean_score
## [1] 10.7
class(mean_score)
## [1] "numeric"
Surprisingly, logical values can be used in arithmetic expressions, too. When R encounters value TRUE in an arithmetic context, it replaces it with 1, zero otherwise. Used with multiplication, this acts like an on/off switch, which we will use in [LM: dummy variables].
TRUE + TRUE
## [1] 2
sqrt(3 + TRUE)
## [1] 2
is_car <- c(TRUE, TRUE, FALSE) ## bicycle otherwise
wheels <- 2 + is_car * 2
wheels
## [1] 4 4 2
Data sets usually contain variables that are not numeric, but partition the data into groups. For example, we frequently group observations by the following
Part: participantCondition: experimental conditionDesign: one of several designsEducation: level of education (e.g., low, middle or high)Two object types apply for grouping observations: factor and character. While factors specialize on grouping observations, character objects can also be used to store longer text, say the description of a usability problem. The following identifies two conditions in a study, say a comparison of designs A and B. Note how the factor identifies its levels when called to the screen:
## [1] "A" "B" "B" "A"
## [1] A B B A
## Levels: A B
Statistical analysis deals with real world data which ever so often is messy. Frequently, a planned observation could not be recorded, because the participant decided to quit or the equipment did not work properly or the internet collapsed. Users of certain legacy statistics packages got used to coding missing observations as -999 and then declared this a missing value. In R missing values are first-class citizens. Every vector of a certain class can contain missing values, which are identified as NA.
Most basic data manipulation functions, like mean are aware of missing values and act conservatively in their presence: the programmer has to explicitly ask to ignore any NA values. While this appears slightly inconvenient, it helps to catch errors early in data analysis, and creates better transparency.
clicks <- c(3, NA, 2)
mean(clicks)
## [1] NA
mean(clicks, na.rm = T)
## [1] 2.5
This book is about programming and statistics at the same time. Unfortunately, there are a few terms that have a particular meaning in both domains. One of those is “a variable”. In statistics, a variable usually is a property we have recorded, say the body length of persons, or their gender. In general programming, a variable is a space in the computers memories, where results can be stored and recalled. Fortunately, R avoids any confusion and calls objects what is usually called a programming variable.
R comes with a full bunch of functions for creating and summarizing data. Let me first introduce you to functions that produce exactly one number to characterizes a vector. The following functions do that with to the vector X, every in their own way:
length(X)
sum(X)
mean(X)
var(X)
sd(X)
min(X)
max(X)
median(X)
These functions are a transformation of data. The input to these transformations is X and is given as an argument to the function. Other functions require more than one argument. The quantile function is routinely used to summarize a variable. Recall that X has been drawn from a Normal distribution of \(\mu=2\) and standard deviation \(\sigma = 1\). All Normal distributions have the property that about 66% of the mass is within the range \(\mu-\sigma\) and \(\mu+\sigma\). That means, in turn: 17% are below \(\mu-\sigma\) and 66 + 17 = 83% are below \(\mu+\sigma\). The number of observations in a certain range of values is called a quantile. The quantile function operates on X, but takes a vector of quantiles as second argument:
## 17% 83%
## 1.09 2.81
Most functions in R have optional arguments that let you change how the function performs. The basic mathematical functions all have the optional argument na.rm. This is a switch that determines how the function deals with missing values NA. Many optional arguments have defaults. The default of na.rm is FALSE (“return NA in case of NAs in the vector”). By setting it to TRUE, they are removed before operation.
B <- c(1, 2, NA, 3)
mean(B)
mean(B, na.rm = TRUE)
Most more complex routines in R have an abundance of parameters, most of which have reasonable defaults, fortunately. To give a more complex example, the first call of stan_glm performs a Bayesian estimation of the grand mean model. The second does a Poisson grand mean model with 5000 iterations per MCMC chain. As seed has been fixed, every subsequent run will produce the exact same chains. My apologies for the jargon!
D_1 = data_frame(X = rnorm(20, 2, 1))
M_1 = stan_glm(X ~ 1,
data = D_1)
D_2 = data_frame(X = rpois(20, 2))
M_2 = stan_glm(X ~ 1,
family = poisson,
seed = 42,
iter = 5000,
data = D_1)
R brings the usual set of arithmetic operators, like +, -, / and more. In fact, an operator is just a function. The sum of two numbers can, indeed, be written in the two ways:
1 + 2
## [1] 3
`+`(1,2)
## [1] 3
The second term is a function that takes two numbers as input and returns a third. It is just a different syntax, and this one is called the polish notation. I will never use it throughout.
Another set of commonly used operators are logical, they implement Boolean algebra. Some common Boolean operators are shown in the truth table:
| A | B | !A | A & B | A | B | A == B |
|---|---|---|---|---|---|
| not | and | or | equals | ||
| T | T | F | T | T | T |
| T | F | F | F | T | F |
| F | T | T | F | T | F |
| F | F | T | F | F | T |
Be careful not to confuse Boolean “and” and “or” with their common natural language use. If you ask: “Can you buy apples or pears on the market?”, the natural answer would be: “both”. The Boolean answer is: TRUE. In a requirements document you could state “This app is for children and adults”. In Boolean the answer would be FALSE, because no one can be a child and an adult at the same time, strictly. A correct Boolean statement would be: “The envisioned users can be adult or child”.
Further Boolean operators exist, but can be derived from the three above. For example, the exclusive OR, “either A or B” can be written as: (A | B) & !(A & B). This term only gets TRUE when A or B is TRUE, but not both.
In the data analysis workflow, Boolean logic is frequently used for filtering data and we re-encounter them in data transformation.
Most behavioral research collects real data to work with. As behavior researchers are obsessed about finding associations between variables, real data usually contains several. If you have a sample of observations (e.g. participants) and every case has the same variables (measures or groups), data is stored in a table structure, where columns are variables and rows are observations.
R knows the data.frame objects to store variable-by-observation tables. Data frames are tables, where columns represent statistical variables. Variables have names and can be of different data types, as they usually appear in empirical research. In many cases data frames are imported to R, as they represent real data. Here we first see how to create data frames by simulation. First, we usually want some initial inspection of a freshly harvested data frame.
Data frames are objects. By just calling a data frame it gets printed to the screen. This is a good moment to reflect on one idiosyncrasy in R, a feature, of course. R is primarily used interactively, which has the immense advantage that the programmer can incrementally build the data analysis. This implies, that the programmer often wants to quickly check what currently is in a variable. Now, observe what happens when we call Experiment:
Experiment
| Group | age | outcome |
|---|---|---|
| control | 30 | 204 |
| experimental | 28 | 238 |
| control | 18 | 199 |
| experimental | 22 | 231 |
| control | 24 | 204 |
| experimental | 26 | 266 |
| control | 23 | 194 |
| experimental | 33 | 244 |
| control | 29 | 230 |
| experimental | 22 | 253 |
| control | 32 | 210 |
| experimental | 22 | 255 |
| control | 19 | 213 |
| experimental | 31 | 260 |
| control | 31 | 221 |
| experimental | 26 | 242 |
| control | 28 | 214 |
| experimental | 31 | 251 |
| control | 22 | 216 |
| experimental | 30 | 248 |
| control | 21 | 213 |
| experimental | 33 | 248 |
| control | 28 | 211 |
| experimental | 33 | 248 |
| control | 28 | 219 |
| experimental | 31 | 249 |
| control | 33 | 203 |
| experimental | 18 | 251 |
| control | 34 | 219 |
| experimental | 19 | 248 |
| control | 33 | 202 |
| experimental | 18 | 249 |
| control | 29 | 207 |
| experimental | 27 | 244 |
| control | 28 | 209 |
| experimental | 23 | 257 |
| control | 29 | 211 |
| experimental | 21 | 250 |
| control | 32 | 222 |
| experimental | 23 | 256 |
| control | 20 | 211 |
| experimental | 22 | 243 |
| control | 34 | 215 |
| experimental | 34 | 254 |
| control | 35 | 177 |
| experimental | 20 | 219 |
| control | 28 | 206 |
| experimental | 29 | 268 |
| control | 31 | 178 |
| experimental | 30 | 251 |
| control | 33 | 212 |
| experimental | 34 | 253 |
| control | 30 | 222 |
| experimental | 33 | 273 |
| control | 22 | 210 |
| experimental | 31 | 262 |
| control | 28 | 215 |
| experimental | 35 | 253 |
| control | 19 | 203 |
| experimental | 19 | 252 |
| control | 23 | 201 |
| experimental | 19 | 243 |
| control | 22 | 199 |
| experimental | 23 | 242 |
| control | 19 | 207 |
| experimental | 23 | 255 |
| control | 28 | 202 |
| experimental | 18 | 228 |
| control | 20 | 193 |
| experimental | 28 | 241 |
| control | 22 | 204 |
| experimental | 24 | 256 |
| control | 23 | 201 |
| experimental | 30 | 236 |
| control | 18 | 222 |
| experimental | 31 | 237 |
| control | 24 | 221 |
| experimental | 19 | 258 |
| control | 24 | 215 |
| experimental | 26 | 255 |
| control | 20 | 216 |
| experimental | 35 | 233 |
| control | 32 | 213 |
| experimental | 32 | 268 |
| control | 32 | 209 |
| experimental | 31 | 251 |
| control | 33 | 220 |
| experimental | 32 | 251 |
| control | 28 | 212 |
| experimental | 29 | 255 |
| control | 30 | 213 |
| experimental | 20 | 249 |
| control | 33 | 209 |
| experimental | 32 | 270 |
| control | 31 | 209 |
| experimental | 18 | 231 |
| control | 25 | 209 |
| experimental | 19 | 254 |
| control | 32 | 211 |
| experimental | 24 | 271 |
This is useful information printed to the screen, we see sample size, names of objects and their classes, and the first ten observations as examples. Obviously, this is not the data itself, but some sort of summary. It were a complete disaster, if R would pass this information on when the call is part of an assignment or other operation on the data, for example: NewExp <- Experiment. Apparently, R is aware of whether a called object is part of an operation, or purely for the programmers eyes. For any object, if called to the console in an intercative session, R silently uses the print function on the object. The following would give precisely the same results as above.
print(Experiment)
Such print functions exist for dozens of classes of objects. They represent what the developers thought would be the most salient or logical thing to print in an interactive session. By virtue of the object system, R finds the right print function for the object at hand.
However, when working interactively, most of the time you do a data transformation and assign it to a new variable, you check it, immediately. As the assignment is always silent (unless their occurs an error), you are forced to type a print statement afterwards. There is shortcut for that: if you embrace the assignment statement in brackets, the result will be printed to the screen. As convenient as this is, especially when creating longer data transformation chains, it introduces some visual clutter. In this book I use it sparingly, at most.
The dedicated print function does what way back a developer thought would be the most useful information in most cases. Depending on the situation, it can provide too much, too little or just the wrong information. For example: when simulating a data set, like in [First Exercise], we want to check whether the resulting data frame has the desired length and variables the correct class. The standard implementation of the print command, dumps the whole data frame to the screen, which can be much more than ever needed. The newer dplyr implementation data_frame (from the tidyverse, also called tibble) you have seen in action, already, is much better suited for this situation, as it displays dimensions and variable classes. In contrast, if the job is to spot whether there are many missing values (NA), the classic print performs better.
Whatever the question is, several commands are available to look into a data frame from different perspectives. Another command that is implemented for a variety of classes is summary. For all data frames, it produces an overview with descriptive statistics for all variables (i.e. columns). Particularly useful for data initial screening, is that missing values are listed per variable.
summary(Experiment)
| Group | age | outcome | |
|---|---|---|---|
| Length:100 | Min. :18.0 | Min. :177 | |
| Class :character | 1st Qu.:22.0 | 1st Qu.:211 | |
| Mode :character | Median :28.0 | Median :225 | |
| NA | Mean :26.7 | Mean :230 | |
| NA | 3rd Qu.:31.0 | 3rd Qu.:251 | |
| NA | Max. :35.0 | Max. :273 |
The str (structure) command works on any R object and displays the hierarchical structure (if there is one):
str(Experiment)
## Classes 'tbl_df', 'tbl' and 'data.frame': 100 obs. of 3 variables:
## $ Group : chr "control" "experimental" "control" "experimental" ...
## $ age : num 30 28 18 22 24 26 23 33 29 22 ...
## $ outcome: num 204 238 199 231 204 ...
Data frames store variables, but statistical procedures operate on variables. We need ways of accessing and manipulating statistical variables and we will have plenty. First, Recall that in R the basic object types are all vectors. You can store as many elements as you want in an object, as long as they are of the same class.
Internally, data frames are a collection of “vertical” vectors that are equally long. Being a collection of vectors, the variables of a data frame can be of different classes, like character, factor or numeric. In the most basic case, you want to calculate a statistic for a single variable out of a data frame. The $ operator pulls the variable out as a vector:
mean(Experiment$outcome)
## [1] 230
As data frames are rectangular structures, you can also access individual values by their addresses. The following commands call
Experiment[ 1, 3]
| outcome |
|---|
| 204 |
Experiment[1:3, 2]
| age |
|---|
| 30 |
| 28 |
| 18 |
Experiment[ 1, ]
| Group | age | outcome |
|---|---|---|
| control | 30 | 204 |
Addressing one or more elements in the square brackets, always requires two elements, first the row, second the column. As odd as it looks, one or both elements can be empty, which just means: get all rows (or all columns). Even the expression Experiment[,] is fully valid and will just the return the whole data frame.
There is an important difference, however, when using Rs classic data.frame as compared to dplyr’s data_frameimplementation: When using single square brackets on dplyr data frames one always gets a data frame back. That is a very predictable behavior, and very much unlike the classic: with data.frame, when the addressed elements expand over multiple columns, like Experiment[, 1:2], the result will be a data.frame object, too. However, when slicing a single column, the result is a vector:
Exp_classic <- as.data.frame(Experiment)
Exp_classic[1:2,1:2] ## data.frame
| Group | age |
|---|---|
| control | 30 |
| experimental | 28 |
Exp_classic[1,] ## data.frame
| Group | age | outcome |
|---|---|---|
| control | 30 | 204 |
Exp_classic[,1] ## vector
## [1] "control" "experimental" "control" "experimental"
## [5] "control" "experimental" "control" "experimental"
## [9] "control" "experimental" "control" "experimental"
## [13] "control" "experimental" "control" "experimental"
## [17] "control" "experimental" "control" "experimental"
## [21] "control" "experimental" "control" "experimental"
## [25] "control" "experimental" "control" "experimental"
## [29] "control" "experimental" "control" "experimental"
## [33] "control" "experimental" "control" "experimental"
## [37] "control" "experimental" "control" "experimental"
## [41] "control" "experimental" "control" "experimental"
## [45] "control" "experimental" "control" "experimental"
## [49] "control" "experimental" "control" "experimental"
## [53] "control" "experimental" "control" "experimental"
## [57] "control" "experimental" "control" "experimental"
## [61] "control" "experimental" "control" "experimental"
## [65] "control" "experimental" "control" "experimental"
## [69] "control" "experimental" "control" "experimental"
## [73] "control" "experimental" "control" "experimental"
## [77] "control" "experimental" "control" "experimental"
## [81] "control" "experimental" "control" "experimental"
## [85] "control" "experimental" "control" "experimental"
## [89] "control" "experimental" "control" "experimental"
## [93] "control" "experimental" "control" "experimental"
## [97] "control" "experimental" "control" "experimental"
Predictability and a few other useful tweaks made me prefer data_frame over data.frame. But, many third-party packages continue to produce classic data.frame objects. For example, there is an alternative to package `readxl, openxlsx, which reads (and writes) Excel files. It returns classic data.frames, which can easily be converted as follows:
D_foo <-
read.xlsx("foo.xlsx") %>%
as_data_frame()
While data_frame[] behaves perfectly predictable, in that it always returns a data frame, even when this is just a single column or cell. Sometimes, one wants to truly extract a vector. With a data_frame a single column can be extracted as a vector, using double square brackets, or using the $ operator.
Experiment[[1]] ## vector
Experiment$Group ## the same
Sometimes, it may be necessary to change values in a data frame. For example, a few outliers have been discovered during data screening, and the researcher decides to mark them as missing values. The syntax for indexing elements in a data frame can be used in conjunction with the assignment operator <-. In the example below, we make the simulated experiment more realistic by injecting a few outliers. Then we discard these outliers by setting them all to NA.
## injecting
Experiment[2, "outcome"] <- 660
Experiment[6, "outcome"] <- 987
Experiment[c(1,3), "age"] <- -99
## printing first few observations
head(Experiment)
| Group | age | outcome |
|---|---|---|
| control | -99 | 204 |
| experimental | 28 | 660 |
| control | -99 | 199 |
| experimental | 22 | 231 |
| control | 24 | 204 |
| experimental | 26 | 987 |
## setting to NA
Experiment[c(2, 6),"outcome"] <- NA
Experiment[c(1, 3),"age"] <- NA
Besides the injection, note two more features of addressing data frame elements. The first is, that vectors can be used to address multiple rows, e.g. 2 and 6. In fact, the range operator 1:3 we used above is just a convenient way of creating a vector c(1,2,3). Although not shown in the example, this works for columns alike.
The careful reader may also have noted another oddity in the above example. With Experiment[c(2, 6),"outcome"] we addressed two elements, but right-hand side of <- is one value, only. That is a basic mechanism of R, called reuse. When the left-hand side is longer than the right-hand side, the right-hand side is reused as many times as needed. Many basic functions in R work like this, and it can be quite useful. For example, you may want to create a vector of 20 random numbers, where one half has a different mean as the second half of observations.
rnorm(20, mean = c(1,2), sd = 1)
The above example reuses the two mean values 50 times, creating an alternating pattern. Strictly speaking, the sd = 1 parameter is reused, too, a 100 times. While reuse often comes is convenient, it can also lead to difficult programming errors. So, it is good advice to be aware of this mechanism and always check the input to vectorized functions, carefully.
R lets you import data from almost every conceivable source, given that you have installed and loaded the appropriate packages (foreign, haven or openxlsx for Excel files). Besides that R has its own file format for storing data, which is .Rda files. With these files you can save data frames (and any other object in R), using the save(data, file = "some_file.Rda") and load(file = "some_file.Rda") commands.
Few people create their data tables directly in R, or have legacy data sets in Excel (.xslx) and SPSS files (.sav). Moreover, the data can be produced by electronic measurement devices (e.g. electrodermal response measures) or programmed experiments can provide data in different forms, for example as .csv (comma-separated-values) files. All these files can be opened by the following commands:
## Text files
Experiment <-
read_csv("Data/Experiment.csv")
## Excel
Experiment <-
read_excel("Data/Experiment.xlsx", sheet = 1)
## SPSS (haven)
Experiment <-
read_sav("Data/Experiment.sav")
## SPSS (foreign)
Experiment <-
read.spss("Data/Experiment.sav", to.data.frame = TRUE)
Note that I gave two options for reading SPSS files. The first (with an underscore) is from the newer haven package (part of tidyverse). With some SPSS files, I experienced problems with this command, as it does not convert SPSS’s data type labelled (which is almost the same as an R factor). The alternative is the classic read.spss command which works almost always, but as a default it creates a list of lists, which is not what you typically want. With the extra argument, as shown, it behaves as expected.
Remember, data frames are objects and volatile as such. If you leave your session, they are gone. Once you have you data frame imported and cleaned, you may want to store it to a file. Like for reading, many commands are available for writing all kinds of data formats. If you are lucky to have a complete R-based workflow, you can conveniently use R’s own format for storing data, Rdata files. For storing a data frame and then reading it back in (in your next session), simply do:
save(Experiment, file = "Data/Experiment.Rda")
load(file = "Data/Experiment.Rda")
Note that with save and load all objects are restored by their original names, without using any assignments. Take care, as this will not overwrite any object with the same name. Another issue is that for the save command you have to explicitly refer to the file argument and provide the file path as a character object. In Rstudio, begin typing file="", put the cursor between the quotation marks and hit Tab, which opens a small dialogue for navigation to the desired directory.
Once you have loaded, prepared and started to analyze a data frame in R, there is little reason to go back to any legacy program. Still, the haven and foreign packages contain commands to write to various file formats. I’ll keep that as an exercise to the reader.
This book features more than a dozen case studies. Every case will be encountered several times and multiple objects are created along the way: data sets, regressions, graphics, tables, you name it. That posed the problem of naming the objects, so that they are unique. I could have chosen object names, like: BrowsingAB_M_1, AUP_M_1, etc. But, this is not what you normally would do, when working on one study at a time. Moreover, every letter you add to a line of code makes it more prone to errors and less likely that you, dear reader, are typing it in and trying it out yourself.
For these reasons, all cases are enclosed in case environments and provided with this book. For getting a case environment to work in your session, it has to be loaded from the respective R data file first:
load("BrowsingAB.Rda")
In R, environments are containers for collections of objects. If an object BAB1 is placed in an environment BrowsingAB, it can be called as BrowsingAB$BAB1. This way, no brevity is gained. Another way to assess objects in an environment is to attach the environment first as:
attach(BrowsingAB)
BAB1
| Obs | Part | Task | Design | Gender | Education | age | Far_sighted | Small_font | ToT | clicks | returns |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | A | M | Middle | 49 | TRUE | FALSE | 189.6 | 12 | 4 |
| 2 | 2 | 1 | A | M | Low | 36 | FALSE | FALSE | 247.0 | 15 | 4 |
| 3 | 3 | 1 | A | F | Low | 66 | FALSE | FALSE | 226.3 | 14 | 2 |
| 4 | 4 | 1 | A | M | Middle | 38 | FALSE | FALSE | 250.8 | 10 | 6 |
| 5 | 5 | 1 | A | M | High | 50 | TRUE | FALSE | 134.7 | 15 | 2 |
| 6 | 6 | 1 | A | M | High | 48 | TRUE | FALSE | 182.1 | 5 | 2 |
| 7 | 7 | 1 | A | M | High | 66 | FALSE | FALSE | 197.5 | 8 | 3 |
| 8 | 8 | 1 | A | F | High | 29 | FALSE | FALSE | 221.0 | 11 | 3 |
| 9 | 9 | 1 | A | M | Middle | 46 | FALSE | FALSE | 151.4 | 10 | 1 |
| 10 | 10 | 1 | A | M | Low | 51 | FALSE | FALSE | 206.6 | 10 | 2 |
| 11 | 11 | 1 | A | F | Middle | 48 | TRUE | FALSE | 189.7 | 13 | 2 |
| 12 | 12 | 1 | A | M | High | 51 | TRUE | FALSE | 164.5 | 4 | 3 |
| 13 | 13 | 1 | A | M | High | 61 | FALSE | FALSE | 161.7 | 8 | 2 |
| 14 | 14 | 1 | A | F | Middle | 41 | TRUE | FALSE | 176.6 | 18 | 8 |
| 15 | 15 | 1 | A | F | Middle | 70 | TRUE | FALSE | 163.8 | 12 | 3 |
| 16 | 16 | 1 | A | M | Middle | 75 | TRUE | FALSE | 141.3 | 11 | 1 |
| 17 | 17 | 1 | A | M | Low | 26 | FALSE | FALSE | 212.0 | 13 | 1 |
| 18 | 18 | 1 | A | F | Middle | 32 | FALSE | FALSE | 259.8 | 17 | 5 |
| 19 | 19 | 1 | A | F | Middle | 49 | FALSE | FALSE | 227.8 | 7 | 1 |
| 20 | 20 | 1 | A | M | High | 39 | FALSE | FALSE | 204.4 | 11 | 8 |
| 21 | 21 | 1 | A | M | Middle | 48 | FALSE | FALSE | 141.9 | 12 | 3 |
| 22 | 22 | 1 | A | F | Middle | 55 | FALSE | FALSE | 182.3 | 9 | 2 |
| 23 | 23 | 1 | A | M | Middle | 33 | FALSE | FALSE | 189.5 | 11 | 2 |
| 24 | 24 | 1 | A | M | Middle | 64 | FALSE | FALSE | 209.2 | 10 | 5 |
| 25 | 25 | 1 | A | F | High | 29 | FALSE | FALSE | 195.5 | 6 | 1 |
| 26 | 26 | 1 | A | M | Middle | 61 | TRUE | FALSE | 185.8 | 13 | 1 |
| 27 | 27 | 1 | A | F | High | 35 | FALSE | FALSE | 196.3 | 11 | 2 |
| 28 | 28 | 1 | A | M | High | 45 | FALSE | FALSE | 153.0 | 12 | 2 |
| 29 | 29 | 1 | A | F | Low | 63 | FALSE | FALSE | 195.9 | 8 | 2 |
| 30 | 30 | 1 | A | M | High | 48 | FALSE | FALSE | 268.2 | 11 | 1 |
| 31 | 31 | 1 | A | M | High | 40 | FALSE | FALSE | 183.5 | 7 | 6 |
| 32 | 32 | 1 | A | M | Middle | 34 | TRUE | FALSE | 195.6 | 12 | 0 |
| 33 | 33 | 1 | A | F | Middle | 32 | TRUE | FALSE | 151.6 | 12 | 3 |
| 34 | 34 | 1 | A | M | High | 34 | TRUE | FALSE | 206.7 | 9 | 1 |
| 35 | 35 | 1 | A | F | High | 53 | FALSE | FALSE | 244.9 | 12 | 3 |
| 36 | 36 | 1 | A | M | Middle | 62 | FALSE | FALSE | 125.0 | 10 | 1 |
| 37 | 37 | 1 | A | F | High | 26 | FALSE | FALSE | 193.0 | 8 | 4 |
| 38 | 38 | 1 | A | F | Low | 51 | TRUE | FALSE | 215.5 | 17 | 2 |
| 39 | 39 | 1 | A | M | High | 71 | TRUE | FALSE | 230.2 | 11 | 2 |
| 40 | 40 | 1 | A | M | Middle | 62 | TRUE | FALSE | 243.2 | 11 | 1 |
| 41 | 41 | 1 | A | F | Low | 38 | TRUE | FALSE | 223.0 | 9 | 3 |
| 42 | 42 | 1 | A | F | Low | 26 | FALSE | FALSE | 252.3 | 14 | 2 |
| 43 | 43 | 1 | A | F | Middle | 54 | FALSE | FALSE | 189.8 | 15 | 1 |
| 44 | 44 | 1 | A | M | High | 69 | TRUE | FALSE | 234.1 | 8 | 4 |
| 45 | 45 | 1 | A | F | High | 51 | TRUE | FALSE | 152.2 | 8 | 0 |
| 46 | 46 | 1 | A | M | High | 61 | FALSE | FALSE | 245.2 | 13 | 1 |
| 47 | 47 | 1 | A | M | Low | 49 | FALSE | FALSE | 217.9 | 8 | 1 |
| 48 | 48 | 1 | A | M | High | 41 | TRUE | FALSE | 173.6 | 8 | 2 |
| 49 | 49 | 1 | A | M | Low | 77 | FALSE | FALSE | 274.9 | 15 | 3 |
| 50 | 50 | 1 | A | M | Low | 75 | FALSE | FALSE | 166.8 | 8 | 0 |
| 51 | 51 | 1 | A | F | High | 32 | FALSE | FALSE | 170.6 | 10 | 1 |
| 52 | 52 | 1 | A | F | Low | 77 | TRUE | FALSE | 212.1 | 12 | 3 |
| 53 | 53 | 1 | A | F | High | 72 | FALSE | FALSE | 193.2 | 17 | 4 |
| 54 | 54 | 1 | A | M | Middle | 61 | FALSE | FALSE | 212.4 | 6 | 3 |
| 55 | 55 | 1 | A | F | High | 50 | FALSE | FALSE | 185.8 | 11 | 3 |
| 56 | 56 | 1 | A | M | High | 48 | FALSE | FALSE | 205.0 | 7 | 1 |
| 57 | 57 | 1 | A | M | Low | 32 | FALSE | FALSE | 233.4 | 13 | 3 |
| 58 | 58 | 1 | A | F | Low | 26 | FALSE | FALSE | 290.6 | 9 | 3 |
| 59 | 59 | 1 | A | F | Low | 48 | TRUE | FALSE | 181.2 | 6 | 2 |
| 60 | 60 | 1 | A | M | High | 38 | TRUE | FALSE | 236.9 | 14 | 7 |
| 61 | 61 | 1 | A | M | High | 66 | TRUE | FALSE | 261.5 | 10 | 1 |
| 62 | 62 | 1 | A | M | Middle | 74 | TRUE | FALSE | 211.4 | 5 | 3 |
| 63 | 63 | 1 | A | M | Middle | 40 | FALSE | FALSE | 185.4 | 9 | 2 |
| 64 | 64 | 1 | A | M | Low | 69 | FALSE | FALSE | 178.4 | 9 | 2 |
| 65 | 65 | 1 | A | M | Middle | 33 | FALSE | FALSE | 220.2 | 11 | 2 |
| 66 | 66 | 1 | A | F | Low | 65 | TRUE | FALSE | 311.3 | 13 | 9 |
| 67 | 67 | 1 | A | F | Middle | 53 | TRUE | FALSE | 222.8 | 9 | 3 |
| 68 | 68 | 1 | A | M | Low | 45 | FALSE | FALSE | 167.8 | 16 | 4 |
| 69 | 69 | 1 | A | M | Low | 70 | FALSE | FALSE | 188.4 | 12 | 5 |
| 70 | 70 | 1 | A | F | Low | 37 | FALSE | FALSE | 239.8 | 15 | 4 |
| 71 | 71 | 1 | A | F | High | 48 | FALSE | FALSE | 148.9 | 9 | 3 |
| 72 | 72 | 1 | A | F | Middle | 66 | FALSE | FALSE | 251.4 | 12 | 6 |
| 73 | 73 | 1 | A | F | Middle | 24 | FALSE | FALSE | 102.0 | 12 | 1 |
| 74 | 74 | 1 | A | F | Middle | 43 | FALSE | FALSE | 253.4 | 17 | 6 |
| 75 | 75 | 1 | A | F | Middle | 25 | FALSE | FALSE | 210.8 | 12 | 6 |
| 76 | 76 | 1 | A | M | Low | 27 | FALSE | FALSE | 212.9 | 9 | 1 |
| 77 | 77 | 1 | A | F | High | 33 | TRUE | FALSE | 157.3 | 10 | 3 |
| 78 | 78 | 1 | A | F | Middle | 36 | TRUE | FALSE | 220.2 | 8 | 4 |
| 79 | 79 | 1 | A | M | High | 37 | TRUE | FALSE | 225.8 | 10 | 3 |
| 80 | 80 | 1 | A | F | High | 42 | TRUE | FALSE | 196.1 | 6 | 2 |
| 81 | 81 | 1 | A | M | High | 49 | FALSE | FALSE | 145.2 | 10 | 0 |
| 82 | 82 | 1 | A | F | High | 72 | TRUE | FALSE | 122.5 | 10 | 1 |
| 83 | 83 | 1 | A | F | Low | 64 | FALSE | FALSE | 189.9 | 7 | 1 |
| 84 | 84 | 1 | A | M | Low | 77 | TRUE | FALSE | 272.6 | 6 | 2 |
| 85 | 85 | 1 | A | M | High | 39 | FALSE | FALSE | 178.2 | 10 | 2 |
| 86 | 86 | 1 | A | M | High | 69 | FALSE | FALSE | 178.5 | 6 | 0 |
| 87 | 87 | 1 | A | F | High | 27 | FALSE | FALSE | 181.3 | 11 | 4 |
| 88 | 88 | 1 | A | F | High | 27 | FALSE | FALSE | 176.3 | 6 | 4 |
| 89 | 89 | 1 | A | F | Low | 63 | FALSE | FALSE | 211.0 | 9 | 1 |
| 90 | 90 | 1 | A | F | Low | 52 | FALSE | FALSE | 193.0 | 14 | 1 |
| 91 | 91 | 1 | A | M | Low | 56 | TRUE | FALSE | 222.9 | 10 | 10 |
| 92 | 92 | 1 | A | F | High | 27 | FALSE | FALSE | 184.8 | 5 | 4 |
| 93 | 93 | 1 | A | F | Low | 72 | TRUE | FALSE | 268.8 | 15 | 0 |
| 94 | 94 | 1 | A | M | Low | 70 | FALSE | FALSE | 212.5 | 9 | 4 |
| 95 | 95 | 1 | A | M | Low | 22 | FALSE | FALSE | 215.4 | 9 | 2 |
| 96 | 96 | 1 | A | M | Low | 62 | FALSE | FALSE | 275.4 | 12 | 0 |
| 97 | 97 | 1 | A | F | Middle | 52 | FALSE | FALSE | 182.7 | 9 | 2 |
| 98 | 98 | 1 | A | M | Low | 76 | TRUE | FALSE | 261.1 | 11 | 2 |
| 99 | 99 | 1 | A | M | High | 74 | TRUE | FALSE | 231.1 | 8 | 2 |
| 100 | 100 | 1 | A | M | High | 47 | FALSE | FALSE | 208.6 | 6 | 4 |
| 101 | 101 | 1 | B | M | Middle | 49 | TRUE | TRUE | 282.7 | 14 | 2 |
| 102 | 102 | 1 | B | M | Low | 36 | FALSE | TRUE | 196.4 | 7 | 2 |
| 103 | 103 | 1 | B | F | Low | 66 | FALSE | TRUE | 183.6 | 9 | 3 |
| 104 | 104 | 1 | B | M | Middle | 38 | FALSE | TRUE | 181.1 | 5 | 0 |
| 105 | 105 | 1 | B | M | High | 50 | TRUE | TRUE | 192.5 | 14 | 1 |
| 106 | 106 | 1 | B | M | High | 48 | TRUE | TRUE | 182.0 | 8 | 9 |
| 107 | 107 | 1 | B | M | High | 66 | FALSE | TRUE | 122.3 | 8 | 0 |
| 108 | 108 | 1 | B | F | High | 29 | FALSE | TRUE | 102.2 | 6 | 2 |
| 109 | 109 | 1 | B | M | Middle | 46 | FALSE | TRUE | 198.6 | 1 | 2 |
| 110 | 110 | 1 | B | M | Low | 51 | FALSE | TRUE | 223.4 | 10 | 2 |
| 111 | 111 | 1 | B | F | Middle | 48 | TRUE | TRUE | 217.9 | 11 | 3 |
| 112 | 112 | 1 | B | M | High | 51 | TRUE | TRUE | 180.5 | 5 | 7 |
| 113 | 113 | 1 | B | M | High | 61 | FALSE | TRUE | 144.9 | 5 | 2 |
| 114 | 114 | 1 | B | F | Middle | 41 | TRUE | TRUE | 284.4 | 8 | 6 |
| 115 | 115 | 1 | B | F | Middle | 70 | TRUE | TRUE | 269.8 | 7 | 4 |
| 116 | 116 | 1 | B | M | Middle | 75 | TRUE | TRUE | 185.0 | 10 | 4 |
| 117 | 117 | 1 | B | M | Low | 26 | FALSE | TRUE | 146.6 | 6 | 2 |
| 118 | 118 | 1 | B | F | Middle | 32 | FALSE | TRUE | 262.8 | 7 | 0 |
| 119 | 119 | 1 | B | F | Middle | 49 | FALSE | TRUE | 142.9 | 3 | 3 |
| 120 | 120 | 1 | B | M | High | 39 | FALSE | TRUE | 196.9 | 7 | 1 |
| 121 | 121 | 1 | B | M | Middle | 48 | FALSE | TRUE | 130.9 | 10 | 0 |
| 122 | 122 | 1 | B | F | Middle | 55 | FALSE | TRUE | 163.0 | 6 | 2 |
| 123 | 123 | 1 | B | M | Middle | 33 | FALSE | TRUE | 172.5 | 8 | 1 |
| 124 | 124 | 1 | B | M | Middle | 64 | FALSE | TRUE | 255.7 | 13 | 1 |
| 125 | 125 | 1 | B | F | High | 29 | FALSE | TRUE | 168.8 | 9 | 2 |
| 126 | 126 | 1 | B | M | Middle | 61 | TRUE | TRUE | 192.5 | 11 | 1 |
| 127 | 127 | 1 | B | F | High | 35 | FALSE | TRUE | 113.9 | 5 | 1 |
| 128 | 128 | 1 | B | M | High | 45 | FALSE | TRUE | 119.9 | 3 | 0 |
| 129 | 129 | 1 | B | F | Low | 63 | FALSE | TRUE | 115.6 | 11 | 4 |
| 130 | 130 | 1 | B | M | High | 48 | FALSE | TRUE | 100.2 | 11 | 0 |
| 131 | 131 | 1 | B | M | High | 40 | FALSE | TRUE | 124.4 | 6 | 1 |
| 132 | 132 | 1 | B | M | Middle | 34 | TRUE | TRUE | 203.4 | 15 | 5 |
| 133 | 133 | 1 | B | F | Middle | 32 | TRUE | TRUE | 259.9 | 14 | 4 |
| 134 | 134 | 1 | B | M | High | 34 | TRUE | TRUE | 218.0 | 12 | 4 |
| 135 | 135 | 1 | B | F | High | 53 | FALSE | TRUE | 186.9 | 15 | 1 |
| 136 | 136 | 1 | B | M | Middle | 62 | FALSE | TRUE | 153.9 | 5 | 0 |
| 137 | 137 | 1 | B | F | High | 26 | FALSE | TRUE | 116.2 | 6 | 0 |
| 138 | 138 | 1 | B | F | Low | 51 | TRUE | TRUE | 312.4 | 14 | 3 |
| 139 | 139 | 1 | B | M | High | 71 | TRUE | TRUE | 180.0 | 15 | 2 |
| 140 | 140 | 1 | B | M | Middle | 62 | TRUE | TRUE | 229.8 | 10 | 1 |
| 141 | 141 | 1 | B | F | Low | 38 | TRUE | TRUE | 225.0 | 12 | 3 |
| 142 | 142 | 1 | B | F | Low | 26 | FALSE | TRUE | 165.8 | 8 | 3 |
| 143 | 143 | 1 | B | F | Middle | 54 | FALSE | TRUE | 213.1 | 11 | 0 |
| 144 | 144 | 1 | B | M | High | 69 | TRUE | TRUE | 236.1 | 15 | 2 |
| 145 | 145 | 1 | B | F | High | 51 | TRUE | TRUE | 245.4 | 12 | 1 |
| 146 | 146 | 1 | B | M | High | 61 | FALSE | TRUE | 202.4 | 6 | 4 |
| 147 | 147 | 1 | B | M | Low | 49 | FALSE | TRUE | 157.6 | 14 | 0 |
| 148 | 148 | 1 | B | M | High | 41 | TRUE | TRUE | 201.0 | 8 | 4 |
| 149 | 149 | 1 | B | M | Low | 77 | FALSE | TRUE | 263.6 | 17 | 4 |
| 150 | 150 | 1 | B | M | Low | 75 | FALSE | TRUE | 138.8 | 9 | 0 |
| 151 | 151 | 1 | B | F | High | 32 | FALSE | TRUE | 151.7 | 4 | 3 |
| 152 | 152 | 1 | B | F | Low | 77 | TRUE | TRUE | 232.6 | 13 | 2 |
| 153 | 153 | 1 | B | F | High | 72 | FALSE | TRUE | 212.4 | 7 | 1 |
| 154 | 154 | 1 | B | M | Middle | 61 | FALSE | TRUE | 194.3 | 6 | 0 |
| 155 | 155 | 1 | B | F | High | 50 | FALSE | TRUE | 151.4 | 6 | 2 |
| 156 | 156 | 1 | B | M | High | 48 | FALSE | TRUE | 170.8 | 5 | 1 |
| 157 | 157 | 1 | B | M | Low | 32 | FALSE | TRUE | 126.4 | 11 | 2 |
| 158 | 158 | 1 | B | F | Low | 26 | FALSE | TRUE | 178.9 | 5 | 2 |
| 159 | 159 | 1 | B | F | Low | 48 | TRUE | TRUE | 187.1 | 11 | 1 |
| 160 | 160 | 1 | B | M | High | 38 | TRUE | TRUE | 249.1 | 9 | 1 |
| 161 | 161 | 1 | B | M | High | 66 | TRUE | TRUE | 218.9 | 9 | 3 |
| 162 | 162 | 1 | B | M | Middle | 74 | TRUE | TRUE | 251.3 | 13 | 4 |
| 163 | 163 | 1 | B | M | Middle | 40 | FALSE | TRUE | 172.4 | 16 | 2 |
| 164 | 164 | 1 | B | M | Low | 69 | FALSE | TRUE | 147.6 | 15 | 4 |
| 165 | 165 | 1 | B | M | Middle | 33 | FALSE | TRUE | 114.8 | 9 | 3 |
| 166 | 166 | 1 | B | F | Low | 65 | TRUE | TRUE | 303.6 | 21 | 6 |
| 167 | 167 | 1 | B | F | Middle | 53 | TRUE | TRUE | 211.2 | 7 | 2 |
| 168 | 168 | 1 | B | M | Low | 45 | FALSE | TRUE | 174.7 | 9 | 1 |
| 169 | 169 | 1 | B | M | Low | 70 | FALSE | TRUE | 195.1 | 11 | 4 |
| 170 | 170 | 1 | B | F | Low | 37 | FALSE | TRUE | 174.1 | 11 | 1 |
| 171 | 171 | 1 | B | F | High | 48 | FALSE | TRUE | 141.4 | 6 | 1 |
| 172 | 172 | 1 | B | F | Middle | 66 | FALSE | TRUE | 168.5 | 7 | 4 |
| 173 | 173 | 1 | B | F | Middle | 24 | FALSE | TRUE | 64.4 | 2 | 2 |
| 174 | 174 | 1 | B | F | Middle | 43 | FALSE | TRUE | 195.4 | 15 | 3 |
| 175 | 175 | 1 | B | F | Middle | 25 | FALSE | TRUE | 207.9 | 13 | 3 |
| 176 | 176 | 1 | B | M | Low | 27 | FALSE | TRUE | 221.7 | 7 | 1 |
| 177 | 177 | 1 | B | F | High | 33 | TRUE | TRUE | 174.3 | 10 | 2 |
| 178 | 178 | 1 | B | F | Middle | 36 | TRUE | TRUE | 235.2 | 12 | 3 |
| 179 | 179 | 1 | B | M | High | 37 | TRUE | TRUE | 243.8 | 3 | 3 |
| 180 | 180 | 1 | B | F | High | 42 | TRUE | TRUE | 209.2 | 15 | 4 |
| 181 | 181 | 1 | B | M | High | 49 | FALSE | TRUE | 141.2 | 8 | 2 |
| 182 | 182 | 1 | B | F | High | 72 | TRUE | TRUE | 158.9 | 15 | 1 |
| 183 | 183 | 1 | B | F | Low | 64 | FALSE | TRUE | 167.4 | 9 | 2 |
| 184 | 184 | 1 | B | M | Low | 77 | TRUE | TRUE | 286.4 | 15 | 4 |
| 185 | 185 | 1 | B | M | High | 39 | FALSE | TRUE | 96.9 | 5 | 4 |
| 186 | 186 | 1 | B | M | High | 69 | FALSE | TRUE | 166.0 | 7 | 0 |
| 187 | 187 | 1 | B | F | High | 27 | FALSE | TRUE | 113.9 | 9 | 3 |
| 188 | 188 | 1 | B | F | High | 27 | FALSE | TRUE | 145.5 | 10 | 0 |
| 189 | 189 | 1 | B | F | Low | 63 | FALSE | TRUE | 180.3 | 12 | 1 |
| 190 | 190 | 1 | B | F | Low | 52 | FALSE | TRUE | 156.1 | 6 | 3 |
| 191 | 191 | 1 | B | M | Low | 56 | TRUE | TRUE | 260.2 | 17 | 6 |
| 192 | 192 | 1 | B | F | High | 27 | FALSE | TRUE | 151.3 | 8 | 0 |
| 193 | 193 | 1 | B | F | Low | 72 | TRUE | TRUE | 265.6 | 10 | 3 |
| 194 | 194 | 1 | B | M | Low | 70 | FALSE | TRUE | 169.4 | 6 | 0 |
| 195 | 195 | 1 | B | M | Low | 22 | FALSE | TRUE | 150.0 | 8 | 1 |
| 196 | 196 | 1 | B | M | Low | 62 | FALSE | TRUE | 205.1 | 12 | 1 |
| 197 | 197 | 1 | B | F | Middle | 52 | FALSE | TRUE | 226.9 | 11 | 2 |
| 198 | 198 | 1 | B | M | Low | 76 | TRUE | TRUE | 214.2 | 10 | 1 |
| 199 | 199 | 1 | B | M | High | 74 | TRUE | TRUE | 290.9 | 16 | 3 |
| 200 | 200 | 1 | B | M | High | 47 | FALSE | TRUE | 145.4 | 9 | 2 |
detach(BrowsingAB)
attach(Reading)
D_1
| Part | font_size | font_color | mu | ToT |
|---|---|---|---|---|
| 1 | 10pt | gray | 60 | 66.9 |
| 2 | 12pt | gray | 48 | 45.2 |
| 3 | 10pt | gray | 60 | 61.8 |
| 4 | 12pt | gray | 48 | 51.2 |
| 5 | 10pt | gray | 60 | 62.0 |
| 6 | 12pt | gray | 48 | 47.5 |
| 7 | 10pt | gray | 60 | 67.6 |
| 8 | 12pt | gray | 48 | 47.5 |
| 9 | 10pt | gray | 60 | 70.1 |
| 10 | 12pt | gray | 48 | 47.7 |
| 11 | 10pt | gray | 60 | 66.5 |
| 12 | 12pt | gray | 48 | 59.4 |
| 13 | 10pt | gray | 60 | 53.1 |
| 14 | 12pt | gray | 48 | 46.6 |
| 15 | 10pt | gray | 60 | 59.3 |
| 16 | 12pt | gray | 48 | 51.2 |
| 17 | 10pt | gray | 60 | 58.6 |
| 18 | 12pt | gray | 48 | 34.7 |
| 19 | 10pt | gray | 60 | 47.8 |
| 20 | 12pt | gray | 48 | 54.6 |
| 21 | 10pt | black | 50 | 48.5 |
| 22 | 12pt | black | 46 | 37.1 |
| 23 | 10pt | black | 50 | 49.1 |
| 24 | 12pt | black | 46 | 52.1 |
| 25 | 10pt | black | 50 | 59.5 |
| 26 | 12pt | black | 46 | 43.8 |
| 27 | 10pt | black | 50 | 48.7 |
| 28 | 12pt | black | 46 | 37.2 |
| 29 | 10pt | black | 50 | 52.3 |
| 30 | 12pt | black | 46 | 42.8 |
| 31 | 10pt | black | 50 | 52.3 |
| 32 | 12pt | black | 46 | 49.5 |
| 33 | 10pt | black | 50 | 55.2 |
| 34 | 12pt | black | 46 | 43.0 |
| 35 | 10pt | black | 50 | 52.5 |
| 36 | 12pt | black | 46 | 37.4 |
| 37 | 10pt | black | 50 | 46.1 |
| 38 | 12pt | black | 46 | 41.7 |
| 39 | 10pt | black | 50 | 37.9 |
| 40 | 12pt | black | 46 | 46.2 |
Calling attach gets you into the namespace of the environment (formally correct: the namespace gets imported to your working environment). All objects in that namespace become immediately visible by their mere name. The detach command leaves the environment, again. When working with the case environments, I strongly recommend to detach before attaching another environment.
All case environments provided with this book contain one or more data sets. Many of the cases are synthetic data which has been generated by a simulation function. This function, routinely called simulate is provided with the case environment, too. That gives you the freedom to produce your own data sets with the same structure, but different effects. Generally, calling the simulation function without any further arguments exactly reproduces the synthetic data set provided with the case environment.
simulate() ## reproduces the data frame D_1
| Part | font_size | font_color | mu | ToT |
|---|---|---|---|---|
| 1 | 10pt | gray | 60 | 66.9 |
| 2 | 12pt | gray | 48 | 45.2 |
| 3 | 10pt | gray | 60 | 61.8 |
| 4 | 12pt | gray | 48 | 51.2 |
| 5 | 10pt | gray | 60 | 62.0 |
| 6 | 12pt | gray | 48 | 47.5 |
| 7 | 10pt | gray | 60 | 67.6 |
| 8 | 12pt | gray | 48 | 47.5 |
| 9 | 10pt | gray | 60 | 70.1 |
| 10 | 12pt | gray | 48 | 47.7 |
| 11 | 10pt | gray | 60 | 66.5 |
| 12 | 12pt | gray | 48 | 59.4 |
| 13 | 10pt | gray | 60 | 53.1 |
| 14 | 12pt | gray | 48 | 46.6 |
| 15 | 10pt | gray | 60 | 59.3 |
| 16 | 12pt | gray | 48 | 51.2 |
| 17 | 10pt | gray | 60 | 58.6 |
| 18 | 12pt | gray | 48 | 34.7 |
| 19 | 10pt | gray | 60 | 47.8 |
| 20 | 12pt | gray | 48 | 54.6 |
| 21 | 10pt | black | 50 | 48.5 |
| 22 | 12pt | black | 46 | 37.1 |
| 23 | 10pt | black | 50 | 49.1 |
| 24 | 12pt | black | 46 | 52.1 |
| 25 | 10pt | black | 50 | 59.5 |
| 26 | 12pt | black | 46 | 43.8 |
| 27 | 10pt | black | 50 | 48.7 |
| 28 | 12pt | black | 46 | 37.2 |
| 29 | 10pt | black | 50 | 52.3 |
| 30 | 12pt | black | 46 | 42.8 |
| 31 | 10pt | black | 50 | 52.3 |
| 32 | 12pt | black | 46 | 49.5 |
| 33 | 10pt | black | 50 | 55.2 |
| 34 | 12pt | black | 46 | 43.0 |
| 35 | 10pt | black | 50 | 52.5 |
| 36 | 12pt | black | 46 | 37.4 |
| 37 | 10pt | black | 50 | 46.1 |
| 38 | 12pt | black | 46 | 41.7 |
| 39 | 10pt | black | 50 | 37.9 |
| 40 | 12pt | black | 46 | 46.2 |
simulate(N = 6) ## simulates first 6 participants only
| Part | font_size | font_color | mu | ToT |
|---|---|---|---|---|
| 1 | 10pt | gray | 60 | 66.9 |
| 2 | 12pt | gray | 48 | 45.2 |
| 3 | 10pt | gray | 60 | 61.8 |
| 4 | 12pt | black | 46 | 49.2 |
| 5 | 10pt | black | 50 | 52.0 |
| 6 | 12pt | black | 46 | 45.5 |
Furthermore, once you delve deeper into R, you can critically inspect the simulation function’s code for its behavioral and psychological assumptions (working through the later chapters on data management and simulation will help).
simulate ## calling a function without paranthesis prints its code
## function(N = 40,
## beta = c(Intercpt = 60,
## fnt_size_12 = -12,
## fnt_color_blk = -10,
## ia_blk_12 = 8),
## sigma = 5,
## seed = 42)
## {
## set.seed(seed)
## out <-
## data_frame(Part = 1:N,
## font_size = factor(rep(c(1, 2), N/2),
## levels = c(1,2),
## labels = c("10pt", "12pt")),
## font_color = factor(c(rep(1, N/2), rep(2, N/2)),
## levels = c(1,2),
## labels = c("gray", "black"))) %>%
## mutate( mu = beta[1] +
## beta[2] * (font_size == "12pt") +
## beta[3] * (font_color == "black") +
## beta[4] * (font_color == "black") * (font_size == "12pt"),
## ToT = rnorm(N, mu, sigma))
##
## class(out) <- append(class(out), "sim_tbl")
## attr(out, "coef") <- list(beta = beta,
## sigma = sigma)
## attr(out, "seed") <- seed
##
## out
## }
detach(Reading)
Finally, the case environments contain all objects that have been created throughout this book. This is especially useful for the regression models, as fitting these can take from a few seconds to hours.
Note that working with environments is a tricky business. Creating these case environments in an efficient way was more difficult than you may think. Therefore, I do not recommend using environments in everday data analysis, with one exception: at any moment the current working environment contains all objects that have been created, so far. That is precisely the set of objects shown in the Environment pane of Rstudio (or call ls() for a listing). Saving all objects and retrieving them when returning from a break is as easy as:
save(file = "my_data_analysis.Rda")
## have a break
load("my_data_analysis.Rda")
Next to that, Rstudio can be configured to save the current workspace on exit and reload it on the next start. When working on just one data analysis for a longer period of time, this can be a good choice.
In the whole book, data sets are structured according to the rule one-row-per-observation of the dependent variable (the ORPO rule). Many researchers still organize their data tables as one-row-per-participant, as is requested by some legacy statistics programs. This is fine in research with non-repeated measures, but will not function properly with modern regression models, like linear mixed-effects models. Consider an study where two designs were evaluated by three participants using a self-report item, like “how easy to use is the interface?” Then, the wrong way of structuring the data would be:
| Part | A | B |
|---|---|---|
| 1 | 7 | 4 |
| 2 | 3 | 4 |
| 3 | 4 | 6 |
The correct way of setting up the data frame is:
| Part | Design | response |
|---|---|---|
| 1 | A | 7 |
| 2 | A | 3 |
| 3 | A | 4 |
| 1 | B | 4 |
| 2 | B | 4 |
| 3 | B | 6 |
The ORPO rule dictates another principle: every row should have a unique identifier. In the example above, every observation is uniquely identified by the participant identifier and the design condition. If we extend the example slightly, it is immediatly apparent, why the ORPO ruile is justified. Imagine, the study actually asked three partcipants to rate two different tasks on two different designs by three self-report items. By the ORPO rule, we can easily extend the data frame as below (showing a random selection of rows). I leave it up to the reader to figure out how to press such a data set in the wide legacy format.
Using identifiers is good practice for several reasons. First, it reduces problems during manual data entry. Second, it allows to efficiently record data in multi-method experiments and join them automatically. This is shown in chapter [DM]. Lastly, the identifiers will become statistically interesting by themselves when we turn to linear mixed-effects model and the notion of members of a population. Throughout the book I will use standard names for recurring identifier variables in design research:
Note that usually these entities get numerical identifiers, but these numbers are just labels. Throughout, variables are written Uppercase when they are entities, but not real numbers. An exception is the trial order in experiments with massive repeated measures. These get a numerical type to allow exploring effects over time such as learning, training and fatigue.
| Part | Task | Design | Item | response |
|---|---|---|---|---|
| 1 | 1 | A | 2 | 1 |
| 1 | 1 | A | 3 | 1 |
| 1 | 2 | B | 2 | 5 |
| 1 | 2 | B | 3 | 7 |
| 2 | 2 | B | 3 | 6 |
| 3 | 2 | B | 1 | 7 |
Do you wonder about the strange use of %>% in my code above? This is the tidy way of programming data transformations in R.
The so-called magritte operator %>% is part of the dplyr/tidyr framework for data manipulation. It chains steps of data manipulation by connecting transformation functions. In the following, we will first see a few basic examples. Later, we will proceed to longer transformation chains and see how graceful dplyr piping is, compared to the classic data transformation syntax in R.
Importing data from any of the possible resources, will typically give a data frame. However, often the researcher wants to select or rename variables in the data frame. Say, you want to variable Group to be called Condition, omit the variable age and store the new data frame as Exp. The select command does all this. In the following code the data frame Experiment is piped into select. The variable Condition is renamed to Group, and the variable outcome is taken as-is. All other variables are discarded.
Exp <- Experiment %>%
select(Condition = Group, outcome)
Another frequent step in data analysis is cleaning the data from missing values and outliers. In the following code example, we first “inject” a few missing values for age (which were coded as -99) and outliers (>500) in the outcome variable. Note that I am using some R commands that you don’t need to understand by now. Then we reuse the above code for renaming (this time keeping age on board) and add some filtering steps:
## rename, then filtering
Exp <-
Experiment %>%
select(Condition = Group, age, outcome) %>%
filter(outcome < 500) %>%
filter(age != -99)
Exp
| Condition | age | outcome |
|---|---|---|
| experimental | 22 | 231 |
| control | 24 | 204 |
| control | 23 | 194 |
| experimental | 33 | 244 |
| control | 29 | 230 |
| experimental | 22 | 253 |
| control | 32 | 210 |
| experimental | 22 | 255 |
| control | 19 | 213 |
| experimental | 31 | 260 |
| control | 31 | 221 |
| experimental | 26 | 242 |
| control | 28 | 214 |
| experimental | 31 | 251 |
| control | 22 | 216 |
| experimental | 30 | 248 |
| control | 21 | 213 |
| experimental | 33 | 248 |
| control | 28 | 211 |
| experimental | 33 | 248 |
| control | 28 | 219 |
| experimental | 31 | 249 |
| control | 33 | 203 |
| experimental | 18 | 251 |
| control | 34 | 219 |
| experimental | 19 | 248 |
| control | 33 | 202 |
| experimental | 18 | 249 |
| control | 29 | 207 |
| experimental | 27 | 244 |
| control | 28 | 209 |
| experimental | 23 | 257 |
| control | 29 | 211 |
| experimental | 21 | 250 |
| control | 32 | 222 |
| experimental | 23 | 256 |
| control | 20 | 211 |
| experimental | 22 | 243 |
| control | 34 | 215 |
| experimental | 34 | 254 |
| control | 35 | 177 |
| experimental | 20 | 219 |
| control | 28 | 206 |
| experimental | 29 | 268 |
| control | 31 | 178 |
| experimental | 30 | 251 |
| control | 33 | 212 |
| experimental | 34 | 253 |
| control | 30 | 222 |
| experimental | 33 | 273 |
| control | 22 | 210 |
| experimental | 31 | 262 |
| control | 28 | 215 |
| experimental | 35 | 253 |
| control | 19 | 203 |
| experimental | 19 | 252 |
| control | 23 | 201 |
| experimental | 19 | 243 |
| control | 22 | 199 |
| experimental | 23 | 242 |
| control | 19 | 207 |
| experimental | 23 | 255 |
| control | 28 | 202 |
| experimental | 18 | 228 |
| control | 20 | 193 |
| experimental | 28 | 241 |
| control | 22 | 204 |
| experimental | 24 | 256 |
| control | 23 | 201 |
| experimental | 30 | 236 |
| control | 18 | 222 |
| experimental | 31 | 237 |
| control | 24 | 221 |
| experimental | 19 | 258 |
| control | 24 | 215 |
| experimental | 26 | 255 |
| control | 20 | 216 |
| experimental | 35 | 233 |
| control | 32 | 213 |
| experimental | 32 | 268 |
| control | 32 | 209 |
| experimental | 31 | 251 |
| control | 33 | 220 |
| experimental | 32 | 251 |
| control | 28 | 212 |
| experimental | 29 | 255 |
| control | 30 | 213 |
| experimental | 20 | 249 |
| control | 33 | 209 |
| experimental | 32 | 270 |
| control | 31 | 209 |
| experimental | 18 | 231 |
| control | 25 | 209 |
| experimental | 19 | 254 |
| control | 32 | 211 |
| experimental | 24 | 271 |
During data preparation and analysis, new variables are created, routinely. For example, the covariate is often shifted to the center before using linear regression:
mean_age = mean(Exp$age)
Exp <- Exp %>%
mutate(age_cntrd = age - mean_age)
Exp
| Condition | age | outcome | age_cntrd |
|---|---|---|---|
| experimental | 22 | 231 | -4.719 |
| control | 24 | 204 | -2.719 |
| control | 23 | 194 | -3.719 |
| experimental | 33 | 244 | 6.281 |
| control | 29 | 230 | 2.281 |
| experimental | 22 | 253 | -4.719 |
| control | 32 | 210 | 5.281 |
| experimental | 22 | 255 | -4.719 |
| control | 19 | 213 | -7.719 |
| experimental | 31 | 260 | 4.281 |
| control | 31 | 221 | 4.281 |
| experimental | 26 | 242 | -0.719 |
| control | 28 | 214 | 1.281 |
| experimental | 31 | 251 | 4.281 |
| control | 22 | 216 | -4.719 |
| experimental | 30 | 248 | 3.281 |
| control | 21 | 213 | -5.719 |
| experimental | 33 | 248 | 6.281 |
| control | 28 | 211 | 1.281 |
| experimental | 33 | 248 | 6.281 |
| control | 28 | 219 | 1.281 |
| experimental | 31 | 249 | 4.281 |
| control | 33 | 203 | 6.281 |
| experimental | 18 | 251 | -8.719 |
| control | 34 | 219 | 7.281 |
| experimental | 19 | 248 | -7.719 |
| control | 33 | 202 | 6.281 |
| experimental | 18 | 249 | -8.719 |
| control | 29 | 207 | 2.281 |
| experimental | 27 | 244 | 0.281 |
| control | 28 | 209 | 1.281 |
| experimental | 23 | 257 | -3.719 |
| control | 29 | 211 | 2.281 |
| experimental | 21 | 250 | -5.719 |
| control | 32 | 222 | 5.281 |
| experimental | 23 | 256 | -3.719 |
| control | 20 | 211 | -6.719 |
| experimental | 22 | 243 | -4.719 |
| control | 34 | 215 | 7.281 |
| experimental | 34 | 254 | 7.281 |
| control | 35 | 177 | 8.281 |
| experimental | 20 | 219 | -6.719 |
| control | 28 | 206 | 1.281 |
| experimental | 29 | 268 | 2.281 |
| control | 31 | 178 | 4.281 |
| experimental | 30 | 251 | 3.281 |
| control | 33 | 212 | 6.281 |
| experimental | 34 | 253 | 7.281 |
| control | 30 | 222 | 3.281 |
| experimental | 33 | 273 | 6.281 |
| control | 22 | 210 | -4.719 |
| experimental | 31 | 262 | 4.281 |
| control | 28 | 215 | 1.281 |
| experimental | 35 | 253 | 8.281 |
| control | 19 | 203 | -7.719 |
| experimental | 19 | 252 | -7.719 |
| control | 23 | 201 | -3.719 |
| experimental | 19 | 243 | -7.719 |
| control | 22 | 199 | -4.719 |
| experimental | 23 | 242 | -3.719 |
| control | 19 | 207 | -7.719 |
| experimental | 23 | 255 | -3.719 |
| control | 28 | 202 | 1.281 |
| experimental | 18 | 228 | -8.719 |
| control | 20 | 193 | -6.719 |
| experimental | 28 | 241 | 1.281 |
| control | 22 | 204 | -4.719 |
| experimental | 24 | 256 | -2.719 |
| control | 23 | 201 | -3.719 |
| experimental | 30 | 236 | 3.281 |
| control | 18 | 222 | -8.719 |
| experimental | 31 | 237 | 4.281 |
| control | 24 | 221 | -2.719 |
| experimental | 19 | 258 | -7.719 |
| control | 24 | 215 | -2.719 |
| experimental | 26 | 255 | -0.719 |
| control | 20 | 216 | -6.719 |
| experimental | 35 | 233 | 8.281 |
| control | 32 | 213 | 5.281 |
| experimental | 32 | 268 | 5.281 |
| control | 32 | 209 | 5.281 |
| experimental | 31 | 251 | 4.281 |
| control | 33 | 220 | 6.281 |
| experimental | 32 | 251 | 5.281 |
| control | 28 | 212 | 1.281 |
| experimental | 29 | 255 | 2.281 |
| control | 30 | 213 | 3.281 |
| experimental | 20 | 249 | -6.719 |
| control | 33 | 209 | 6.281 |
| experimental | 32 | 270 | 5.281 |
| control | 31 | 209 | 4.281 |
| experimental | 18 | 231 | -8.719 |
| control | 25 | 209 | -1.719 |
| experimental | 19 | 254 | -7.719 |
| control | 32 | 211 | 5.281 |
| experimental | 24 | 271 | -2.719 |
Finally, for the descriptive statistics part of your report, you probably want to summarize the outcome variable per experimental condition. The following chain of commands first groups the data frame, then computes means and standard deviations. At every step, a data frame is piped into another command, which processes the data frame and outputs a data frame.
Exp %>%
group_by(Condition) %>%
summarize(mean = mean(outcome),
sd = sd(outcome) )
| Condition | mean | sd |
|---|---|---|
| control | 209 | 10.1 |
| experimental | 250 | 11.2 |
Good statistical graphics can vastly improve your and your readers understanding of data and results. This book exclusively introduces the modern ggplot2 graphics system of R, which is based on the grammar of graphics.
Every plot starts with piping a data frame into the ggplot(aes(...)) command. The aes(...) argument of ggplot creates the aesthetics, which is a mapping between variables and features of the plot (and only remotely has something to do with beauty). Review the code once again that produces the piles-of-sugar: the aesthetics map the variable Group on the fill color, whereas outcome is mapped to the x axis. For a plot to be valid, there must at least one layer with a geometry. The above example uses the density geometry, which calculates the density and maps it to the y axis.
The ggplot2 plotting system knows a full set of geometries, like:
geom_point()geom_smooth()geom_histogram()geom_boxplot() andgeom_violin()For a brief demonstration of ggplots basic functionality, we use the BAB1 data set of the BrowsingAB case. We attach the case environment and use the str command to take a first look at the data:
attach(BrowsingAB)
BAB1
| Obs | Part | Task | Design | Gender | Education | age | Far_sighted | Small_font | ToT | clicks | returns |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | A | M | Middle | 49 | TRUE | FALSE | 189.6 | 12 | 4 |
| 2 | 2 | 1 | A | M | Low | 36 | FALSE | FALSE | 247.0 | 15 | 4 |
| 3 | 3 | 1 | A | F | Low | 66 | FALSE | FALSE | 226.3 | 14 | 2 |
| 4 | 4 | 1 | A | M | Middle | 38 | FALSE | FALSE | 250.8 | 10 | 6 |
| 5 | 5 | 1 | A | M | High | 50 | TRUE | FALSE | 134.7 | 15 | 2 |
| 6 | 6 | 1 | A | M | High | 48 | TRUE | FALSE | 182.1 | 5 | 2 |
| 7 | 7 | 1 | A | M | High | 66 | FALSE | FALSE | 197.5 | 8 | 3 |
| 8 | 8 | 1 | A | F | High | 29 | FALSE | FALSE | 221.0 | 11 | 3 |
| 9 | 9 | 1 | A | M | Middle | 46 | FALSE | FALSE | 151.4 | 10 | 1 |
| 10 | 10 | 1 | A | M | Low | 51 | FALSE | FALSE | 206.6 | 10 | 2 |
| 11 | 11 | 1 | A | F | Middle | 48 | TRUE | FALSE | 189.7 | 13 | 2 |
| 12 | 12 | 1 | A | M | High | 51 | TRUE | FALSE | 164.5 | 4 | 3 |
| 13 | 13 | 1 | A | M | High | 61 | FALSE | FALSE | 161.7 | 8 | 2 |
| 14 | 14 | 1 | A | F | Middle | 41 | TRUE | FALSE | 176.6 | 18 | 8 |
| 15 | 15 | 1 | A | F | Middle | 70 | TRUE | FALSE | 163.8 | 12 | 3 |
| 16 | 16 | 1 | A | M | Middle | 75 | TRUE | FALSE | 141.3 | 11 | 1 |
| 17 | 17 | 1 | A | M | Low | 26 | FALSE | FALSE | 212.0 | 13 | 1 |
| 18 | 18 | 1 | A | F | Middle | 32 | FALSE | FALSE | 259.8 | 17 | 5 |
| 19 | 19 | 1 | A | F | Middle | 49 | FALSE | FALSE | 227.8 | 7 | 1 |
| 20 | 20 | 1 | A | M | High | 39 | FALSE | FALSE | 204.4 | 11 | 8 |
| 21 | 21 | 1 | A | M | Middle | 48 | FALSE | FALSE | 141.9 | 12 | 3 |
| 22 | 22 | 1 | A | F | Middle | 55 | FALSE | FALSE | 182.3 | 9 | 2 |
| 23 | 23 | 1 | A | M | Middle | 33 | FALSE | FALSE | 189.5 | 11 | 2 |
| 24 | 24 | 1 | A | M | Middle | 64 | FALSE | FALSE | 209.2 | 10 | 5 |
| 25 | 25 | 1 | A | F | High | 29 | FALSE | FALSE | 195.5 | 6 | 1 |
| 26 | 26 | 1 | A | M | Middle | 61 | TRUE | FALSE | 185.8 | 13 | 1 |
| 27 | 27 | 1 | A | F | High | 35 | FALSE | FALSE | 196.3 | 11 | 2 |
| 28 | 28 | 1 | A | M | High | 45 | FALSE | FALSE | 153.0 | 12 | 2 |
| 29 | 29 | 1 | A | F | Low | 63 | FALSE | FALSE | 195.9 | 8 | 2 |
| 30 | 30 | 1 | A | M | High | 48 | FALSE | FALSE | 268.2 | 11 | 1 |
| 31 | 31 | 1 | A | M | High | 40 | FALSE | FALSE | 183.5 | 7 | 6 |
| 32 | 32 | 1 | A | M | Middle | 34 | TRUE | FALSE | 195.6 | 12 | 0 |
| 33 | 33 | 1 | A | F | Middle | 32 | TRUE | FALSE | 151.6 | 12 | 3 |
| 34 | 34 | 1 | A | M | High | 34 | TRUE | FALSE | 206.7 | 9 | 1 |
| 35 | 35 | 1 | A | F | High | 53 | FALSE | FALSE | 244.9 | 12 | 3 |
| 36 | 36 | 1 | A | M | Middle | 62 | FALSE | FALSE | 125.0 | 10 | 1 |
| 37 | 37 | 1 | A | F | High | 26 | FALSE | FALSE | 193.0 | 8 | 4 |
| 38 | 38 | 1 | A | F | Low | 51 | TRUE | FALSE | 215.5 | 17 | 2 |
| 39 | 39 | 1 | A | M | High | 71 | TRUE | FALSE | 230.2 | 11 | 2 |
| 40 | 40 | 1 | A | M | Middle | 62 | TRUE | FALSE | 243.2 | 11 | 1 |
| 41 | 41 | 1 | A | F | Low | 38 | TRUE | FALSE | 223.0 | 9 | 3 |
| 42 | 42 | 1 | A | F | Low | 26 | FALSE | FALSE | 252.3 | 14 | 2 |
| 43 | 43 | 1 | A | F | Middle | 54 | FALSE | FALSE | 189.8 | 15 | 1 |
| 44 | 44 | 1 | A | M | High | 69 | TRUE | FALSE | 234.1 | 8 | 4 |
| 45 | 45 | 1 | A | F | High | 51 | TRUE | FALSE | 152.2 | 8 | 0 |
| 46 | 46 | 1 | A | M | High | 61 | FALSE | FALSE | 245.2 | 13 | 1 |
| 47 | 47 | 1 | A | M | Low | 49 | FALSE | FALSE | 217.9 | 8 | 1 |
| 48 | 48 | 1 | A | M | High | 41 | TRUE | FALSE | 173.6 | 8 | 2 |
| 49 | 49 | 1 | A | M | Low | 77 | FALSE | FALSE | 274.9 | 15 | 3 |
| 50 | 50 | 1 | A | M | Low | 75 | FALSE | FALSE | 166.8 | 8 | 0 |
| 51 | 51 | 1 | A | F | High | 32 | FALSE | FALSE | 170.6 | 10 | 1 |
| 52 | 52 | 1 | A | F | Low | 77 | TRUE | FALSE | 212.1 | 12 | 3 |
| 53 | 53 | 1 | A | F | High | 72 | FALSE | FALSE | 193.2 | 17 | 4 |
| 54 | 54 | 1 | A | M | Middle | 61 | FALSE | FALSE | 212.4 | 6 | 3 |
| 55 | 55 | 1 | A | F | High | 50 | FALSE | FALSE | 185.8 | 11 | 3 |
| 56 | 56 | 1 | A | M | High | 48 | FALSE | FALSE | 205.0 | 7 | 1 |
| 57 | 57 | 1 | A | M | Low | 32 | FALSE | FALSE | 233.4 | 13 | 3 |
| 58 | 58 | 1 | A | F | Low | 26 | FALSE | FALSE | 290.6 | 9 | 3 |
| 59 | 59 | 1 | A | F | Low | 48 | TRUE | FALSE | 181.2 | 6 | 2 |
| 60 | 60 | 1 | A | M | High | 38 | TRUE | FALSE | 236.9 | 14 | 7 |
| 61 | 61 | 1 | A | M | High | 66 | TRUE | FALSE | 261.5 | 10 | 1 |
| 62 | 62 | 1 | A | M | Middle | 74 | TRUE | FALSE | 211.4 | 5 | 3 |
| 63 | 63 | 1 | A | M | Middle | 40 | FALSE | FALSE | 185.4 | 9 | 2 |
| 64 | 64 | 1 | A | M | Low | 69 | FALSE | FALSE | 178.4 | 9 | 2 |
| 65 | 65 | 1 | A | M | Middle | 33 | FALSE | FALSE | 220.2 | 11 | 2 |
| 66 | 66 | 1 | A | F | Low | 65 | TRUE | FALSE | 311.3 | 13 | 9 |
| 67 | 67 | 1 | A | F | Middle | 53 | TRUE | FALSE | 222.8 | 9 | 3 |
| 68 | 68 | 1 | A | M | Low | 45 | FALSE | FALSE | 167.8 | 16 | 4 |
| 69 | 69 | 1 | A | M | Low | 70 | FALSE | FALSE | 188.4 | 12 | 5 |
| 70 | 70 | 1 | A | F | Low | 37 | FALSE | FALSE | 239.8 | 15 | 4 |
| 71 | 71 | 1 | A | F | High | 48 | FALSE | FALSE | 148.9 | 9 | 3 |
| 72 | 72 | 1 | A | F | Middle | 66 | FALSE | FALSE | 251.4 | 12 | 6 |
| 73 | 73 | 1 | A | F | Middle | 24 | FALSE | FALSE | 102.0 | 12 | 1 |
| 74 | 74 | 1 | A | F | Middle | 43 | FALSE | FALSE | 253.4 | 17 | 6 |
| 75 | 75 | 1 | A | F | Middle | 25 | FALSE | FALSE | 210.8 | 12 | 6 |
| 76 | 76 | 1 | A | M | Low | 27 | FALSE | FALSE | 212.9 | 9 | 1 |
| 77 | 77 | 1 | A | F | High | 33 | TRUE | FALSE | 157.3 | 10 | 3 |
| 78 | 78 | 1 | A | F | Middle | 36 | TRUE | FALSE | 220.2 | 8 | 4 |
| 79 | 79 | 1 | A | M | High | 37 | TRUE | FALSE | 225.8 | 10 | 3 |
| 80 | 80 | 1 | A | F | High | 42 | TRUE | FALSE | 196.1 | 6 | 2 |
| 81 | 81 | 1 | A | M | High | 49 | FALSE | FALSE | 145.2 | 10 | 0 |
| 82 | 82 | 1 | A | F | High | 72 | TRUE | FALSE | 122.5 | 10 | 1 |
| 83 | 83 | 1 | A | F | Low | 64 | FALSE | FALSE | 189.9 | 7 | 1 |
| 84 | 84 | 1 | A | M | Low | 77 | TRUE | FALSE | 272.6 | 6 | 2 |
| 85 | 85 | 1 | A | M | High | 39 | FALSE | FALSE | 178.2 | 10 | 2 |
| 86 | 86 | 1 | A | M | High | 69 | FALSE | FALSE | 178.5 | 6 | 0 |
| 87 | 87 | 1 | A | F | High | 27 | FALSE | FALSE | 181.3 | 11 | 4 |
| 88 | 88 | 1 | A | F | High | 27 | FALSE | FALSE | 176.3 | 6 | 4 |
| 89 | 89 | 1 | A | F | Low | 63 | FALSE | FALSE | 211.0 | 9 | 1 |
| 90 | 90 | 1 | A | F | Low | 52 | FALSE | FALSE | 193.0 | 14 | 1 |
| 91 | 91 | 1 | A | M | Low | 56 | TRUE | FALSE | 222.9 | 10 | 10 |
| 92 | 92 | 1 | A | F | High | 27 | FALSE | FALSE | 184.8 | 5 | 4 |
| 93 | 93 | 1 | A | F | Low | 72 | TRUE | FALSE | 268.8 | 15 | 0 |
| 94 | 94 | 1 | A | M | Low | 70 | FALSE | FALSE | 212.5 | 9 | 4 |
| 95 | 95 | 1 | A | M | Low | 22 | FALSE | FALSE | 215.4 | 9 | 2 |
| 96 | 96 | 1 | A | M | Low | 62 | FALSE | FALSE | 275.4 | 12 | 0 |
| 97 | 97 | 1 | A | F | Middle | 52 | FALSE | FALSE | 182.7 | 9 | 2 |
| 98 | 98 | 1 | A | M | Low | 76 | TRUE | FALSE | 261.1 | 11 | 2 |
| 99 | 99 | 1 | A | M | High | 74 | TRUE | FALSE | 231.1 | 8 | 2 |
| 100 | 100 | 1 | A | M | High | 47 | FALSE | FALSE | 208.6 | 6 | 4 |
| 101 | 101 | 1 | B | M | Middle | 49 | TRUE | TRUE | 282.7 | 14 | 2 |
| 102 | 102 | 1 | B | M | Low | 36 | FALSE | TRUE | 196.4 | 7 | 2 |
| 103 | 103 | 1 | B | F | Low | 66 | FALSE | TRUE | 183.6 | 9 | 3 |
| 104 | 104 | 1 | B | M | Middle | 38 | FALSE | TRUE | 181.1 | 5 | 0 |
| 105 | 105 | 1 | B | M | High | 50 | TRUE | TRUE | 192.5 | 14 | 1 |
| 106 | 106 | 1 | B | M | High | 48 | TRUE | TRUE | 182.0 | 8 | 9 |
| 107 | 107 | 1 | B | M | High | 66 | FALSE | TRUE | 122.3 | 8 | 0 |
| 108 | 108 | 1 | B | F | High | 29 | FALSE | TRUE | 102.2 | 6 | 2 |
| 109 | 109 | 1 | B | M | Middle | 46 | FALSE | TRUE | 198.6 | 1 | 2 |
| 110 | 110 | 1 | B | M | Low | 51 | FALSE | TRUE | 223.4 | 10 | 2 |
| 111 | 111 | 1 | B | F | Middle | 48 | TRUE | TRUE | 217.9 | 11 | 3 |
| 112 | 112 | 1 | B | M | High | 51 | TRUE | TRUE | 180.5 | 5 | 7 |
| 113 | 113 | 1 | B | M | High | 61 | FALSE | TRUE | 144.9 | 5 | 2 |
| 114 | 114 | 1 | B | F | Middle | 41 | TRUE | TRUE | 284.4 | 8 | 6 |
| 115 | 115 | 1 | B | F | Middle | 70 | TRUE | TRUE | 269.8 | 7 | 4 |
| 116 | 116 | 1 | B | M | Middle | 75 | TRUE | TRUE | 185.0 | 10 | 4 |
| 117 | 117 | 1 | B | M | Low | 26 | FALSE | TRUE | 146.6 | 6 | 2 |
| 118 | 118 | 1 | B | F | Middle | 32 | FALSE | TRUE | 262.8 | 7 | 0 |
| 119 | 119 | 1 | B | F | Middle | 49 | FALSE | TRUE | 142.9 | 3 | 3 |
| 120 | 120 | 1 | B | M | High | 39 | FALSE | TRUE | 196.9 | 7 | 1 |
| 121 | 121 | 1 | B | M | Middle | 48 | FALSE | TRUE | 130.9 | 10 | 0 |
| 122 | 122 | 1 | B | F | Middle | 55 | FALSE | TRUE | 163.0 | 6 | 2 |
| 123 | 123 | 1 | B | M | Middle | 33 | FALSE | TRUE | 172.5 | 8 | 1 |
| 124 | 124 | 1 | B | M | Middle | 64 | FALSE | TRUE | 255.7 | 13 | 1 |
| 125 | 125 | 1 | B | F | High | 29 | FALSE | TRUE | 168.8 | 9 | 2 |
| 126 | 126 | 1 | B | M | Middle | 61 | TRUE | TRUE | 192.5 | 11 | 1 |
| 127 | 127 | 1 | B | F | High | 35 | FALSE | TRUE | 113.9 | 5 | 1 |
| 128 | 128 | 1 | B | M | High | 45 | FALSE | TRUE | 119.9 | 3 | 0 |
| 129 | 129 | 1 | B | F | Low | 63 | FALSE | TRUE | 115.6 | 11 | 4 |
| 130 | 130 | 1 | B | M | High | 48 | FALSE | TRUE | 100.2 | 11 | 0 |
| 131 | 131 | 1 | B | M | High | 40 | FALSE | TRUE | 124.4 | 6 | 1 |
| 132 | 132 | 1 | B | M | Middle | 34 | TRUE | TRUE | 203.4 | 15 | 5 |
| 133 | 133 | 1 | B | F | Middle | 32 | TRUE | TRUE | 259.9 | 14 | 4 |
| 134 | 134 | 1 | B | M | High | 34 | TRUE | TRUE | 218.0 | 12 | 4 |
| 135 | 135 | 1 | B | F | High | 53 | FALSE | TRUE | 186.9 | 15 | 1 |
| 136 | 136 | 1 | B | M | Middle | 62 | FALSE | TRUE | 153.9 | 5 | 0 |
| 137 | 137 | 1 | B | F | High | 26 | FALSE | TRUE | 116.2 | 6 | 0 |
| 138 | 138 | 1 | B | F | Low | 51 | TRUE | TRUE | 312.4 | 14 | 3 |
| 139 | 139 | 1 | B | M | High | 71 | TRUE | TRUE | 180.0 | 15 | 2 |
| 140 | 140 | 1 | B | M | Middle | 62 | TRUE | TRUE | 229.8 | 10 | 1 |
| 141 | 141 | 1 | B | F | Low | 38 | TRUE | TRUE | 225.0 | 12 | 3 |
| 142 | 142 | 1 | B | F | Low | 26 | FALSE | TRUE | 165.8 | 8 | 3 |
| 143 | 143 | 1 | B | F | Middle | 54 | FALSE | TRUE | 213.1 | 11 | 0 |
| 144 | 144 | 1 | B | M | High | 69 | TRUE | TRUE | 236.1 | 15 | 2 |
| 145 | 145 | 1 | B | F | High | 51 | TRUE | TRUE | 245.4 | 12 | 1 |
| 146 | 146 | 1 | B | M | High | 61 | FALSE | TRUE | 202.4 | 6 | 4 |
| 147 | 147 | 1 | B | M | Low | 49 | FALSE | TRUE | 157.6 | 14 | 0 |
| 148 | 148 | 1 | B | M | High | 41 | TRUE | TRUE | 201.0 | 8 | 4 |
| 149 | 149 | 1 | B | M | Low | 77 | FALSE | TRUE | 263.6 | 17 | 4 |
| 150 | 150 | 1 | B | M | Low | 75 | FALSE | TRUE | 138.8 | 9 | 0 |
| 151 | 151 | 1 | B | F | High | 32 | FALSE | TRUE | 151.7 | 4 | 3 |
| 152 | 152 | 1 | B | F | Low | 77 | TRUE | TRUE | 232.6 | 13 | 2 |
| 153 | 153 | 1 | B | F | High | 72 | FALSE | TRUE | 212.4 | 7 | 1 |
| 154 | 154 | 1 | B | M | Middle | 61 | FALSE | TRUE | 194.3 | 6 | 0 |
| 155 | 155 | 1 | B | F | High | 50 | FALSE | TRUE | 151.4 | 6 | 2 |
| 156 | 156 | 1 | B | M | High | 48 | FALSE | TRUE | 170.8 | 5 | 1 |
| 157 | 157 | 1 | B | M | Low | 32 | FALSE | TRUE | 126.4 | 11 | 2 |
| 158 | 158 | 1 | B | F | Low | 26 | FALSE | TRUE | 178.9 | 5 | 2 |
| 159 | 159 | 1 | B | F | Low | 48 | TRUE | TRUE | 187.1 | 11 | 1 |
| 160 | 160 | 1 | B | M | High | 38 | TRUE | TRUE | 249.1 | 9 | 1 |
| 161 | 161 | 1 | B | M | High | 66 | TRUE | TRUE | 218.9 | 9 | 3 |
| 162 | 162 | 1 | B | M | Middle | 74 | TRUE | TRUE | 251.3 | 13 | 4 |
| 163 | 163 | 1 | B | M | Middle | 40 | FALSE | TRUE | 172.4 | 16 | 2 |
| 164 | 164 | 1 | B | M | Low | 69 | FALSE | TRUE | 147.6 | 15 | 4 |
| 165 | 165 | 1 | B | M | Middle | 33 | FALSE | TRUE | 114.8 | 9 | 3 |
| 166 | 166 | 1 | B | F | Low | 65 | TRUE | TRUE | 303.6 | 21 | 6 |
| 167 | 167 | 1 | B | F | Middle | 53 | TRUE | TRUE | 211.2 | 7 | 2 |
| 168 | 168 | 1 | B | M | Low | 45 | FALSE | TRUE | 174.7 | 9 | 1 |
| 169 | 169 | 1 | B | M | Low | 70 | FALSE | TRUE | 195.1 | 11 | 4 |
| 170 | 170 | 1 | B | F | Low | 37 | FALSE | TRUE | 174.1 | 11 | 1 |
| 171 | 171 | 1 | B | F | High | 48 | FALSE | TRUE | 141.4 | 6 | 1 |
| 172 | 172 | 1 | B | F | Middle | 66 | FALSE | TRUE | 168.5 | 7 | 4 |
| 173 | 173 | 1 | B | F | Middle | 24 | FALSE | TRUE | 64.4 | 2 | 2 |
| 174 | 174 | 1 | B | F | Middle | 43 | FALSE | TRUE | 195.4 | 15 | 3 |
| 175 | 175 | 1 | B | F | Middle | 25 | FALSE | TRUE | 207.9 | 13 | 3 |
| 176 | 176 | 1 | B | M | Low | 27 | FALSE | TRUE | 221.7 | 7 | 1 |
| 177 | 177 | 1 | B | F | High | 33 | TRUE | TRUE | 174.3 | 10 | 2 |
| 178 | 178 | 1 | B | F | Middle | 36 | TRUE | TRUE | 235.2 | 12 | 3 |
| 179 | 179 | 1 | B | M | High | 37 | TRUE | TRUE | 243.8 | 3 | 3 |
| 180 | 180 | 1 | B | F | High | 42 | TRUE | TRUE | 209.2 | 15 | 4 |
| 181 | 181 | 1 | B | M | High | 49 | FALSE | TRUE | 141.2 | 8 | 2 |
| 182 | 182 | 1 | B | F | High | 72 | TRUE | TRUE | 158.9 | 15 | 1 |
| 183 | 183 | 1 | B | F | Low | 64 | FALSE | TRUE | 167.4 | 9 | 2 |
| 184 | 184 | 1 | B | M | Low | 77 | TRUE | TRUE | 286.4 | 15 | 4 |
| 185 | 185 | 1 | B | M | High | 39 | FALSE | TRUE | 96.9 | 5 | 4 |
| 186 | 186 | 1 | B | M | High | 69 | FALSE | TRUE | 166.0 | 7 | 0 |
| 187 | 187 | 1 | B | F | High | 27 | FALSE | TRUE | 113.9 | 9 | 3 |
| 188 | 188 | 1 | B | F | High | 27 | FALSE | TRUE | 145.5 | 10 | 0 |
| 189 | 189 | 1 | B | F | Low | 63 | FALSE | TRUE | 180.3 | 12 | 1 |
| 190 | 190 | 1 | B | F | Low | 52 | FALSE | TRUE | 156.1 | 6 | 3 |
| 191 | 191 | 1 | B | M | Low | 56 | TRUE | TRUE | 260.2 | 17 | 6 |
| 192 | 192 | 1 | B | F | High | 27 | FALSE | TRUE | 151.3 | 8 | 0 |
| 193 | 193 | 1 | B | F | Low | 72 | TRUE | TRUE | 265.6 | 10 | 3 |
| 194 | 194 | 1 | B | M | Low | 70 | FALSE | TRUE | 169.4 | 6 | 0 |
| 195 | 195 | 1 | B | M | Low | 22 | FALSE | TRUE | 150.0 | 8 | 1 |
| 196 | 196 | 1 | B | M | Low | 62 | FALSE | TRUE | 205.1 | 12 | 1 |
| 197 | 197 | 1 | B | F | Middle | 52 | FALSE | TRUE | 226.9 | 11 | 2 |
| 198 | 198 | 1 | B | M | Low | 76 | TRUE | TRUE | 214.2 | 10 | 1 |
| 199 | 199 | 1 | B | M | High | 74 | TRUE | TRUE | 290.9 | 16 | 3 |
| 200 | 200 | 1 | B | M | High | 47 | FALSE | TRUE | 145.4 | 9 | 2 |
The BrowsingAB case is a virtual series of studies, where two websites were compared by how long it takes users to complete a given task, time-on-task (ToT). Besides the design factor, a number of additional variables exist, that could possibly play a role, too, for ToT. We explore the data set with ggplot:
We begin with a plot that shows the association between age of the participant and time-on-task (ToT). Both variables are metric and suggest themselves to be put on a 2D plane, with coordinates x and y, a scatter plot.
BAB1 %>%
ggplot(aes(x = age, y = ToT)) +
geom_point()
Let’s take a look at the elements of the command chain: The first two lines pipe the data frame into the ggplot engine.
BAB1 %>%
ggplot(...)
At that moment, the ggplot engine “knows” which variables the data frame contains and hence are available for the plot. It does not yet know, what variables are being used, and how. The next step is, usually, to consider a basic (there exist more than 30) geometry and put it on a layer. The scatter plot geometry of ggplot is geom_point:
BAB1 %>%
ggplot(...) +
geom_point()
The last step is the aesthetic mapping, which tells ggplot the variables to use and how to map them to aesthetic properties of the geometry. The basic properties of points in a coordinate system are the x and y-positions:
BAB1 %>%
ggplot(aes(x = age, y = ToT)) +
geom_point()
The function aes creates a mapping where the aesthetics per variable are given. When call aes directly, we see that it is just a table.
aes(x = age, y = ToT)
## * x -> age
## * y -> ToT
One tiny detail in the above chain has not yet been explained: the +. When choosing the geometry, you actually add a layer to the plot. This is, of course, not the literal mathematical sum. Technically, what the author of the ggplot2 package did, was to overload the + operator. A large set of ggplot functions can be combined in a myriad of ways, just using +. The overloaded + in ggplot is a brilliant analogy: you can infinitely chain ggplot functions, like you can create long sums. You can store ggplot object and later modify it by adding functions. The analogy has its limits, though: other than sums, order matters in ggplot combinations: the first in the chain is always ggplot and layers are drawn upon each other.
Let’s move on with a slightly different situation that will result in a different geometry. Say, we are interested in the distribution of the time-on-task measures under the two designs. We need a geometry, that visualizes the distribution of quantitative variables split by a grouping variable, factor. The box plot does the job:
BAB1 %>%
ggplot(aes(x = Design, y = ToT)) +
geom_boxplot()
The box plot maps ToT to y (again). The factor Design is represented as a split on the x-axis. Interestingly, the box plot does not represent the data as raw as in the scatter plot example. The geometry actually performs an analysis on ToT, which produces five statistics: min, first quartile, median, third quartile and max. These statistics define the vertical positions of bars and end points.
Now, we combine all three variables in one plot: how does the association between ToT and age differ by design? As we have two quantitative variables, we stay with the scatter plot for now. As we intend to separate the groups, we need a property of points to distinguish them. Points offer several additional aesthetics, such as color, size and shape. We choose color, and add it to the aesthetic mapping by aes.
BAB1 %>%
ggplot(aes(x = age, y = ToT, color = Design)) +
geom_point()
Now, we can distinguish the groups visually, but there is too much clutter to discover any relation. With the box plot we saw that some geometries do not represent the raw data, but summaries (statistics) of data. For the scatter plots, a geometry that does the job of summarizing the trend is geom_smooth. This geometry summarizes a cloud of point by drawing a LOESS-smooths line through it. Note how the color mapping is applied to all geometry layers.
BAB1 %>%
ggplot(aes(x = age, y = ToT, color = Design)) +
geom_point() +
geom_smooth()
We see a highly interesting pattern: the association between age and ToT follows two slightly different mirrored sigmoid curves.
Now that we have represented three variables with properties of geometries, what if we wanted to add a fourth one, say education level? Formally, we could use another aesthetic, say shape of points, to represent it. You can easily imagine that this would no longer result in a clear visual figure. For situations, where there are many factors, or factors with many levels, it is impossible to reasonably represent them in one plot. The alternative is to use facetting. A facet splits the data by a grouping variable and creates one single plot for every group:
BAB1 %>%
ggplot(aes(x = age, y = ToT, color = Design)) +
geom_point() +
geom_smooth() +
facet_grid(Education ~ .)
See, how the facet_grid command takes a formula, instead of just a variable name. This makes faceting the primary choice for highly-dimensional situations. For example, we may also choose to represent both factors, Design and education by facets:
BAB1 %>%
ggplot(aes(x = age, y = ToT)) +
geom_point() +
geom_smooth() +
facet_grid(Education ~ Design)
Note how the color aesthetic, although unnecessary, is kept. It is possible to map several aesthetics (or facets) to one variable (but not vice versa).
Above we have seen examples of functions that boils down a vector to a single statistic, like the mean. R has several functions that summarize data in a more complex way. One function with a wide range of applications is the lm command, that applies regression models to data (provided as data frames).
In the following, we will use another simulated data frame Exp to demonstrate linear models. To make this more interesting, we simulate Exp in a slightly advanced way, with quantitative associations between variables. Note how the expected value \(\mu\) is created by drawing on the variables Condition and age. The last step adds (somewhat) realistic noise to the measures, by drawing from the normal distribution with a mean of \(mu\).
N_Obs <- 20
Exp <-
data_frame(Obs = 1:N_Obs,
Condition = rep(c("Experimental", "Control"),
N_Obs/2),
age = runif(N_Obs, 18, 35),
mu = 200 + (Condition == "Control") * 50 + age * 1,
outcome = rnorm(N_Obs, mu, 10))
The experiment involves two groups, which in classic statistics would clearly point to what is commonly referred to as ANOVA. As it will turn out in [LM], old-fashioned ANOVA can be replaced by a rather simple regression model, that I call comparison of groups model (CGM). The estimation of regression models is done by a regression engine, which basically is a (very powerful) R command. The specification for any regression model is given in R’s formula language. Learning this formula language is key to unleashing the power of regression models in R. We can perform a CGM on the data frame Exp using the regression engine stan_glm. The desired model estimates the effect of Condition on outcome. This produces a regression object that contains an abundance on information, much of it of lesser interest for now. (A piece of information, that it does not contain is F statistics and p-values; and that is why it is not an ANOVA, strictly!) The foremost question is how strong the difference between the group is. The fixef command extracts the parameter estimates from the model to answer the question.
M_1 <-
stan_glm(outcome ~ Condition,
data = Exp)
fixef(M_1)
| model | type | nonlin | fixef | re_factor | re_entity | center | lower | upper |
|---|---|---|---|---|---|---|---|---|
| object | fixef | NA | Intercept | NA | NA | 209.5 | 206.4 | 212.4 |
| object | fixef | NA | Conditionexperimental | NA | NA | 40.3 | 35.9 | 44.8 |
Another classic model is linear regression, where outcome is predicted by a metric variable, say age. The stan_glm regression engine is truly multi-purpose and does the job with grace:
M_2 <-
stan_glm(outcome ~ age,
data = Exp)
fixef(M_2)
| model | type | nonlin | fixef | re_factor | re_entity | center | lower | upper |
|---|---|---|---|---|---|---|---|---|
| object | fixef | NA | Intercept | NA | NA | 229.676 | 204.37 | 253.835 |
| object | fixef | NA | age | NA | NA | 0.004 | -0.87 | 0.903 |
If you are interested in both at the same time, you can combine that in one model by the following formula:
M_3 <-
stan_glm(outcome ~ Condition + age,
data = Exp)
fixef(M_3)
| model | type | nonlin | fixef | re_factor | re_entity | center | lower | upper |
|---|---|---|---|---|---|---|---|---|
| object | fixef | NA | Intercept | NA | NA | 201.502 | 190.145 | 212.808 |
| object | fixef | NA | Conditionexperimental | NA | NA | 40.633 | 36.207 | 44.855 |
| object | fixef | NA | age | NA | NA | 0.296 | -0.109 | 0.693 |
A statistical model has several components, for example the coefficients and residuals. Models are complex objects, from which a variety of inferences can be made. For example, the coefficient estimates can be extracted and used for prediction. This is what fixef() does in the above code.
A number of functions can be used to extract certain aspects of the model. For example:
fixef(model) extracts the linear effectsresiduals(model) extracts the measurement errorspredict(model) extracts the expected valuesThese will all be covered in later chapters.
As you have seen throughout this chapter, with R you can effectively manage data, create expresseive graphics and conveniently estimate statistical models. Then usually comes the painful moment where all this needs to be assembled into a comprehensive report. And this is the past, because in your modern R environment you can write whole books without much hassle. This works by the following basic principles:
During the compilation, the R chunks are executed and, by default, code and output are inserted and formatted formatted.
In the book package (directory /Data) you will find the data set of the (virtual) study BAB1, which we will be using in coming chapters. This data comes as comma-separated value file with the file ending .csv. Load this file into R using the read_csv command and check the dataframe.
Find the documentation of the packages haven and readr. Find two import functions. The one that is most useful for you and the one that you consider most exotic.
We have seen how to extract a data frame column as a vector using the double square brackets. There seems to be no such option to extract an individual row as a vector. Why? (Think about object types).
Use the world wide web to find geometries that are useful for plotting associations between grouping variables (factors) and a metric variables. Try them all on the BAB1 data frame. Compare the geometries on what properties of the data they convey.
Like data frames and regression results, plots produced by ggplot are complex objects, too. Create an arbitrary plot, store it in a variable and inspect it using class, summary and str. In addition, what happens when you assign the plot to a variable and what happens when you call the variable?
Revisit the python-swallowed-camel plot and check out how the aesthetic mapping is created. The plot uses a density geometry. Change it into a histogram. Then produce a box plot that shows the two conditions (think carefully about the mappings of x and y).
Use the data set BAB5 in BrowsingAB. It contains a follow-up experiment, where participants had to do five different tasks on the website. Plot the association between age and ToT by task, using color. Then put Task on a facet grid, and use color to represent Design again.
Use the data set BAB5 in BrowsingAB. Using a transformation chain, take the sum or average of participants’ ToT. Then run a few simple regression models.
Use your own data. Drop an Excel and/or an SPSS file into the same directory as your current R file. Read the data into a data frame and summarize what is in the data frame. Use ggplot to create one or more exploratory graphs. Then use dplyr summarize to create summary statistics.
R for Data Science is a book co-authored by Hadley “Tidy” Wickham.
ggplot2 Version of Figures in “25 Recipes for Getting Started with R” for readers who are familiar with the legacy plotting commands in R
Introduction to dplyr for Faster Data Manipulation in R introduces dplyr, the next generation R interface for data manipulation, which is used extensively in this book.
Quick-R is a comprehensive introduction to many common statistical techniques with R.
Code as manuscript features a small set of lessons with code examples, assignments and further resources. For if you are in a haste.