This document is written using R Markdown. The source code is available in 01-intro-lab-answers.R.
Here are the number of words recalled by 5 subjects. Factor levels are given as: Neg, Neu, Pos (http://personality-project.org/).
Jim, 32, 15, 45
Victor, 30, 13, 40
Faye, 26, 12, 42
Ron, 22, 10, 38
Jason, 29, 8, 35
dfrm, will have 3 columns named subject, condition, and words.If you feel comfortable with IO operations in R, you can work directly with the data file words.dat.
There are three variables in total, and subject and condition are considered as categorical variables. Each subject has three records for words. The best way to store such data is to create a data frame with three columns, but we will need to repeat conditions across subjects. This can be done using the rep() command.
subject <- factor(rep(c("Jim", "Victor", "Faye", "Ron", "Jason"), each = 3))
cond <- factor(rep(c("Neg", "Neu", "Pos"), 5))
words <- c(32, 15, 45, 30, 13, 40, 26, 12, 42, 22, 10, 38, 29, 8, 35)
dfrm <- data.frame(subject, cond, words)
head(dfrm)
## subject cond words
## 1 Jim Neg 32
## 2 Jim Neu 15
## 3 Jim Pos 45
## 4 Victor Neg 30
## 5 Victor Neu 13
## 6 Victor Pos 40
We can subset the data frame by filtering data matching the following criteria:
dfrm[subject == "Victor", ]
## subject cond words
## 4 Victor Neg 30
## 5 Victor Neu 13
## 6 Victor Pos 40
Note that this filter applies on rows, hence it appears before the comma in brackets. Leaving right part (after the comma) blank means that R will return all columns of the data frame. If we only want a specific column, we can use an expression like this:
dfrm[subject == "Victor", "words"]
## [1] 30 13 40
mean(dfrm[subject == "Victor", "words"])
## [1] 27.67
As we will see later, it is more convenient to use the subset() command, which allows to apply filters on rows and columns in a friendly manner, e.g.
subset(dfrm, subject == "Victor", "words")
To compute means for each condition, we can use the same process and filter the data frame with logical operators. For instance, to get the mean in the neutral condition (Neu), we would use
mean(dfrm[cond == "Neu", "words"])
## [1] 11.6
However, we would need to repeat this instruction three times (i.e., for each level of the factor). A faster way to compute marginal means is to aggregate the data like this:
aggregate(words ~ cond, data = dfrm, FUN = mean)
## cond words
## 1 Neg 27.8
## 2 Neu 11.6
## 3 Pos 40.0
Remark: It is possible to read the data directly rather than creating a data frame manually. Here is a possible solution :
tmp <- read.table("../data/words.dat", sep = ",", skip = 2, header = FALSE, col.names = c("subject",
"Neg", "Neu", "Pos"))
## Warning: readTableHeader a trouvé une ligne finale incomplète dans
## '../data/words.dat'
library(reshape2)
dfrm <- melt(tmp)
## Using subject as id variables
head(dfrm)
## subject variable value
## 1 Jim Neg 32
## 2 Victor Neg 30
## 3 Faye Neg 26
## 4 Ron Neg 22
## 5 Jason Neg 29
## 6 Jim Neu 15
The file brain_size.csv contains data from a study on the relationship between brain size and weight and intelligence (Willerman et al. 1991).
The data are stored in a CSV file, where records (fileds) are separated by semicolons. The first six lines of brain_size.csv are displayed below:
"";"Gender";"FSIQ";"VIQ";"PIQ";"Weight";"Height";"MRI_Count"
"1";"Female";133;132;124;"118";"64.5";816932
"2";"Male";140;150;124;".";"72.5";1001121
"3";"Male";139;123;150;"143";"73.3";1038437
"4";"Male";133;129;128;"172";"68.8";965353
"5";"Female";137;132;134;"147";"65.0";951545
"6";"Female";99;90;110;"146";"69.0";928799
The read.csv2() command can be used to read this kind of file. If comma were to be used to separate records, we would use read.csv() instead. You will have to change the pathname depending on where brain_size.csv is located in your file system. Alternatively, you can update the working directory to point to the folder where this file was saved.
d <- read.csv2("../data/brain_size.csv")
head(d)
## X Gender FSIQ VIQ PIQ Weight Height MRI_Count
## 1 1 Female 133 132 124 118 64.5 816932
## 2 2 Male 140 150 124 . 72.5 1001121
## 3 3 Male 139 123 150 143 73.3 1038437
## 4 4 Male 133 129 128 172 68.8 965353
## 5 5 Female 137 132 134 147 65.0 951545
## 6 6 Female 99 90 110 146 69.0 928799
As can be seen, the first column is not really interesting and can be safely removed.
d <- d[, -1]
head(d)
## Gender FSIQ VIQ PIQ Weight Height MRI_Count
## 1 Female 133 132 124 118 64.5 816932
## 2 Male 140 150 124 . 72.5 1001121
## 3 Male 139 123 150 143 73.3 1038437
## 4 Male 133 129 128 172 68.8 965353
## 5 Female 137 132 134 147 65.0 951545
## 6 Female 99 90 110 146 69.0 928799
Note that R automatically creates unique identifier for statistical units arranged in rows: they are called rownames. Variables are stored as columns headers and are known as colnames, or names in the case of a data frame:
names(d)
## [1] "Gender" "FSIQ" "VIQ" "PIQ" "Weight" "Height"
## [7] "MRI_Count"
A convenient command to display a general summary of the data is the summary() function. For categorical data, il will display a table of counts, and for numerical variables a 6-number summary (range, IQR and measures of central tendency). The summary() command is also useful to spot possible coding problem.
summary(d)
## Gender FSIQ VIQ PIQ Weight
## Female:20 Min. : 77.0 Min. : 71 Min. : 72.0 . : 2
## Male :20 1st Qu.: 89.8 1st Qu.: 90 1st Qu.: 88.2 118 : 2
## Median :116.5 Median :113 Median :115.0 127 : 2
## Mean :113.5 Mean :112 Mean :111.0 143 : 2
## 3rd Qu.:135.5 3rd Qu.:130 3rd Qu.:128.0 146 : 2
## Max. :144.0 Max. :150 Max. :150.0 155 : 2
## (Other):28
## Height MRI_Count
## 64.5 : 4 Min. : 790619
## 66.5 : 3 1st Qu.: 855918
## 68.0 : 3 Median : 905399
## 69.0 : 3 Mean : 908755
## 63.0 : 2 3rd Qu.: 950078
## 66.0 : 2 Max. :1079549
## (Other):23
In this case, we can notice that some variables are not treated as numerical variables by R, e.g. Weight and Height. This can be explained by the presence of missing values encoded as “.”, whereas R usually expects NA or blanks. We could update the reading command to specify how missing data are coded. In this case, we would write
d <- read.csv2("../data/brain_size.csv", na.strings = ".")
However, it is still possible to convert these two variables as numeric as follows:
d$Weight <- as.numeric(as.character(d$Weight))
## Warning: NAs introduits lors de la conversion automatique
d$Height <- as.numeric(as.character(d$Height))
## Warning: NAs introduits lors de la conversion automatique
summary(d)
## Gender FSIQ VIQ PIQ Weight
## Female:20 Min. : 77.0 Min. : 71 Min. : 72.0 Min. :106
## Male :20 1st Qu.: 89.8 1st Qu.: 90 1st Qu.: 88.2 1st Qu.:135
## Median :116.5 Median :113 Median :115.0 Median :146
## Mean :113.5 Mean :112 Mean :111.0 Mean :151
## 3rd Qu.:135.5 3rd Qu.:130 3rd Qu.:128.0 3rd Qu.:172
## Max. :144.0 Max. :150 Max. :150.0 Max. :192
## NA's :2
## Height MRI_Count
## Min. :62.0 Min. : 790619
## 1st Qu.:66.0 1st Qu.: 855918
## Median :68.0 Median : 905399
## Mean :68.5 Mean : 908755
## 3rd Qu.:70.5 3rd Qu.: 950078
## Max. :77.0 Max. :1079549
## NA's :1
It is important to first recode raw values to characters, and then to numeric values. Now, we can confirm that there are 3 missing observations in total. Another way to count missing data is to rely on the is.na() command. It returns TRUE (value is set as missing) or FALSE, which can be counted in R as any integer (TRUE=1). E.g.,
sum(is.na(d$Weight))
## [1] 2
table(is.na(d$Weight))
##
## FALSE TRUE
## 38 2
The summary() command also gives the number of males and females. We could also use table() directly:
table(d[, "Gender"]) # or table(d[,1])
##
## Female Male
## 20 20
Average full IQ is computed as follows:
mean(d$FSIQ)
## [1] 113.5
Since there's no missing data, we do not have to add the add.rm=TRUE option. If we want to restrict the analysis to males, we can use the following command:
mean(d$FSIQ[d$Gender == "Male"])
## [1] 115
Likewise, we would use mean(d$FSIQ[d$Gender == "Female"]) to compute mean FSQI for females. But, as described in Application 1, it is often easier to use built-in aggregating command like this:
aggregate(FSIQ ~ Gender, d, mean)
## Gender FSIQ
## 1 Female 111.9
## 2 Male 115.0
Finally, average MRI counts in log units can be computed as follows:
mean(log(d$MRI_Count, base = 10)) # or we can use log10
## [1] 5.957
Willerman L, Schultz R, Rutledge J and Bigler E (1991). “In Vivo Brain Size and Intelligence.” Intelligence, 15, pp. 223-228.