Data analysis and visualization using R

november 2015

Complex datatypes and IO

Contents

Matrices
Factors
Lists
Data frames
Reading dataframes from file (first iteration)
Plotting with dataframes

Matrices

Matrices are vectors with dimensions

We will not detail on them in this course, only this one slide
This does not mean they are not important, but they are just not the focus here

m <- matrix(1:10, nrow = 2, ncol = 5); m

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    7    9
[2,]    2    4    6    8   10

v <- 1:10; dim(v) <- c(2, 5); v

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    7    9
[2,]    2    4    6    8   10

Factors

Although factors are not really complex, I saved them because they have some strange behaviour.
Factors are used to represent data in nominal or ordinal scales
Nominal has no order; Ordinal has
these functions are relevant
- factor(x)
- as.factor(x)
- factor(x, levels = my_levels)

Character to factor

Suppose you have surveyed the eye color of your class room and found these values

eye_colors <- c("green", "blue", "brown", "brown", "blue", "brown", "brown", "brown", "blue", "brown", "green", "brown", "brown", "blue", "blue", "brown")

next you would like to plot or tabulate these findings

Plot character data

Simply plotting gives an error

plot(eye_colors)

Error in plot.window(...): need finite 'ylim' values

Plot factor data

Plotting a character vector converted to a factor is easy

eye_colors <- as.factor(eye_colors)
plot(eye_colors)

Tabulate factor data

Factors are also really easy to tabulate and filter

table(eye_colors)

eye_colors
 blue brown green 
    5     9     2

sum(eye_colors == "blue")

[1] 5

Defining levels

Especially when working with ordinal scales, defining the order of the factors (levels) is useful
By default, R uses the natural ordering (numerical/alphabetical)
You can even define missing levels, as shown in the next slide

Factors with ordinal scale

classSizes <- factor(
    c("big","small","huge","huge","small","big","small","big"),
    levels = c("small", "normal", "big", "huge"), ordered = TRUE)
plot(classSizes)

Calculations with factors in Ordinal scale

When you have an ordered factor, you can do some calulations with it

classSizes < "big"

[1] FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE

sum(classSizes == "huge")

[1] 2

Convert existing factors

When you already have an unorderd factor, you can make it ordered by using the function ordered() together with the levels vector

classSizes <- factor(c("big","small","huge","huge","small","big","small","big"))
classSizes <- ordered(classSizes, levels = c("small", "big", "huge"))
classSizes

[1] big   small huge  huge  small big   small big  
Levels: small < big < huge

Working with factors

Factors are used all the time e.g. for defining treated/untreated. That's why R knows how to deal with them so well:

with(ChickWeight, plot(weight ~ Diet))

Lists

A list is an ordered collection of vectors
These vectors can have differing types
Accessing list elements is done with double brackets: [[]]

List action

x <- c(2, 3, 1)
y <- c("foo", "bar")
l <- list(x, y); l

[[1]]
[1] 2 3 1

[[2]]
[1] "foo" "bar"

l[[2]]

[1] "foo" "bar"

l[[1]][2]

[1] 3

Named list elements (1)

List can also have named elements

x <- c(2, 3, 1)
y <- c("foo", "bar")
l <- list("numbers" = x, "words" = y)
l

$numbers
[1] 2 3 1

$words
[1] "foo" "bar"

Named list elements (2)

Accessing named elements can be done in three ways

l[[2]]        # index

[1] "foo" "bar"

l[["words"]]  # name of element with double brackets

[1] "foo" "bar"

l$words       # name of element with dollar sign

[1] "foo" "bar"

Named list elements (3)

Accessing named elements has its limitations

select <- "words"
l[[select]] ## OK

[1] "foo" "bar"

l$select ##fails - no element with name "select"

NULL

Single versus double brackets on lists

single brackets on a list returns a list; double brackets a vector

l[[2]]

[1] "foo" "bar"

l[2]

$words
[1] "foo" "bar"

l["words"]

$words
[1] "foo" "bar"

Single vs. double brackets on lists (2)

This behaviour can become awkward

l["words"]$words

[1] "foo" "bar"

l[2]["words"][1]$words  ## mind****

[1] "foo" "bar"

Arrays

Arays are vectors with a dimensions (dim) attribute
Also created using array() function
An array with 2 dimensions is a matrix

x <- 1:10
dim(x) <- c(2, 5)
x

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    7    9
[2,]    2    4    6    8   10

class(x)

[1] "matrix"

a <- array(data = 1:12, dim = c(2, 3, 2))
# same as "a <- 1:12; dim(a) <- c(2, 3, 2)"
rownames(a) <- c("foo", "bar")
a

, , 1

    [,1] [,2] [,3]
foo    1    3    5
bar    2    4    6

, , 2

    [,1] [,2] [,3]
foo    7    9   11
bar    8   10   12

class(a)

[1] "array"

Dataframes

data.frame rules all

In practice you will work with data frames >95% of the time
Let's meet and greet

geneNames <- c("P53","BRCA1","VAMP1", "FHIT")
sig <- c(TRUE, TRUE, FALSE, FALSE)
meanExp <- c(4.5, 7.3, 5.4, 2.4)
genes <- data.frame(
    "name" = geneNames,  
    "significant" = sig,  
    "expression" = meanExp)  
genes

   name significant expression
1   P53        TRUE        4.5
2 BRCA1        TRUE        7.3
3 VAMP1       FALSE        5.4
4  FHIT       FALSE        2.4

genes[2,1]      #row 2, element 1

[1] BRCA1
Levels: BRCA1 FHIT P53 VAMP1

genes[, 1:2]    #columns 1 and 2

   name significant
1   P53        TRUE
2 BRCA1        TRUE
3 VAMP1       FALSE
4  FHIT       FALSE

genes[1:2]      #columns 1 and 2 (!)

   name significant
1   P53        TRUE
2 BRCA1        TRUE
3 VAMP1       FALSE
4  FHIT       FALSE

genes[1:2,]     #row 1 and 2

   name significant expression
1   P53        TRUE        4.5
2 BRCA1        TRUE        7.3

genes[c("name", "expression")]  #columns "name" and "expression"

   name expression
1   P53        4.5
2 BRCA1        7.3
3 VAMP1        5.4
4  FHIT        2.4

genes$name      #column "name"

[1] P53   BRCA1 VAMP1 FHIT 
Levels: BRCA1 FHIT P53 VAMP1

Selections on dataframes summarized

In general, selections on dataframes are done in this form:
my_data[row_sel, col_sel]
where row_sel and col_sel can be
- a single index
- a numerical vector
- a logical vector (of the same length!)
- empty (for all rows/columns)

A dataframe is (sort of) a list of vectors

genes[["name"]] ## select column w. double brackets like list

[1] P53   BRCA1 VAMP1 FHIT 
Levels: BRCA1 FHIT P53 VAMP1

class(genes) ## it is NOT a list though

[1] "data.frame"

str(genes)

'data.frame':   4 obs. of  3 variables:
 $ name       : Factor w/ 4 levels "BRCA1","FHIT",..: 3 1 4 2
 $ significant: logi  TRUE TRUE FALSE FALSE
 $ expression : num  4.5 7.3 5.4 2.4

Reading from file

Loading data frames from file

In real life, data in dataframes is often loaded from file
The most used data transfer & storage format is text (tab- or comma-separated)
Here is an example data set in file ("whale_selenium.txt")

whale liver.Se tooth.Se  
1 6.23 140.16  
2 6.79 133.32  
3 7.92 135.34  
...  
19 41.23 206.30  
20 45.47 141.31

Reading the whale data

whale.selenium <- read.table("data/whale_selenium.txt")
head(whale.selenium)

     V1       V2       V3
1 whale liver.Se tooth.Se
2     1     6.23   140.16
3     2     6.79   133.32
4     3     7.92   135.34
5     4     8.02   127.82
6     5     9.34   108.67

When loading the data in the standard way,
- there is no special consideration for the header line
- the separator is assumed to be a space
- the decimal is assumed to be a dot "."

Here, it is specified that
- the first line is a header line
- the first colum contains the row names

whale.selenium <- read.table(
    file = "data/whale_selenium.txt",
    header = TRUE,
    row.names = 1)
summary(whale.selenium)

    liver.Se         tooth.Se    
 Min.   : 6.230   Min.   :108.7  
 1st Qu.: 9.835   1st Qu.:134.8  
 Median :14.905   Median :143.4  
 Mean   :20.685   Mean   :156.6  
 3rd Qu.:33.633   3rd Qu.:175.1  
 Max.   :45.470   Max.   :245.1

Ready to rumble

plot(
    whale.selenium$liver.Se, whale.selenium$tooth.Se,
    xlab = "liver Selenium", ylab = "tooth Selenium")
abline(lm(whale.selenium$tooth.Se ~ whale.selenium$liver.Se))

or, with a smoother:

scatter.smooth(
    whale.selenium$liver.Se, whale.selenium$tooth.Se,
    xlab = "liver Selenium", ylab = "tooth Selenium")
abline(lm(whale.selenium$tooth.Se ~ whale.selenium$liver.Se))

Advanced file reading

More advanced file reading will be dealt with in a later presentation.

Basic DF manipulations

Changing column names

names(whale.selenium) <- c("liver", "tooth")
head(whale.selenium, n=2)

  liver  tooth
1  6.23 140.16
2  6.79 133.32

##or
colnames(whale.selenium) <- c("brrrr", "gross")
head(whale.selenium, n=2)

  brrrr  gross
1  6.23 140.16
2  6.79 133.32

Adding columns

You can add columns to an exisiting dataframe

## add simulated stomach data
whale.selenium$stomach <- rnorm(nrow(whale.selenium), 42, 6) 
head(whale.selenium, n=2)

  liver  tooth  stomach
1  6.23 140.16 39.81500
2  6.79 133.32 29.05594

# or
cbind(whale.selenium, "a_code" = rep(1:2, nrow(whale.selenium)))

   liver  tooth  stomach a_code
1   6.23 140.16 39.81500      1
2   6.79 133.32 29.05594      2
3   7.92 135.34 48.58592      1
4   8.02 127.82 42.10717      2
5   9.34 108.67 41.56941      1
6  10.00 146.22 42.19763      2
7  10.57 131.18 57.02038      1
8  11.04 145.51 41.67049      2
9  12.36 163.24 34.87155      1
10 14.53 136.55 40.26866      2
11 15.28 112.63 42.58383      1
12 18.68 245.07 39.74285      2
13 22.08 140.48 40.61575      1
14 27.55 177.93 37.94400      2
15 32.83 160.73 47.22081      1
16 36.04 227.60 44.88895      2
17 37.74 177.69 36.34232      1
18 40.00 174.23 45.49380      2
19 41.23 206.30 39.41568      1
20 45.47 141.31 46.95745      2
21  6.23 140.16 39.81500      1
22  6.79 133.32 29.05594      2
23  7.92 135.34 48.58592      1
24  8.02 127.82 42.10717      2
25  9.34 108.67 41.56941      1
26 10.00 146.22 42.19763      2
27 10.57 131.18 57.02038      1
28 11.04 145.51 41.67049      2
29 12.36 163.24 34.87155      1
30 14.53 136.55 40.26866      2
31 15.28 112.63 42.58383      1
32 18.68 245.07 39.74285      2
33 22.08 140.48 40.61575      1
34 27.55 177.93 37.94400      2
35 32.83 160.73 47.22081      1
36 36.04 227.60 44.88895      2
37 37.74 177.69 36.34232      1
38 40.00 174.23 45.49380      2
39 41.23 206.30 39.41568      1
40 45.47 141.31 46.95745      2

Adding rows: `rbind()`

Adding rows is similar (continued on next slide)

myData1 <- data.frame(colA = 1:3, colB = c("a", "b", "c")); myData1

  colA colB
1    1    a
2    2    b
3    3    c

myData2 <- data.frame(colA = 4:5, colB = c("d", "e")); myData2

  colA colB
1    4    d
2    5    e

myDataComplete <- rbind(myData1, myData2)
myDataComplete

  colA colB
1    1    a
2    2    b
3    3    c
4    4    d
5    5    e

Note that the column names of both dataframes need to match for this operation to succeed!

Getting a summary

summary(whale.selenium) ## gives a 5-number summary of each column

     liver            tooth          stomach     
 Min.   : 6.230   Min.   :108.7   Min.   :29.06  
 1st Qu.: 9.835   1st Qu.:134.8   1st Qu.:39.66  
 Median :14.905   Median :143.4   Median :41.62  
 Mean   :20.685   Mean   :156.6   Mean   :41.92  
 3rd Qu.:33.633   3rd Qu.:175.1   3rd Qu.:45.04  
 Max.   :45.470   Max.   :245.1   Max.   :57.02

Getting the dimensions of a dataframe

dim(whale.selenium)

[1] 20  3

A more readable selection

You can also use subset() to make both column and row selections
This is a more readable alternative to [ , ]
Note that you don't even need to use quotes

##select rows for which Solar.R is available
head(subset(airquality, subset = !is.na(Solar.R)))

  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
7    23     299  8.6   65     5   7
8    19      99 13.8   59     5   8

`subset()` cont.

## select two columns only
head(subset(airquality, select = c(Ozone, Solar.R)))

  Ozone Solar.R
1    41     190
2    36     118
3    12     149
4    18     313
5    NA      NA
6    28      NA

`subset()` cont.

## combine row and colum selection
head(subset(airquality, subset = !is.na(Solar.R), select = c(Ozone, Solar.R)))

  Ozone Solar.R
1    41     190
2    36     118
3    12     149
4    18     313
7    23     299
8    19      99

`subset()` cont.

## shorthand
subset(airquality, Day == 1, select = -Temp)

    Ozone Solar.R Wind Month Day
1      41     190  7.4     5   1
32     NA     286  8.6     6   1
62    135     269  4.1     7   1
93     39      83  6.9     8   1
124    96     167  6.9     9   1

subset() can be used more sophisticated; just GIYF

Complex datatypes and IO

Matrices

Matrices are vectors with dimensions

Factors

Factors

Character to factor

Plot character data

Plot factor data

Tabulate factor data

Defining levels

Factors with ordinal scale

Calculations with factors in Ordinal scale

Convert existing factors

Working with factors

Lists

Lists

List action

Named list elements (1)

Named list elements (2)

Named list elements (3)

Single versus double brackets on lists

Single vs. double brackets on lists (2)

Arrays

Dataframes

data.frame rules all

Selections on dataframes summarized

A dataframe is (sort of) a list of vectors

Reading from file

Loading data frames from file

Reading the whale data

Ready to rumble

Advanced file reading

Basic DF manipulations

Changing column names

Adding columns

Adding rows: rbind()

Getting a summary

Getting the dimensions of a dataframe

A more readable selection

subset() cont.

subset() cont.

subset() cont.

Adding rows: `rbind()`

`subset()` cont.

`subset()` cont.

`subset()` cont.