Data Types

Matricies
Data Frames
Factors
Missing Values

Matricies

Matrices are vectors with a dimension attribute. The dimension attribute is itself an integer vector of length 2 (nrow, ncol)

m <- matrix(nrow =2, ncol=3)
m

##      [,1] [,2] [,3]
## [1,]   NA   NA   NA
## [2,]   NA   NA   NA

dim(m)

## [1] 2 3

attributes(m)

## $dim
## [1] 2 3

Matrices are constructed column-wise, so entries can be thought of starting in the “upper left” corner and running down the columns

 m <- matrix(1:6, nrow = 2, ncol = 3)
m

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

Matrices can also be created directly from vectors by adding a dimension attribute.

m <- 1:10
m

##  [1]  1  2  3  4  5  6  7  8  9 10

dim(m) <- c(2,5)
m

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10

Data Frames

They are represented as a special type of list where every element of the list has to have the same length.
Each element of the list can be thought of as a column and the length of each element of the list is the number of rows
Data frames also have a special attribute called row.names
Data frames are usually created by calling read.table() or read.csv()
Can be converted to a matrix by calling data.matrix()

 x <- data.frame(foo = 1:4, bar = c(T, T, F, F)) 
x

##   foo   bar
## 1   1  TRUE
## 2   2  TRUE
## 3   3 FALSE
## 4   4 FALSE

nrow(x)

## [1] 4

ncol(x)

## [1] 2

Factors

Factors are used to represent categorical data. Factors can be unordered or ordered. One can think of a factor as an integer vector where each integer has a label.
Using factors with labels is better than using integers because factors are self-describing; having a variable that has values “Male” and “Female” is better than a variable that has values 1 and 2.

x <- factor(c("yes", "yes", "no", "yes", "no")) 
x

## [1] yes yes no  yes no 
## Levels: no yes

table(x)

## x
##  no yes 
##   2   3

unclass(x)

## [1] 2 2 1 2 1
## attr(,"levels")
## [1] "no"  "yes"

The order of the levels can be set using the levels argument to factor(). This can be important in linear modelling because the first level is used as the baseline level.

x <- factor(c("yes", "yes", "no", "yes", "no"),  levels = c("yes", "no"))
x

## [1] yes yes no  yes no 
## Levels: yes no

Missing Values

Missing values are denoted by NA or NaN for undefined mathematical operations.
is.na() is used to test objects if they are NA
is.nan() is used to test for NaN
NA values have a class also, so there are integer NA, character NA, etc.
A NaN value is also NA but the converse is not true

 x <- c(1, 2, NA, 10, 3)
is.na(x)

## [1] FALSE FALSE  TRUE FALSE FALSE

is.nan(x)

## [1] FALSE FALSE FALSE FALSE FALSE

x <- c(1, 2, NaN, NA, 4)
is.na(x)

## [1] FALSE FALSE  TRUE  TRUE FALSE

is.nan(x)

## [1] FALSE FALSE  TRUE FALSE FALSE