The R data structures used most frequently are vectors, factors, list, data frames and matrices.

Vector

The elements of a vector must be of the same type.The vectors can be created by using the c() combine function.

Example: We’ll create vectors, to store the three patient names, the patient’s body temperature in degrees Fahrenheit and the patien’s diagnosis (TRUE if he or she has influenza, FALSE othercase)

name<-c("Jhon", "Jane", "Steve") # Character vector
temperature<-c(98.1,98.6,101.4)
status<-c(FALSE, FALSE, TRUE)

Data for each patient can be accesed using his or her position in the set. To obtain the temperature value for patient Jane, simply tYpe:

temperature[2]
## [1] 98.6

To obtain a range of values.

temperature[2:3]
## [1]  98.6 101.4

Items can be excluded by specifying a negative item number

temperature[-2]
## [1]  98.1 101.4

To specify a logical vector indicating whether or not each item should be included.

temperature[c(T,T,F)]
## [1] 98.1 98.6

Factors

A factor is a species case of vector that is solely used for representing categorical or ordinal variables.

gender<-factor(c("MALE","FEMALE","MALE"))
gender
## [1] MALE   FEMALE MALE  
## Levels: FEMALE MALE

The levels comprise the set of possible categories the factor could take, in this case, MALE or FEMALE When we create factors, we can add additional levels that may not appear in the original data.

blood<-factor(c("O","AB","A"), 
              levels = c("A","B","AB","O"))
blood
## [1] O  AB A 
## Levels: A B AB O

The storing the additional level allows for the possibility if adding patients with the other blood type in the future. It also ensures that if we were to create a table of blood types, we would know that type B exists, despite it not being found in our initial data. We indicate the presence of ordinal data by providing the factor’s level in the desired order, listed ascendig from lowest to highest.

symptoms<-factor(c("SEVERE", "MILD", "MODERATE"),
                 levels = c("MILD","MODERATE","SEVERE"),
                 ordered = TRUE)
symptoms
## [1] SEVERE   MILD     MODERATE
## Levels: MILD < MODERATE < SEVERE
symptoms>"MODERATE"
## [1]  TRUE FALSE FALSE

List

A list is a data structure, much like a vector, in that it is used for storing an ordered set of elements. A vector requieres all its element to be the same type, a list allows different R data types to be collected.

subject1<-list(fullname=name[1],
               temperature=temperature[1],
               status=status[1],
               gender=gender[1],
               blood=blood[1],
               symptoms=symptoms[1])
subject1
## $fullname
## [1] "Jhon"
## 
## $temperature
## [1] 98.1
## 
## $status
## [1] FALSE
## 
## $gender
## [1] MALE
## Levels: FEMALE MALE
## 
## $blood
## [1] O
## Levels: A B AB O
## 
## $symptoms
## [1] SEVERE
## Levels: MILD < MODERATE < SEVERE

As a list retains order like a vector, its components can be accessed using numeric positions, as show here for the temperature value

subject1[2]
## $temperature
## [1] 98.1

The result of using vector-style operators on a list object is another list object, which is a subset of the original list. to instead return a single list item in its native data type, use double brackets whe selecting the list component. The following command returns a numeric vector of length one.

subject1[[2]]
## [1] 98.1

Is often better to acces list components by name.

subject1$temperature
## [1] 98.1

Data frames

The data frame is a structure analogous to a spreadsheet or database in that it has both rows and columns of data. In R terms, a data frame can be understood as a list of vectors or factors, each having exactly the same number of values. the data frames is literally a list of vector-type objects

pt_data<-data.frame(name, temperature,status,gender,blood,
                    symptoms,stringsAsFactors = FALSE)
pt_data
##    name temperature status gender blood symptoms
## 1  Jhon        98.1  FALSE   MALE     O   SEVERE
## 2  Jane        98.6  FALSE FEMALE    AB     MILD
## 3 Steve       101.4   TRUE   MALE     A MODERATE

The parameter stringsAsFactors=FALSE, if we do not specify this option, R will automatically convert every character vector to a factor

To extract entire columns (vectors)

pt_data$name
## [1] "Jhon"  "Jane"  "Steve"

To extract multiple columns from a data frame

pt_data[c("temperature","status")]
##   temperature status
## 1        98.1  FALSE
## 2        98.6  FALSE
## 3       101.4   TRUE
pt_data[2:3]
##   temperature status
## 1        98.1  FALSE
## 2        98.6  FALSE
## 3       101.4   TRUE

The data frames is two dimensional, both the desired rows and columns must be specified. Rows are specified firts, followed by a comma, followed by the columns in a format like this [rows , columns]. To extract the value in the firts row and second column of the patient data frame

pt_data[1,2]
## [1] 98.1

To extract data from the first and third rows and the second and fourth columns

pt_data[c(1,3),c(2,4)]
##   temperature gender
## 1        98.1   MALE
## 3       101.4   MALE

To extract all rows of the first column

pt_data[ ,1]
## [1] "Jhon"  "Jane"  "Steve"

To extract all columns for the first row

pt_data[1, ]
##   name temperature status gender blood symptoms
## 1 Jhon        98.1  FALSE   MALE     O   SEVERE

To create new columns in data frames, for example, we may need to convert the Fahrenheit temperature readings in the patient data frame to the Celsius scale.

pt_data$temperature_Celsius<-(pt_data$temperature-32)*(5/9)
pt_data
##    name temperature status gender blood symptoms temperature_Celsius
## 1  Jhon        98.1  FALSE   MALE     O   SEVERE            36.72222
## 2  Jane        98.6  FALSE FEMALE    AB     MILD            37.00000
## 3 Steve       101.4   TRUE   MALE     A MODERATE            38.55556

To visualized the first and last rows of data frame

head(pt_data,2)
##   name temperature status gender blood symptoms temperature_Celsius
## 1 Jhon        98.1  FALSE   MALE     O   SEVERE            36.72222
## 2 Jane        98.6  FALSE FEMALE    AB     MILD            37.00000
tail(pt_data,2)
##    name temperature status gender blood symptoms temperature_Celsius
## 2  Jane        98.6  FALSE FEMALE    AB     MILD            37.00000
## 3 Steve       101.4   TRUE   MALE     A MODERATE            38.55556
attach(pt_data)
## The following objects are masked _by_ .GlobalEnv:
## 
##     blood, gender, name, status, symptoms, temperature

The attach() function allows to acces columns of a data frame without having to specify the data frame name or the instrucion pt_data$name.

names(pt_data)
## [1] "name"                "temperature"         "status"             
## [4] "gender"              "blood"               "symptoms"           
## [7] "temperature_Celsius"
str(pt_data)
## 'data.frame':    3 obs. of  7 variables:
##  $ name               : chr  "Jhon" "Jane" "Steve"
##  $ temperature        : num  98.1 98.6 101.4
##  $ status             : logi  FALSE FALSE TRUE
##  $ gender             : Factor w/ 2 levels "FEMALE","MALE": 2 1 2
##  $ blood              : Factor w/ 4 levels "A","B","AB","O": 4 3 1
##  $ symptoms           : Ord.factor w/ 3 levels "MILD"<"MODERATE"<..: 3 1 2
##  $ temperature_Celsius: num  36.7 37 38.6

Matrices

Matrix is a data structure that represent a two-dimensional table. With rows and columns of data. R matrices can contain only type of data, although they are most often used for mathematical operations and therefore typically store only numbers.

Requesting two rows create a matrix with three columns.

m <- matrix(c(1,2,3,4,5,6),nrow=2)
m
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

requesting two columns creates a matrix with three rows.

m <- matrix(c(1,2,3,4,5,6),ncol=2)
m
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

If we want to add names to the column indices: Orange, bananas, and melon y to the row indices: supermarket and store, we use the dimnames() function.

m <- matrix(c(1,2,3,4,5,6),nrow=2,
            dimnames=list(c("supermarket","store"),
                          c("oranges","bananas","melon"))) 
m
##             oranges bananas melon
## supermarket       1       3     5
## store             2       4     6

Removing R data structures

The listing function ls() return a vector of all data structures currently in memory.

ls()
## [1] "blood"       "gender"      "m"           "name"        "pt_data"    
## [6] "status"      "subject1"    "symptoms"    "temperature"

The remove function rm() can be used to eliminate the m and subject1 objects

rm(m,subject1)

The ls() function to clear the entire R session

rm(list = ls())

The next session is about Managing data with R.