The R data structures used most frequently are vectors, factors, list, data frames and matrices.
The elements of a vector must be of the same type.The vectors can be
created by using the c() combine function.
Example: We’ll create vectors, to store the three patient names, the
patient’s body temperature in degrees Fahrenheit and the patien’s
diagnosis (TRUE if he or she has influenza,
FALSE othercase)
name<-c("Jhon", "Jane", "Steve") # Character vector
temperature<-c(98.1,98.6,101.4)
status<-c(FALSE, FALSE, TRUE)
Data for each patient can be accesed using his or her position in the set. To obtain the temperature value for patient Jane, simply tYpe:
temperature[2]
## [1] 98.6
To obtain a range of values.
temperature[2:3]
## [1] 98.6 101.4
Items can be excluded by specifying a negative item number
temperature[-2]
## [1] 98.1 101.4
To specify a logical vector indicating whether or not each item should be included.
temperature[c(T,T,F)]
## [1] 98.1 98.6
A factor is a species case of vector that is solely used for representing categorical or ordinal variables.
gender<-factor(c("MALE","FEMALE","MALE"))
gender
## [1] MALE FEMALE MALE
## Levels: FEMALE MALE
The levels comprise the set of possible categories the factor could take, in this case, MALE or FEMALE When we create factors, we can add additional levels that may not appear in the original data.
blood<-factor(c("O","AB","A"),
levels = c("A","B","AB","O"))
blood
## [1] O AB A
## Levels: A B AB O
The storing the additional level allows for the possibility if adding patients with the other blood type in the future. It also ensures that if we were to create a table of blood types, we would know that type B exists, despite it not being found in our initial data. We indicate the presence of ordinal data by providing the factor’s level in the desired order, listed ascendig from lowest to highest.
symptoms<-factor(c("SEVERE", "MILD", "MODERATE"),
levels = c("MILD","MODERATE","SEVERE"),
ordered = TRUE)
symptoms
## [1] SEVERE MILD MODERATE
## Levels: MILD < MODERATE < SEVERE
symptoms>"MODERATE"
## [1] TRUE FALSE FALSE
A list is a data structure, much like a vector, in that it is used for storing an ordered set of elements. A vector requieres all its element to be the same type, a list allows different R data types to be collected.
subject1<-list(fullname=name[1],
temperature=temperature[1],
status=status[1],
gender=gender[1],
blood=blood[1],
symptoms=symptoms[1])
subject1
## $fullname
## [1] "Jhon"
##
## $temperature
## [1] 98.1
##
## $status
## [1] FALSE
##
## $gender
## [1] MALE
## Levels: FEMALE MALE
##
## $blood
## [1] O
## Levels: A B AB O
##
## $symptoms
## [1] SEVERE
## Levels: MILD < MODERATE < SEVERE
As a list retains order like a vector, its components can be accessed using numeric positions, as show here for the temperature value
subject1[2]
## $temperature
## [1] 98.1
The result of using vector-style operators on a list object is another list object, which is a subset of the original list. to instead return a single list item in its native data type, use double brackets whe selecting the list component. The following command returns a numeric vector of length one.
subject1[[2]]
## [1] 98.1
Is often better to acces list components by name.
subject1$temperature
## [1] 98.1
The data frame is a structure analogous to a spreadsheet or database in that it has both rows and columns of data. In R terms, a data frame can be understood as a list of vectors or factors, each having exactly the same number of values. the data frames is literally a list of vector-type objects
pt_data<-data.frame(name, temperature,status,gender,blood,
symptoms,stringsAsFactors = FALSE)
pt_data
## name temperature status gender blood symptoms
## 1 Jhon 98.1 FALSE MALE O SEVERE
## 2 Jane 98.6 FALSE FEMALE AB MILD
## 3 Steve 101.4 TRUE MALE A MODERATE
The parameter stringsAsFactors=FALSE, if we do not
specify this option, R will automatically convert every character vector
to a factor
To extract entire columns (vectors)
pt_data$name
## [1] "Jhon" "Jane" "Steve"
To extract multiple columns from a data frame
pt_data[c("temperature","status")]
## temperature status
## 1 98.1 FALSE
## 2 98.6 FALSE
## 3 101.4 TRUE
pt_data[2:3]
## temperature status
## 1 98.1 FALSE
## 2 98.6 FALSE
## 3 101.4 TRUE
The data frames is two dimensional, both the desired rows and columns must be specified. Rows are specified firts, followed by a comma, followed by the columns in a format like this [rows , columns]. To extract the value in the firts row and second column of the patient data frame
pt_data[1,2]
## [1] 98.1
To extract data from the first and third rows and the second and fourth columns
pt_data[c(1,3),c(2,4)]
## temperature gender
## 1 98.1 MALE
## 3 101.4 MALE
To extract all rows of the first column
pt_data[ ,1]
## [1] "Jhon" "Jane" "Steve"
To extract all columns for the first row
pt_data[1, ]
## name temperature status gender blood symptoms
## 1 Jhon 98.1 FALSE MALE O SEVERE
To create new columns in data frames, for example, we may need to convert the Fahrenheit temperature readings in the patient data frame to the Celsius scale.
pt_data$temperature_Celsius<-(pt_data$temperature-32)*(5/9)
pt_data
## name temperature status gender blood symptoms temperature_Celsius
## 1 Jhon 98.1 FALSE MALE O SEVERE 36.72222
## 2 Jane 98.6 FALSE FEMALE AB MILD 37.00000
## 3 Steve 101.4 TRUE MALE A MODERATE 38.55556
To visualized the first and last rows of data frame
head(pt_data,2)
## name temperature status gender blood symptoms temperature_Celsius
## 1 Jhon 98.1 FALSE MALE O SEVERE 36.72222
## 2 Jane 98.6 FALSE FEMALE AB MILD 37.00000
tail(pt_data,2)
## name temperature status gender blood symptoms temperature_Celsius
## 2 Jane 98.6 FALSE FEMALE AB MILD 37.00000
## 3 Steve 101.4 TRUE MALE A MODERATE 38.55556
attach(pt_data)
## The following objects are masked _by_ .GlobalEnv:
##
## blood, gender, name, status, symptoms, temperature
The attach() function allows to acces columns of a data
frame without having to specify the data frame name or the instrucion
pt_data$name.
names(pt_data)
## [1] "name" "temperature" "status"
## [4] "gender" "blood" "symptoms"
## [7] "temperature_Celsius"
str(pt_data)
## 'data.frame': 3 obs. of 7 variables:
## $ name : chr "Jhon" "Jane" "Steve"
## $ temperature : num 98.1 98.6 101.4
## $ status : logi FALSE FALSE TRUE
## $ gender : Factor w/ 2 levels "FEMALE","MALE": 2 1 2
## $ blood : Factor w/ 4 levels "A","B","AB","O": 4 3 1
## $ symptoms : Ord.factor w/ 3 levels "MILD"<"MODERATE"<..: 3 1 2
## $ temperature_Celsius: num 36.7 37 38.6
Matrix is a data structure that represent a two-dimensional table. With rows and columns of data. R matrices can contain only type of data, although they are most often used for mathematical operations and therefore typically store only numbers.
Requesting two rows create a matrix with three columns.
m <- matrix(c(1,2,3,4,5,6),nrow=2)
m
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
requesting two columns creates a matrix with three rows.
m <- matrix(c(1,2,3,4,5,6),ncol=2)
m
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
If we want to add names to the column indices: Orange, bananas, and
melon y to the row indices: supermarket and store, we use the
dimnames() function.
m <- matrix(c(1,2,3,4,5,6),nrow=2,
dimnames=list(c("supermarket","store"),
c("oranges","bananas","melon")))
m
## oranges bananas melon
## supermarket 1 3 5
## store 2 4 6
The listing function ls() return a vector of all data
structures currently in memory.
ls()
## [1] "blood" "gender" "m" "name" "pt_data"
## [6] "status" "subject1" "symptoms" "temperature"
The remove function rm() can be used to eliminate the m
and subject1 objects
rm(m,subject1)
The ls() function to clear the entire R session
rm(list = ls())
The next session is about Managing data with R.