BST02: Using R for Statistics in Medical Research

Part B: Basic use of R


BASIC R

Using R

R is a command-based procedural language
- write and execute commands
- use and define functions

You may write the commands in the R console (Windows) or in a shell (Linux). You will become more familiar with the syntax as you use it.

scalar = a certain/single observation within a variable/vector/matrix
vector = a variable/column in a matrix
matrix = combination of different vectors, a dataframe/set

In practice examples

Common questions which can be answered using R!
- What is the average age?
- What is the average serum bilirubin?
- What is the average serum cholesterol?
- What is the percentage of females?
- How many missing values do we have for serum cholesterol?

Basics in R

Elementary commands: expressions and assignments
- An expression given as command is evaluated printed and discarded
- An assignment evaluates an expression and passes the value to a variable - the result is not automatically printed

In order to store information, the expression should assign the command (x <- 2098)

You can use R as a calculator!
- Basic arithmetics: +, -, *, /, ^
- Complicated arithmetics

#Tips: - R is case sensitive, e.g.,
- “sex” is different than “Sex”
- Commands are separated by a semi-colon or by a newline
- Comments can be put anywhere, starting with a hashmark #: everything to the end of the line is a comment
- Assign a value to an object by <- or =

#When it is not a numeric value: - Missing values are coded as NA (i.e., not available) is.na()
- Infinity is coded as Inf (plus infinity) or -Inf (minus infinity) is.finite()
- The Null objects are coded as NULL
- Not a number is coded as NaN (Not a Number). Example: 0/0 = NaN

DEMO

Load packages

If you are using the package for the first time, you will have to first install it

# install.packages("survival") 
# install.packages("memisc") 
library(survival)
library(memisc)

Get and view data

Load data set from package

bc <- survival::pbc

# Print the first 6 rows of the data set
head(pbc)

# View data
#view(pbc)

Common questions that can be answered in R

# What is the average `age`?
mean(pbc$age)

# What is the average `serum bilirubin`?
mean(pbc$bili)

# What is the average `serum cholesterol`?
mean(pbc$chol, na.rm = TRUE)

# What is the percentage of `females`? \
# In order to use the function `percent()`, the package `memisc` should be loaded first
percent(pbc$sex)

# What is the average `serum bilirubin` and `serum cholesterol`?
mean(pbc$chol, na.rm = TRUE)
mean(pbc$bili, na.rm = TRUE)

###We obtained a lot of information in R but we did not save anything

"Hello"

# In order to store information, the expression should assign the command
hi <- "Hello"
hi

# We need to define any object before we use it \
# E.g. `number` and `x` are first defined  \
# number
number <- 10
number
# x
x = 1
x

Things to remember!

  • = is different from==, e.g. x == 3
  • R is sensitive, e.g. pbc$Age will not run because there is a typo
  • We need to check the names first using the function names()
names(pbc)
pbc$age

Data checks

# Check for missing data
is.na(x)

# The function `head()` can be used when the output is expected to be long
head(is.na(pbc))

# If you are looking at the whole data set, you need to get a summary
table(is.na(pbc))
is.na(pbc$age)
table(is.na(pbc$age)) 

# Check for infinity data
is.infinite(pbc$age)

Common R objects

Data structure:
- Vectors have the same type of elements

pbc[1:6, c("age")]

vec <- c(1, 2, 3, 4, 5)
vec
  • Matrices have the same type of elements with the same length
pbc[1:6, c("age", "bili", "chol")]

vec <- c(1, 2, 3, 4, 5, 6)
mat <- matrix(vec, 3, 3)
mat

vec <- c(1, 2, 3, 4, 5, 6)
mat <- matrix(vec, 3, 3, byrow = TRUE) # rows will be filled instead of columns
mat
  • Data.frames have elements of different type with the same length
pbc[1:6, c("id", "sex", "bili", "chol")]

dtf <- data.frame(pbc$sex, pbc$age)
dtf[1:3,]
dtf <- data.frame(Gender = pbc$sex, Age = pbc$age)
dtf[1:3,]
  • Lists have elements of different type and length
list(pbc[1:6, c("sex")], pbc[1:2, c("sex", "bili")], pbc$age[1:4])

list1 <- list(vec = c(1:5), mat = pbc[1:2, c("age", "sex")])
list1

DEMO COMMON R OBJECTS

#' ## Load packages 
#' If you are using the package for the first time, you will have to first install it \
# install.packages("survival") 
library(survival)

#' ## Get data
#' Load data set from package
pbc <- survival::pbc

#' ## Vectors \
#' Create vectors from a data set
pbc$age
pbc$time

#' Create other vectors \
#' Continuous
vec1 <- c(1:100)
vec2 <- c(2, 32, 14, 23, 54, 13, 45)

#' Categorical
vec3 <- c("male", "female", "female", "male", "female", "female")

#' Logical
vec4 <- c(TRUE, TRUE, FALSE)

#' ## Matrices \
#' Create matrices from a data set
matrix(pbc$time[1:9], 3, 3)
(age_time <- cbind(pbc$age, pbc$time)) #rows will be bind (column 1 = age, colum 2 - time)
(age_time <- rbind(pbc$age, pbc$time)) #columns will be bind (row 1 = age, row 2 = time)

#' Create other matrices
matrix(1:4, 2, 2)

#' Fill matrix in by row
matrix(1:4, 2, 2, byrow = TRUE) #count in rows

#' TAKE CARE
(Columns_com <- cbind(c(1:4), c(1:8))) #you will count on!

#' ## Data frames \
#' Create data frames from a data set
age_time_sex <- data.frame(pbc$age, pbc$time, pbc$sex)

#' Add names to a data frame
age_time_sex <- data.frame(age = pbc$age, time = pbc$time, sex = pbc$sex)

#' Create other data frames
DF <- data.frame(Age = c(30, 20, 50),
                 Sex = c("male", "female", "female"),
                 Drug = c("yes", "yes", "no"))
head(DF)

DF <- data.frame(Age = runif(30, 1, 80),
                 Sex = sample(1:2, 30, replace = TRUE),
                 Drug = sample(1:3, 30, replace = TRUE))
head(DF)

#' ## Lists 
#' Create lists from a data set
list_pbc <- list(pbc$id, pbc$time)

#' Create other lists
myList <- list(x = c(1:20), sex = c("male", "female")) 
myList
mylist = list(names = c("Jack","Mary"), child.ages = c(4,7,9,10,11))
mylist

PRACTICAL COMMON R OBJECTS

rm(list=ls()) 

# Explore dataset
library(survival)
library(memisc)

heart <- survival::heart
retinopathy <- survival::retinopathy

head(heart)
tail(heart)

head(retinopathy)
tail(retinopathy)

# view vectors
heart$event
heart$age

retinopathy$eye
retinopathy$risk

# create numerical and categorical vector
numbers <- c(34, 24, 19, 23, 16)
numbers_2 <- c(1:200)
treatment <- c("yes", "yes", "no", "no", "no", "yes")

# create matrices and dataframes
matrix(c(heart$id, heart$age), , 2) # empty spot: let the vectors tell you! if you do not specify 2, the columns will be placed directly under each other
data.frame(id = retinopathy$id, type = retinopathy$type, trt = retinopathy$trt) 

# create lists
list(stop_heart = heart$stop, id_reti = retinopathy$id, risk_reti = retinopathy$risk)
list(numbers = numbers, many_numbers = numbers_2, treatment = treatment)
# with =, you can appoint a name to a certain variable in df

Importing data and saving your work

to import data:

  • load(), read.table(), read.csv() (no need for package)
  • package: foreign, function: read.spss(),read.dta()
  • package: Hmisc, function: sas.get()
  • package: openxlsx, function: read.xlsx()
  • package: readxl, function: read_excel()
  • package: haven, function: read_spss()

###Tips: - Short names are prefered over longer names - Try to avoid using names that contain symbols - Avoid spaces in names - Remove any comments in your data set - Make sure that any missing values in your data set are indicated with the same value (or no value)

to save data:

  • save(), write.table(),write.csv() (no need for package, NB the data needs to finish with .RData)
  • package: foreign, function: write.spss(),write.dta()
  • package: openxlsx, function: write.xlsx()

###practical examples #HIER BEN IK

library(survival)
library(openxlsx)

#' ## Get data
#' Load data set from package
pbc <- survival::pbc

#' List all your R objects 
ls()

#' ## Save your work 
#' Take care: first you need to set your working directory (Rstudio: Session -> Set Working Directory -> Choose Directory...) \
#' Use also the function `setwd(...)` to set the working directory
getwd()


dt <- pbc[1:6, c("id", "sex", "bili", "chol")]
p <- 1

#save(dt, p, file = "data.RData") #can save multiple obj. at the same time
#saveRDS(dt, file = "data.RData")

#' ## Load previous work
#load("data.RData")
#readRDS("data.RData")

#' ## Transform a RData data set into xls, csv or text format
#write.csv(dt, "mydata.csv")
#write.table(dt, "mydata.txt", sep="\t")
#write.xlsx(dt, "mydata.xlsx")

From here: copied from Lorenza Dall’aglio

##Data transformation This refers to: - how to round ocntinuous variables - need to specify digits - convert variable type - e.g. numeric to factor - compute/transform new variables - just do the math transformation you need - change from wide to long format or viceversa - wide format = 1 patient per row. - long format = multiple rows for same patient (useful when got longitudinal data)

install.packages("reshape2", print = F)
library(reshape2)
library(survival)
#' ## Get data
#' Load data set from package
pbc <- survival::pbc
pbcseq <- survival::pbcseq

#' ## Round continuous variables
pbc$ast <- round(pbc$ast, digits = 1) #1 decimal
pbc$age <- round(pbc$age, digits = 2) # 2 decimals
head(pbc)

#' ## Create factors
#' Set `sex` as factor with labels `m` for `male` and `f` for `female`
factor(pbc$sex, levels = c("m", "f"),
       labels = c("male","female"))

#' ## Transform continuous variables \
#' Categorize `age` (take as cut off value 40) 
#' 
#' * Step 1: Check whether the age variable of each patient is above 40
pbc$age40higher <- pbc$age > 40

#' * Step 2: Transform a logical variable into a numeric
pbc$age40higher <- as.numeric(pbc$age40higher)
head(pbc)

#' * Step 3: Transform a numeric variable into a factor
pbc$age40higher <- factor(pbc$age40higher, levels = c(0:1),
                              labels = c("young","old"))
head(pbc)

#' Standardize `age`
pbc$ageST <- (pbc$age-mean(pbc$age))/(sd(pbc$age))
head(pbc)


#' ## Wide/long format \
#' ### Long to wide data set
head(pbcseq)

#' Select the first (or last) row of each patient  \ \
#remove the first or last row for each patient - cause you just need one of those
head(pbcseq[!duplicated(pbcseq[c("id")]), ]) #take 1st row for patients - cause you're getting rid of duplicates
head(pbcseq[!duplicated(pbcseq[c("id")], fromLast = TRUE), ]) #here you get the 2nd row for patients - cause getting rid of duplicates from the bottom of dataset! (fromLast = T)

#' We will keep all the observations we need by using the following way \
#' 
#' * Step 1: Obtain how many repeated measurements we have per patient
vec <- table(pbcseq$id)

#' * Step 2: Take a sequence of the visits of each patient
vec2 <- sequence(vec)
pbcseq$visits <- vec2

#' * Step 3: Obtain the wide format `pbcseq` data set using the function `reshape()` from the `reshape2` package
pbcseqWide <- reshape(pbcseq, idvar = c("id"), 
                    drop = c("futime", "status", "trt", "age", "sex", 
                             "day", "ascites", "hepato", "spiders", 
                             "edema", "chol", "albumin", "alk.phos", 
                             "ast", "platelet", "protime", "stage"),
                    timevar = "visits", direction = "wide")

head(pbcseqWide)



#' ### Wide to long data set  
#' Obtain the long format `pbcseqWide` data set using the function `reshape()` from the `reshape2` package
pbcLong <- reshape(pbcseqWide, idvar = c("id"), timevar = "time", 
                    varying = list(names(pbcseqWide)[2:17]),
                    v.names = "bili", direction = "long", times = 1:16)
head(pbcLong)

#' Extra step: Order the data set by `id` number
pbcLong <- pbcLong[order(pbcLong$id),]
head(pbcLong)

Data exploration

this is stuff like mean median sd IQR etc…

library(memisc)
mean(pbc$age)
mean(pbc$age, na.rm = TRUE)
sd(pbc$age)
median(pbc$age)
IQR(pbc$age)
percent(pbc$trt)
#mean for one cat. only
pbc_males <- pbc[pbc$sex == "m", ]
mean(pbc_males$age)

Handling missing data

is.na(pbc$chol) #find missings
DF <- pbc[complete.cases(pbc$chol), ] #find complete cases (i.e. where there are no missings) - here won't have any missing values
pbc_chol_na <- pbc[is.na(pbc$chol) == TRUE, ] #get the rows with missing values 
head(pbc_chol_na)
dim(pbc_chol_na) #get the dimensions where there are missings - this gives n rows with missing 


#check outliers
pbc_out_bili <- pbc[pbc$bili > 25, ] #get the rows of ppl with values > 25 - here considered as outliers 
head(pbc_out_bili)

Data visualisation:

yay plots!! :) below, plots based on basic r functions, ggplot package, and lattice package are shown NB pick either ggplot or lattice to use for your plots cause the packages are so different and it can get confusing!

Important plotting basic functions - plot(): scatter plot (and others) - barplot(): bar plots - boxplot(): box-and-whisker plots - dotchart(): dot plots - hist(): histograms - pie(): pie charts - qqnorm(), qqline(), qqplot(): distribution plots - pairs(): for multivariate data

###Basic plots

library(survival)
library(lattice)
library(ggplot2)
library(emojifont)
library(gtrendsR)

#'## Get data
#' Load data set from package
pbc <- survival::pbc
pbcseq <- survival::pbcseq

#' ## Basic plots

########### CONTINUOUS VARIABLES
#' 1 CONTINUOUS VARIABLE (SCATTERPLOT)
plot(pbc$bili)

#' 2 CONTINUOUS VARIABLES (CORRELATION, SCATTERPLOT)
plot(pbc$age, pbc$bili)

#' add axis labels
plot(pbc$age, pbc$bili, ylab = "Serum bilirubin", xlab = "Age")

#' add axis labels - change size of axis and labels
plot(pbc$age, pbc$bili, ylab = "Serum bilirubin", xlab = "Age", cex.axis = 1.5, cex.lab = 1.9) #the cex.lab = how large labels are, axis is how large axis letters are

#' 2 CONTINUOUS VARIABLES AND 1 CATEGORICAL (SCATTERPLOTS)
#Basic plot with 3 variables - xaxis represents `age`, yaxis represents `serum bilirubin` and colours represent `sex`
plot(pbc$age, pbc$bili, ylab = "Serum bilirubin", xlab = "Age", cex.axis = 1.5, cex.lab = 1.9, col = pbc$sex, pch = 16)
legend(30, 25, legend = c("male", "female"), col = c(1,2), pch = 16) #legend - here we set all stuff for legend

#pch = option for changing circles into dots - 16 = dots
#legend put at x axis = 30 and y axis = 25 (positions)

#' 1 CONTINUOUS VARIABLE (HISTOGRAM)
hist(pbc$bili, breaks = 50) #want 50 breaks

hist(pbc$bili, breaks = seq(min(pbc$bili), max(pbc$bili), length = 20)) #breaks down naturally

#' Multiple panels with multiple histograms
par(mfrow=c(2,2)) #2 rows 2 cols - can fit 4 figures - figures specified below 
hist(pbc$bili, prob=T) #prob = T means we get the probabilities in the histogram
hist(pbc$chol, prob=T)
hist(pbc$albumin, prob=T)
hist(pbc$alk.phos, prob=T)

#' 3 CONTINUOUS VARIABLES (Multivariate plot)
pairs(cbind(pbc$bili, pbc$chol, pbc$albumin))

################ CATEGORICAL VARIABLES
#' 1 CATEGORICAL VARIABLE (BARCHART) - checking the frequency of `males` and `females`
barchart(pbc$sex)

plot(pbc$sex)

#' 1 CONTINUOUS VARIABLE BY 1 CATEGORICAL VARIABLE (BOXPLOT) - cheking the distribution of `age` per `sex` group
boxplot(pbc$age ~ pbc$sex, ylab = "Age", xlab = "Gender")


#' 1 CONTINUOUS VAR. BY 1 CAT. VARIABLE (DENSITY PLOTS) 
#Density plots of `serum bilirubin` per `sex` group to investigate the distribution
#i.e. here we get the distribution curves for bilirubine for males and females separately 
pbc_male_bili <- pbc$bili[pbc$sex == "m"] #so we can divide later by male and female
pbc_female_bili <- pbc$bili[pbc$sex == "f"]
plot(density(pbc_male_bili), col = rgb(0,0,1,0.5), ylim = c(0,0.40), #density function - specify limits
     main = "Density plots", xlab = "bili", ylab = "")
polygon(density(pbc_male_bili), col = rgb(0,0,1,0.5), border = "blue") #here with col =, we try to create a new colour - the last number in () means how transparent you want the colour to be
#polygon fills in the distribution with the colour 
lines(density(pbc_female_bili), col = rgb(1,0,0,0.5)) #this says add lines - cause we want a distribution for females too! here again we want a given colour representing the density of the female data
polygon(density(pbc_female_bili), col = rgb(1,0,0,0.5), border = "red")
legend(5,0.3, c("male", "female"),  #so ppl know which distribution is which
       col = c(rgb(0,0,1,0.5), rgb(1,0,0,0.5)), lty = 1)  


# 1 CATEOGIRCAL VARIABLE with multiple levels (Pie chart)
pbc$status <- factor(pbc$status, levels = c(0, 1, 2),
labels = c("censored", "transplant", "dead"))
pie(prop.table(table(pbc$status)), #Pie chart
col = c("green", "blue", "red"), cex = 2) #based on order above, give a colour to each label

###Lattice family this is suggested by the prof. as the best

######### CONTINUOUS VARIABLES
library(lattice)
library(survival)
library(ggplot2)

pbc <- survival :: pbc
pbcseq <- survival::pbcseq

#correlation bw TWO CONT VARS 
xyplot(bili ~ age, data = pbc, type = "p", lwd = 2)

#smooth evolution, for two cont vars categorised by a cat var. 
xyplot(bili ~ age, group = sex, data = pbc, type = "smooth", lwd = 2, col = c("red", "blue")) #lwd = 2 means thick lines, smooth cause you want smooth lines. NB tghis plot supposed to look weird here! just not the right plot 

#smooth evolution with multiple colours
yplot(bili ~ age, group = sex, data = pbc, type = c("p", "smooth"), lwd = 2, col = c("red", "blue"))

#want smooth evolution and create two diff. panels: done with "|" 
xyplot(bili ~ age | sex, data = pbc, type = c("p", "smooth"), lwd = 2, col = c("red"))   

#per status
#create factor
pbc$status <- factor(pbc$status, levels = c(0, 1, 2), labels = c("censored", "transplant", "dead"))
xyplot(bili ~ age | status, data = pbc, type = c("p", "smooth"), lwd = 2, col = c("red"), layout = c(3,1)) #here layout says 3 cols in 1 row

#INDIVIDUAL PATIENT PLOT (useful with repeated time points)
xyplot(bili ~ day, group = id, data = pbcseq, type ="l", col = "black") #we see the trajectory

#inidividual patient plot per status 
pbcseq$status <- factor(pbcseq$status, levels = c(0, 1, 2), labels = c("censored", "transplant", "dead"))
xyplot(bili ~ day | status, group = id, data = pbcseq, type ="l", col = "black", layout = c(3,1), #group = ID
       grid = TRUE, xlab = "Days", ylab = "Serum bilirubin")

###GGplot family

#one CONT VAR
ggplot(pbc, aes(age, bili, colour = sex)) + #+ sign before specifying what you want
  geom_point() #here you say: I want points! 

#add smooth line 
ggplot(pbc, aes(age, bili, colour = sex)) + 
geom_point(alpha = 0.3) + #alpha = transparency 
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#more stuff
ggplot(pbcseq[pbcseq$id == 93,], aes(day, bili)) +
geom_line() +
geom_smooth(colour = 'blue', span = 0.4) +
labs(title = "Patient 93", subtitle = "Evolution over time", #change labels
     y = "Serum bilirubin", x = "Days")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#boxplot of 2 vars
ggplot(pbc, aes(stage, bili, group = stage)) +
geom_boxplot() +
labs(y = "Serum bilirubin", x = "Stage")
## Warning: Removed 6 rows containing missing values (stat_boxplot).

#density plots
p <- ggplot(pbc, aes(bili, fill = sex)) + #you can save the plot by assigning it to an object
geom_density(alpha = 0.25) 
p 

#change the colour of the filling of the density plot p
p + scale_fill_manual(values=c("#999999", "#E69F00")) #add + and change the colour 

###Have some fun! These are funny things you can put in your plots: - emojis - getting google data - getting twitter data

#emoji plots

set.seed(123)
x1 <- rnorm(10)
y1 <- rnorm(10)
x2 <- rnorm(10)
y2 <- rnorm(10)
plot(x1, y1, cex=0)
text(x1, y1, labels=emoji('heartbeat'), cex=1.5, col='red', family='EmojiOne')
text(x2, y2, labels=emoji('cow'), cex=1.5, col='steelblue', family='EmojiOne')

search_emoji('face')

plot(x1, y1, cex=0)
text(x1, y1, labels=emoji('nerd_face'), cex=1.5, col='red', family='EmojiOne')

#google data :D 
google.trends1 = gtrends(c("feyenoord"), gprop = "web", time = "all")[[1]]

ggplot(data = google.trends1, aes(x = date, y = hits)) +
  geom_line() +
  labs(y = "Feyenoord", x = "Time") +
  ggtitle("Hits on Google")


#use twitter data
rt <- rtweet::search_tweets(q = "#rstats", n = 1000, include_rts = FALSE) #look at #rstats tweets  
cl <- rtweet::search_tweets(q = "#climate", n = 1000, include_rts = FALSE)
bg <- rtweet::search_tweets(q = "@BillGates", n = 1000, include_rts = FALSE)
emc <- rtweet::search_tweets(q = "@ErasmusMC", n = 1000, include_rts = FALSE, by = "months")
lumc <- rtweet::search_tweets(q = "@LUMC_Leiden", n = 1000, include_rts = FALSE)
ts_plot(rt)
ts_plot(cl)
ts_plot(bg)
ts_plot(emc)
ts_plot(lumc)

Recap: Most impo commands

###Useful Summary: Data manipulation Common R objects: - c() - matrix() - data.frame() - list() - cbind(), rbind()

Transformation - factor() - reshape() - order() - complete.cases()

Exploration - is.na() - dim() - mean(), sd() - median(), IQR() - percent()

Visualization - plot(), legend() - hist() - barchart() - boxplot() - xyplot(), ggplot() - par()

Import/Save/Present - head() - save(), saveRDS() - load(), readRDS() - read.csv(), write.csv(), - read.xlsx(), - write.xlsx(), - read.table(), - write.table()

Other - install.packages(), - library() - ls(), objects(), getwd()

##Indexing and Subsetting to index use [] can index by: - position, - VECTOR: e.g. x[2:4] for vector x. to exclude: x[-5] - MATRIX: matrix[2, ] - get second row but all cols - DF: df[2] - you can avoid having the comma - but this will give you the 2nd variable - LIST: works like df - mylist[2] get a list as outcome, mylist[[2]] get a vector out or index with \(: mylist\)y (just get that one df) - name, - VECTOR: e.g. x[c(“foo”, “one”)] - logical, - VECTOR e.g. x[x>7 & x > 9]; x[x > 7 | x > 9] - DF: DF[DF$Y < 30, ]

Double indexing: e.g. pbcDF <- data.frame(pbc) head(pbc[pbc$sex == “m”, 1:7]) here you index both rows and cols

Summary

Vectors [] [""] - for categorical variables

Matrices [,]

Data frames [,] [[]], [] $

Lists [] [[]] $

can try diff. subsetting in the shiny app the prof. created

#' ## Vector indexing
#' Select the 3rd element from vector `age`
pbc$age[3]

#' Select the `sex` of the 10th patient
pbc$sex[10]

#' Remove the 1st element from the `id` vector
pbc$id[-1]

#' From the vector `age`, select patients that are younger than 30
pbc$age[pbc$age < 30]

#' From the vector `age`, select only `female` patients
pbc$age[pbc$sex == "f"]

#' ## Matrix and data frame indexing
#' Select the 3rd column of the pbc data set \ 
pbc[, 3] #output = df

#' Different ways exist to obtain that
pbc[[3]] #output = vector 

#' Select the baseline details of the 5th patient
pbc[pbc$id == 5, ]

#' Select the serum cholesterol for all males
pbc[pbc$sex == "m", "bili"]
pbc$bili[pbc$sex == "m"]

#' Select the `age` for `male` patients or patients that have `serum bilirubin` more than 3
pbc[pbc$sex == "m" | pbc$bili > 5, "age"]
pbc$age[pbc$sex == "m" | pbc$bili > 5]

#' Select the first measurement per patient using the `pbcseq` data set
head(pbcseq[!duplicated(pbcseq[c("id")]), ])

#' Select the last measurement per patient using the `pbcseq` data set
head(pbcseq[!duplicated(pbcseq[c("id")], fromLast = TRUE), ])

#' Select all `male` patients that died
pbc[pbc$sex == "m" & pbc$status == 2, ]      

#' Select `male` patients or patients that died
head(pbc[pbc$sex == "m" | pbc$status == 2, ])

#' Select the `serum bilirubin` measurements only for `female` patients \
#' Calculate the mean and standard deviation \
#' Plot a histogram
pbc[pbc$sex == "f", "bili"]   

mean(pbc[pbc$sex == "f", "bili"])
sd(pbc[pbc$sex == "f", "bili"])

hist(pbc[pbc$sex == "f", "bili"])

#' Select all rows where the `serum bilirubin` measurements are smaller that 10
head(pbc[pbc$bili < 10, ])

#' Make a boxplot of `serum bilirubin` per `sex` using the previous selection
new_pbc <- pbc[pbc$bili < 10, ]
boxplot(new_pbc$bili ~ new_pbc$sex)

#' ## List indexing
#' Create a list with 3 elements: \
#' 
#' * 1st element: all `pbc$id` \
#' * 2nd element: `pbc$bili` for males \
#' * 3rd element: `pbc$age` > 30
myList <- list(pbc$id, pbc$bili[pbc$sex == "m"], pbc$age[pbc$age > 30])

#' Select the second element (the output should be a list)
myList[2]

#' Select the third element (the output should be a vector)
myList[[3]]

#' Select the third element (the output should be a vector) \
#' Then, from the third element, select the elements that are smaller than 20
#' Tips: use baby steps
newData <- myList[[3]] #get df out of list
newData[newData < 20] #then select in the new df the data that is < 20 

#' Create a list with 3 elements and give them names: \
#' 
#' * 1st element - all_id: all `pbc$id` \
#' * 2nd element - bili_male: `pbc$bili` for males \
#' * 3rd element - young: `pbc$age` < 30
myList <- list(all_id = pbc$id, bili_male = pbc$bili[pbc$sex == "m"], young = pbc$age[pbc$age < 30])

#' Select all_id by name indexing
myList$all_id