Principles of Biostatistics and Epidemiology

Introduction

In this first lab, we’re going to introduce you to the basic things of R, we’ll start from the introduction of the RStudio interface to install packages. Then in the part of basic arithmetic, we will guide you how to do some calculation in R, furthermore, introduce you the concept of data type by creating the dataset.

Start R!

Double clicking the RStudio icon in your computer, you will see a window with four panes. You can redesign the interface in the preferences. When you open the RStudio, there are two main regions, console and output.

You can type your code in the R script or console. The calculated results will show on the console and the plot you make will show on the output region.

Install packages and getting help

There are many packages which can extend the basic R function and make the works more productive and reproducible. Now, let’s install the “tidyverse” package in R. The packages in the tidyverse share a common philosophy of data and R programming, and are designed to work together naturally. To have the R code works, after you finish the code on the R script or the console, press the “RUN” or press enter to run the code.

install.packages("tidyverse")

Please make sure that the package name should go inside the quotation mark.

After we install the tidyverse package, we still can not use it yet! You can imagine that R is like a work table which has plenty of drawers, the step of installing a package is just like you buy the tool from the shop then store it in the drawer. So, how can we get the tools on the table? Once you install the package, you can load the package by the library() function.

library(tidyverse) #load the package

## ── Attaching packages ──── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ─────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

detach(package:tidyverse) #unload the package

Getting help with R

In the output region, you can find out there is one tab with “Help”. You can search the function you would like to use then you’ll see the detail information about the function, like the function description, argument and maybe the example of the function. You can also use the question mark to search for it on the R script or the console: ?library(). Put the ? in front of the function and run it. The other way is to type the name of the function in the help() with the quotation mark: help("tidyverse").

Many sources could help you solve your coding problems. If you’re stuck in a problem, please google the problem and add “R” to the searching query. Or you can try to search the problem on the StackOverflow, though, you may have to spend a little bit time to find the one who has the same problem as you.

Exercise 1: HELP!!!

The help() function and ? function only practice when you already have the package be loaded. To make sure the tidyverse package was successfully installed in your R, please use these functions to access the R Documentation.

Basic arithmetic

The following table describe the arithmetic operators in R

In R, <- the leftwards arrow symbol, the assignment operator, has its meaning, it means you put the function or object from the right side into the left side where you can assign the name for.

#Here is the arithmetic example
#if you want to note something directly on the R script, you can use the number sign, "#"
#R will ignore every words after the "#"
add<- 1+1
sub<- 2-1
mul<- 2*3
div<- 10/5
exponent<- 3^2
sr<- sqrt(144)
abs<- abs(-21)

Function must in font code followed by parentheses, e.g., sqrt(). The <- will create an object that only in font code without the parentheses.

add

## [1] 2

You’ll see the same result of the addition example on the console.

Exercise 2: calculate with object

Please convert the 180-centimeter height to 1.8-meter height.

height<- 180
height/100

## [1] 1.8

Please convert the weight from pound to kilogram.

weight<- 160
weight/2.205

## [1] 72.56236

Now, it’s your turn to calculate your body mass index(BMI)! You can try to input your height and weight by using the assignment operator.

Hint: BMI is defined as the body mass in units of kilogram divided by the square of the body height in units of meter.

height<- ___
weight<- ___

weight/(height/100)^2

Basic object

R can distinguish different types of data that store in different objects. To correctly analyze the data in R, we should understand the types of data in the dataset. We will introduce you the types of data by creating a dataset. Let’s start from create a vector which stores 12 numbers.

Numebers

id<- 1:12

class(id)

## [1] "integer"

glimpse(id)

##  int [1:12] 1 2 3 4 5 6 7 8 9 10 ...

We make a numeric object, id, which store 12 integers from 1 to 12. The function class() can return the types of data on the console. The glimpse() function which following by the class() function will show the structure of the assigned object that will fit the size of console. You’ll find out how useful it is!

Charactor string

After we create ids for 12 people, we’re going to assign their sex. The c() function can combine values into a vector. That is, we can assign a series of categorical variables into the sex object.

#this order will match to the order of the id object
sex<- c("F", "F", "F", "F", "F", "F", "M", "M", "M", "M" ,"M", "M")

class(sex)

## [1] "character"

glimpse(sex)

##  chr [1:12] "F" "F" "F" "F" "F" "F" "M" "M" "M" "M" "M" "M"

A character must go inside the quaotion mark

Factor

sex<- as.factor(sex)
sex

##  [1] F F F F F F M M M M M M
## Levels: F M

class(sex)

## [1] "factor"

levels(sex)

## [1] "F" "M"

glimpse(sex)

##  Factor w/ 2 levels "F","M": 1 1 1 1 1 1 2 2 2 2 ...

The factor is a data structure to store several categorical variables with their levels. To transform the character variables to the categorical variables(or enumerated variables), the as.factor() function can transform the other types of data into the type of factor.

The function levels() can demonstrate the orders in the factor by the variables, furthermore, the glimpse() function present the orders of the values by Arabic numbers.

Data frame

Now, we have two vectors, id and sex. We can combine these two vectors into one data frame. We’re going to use a function tibble() from the tibble package which also be one of the tidyverse package.

library(tidyverse)
df<- tibble(id, sex)
df

## # A tibble: 12 x 2
##       id sex  
##    <int> <fct>
##  1     1 F    
##  2     2 F    
##  3     3 F    
##  4     4 F    
##  5     5 F    
##  6     6 F    
##  7     7 M    
##  8     8 M    
##  9     9 M    
## 10    10 M    
## 11    11 M    
## 12    12 M

class(df)

## [1] "tbl_df"     "tbl"        "data.frame"

glimpse(df)

## Observations: 12
## Variables: 2
## $ id  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
## $ sex <fct> F, F, F, F, F, F, M, M, M, M, M, M

About class(df): tbl_df inherits from tbl, which inherits from data.frame

Data frame is a table that each variable must have its own column, each observation must have its own row and each value must have its own cell. To create a data frame, we can use the tibble function.

Create variables by `dplyr`

df<- df%>%
  mutate(height=sample(155:185, 12, replace= TRUE),
         weight=sample(60:80, 12, replace= TRUE))
df

## # A tibble: 12 x 4
##       id sex   height weight
##    <int> <fct>  <int>  <int>
##  1     1 F        176     65
##  2     2 F        180     64
##  3     3 F        178     61
##  4     4 F        167     76
##  5     5 F        167     63
##  6     6 F        155     70
##  7     7 M        182     66
##  8     8 M        183     65
##  9     9 M        156     68
## 10    10 M        179     65
## 11    11 M        163     66
## 12    12 M        165     65

From this script, you may notice that the R code are different from the code we used at the begin. To explain the R code before, there are some prerequisites I shall tell you.

What is the symbol, `%>%`?

The pipe, %>%, comes from the magrittr package by Stefan Milton Bache. The design of pipe is to help you write a easily readable and understandable code. The meaning of %>% is that %>% takes whatever is before it, and feed it into the next step.

In this case, df%>%mutate() means mutate(df,...).

Manipulate or add new variables by `mutate`

mutate is one of the main function from dplyr. The job of mutate is to add new columns. Therefore, the following code means: create a new variable called height which is randomly assigned 12 numbers between 155 and 185.

sample(x, size, replace= FALSE) is the basic R function that take a sample of the specified size from the elements of x. replace= TRUE means a unit can occur one or more times in the sample.

mutate(df, height=sample(155:185, 12, replace= TRUE))

## # A tibble: 12 x 4
##       id sex   height weight
##    <int> <fct>  <int>  <int>
##  1     1 F        163     65
##  2     2 F        164     64
##  3     3 F        167     61
##  4     4 F        158     76
##  5     5 F        161     63
##  6     6 F        165     70
##  7     7 M        167     66
##  8     8 M        180     65
##  9     9 M        171     68
## 10    10 M        181     65
## 11    11 M        174     66
## 12    12 M        181     65

Exercise 3: add a new column which names “age”

Please add a new column which name is age by mutate. And randomly assign 12 numbers between 20 and 30 into the age column.

df<- df%>%
  mutate(
  ___=sample(___:___, ___, replace= TRUE)
         )

Exercise 4: construct the arithmetic equation in mutate( )

Do you remember the BMI equation we constructed before? Please create a new column, BMI, which you calculate the BMI for everyone.

Appendix

Extraction operator

The extraction poerator have different forms: [, [[ and $. The first form of [ can extract the elements from vectors and data frames. The second and third forms of [[ and $ can extract the contents from a single object. However, the $ only use the name for extraction, therefore, a column without its name from a data frame can not be extract with $.

#example of "["
df[1]

## # A tibble: 12 x 1
##       id
##    <int>
##  1     1
##  2     2
##  3     3
##  4     4
##  5     5
##  6     6
##  7     7
##  8     8
##  9     9
## 10    10
## 11    11
## 12    12

df[2]

## # A tibble: 12 x 1
##    sex  
##    <fct>
##  1 F    
##  2 F    
##  3 F    
##  4 F    
##  5 F    
##  6 F    
##  7 M    
##  8 M    
##  9 M    
## 10 M    
## 11 M    
## 12 M

#example of "[["
df[[1]]

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12

df[[2]]

##  [1] F F F F F F M M M M M M
## Levels: F M

#example of "$"
df$id

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12

Row and column

The extraction operators can also be assigned to extract which row and which column you want. The following form shows the selection of rows and columns from a dafa frame: data_frame[row, cloumn].

df

## # A tibble: 12 x 6
##       id sex   height weight   age   BMI
##    <int> <fct>  <int>  <int> <int> <dbl>
##  1     1 F        176     65    28  21.0
##  2     2 F        180     64    29  19.8
##  3     3 F        178     61    28  19.3
##  4     4 F        167     76    28  27.3
##  5     5 F        167     63    21  22.6
##  6     6 F        155     70    21  29.1
##  7     7 M        182     66    29  19.9
##  8     8 M        183     65    25  19.4
##  9     9 M        156     68    20  27.9
## 10    10 M        179     65    26  20.3
## 11    11 M        163     66    22  24.8
## 12    12 M        165     65    30  23.9

df[3,5]

## # A tibble: 1 x 1
##     age
##   <int>
## 1    28

The df we created before have 12 rows with 5 columns. If we want to extract the BMI of the fifth row, row 3 and column 5 should going into the extraction operator, [3,5].

What about to extract all information of the seventh row?

df[7, ]

## # A tibble: 1 x 6
##      id sex   height weight   age   BMI
##   <int> <fct>  <int>  <int> <int> <dbl>
## 1     7 M        182     66    29  19.9

Descriptive statistics

Import and export data

All the data analysis process start with loading the dataset into the analysis tools. At the end of the analysis, we usually want to save the tidy and transformed data. There are many packages for R to import several kinds of data but here I take the comma seoarated values for example. #### The comma separated values

Import

To import the comma separated values (csv) is very simple, the only thing you need is to get the path of the data and print it in the read.csv() function. You can use the assignment operator(<-) to assign the data as the dataset.

REVEAL<- read.csv("/Users/liupochen_macbook/Desktop/MGH_EPI/REVEAL.csv")

Please make sure that the path should go inside the quotation mark and the slash which describes the inner folder must be /, not a backslash.

Export

Export the data is almost the same as what you did for importing data. The difference is that you have to tell the function which dataset you want to export. Therefore, you can type the names of the dataset you want to export at the first argument of the write.csv() function, e.g., write.csv(name_of_dataset, "path").

write.csv(REVEAL, "/Users/liupochen_macbook/Desktop/MGH_EPI/REVEAL.csv")

Principles of Biostatistics and Epidemiology

Lab session 1: Measuring disease occurrence and descriptive epidemiology

Bo-Chen Liu

2019/8/26

Introduction

Start R!

Install packages and getting help

Getting help with R

Exercise 1: HELP!!!

Basic arithmetic

Exercise 2: calculate with object

Basic object

Numebers

Charactor string

Factor

Data frame

Create variables by `dplyr`

What is the symbol, `%>%`?

Manipulate or add new variables by `mutate`

Exercise 3: add a new column which names “age”

Exercise 4: construct the arithmetic equation in mutate( )

Appendix

Extraction operator

Row and column

Descriptive statistics

Import and export data

Import

Export

Principles of Biostatistics and Epidemiology

Lab session 1: Measuring disease occurrence and descriptive epidemiology

Bo-Chen Liu

2019/8/26

Introduction

Start R!

Install packages and getting help

Getting help with R

Exercise 1: HELP!!!

Basic arithmetic

Exercise 2: calculate with object

Basic object

Numebers

Charactor string

Factor

Data frame

Create variables by dplyr

What is the symbol, %>%?

Manipulate or add new variables by mutate

Exercise 3: add a new column which names “age”

Exercise 4: construct the arithmetic equation in mutate( )

Appendix

Extraction operator

Row and column

Descriptive statistics

Import and export data

Import

Export

Create variables by `dplyr`

What is the symbol, `%>%`?

Manipulate or add new variables by `mutate`