In this first lab, we’re going to introduce you to the basic things of R, we’ll start from the introduction of the RStudio interface to install packages. Then in the part of basic arithmetic, we will guide you how to do some calculation in R, furthermore, introduce you the concept of data type by creating the dataset.
You can type your code in the R script or console. The calculated results will show on the console and the plot you make will show on the output region.
There are many packages which can extend the basic R function and make the works more productive and reproducible. Now, let’s install the “tidyverse” package in R. The packages in the tidyverse share a common philosophy of data and R programming, and are designed to work together naturally. To have the R code works, after you finish the code on the R script or the console, press the “RUN” or press enter to run the code.
install.packages("tidyverse")
Please make sure that the package name should go inside the quotation mark.
After we install the tidyverse package, we still can not use it yet! You can imagine that R is like a work table which has plenty of drawers, the step of installing a package is just like you buy the tool from the shop then store it in the drawer. So, how can we get the tools on the table? Once you install the package, you can load the package by the library() function.
library(tidyverse) #load the package
## ── Attaching packages ──── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ─────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
detach(package:tidyverse) #unload the package
In the output region, you can find out there is one tab with “Help”. You can search the function you would like to use then you’ll see the detail information about the function, like the function description, argument and maybe the example of the function. You can also use the question mark to search for it on the R script or the console: ?library(). Put the ? in front of the function and run it. The other way is to type the name of the function in the help() with the quotation mark: help("tidyverse").
Many sources could help you solve your coding problems. If you’re stuck in a problem, please google the problem and add “R” to the searching query. Or you can try to search the problem on the StackOverflow, though, you may have to spend a little bit time to find the one who has the same problem as you.
The help() function and ? function only practice when you already have the package be loaded. To make sure the tidyverse package was successfully installed in your R, please use these functions to access the R Documentation.
The following table describe the arithmetic operators in R
In R, <- the leftwards arrow symbol, the assignment operator, has its meaning, it means you put the function or object from the right side into the left side where you can assign the name for.
#Here is the arithmetic example
#if you want to note something directly on the R script, you can use the number sign, "#"
#R will ignore every words after the "#"
add<- 1+1
sub<- 2-1
mul<- 2*3
div<- 10/5
exponent<- 3^2
sr<- sqrt(144)
abs<- abs(-21)
Function must in font code followed by parentheses, e.g., sqrt(). The <- will create an object that only in font code without the parentheses.
add
## [1] 2
You’ll see the same result of the addition example on the console.
Please convert the 180-centimeter height to 1.8-meter height.
height<- 180
height/100
## [1] 1.8
Please convert the weight from pound to kilogram.
weight<- 160
weight/2.205
## [1] 72.56236
Now, it’s your turn to calculate your body mass index(BMI)! You can try to input your height and weight by using the assignment operator.
Hint: BMI is defined as the body mass in units of kilogram divided by the square of the body height in units of meter.
height<- ___
weight<- ___
weight/(height/100)^2
R can distinguish different types of data that store in different objects. To correctly analyze the data in R, we should understand the types of data in the dataset. We will introduce you the types of data by creating a dataset. Let’s start from create a vector which stores 12 numbers.
id<- 1:12
class(id)
## [1] "integer"
glimpse(id)
## int [1:12] 1 2 3 4 5 6 7 8 9 10 ...
We make a numeric object, id, which store 12 integers from 1 to 12. The function class() can return the types of data on the console. The glimpse() function which following by the class() function will show the structure of the assigned object that will fit the size of console. You’ll find out how useful it is!
After we create ids for 12 people, we’re going to assign their sex. The c() function can combine values into a vector. That is, we can assign a series of categorical variables into the sex object.
#this order will match to the order of the id object
sex<- c("F", "F", "F", "F", "F", "F", "M", "M", "M", "M" ,"M", "M")
class(sex)
## [1] "character"
glimpse(sex)
## chr [1:12] "F" "F" "F" "F" "F" "F" "M" "M" "M" "M" "M" "M"
A character must go inside the quaotion mark
sex<- as.factor(sex)
sex
## [1] F F F F F F M M M M M M
## Levels: F M
class(sex)
## [1] "factor"
levels(sex)
## [1] "F" "M"
glimpse(sex)
## Factor w/ 2 levels "F","M": 1 1 1 1 1 1 2 2 2 2 ...
The factor is a data structure to store several categorical variables with their levels. To transform the character variables to the categorical variables(or enumerated variables), the as.factor() function can transform the other types of data into the type of factor.
The function levels() can demonstrate the orders in the factor by the variables, furthermore, the glimpse() function present the orders of the values by Arabic numbers.
Now, we have two vectors, id and sex. We can combine these two vectors into one data frame. We’re going to use a function tibble() from the tibble package which also be one of the tidyverse package.
library(tidyverse)
df<- tibble(id, sex)
df
## # A tibble: 12 x 2
## id sex
## <int> <fct>
## 1 1 F
## 2 2 F
## 3 3 F
## 4 4 F
## 5 5 F
## 6 6 F
## 7 7 M
## 8 8 M
## 9 9 M
## 10 10 M
## 11 11 M
## 12 12 M
class(df)
## [1] "tbl_df" "tbl" "data.frame"
glimpse(df)
## Observations: 12
## Variables: 2
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
## $ sex <fct> F, F, F, F, F, F, M, M, M, M, M, M
About class(df): tbl_df inherits from tbl, which inherits from data.frame
Data frame is a table that each variable must have its own column, each observation must have its own row and each value must have its own cell. To create a data frame, we can use the tibble function.
dplyrdf<- df%>%
mutate(height=sample(155:185, 12, replace= TRUE),
weight=sample(60:80, 12, replace= TRUE))
df
## # A tibble: 12 x 4
## id sex height weight
## <int> <fct> <int> <int>
## 1 1 F 176 65
## 2 2 F 180 64
## 3 3 F 178 61
## 4 4 F 167 76
## 5 5 F 167 63
## 6 6 F 155 70
## 7 7 M 182 66
## 8 8 M 183 65
## 9 9 M 156 68
## 10 10 M 179 65
## 11 11 M 163 66
## 12 12 M 165 65
From this script, you may notice that the R code are different from the code we used at the begin. To explain the R code before, there are some prerequisites I shall tell you.
%>%?The pipe, %>%, comes from the magrittr package by Stefan Milton Bache. The design of pipe is to help you write a easily readable and understandable code. The meaning of %>% is that %>% takes whatever is before it, and feed it into the next step.
In this case, df%>%mutate() means mutate(df,...).
mutatemutate is one of the main function from dplyr. The job of mutate is to add new columns. Therefore, the following code means: create a new variable called height which is randomly assigned 12 numbers between 155 and 185.
sample(x, size, replace= FALSE) is the basic R function that take a sample of the specified size from the elements of x. replace= TRUE means a unit can occur one or more times in the sample.
mutate(df, height=sample(155:185, 12, replace= TRUE))
## # A tibble: 12 x 4
## id sex height weight
## <int> <fct> <int> <int>
## 1 1 F 163 65
## 2 2 F 164 64
## 3 3 F 167 61
## 4 4 F 158 76
## 5 5 F 161 63
## 6 6 F 165 70
## 7 7 M 167 66
## 8 8 M 180 65
## 9 9 M 171 68
## 10 10 M 181 65
## 11 11 M 174 66
## 12 12 M 181 65
Please add a new column which name is age by mutate. And randomly assign 12 numbers between 20 and 30 into the age column.
df<- df%>%
mutate(
___=sample(___:___, ___, replace= TRUE)
)
Do you remember the BMI equation we constructed before? Please create a new column, BMI, which you calculate the BMI for everyone.
The extraction poerator have different forms: [, [[ and $. The first form of [ can extract the elements from vectors and data frames. The second and third forms of [[ and $ can extract the contents from a single object. However, the $ only use the name for extraction, therefore, a column without its name from a data frame can not be extract with $.
#example of "["
df[1]
## # A tibble: 12 x 1
## id
## <int>
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## 7 7
## 8 8
## 9 9
## 10 10
## 11 11
## 12 12
df[2]
## # A tibble: 12 x 1
## sex
## <fct>
## 1 F
## 2 F
## 3 F
## 4 F
## 5 F
## 6 F
## 7 M
## 8 M
## 9 M
## 10 M
## 11 M
## 12 M
#example of "[["
df[[1]]
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
df[[2]]
## [1] F F F F F F M M M M M M
## Levels: F M
#example of "$"
df$id
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
The extraction operators can also be assigned to extract which row and which column you want. The following form shows the selection of rows and columns from a dafa frame: data_frame[row, cloumn].
df
## # A tibble: 12 x 6
## id sex height weight age BMI
## <int> <fct> <int> <int> <int> <dbl>
## 1 1 F 176 65 28 21.0
## 2 2 F 180 64 29 19.8
## 3 3 F 178 61 28 19.3
## 4 4 F 167 76 28 27.3
## 5 5 F 167 63 21 22.6
## 6 6 F 155 70 21 29.1
## 7 7 M 182 66 29 19.9
## 8 8 M 183 65 25 19.4
## 9 9 M 156 68 20 27.9
## 10 10 M 179 65 26 20.3
## 11 11 M 163 66 22 24.8
## 12 12 M 165 65 30 23.9
df[3,5]
## # A tibble: 1 x 1
## age
## <int>
## 1 28
The df we created before have 12 rows with 5 columns. If we want to extract the BMI of the fifth row, row 3 and column 5 should going into the extraction operator, [3,5].
What about to extract all information of the seventh row?
df[7, ]
## # A tibble: 1 x 6
## id sex height weight age BMI
## <int> <fct> <int> <int> <int> <dbl>
## 1 7 M 182 66 29 19.9
All the data analysis process start with loading the dataset into the analysis tools. At the end of the analysis, we usually want to save the tidy and transformed data. There are many packages for R to import several kinds of data but here I take the comma seoarated values for example. #### The comma separated values
To import the comma separated values (csv) is very simple, the only thing you need is to get the path of the data and print it in the read.csv() function. You can use the assignment operator(<-) to assign the data as the dataset.
REVEAL<- read.csv("/Users/liupochen_macbook/Desktop/MGH_EPI/REVEAL.csv")
Please make sure that the path should go inside the quotation mark and the slash which describes the inner folder must be /, not a backslash.
Export the data is almost the same as what you did for importing data. The difference is that you have to tell the function which dataset you want to export. Therefore, you can type the names of the dataset you want to export at the first argument of the write.csv() function, e.g., write.csv(name_of_dataset, "path").
write.csv(REVEAL, "/Users/liupochen_macbook/Desktop/MGH_EPI/REVEAL.csv")