Tutorial 1: Introduction to RStudio and Base R

Introduction to R:

R is a free, open-source programming language for conducting statistical analysis and visualisation. It will be the platform and language we will be using for the class. To download R, go to the following website: https://cran.r-project.org/

You will see the links for downlading RStudio

Click on the link relevant to your operating system and follow the instructions below:

1) Microsoft Windows:
    i. After clicking on the link, select “base” since you are R installing R for the first time. 
    ii. Follow the Windows installation instruction. Choose “Accept Default Startup Options”.
2) Mac OS X:
    i. After clicking on the link, select the most up-to-date version (“R-3.6.1.pkg”). 
    ii. Follow the OS X installation instruction.

Introduction to RStudio:

RStudio is an Integrated Development Environment (IDE) for R. An IDE is a software / application that provides the full set of facilities for conducting the tasks related to programming such as code writing, compiling, debugging, and execution. RStudio has an easy-to-understand interface which groups your tasks by window. The windows are arranged such that you can see your codes, notes, variables, graphs, etc. on a single screen. We shall elaborate further on this later in this tutorial. Before that, let us first download and install the software.

1) Click on the link below to get to the RStudio download site: https://rstudio.com/products/rstudio/download/

2) Scroll down until you see several links 

3) Select the installer for “macOS 10.12+ (64-bit)”, or “Windows 10/8/7 (64-bit)”).

4) Follow the installation instructions for your respective OS.

After you have installed the software, open it. You will see an interface. The interface consists of 4 windows:

1) Console (bottom left) – the main window for typing codes for immediate execution / testing. 
2) Global environment (top right) – shows the data and variables that you have imported / created in your current environment.
3) Script (top left) – stores and organises codes and notes. This is where we type and save our codes for future use.
4) Bottom right – this pane here has multiple tabs showing:
    i. The files in your current directory;
    ii. Any graph that you plot; and
    iii. Lists of packages available for installation.
    

The convenient thing about RStudio is that many functions can be conducted (e.g. opening files, installing packages, etc.) just by clicking the mouse. The codes corresponding to the action will appear in the Console panel. Click on the “Packages” tab in the bottom right panel. Scroll down the packages and select “dplyr” by checking the box or type:

install.packages("tidyverse", repos = "http://cran.us.r-project.org")
## 
## The downloaded binary packages are in
##  /var/folders/ml/7cpckwn97dn5ws_msfs1zg5c0000gn/T//RtmpoMc9Ub/downloaded_packages

However, to use a package you should load the package with the following line of code:

library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1.9000     ✔ purrr   0.3.3     
## ✔ tibble  2.1.3          ✔ dplyr   0.8.3     
## ✔ tidyr   1.0.0          ✔ stringr 1.4.0     
## ✔ readr   1.3.1          ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Click on the history tab in the top right panel.

You’ll notice that “library(dplyr)” also appears here. The history tab keeps track of every single code / action executed during your session.

Creating a Data Table in Script

Copy and paste the following codes in your script (top left panel):

names <- c("Earth", "Jupiter", "Mars", "Mercury", "Neptune", "Pluto", "Saturn", "Uranus", "Venus")

position <- c(3, 5, 4, 1, 8, 9, 6, 7, 2)

orbit <- c("365.256 days", "11.862 yrs", "685.758 days", "87.97 days", "164.8 yrs", "247.94 yrs", "29.4571 yrs", "84.0205 yrs", "224.7 days")

day <- c(1, 0.41, 1.026, 58.646, 0.671, -6.387, 0.426, -0.718, -243.019)

mass <- c(1, 317.8, 0.107, 0.055, 17.147, 0.00218, 95.159, 14.536, 0.815)

moons <- c(1, 79, 2, 0, 14, 0, 53, 27, 0)

habitable <- c(TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE)

Run the script. You do this by selecting completely every line of code you want to run, then clicking the “run” button on the top right of the panel. You have created 7 vectors with the names “names”, “position”, “orbit”, “day_length”, “mass”, “no_of_moons” and “habitable”.

Textbook Reference:

Pages 10-11 of the textbook / reading talks about vectors, how to create them, commands for knowing their extract properties, indexing particular elements in the vectors and creating subsets from them.

Now look at the other panels. What do you notice?

1) The same lines of code are repeated in the “Console” panel. What the script does is take the selected lines, and run them through the console.

2) At the same time, you’ll also notice that these lines are recorded in your history.
3) Click on the “Environment” tab in the top right panel, you’ll also notice a list of new variables. These are the vectors you created using the script above. 

The tab also provided some additional details on each vector / variable. The most important are “type” and “length”.

i. Type: There are three primary types of data – 1) “numeric”; 2) “character”; and 3) “logical”. Numeric data are data presented as numbers. They can be “integers” or “double” (with decimal places). Character data are data presented as text. Character data are always presented within quotation marks (“”). Logical data are data presented as TRUE / FALSE statements. Note that R is case sensitive so TRUE / FALSE statements must be in CAPS with no quotation marks.
ii. Length: Note that every single vector is the same length (9 variables). When combining vectors into tables, it is important that they are the same length.

Copy and paste the following line of code to your script and run it:

solar_system <- data.frame(planet_names = names, position_in_system = position, orbit, day_length = day, mass_earths = mass, no_of_moons = moons, habitable)

Run this line of script. What do you notice?

1) In the “Environment” tab, you see a new entry “solar_system”. Notice that the type for “solar_system” is not numeric, character or logical but rather “data.frame”.

Data Frames are specialised tables in which the vectors are organised in columns as variables with the rows containing the set of value for each variable.

2) As is always the case, the line you just ran will also be reflected in your “Console” and captured in your “History”.

Type solar_system in your console to show the data table (you can also type it into the script and run it).

solar_system
##   planet_names position_in_system        orbit day_length mass_earths
## 1        Earth                  3 365.256 days      1.000     1.00000
## 2      Jupiter                  5   11.862 yrs      0.410   317.80000
## 3         Mars                  4 685.758 days      1.026     0.10700
## 4      Mercury                  1   87.97 days     58.646     0.05500
## 5      Neptune                  8    164.8 yrs      0.671    17.14700
## 6        Pluto                  9   247.94 yrs     -6.387     0.00218
## 7       Saturn                  6  29.4571 yrs      0.426    95.15900
## 8       Uranus                  7  84.0205 yrs     -0.718    14.53600
## 9        Venus                  2   224.7 days   -243.019     0.81500
##   no_of_moons habitable
## 1           1      TRUE
## 2          79     FALSE
## 3           2     FALSE
## 4           0     FALSE
## 5          14     FALSE
## 6           0     FALSE
## 7          53     FALSE
## 8          27     FALSE
## 9           0     FALSE

Go to the solar_system line in the “Environment” panel. You’ll notice a table symbol highlighted by the red circle in the diagram below:

Click on this symbol. You’ll notice a new tab appearing in your “Scripts” panel, labelled solar_system. This shows the table you’ve created. You can filter and sort data using this table. However, before we proceed further, let’s first try to save your work.

Exporting Your Table to Excel Copy and paste the following lines of code to your script:

install.packages(“openxlsx”) library(openxlsx)

write.xlsx(solar_system, ‘solar_system.xlsx’)

Run the lines. The first two lines of the code installs and Excel conversion package called “openxlsx”, and opens its library to access the commands of which we’re interested in “write.xlsx”. The last line writes the data frame into an excel file. Package libraries: Packages are user-contributed codes / functions / data for extending R’s capabilities and making it simpler to use. To put simply, they are shortcuts which allow you to perform complex tasks with just a single line of code.

Textbook Reference: Page 25 of the textbook / reading give some background as well as an example for loading the “ggplot2” graphing package.

Click on the “Packages” tab in the bottom right panel. Scroll down the list of packages and look for the package called “openxlsx” which you’ve just installed.

You can see a brief description of the package which is to “Read, Write and Edit xlsx Files”. Click on the blue “openxlsx” word. You’ll immediate open the help tab which gives a description of the commands in this package. Scroll down the list until you find the write.xlsx command. Read its description.

You can even click on it to see the syntax for writing the command

After executing your script, go to the “Files” tab in the bottom right panel. Scroll down to the bottom. You’ll see that a new Excel file has been created titled “solar_system.xlsx”. You can open the file and see that the table is the same as that of your data frame.

Data Manipulation Using Base R and Tidyverse Copy and paste the following lines of code to your script and run them:

install.packages(“tidyverse”) library(tidyverse)

“Tidyverse” is actually a family of packages that deals with a range of R functions from data manipulation to analysis and visualisation. You can see the downloaded packages in the “Console” panel.

We will be using some functions from “dplyr” in this section and plotting graphs using “ggplot2” in a later tutorial.

Rearranging the Table Using “arrange()”: Type “solar_system” in your command line (“Console” panel) and press enter. Look at the table that appears. You’ll notice that the planets are arranged in alphabetical order. What if we want to arrange them relative to their distance from the sun? We will need to sort them according to their “position_in_system” (column 2). Copy and paste the following lines of code to your script and run them:

solar_system_sorted <- arrange(solar_system, position_in_system) solar_system_sorted

You’ll notice that you have created a new data frame “solar_system_sorted”, where the planets are now arranged in order.

Adding New Columns to the Data Frame Using “mutate()”: Say we are interested in finding out if any of the planets besides earth could potentially be colonised (maybe using SpaceX). To find the answer, we first need to obtain additional information. Copy and paste the following lines of code to your script and run them:

temp_high_C <- c(427, 465, 58, 20, 35700, 11700, 4737, 7000, -218) temp_low_C <- c(-173, 465, -88, -153, -145, -178, -224, -218, -240) main_comp <- c(“rock”, “rock”, “rock”, “rock”, “gas”, “gas”, “gas”, “gas”, “rock”)

colony_unsorted <- solar_system_sorted %>% mutate(temp_high_C = temp_high_C, temp_low_C = temp_low_C, main_comp = main_comp) colony_unsorted

You have now created a new data frame titled “colony_unsorted” with three new columns of data added (high temperature in Celsius, low temperature in Celsius, and main composition).

You’ll notice the code above uses a strange symbol “%>%” highlighted in yellow. This symbol is known as the “Pipe Operator”. It enables you to chain / link multiple functions together (like a pipe) and means to “take A and do B”. Therefore, in the code above, the it means “take solar_system_sorted and mutate ……”.

Filtering the Data Using “filter()”: Now let’s filter the data to identify planets that could potentially be hospital to humans. There are four criteria that must be met: 1) The planet must be a rock; 2) The planet must not be too hot; 3) The planet must not be too cold; and 4) The planet must not be Earth (since we’re already staying there). Copy and paste the following lines of code to your script:

colony_filtered <- colony_unsorted %>% filter(temp_high_C < 200 & temp_low_C > -200 & main_comp == “rock” & habitable != TRUE) colony_filtered

Run the codes. You have created a data frame “colony_filtered” showing the planets that meet all the four criteria above.

Mars is the only planet identified. Therefore, we hope that SpaceX succeeds! Once again, the pipe operator is used in our function above. Here the pipe operation meant “take colony_unsorted and filter ……” You’ll also notice the presence of the “&” symbol. This is a Boolean operator that means “and” and indicates that ALL four criteria must be satisfied. If only some of the conditions need to be satisfied, you can use the “or” operator which in R is represented by the symbol “|”. Textbook Reference: You can check page 6 of the textbook / reading for different types of Boolean operators. They include: 1) “&&” - test for “and” that gives TRUE / FALSE. 2) “||” - test for “or” that also gives TRUE / FALSE 3) “!” – test for “not” that also gives TRUE / FALSE 4) “==” – equal to 5) “>=” “<=” - greater than or equal to and less than or equal to 6) “!=” – not equal to

Now that you have obtained your results, let us try to save the data in Excel. Use the following code:

write.xlsx(colony_filtered, ‘solar_system.xlsx’)

Open “solar_system.xlsx”, what do you notice? “solar_system.xlsx” contains only one sheet with your filtered results above. The “solar_system” table is gone. What the code has done is overwritten your old file since the file names are the same. You can either save the data in a separate file, or try the following code:

write.xlsx(list(solar_system, colony_filtered), file=‘solar_system.xlsx’)

Open “solar_system.xlsx”. Now you have two sheets in the file with the two tables you created. Click on the small save button at the top left of your scripts panel:

Save your script file.