This session, we will focus on taking baby steps with R.
We will learn how to:
The final step involves three aspects of the data that I think are important to look for.
These are:
You will need to sign up for an Rstudio cloud account when you click on the link for our project here.
R can be unforgiving. It is case-sensitive and if you don’t get function and object names exactly right, it will not understand what you are trying to tell it. R, although very sophisticated, is a dumb machine! Also, it is useful to litter your code with comments. These can be done, simply by using a # sign in front of what you are writing.
You will notice I have some comments to help orient me in the code on Rstudio
Rstudio has four different parts and at first it can be a bit intimidating. Let’s explore the different parts of the program before we start.It seems strange, but the Source pane sends commands to the Console or Terminal. This is so that you have a record of what you’ve done as your Source file, which is plain text. The Console is like the brains of the operation and executes your commands. The right upper pane shows the data.frames or datasets that we are working on and the bottom right pane has different tabs for our files, plots, packages and help.
R is like a ‘go cart’ that many, many, many people have contributed to and has been turned into a Rolls-Royce. Basic mathematical and statistical functions are ‘built-in’ to R, but libraries are little ‘packages’ of code with functions that allow us to extend the use of R.
Four libraries will help us achieve our goals today will be:
riowhich helps us import a variety of datasets. See here for more information.
magrittrallows us to use the pipe operator in R. More information about the magrittr package is available here. This will be explained later.
visdatThis helps us see missing data in our data.frame in a visual manner.
skimrskimr is a package for summarising our data.frame. See more information here.
To install libraries, we need to be connected to the internet!
To install and load libraries in R we use the following code:
install.packages("rio")
library(rio)This is the simple way. If you have already loaded rio, then a way to not have to re-install it is with a bit more code that I’ll use quite frequently
if(!require(rio)) install.packages("rio")
library(rio)This is a bit complex, but illustrates a useful function. In computing we can make the computer do stuff if a certain condition is fulfilled. This is called conditional logic. It is vitally important.
The
if(!require(rio)) bit, is saying “have a look and see whether you already have the rio package”. The
require(rio) bit will return a value of TRUE if it is already loaded in your computer, and FALSE if not. We only want the computer to install the library if it is not there. If it is not there, the result of require(rio) will be FALSE. The if function only executes the following function if TRUE, so we need to turn the TRUE into a FALSE and vice versa. This is what the ! does. Don’t worry if you don’t get this. It is a bit of magic that saves you a bit of time loading new libraries! The if statement means that the install.packages() code only runs if you don’t have rio already installed. When you have lots of packages to load, you’ll find this very handy!
The
library(rio) bit makes sure that the R console can find the function we are after. For example, if we are wanting to use the import() function from the rio package, if we haven’t typed library(rio) we will have to tell R where to find the import() function with the following code…
rio::import("./path/to/csv or Excel file.csv") With the library call, we can now more simply type:
import("./path/to/csv or Excel file.csv") However, this assumes there is no library loaded earlier with the same import() function. This introduces a little bit of ambiguity, which one may wish to avoid.
You can find in which libraries R is looking in a session by typing
searchpaths()The paths are hierarchical, so that the first .GlobalEnv (global environment) which is shown to the top right pane in Rstudio is searched first, then tools:rstudio, then (on my windows machine) C:/Program Files/R/R-4.1.2/library/stats, etc.
Once library(rio) is called, R is now constantly looking inside the rio library for any functions, so if you ask for the import() function, R will now find it there. This applies to any other function you want to call without referencing the library first.
To make things a bit confusing, require() will also tell R you are wanting to use the functions in this package (only if it is there, not if it fails and has to install), but there is no harm in adding the library() code to make doubly sure that R can find the function inside your package.
Now, make some code to install the magrittr, visdat, ggplot and skimr packages. Use the if, require, install.packages and library functions.
Write them in the editor (top left hand screen, and use “CTRL + ENTER” to execute the code).
You’ll see the code execute in the bottom pane.
We will import some data from the internet. It is a cot-death dataset. We will talk about what it means later, but for now will just dive in to have a look at the data. It was a case-control study to look for risk factors for cot-death in the 1980s.
We will use the following code:
df <- rio::import("https://flexiblelearning.auckland.ac.nz/data-analysis/menu/1/files/simple_sids_epiinfo2.xlsx")This effectively takes our Excel sheet from the web and pulls it into our computer. The <- is an assignment operator which names our new spreadsheet df. I’ve chosen the name df because the technical word for a spreadsheet in R is a data.frame.
Here, "https://flexiblelearning.auckland.ac.nz/data-analysis/menu/1/files/simple_sids_epiinfo2.xlsx" is an argument or modifier to the function that does the importing import. The argument tells rio::import where to look for the dataset.
You will see our object up on the top right pane (Environment/History).
You can then double click it to see what it looks like in a mode similar to Excel.
If you are uncertain of all the different options for the rio::import function, type the following into the editor and execute.
?rio::import
You’ll see help documentation pop-up in the bottom right-hand pane. This is really useful!
We can now have a look for duplicates in the df object.
df[duplicated(df), ]
## Pipe
df[duplicated(df),] %>% nrow
## Old-fashioned
nrow(df[duplicated(df),])
## Remove duplicate rows from data.frame
df <- df[!duplicated(df), ]Here, we are using square brackets [] to subset the df object. The code in the square brackets is divided by the comma. Before the comma refers to actions on rows and after applies to columns.
We are essentially saying to R, “look for any duplicated rows in the df object. This means looking across all columns, because we haven’t limited it further.
If we instead wrote:
df[duplicated(df$Mother_age), ]
df[duplicated(df$Mother_age),] %>% nrowHere, the $ sign is used to indicate a column within the data.frame.
Interpret the output. Note the use of the %>% operator. This takes the output of the first argument (the number of duplicated rows in the data.frame) and than counts them. This is nice for avoiding loads of brackets. An alternative way of doing it is a bit more confusing.
Here’s the code to look at the ranges of values of a data.frame.
df %>% skimInterpret the output.
What is the object and which is the function in the code here?
Are the ranges sensible?
visdat::vis_miss(df) +
ggplot2::theme(plot.margin = unit(c(1,3,1,1), "cm"))Interpret the plot. See if you can figure out how to sort the columns by their missingness. You’ll have to execute ?vis_miss.
In the data folder is the heart.csv file.
Hint:
df <- rio::import("./data/heart.csv")