2. Tutorial for POPLHLTH 304: Getting started with R

Simon Thornley

15 March, 2022


Out of my depth

Getting started with R

This session, we will focus on taking baby steps with R.

We will learn how to:

  • load libraries
  • import data
  • check our integrity of our data

The final step involves three aspects of the data that I think are important to look for.

These are:

  • duplicates
  • missing data
  • ranges

Before we start

You will need to sign up for an Rstudio cloud account when you click on the link for our project here.

R can be unforgiving. It is case-sensitive and if you don’t get function and object names exactly right, it will not understand what you are trying to tell it. R, although very sophisticated, is a dumb machine! Also, it is useful to litter your code with comments. These can be done, simply by using a # sign in front of what you are writing.

You will notice I have some comments to help orient me in the code on Rstudio

Rstudio has four different parts and at first it can be a bit intimidating. Let’s explore the different parts of the program before we start.
Orientating yourself to Rstudio

It seems strange, but the Source pane sends commands to the Console or Terminal. This is so that you have a record of what you’ve done as your Source file, which is plain text. The Console is like the brains of the operation and executes your commands. The right upper pane shows the data.frames or datasets that we are working on and the bottom right pane has different tabs for our files, plots, packages and help.

Loading libraries

R is like a ‘go cart’ that many, many, many people have contributed to and has been turned into a Rolls-Royce. Basic mathematical and statistical functions are ‘built-in’ to R, but libraries are little ‘packages’ of code with functions that allow us to extend the use of R.

Four libraries will help us achieve our goals today will be:

rio

which helps us import a variety of datasets. See here for more information.

magrittr

allows us to use the pipe operator in R. More information about the magrittr package is available here. This will be explained later.

visdat

This helps us see missing data in our data.frame in a visual manner.

skimr

skimr is a package for summarising our data.frame. See more information here.

To install libraries, we need to be connected to the internet!

To install and load libraries in R we use the following code:

install.packages("rio")
library(rio)

This is the simple way. If you have already loaded rio, then a way to not have to re-install it is with a bit more code that I’ll use quite frequently

if(!require(rio)) install.packages("rio")
library(rio)

This is a bit complex, but illustrates a useful function. In computing we can make the computer do stuff if a certain condition is fulfilled. This is called conditional logic. It is vitally important.

The

if(!require(rio)) 

bit, is saying “have a look and see whether you already have the rio package”. The

require(rio) 

bit will return a value of TRUE if it is already loaded in your computer, and FALSE if not. We only want the computer to install the library if it is not there. If it is not there, the result of require(rio) will be FALSE. The if function only executes the following function if TRUE, so we need to turn the TRUE into a FALSE and vice versa. This is what the ! does. Don’t worry if you don’t get this. It is a bit of magic that saves you a bit of time loading new libraries! The if statement means that the install.packages() code only runs if you don’t have rio already installed. When you have lots of packages to load, you’ll find this very handy!

The

library(rio) 

bit makes sure that the R console can find the function we are after. For example, if we are wanting to use the import() function from the rio package, if we haven’t typed library(rio) we will have to tell R where to find the import() function with the following code…

rio::import("./path/to/csv or Excel file.csv") 

With the library call, we can now more simply type:

import("./path/to/csv or Excel file.csv") 

However, this assumes there is no library loaded earlier with the same import() function. This introduces a little bit of ambiguity, which one may wish to avoid.

You can find in which libraries R is looking in a session by typing

searchpaths()

The paths are hierarchical, so that the first .GlobalEnv (global environment) which is shown to the top right pane in Rstudio is searched first, then tools:rstudio, then (on my windows machine) C:/Program Files/R/R-4.1.2/library/stats, etc.

Once library(rio) is called, R is now constantly looking inside the rio library for any functions, so if you ask for the import() function, R will now find it there. This applies to any other function you want to call without referencing the library first.

To make things a bit confusing, require() will also tell R you are wanting to use the functions in this package (only if it is there, not if it fails and has to install), but there is no harm in adding the library() code to make doubly sure that R can find the function inside your package.

Now, make some code to install the magrittr, visdat, ggplot and skimr packages. Use the if, require, install.packages and library functions.

Write them in the editor (top left hand screen, and use “CTRL + ENTER” to execute the code).

You’ll see the code execute in the bottom pane.

Import our data

We will import some data from the internet. It is a cot-death dataset. We will talk about what it means later, but for now will just dive in to have a look at the data. It was a case-control study to look for risk factors for cot-death in the 1980s.

We will use the following code:

df <- rio::import("https://flexiblelearning.auckland.ac.nz/data-analysis/menu/1/files/simple_sids_epiinfo2.xlsx")

This effectively takes our Excel sheet from the web and pulls it into our computer. The <- is an assignment operator which names our new spreadsheet df. I’ve chosen the name df because the technical word for a spreadsheet in R is a data.frame.

Here, "https://flexiblelearning.auckland.ac.nz/data-analysis/menu/1/files/simple_sids_epiinfo2.xlsx" is an argument or modifier to the function that does the importing import. The argument tells rio::import where to look for the dataset.

You will see our object up on the top right pane (Environment/History).

You can then double click it to see what it looks like in a mode similar to Excel.

If you are uncertain of all the different options for the rio::import function, type the following into the editor and execute.

?rio::import

You’ll see help documentation pop-up in the bottom right-hand pane. This is really useful!

Check for duplicates

We can now have a look for duplicates in the df object.

df[duplicated(df), ]

## Pipe

df[duplicated(df),] %>% nrow

## Old-fashioned

nrow(df[duplicated(df),])

## Remove duplicate rows from data.frame

df <- df[!duplicated(df), ]

Here, we are using square brackets [] to subset the df object. The code in the square brackets is divided by the comma. Before the comma refers to actions on rows and after applies to columns.

We are essentially saying to R, “look for any duplicated rows in the df object. This means looking across all columns, because we haven’t limited it further.

If we instead wrote:

df[duplicated(df$Mother_age), ]

df[duplicated(df$Mother_age),] %>% nrow

Here, the $ sign is used to indicate a column within the data.frame.

Interpret the output. Note the use of the %>% operator. This takes the output of the first argument (the number of duplicated rows in the data.frame) and than counts them. This is nice for avoiding loads of brackets. An alternative way of doing it is a bit more confusing.

Ranges

Here’s the code to look at the ranges of values of a data.frame.

df %>% skim

Interpret the output.

What is the object and which is the function in the code here?

Are the ranges sensible?

Missing data

visdat::vis_miss(df) +
  ggplot2::theme(plot.margin = unit(c(1,3,1,1), "cm"))

Interpret the plot. See if you can figure out how to sort the columns by their missingness. You’ll have to execute ?vis_miss.

Homework

In the data folder is the heart.csv file.

Import the data into R

Hint:

df <- rio::import("./data/heart.csv")

Check the data for duplicates

Check the range of the dataframe

Check the missing values

Check the range of the variables using the skimr package.