R is a programming language that can carry out sophisticated analyses and we simply need to learn the language R speaks.

Course Aims

This short course aims to introduce the key basics of programming in R (a key skill for a health data scientist).
The emphasis is on the fundamental principles of writing scripts in R and how they are applied in practice.
Commands around reading, manipulating, analysing and visualising health data will be introduced and hands-on practice will consolidate learning.

Brief description of the course

One of the key skills highlighted to be able to work with health data is to have knowledge of and be able to use the R-programming language in order to manipulate, analyse and visualise data. R’s usage is varied: it has been used extensively in academia, but not in the healthcare sector (in particular the NHS) and industry. R is open source and free, which makes it attractive to use, particularly when finances are of concern. There are also other well-documented positives to using R (e.g., blog post http://monkeysuncle.stanford.edu/?p=367).

R is primarily used within the programme MSc in Health Data Science in two modules. In order for individuals to get an insight into R and become familiar with its syntax before the Masters programme, this short course sets out to provide the key basics in order to start an individuals’ journey with R, providing a basis from which to explore its other capabilities.
It is a pre-requisite for the Fundamentals of Statistics and Mathematics in Health Data Science module and will run alongside the already developed short course entitled the same, which introduces key statistical concepts required for the course. It will not introduce you to an extensive list of commands, nor will it make an expert R-programmer. What it will do is show some most used commands and functions that are used in the field, and a best practice for writing scripts.
Some statistics will inevitably be covered, but at an introductory level, and are by no means the focus of the course.

This course is aimed at individuals who have no knowledge of R and/or have limited or no programming experience, and who wish to work with health data.

Intended Learning outcomes

Category of outcome students should be able to:

  • A. Knowledge and understanding
    - LO1: To know key constructs in the R-programming language that read, manipulate, analyse and visualise data
    - LO2: To know how to put small scripts together to work with health data

  • B. Intellectual skills
    - LO3: Design/develop a script to analyse health data

  • C. Practical skills
    - LO4: Perform simple key commands in R
    - LO5: Write simple, but complete R scripts

  • D. Transferable skills and personal qualities
    - LO6: Transfer knowledge and practical skills between datasets/tasks

Introduction

The idea of these sessions is to provide an introduction to using the statistical computing package known as R. This includes how to read data into R, perform various calculation, obtain summary statistics for data and carry out simple analyses. You should read and work through the given notes and seek clarification and help when required from one of the staff in the room.

R is a free, open-source statistical environment which was originally created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is now developed by the R Development Core Team. The first version of the software was released in 1997. It is an implementation of the S language that was originally developed by J. M. Chambers at AT&T’s Bell Laboratories in the late 1980’s. There is a commercial version of S called S-PLUS which is sold by TIBCO Software Inc. S-PLUS is rather expensive but works in a very similar way to the freely available R, so why not use R?

There are a number of textbooks available which discuss using R for statistical data analysis. A good introductory one is Crawley, M. J. (2005) Statistics, An Introduction using R. (Wiley), while a favourite of our is Matloff, N. (2009). The Art of R Programming (No starch press).

What is R?

The command language for R is a computer programming language but the syntax is fairly straightforward and it has many built-in statistical functions. The language can be easily extended with user-written functions. R also has very good graphing facilities.

Obtaining and Installing R and RStudio

To demonstrate and use R, we use RStudio IDE for the R statistical programming language. RStudio is an integrated development environment (IDE) for R. It is a tool that can help you do your work better and faster and includes docked windows for your console and syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.

You can download and install a copy of the latest version of R for free on your own computer. You can download and install the current version of R (version 3.3.1 as of 21-06-2016) by clicking on the link below.

http://www.stats.bris.ac.uk/R/

You can either run the program, or save the program to your computer and then run it to install R. When installing, you can accept the default settings. Under ‘Documentation’ you can download the document entitled ‘An Introduction to R’ by W. N. Venables, D. M. Smith and the R Development Core Team (2006) which gives a clear introduction to the language and in- formation on how to use R for doing statistical analysis and graphics. This manual is also available through the software at Help > Manuals (in PDF)> An Introduction to R. You can download it as a pdf file and keep it on your personal computer for reference.

Similarly, to install RStudio, click on the link

https://www.rstudio.com/products/rstudio/download/

Fig 1. Screenshot of Studio upon opening

Fig 1. Screenshot of Studio upon opening

Straight into R

We can think of R as a sophisticated calculator with it’s own language and we need to learn communicate with our new friend. Type anything into the console at the prompt, and R will evaluate it and print the answer.

In this session, lets practice using R by copying and typing the commands which are given following the \(>\) prompt in the notes below, followed by the ENTER key and then observing the output.

Let’s try some simple calculations. Type the following commands in blue into the console and check that your results match up (with the results displayed after \(\#\#\) [1]).
R works with vectors and the [1] at the beginning of each results line indicates that the answer is the first element of a vector, which in each instances below is of length 1.

9+8
## [1] 17
3*5
## [1] 15
8+(3*6)
## [1] 26

R operator precedence rules are conventional

2+4*20/10
## [1] 10
4^3
## [1] 64

The standard mathematical constants and functions are built-in, such as \(\pi\)=3.14159… exp(), sin(), cos(), tan() e.t.c.,

pi 
## [1] 3.141593
sqrt(pi)  # Find the sqrt of pi
## [1] 1.772454
pi*(10^2)
## [1] 314.1593
cos(2*pi)  
## [1] 1
a <- 9
a*2
## [1] 18
b <- c(8, 1, 3)  # to input a vector we use the syntax c( ) with commas
b*3  # R performs elementwise multiplication
## [1] 24  3  9

Tips

  • The \(\#\) symbol can be used in the console to make comments. R ignores the rest of the line after the \(\#\) symbol.

  • A useful tip is that previous session commands can be accessed via the up and down arrow keys on the keyboard. This can save time typing as these can be edited and reused.

  • You can ask R to print a list of which variables are saved in your environment already, simply type

ls()
## [1] "a"   "b"   "img"

To remove a variable from the environment, for example b, type

rm(b)

To clear your R environment and remove all variables/previously saved data, use

rm(list=ls())

which removes the list which is printed when we type \(\verb|ls()|\). Can you see the logic of this? Don’t worry this line of code is only used when you first begin working in R and is the most difficult we will come across in the course.

I need some help!

The help and support section of R is an invaluable resource that has contributed to the popularity of R. Help is easily accessed by clicking on the Help tab of the bottom right window in RStudio under ‘Help’. The first description can sometimes be taxing to understand but there is always an example at the bottom of the page.

If you’re struggling to find help because you are unsure of the function to search for, type

help.search("linear model")

will search for help files for functions that have something to do with “linear model”. Finally, the quality and quantity of help for R online is particularly great and a google search beginning with an R e.g. “R linear model” usually returns the most relevant solution to your problem.

Try to find out more about the logaritmic functions and answer the following:

Q1. Can you compute the the value of \(\log_e 10\)? and \(\log_2 20\)?

Hint, try

help.search("logarithm")

Q2. Can you compute the critical value of \(\chi^2_4\) at 99.5% confidence to more accuracy than the values in the figure below (up to 7 decimal places) and also \(\chi^2_8\) for \(\alpha=0.975\)?

Hint use the \(\chi^2\) quantile function, \(\verb|qchisq|\):

help(qchisq)                   # qchisq documentation 
Check your answers with the table below.
Fig 2. Chi-squared distribution table

Fig 2. Chi-squared distribution table

Good Practice

The syntax can become complicated and therefore we must ensure that our code is readable and reproduciable to others.

The key piece of advice here is to comment any lines of code that are not completely obvious. Entire commented lines should begin with # and one space. Individual lines should be commented using two spaces, hash then another space, as above.

Your style of writing will be personal but our recommendations for ‘good’ code + 1. Use <- and not = for assigning variables + 2. Leave a space after each comma + 3. Leave a space between operators, e.g. <- and + and / e.t.c + 4. No more than 5 lines of code at once

The general point is to be consistent throughout your coding.

A comprehensive guide can be found at https://google.github.io/styleguide/Rguide.xml

Let’s start working in R

To begin a session, first + 1. Go to Session > Set Working Directory > Choose Directory
+ and create a folder where you will work from today. This folder should include any datasets you wish you use
+ 2. Type into your console

rm(list=ls())

to remove any previously saved data.
+ 3. Create a new file using File > New File > R Script We call this an R script and will have the extension .r or .R when you save it. Copy and paste the following

and save this file. The extension will become .r or .R to recognise that this is an R script and should be opened in R rather than word or excel e.t.c. Locate the ‘Run’ button in the right hand side of the top left window and with your cursor on the line that you wish to run, click run. Alternatively, press Ctrl+ENTER on windows (cmd+ENTER on a mac).

Your code will be ran/exectuted in the Console (bottom left window).

We will continue to work with our R script for the remainder of the session to allow you to refer back at a later date. Please annotate and comment your code throughout the day using the \(\#\) symbol as above.

Installing and Loading R Packages

Part of the reason R has become so popular is the vast array of packages available at the cran and bioconductor repositories. In the last few years, the number of packages has grown exponentially!

Installing these R packages couldn’t be easier (especially in RStudio). Let’s suppose you want to install the \(\verb|ggplot2|\) package which is a hugely popular package for creating nice looking graphics. Well nothing could be easier. We type the following into the R console

install.packages("ggplot2")

Alternatively in RStudio, you can simply click on the Packages tab in the bottom right corner and then Install (Packages > Install). Type \(\verb|ggplot2|\) into the box ‘Packages (separate multiple with space or comma)’ and ensure the ‘Install dependencies’ is checked (It is by default).

By completing either of these methods, \(\verb|ggplot2|\) is installed in your library when you want to use it you can either type

library("ggplot2")

or check the box in the list under Packages.

Can I create my own functions? [User-written functions]

R has many built-in functions that carry out most of the simplest tasks we require such as the mean, variance. However, the R programming language allows us to write our own functions.

The basic format for a function is

NAME_OF_FUNCTION <- function( INPUT FOR FUNCTION ){
                    
                    OUR R CODE
                    
                    
                    return(OUTPUT)
                    }

We demonstrate this using the mean function. To find the mean of a vector we add each elements and divide by the number of elelents is the vector \[ \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i,\] where \(n\) is the number of elements in the vector.

Consider the vector \(x\)

x <- c(5, 1, 7, 9, 1, 6, 10, 40, 12, 2) 

then a few lines of R code that will allow us to compute the mean of \(x\) would be

n <- length(x)  # Find the number of elements in x and save as n
n  # print n
## [1] 10
sum_of_x <- sum(x) # add all the elements of a and save as sum_of_a
sum_of_x  # print results
## [1] 93
x_bar <- sum_of_x/n

We have computed the mean of \(x\) as

x_bar
## [1] 9.3

Check that your function works correctly by comparing it with R’s built in \(\verb|mean|\)

mean(x)
## [1] 9.3

Let’s make these lines of code into a function, we will call this function \(\verb|our_mean_function|\):

our_mean_function <- function(x){
                    n <- length(x)  # how long is the vector x?
                    sum_of_x <- sum(x)  # add up all elements
                    x_bar <- sum_of_x/n  # divide variable by n
                    return(x_bar)  # return the answer
}

Run this function into the console and type \(\verb|ls()|\). You can see that this function is now available to use. To use your function, type

our_mean_function(x)
## [1] 9.3

Voila! We have created magic.

Can you create a function to compute the variance of \(x\)? The formula to compute the sample variance is var \[s^2= \frac{1}{n-1} \sum_{i=1}^n (x_i-\bar{x})^2,\] Hint: you can compare your solution to

var(x)
## [1] 130.6778

Loops and conditional statements

The for loop of R language can be written as

for (i in arr) {expr1; expr2 ...}

It goes through the vector arr every time one element i, and executes a group of commands inside the \(\{ \dots \}\) in each cycle. The \(\verb|break|\) statement can be used to terminate the loop abruptly. If you don’t want to terminate the whole loop, but just ignore current cycle, the \(\verb|next|\) statement can do that.

A simple loop is constructed as follows

for(i in 1:10) {
                    print(i) 
                    }
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

These can be used to build up more complicated functions involving other functions

for(i in 1:10) { 
                    print(our_mean_function(x+i)) 
                    }
## [1] 10.3
## [1] 11.3
## [1] 12.3
## [1] 13.3
## [1] 14.3
## [1] 15.3
## [1] 16.3
## [1] 17.3
## [1] 18.3
## [1] 19.3
for(i in 1:10) { 
                    if(i ==2) print(our_mean_function(x+i)) 
                    }
## [1] 11.3

Data Analysis

Inputting Data

Reading data into R is straight forward. First make sure your data files are saved in the same place your R session is working from. This can be found with the function \(\verb|getwd()|\).

getwd()               # get the name of your current working directory
## [1] "/Users/Hannah/Introduction to R"

The command to import the datafile from this location depends on the type of file you are importing. Quite frequently, the sample data is in Excel format. For this, we can use the function read.xlsx from the xlsx package. It reads from an Excel spreadsheet and returns a data frame. The following shows how to load an Excel spreadsheet named “CHD.xlsx” and save it as a dataframe called CHD.

install.packages("xlsx")  # First install the xlsx R package
library(xlsx)                   # load the xlsx package 
## Loading required package: rJava
## Loading required package: xlsxjars
CHD <- read.xlsx("CHD.xlsx",  sheetIndex = 1)  # read in the data from sheet number 1

If you know your dataset is large then viewing your data will not be helpful. However, as a check that your data is imported correctly you can use the following commands.

If your data is relatively small (less than 20 columns and 50 rows) then it may be sensible to view your dataset. You can do this using

View(CHD)

where a new window will open displaying your dataset or you can view the data directly in the console simply type the name of your dataset, i.e., \(\verb|CHD|\).

The joys of cleaning data

Real life data is messy and statisticians and data scientists spent a large proportion of their time cleaning data.

To extract a variable from your data set such as tobacco, you require the $ sign, e.g. CHD$tobacco extracts the variable tobacco from the CHD dataset.

To check the type of variable you can use \(\verb|str()|\) or directly we can use \(\verb|class()|\),

class(CHD$obesity);
## [1] "numeric"
class(CHD$famhist);
## [1] "factor"

If you wish to change the type of variable then you can use

as.numeric();
as.character();
as.factor();

Data Analysis

To find summary statistics

mean(CHD$tobacco)
## [1] 3.635649

Visualising Data

R has a very rich set of graphics facilities. The top-level R home page, http://www.r-project. org/, has some colorful examples, and there is a very nice display of examples in the R Graph Gallery found online at http://www.r-graph-gallery.com. An entire book, R Graphics by Paul Murrell (Chapman and Hall, 2005), is devoted to the subject.

Simple plots can be
plot(CHD$age, CHD$tobacco, xlab = "Age (years)", ylab = "Tobacco (grams smoked per week)", col= "CadetBlue", pch=20)

R allows you to visualise data in many different ways.

# install.packages("ggmap")
library("ggmap")
## Loading required package: ggplot2
qmap(location = "Manchester")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Manchester&zoom=10&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Manchester&sensor=false

qmap(location = "Manchester", zoom = 15)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Manchester&zoom=15&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Manchester&sensor=false

usa_center = as.numeric(geocode("United States"))
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=United%20States&sensor=false
USAMap = ggmap(get_googlemap(center=usa_center, scale=2, zoom=4), extent="normal")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=37.09024,-95.712891&zoom=4&size=640x640&scale=2&maptype=terrain&sensor=false
USAMap

Saving your R session

You can save the contents of your current worksheet by using the menu item File>Save Workspace. You will then be asked to choose a filename. Add the extension .RData to the filename you specify. Your command history can also be saved using File>Save History and then specifying a filename with the .Rhistory extension. At a later date you can resume this saved session by opening R and choosing File>Load Workspace and then File>Load History. In both cases you select the appropriate previously saved files. If you load previously worked spaces, then do not use the \(\verb|rm(list=ls())|\) command as this will remove your saved variables.

However, I advise not to save your area as it results in duplicates of your datasets and if your R code script is kept with the data then you can always re-replicate the results in a second.

Acknowledgements

Thank you to Dr Peter Foster for providing the skeleton structure of these notes. Peter has delivered a 4 part series for the past 5 years to the incoming MSc statistics students for the School of Mathematics, University of Manchester.

Solutions

var_student <- function(x){
                    n  <- length(x)
                    sum_of_x <- sum(x)
                    x_bar <- sum_of_x/n
                    n_1 <- n - 1 
                    xx2 <- (x-x_bar)^2
                    variance <- sum(xx2)/n_1
                    return(variance)
}

var_student(x)
## [1] 130.6778
# OR in 1 line
(sum((x-(sum(x)/length(x)))^2)/(length(x)-1))
## [1] 130.6778