R is a programming language that can carry out sophisticated analyses and today we will simply learn the language of R.

Course details

Course aims

This short course aims to introduce the key basics of programming in R (a key skill for a health data scientist).
The emphasis is on the fundamental principles of writing scripts in R and how they are applied in practice.
Commands around reading, manipulating, analysing and visualising health data will be introduced and hands-on practice will consolidate learning.

Course timetable

10:00 | Welcome
10:10 | Session 1: Intro to programming in R (and RMarkdown)
11:30 | Tea/Coffee Break
11:45 | Session 2: Loading data and visualisation
12:30 | Lunch
13:00 | Session 3: Creating your own functions
14:00 | Final Remarks and End of Session

Contact Details

Dr. Hannah Lennon hannah.lennon@manchester.ac.uk
Dr. Paolo Fraccaro paolo.fraccaro@manchester.ac.uk
Glen Martin glen.martin@manchester.ac.uk
Dr. Heather Robinson heather.robinson-2@manchester.ac.uk
Kieran O’Malley kieran.omalley@manchester.ac.uk

Contributors

All teaching material is provided by Hannah Lennon with the help of Paraskevi Pericleous, Paolo Fraccaro and Kieran O’Malley.

Course description

One of the key skills highlighted to be able to work with health data is to have knowledge of and be able to use the R-programming language in order to manipulate, analyse and visualise data. R’s usage is varied: it has been used extensively in academia, but not in the healthcare sector (in particular the NHS) and industry. R is open source and free, which makes it attractive to use, particularly when finances are of concern. There are also other well-documented positives to using R (e.g., blog post http://monkeysuncle.stanford.edu/?p=367).

This course is aimed at individuals who have no knowledge of R and/or have limited or no programming experience, and who wish to work with health data.

Intended learning outcomes

• To know key constructs in the R-programming language that read, manipulate, analyse and visualise data

• To know how to put small scripts together to work with data

• Design/develop a script to analyse data

• Perform simple key commands in R

• Write simple, but complete R scripts

• Transfer knowledge and practical skills between datasets/tasks

Programming in R

The idea of this sessions is to provide an introduction to using the statistical computing package known as R. This includes how to read data into R, perform various calculation, obtain summary statistics for data and carry out simple analyses. You should read and work through the given notes and seek clarification and help when required from one of the course facilitators in the room.

R is a free, open-source statistical environment which was originally created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is now developed by the R Development Core Team. The first version of the software was released in 1997. It is an implementation of the S language that was originally developed by J. M. Chambers at AT&T’s Bell Laboratories in the late 1980’s. There is a commercial version of S called S-PLUS which is sold by TIBCO Software Inc. S-PLUS is rather expensive but works in a very similar way to the freely available R, so why not use R?

There are a number of textbooks available which discuss using R for statistical data analysis. A good introductory one is

‘Statistics, An Introduction using R. (Wiley)’, Crawley, M. J. (2005)

while a favourite of ours is

‘The Art of R Programming (No starch press)’ Matloff, N. (2009).

The full textbook of

‘R for Data Science’, Garrett Grolemund and Hadley Wickham (2016),

is freely available online and can be found at http://r4ds.had.co.nz/index.html.

What is R?

The command language for R is a computer programming language but the syntax is fairly straightforward and it has many built-in statistical functions. The language can be easily extended with user-written functions. R also has very good graphing facilities.

R is a great place to start a data science journey because it is an environment designed from the ground up to support data science.

R is not just a programming language, but it is also an interactive environment for doing data science. To support interaction, R is a much more flexible language than many of its peers.

This flexibility comes with its downsides, but the big upside is how easy it is to evolve tailored grammars for specific parts of the data science process. These mini languages help you think about problems as a data scientist, while supporting fluent interaction between your brain and the computer.

Installing R and RStudio

To demonstrate and use R, we use RStudio for the R statistical programming language. It is a tool that can help you work better and faster and includes docked windows for your console and syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.

You can download and install a copy of the latest version of R for free on your own computer. You can download and install the current version of R by clicking on the link below.

http://www.stats.bris.ac.uk/R/

You can either run the program, or save the program to your computer and then run it to install R. When installing, you can accept the default settings. Under ‘Documentation’ you can download the document entitled ‘An Introduction to R’ by W. N. Venables, D. M. Smith and the R Development Core Team (2006) which gives a clear introduction to the language and information on how to use R for doing statistical analysis and graphics. This manual is also available through the software at Help > Manuals (in PDF)> An Introduction to R. You can download it as a PDF file and keep it on your personal computer for reference.

Similarly, to install RStudio, click on the link

https://www.rstudio.com/products/rstudio/download/

Fig 1. Screenshot of Studio upon opening

Fig 1. Screenshot of Studio upon opening

Straight into R

We can think of R as a sophisticated calculator with its own language and we need to learn communicate with our new friend. You can type anything into the console at the prompt, and R will evaluate it and print the answer.
For now, all you need to know is that you type R code in the console pane, and press enter to run it. You’ll learn more as we go along!

Let’s try some simple calculations. Type the following commands in blue into the console following the \(>\) prompt, followed by the ENTER key, and check that your results match up (with the results displayed after \(\#\#\) [1]).

R works with vectors and the [1] at the beginning of each results line indicates that the answer is the first element of a vector of length 1.

Note: A vector is a sequence of data elements of the same basic type, e.g. a row of numbers or a row of names.
We can consider a vector as simply a row of data. Members in a vector are officially called components.

# An addition
5 + 5 
## [1] 10
# A subtraction
5 - 5 
## [1] 0
# A multiplication
3 * 5
## [1] 15
 # A division
(5 + 5) / 2 
## [1] 5
# Brackets can be used to separate the equations
8+(3*6)
## [1] 26
# 4 cubed, also known as 4 to the power of three
4^3
## [1] 64

R operator precedence rules follow conventional mathematical rules, e.g. BIDMAS,

2+4*20/10
## [1] 10

i.e.2+((4*20)/10), \[2+\Big(\frac{(4 \times 20)}{10}\Big).\]

We can comparing values using the less than, greather than and double equal signs (\(>\), \(<\), \(>=\), \(==\) and these will return a logical value, which is either ‘TRUE’ or ‘FALSE’ We can use the exclamation mark to mean NOT. Therefore \(!<\) means not less than.
Note: these are called Boolean statements and they can only produce either TRUE or FALSE.

          #     a == b  equal to
          #     a != b  unequal to
          #     a <  b  less than
          #     a >  a  bigger than
          #     a >= a  bigger or equal to
54>45 # greater than
## [1] TRUE
1<2 # less than
## [1] TRUE
15==15 # exactly equal to
## [1] TRUE
50>=51 # greater than or equal to
## [1] FALSE

The standard mathematical constants and functions are built-in, such as \(\pi\)=3.14159…, \(\sqrt(2)\), exp(), sin(), cos(), tan() e.t.c.,

  • R is case sensitive.

  • Commands are separated either by a ; or by a newline.

  • The \(\#\) character can be used to make comments. R doesn’t execute the rest of the line after the \(\#\) symbol - it ignores it.

  • Previous session commands can be accessed via the up and down arrow keys on the keyboard. This can save time typing as these can be edited and reused.

pi 
## [1] 3.141593
sqrt(pi)  # Find the sqrt of pi
## [1] 1.772454
pi*(10^2)
## [1] 314.1593
cos(2*pi)  
## [1] 1

Some symbols have a special meaning in R, e.g. the semicolon

1:8  # produces a sequence of 1 to 8
## [1] 1 2 3 4 5 6 7 8

Assigning values to a variable

We can create variables that R will store using the assign arrow, \(\verb|<-|\) to assign a value to a variable. This is a less than sign and a hyphen with no space between them. Note that we do not use \(\verb|=|\). We use \(\verb|=|\) syntax when inside brackets only and then we would call it a local variable.

a <- 4
3*a
## [1] 12
a^2
## [1] 16

There are different ways to assign values to a variable (or placeholder). The most common is the arrow method \(<-\) that assigns a value to the variable it is pointing at.

x <- 2  # assigns a numeric value to a variable
y = 15  # -- same effect BUT DO NOT USE THIS!!!
assign('d', 7) # -- same effect
color <- 'green' # assigns a character value to a variable

Once we created, we can calculate with variables

b <- x + y
f <- d^x
g <- x < y

The command \(c\) is a function that combines values to a vector

c(2,3,4) 
## [1] 2 3 4
c(d, color, 'Flowers') 
## [1] "7"       "green"   "Flowers"

As mentioned, a vector is a sequence of data elements of the same basic type. Members in a vector are officially called components. Here is a vector containing three numeric values 8, 1 and 3. We can consider a vector as simply a row of data.

R has built in functions for simple statistical measures

          vector <- c(x,y,d,f,b, 50.3, 7^2) 
          length(vector)
## [1] 7
          sum(vector)
## [1] 189.3
          mean(vector)
## [1] 27.04286
          max(vector)
## [1] 50.3
          min(vector)
## [1] 2
          median(vector)
## [1] 17
          summary(vector)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   11.00   17.00   27.04   49.00   50.30

We can consider a vector as simply a row of data. Here is a vector containing three numeric values 8, 1 and 3.

b <- c(8, 1, 3)  # to input a vector we use the syntax c( ) with commas
b*3  # R performs componentwise multiplication
## [1] 24  3  9

To extract the second component of a vector we can use square brackets

b[2]  # extract the second component of the vector b
## [1] 1

The \(\verb|c|\) function can also be used inside square brackets to combine values of common type together to form a vector. For example, it can be used to access two components of \(\verb|b|\), e.g. the second and third

b[c(2, 3) ]# extracts the second and third component of the vector b
## [1] 1 3

You will notice that the following will produce an error

b[ 2, 3 ]

To understand why the error is produced, let’s create vectors of health care data, e.g blood pressure, age and gender of 6 patients

id     <- c("N198","N805","N333","N117","N195","N298")
gender <- c(1, 0, 1, 1, 0, 1)  # 0 denotes male, 1 denotes female
age    <- c(30, 60, 26, 75, 19, 60)
blood  <- c(0.4, 0.2, 0.6, 0.2, 0.8, 0.1)

Vectors can be arranged in rows or columns to form a structure similar to a matrix. We can combine them together as follows

cbind(id, gender, age, blood)
##      id     gender age  blood
## [1,] "N198" "1"    "30" "0.4"
## [2,] "N805" "0"    "60" "0.2"
## [3,] "N333" "1"    "26" "0.6"
## [4,] "N117" "1"    "75" "0.2"
## [5,] "N195" "0"    "19" "0.8"
## [6,] "N298" "1"    "60" "0.1"
rbind(id, gender, age, blood)
##        [,1]   [,2]   [,3]   [,4]   [,5]   [,6]  
## id     "N198" "N805" "N333" "N117" "N195" "N298"
## gender "1"    "0"    "1"    "1"    "0"    "1"   
## age    "30"   "60"   "26"   "75"   "19"   "60"  
## blood  "0.4"  "0.2"  "0.6"  "0.2"  "0.8"  "0.1"
health_data <- cbind(id, gender, age, blood)

Alternatively, the data frame keeps your data organised

my_data <- data.frame(id, gender, age, blood) 
my_data <- data.frame(ID=id, Sex=gender, Age=age, Blood=blood) # or specify your names
my_data
##     ID Sex Age Blood
## 1 N198   1  30   0.4
## 2 N805   0  60   0.2
## 3 N333   1  26   0.6
## 4 N117   1  75   0.2
## 5 N195   0  19   0.8
## 6 N298   1  60   0.1

The commands we have used above create a data structure called a data frame, which is a list of variables of the same number of rows with unique row names, given class “data.frame”. If no variables are included, the row names determine the number of rows.

          summary(my_data)
##     ID         Sex              Age         Blood       
##  N117:1   Min.   :0.0000   Min.   :19   Min.   :0.1000  
##  N195:1   1st Qu.:0.2500   1st Qu.:27   1st Qu.:0.2000  
##  N198:1   Median :1.0000   Median :45   Median :0.3000  
##  N298:1   Mean   :0.6667   Mean   :45   Mean   :0.3833  
##  N333:1   3rd Qu.:1.0000   3rd Qu.:60   3rd Qu.:0.5500  
##  N805:1   Max.   :1.0000   Max.   :75   Max.   :0.8000
          class(my_data)
## [1] "data.frame"
          head(my_data) # first six lines
##     ID Sex Age Blood
## 1 N198   1  30   0.4
## 2 N805   0  60   0.2
## 3 N333   1  26   0.6
## 4 N117   1  75   0.2
## 5 N195   0  19   0.8
## 6 N298   1  60   0.1
          tail(my_data) # last six lines
##     ID Sex Age Blood
## 1 N198   1  30   0.4
## 2 N805   0  60   0.2
## 3 N333   1  26   0.6
## 4 N117   1  75   0.2
## 5 N195   0  19   0.8
## 6 N298   1  60   0.1
          names(my_data) # returns the column names of the data.frame
## [1] "ID"    "Sex"   "Age"   "Blood"
          nrow(my_data)  # number of rows
## [1] 6
          ncol(my_data)  # number of columns
## [1] 4

Similar to (x,y) coordinates, the matrix indicies always read [ROWS, COLUMNS]. To extract a single cell value from the second row and third column, we type

health_data[c(2,3)]
## [1] "N805" "N333"
my_data[2, 3]
## [1] 60

We prefer to work with data frames from now on.

Omitting column values implies all columns; here all columns in row 2

my_data[2, ]
##     ID Sex Age Blood
## 2 N805   0  60   0.2
my_data[2, 1:ncol(my_data)] # The same command
##     ID Sex Age Blood
## 2 N805   0  60   0.2

Omitting row value implies all rows; here all rows in column 3

We can also use ranges - rows 2 and 3, columns 2 and 3

my_data[2:3, 2:3]
##   Sex Age
## 2   0  60
## 3   1  26

Q1A. What is the difference between cbind() and rbind()?

Q1B. We found out that the blood pressure instrument is under-recording each measure and all measuremtent incorrect by 0.1. How would you add 0.1 to all values in the blood vector?

Q1C. We found out that the first patient is 33 years old. How would you change the first element of the vector age to 33 years?

Which variables are stored in my R environment?

You can ask R to print a list of which variables are saved in your environment already, simply type

ls()  # returns all values that are in your workspace
##  [1] "a"           "age"         "b"           "blood"       "color"      
##  [6] "d"           "f"           "g"           "gender"      "health_data"
## [11] "id"          "my_data"     "vector"      "x"           "y"

These can also be seen in RStudio by clicking on the Environment tab where a summary of the variables is shown.

To remove a variable from the environment, for example \(\verb|b|\), type

rm(b) # removes b
rm(x,y,d,f,b,g) # removes individual values
## Warning in rm(x, y, d, f, b, g): object 'b' not found

Note: To clear your R environment and remove all variables/previously saved data, use

rm(list=ls())  # Can you see the logic of this? 

This removes the list of variables which is printed when we type the command \(\verb|ls()|\). Don’t worry too much about this line of code. It is only used when you first begin working in R and is the most difficult we will come across in the workshop.

At the beginning of each R session

To begin a session, first
+ 1. Go to
Session > Set Working Directory > Choose Directory
and create a folder where you will work from today. This folder should include any datasets you wish you use.
+ 2. Type into your console

rm(list=ls())

to remove any previously saved data that might automatically loaded at the beginning of the session.

R code can be entered into the command line directly or saved to a script, which can be run inside a session using the source function. This is the best practice.
+ 3. Create a new file using File > New File > R Script We call this an R script and will have the extension .r or .R when you save it. Copy and paste the following

###############################################################
## Introduction to Programming in R, Hannah Lennon 
## Health eResearch Centre, The University of Manchester
## 1st June 2017
## 

rm(list=ls())  # Clean our R environment
x <- c(5, 1, 7, 9, 3, 6, 10, 14, 12, 8)  # insert the row of data manually
ls()  # list all variables saved in our R environment
mean(x)  # Find the mean
var(x)  # Variance 
sd(x)  # Standard deviation 
max(x) 
min(x)
range(x) 
hist(x, breaks = 4, col="red") # a histogram with 2 bins coloured red

and save this file. The extension will become .r or .R to recognise that this is an R script and should be opened in R rather than word or excel e.t.c. Locate the ‘Run’ button in the right hand side of the top left window and with your cursor on the line that you wish to run, click run. Alternatively, press Ctrl+ENTER on windows (cmd+ENTER on a mac).

Screenshot of Studio during use

Screenshot of Studio during use

Your code will be ran/exectuted in the Console (bottom left window).

We will continue to work with our R script for the remainder of the session to allow you to refer back at a later date. Please annotate and comment your code throughout the day using the \(\#\) symbol as above.

Good practice

The syntax can become complicated and therefore we must ensure that our code is readable and reproduciable to others.

The key piece of advice here is to comment any lines of code that are not completely obvious. Entire commented lines should begin with # and one space. Individual lines should be commented using two spaces, hash then another space, as above.

Your style of writing will be personal but our recommendations for a ‘good’ style code
+ 1. Use <- and not = for assigning variables
+ 2. Leave a space after each comma
+ 3. Leave a space between operators, e.g. + and * and / e.t.c
+ 4. No more than 5 lines of code without a new line

The general point is to be consistent throughout your coding.

A comprehensive guide can be found at https://google.github.io/styleguide/Rguide.xml

Installing and loading R packages to add more functions

You’ll also need to install some R packages. An R package is a collection of functions, data, and documentation that extends the capabilities of base R. Using packages is key to the successful use of R.

Part of the reason R has become so popular is the vast array of packages of user-written functions available at the cran and bioconductor repositories. In the last few years, the number of packages has grown exponentially!

Installing these R packages couldn’t be easier (especially in RStudio). Note we only install a packge ONCE. Let’s suppose you want to install the \(\verb|ggplot2|\) package which is a hugely popular package for creating nice looking graphics. Well nothing could be easier. We type the following into the R console

install.packages("ggplot2")

Alternatively in RStudio, you can simply click on the Packages tab in the bottom right corner and then Install (Packages > Install). Type \(\verb|ggplot2|\) into the box ‘Packages (separate multiple with space or comma)’ and ensure the ‘Install dependencies’ is checked (It is by default).

By completing either of these methods, \(\verb|ggplot2|\) is installed in your library when you want to use it you can either type

library("ggplot2")

or check the box in the list under Packages.

The cowplot package is an add-on to ggplot2. It is meant to provide a publication-ready theme for ggplot2, one that requires a minimum amount of fiddling with sizes of axis labels, plot backgrounds, etc https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html

The range of R packages that are contributed to R is huge. Some packages allow more indepth statistical analysis, whereas some allow data to be imported. Some allow advanced graphics while some import data directly from the internet. Popular R packages (as listed by Garrett Grolemund at https://support.rstudio.com/hc/en-us/articles/201057987-Quick-list-of-useful-R-packages).

I need some help!

The help and support section of R is an invaluable resource that has contributed to the popularity of R. Help is easily accessed by clicking on the Help tab of the bottom right window in RStudio under ‘Help’. The first description can sometimes be taxing to understand but there is always an example at the bottom of the page.

If you’re struggling to find help because you are unsure of the function to search for, type

help.search("paste")

will search for help files for functions that have something to do with “paste” (like glue or cut & paste). Finally, the quality and quantity of help for R online is particularly great and a google search beginning with an R e.g. “R paste” usually returns the most relevant solution to your problem.

Exercise 1. Can you use the paste() function to create a list of all the dates in January

Hint: Use the words \(\verb|January|\) and \(\verb|1:31|\) to obtain the output below

##  [1] "1January"  "2January"  "3January"  "4January"  "5January" 
##  [6] "6January"  "7January"  "8January"  "9January"  "10January"
## [11] "11January" "12January" "13January" "14January" "15January"
## [16] "16January" "17January" "18January" "19January" "20January"
## [21] "21January" "22January" "23January" "24January" "25January"
## [26] "26January" "27January" "28January" "29January" "30January"
## [31] "31January"

RMarkdown

For the rest of our session, I’d like to create an RMarkdown document for you to use as a template for future use. All of todays notes have been created in R using RMarkdown.

R Markdown documents are fully reproducible (they can be automatically regenerated whenever underlying R code or data changes).

Available output formats include – HTML,
– PDFs,
– MS Word documents,
– Beamer,
– HTML5 presentations,
– Tufte handouts,
– R package vignettes,
– and even entire websites.

My work flow = Analysis & Figures in R + Writing (Papers & Presentations) in Latex http://kieranhealy.org/blog/archives/2014/01/23/plain-text/.

A cheat sheet of all the information is available here https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf.

For more information on R Markdown, see http://rmarkdown.rstudio.com.

STEP 1

File > New File > R Markdown (This creates the template with the necessary info at the top of the .Rmd file)

STEP 2

Code is written in chunks (Insert R chunk) and all text around the chunks. The hash symbol is used to display a heading

STEP 3

Click the Knit to html/word or pdf and R creates a well presented document for you.

Example

---
title: "Example"
author: "Hannah Lennon"
date: "01/06/2017"
output: html_document
---

## R Markdown
Here is an example of summarising the car data
#```{r}
#summary(cars)
#```
and here's my plot
#```{r}
#plot(cars)
#```

Exercise: Set up your own .Rmd file to continue the rest of the session working with

Loading Data into R and Visulisation

R’s built-in datasets

R has many datasets built in or that are available though loading packages. They are very useful when beginning the R learning curve to practise on.

These can be accessed using the \(\verb|data()|\) command

data()

Once chosen, a dataset can be loaded into our environment by

data(iris)
ls()

Importing data

To demonstrate loading Data into R and Visulisation we are going to use a clinical dataset from a heart disease hospital which is interested in coronary heart disease (CHD). The dataset called CHD and the variables in the data are as follows:

Name Description
sbp systolic blood pressure
tobacco cumulative tobacco (kg)
ldl low density lipoprotein cholesterol
adiposity a measure of body shape
famhist family history of heart disease (Present=1, Absent=0)
typea type-A behaviour
obesity BMI (kg/m^2)
alcohol current alcohol consumption
age current age in years
chd coronary heart disease (yes=1/no=0)

Reading data into R is straight forward. First make sure your data files are saved in the same place your R session is working from. This can be found with the function \(\verb|getwd()|\).

getwd()               # get the name of your current working directory
## [1] "/Users/Hannah/Desktop/Intro to R course 1st June"

The command to import the datafile from this location depends on the type of file you are importing.

Loading from an .csv File

The following shows how to load an Excel spreadsheet named “CHD.csv” or “CHD.xlsx” and saves it as a dataframe called CHD also.

Typically we load .csv (Comma separated values) files but R also allows many other types, such as excel files

CHD <- read.csv2("CHD.csv", sep=",") # or we can load a csv (Comma separated values) files 

Loading from an Excel File

Quite frequently, the sample data is in Excel format. For this, we can use the function read.xlsx from the xlsx package. It reads from an Excel spreadsheet and returns a data frame.

install.packages("xlsx")  # First install the xlsx R package
library(xlsx)                   # load the xlsx package 
CHD <- read.xlsx("CHD.xlsx",  sheetIndex = 1)  # read in the data from sheet number 1

Loading from a SAS File

install.packages("sas7bdat")  # First install the sas7bdat R package
library(sas7bdat)                   # load the sas7bdat( package 
CHD <- read.xlsx("CHD.sas7bdat")  # read in the data from sheet number 1

Loading from a Stata File

Many different data types (including SAS) can be imported using the foreign package (i.e, foreign to R).

We can use the package \(foreign\) to load stata files in

library(foreign)
read.dta(CHD.dta)

Also, you can also write Stata \(.dta\) files from R (if your coauthors or journals insist on having \(.dta\) data).

Suppose your R data frame has the name \(my_data\) and you want to save the file to the C: drive as \(my_data.dta\) After loading the foreign library, Here’s the syntax:

library(foreign)
write.dta(my_data, "C://my_data.dta")

Visualising your data

If you have prior knowledge that your dataset is large, then viewing your data will not be helpful and may slow your computer. However, as a check that your data is imported correctly you can use the following commands

What is the dimension of CHD?

dim(CHD)
## [1] 462  10

What is the structure of my data? In the format: Names/Variable type/Example of the first few values, ….

str(CHD)
## 'data.frame':    462 obs. of  10 variables:
##  $ sbp      : int  160 144 118 170 134 132 142 114 114 132 ...
##  $ tobacco  : Factor w/ 214 levels "0","0.01","0.02",..: 81 2 9 196 90 183 146 147 1 1 ...
##  $ ldl      : Factor w/ 329 levels "0.98","1.07",..: 239 165 100 269 101 270 96 172 126 242 ...
##  $ adiposity: Factor w/ 408 levels "10.05","10.29",..: 146 247 311 386 232 369 53 39 87 291 ...
##  $ famhist  : Factor w/ 2 levels "Absent","Present": 2 1 2 2 2 2 1 2 2 2 ...
##  $ typea    : int  49 55 52 51 60 62 59 62 49 69 ...
##  $ obesity  : Factor w/ 400 levels "14.7","17.75",..: 177 306 313 367 202 351 38 100 162 335 ...
##  $ alcohol  : Factor w/ 249 levels "0","0.19","0.26",..: 249 78 137 104 196 46 84 209 81 1 ...
##  $ age      : int  52 63 46 58 49 45 38 58 29 53 ...
##  $ chd      : int  1 1 0 1 1 0 0 1 0 1 ...

What are the names of my columns?

colnames(CHD) 
##  [1] "sbp"       "tobacco"   "ldl"       "adiposity" "famhist"  
##  [6] "typea"     "obesity"   "alcohol"   "age"       "chd"

The function \(\verb|head()|\) and \(\verb|tail()|\) displays the top 6 and bottom 6 lines of the dataset

head(CHD)
##   sbp tobacco  ldl adiposity famhist typea obesity alcohol age chd
## 1 160      12 5.73     23.11 Present    49    25.3    97.2  52   1
## 2 144    0.01 4.41     28.61  Absent    55   28.87    2.06  63   1
## 3 118    0.08 3.48     32.28 Present    52   29.14    3.81  46   0
## 4 170     7.5 6.41     38.03 Present    51   31.99   24.26  58   1
## 5 134    13.6  3.5     27.78 Present    60   25.99   57.34  49   1
## 6 132     6.2 6.47     36.21 Present    62   30.77   14.14  45   0
tail(CHD)
##     sbp tobacco   ldl adiposity famhist typea obesity alcohol age chd
## 457 170     0.4  4.11     42.06 Present    56    33.1    2.06  57   0
## 458 214     0.4  5.98     31.72  Absent    64   28.45       0  58   0
## 459 182     4.2  4.41      32.1  Absent    52   28.61   18.72  52   1
## 460 108       3  1.59     15.23  Absent    40   20.09   26.64  55   0
## 461 118     5.4 11.61     30.79  Absent    64   27.35   23.97  40   0
## 462 132       0  4.82     33.41 Present    62    14.7       0  46   1

If your data is relatively small (less than 20 columns and 50 rows) then it may be sensible to view your dataset. You can do this using

# View(CHD)

where a new window will open displaying your dataset or you can view the data directly in the console simply type the name of your dataset, i.e., \(\verb|CHD|\).

Understanding the type of variables

R has a wide variety of data types including scalars, vectors (numerical, character, logical), matrices, data frames, and lists.

To check the type of variable you can use \(\verb|str( )|\) to list all types of variables or directly we can use \(\verb|class( )|\),

class(CHD$obesity);
## [1] "factor"
class(CHD$famhist);
## [1] "factor"

If you wish to check the type of variable then you can use

is.numeric();
is.character();
is.factor();

and to change the type of variable

as.numeric();
as.character();
as.factor();

The joys of cleaning data

Real life data is messy and statisticians and data scientists spend a large proportion of their time cleaning data. There is no ‘one size fits all’, when it comes to data. Data scientists, however, need to be sceptical about the data that they have. For instance, negative time and age can ring a bell. As well as, missing values, the reason that they may be missing, and any kind of values that they may differ and the reason that they may differ. These are only a few simple examples of what we need to deal with.

# Change class of variables to numeric
CHD$obesity  <- as.numeric(as.character(CHD$obesity))
CHD$tobacco  <- as.numeric(as.character(CHD$tobacco))
CHD$ldl      <- as.numeric(as.character(CHD$ldl))
CHD$adiposity <- as.numeric(as.character(CHD$adiposity))
CHD$alcohol  <- as.numeric(as.character(CHD$alcohol))

Visualising data

R has a very rich set of graphics facilities.
The top-level R home page, http://www.r-project.org/, has some colorful examples, and there is a very nice display of examples in the R Graph Gallery found online at http://www.r-graph-gallery.com.
An entire book, ‘R Graphics’ by Paul Murrell (Chapman and Hall, 2005), is devoted to the subject.

Simple plots can be made using the functions
- \(\verb|hist()|\) a histogram
- \(\verb|boxplot()|\) a boxplot
- \(\verb|plot(density())|\) density plot
- \(\verb|contour()|\) contour plot

The function \(\verb|plot()|\) is the most generic and many types can be specifed i.e. \(\verb|plot(x, y, type="p")|\) is a plot with points (e.g. a scatterplot).
Possible types are
- “p” for points,
- “l” for lines,
- “b” for both,
- “o” for both ‘overplotted’,
- “h” for ‘histogram’ like (or ‘high-density’) vertical lines,
- “s” for stair steps.

To create a histogram of the distribution of BMI we can use the following commands. Q. Can you describe the difference between each of the two commands?

hist(CHD$obesity, col="Red", breaks=25, freq = T, xlab="BMI (kg/m^2)")

hist(CHD$obesity, col="Red", breaks=25, freq = F, xlab="BMI (kg/m^2)")

To create a boxplot the distribution of BMI for individuals with coronary heart disease and those without. We can create groups using the following commands.

group1 <- CHD[CHD$chd==0, ]  # Extract rows where chd==0 is true
group2 <- CHD[CHD$chd==1, ]  # Extract rows where chd==1 is true
boxplot(group1$obesity, group2$obesity, col="CadetBlue", pch=20, names=c("CHD Absent", "CHD Present"), ylab="BMI (kg/m^2)")

Q4. Can you create a boxplot of age against family history of CHD?

To exlore the bivariate relationships of tobacco use with age, we plot age against tobaco using the commands
plot(CHD$age, CHD$tobacco, xlab = "Age (years)", ylab = "Tobacco (grams smoked per week)",
type="p",  col= "CadetBlue", pch=20)

Q5. Can you produce a scatter plot to assess the relationship between blood pressure (sbp) and BMI (obesity)?
a) Have you labelled your axes?
b) Have you added a title to your plot?

Note to add lines or points to a plot, we can use the commands \(\verb|lines()|\) and \(\verb|points()|\) which work the same as the \(\verb|plot()|\), only they do not create a new plot, they add lines or points to an existing plot. The function \(\verb|abline()|\) is also very useful.

Extracting data

Recap: Similar to (x,y) coordinates, the matrix/data frame indicies always read [ROWS, COLUMNS].

We can also access variables directly by using their names, either with |object[ ,“variable”] notation or object$variable notation.

To extract a variable from your data set such as \(\verb|tobacco|\), you require the $ sign, e.g. CHD$tobacco extracts the variable tobacco from the \(\verb|CHD|\) dataset.

To get the first 10 rows of variable \(\verb|tobacco|\) we can use two methods

CHD[1:10, "tobacco"]  # Extract rows 1 to 10 of column tobacco
##  [1] 12.00  0.01  0.08  7.50 13.60  6.20  4.05  4.08  0.00  0.00
CHD$tobacco[1:10]   # Extract column tobacco and then extract elements 1 to 10
##  [1] 12.00  0.01  0.08  7.50 13.60  6.20  4.05  4.08  0.00  0.00

The \(\verb|c|\) function is widely used to combine values of common type together to form a vector. For example, it can be used to access non-sequential rows and columns from a data frame.

CHD[c(1,3,5), 1]  # get column 1 for rows 1, 3 and 5
## [1] 160 118 134

and we can get row 1 and row 6 values for variables age, sbp and chd status

CHD[c(1,6), c("age", "sbp", "chd")]
##   age sbp chd
## 1  52 160   1
## 6  45 132   0

We can also use Boolean functions introduced earlier to extract all patients with a CHD diagnosis

patients_with_chd <- CHD[CHD$chd==1, ]  # Extract rows where chd==1 is true
head(patients_with_chd)  # display only the head because there's a large number
##    sbp tobacco  ldl adiposity famhist typea obesity alcohol age chd
## 1  160   12.00 5.73     23.11 Present    49   25.30   97.20  52   1
## 2  144    0.01 4.41     28.61  Absent    55   28.87    2.06  63   1
## 4  170    7.50 6.41     38.03 Present    51   31.99   24.26  58   1
## 5  134   13.60 3.50     27.78 Present    60   25.99   57.34  49   1
## 8  114    4.08 4.59     14.60 Present    62   23.11    6.72  58   1
## 10 132    0.00 5.80     30.96 Present    69   30.11    0.00  53   1

Q. How many patients have CHD?

Or all patients under 16 years old

CHD[CHD$age < 16, ]
##    sbp tobacco  ldl adiposity famhist typea obesity alcohol age chd
## 14 132       0 1.87     17.21  Absent    49   23.63    0.97  15   0
## 43 120       0 1.07     16.02  Absent    47   22.15    0.00  15   0
## 71 118       0 3.67     12.13  Absent    51   19.15    0.60  15   0

If there were no variable names, or we wanted to change the names, we could use \(\verb|colnames|\)

colnames(CHD)
##  [1] "sbp"       "tobacco"   "ldl"       "adiposity" "famhist"  
##  [6] "typea"     "obesity"   "alcohol"   "age"       "chd"

To change one variable name, we can extract the elements and reassign another value to it

colnames(CHD)[4] <- "Body shape measure"

Here the new value is a character string of value “Body shape measure”.

Summarising data

To find summary statistics we can use the built in R functions

mean(CHD$tobacco)
## [1] 3.635649
mean(CHD$tobacco)
## [1] 3.635649

The \(\verb|summary|\) function is a generic function to summarise many types of R objects, including datasets. When used on a dataset, \(\verb|summary|\) returns distributional summaries of variables in the dataset.

summary(CHD)
##       sbp           tobacco             ldl         Body shape measure
##  Min.   :101.0   Min.   : 0.0000   Min.   : 0.980   Min.   : 6.74     
##  1st Qu.:124.0   1st Qu.: 0.0525   1st Qu.: 3.283   1st Qu.:19.77     
##  Median :134.0   Median : 2.0000   Median : 4.340   Median :26.11     
##  Mean   :138.3   Mean   : 3.6356   Mean   : 4.740   Mean   :25.41     
##  3rd Qu.:148.0   3rd Qu.: 5.5000   3rd Qu.: 5.790   3rd Qu.:31.23     
##  Max.   :218.0   Max.   :31.2000   Max.   :15.330   Max.   :42.49     
##     famhist        typea         obesity         alcohol      
##  Absent :270   Min.   :13.0   Min.   :14.70   Min.   :  0.00  
##  Present:192   1st Qu.:47.0   1st Qu.:22.98   1st Qu.:  0.51  
##                Median :53.0   Median :25.80   Median :  7.51  
##                Mean   :53.1   Mean   :26.04   Mean   : 17.04  
##                3rd Qu.:60.0   3rd Qu.:28.50   3rd Qu.: 23.89  
##                Max.   :78.0   Max.   :46.58   Max.   :147.19  
##       age             chd        
##  Min.   :15.00   Min.   :0.0000  
##  1st Qu.:31.00   1st Qu.:0.0000  
##  Median :45.00   Median :0.0000  
##  Mean   :42.82   Mean   :0.3463  
##  3rd Qu.:55.00   3rd Qu.:1.0000  
##  Max.   :64.00   Max.   :1.0000

If we want conditional summaries, for example only for those patients over 50 years old (age >= 50), we first subset the data using \(\verb|filter|\) from the \(\verb|dplyr|\) package, then summarise as usual.

R permits nested function calls, where the results of one function are passed directly as an argument to another function. Here, filter returns a dataset containing observations where \(\verb|age >= 50|\). This data subset is then passed to summary to obtain distributions of the variables in the subset.

#install.packages("dplyr")
library(dplyr)
summary(filter(CHD, age >= 50))
##       sbp           tobacco            ldl         Body shape measure
##  Min.   :108.0   Min.   : 0.000   Min.   : 0.980   Min.   :12.33     
##  1st Qu.:130.0   1st Qu.: 1.325   1st Qu.: 3.908   1st Qu.:25.29     
##  Median :144.0   Median : 4.200   Median : 5.065   Median :30.11     
##  Mean   :147.3   Mean   : 5.644   Mean   : 5.287   Mean   :29.69     
##  3rd Qu.:162.0   3rd Qu.: 8.200   3rd Qu.: 6.383   3rd Qu.:34.51     
##  Max.   :216.0   Max.   :31.200   Max.   :14.160   Max.   :42.49     
##     famhist       typea          obesity         alcohol      
##  Absent :85   Min.   :20.00   Min.   :18.36   Min.   :  0.00  
##  Present:95   1st Qu.:45.00   1st Qu.:24.30   1st Qu.:  0.00  
##               Median :52.00   Median :26.84   Median :  7.73  
##               Mean   :51.37   Mean   :26.91   Mean   : 16.84  
##               3rd Qu.:58.25   3rd Qu.:29.03   3rd Qu.: 24.26  
##               Max.   :78.00   Max.   :45.72   Max.   :120.03  
##       age             chd      
##  Min.   :50.00   Min.   :0.00  
##  1st Qu.:54.00   1st Qu.:0.00  
##  Median :58.00   Median :1.00  
##  Mean   :57.42   Mean   :0.55  
##  3rd Qu.:61.00   3rd Qu.:1.00  
##  Max.   :64.00   Max.   :1.00

You can tabulate your data also, once you have checked it is stored as a factor.

table(CHD$chd)
## 
##   0   1 
## 302 160

Exercise: Can you sort the CHD data in increasing order of the age variable?

Can I create my own functions in R? [User-written functions]

R has many built-in functions that carry out most of the simplest tasks we require such as the mean, variance. However, the R programming language allows us to write our own functions.

An example of a function with and input and an output.

An example of a function with and input and an output.

The basic format for a function is

NAME_OF_FUNCTION <- function( INPUT FOR FUNCTION ){
                    
                    OUR R CODE
                    
                    
                    return(OUTPUT)
                    }

For example, let’s code a function which multiplies any number that you give it by 3 and another function which when you give it two numbers, it adds them for you.

f <- function(x) {
  3 * x
}

f( 5 )
## [1] 15
f1 <- function(x,y ) {
  x + y
}

f1( 5, 4)
## [1] 9

These examples are simple and of course in this case it is easier to simply write

x <- 5
3 * x
## [1] 15
y <- 4
x + y
## [1] 9

We demonstrate this using the mean function. To find the mean of a vector we add each elements and divide by the number of elelents is the vector \[ \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i,\] where \(n\) is the number of elements in the vector.

Consider the vector \(x\)

x <- c(5, 1, 7, 9, 1, 6, 10, 40, 12, 2) 

then a few lines of R code that will allow us to compute the mean of \(x\) would be

n <- length(x)  # Find the number of elements in x and save as n
n  # print n
## [1] 10
sum_of_x <- sum(x)  # add all the elements of a and save as sum_of_a
sum_of_x  # print sum_of_x
## [1] 93
x_bar <- sum_of_x/n  # divide sum_of_x by n

We have computed the mean of \(x\) as

x_bar
## [1] 9.3

Check that your function works correctly by comparing it with R’s built in \(\verb|mean|\)

mean(x)
## [1] 9.3

Let’s make these lines of code into a function, we will call this function \(\verb|our_mean_function|\):

our_mean_function <- function(x){
                    n <- length(x)  # how long is the vector x?
                    sum_of_x <- sum(x)  # add up all elements
                    x_bar <- sum_of_x/n  # divide variable by n
                    return(x_bar)  # return the answer
}

Run this function into the console and type \(\verb|ls()|\). You can see that this function is now available to use. To use your function, type

our_mean_function(x)
## [1] 9.3

Voila! We have created magic.

Can you create a function to compute the variance of \(x\)? The formula to compute the sample variance is var \[s^2= \frac{1}{n-1} \sum_{i=1}^n (x_i-\bar{x})^2,\] Hint:

## 1. Find the mean
## 2. Subtract the mean from x
## 3. Square your answer
## 4. Sum your answer
## 5. Divide your answer by (n-1)

You can compare your solution to

var(x)
## [1] 130.6778

Our functions can then be used and combined with any other functions to build up more complicated functions involving other functions, e.g.

for(i in 1:10) { 
                    if(i == 2) print(our_mean_function(x+i)) 
                    }
## [1] 11.3

Loops and Conditional statements

The for loop of R language can be written as

for (i in 1:n) {line1; line2;  ...}

It goes through the vector \(1,2,3,\dots, n\) every time one element at a time, and executes a group of commands inside the \(\{ line1; line2; \dots \}\) in each cycle.

A simple loop is constructed as follows

for(i in 1:10) {
                    print(i) 
                    }
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

Saving your R session

You can save the contents of your current worksheet by using the menu item File>Save Workspace. You will then be asked to choose a filename. Add the extension \(\verb|.RData|\) to the filename you specify. Your command history can also be saved using File>Save History and then specifying a filename with the .Rhistory extension. At a later date you can resume this saved session by opening R and choosing File > Load Workspace and then File > Load History. In both cases you select the appropriate previously saved files. If you load previously worked spaces, then do not use the \(\verb|rm(list=ls())|\) command as this will remove your saved variables.

However, I advise not to save your area as it results in duplicates of your datasets and if your R code script is kept with the data then you can always replicate the results in a second.