R is a programming language that can carry out sophisticated analyses and we simply need to learn the language R speaks.
This short course aims to introduce the key basics of programming in R (a key skill for a health data scientist).
The emphasis is on the fundamental principles of writing scripts in R and how they are applied in practice.
Commands around reading, manipulating, analysing and visualising health data will be introduced and hands-on practice will consolidate learning.
All teaching material is provided by Dr. Hannah Lennon, Dr. Paraskevi Pericleous and Kieran O’Malley. Thank you to Dr. Peter Foster for providing the skeleton structure of these notes. Peter has delivered a 4 part series for the past 5 years to the incoming MSc statistics students for the School of Mathematics, University of Manchester.
Dr. Hannah Lennon hannah.lennon@manchester.ac.uk
Dr. Paraskevi Pericleous paraskevi.pericleous@manchester.ac.uk
Kieran O’Malley kieran.omalley@manchester.ac.uk
One of the key skills highlighted to be able to work with health data is to have knowledge of and be able to use the R-programming language in order to manipulate, analyse and visualise data. R’s usage is varied: it has been used extensively in academia, but not in the healthcare sector (in particular the NHS) and industry. R is open source and free, which makes it attractive to use, particularly when finances are of concern. There are also other well-documented positives to using R (e.g., blog post http://monkeysuncle.stanford.edu/?p=367).
R is primarily used within the programme MSc in Health Data Science in two modules. In order for individuals to get an insight into R and become familiar with its syntax before the Masters programme, this short course sets out to provide the key basics in order to start an individuals’ journey with R, providing a basis from which to explore its other capabilities.
It is a pre-requisite for the Fundamentals of Statistics and Mathematics in Health Data Science module and will run alongside the already developed short course entitled the same, which introduces key statistical concepts required for the course. It will not introduce you to an extensive list of commands, nor will it make an expert R-programmer. What it will do is show some most used commands and functions that are used in the field, and a best practice for writing scripts.
Some statistics will inevitably be covered, but at an introductory level, and are by no means the focus of the course.
This course is aimed at individuals who have no knowledge of R and/or have limited or no programming experience, and who wish to work with health data.
Category of outcome students should be able to:
A. Knowledge and understanding
- LO1: To know key constructs in the R-programming language that read, manipulate, analyse and visualise data
- LO2: To know how to put small scripts together to work with health data
B. Intellectual skills
- LO3: Design/develop a script to analyse health data
C. Practical skills
- LO4: Perform simple key commands in R
- LO5: Write simple, but complete R scripts
D. Transferable skills and personal qualities
- LO6: Transfer knowledge and practical skills between datasets/tasks
The idea of these sessions is to provide an introduction to using the statistical computing package known as R. This includes how to read data into R, perform various calculation, obtain summary statistics for data and carry out simple analyses. You should read and work through the given notes and seek clarification and help when required from one of the staff in the room.
R is a free, open-source statistical environment which was originally created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is now developed by the R Development Core Team. The first version of the software was released in 1997. It is an implementation of the S language that was originally developed by J. M. Chambers at AT&T’s Bell Laboratories in the late 1980’s. There is a commercial version of S called S-PLUS which is sold by TIBCO Software Inc. S-PLUS is rather expensive but works in a very similar way to the freely available R, so why not use R?
There are a number of textbooks available which discuss using R for statistical data analysis. A good introductory one is Crawley, M. J. (2005) Statistics, An Introduction using R. (Wiley), while a favourite of our is Matloff, N. (2009). The Art of R Programming (No starch press).
The command language for R is a computer programming language but the syntax is fairly straightforward and it has many built-in statistical functions. The language can be easily extended with user-written functions. R also has very good graphing facilities.
To demonstrate and use R, we use RStudio IDE for the R statistical programming language. RStudio is an integrated development environment (IDE) for R. It is a tool that can help you do your work better and faster and includes docked windows for your console and syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.
You can download and install a copy of the latest version of R for free on your own computer. You can download and install the current version of R (version 3.3.1 as of 21-06-2016) by clicking on the link below.
http://www.stats.bris.ac.uk/R/
You can either run the program, or save the program to your computer and then run it to install R. When installing, you can accept the default settings. Under ‘Documentation’ you can download the document entitled ‘An Introduction to R’ by W. N. Venables, D. M. Smith and the R Development Core Team (2006) which gives a clear introduction to the language and in- formation on how to use R for doing statistical analysis and graphics. This manual is also available through the software at Help > Manuals (in PDF)> An Introduction to R. You can download it as a pdf file and keep it on your personal computer for reference.
Similarly, to install RStudio, click on the link
https://www.rstudio.com/products/rstudio/download/
Fig 1. Screenshot of Studio upon opening
We can think of R as a sophisticated calculator with it’s own language and we need to learn communicate with our new friend. Type anything into the console at the prompt, and R will evaluate it and print the answer.
In this session, lets practice using R by copying and typing the commands which are given following the \(>\) prompt in the notes below, followed by the ENTER key and then observing the output.
Let’s try some simple calculations. Type the following commands in blue into the console and check that your results match up (with the results displayed after \(\#\#\) [1]).
R works with vectors and the [1] at the beginning of each results line indicates that the answer is the first element of a vector, which in each instances below is of length 1.
9+8
## [1] 17
3*5
## [1] 15
8+(3*6)
## [1] 26
R operator precedence rules are conventional
2+4*20/10
## [1] 10
4^3
## [1] 64
The standard mathematical constants and functions are built-in, such as \(\pi\)=3.14159… exp(), sin(), cos(), tan() e.t.c.,
pi
## [1] 3.141593
sqrt(pi) # Find the sqrt of pi
## [1] 1.772454
pi*(10^2)
## [1] 314.1593
cos(2*pi)
## [1] 1
a <- 4
3*a
## [1] 12
a^2
## [1] 16
1:8 # produces a sequence of 1 to 8
## [1] 1 2 3 4 5 6 7 8
R is case sensitive.
Commands are separated either by a ; or by a newline.
We use the \(\verb|<-|\) to assign a value to a variable. We use \(\verb|=|\) syntax when inside brackets only.
The \(\#\) character can be used to make comments. R doesn’t execute the rest of the line after the \(\#\) symbol - it ignores it.
Previous session commands can be accessed via the up and down arrow keys on the keyboard. This can save time typing as these can be edited and reused.
The syntax can become complicated and therefore we must ensure that our code is readable and reproduciable to others.
The key piece of advice here is to comment any lines of code that are not completely obvious. Entire commented lines should begin with # and one space. Individual lines should be commented using two spaces, hash then another space, as above.
Your style of writing will be personal but our recommendations for ‘good’ code
+ 1. Use <- and not = for assigning variables
+ 2. Leave a space after each comma
+ 3. Leave a space between operators, e.g. <- and + and / e.t.c
+ 4. No more than 5 lines of code at once
The general point is to be consistent throughout your coding.
A comprehensive guide can be found at https://google.github.io/styleguide/Rguide.xml
A vector is a sequence of data elements of the same basic type. Members in a vector are officially called components. Nevertheless, we will just call them members in this site. Here is a vector containing three numeric values 8, 1 and 3. We can consider a vector as simply a row of data.
b <- c(8, 1, 3) # to input a vector we use the syntax c( ) with commas
b*3 # R performs componentwise multiplication
## [1] 24 3 9
To extract the second component of a vector we use square brackets
b[2]
## [1] 1
The \(\verb|c|\) function can be used inside this to combine values of common type together to form a vector. For example, it can be used to access two components of \(\verb|b|\), e.g. the second and third
b[c(2,3)]
## [1] 1 3
Many vectors put in rows or columns can make a matrix. If we combine together vector b and a new vector d, then we can create a matrix
d <- c(1, 2, 5)
cbind(b, d)
## b d
## [1,] 8 1
## [2,] 1 2
## [3,] 3 5
rbind(b, d)
## [,1] [,2] [,3]
## b 8 1 3
## d 1 2 5
Q1. What is the difference between cbind() and rbind()? Explore the operations between matrices or vector and a matrix.
You can ask R to print a list of which variables are saved in your environment already, simply type
ls()
## [1] "a" "b" "d"
These can also be seen in RStudio by clicking on the Environment tab where a summary of the variables is shown.
To remove a variable from the environment, for example \(\verb|b|\), type
rm(b)
To clear your R environment and remove all variables/previously saved data, use
rm(list=ls())
which removes the list which is printed when we type \(\verb|ls()|\). Can you see the logic of this? Don’t worry this line of code is only used when you first begin working in R and is the most difficult we will come across in the course.
The help and support section of R is an invaluable resource that has contributed to the popularity of R. Help is easily accessed by clicking on the Help tab of the bottom right window in RStudio under ‘Help’. The first description can sometimes be taxing to understand but there is always an example at the bottom of the page.
If you’re struggling to find help because you are unsure of the function to search for, type
help.search("logarithm")
will search for help files for functions that have something to do with “logarithm”. Finally, the quality and quantity of help for R online is particularly great and a google search beginning with an R e.g. “R logarithm” usually returns the most relevant solution to your problem.
Try to find out more about the logarithmic functions and answer the following:
Q2. Can you compute the the value of \(\log_e 10\)? and \(\log_2 20\)?
You can carry out a variety of calculations involving parametric probability distributions in R. Some of the common distributions available are
|—————|—————–|
| Distribution | R name |
|—————|—————–|
| Binomial | \(\verb|binom|\) |
| Poisson | \(\verb|pois|\) |
| Geometric | \(\verb|geom|\) |
| Neg Binomial | \(\verb|nbinom|\) |
| Uniform | \(\verb|unif|\) |
| Normal | \(\verb|norm|\) |
| Gamma | \(\verb|gamma|\) |
| Chisquare | \(\verb|chisq|\) |
| Beta | \(\verb|beta|\) |
| Student t | \(\verb|t|\) |
|—————|—————–|
Some of these will be familiar at the moment while others may be less so.
R has four particular functions available for each distribution. These are
|—————————–|—————————————|
| Name | Description |
|—————————–|—————————————|
| dname(x= , other arguments) | Density or probability mass function |
| pname(q= , other arguments) | Cumulative distribution function |
| qname(p= , other arguments) | Quantile function |
| rname(n= , other arguments) | Random deviates |
|—————————–|—————————————|
i.e. you prefix the R function name with either of the letters ‘d’, ‘p’, ‘q’ or ‘r’, depending what you would like to calculate. You have to specify the values of the parameters of the distribution in the call to the function if you are changing them from any preset default values. Functions may also have other arguments with preset values but you can use ‘help’ in R to check these.
Q3. Can you compute the critical value of \(\chi^2_4\) at 99.5% confidence with more ** accuracy than the values in the figure below (up to 7 decimal places) **
and also \(\chi^2_8\) for \(\alpha=0.975\)?
Hint use the \(\chi^2\) quantile function, \(\verb|qchisq|\):
help(qchisq) # qchisq documentation
Check your answers with the table below.
Fig 2. Chi-squared distribution table
To begin a session, first
+ 1. Go to
Session > Set Working Directory > Choose Directory
and create a folder where you will work from today. This folder should include any datasets you wish you use.
+ 2. Type into your console
rm(list=ls())
to remove any previously saved data.
R code can be entered into the command line directly or saved to a script, which can be run inside a session using the source function. This is the best practice.
+ 3. Create a new file using File > New File > R Script We call this an R script and will have the extension .r or .R when you save it. Copy and paste the following
and save this file. The extension will become .r or .R to recognise that this is an R script and should be opened in R rather than word or excel e.t.c. Locate the ‘Run’ button in the right hand side of the top left window and with your cursor on the line that you wish to run, click run. Alternatively, press Ctrl+ENTER on windows (cmd+ENTER on a mac).
Fig 1. Screenshot of Studio during use
Your code will be ran/exectuted in the Console (bottom left window).
We will continue to work with our R script for the remainder of the session to allow you to refer back at a later date. Please annotate and comment your code throughout the day using the \(\#\) symbol as above.
You can save the contents of your current worksheet by using the menu item File>Save Workspace. You will then be asked to choose a filename. Add the extension \(\verb|.RData|\) to the filename you specify. Your command history can also be saved using File>Save History and then specifying a filename with the .Rhistory extension. At a later date you can resume this saved session by opening R and choosing File > Load Workspace and then File > Load History. In both cases you select the appropriate previously saved files. If you load previously worked spaces, then do not use the \(\verb|rm(list=ls())|\) command as this will remove your saved variables.
However, I advise not to save your area as it results in duplicates of your datasets and if your R code script is kept with the data then you can always replicate the results in a second.
Part of the reason R has become so popular is the vast array of packages available at the cran and bioconductor repositories. In the last few years, the number of packages has grown exponentially!
Installing these R packages couldn’t be easier (especially in RStudio). Note we only install a packge ONCE. Let’s suppose you want to install the \(\verb|ggplot2|\) package which is a hugely popular package for creating nice looking graphics. Well nothing could be easier. We type the following into the R console
install.packages("ggplot2")
Alternatively in RStudio, you can simply click on the Packages tab in the bottom right corner and then Install (Packages > Install). Type \(\verb|ggplot2|\) into the box ‘Packages (separate multiple with space or comma)’ and ensure the ‘Install dependencies’ is checked (It is by default).
By completing either of these methods, \(\verb|ggplot2|\) is installed in your library when you want to use it you can either type
library("ggplot2")
or check the box in the list under Packages.
The range of R packages that are contributed to R is huge. Some packages allow more indepth statistical analysis, whereas some allow data to be imported. Some allow advanced graphics while some import data directly from the interenet.
A favourite of mine is the ‘ggmap’ R package which is short for ‘google maps’. This function directly imports data from google maps and allows you to use google maps within R! For example,
# install.packages("ggmap")
library("ggmap")
qmap(location = "Manchester, UK")
qmap(location = "Manchester, UK", zoom = 15) # Let's zoom in a little
We can use the imported data as we would any other data set. For example I can add a data point onto the plot to pinpoint the Univeristy of Manchester.
USAMap = ggmap(get_googlemap(center=usa_center, scale=2, zoom=4), extent="normal")
USAMap
R has many built-in functions that carry out most of the simplest tasks we require such as the mean, variance. However, the R programming language allows us to write our own functions.
An example of a function with and input and an output.
The basic format for a function is
NAME_OF_FUNCTION <- function( INPUT FOR FUNCTION ){
OUR R CODE
return(OUTPUT)
}
For example, let’s code a function which multiplies any number that you give it by 3and another function which when you give it two numbers, it adds them for you.
f <- function(x) {
3 * x
}
f( 5 )
## [1] 15
f1 <- function(x,y ) {
x + y
}
f1( 5, 4)
## [1] 9
These examples are simple and of course in this case it is easier to simply write
x <- 5
3 * x
## [1] 15
y <- 4
x + y
## [1] 9
We demonstrate this using the mean function. To find the mean of a vector we add each elements and divide by the number of elelents is the vector \[ \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i,\] where \(n\) is the number of elements in the vector.
Consider the vector \(x\)
x <- c(5, 1, 7, 9, 1, 6, 10, 40, 12, 2)
then a few lines of R code that will allow us to compute the mean of \(x\) would be
n <- length(x) # Find the number of elements in x and save as n
n # print n
## [1] 10
sum_of_x <- sum(x) # add all the elements of a and save as sum_of_a
sum_of_x # print sum_of_x
## [1] 93
x_bar <- sum_of_x/n # divide sum_of_x by n
We have computed the mean of \(x\) as
x_bar
## [1] 9.3
Check that your function works correctly by comparing it with R’s built in \(\verb|mean|\)
mean(x)
## [1] 9.3
Let’s make these lines of code into a function, we will call this function \(\verb|our_mean_function|\):
our_mean_function <- function(x){
n <- length(x) # how long is the vector x?
sum_of_x <- sum(x) # add up all elements
x_bar <- sum_of_x/n # divide variable by n
return(x_bar) # return the answer
}
Run this function into the console and type \(\verb|ls()|\). You can see that this function is now available to use. To use your function, type
our_mean_function(x)
## [1] 9.3
Voila! We have created magic.
Can you create a function to compute the variance of \(x\)? The formula to compute the sample variance is var \[s^2= \frac{1}{n-1} \sum_{i=1}^n (x_i-\bar{x})^2,\] Hint:
## 1. Find the mean
## 2. Subtract the mean from x
## 3. Square your answer
## 4. Sum your answer
## 5. Divide your answer by (n-1)
You can compare your solution to
var(x)
## [1] 130.6778
The for loop of R language can be written as
for (i in 1:n) {line1; line2; ...}
It goes through the vector \(1,2,3,\dots, n\) every time one element at a time, and executes a group of commands inside the \(\{ line1; line2; \dots \}\) in each cycle.
A simple loop is constructed as follows
for(i in 1:10) {
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
These can be used to build up more complicated functions involving other functions
for(i in 1:10) {
print(our_mean_function(x+i))
}
## [1] 10.3
## [1] 11.3
## [1] 12.3
## [1] 13.3
## [1] 14.3
## [1] 15.3
## [1] 16.3
## [1] 17.3
## [1] 18.3
## [1] 19.3
Boolean statements can only produce either TRUE or FALSE
our_var <- 5
our_var == 5
## [1] TRUE
our_var < 10
## [1] TRUE
our_var > 10
## [1] FALSE
our_var != 5
## [1] FALSE
our_var >! 5
## [1] TRUE
our_var <! 5
## [1] FALSE
where we read ! as ‘not’.
Boolean statements can be used to determine whether an expression should be run or not.
our_var <- 5
if (our_var==5)
{
print('Our variable is equal to 5')
}
## [1] "Our variable is equal to 5"
Taking this one step further, in case our condition is not met we can use the ‘else’ statement.
our_var <- 6
if (our_var==5)
{
print('Our variable is equal to 5')
} else
{
print('Our variable is not equal to 5')
}
## [1] "Our variable is not equal to 5"
for(i in 1:10) {
if(i == 2) print(our_mean_function(x+i))
}
## [1] 11.3
Reading data into R is straight forward. First make sure your data files are saved in the same place your R session is working from. This can be found with the function \(\verb|getwd()|\).
getwd() # get the name of your current working directory
## [1] "/Users/Hannah/Dropbox/Introducing R for Health Data"
The command to import the datafile from this location depends on the type of file you are importing. Quite frequently, the sample data is in Excel format. For this, we can use the function read.xlsx from the xlsx package. It reads from an Excel spreadsheet and returns a data frame. The following shows how to load an Excel spreadsheet named “CHD.xlsx” or “CHD.csv” and save it as a dataframe called CHD also.
This is a clinical dataset. We have data on a series of patients and we are particularly interested in whether or not they have coronary heart disease (CHD). The variables in the data are as follows:
| Name | Description |
|---|---|
| sbp | systolic blood pressure |
| tobacco | cumulative tobacco (kg) |
| ldl | low density lipoprotein cholesterol |
| adiposity | a measure of body shape |
| famhist | family history of heart disease (Present=1, Absent=0) |
| typea | type-A behaviour |
| obesity | BMI |
| alcohol | current alcohol consumption |
| age | current age in years |
| chd | coronary heart disease (yes=1/no=0) |
Typically we load .csv (Comma separated values) files but R also allows many other types, such as excel files
CHD <- read.csv2("CHD.csv", sep=",") # or we can load a csv (Comma separated values) files
install.packages("xlsx") # First install the xlsx R package
library(xlsx) # load the xlsx package
CHD <- read.xlsx("CHD.xlsx", sheetIndex = 1) # read in the data from sheet number 1
If you know your dataset is large then viewing your data will not be helpful and will slow your computer. However, as a check that your data is imported correctly you can use the following commands.
dim(CHD) # What is the dimension of CHD?
## [1] 462 10
str(CHD) # What is the structure of my data? In the format: Names/Variable type/Example of the first few values, ....
## 'data.frame': 462 obs. of 10 variables:
## $ sbp : int 160 144 118 170 134 132 142 114 114 132 ...
## $ tobacco : Factor w/ 214 levels "0","0.01","0.02",..: 81 2 9 196 90 183 146 147 1 1 ...
## $ ldl : Factor w/ 329 levels "0.98","1.07",..: 239 165 100 269 101 270 96 172 126 242 ...
## $ adiposity: Factor w/ 408 levels "10.05","10.29",..: 146 247 311 386 232 369 53 39 87 291 ...
## $ famhist : Factor w/ 2 levels "Absent","Present": 2 1 2 2 2 2 1 2 2 2 ...
## $ typea : int 49 55 52 51 60 62 59 62 49 69 ...
## $ obesity : Factor w/ 400 levels "14.7","17.75",..: 177 306 313 367 202 351 38 100 162 335 ...
## $ alcohol : Factor w/ 249 levels "0","0.19","0.26",..: 249 78 137 104 196 46 84 209 81 1 ...
## $ age : int 52 63 46 58 49 45 38 58 29 53 ...
## $ chd : int 1 1 0 1 1 0 0 1 0 1 ...
colnames(CHD) # What are the names of my columns?
## [1] "sbp" "tobacco" "ldl" "adiposity" "famhist"
## [6] "typea" "obesity" "alcohol" "age" "chd"
The function \(\verb|head()|\) and \(\verb|tail()|\) displays the top 6 and bottom 6 lines of the dataset
head(CHD)
## sbp tobacco ldl adiposity famhist typea obesity alcohol age chd
## 1 160 12 5.73 23.11 Present 49 25.3 97.2 52 1
## 2 144 0.01 4.41 28.61 Absent 55 28.87 2.06 63 1
## 3 118 0.08 3.48 32.28 Present 52 29.14 3.81 46 0
## 4 170 7.5 6.41 38.03 Present 51 31.99 24.26 58 1
## 5 134 13.6 3.5 27.78 Present 60 25.99 57.34 49 1
## 6 132 6.2 6.47 36.21 Present 62 30.77 14.14 45 0
tail(CHD)
## sbp tobacco ldl adiposity famhist typea obesity alcohol age chd
## 457 170 0.4 4.11 42.06 Present 56 33.1 2.06 57 0
## 458 214 0.4 5.98 31.72 Absent 64 28.45 0 58 0
## 459 182 4.2 4.41 32.1 Absent 52 28.61 18.72 52 1
## 460 108 3 1.59 15.23 Absent 40 20.09 26.64 55 0
## 461 118 5.4 11.61 30.79 Absent 64 27.35 23.97 40 0
## 462 132 0 4.82 33.41 Present 62 14.7 0 46 1
If your data is relatively small (less than 20 columns and 50 rows) then it may be sensible to view your dataset. You can do this using
# View(CHD)
where a new window will open displaying your dataset or you can view the data directly in the console simply type the name of your dataset, i.e., \(\verb|CHD|\).
To check the type of variable you can use \(\verb|str( )|\) to list all types of variables or directly we can use \(\verb|class( )|\),
class(CHD$obesity);
## [1] "factor"
class(CHD$famhist);
## [1] "factor"
If you wish to check the type of variable then you can use
is.numeric();
is.character();
is.factor();
and to change the type of variable
as.numeric();
as.character();
as.factor();
Real life data is messy and statisticians and data scientists spend a large proportion of their time cleaning data. There is no ‘one size fits all’, when it comes to data. Data scientists, however, need to be sceptical about the data that they have. For instance, negative time and age can ring a bell. As well as, missing values, the reason that they may be missing, and any kind of values that they may differ and the reason that they may differ. These are only a few simple examples of what we need to deal with.
# Change class of variables to numeric
CHD$obesity <- as.numeric(as.character(CHD$obesity))
CHD$tobacco <- as.numeric(as.character(CHD$tobacco))
CHD$ldl <- as.numeric(as.character(CHD$ldl))
CHD$adiposity <- as.numeric(as.character(CHD$adiposity))
CHD$alcohol <- as.numeric(as.character(CHD$alcohol))
Similar to (x,y) coordinates, the matrix indicies always read [ROWS, COLUMNS]. To extract a single cell value from the second row and third column, we type
CHD[2, 3]
## [1] 4.41
Omitting column values implies all columns; here all columns in row 2
CHD[2, ]
## sbp tobacco ldl adiposity famhist typea obesity alcohol age chd
## 2 144 0.01 4.41 28.61 Absent 55 28.87 2.06 63 1
Omitting row value implies all rows; here all rows in column 3
We can also use ranges - rows 2 and 3, columns 2 and 3
CHD[2:3, 2:3]
## tobacco ldl
## 2 0.01 4.41
## 3 0.08 3.48
We can also access variables directly by using their names, either with |object[ ,“variable”] notation or object$variable notation.
To extract a variable from your data set such as \(\verb|tobacco|\), you require the $ sign, e.g. CHD$tobacco extracts the variable tobacco from the \(\verb|CHD|\) dataset.
To get the first 10 rows of variable \(\verb|tobacco|\) we can use two methods
CHD[1:10, "tobacco"]
## [1] 12.00 0.01 0.08 7.50 13.60 6.20 4.05 4.08 0.00 0.00
CHD$tobacco[1:10]
## [1] 12.00 0.01 0.08 7.50 13.60 6.20 4.05 4.08 0.00 0.00
The \(\verb|c|\) function is widely used to combine values of common type together to form a vector. For example, it can be used to access non-sequential rows and columns from a data frame.
CHD[c(1,3,5), 1] # get column 1 for rows 1, 3 and 5
## [1] 160 118 134
and we can get row 1 and row 6 values for variables age, sbp and chd status
CHD[c(1,6), c("age", "sbp", "chd")]
## age sbp chd
## 1 52 160 1
## 6 45 132 0
If there were no variable names, or we wanted to change the names, we could use \(\verb|colnames|\)
colnames(CHD)
## [1] "sbp" "tobacco" "ldl" "adiposity" "famhist"
## [6] "typea" "obesity" "alcohol" "age" "chd"
To change one variable name, just use indexing
colnames(CHD)[4] <- "Body shape measure"
R has a very rich set of graphics facilities. The top-level R home page, http://www.r-project. org/, has some colorful examples, and there is a very nice display of examples in the R Graph Gallery found online at http://www.r-graph-gallery.com. An entire book, R Graphics by Paul Murrell (Chapman and Hall, 2005), is devoted to the subject.
Simple plots can be made using the functions - \(\verb|hist()|\) a histogram
- \(\verb|boxplot()|\) a boxplot
- \(\verb|plot(density())|\) density plot
- \(\verb|contour()|\) contour plot
The function \(\verb|plot()|\) is the most generic and many types can be specifed i.e. \(\verb|plot(x, y, type="p")|\) is a plot with points (e.g. a scatterplot) Possible types are
- “p” for points,
- “l” for lines,
- “b” for both,
- “o” for both ‘overplotted’,
- “h” for ‘histogram’ like (or ‘high-density’) vertical lines,
- “s” for stair steps.
To create a boxplot the distribution of BMI for individuals with coronary heart disease and those without. You can create groups using the following commands.
group1 <- CHD[CHD$chd==0, ]
group2 <- CHD[CHD$chd==1, ]
boxplot(group1$obesity, group2$obesity, col="CadetBlue", pch=20, names=c("CHD absent", "CHD present"),
ylab="BMI (kg/m^2)")
Q4. Can you create a boxplot of age against family history of CHD?
To exlore the bivariate relaationships of tobacco use with age, we plot age against tobaco using the commandsplot(CHD$age, CHD$tobacco, xlab = "Age (years)", ylab = "Tobacco (grams smoked per week)",
type="p", col= "CadetBlue", pch=20)
Q5. Can you produce a scatter plot to assess the relationship between blood pressure (sbp) and BMI (obesity)?
a) Have you labelled your axes?
b) Have you added a title to your plot?
Q6. To check if the age data is normally distributed, create a histogram with 25 breaks for the age variable of the CHD dataset and superimpose a normal density onto the plot. Can you expain what each line of code is doing?
The distbritubion of BMI in the CHD dataset
To find summary statistics we can use the built in R functions
mean(CHD$tobacco)
## [1] 3.635649
mean(CHD$tobacco)
## [1] 3.635649
The \(\verb|summary|\) function is a generic function to summarise many types of R objects, including datasets. When used on a dataset, \(\verb|summary|\) returns distributional summaries of variables in the dataset.
summary(CHD)
## sbp tobacco ldl Body shape measure
## Min. :101.0 Min. : 0.0000 Min. : 0.980 Min. : 6.74
## 1st Qu.:124.0 1st Qu.: 0.0525 1st Qu.: 3.283 1st Qu.:19.77
## Median :134.0 Median : 2.0000 Median : 4.340 Median :26.11
## Mean :138.3 Mean : 3.6356 Mean : 4.740 Mean :25.41
## 3rd Qu.:148.0 3rd Qu.: 5.5000 3rd Qu.: 5.790 3rd Qu.:31.23
## Max. :218.0 Max. :31.2000 Max. :15.330 Max. :42.49
## famhist typea obesity alcohol
## Absent :270 Min. :13.0 Min. :14.70 Min. : 0.00
## Present:192 1st Qu.:47.0 1st Qu.:22.98 1st Qu.: 0.51
## Median :53.0 Median :25.80 Median : 7.51
## Mean :53.1 Mean :26.04 Mean : 17.04
## 3rd Qu.:60.0 3rd Qu.:28.50 3rd Qu.: 23.89
## Max. :78.0 Max. :46.58 Max. :147.19
## age chd
## Min. :15.00 Min. :0.0000
## 1st Qu.:31.00 1st Qu.:0.0000
## Median :45.00 Median :0.0000
## Mean :42.82 Mean :0.3463
## 3rd Qu.:55.00 3rd Qu.:1.0000
## Max. :64.00 Max. :1.0000
If we want conditional summaries, for example only for those patients over 50 years old (age >= 50), we first subset the data using \(\verb|filter|\) from the \(\verb|dplyr|\) package, then summarise as usual.
R permits nested function calls, where the results of one function are passed directly as an argument to another function. Here, filter returns a dataset containing observations where \(\verb|age >= 50|\). This data subset is then passed to summary to obtain distributions of the variables in the subset.
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
summary(filter(CHD, age >= 50))
## sbp tobacco ldl Body shape measure
## Min. :108.0 Min. : 0.000 Min. : 0.980 Min. :12.33
## 1st Qu.:130.0 1st Qu.: 1.325 1st Qu.: 3.908 1st Qu.:25.29
## Median :144.0 Median : 4.200 Median : 5.065 Median :30.11
## Mean :147.3 Mean : 5.644 Mean : 5.287 Mean :29.69
## 3rd Qu.:162.0 3rd Qu.: 8.200 3rd Qu.: 6.383 3rd Qu.:34.51
## Max. :216.0 Max. :31.200 Max. :14.160 Max. :42.49
## famhist typea obesity alcohol
## Absent :85 Min. :20.00 Min. :18.36 Min. : 0.00
## Present:95 1st Qu.:45.00 1st Qu.:24.30 1st Qu.: 0.00
## Median :52.00 Median :26.84 Median : 7.73
## Mean :51.37 Mean :26.91 Mean : 16.84
## 3rd Qu.:58.25 3rd Qu.:29.03 3rd Qu.: 24.26
## Max. :78.00 Max. :45.72 Max. :120.03
## age chd
## Min. :50.00 Min. :0.00
## 1st Qu.:54.00 1st Qu.:0.00
## Median :58.00 Median :1.00
## Mean :57.42 Mean :0.55
## 3rd Qu.:61.00 3rd Qu.:1.00
## Max. :64.00 Max. :1.00
You can tabulate your data also once you know it is a factor
table(CHD$chd)
##
## 0 1
## 302 160
Q7. Can you perform a hypothesis test using the appropriate test to see whether the mean BMI of individuals with CHD is higher than the mean BMI of the individuals without CHD?
We want to test the hypothesis that individuals with coronary heart disease have a higher BMI than those who do not have coronary heart disease.
A T-test is a statistical procedure for comparing means of continuous measures in two populations. Call these group 1 and group 2 which are of different sizes, call these n1 and n2.
group1 <- CHD[CHD$chd==0, ]
group2 <- CHD[CHD$chd==1, ]
n1 <- length(group1)
n2 <- length(group2)
The null hypothesis is there is no differnce between the mean of the two groups, i.e., the two groups come from the same population: \[H_0 : \mu_1 = \mu_2,\] \[H_1 : \mu_1 < \mu_2.\]
The t statistic to test whether the means are different can be calculated as follows: \[ t = \frac{\bar {X}_1 - \bar{X}_2}{s_{X_1 X_2} \cdot \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}\] where \[ s_{X_1 X_2} = \sqrt{\frac{(n_1-1)s_{X_1}^2+(n_2-1)s_{X_2}^2}{n_1+n_2-2}},\] which for group 1 and group 2 is \[t= -2.1576.\]
To test whether the null hypothesis holds, we compare the computed t-statistic with the critical value of the Student’s T-distribution with \(n_1+n_2-2\) degrees of freedom:
qt(0.95, 460)
## [1] 1.648173
The test statistic \(t\) is less than this value and therefore we reject the null hypothesis.
Explicitly, to compute the p-value we use
pt(-2.1576, 460)
## [1] 0.015738
which is less than 0.05 (the pre-specified significance level) indicating we reject the null hypothesis at 5% significance.
To compute this in R:
t.test(group1$obesity, group2$obesity, var.equal = T, alternative = "less")
##
## Two Sample t-test
##
## data: group1$obesity and group2$obesity
## t = -2.1576, df = 460, p-value = 0.01574
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -0.2090821
## sample estimates:
## mean of x mean of y
## 25.73745 26.62294
Let \(Y\) denote the dependent variable (outcome) which we want to predict based on a linear relation with the independent variable \(X\), \[Y = X^T \beta + \varepsilon.\] To do fit a linear regression in R we need the \(\verb|lm()|\) command.
Q8. a) Can you use correlation to assess the relationship between blood pressure (sbp) and BMI (obesity)? ** b) Can you interpret the slope term of the regression results? **
Fit a linear regression using the \(\verb|lm()|\) and use the \(\verb|summary()|\) command to explain the output, in particular variable in terms of the others for the dataset given, like below. Try to understand the model.
model1 <- lm(obesity ~ sbp, data=CHD)
model1
##
## Call:
## lm(formula = obesity ~ sbp, data = CHD)
##
## Coefficients:
## (Intercept) sbp
## 19.27408 0.04894
summary(model1)
##
## Call:
## lm(formula = obesity ~ sbp, data = CHD)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.0345 -2.7460 -0.4152 2.2928 21.7265
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.27408 1.30182 14.806 < 2e-16 ***
## sbp 0.04894 0.00931 5.257 2.25e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.097 on 460 degrees of freedom
## Multiple R-squared: 0.05668, Adjusted R-squared: 0.05463
## F-statistic: 27.64 on 1 and 460 DF, p-value: 2.245e-07
Let’s interpret the output
Fig 1. Screenshot of Studio during use
We can see \(\hat{\beta_0}=19.3\), \(\hat{\beta_1}=0.05\) and \(\hat{\sigma^2}=4.1\) so for every unit increase in blood pressure, BMI increases by 0.05 kg/\(m^2\).
The t-value is a statistical test to see if the regression coeffiecent is different from zero (i.e. necessary for the model). The t-tests are used to conduct hypothesis tests on the regression coefficients obtained in simple linear regression. A statistic based on the t-distribution is used to test the two-sided hypothesis that the true slope is equals some constant value.
The statements for the hypothesis test are expressed as: \[H_0: \beta_1 = 0, \] \[H_1: \beta_1 \neq 0, \]
The test statistic used for this test is: \[T = \frac{\hat{\beta_1} - 0}{s.e(\hat{\beta_1})},\] where \(\hat{\beta_1}\) is the least square estimate of \({\beta_1}\), and \(s.e(\hat{\beta_1})\) is its standard error.
The test statistic, T, follows a T-distribution with \((n-2)\) degrees of freedom, where \(n\) is the total number of observations. The null hypothesis, \(H_0\), is accepted if the calculated value of the test statistic is such that \[ -t_{\alpha/2, n-2} < T|< t_{\alpha/2, n-2}\] where we can compute the critical values for the two-sided hypothesis with 5% significance (2.5% either side for a two-tailed test) as
qt(0.975, 460)
## [1] 1.965134
Therefore in this case we reject the null hypothesis and age is a statistcially significant predictor for BMI.
Alternatively, we can look in terms of the p-value
1-pt(5.257, 460)
## [1] 1.123124e-07
Which is one side of the p-value
2*(1-pt(5.257, 460))
## [1] 2.246248e-07
which is less than our pre-specified 5% significance level, confirming we reject the null hypothesis. In other words, the test indicates the fitted regression model is of value in explaining variations in the observations and a linear relationship exists between BMI and blood pressure.
anova(model1)
## Analysis of Variance Table
##
## Response: obesity
## Df Sum Sq Mean Sq F value Pr(>F)
## sbp 1 463.9 463.90 27.637 2.245e-07 ***
## Residuals 460 7721.2 16.79
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA (ANalysis Of VAriance) calculations are displayed in an analysis of variance table, which has the following format for simple linear regression:
| Source | Degrees of Freedom | Sum of squares | Mean Square | F | |
|---|---|---|---|---|---|
| Model | \(1\) | \(\sum_{i=1}^n (\hat{y}_i-\bar{y}_i)^2\) | SSM/DFM | MSM/MSE | |
| Error/Residuals | \(n - 2\) | \(\sum_{i=1}^n (y_i-\hat{y}_i)^2\) | SSE/DFE ————– | ———- | ————— |
| Total | \(n - 1\) | \(\sum_{i=1}^n(y_i-\bar{y})^2\) | SST/DFT |
The “F” column provides a statistic for testing the hypothesis that \(\beta_1 \neq 0\) against the null hypothesis that \(\beta_1 = 0\) where \(\beta_1\) is the blood pressure variable coefficent. The test statistic is the ratio MSM/MSE, the mean square model term divided by the mean square error term. When the MSM term is large relative to the MSE term, then the ratio is large and there is evidence against the null hypothesis.
For simple linear regression, the statistic MSM/MSE has an \(F\) distribution with degrees of freedom (DoF Model, DoF Residuals) = \((1, n - 2)\)
The \(R^2\) and Adjusted \(R^2\) Values: For simple linear regression, \(R^2\) is the square of the sample correlation rxy.
For multiple linear regression with intercept (which includes simple linear regression), it is defined as \(R^2 = SSM / SST\).
In either case, \(R^2\) indicates the proportion of variation in the \(y\)-variable that is due to variation in the \(x\)-variables.
Many researchers prefer the adjusted \(R^2\) value instead, which is penalised for having a large number of parameters in the model: \[\textrm{Adjusted } R^2 = \frac{1 - (1 - R^2)(n - 1) }{ (n - p)}.\]
A BMI measure of a 30 year old is likely to be less than a BMI measure of a 50 year old since generally BMI increases with age. Blood pressure can vary with age also. Can you adjust for age in your model - do your results change?
model2 <- lm(obesity ~ sbp + age, data=CHD)
model2
##
## Call:
## lm(formula = obesity ~ sbp + age, data = CHD)
##
## Coefficients:
## (Intercept) sbp age
## 18.97043 0.03018 0.06769
summary(model2)
##
## Call:
## lm(formula = obesity ~ sbp + age, data = CHD)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.369 -2.549 -0.344 2.005 23.018
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.970432 1.272040 14.913 < 2e-16 ***
## sbp 0.030184 0.009862 3.061 0.00234 **
## age 0.067694 0.013836 4.893 1.38e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.999 on 459 degrees of freedom
## Multiple R-squared: 0.1034, Adjusted R-squared: 0.09953
## F-statistic: 26.48 on 2 and 459 DF, p-value: 1.311e-11
We see that age is a significant covariate in the model.
We can carry out a formal model fit to select the most suitable model - a likelihood ratio test -
anova(model1, model2)
## Analysis of Variance Table
##
## Model 1: obesity ~ sbp
## Model 2: obesity ~ sbp + age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 460 7721.2
## 2 459 7338.5 1 382.71 23.937 1.38e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
where we see that the “F” column provides a statistic for testing the hypothesis that \(\beta_2 \neq 0\) against the null hypothesis that \(\beta_2 = 0\) where \(\beta_2\) is the age variable coefficent because we are testing two nested models. The test statistic is the ratio MSM/MSE, the mean square model term divided by the mean square error term. When the MSM term is large relative to the MSE term, then the ratio is large and there is evidence against the null hypothesis.
For multiple linear regression, the statistic MSM/MSE has an \(F\) distribution with degrees of freedom (DoF Model, DoF Residuals) = \((1, n - 2)\)
Logistic regression is used when the dependent variable (outcome) is binary and the probability of being 1(or 0) is modelled. In this case, \[logit(P[Y=1])=X^T \beta + \varepsilon\] or \[P[Y=1]=\frac{\exp{[X^T \beta + \varepsilon]}}{1+\exp{[X^T \beta + \varepsilon]}}\].
Suppose we are interested in risk factors for CHD. This would be a common exercise in epidemiology, for example, we may want to predict coronory heart disease in individuals where CHD$chd variable is the outcome with values Yes/No (or 0/1).
Q9. Perform a logistic regression for chd using glm() function and summary() and interpret your output
model3 <- glm(chd ~ age + obesity, family=binomial(link="logit"), data=CHD)
summary(model3)
##
## Call:
## glm(formula = chd ~ age + obesity, family = binomial(link = "logit"),
## data = CHD)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4401 -0.9227 -0.5384 1.0905 2.2497
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.581465 0.742611 -4.823 1.42e-06 ***
## age 0.063958 0.008674 7.374 1.66e-13 ***
## obesity 0.002523 0.025934 0.097 0.923
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 596.11 on 461 degrees of freedom
## Residual deviance: 525.55 on 459 degrees of freedom
## AIC: 531.55
##
## Number of Fisher Scoring iterations: 4
Can you adjusted for other lifestyle factors (smoking and alcohol consumption) too? Include all the variables in the data as predictors. Which variables appear important in predicting CHD?
model4 <- glm(chd ~ ., family=binomial(link="logit"), data=CHD)
summary(model4)
##
## Call:
## glm(formula = chd ~ ., family = binomial(link = "logit"), data = CHD)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7781 -0.8213 -0.4387 0.8889 2.5435
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.1507209 1.3082600 -4.701 2.58e-06 ***
## sbp 0.0065040 0.0057304 1.135 0.256374
## tobacco 0.0793764 0.0266028 2.984 0.002847 **
## ldl 0.1739239 0.0596617 2.915 0.003555 **
## `Body shape measure` 0.0185866 0.0292894 0.635 0.525700
## famhistPresent 0.9253704 0.2278940 4.061 4.90e-05 ***
## typea 0.0395950 0.0123202 3.214 0.001310 **
## obesity -0.0629099 0.0442477 -1.422 0.155095
## alcohol 0.0001217 0.0044832 0.027 0.978350
## age 0.0452253 0.0121298 3.728 0.000193 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 596.11 on 461 degrees of freedom
## Residual deviance: 472.14 on 452 degrees of freedom
## AIC: 492.14
##
## Number of Fisher Scoring iterations: 5