The aim of this lab is to use R to examine the linear relationship between pairs of continuous variables. By the end of this lab you should be able to:
The main new R functions we will use for are cor(), corr.test(), lm() and predict().
Context: A sample of human subjects was asked to memorise a list of disconnected items and the average percent memory retention was calculated at various times during the following week.
Question: Are average memory retention and time significantly related?
Data: The data are available from a worksheet memory.csv where:
retention: The average percent memory retention from the sample.time: The times since memorisation took place (in minutes).We wish to investigate whether the variables average percent memory retention and time since memorisation of the list are related in the population of human subjects. Data are available at 13 time points and we assume that these data are representative of the data that could be obtained from a wider population of similar people.
Launch RStudio, and set the working directory so that it can find the data files used in this lab and save the data into a dataframe called memory:
memory <- read.csv("memory.csv")
Firstly, create a scatterplot of memory retention against time:
plot(memory$time,memory$retention , ylab="Average Memory Retention (%)",xlab="Time Since Memorisation (mins)")
Is there a relationship between memory retention and time?
If so, is it a linear relationship?
If it is not a linear relationship, can we find a transformation of one (or both) of the variables that will give a linear relationship? We will add a variable logtime which is the logarithm of the variable time to the dataframe memory.
memory$logtime <- log(memory$time)
memory
## retention time logtime
## 1 0.84 1 0.000000
## 2 0.71 5 1.609438
## 3 0.61 15 2.708050
## 4 0.56 30 3.401197
## 5 0.54 60 4.094345
## 6 0.47 120 4.787492
## 7 0.45 240 5.480639
## 8 0.38 480 6.173786
## 9 0.36 720 6.579251
## 10 0.26 1440 7.272398
## 11 0.20 2880 7.965546
## 12 0.16 5760 8.658693
## 13 0.08 10080 9.218309
Now produce a plot of memory retention against log of time
plot(memory$logtime,memory$retention , ylab="Average Memory Retention (%)",xlab="Log time Since Memorisation (mins)")
Is the relationship between memory retention and log of time approximately linear?
Since there is a clear linear relationship it is appropriate to estimate the correlation between the variables Memory Retention and Log of Time.
cor(memory$logtime,memory$retention)
## [1] -0.9949259
We now assess the statistical significance of this observed correlation between memory retention and log of time using the function cor.test()
cor.test(memory$logtime,memory$retention)
##
## Pearson's product-moment correlation
##
## data: memory$logtime and memory$retention
## t = -32.798, df = 11, p-value = 2.525e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9985284 -0.9825817
## sample estimates:
## cor
## -0.9949259
This can be interpreted in two ways: by using a p-value for hypothesis testing or by a confidence interval:
Interpret the output both in terms of a hypothesis test and a confidence interval
Context: A survey of students at California Polytechnic State University collected data to investigate the question of whether back aches might be due to carrying heavy backpacks. The backpack weight of the student is taken, alongwith their body weight (measured in lb). It may be assumed that students with a higher body weight will carry a heavier backpack, though this may not be true.
Question: Are backpack weight and body weight significantly related?
Data: The data are available in the Stat2Data package, which can be loaded in as follows:
install.packages("Stat2Data")
library(Stat2Data)
data("Backpack")
Context: The children’s psychiatric unit at Yorkhill Hospital wants to create a screening method to identify unusual levels of melatonin in children. It is believed that children with too little or too much melatonin are more susceptible to depression. The problem is that there is a natural variation in levels of melatonin in children. A doctor at the hospital believes it may be related to age. She studies 19 healthy, non-depressed children of various ages (in months) and records their melatonin levels.
Questions: - Model the relationship between the average melatonin level and age in the wider population of children. - For a 7-year old child (84 months) provide upper and lower bounds for the melatonin level outside of which doctors may have to be worried.
Data: The data for this study are found in the file melatonin.csv where:
melatonin: Melatonin level (mg)age: Age of child (months)We wish to investigate the relationship between average melatonin level and age in healthy children.
Here the response variable is Melatonin level and the explanatory variable is Age.
Load the .csv data into R and produce a scatterplot of the data:
melatonin.dat <- read.csv(file="melatonin.csv")
attach(melatonin.dat) #allows direct access to columns of the dataset
melatonin.dat
## melatonin age
## 1 4.1 34
## 2 5.4 20
## 3 3.0 67
## 4 2.1 94
## 5 3.4 65
## 6 5.4 29
## 7 2.8 57
## 8 2.4 89
## 9 5.0 14
## 10 2.1 84
## 11 1.4 97
## 12 3.5 63
## 13 4.8 28
## 14 3.0 72
## 15 3.7 45
## 16 2.6 74
## 17 3.0 81
## 18 3.5 50
## 19 2.4 82
plot(age, melatonin, xlab="Age of Child (months)", ylab="Melatonin (mg)")
From the scatterplot, we can deduce:
In the data description we have seen that there is an approximate linear relationship between age and melatonin levels. Linear regression is a method used to describe this relationship by a linear function. Use the command lm() to perform this regression:
lm(melatonin~age,data=melatonin.dat)
##
## Call:
## lm(formula = melatonin ~ age, data = melatonin.dat)
##
## Coefficients:
## (Intercept) age
## 5.90529 -0.04245
The output of the regression can be saved like so
melatonin.lm <- lm(melatonin~age,data=melatonin.dat)
Before we make any conclusions from the regression output in melatonin.lm, we should first check that the assumptions underlying the regression analysis are reasonable.
A linear regression model makes three fundamental assumptions:
Linearity: The relationship between the dependent and independent variables is linear.
Constant variance (heteroscedasicity): The residuals have a constant variance.
Normality: The residuals of the linear regression are normal.
We can check the first two assumptions by looking at a plot of the residuals versus the fitted values. For the third assumption, we need to produce a normal probability plot (Q-Q plot) of the standardised residuals.
par(mfrow=c(2,2))
plot(melatonin.lm)
The code produces four plots. Here, we will only focus on the first two to check all the assumptions.
NOTE: If you want to reproduce only the first two plots, use the command par(mfrow=c(1,2)) to obtain these. Once done, you will need to hit enter afterwards (as R automatically produces the other two plots).
From the plot, it appears that there is no particular pattern in the residuals: they are approximately equally distributed above and below the (horizontal) zero line no matter where you look on the fitted values axis. This is a good indication that the relationship between melatonin level and age can be well approximated by a linear function.
From the plot, the spread of the residuals seems quite constant throughout the range of the fitted values. This suggests that constant variance of the residuals is a reasonable assumption.
The Normal Probability Plot, for all practical purposes is sufficiently close to a straight line, so Normality of the residuals can reasonably be assumed.
Note:
What would be the effect if the residuals deviated from a straight line in the ways illustrated above?
The assumptions undelying the linear regression are reasonably well satisfied so we may proceed to interpret the melatonin.lm output.
summary(melatonin.lm)
##
## Call:
## lm(formula = melatonin ~ age, data = melatonin.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.68588 -0.28911 -0.02473 0.26125 0.72564
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.905290 0.219629 26.89 2.27e-15 ***
## age -0.042446 0.003364 -12.62 4.65e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3685 on 17 degrees of freedom
## Multiple R-squared: 0.9035, Adjusted R-squared: 0.8979
## F-statistic: 159.2 on 1 and 17 DF, p-value: 4.646e-10
The coefficient for age is significantly different from zero, because the pvalue associated with it \((4.65\times 10^{-10})\) is less than 0.05. Therefore, age is a useful predictor for melatonin level.
We can use the function confint() to produce a confidence interval for the slope parameter:
confint(melatonin.lm)
## 2.5 % 97.5 %
## (Intercept) 5.4419134 6.36866586
## age -0.0495427 -0.03534901
The confidence interval for age does not contain 0 \((-0.0495,-0.0353)\), so again age is a statistically significant predictor of the melatonin level.
The intercept is also statistically significant with a p-value of \(2.27 \times 10^{-15}\). The CI above also does not include the value 0 \((5.44, 6.37)\) so we do need to include an intercept parameter in the regression model.
The best possible linear function to describe the dependence of the average melatonin level (in mg) on age (in months) is given by: \[ \text{Avg melatonin} = 5.9 -0.0424 \times \text{Age}\] This means that for every month children grow, their melatonin level decreases by 0.0424mg on average.
Note: This linear relationship can only be trusted for ages between 14 and 97 months. Outside this range we should not make any predictions, because we don’t have observations for those values.
The \(R^2\) is 90.4%. This means that 90.4% of the variability in melatonin levels is explained by its relationship with age.
A summary of the model and data is given by producing a fitted line plot using:
par(mfrow=c(1,1))
plot(age,melatonin, xlab="Age of Child (months)", ylab="Melatonin (mg)")
abline(melatonin.lm) # best fitting line
The function predict() includes the option to calculate prediction intervals using linear regression. Because we have already verified the usefulness of a linear relationship between age and melatonin levels, we can proceed without checking the model. We wish to predict the melatonin level for a child of age 84 months.
predict(melatonin.lm,newdata=data.frame(age=84),interval="prediction",level=0.95)
## fit lwr upr
## 1 2.339838 1.524637 3.155039
Looking at the 95% Prediction Interval, this means that a healthy child of 84 monthsis very likely to have a melatonin level somewhere between 1.52 and 3.16 mg, with a best estimate of 2.34 mg.
We can also produce a CI for the ‘position of the line’ for a child aged 84 months:
predict(melatonin.lm, newdata=data.frame(age=84),interval="confidence",level=0.95)
## fit lwr upr
## 1 2.339838 2.094509 2.585167
Hence it is highly likely that the average melatonin level in the wider population of children aged 84 months will lie between 2.09 and 2.58 mg.
We may provide a useful summary of the fitted linear regression model along with prediction bands by producing a fitted line plot with prediction band superimposed. These bands represent the two levels between which we expect the melatonin levels for an individual healthy child to fall.
pi <- predict(melatonin.lm,
data.frame(age=age),level=0.95,interval="prediction")
plot(age,melatonin, xlab="Age of Child (months)", ylab="Melatonin (mg)", pch=16)
abline(melatonin.lm)
ord <-order(age) ## we need to order the values to produce the prediction interval lines
lines(x=age[ord], y=pi[,2][ord], lty=3) # lower PI
lines(x=age[ord], y=pi[,3][ord], lty=3) # upper PI
legend(20,2, c("Fitted", "Prediction"), lty=c(1,3), bty="n")
Thus we could give this plot to a doctor and s/he could obtain a range of values for the typical melatonin levels found in children at any given age within the range 14 and 97 months.
Average melatonin level and age are related in a linear fashion between the age range of 14 and 97 months, such that for every additional month the average melatonin level of the child decreases by 0.0424 mg. This suggests that older children have a lower melatonin level.
A doctor who finds a 7 year-old patient in his office with a melatonin level that is below 1.52 mg or above 3.16 mg should be investigated further, since these levels are unlikely to be observed in a typical non-depressed child.
Context: As part of a larger study on the impact of visual indicators whilst testing grip strength an occupational scientist recorded the right hand grip strength (rgs) measured in Newtons, using a digital dynamometer, of 30 healthy subjects together with their body weight (weight) measured in kilos.
Questions: Model the dependence of right hand grip strength on body weight. Predict the right hand grip strength of a healthy subject with a body weight of 88 kg.
Data: This is held in a CSV file called gripst.csv.