Scientific Research, Statistics Lab 5

Learning targets

Recall statistical power, type I, and type II errors
Understand the difference between correlation and regression analysis
Be able to conduct and interpret a simple linear regression

Example exam question

The following task is trying once more to illustrate the mechanics of a frequentist statistical test: you decide on a test statistic (you would actually take this from a text book…), then compare your test statistic to a random distribution of that statistic (that is, for example, random t-values given a certain degree of freedom in the case of the t-test), then you assess how ‘rare’ your calculated value is (again, compared to random numbers following the same distribution).

The Chi-squared test is a statistical test you may or may not have heard of. Without knowing how it works, if I tell you that the test statistic for the Chi-squared test is a Chi-squared value that follows a Chi-squared distribution with \(n\) degrees of freedom, would you consider a Chi-squared value of 15 at 2 degrees of freedom a significant test result? To answer this question, follow these steps:

Use rchisq() to create 10,000 random numbers that follow a Chi-squared distribution with 2 degrees of freedom. Give this vector a name, e.g. v1.
Plot a histogram of those numbers
Is the value 15 a frequent occurence in random numbers that follow a Chi-squared distribution at 2 degrees of freedom?
Now calculate the exact probability of obtaining a value of 15 when drawing from random numbers that follow a Chi-squared distribution at 2 degrees of freedom
Is the test significant?

rchi <- rchisq(n = 10000, df = 2)
hist(rchi)

pchisq(q = 15, df = 2, lower.tail = F)

## [1] 0.0005530844

Type I and type II errors, statistical power

In any statistical test, there is a chance that we falsly reject \(H_0\) and accept \(H_A\), this is called a type I error (\(\alpha\) error). On the other hand, if we falsly accept \(H_0\) and reject \(H_A\), this is called a type II error (\(\beta\) error). Both are ‘bad’ of course, but depending on the case, there may be reasons why one should be trying to minimise a type I OR type II error. Consider a case where the efficiency of a treatment against cancer is tested but it is known that the treatment has severe side effects. In this case you would want to be absolutely sure that your treatment provides a benefit for your patients, knowing that they will have to cope with the adverse side effects. So, to minimise a type I error (concluding that the treatment works while in reality it doesn’t, or, in other words, by chance getting a sample that lies to the right of the \(\alpha\)-threshold), you would be advised to set the \(\alpha\) threshold as low as possible, e.g. to 1% rather than the usual 5%. On the other hand, if you were to test whether an optional prenatal diagnostic may have a fatal side effect for the embryo, you would want to avoid a type II error (concluding that there is no risk of a fatal side effect while there is one). In such a case you would have a higher \(\alpha\) threshold to maximise the so called ‘power’ of the test to detect a difference.
The power of a statistical test is the probability of drawing the correct conclusion if the alternative hypothesis is true (i.e. statistical power is the probability of rejecting \(H_0\) in favour of the true \(H_A\)). It is simply defined as \(1 - \beta\) (\(1 -\) type II error probability).

Types of errors in statistical hypothesis testing.
	Accept \(H_0\)	Reject \(H_0\)
\(H_0\) (true)	Correct	Type I error (\(\alpha\) error)
\(H_0\) (false)	Type II error (\(\beta\) error)	Correct

Type I and II statistical errors.

Consider the following (entirely fictive) example: A new internal WiFi antenna for the iPhone is tested. We know it performs better than the old ones used in previous iphone versions. However, some engineers are suspicious that it could cause a headache, possibly even tumors during long phone calls. Therefore, they have collected some preliminary data.

Make up a fictive data set containing two variables: ‘antenna’, which takes on the values 0 ten times (old antenna) and then 1 ten times (new antenna); and ‘headache’, an measure from the ‘headache meter’, a handheld device that measures the headache intensity after a phone call. The variable ‘headache’ should take on the values below.

antenna <- rep(c(0,1), each = 10)
headache <- c(0.7, -1.6, -0.2, -1.2, -0.1,  3.4,  3.7,  0.8,  
              0.0,  2.0,  1.9,  0.8,  1.1,  0.1, -0.1,  4.4,  
              5.5,  1.6,  4.6,  3.4)
dat <- data.frame(antenna, headache)

headache
 [1]  0.7 -1.6 -0.2 -1.2 -0.1  3.4  3.7  0.8  0.0  2.0  1.9  0.8  1.1  0.1
[15] -0.1  4.4  5.5  1.6  4.6  3.4

Take 4 samples from the variable ‘headache’ to estimate the standard deviation in the data.
Draw a boxplot of ‘headache’ by group, and visually extract the approximate difference between groups
Based on the above, use a power test to calculate the power of a potential t-test. What would be the approximate power in your test? Comment on the outcome. What sample size would be reasonable if indeed the expected difference was the one you took from the box plot?
Because it is difficult to verify the assumption of normality for such a small sample size, look up a non-parametric two sample test (non-parametric means it will not rely on distributional assumptions) and apply it to the data.

## Randomly sample 4 observations
mysample <- sample(headache, size = 4)
sd(mysample)

## [1] 2.054872

## Draw boxplot
boxplot(headache ~ antenna)

## ca. 1.5 difference between groups
power.t.test(n = 10, delta = 1.4, sd = 0.52)

## 
##      Two-sample t test power calculation 
## 
##               n = 10
##           delta = 1.4
##              sd = 0.52
##       sig.level = 0.05
##           power = 0.9998989
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Correlation analysis

Correlation analysis is related to linear regression, they both are concerned with the relationship between two variables. The difference between the two is that if we do assume that one variable causes changes in the other variable, we use regression analysis (see below). If we simply believe that two variables are related but do not suspect a causal relationship, we use correlation analysis. The strength of the linear relationship between two variables can be quantified by their covariance, which is the sum of the products of their respective deviations from their means (often named SSXY, or ‘sum of products’ see below explanations on regression!) divided by the degrees of freedom, the equivalent of the variance or mean sum of squares in the case of only 1 variable (seen above): \ \[ cov(x,y)=\frac{\sum{(x-\overline{x})(y-\overline{y})}}{n-1} \] How can this formula be understood intuitively? First, think of a graph where x and y are linearly related (x increases at the same rate as y). In this case you will multiply high negative values with high negative ones (yielding high positive values) and add them to high positive ones multiplied with high positive ones. The result is of course a large number. When there is no relationship between the variables (see the two plots on the right), you obtain low values. See the following code to illustrate this:

x1 <- 1:10; y1 <- 1:10 # the semicolon simply allows two commands per line
x2 <- 1:10; y2 <- rep(1, times = 10)
set.seed(1234)

sum((x1 - mean(x1)) * (y1 - mean(y1)))/9 # high value
[1] 9.166667
sum((x2 - mean(x2)) * (y2 - mean(y2)))/9 # very low value, actually 0
[1] 0
sum((x3 - mean(x3)) * (y3 - mean(y3)))/9 # low value
[1] -0.1862983

par(mfrow = c(1, 3))
plot(x1, y1); plot(x2, y2); plot(x3, y3)

Covariance ranges from -\(\infty\) to +\(\infty\), and the values depend on the units used. This is not practical to compare the degree of correlation between several pairs of variables with different units. Therefore, we standardise the metric in order to get a value between -1 (strongly negative relationship) and 1 (strongly positive relationship), with zero indicating no (linear) relationship. This is done by dividing by the product of the standard deviations, we thus get the ‘correlation coefficient’ \(r\): \[ r = \frac{\sum{(x-\overline{x})(y-\overline{y})}}{\sqrt{\sum{(x-\overline{x})^2}}\sqrt{\sum{(y-\overline{y})^2}}(n-1)} = \frac{SSXY}{\sigma(x) \ \sigma(y)} \] We can easily calculate the covariance and the correlation coefficient by hand or by using the cov() and cor() commands in R:

## Covariance
# Calculated by hand
sum((x1 - mean(x1)) * (y1 - mean(y1)))/9 # high value
[1] 9.166667
# Calculated using the built-in function
cov(x = x1, y = y1)
[1] 9.166667

cov(x2, y2) # very low value, actually 0
[1] 0
cov(x3, y3) # low value
[1] -0.1862983

## Correlation coefficient (covariance multiplied by the product of the standard deviations)
# Calculated by hand
sum((x1 - mean(x1)) * (y1 - mean(y1)))/9 * 1/(sd(x1) * sd(y1)) # high value
[1] 1
# Calculated using the built-in function
cor(x = x1, y = y1)
[1] 1
cor(x = x2, y = y2)
Warning in cor(x = x2, y = y2): the standard deviation is zero
[1] NA
cor(x = x3, y = y3)
[1] -0.1752832

We can test for a significant correlation using the cor.test() command in R:

set.seed(321)
rn1 <- rnorm(10)
rn2 <- rnorm(10)
cor.test(rn1, rn2)

## 
##  Pearson's product-moment correlation
## 
## data:  rn1 and rn2
## t = -0.52261, df = 8, p-value = 0.6154
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7280343  0.5057940
## sample estimates:
##        cor 
## -0.1816951

plot(rn1, rn2)

This is testing the correlation between two random samples out of a normal distribution centered around 0 with standard devation 1. This relationship should normally turn out not to be significant, but every so often you will get a p-value smaller than 0.05, indicating that the two variables are significantly correlated at an \(\alpha = 0.05\) level. In this case you commit a type I error, because you know that the two sets of random numbers are not related. How often is that the case? Consider the following code to visualise how often you may fall into the trap of finding a significant correlation while there is none. Note that you do not need to be able to code something like this, great if it makes sense to you:

set.seed(123)
cors <- list()
for (i in 1:100)
{
  x <- rnorm(10)
  y <- rnorm(10)
  cors[[i]] <- cbind(x, y)
  if (cor.test(x = x, y = y)$p.value < 0.05)
  {
    print(paste("Test", i, "is significant by chance!"))
  }
} 
[1] "Test 7 is significant by chance!"
[1] "Test 9 is significant by chance!"
[1] "Test 27 is significant by chance!"
[1] "Test 83 is significant by chance!"
[1] "Test 88 is significant by chance!"
## Simply enter the number of the significant test into the double square brackets
## to access the data in the list.
cors[[7]]
                x           y
 [1,]  0.11764660  1.44455086
 [2,] -0.94747461  0.45150405
 [3,] -0.49055744  0.04123292
 [4,] -0.25609219 -0.42249683
 [5,]  1.84386201 -2.05324722
 [6,] -0.65194990  1.13133721
 [7,]  0.23538657 -1.46064007
 [8,]  0.07796085  0.73994751
 [9,] -0.96185663  1.90910357
[10,] -0.07130809 -1.44389316
## ...and we can rerun this correlation test to see whether it is really significant and
## to check if our loop has done the right thing.
cor.test(cors[[7]][, 1], cors[[7]][, 2])

    Pearson's product-moment correlation

data:  cors[[7]][, 1] and cors[[7]][, 2]
t = -2.6134, df = 8, p-value = 0.03097
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.91660788 -0.08557585
sample estimates:
       cor 
-0.6786367 
plot(cors[[7]][, 1], cors[[7]][, 2])

You will find that on average, 5 out of 100 such random tests turn out to be significant at the 5% level. In other words, we accept that in 5% of the cases we could obtain a false positive by chance (reject the null hypothesis when in fact it is true).

As an exercise, invent an example from your field of research where correlation analysis would be an appropriate tool. The data set should obviously contain two variables that may or may not be related, but changes in one should not cause changes in the other. Calculate the correlation coefficient, either by hand or by searching an R function that does it for you. If you still struggle with R basics, calculating it by hand will be a good exercise. Plot your data, and test for a significant correlation, and interpret the result.

Linear regression

There are various strategies on how to explain and teach linear regression. I firmly believe that some minimum understanding of what R does when you use lm() (the function in R to run a linear regression or a linear model, hence the name of the R function) is extremely helpful. When you have two variables, an x-variable that you measured without error (e.g. the fresh weight of leaf samples), and a y-variable that depends on x (e.g. the dry weight of your leaf samples), you can ‘regress’ y on x (see textbooks for the interesting origin of the word ‘regression’). In other words, you try to estimate the linear relationship that exists between x and y. Recall from your maths knowledge that to describe a straight line, you need two pieces of information, the intercept and the slope: \[ \widehat{y} = a + bx \] with the intercept \(a\) and the slope \(b\). \(\widehat{y}\) represents the estimated values of \(y\). Now we want to know the intercept and the slope that best characterise a certain relationship. Let’s visualise the example from above, \(x\) being the fresh weight, \(y\) the dry weight of your collected leaf samples:

set.seed(123)
x <- rnorm(10, mean = 10) * 1:10 # freshweight
y <- rnorm(10 ,mean = 10) * 1:10/5 # dryweight
## Plot the data
par(mfrow = c(1, 2))
op <- par(mar = c(3.5, 3, 1, 1), mgp = c(2.4, 1, 0)) # reduce fig margins
plot(x, y, xlim = c(0, 120), ylim = c(0, 25))
abline(h = mean(y))
abline(a = 0, b = 0.2)

par(op) # reset graphical setting to default

Note that we can visually estimate the intercept and the slope: the intercept is approximately zero (as it should be in our example) and the slope about 0.2 (if you move one unit to the right, you move c. 0.2 up). However, we need an objective measure for the best fit. Bypassing some of the theory, it is intuitive to start pivoting a horizontal line through the point that represents the mean of all y values and find the place where it bests fits the data (see Figure above). One way to do this is to minimise the sum of squared residuals \(d\), that is the vertical distances between the data points and the fitted line (least squares method):

Total sum of squares (TSS) (left panel), and residual sum of squares (RSS) (right panel).

\[ d = y-\widehat{y} \] with \(\widehat{y}\) the estimated values sitting on the line. We can now replace \(\widehat{y}\) by \[ \widehat{y} = a + bx \] and obtain \[ d = y - a - bx \] Our task therefore is to find the intercept \(b\) that minimises \[ \sum(y - a - bx)^2 \] This is done by setting the first derivative with respect to \(a\) of the above equation to zero (as you should recall from calculus): \[ \frac{d\sum(y - a - bx)^2}{da}=-2\sum x \ (y - a - bx) = 0 \] Now solve for \(a\). It turns out that (here without proof): \[ b=\frac{SSXY}{SSX} \] with \(SSXY\) the sum of products of \(x\) and \(y\) and \(SSX\) the sum of squares of \(x\). So let us calculate the slope of our example:

SSXY <- sum((x - mean(x)) * (y - mean(y)))
SSX <- sum((x - mean(x))^2)
b <- SSXY/SSX
b
[1] 0.2065662

By definition, the best fit line passes through the point (\(\overline{x}\), \(\overline{y}\)) the mean of both variables. We can therefore calculate the intercept solving \[ \overline{y} = a + b\overline{x} \] for a:

mean(y) - b * mean(x)
[1] -0.288882

Now that we know what we are doing, we deserve to get the same result much quicker:

m1 <- lm(y ~ x) #fit a linear model of y against x
m1

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
    -0.2889       0.2066

The lm command does exactly what we have done above and you will see that you get the same intercept and slope as we calculated by hand. If you need more information, use the summary() function, a useful function that you can use on many different object types:

summary(m1)

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2925 -0.5734 -0.1133  0.3116  2.2373 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.28888    0.72960  -0.396    0.702    
x            0.20657    0.01196  17.268 1.29e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.018 on 8 degrees of freedom
Multiple R-squared:  0.9739,    Adjusted R-squared:  0.9706 
F-statistic: 298.2 on 1 and 8 DF,  p-value: 1.288e-07
abline(m1)

Let us look at the output and try to interpret it. First, the call is repeated and you get some statistics on the distribution of the residuals. Then comes a table with the estimates and their standard errors of the intercept and slope (here named by the variable \(x\)). The two parameters are tested for significance, both with the null hypothesis that they are not different from zero (often, this test does not make sense and cannot be interpreted sensibly for the intercept). Last, you get the residual standard error, the multiple R-squared (the proportion of the explained variance), the adjusted R-squared and an F statistic of the explained vs. total variance. The latter corresponds to the t-test for the slope above. The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. The last (and important) step is to check for accuracy of the model assumptions. For linear regression, the most important assumptions are:

The explanatory variable (the x variable) is measured without error
The variance in y is constant (it does not change with increasing y)
The residuals are normally distributed

This brings us to the ultimate notation of the linear regression equation: \[ y = a + bx + \epsilon \quad \quad \quad \epsilon \sim N(0,\sigma^2) \] with y denoting the response variable, x the predictor (explanatory variable), a as the intercept, b indicating the slope and \(\epsilon\) signifying the model errors (random errors) which we estimate using the model residuals! Note that the assumption is that the model errors are normally distributed not the response variable and we use the model residuals (an estimate of the model errors) to assess the normality assumption. This is why in practice we have to run a linear regression model first and then inspect the residuals to find out whether we violate the model assumptions! The first assumption cannot be checked post-hoc, but we should verify it beforehand, given the data in question. The second two can be assessed post-hoc. This is best done by applying the plot() function to the model output:

par(mfrow = c(2, 2))
plot(m1)

Model diagnostic plots showing the residuals vs. fitted (predicted) values (upper left) - for assessing the variance homogeneity (homoscedasticity) criterion (no pattern suggests variance homogeneity, variance heterogeneity is commonly expressed as fan-shaped or undulating patterns), a quantile-quantile plot (upper right) contrasting the quantiles of the response variable with the quantiles of a normal distribution with the same mean and standard deviation to assess the normality criterion (if the sample distribution approximates a normal distribution, then the points fall on one line), scale-location plot (lower left) similar to the first plot but uses the square root of the standardized residuals instead, Cook’s distance plot (lower right) - The Cook’s distance statistic identifies influential points. There is no standard cut-off but large values (which are labelled with their observation number) suggest potential outliers.

summary(m1)

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2925 -0.5734 -0.1133  0.3116  2.2373 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.28888    0.72960  -0.396    0.702    
x            0.20657    0.01196  17.268 1.29e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.018 on 8 degrees of freedom
Multiple R-squared:  0.9739,    Adjusted R-squared:  0.9706 
F-statistic: 298.2 on 1 and 8 DF,  p-value: 1.288e-07

The interpretation of these plots requires some experience. Essentially, you are looking for patterns in the left two plots (e.g. funnel shapes or trends), which would suggest that the variance in \(y\) is not constant along \(x\), which violates one of the model assumptions. The bottom left plot is a variation of the top left one, where the square root of the standardised residuals are used, which sometimes makes it easier to detect patterns. The ‘Q-Q’-plot (top right) examines the distribution of the residuals by plotting their quantiles against those of a normal distribution. If the residuals are approximately normally distributed, they will more or less lie on a straight line. The last plot (bottom right) detects influential data points that may need special attention. ‘Influential’ means that the outcome of the analysis will depend on such data points to a very high degree. This can occur because a data point has a high residual (outlier in the \(y\)-space) or high ‘leverage’ (outlier in the \(x\)-space). The Cook’s distance combines the two by subsequently excluding one data point at a time. Roughly, we are looking for values that are greater than one, i.e. that lie outside the solid red line. We know that the underlying data are normal in our case, but you may still find that it looks like the above assumptions are violated (because we got an extreme sample by chance). What options you have in cases where the assumptions are clearly violated is beyond the scope of this introduction.

As an exercise, use the inbuilt data set ‘cars’ (load it with data(cars)) to see whether the speed at which a car travels influences the distance it needs to stop.

Explore the data set using R functions
Plot the data
Run a linear regression model (which variable is the predictor variable? Which one the response?)
Look at the diagnostic plots
Comment on the assumptions and whether you think they are met
Interpret the results

Let’s use the built-in data set cars to run a linear regression, which is actually quite an instructive example as it highlights the importance of using common sense and in real-world examples also export knowledge to judge model results. Here we only have two variables (see ?cars: speed in mph and dist, the stopping distance in feet. Recall that the intercept in a linear regression model give the value of the response variable (here stopping distance), when the explanatory variable is zero. Now let’s run a standard linear regression on this data set:

plot(dist ~ speed, data = cars)
m1 <- lm(dist ~ speed, data = cars)
summary(m1)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

abline(m1) # model fit to plot

## Eh? Negative intercept? Does that make sense in this context?
## Rescale plot axes to hone in on this
plot(dist ~ speed, data = cars, xlim = c(0, 30), ylim = c(-20, 120))
abline(m1) # model fit
abline(v = 0, lty = 3) # vertical line indicating x = 0
abline(h = -17.58, col = "red") # model intercept

Ok, looks pretty good at first glance, but now you need to ask yourself: Does this all makes sense? The model summary tells us that the intercept is -17.6 feet (see red horizontal line in the above figure), meaning that when the speed of a car is zero the stopping distance is negative. That’s nonsense! If the speed is zero, the stopping distance must be zero as well, there is no such thing as a negative stopping distance. This example highlights the fact that in some cases a linear regression model should be fitted through the origin (i.e. at y = 0 and x = 0). You see now how important it is to scrutinise our model, or perhaps I should say our work in general!!! Ok, here is how to fit a linear regression model without intercept.

plot(dist ~ speed, data = cars)
## Refit model without intercept
## There are two alternative notations to remove the intercept
m2 <- lm(dist ~ 0 + speed, data = cars)
m2 <- lm(dist ~ speed - 1, data = cars) # same as above
summary(m2)

## 
## Call:
## lm(formula = dist ~ speed - 1, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.183 -12.637  -5.455   4.590  50.181 
## 
## Coefficients:
##       Estimate Std. Error t value Pr(>|t|)    
## speed   2.9091     0.1414   20.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.26 on 49 degrees of freedom
## Multiple R-squared:  0.8963, Adjusted R-squared:  0.8942 
## F-statistic: 423.5 on 1 and 49 DF,  p-value: < 2.2e-16

abline(m2)

We can get predictions for a particular speed or for an entire range of speeds. See ?predict.lm for help with predictions from a linear model. For those of you who have ever done a calibration in the lab…this is exactly what’s running in the background: a linear regression model (or in some cases perhaps a nonlinear model but it is the same story). In order to plot confidence or prediction intervals, we would derive model predictions over the whole range of the explanatory variable(s) and tell the predict fuction to return those intervals along with the mean model fit like in the following example, where we first predict for a single value and then for a range of predictor values:

## Model predictions for one particular speed, say 24 mph
predict(object = m2)

##        1        2        3        4        5        6        7        8 
## 11.63653 11.63653 20.36393 20.36393 23.27306 26.18219 29.09132 29.09132 
##        9       10       11       12       13       14       15       16 
## 29.09132 32.00045 32.00045 34.90959 34.90959 34.90959 34.90959 37.81872 
##       17       18       19       20       21       22       23       24 
## 37.81872 37.81872 37.81872 40.72785 40.72785 40.72785 40.72785 43.63698 
##       25       26       27       28       29       30       31       32 
## 43.63698 43.63698 46.54611 46.54611 49.45525 49.45525 49.45525 52.36438 
##       33       34       35       36       37       38       39       40 
## 52.36438 52.36438 52.36438 55.27351 55.27351 55.27351 58.18264 58.18264 
##       41       42       43       44       45       46       47       48 
## 58.18264 58.18264 58.18264 64.00091 66.91004 69.81917 69.81917 69.81917 
##       49       50 
## 69.81917 72.72830

predict(object = m2, newdata = data.frame(speed = 24))

##        1 
## 69.81917

## Add a prediction interval to the mean estimate
predict(object = m2, newdata = data.frame(speed = 24), interval = "prediction")

##        fit      lwr      upr
## 1 69.81917 36.44121 103.1971

## Compute predictions over the range of the original speed values on a very fine grid of
## 100 equally spaced speed values between 3 and 25 mph.
xv <- seq(3, 25, length.out = 100) # xv stands short for new x-values to predict from
## Note in the predict function you need to provide a data frame in which you provide your new
## predictor values to predict from. The naming must be the same as in the original data, i.e.
## the new speed values stored in the xv vector must be called 'speed' in this new data frame.

preds <- predict(object = m2, newdata = data.frame(speed = xv), interval = "prediction")
preds <- as.data.frame(preds)

## Plot the lower and upper bounds of the prediction interval
plot(dist ~ speed, data = cars)
lines(xv, preds$fit) # model fit
# Upper and lower confidence bounds
lines(xv, preds$upr, lty = 2)
lines(xv, preds$lwr, lty = 2)