We will use data on the heart rates and body temperatures of healthy adults to see whether there is an association between body temperature and heart rate. The data consists of body temperatures (in Fahrenheit) and heart rates (in beats per minute) from 130 people.
Let’s view our data for heart rate in terms of body temperature using
the function
gf_point(response_variable ~ explanatory_variable, data=DATA).
We can add labels using the xlab="label for x-axis" and
ylab="label for y-axis"
We see a weak positive linear relationship between body temperature
and heart rate. Let’s calculate the correlation using the
cor( ) function.
## [1] 0.2536564
Do we have an outlier? Does the point at approximately (100.75, 75) influence the value of the correlation? That person might have a fever. Here is code to remove that observation and check to see if it is influential.
# select rows of the data where temp is less than or equal to 100.5"
subset_HRdata <- subset(HRdata, BodyTemp <= 100.5)Re-plot the subset data.
## [1] 0.2536827
The correlation didn’t change much at all. The point near (100.5, 75) is not an influential observation. Therefore we will proceed with the original data.
The general pattern of the points in the scatter plot should follow a linear trend; the pattern should not show curved or other nonlinear patterns.
There should be approximately the same distribution of points above the regression line as below the regression line (symmetry about the regression line).
The variability of the points around the regression line should be similar regardless of the value of the explanatory variable; the variability (spread) of the points around the regression line should not change as you slide along the x-axis (equal variance/standard deviation).
Looking back at the data and regression line plot, the pattern of the points in the scatter plot doesn’t look curved; a linear trend seems reasonable. Second, we note that the points above the regression line appear similar in spread and shape as the points below the regression line (like a mirror image to the pattern). Finally, the variability in heart rates is approximately the same for different body temperatures (e.g., we don’t see that low body temperatures have low variability in heart rates and high body have high variability in heart rates or vice versa). When checking validity conditions, we use logic much like we do when testing hypotheses—we will assume the condition is true unless we see strong evidence that it is not.
Let’s calculate our favorite statistics. In particular the mean and standard deviation of the two variables, body temperature and heart rate, since they will determine the regression line slope and intercept.
\[\widehat{HeartRate} = a + b \times( BodyTemp).\]
## min Q1 median Q3 max mean sd n missing
## 57 69 74 79 89 73.76154 7.062077 130 0
## min Q1 median Q3 max mean sd n missing
## 96.3 97.8 98.3 98.7 100.8 98.24923 0.7331832 130 0
It is always the case that the point \((\bar{x}_{explanatory}, \bar{x}_{response})\) is a point on the line of best fit (the regression line).
This means the point \((\bar{x}_{temp}, \bar{x}_{hr}) = (98.25, 73.76)\) will be a point on our line of best fit.
The values of the two standard deviations and the correlation can be used to calculate the slope of the regression line. The relationship between the correlation \(r\) and the slope \(b\) is given by
\[b = r \frac{s_y}{s_x}.\]
We can use this formula to preview the slope of our linear model
## [1] 2.443492
Now we have the slope \(b=2.443238\) and a point on our regression line \((98.25, 73.76)\), so we can solve to find the intercept \(a\).
\[ y = a + bx \]
\[73.76 = a + 2.443238(98.25)\]
## [1] -166.3096
Putting \(a\) and \(b\) into the slope-intercept linear equation we get
\[ \hat{y} = -166.2881 + 2.443238x. \]
We can also calculate the coefficients \(a\), the intercept, and \(b\), the slope, for our line of best fit
using the linear model command
lm( response ~ predictor, data= ).
##
## Call:
## lm(formula = HeartRate ~ BodyTemp, data = HRdata)
##
## Coefficients:
## (Intercept) BodyTemp
## -166.285 2.443
We get the same line of best fit as seem previously. We can write the equation in the context of our study as:
\[\widehat{HeartRate} = -166.285 + 2.443 \times( BodyTemp).\]
To graph both the data and the line of best fit we will use two new commands:
the command
gf_abline(intercept = , slope = , color= "red") gives
the graph of the line of best fit with the numerical values of the
intercept and slope included after the equal signs.
the %>% is a ‘piping’ command that tells R to
first do the thing before the %>% then do the thing
after, then display both. This command layers the two graphs
(of the data and the line) one on top of the other.
gf_point(HeartRate ~ BodyTemp, data=HRdata) %>%
gf_abline(intercept=-166.285,slope=2.443,color="purple")Our goal is to perform a hypothesis test to determine if the linear relationship given by the regression line is statistically significant.
Null: there is no linear association between body temp and heart rate. Alternative: there is a linear association between body temp and heart rate.
Equivalently, in terms of slope, we let \(\beta\) represent the slope of the regression line for the entire population of healthy adults.
\[H_0: \beta = 0 \\ H_a: \beta \neq 0 \]
Before calculating p-values we calculate the standardized \(t\)-statistic. There are two formulas we can use to calculate the standardized statistic, one uses the correlation \(r\) and the other uses the slope \(b\). Recall that these two numbers, \(r\) and \(b\), are related by a ratio of the standard deviations of our variables.
\[t = \frac{r}{ \sqrt{\frac{1-r^2}{n-2}} }
\textrm{ and } \]
\[t = \frac{b-0}{SE(b)}\]
The first formula for the standardized statistic uses the sample size \(n\) and the correlation coefficient, \(r= 0.2536564\). The denominator \(\sqrt{\frac{1-r^2}{n-2}}\) is the standard error for the correlation coefficient.
The second formula uses the slope \(b=2.443\) and the standard error of the distribution of slopes, \(SE(b)\). The standard error \(SE(b)\) can be obtained from a simulation-based distribution of shuffled slopes or by theory-based techniques shown below.
Using the first formula with correlation, we can calculate \(t\) as follows.
## [1] 2.967156
To calculate a theory-based p-value for inference on the slope we use
information generated by the linear model command lm( ).
Similar to the ANOVA aov( ), this command stores much more
information than is displayed by default. We have seen the output of
slope and intercept. To view the p-value for our hypothesis test we
assign an output name, such as m1 short for model 1, then
we can display a summary of the test using the summary( )
command.
##
## Call:
## lm(formula = HeartRate ~ BodyTemp, data = HRdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.6413 -4.6356 0.3247 4.8304 15.8474
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -166.2847 80.9123 -2.055 0.04190 *
## BodyTemp 2.4432 0.8235 2.967 0.00359 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.858 on 128 degrees of freedom
## Multiple R-squared: 0.06434, Adjusted R-squared: 0.05703
## F-statistic: 8.802 on 1 and 128 DF, p-value: 0.003591
Here is annotated image describing the important parts of the summary.
Notice that this output contains the intercept and slope of the regression line, but the values are stacked vertically instead of displayed horizontally. Also notice that the \(t\)-statistic \(= 2.967\) is the same as we calculated above. Next to the \(t\)-statistic is a \(p\)-value \(= 0.00359\). This represents the probability of seeing a standardized statistic of 2.967 or more extreme if the null-hypothesis is true. Note that the \(P(>|t|)\) is telling us this is a two-sided \(p\)-value. If we are performing a one-sided hypothesis test we would divide this \(p\)-value by 2.
Additionally, the standard error of slope, SEb, can be obtained from the summary output. Using this quantity we can again confirm the calculation of the \(t\) statistic, \(t = \frac{b-0}{SE(b)}\).
## [1] 2.967203
Using the SEb = 0.8235 we can calculate a 2SD
Confidence interval:
# upper and lower endpoints of a 95% confidence interval for the slope beta.
upper = slope+2*SEb
lower = slope-2*SEb
lower## [1] 0.7964916
## [1] 4.090492
We can calculate the theory-based 95% confidence
interval using the confint( ) function applied to
m1.
## 2.5 % 97.5 %
## (Intercept) -326.383620 -6.185819
## BodyTemp 0.813765 4.072711
Notice that the 2SD method gave us an interval that is slightly wider than necessary, but the 2SD method has the benefit of being a quick easy calculation without having to use technology.
Conclusions:
Strength of Evidence with context We have a \(p\)-value of \(= 0.00359\) and a standardized statistic of \(2.967,\) which give us very strong and strong evidence against the null hypothesis respectively. Thus, our our evidence supports the conclusion that there is a statistically significant linear association between body temperature and heart rate.
Estimation with context: Our 2SD confidence interval is \((0.7962, 4.0902)\) and the theory-based 95% confidence interval is (0.814, 4.073). Therefore, we are 95% confident that a one degree increase in body temperature is associated with an average increase of 0.814 to 4.073 heart beats per minute.
Causation and Generalization: Since this is an observational study of 130 healthy adults, we cannot draw any conclusions regarding causation. We also cannot generalize since it was not indicated that our data is a random sample of healthy adults; for all we know it could be a convenience sample of 18-25 year old adults.