#the two packages you need 
library(ggformula)
library(mosaic)

We will use data on the heart rates and body temperatures of healthy adults to see whether there is an association between body temperature and heart rate. The data consists of body temperatures (in Fahrenheit) and heart rates (in beats per minute) from 130 people.

HRdata <- read.table("http://www.isi-stats.com/isi/data/chap10/TempHeart.txt", header=TRUE)

Let’s view our data for heart rate in terms of body temperature using the function gf_point(response_variable ~ explanatory_variable, data=DATA). We can add labels using the xlab="label for x-axis" and ylab="label for y-axis"

gf_point(HeartRate~BodyTemp, data=HRdata, ylab="Heart rate (bpm)", xlab="Body temperature (F)")

We see a weak positive linear relationship between body temperature and heart rate. Let’s calculate the correlation using the cor( ) function.

r = cor(HeartRate~BodyTemp,data=HRdata)
r
## [1] 0.2536564

Do we have an outlier? Does the point at approximately (100.75, 75) influence the value of the correlation? That person might have a fever. Here is code to remove that observation and check to see if it is influential.

# select rows of the data where temp is less than or equal to 100.5"
subset_HRdata <- subset(HRdata, BodyTemp <= 100.5)

Re-plot the subset data.

gf_point(HeartRate ~ BodyTemp, data=subset_HRdata)

r <- cor(HeartRate ~ BodyTemp, data=subset_HRdata)
r
## [1] 0.2536827

The correlation didn’t change much at all. The point near (100.5, 75) is not an influential observation. Therefore we will proceed with the original data.

Validity conditions for regression with slope (or correlation)

Looking back at the data and regression line plot, the pattern of the points in the scatter plot doesn’t look curved; a linear trend seems reasonable. Second, we note that the points above the regression line appear similar in spread and shape as the points below the regression line (like a mirror image to the pattern). Finally, the variability in heart rates is approximately the same for different body temperatures (e.g., we don’t see that low body temperatures have low variability in heart rates and high body have high variability in heart rates or vice versa). When checking validity conditions, we use logic much like we do when testing hypotheses—we will assume the condition is true unless we see strong evidence that it is not.

Regression Line from R calculations

Let’s calculate our favorite statistics. In particular the mean and standard deviation of the two variables, body temperature and heart rate, since they will determine the regression line slope and intercept.

\[\widehat{HeartRate} = a + b \times( BodyTemp).\]

#heart rate statistics
favstats(~HeartRate,data=HRdata)
##  min Q1 median Q3 max     mean       sd   n missing
##   57 69     74 79  89 73.76154 7.062077 130       0
xbar_hr = 73.76154
s_hr = 7.062077
#body temp statistics
favstats(~BodyTemp,data=HRdata)
##   min   Q1 median   Q3   max     mean        sd   n missing
##  96.3 97.8   98.3 98.7 100.8 98.24923 0.7331832 130       0
xbar_temp = 98.24923
s_temp = 0.7331832

It is always the case that the point \((\bar{x}_{explanatory}, \bar{x}_{response})\) is a point on the line of best fit (the regression line).

This means the point \((\bar{x}_{temp}, \bar{x}_{hr}) = (98.25, 73.76)\) will be a point on our line of best fit.

The values of the two standard deviations and the correlation can be used to calculate the slope of the regression line. The relationship between the correlation \(r\) and the slope \(b\) is given by

\[b = r \frac{s_y}{s_x}.\]

We can use this formula to preview the slope of our linear model

slope = r*s_hr/s_temp
slope
## [1] 2.443492

Now we have the slope \(b=2.443238\) and a point on our regression line \((98.25, 73.76)\), so we can solve to find the intercept \(a\).

\[ y = a + bx \]

\[73.76 = a + 2.443238(98.25)\]

a=xbar_hr-slope*xbar_temp
a
## [1] -166.3096

Putting \(a\) and \(b\) into the slope-intercept linear equation we get

\[ \hat{y} = -166.2881 + 2.443238x. \]

Regression Line from R functions

We can also calculate the coefficients \(a\), the intercept, and \(b\), the slope, for our line of best fit using the linear model command lm( response ~ predictor, data= ).

lm(HeartRate~BodyTemp,data=HRdata)
## 
## Call:
## lm(formula = HeartRate ~ BodyTemp, data = HRdata)
## 
## Coefficients:
## (Intercept)     BodyTemp  
##    -166.285        2.443

We get the same line of best fit as seem previously. We can write the equation in the context of our study as:

\[\widehat{HeartRate} = -166.285 + 2.443 \times( BodyTemp).\]

To graph both the data and the line of best fit we will use two new commands:

gf_point(HeartRate ~ BodyTemp, data=HRdata) %>%
  gf_abline(intercept=-166.285,slope=2.443,color="purple")

Hypothesis Test with Slope

Our goal is to perform a hypothesis test to determine if the linear relationship given by the regression line is statistically significant.

Null: there is no linear association between body temp and heart rate. Alternative: there is a linear association between body temp and heart rate.

Equivalently, in terms of slope, we let \(\beta\) represent the slope of the regression line for the entire population of healthy adults.

\[H_0: \beta = 0 \\ H_a: \beta \neq 0 \]

Standardized \(t\)-statistic

Before calculating p-values we calculate the standardized \(t\)-statistic. There are two formulas we can use to calculate the standardized statistic, one uses the correlation \(r\) and the other uses the slope \(b\). Recall that these two numbers, \(r\) and \(b\), are related by a ratio of the standard deviations of our variables.

\[t = \frac{r}{ \sqrt{\frac{1-r^2}{n-2}} } \textrm{ and } \]
\[t = \frac{b-0}{SE(b)}\]

The first formula for the standardized statistic uses the sample size \(n\) and the correlation coefficient, \(r= 0.2536564\). The denominator \(\sqrt{\frac{1-r^2}{n-2}}\) is the standard error for the correlation coefficient.

The second formula uses the slope \(b=2.443\) and the standard error of the distribution of slopes, \(SE(b)\). The standard error \(SE(b)\) can be obtained from a simulation-based distribution of shuffled slopes or by theory-based techniques shown below.

Using the first formula with correlation, we can calculate \(t\) as follows.

n = 130
t_wcor = r/sqrt((1-r^2)/(n-2))
t_wcor
## [1] 2.967156

Inference with slope of regression line

To calculate a theory-based p-value for inference on the slope we use information generated by the linear model command lm( ). Similar to the ANOVA aov( ), this command stores much more information than is displayed by default. We have seen the output of slope and intercept. To view the p-value for our hypothesis test we assign an output name, such as m1 short for model 1, then we can display a summary of the test using the summary( ) command.

m1 <- lm(HeartRate~BodyTemp, data=HRdata)
summary(m1)
## 
## Call:
## lm(formula = HeartRate ~ BodyTemp, data = HRdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.6413  -4.6356   0.3247   4.8304  15.8474 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -166.2847    80.9123  -2.055  0.04190 * 
## BodyTemp       2.4432     0.8235   2.967  0.00359 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.858 on 128 degrees of freedom
## Multiple R-squared:  0.06434,    Adjusted R-squared:  0.05703 
## F-statistic: 8.802 on 1 and 128 DF,  p-value: 0.003591

Here is annotated image describing the important parts of the summary.

Reading the summary of the linear model command
Reading the summary of the linear model command

Notice that this output contains the intercept and slope of the regression line, but the values are stacked vertically instead of displayed horizontally. Also notice that the \(t\)-statistic \(= 2.967\) is the same as we calculated above. Next to the \(t\)-statistic is a \(p\)-value \(= 0.00359\). This represents the probability of seeing a standardized statistic of 2.967 or more extreme if the null-hypothesis is true. Note that the \(P(>|t|)\) is telling us this is a two-sided \(p\)-value. If we are performing a one-sided hypothesis test we would divide this \(p\)-value by 2.

Additionally, the standard error of slope, SEb, can be obtained from the summary output. Using this quantity we can again confirm the calculation of the \(t\) statistic, \(t = \frac{b-0}{SE(b)}\).

SEb = 0.8235
t_wslope = slope/SEb
t_wslope
## [1] 2.967203

Confidence intervals for slope.

Using the SEb = 0.8235 we can calculate a 2SD Confidence interval:

# upper and lower endpoints of a 95% confidence interval for the slope beta.
upper = slope+2*SEb
lower = slope-2*SEb

lower
## [1] 0.7964916
upper
## [1] 4.090492

We can calculate the theory-based 95% confidence interval using the confint( ) function applied to m1.

confint(m1,level=0.95)
##                   2.5 %    97.5 %
## (Intercept) -326.383620 -6.185819
## BodyTemp       0.813765  4.072711

Notice that the 2SD method gave us an interval that is slightly wider than necessary, but the 2SD method has the benefit of being a quick easy calculation without having to use technology.

Conclusions:

Strength of Evidence with context We have a \(p\)-value of \(= 0.00359\) and a standardized statistic of \(2.967,\) which give us very strong and strong evidence against the null hypothesis respectively. Thus, our our evidence supports the conclusion that there is a statistically significant linear association between body temperature and heart rate.

Estimation with context: Our 2SD confidence interval is \((0.7962, 4.0902)\) and the theory-based 95% confidence interval is (0.814, 4.073). Therefore, we are 95% confident that a one degree increase in body temperature is associated with an average increase of 0.814 to 4.073 heart beats per minute.

Causation and Generalization: Since this is an observational study of 130 healthy adults, we cannot draw any conclusions regarding causation. We also cannot generalize since it was not indicated that our data is a random sample of healthy adults; for all we know it could be a convenience sample of 18-25 year old adults.