Data Description: Forbes’ Data on Boiling Points in the Alps

For this part of the exam you will need to use the forbes dataset in the MASS package. A data frame with 17 observations on boiling point (bp) of water and barometric pressure (pres) in inches of mercury.

According to chem.perdue.edu: - A liquid boils at a temperature at which its vapor pressure is equal to the pressure of the gas above it. The lower the pressue of a gas above a liquid, the lower the temperature at which the liquid will boil.

# Import the package
library(MASS)

# Load in the data
data("forbes")

# Learn about the data
?forbes
## starting httpd help server ... done

a) Identify the explanatory variable and the response variable for this study.

str(forbes)
## 'data.frame':    17 obs. of  2 variables:
##  $ bp  : num  194 194 198 198 199 ...
##  $ pres: num  20.8 20.8 22.4 22.7 23.1 ...
x <- forbes$pres
y <- forbes$bp

Since the boiling point is based on how much pressure is above the liquid, the explanatory variable would be Pressure (pres) while the response variable is Boiling Point (bp).

b) Use R to create a scatter plot of the two variables. Examine the scatter plot and verbally describe the overall relationship (linear or non-linear, positive or negative).

plot(x,y)

This scatter plot seems to have a strong, positive, linear relationship; when “x” increases, “Y” also increases, and it seems to rise with a steady slope that doesn’t curve the way, say, an exponentially-related set of data.

c) Create a simple linear regression model and add the least squares regression line to the scatter plot. What is the equation of the line?

# From scratch
x_bar <- mean(x)
y_bar <- mean(y)
x2 <- sum((x-x_bar)*(y-y_bar))
x3 <- sum((x-x_bar)^2)
beta_1 <- x2/x3
beta_0 <- y_bar-(beta_1*x_bar)

beta_1
## [1] 1.901784
beta_0
## [1] 155.2965
# Verifying using lm()
mod <- lm(y~x, data = forbes)
mod$coefficients
## (Intercept)           x 
##  155.296483    1.901784

The equation for the line is: y = 155.296483 + 1.901784x

d) Interpret the slope in the context of these data. Provide a five part conclusion for the hypothesis test for slope in the model summary output in R.

summary(mod)
## 
## Call:
## lm(formula = y ~ x, data = forbes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.22687 -0.22178  0.07723  0.19687  0.51001 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 155.29648    0.92734  167.47   <2e-16 ***
## x             1.90178    0.03676   51.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.444 on 15 degrees of freedom
## Multiple R-squared:  0.9944, Adjusted R-squared:  0.9941 
## F-statistic:  2677 on 1 and 15 DF,  p-value: < 2.2e-16

We reject the null hypothesis with a significance level of 0.05 and a p-value of less than 2.2e-16. There is compelling evidence that barometric pressure effects the boiling point of the given liquid.

e) What percent of the variable is explained by the simple linear regression model?

anova<-anova(mod)
ssreg<-anova$`Sum Sq`[1]
ssres<- anova$`Sum Sq`[2]
sstot<- ssreg+ssres
R2 <- ssreg/sstot
R2
## [1] 0.9944282

It seems that 99.44282% of the variable is represented by the simple linear regression model.