Simple Linear Regression

For this learning activity, I used the “faithful” data set which is already packaged with an installation of R. I used the R Tutorial eBook as a resource for this exercise: http://www.r-tutor.com/elementary-statistics/simple-linear-regression.

R description of the “faithful” data set for convenience: “Waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.” The dataframe has two variables, eruptions and waiting, where eruptions is the duration of an eruption in minutes and waiting is the waiting time to the next eruption in minutes.

library(ggplot2) 

data("faithful") #load data 

#preview data
head(faithful)
##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55
#generate linear model using 
#eruptions as the dependent variable 
#and waiting as independent variable 
eruption.lm <- lm(eruptions ~ waiting, data=faithful)

#model description 
summary(eruption.lm)
## 
## Call:
## lm(formula = eruptions ~ waiting, data = faithful)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.29917 -0.37689  0.03508  0.34909  1.19329 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.874016   0.160143  -11.70   <2e-16 ***
## waiting      0.075628   0.002219   34.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4965 on 270 degrees of freedom
## Multiple R-squared:  0.8115, Adjusted R-squared:  0.8108 
## F-statistic:  1162 on 1 and 270 DF,  p-value: < 2.2e-16
#plot
plot(eruptions~waiting,data=faithful)
abline(eruption.lm,col="red")

plot(eruption.lm)

Let’s use our model to predict a hypothetical scenario. What will be the duration of the next eruption given a waiting time of 75 minutes?

#store our hypothetical wait time
#in newdata
newdata<- data.frame(waiting=75)
predict(eruption.lm,newdata)
##       1 
## 3.79808

So, our model predicts that given we have waited for 75 minutes, the next eruption should last approximately 3.80 minutes.

Multiple linear regression

Performing multiple linear regression in R is nearly identical to the Simple Linear Regression. Again, I use the examples provided in the R tutorial eBook referenced at the top of the page.

The example uses the “stackloss” data set which also comes pre-packaged with R. For convenience the description follows: “Operational data of a plant for the oxidation of ammonia to nitric acid.” The dataframe has 4 variables: Air Flow, Water Temp, Acid Conc., and stack.loss.

data("stackloss") #load data

head(stackloss) #preview data
##   Air.Flow Water.Temp Acid.Conc. stack.loss
## 1       80         27         89         42
## 2       80         27         88         37
## 3       75         25         90         37
## 4       62         24         87         28
## 5       62         22         87         18
## 6       62         23         87         18
#generate linear model
#similar to simple linear regression
stackloss.lm = lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., data=stackloss)

#summarize model
summary(stackloss.lm)
## 
## Call:
## lm(formula = stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., 
##     data = stackloss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.2377 -1.7117 -0.4551  2.3614  5.6978 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -39.9197    11.8960  -3.356  0.00375 ** 
## Air.Flow      0.7156     0.1349   5.307  5.8e-05 ***
## Water.Temp    1.2953     0.3680   3.520  0.00263 ** 
## Acid.Conc.   -0.1521     0.1563  -0.973  0.34405    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.243 on 17 degrees of freedom
## Multiple R-squared:  0.9136, Adjusted R-squared:  0.8983 
## F-statistic:  59.9 on 3 and 17 DF,  p-value: 3.016e-09

Notice we use the same linear model function (lm) but pass in the arguments slightly different to take into account the multiple independent variables.

#plot model 
par(mfrow=c(2,3))
termplot(stackloss.lm)
plot(stackloss.lm)

As in the Simple Linear Regression example, let’s supply hypothetical independet variable values:

#supply hypothetical values for independent variables 
newdata <- data.frame(Air.Flow=72,Water.Temp=20,Acid.Conc.=85)

#predict using model given our hypothetical values 
predict(stackloss.lm, newdata)
##        1 
## 24.58173

This means that given our hypothetical values, our model would predict the stack.loss value as being approximately 24.58 which represents 10 times the percentage of the ingoing ammonia to the plant that escapes from the absorption columnn unabsorbed. According to the documentation, this value is inversly related to the overall efficiency of the plant.

Final Remarks

I provided two examples of linear regression, the former simple and the latter multiple. Although I plotted the models introduced in this document, I will actually evaluate these models and give more context for the various statistical values presented here in the next activity, Activity 8.