Regression_Part_2

library(resampledata)

Does linear regression make sense?

Of course, not all data sets are modeled well by a line.
A residual plot is one way to assess whether fitting data to a line makes sense.
The residuals are the difference between the observed output and the predicted output: \[ \hat{y}-y \] for each \( x \) in the data set.

plot of chunk unnamed-chunk-2

the vertical length of the line segments are the residues
we plot these lengths in a residual plot

Example

For CO\( _2 \) example:

library(resampledata)
linearfit <- lm(Level~Year, data=Maunaloa)
plot(Maunaloa$Year, resid(linearfit))
abline(h=0)

plot of chunk unnamed-chunk-3

What to look for

The points should be randomly distributed in the region.
Here it's not a random as it might be:
- the positive residuals are at the extreme ends,
- and the negative residuals are in the middle.

BB example

Create the residual plot for the BB example.

bbfit <- lm(PercFG~OffReb, data=NBA1617)
plot(NBA1617$OffReb, resid(bbfit))
abline(h=0)

plot of chunk unnamed-chunk-4

This one seems more random to me. (subjective)
You might remove the outlier: the one with a residual of 20.
(there also appears to be one outlier with a very high number of offensive rebounds)

One further step

library(resampledata)
linearfit <- lm(Level~Year, data=Maunaloa)
plot(Maunaloa$Year, resid(linearfit))
abline(h=0)
lines(smooth.spline(Maunaloa$Year, resid(linearfit), df=3))

plot of chunk unnamed-chunk-5

We want the spline to be a horizontal line.
If it isn't it suggests a linear regression may not be a good model.

BB example

add the spline to the BB plot

bbfit <- lm(PercFG~OffReb, data=NBA1617)
plot(NBA1617$OffReb, resid(bbfit))
#abline(h=0)
lines(smooth.spline(NBA1617$OffReb, resid(bbfit), df=3))

plot of chunk unnamed-chunk-6

So, despite the fact that the correlation for the CO\( _2 \) example is much higher than that of the BB example, this analysis indicates that the BB example is better suited to a linear regression.
Perhaps some other type of regression is better for the carbon dioxide example.

another look

plot(Maunaloa$Level~Maunaloa$Year)
abline(linearfit)

plot of chunk unnamed-chunk-7

Next issue and some new data

the Alelager data set contains information about alcohol and calories in beer:

library(resampledata)
library(tidyverse)
head(Alelager)

  ID Type Alcohol Calories
1  1  Ale    5.50      160
2  2  Ale    5.40      156
3  3  Ale    4.85      146
4  4  Ale    4.50      150
5  5  Ale    5.20      160
6  6  Ale    5.30      174

The first few steps

I'm just going to catch up to where we are with the CO\( _2 \) and BB examples
I'll use pipes a little to illustrate

plot

Alelager %>% 
  select(Alcohol, Calories) %>% 
  plot

plot of chunk unnamed-chunk-9

let's remove the outlier

newbeer <- Alelager %>% 
  filter(Alcohol<6.5) 
newbeer%>% 
  select(Alcohol, Calories) %>% 
  plot

plot of chunk unnamed-chunk-10

correlation and r-squared

r <- newbeer %>% 
  select(Alcohol, Calories) %>% 
  cor
r

           Alcohol  Calories
Alcohol  1.0000000 0.6105981
Calories 0.6105981 1.0000000

r^2

         Alcohol Calories
Alcohol  1.00000  0.37283
Calories 0.37283  1.00000

So there is a moderate, positive correlation
37.3% of the variance is explained by the model

linear model

linearbeer <- lm(Calories~Alcohol, data=newbeer)
linearbeer


Call:
lm(formula = Calories ~ Alcohol, data = newbeer)

Coefficients:
(Intercept)      Alcohol  
      47.41        21.99

according to the model, for each increase of 1% alcohol there is an increase of 21.99 calories

residual plot

plot(newbeer$Alcohol, resid(linearbeer))
abline(h=0)
lines(smooth.spline(newbeer$Alcohol, resid(linearbeer), df=3))

plot of chunk unnamed-chunk-13

Analysis of residual plot

The spline is pretty horizontal
The points seem pretty random
a linear model seems good
for what it's worth - I did this without removing the outlier and a linear model did not seem like a good fit

Caught up...next steps

Sample

This data set is a sample
Is the true linear regression, that is for all beer, the same as this one?
We can answer this two ways
- find a confidence interval for the slope
- find confidence intervals for the predictions (one for each alcohol value)

Confidence interval for slope

confint(linearbeer)

                2.5 %    97.5 %
(Intercept) -8.367596 103.19397
Alcohol     10.948288  33.02796

This comes from a \( t \) distribution and is pretty messy
we are 95% confident that the true slope is between 10.95 and 33.03
Since 0 is not in the interval we have evidence that there is, in fact, a relationship between alcohol level and calories

BB example

Confidence interval for slope

confint(bbfit)

                  2.5 %      97.5 %
(Intercept) 41.06394948 44.58278227
OffReb       0.03241561  0.08282798

Graphically

library(ggformula)
gf_point(Calories~Alcohol, data=newbeer) %>% 
  gf_lm(interval="prediction", fill="skyblue") %>% 
  gf_lm(interval="confidence")

plot of chunk unnamed-chunk-16

the grey strip is a 95% confidence interval for where the true regression line lies
- it corresponds to the confidence interval we found for the slope
The blue strip is a 95% confidence interval for the true predicted value
- think of a fixed \( x \) value corresponding to its own confidence interval

BB

library(ggformula)
gf_point(PercFG~OffReb, data=NBA1617) %>% 
  gf_lm(interval="prediction", fill="skyblue") %>% 
  gf_lm(interval="confidence")

plot of chunk unnamed-chunk-17

Prediction

first, I found an easier way to “predict”:
for each of the following, we fix Alcohol=5

Calories.dist <- makeFun(linearbeer)
Calories.dist(Alcohol=5)

       1 
157.3538

width of confidence interval

Calories.dist(Alcohol=5, interval="confidence")

       fit      lwr     upr
1 157.3538 153.3796 161.328

confidence interval for prediction

Calories.dist(Alcohol=5, interval="prediction")

       fit      lwr      upr
1 157.3538 135.3588 179.3488

all in one picture

gf_point(Calories~Alcohol, data=newbeer) %>% 
  gf_lm(interval="prediction", fill="skyblue") %>% 
  gf_lm(interval="confidence")

plot of chunk unnamed-chunk-21

Calories.dist(Alcohol=5)

       1 
157.3538

Calories.dist(Alcohol=5, interval="confidence")

       fit      lwr     upr
1 157.3538 153.3796 161.328

Calories.dist(Alcohol=5, interval="prediction")

       fit      lwr      upr
1 157.3538 135.3588 179.3488

BB

predict the FG percentage for a player with 200 offensive rebounds
find a 95% confidence interval for the FG percentage for a player with 200 offensive rebounds

FG.dist <- makeFun(bbfit)
FG.dist(OffReb=200)

       1 
54.34773

FG.dist(OffReb=200, interval="prediction")

       fit      lwr     upr
1 54.34773 42.37505 66.3204