Regression_Part_2

library(resampledata)

Does linear regression make sense?

  • Of course, not all data sets are modeled well by a line.
  • A residual plot is one way to assess whether fitting data to a line makes sense.
  • The residuals are the difference between the observed output and the predicted output: \[ \hat{y}-y \] for each \( x \) in the data set.

plot of chunk unnamed-chunk-2

  • the vertical length of the line segments are the residues
  • we plot these lengths in a residual plot

Example

For CO\( _2 \) example:

library(resampledata)
linearfit <- lm(Level~Year, data=Maunaloa)
plot(Maunaloa$Year, resid(linearfit))
abline(h=0)

plot of chunk unnamed-chunk-3

What to look for

  • The points should be randomly distributed in the region.
  • Here it's not a random as it might be:
    • the positive residuals are at the extreme ends,
    • and the negative residuals are in the middle.

BB example

Create the residual plot for the BB example.

bbfit <- lm(PercFG~OffReb, data=NBA1617)
plot(NBA1617$OffReb, resid(bbfit))
abline(h=0)

plot of chunk unnamed-chunk-4

  • This one seems more random to me. (subjective)
  • You might remove the outlier: the one with a residual of 20.
  • (there also appears to be one outlier with a very high number of offensive rebounds)

One further step

library(resampledata)
linearfit <- lm(Level~Year, data=Maunaloa)
plot(Maunaloa$Year, resid(linearfit))
abline(h=0)
lines(smooth.spline(Maunaloa$Year, resid(linearfit), df=3))

plot of chunk unnamed-chunk-5

  • We want the spline to be a horizontal line.
  • If it isn't it suggests a linear regression may not be a good model.

BB example

  • add the spline to the BB plot
bbfit <- lm(PercFG~OffReb, data=NBA1617)
plot(NBA1617$OffReb, resid(bbfit))
#abline(h=0)
lines(smooth.spline(NBA1617$OffReb, resid(bbfit), df=3))

plot of chunk unnamed-chunk-6

  • So, despite the fact that the correlation for the CO\( _2 \) example is much higher than that of the BB example, this analysis indicates that the BB example is better suited to a linear regression.
  • Perhaps some other type of regression is better for the carbon dioxide example.

another look

plot(Maunaloa$Level~Maunaloa$Year)
abline(linearfit)

plot of chunk unnamed-chunk-7

Next issue and some new data

  • the Alelager data set contains information about alcohol and calories in beer:
library(resampledata)
library(tidyverse)
head(Alelager)
  ID Type Alcohol Calories
1  1  Ale    5.50      160
2  2  Ale    5.40      156
3  3  Ale    4.85      146
4  4  Ale    4.50      150
5  5  Ale    5.20      160
6  6  Ale    5.30      174

The first few steps

  • I'm just going to catch up to where we are with the CO\( _2 \) and BB examples
  • I'll use pipes a little to illustrate

plot

Alelager %>% 
  select(Alcohol, Calories) %>% 
  plot

plot of chunk unnamed-chunk-9

let's remove the outlier

newbeer <- Alelager %>% 
  filter(Alcohol<6.5) 
newbeer%>% 
  select(Alcohol, Calories) %>% 
  plot

plot of chunk unnamed-chunk-10

correlation and r-squared

r <- newbeer %>% 
  select(Alcohol, Calories) %>% 
  cor
r
           Alcohol  Calories
Alcohol  1.0000000 0.6105981
Calories 0.6105981 1.0000000
r^2
         Alcohol Calories
Alcohol  1.00000  0.37283
Calories 0.37283  1.00000
  • So there is a moderate, positive correlation
  • 37.3% of the variance is explained by the model

linear model

linearbeer <- lm(Calories~Alcohol, data=newbeer)
linearbeer

Call:
lm(formula = Calories ~ Alcohol, data = newbeer)

Coefficients:
(Intercept)      Alcohol  
      47.41        21.99  
  • according to the model, for each increase of 1% alcohol there is an increase of 21.99 calories

residual plot

plot(newbeer$Alcohol, resid(linearbeer))
abline(h=0)
lines(smooth.spline(newbeer$Alcohol, resid(linearbeer), df=3))

plot of chunk unnamed-chunk-13

Analysis of residual plot

  • The spline is pretty horizontal
  • The points seem pretty random
  • a linear model seems good
  • for what it's worth - I did this without removing the outlier and a linear model did not seem like a good fit

Caught up...next steps

Sample

  • This data set is a sample
  • Is the true linear regression, that is for all beer, the same as this one?
  • We can answer this two ways
    • find a confidence interval for the slope
    • find confidence intervals for the predictions (one for each alcohol value)

Confidence interval for slope

confint(linearbeer)
                2.5 %    97.5 %
(Intercept) -8.367596 103.19397
Alcohol     10.948288  33.02796
  • This comes from a \( t \) distribution and is pretty messy
  • we are 95% confident that the true slope is between 10.95 and 33.03
  • Since 0 is not in the interval we have evidence that there is, in fact, a relationship between alcohol level and calories

BB example

Confidence interval for slope

confint(bbfit)
                  2.5 %      97.5 %
(Intercept) 41.06394948 44.58278227
OffReb       0.03241561  0.08282798

Graphically

library(ggformula)
gf_point(Calories~Alcohol, data=newbeer) %>% 
  gf_lm(interval="prediction", fill="skyblue") %>% 
  gf_lm(interval="confidence")

plot of chunk unnamed-chunk-16

  • the grey strip is a 95% confidence interval for where the true regression line lies
    • it corresponds to the confidence interval we found for the slope
  • The blue strip is a 95% confidence interval for the true predicted value
    • think of a fixed \( x \) value corresponding to its own confidence interval

BB

library(ggformula)
gf_point(PercFG~OffReb, data=NBA1617) %>% 
  gf_lm(interval="prediction", fill="skyblue") %>% 
  gf_lm(interval="confidence")

plot of chunk unnamed-chunk-17

Prediction

  • first, I found an easier way to “predict”:
  • for each of the following, we fix Alcohol=5
Calories.dist <- makeFun(linearbeer)
Calories.dist(Alcohol=5)
       1 
157.3538 

width of confidence interval

Calories.dist(Alcohol=5, interval="confidence")
       fit      lwr     upr
1 157.3538 153.3796 161.328

confidence interval for prediction

Calories.dist(Alcohol=5, interval="prediction")
       fit      lwr      upr
1 157.3538 135.3588 179.3488

all in one picture

gf_point(Calories~Alcohol, data=newbeer) %>% 
  gf_lm(interval="prediction", fill="skyblue") %>% 
  gf_lm(interval="confidence")

plot of chunk unnamed-chunk-21

Calories.dist(Alcohol=5)
       1 
157.3538 
Calories.dist(Alcohol=5, interval="confidence")
       fit      lwr     upr
1 157.3538 153.3796 161.328
Calories.dist(Alcohol=5, interval="prediction")
       fit      lwr      upr
1 157.3538 135.3588 179.3488

BB

  • predict the FG percentage for a player with 200 offensive rebounds
  • find a 95% confidence interval for the FG percentage for a player with 200 offensive rebounds
FG.dist <- makeFun(bbfit)
FG.dist(OffReb=200)
       1 
54.34773 
FG.dist(OffReb=200, interval="prediction")
       fit      lwr     upr
1 54.34773 42.37505 66.3204