Example #1: Hikers

The following data represents the body weight (lbs) and backpack weight (lbs) for a group of hikers:

## BACKPACKING
body<-c(120, 187, 109, 103, 131, 165, 159, 116)
backpack<-c(26, 30, 26, 24, 29, 35, 31, 28)

hikers<-data.frame(body, backpack)

Create a scatterplot

library(tidyverse)

ggplot(hikers, aes(body, backpack))+
  geom_point(size=3)+
  theme_bw()+
  xlab("Body Weight (ft)")+
  ylab("Backpack Weight (sec)")+
  ggtitle("Scatterplot of Body Weight (ft) vs Backpack Weight (ft)")

Describe the scatterplot

Use the following four characteristics to describe the scatterplot:

  • Direction: positive or negative
  • Form: linear or non-linear
  • Strength: strong, moderate, weak, no
  • Outliers

Example #2: Rollercoasters

People who responded to a July 2004 Discovery Channel poll named the best 10 roller coasters in the United States. The following data shows the length of the initial drop (in feet) and the duration of the ride (in seconds).

## ROLLER COASTER
# the data
drop<-c(105, 300, 255, 215, 195, 
        141, 214, 95, 108, 86)

duration<-c(135, 105, 180, 240, 120, 
            65, 140, 90, 160, 90)


# make a dataframe
rollercoaster<-data.frame(drop, duration)

Create a scatterplot

# scatterplot
ggplot(rollercoaster, aes(drop, duration))+
  geom_point(size=3)+
  theme_bw()+
  xlab("Drop (ft)")+
  ylab("Duration (sec)")+
  ggtitle("Scatterplot of Drop (ft) vs Duration (sec)")

Correlation

Correlation is a metric for the strength of the linear relationship between two numeric variables.

Correlation has the following properties:

  • Notation: r
  • Is between -1 (perfect negative) and 1 (perfect positive)
  • Is a symmetric function (ie is doesn’t matter what order the variables enter the equation)
# correlation (is a symmetric function)
cor(drop, duration) #0.3523023
## [1] 0.3523023
cor(duration, drop) #0.3523023
## [1] 0.3523023

Summary Statistics

# summary statistics
sd(drop) #74.74579
## [1] 74.74579
mean(drop) #171.4
## [1] 171.4
sd(duration) #51.32955
## [1] 51.32955
mean(duration) #132.5
## [1] 132.5

Simple Linear Regression

We want to describe the relationship between two numeric variables with a mathematical model. Assuming that the relationship between the variables is linear, we use the method of least squares to find the line of best fit.

We call this model simple linear regression: \[\hat{y}=\hat{\beta}_0+\hat{\beta}_1x\] Notation:

  • \(\hat{y}\) : the predicted value
  • \(\hat{\beta}_0\) : the y-intercept coefficient estimate
  • \(\hat{\beta}_1\) : the slope coefficient estimate
  • \(x\) : the explanatory variable

Some books use slightly different notation:

  • \(b_0\) = \(\hat{\beta}_0\) : the y-intercept coefficient estimate
  • \(b_1 = \hat{\beta}_1\) : the slope coefficient estimate
Calculating the Coefficient Estimates by Hand

\[slope=\hat{\beta}_1 = r \times \frac{s_y}{s_x}\] \[intercept = \hat{\beta}_0 = \bar{y}-\hat{\beta}_1\bar{x}\]

Finding the SLR Model in R
# simple linear model
mod<-lm(duration~drop)
summary(mod)
## 
## Call:
## lm(formula = duration ~ drop)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -60.15 -23.47 -10.51  25.10  96.95 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  91.0326    42.1480   2.160   0.0628 .
## drop          0.2419     0.2272   1.065   0.3181  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.95 on 8 degrees of freedom
## Multiple R-squared:  0.1241, Adjusted R-squared:  0.01463 
## F-statistic: 1.134 on 1 and 8 DF,  p-value: 0.3181
# mod coefficients
coefficients(mod)
## (Intercept)        drop 
##  91.0325879   0.2419336

Adding the Fitted Line

# scatterplot with lm
ggplot(data=rollercoaster, aes(x=drop, y=duration))+
  geom_point(size=3)+
  theme_bw()+
  xlab("Drop (ft)")+
  ylab("Duration (sec)")+
  ggtitle("Scatterplot of Drop (ft) vs Duration (sec)")+
  geom_abline(slope=mod$coefficients[2], 
              intercept = mod$coefficients[1],
              color="blue", lty=2, lwd=1)

Example #3: Hadley’s Height

The following data represent Hadley’s age (in days) and height (in inches) during her infant wellness check-ups:

## BEWARE OF EXTRAPOLATION
growth<-data.frame(days=c(0, 10, 62, 129),
                   height=c(19.75, 20.0, 23.5, 25.6))

ggplot(growth, aes(days, height))+
  geom_point(size=3)+
  geom_smooth(method="lm", se=FALSE)+
  theme_minimal()
## `geom_smooth()` using formula 'y ~ x'

Fit a linea model

gMod<-lm(height~days, growth)
summary(gMod)
## 
## Call:
## lm(formula = height ~ days, data = growth)
## 
## Residuals:
##        1        2        3        4 
## -0.09162 -0.31344  0.73312 -0.32805 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 19.841624   0.429515   46.20 0.000468 ***
## days         0.047182   0.005987    7.88 0.015725 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6131 on 2 degrees of freedom
## Multiple R-squared:  0.9688, Adjusted R-squared:  0.9532 
## F-statistic:  62.1 on 1 and 2 DF,  p-value: 0.01572

Prediction

Predict her height on her 10th birthday:

# Predict her height on her 10th birthday
19.841624+0.047182*(365*10)
## [1] 192.0559
# In feet
192/12 #16 ft
## [1] 16

Beware of Extrapolation

pGrowth<-data.frame(days=c(0, 10, 62, 129, 3650),
                    height=c(19.75, 20.0, 23.5, 25.6, 192.0559))

ggplot(pGrowth, aes(days, height))+
  geom_point(size=3)+
  geom_smooth(method="lm", se=FALSE)+
  theme_minimal()
## `geom_smooth()` using formula 'y ~ x'