MATH138: Scatterplots, Correlation, and Simple Linear Regression

Example #1: Hikers

The following data represents the body weight (lbs) and backpack weight (lbs) for a group of hikers:

## BACKPACKING
body<-c(120, 187, 109, 103, 131, 165, 159, 116)
backpack<-c(26, 30, 26, 24, 29, 35, 31, 28)

hikers<-data.frame(body, backpack)

Create a scatterplot

library(tidyverse)

ggplot(hikers, aes(body, backpack))+
  geom_point(size=3)+
  theme_bw()+
  xlab("Body Weight (ft)")+
  ylab("Backpack Weight (sec)")+
  ggtitle("Scatterplot of Body Weight (ft) vs Backpack Weight (ft)")

Describe the scatterplot

Use the following four characteristics to describe the scatterplot:

Direction: positive or negative
Form: linear or non-linear
Strength: strong, moderate, weak, no
Outliers

Example #2: Rollercoasters

People who responded to a July 2004 Discovery Channel poll named the best 10 roller coasters in the United States. The following data shows the length of the initial drop (in feet) and the duration of the ride (in seconds).

## ROLLER COASTER
# the data
drop<-c(105, 300, 255, 215, 195, 
        141, 214, 95, 108, 86)

duration<-c(135, 105, 180, 240, 120, 
            65, 140, 90, 160, 90)


# make a dataframe
rollercoaster<-data.frame(drop, duration)

Create a scatterplot

# scatterplot
ggplot(rollercoaster, aes(drop, duration))+
  geom_point(size=3)+
  theme_bw()+
  xlab("Drop (ft)")+
  ylab("Duration (sec)")+
  ggtitle("Scatterplot of Drop (ft) vs Duration (sec)")

Correlation

Correlation is a metric for the strength of the linear relationship between two numeric variables.

Correlation has the following properties:

Notation: r
Is between -1 (perfect negative) and 1 (perfect positive)
Is a symmetric function (ie is doesn’t matter what order the variables enter the equation)

# correlation (is a symmetric function)
cor(drop, duration) #0.3523023

## [1] 0.3523023

cor(duration, drop) #0.3523023

## [1] 0.3523023

Summary Statistics

# summary statistics
sd(drop) #74.74579

## [1] 74.74579

mean(drop) #171.4

## [1] 171.4

sd(duration) #51.32955

## [1] 51.32955

mean(duration) #132.5

## [1] 132.5

Simple Linear Regression

We want to describe the relationship between two numeric variables with a mathematical model. Assuming that the relationship between the variables is linear, we use the method of least squares to find the line of best fit.

We call this model simple linear regression: \[\hat{y}=\hat{\beta}_0+\hat{\beta}_1x\] Notation:

\(\hat{y}\) : the predicted value
\(\hat{\beta}_0\) : the y-intercept coefficient estimate
\(\hat{\beta}_1\) : the slope coefficient estimate
\(x\) : the explanatory variable

Some books use slightly different notation:

\(b_0\) = \(\hat{\beta}_0\) : the y-intercept coefficient estimate
\(b_1 = \hat{\beta}_1\) : the slope coefficient estimate

Calculating the Coefficient Estimates by Hand

\[slope=\hat{\beta}_1 = r \times \frac{s_y}{s_x}\] \[intercept = \hat{\beta}_0 = \bar{y}-\hat{\beta}_1\bar{x}\]

Finding the SLR Model in R

# simple linear model
mod<-lm(duration~drop)
summary(mod)

## 
## Call:
## lm(formula = duration ~ drop)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -60.15 -23.47 -10.51  25.10  96.95 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  91.0326    42.1480   2.160   0.0628 .
## drop          0.2419     0.2272   1.065   0.3181  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.95 on 8 degrees of freedom
## Multiple R-squared:  0.1241, Adjusted R-squared:  0.01463 
## F-statistic: 1.134 on 1 and 8 DF,  p-value: 0.3181

# mod coefficients
coefficients(mod)

## (Intercept)        drop 
##  91.0325879   0.2419336

Adding the Fitted Line

# scatterplot with lm
ggplot(data=rollercoaster, aes(x=drop, y=duration))+
  geom_point(size=3)+
  theme_bw()+
  xlab("Drop (ft)")+
  ylab("Duration (sec)")+
  ggtitle("Scatterplot of Drop (ft) vs Duration (sec)")+
  geom_abline(slope=mod$coefficients[2], 
              intercept = mod$coefficients[1],
              color="blue", lty=2, lwd=1)

Example #3: Hadley’s Height

The following data represent Hadley’s age (in days) and height (in inches) during her infant wellness check-ups:

## BEWARE OF EXTRAPOLATION
growth<-data.frame(days=c(0, 10, 62, 129),
                   height=c(19.75, 20.0, 23.5, 25.6))

ggplot(growth, aes(days, height))+
  geom_point(size=3)+
  geom_smooth(method="lm", se=FALSE)+
  theme_minimal()

## `geom_smooth()` using formula 'y ~ x'

Fit a linea model

gMod<-lm(height~days, growth)
summary(gMod)

## 
## Call:
## lm(formula = height ~ days, data = growth)
## 
## Residuals:
##        1        2        3        4 
## -0.09162 -0.31344  0.73312 -0.32805 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 19.841624   0.429515   46.20 0.000468 ***
## days         0.047182   0.005987    7.88 0.015725 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6131 on 2 degrees of freedom
## Multiple R-squared:  0.9688, Adjusted R-squared:  0.9532 
## F-statistic:  62.1 on 1 and 2 DF,  p-value: 0.01572

Prediction

Predict her height on her 10th birthday:

# Predict her height on her 10th birthday
19.841624+0.047182*(365*10)

## [1] 192.0559

# In feet
192/12 #16 ft

## [1] 16

Beware of Extrapolation

pGrowth<-data.frame(days=c(0, 10, 62, 129, 3650),
                    height=c(19.75, 20.0, 23.5, 25.6, 192.0559))

ggplot(pGrowth, aes(days, height))+
  geom_point(size=3)+
  geom_smooth(method="lm", se=FALSE)+
  theme_minimal()

## `geom_smooth()` using formula 'y ~ x'

MATH138: Scatterplots, Correlation, and Simple Linear Regression

Heather Kitada Smalley

Example #1: Hikers

Create a scatterplot

Describe the scatterplot

Example #2: Rollercoasters

Create a scatterplot

Correlation

Summary Statistics

Simple Linear Regression

Calculating the Coefficient Estimates by Hand

Finding the SLR Model in R

Adding the Fitted Line

Example #3: Hadley’s Height

Fit a linea model

Prediction

Beware of Extrapolation