Learning Objectives:

Students will learn to create scatterplots and describe the relationships between two numeric variables.

Example: Take A Hike!

STEP 1: Load the data:

body_wgt<-c(120, 187, 109, 103, 131, 165, 158, 116)
backpack_wgt<-c(26, 30, 26, 24, 29, 35, 31, 28)

backpack_df<-data.frame(body_wgt, backpack_wgt)

Which variable should be the response and which should be the explantory?

## YOUR ANSWER HERE ##

STEP 2: Scatterplot

library(tidyverse)

ggplot(backpack_df, aes(x=body_wgt, y=backpack_wgt))+
  geom_point()+
  xlab("Body Weight (lb)")+
  ylab("Backpack Weight (lb)")+
  ggtitle("Scatterplot of Backpack Weight vs Body Weight")

STEP 3: Describe

When looking at a scatterplot you want to be able to describe the overall pattern and for striking departures from that pattern.

You can describe the overall pattern of a scatterplot by the:

  • direction – positive or negative
  • form – linear or non-linear
  • strength – strong (points close together) or weak (points spread out)
  • outliers - an individual value that falls outside the overall pattern of the relationship

How would you describe the above scatterplot?

### YOUR ANSWER HERE ###

STEP 4: Correlation Coefficient

cor(body_wgt, backpack_wgt)
## [1] 0.7946927

What happens when you switch the order of the variables?

## YOU TRY IT

STEP 5: Activity

First: Load in the data

data("anscombe")
str(anscombe)
## 'data.frame':    11 obs. of  8 variables:
##  $ x1: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x2: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x3: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x4: num  8 8 8 8 8 8 8 19 8 8 ...
##  $ y1: num  8.04 6.95 7.58 8.81 8.33 ...
##  $ y2: num  9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 ...
##  $ y3: num  7.46 6.77 12.74 7.11 7.81 ...
##  $ y4: num  6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 ...

Directions:

If your birthday is:

  • January - March: Use variables x1 and y1
  • April - June: Use variables x2 and y2
  • July - September: Use variables x3 and y3
  • October - December: Use variables x4 and y4

Complete the following tasks:

  • Create a scatterplot and describe it
  • Calculate the mean and standard deviation for both your x and y variables
  • Calculate the correlation coefficient
  • Compare the information you have obtained with your neighbor
cor(anscombe$x2, anscombe$y2)
## [1] 0.8162365
mean(anscombe$x2)
## [1] 9
sd(anscombe$x2)
## [1] 3.316625
mean(anscombe$y2)
## [1] 7.500909
sd(anscombe$y2)
## [1] 2.031657
## SPACE FOR YOUR WORK ##
ggplot(anscombe, aes(x2, y2))+
  geom_point()

lm(y2~x2, data=anscombe)
## 
## Call:
## lm(formula = y2 ~ x2, data = anscombe)
## 
## Coefficients:
## (Intercept)           x2  
##       3.001        0.500

STEP 6: Line of Best Fit

ggplot(backpack_df, aes(x=body_wgt, y=backpack_wgt))+
  geom_point()+
  geom_smooth(method="lm", se=FALSE, color="red", lty=2)+
  xlab("Body Weight (lb)")+
  ylab("Backpack Weight (lb)")+
  ggtitle("Scatterplot of Backpack Weight vs Body Weight")
## `geom_smooth()` using formula 'y ~ x'

What is this line?

# Y~X
# Y~X
mod<-lm(backpack_wgt~body_wgt, data=backpack_df)
summary(mod)
## 
## Call:
## lm(formula = backpack_wgt ~ body_wgt, data = backpack_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2444 -1.2750  0.1133  0.9308  3.7532 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 16.26493    3.93692   4.131  0.00614 **
## body_wgt     0.09080    0.02831   3.207  0.01844 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.27 on 6 degrees of freedom
## Multiple R-squared:  0.6315, Adjusted R-squared:  0.5701 
## F-statistic: 10.28 on 1 and 6 DF,  p-value: 0.01844

Diagnostics

Residual Plot:

res_df<-backpack_df%>%
  cbind(res=mod$residuals)

ggplot(res_df, aes(body_wgt, res))+
  geom_point()+
  geom_hline(yintercept=0, color="red", lwd=1, lty=2)+
  ggtitle("Residual Plot")

QQ Plot:

qqnorm(mod$residuals)
qqline(mod$residuals)