The following data represents the body weight (lbs) and backpack weight (lbs) for a group of hikers:
## BACKPACKING
body<-c(120, 187, 109, 103, 131, 165, 159, 116)
backpack<-c(26, 30, 26, 24, 29, 35, 31, 28)
hikers<-data.frame(body, backpack)
library(tidyverse)
ggplot(hikers, aes(body, backpack))+
geom_point(size=3)+
theme_bw()+
xlab("Body Weight (ft)")+
ylab("Backpack Weight (sec)")+
ggtitle("Scatterplot of Body Weight (ft) vs Backpack Weight (ft)")
Use the following four characteristics to describe the scatterplot:
People who responded to a July 2004 Discovery Channel poll named the best 10 roller coasters in the United States. The following data shows the length of the initial drop (in feet) and the duration of the ride (in seconds).
## ROLLER COASTER
# the data
drop<-c(105, 300, 255, 215, 195,
141, 214, 95, 108, 86)
duration<-c(135, 105, 180, 240, 120,
65, 140, 90, 160, 90)
# make a dataframe
rollercoaster<-data.frame(drop, duration)
# scatterplot
ggplot(rollercoaster, aes(drop, duration))+
geom_point(size=3)+
theme_bw()+
xlab("Drop (ft)")+
ylab("Duration (sec)")+
ggtitle("Scatterplot of Drop (ft) vs Duration (sec)")
Correlation is a metric for the strength of the linear relationship between two numeric variables.
Correlation has the following properties: * Notation: r * Is between -1 (perfect negative) and 1 (perfect positive) * Is a symmetric function (ie is doesn’t matter what order the variables enter the equation)
# correlation (is a symmetric function)
cor(drop, duration) #0.3523023
## [1] 0.3523023
cor(duration, drop) #0.3523023
## [1] 0.3523023
# summary statistics
sd(drop) #74.74579
## [1] 74.74579
mean(drop) #171.4
## [1] 171.4
sd(duration) #51.32955
## [1] 51.32955
mean(duration) #132.5
## [1] 132.5
We want describe the relationship between two numeric variables with a mathematical model. Assuming that the relationship between the variables is linear, we use the method of least squares to find the line of best fit.
We call this model simple linear regression: \[\hat{y}=\hat{\beta}_0+\hat{\beta}_1x\] Notation: * \(\hat{y}\) : the predicted value * \(\hat{\beta}_0\) : the y-intercept coefficient estimate * \(\hat{\beta}_1\) : the slope coefficient estimate * \(x\) : the explanatory variable
\[slope=\hat{\beta}_1 = r \times \frac{s_y}{s_x}\] \[intercept = \hat{\beta}_0 = \bar{y}-\hat{\beta}_1\bar{x}\]
# simple linear model
mod<-lm(duration~drop)
summary(mod)
##
## Call:
## lm(formula = duration ~ drop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.15 -23.47 -10.51 25.10 96.95
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.0326 42.1480 2.160 0.0628 .
## drop 0.2419 0.2272 1.065 0.3181
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.95 on 8 degrees of freedom
## Multiple R-squared: 0.1241, Adjusted R-squared: 0.01463
## F-statistic: 1.134 on 1 and 8 DF, p-value: 0.3181
# mod coefficients
coefficients(mod)
## (Intercept) drop
## 91.0325879 0.2419336
# scatterplot with lm
ggplot(data=rollercoaster, aes(x=drop, y=duration))+
geom_point(size=3)+
theme_bw()+
xlab("Drop (ft)")+
ylab("Duration (sec)")+
ggtitle("Scatterplot of Drop (ft) vs Duration (sec)")+
geom_abline(slope=mod$coefficients[2],
intercept = mod$coefficients[1],
color="blue", lty=2, lwd=1)
The following data represent Hadley’s age (in days) and height (in inches) during her infant wellness check-ups:
## BEWARE OF EXTRAPOLATION
growth<-data.frame(days=c(0, 10, 62, 129),
height=c(19.75, 20.0, 23.5, 25.6))
ggplot(growth, aes(days, height))+
geom_point(size=3)+
geom_smooth(method="lm", se=FALSE)+
theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
gMod<-lm(height~days, growth)
summary(gMod)
##
## Call:
## lm(formula = height ~ days, data = growth)
##
## Residuals:
## 1 2 3 4
## -0.09162 -0.31344 0.73312 -0.32805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.841624 0.429515 46.20 0.000468 ***
## days 0.047182 0.005987 7.88 0.015725 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6131 on 2 degrees of freedom
## Multiple R-squared: 0.9688, Adjusted R-squared: 0.9532
## F-statistic: 62.1 on 1 and 2 DF, p-value: 0.01572
Predict her height on her 10th birthday:
# Predict her height on her 10th birthday
19.841624+0.047182*(365*10)
## [1] 192.0559
# In feet
192/12 #16 ft
## [1] 16
pGrowth<-data.frame(days=c(0, 10, 62, 129, 3650),
height=c(19.75, 20.0, 23.5, 25.6, 192.0559))
ggplot(pGrowth, aes(days, height))+
geom_point(size=3)+
geom_smooth(method="lm", se=FALSE)+
theme_minimal()
## `geom_smooth()` using formula 'y ~ x'