Students will learn how to use linear regression to model relationships between variables and also explore subgroups.
As the climate grows warmer, we expect many animal species to move toward the poles in an attempt to maintain their preferred temperature range.
Do data on fish in the North Sea confirm this suspicion?
The data are 25 years of mean winter temperatures at the bottom of the North Sea (degrees Celsius) and the center of the distribution of anglerfish (sometimes called monkfish) in degrees of north latitude.
library(tidyverse)
# EXAMPLE: Climate Change and Fish Habitats
# Data on anglerfish distribution
# Explanatory: Temp in C (mean winter temperature at bottom of North Sea)
# Response: Latitude of center for distribution of anglerfish
year<-c(1977:2001)
temp<-c(6.26, 6.26, 6.27, 6.31, 6.34, 6.32, 6.37, 6.39, 6.42,
6.52, 6.68, 6.76, 6.78, 6.89, 6.90, 6.93, 6.98,
7.02, 7.09, 7.13, 7.15, 7.29, 7.34, 7.57, 7.65)
lat<-c(57.20, 57.96, 57.65, 57.59, 58.01, 59.06, 56.85, 56.87, 57.43,
57.72, 57.83, 57.87, 57.48, 58.13, 58.52, 58.48, 57.89,
58.71, 58.07, 58.49, 58.28, 58.49, 58.01, 58.57, 58.90)
# Make a dataframe
fish<-data.frame(year, temp, lat)
library(tidyverse)
# Scatterplot of data looks linear
ggplot(data=fish, aes(temp, lat))+
geom_point()+
ggtitle("Scatterplot of Temp vs Fish Latitude")+
xlab("Temperature (C)")+
ylab("Fish Latitude")+
theme_bw()
Describe the relationship:
# Fit a simple linear model
mod2<-lm(lat~temp)
summary(mod2)
##
## Call:
## lm(formula = lat ~ temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.81309 -0.27207 -0.02401 0.20523 1.43781
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.4524 1.5324 34.229 < 2e-16 ***
## temp 0.8180 0.2254 3.629 0.00141 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4734 on 23 degrees of freedom
## Multiple R-squared: 0.3641, Adjusted R-squared: 0.3364
## F-statistic: 13.17 on 1 and 23 DF, p-value: 0.001408
names(mod2)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
# Scatterplot with fitted line
ggplot(data=fish, aes(temp, lat))+
geom_point()+
ggtitle("Scatterplot of Temp vs Fish Latitude")+
xlab("Temperature (C)")+
ylab("Fish Latitude")+
theme_bw()+
geom_abline(slope=mod2$coefficients[2], intercept=mod2$coefficients[1],
color="blue", lty=2, lwd=1)
When you buy a book off Amazon, you get a quote for how much it costs to ship. This is based on the weight of the book. If you didn’t know the weight of the book, what other characteristics of it could you measure to help predict the weight?
#install.packages("DAAG")
library(DAAG)
data(allbacks)
books <- allbacks[, c(3, 1, 4)]
head(books)
## weight volume cover
## 1 800 885 hb
## 2 950 1016 hb
## 3 1050 1125 hb
## 4 350 239 hb
## 5 750 701 hb
## 6 600 641 hb
# Scatterplot of all books together
ggplot(data=books, aes(x=volume, y=weight))+
geom_point()+
ggtitle("Scatterplot of Volume vs Weight of Books")+
theme_bw()
m3<-lm(weight~volume, data=books)
summary(m3)
##
## Call:
## lm(formula = weight ~ volume, data = books)
##
## Residuals:
## Min 1Q Median 3Q Max
## -189.97 -109.86 38.08 109.73 145.57
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 107.67931 88.37758 1.218 0.245
## volume 0.70864 0.09746 7.271 6.26e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 123.9 on 13 degrees of freedom
## Multiple R-squared: 0.8026, Adjusted R-squared: 0.7875
## F-statistic: 52.87 on 1 and 13 DF, p-value: 6.262e-06
# Scatterplot of all books together and fitted model
ggplot(data=books, aes(x=volume, y=weight))+
geom_point()+
ggtitle("Scatterplot of Volume vs Weight of Books")+
theme_bw()+
geom_abline(slope=m3$coefficients[2], intercept = m3$coefficients[1])
What about considering cover type in the model?
# What about considering cover type
ggplot(data=books, aes(x=volume, y=weight, color=cover))+
geom_point()+
ggtitle("Scatterplot of Volume vs Weight of Books")+
theme_bw()
Would there be different equations for the models?
ggplot(data=books, aes(x=volume, y=weight, color=cover))+
geom_point()+
geom_smooth(method="lm", se=FALSE)+
ggtitle("Scatterplot of Volume vs Weight of Books")+
theme_bw()
## `geom_smooth()` using formula 'y ~ x'
How does this work in the model? This is called an interaction, which
means that the relationship between x
and y
varies depending on the levels of a variable.
It can be modeled by using the following code:
# Include interaction to shift intercept and change slope
m5<-lm(weight~volume*cover, data=books)
summary(m5)
##
## Call:
## lm(formula = weight ~ volume * cover, data = books)
##
## Residuals:
## Min 1Q Median 3Q Max
## -89.67 -32.07 -21.82 17.94 215.91
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 161.58654 86.51918 1.868 0.0887 .
## volume 0.76159 0.09718 7.837 7.94e-06 ***
## coverpb -120.21407 115.65899 -1.039 0.3209
## volume:coverpb -0.07573 0.12802 -0.592 0.5661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 80.41 on 11 degrees of freedom
## Multiple R-squared: 0.9297, Adjusted R-squared: 0.9105
## F-statistic: 48.5 on 3 and 11 DF, p-value: 1.245e-06
If you want to learn more about this please consider taking DATA 152 and/or DATA 252.
Based on the graphic published in: https://fivethirtyeight.com/features/how-every-nfl-teams-fans-lean-politically/
# Import data
sports<-read.csv("https://raw.githubusercontent.com/kitadasmalley/FA2020_DataViz/main/data/NFL_fandom_data.csv",
header=TRUE)
head(sports)
## DMA NFL NBA MLB NHL NASCAR CBB CFB PctTrumpVote
## 1 Abilene-Sweetwater TX 0.45 0.21 0.14 0.02 0.04 0.03 0.11 0.79
## 2 Albany GA 0.32 0.30 0.09 0.01 0.08 0.03 0.17 0.59
## 3 Albany-Schenectady-Troy NY 0.40 0.20 0.20 0.08 0.06 0.03 0.04 0.44
## 4 Albuquerque-Santa Fe NM 0.53 0.21 0.11 0.03 0.03 0.04 0.06 0.40
## 5 Alexandria LA 0.42 0.28 0.09 0.01 0.05 0.03 0.12 0.70
## 6 Alpena MI 0.28 0.13 0.21 0.12 0.10 0.07 0.09 0.64
# Tidy the data
## Use gather to create:
### column for sport (categorical variable)
### Column for search interest (numeric - percent)
sportsT<-sports%>%
gather("sport", "searchInterest",-c(DMA, PctTrumpVote))
# Level the sport variable so that its in the right order
sportsT$sport<-factor(sportsT$sport,
level=c("NBA", "MLB", "NHL", "NFL", "CBB", "NASCAR", "CFB"))
ggplot(sportsT, aes(x=PctTrumpVote, y=searchInterest))+
geom_point()+
geom_smooth(method="lm", se=FALSE)+
facet_grid(.~sport)+
ylab("Search Interest")+
xlab("Trump Vote Share")
## `geom_smooth()` using formula 'y ~ x'