Learning Objectives

Students will learn how to use linear regression to model relationships between variables and also explore subgroups.

Example 1: Climate Change and Fish Habitats

As the climate grows warmer, we expect many animal species to move toward the poles in an attempt to maintain their preferred temperature range.

Do data on fish in the North Sea confirm this suspicion?

The data are 25 years of mean winter temperatures at the bottom of the North Sea (degrees Celsius) and the center of the distribution of anglerfish (sometimes called monkfish) in degrees of north latitude.

Step 0: Tidyverse

library(tidyverse)

Step 1: Load the data into R. Create a data frame.

# EXAMPLE: Climate Change and Fish Habitats
# Data on anglerfish distribution
# Explanatory: Temp in C (mean winter temperature at bottom of North Sea)
# Response: Latitude of center for distribution of anglerfish

year<-c(1977:2001)
temp<-c(6.26, 6.26, 6.27, 6.31, 6.34, 6.32, 6.37, 6.39, 6.42, 
        6.52, 6.68, 6.76, 6.78, 6.89, 6.90, 6.93, 6.98, 
        7.02, 7.09, 7.13, 7.15, 7.29, 7.34, 7.57, 7.65)
lat<-c(57.20, 57.96, 57.65, 57.59, 58.01, 59.06, 56.85, 56.87, 57.43,
       57.72, 57.83, 57.87, 57.48, 58.13, 58.52, 58.48, 57.89,
       58.71, 58.07, 58.49, 58.28, 58.49, 58.01, 58.57, 58.90)

# Make a dataframe
fish<-data.frame(year, temp, lat)

Step 2: Look at the data! Create a scatterplot!

library(tidyverse)
# Scatterplot of data looks linear
ggplot(data=fish, aes(temp, lat))+
  geom_point()+
  ggtitle("Scatterplot of Temp vs Fish Latitude")+
  xlab("Temperature (C)")+
  ylab("Fish Latitude")+
  theme_bw()

Describe the relationship:

  • direction – positive or negative
  • form – linear or non-linear
  • strength – strong (points close together) or weak (points spread out)
  • outliers - an individual value that falls outside the overall pattern of the relationship

Step 3: Create a simple linear model

# Fit a simple linear model 
mod2<-lm(lat~temp)
summary(mod2)
## 
## Call:
## lm(formula = lat ~ temp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.81309 -0.27207 -0.02401  0.20523  1.43781 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  52.4524     1.5324  34.229  < 2e-16 ***
## temp          0.8180     0.2254   3.629  0.00141 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4734 on 23 degrees of freedom
## Multiple R-squared:  0.3641, Adjusted R-squared:  0.3364 
## F-statistic: 13.17 on 1 and 23 DF,  p-value: 0.001408
names(mod2)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

Step 4: Add the fitted line to the scatterplot

# Scatterplot with fitted line
ggplot(data=fish, aes(temp, lat))+
  geom_point()+
  ggtitle("Scatterplot of Temp vs Fish Latitude")+
  xlab("Temperature (C)")+
  ylab("Fish Latitude")+
  theme_bw()+
  geom_abline(slope=mod2$coefficients[2], intercept=mod2$coefficients[1],
              color="blue", lty=2, lwd=1)

Example 2: Amazon Shipping

When you buy a book off Amazon, you get a quote for how much it costs to ship. This is based on the weight of the book. If you didn’t know the weight of the book, what other characteristics of it could you measure to help predict the weight?

Step 1: Load the Data

#install.packages("DAAG")
library(DAAG)
data(allbacks)

books <- allbacks[, c(3, 1, 4)]
head(books)
##   weight volume cover
## 1    800    885    hb
## 2    950   1016    hb
## 3   1050   1125    hb
## 4    350    239    hb
## 5    750    701    hb
## 6    600    641    hb

Step 2: Look at the data

# Scatterplot of all books together
ggplot(data=books, aes(x=volume, y=weight))+
  geom_point()+
  ggtitle("Scatterplot of Volume vs Weight of Books")+
  theme_bw()

Step 3: Create a simple linear model

m3<-lm(weight~volume, data=books)
summary(m3)
## 
## Call:
## lm(formula = weight ~ volume, data = books)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -189.97 -109.86   38.08  109.73  145.57 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 107.67931   88.37758   1.218    0.245    
## volume        0.70864    0.09746   7.271 6.26e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 123.9 on 13 degrees of freedom
## Multiple R-squared:  0.8026, Adjusted R-squared:  0.7875 
## F-statistic: 52.87 on 1 and 13 DF,  p-value: 6.262e-06

Step 4: Add the fitted line to the scatterplot

# Scatterplot of all books together and fitted model
ggplot(data=books, aes(x=volume, y=weight))+
  geom_point()+
  ggtitle("Scatterplot of Volume vs Weight of Books")+
  theme_bw()+
  geom_abline(slope=m3$coefficients[2], intercept = m3$coefficients[1])

Step 5: Explore possible subgroups

What about considering cover type in the model?

# What about considering cover type 
ggplot(data=books, aes(x=volume, y=weight, color=cover))+
  geom_point()+
  ggtitle("Scatterplot of Volume vs Weight of Books")+
  theme_bw()

Would there be different equations for the models?

ggplot(data=books, aes(x=volume, y=weight, color=cover))+
  geom_point()+
  geom_smooth(method="lm", se=FALSE)+
  ggtitle("Scatterplot of Volume vs Weight of Books")+
  theme_bw()
## `geom_smooth()` using formula 'y ~ x'

Interactions?!

How does this work in the model? This is called an interaction, which means that the relationship between x and y varies depending on the levels of a variable.

It can be modeled by using the following code:

# Include interaction to shift intercept and change slope
m5<-lm(weight~volume*cover, data=books)
summary(m5)
## 
## Call:
## lm(formula = weight ~ volume * cover, data = books)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -89.67 -32.07 -21.82  17.94 215.91 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     161.58654   86.51918   1.868   0.0887 .  
## volume            0.76159    0.09718   7.837 7.94e-06 ***
## coverpb        -120.21407  115.65899  -1.039   0.3209    
## volume:coverpb   -0.07573    0.12802  -0.592   0.5661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80.41 on 11 degrees of freedom
## Multiple R-squared:  0.9297, Adjusted R-squared:  0.9105 
## F-statistic:  48.5 on 3 and 11 DF,  p-value: 1.245e-06

If you want to learn more about this please consider taking DATA 152 and/or DATA 252.

Example 3: How do NFL fans lean politically?

Based on the graphic published in: https://fivethirtyeight.com/features/how-every-nfl-teams-fans-lean-politically/

Step 1: Import the Data

# Import data
sports<-read.csv("https://raw.githubusercontent.com/kitadasmalley/FA2020_DataViz/main/data/NFL_fandom_data.csv", 
                 header=TRUE)

head(sports)
##                          DMA  NFL  NBA  MLB  NHL NASCAR  CBB  CFB PctTrumpVote
## 1      Abilene-Sweetwater TX 0.45 0.21 0.14 0.02   0.04 0.03 0.11         0.79
## 2                  Albany GA 0.32 0.30 0.09 0.01   0.08 0.03 0.17         0.59
## 3 Albany-Schenectady-Troy NY 0.40 0.20 0.20 0.08   0.06 0.03 0.04         0.44
## 4    Albuquerque-Santa Fe NM 0.53 0.21 0.11 0.03   0.03 0.04 0.06         0.40
## 5              Alexandria LA 0.42 0.28 0.09 0.01   0.05 0.03 0.12         0.70
## 6                  Alpena MI 0.28 0.13 0.21 0.12   0.10 0.07 0.09         0.64

Step 2: Tidy the Data

# Tidy the data
## Use gather to create:
### column for sport (categorical variable)
### Column for search interest (numeric - percent)

sportsT<-sports%>%
  gather("sport", "searchInterest",-c(DMA, PctTrumpVote))

Step 3: Relevel

# Level the sport variable so that its in the right order
sportsT$sport<-factor(sportsT$sport, 
                      level=c("NBA", "MLB", "NHL", "NFL", "CBB", "NASCAR", "CFB"))

Step 4: Graph it!

ggplot(sportsT, aes(x=PctTrumpVote, y=searchInterest))+
  geom_point()+
  geom_smooth(method="lm", se=FALSE)+
  facet_grid(.~sport)+
  ylab("Search Interest")+
  xlab("Trump Vote Share")
## `geom_smooth()` using formula 'y ~ x'