Scenario

When studying the penguins on the Palmer islands, how do our quantitative variables interact with each other? Can we use scatterplots and linear regression to find any patterns or relationships between these variables? We will try to explore these questions by using our new tools of linear regression.

Initial Data Analysis

We do some exploratory data analysis to determine more about this data set.

names(PData)
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"
dim(PData)
## [1] 333   8
head(PData)
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           36.7          19.3               193        3450
## 5 Adelie  Torgersen           39.3          20.6               190        3650
## 6 Adelie  Torgersen           38.9          17.8               181        3625
## # ℹ 2 more variables: sex <fct>, year <int>

We can observe that there are four quantitative variables: bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g. We will call these “bill length bill depth, flipper length, and body mass” in our report.

Our next goal is to see whether a linear regression is an appropriate tool for comparing two variables. If it is, we hope to see a reasonably large correlation coefficient between those two variables (say, greater than 0.5 or less than -0.5). Note that this will not be enough to go on - we’ll still need to see whether the variables have a linear relationship! But it’s a start.

PData %>% 
  select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) %>%
  cor()
##                   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm         1.0000000    -0.2286256         0.6530956   0.5894511
## bill_depth_mm         -0.2286256     1.0000000        -0.5777917  -0.4720157
## flipper_length_mm      0.6530956    -0.5777917         1.0000000   0.8729789
## body_mass_g            0.5894511    -0.4720157         0.8729789   1.0000000

There are a number of variable pairs whose correlation coefficients are large! The largest positive correlation is 0.873 (Body Mass vs Flipper Length) and the largest negative correlation is -0.578 (Bill Depth vs Flipper Length). Another reasonably large correlation coefficient is 0.653 (Bill Length vs Flipper Length). We will focus on these three comparisons. Since Flipper Length is common across all of them, we will make Flipper Length a common x-variable.

We plot all three satterplots to see which ones appear linear and reasonable for linear regression:

ggplot(PData, aes(x=flipper_length_mm, y=body_mass_g))+
  geom_point()

ggplot(PData, aes(x=flipper_length_mm, y=bill_depth_mm))+
  geom_point()

ggplot(PData, aes(x=flipper_length_mm, y=bill_length_mm))+
  geom_point()

(Flipper Length vs Body Mass) and (Flipper Length vs Bill Length) both appear to have strong, positive linear relationships. Both seem appropriate for linear regression. (Flipper Length vs Bill Depth), on the other hand, is a little bit different – it has two “clumps” of data. For now, we will proceed with analyzing the first two graphs, and will come back to will reexamine the “clumps” more in a future lesson.

Bill Length vs Flipper Length

We calculate the linear model for Bill Length and Flipper Length. Here, Bill Length is the y-variable, and Flipper Length is the x-variable.

lm(PData$bill_length_mm~PData$flipper_length_mm)
## 
## Call:
## lm(formula = PData$bill_length_mm ~ PData$flipper_length_mm)
## 
## Coefficients:
##             (Intercept)  PData$flipper_length_mm  
##                 -7.2186                   0.2548
# We get an intercept of -7.2186 and a slope of 0.2548

We see that our intercept was -7.2186 and our slope was 0.2548. We therefore create a new scatterplot with the linear model drawn on top:

ggplot(PData, aes(x=flipper_length_mm, y=bill_length_mm))+
  geom_point()+
  geom_abline(intercept=-7.2649, slope=0.2548, color="pink", size=2)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

We confirm that, visually, this line cuts through the point cloud in a way that reasonably describes the data. What remains is to check the residual plot and confirm that it is “nice and boring.”

Resid_FlipperLength_BillLength <- resid(lm(PData$bill_length_mm~PData$flipper_length_mm))

ggplot(PData, aes(x=flipper_length_mm, y=Resid_FlipperLength_BillLength))+
  geom_point()

The residual plot is nice and boring! Success! We conclude that we can reasonably estimate Bill Length of a Palmer penguin if we know the penguin’s flipper length, using a model of the form (Bill Length) = 0.2548 x (Flipper Length) - 7.2649.

Body Mass vs Flipper Length

BodyMass_Model <- lm(PData$body_mass_g ~ PData$flipper_length_mm)
BodyMass_Model
## 
## Call:
## lm(formula = PData$body_mass_g ~ PData$flipper_length_mm)
## 
## Coefficients:
##             (Intercept)  PData$flipper_length_mm  
##                -5872.09                    50.15

Record the intercept and slope

intercept_bm <- coef(BodyMass_Model)[1]
slope_bm <- coef(BodyMass_Model)[2]

intercept_bm
## (Intercept) 
##   -5872.093
slope_bm
## PData$flipper_length_mm 
##                50.15327

Create scatterplot with regression line

ggplot(PData, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point() +
geom_abline(intercept = intercept_bm, slope = slope_bm, color = "blue", size = 2)

### Create residuals

Resid_FlipperLength_BodyMass <- resid(BodyMass_Model)

Plot residuals

ggplot(PData, aes(x = flipper_length_mm, y = Resid_FlipperLength_BodyMass)) +
geom_point()