When studying the penguins on the Palmer islands, how do our quantitative variables interact with each other? Can we use scatterplots and linear regression to find any patterns or relationships between these variables? We will try to explore these questions by using our new tools of linear regression.
We do some exploratory data analysis to determine more about this data set.
names(PData)
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
dim(PData)
## [1] 333 8
head(PData)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen 36.7 19.3 193 3450
## 5 Adelie Torgersen 39.3 20.6 190 3650
## 6 Adelie Torgersen 38.9 17.8 181 3625
## # ℹ 2 more variables: sex <fct>, year <int>
We can observe that there are four quantitative variables: bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g. We will call these “bill length bill depth, flipper length, and body mass” in our report.
Our next goal is to see whether a linear regression is an appropriate tool for comparing two variables. If it is, we hope to see a reasonably large correlation coefficient between those two variables (say, greater than 0.5 or less than -0.5). Note that this will not be enough to go on - we’ll still need to see whether the variables have a linear relationship! But it’s a start.
PData %>%
select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) %>%
cor()
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm 1.0000000 -0.2286256 0.6530956 0.5894511
## bill_depth_mm -0.2286256 1.0000000 -0.5777917 -0.4720157
## flipper_length_mm 0.6530956 -0.5777917 1.0000000 0.8729789
## body_mass_g 0.5894511 -0.4720157 0.8729789 1.0000000
There are a number of variable pairs whose correlation coefficients are large! The largest positive correlation is 0.873 (Body Mass vs Flipper Length) and the largest negative correlation is -0.578 (Bill Depth vs Flipper Length). Another reasonably large correlation coefficient is 0.653 (Bill Length vs Flipper Length). We will focus on these three comparisons. Since Flipper Length is common across all of them, we will make Flipper Length a common x-variable.
We plot all three satterplots to see which ones appear linear and reasonable for linear regression:
ggplot(PData, aes(x=flipper_length_mm, y=body_mass_g))+
geom_point()
ggplot(PData, aes(x=flipper_length_mm, y=bill_depth_mm))+
geom_point()
ggplot(PData, aes(x=flipper_length_mm, y=bill_length_mm))+
geom_point()
(Flipper Length vs Body Mass) and (Flipper Length vs Bill Length) both appear to have strong, positive linear relationships. Both seem appropriate for linear regression. (Flipper Length vs Bill Depth), on the other hand, is a little bit different – it has two “clumps” of data. For now, we will proceed with analyzing the first two graphs, and will come back to will reexamine the “clumps” more in a future lesson.
We calculate the linear model for Bill Length and Flipper Length. Here, Bill Length is the y-variable, and Flipper Length is the x-variable.
lm(PData$bill_length_mm~PData$flipper_length_mm)
##
## Call:
## lm(formula = PData$bill_length_mm ~ PData$flipper_length_mm)
##
## Coefficients:
## (Intercept) PData$flipper_length_mm
## -7.2186 0.2548
# We get an intercept of -7.2186 and a slope of 0.2548
We see that our intercept was -7.2186 and our slope was 0.2548. We therefore create a new scatterplot with the linear model drawn on top:
ggplot(PData, aes(x=flipper_length_mm, y=bill_length_mm))+
geom_point()+
geom_abline(intercept=-7.2649, slope=0.2548, color="pink", size=2)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
We confirm that, visually, this line cuts through the point cloud in a way that reasonably describes the data. What remains is to check the residual plot and confirm that it is “nice and boring.”
Resid_FlipperLength_BillLength <- resid(lm(PData$bill_length_mm~PData$flipper_length_mm))
ggplot(PData, aes(x=flipper_length_mm, y=Resid_FlipperLength_BillLength))+
geom_point()
The residual plot is nice and boring! Success! We conclude that we can reasonably estimate Bill Length of a Palmer penguin if we know the penguin’s flipper length, using a model of the form (Bill Length) = 0.2548 x (Flipper Length) - 7.2649.
BodyMass_Model <- lm(PData$body_mass_g ~ PData$flipper_length_mm)
BodyMass_Model
##
## Call:
## lm(formula = PData$body_mass_g ~ PData$flipper_length_mm)
##
## Coefficients:
## (Intercept) PData$flipper_length_mm
## -5872.09 50.15
intercept_bm <- coef(BodyMass_Model)[1]
slope_bm <- coef(BodyMass_Model)[2]
intercept_bm
## (Intercept)
## -5872.093
slope_bm
## PData$flipper_length_mm
## 50.15327
ggplot(PData, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point() +
geom_abline(intercept = intercept_bm, slope = slope_bm, color = "blue", size = 2)
### Create residuals
Resid_FlipperLength_BodyMass <- resid(BodyMass_Model)
ggplot(PData, aes(x = flipper_length_mm, y = Resid_FlipperLength_BodyMass)) +
geom_point()