Predicting penguin’s body mass using linear regression models

What is a linear regression model

A linear regression model is a predictive analysis tool to explain the relationship between a dependent variable and a independent variable [1]. A multiple linear regression model is the same as a linear regression model, however, it just uses more than one independent variable.

The dependent or response variable is the variable being tested or measured. It is the outcome of the experiment [2].

The explanatory or independent variables are the variables that are changed or controlled to test their effects on the dependent variable [2].

The basic format of a linear regression model equation is: y = a + bx + error [3]

The intercept (a) or otherwise known as the y-intercept, tells us at what point does the regression line cross the y-axis when x = 0. [3]
The slope (b) tells us how much of an increase there is for y when the x increases by one unit (how steep the line is). [3]

Assumptions

There are 4 different assumptions a linear regression model must adhere to [3].

There is a linear relationship between the mean response (y) and the explanatory variable (x)
The errors are independent - there is no connection between how far any two points lie from the regression line
The responses are normally distributed at each level of x
The variance of the responses is equal for all levels of x

The aim of this study

The aim of this study is to try and find which variables are the best to predict the weight of penguins.

We also want to find the relationship between different variables and the weight of penguins

For this study we are assuming that none of the 4 assumptions are violated

Retrieving the data

First we need to import a few libraries for this task.

library(ggplot2)
library(GGally)
library(gridExtra)
library(sjPlot)

For this study we are using the “penguins” data frame from the “palmerpenguins” package [4].

# install.packages("palmerpenguins") #remove the 1st "#" if you need to install the package
library(palmerpenguins)
penguins <- penguins

We need to check the data and check what the data types of each variable is.

str(penguins)

## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

head(penguins)

## # A tibble: 6 × 8
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Torge…           39.1          18.7              181        3750 male 
## 2 Adelie  Torge…           39.5          17.4              186        3800 fema…
## 3 Adelie  Torge…           40.3          18                195        3250 fema…
## 4 Adelie  Torge…           NA            NA                 NA          NA <NA> 
## 5 Adelie  Torge…           36.7          19.3              193        3450 fema…
## 6 Adelie  Torge…           39.3          20.6              190        3650 male 
## # … with 1 more variable: year <int>

We also need to remove any NA values from the data set.

penguins <- na.omit(penguins)

Exploratory Data Analysis - EDA

Before starting any modelling, we need to explore the data. We need to examine distributions of individual responses and predictors using graphical and numerical summaries, and begin to discover relationships between the variables.

body_mass_plot <- ggplot(data = penguins) + #- distribution of the response variable
  aes(x = body_mass_g) +
  geom_histogram(color = "red", binwidth = 0.5) +
  xlab("Penguin body mass (g)")

species_plot <- ggplot(data = penguins) + #- distribution of the explanatory variable
  aes(x = species) +
  geom_bar(fill = "orange", binwidth = 0.5) +
  xlab("Penguin species") +
  geom_text(aes(label = ..count..), 
            stat = "count", 
            vjust = 1.5, 
            colour = "black")

bill_length_plot <- ggplot(data = penguins) + #- distribution of the explanatory variable
  aes(x = bill_length_mm) +
  geom_histogram(fill = "yellow", colour = "white", binwidth = 0.5) +
  xlab("Penguin bill length (mm)")

bill_depth_plot <- ggplot(data = penguins) + #- distribution of the explanatory variable
  aes(x = bill_depth_mm) +
  geom_histogram(fill = "green", colour = "white", binwidth = 0.5) +
  xlab("Penguin bill depth (mm)")

flipper_length_plot <- ggplot(data = penguins) + #- distribution of the explanatory variable
  aes(x = flipper_length_mm) +
  geom_histogram(fill = "blue", colour = "blue", binwidth = 0.5) +
  xlab("Penguin flipper length (mm)")

penguin_sex_plot <- ggplot(data = penguins) + #- distribution of the explanatory variable
  aes(x = sex) +
  geom_bar(fill = "purple", binwidth = 0.5) +
  xlab("Penguin sex (M of F)") +
  geom_text(aes(label = ..count..), 
            stat = "count", 
            vjust = 1.5, 
            colour = "black")

grid.arrange(body_mass_plot, species_plot, bill_length_plot, bill_depth_plot, flipper_length_plot, penguin_sex_plot)

We next want to examine the numerical and graphical summaries of the relationships between model covariates and responses.

ggpairs(data = penguins,
        columns = c("species", "bill_length_mm", "bill_depth_mm", "flipper_length_mm", "sex", "body_mass_g"))

The relationship between two continuous variables is depicted with scatterplots below the diagonal and correlation coefficients above the diagonal

From this graph, we can see that there is a strong correlation between body mass and bill length, body mass and bill depth, and body mass and flipper length.

Creating the models

First we will check the relationship between each of the bill and flipper variables individually.

model_1 <- lm(body_mass_g ~ bill_length_mm, penguins)
summary(model_1)

## 
## Call:
## lm(formula = body_mass_g ~ bill_length_mm, data = penguins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1759.38  -468.82    27.79   464.20  1641.00 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     388.845    289.817   1.342    0.181    
## bill_length_mm   86.792      6.538  13.276   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 651.4 on 331 degrees of freedom
## Multiple R-squared:  0.3475, Adjusted R-squared:  0.3455 
## F-statistic: 176.2 on 1 and 331 DF,  p-value: < 2.2e-16

model_2 <- lm(body_mass_g ~ bill_depth_mm, penguins)
summary(model_2)

## 
## Call:
## lm(formula = body_mass_g ~ bill_depth_mm, data = penguins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1616.08  -510.38   -68.67   486.51  1811.12 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7519.98     342.33  21.967   <2e-16 ***
## bill_depth_mm  -193.01      19.81  -9.741   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 710.9 on 331 degrees of freedom
## Multiple R-squared:  0.2228, Adjusted R-squared:  0.2205 
## F-statistic: 94.89 on 1 and 331 DF,  p-value: < 2.2e-16

model_3 <- lm(body_mass_g ~ flipper_length_mm, penguins)
summary(model_3)

## 
## Call:
## lm(formula = body_mass_g ~ flipper_length_mm, data = penguins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1057.33  -259.79   -12.24   242.97  1293.89 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -5872.09     310.29  -18.93   <2e-16 ***
## flipper_length_mm    50.15       1.54   32.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 393.3 on 331 degrees of freedom
## Multiple R-squared:  0.7621, Adjusted R-squared:  0.7614 
## F-statistic:  1060 on 1 and 331 DF,  p-value: < 2.2e-16

As you can see, all 3 models are each statistically significant (p-value < 0.05)

Next let’s try adding multiple independent variables to the formula

model_4 <- lm(body_mass_g ~ bill_length_mm + bill_depth_mm, penguins)
summary(model_4)

## 
## Call:
## lm(formula = body_mass_g ~ bill_length_mm + bill_depth_mm, data = penguins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1806.74  -456.82    15.33   471.07  1541.34 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3413.452    437.911   7.795 8.50e-14 ***
## bill_length_mm   74.813      6.076  12.313  < 2e-16 ***
## bill_depth_mm  -145.507     16.873  -8.624 2.76e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 589.4 on 330 degrees of freedom
## Multiple R-squared:  0.4675, Adjusted R-squared:  0.4642 
## F-statistic: 144.8 on 2 and 330 DF,  p-value: < 2.2e-16

model_5 <- lm(body_mass_g ~ bill_length_mm + flipper_length_mm, penguins)
summary(model_5)

## 
## Call:
## lm(formula = body_mass_g ~ bill_length_mm + flipper_length_mm, 
##     data = penguins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1083.08  -282.65   -17.18   242.95  1293.24 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -5836.299    312.604 -18.670   <2e-16 ***
## bill_length_mm        4.959      5.214   0.951    0.342    
## flipper_length_mm    48.890      2.034  24.034   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 393.4 on 330 degrees of freedom
## Multiple R-squared:  0.7627, Adjusted R-squared:  0.7613 
## F-statistic: 530.4 on 2 and 330 DF,  p-value: < 2.2e-16

model_5 <- lm(body_mass_g ~ bill_depth_mm + flipper_length_mm, penguins)
summary(model_5)

## 
## Call:
## lm(formula = body_mass_g ~ bill_depth_mm + flipper_length_mm, 
##     data = penguins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1031.64  -272.80   -22.72   241.11  1282.80 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -6537.598    545.348 -11.988   <2e-16 ***
## bill_depth_mm        19.878     13.407   1.483    0.139    
## flipper_length_mm    51.767      1.884  27.481   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 392.6 on 330 degrees of freedom
## Multiple R-squared:  0.7637, Adjusted R-squared:  0.7622 
## F-statistic: 533.2 on 2 and 330 DF,  p-value: < 2.2e-16

model_6 <- lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm, penguins)
summary(model_6)

## 
## Call:
## lm(formula = body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm, 
##     data = penguins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1051.37  -284.50   -20.37   241.03  1283.51 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -6445.476    566.130 -11.385   <2e-16 ***
## bill_length_mm        3.293      5.366   0.614    0.540    
## bill_depth_mm        17.836     13.826   1.290    0.198    
## flipper_length_mm    50.762      2.497  20.327   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 393 on 329 degrees of freedom
## Multiple R-squared:  0.7639, Adjusted R-squared:  0.7618 
## F-statistic: 354.9 on 3 and 329 DF,  p-value: < 2.2e-16

Whilst bill length and bill depth are statistically significant when used together, when they are combined with flipper length, they become insignificant.

We can use the “tab_model” function to compare our models

tab_model(model_1, model_2, model_3, model_4, model_5, model_6, dv.labels = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5", "Model 6"))

	Model 1			Model 2			Model 3			Model 4			Model 5			Model 6
Predictors	Estimates	CI	p	Estimates	CI	p	Estimates	CI	p	Estimates	CI	p	Estimates	CI	p	Estimates	CI	p
(Intercept)	388.85	-181.27 – 958.96	0.181	7519.98	6846.57 – 8193.39	<0.001	-5872.09	-6482.47 – -5261.71	<0.001	3413.45	2552.00 – 4274.90	<0.001	-6537.60	-7610.40 – -5464.80	<0.001	-6445.48	-7559.17 – -5331.78	<0.001
bill_length_mm	86.79	73.93 – 99.65	<0.001							74.81	62.86 – 86.76	<0.001				3.29	-7.26 – 13.85	0.540
bill_depth_mm				-193.01	-231.98 – -154.03	<0.001				-145.51	-178.70 – -112.32	<0.001	19.88	-6.50 – 46.25	0.139	17.84	-9.36 – 45.03	0.198
flipper_length_mm							50.15	47.12 – 53.18	<0.001				51.77	48.06 – 55.47	<0.001	50.76	45.85 – 55.67	<0.001
Observations	333			333			333			333			333			333
R² / R² adjusted	0.347 / 0.345			0.223 / 0.220			0.762 / 0.761			0.467 / 0.464			0.764 / 0.762			0.764 / 0.762

From this comparison, Model 3 is our best model. This can be determined through the adjusted R-squared value. Higher values indicated a better performing model because a greater amount of variation in the dependent variable is explained by the independent variable. Even though our Model 5 and Model have a very slightly greater adjusted R-squared value (0.001), both the bill length and bill depth are not significant in either of the models. This mean that they are not as useful in determining the body mass.

Lastly, we need to check the assumptions of our best linear model. This can be done with the code below.

par(mfrow=c(2,2))
plot(model_3)

From this study we can see that flipper length is the most important variable in determining and predicting body mass in penguins (that were used in this study).

References

[1]. What is Linear Regression? - Statistics Solutions [Internet]. Statistics Solutions. Available from: https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/what-is-linear-regression/?__cf_chl_jschl_tk__=pmd_R2lkHiW3t9prFULjSXZHer34Nv8XMJC6EiAb4UGRitU-1634796427-0-gqNtZGzNAlCjcnBszQi9

[2]. Understand the Difference Between Independent and Dependent Variables [Internet]. ThoughtCo. 2021. Available from: https://www.thoughtco.com/independent-and-dependent-variables-differences-606115

[3]. Roback and Legler (2021). Applied Generalized Linear Models and Multilevel Models in R]

[4]. Horst A. palmerpenguins R data package [Internet]. Allisonhorst.github.io. Available from: https://allisonhorst.github.io/palmerpenguins/