A linear regression model is a predictive analysis tool to explain the relationship between a dependent variable and a independent variable [1]. A multiple linear regression model is the same as a linear regression model, however, it just uses more than one independent variable.
The dependent or response variable is the variable being tested or measured. It is the outcome of the experiment [2].
The explanatory or independent variables are the variables that are changed or controlled to test their effects on the dependent variable [2].
The basic format of a linear regression model equation is: y = a + bx + error [3]
The intercept (a) or otherwise known as the y-intercept, tells us at what point does the regression line cross the y-axis when x = 0. [3]
The slope (b) tells us how much of an increase there is for y when the x increases by one unit (how steep the line is). [3]
There are 4 different assumptions a linear regression model must adhere to [3].
There is a linear relationship between the mean response (y) and the explanatory variable (x)
The errors are independent - there is no connection between how far any two points lie from the regression line
The responses are normally distributed at each level of x
The variance of the responses is equal for all levels of x
The aim of this study is to try and find which variables are the best to predict the weight of penguins.
We also want to find the relationship between different variables and the weight of penguins
For this study we are assuming that none of the 4 assumptions are violated
First we need to import a few libraries for this task.
library(ggplot2)
library(GGally)
library(gridExtra)
library(sjPlot)
For this study we are using the “penguins” data frame from the “palmerpenguins” package [4].
# install.packages("palmerpenguins") #remove the 1st "#" if you need to install the package
library(palmerpenguins)
penguins <- penguins
We need to check the data and check what the data types of each variable is.
str(penguins)
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
head(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge… 39.1 18.7 181 3750 male
## 2 Adelie Torge… 39.5 17.4 186 3800 fema…
## 3 Adelie Torge… 40.3 18 195 3250 fema…
## 4 Adelie Torge… NA NA NA NA <NA>
## 5 Adelie Torge… 36.7 19.3 193 3450 fema…
## 6 Adelie Torge… 39.3 20.6 190 3650 male
## # … with 1 more variable: year <int>
We also need to remove any NA values from the data set.
penguins <- na.omit(penguins)
Before starting any modelling, we need to explore the data. We need to examine distributions of individual responses and predictors using graphical and numerical summaries, and begin to discover relationships between the variables.
body_mass_plot <- ggplot(data = penguins) + #- distribution of the response variable
aes(x = body_mass_g) +
geom_histogram(color = "red", binwidth = 0.5) +
xlab("Penguin body mass (g)")
species_plot <- ggplot(data = penguins) + #- distribution of the explanatory variable
aes(x = species) +
geom_bar(fill = "orange", binwidth = 0.5) +
xlab("Penguin species") +
geom_text(aes(label = ..count..),
stat = "count",
vjust = 1.5,
colour = "black")
bill_length_plot <- ggplot(data = penguins) + #- distribution of the explanatory variable
aes(x = bill_length_mm) +
geom_histogram(fill = "yellow", colour = "white", binwidth = 0.5) +
xlab("Penguin bill length (mm)")
bill_depth_plot <- ggplot(data = penguins) + #- distribution of the explanatory variable
aes(x = bill_depth_mm) +
geom_histogram(fill = "green", colour = "white", binwidth = 0.5) +
xlab("Penguin bill depth (mm)")
flipper_length_plot <- ggplot(data = penguins) + #- distribution of the explanatory variable
aes(x = flipper_length_mm) +
geom_histogram(fill = "blue", colour = "blue", binwidth = 0.5) +
xlab("Penguin flipper length (mm)")
penguin_sex_plot <- ggplot(data = penguins) + #- distribution of the explanatory variable
aes(x = sex) +
geom_bar(fill = "purple", binwidth = 0.5) +
xlab("Penguin sex (M of F)") +
geom_text(aes(label = ..count..),
stat = "count",
vjust = 1.5,
colour = "black")
grid.arrange(body_mass_plot, species_plot, bill_length_plot, bill_depth_plot, flipper_length_plot, penguin_sex_plot)
We next want to examine the numerical and graphical summaries of the relationships between model covariates and responses.
ggpairs(data = penguins,
columns = c("species", "bill_length_mm", "bill_depth_mm", "flipper_length_mm", "sex", "body_mass_g"))
The relationship between two continuous variables is depicted with scatterplots below the diagonal and correlation coefficients above the diagonal
From this graph, we can see that there is a strong correlation between body mass and bill length, body mass and bill depth, and body mass and flipper length.
First we will check the relationship between each of the bill and flipper variables individually.
model_1 <- lm(body_mass_g ~ bill_length_mm, penguins)
summary(model_1)
##
## Call:
## lm(formula = body_mass_g ~ bill_length_mm, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1759.38 -468.82 27.79 464.20 1641.00
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 388.845 289.817 1.342 0.181
## bill_length_mm 86.792 6.538 13.276 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 651.4 on 331 degrees of freedom
## Multiple R-squared: 0.3475, Adjusted R-squared: 0.3455
## F-statistic: 176.2 on 1 and 331 DF, p-value: < 2.2e-16
model_2 <- lm(body_mass_g ~ bill_depth_mm, penguins)
summary(model_2)
##
## Call:
## lm(formula = body_mass_g ~ bill_depth_mm, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1616.08 -510.38 -68.67 486.51 1811.12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7519.98 342.33 21.967 <2e-16 ***
## bill_depth_mm -193.01 19.81 -9.741 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 710.9 on 331 degrees of freedom
## Multiple R-squared: 0.2228, Adjusted R-squared: 0.2205
## F-statistic: 94.89 on 1 and 331 DF, p-value: < 2.2e-16
model_3 <- lm(body_mass_g ~ flipper_length_mm, penguins)
summary(model_3)
##
## Call:
## lm(formula = body_mass_g ~ flipper_length_mm, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1057.33 -259.79 -12.24 242.97 1293.89
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5872.09 310.29 -18.93 <2e-16 ***
## flipper_length_mm 50.15 1.54 32.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 393.3 on 331 degrees of freedom
## Multiple R-squared: 0.7621, Adjusted R-squared: 0.7614
## F-statistic: 1060 on 1 and 331 DF, p-value: < 2.2e-16
As you can see, all 3 models are each statistically significant (p-value < 0.05)
Next let’s try adding multiple independent variables to the formula
model_4 <- lm(body_mass_g ~ bill_length_mm + bill_depth_mm, penguins)
summary(model_4)
##
## Call:
## lm(formula = body_mass_g ~ bill_length_mm + bill_depth_mm, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1806.74 -456.82 15.33 471.07 1541.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3413.452 437.911 7.795 8.50e-14 ***
## bill_length_mm 74.813 6.076 12.313 < 2e-16 ***
## bill_depth_mm -145.507 16.873 -8.624 2.76e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 589.4 on 330 degrees of freedom
## Multiple R-squared: 0.4675, Adjusted R-squared: 0.4642
## F-statistic: 144.8 on 2 and 330 DF, p-value: < 2.2e-16
model_5 <- lm(body_mass_g ~ bill_length_mm + flipper_length_mm, penguins)
summary(model_5)
##
## Call:
## lm(formula = body_mass_g ~ bill_length_mm + flipper_length_mm,
## data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1083.08 -282.65 -17.18 242.95 1293.24
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5836.299 312.604 -18.670 <2e-16 ***
## bill_length_mm 4.959 5.214 0.951 0.342
## flipper_length_mm 48.890 2.034 24.034 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 393.4 on 330 degrees of freedom
## Multiple R-squared: 0.7627, Adjusted R-squared: 0.7613
## F-statistic: 530.4 on 2 and 330 DF, p-value: < 2.2e-16
model_5 <- lm(body_mass_g ~ bill_depth_mm + flipper_length_mm, penguins)
summary(model_5)
##
## Call:
## lm(formula = body_mass_g ~ bill_depth_mm + flipper_length_mm,
## data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1031.64 -272.80 -22.72 241.11 1282.80
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6537.598 545.348 -11.988 <2e-16 ***
## bill_depth_mm 19.878 13.407 1.483 0.139
## flipper_length_mm 51.767 1.884 27.481 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 392.6 on 330 degrees of freedom
## Multiple R-squared: 0.7637, Adjusted R-squared: 0.7622
## F-statistic: 533.2 on 2 and 330 DF, p-value: < 2.2e-16
model_6 <- lm(body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm, penguins)
summary(model_6)
##
## Call:
## lm(formula = body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm,
## data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1051.37 -284.50 -20.37 241.03 1283.51
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6445.476 566.130 -11.385 <2e-16 ***
## bill_length_mm 3.293 5.366 0.614 0.540
## bill_depth_mm 17.836 13.826 1.290 0.198
## flipper_length_mm 50.762 2.497 20.327 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 393 on 329 degrees of freedom
## Multiple R-squared: 0.7639, Adjusted R-squared: 0.7618
## F-statistic: 354.9 on 3 and 329 DF, p-value: < 2.2e-16
Whilst bill length and bill depth are statistically significant when used together, when they are combined with flipper length, they become insignificant.
We can use the “tab_model” function to compare our models
tab_model(model_1, model_2, model_3, model_4, model_5, model_6, dv.labels = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5", "Model 6"))
Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | Model 6 | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Predictors | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p |
(Intercept) | 388.85 | -181.27 – 958.96 | 0.181 | 7519.98 | 6846.57 – 8193.39 | <0.001 | -5872.09 | -6482.47 – -5261.71 | <0.001 | 3413.45 | 2552.00 – 4274.90 | <0.001 | -6537.60 | -7610.40 – -5464.80 | <0.001 | -6445.48 | -7559.17 – -5331.78 | <0.001 |
bill_length_mm | 86.79 | 73.93 – 99.65 | <0.001 | 74.81 | 62.86 – 86.76 | <0.001 | 3.29 | -7.26 – 13.85 | 0.540 | |||||||||
bill_depth_mm | -193.01 | -231.98 – -154.03 | <0.001 | -145.51 | -178.70 – -112.32 | <0.001 | 19.88 | -6.50 – 46.25 | 0.139 | 17.84 | -9.36 – 45.03 | 0.198 | ||||||
flipper_length_mm | 50.15 | 47.12 – 53.18 | <0.001 | 51.77 | 48.06 – 55.47 | <0.001 | 50.76 | 45.85 – 55.67 | <0.001 | |||||||||
Observations | 333 | 333 | 333 | 333 | 333 | 333 | ||||||||||||
R2 / R2 adjusted | 0.347 / 0.345 | 0.223 / 0.220 | 0.762 / 0.761 | 0.467 / 0.464 | 0.764 / 0.762 | 0.764 / 0.762 |
From this comparison, Model 3 is our best model. This can be determined through the adjusted R-squared value. Higher values indicated a better performing model because a greater amount of variation in the dependent variable is explained by the independent variable. Even though our Model 5 and Model have a very slightly greater adjusted R-squared value (0.001), both the bill length and bill depth are not significant in either of the models. This mean that they are not as useful in determining the body mass.
Lastly, we need to check the assumptions of our best linear model. This can be done with the code below.
par(mfrow=c(2,2))
plot(model_3)
From this study we can see that flipper length is the most important variable in determining and predicting body mass in penguins (that were used in this study).
[1]. What is Linear Regression? - Statistics Solutions [Internet]. Statistics Solutions. Available from: https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/what-is-linear-regression/?__cf_chl_jschl_tk__=pmd_R2lkHiW3t9prFULjSXZHer34Nv8XMJC6EiAb4UGRitU-1634796427-0-gqNtZGzNAlCjcnBszQi9
[2]. Understand the Difference Between Independent and Dependent Variables [Internet]. ThoughtCo. 2021. Available from: https://www.thoughtco.com/independent-and-dependent-variables-differences-606115
[3]. Roback and Legler (2021). Applied Generalized Linear Models and Multilevel Models in R]
[4]. Horst A. palmerpenguins R data package [Internet]. Allisonhorst.github.io. Available from: https://allisonhorst.github.io/palmerpenguins/