Source of data: Kaggle (https://www.kaggle.com/code/parulpandey/penguin-dataset-the-new-iris/notebook#Flipperlength-distribution)

mydata <- read.csv("~/Desktop/R STUDIO/penguins_size.csv", header = TRUE, sep = ",", dec = ".")
mydata <- drop_na(mydata)
mydata <- mydata[mydata$sex != ".",]

Description

Description of variables

Research question: How factors: culmen length, culmen depth, flipper length and sex influence the body mass of a penguin?

head(mydata)
##   species    island culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g    sex
## 1  Adelie Torgersen             39.1            18.7               181        3750   MALE
## 2  Adelie Torgersen             39.5            17.4               186        3800 FEMALE
## 3  Adelie Torgersen             40.3            18.0               195        3250 FEMALE
## 4  Adelie Torgersen             36.7            19.3               193        3450 FEMALE
## 5  Adelie Torgersen             39.3            20.6               190        3650   MALE
## 6  Adelie Torgersen             38.9            17.8               181        3625 FEMALE

Definition of the regression model

In the regression model, we will identify what is the relationship between variables in our data:

With the model, we will show what is the magnitude of the relationship and determine its statistical significance - how strong independent variables influence the value of dependent variable. We have choose variable body_mass_g as the dependent variable, and variables culmen_length_mm, culmen_depth_mm, flipper_length_mm and SexF as independent variables, because we want to see how they influence the body mass of a penguin.

Visual presentation of the data

scatterplotMatrix(mydata[ , c(-1,-2,-7,-8,-9,-10, -11)], 
                  smooth = FALSE)  

From the scatterplots we can observe, that variable columen_depth_mm is negatively correlated with all the other variables. Between all other variables (excluding columen_depth_mm) there is positive correlation.

mydata$sexF <- factor(mydata$sex,
                      levels = c("MALE", "FEMALE"),
                      labels = c("M", "F"))

Regression definition

With this regression I would like to find out how culmen length, culmen depth, flipper length and sex of penguin affect their body mass.

fit1 <- lm(body_mass_g ~ culmen_length_mm + culmen_depth_mm + flipper_length_mm + sexF,
           data = mydata)

mydata$StdResid <- round(rstandard(fit1), 3)
mydata$CooksD <- round(cooks.distance(fit1), 3)
mydata$StdFitValues <- round(fitted.values(fit1),3)

Regression assumptions

Multicolinearity
vif(fit1)
##  culmen_length_mm   culmen_depth_mm flipper_length_mm              sexF 
##          1.875634          2.686946          3.364086          1.916248
mean(vif(fit1))
## [1] 2.460728

VIF < 5 mean(VIF) > 1

Normality

-H0: Variable is normally distributed -H1: Variable is not normally distributed

hist(mydata$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(mydata$StdResid) #to check normality of the standard residuals
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.99623, p-value = 0.6164

Normality: from the Shapiro-Wilk normality test, we conclude, that the regression model doesn’t violate the normality assumption -> p-value > 0.05 -> H0 cannot be rejected.

Linearity
scatterplot(y = mydata$StdResid, x = mydata$StdFitValues,
            xlab = "Standardized residuals", 
            ylab = "Standardized fitted values",
            boxplots = FALSE,
            smooth = FALSE,
            regLine = TRUE)

Linearity: based on the scatterplot, we can assume that the linearity assumption is met.

Homoskedasticity
  • H0:The variance is constant
  • H1:The variance is not constant
ols_test_breusch_pagan(fit1)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                  Data                   
##  ---------------------------------------
##  Response : body_mass_g 
##  Variables: fitted values of body_mass_g 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    1.020442 
##  Prob > Chi2   =    0.3124142

Because p-value is bigger than 5%, we cannot reject H0.

Correcting the data

Removing outliers, and units with high impact
head(mydata[order(mydata$StdResid),], 10)
##       species    island culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g    sex sexF StdResid CooksD
## 185 Chinstrap     Dream             46.9            16.6               192        2700 FEMALE    F   -2.742  0.021
## 163 Chinstrap     Dream             50.3            20.0               197        3300   MALE    M   -2.256  0.017
## 124    Adelie Torgersen             44.1            18.0               210        4000   MALE    M   -2.222  0.009
## 202 Chinstrap     Dream             52.2            18.8               197        3450   MALE    M   -2.107  0.017
## 187 Chinstrap     Dream             49.0            19.5               210        3950   MALE    M   -1.962  0.013
## 168 Chinstrap     Dream             48.5            17.5               191        3400   MALE    M   -1.926  0.018
## 116    Adelie Torgersen             37.7            19.8               198        3500   MALE    M   -1.919  0.015
## 126    Adelie Torgersen             43.1            19.2               197        3500   MALE    M   -1.909  0.006
## 139    Adelie     Dream             37.3            16.8               192        3000 FEMALE    F   -1.865  0.006
## 99     Adelie    Biscoe             37.9            18.6               193        2925 FEMALE    F   -1.746  0.009
##     StdFitValues
## 185     3627.814
## 163     4062.354
## 124     4753.704
## 202     4161.236
## 187     4613.161
## 168     4048.811
## 116     4147.739
## 126     4147.991
## 139     3632.952
## 99      3515.422
head(mydata[order(-mydata$StdResid),], 10)
##       species island culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g    sex sexF StdResid CooksD
## 35     Adelie  Dream             39.8            19.1               184        4650   MALE    M    2.927  0.025
## 232    Gentoo Biscoe             49.2            15.2               221        6300   MALE    M    2.628  0.018
## 106    Adelie Biscoe             45.6            20.3               191        4600   MALE    M    2.319  0.014
## 290    Gentoo Biscoe             51.1            16.3               220        6000   MALE    M    2.147  0.010
## 228    Gentoo Biscoe             48.4            14.6               213        5850   MALE    M    2.066  0.017
## 41     Adelie  Dream             39.6            18.8               190        4600   MALE    M    2.010  0.009
## 109    Adelie Biscoe             39.6            20.7               191        3900 FEMALE    F    1.932  0.027
## 246    Gentoo Biscoe             45.1            14.5               207        5050 FEMALE    F    1.931  0.007
## 164 Chinstrap  Dream             58.0            17.8               181        3700 FEMALE    F    1.928  0.069
## 230    Gentoo Biscoe             49.3            15.7               217        5850   MALE    M    1.884  0.007
##     StdFitValues
## 35      3659.549
## 232     5409.959
## 106     3814.518
## 290     5272.012
## 228     5152.868
## 41      3918.796
## 109     3253.026
## 246     4395.178
## 164     3071.576
## 230     5211.379
mydata <- mydata[mydata$StdResid < 2.3 & mydata$StdResid > -2.3,]
hist(mydata$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

head(mydata[order(-mydata$CooksD),], 20)
##       species    island culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g    sex sexF StdResid CooksD
## 164 Chinstrap     Dream             58.0            17.8               181        3700 FEMALE    F    1.928  0.069
## 109    Adelie    Biscoe             39.6            20.7               191        3900 FEMALE    F    1.932  0.027
## 168 Chinstrap     Dream             48.5            17.5               191        3400   MALE    M   -1.926  0.018
## 163 Chinstrap     Dream             50.3            20.0               197        3300   MALE    M   -2.256  0.017
## 195 Chinstrap     Dream             51.5            18.7               187        3250   MALE    M   -1.590  0.017
## 202 Chinstrap     Dream             52.2            18.8               197        3450   MALE    M   -2.107  0.017
## 228    Gentoo    Biscoe             48.4            14.6               213        5850   MALE    M    2.066  0.017
## 116    Adelie Torgersen             37.7            19.8               198        3500   MALE    M   -1.919  0.015
## 137    Adelie     Dream             32.1            15.5               188        3050 FEMALE    F   -1.639  0.013
## 187 Chinstrap     Dream             49.0            19.5               210        3950   MALE    M   -1.962  0.013
## 247    Gentoo    Biscoe             59.6            17.0               230        6050   MALE    M    1.401  0.013
## 88     Adelie     Dream             39.6            18.1               186        4450   MALE    M    1.854  0.012
## 155 Chinstrap     Dream             46.0            18.9               195        4150 FEMALE    F    1.782  0.012
## 210 Chinstrap     Dream             55.8            19.8               207        4000   MALE    M   -1.355  0.011
## 10     Adelie Torgersen             34.6            21.1               198        4400   MALE    M    1.070  0.010
## 16     Adelie    Biscoe             37.8            18.3               174        3400 FEMALE    F    1.763  0.010
## 290    Gentoo    Biscoe             51.1            16.3               220        6000   MALE    M    2.147  0.010
## 321    Gentoo    Biscoe             48.1            15.1               209        5500   MALE    M    1.613  0.010
## 7      Adelie Torgersen             39.2            19.6               195        4675   MALE    M    1.860  0.009
## 41     Adelie     Dream             39.6            18.8               190        4600   MALE    M    2.010  0.009
##     StdFitValues
## 164     3071.576
## 109     3253.026
## 168     4048.811
## 163     4062.354
## 195     3783.216
## 202     4161.236
## 228     5152.868
## 116     4147.739
## 137     3601.672
## 187     4613.161
## 247     5580.214
## 88      3823.755
## 155     3548.384
## 210     4455.022
## 10      4043.043
## 16      2803.790
## 290     5272.012
## 321     4955.220
## 7       4044.986
## 41      3918.796
Removing outliers
mydata <- mydata[!(mydata$CooksD > 0.02),]

Estimation of the final model

Diagnostics

fit1 <- lm(body_mass_g ~ culmen_length_mm + culmen_depth_mm + flipper_length_mm + sexF,
           data = mydata)

summary(fit1)
## 
## Call:
## lm(formula = body_mass_g ~ culmen_length_mm + culmen_depth_mm + 
##     flipper_length_mm + sexF, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -742.08 -241.93    4.98  216.73  739.32 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1748.421    644.510  -2.713  0.00703 ** 
## culmen_length_mm     -3.645      4.668  -0.781  0.43543    
## culmen_depth_mm     -88.924     15.113  -5.884 1.01e-08 ***
## flipper_length_mm    39.295      2.403  16.351  < 2e-16 ***
## sexFF              -533.060     50.127 -10.634  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 324.8 on 322 degrees of freedom
## Multiple R-squared:  0.8366, Adjusted R-squared:  0.8346 
## F-statistic: 412.2 on 4 and 322 DF,  p-value: < 2.2e-16

Conclusion:

Partial regression coefficients: For the model we have four regression coefficients that present the slope:

  • culmen_length_mm: -3.645

  • culmen_depth_mm: -88.924

  • flipper_length_mm: 39.295

  • sexFF: -533.060

  • Variable culmen_length_mm have p-value > 0.05 -> therefore it is not statistically significant.

  • If column_depth_mm increases by 1mm, ceteris paribus, on average the body mass will decrease by 88,942 grams. Coefficient is significant at p < 0,001.

  • If flipper_length_mm increases by 1 mm, ceteris paribus, on average the body mass will increase by 39,295 grams. Coefficient is significant at p < 0,001.

  • If everything else stays the same, on average females have 533 grams less body mass than males. Coefficient is significant at p < 0,001.

Multiple coefficient of determination: The adjusted R-squared has the value of 0.8366, and it indicates that 84% of the variation within variable body_mass_gg, is explained by independent variables culmen_length_mm, culmen_depth_mm, flipper_length_mm and sexFF.In general it explains that our model is fitting data well.

Multiple correlation coefficient:

sqrt(summary(fit1)$r.squared)
## [1] 0.9146733

Calculated as the square root of R-squared, and it equals to 0.9. From this coefficient we can conclude, that the maximum degree of linear relationship between the variable body_mass_gg, and independent variables is 0.9 - strong relationship.

Variance of F-test:

The F-test, evaluates how good is the regression model.

  • H0: ρ2 = 0

  • H1: ρ2 > 0

In our case, the p-value is lower than 0.001, so we can reject the H0, and assume that the model is well structured, we found there is the relationship between dependent and at least one independent variable.