Source of data: Kaggle (https://www.kaggle.com/code/parulpandey/penguin-dataset-the-new-iris/notebook#Flipperlength-distribution)
mydata <- read.csv("~/Desktop/R STUDIO/penguins_size.csv", header = TRUE, sep = ",", dec = ".")
mydata <- drop_na(mydata)
mydata <- mydata[mydata$sex != ".",]
Description of variables
Research question: How factors: culmen length, culmen depth, flipper length and sex influence the body mass of a penguin?
head(mydata)
## species island culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g sex
## 1 Adelie Torgersen 39.1 18.7 181 3750 MALE
## 2 Adelie Torgersen 39.5 17.4 186 3800 FEMALE
## 3 Adelie Torgersen 40.3 18.0 195 3250 FEMALE
## 4 Adelie Torgersen 36.7 19.3 193 3450 FEMALE
## 5 Adelie Torgersen 39.3 20.6 190 3650 MALE
## 6 Adelie Torgersen 38.9 17.8 181 3625 FEMALE
In the regression model, we will identify what is the relationship between variables in our data:
With the model, we will show what is the magnitude of the relationship and determine its statistical significance - how strong independent variables influence the value of dependent variable. We have choose variable body_mass_g as the dependent variable, and variables culmen_length_mm, culmen_depth_mm, flipper_length_mm and SexF as independent variables, because we want to see how they influence the body mass of a penguin.
scatterplotMatrix(mydata[ , c(-1,-2,-7,-8,-9,-10, -11)],
smooth = FALSE)
From the scatterplots we can observe, that variable columen_depth_mm is negatively correlated with all the other variables. Between all other variables (excluding columen_depth_mm) there is positive correlation.
mydata$sexF <- factor(mydata$sex,
levels = c("MALE", "FEMALE"),
labels = c("M", "F"))
With this regression I would like to find out how culmen length, culmen depth, flipper length and sex of penguin affect their body mass.
fit1 <- lm(body_mass_g ~ culmen_length_mm + culmen_depth_mm + flipper_length_mm + sexF,
data = mydata)
mydata$StdResid <- round(rstandard(fit1), 3)
mydata$CooksD <- round(cooks.distance(fit1), 3)
mydata$StdFitValues <- round(fitted.values(fit1),3)
vif(fit1)
## culmen_length_mm culmen_depth_mm flipper_length_mm sexF
## 1.875634 2.686946 3.364086 1.916248
mean(vif(fit1))
## [1] 2.460728
VIF < 5 mean(VIF) > 1
-H0: Variable is normally distributed -H1: Variable is not normally distributed
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(mydata$StdResid) #to check normality of the standard residuals
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.99623, p-value = 0.6164
Normality: from the Shapiro-Wilk normality test, we conclude, that the regression model doesn’t violate the normality assumption -> p-value > 0.05 -> H0 cannot be rejected.
scatterplot(y = mydata$StdResid, x = mydata$StdFitValues,
xlab = "Standardized residuals",
ylab = "Standardized fitted values",
boxplots = FALSE,
smooth = FALSE,
regLine = TRUE)
Linearity: based on the scatterplot, we can assume that the linearity assumption is met.
ols_test_breusch_pagan(fit1)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------------
## Response : body_mass_g
## Variables: fitted values of body_mass_g
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 1.020442
## Prob > Chi2 = 0.3124142
Because p-value is bigger than 5%, we cannot reject H0.
head(mydata[order(mydata$StdResid),], 10)
## species island culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g sex sexF StdResid CooksD
## 185 Chinstrap Dream 46.9 16.6 192 2700 FEMALE F -2.742 0.021
## 163 Chinstrap Dream 50.3 20.0 197 3300 MALE M -2.256 0.017
## 124 Adelie Torgersen 44.1 18.0 210 4000 MALE M -2.222 0.009
## 202 Chinstrap Dream 52.2 18.8 197 3450 MALE M -2.107 0.017
## 187 Chinstrap Dream 49.0 19.5 210 3950 MALE M -1.962 0.013
## 168 Chinstrap Dream 48.5 17.5 191 3400 MALE M -1.926 0.018
## 116 Adelie Torgersen 37.7 19.8 198 3500 MALE M -1.919 0.015
## 126 Adelie Torgersen 43.1 19.2 197 3500 MALE M -1.909 0.006
## 139 Adelie Dream 37.3 16.8 192 3000 FEMALE F -1.865 0.006
## 99 Adelie Biscoe 37.9 18.6 193 2925 FEMALE F -1.746 0.009
## StdFitValues
## 185 3627.814
## 163 4062.354
## 124 4753.704
## 202 4161.236
## 187 4613.161
## 168 4048.811
## 116 4147.739
## 126 4147.991
## 139 3632.952
## 99 3515.422
head(mydata[order(-mydata$StdResid),], 10)
## species island culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g sex sexF StdResid CooksD
## 35 Adelie Dream 39.8 19.1 184 4650 MALE M 2.927 0.025
## 232 Gentoo Biscoe 49.2 15.2 221 6300 MALE M 2.628 0.018
## 106 Adelie Biscoe 45.6 20.3 191 4600 MALE M 2.319 0.014
## 290 Gentoo Biscoe 51.1 16.3 220 6000 MALE M 2.147 0.010
## 228 Gentoo Biscoe 48.4 14.6 213 5850 MALE M 2.066 0.017
## 41 Adelie Dream 39.6 18.8 190 4600 MALE M 2.010 0.009
## 109 Adelie Biscoe 39.6 20.7 191 3900 FEMALE F 1.932 0.027
## 246 Gentoo Biscoe 45.1 14.5 207 5050 FEMALE F 1.931 0.007
## 164 Chinstrap Dream 58.0 17.8 181 3700 FEMALE F 1.928 0.069
## 230 Gentoo Biscoe 49.3 15.7 217 5850 MALE M 1.884 0.007
## StdFitValues
## 35 3659.549
## 232 5409.959
## 106 3814.518
## 290 5272.012
## 228 5152.868
## 41 3918.796
## 109 3253.026
## 246 4395.178
## 164 3071.576
## 230 5211.379
mydata <- mydata[mydata$StdResid < 2.3 & mydata$StdResid > -2.3,]
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
head(mydata[order(-mydata$CooksD),], 20)
## species island culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g sex sexF StdResid CooksD
## 164 Chinstrap Dream 58.0 17.8 181 3700 FEMALE F 1.928 0.069
## 109 Adelie Biscoe 39.6 20.7 191 3900 FEMALE F 1.932 0.027
## 168 Chinstrap Dream 48.5 17.5 191 3400 MALE M -1.926 0.018
## 163 Chinstrap Dream 50.3 20.0 197 3300 MALE M -2.256 0.017
## 195 Chinstrap Dream 51.5 18.7 187 3250 MALE M -1.590 0.017
## 202 Chinstrap Dream 52.2 18.8 197 3450 MALE M -2.107 0.017
## 228 Gentoo Biscoe 48.4 14.6 213 5850 MALE M 2.066 0.017
## 116 Adelie Torgersen 37.7 19.8 198 3500 MALE M -1.919 0.015
## 137 Adelie Dream 32.1 15.5 188 3050 FEMALE F -1.639 0.013
## 187 Chinstrap Dream 49.0 19.5 210 3950 MALE M -1.962 0.013
## 247 Gentoo Biscoe 59.6 17.0 230 6050 MALE M 1.401 0.013
## 88 Adelie Dream 39.6 18.1 186 4450 MALE M 1.854 0.012
## 155 Chinstrap Dream 46.0 18.9 195 4150 FEMALE F 1.782 0.012
## 210 Chinstrap Dream 55.8 19.8 207 4000 MALE M -1.355 0.011
## 10 Adelie Torgersen 34.6 21.1 198 4400 MALE M 1.070 0.010
## 16 Adelie Biscoe 37.8 18.3 174 3400 FEMALE F 1.763 0.010
## 290 Gentoo Biscoe 51.1 16.3 220 6000 MALE M 2.147 0.010
## 321 Gentoo Biscoe 48.1 15.1 209 5500 MALE M 1.613 0.010
## 7 Adelie Torgersen 39.2 19.6 195 4675 MALE M 1.860 0.009
## 41 Adelie Dream 39.6 18.8 190 4600 MALE M 2.010 0.009
## StdFitValues
## 164 3071.576
## 109 3253.026
## 168 4048.811
## 163 4062.354
## 195 3783.216
## 202 4161.236
## 228 5152.868
## 116 4147.739
## 137 3601.672
## 187 4613.161
## 247 5580.214
## 88 3823.755
## 155 3548.384
## 210 4455.022
## 10 4043.043
## 16 2803.790
## 290 5272.012
## 321 4955.220
## 7 4044.986
## 41 3918.796
mydata <- mydata[!(mydata$CooksD > 0.02),]
fit1 <- lm(body_mass_g ~ culmen_length_mm + culmen_depth_mm + flipper_length_mm + sexF,
data = mydata)
summary(fit1)
##
## Call:
## lm(formula = body_mass_g ~ culmen_length_mm + culmen_depth_mm +
## flipper_length_mm + sexF, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -742.08 -241.93 4.98 216.73 739.32
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1748.421 644.510 -2.713 0.00703 **
## culmen_length_mm -3.645 4.668 -0.781 0.43543
## culmen_depth_mm -88.924 15.113 -5.884 1.01e-08 ***
## flipper_length_mm 39.295 2.403 16.351 < 2e-16 ***
## sexFF -533.060 50.127 -10.634 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 324.8 on 322 degrees of freedom
## Multiple R-squared: 0.8366, Adjusted R-squared: 0.8346
## F-statistic: 412.2 on 4 and 322 DF, p-value: < 2.2e-16
Conclusion:
Partial regression coefficients: For the model we have four regression coefficients that present the slope:
culmen_length_mm: -3.645
culmen_depth_mm: -88.924
flipper_length_mm: 39.295
sexFF: -533.060
Variable culmen_length_mm have p-value > 0.05 -> therefore it is not statistically significant.
If column_depth_mm increases by 1mm, ceteris paribus, on average the body mass will decrease by 88,942 grams. Coefficient is significant at p < 0,001.
If flipper_length_mm increases by 1 mm, ceteris paribus, on average the body mass will increase by 39,295 grams. Coefficient is significant at p < 0,001.
If everything else stays the same, on average females have 533 grams less body mass than males. Coefficient is significant at p < 0,001.
Multiple coefficient of determination: The adjusted R-squared has the value of 0.8366, and it indicates that 84% of the variation within variable body_mass_gg, is explained by independent variables culmen_length_mm, culmen_depth_mm, flipper_length_mm and sexFF.In general it explains that our model is fitting data well.
Multiple correlation coefficient:
sqrt(summary(fit1)$r.squared)
## [1] 0.9146733
Calculated as the square root of R-squared, and it equals to 0.9. From this coefficient we can conclude, that the maximum degree of linear relationship between the variable body_mass_gg, and independent variables is 0.9 - strong relationship.
Variance of F-test:
The F-test, evaluates how good is the regression model.
H0: ρ2 = 0
H1: ρ2 > 0
In our case, the p-value is lower than 0.001, so we can reject the H0, and assume that the model is well structured, we found there is the relationship between dependent and at least one independent variable.