Tutorial for Diagnostics

Load all the packages we need:

library(ggplot2)
library(dplyr)
library(jtools)
library(car)
library(olsrr)
library(ggthemes)

Remove rows with missing values

df0<-read.csv("penguins.csv")
df0[df0 == ""] <- NA
df1<-na.omit(df0)
for(i in c(1,2,7)){
  df1[,i]<-factor(df1[,i])
}
str(df1)

'data.frame':   333 obs. of  7 variables:
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num  39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
 $ bill_depth_mm    : num  18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
 $ flipper_length_mm: int  181 186 195 193 190 181 195 182 191 198 ...
 $ body_mass_g      : int  3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
 $ sex              : Factor w/ 2 levels "FEMALE","MALE": 2 1 1 1 2 1 2 1 2 2 ...
 - attr(*, "na.action")= 'omit' Named int [1:11] 4 9 10 11 12 48 247 287 325 337 ...
  ..- attr(*, "names")= chr [1:11] "4" "9" "10" "11" ...

Build the regression model

mod0<-lm(body_mass_g~flipper_length_mm+bill_depth_mm+sex,data=df1)
summ(mod0)

Observations	333
Dependent variable	body_mass_g
Type	OLS linear regression

F(3,329)	509.48
R²	0.82
Adj. R²	0.82

	Est.	S.E.	t val.	p
(Intercept)	-2246.83	625.29	-3.59	0.00
flipper_length_mm	38.19	2.08	18.32	0.00
bill_depth_mm	-86.95	15.46	-5.63	0.00
sexMALE	538.08	51.31	10.49	0.00
Standard errors: OLS

Check Multicollinearity

###### VIF 
vif(mod0)

flipper_length_mm     bill_depth_mm               sex 
         2.444470          2.653895          1.891038

x<-vif(mod0)
x[1:3]<5

flipper_length_mm     bill_depth_mm               sex 
             TRUE              TRUE              TRUE

Since they are all smaller than 5, there’s no need to worry about multicollinearity

Specification

##### specification
hat<-fitted(mod0)
mod_link_test<-lm(body_mass_g~hat+I(hat^2),data = df1)
summary(mod_link_test)


Call:
lm(formula = body_mass_g ~ hat + I(hat^2), data = df1)

Residuals:
    Min      1Q  Median      3Q     Max 
-934.54 -232.15  -12.45  223.74 1003.14 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.489e+03  6.394e+02   3.893  0.00012 ***
hat         -1.802e-01  2.999e-01  -0.601  0.54851    
I(hat^2)     1.358e-04  3.439e-05   3.948 9.62e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 332.2 on 330 degrees of freedom
Multiple R-squared:  0.8309,    Adjusted R-squared:  0.8298 
F-statistic: 810.5 on 2 and 330 DF,  p-value: < 2.2e-16

##### hat is not statistically significant, 
##### hat_squared is statistically, we may say that the model may be misspecified

Check Heteroskedasticity

ols_test_breusch_pagan(mod0)


 Breusch Pagan Test for Heteroskedasticity
 -----------------------------------------
 Ho: the variance is constant            
 Ha: the variance is not constant        

                 Data                   
 ---------------------------------------
 Response : body_mass_g 
 Variables: fitted values of body_mass_g 

        Test Summary         
 ----------------------------
 DF            =    1 
 Chi2          =    0.9292917 
 Prob > Chi2   =    0.335047

Check Auto-correlation, in other words, if if erros are independently distributed

durbinWatsonTest(mod0)

 lag Autocorrelation D-W Statistic p-value
   1        0.083443      1.828663   0.092
 Alternative hypothesis: rho != 0

Since it is very close to 2, there’s no need to worry autocorrelation, we may say that the errors are indepently distributed

Diagonsotics Plot

plot(mod0)

ggplot(data=df1,aes(x=flipper_length_mm,y=body_mass_g)) +
  geom_point()+
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE)+
  labs(x="Flipper Length (mm)",y="Body Mass (g)")+
  theme_stata()

ggplot(data = df1,aes(x=flipper_length_mm, y = body_mass_g)) +
  geom_point()+  
  geom_smooth(method = "lm",formula=y ~ x,se = FALSE)+  
  labs(x="Flipper Length (mm)",y="Body Mass (g)")+  
  facet_wrap(~ sex, ncol = 2)+
  theme_gdocs()+ggtitle("By Sex")

body_mass_male<-df1[df1$sex=="MALE",]['body_mass_g']
body_mass_female<-df1[df1$sex=="FEMALE",]['body_mass_g']
t.test(body_mass_male,body_mass_female,var.equal=FALSE)


    Welch Two Sample t-test

data:  body_mass_male and body_mass_female
t = 8.5545, df = 323.9, p-value = 4.794e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 526.2453 840.5783
sample estimates:
mean of x mean of y 
 4545.685  3862.273