library(ggplot2)
library(dplyr)
library(jtools)
library(car)
library(olsrr)
library(ggthemes)Tutorial for Diagnostics
Load all the packages we need:
Remove rows with missing values
df0<-read.csv("penguins.csv")
df0[df0 == ""] <- NA
df1<-na.omit(df0)
for(i in c(1,2,7)){
df1[,i]<-factor(df1[,i])
}
str(df1)'data.frame': 333 obs. of 7 variables:
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
$ bill_depth_mm : num 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
$ flipper_length_mm: int 181 186 195 193 190 181 195 182 191 198 ...
$ body_mass_g : int 3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
$ sex : Factor w/ 2 levels "FEMALE","MALE": 2 1 1 1 2 1 2 1 2 2 ...
- attr(*, "na.action")= 'omit' Named int [1:11] 4 9 10 11 12 48 247 287 325 337 ...
..- attr(*, "names")= chr [1:11] "4" "9" "10" "11" ...
Build the regression model
mod0<-lm(body_mass_g~flipper_length_mm+bill_depth_mm+sex,data=df1)
summ(mod0)| Observations | 333 |
| Dependent variable | body_mass_g |
| Type | OLS linear regression |
| F(3,329) | 509.48 |
| R² | 0.82 |
| Adj. R² | 0.82 |
| Est. | S.E. | t val. | p | |
|---|---|---|---|---|
| (Intercept) | -2246.83 | 625.29 | -3.59 | 0.00 |
| flipper_length_mm | 38.19 | 2.08 | 18.32 | 0.00 |
| bill_depth_mm | -86.95 | 15.46 | -5.63 | 0.00 |
| sexMALE | 538.08 | 51.31 | 10.49 | 0.00 |
| Standard errors: OLS |
Check Multicollinearity
###### VIF
vif(mod0)flipper_length_mm bill_depth_mm sex
2.444470 2.653895 1.891038
x<-vif(mod0)
x[1:3]<5flipper_length_mm bill_depth_mm sex
TRUE TRUE TRUE
Since they are all smaller than 5, there’s no need to worry about multicollinearity
Specification
##### specification
hat<-fitted(mod0)
mod_link_test<-lm(body_mass_g~hat+I(hat^2),data = df1)
summary(mod_link_test)
Call:
lm(formula = body_mass_g ~ hat + I(hat^2), data = df1)
Residuals:
Min 1Q Median 3Q Max
-934.54 -232.15 -12.45 223.74 1003.14
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.489e+03 6.394e+02 3.893 0.00012 ***
hat -1.802e-01 2.999e-01 -0.601 0.54851
I(hat^2) 1.358e-04 3.439e-05 3.948 9.62e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 332.2 on 330 degrees of freedom
Multiple R-squared: 0.8309, Adjusted R-squared: 0.8298
F-statistic: 810.5 on 2 and 330 DF, p-value: < 2.2e-16
##### hat is not statistically significant,
##### hat_squared is statistically, we may say that the model may be misspecifiedCheck Heteroskedasticity
ols_test_breusch_pagan(mod0)
Breusch Pagan Test for Heteroskedasticity
-----------------------------------------
Ho: the variance is constant
Ha: the variance is not constant
Data
---------------------------------------
Response : body_mass_g
Variables: fitted values of body_mass_g
Test Summary
----------------------------
DF = 1
Chi2 = 0.9292917
Prob > Chi2 = 0.335047
Check Auto-correlation, in other words, if if erros are independently distributed
durbinWatsonTest(mod0) lag Autocorrelation D-W Statistic p-value
1 0.083443 1.828663 0.092
Alternative hypothesis: rho != 0
Since it is very close to 2, there’s no need to worry autocorrelation, we may say that the errors are indepently distributed
Diagonsotics Plot
plot(mod0)ggplot(data=df1,aes(x=flipper_length_mm,y=body_mass_g)) +
geom_point()+
geom_smooth(method = "lm", formula = y ~ x, se = FALSE)+
labs(x="Flipper Length (mm)",y="Body Mass (g)")+
theme_stata() ggplot(data = df1,aes(x=flipper_length_mm, y = body_mass_g)) +
geom_point()+
geom_smooth(method = "lm",formula=y ~ x,se = FALSE)+
labs(x="Flipper Length (mm)",y="Body Mass (g)")+
facet_wrap(~ sex, ncol = 2)+
theme_gdocs()+ggtitle("By Sex")body_mass_male<-df1[df1$sex=="MALE",]['body_mass_g']
body_mass_female<-df1[df1$sex=="FEMALE",]['body_mass_g']
t.test(body_mass_male,body_mass_female,var.equal=FALSE)
Welch Two Sample t-test
data: body_mass_male and body_mass_female
t = 8.5545, df = 323.9, p-value = 4.794e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
526.2453 840.5783
sample estimates:
mean of x mean of y
4545.685 3862.273