Attaching package: 'data.table'
The following objects are masked from 'package:dplyr':
between, first, last
The following object is masked from 'package:purrr':
transpose
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
4 Adelie Torgersen NA NA NA NA <NA> 2007
5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
# … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g
penguins<-drop_na(penguins)penguins$year=as.factor(penguins$year) #we are interested in year as a grouping/categorical variable so we will make it a factor
What is the effect of year on bill depth by species?
We will start with a boxplot for a quick check. We would eventually want to calculate means and error bars for the final visualization though! Note that the graph below is a good way to view the interaction of our explanatory variables, which is not what we modeled… We only consider the additive effects (each variable on its own)
# effect of year on bill depth by speciesggplot(data=penguins, aes(x=as.factor(year), y=bill_depth_mm, color=species))+geom_boxplot()+theme_classic()
-t / pvalue tells us whether there is a sig association between the predictor and the outcome variable…
-in stats terms, this tells us whether the beta coef of predictor is significantly different form zero
-coefficient can be interpreted as average effect on y of a one unit increase in predictor, holding all other predictors fixed
Here, we have an additive model and we see from the anova table and the lm summary that there are significant effects of species and sex on bill depth but that there is not effect on year. Next, let’s look at the data again to confirm!
Here, we want to know how well the model represent the data. We need: 1. The R2 value of the model (closer to 1 is best) 2. The p-value of the model (<0.05 is required for there to be a relationship) 3. We can calculate residual standard error. Lower = more accurate!
The R2 and p are in the summary! Below is the formula for RMSE
summary(lm1)
Call:
lm(formula = bill_depth_mm ~ species + sex + year, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-2.07890 -0.56431 -0.00782 0.48485 3.12581
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.71835 0.10760 164.672 <2e-16 ***
speciesChinstrap 0.05649 0.12232 0.462 0.6445
speciesGentoo -3.36375 0.10268 -32.760 <2e-16 ***
sexmale 1.50471 0.09107 16.523 <2e-16 ***
year2008 -0.21053 0.11374 -1.851 0.0651 .
year2009 -0.14416 0.11239 -1.283 0.2005
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8308 on 327 degrees of freedom
Multiple R-squared: 0.8247, Adjusted R-squared: 0.822
F-statistic: 307.6 on 5 and 327 DF, p-value: < 2.2e-16
#RSE: <- LOWER RSE= more accurate the model!sigma(lm1)
[1] 0.8308337
mean(penguins$bill_depth_mm)
[1] 17.16486
sigma(lm1)/mean(penguins$bill_depth_mm)
[1] 0.04840316
#0.048, or 4.8% error rate
We can also get this information from the performance package using model_performance. This function tells us many things, including R2 and RMSE. We will discuss the rest of this later
Note that there are many ways to build a dataframe and plot for these. This is just one example. Here we can visualize that the effects of each variable individually are not very large.
lm2g<-lm2 %>%augment() %>%ggplot(aes(x=bill_length_mm, y=bill_depth_mm, color=species))+geom_point()+geom_line(aes(y=.fitted))+theme_classic()lm3g<-lm3 %>%augment() %>%ggplot(aes(x=bill_length_mm, y=bill_depth_mm, color=species))+geom_point()+geom_line(aes(y=.fitted))+theme_classic()lm2g/lm3g #lm2 has same y int for all! lm3 does not (because of the interaction term!!!)
check_model(lm3) #things look good, but we have super high VIF
Variable `Component` is not in your data frame :/
check_collinearity(lm3) #a table of collinearity results - we would need to remove stuff from the model until the model has VIG that are all below 5 or so (below 10 is fine )
Warning: Model has interaction terms. VIFs might be inflated. You may check
multicollinearity among predictors of a model without interaction terms.
# Check for Multicollinearity
Moderate Correlation
Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
bill_length_mm 9.66 [7.94, 11.81] 3.11 0.10 [0.08, 0.13]
High Correlation
Term VIF VIF 95% CI Increased SE Tolerance
species 49899.43 [40473.88, 61520.04] 223.38 2.00e-05
bill_length_mm:species 64717.85 [52493.21, 79789.42] 254.40 1.55e-05
Tolerance 95% CI
[0.00, 0.00]
[0.00, 0.00]
vif(lm3) #gives us a more useful table- tells us that species and species*bill length are looking bad. SO BAD that we cannot run this model....
there are higher-order terms (interactions) in this model
consider setting type = 'predictor'; see ?vif
What do we do? We remove interaction terms one by one, thereby simplifying the model, until the VIF are low enough to be meaningful (all below 5 is a good rule of thumb)
since the model without the * is just lm2, we are all set.
-This is a comprehensive model check that uses many vars to assess the best model.
-Here, lm2 wins easily! R2 are about the same, RMSE (residual mean square error) is about the same.
-AICweights tell us thje relative likelihood of a model– closer to 1 is best. ***when you look at AIC scores (slightly different from AIC weights, the lower the value, the better)
lm4<-lm(bill_depth_mm ~ bill_length_mm * species * sex, data=penguins)#look at summarysummary(lm4)
# make a pretty graph!lm4 %>%augment() %>%ggplot(aes(x=bill_length_mm, y=bill_depth_mm,color=species))+geom_point()+geom_line(aes(y=.fitted))+theme_classic()
#oops, that isn't quite right. What are we missing?lm4g2<-lm4 %>%augment() %>%ggplot(aes(x=bill_length_mm, y=bill_depth_mm,color=species, shape=sex))+geom_point()+geom_line(aes(y=.fitted))+theme_classic()#ORlm4g3<-lm4 %>%augment() %>%ggplot(aes(x=bill_length_mm, y=bill_depth_mm,color=species, shape=sex))+geom_point()+geom_line(aes(y=.fitted))+theme_classic()+facet_wrap(~sex)#compare graphs!lm4g2/lm4g3