class: center, middle, inverse, title-slide .title[ # Seminar Week 7 ] .subtitle[ ## Recap of Statistical Modelling Techniques ] .author[ ### Ingmar Staude ] --- # Recap of Statistical Models What statistical tools do you know by now? -- - t-test (compare two groups) - Simple linear regression (one predictor, one outcome) - ANOVA (compare more than two groups) - ANCOVA (compare groups while controlling for covariates) - Separate-slope model (different slopes per group) - Multiple regression (multiple predictors) --- # Which Model? You want to know the relationship between predator abundance and prey population size. -- ```r lm(prey_abundance ~ predator_abundance) ``` --- # Which Model? Are plant communities in novel ecosystems less diverse on average than in historical reference ecosystems? -- ```r t.test(novel, reference, alternative = "less") ``` ``` ## ## Welch Two Sample t-test ## ## data: novel and reference ## t = -6.3206, df = 37.082, p-value = 1.146e-07 ## alternative hypothesis: true difference in means is less than 0 ## 95 percent confidence interval: ## -Inf -3.974375 ## sample estimates: ## mean of x mean of y ## 12.42487 17.84623 ``` --- # Alternative model Can we answer the same question with a different model? -- ```r model <- lm(diversity ~ ecosystem, data = d) ``` <small> ```r summary(model) ``` ``` ## ## Call: ## lm(formula = diversity ~ ecosystem, data = d) ## ## Residuals: ## Min 1Q Median 3Q Max ## -6.3247 -1.8907 -0.0649 1.9166 4.9359 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 12.4249 0.6065 20.486 < 2e-16 *** ## ecosystemReference 5.4214 0.8577 6.321 2.07e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.712 on 38 degrees of freedom ## Multiple R-squared: 0.5125, Adjusted R-squared: 0.4997 ## F-statistic: 39.95 on 1 and 38 DF, p-value: 2.07e-07 ``` <small> --- # Which model? You want to test whether planting native wildflowers increases pollinator visits in gardens, whilst controlling for surrounding green space. - Treatment = gardens with wildflower mix - Control = gardens without wildflower mix - Plus surrounding green space as covariate -- ```r lm(pollinator_visits ~ treatment + garden_size) ``` --- # Which model? Is the average time until a conservation policy shows measurable ecological effects longer than a legislative period? -- ```r t.test(policy_delay, mu = 4) # assuming 4 years as the legislative period ``` ``` ## ## One Sample t-test ## ## data: policy_delay ## t = 5.249, df = 19, p-value = 4.573e-05 ## alternative hypothesis: true mean is not equal to 4 ## 95 percent confidence interval: ## 5.372805 7.193690 ## sample estimates: ## mean of x ## 6.283248 ``` --- # Alternative model Can we answer the same question with a different model? -- ```r model <- lm(policy_delay ~ 1) summary(model) ``` ``` ## ## Call: ## lm(formula = policy_delay ~ 1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.2165 -1.2704 -0.0433 0.8142 3.2906 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.283 0.435 14.45 1.07e-11 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.945 on 19 degrees of freedom ``` --- # Which model? Does plant species richness influence ecosystem productivity *differently* under nutrient enrichment? -- ```r lm(biomass ~ richness * treatment) ``` --- # Goodness of fit You’re told: **SSE = 10,000**, **TSS = 100,000**. - Compute R². What does it mean in plain language? - Is that enough to claim the model is “good”? Why / why not? - What does a good metric need to take into account? --- # Dummy variables A categorical variable has three groups (A, B, C). Write an ANOVA model for this predictor in math form using dummy variables. ---