FOR3012 Statistics Module lab 5

Author

Patrick James & Jack Goldman

Published

October 19, 2023

Logistic regression, multiple regression, and model selection

Logistic regression summary

Logistic regression is a modelling technique used when your data are binary (0-1) or presented as a proportion between 0 and 1
Logistic regression is an example of a generalized linear model that cannot be fit using ordinary least squares (OLS)
Instead, the principle of maximum likelihood (beyond the scope of this course) is used to estimate parameters
Raw coefficient estimates represent the odds associated with a predictor. To understand the effect of the predictor on the probability of an event (e.g., a success occurring) one must convert the odds value using the mathematical constant e (i.e., exp(x))

Multiple regression summary

Multiple regression is a technique for using several predictor variables to model a single response variable.
The same principles and assumptions from simple linear regression also apply here
Predictors should be standardized (e.g., using the scale function) prior to being used in a regression model to avoid biases introduced due to different units of measurement and different predictor variances
One can test the significance of an entire multiple regression model using an F test
If the overall model is significant, we can assess and interpret the significance of the individual predictors

Model selection

Model selection is the process of identifying the “best” model from a set of potential candidate model
Model selection is typically based on two premises: parsimony, and maximizing the variance explained while minimizing the number of predictors retained. The two premises are related.
One can compare general linear models fit using OLS with AdjR2
One can compare general and generalized linear models fit using maximum likelihood (e.g., logistic regressions) with AIC. AIC can also be used to compare models fit with OLS
AIC model selection can be forward, backward, or stepwise.

EXERCISE

For today’s exercise, we will be working with the ant data that were presented in lecture. These data are taken from Gotelli and Ellison 2002.

Gotelli, N. J., & Ellison, A. M. (2002). Biogeography at a regional scale: determinants of ant species density in New England bogs and forests. Ecology, 83(6), 1604-1609.

The file on Quercus is called “Lab4_ExerciseData_Ants.xlsx”

The data contain six columns:

Site – a three letter acronym for a location (not important for us)
Lat. = Latitude
Long. = Longitude
Elev. = Elevation
nSpp = number of ant species collected at each location (response variable of interest)
Habitat – a binary variable of either “forest” or “bog”. You will note that all sites had both types of habitats sampled.

Using the script uploaded to Quercus entitled “FOR3012_AntsScript.r”, follow the steps indicated to conduct an analysis of these data.

ASSIGNMENT 4 – to be done individually, or in pairs.

While the analysis above explored many different model combinations, it did not consider the role of habitat (bog vs. forest.). Extend the analysis you just carried out in the exercise and include habitat as a potential predictor to identify the best model using AIC. Note that a model that contains only Habitat would be identical to an ANOVA – but recall that you can also specify an ANOVA model using the lm function.

In this context, imagine that you have been asked to prepare a report on ant diversity in this area. In your report, include the following elements: description of the context and question being addressed, the nature of the data being used to investigate the question, the method that has been selected, the null hypothesis and assumptions of the model, other pertinent analytical steps (e.g., transformations and scaling) and why they were chosen, your results (e.g., summary table of the final model), and some succinct interpretations and conclusions. Include figures and tables as necessary to illustrate your point.

Please keep your report concise and fewer than 2 pages

For work done in pairs a short (2-3 sentence) statement of each author’s contributions is required.

The grading rubric can be found below:


	RUBRIC Element		Points

	Introduction – context and question		10

	Data description – simple – needn’t be complex		5

	Methods : linear regression (1point); null hypothesis (1.5 points); assumptions (1); note on log transformation (1.5)		5

	Graphs of individual relationships between response and predictors with appropriate labels and captions		5

	Results - final model table – clearly presented with appropriate caption		5

	Clear interpretation and conclusions that match results.		10

	Overall clarity and organization		5

	Clear authorship contribution statement – make sure all are equally contributing		5

	TOTAL		50