Week 4 Summary

This week we continued to work with Poisson Regression. An important step in doing this is understanding when and why to use Poisson Regression over ordinary linear regression.

Recall the four assumptions of Poisson Regression:
1. Response variable is a count per unit of time or space, described by a Poisson Distribution.
2. The observations must be independent of one another.
3. The mean of a poisson random variable must be equal to its variance.
4. The log of the mean rate lambda must be a linear function of x.

Assumption #1:
- Whenever the response variable is a count, Poisson Regression is likely to be appropriate. Recall that a count ranges from 0 to infinity and takes on whole number values. Also, for a Poisson Distribution we expect it to be right skewed. For example, see this histogram of a good candidate response for Poisson Regression.

library(faraway)

## Warning: package 'faraway' was built under R version 3.6.3

data(gala)
hist(gala$Species)

Assumption #2:
This is also an assumption for ordinary linear regression.

Assumption #3:
For Poisson Regression, we expect the mean to equal the variance of the response. This means that as the variance will increase for higher estimated means.

plot(gala$Species~gala$Elevation)

This plot illustrates the relationship described above. Here our response is a count of the mean number of species of turtles for a given island and the explanatory variable is the elevation of the island. We can see this is a positive correlation, i.e., as the elevation goes up so does the mean number of species. Assuming that this response follows a Poisson Distribution would imply that the variance should also increase for higher elevations. This increase in variance can be seen by the fact that the points fan out as the elevation increases.

Assumption #4:
Finally, we assume the log of the response is a linear function of the predictors. For example, look at the plot below for the number of species against the distance to the nearest island. In the second plot, it is the log of the number of species. By taking the log of the response, we can spread out the points and better visualize the relationship between the two.

plot(gala$Species~gala$Nearest)

plot(log(gala$Species)~gala$Nearest)

Week 4 Summary

Sam K

3/5/2021