This dataset in package faraway contains the number of plant species on each of 30 islands in the Galapagos Archipelago, along with some attributes of each island. From the dataset, species-area relationship is as follows:

S=cA^z

library(faraway)
data(gala)
library(ggplot2)
ggplot(gala, aes(x = Area, y = Species)) + 
  geom_point(size = 3)

The figure shows species vs area plot. There are a few very large islands and many small ones. If we take the logarithm of both sides of our model

log(S)=log(c)+zlog(A) Then the linear model :

gala.1 <- lm(log(Species)~log(Area),data=gala)
summary(gala.1)
## 
## Call:
## lm(formula = log(Species) ~ log(Area), data = gala)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5442 -0.4001  0.0941  0.5449  1.3752 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.9037     0.1571  18.484  < 2e-16 ***
## log(Area)     0.3886     0.0416   9.342 4.23e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7842 on 28 degrees of freedom
## Multiple R-squared:  0.7571, Adjusted R-squared:  0.7484 
## F-statistic: 87.27 on 1 and 28 DF,  p-value: 4.23e-10
# confidence intervals on the coefficients
confint(gala.1)
##                 2.5 %    97.5 %
## (Intercept) 2.5819216 3.2254846
## log(Area)   0.3033903 0.4738033
# 1. What is the estimated value of z? 0.30
#2.  What is the estimated value of c? 2.58
#3.  Drakare et al. (2005) found the average value of z across many studies to be 0.27. Is the value of z from this dataset significantly different from 0.27? 1
# The value is closer to 0.30
exp(2.9037)
## [1] 18.24151
  1. What is the estimated value of z? The estimated value of z is 0.3886 (slope).

  2. What is the estimated value of c? The estimated value of c is exp(2.9037)=18.24151

  3. Drakare et al. (2005) found the average value of z across many studies to be 0.27. Is the value of z from this dataset significantly different from 0.27? The confidence interval shows the z value 0.27 is less than the lower limit. The estimated value of z is 0.3886 , Therefore the z value from the literature is significantly different since the value is not in the confidence interval.

old.par = par(mfrow=c(2,2))
plot(gala.1)

par(old.par)
# this gives us some nifty tools for extracting things from fitted models
library(broom)
test_distance <- augment(gala.1)
test_distance <- cbind(test_distance, Scruz = gala$Scruz)
# the current version of broom only produces .std.resid -- this may change
ggplot(test_distance, aes(x = Scruz, y = .std.resid)) + 
  geom_point()

The two islands farthest away from Santa cruz have negative residuals.

Another dataset

ggplot(eco, aes(x=income, y = usborn)) +
  geom_point() + 
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

The blue solid line is the regression line of the linear moder . Gray shaded polygon shows the confidance interval Chicking the residuals using ggplot.

eco.1 <- lm(usborn ~ income, data = eco)
eco.test <- augment(eco.1, data = eco)
ggplot(eco.test, aes(x=.fitted, y = .std.resid)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#plotting actual data

ggplot(eco, aes(x=income, y = usborn)) +
  geom_point() +
  geom_smooth(color = "red")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The line is not flat amd CI don’t exclude zero.

## different aesthetic
ggplot(eco.test, aes(sample = .std.resid)) +
  stat_qq() +
  geom_abline(color = "red")

This in not a straightline. Therefore residuals are not following normal distribution.

## scale location plot
ggplot(eco.test, aes(x = .fitted, y = sqrt(abs(.std.resid)))) + 
  geom_point() + 
  geom_smooth() + 
  geom_hline(yintercept  = 1)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Variance is too small for fitted values > 0.95

# leverage is the trace of the hat matrix
ggplot(eco.test, aes(.hat, .std.resid)) +
  geom_vline(size = 2, colour = "white", xintercept = 0) +
  geom_hline(size = 2, colour = "white", yintercept = 0) +
  geom_point(aes(size = .cooksd)) + geom_smooth(se = FALSE)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

All the values for Cook’s distance are small

  1. Think about the possible values for the variable usborn. Do you expect this variable to follow a normal distribution? Why or why not?

The Normal Q-Q plot plots the standardized residuals against the theoretical quantiles from a standard normal distribution. If the residuals are following a normal distribution then the points fall along the straight line in the plot. This is a direct test of the assumption that the residuals are normally distributed. The Q-Q plot does not show the normal distribution. Moreover, the histogram for usborn is seen not normally distributed and it is right skewed.

ggplot(eco, aes(x=usborn)) + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# All together
usborn <- lm(usborn~income,data=eco.test)
born.par = par(mfrow=c(2,2))
plot(usborn)

par(born.par)

Back to Galapagos species data Added prediction model

## always start by making a dataframe with the values of your covariates
nd = data.frame(Area=seq(min(gala$Area),max(gala$Area),10))
stats.gala <- glance(gala.1)
pred.1 <- augment(gala.1, newdata = nd, se_fit = TRUE)

# calculate the lower and upper 95% confidence limits on the mean
tcrit <- qt(0.975, df = stats.gala$df.residual)
pred.1$lwr <- with(pred.1, .fitted - tcrit * .se.fit)
pred.1$upr <- with(pred.1, .fitted + tcrit * .se.fit)

## put the raw data plot back first, then add
## lines and ribbons
ggplot(gala, aes(x = log(Area))) + 
  geom_point(aes(y = log(Species)), size = 3) + 
  geom_line(aes(y = .fitted ), 
            data = pred.1, size = 2, color = "blue") + 
  geom_ribbon(aes(ymin = lwr, ymax = upr), 
              data = pred.1, alpha = 0.5)

  1. What size island has a mean # of species equal to 55? To find the area that has a mean of 55 species, the linear model should be changed accordingly The answer is 12.52 km\(^2\)
gala.2 <- lm(log(Area)~log(Species),data=gala)
summary(gala.2)
## 
## Call:
## lm(formula = log(Area) ~ log(Species), data = gala)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4377 -1.1545  0.3568  0.9687  2.9018 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -5.2798     0.7987  -6.611 3.60e-07 ***
## log(Species)   1.9483     0.2086   9.342 4.23e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.756 on 28 degrees of freedom
## Multiple R-squared:  0.7571, Adjusted R-squared:  0.7484 
## F-statistic: 87.27 on 1 and 28 DF,  p-value: 4.23e-10
# confidence inte

A=exp(-5.2798+(1.9483*log(55)))

A
## [1] 12.52451
 #Another way 

new_data <- data.frame(Species = c(55))

# Fiting the linear model
lm1 <- lm(log(Area)~log(Species),data=gala)

#Predicts the future values
pred<-predict(lm1, newdata = new_data)
exp(pred)
##        1 
## 12.52446
pred.1$psigma <- sqrt(stats.gala$sigma^2 + pred.1$.se.fit^2)

pred.1$plwr <- with(pred.1, .fitted - tcrit * psigma)
pred.1$pupr <- with(pred.1, .fitted + tcrit * psigma)

ggplot(gala, aes(x = log(Area))) + 
  geom_point(aes(y = log(Species)), size = 3) + 
  geom_line(aes(y = .fitted ), 
            data = pred.1, size = 2, color = "blue") + 
  geom_ribbon(aes(ymin = lwr, ymax = upr, y = 0), data = pred.1, alpha = 0.25) +
  geom_ribbon(aes(ymin = plwr, ymax = pupr, y = 0), data = pred.1, alpha = 0.25)

The light polygon shows the confidence interval of of observations expected for each value of log(Area)

  1. What size island has a 2.5 % chance of observing < 55 species?

Ans: The lower confidence limit shows that there is 2.5 % chnace for observing less than 55 species in approximately 9.5 km^2 area island

nd = data.frame(Area=12.52446)
pred.55 <- predict(gala.1, newdata = nd, se.fit=TRUE)

pred.55$psigma <- sqrt(stats.gala$sigma^2 + pred.55$se.fit^2)

pred.55$plwr <- with(pred.55, fit - tcrit * psigma)
pred.55$pupr <- with(pred.55, fit + tcrit * psigma)
pred.55$plwr
##        1 
## 2.250976
exp(pred.55$plwr)
##        1 
## 9.497004
ggplot(gala, aes(x = log(Area))) + 
  geom_point(aes(y = log(Species)), size = 3) + 
  geom_line(aes(y = .fitted ), 
            data = pred.1, size = 2, color = "blue") + 
  geom_ribbon(aes(ymin = lwr, ymax = upr), data = pred.1, alpha = 0.25) +
  geom_ribbon(aes(ymin = plwr, ymax = pupr), data = pred.1, alpha = 0.25) +
  coord_trans(x="exp", y="exp")
## Warning in trans$inverse(continuous_range_coord): NaNs produced

## Warning in trans$inverse(continuous_range_coord): NaNs produced
## Warning in self$trans$x$inverse(panel_params$x.range): NaNs produced
## Warning in self$trans$y$inverse(panel_params$y.range): NaNs produced
## Warning in self$trans$x$inverse(panel_params$x.range): NaNs produced
## Warning in self$trans$y$inverse(panel_params$y.range): NaNs produced
## Warning in self$trans$x$inverse(panel_params$x.range): NaNs produced
## Warning in self$trans$y$inverse(panel_params$y.range): NaNs produced
## Warning in self$trans$x$inverse(panel_params$x.range): NaNs produced
## Warning in self$trans$y$inverse(panel_params$y.range): NaNs produced
## Warning in self$trans$x$inverse(panel_params$x.range): NaNs produced
## Warning in self$trans$y$inverse(panel_params$y.range): NaNs produced

ggplot(gala, aes(x = Area)) + 
  geom_point(aes(y = Species), size = 3) + 
  geom_line(aes(y = exp(.fitted) ), 
            data = pred.1, size = 2, color = "blue") + 
  geom_ribbon(aes(ymin = exp(lwr), ymax = exp(upr)), data = pred.1, alpha = 0.25)