Question 1

Island=c("Santa Barbara","Anacapa","San Miguel","San Nicolas","San     Clemente","Santa Catalina","Santa Rosa","Santa Cruz")
Area=c(2.6, 2.9, 37, 58, 145, 194, 217, 294)
Dist=c(61, 20, 42, 98, 79, 32, 44, 30)
Native=c(88,190,198,139,272,421,387,480)
Endemic=c(14,22,18,18,47,37,42,45)
Exotic=c(44,75,69,131,110,185,98,170)
Total=c(132,265,267,270,382,604,484,650)
##Coerce Data Vectors Into a Dataframe:
ChannelIslands=data.frame(Island, Area, Dist, Native, Endemic, Exotic, Total)
##Remove individual files now that they are assembled into a dataframe.
rm(Island, Area, Dist, Native, Endemic, Exotic, Total)
##Convert “Island” (the variable containing island names) into a Factor Variable:
ChannelIslands$Island <- factor(ChannelIslands$Island)

ggplot(ChannelIslands, aes(x = Area, y =Total)) +
  geom_point(aes(color=Island, size=Area)) +
  geom_smooth() +
  xlab("Island Area") +
  ylab("Total Species") +
  ggtitle("Scatterplot of Total Species on Island Area")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Question 2

The above plot suggests that as the area of an island increases, the total species also increases.

Question 3

Santa Barbara has the fewest species. It also has the smallest island area and is significantly smaller than almost every other island.

Question 4

Santa Cruz has the greatest number of species. It also has the largest island area compared to the other islands.

Question 5

The smooth line suggests that the relationship between total species and island area is linear.

Part 2

mymodel = lm(ChannelIslands$Total ~ ChannelIslands$Area)
summary(mymodel)
## 
## Call:
## lm(formula = ChannelIslands$Total ~ ChannelIslands$Area)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -66.46 -43.94 -11.95  27.23 103.66 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         194.3561    34.5928   5.618 0.001358 ** 
## ChannelIslands$Area   1.5772     0.2205   7.154 0.000376 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63.9 on 6 degrees of freedom
## Multiple R-squared:  0.8951, Adjusted R-squared:  0.8776 
## F-statistic: 51.17 on 1 and 6 DF,  p-value: 0.0003764

Question 6

Based on the linear regression results, as island area increases, the total species on the islands is expected to rise by 1.5772 species.

Question 7

Based on the model, an island with 0.0 km^2 area would have 194.3561 total species on the island.

Question 8

The t-value for the slope term is 7.154 and the Pr(>|t|) is 0.000376. This means that the relationship between total species and area is significantly different from zero and varies 7.154 SE from zero. Since the p-value is less than 0.05, this value is statistically significant.

Question 9

The multiple R-squared value is 0.8951 which means that 89.51% of the variation in total species is explained by island area.

Question 10

SSY <- sum((ChannelIslands$Total - mean(ChannelIslands$Total))^2)
SSY
## [1] 233469.5

The total sum of squares is 233469.5.

Question 11

SSE <- sum(mymodel$residuals^2)
SSE
## [1] 24501.08

The error sum of squares is 24501.08.

Question 12

1 - (SSE/SSY)
## [1] 0.8950566

The proportionate reduction of SSE relative to SSY is 0.8950566.

Question 13

((SSY-SSE)/1)/(SSE/(8-1-1))
## [1] 51.17369

This corresponds to the F-statistics which is 51.17369. This tests whether there is a significant amount of variance to reject the null hypothesis that there is no relationship between the variables. Since this F-statistic is quite large, we can reject this null hypothesis.

Part 3

Question 14

ggplot(data=ChannelIslands) +
  geom_point(mapping=aes(x=Area, y=Native), color="forestgreen", shape=15, size = 2.5) +
  geom_smooth(mapping=aes(x=Area, y=Native), color="forestgreen") +
  geom_point(mapping=aes(x=Area, y=Endemic), color="dodgerblue", shape=15, size = 2.5) +
  geom_smooth(mapping=aes(x=Area, y=Endemic), color="dodgerblue") +
  geom_point(mapping=aes(x=Area, y=Exotic), color="firebrick1", shape=15, size = 2.5) +
  geom_smooth(mapping=aes(x=Area, y=Exotic), color="firebrick1") +
  theme_gray()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Question 15

For Native plants, the slope between area and richness is higher than endemic and exotic. The intercept for Exotic plants is slightly higher than endemic plants though both seem to have fairly similar slopes.

Part 4

Question 16

islanddistance=data.frame(ChannelIslands$Island, mymodel$residuals, ChannelIslands$Dist)
ggplot(data=islanddistance) +
  geom_point(mapping=aes(x=ChannelIslands.Dist, y=mymodel.residuals), color="forestgreen", shape=15, size = 2.5) +
  geom_smooth(mapping=aes(x=ChannelIslands.Dist, y=mymodel.residuals), color="forestgreen") +
  theme_gray()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Question 17

The above model suggests that as distance increases, the deviation from model richness becomes slightly smaller.

Question 18

mymodel2 <- lm(mymodel$residuals ~ ChannelIslands$Dist)
summary(mymodel2)
## 
## Call:
## lm(formula = mymodel$residuals ~ ChannelIslands$Dist)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.963 -38.740  -1.323  31.677  80.469 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)           62.778     42.476   1.478    0.190
## ChannelIslands$Dist   -1.237      0.751  -1.647    0.151
## 
## Residual standard error: 53.03 on 6 degrees of freedom
## Multiple R-squared:  0.3114, Adjusted R-squared:  0.1966 
## F-statistic: 2.713 on 1 and 6 DF,  p-value: 0.1506

The slope for the relationship between distance and the residuals of total species and area is -1.237, the t-value is -1.647, and the R^2 is 0.3114. However, since th p-value is 0.151, which is not less than 0.05, I cannot reject the null hypothesis that there is no relationship between the two variables.

Question 19

summary(mymodel)
## 
## Call:
## lm(formula = ChannelIslands$Total ~ ChannelIslands$Area)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -66.46 -43.94 -11.95  27.23 103.66 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         194.3561    34.5928   5.618 0.001358 ** 
## ChannelIslands$Area   1.5772     0.2205   7.154 0.000376 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63.9 on 6 degrees of freedom
## Multiple R-squared:  0.8951, Adjusted R-squared:  0.8776 
## F-statistic: 51.17 on 1 and 6 DF,  p-value: 0.0003764
summary(mymodel2)
## 
## Call:
## lm(formula = mymodel$residuals ~ ChannelIslands$Dist)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.963 -38.740  -1.323  31.677  80.469 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)           62.778     42.476   1.478    0.190
## ChannelIslands$Dist   -1.237      0.751  -1.647    0.151
## 
## Residual standard error: 53.03 on 6 degrees of freedom
## Multiple R-squared:  0.3114, Adjusted R-squared:  0.1966 
## F-statistic: 2.713 on 1 and 6 DF,  p-value: 0.1506

89.51% of the variation in native richness is explains by area, while 31.14% of the remaining variance is explained by distance (though this value is not statistically significant because the p-value is too large).