Question 1

#instantiating data

#input data for each of 7 columns of data:
Island=c("Santa Barbara","Anacapa","San Miguel","San Nicolas","San Clemente","Santa Catalina","Santa Rosa","Santa Cruz")
Area=c(2.6, 2.9, 37, 58, 145, 194, 217, 294)
Dist=c(61, 20, 42, 98, 79, 32, 44, 30)
Native=c(88,190,198,139,272,421,387,480)
Endemic=c(14,22,18,18,47,37,42,45)
Exotic=c(44,75,69,131,110,185,98,170)
Total=c(132,265,267,270,382,604,484,650)

#coerce data vectors into a data frame
ChannelIslands=data.frame(Island, Area, Dist, Native, Endemic, Exotic, Total)

#remove individual files now that they are assembled into a data frame
rm(Island, Area, Dist, Native, Endemic, Exotic, Total)

#convert "island" (variable containing island names) into a factor variable
ChannelIslands$Island <- factor(ChannelIslands$Island)

#plot Total Species (y-axis) on Island Area (x-axis)
ggplot(ChannelIslands, aes(x=Area, y=Total)) +
geom_point(aes(color=Total, size=Area )) +
geom_smooth()+
xlab("Island Area") +
ylab("Total Species") +
ggtitle("Scatterplot of Total Species on Area")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Question 2

The greater the island area, the greater number of plant species will be found on that particular island.

Question 3

The island with the fewest plant species is Santa Barbara (132), which has an area of 2.6 square kilometers, the smallest area of the 8 islands.

Question 4

The island with the greatest number of plant species is Santa Cruz (650), which has an area of 294 square kilometers, the largest area of the 8 islands.

Question 5

The smooth line tells us that there is a strong positive correlation between island area and number of plant species. As area increases, x, the total species also increases, y.

mymodel=lm(ChannelIslands$Native ~ ChannelIslands$Area)
summary(mymodel)
## 
## Call:
## lm(formula = ChannelIslands$Native ~ ChannelIslands$Area)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.612 -34.226  -7.542  34.551  61.581 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         124.8303    25.9310   4.814 0.002958 ** 
## ChannelIslands$Area   1.2376     0.1653   7.488 0.000293 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.9 on 6 degrees of freedom
## Multiple R-squared:  0.9033, Adjusted R-squared:  0.8872 
## F-statistic: 56.07 on 1 and 6 DF,  p-value: 0.0002931

Question 6

The slope coefficient of 1.2376 tells us that as island area increases, the expected number of native species are predicted to increase by 1.24 in this model.

Question 7

According to this model, an island with 0.0 square kilometers area would be predicted to have about 125 native plant species.

Question 8

The high t-value, 7.488, and very low p-value, 0.000293, indicate that the slope is significantly different from zero. This brings me to the conclusion that island area is a meaningful and statistically significant predictor of native plant richness in this data set.

Question 9

The Multiple R-squared value, 0.9033, indicates that about 90.33% of the variation in the number of native plant species across the islands can be explained by the linear relationship island area. Only about 9.7% of the variation in native species remains unexplained. Thus, the size of an island accounts for most of the differences in native plant species among the island in this data set.

Question 10

The total sum of squares (SSY) value is 142434.9.

SSY <- sum((ChannelIslands$Native - mean(ChannelIslands$Native))^2)

Question 11

The error sum of squares (SSE) value is 13767.48.

SSE <- sum(mymodel$residuals^2)

Question 12

The proportionate reduction of SSE relative to SSY is 0.9033419.

1 - (SSE/SSY)
## [1] 0.9033419

Question 13

The value obtained is 56.07448, which corresponds exactly with this model’s F-statistic.

((SSY - SSE)/1)/(SSE/(8 - 1 - 1))
## [1] 56.07448

Question 14

ggplot(data = ChannelIslands) +
  
geom_point(mapping=aes(x=
Area, y=Native),
color="forestgreen",
shape=15, size=2.5) +
  
geom_smooth(mapping=aes(
x=Area, y=Native),
color="forestgreen")+
  
geom_point(mapping=aes(x=
Area, y=Endemic),
color="dodgerblue")+
  
geom_smooth(mapping=aes(
x=Area, y=Endemic),
color="dodgerblue")+
  
geom_point(mapping=aes(x=
Area, y=Exotic),
color="firebrick1", shape=15, 
size=2.5)+
  
geom_smooth(mapping=aes(
x=Area, y=Exotic),
color="firebrick1") + 
  theme_gray()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Question 15

Native species, as demonstrated by their positive slope, are much more abundant as island area increases. Exotic species are a little more abundant as island area increases; the slope shows a slight positive increase, but not as large as native species. Lastly, endemic species do not show much change in slope at all as island area increases, and they stay nearly about zero in species richness.

Question 16

islanddistance <- data.frame(
Island = ChannelIslands$Island,
residuals = mymodel$residuals,
Dist = ChannelIslands$Dist)

ggplot(data = islanddistance) +
geom_point(aes(x = Dist, y = residuals),
color = "forestgreen",
shape = 15, size = 2.5) +
  
geom_smooth(aes(x = Dist, y = residuals),
color = "forestgreen") +
xlab("Island Distance") +
ylab("Model Residuals") +
ggtitle("Scatterplot of Model Residuals on Island Distance") +
theme_gray()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Question 17

If residuals show no pattern, then island area fully captures the effect, and island distance does not explain additional variation. In this case, the residuals do appear to show a decreasing trend with distance. This suggests that distance might also affect the number of native species and that the model may still be missing an important predictor.

Question 18

The slope is negative and significant, showing that more distant islands have fewer native species than predicted by area. The model explains a substantial fraction, 71.5%, of the leftover variation, suggesting distance is an important secondary predictor.

mymodel2 <- 
lm(mymodel$residuals ~
ChannelIslands$Dist)
summary(mymodel2)
## 
## Call:
## lm(formula = mymodel$residuals ~ ChannelIslands$Dist)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.849 -18.320   8.098  15.904  29.724 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)   
## (Intercept)          71.3151    20.4815   3.482  0.01311 * 
## ChannelIslands$Dist  -1.4052     0.3621  -3.880  0.00817 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.57 on 6 degrees of freedom
## Multiple R-squared:  0.7151, Adjusted R-squared:  0.6676 
## F-statistic: 15.06 on 1 and 6 DF,  p-value: 0.008167

Question 19

90.3% of total variance in native richness was explained by area. 71.5%, or about 7% of total, remaining variance was explained by distance. This shows that area is the primary determinant, but distnace still has a meanigful secondary effect.