Question 1

Island=c("Santa Barbara","Anacapa","San Miguel","San Nicolas","San     Clemente","Santa Catalina","Santa Rosa","Santa Cruz")
Area=c(2.6, 2.9, 37, 58, 145, 194, 217, 294)
Dist=c(61, 20, 42, 98, 79, 32, 44, 30)
Native=c(88,190,198,139,272,421,387,480)
Endemic=c(14,22,18,18,47,37,42,45)
Exotic=c(44,75,69,131,110,185,98,170)
Total=c(132,265,267,270,382,604,484,650)
##Coerce Data Vectors Into a Dataframe:
ChannelIslands=data.frame(Island, Area, Dist, Native, Endemic, Exotic, Total)
##Remove individual files now that they are assembled into a dataframe.
rm(Island, Area, Dist, Native, Endemic, Exotic, Total)
##Convert “Island” (the variable containing island names) into a Factor Variable:
ChannelIslands$Island <- factor(ChannelIslands$Island)

ggplot(ChannelIslands, mapping = aes(x = Area, y =Total)) +
  geom_point(aes(color=Island, size=Area)) +
  geom_smooth() +
  xlab("Island Area") +
  ylab("Total Species") +
  ggtitle("Scatterplot of Total Species on Island Area")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Question 2

The above plot suggests that as the area of an island increases, the total species also increases.

Question 3

Santa Barbara has the fewest species. It also has the smallest island area and is significantly smaller than almost every other island.

Question 4

Santa Cruz has the greatest number of species. It also has the largest island area compared to the other islands.

Question 5

The smooth line suggests that the relationship between total species and island area is linear.

Part 2

mymodel = lm(ChannelIslands$Native ~ ChannelIslands$Area)
summary(mymodel)
## 
## Call:
## lm(formula = ChannelIslands$Native ~ ChannelIslands$Area)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.612 -34.226  -7.542  34.551  61.581 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         124.8303    25.9310   4.814 0.002958 ** 
## ChannelIslands$Area   1.2376     0.1653   7.488 0.000293 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.9 on 6 degrees of freedom
## Multiple R-squared:  0.9033, Adjusted R-squared:  0.8872 
## F-statistic: 56.07 on 1 and 6 DF,  p-value: 0.0002931

Question 6

Based on the linear regression results, as island area increases, the native species on the islands is expected to rise by 1.2376 species.

Question 7

Based on the model, an island with 0.0 km^2 area would have 124.8303 total species on the island.

Question 8

The t-value for the slope term is 7.488 and the Pr(>|t|) is 0.000293. This means that the relationship between total species and area is significantly different from zero and varies 7.488 SE from zero. Since the p-value is less than 0.05, this value is statistically significant.

Question 9

The multiple R-squared value is 0.9033 which means that 90.33% of the variation in total species is explained by island area.

Question 10

SSY <- sum((ChannelIslands$Native - mean(ChannelIslands$Native))^2)
SSY
## [1] 142434.9

The total sum of squares is 142434.9.

Question 11

SSE <- sum(mymodel$residuals^2)
SSE
## [1] 13767.48

The error sum of squares is 13767.48.

Question 12

(SSY-SSE)/SSY
## [1] 0.9033419

The proportionate reduction of SSE relative to SSY is 0.9033419.

Question 13

((SSY-SSE)/1)/(SSE/(8-1-1))
## [1] 56.07448

This corresponds to the F-statistic which is 56.07448. This tests whether there is a significant amount of variance to reject the null hypothesis that there is no relationship between the variables. Since this F-statistic is quite large, we can reject this null hypothesis.

Part 3

Question 14

ggplot(data=ChannelIslands) +
  geom_point(mapping=aes(x=Area, y=Native), color="forestgreen", shape=15, size = 2.5) +
  geom_smooth(mapping=aes(x=Area, y=Native), color="forestgreen") +
  geom_point(mapping=aes(x=Area, y=Endemic), color="dodgerblue", shape=15, size = 2.5) +
  geom_smooth(mapping=aes(x=Area, y=Endemic), color="dodgerblue") +
  geom_point(mapping=aes(x=Area, y=Exotic), color="firebrick1", shape=15, size = 2.5) +
  geom_smooth(mapping=aes(x=Area, y=Exotic), color="firebrick1") +
  theme_gray()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Question 15

For Native plants, the slope between area and richness is higher than endemic and exotic. The intercept for Exotic plants is slightly higher than endemic plants though both seem to have fairly similar slopes.

Part 4

Question 16

islanddistance=data.frame(ChannelIslands$Island, mymodel$residuals, ChannelIslands$Dist)

ggplot(data=islanddistance) +
  geom_point(mapping=aes(x=ChannelIslands.Dist, y=mymodel.residuals), color="forestgreen", shape=15, size = 2.5) +
  geom_smooth(mapping=aes(x=ChannelIslands.Dist, y=mymodel.residuals), color="forestgreen") +
  xlab("Island Distance") +
  ylab("Model Residuals") +
  ggtitle("Scatterplot of Model Residuals on Island Distance")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Question 17

The above model suggests that as distance increases, the deviation from model richness becomes slightly smaller.

Question 18

mymodel2 <- lm(mymodel$residuals ~ ChannelIslands$Dist)
summary(mymodel2)
## 
## Call:
## lm(formula = mymodel$residuals ~ ChannelIslands$Dist)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.849 -18.320   8.098  15.904  29.724 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)   
## (Intercept)          71.3151    20.4815   3.482  0.01311 * 
## ChannelIslands$Dist  -1.4052     0.3621  -3.880  0.00817 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.57 on 6 degrees of freedom
## Multiple R-squared:  0.7151, Adjusted R-squared:  0.6676 
## F-statistic: 15.06 on 1 and 6 DF,  p-value: 0.008167

The slope for the relationship between distance and the residuals of native species and area is -1.4052, the t-value is -3.880, and the R^2 is 0.7151. Since the p-value is 0.00817, which is less than 0.05, I can reject the null hypothesis that there is no relationship between the two variables. For every 1km increase in distance, the residuals of species decreases by 1.4052.

Question 19

summary(mymodel)
## 
## Call:
## lm(formula = ChannelIslands$Native ~ ChannelIslands$Area)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.612 -34.226  -7.542  34.551  61.581 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         124.8303    25.9310   4.814 0.002958 ** 
## ChannelIslands$Area   1.2376     0.1653   7.488 0.000293 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.9 on 6 degrees of freedom
## Multiple R-squared:  0.9033, Adjusted R-squared:  0.8872 
## F-statistic: 56.07 on 1 and 6 DF,  p-value: 0.0002931
summary(mymodel2)
## 
## Call:
## lm(formula = mymodel$residuals ~ ChannelIslands$Dist)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.849 -18.320   8.098  15.904  29.724 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)   
## (Intercept)          71.3151    20.4815   3.482  0.01311 * 
## ChannelIslands$Dist  -1.4052     0.3621  -3.880  0.00817 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.57 on 6 degrees of freedom
## Multiple R-squared:  0.7151, Adjusted R-squared:  0.6676 
## F-statistic: 15.06 on 1 and 6 DF,  p-value: 0.008167

90.33% of the variation in native richness is explained by area, while 71.51% of the remaining variance is explained by distance.