Island=c("Santa Barbara","Anacapa","San Miguel","San Nicolas","San Clemente","Santa Catalina","Santa Rosa","Santa Cruz")
Area=c(2.6, 2.9, 37, 58, 145, 194, 217, 294)
Dist=c(61, 20, 42, 98, 79, 32, 44, 30)
Native=c(88,190,198,139,272,421,387,480)
Endemic=c(14,22,18,18,47,37,42,45)
Exotic=c(44,75,69,131,110,185,98,170)
Total=c(132,265,267,270,382,604,484,650)
##Coerce Data Vectors Into a Dataframe:
ChannelIslands=data.frame(Island, Area, Dist, Native, Endemic, Exotic, Total)
##Remove individual files now that they are assembled into a dataframe.
rm(Island, Area, Dist, Native, Endemic, Exotic, Total)
##Convert “Island” (the variable containing island names) into a Factor Variable:
ChannelIslands$Island <- factor(ChannelIslands$Island)
ggplot(ChannelIslands, aes(x = Area, y =Total)) +
geom_point(aes(color=Island, size=Area)) +
geom_smooth() +
xlab("Island Area") +
ylab("Total Species") +
ggtitle("Scatterplot of Total Species on Island Area")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The above plot suggests that as the area of an island increases, the total species also increases.
Santa Barbara has the fewest species. It also has the smallest island area and is significantly smaller than almost every other island.
Santa Cruz has the greatest number of species. It also has the largest island area compared to the other islands.
The smooth line suggests that the relationship between total species and island area is linear.
mymodel = lm(ChannelIslands$Total ~ ChannelIslands$Area)
summary(mymodel)
##
## Call:
## lm(formula = ChannelIslands$Total ~ ChannelIslands$Area)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.46 -43.94 -11.95 27.23 103.66
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 194.3561 34.5928 5.618 0.001358 **
## ChannelIslands$Area 1.5772 0.2205 7.154 0.000376 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 63.9 on 6 degrees of freedom
## Multiple R-squared: 0.8951, Adjusted R-squared: 0.8776
## F-statistic: 51.17 on 1 and 6 DF, p-value: 0.0003764
Based on the linear regression results, as island area increases, the total species on the islands is expected to rise by 1.5772 species.
Based on the model, an island with 0.0 km^2 area would have 194.3561 total species on the island.
The t-value for the slope term is 7.154 and the Pr(>|t|) is 0.000376. This means that the relationship between total species and area is significantly different from zero and varies 7.154 SE from zero. Since the p-value is less than 0.05, this value is statistically significant.
The multiple R-squared value is 0.8951 which means that 89.51% of the variation in total species is explained by island area.
SSY <- sum((ChannelIslands$Total - mean(ChannelIslands$Total))^2)
SSY
## [1] 233469.5
The total sum of squares is 233469.5.
SSE <- sum(mymodel$residuals^2)
SSE
## [1] 24501.08
The error sum of squares is 24501.08.
1 - (SSE/SSY)
## [1] 0.8950566
The proportionate reduction of SSE relative to SSY is 0.8950566.
((SSY-SSE)/1)/(SSE/(8-1-1))
## [1] 51.17369
This corresponds to the F-statistics which is 51.17369. This tests whether there is a significant amount of variance to reject the null hypothesis that there is no relationship between the variables. Since this F-statistic is quite large, we can reject this null hypothesis.
ggplot(data=ChannelIslands) +
geom_point(mapping=aes(x=Area, y=Native), color="forestgreen", shape=15, size = 2.5) +
geom_smooth(mapping=aes(x=Area, y=Native), color="forestgreen") +
geom_point(mapping=aes(x=Area, y=Endemic), color="dodgerblue", shape=15, size = 2.5) +
geom_smooth(mapping=aes(x=Area, y=Endemic), color="dodgerblue") +
geom_point(mapping=aes(x=Area, y=Exotic), color="firebrick1", shape=15, size = 2.5) +
geom_smooth(mapping=aes(x=Area, y=Exotic), color="firebrick1") +
theme_gray()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
For Native plants, the slope between area and richness is higher than endemic and exotic. The intercept for Exotic plants is slightly higher than endemic plants though both seem to have fairly similar slopes.
islanddistance=data.frame(ChannelIslands$Island, mymodel$residuals, ChannelIslands$Dist)
ggplot(data=islanddistance) +
geom_point(mapping=aes(x=ChannelIslands.Dist, y=mymodel.residuals), color="forestgreen", shape=15, size = 2.5) +
geom_smooth(mapping=aes(x=ChannelIslands.Dist, y=mymodel.residuals), color="forestgreen") +
theme_gray()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The above model suggests that as distance increases, the deviation from model richness becomes slightly smaller.
mymodel2 <- lm(mymodel$residuals ~ ChannelIslands$Dist)
summary(mymodel2)
##
## Call:
## lm(formula = mymodel$residuals ~ ChannelIslands$Dist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.963 -38.740 -1.323 31.677 80.469
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.778 42.476 1.478 0.190
## ChannelIslands$Dist -1.237 0.751 -1.647 0.151
##
## Residual standard error: 53.03 on 6 degrees of freedom
## Multiple R-squared: 0.3114, Adjusted R-squared: 0.1966
## F-statistic: 2.713 on 1 and 6 DF, p-value: 0.1506
The slope for the relationship between distance and the residuals of total species and area is -1.237, the t-value is -1.647, and the R^2 is 0.3114. However, since th p-value is 0.151, which is not less than 0.05, I cannot reject the null hypothesis that there is no relationship between the two variables.
summary(mymodel)
##
## Call:
## lm(formula = ChannelIslands$Total ~ ChannelIslands$Area)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.46 -43.94 -11.95 27.23 103.66
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 194.3561 34.5928 5.618 0.001358 **
## ChannelIslands$Area 1.5772 0.2205 7.154 0.000376 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 63.9 on 6 degrees of freedom
## Multiple R-squared: 0.8951, Adjusted R-squared: 0.8776
## F-statistic: 51.17 on 1 and 6 DF, p-value: 0.0003764
summary(mymodel2)
##
## Call:
## lm(formula = mymodel$residuals ~ ChannelIslands$Dist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.963 -38.740 -1.323 31.677 80.469
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.778 42.476 1.478 0.190
## ChannelIslands$Dist -1.237 0.751 -1.647 0.151
##
## Residual standard error: 53.03 on 6 degrees of freedom
## Multiple R-squared: 0.3114, Adjusted R-squared: 0.1966
## F-statistic: 2.713 on 1 and 6 DF, p-value: 0.1506
89.51% of the variation in native richness is explains by area, while 31.14% of the remaining variance is explained by distance (though this value is not statistically significant because the p-value is too large).