#instantiating data
#input data for each of 7 columns of data:
Island=c("Santa Barbara","Anacapa","San Miguel","San Nicolas","San Clemente","Santa Catalina","Santa Rosa","Santa Cruz")
Area=c(2.6, 2.9, 37, 58, 145, 194, 217, 294)
Dist=c(61, 20, 42, 98, 79, 32, 44, 30)
Native=c(88,190,198,139,272,421,387,480)
Endemic=c(14,22,18,18,47,37,42,45)
Exotic=c(44,75,69,131,110,185,98,170)
Total=c(132,265,267,270,382,604,484,650)
#coerce data vectors into a data frame
ChannelIslands=data.frame(Island, Area, Dist, Native, Endemic, Exotic, Total)
#remove individual files now that they are assembled into a data frame
rm(Island, Area, Dist, Native, Endemic, Exotic, Total)
#convert "island" (variable containing island names) into a factor variable
ChannelIslands$Island <- factor(ChannelIslands$Island)
#plot Total Species (y-axis) on Island Area (x-axis)
ggplot(ChannelIslands, aes(x=Area, y=Total)) +
geom_point(aes(color=Total, size=Area )) +
geom_smooth()+
xlab("Island Area") +
ylab("Total Species") +
ggtitle("Scatterplot of Total Species on Area")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The greater the island area, the greater number of plant species will be found on that particular island.
The island with the fewest plant species is Santa Barbara (132), which has an area of 2.6 square kilometers, the smallest area of the 8 islands.
The island with the greatest number of plant species is Santa Cruz (650), which has an area of 294 square kilometers, the largest area of the 8 islands.
The smooth line tells us that there is a strong positive correlation between island area and number of plant species. As area increases, x, the total species also increases, y.
mymodel=lm(ChannelIslands$Native ~ ChannelIslands$Area)
summary(mymodel)
##
## Call:
## lm(formula = ChannelIslands$Native ~ ChannelIslands$Area)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.612 -34.226 -7.542 34.551 61.581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 124.8303 25.9310 4.814 0.002958 **
## ChannelIslands$Area 1.2376 0.1653 7.488 0.000293 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.9 on 6 degrees of freedom
## Multiple R-squared: 0.9033, Adjusted R-squared: 0.8872
## F-statistic: 56.07 on 1 and 6 DF, p-value: 0.0002931
The slope coefficient of 1.2376 tells us that as island area increases, the expected number of native species are predicted to increase by 1.24 in this model.
According to this model, an island with 0.0 square kilometers area would be predicted to have about 125 native plant species.
The high t-value, 7.488, and very low p-value, 0.000293, indicate that the slope is significantly different from zero. This brings me to the conclusion that island area is a meaningful and statistically significant predictor of native plant richness in this data set.
The Multiple R-squared value, 0.9033, indicates that about 90.33% of the variation in the number of native plant species across the islands can be explained by the linear relationship island area. Only about 9.7% of the variation in native species remains unexplained. Thus, the size of an island accounts for most of the differences in native plant species among the island in this data set.
The total sum of squares (SSY) value is 142434.9.
SSY <- sum((ChannelIslands$Native - mean(ChannelIslands$Native))^2)
The error sum of squares (SSE) value is 13767.48.
SSE <- sum(mymodel$residuals^2)
The proportionate reduction of SSE relative to SSY is 0.9033419.
1 - (SSE/SSY)
## [1] 0.9033419
The value obtained is 56.07448, which corresponds exactly with this model’s F-statistic.
((SSY - SSE)/1)/(SSE/(8 - 1 - 1))
## [1] 56.07448
ggplot(data = ChannelIslands) +
geom_point(mapping=aes(x=
Area, y=Native),
color="forestgreen",
shape=15, size=2.5) +
geom_smooth(mapping=aes(
x=Area, y=Native),
color="forestgreen")+
geom_point(mapping=aes(x=
Area, y=Endemic),
color="dodgerblue")+
geom_smooth(mapping=aes(
x=Area, y=Endemic),
color="dodgerblue")+
geom_point(mapping=aes(x=
Area, y=Exotic),
color="firebrick1", shape=15,
size=2.5)+
geom_smooth(mapping=aes(
x=Area, y=Exotic),
color="firebrick1") +
theme_gray()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Native species, as demonstrated by their positive slope, are much more abundant as island area increases. Exotic species are a little more abundant as island area increases; the slope shows a slight positive increase, but not as large as native species. Lastly, endemic species do not show much change in slope at all as island area increases, and they stay nearly about zero in species richness.
islanddistance <- data.frame(
Island = ChannelIslands$Island,
residuals = mymodel$residuals,
Dist = ChannelIslands$Dist)
ggplot(data = islanddistance) +
geom_point(aes(x = Dist, y = residuals),
color = "forestgreen",
shape = 15, size = 2.5) +
geom_smooth(aes(x = Dist, y = residuals),
color = "forestgreen") +
xlab("Island Distance") +
ylab("Model Residuals") +
ggtitle("Scatterplot of Model Residuals on Island Distance") +
theme_gray()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
If residuals show no pattern, then island area fully captures the effect, and island distance does not explain additional variation. In this case, the residuals do appear to show a decreasing trend with distance. This suggests that distance might also affect the number of native species and that the model may still be missing an important predictor.
The slope is negative and significant, showing that more distant islands have fewer native species than predicted by area. The model explains a substantial fraction, 71.5%, of the leftover variation, suggesting distance is an important secondary predictor.
mymodel2 <-
lm(mymodel$residuals ~
ChannelIslands$Dist)
summary(mymodel2)
##
## Call:
## lm(formula = mymodel$residuals ~ ChannelIslands$Dist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.849 -18.320 8.098 15.904 29.724
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.3151 20.4815 3.482 0.01311 *
## ChannelIslands$Dist -1.4052 0.3621 -3.880 0.00817 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.57 on 6 degrees of freedom
## Multiple R-squared: 0.7151, Adjusted R-squared: 0.6676
## F-statistic: 15.06 on 1 and 6 DF, p-value: 0.008167
90.3% of total variance in native richness was explained by area. 71.5%, or about 7% of total, remaining variance was explained by distance. This shows that area is the primary determinant, but distnace still has a meanigful secondary effect.