Island=c("Santa Barbara","Anacapa","San Miguel","San Nicolas","San Clemente","Santa Catalina","Santa Rosa","Santa Cruz")
Area=c(2.6, 2.9, 37, 58, 145, 194, 217, 294)
Dist=c(61, 20, 42, 98, 79, 32, 44, 30)
Native=c(88,190,198,139,272,421,387,480)
Endemic=c(14,22,18,18,47,37,42,45)
Exotic=c(44,75,69,131,110,185,98,170)
Total=c(132,265,267,270,382,604,484,650)
##Coerce Data Vectors Into a Dataframe:
ChannelIslands=data.frame(Island, Area, Dist, Native, Endemic, Exotic, Total)
##Remove individual files now that they are assembled into a dataframe.
rm(Island, Area, Dist, Native, Endemic, Exotic, Total)
##Convert “Island” (the variable containing island names) into a Factor Variable:
ChannelIslands$Island <- factor(ChannelIslands$Island)
ggplot(ChannelIslands, mapping = aes(x = Area, y =Total)) +
geom_point(aes(color=Island, size=Area)) +
geom_smooth() +
xlab("Island Area") +
ylab("Total Species") +
ggtitle("Scatterplot of Total Species on Island Area")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The above plot suggests that as the area of an island increases, the total species also increases.
Santa Barbara has the fewest species. It also has the smallest island area and is significantly smaller than almost every other island.
Santa Cruz has the greatest number of species. It also has the largest island area compared to the other islands.
The smooth line suggests that the relationship between total species and island area is linear.
mymodel = lm(ChannelIslands$Native ~ ChannelIslands$Area)
summary(mymodel)
##
## Call:
## lm(formula = ChannelIslands$Native ~ ChannelIslands$Area)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.612 -34.226 -7.542 34.551 61.581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 124.8303 25.9310 4.814 0.002958 **
## ChannelIslands$Area 1.2376 0.1653 7.488 0.000293 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.9 on 6 degrees of freedom
## Multiple R-squared: 0.9033, Adjusted R-squared: 0.8872
## F-statistic: 56.07 on 1 and 6 DF, p-value: 0.0002931
Based on the linear regression results, as island area increases, the native species on the islands is expected to rise by 1.2376 species.
Based on the model, an island with 0.0 km^2 area would have 124.8303 total species on the island.
The t-value for the slope term is 7.488 and the Pr(>|t|) is 0.000293. This means that the relationship between total species and area is significantly different from zero and varies 7.488 SE from zero. Since the p-value is less than 0.05, this value is statistically significant.
The multiple R-squared value is 0.9033 which means that 90.33% of the variation in total species is explained by island area.
SSY <- sum((ChannelIslands$Native - mean(ChannelIslands$Native))^2)
SSY
## [1] 142434.9
The total sum of squares is 142434.9.
SSE <- sum(mymodel$residuals^2)
SSE
## [1] 13767.48
The error sum of squares is 13767.48.
(SSY-SSE)/SSY
## [1] 0.9033419
The proportionate reduction of SSE relative to SSY is 0.9033419.
((SSY-SSE)/1)/(SSE/(8-1-1))
## [1] 56.07448
This corresponds to the F-statistic which is 56.07448. This tests whether there is a significant amount of variance to reject the null hypothesis that there is no relationship between the variables. Since this F-statistic is quite large, we can reject this null hypothesis.
ggplot(data=ChannelIslands) +
geom_point(mapping=aes(x=Area, y=Native), color="forestgreen", shape=15, size = 2.5) +
geom_smooth(mapping=aes(x=Area, y=Native), color="forestgreen") +
geom_point(mapping=aes(x=Area, y=Endemic), color="dodgerblue", shape=15, size = 2.5) +
geom_smooth(mapping=aes(x=Area, y=Endemic), color="dodgerblue") +
geom_point(mapping=aes(x=Area, y=Exotic), color="firebrick1", shape=15, size = 2.5) +
geom_smooth(mapping=aes(x=Area, y=Exotic), color="firebrick1") +
theme_gray()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
For Native plants, the slope between area and richness is higher than endemic and exotic. The intercept for Exotic plants is slightly higher than endemic plants though both seem to have fairly similar slopes.
islanddistance=data.frame(ChannelIslands$Island, mymodel$residuals, ChannelIslands$Dist)
ggplot(data=islanddistance) +
geom_point(mapping=aes(x=ChannelIslands.Dist, y=mymodel.residuals), color="forestgreen", shape=15, size = 2.5) +
geom_smooth(mapping=aes(x=ChannelIslands.Dist, y=mymodel.residuals), color="forestgreen") +
xlab("Island Distance") +
ylab("Model Residuals") +
ggtitle("Scatterplot of Model Residuals on Island Distance")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The above model suggests that as distance increases, the deviation from model richness becomes slightly smaller.
mymodel2 <- lm(mymodel$residuals ~ ChannelIslands$Dist)
summary(mymodel2)
##
## Call:
## lm(formula = mymodel$residuals ~ ChannelIslands$Dist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.849 -18.320 8.098 15.904 29.724
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.3151 20.4815 3.482 0.01311 *
## ChannelIslands$Dist -1.4052 0.3621 -3.880 0.00817 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.57 on 6 degrees of freedom
## Multiple R-squared: 0.7151, Adjusted R-squared: 0.6676
## F-statistic: 15.06 on 1 and 6 DF, p-value: 0.008167
The slope for the relationship between distance and the residuals of native species and area is -1.4052, the t-value is -3.880, and the R^2 is 0.7151. Since the p-value is 0.00817, which is less than 0.05, I can reject the null hypothesis that there is no relationship between the two variables. For every 1km increase in distance, the residuals of species decreases by 1.4052.
summary(mymodel)
##
## Call:
## lm(formula = ChannelIslands$Native ~ ChannelIslands$Area)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.612 -34.226 -7.542 34.551 61.581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 124.8303 25.9310 4.814 0.002958 **
## ChannelIslands$Area 1.2376 0.1653 7.488 0.000293 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.9 on 6 degrees of freedom
## Multiple R-squared: 0.9033, Adjusted R-squared: 0.8872
## F-statistic: 56.07 on 1 and 6 DF, p-value: 0.0002931
summary(mymodel2)
##
## Call:
## lm(formula = mymodel$residuals ~ ChannelIslands$Dist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.849 -18.320 8.098 15.904 29.724
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.3151 20.4815 3.482 0.01311 *
## ChannelIslands$Dist -1.4052 0.3621 -3.880 0.00817 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.57 on 6 degrees of freedom
## Multiple R-squared: 0.7151, Adjusted R-squared: 0.6676
## F-statistic: 15.06 on 1 and 6 DF, p-value: 0.008167
90.33% of the variation in native richness is explained by area, while 71.51% of the remaining variance is explained by distance.