Island=c("Santa Barbara","Anacapa","San Miguel","San Nicolas","San Clemente","Santa Catalina","Santa Rosa","Santa Cruz")
Area=c(2.6, 2.9, 37, 58, 145, 194, 217, 294)
Dist=c(61, 20, 42, 98, 79, 32, 44, 30)
Native=c(88,190,198,139,272,421,387,480)
Endemic=c(14,22,18,18,47,37,42,45)
Exotic=c(44,75,69,131,110,185,98,170)
Total=c(132,265,267,270,382,604,484,650)
ChannelIslands=data.frame(Island, Area, Dist, Native, Endemic, Exotic, Total)
rm(Island, Area, Dist, Native, Endemic, Exotic, Total)
ChannelIslands$Island <- factor(ChannelIslands$Island)
1. Include the plot in the write-up alongside the working code
ggplot(ChannelIslands, aes(x = Area, y =Total)) +
geom_point(aes(color=Island, size=Area)) +
geom_smooth() +
xlab("Island Area (sq.km)") +
ylab("Total Number of Native Species") +
ggtitle("Scatterplot of Species by Island Area")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
2. What does the plot suggest about the relationship between the number of plant species found on different islands and the size of the islands?
The plot suggests that as the size of the island increases, the number of plant species found on the island increases as well. This means there is a positive relationship between the two variables.
3. Which island has the fewest species? How does its size compare to other islands?
Santa Barbara has the fewest species, with 132 total. It is also the smallest island with an area of 2.6 square km.
4. Which island has the greatest number of species? How does its size compare to the other islands?
Santa Cruz has the greatest number of species with a total of 650. It is also the largest island with an area of 294 square km.
5. What does the smooth line tell you about the form (or shape) of the relationship between the two variables?
The smooth line tells us that the relationship between the two variables is not random, and it is predicted.
attach(ChannelIslands)
mymodel = lm(Native ~ Area)
summary(mymodel)
##
## Call:
## lm(formula = Native ~ Area)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.612 -34.226 -7.542 34.551 61.581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 124.8303 25.9310 4.814 0.002958 **
## Area 1.2376 0.1653 7.488 0.000293 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.9 on 6 degrees of freedom
## Multiple R-squared: 0.9033, Adjusted R-squared: 0.8872
## F-statistic: 56.07 on 1 and 6 DF, p-value: 0.0002931
6. What does the slope coefficient tell you about how the expected number of native species changes with changes in island area?
The slope coefficient tells me that the expected number of native species increases at a positive rate of 1.2376 as island area increases by the standard unit 1km2.
7. According to this model, how many native plant species would you expect to find on an island with a size of 0.0 km2?
The table tells me that the intercept is 124.8303, which mean that on an island with a size of 0.0km2, you could expect to find 124 native plant species. However, this seems confusing to me because if an island is nonexistent with an area of 0.0km2 I don’t think there would be any plants, but I guess this is just according to the model.
8. What do you conclude based on an interpretation of the t-value and Pr(>|t|) for the slope term?
The t-value is 7.488 and the Pr(>|t|) is 0.000293. The t-value is large which means that there is a large difference between the means and it is less likely that that difference is due to random chance. The Pr(>|t|) value being less than 0.05 indicates strong evidence against the null hypothesis, meaning that is it unlikely due to chance/randomness.
9. Interpret the R2 (Multiple R-squared) value.
The multiple r-squared value is 0.9033 which means that 90.33% of the variation is due to the land area in the model. This means that there is a very strong positive relationship between land area and plant speies.
10. In this model, the number of native species is your Y variable. Calculate the total sum of squares (SSY). What is the value?
SSY = sum((Native - mean(Native))^2)
print(SSY)
## [1] 142434.9
The SSY is 142,434.9
11. Calculate the error sum of squares from the model (SSE). There are several ways to access the Y values. The easiest is: mymodel$residuals. What is the value of SSE?
residuals = mymodel$residuals
SSE = sum(residuals^2)
print(SSE)
## [1] 13767.48
The SSE is 13,767.48
12. Calculate the proportionate reduction of SSE relative to SSY. What is it?
1-(SSE/SSY)
## [1] 0.9033419
The proportionate reduction of SSE relative to SSY is 0.9033, which is the name as the multiple r-squared value.
13. Calculate the following, where k is the number of explanatory variables in the model (which is 1) and n is the number of islands (8). What is the value obtained, and what element of your model summary does it correspond with.
value = ((SSE-SSY)/1)/((SSE)/6)
print(value)
## [1] -56.07448
The value obtained is -56.07, and it corresponds to the F-statistic at the bottom of the model, which is positive 56.07.
14. Include the script and plot in your report.
p=ggplot(data=ChannelIslands) +
geom_point(mapping=aes(x=Area, y=Native), color="forestgreen", shape=15, size = 2.5) +
geom_smooth(mapping=aes(x=Area, y=Native), color="forestgreen") +
theme_gray() +
geom_point(mapping=aes(x=Area, y=Endemic), color="dodgerblue", shape=16, size = 2.5) +
geom_smooth(mapping=aes(x=Area, y=Endemic), color="dodgerblue") +
theme_gray() +
geom_point(mapping=aes(x=Area, y=Exotic), color="firebrick1", shape=17, size = 2.5) +
geom_smooth(mapping=aes(x=Area, y=Exotic), color="firebrick1") +
theme_gray()
ggplotly(p)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
15. Based on the scatterplot you created, how does the slope of the relationship between Richness and Area differ for the three different species types?
Richness of endemic species do not seem to be affected by the area size and have the lowest richness. Richness of exotic species also do not seem to be too affected by area size but have higher variance. Richness of native species have a strong positive relationship with area size.
Plot the residuals from the model you created in PART 2 on Distance. Interpret the model summary, paying attention to the magnitude and the sign of the slope term, the t-value for the slope term, the F- value and the R2. You’ll need to create a new dataframe with variables Island, Residuals, Distance. Do your plotting and modelling using that new dataframe.Â
16. Include your script and plot. Interpret the model
new_df <- ChannelIslands %>%
select(Island, Dist) %>%
mutate(residuals)
ggplot(data=new_df) +
geom_point(mapping=aes(x=Dist, y=residuals), color="forestgreen", shape=15, size = 2.5) +
geom_smooth(mapping=aes(x=Dist, y=residuals), color="forestgreen") +
xlab("Distance from the Mainland (km)") +
ylab("Native Plant Species Residuals") +
theme_gray()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
17. What does the plot suggest about how the deviation from modeled richness (based on the model of Native Species on Area) relates to distance from the mainland?
There is a negative relationship between the deviation from modeled richness based on the model of native species on area and distance from the mainland. This means that the further an island is to the mainland, the less native species it will have and vice versa.
18. Create a new model that regresses the residuals from your original model (from PART 2) on Distance. Include the model summary and interpret the model, focusing on the slope, t-statistic, P>|t|, and R2
attach(new_df)
## The following object is masked _by_ .GlobalEnv:
##
## residuals
## The following objects are masked from ChannelIslands:
##
## Dist, Island
part4model = lm(residuals ~ Dist)
summary(part4model)
##
## Call:
## lm(formula = residuals ~ Dist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.849 -18.320 8.098 15.904 29.724
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.3151 20.4815 3.482 0.01311 *
## Dist -1.4052 0.3621 -3.880 0.00817 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.57 on 6 degrees of freedom
## Multiple R-squared: 0.7151, Adjusted R-squared: 0.6676
## F-statistic: 15.06 on 1 and 6 DF, p-value: 0.008167
Based on the model from the script I ran, the slope coefficient is -1.4052 which means that as the distance increases by 1km, the predicted native species will decrease at a rate of -1.4052. The intercept at 71.3151 means that an island that is 0.0km away from the mainland would have about 71 native species based on the model. The t-statistic is -3.880; the distance from 0 means that the there is a larger difference between the two means. The P>|t| is 0.00817, which is less than 0.05 so we can reject the null hypothesis and assume that the result is unlikely to be due to chance alone. Finally, the multiple R-squared is 0.7151 which means that 71.51% of the variation is due to distance. This is a weaker value than when I ran the model for area.
19. What percentage of total variance in Native Richness was explained by Area? What percentage of the remaining variance was explained by Distance?
90.33% of total variance in Native Richness was explained by Area, and 71.51% of the remaining variance (9.67%) was explaining by Distance. 9.67 x 0.7151 = 6.915017, so 6.92% of the variance was explained by Distance.