The Iris data set contains three plant species (setosa, virginica, versicolor) and four features measured for each sample. The analysis and graphics below will calculate the basic statistical features of each trait for each species. Next, the code below will determine if Sepal Width is a good predictor of Sepal Length across all species.
data ("iris")
library(dplyr)
library(tidyverse)
library(purrr) #package for iteration
library(AER)
Using the iris data set, calculate the min, max, mean, and median for each variable for trait and for each species.
basic_stats <- iris %>%
group_by(Species) %>%
summarise_all(list(min, max, mean, median)) %>% # specifies the statistics you want from the summarize function
pivot_longer(-Species) %>% # expands the data frame, removing unnecessary columns
pivot_wider(names_from = "name") # widens the data frame, so that it is now restructured into a readable format, with each statistics for each trait of each species
Create the table
knitr::kable(basic_stats, align = "ccccccccccccc", caption = "Figure 1: Basic Statistics for each Trait for each Species of Iris", col.names = c("Species", "Minimum Sepal Length", "Minimum Sepal Width", "Minimum Petal Length", "Minimum Petal Width", "Maximum Sepal Length", "Maximum Sepal Width", "Maximum Petal Length", "Maximum Petal Width", "Mean Sepal Length", "Mean Sepal Width", "Mean Petal Length", "Mean Petal Width","Median Sepal Length", "Median Sepal Width", "Median Petal Length", "Median Petal Width"))
| Species | Minimum Sepal Length | Minimum Sepal Width | Minimum Petal Length | Minimum Petal Width | Maximum Sepal Length | Maximum Sepal Width | Maximum Petal Length | Maximum Petal Width | Mean Sepal Length | Mean Sepal Width | Mean Petal Length | Mean Petal Width | Median Sepal Length | Median Sepal Width | Median Petal Length | Median Petal Width |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| setosa | 4.3 | 2.3 | 1.0 | 0.1 | 5.8 | 4.4 | 1.9 | 0.6 | 5.006 | 3.428 | 1.462 | 0.246 | 5.0 | 3.4 | 1.50 | 0.2 |
| versicolor | 4.9 | 2.0 | 3.0 | 1.0 | 7.0 | 3.4 | 5.1 | 1.8 | 5.936 | 2.770 | 4.260 | 1.326 | 5.9 | 2.8 | 4.35 | 1.3 |
| virginica | 4.9 | 2.2 | 4.5 | 1.4 | 7.9 | 3.8 | 6.9 | 2.5 | 6.588 | 2.974 | 5.552 | 2.026 | 6.5 | 3.0 | 5.55 | 2.0 |
Using the iris data set, determine if Sepal.Width is a good predictor of Sepal.Length across all species.
setosa <- iris %>%
filter(Species == "setosa") # create subset of data to be used as an example when creating the function below
Create a function that will run linear regression analysis and then print the key statistics in a tibble.
lm_and_stats_output <- function(y ,x ,data_frame) {
formula <- as.formula (paste(y, x, sep = " ~ ")) # how to structure the linear regression input
lm_setosa <- lm(formula, data = data_frame) # linear regression
model_summary_setosa <- summary(lm_setosa) #Summarize linear regression
#Pull out stats table
model_statistics <- tibble("intercept" = model_summary_setosa$coefficients[1,1], #intercept
"slope" = model_summary_setosa$coefficients[2,1], #slope
"p_value" = model_summary_setosa$coefficients[2,4], #p-value
"r_squared" = model_summary_setosa$r.squared) #R2
return(model_statistics)
}
lm_and_stats_output(y = "Sepal.Length", x = "Sepal.Width", setosa) # check to see if the function works
Use the function to calculate the statistics for all Iris species.
lm_all <- iris %>%
split(.$Species) %>%
map_df(~lm_and_stats_output(y = "Sepal.Length", x = "Sepal.Width", .), .id = "Species") #map() function from the purrr package runs the regression on each species of Iris
Report the r2, p-value, and slope from your lm() call in a nice table with appropriate column labels.
knitr::kable(lm_all, align = "ccccc", caption = "Figure 2: Linear Regression Demostrates a Positive Relationship Between Sepal Width (independent variable) as a Good Predictor of Sepal Length (dependent varibale) for each Species. ", col.names = c("Species", "Intercept", "Slope", "P-Value", "R Squared"))
| Species | Intercept | Slope | P-Value | R Squared |
|---|---|---|---|---|
| setosa | 2.639001 | 0.6904897 | 0.0000000 | 0.5513756 |
| versicolor | 3.539735 | 0.8650777 | 0.0000877 | 0.2765821 |
| virginica | 3.906836 | 0.9015345 | 0.0008435 | 0.2090573 |
Use ggplot to graph the linear model you just created in 2. You can just use ggplot’s geom_smooth function with method set to “lm” to add your regression line to the scatter plot.
ggplot(iris, aes (Sepal.Width, Sepal.Length, color = Species)) +
geom_point() +
facet_wrap(vars(Species)) +
geom_smooth(method = "lm", formula = y ~ x) +
labs(x = "Sepal Width",
y = "Sepal Length",
caption = "Graph 1: Scatter plot of Sepal Width (indpendent) versus Sepal Length (depenedent) from the iris data set \n for all species. For the setosa species, Sepal Length showed a positive and significant relationship \n with Sepal Width (r2 = 0.5513756, p-value = 6.709843e-10 ). For the versicolor species, \n Sepal Length showed a positive and significant relationship with Sepal Width (r2 = 0.2765821, \n p-value = 8.771860e-05 ). For the virginica species, Sepal Length showed a positive \n and significant relationship with Sepal Width (r2 = 0.2090573, p-value = 8.434625e-04 )."
) +
theme_bw()
Basic statistics show that the virginica species generally has the largest traits, followed by versicolor, then setosa (figure 1). All species of Iris within this data set display a significant positive relationship between Sepal Width as the predictor variable for Sepal Length as the dependent variable. This claim is supported by the positive slopes of the linear regression between these two variables, and the p values below .05 (figure 2). This can also be seen visually in graph 1, where the increase in Sepal Width on the x axis causes the increased Sepal Length on the y axis (graph 1).