Introduction

This code works with the pre-installed ‘Iris’ dataset. It calculates statistics for each species and tabulates them, then creates a linear model to test the relationship between sepal length and sepal width across iris species. Finally, the code plots this relationship.

Part 1.

Create a single table from the Iris data set with:

MIN, MAX, MEAN, MEDIAN for each variable (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) for each species

Load packages

library(tidyverse)

Calculate statistics in new tibbles

iris

iris_min <- iris %>% 
  group_by(Species) %>% 
  summarise(across(everything(), tibble::lst(min), .names = "{fn}_{col}")) %>% 
   mutate (statistic = "min") %>% 
  rename(sepal_length = min_Sepal.Length,
         sepal_width = min_Sepal.Width,
         petal_length = min_Petal.Length,
         petal_width = min_Petal.Width)

iris_max <- iris %>% 
  group_by(Species) %>% 
  summarise(across(everything(), tibble::lst(max), .names = "{fn}_{col}")) %>% 
   mutate (statistic = "max") %>% 
  rename(sepal_length = max_Sepal.Length,
         sepal_width = max_Sepal.Width,
         petal_length = max_Petal.Length,
         petal_width = max_Petal.Width)

iris_mean <- iris %>% 
  group_by(Species) %>% 
  summarise(across(everything(), tibble::lst(mean), .names = "{fn}_{col}")) %>% 
   mutate (statistic = "mean") %>% 
  rename(sepal_length = mean_Sepal.Length,
         sepal_width = mean_Sepal.Width,
         petal_length = mean_Petal.Length,
         petal_width = mean_Petal.Width)

iris_median <- iris %>% 
  group_by(Species) %>% 
  summarise(across(everything(), tibble::lst(median), .names = "{fn}_{col}")) %>% 
  mutate (statistic = "median") %>% 
  rename(sepal_length = median_Sepal.Length,
         sepal_width = median_Sepal.Width,
         petal_length = median_Petal.Length,
         petal_width = median_Petal.Width)

Union stats tibbles into one

iris_minmax <- union(iris_min, iris_max, by = "Species")

iris_minmaxmed <- union(iris_minmax, iris_median, by = "Species")

iris_stats <- union(iris_minmaxmed, iris_mean, by = "Species") %>% 
  relocate(statistic, .after = "Species" ) %>% 
  arrange(Species)

Create table

library(kableExtra)
library(knitr)
library(rvest) 

knitr::kable(iris_stats,
             caption = "Table 1. Iris species summary statistics.", 
             col.names = c("Species", "Statistic", "Sepal Length", "Sepal Width", "Petal Length", "Petal Width")) %>%
             kable_minimal(full_width = F, html_font = "Cambria", position = "left") %>%
             collapse_rows(columns = 1, valign = "middle") ## Cannot get this to work!!
Table 1. Iris species summary statistics.
Species Statistic Sepal Length Sepal Width Petal Length Petal Width
setosa min 4.300 2.300 1.000 0.100
setosa max 5.800 4.400 1.900 0.600
setosa median 5.000 3.400 1.500 0.200
setosa mean 5.006 3.428 1.462 0.246
versicolor min 4.900 2.000 3.000 1.000
versicolor max 7.000 3.400 5.100 1.800
versicolor median 5.900 2.800 4.350 1.300
versicolor mean 5.936 2.770 4.260 1.326
virginica min 4.900 2.200 4.500 1.400
virginica max 7.900 3.800 6.900 2.500
virginica median 6.500 3.000 5.550 2.000
virginica mean 6.588 2.974 5.552 2.026

Part 2.

Is sepal width a good predictor of sepal length across all species?

iris_model <- lm(Sepal.Length ~ Sepal.Width, data = iris)

iris_model_summary <- summary(iris_model)

coefiris <- coef(iris_model_summary)

r.squared <- iris_model_summary$r.squared # r-squared
p.value <- coefiris[2,4] # p-value
slope <- coefiris[2,1] # slope


iris_model_stats <- data.frame(r.squared, p.value, slope)


knitr::kable(iris_model_stats,
             caption = "Table 2. Iris model statistics demonstrating a negative relationship between Sepal Width (independent variable) and Sepal Length (dependent variable) across species. ", 
             col.names = c("R-squared", "p-value", "slope")) %>%
             kable_minimal(full_width = F, html_font = "Cambria", position = "left") 
Table 2. Iris model statistics demonstrating a negative relationship between Sepal Width (independent variable) and Sepal Length (dependent variable) across species.
R-squared p-value slope
0.0138227 0.1518983 -0.2233611

Part 3.

Graph linear model of Sepal Length ~ Sepal Width in ggplot

ggplot(iris, aes(Sepal.Width, Sepal.Length)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(x = "Sepal Width (cm)",
       y = "Sepal Length (cm)") +
  theme_bw()
Figure 1. Relationship between sepal width and sepal length of _Iris_ species. 
 The linear model demonstrates a weak and non-significant negative relationship between sepal width and sepal length across _Iris setosa_, 
 _I. versicolor_, and _I. virginica_ (r^2^ = 0.013, p = 0.15). Sepal width was used as the independent variable and sepal length as the dependent variable.

Figure 1. Relationship between sepal width and sepal length of Iris species. The linear model demonstrates a weak and non-significant negative relationship between sepal width and sepal length across Iris setosa, I. versicolor, and I. virginica (r2 = 0.013, p = 0.15). Sepal width was used as the independent variable and sepal length as the dependent variable.

Conclusion

In exploring the table of statistics for each species, Iris. virginica appears to have the longest average sepals and petals, with I. setosa having the shortest and widest sepals, and the smallest overall petals. There is no significant relationship between sepal length and sepal width across all species - sepal width is not a good predictor of sepal length of the three iris. It would be interesting to explore this relationship at the species level - I would think there is a stronger relationship there.