Introduction

This code works with the pre-installed ‘Iris’ dataset. It calculates statistics for each species and tabulates them, then creates a linear model to test the relationship between sepal length and sepal width across iris species. Finally, the code plots this relationship.

Part 1.

Create a single table from the Iris data set with:

MIN, MAX, MEAN, MEDIAN for each variable (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) for each species

Load packages

library(tidyverse)

Calculate statistics in new tibbles

iris

iris_min <- iris %>% 
  group_by(Species) %>% 
  summarise(across(everything(), tibble::lst(min), .names = "{fn}_{col}")) %>% 
   mutate (statistic = "min") %>% 
  rename(sepal_length = min_Sepal.Length,
         sepal_width = min_Sepal.Width,
         petal_length = min_Petal.Length,
         petal_width = min_Petal.Width)

iris_max <- iris %>% 
  group_by(Species) %>% 
  summarise(across(everything(), tibble::lst(max), .names = "{fn}_{col}")) %>% 
   mutate (statistic = "max") %>% 
  rename(sepal_length = max_Sepal.Length,
         sepal_width = max_Sepal.Width,
         petal_length = max_Petal.Length,
         petal_width = max_Petal.Width)

iris_mean <- iris %>% 
  group_by(Species) %>% 
  summarise(across(everything(), tibble::lst(mean), .names = "{fn}_{col}")) %>% 
   mutate (statistic = "mean") %>% 
  rename(sepal_length = mean_Sepal.Length,
         sepal_width = mean_Sepal.Width,
         petal_length = mean_Petal.Length,
         petal_width = mean_Petal.Width)

iris_median <- iris %>% 
  group_by(Species) %>% 
  summarise(across(everything(), tibble::lst(median), .names = "{fn}_{col}")) %>% 
  mutate (statistic = "median") %>% 
  rename(sepal_length = median_Sepal.Length,
         sepal_width = median_Sepal.Width,
         petal_length = median_Petal.Length,
         petal_width = median_Petal.Width)

Union stats tibbles into one

iris_minmax <- union(iris_min, iris_max, by = "Species")

iris_minmaxmed <- union(iris_minmax, iris_median, by = "Species")

iris_stats <- union(iris_minmaxmed, iris_mean, by = "Species") %>% 
  relocate(statistic, .after = "Species" ) %>% 
  arrange(Species)

Create table

library(kableExtra)
library(knitr)
library(rvest) 

knitr::kable(iris_stats,
             caption = "Table 1. Iris species summary statistics.", 
             col.names = c("Species", "Statistic", "Sepal Length", "Sepal Width", "Petal Length", "Petal Width")) %>%
             kable_minimal(full_width = F, html_font = "Cambria", position = "left") %>%
             collapse_rows(columns = 1, valign = "middle") ## Cannot get this to work!!

Table 1. Iris species summary statistics.
Species	Statistic	Sepal Length	Sepal Width	Petal Length	Petal Width
setosa	min	4.300	2.300	1.000	0.100
setosa	max	5.800	4.400	1.900	0.600
setosa	median	5.000	3.400	1.500	0.200
setosa	mean	5.006	3.428	1.462	0.246
versicolor	min	4.900	2.000	3.000	1.000
versicolor	max	7.000	3.400	5.100	1.800
versicolor	median	5.900	2.800	4.350	1.300
versicolor	mean	5.936	2.770	4.260	1.326
virginica	min	4.900	2.200	4.500	1.400
virginica	max	7.900	3.800	6.900	2.500
virginica	median	6.500	3.000	5.550	2.000
virginica	mean	6.588	2.974	5.552	2.026

Part 2.

Is sepal width a good predictor of sepal length across all species?

iris_model <- lm(Sepal.Length ~ Sepal.Width, data = iris)

iris_model_summary <- summary(iris_model)

coefiris <- coef(iris_model_summary)

r.squared <- iris_model_summary$r.squared # r-squared
p.value <- coefiris[2,4] # p-value
slope <- coefiris[2,1] # slope


iris_model_stats <- data.frame(r.squared, p.value, slope)


knitr::kable(iris_model_stats,
             caption = "Table 2. Iris model statistics demonstrating a negative relationship between Sepal Width (independent variable) and Sepal Length (dependent variable) across species. ", 
             col.names = c("R-squared", "p-value", "slope")) %>%
             kable_minimal(full_width = F, html_font = "Cambria", position = "left")

Table 2. Iris model statistics demonstrating a negative relationship between Sepal Width (independent variable) and Sepal Length (dependent variable) across species.
R-squared	p-value	slope
0.0138227	0.1518983	-0.2233611

Part 3.

Graph linear model of Sepal Length ~ Sepal Width in ggplot

ggplot(iris, aes(Sepal.Width, Sepal.Length)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(x = "Sepal Width (cm)",
       y = "Sepal Length (cm)") +
  theme_bw()

Figure 1. Relationship between sepal width and sepal length of Iris species. The linear model demonstrates a weak and non-significant negative relationship between sepal width and sepal length across Iris setosa, I. versicolor, and I. virginica (r² = 0.013, p = 0.15). Sepal width was used as the independent variable and sepal length as the dependent variable.

Conclusion

In exploring the table of statistics for each species, Iris. virginica appears to have the longest average sepals and petals, with I. setosa having the shortest and widest sepals, and the smallest overall petals. There is no significant relationship between sepal length and sepal width across all species - sepal width is not a good predictor of sepal length of the three iris. It would be interesting to explore this relationship at the species level - I would think there is a stronger relationship there.

CS08

Andrew Davies

3/1/2022