Introduction
This code works with the pre-installed ‘Iris’ dataset. It calculates statistics for each species and tabulates them, then creates a linear model to test the relationship between sepal length and sepal width across iris species. Finally, the code plots this relationship.
Part 1.
Create a single table from the Iris data set with:
MIN, MAX, MEAN, MEDIAN for each variable (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) for each species
Load packages
library(tidyverse)
Calculate statistics in new tibbles
iris
iris_min <- iris %>%
group_by(Species) %>%
summarise(across(everything(), tibble::lst(min), .names = "{fn}_{col}")) %>%
mutate (statistic = "min") %>%
rename(sepal_length = min_Sepal.Length,
sepal_width = min_Sepal.Width,
petal_length = min_Petal.Length,
petal_width = min_Petal.Width)
iris_max <- iris %>%
group_by(Species) %>%
summarise(across(everything(), tibble::lst(max), .names = "{fn}_{col}")) %>%
mutate (statistic = "max") %>%
rename(sepal_length = max_Sepal.Length,
sepal_width = max_Sepal.Width,
petal_length = max_Petal.Length,
petal_width = max_Petal.Width)
iris_mean <- iris %>%
group_by(Species) %>%
summarise(across(everything(), tibble::lst(mean), .names = "{fn}_{col}")) %>%
mutate (statistic = "mean") %>%
rename(sepal_length = mean_Sepal.Length,
sepal_width = mean_Sepal.Width,
petal_length = mean_Petal.Length,
petal_width = mean_Petal.Width)
iris_median <- iris %>%
group_by(Species) %>%
summarise(across(everything(), tibble::lst(median), .names = "{fn}_{col}")) %>%
mutate (statistic = "median") %>%
rename(sepal_length = median_Sepal.Length,
sepal_width = median_Sepal.Width,
petal_length = median_Petal.Length,
petal_width = median_Petal.Width)
Union stats tibbles into one
iris_minmax <- union(iris_min, iris_max, by = "Species")
iris_minmaxmed <- union(iris_minmax, iris_median, by = "Species")
iris_stats <- union(iris_minmaxmed, iris_mean, by = "Species") %>%
relocate(statistic, .after = "Species" ) %>%
arrange(Species)
Create table
library(kableExtra)
library(knitr)
library(rvest)
knitr::kable(iris_stats,
caption = "Table 1. Iris species summary statistics.",
col.names = c("Species", "Statistic", "Sepal Length", "Sepal Width", "Petal Length", "Petal Width")) %>%
kable_minimal(full_width = F, html_font = "Cambria", position = "left") %>%
collapse_rows(columns = 1, valign = "middle") ## Cannot get this to work!!
Table 1. Iris species summary statistics.
|
Species
|
Statistic
|
Sepal Length
|
Sepal Width
|
Petal Length
|
Petal Width
|
|
setosa
|
min
|
4.300
|
2.300
|
1.000
|
0.100
|
|
setosa
|
max
|
5.800
|
4.400
|
1.900
|
0.600
|
|
setosa
|
median
|
5.000
|
3.400
|
1.500
|
0.200
|
|
setosa
|
mean
|
5.006
|
3.428
|
1.462
|
0.246
|
|
versicolor
|
min
|
4.900
|
2.000
|
3.000
|
1.000
|
|
versicolor
|
max
|
7.000
|
3.400
|
5.100
|
1.800
|
|
versicolor
|
median
|
5.900
|
2.800
|
4.350
|
1.300
|
|
versicolor
|
mean
|
5.936
|
2.770
|
4.260
|
1.326
|
|
virginica
|
min
|
4.900
|
2.200
|
4.500
|
1.400
|
|
virginica
|
max
|
7.900
|
3.800
|
6.900
|
2.500
|
|
virginica
|
median
|
6.500
|
3.000
|
5.550
|
2.000
|
|
virginica
|
mean
|
6.588
|
2.974
|
5.552
|
2.026
|
Part 2.
Is sepal width a good predictor of sepal length across all species?
iris_model <- lm(Sepal.Length ~ Sepal.Width, data = iris)
iris_model_summary <- summary(iris_model)
coefiris <- coef(iris_model_summary)
r.squared <- iris_model_summary$r.squared # r-squared
p.value <- coefiris[2,4] # p-value
slope <- coefiris[2,1] # slope
iris_model_stats <- data.frame(r.squared, p.value, slope)
knitr::kable(iris_model_stats,
caption = "Table 2. Iris model statistics demonstrating a negative relationship between Sepal Width (independent variable) and Sepal Length (dependent variable) across species. ",
col.names = c("R-squared", "p-value", "slope")) %>%
kable_minimal(full_width = F, html_font = "Cambria", position = "left")
Table 2. Iris model statistics demonstrating a negative relationship between Sepal Width (independent variable) and Sepal Length (dependent variable) across species.
|
R-squared
|
p-value
|
slope
|
|
0.0138227
|
0.1518983
|
-0.2233611
|
Part 3.
Graph linear model of Sepal Length ~ Sepal Width in ggplot
ggplot(iris, aes(Sepal.Width, Sepal.Length)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Sepal Width (cm)",
y = "Sepal Length (cm)") +
theme_bw()
Conclusion
In exploring the table of statistics for each species, Iris. virginica appears to have the longest average sepals and petals, with I. setosa having the shortest and widest sepals, and the smallest overall petals. There is no significant relationship between sepal length and sepal width across all species - sepal width is not a good predictor of sepal length of the three iris. It would be interesting to explore this relationship at the species level - I would think there is a stronger relationship there.