Introduction

This R markdown document contains code for calculating statistics (min, median, mean, max) for each iris species in the iris dataset. To analyze whether sepal width is a good predictor of sepal length across all three species, the code constructs a linear model and plot (Figure 1) with a linear regression of sepal width versus sepal length. The code also constructs two tables: one for iris traits (Table 1) and one for iris statistics (Table 2).

Code and annotation

Load packages
shhh <- suppressPackageStartupMessages #  silence messages.
shhh(library(tidyverse))
shhh(library(knitr))
shhh(library(ggplot2))
shhh(library(dplyr))

Create a new dataframe with the statistical values for each species
data(iris)
# view(iris)

  
iris_traits <- iris %>% 
  group_by(Species) %>% 
  summarise(across(.cols = everything(), list(
      "min" =min,
      median = median,
      mean = mean,
      max = max ) )  ) %>% 
         pivot_longer( -Species, names_to = "Trait", values_to = "Observation")  %>% 
  separate(Trait, c("Trait", "Statistic"), sep = "_")

Create a table of observations for iris traits
knitr::kable( (iris_traits), caption = "Table 1.  Iris Traits") 
Table 1. Iris Traits
Species Trait Statistic Observation
setosa Sepal.Length min 4.300
setosa Sepal.Length median 5.000
setosa Sepal.Length mean 5.006
setosa Sepal.Length max 5.800
setosa Sepal.Width min 2.300
setosa Sepal.Width median 3.400
setosa Sepal.Width mean 3.428
setosa Sepal.Width max 4.400
setosa Petal.Length min 1.000
setosa Petal.Length median 1.500
setosa Petal.Length mean 1.462
setosa Petal.Length max 1.900
setosa Petal.Width min 0.100
setosa Petal.Width median 0.200
setosa Petal.Width mean 0.246
setosa Petal.Width max 0.600
versicolor Sepal.Length min 4.900
versicolor Sepal.Length median 5.900
versicolor Sepal.Length mean 5.936
versicolor Sepal.Length max 7.000
versicolor Sepal.Width min 2.000
versicolor Sepal.Width median 2.800
versicolor Sepal.Width mean 2.770
versicolor Sepal.Width max 3.400
versicolor Petal.Length min 3.000
versicolor Petal.Length median 4.350
versicolor Petal.Length mean 4.260
versicolor Petal.Length max 5.100
versicolor Petal.Width min 1.000
versicolor Petal.Width median 1.300
versicolor Petal.Width mean 1.326
versicolor Petal.Width max 1.800
virginica Sepal.Length min 4.900
virginica Sepal.Length median 6.500
virginica Sepal.Length mean 6.588
virginica Sepal.Length max 7.900
virginica Sepal.Width min 2.200
virginica Sepal.Width median 3.000
virginica Sepal.Width mean 2.974
virginica Sepal.Width max 3.800
virginica Petal.Length min 4.500
virginica Petal.Length median 5.550
virginica Petal.Length mean 5.552
virginica Petal.Length max 6.900
virginica Petal.Width min 1.400
virginica Petal.Width median 2.000
virginica Petal.Width mean 2.026
virginica Petal.Width max 2.500

Determine if Sepal.Width is a good predictor of Sepal.Length across all species (not individually) using a linear model with lm().
####################################################
##Graph data to visualize 
####################################################


ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length)) +
  geom_point()+
  theme_classic()

ggplot(iris, aes(x = Sepal.Width)) +  # looks like it is normally distributed
  geom_histogram()+
  theme_classic()

####################################################
##Model data 
####################################################

#Sepal L versus W (both equivalent)
sepal <- lm(Sepal.Width ~Sepal.Length, data = iris)
# sepal <- lm(Sepal.Width ~ 1 + Sepal.Length, data = iris)

#Summarize model
summary(sepal)
#Pull out stats table
coef(summary(sepal))

#Add model prediction onto data graph
sepal_summary <- summary(sepal)

Make a tibble:
iris_table <-  tibble( "R-squared"= sepal_summary$r.squared,
                       "P-value" = sepal_summary$coefficients[2,4],
                        "Slope" = sepal_summary$coefficients[2,1]
                      )
iris_table

Make a table:
knitr::kable(iris_table, align = "ccccc", 
             caption = "Table 2. Iris Statistics",
             col.names = c("R-squared",
                       "P-value" ,
                        "Slope"))
Table 2. Iris Statistics
R-squared P-value Slope
0.0138227 0.1518983 -0.0618848

Use ggplot to graph the linear model

ggplot(iris, aes(Sepal.Width,Sepal.Length))+
  geom_point(color = "lightgrey")+
  geom_smooth(method = "lm", se = F) +
  labs(x = "Sepal width",
       y = "Sepal length",
    caption = "Figure 1. Scatter plot with linear regression of the independent variable, sepal width, versus the dependent variable, sepal \n  length, from the iris data set for all species. Sepal length showed a negative correlation and insignificant relationship wtih\n sepal width  (r-squared = 0.01382,  p-value = 0.1519).") +
  theme(plot.caption.position = "plot",
        plot.caption = element_text(hjust = 0)) 
## `geom_smooth()` using formula 'y ~ x'

Conclusion

Plotting a linear regression for sepal width versus sepal length for all iris species showed a negative correlation between the two iris traits, indicating that as sepal width increases, sepal length decreases. However, the results are insignificant (p-value = 0.1519). R-squared indicates that sepal width explains ~1.4% of the variation in sepal length.