1 Introduction

This analysis was completed based on the iris data set. First, statistics of each measurement for all iris species in the data set is completed to better understand the data set. Next, a linear regression is completed to understand the extent to which petal length is dependent on petal width for the iris species in the data set as a whole. Relevant statistics and a graph of the regression are presented.

2 Load Packages

library(tidyverse)
library(dplyr)
library(knitr)
library(kableExtra)

3 Calculate statistics for each variable of each species in the iris data set

iris_grouped <- iris %>%
  group_by(Species) 

#min values
iris_min <- iris_grouped %>%
  summarise_all(min)%>%
  pivot_longer(-Species, names_to="Value", values_to="Minimum") %>%
  separate(col="Value", into=c("Tissue", "Measurement"))

#max values
iris_max <- iris_grouped %>%
  summarise_all(max)%>%
  pivot_longer(-Species, names_to="Value", values_to="Maximum") %>%
  separate(col="Value", into=c("Tissue", "Measurement"))

#mean values
iris_mean <- iris_grouped %>%
  summarise_all(mean)%>%
  pivot_longer(-Species, names_to="Value", values_to="Mean") %>%
  separate(col="Value", into=c("Tissue", "Measurement"))

#median values
iris_median <- iris_grouped %>%
  summarise_all(mean)%>%
  pivot_longer(-Species, names_to="Value", values_to="Median")%>%
  separate(col="Value", into=c("Tissue", "Measurement"))

# join all statistic tables
iris_stats <- iris_min %>%  
  left_join(iris_max, by = c("Species", "Tissue", "Measurement")) %>% 
  left_join(iris_mean, by = c("Species", "Tissue", "Measurement")) %>% 
  left_join(iris_median, by = c("Species", "Tissue", "Measurement"))

#create table
iris_stats%>%
kbl(caption="Table 1: Iris Statistics") %>%
  kable_classic(full_width=F, 
                html_font = "Cambria", 
                position = "center") %>%
  kable_styling(bootstrap_options="striped")
Table 1: Iris Statistics
Species Tissue Measurement Minimum Maximum Mean Median
setosa Sepal Length 4.3 5.8 5.006 5.006
setosa Sepal Width 2.3 4.4 3.428 3.428
setosa Petal Length 1.0 1.9 1.462 1.462
setosa Petal Width 0.1 0.6 0.246 0.246
versicolor Sepal Length 4.9 7.0 5.936 5.936
versicolor Sepal Width 2.0 3.4 2.770 2.770
versicolor Petal Length 3.0 5.1 4.260 4.260
versicolor Petal Width 1.0 1.8 1.326 1.326
virginica Sepal Length 4.9 7.9 6.588 6.588
virginica Sepal Width 2.2 3.8 2.974 2.974
virginica Petal Length 4.5 6.9 5.552 5.552
virginica Petal Width 1.4 2.5 2.026 2.026

4 Create table of relevant statistics from linear regression of sepal length on sepal width

#create linear model
linmodel <- lm(Sepal.Width ~ Sepal.Length, data=iris)
#create model summary
modsummary <- summary(linmodel)
#take relevant statistics out of model summary
mod_stats <- tibble("slope"     = modsummary$coefficients[2,1], #slope from summary
                    "p value"   = modsummary$coefficients[2,4], #p-value from summary
                    "R squared"  = modsummary$r.squared) #r squared from summary

#create table
mod_stats %>%
  kbl(booktabs=T, caption="Table 2: Statistics for the negative relationship between sepal width (independent variable) and sepal length (dependent variable) for three species of iris plant",
      col.names=c("Slope", "p-value", "R^2^"),
      align=c("ccc")) %>%
  kable_styling()
Table 2: Statistics for the negative relationship between sepal width (independent variable) and sepal length (dependent variable) for three species of iris plant
Slope p-value R2
-0.0618848 0.1518983 0.0138227

5 Create plot of linear regression of sepal length on sepal width

#create plot
ggplot(iris, aes(Sepal.Width, Sepal.Length))+
  geom_point()+
  geom_smooth(method="lm", color="black")+
  theme_classic()+
  labs(x="Sepal Width",
       y="Sepal Length", 
       title="Figure 1: Iris Sepal Length as a Function of Sepal Width")+
  theme(plot.title = element_text(hjust=0.5, vjust=3))
Figure 1: Sepal length (dependent variable) as a function of sepal width (independent variable) in the iris data set for all species. The model shows a negative but unsignificant linear relationship with sepal length decreasing as sepal width increases (R^2^=0.014, p-value=0.152).

Figure 1: Sepal length (dependent variable) as a function of sepal width (independent variable) in the iris data set for all species. The model shows a negative but unsignificant linear relationship with sepal length decreasing as sepal width increases (R2=0.014, p-value=0.152).

6 Conclusion

Table 2 and figure 1 demonstrate that there is only a slight correlation between sepal width and sepal length in iris species, however the low R2 value and relatively high p-value demonstrate that this relationship is not statistically significant. This may be a result of combining all iris species in the data set in this analysis. Based on this analysis, sepal width should not be used to predict sepal length in iris species generally.