This analysis was completed based on the iris data set. First, statistics of each measurement for all iris species in the data set is completed to better understand the data set. Next, a linear regression is completed to understand the extent to which petal length is dependent on petal width for the iris species in the data set as a whole. Relevant statistics and a graph of the regression are presented.
library(tidyverse)
library(dplyr)
library(knitr)
library(kableExtra)
iris_grouped <- iris %>%
group_by(Species)
#min values
iris_min <- iris_grouped %>%
summarise_all(min)%>%
pivot_longer(-Species, names_to="Value", values_to="Minimum") %>%
separate(col="Value", into=c("Tissue", "Measurement"))
#max values
iris_max <- iris_grouped %>%
summarise_all(max)%>%
pivot_longer(-Species, names_to="Value", values_to="Maximum") %>%
separate(col="Value", into=c("Tissue", "Measurement"))
#mean values
iris_mean <- iris_grouped %>%
summarise_all(mean)%>%
pivot_longer(-Species, names_to="Value", values_to="Mean") %>%
separate(col="Value", into=c("Tissue", "Measurement"))
#median values
iris_median <- iris_grouped %>%
summarise_all(mean)%>%
pivot_longer(-Species, names_to="Value", values_to="Median")%>%
separate(col="Value", into=c("Tissue", "Measurement"))
# join all statistic tables
iris_stats <- iris_min %>%
left_join(iris_max, by = c("Species", "Tissue", "Measurement")) %>%
left_join(iris_mean, by = c("Species", "Tissue", "Measurement")) %>%
left_join(iris_median, by = c("Species", "Tissue", "Measurement"))
#create table
iris_stats%>%
kbl(caption="Table 1: Iris Statistics") %>%
kable_classic(full_width=F,
html_font = "Cambria",
position = "center") %>%
kable_styling(bootstrap_options="striped")
| Species | Tissue | Measurement | Minimum | Maximum | Mean | Median |
|---|---|---|---|---|---|---|
| setosa | Sepal | Length | 4.3 | 5.8 | 5.006 | 5.006 |
| setosa | Sepal | Width | 2.3 | 4.4 | 3.428 | 3.428 |
| setosa | Petal | Length | 1.0 | 1.9 | 1.462 | 1.462 |
| setosa | Petal | Width | 0.1 | 0.6 | 0.246 | 0.246 |
| versicolor | Sepal | Length | 4.9 | 7.0 | 5.936 | 5.936 |
| versicolor | Sepal | Width | 2.0 | 3.4 | 2.770 | 2.770 |
| versicolor | Petal | Length | 3.0 | 5.1 | 4.260 | 4.260 |
| versicolor | Petal | Width | 1.0 | 1.8 | 1.326 | 1.326 |
| virginica | Sepal | Length | 4.9 | 7.9 | 6.588 | 6.588 |
| virginica | Sepal | Width | 2.2 | 3.8 | 2.974 | 2.974 |
| virginica | Petal | Length | 4.5 | 6.9 | 5.552 | 5.552 |
| virginica | Petal | Width | 1.4 | 2.5 | 2.026 | 2.026 |
#create linear model
linmodel <- lm(Sepal.Width ~ Sepal.Length, data=iris)
#create model summary
modsummary <- summary(linmodel)
#take relevant statistics out of model summary
mod_stats <- tibble("slope" = modsummary$coefficients[2,1], #slope from summary
"p value" = modsummary$coefficients[2,4], #p-value from summary
"R squared" = modsummary$r.squared) #r squared from summary
#create table
mod_stats %>%
kbl(booktabs=T, caption="Table 2: Statistics for the negative relationship between sepal width (independent variable) and sepal length (dependent variable) for three species of iris plant",
col.names=c("Slope", "p-value", "R^2^"),
align=c("ccc")) %>%
kable_styling()
| Slope | p-value | R2 |
|---|---|---|
| -0.0618848 | 0.1518983 | 0.0138227 |
#create plot
ggplot(iris, aes(Sepal.Width, Sepal.Length))+
geom_point()+
geom_smooth(method="lm", color="black")+
theme_classic()+
labs(x="Sepal Width",
y="Sepal Length",
title="Figure 1: Iris Sepal Length as a Function of Sepal Width")+
theme(plot.title = element_text(hjust=0.5, vjust=3))
Figure 1: Sepal length (dependent variable) as a function of sepal width (independent variable) in the iris data set for all species. The model shows a negative but unsignificant linear relationship with sepal length decreasing as sepal width increases (R2=0.014, p-value=0.152).
Table 2 and figure 1 demonstrate that there is only a slight correlation between sepal width and sepal length in iris species, however the low R2 value and relatively high p-value demonstrate that this relationship is not statistically significant. This may be a result of combining all iris species in the data set in this analysis. Based on this analysis, sepal width should not be used to predict sepal length in iris species generally.