1 Introduction

The Iris data set contains three plant species (setosa, virginica, versicolor) and four features measured for each sample. The analysis and graphics below will calculate the basic statistical features of each trait for each species. Next, the code below will determine if Sepal Width is a good predictor of Sepal Length across all species.

2 Methods

2.1 Upload the data and packages:

 data ("iris")
library(dplyr)
library(tidyverse)
library(purrr) #package for iteration
library(AER)

2.2 Part 1: Basic Statistics of Iris Species

Using the iris data set, calculate the min, max, mean, and median for each variable for trait and for each species.

basic_stats <-  iris %>% 
group_by(Species) %>% 
summarise_all(list(min, max, mean, median)) %>% # specifies the statistics you want from the summarize function
  pivot_longer(-Species) %>% # expands the data frame, removing unnecessary columns
  pivot_wider(names_from = "name") # widens the data frame, so that it is now restructured into a readable format, with each statistics for each trait of each species

Create the table

knitr::kable(basic_stats, align = "ccccccccccccc", caption = "Figure 1: Basic Statistics for each Trait for each Species of Iris", col.names = c("Species", "Minimum Sepal Length", "Minimum Sepal Width", "Minimum Petal Length", "Minimum Petal Width", "Maximum Sepal Length", "Maximum Sepal Width", "Maximum Petal Length", "Maximum Petal Width",  "Mean Sepal Length", "Mean Sepal Width", "Mean Petal Length", "Mean Petal Width","Median Sepal Length", "Median Sepal Width", "Median Petal Length", "Median Petal Width"))

Figure 1: Basic Statistics for each Trait for each Species of Iris
Species	Minimum Sepal Length	Minimum Sepal Width	Minimum Petal Length	Minimum Petal Width	Maximum Sepal Length	Maximum Sepal Width	Maximum Petal Length	Maximum Petal Width	Mean Sepal Length	Mean Sepal Width	Mean Petal Length	Mean Petal Width	Median Sepal Length	Median Sepal Width	Median Petal Length	Median Petal Width
setosa	4.3	2.3	1.0	0.1	5.8	4.4	1.9	0.6	5.006	3.428	1.462	0.246	5.0	3.4	1.50	0.2
versicolor	4.9	2.0	3.0	1.0	7.0	3.4	5.1	1.8	5.936	2.770	4.260	1.326	5.9	2.8	4.35	1.3
virginica	4.9	2.2	4.5	1.4	7.9	3.8	6.9	2.5	6.588	2.974	5.552	2.026	6.5	3.0	5.55	2.0

2.3 Part 2: Linear Regression Analyzing the Relationship Between Sepal.Width and Sepal.Length

Using the iris data set, determine if Sepal.Width is a good predictor of Sepal.Length across all species.

setosa <- iris %>% 
  filter(Species == "setosa") # create subset of data to be used as an example when creating the function below

Create a function that will run linear regression analysis and then print the key statistics in a tibble.

lm_and_stats_output <- function(y ,x ,data_frame) {
  
  formula <- as.formula (paste(y, x, sep = " ~ "))  # how to structure the linear regression input
  
  lm_setosa <- lm(formula, data = data_frame) # linear regression

  model_summary_setosa <- summary(lm_setosa)   #Summarize linear regression
  
  #Pull out stats table
  model_statistics <- tibble("intercept" = model_summary_setosa$coefficients[1,1], #intercept
                             "slope"     = model_summary_setosa$coefficients[2,1], #slope
                             "p_value"   = model_summary_setosa$coefficients[2,4], #p-value
                             "r_squared"  = model_summary_setosa$r.squared)         #R2
  return(model_statistics)
}

lm_and_stats_output(y = "Sepal.Length", x = "Sepal.Width", setosa) # check to see if the function works

Use the function to calculate the statistics for all Iris species.

lm_all <- iris %>% 
  split(.$Species) %>% 
  map_df(~lm_and_stats_output(y = "Sepal.Length", x = "Sepal.Width", .), .id = "Species") #map() function from the purrr package runs the regression on each species of Iris

Report the r2, p-value, and slope from your lm() call in a nice table with appropriate column labels.

knitr::kable(lm_all, align = "ccccc", caption = "Figure 2: Linear Regression Demostrates a Positive Relationship Between Sepal Width (independent variable) as a Good Predictor of  Sepal Length (dependent varibale) for each Species. ", col.names = c("Species", "Intercept", "Slope", "P-Value", "R Squared"))

Figure 2: Linear Regression Demostrates a Positive Relationship Between Sepal Width (independent variable) as a Good Predictor of Sepal Length (dependent varibale) for each Species.
Species	Intercept	Slope	P-Value	R Squared
setosa	2.639001	0.6904897	0.0000000	0.5513756
versicolor	3.539735	0.8650777	0.0000877	0.2765821
virginica	3.906836	0.9015345	0.0008435	0.2090573

3 Graphics

3.1 Part 3: Graphics that Display the Relationship Between Sepal.Width and Sepal.Length

Use ggplot to graph the linear model you just created in 2. You can just use ggplot’s geom_smooth function with method set to “lm” to add your regression line to the scatter plot.

ggplot(iris, aes (Sepal.Width, Sepal.Length, color = Species)) +
  geom_point() +
  facet_wrap(vars(Species)) +
  geom_smooth(method = "lm", formula = y ~ x) +
  labs(x = "Sepal Width",
       y = "Sepal Length", 
       caption = "Graph 1: Scatter plot of Sepal Width (indpendent) versus Sepal Length (depenedent) from the iris data set \n for all species. For the setosa species, Sepal Length showed a positive  and significant  relationship \n with  Sepal Width  (r2 = 0.5513756,  p-value = 6.709843e-10 ).  For the versicolor species, \n Sepal Length showed a positive  and significant   relationship with Sepal Width (r2 = 0.2765821, \n p-value = 8.771860e-05 ). For the virginica species,  Sepal Length showed a positive  \n and significant relationship with Sepal Width (r2 = 0.2090573, p-value = 8.434625e-04 )."
    ) +
  theme_bw()

4 Conclusion

Basic statistics show that the virginica species generally has the largest traits, followed by versicolor, then setosa (figure 1). All species of Iris within this data set display a significant positive relationship between Sepal Width as the predictor variable for Sepal Length as the dependent variable. This claim is supported by the positive slopes of the linear regression between these two variables, and the p values below .05 (figure 2). This can also be seen visually in graph 1, where the increase in Sepal Width on the x axis causes the increased Sepal Length on the y axis (graph 1).

Assignment 8: Iris Statistics and Linear Model

Kelly Teitel

2/26/2022