Data description

In this project we consider the classical iris data that can be found in the R datasets package. This data has 5 columns named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species and 150
observations.

You can find help on these data here.

Descriptive analysis

First, we compute some descriptive statistics with the summary() function:

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

Second, we use this code

aggregate(cbind(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) ~ Species,
  data = iris, mean)

to get the following table that shows the means of the \(4\) numerical variables for each species.

Means by species
Species Sepal Length Sepal Width Petal Length Petal Width
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026

Linear regression

We use the function cor() to get the Pearson’s coefficients of correlation between all our numeric variables:

Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.000 -0.118 0.872 0.818
Sepal.Width -0.118 1.000 -0.428 -0.366
Petal.Length 0.872 -0.428 1.000 0.963
Petal.Width 0.818 -0.366 0.963 1.000

Here are \(3\) scatter plots that show the association between Petal.Length and the other numerical variables.

Study of iris flowersStudy of iris flowersStudy of iris flowers

Study of iris flowers

Now, we would like to explain the variations in the length of the sepal as a function of the length of the petal. To do so, we use the following linear regression lm(Sepal.Length ~ Petal.Length, data = iris). Here is the summary of this model:

Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.307 0.078 54.939 0
Petal.Length 0.409 0.019 21.646 0


The model’s equation is \[ Sepal.Length = 4.307+ 0.409 Petal.Length. \]

Here are the data with the regression line:

Finally, we use the plot(reg) command to get some graphical representations of the residuals.

Residuals of the regression

Residuals of the regression