Data description
In this project we consider the classical iris data that can be found in the R datasets package. This data has 5 columns named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species and 150 observations.
You can find help on these data here
Descriptive analysis
First, we compute some descriptive statistics with the summary() function:
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
|---|---|---|---|---|---|
| Min. :4.300 | Min. :2.000 | Min. :1.000 | Min. :0.100 | setosa :50 | |
| 1st Qu.:5.100 | 1st Qu.:2.800 | 1st Qu.:1.600 | 1st Qu.:0.300 | versicolor:50 | |
| Median :5.800 | Median :3.000 | Median :4.350 | Median :1.300 | virginica :50 | |
| Mean :5.843 | Mean :3.057 | Mean :3.758 | Mean :1.199 | ||
| 3rd Qu.:6.400 | 3rd Qu.:3.300 | 3rd Qu.:5.100 | 3rd Qu.:1.800 | ||
| Max. :7.900 | Max. :4.400 | Max. :6.900 | Max. :2.500 |
Second, we use this code
to get the following table that shows the means of the \(4\) numerical variables for each species.
| Species | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width |
|---|---|---|---|---|
| setosa | 5.006 | 3.428 | 1.462 | 0.246 |
| versicolor | 5.936 | 2.770 | 4.260 | 1.326 |
| virginica | 6.588 | 2.974 | 5.552 | 2.026 |
Linear regression
We use the function cor() to get the Pearson’s coefficients of correlation between all our numeric variables:
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
|---|---|---|---|---|
| Sepal.Length | 1.000 | -0.118 | 0.872 | 0.818 |
| Sepal.Width | -0.118 | 1.000 | -0.428 | -0.366 |
| Petal.Length | 0.872 | -0.428 | 1.000 | 0.963 |
| Petal.Width | 0.818 | -0.366 | 0.963 | 1.000 |
Here are \(3\) scatter plots that show the association between Petal.Length and the other numerical variables.
Study of iris flowers
Now, we would like to explain the variations in the length of the sepal as a function of the length of the petal. To do so, we use the following linear regression lm(Sepal.Length ~ Petal.Length, data = iris). Here is the summary of this model:
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 4.307 | 0.078 | 54.939 | 0 |
| Petal.Length | 0.409 | 0.019 | 21.646 | 0 |
The model’s equation is
\[ Sepal.Length = 4.307+ 0.409 Petal.Length. \]
Here are the data with the regression line:
Finally, we use the plot(mod) command to get some graphical representations of the residuals.