Enter your name here: Labdhi Ghelani
For this assignment, we will use the widely famous IRIS dataset which is included in every R installation (you automatically have it when you first start RStudio. Just type “iris” and run it to see)
Numerous guides have been written on the exploration of this widely known dataset. Iris was introduced by Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems, contains three plant species (setosa, virginica, versicolor) and four features measured for each sample. These quantify the morphologic variation of the iris flower in its three species, all measurements given in centimeters.
Step 1- Load the relevant libraries
Step 2 - Create a correlation matrix of the Iris dataset using the DataExplorer correlation function we used in class in lab 3. Include only continuous variables in your correlation plot to avoid confusion as factor variables don’t make sense in a correlation plot (10 points)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## [5] "Species"
Answer the following: What is the correlation coefficient between Petal Length and Petal Width?
The correlation coefficient between Petal Length and Petal Width is 0.96
How does this compare with the correlation coefficient of Sepal Length and Sepal Width?
The correlation coefficient between Sepal Length and Sepal Width is -0.12 which is a weak downhill (negative) linear relationship
Step 3 - Create three separate correlation matrices for each species of iris flower (20 points)
Answer the following: Are the correlation coefficients similar or different when comparing Sepal length vs. Sepal Width among the three species of Iris flowers?
The correlation coefficient between Sepal Length and Sepal Width for Setosa is 0.74 which is a strong uphill (positive) linear relationship. The correlation coefficient between Sepal Length and Sepal Width for Versicolor is 0.53 which is a moderate uphill (positive) linear relationship. The correlation coefficient between Sepal Length and Sepal Width for Virginica is 0.46 which is a weak uphill (positive) linear relationship. Hence, it is safe to say that the correlation coeffiecient for different species of the the Iris flowers have weak to strong coorrelations between the speal length and width.
Step 4 - Create a box plot of Petal Length by flower species. Make each box plot a different color for each species (10 points)
## List of 3
## $ setosa : num [1:5] 1 1.4 1.5 1.6 1.9
## $ versicolor: num [1:5] 3 4 4.35 4.6 5.1
## $ virginica : num [1:5] 4.5 5.1 5.55 5.9 6.9
## - attr(*, "dim")= int 3
## - attr(*, "dimnames")=List of 1
## ..$ iris[, "Species"]: chr [1:3] "setosa" "versicolor" "virginica"
## - attr(*, "call")= language by.default(data = iris[, "Petal.Length"], INDICES = iris[, "Species"], FUN = fivenum)
## - attr(*, "class")= chr "by"
Answer the following: What insights can you draw from the box plot you just generated?
Each of the three species: Setosa, Versicolor and Virginica are represented by a box and whiskers, the lines sticking out of the boxes. Setosa and Virginica also have outliers by itself. Each of the boxes has a dark line in the middle. For sertosa this is at 1.5, for versicolor this is at 4.35, and for virginica this is at 5.55.These represent the medians for each of the species. At the bottom of the boxes, the lines are at 1.4, 4, and 5.1. They represent the first quartile. The top of the boxes are at 1.6, 4.6, and 5.9. They represent the third quartile.The boxes represent the middle 50% of the observations for each species, those between the first and the third quartile (25th and 75th percentiles).If we take the distance from the top of each box to the bottom that would be the interquartile range. We can see that the IQR for setosa is much smaller than that for the other two species.
Step 5 - Create a Scatter jitter plot of Petal Width on the x axis vs. Petal Length on y axis, for the species of flower you identify in your boxplot that has the smallest median Petal Length (15 points)
Step 6 - Now switch this plot to scatter point without the jitter. There appears to be an outlier point on the right of the graph that has Petal Width of 0.6. Can you figure out a way to make this point a different color than the rest? (20 points)
Step 7- Finally, create a vertical bar graph that sums observations by flower species after filtering the Iris dataset to only observations with Sepal Length less than 6.
Order your bar graph so that the species with the most records is on the left and the species with the least records is on the right
Make each species bar a different color (25 points)
## Species Count
## 1 setosa 50
## 2 versicolor 26
## 3 virginica 7
Answer the following: What are the count of observations by species in your graph above
The count of each Species of the Iris flower having Sepal Length less than 6 are as follows: setosa:50, versicolor:26, virginica:7.