Enter your name here: Maojie Li
For this assignment, we will use the widely famous IRIS dataset which is included in every R installation (you automatically have it when you first start RStudio. Just type “iris” and run it to see)
Numerous guides have been written on the exploration of this widely known dataset. Iris was introduced by Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems, contains three plant species (setosa, virginica, versicolor) and four features measured for each sample. These quantify the morphologic variation of the iris flower in its three species, all measurements given in centimeters.
Step 1- Load the relevant libraries
library(ggplot2)
library(tidyr)
library(datasets)
data("iris")
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Step 2 - Create a correlation matrix of the Iris dataset using the DataExplorer correlation function we used in class in lab 3. Include only continuous variables in your correlation plot to avoid confusion as factor variables don’t make sense in a correlation plot (10 points)
library(DataExplorer)
## Warning: package 'DataExplorer' was built under R version 3.5.2
library(corrplot)
## corrplot 0.84 loaded
title="matrix_iris"
plot_correlation(iris)
Answer the following: What is the correlation coefficient between Petal Length and Petal Width? The correlation coefficient between Petal Length and Petal Width is 0.96. How does this compare with the correlation coefficient of Sepal Length and Sepal Width? The correlation cofficient of Sepal length and Sepal Width is -0.12, which indicate that Sepal length and Sepal Width has negaive correlate relationship. When thecorrelation coefficient between Petal Length and Petal Width is 0.96, Petal Length and Petal Width have stronger correlation relationship than Sepal length and Sepal Width. Step 3 - Create three separate correlation matrices for each species of iris flower (20 points)
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
m<-levels(iris$Species)
title0<-"Setosa"
setosaCorr=cor(iris[iris$Species==m[1],1:4])
corrplot(setosaCorr,method="number",title=title,mar=c(0,0,1,0))
versC=cor(iris[iris$Species==m[2],1:4])
title1<-"versicolor"
corrplot(versC,method="number",title=title1,mar=c(0,0,1,0))
veriC<-cor(iris[iris$Species==m[3],1:4])
title2<-"virginica"
corrplot(veriC,method="number",title=title2,mar=c(0,0,1,0))
Answer the following: Are the correlation coefficients similar or different when comparing Sepal length vs. Sepal Width among the three species of Iris flowers? The correlation coefficients is similar when comparing Sepal lenth vs.Sepal Width among three species.In versicolor, the correlation coefficientis 0.53. In setosa,the correlation coefficientis 0.74. In virginica,the correlation coefficientis is 0.46. However, all the correlation coefficients are postive.
Step 4 - Create a box plot of Petal Length by flower species. Make each box plot a different color for each species (10 points)
library(ggplot2)
ggplot(data=iris)+
geom_boxplot(mapping = aes(x=iris$Species,y=iris$Petal.Length),fill="white",color=c("yellow","blue","orange"))+
ggtitle(" box plot of Petal Length by flower specie")
Answer the following: What insights can you draw from the box plot you just generated?
The setosa has the lowest mean of petal length. Step 5 - Create a Scatter jitter plot of Petal Width on the x axis vs. Petal Length on y axis, for the species of flower you identify in your boxplot that has the smallest median Petal Length (15 points)
seto<-iris[iris$Species =='setosa',c("Petal.Length","Petal.Width","Species")]
ggplot(seto)+
geom_point(mapping=aes(x=seto$Petal.Width,y=seto$Petal.Length),position = "jitter")+
ggtitle("Petal Width vs. Petal Length-Jitter scatter plot")
Step 6 - Now switch this plot to scatter point without the jitter. There appears to be an outlier point on the right of the graph that has Petal Width of 0.6. Can you figure out a way to make this point a different color than the rest? (20 points)
ggplot(seto)+
geom_point(mapping=aes(seto$Petal.Width, seto$Petal.Length,colour=Petal.Width>=0.6))+
ggtitle("Pental Width vs. Petal Length--outlier")
Step 7- Finally, create a vertical bar graph that sums observations by flower species after filtering the Iris dataset to only observations with Sepal Length less than 6.
Order your bar graph so that the species with the most records is on the left and the species with the least records is on the right
Make each species bar a different color (25 points)
sm<-iris[iris$Sepal.Length<6,c("Petal.Length","Petal.Width","Species","Sepal.Length")]
ggplot(sm)+
geom_bar(mapping = aes(x=sm$Species),fill=c("gray","blue","light blue"))+
ggtitle("Flowers' Sepal Length smaller than 6")
summary(sm)
## Petal.Length Petal.Width Species Sepal.Length
## Min. :1.000 Min. :0.1000 setosa :50 Min. :4.300
## 1st Qu.:1.400 1st Qu.:0.2000 versicolor:26 1st Qu.:4.900
## Median :1.600 Median :0.4000 virginica : 7 Median :5.200
## Mean :2.543 Mean :0.7012 Mean :5.224
## 3rd Qu.:4.000 3rd Qu.:1.2000 3rd Qu.:5.600
## Max. :5.100 Max. :2.4000 Max. :5.900
Answer the following: What are the count of observations by species in your graph above The total count of observations by species are 83.
Before you submit your files, check that each of your ggplots includes a proper TITLE and properly labels the x and y axes
Submit the following two files into the Assignment 3 drop box on Blackboard. Be sure to include your name in the filenames per the examples below. File 1: This RMD document with your plots and answers completed above Example: Yoni_Dvorkis_ALY6070_Assignment3.Rmd
File 2: HTML version of this file after you click “Knit” above Example: Yoni_Dvorkis_ALY6070_Assignment3.html
Academic Integrity Reminder: Collaborating on this assignment with any other student is prohibited and will be considered a violation of your Academic Integrity Agreement that you signed prior to joining the program.
Good luck!