Enter your name here: Maojie Li

For this assignment, we will use the widely famous IRIS dataset which is included in every R installation (you automatically have it when you first start RStudio. Just type “iris” and run it to see)

Numerous guides have been written on the exploration of this widely known dataset. Iris was introduced by Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems, contains three plant species (setosa, virginica, versicolor) and four features measured for each sample. These quantify the morphologic variation of the iris flower in its three species, all measurements given in centimeters.

Step 1- Load the relevant libraries

library(ggplot2)
library(tidyr)
library(datasets)
data("iris")
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Step 2 - Create a correlation matrix of the Iris dataset using the DataExplorer correlation function we used in class in lab 3. Include only continuous variables in your correlation plot to avoid confusion as factor variables don’t make sense in a correlation plot (10 points)

library(DataExplorer)
## Warning: package 'DataExplorer' was built under R version 3.5.2
library(corrplot)
## corrplot 0.84 loaded
title="matrix_iris"
plot_correlation(iris)

Answer the following: What is the correlation coefficient between Petal Length and Petal Width? The correlation coefficient between Petal Length and Petal Width is 0.96. How does this compare with the correlation coefficient of Sepal Length and Sepal Width? The correlation cofficient of Sepal length and Sepal Width is -0.12, which indicate that Sepal length and Sepal Width has negaive correlate relationship. When thecorrelation coefficient between Petal Length and Petal Width is 0.96, Petal Length and Petal Width have stronger correlation relationship than Sepal length and Sepal Width. Step 3 - Create three separate correlation matrices for each species of iris flower (20 points)

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
m<-levels(iris$Species)
title0<-"Setosa"
setosaCorr=cor(iris[iris$Species==m[1],1:4])
corrplot(setosaCorr,method="number",title=title,mar=c(0,0,1,0))

versC=cor(iris[iris$Species==m[2],1:4])
title1<-"versicolor"
corrplot(versC,method="number",title=title1,mar=c(0,0,1,0))

veriC<-cor(iris[iris$Species==m[3],1:4])
title2<-"virginica"
corrplot(veriC,method="number",title=title2,mar=c(0,0,1,0))

Answer the following: Are the correlation coefficients similar or different when comparing Sepal length vs. Sepal Width among the three species of Iris flowers? The correlation coefficients is similar when comparing Sepal lenth vs.Sepal Width among three species.In versicolor, the correlation coefficientis 0.53. In setosa,the correlation coefficientis 0.74. In virginica,the correlation coefficientis is 0.46. However, all the correlation coefficients are postive.

Step 4 - Create a box plot of Petal Length by flower species. Make each box plot a different color for each species (10 points)

library(ggplot2)

ggplot(data=iris)+
  geom_boxplot(mapping = aes(x=iris$Species,y=iris$Petal.Length),fill="white",color=c("yellow","blue","orange"))+
  ggtitle(" box plot of Petal Length by flower specie")

Answer the following: What insights can you draw from the box plot you just generated?

The setosa has the lowest mean of petal length. Step 5 - Create a Scatter jitter plot of Petal Width on the x axis vs. Petal Length on y axis, for the species of flower you identify in your boxplot that has the smallest median Petal Length (15 points)

seto<-iris[iris$Species =='setosa',c("Petal.Length","Petal.Width","Species")]
ggplot(seto)+
  geom_point(mapping=aes(x=seto$Petal.Width,y=seto$Petal.Length),position = "jitter")+
  ggtitle("Petal Width vs. Petal Length-Jitter scatter plot")

Step 6 - Now switch this plot to scatter point without the jitter. There appears to be an outlier point on the right of the graph that has Petal Width of 0.6. Can you figure out a way to make this point a different color than the rest? (20 points)

ggplot(seto)+
  geom_point(mapping=aes(seto$Petal.Width, seto$Petal.Length,colour=Petal.Width>=0.6))+
  ggtitle("Pental Width vs. Petal Length--outlier")

Step 7- Finally, create a vertical bar graph that sums observations by flower species after filtering the Iris dataset to only observations with Sepal Length less than 6.

Order your bar graph so that the species with the most records is on the left and the species with the least records is on the right

Make each species bar a different color (25 points)

sm<-iris[iris$Sepal.Length<6,c("Petal.Length","Petal.Width","Species","Sepal.Length")]
ggplot(sm)+
  geom_bar(mapping = aes(x=sm$Species),fill=c("gray","blue","light blue"))+
  ggtitle("Flowers' Sepal Length smaller than 6")

summary(sm)
##   Petal.Length    Petal.Width           Species    Sepal.Length  
##  Min.   :1.000   Min.   :0.1000   setosa    :50   Min.   :4.300  
##  1st Qu.:1.400   1st Qu.:0.2000   versicolor:26   1st Qu.:4.900  
##  Median :1.600   Median :0.4000   virginica : 7   Median :5.200  
##  Mean   :2.543   Mean   :0.7012                   Mean   :5.224  
##  3rd Qu.:4.000   3rd Qu.:1.2000                   3rd Qu.:5.600  
##  Max.   :5.100   Max.   :2.4000                   Max.   :5.900

Answer the following: What are the count of observations by species in your graph above The total count of observations by species are 83.

Before you submit your files, check that each of your ggplots includes a proper TITLE and properly labels the x and y axes

Submit the following two files into the Assignment 3 drop box on Blackboard. Be sure to include your name in the filenames per the examples below. File 1: This RMD document with your plots and answers completed above Example: Yoni_Dvorkis_ALY6070_Assignment3.Rmd

File 2: HTML version of this file after you click “Knit” above Example: Yoni_Dvorkis_ALY6070_Assignment3.html

Academic Integrity Reminder: Collaborating on this assignment with any other student is prohibited and will be considered a violation of your Academic Integrity Agreement that you signed prior to joining the program.

Good luck!