# load the libraries
library(ggplot2)
# read the data
iris <- read.csv("iris.csv", row.names = 1)
# summary statistics
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
# correlation
cor(iris[,1:4])
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
## Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
## Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
## Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
Sepal length has the highest range among the 4 variables, Petal Width highest range, Sepal length has highest mean 5.84 and Petal width lowest men 1.99.Sepal Length is large for all the flower species while petal width is small. Sepal length is positively correlated with petal length and petal width and negative correlated with sepal width.Sepal width is negatively correlated with Sepal Length, petal length and petal width.Petal length is positively correlated with sepal length,Petal width and negatively correlated with sepal width.Petal width is positively correlated with sepal length, petal length and negatively correlated with sepal width.
levels(iris$Species)
## [1] "setosa" "versicolor" "virginica"
# Rename the factors
# Rename all levels, by name
levels(iris$Species) <- list(Setosa="setosa",Versicolor="versicolor", Virginica="virginica")
# Scatterplot
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()+ggtitle("Relation between Sepal Length and Sepal Width")+theme_classic()+xlab("Sepal Length")+ylab("Sepal Width")
ggplot(iris, aes(x = Petal.Length, y = Petal.Width))+geom_point()+ggtitle("Relation between Petal Length and Petal Width")+theme_classic()+xlab("Petal Length")+ylab("Petal Width")+theme_classic()
# Boxplot
ggplot(iris, aes(y = Sepal.Length,x = Species, fill= Species)) +
geom_boxplot()+ggtitle("Boxplot of Sepal Length By Species")+theme_classic()+xlab("Sepal Length")
ggplot(iris, aes(y = Sepal.Length,x = Species, fill= Species)) +
geom_boxplot()+ggtitle("Boxplot of Sepal Width By Species")+theme_classic()+xlab("Sepal Width")
# Histogram
ggplot(iris, aes(Sepal.Length)) +
geom_histogram(color="white")+ggtitle("Distribution of Sepal Length")+theme_classic()+xlab("Sepal Length")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(iris, aes(Sepal.Length)) +
geom_histogram(color="white")+ggtitle("Distribution of Sepal Width")+theme_classic()+xlab("Sepal Width")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Use the first three steps and anything else that would be helpful to answer the question your are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end. # Question: Are are there any difference between sepal width of 3 flower species?
fit <- aov(Sepal.Width~Species, data = iris)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 11.35 5.672 49.16 <2e-16 ***
## Residuals 147 16.96 0.115
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the graphs of box plot of sepal width grouped by flower species we can see that there is difference among the sepal width of 3 flower species . We did a one way anova to test the difference significantly. At 5% level of significance with p value < 0.05 with 2 degrees of freedom we conclude that the difference of sepal width among the 3 species if significant.
url <- "https://raw.githubusercontent.com/jonygeta/iris.csv/master/iris.csv"
iris <- read.csv(url)