# load the libraries
library(ggplot2)
# read the data
iris <- read.csv("iris.csv", row.names = 1)

1. Data Exploration: This should include summary statistics, means, median, quartiles or any other relevant information about the data set. Plese include some conclusions in the R Markdown text.

Summary Statistics

# summary statistics
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Correlation

# correlation
cor(iris[,1:4])
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
## Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
## Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
## Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

Sepal length has the highest range among the 4 variables, Petal Width highest range, Sepal length has highest mean 5.84 and Petal width lowest men 1.99.Sepal Length is large for all the flower species while petal width is small. Sepal length is positively correlated with petal length and petal width and negative correlated with sepal width.Sepal width is negatively correlated with Sepal Length, petal length and petal width.Petal length is positively correlated with sepal length,Petal width and negatively correlated with sepal width.Petal width is positively correlated with sepal length, petal length and negatively correlated with sepal width.

2. Data wrangling: Please perform some basic tranformation . Thy will need to make sens but could include column renaming, creating a subset of the data set.

Changing variable levels name

levels(iris$Species)
## [1] "setosa"     "versicolor" "virginica"
# Rename the factors
# Rename all levels, by name
levels(iris$Species) <- list(Setosa="setosa",Versicolor="versicolor", Virginica="virginica")
  1. Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’ be limited to this. Please explore the many other options in R packages such as plotting2.
# Scatterplot
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point()+ggtitle("Relation between Sepal Length and Sepal Width")+theme_classic()+xlab("Sepal Length")+ylab("Sepal Width")

ggplot(iris, aes(x = Petal.Length, y = Petal.Width))+geom_point()+ggtitle("Relation between Petal Length and Petal Width")+theme_classic()+xlab("Petal Length")+ylab("Petal Width")+theme_classic()

# Boxplot
ggplot(iris, aes(y = Sepal.Length,x = Species, fill= Species)) +
  geom_boxplot()+ggtitle("Boxplot of Sepal Length By Species")+theme_classic()+xlab("Sepal Length")

ggplot(iris, aes(y = Sepal.Length,x = Species, fill= Species)) +
  geom_boxplot()+ggtitle("Boxplot of Sepal Width By Species")+theme_classic()+xlab("Sepal Width")

# Histogram
ggplot(iris, aes(Sepal.Length)) +
  geom_histogram(color="white")+ggtitle("Distribution of Sepal Length")+theme_classic()+xlab("Sepal Length")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(iris, aes(Sepal.Length)) +
  geom_histogram(color="white")+ggtitle("Distribution of Sepal Width")+theme_classic()+xlab("Sepal Width")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

4. Meaning question for analysis: Plese state the beginning a meaningful question for analysis.

Use the first three steps and anything else that would be helpful to answer the question your are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end. # Question: Are are there any difference between sepal width of 3 flower species?

fit <- aov(Sepal.Width~Species, data = iris)
summary(fit)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## Species       2  11.35   5.672   49.16 <2e-16 ***
## Residuals   147  16.96   0.115                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the graphs of box plot of sepal width grouped by flower species we can see that there is difference among the sepal width of 3 flower species . We did a one way anova to test the difference significantly. At 5% level of significance with p value < 0.05 with 2 degrees of freedom we conclude that the difference of sepal width among the 3 species if significant.

5. BONUS - Place the original csv in a github file and have R read from the link. This will be a very useful skills as you progress in your data science education career.

url <- "https://raw.githubusercontent.com/jonygeta/iris.csv/master/iris.csv"

iris <- read.csv(url)