# Loads the Iris Dataset! Answer the questions below.
data("iris") # Loads "iris" data
head(iris) # Views iris data## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
# To run a box plot of categorical data, you can use this code
# boxplot(dependant variable ~ independent variable)
boxplot(iris$Sepal.Length ~ iris$Species)Figure 1: Boxplot of Sepal Length (cm) vs iris species in the iris dataset.
Raw data is just the data we collect without doing anything to the data, and transformed data is when we take the raw data and apply a function such as log or square root functions to transform the data to fit the Q-Q line and make it normal.
The data points are dependent because they are replicated on the same plants each time. Also the species is the same so that’s a correlation. This means that you cannot run an ANOVA test, but there are more complex functions that can help us get the answers.
I can tell because if we look at the QQ line in the middle we look for which one has the most points closer to the line. As we see in the first graph of Sepal Length it’s pretty close to having all the points on the line. The second graph about Sepal Width is pretty close, but it’s not exactly on the line mostly. The third one is way off because its in a sort of “s” shape. And number 4 is similar, but the points are closer to the line than graph three.
qqnorm(iris$Sepal.Length) #This one is the closest to normal distribution.
qqline(iris$Sepal.Length)Figure 1: Iris Sepal Length.
qqnorm(iris$Sepal.Width)
qqline(iris$Sepal.Width)Figure 2: Iris Sepal Width.
qqnorm(iris$Petal.Length) # This one is the farthest from a normal distribution.
qqline(iris$Petal.Length)Figure 3: Iris Petal Length.
qqnorm(iris$Petal.Width)
qqline(iris$Petal.Width)Figure 4: Iris Petal Width.
The first assumption is data independence which means that the data we get and the way we set up the experiment should be independent from one another, and there should be no correlations. The second assumption is normal distribution which means that the data is spread normally around the mean. The way we find this out is to run a Q-Q plot and see how close the data is to the q-q line and then make adjustments from there. Lastly we have homoscedasticity which means that the variances are equal meaning that there is an equal spread of data. I would say yes because it passes all three assumptions.
hist(iris$Sepal.Width)Figure 5: Iris sepal width vs other species.
boxplot(iris$Sepal.Width~iris$Species)Figure 5: Iris sepal width vs other species.
qqnorm(iris$Sepal.Width) #test for normality
qqline(iris$Sepal.Width)Figure 5: Iris sepal width vs other species.
Please turn–in your homework via Sakai by saving and submitting an R Markdown PDF or HTML file from R Pubs!