In OpenStats Chapter 1, Exercises, Problem 7, there is a reference to Fisher’s iris data. Discuss the solutions to this problem, and then conduct a descriptive analysis of the data which are conveniently available in R. To access the data in R, simply type “iris.” Investigate any additional R libraries that might help support analysis of this data (e.g., psych package). Share your code and analysis in the discussion forum. This is a graded discussion thread. In order to earn full credit, post your initial response (written or video) to Discussion #1 early in the learning week - no later than Wednesday at 11:59 pm EST; then respond to a minimum of two other posts (text only) from classmates by Sunday at 11:59 pm EST. No late posts are accepted.
Sir Ronald Aylmer Fisher was an English statistician, evolutionary biologist, and geneticist who worked on a data set that contained sepal length and width, and petal length and width from three species of iris flowers (setosa, versicolor and virginica). There were 50 flowers from each species in the data set.
150 cases were included in the data. 50 flowers from each species in the data set
There are 4 numerical variables in Sir Fisher’s experiment. Sepal Length, Sepal Width, Petal Length, and petal width. These numerical variables are all continuous since they can be measured with varying results. Length and width are good examples of continuous variable just like height.
The only categorical variable is species. There are three levels. The levels are setosa, versicolor, and virginica.
First thing I’m doing is finding a way to separate the Iris data based off of the categorical variable of species. I did this by making subsets of data for each species.
irisSetosa <- subset(iris, Species == "setosa")
irisVersicolor <- subset(iris, Species == "versicolor")
irisVirginica <- subset(iris, Species == "virginica")
After this I decided to run a summary of each data set to get descriptive data points such as the measures of center like mean and median, but also I would get infromation that would allow me to make box plot. The box plot will help show variability and distribution. Below is the code used for making the summary tables and blox plots (the [,1:4] code helped me restrict the summary to the nominal continous data we were given).
list (summary(irisSetosa[,1:4]))
## [[1]]
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100
## 1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200
## Median :5.000 Median :3.400 Median :1.500 Median :0.200
## Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246
## 3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300
## Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600
list (summary(irisVersicolor[,1:4]))
## [[1]]
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000
## 1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200
## Median :5.900 Median :2.800 Median :4.35 Median :1.300
## Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326
## 3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500
## Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800
list (summary(irisVirginica[,1:4]))
## [[1]]
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400
## 1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800
## Median :6.500 Median :3.000 Median :5.550 Median :2.000
## Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026
## 3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300
## Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500
par(mfrow=c(1,3),mar=c(6,3,2,1))
boxplot(irisSetosa[,1:4], main="Setosa",ylim = c(0,8),las=2)
boxplot(irisVersicolor[,1:4], main="Versicolor",ylim = c(0,8),las=2)
boxplot(irisVirginica[,1:4], main="Virginica",ylim = c(0,8),las=2)
One other thing I wanted to do was plot each variable relative to each other. This will help me vizualize potential relations between each variable since it will plot each variable against each other (also when I first ran the code all the dots were black so to fix that I had all data points colored based on their species)
plot(iris[1:4], col=iris$Species)
Looking at the plots I think there are relations between certain variables. In order to be certain what I would do is plot each variable against each other individually and then run a linear regression for each species to see if there is a relation between the variables. This will take some time and while I won’t do it for all of them I will do it for Petal Width and Petal Length since I see the strongest correlations there. Also, while I could run a linear regression for each species indivually I will just run one for all species, this will tell me if there’s a statistically significant correlation between Petal Length and Petal Width for all iris’ rather than each individual species. If I wanted to have a more indepth analysis I would then run one for each species.
The code below is how I got the graph and the linear regression.
plot(iris$Petal.Length, iris$Petal.Width, xlab = 'Petal length', ylab = 'Petal width',
pch = 21, bg = as.numeric(iris$Species),
main = 'Petal length vs petal width for Iris data')
LR <- lm(formula = iris$Petal.Width ~ iris$Petal.Length)
abline(LR)
The summary for the linear regression is below.
Call: lm(formula = iris\(Petal.Width ~ iris\)Petal.Length)
Residuals: Min 1Q Median 3Q Max -0.56515 -0.12358 -0.01898 0.13288 0.64272
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.363076 0.039762 -9.131 4.7e-16 iris$Petal.Length 0.415755 0.009582 43.387 < 2e-16 — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2065 on 148 degrees of freedom Multiple R-squared: 0.9271, Adjusted R-squared: 0.9266 F-statistic: 1882 on 1 and 148 DF, p-value: < 2.2e-16
Although there is alot of important information in this summary the one that I am particularly concern with is the R-squared value because it is the coefficient of determination and it will indicate variability of one factor being caused by its relationship to another related factor. In this case about 92.7% of petal length can be explained by petal width (or at least the two are highly correlated).