In OpenStats Chapter 1, Exercises, Problem 7, there is a reference to Fisher’s iris data. Discuss the solutions to this problem, and then conduct a descriptive analysis of the data which are conveniently available in R. To access the data in R, simply type “iris.” Investigate any additional R libraries that might help support analysis of this data (e.g., psych package). Share your code and analysis in the discussion forum. This is a graded discussion thread. In order to earn full credit, post your initial response (written or video) to Discussion #1 early in the learning week - no later than Wednesday at 11:59 pm EST; then respond to a minimum of two other posts (text only) from classmates by Sunday at 11:59 pm EST. No late posts are accepted.

PROBLEM 7

Sir Ronald Aylmer Fisher was an English statistician, evolutionary biologist, and geneticist who worked on a data set that contained sepal length and width, and petal length and width from three species of iris flowers (setosa, versicolor and virginica). There were 50 flowers from each species in the data set.

  1. How many cases were included in the data?

150 cases were included in the data. 50 flowers from each species in the data set

  1. How many numerical variables are included in the data? Indicate what they are, and if they are continuous or discrete.

There are 4 numerical variables in Sir Fisher’s experiment. Sepal Length, Sepal Width, Petal Length, and petal width. These numerical variables are all continuous since they can be measured with varying results. Length and width are good examples of continuous variable just like height.

  1. How many categorical variables are included in the data, and what are they? List the corresponding levels (categories).

The only categorical variable is species. There are three levels. The levels are setosa, versicolor, and virginica.

DESCRIPTIVE DATA

First thing I’m doing is finding a way to separate the Iris data based off of the categorical variable of species. I did this by making subsets of data for each species.

irisSetosa <- subset(iris, Species == "setosa")
irisVersicolor <- subset(iris, Species == "versicolor")
irisVirginica <- subset(iris, Species == "virginica")

After this I decided to run a summary of each data set to get descriptive data points such as the measures of center like mean and median, but also I would get infromation that would allow me to make box plot. The box plot will help show variability and distribution. Below is the code used for making the summary tables and blox plots (the [,1:4] code helped me restrict the summary to the nominal continous data we were given).

list (summary(irisSetosa[,1:4]))
## [[1]]
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100  
##  1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200  
##  Median :5.000   Median :3.400   Median :1.500   Median :0.200  
##  Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246  
##  3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300  
##  Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600
list (summary(irisVersicolor[,1:4]))
## [[1]]
##   Sepal.Length    Sepal.Width     Petal.Length   Petal.Width   
##  Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000  
##  1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200  
##  Median :5.900   Median :2.800   Median :4.35   Median :1.300  
##  Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326  
##  3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500  
##  Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800
list (summary(irisVirginica[,1:4]))
## [[1]]
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400  
##  1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800  
##  Median :6.500   Median :3.000   Median :5.550   Median :2.000  
##  Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026  
##  3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300  
##  Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500
par(mfrow=c(1,3),mar=c(6,3,2,1))
boxplot(irisSetosa[,1:4], main="Setosa",ylim = c(0,8),las=2)
boxplot(irisVersicolor[,1:4], main="Versicolor",ylim = c(0,8),las=2)
boxplot(irisVirginica[,1:4], main="Virginica",ylim = c(0,8),las=2)

One other thing I wanted to do was plot each variable relative to each other. This will help me vizualize potential relations between each variable since it will plot each variable against each other (also when I first ran the code all the dots were black so to fix that I had all data points colored based on their species)

plot(iris[1:4], col=iris$Species)

Looking at the plots I think there are relations between certain variables. In order to be certain what I would do is plot each variable against each other individually and then run a linear regression for each species to see if there is a relation between the variables. This will take some time and while I won’t do it for all of them I will do it for Petal Width and Petal Length since I see the strongest correlations there. Also, while I could run a linear regression for each species indivually I will just run one for all species, this will tell me if there’s a statistically significant correlation between Petal Length and Petal Width for all iris’ rather than each individual species. If I wanted to have a more indepth analysis I would then run one for each species.

The code below is how I got the graph and the linear regression.

plot(iris$Petal.Length, iris$Petal.Width, xlab = 'Petal length', ylab = 'Petal width',
pch = 21, bg = as.numeric(iris$Species),
main = 'Petal length vs petal width for Iris data')
LR <- lm(formula = iris$Petal.Width ~ iris$Petal.Length)
abline(LR)

The summary for the linear regression is below.

Call: lm(formula = iris\(Petal.Width ~ iris\)Petal.Length)

Residuals: Min 1Q Median 3Q Max -0.56515 -0.12358 -0.01898 0.13288 0.64272

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.363076 0.039762 -9.131 4.7e-16 iris$Petal.Length 0.415755 0.009582 43.387 < 2e-16 — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2065 on 148 degrees of freedom Multiple R-squared: 0.9271, Adjusted R-squared: 0.9266 F-statistic: 1882 on 1 and 148 DF, p-value: < 2.2e-16

Although there is alot of important information in this summary the one that I am particularly concern with is the R-squared value because it is the coefficient of determination and it will indicate variability of one factor being caused by its relationship to another related factor. In this case about 92.7% of petal length can be explained by petal width (or at least the two are highly correlated).