In OpenStats Chapter 1, Exercises, Problem 7, there is a reference to Fisher’s iris data. Discuss the solutions to this problem, and then conduct a descriptive analysis of the data which are conveniently available in R. To access the data in R, simply type “iris.” Investigate any additional R libraries that might help support analysis of this data (e.g., psych package). Share your code and analysis in the discussion forum. This is a graded discussion thread. In order to earn full credit, post your initial response (written or video) to Discussion #1 early in the learning week - no later than Wednesday at 11:59 pm EST; then respond to a minimum of two other posts (text only) from classmates by Sunday at 11:59 pm EST. No late posts are accepted.

DESCRIPTIVE DATA

First thing I’m doing is finding a way to separate the Iris data based off of the categorical variable of species. I did this by making subsets of data for each species.

irisSetosa <- subset(iris, Species == "setosa")
irisVersicolor <- subset(iris, Species == "versicolor")
irisVirginica <- subset(iris, Species == "virginica")

After this I decided to run a summary of each data set to get descriptive data points such as the measures of center like mean and median, but also I would get infromation that would allow me to make box plot. The box plot will help show variability and distribution. Below is the code used for making the summary tables and blox plots (the [,1:4] code helped me restrict the summary to the nominal continous data we were given).

list (summary(irisSetosa[,1:4]))

## [[1]]
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100  
##  1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200  
##  Median :5.000   Median :3.400   Median :1.500   Median :0.200  
##  Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246  
##  3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300  
##  Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600

list (summary(irisVersicolor[,1:4]))

## [[1]]
##   Sepal.Length    Sepal.Width     Petal.Length   Petal.Width   
##  Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000  
##  1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200  
##  Median :5.900   Median :2.800   Median :4.35   Median :1.300  
##  Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326  
##  3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500  
##  Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800

list (summary(irisVirginica[,1:4]))

## [[1]]
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400  
##  1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800  
##  Median :6.500   Median :3.000   Median :5.550   Median :2.000  
##  Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026  
##  3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300  
##  Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500

par(mfrow=c(1,3),mar=c(6,3,2,1))
boxplot(irisSetosa[,1:4], main="Setosa",ylim = c(0,8),las=2)
boxplot(irisVersicolor[,1:4], main="Versicolor",ylim = c(0,8),las=2)
boxplot(irisVirginica[,1:4], main="Virginica",ylim = c(0,8),las=2)

One other thing I wanted to do was plot each variable relative to each other. This will help me vizualize potential relations between each variable since it will plot each variable against each other (also when I first ran the code all the dots were black so to fix that I had all data points colored based on their species)

plot(iris[1:4], col=iris$Species)

Looking at the plots I think there are relations between certain variables. In order to be certain what I would do is plot each variable against each other individually and then run a linear regression for each species to see if there is a relation between the variables. This will take some time and while I won’t do it for all of them I will do it for Petal Width and Petal Length since I see the strongest correlations there. Also, while I could run a linear regression for each species indivually I will just run one for all species, this will tell me if there’s a statistically significant correlation between Petal Length and Petal Width for all iris’ rather than each individual species. If I wanted to have a more indepth analysis I would then run one for each species.

The code below is how I got the graph and the linear regression.

plot(iris$Petal.Length, iris$Petal.Width, xlab = 'Petal length', ylab = 'Petal width',
pch = 21, bg = as.numeric(iris$Species),
main = 'Petal length vs petal width for Iris data')
LR <- lm(formula = iris$Petal.Width ~ iris$Petal.Length)
abline(LR)

The summary for the linear regression is below.

Call: lm(formula = iris$Petal.Width ~ iris$Petal.Length)

Residuals: Min 1Q Median 3Q Max -0.56515 -0.12358 -0.01898 0.13288 0.64272

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.363076 0.039762 -9.131 4.7e-16 iris$Petal.Length 0.415755 0.009582 43.387 < 2e-16 — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2065 on 148 degrees of freedom Multiple R-squared: 0.9271, Adjusted R-squared: 0.9266 F-statistic: 1882 on 1 and 148 DF, p-value: < 2.2e-16

Although there is alot of important information in this summary the one that I am particularly concern with is the R-squared value because it is the coefficient of determination and it will indicate variability of one factor being caused by its relationship to another related factor. In this case about 92.7% of petal length can be explained by petal width (or at least the two are highly correlated).

Discussion 1

James Lunga

2/2/2021

PROBLEM 7

DESCRIPTIVE DATA