This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. Type answers after the -> sign.

library(datasets)
attach(iris)
data(iris)
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

What type of variable is Sepal.Length?

-> numerical, continuous

What type of variable is Sepal.Width?

-> numerical, continuous

# From now on, we will only use observations from setosa species. 
setosa <- iris[iris$Species == "setosa", ]

#Get summary statistics for just the sepal lengths
summary(setosa$Sepal.Length)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.300   4.800   5.000   5.006   5.200   5.800

What is the average sepal length (in cm) of the setosa flowers?

-> 5.006

What is the 5# summary for the setosa sepal lengths?

-> Min. 1st Qu. Median 3rd Qu. Max. 4.300 4.800 5.000 5.200 5.800

#Get summary statistics for just the sepal widths
# FINISH THE FOLLOWING CODE BY TYPE setosa$Sepal.Width WITHIN THE PARENTANCIES
summary(setosa$Sepal.Width)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.300   3.200   3.400   3.428   3.675   4.400

What is the IQR for the setosa sepal widths?

-> 3.675- 3.200= 0.475 cm

Using the interquartile method, are there any outliers?

-> 3.2 - 1.50.475= 2.4875= LB we have lower outliers 3.675+ 1.50.475= 4.39 = UB we have an upper outlier

#Get standard deviation for just the sepal widths
sd(setosa$Sepal.Width)

## [1] 0.3790644

What is the value of the standard deviation for the sepal widths (round 2 places)?

-> 0.38 cm

What does the standard deviation tell us about the sepal widths?

-> Most sepal widths deviate from the mean by 0.379 cm on average

#Get a histogram for all of the setosa sepal lengths
hist(setosa$Sepal.Length, #Run histogram function on Petal Lengths
     main="Histogram of Setosa Sepal Lengths",    #Add title
     xlab="Sepal Lengths",    #Add x-axis label
     border="thistle4",    #Color of bin outlines
     col="hotpink",    #Bin colors
     las=1,)    #Position of x-axis numbers

#Get a histogram for all of the setosa sepal WIDTHs
# FINISH THE FOLLOWING CODE BY TYPE setosa$Sepal.Width AFTER hist( 
hist(setosa$Sepal.Width,   #Run histogram function on Sepal Widths
     main="Histogram of Setosa Sepal Widths",    #Add title
     xlab="Sepal Widths",    #Add x-axis label
     border="thistle4",    #Color of bin outlines
     col="hotpink",    #Bin colors
     las=1,)    #Position of x-axis numbers

What do we notice about these histogram? -> The sepal lengths and widths distributions are unimodal and somewhat symmetric

Next, we will create a scatterplot to visualize the relationship between sepal length and sepal width.

plot(Sepal.Length ~ Sepal.Width, data = setosa,
     xlab = "Sepal Widths",
     ylab = "Sepal Lengths",
     main = "Iris Sepal width vs sepal length",
     pch  = 20,
     cex  = 2,
     col  = "lightpink")

What do you notice from the scatterplot?

-> We have a posotive linear relationship and one outlier.

We can also calculate the pearson correlation for these two quantitative variables. We will use the function cor. The method option is set to be “pearson”.

cor(setosa$Sepal.Length, setosa$Sepal.Width, method = "pearson")

## [1] 0.7425467

What is the correlation coefficient between sepal length and sepal width of setosa?

-> 0.7425467

What is the interpretation of the correlation coefficient?

-> we have a strong posotive linear relationship.

After we calculated the correlation coefficient, we can now build a simple linear regression so we can predict sepal length with sepal width of setosa. In this case, we will treat sepal length as response variable and sepal width as explanatory variable.

We use the lm() function to build the model.

# Below fit the linear regression with the sepal length as response variable and sepal width as explanatory variable for data frame setosa.
# FINISH THE FOLLOWING CODE BY TYPE Sepal.Length AFTER (, TYPING Sepal.Width AFTER ~, TYPING setosa AFTER data = 
model = lm(Sepal.Length ~ Sepal.Width, data = setosa )
summary(model)

## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = setosa)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52476 -0.16286  0.02166  0.13833  0.44428 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.6390     0.3100   8.513 3.74e-11 ***
## Sepal.Width   0.6905     0.0899   7.681 6.71e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2385 on 48 degrees of freedom
## Multiple R-squared:  0.5514, Adjusted R-squared:  0.542 
## F-statistic: 58.99 on 1 and 48 DF,  p-value: 6.71e-10

What is the intercept of the fitted linear regression model? How do you interpret it?

-> b0= 2.64. when sepal width is 0, the sepal length equals 2.64 cm

What is the slope of the fitted linear regression model? How do you interpret it?

-> 0.69, for every 1 cm increased in sepal width, sepal length increase by 0.69

What is the linear regression equation for this two variables?

-> predicted sepal length = 2.64 + 0.69 (width)

What is the R-Squared Value (coefficient of determination)? Show the relationship between correlation coefficient and R-squared value.

-> 0.55

What is the name for R-Squared value? How do you interpret it?

-> coefficient of determination, sepal widths explain 55% of the variation in sepal lengths.

Meanwhile, we could generate a residual plot to exam the fitted model. We can use a plot to access the residuals using the $ operator.

plot(setosa$Sepal.Width, model$residuals)
abline(0, 0)

18. What do you think of this residual plot? Is there anything that flag possible problem?

-> we do not have any patterns. we do have one outlier which could flag a possible problem

Next, it would be nice to add the fitted line to the scatterplot. To do so we will use the abline() function.

# Below draw the scatterplot the sepal length as response variable and sepal width as explanatory variable for data frame setosa.
plot(Sepal.Length ~ Sepal.Width, data = setosa,
     xlab = "Sepal Widths",
     ylab = "Sepal Lengths",
     main = "Iris Sepal width vs sepal length",
     pch  = 20,
     cex  = 2,
     col  = "hotpink")
abline(model, lwd = 3, col = "purple")

With the fitted line, we know we have built a linear regression model to predict the sepal length with sepal width.

Method 1: Calculate with linear regression equation.

Predict the sepal length for a setosa iris flower with sepal width at 3.5 cm using the linear regression equation?

-> 5.0

Method 2: Use R function predict()to predict.

# code to predict sepal length for a flower with sepal width at 3.5 cm.
predict(model, newdata = data.frame(Sepal.Width = 3.5))

##        1 
## 5.055715

# FINISH THE FOLLOWING CODE BY TYPE 5 AFTER = 
predict(model, newdata = data.frame(Sepal.Width = 5 ))

##       1 
## 6.09145

Predict the sepal length for a setosa iris flower with sepal width at 5 cm using R output above.

-> 6.09 cm

Change the author name in the YAML to your name, save your file, knit it as an PDF, add the PDF name to ‘Filename_yourname’ and then submit it to the D2L assignment folder.

Scatterplots and Regression

Cali Player

Oct 14, 2024

Change the author name in the YAML to your name, save your file, knit it as an PDF, add the PDF name to ‘Filename_yourname’ and then submit it to the D2L assignment folder.