This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. Type answers after the -> sign.
library(datasets)
attach(iris)
data(iris)
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
-> numerical, continuous
-> numerical, continuous
# From now on, we will only use observations from setosa species.
setosa <- iris[iris$Species == "setosa", ]
#Get summary statistics for just the sepal lengths
summary(setosa$Sepal.Length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.300 4.800 5.000 5.006 5.200 5.800
-> 5.006
-> Min. 1st Qu. Median 3rd Qu. Max. 4.300 4.800 5.000 5.200 5.800
#Get summary statistics for just the sepal widths
# FINISH THE FOLLOWING CODE BY TYPE setosa$Sepal.Width WITHIN THE PARENTANCIES
summary(setosa$Sepal.Width)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.300 3.200 3.400 3.428 3.675 4.400
-> 3.675- 3.200= 0.475 cm
-> 3.2 - 1.50.475= 2.4875= LB we have lower outliers 3.675+ 1.50.475= 4.39 = UB we have an upper outlier
#Get standard deviation for just the sepal widths
sd(setosa$Sepal.Width)
## [1] 0.3790644
-> 0.38 cm
-> Most sepal widths deviate from the mean by 0.379 cm on average
#Get a histogram for all of the setosa sepal lengths
hist(setosa$Sepal.Length, #Run histogram function on Petal Lengths
main="Histogram of Setosa Sepal Lengths", #Add title
xlab="Sepal Lengths", #Add x-axis label
border="thistle4", #Color of bin outlines
col="hotpink", #Bin colors
las=1,) #Position of x-axis numbers
#Get a histogram for all of the setosa sepal WIDTHs
# FINISH THE FOLLOWING CODE BY TYPE setosa$Sepal.Width AFTER hist(
hist(setosa$Sepal.Width, #Run histogram function on Sepal Widths
main="Histogram of Setosa Sepal Widths", #Add title
xlab="Sepal Widths", #Add x-axis label
border="thistle4", #Color of bin outlines
col="hotpink", #Bin colors
las=1,) #Position of x-axis numbers
Next, we will create a scatterplot to visualize the relationship between sepal length and sepal width.
plot(Sepal.Length ~ Sepal.Width, data = setosa,
xlab = "Sepal Widths",
ylab = "Sepal Lengths",
main = "Iris Sepal width vs sepal length",
pch = 20,
cex = 2,
col = "lightpink")
-> We have a posotive linear relationship and one outlier.
We can also calculate the pearson correlation for these two
quantitative variables. We will use the function cor. The
method option is set to be “pearson”.
cor(setosa$Sepal.Length, setosa$Sepal.Width, method = "pearson")
## [1] 0.7425467
-> 0.7425467
-> we have a strong posotive linear relationship.
After we calculated the correlation coefficient, we can now build a simple linear regression so we can predict sepal length with sepal width of setosa. In this case, we will treat sepal length as response variable and sepal width as explanatory variable.
We use the lm() function to build the model.
# Below fit the linear regression with the sepal length as response variable and sepal width as explanatory variable for data frame setosa.
# FINISH THE FOLLOWING CODE BY TYPE Sepal.Length AFTER (, TYPING Sepal.Width AFTER ~, TYPING setosa AFTER data =
model = lm(Sepal.Length ~ Sepal.Width, data = setosa )
summary(model)
##
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = setosa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52476 -0.16286 0.02166 0.13833 0.44428
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6390 0.3100 8.513 3.74e-11 ***
## Sepal.Width 0.6905 0.0899 7.681 6.71e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2385 on 48 degrees of freedom
## Multiple R-squared: 0.5514, Adjusted R-squared: 0.542
## F-statistic: 58.99 on 1 and 48 DF, p-value: 6.71e-10
-> b0= 2.64. when sepal width is 0, the sepal length equals 2.64 cm
-> 0.69, for every 1 cm increased in sepal width, sepal length increase by 0.69
-> predicted sepal length = 2.64 + 0.69 (width)
-> 0.55
-> coefficient of determination, sepal widths explain 55% of the variation in sepal lengths.
Meanwhile, we could generate a residual plot to exam the fitted
model. We can use a plot to access the residuals using the
$ operator.
plot(setosa$Sepal.Width, model$residuals)
abline(0, 0)
18. What do you think of this residual plot? Is there anything that flag
possible problem?
-> we do not have any patterns. we do have one outlier which could flag a possible problem
Next, it would be nice to add the fitted line to the scatterplot. To
do so we will use the abline() function.
# Below draw the scatterplot the sepal length as response variable and sepal width as explanatory variable for data frame setosa.
plot(Sepal.Length ~ Sepal.Width, data = setosa,
xlab = "Sepal Widths",
ylab = "Sepal Lengths",
main = "Iris Sepal width vs sepal length",
pch = 20,
cex = 2,
col = "hotpink")
abline(model, lwd = 3, col = "purple")
With the fitted line, we know we have built a linear regression model to predict the sepal length with sepal width.
Method 1: Calculate with linear regression equation.
-> 5.0
Method 2: Use R function predict()to predict.
# code to predict sepal length for a flower with sepal width at 3.5 cm.
predict(model, newdata = data.frame(Sepal.Width = 3.5))
## 1
## 5.055715
# FINISH THE FOLLOWING CODE BY TYPE 5 AFTER =
predict(model, newdata = data.frame(Sepal.Width = 5 ))
## 1
## 6.09145
-> 6.09 cm