Learning Log 3

Iris

Determine whether there is a relationship between petal length and petal width and attach this dataset for ease of repeated usage.

data(iris)
attach(iris)

View the beginning of the dataset

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

There are 150 observations of 5 different variables

Scatterplot

A scatterplot with petal length as the explanatory variable, and petal length as the response variable.

plot(Petal.Width~Petal.Length, ylab ="Petal Width (cm)",
     xlab ="Petal Width (cm)",
     main = "Relationship between Flower Petal Length and Width")

This scatterplot shows that there is a positive trendline. This means that flowers that have a larger Petal Length also have a larger Petel Width. The points seem to follow the trendline pretty closely. I belive that this a can predicted using a linear relationship.

Creating a Linear Regression Model

mymod<-lm(Petal.Width~Petal.Length)

This creates a linear model with petal width as the the response, and petal length as the explanatory variable.

mymod

## 
## Call:
## lm(formula = Petal.Width ~ Petal.Length)
## 
## Coefficients:
##  (Intercept)  Petal.Length  
##      -0.3631        0.4158

This shows that our intercept is -0.3631 and the slope of the regression line is 0.4158 Thus, our regression equation is: yhat= -0.3631 + 0.4158x

Interpretating Linear Model

First looking at the slope as apect of the model and comparing it to the scatterplot, it makes sense that the slope is positive. Flowers with larger petal length should have larger petal width

Contrastingly, we are not able to provide an explanation for the intercept aspect of the linear model. with a negative intercept, that would mean that a flower with no petal length would have negative petal length, deeming this estimation as irrelevant.

Plotting the Regression Line

plot(Petal.Width~Petal.Length, ylab ="Petal Width (cm)",
     xlab ="Petal Width (cm)")
abline(-0.3631,0.4158)

The regression line accuarately depicts the trend that occurs when increasing sepal length of a flower. This line also depicts the inaccuaracy that occurs near the 0 cm mark, but that is to be expected because there is a lack of data near of 0 cm,

Prediction

Now we will use our Linear Regression model to predict the petal width of flowers given their petal length to test the accuracy of the model. We will use the 50th observation to test the model. First, we find the petal length and width of this observation

iris[50,"Petal.Length"]

## [1] 1.4

iris[50,"Petal.Width"]

## [1] 0.2

Now we compare the Petal Width value to the value that our model created

-0.3631+0.4158*1.4

## [1] 0.21902

The predicted petal width for the 50th observation is 0.21902. Comparing that to the actual petel width of 0.2, the predictor was able to produce a result that was somewhat accurate.

Residual Calculation

This calculates the residual for the 50th observation, which is observed-predicted value for petal width

iris[50,"Petal.Width"]-(-0.3631+0.4158*1.4)

## [1] -0.01902

The prediction over estimated the petal length by 0.01902 cm

Indexing

This is just another way of calculating the residual utilizing indexing

actual<-iris[50,"Petal.Width"]
predicted<- -0.3631+0.4158*iris[50,"Petal.Length"]
resid<- actual-predicted
resid

## [1] -0.01902

This provides the same residual answer as above.

Checking Model Assumptions

First is to see whether the errors follow a normal distribution

myresids <- mymod$residuals
qqnorm(myresids)
qqline(myresids)

This graph does not show any obvious proof that this assumption was broken

Next is testing for equal variance (homostacisity)

plot(mymod$residuals ~ Petal.Length)
abline(0,0)

The spread above and below the line seems to be consistent throughout the graph, so this model checks out.

Mean Square Error

Now to see the mean square error of this model using the summary function within R

summary(mymod)

## 
## Call:
## lm(formula = Petal.Width ~ Petal.Length)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.56515 -0.12358 -0.01898  0.13288  0.64272 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.363076   0.039762  -9.131  4.7e-16 ***
## Petal.Length  0.415755   0.009582  43.387  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2065 on 148 degrees of freedom
## Multiple R-squared:  0.9271, Adjusted R-squared:  0.9266 
## F-statistic:  1882 on 1 and 148 DF,  p-value: < 2.2e-16

The mean square error is 0.2065 whcih is favorable because MSE is the sqaure of each error between observed and predicted, the smaller the MSE value, the better.