I chose to use the iris data for this assignment. The first thing I learned how to do was attach the data so that I can refer to its parts directly by name.

attach(iris)

I want to be able to explain the relationship between petal lengths and petal widths in this set. Since I’m not sure what those columns of data are called in this particular set, I check that information using the head() function.

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Next, I learned how to make a scatter plot to get an initial visual representation of the relationship. I chose to use iris petal width has the explanatory variable and length as the response variable because my hypothesis is that longer petals require the sturdiness of a wider petal. I learned that R will automatically label my axes, but I want to customize them to my exact needs using xlab, ylab, and then main to label the title of the plot.

plot(Petal.Length~Petal.Width, ylab="Petal Length", xlab="Petal Width",
     main="Length of Iris Petals by Width")

Then I learned how to produce a linear regression model from the data set and add the regression line to the plot I just created. I called the model my mod and used the lm function to let R give me the parameters of the model. Then I recalled my plot and added a line using the abline function with the model parameters as inputs.

mymod<-lm(Petal.Length~Petal.Width)
mymod
## 
## Call:
## lm(formula = Petal.Length ~ Petal.Width)
## 
## Coefficients:
## (Intercept)  Petal.Width  
##       1.084        2.230
plot(Petal.Length~Petal.Width, ylab="Petal Length", xlab="Petal Width",
     main="Length of Iris Petals by Width")
abline(1.084,2.230)

Now that I see that the petal length and width have a postive linear relationship, I use my new indexing skills to predict the length of a petal with a certain width. The head() function earlier told me that the petal length data comes from the third column and the petal width data comes from the fourth.

iris[100,4]
## [1] 1.3
iris[100,3]
## [1] 4.1

I see that the 100th observation in the study yielded a petal width of 1.3. Now I want to compare what my linear regression model would predict the length of that petal to be to what the study actually observed. Here, I learned to use R as a simple calculator that crunches numbers out for me. I inserted the observed width of 1.3.

1.084+2.230*1.3
## [1] 3.983

The difference between this sum and the observed value is called the residual. In order to calculate it, I learned how to assign variables to indexed values and then perform operations with them. Here I call up the observed valuess and the length predicted by my model and then subtract the two.

actuallength<-iris[100,3]
width<-iris[100,4]
predictedlength<-1.084+2.230*1.3
oneresid<-actuallength-predictedlength
oneresid
## [1] 0.117

Next, I learned how to check whether or not my prediction error follows a normal distribution. We can make a histogram out of the residuals for each individual piece of data using hist() function after defining a vector that contains all of the residual values.

myresid<-mymod$residuals
hist(myresid, xlab="Residual Value", main="Histogram of Residual Values")

Since it’s hard to determine how closely this data imitates a normal distribution just by the histogram, we can use the qqnorm and qqline functions to see how closely the residuals mimick the normal distribution.

qqnorm(myresid)
qqline(myresid)

I also learned how to check the spread of the data by plotting the explanatory variable against the residuals. This will reveal the variability of the error in relationship to values of the explanatory variabe.

plot(myresid~Petal.Width, ylab = "Residual Values", main="Homoscedasticity Test")
abline(0,0)

This appears problematic to me since the residuals seem to follow a curved trend against our explanatory variable, indicating that the error may not be constant across all x values.

Lastly, I learned how to use the summary() function to find the Mean Square Error and other useful pieces of information that help me understand the relationship between iris petal length and width better.

summary(mymod)
## 
## Call:
## lm(formula = Petal.Length ~ Petal.Width)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.33542 -0.30347 -0.02955  0.25776  1.39453 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.08356    0.07297   14.85   <2e-16 ***
## Petal.Width  2.22994    0.05140   43.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4782 on 148 degrees of freedom
## Multiple R-squared:  0.9271, Adjusted R-squared:  0.9266 
## F-statistic:  1882 on 1 and 148 DF,  p-value: < 2.2e-16

Now I feel much more confident in my ability to analyze data using a linear regression model with the help of R!