1. Conduct both correlation (measures strength of linear relationship) and simple regression analysis in R. Talk about what you find in a few lines (strength and direction of relationship for correlation, slope estimate interpretation.

  2. More importantly, interpret the residuals. You can refer to this videoLinks to an external site. that will be helpful for both discussion and assignment or study this if you prefer readingLinks to an external site..

  3. Did the Gauss Markov Assumptions assumptions hold ? Explain in your own words (instead copying/pasting from the web). These are the 4 OLS assumptions in Chapter 9.3 Download Chapter 9.3of Open Statistics textbook, or you can refer to some simpleLinks to an external site. resources on this too.

df <- Orange
head(df)
##   Tree  age circumference
## 1    1  118            30
## 2    1  484            58
## 3    1  664            87
## 4    1 1004           115
## 5    1 1231           120
## 6    1 1372           142
plot(age~circumference,
     data = df,
     main = "Growth of Orange Trees",
     xlab = "Age (days since 1968/12/31)",
     ylab = "Circumference (mm)"
     )

cor(df$age, df$circumference)
## [1] 0.9135189
cor(df$age, df$circumference, use = "complete.obs")
## [1] 0.9135189
plot(df, main = "Growth of Orange Trees")

cor(df$age, df$circumference, use = "pairwise.complete.obs")
## [1] 0.9135189
relation <- lm(formula = age~circumference, data = df)
print(summary(relation))
## 
## Call:
## lm(formula = age ~ circumference, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -317.88 -140.90  -17.20   96.54  471.16 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    16.6036    78.1406   0.212    0.833    
## circumference   7.8160     0.6059  12.900 1.93e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 203.1 on 33 degrees of freedom
## Multiple R-squared:  0.8345, Adjusted R-squared:  0.8295 
## F-statistic: 166.4 on 1 and 33 DF,  p-value: 1.931e-14
model <- lm(formula = age~circumference, data = df)
res <- resid(model)
plot(fitted(model), res)
abline(0,0)

qqnorm(res)
qqline(res) 

plot(density(res))

This is a normal distribution because there is heteroskedasticity when plotting the data on a linear line, but it fits the Normal QQ line quite well with only a few extreme values towards the end of the graph.

From the plotted graphs, we can tell that the data is nonlinear, so it doesn’t fit this requirement under the Gauss Markov Assumptions. However the data does adhere to the Nearly normal residuals as the data fits well within the normal distribution shown in the QQ plot. There isn’t a constant variability as can be seen in the residual vs fitted graph, the points aren’t evenly spread in relation to the line. These are independent observations, so it fits under this Gauss Markov Assumption.