You can continue to use the dataset from your last discussion, or pick up a new dataset.
?CO2
CO2_df <- CO2
summary(CO2_df)
## Plant Type Treatment conc uptake
## Qn1 : 7 Quebec :42 nonchilled:42 Min. : 95 Min. : 7.70
## Qn2 : 7 Mississippi:42 chilled :42 1st Qu.: 175 1st Qu.:17.90
## Qn3 : 7 Median : 350 Median :28.30
## Qc1 : 7 Mean : 435 Mean :27.21
## Qc3 : 7 3rd Qu.: 675 3rd Qu.:37.12
## Qc2 : 7 Max. :1000 Max. :45.50
## (Other):42
cor(CO2_df$uptake,CO2_df$conc)
## [1] 0.4851774
CO2_df$uptake ~ CO2_df$conc
## CO2_df$uptake ~ CO2_df$conc
I will be looking at the CO2 dataset, which is “from an experiment on
the cold tolerance of the grass species Echinochloa
crus-galli.” The variables of interest are conc and uptake, in
which conc represents the ambient CO2 concentrations (\(\frac{mL}{L}\))—the independent Y
variable—and uptake represents the rate of CO2 uptake by
the plants (\(\frac{\mu\ mol}{m^2}\)
sec)—the dependent X variable.
\[ Y_i = \beta_0 + \beta_1X_i + \epsilon_i \]
The formula is above, in which \(Y_i\) = uptake, \(\beta_0\) = intercept (constant), \(\beta_1X_1\) = conc, and \(\epsilon_i\) = random error
lm(CO2_df$conc ~ CO2_df$uptake, data= CO2_df)
##
## Call:
## lm(formula = CO2_df$conc ~ CO2_df$uptake, data = CO2_df)
##
## Coefficients:
## (Intercept) CO2_df$uptake
## 73.71 13.28
plot(CO2_df$uptake, CO2_df$conc)
library(ggplot2)
ggplot(CO2, aes(x = uptake,
y = conc)) +
geom_point(size = 2,
shape = 18,
col = "purple") +
stat_smooth(method = lm,
linetype = "dashed",
col = "red") +
xlab("CO2 uptake ") +
ylab("C02 ambient concentrations (mL/L)")
## `geom_smooth()` using formula = 'y ~ x'
The slope tells us that for every one unit increase of the uptake of carbon dioxide rates (i.e., one-unit increase in X), there will be a 13.28 increase in the amount of ambient carbon dioxide concentration (mL/L). The intercept tells us that at a theoretical level of no uptake of carbon dioxide rates (i.e., at zero), there would be an average amount of ambient carbon dioxide concentration of 73.71 mL/L existing within the sample.
#slope
cov(CO2_df$uptake, CO2_df$conc)/var(CO2_df$uptake)
## [1] 13.27633
#intercept
#B0 <- Y - Bi (intercept) * X
mean(CO2_df$conc) - (cov(CO2_df$uptake, CO2_df$conc)/var(CO2_df$uptake)) * mean(CO2_df$uptake)
## [1] 73.71
Confirming that the formula gives us the same slope parameter of 13.28. The formula for intercept results in the same 73.71.
Look at the assumptions of OLS. Please skim through chapter 8 of Open Statistics textbook, and pay attention to Gauss Markov Assumptionspart (Full Ideal Conditions of OLS). Do a few Google searches and in less than 20 lines, try to summarize your findings.
| 1) Linearity - the relationship should be relatively linear |
| 2) Nearly normal residuals - the standardized residuals should be normally distributed |
| 3) Constant variability - the variance of errors should be relatively constant |
| 4) Independent observations - there should be unique observations for every x to y |