1 Load Data

Read up on the data.

The CO2 data frame has 84 rows and 5 columns of data from an experiment on the cold tolerance of the grass species Echinochloa crus-galli.

The CO2​ uptake of six plants from Quebec and six plants from Mississippi was measured at several levels of ambient CO2​ concentration.

1.1 Variable Definitions

  • conc - a numeric vector of ambient carbon dioxide concentrations (mL/L).

  • uptake - a numeric vector of carbon dioxide uptake rates

?datasets

library(help = "datasets")

help(CO2)

head(CO2)
## Grouped Data: uptake ~ conc | Plant
##   Plant   Type  Treatment conc uptake
## 1   Qn1 Quebec nonchilled   95   16.0
## 2   Qn1 Quebec nonchilled  175   30.4
## 3   Qn1 Quebec nonchilled  250   34.8
## 4   Qn1 Quebec nonchilled  350   37.2
## 5   Qn1 Quebec nonchilled  500   35.3
## 6   Qn1 Quebec nonchilled  675   39.2
tail(CO2)
## Grouped Data: uptake ~ conc | Plant
##    Plant        Type Treatment conc uptake
## 79   Mc3 Mississippi   chilled  175   18.0
## 80   Mc3 Mississippi   chilled  250   17.9
## 81   Mc3 Mississippi   chilled  350   17.9
## 82   Mc3 Mississippi   chilled  500   17.9
## 83   Mc3 Mississippi   chilled  675   18.9
## 84   Mc3 Mississippi   chilled 1000   19.9

1.2 EDA

str(CO2)
## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':   84 obs. of  5 variables:
##  $ Plant    : Ord.factor w/ 12 levels "Qn1"<"Qn2"<"Qn3"<..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ Type     : Factor w/ 2 levels "Quebec","Mississippi": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Treatment: Factor w/ 2 levels "nonchilled","chilled": 1 1 1 1 1 1 1 1 1 1 ...
##  $ conc     : num  95 175 250 350 500 675 1000 95 175 250 ...
##  $ uptake   : num  16 30.4 34.8 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...
##  - attr(*, "formula")=Class 'formula'  language uptake ~ conc | Plant
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "outer")=Class 'formula'  language ~Treatment * Type
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "labels")=List of 2
##   ..$ x: chr "Ambient carbon dioxide concentration"
##   ..$ y: chr "CO2 uptake rate"
##  - attr(*, "units")=List of 2
##   ..$ x: chr "(uL/L)"
##   ..$ y: chr "(umol/m^2 s)"
describe(CO2)
##            vars  n   mean     sd median trimmed    mad  min    max range  skew
## Plant*        1 84   6.50   3.47    6.5    6.50   4.45  1.0   12.0  11.0  0.00
## Type*         2 84   1.50   0.50    1.5    1.50   0.74  1.0    2.0   1.0  0.00
## Treatment*    3 84   1.50   0.50    1.5    1.50   0.74  1.0    2.0   1.0  0.00
## conc          4 84 435.00 295.92  350.0  408.53 259.46 95.0 1000.0 905.0  0.72
## uptake        5 84  27.21  10.81   28.3   27.33  14.83  7.7   45.5  37.8 -0.10
##            kurtosis    se
## Plant*        -1.26  0.38
## Type*         -2.02  0.05
## Treatment*    -2.02  0.05
## conc          -0.68 32.29
## uptake        -1.35  1.18

2 Linear Regression

We will estimate the following regression.

\[ CO2_i = \beta_0 + \beta_1 Uptake_i + \epsilon_i \]

lm(data = CO2, 
   formula =  conc ~ uptake)
## 
## Call:
## lm(formula = conc ~ uptake, data = CO2)
## 
## Coefficients:
## (Intercept)       uptake  
##       73.71        13.28

\(\beta_1\) is 13.28 in the chart above.

2.1 Covariance-Variance formula for slope

An interesting interpretation of the slope parameter.

beta1 <- cov(CO2$conc, CO2$uptake)/var(CO2$uptake)
round(beta1, digits = 2)
## [1] 13.28

NOT A CO-INCIDENCE THAT THE TWO NUMBERS ARE THE SAME !

Formal Proof.

# Plot

?plot
## Help on topic 'plot' was found in the following packages:
## 
##   Package               Library
##   graphics              /Library/Frameworks/R.framework/Versions/4.2/Resources/library
##   base                  /Library/Frameworks/R.framework/Resources/library
## 
## 
## Using the first match ...
plot(x = CO2$uptake, 
     y = CO2$conc,
     xlab = "CO2 Uptake",
     ylab = "C02 Concentration",
     main = "Best Fit Line",
     sub = "",
     bg  = "lightblue",     # a vector of background colors (Graphical Parameters)
     col = "black",         # the colors for lines and points (Graphical Parameters)
     cex = .9,             # a numerical vector giving the amount by which plotting characters and symbols should be scaled relative to the default = 1 (Graphical Parameters)
     pch = 21,              # a vector of plotting characters or symbols (Graphical Parameters) {triangle, empty circle, filled circle, square,...}
     frame = TRUE          # frame.plot    - a logical indicating whether a box should be drawn around the plot.
     )


?abline
abline(reg = lm(data    = CO2, 
                formula =  conc ~ uptake),
       lwd = 2,                    # line width, default = 1
       col = "blue"
       )

2.2 Finding the intercept

We know that \[\begin{align*} y_i & = \beta_0 + \beta_1 x_i \\ <=> \bar{y} & = \beta_0 + \beta_1 \bar{x} \\ \end{align*}\]

Rearrange the terms and solve for \(\beta_0\):

\[ \beta_0 = \bar{y}-\bar{x}*\beta_1 \]

beta0 <- mean(CO2$conc) - beta1 * mean(CO2$uptake)
beta0    # print the value of beta_0
## [1] 73.71
# beta_0 value should match the intercept value from the lm command
lm(data = CO2, 
   formula =  conc ~ uptake)
## 
## Call:
## lm(formula = conc ~ uptake, data = CO2)
## 
## Coefficients:
## (Intercept)       uptake  
##       73.71        13.28

Intercept values are also the same (as expected) !!!

We will plot the data and best fit line below.

?plot
## Help on topic 'plot' was found in the following packages:
## 
##   Package               Library
##   graphics              /Library/Frameworks/R.framework/Versions/4.2/Resources/library
##   base                  /Library/Frameworks/R.framework/Resources/library
## 
## 
## Using the first match ...
plot(x = CO2$uptake, 
     y = CO2$conc,
     xlab = "CO2 Uptake",
     ylim = c(0, 1000),
     xlim = c(0,50),     
     ylab = "C02 Concentration",
     main = "Best Fit Line",
     sub = "",
     bg  = "lightblue",     # a vector of background colors (Graphical Parameters)
     col = "black",         # the colors for lines and points (Graphical Parameters)
     cex = .9,             # a numerical vector giving the amount by which plotting characters and symbols should be scaled relative to the default = 1 (Graphical Parameters)
     pch = 21,              # a vector of plotting characters or symbols (Graphical Parameters) {triangle, empty circle, filled circle, square,...}
     frame = TRUE          # frame.plot    - a logical indicating whether a box should be drawn around the plot.
     )


?abline
abline(reg = lm(data    = CO2, 
                formula =  conc ~ uptake),
       lwd = 2,                    # line width, default = 1
       col = "blue"
       )

3 Caution

This formula works for simple linear regression, where you have one independent variable (X) and one dependent variable (Y). However, when you move to multivariate regression, which involves multiple independent variables, the formula for calculating the coefficients becomes more complex.