Covariance Variance Formula

Load Data

Read up on the data.

The CO2 data frame has 84 rows and 5 columns of data from an experiment on the cold tolerance of the grass species Echinochloa crus-galli.

The CO2​ uptake of six plants from Quebec and six plants from Mississippi was measured at several levels of ambient CO2​ concentration.

Variable Definitions

  • conc - a numeric vector of ambient carbon dioxide concentrations (mL/L).

  • uptake - a numeric vector of carbon dioxide uptake rates

?datasets

library(help = "datasets")

help(CO2)

head(CO2)
  Plant   Type  Treatment conc uptake
1   Qn1 Quebec nonchilled   95   16.0
2   Qn1 Quebec nonchilled  175   30.4
3   Qn1 Quebec nonchilled  250   34.8
4   Qn1 Quebec nonchilled  350   37.2
5   Qn1 Quebec nonchilled  500   35.3
6   Qn1 Quebec nonchilled  675   39.2
tail(CO2)
   Plant        Type Treatment conc uptake
79   Mc3 Mississippi   chilled  175   18.0
80   Mc3 Mississippi   chilled  250   17.9
81   Mc3 Mississippi   chilled  350   17.9
82   Mc3 Mississippi   chilled  500   17.9
83   Mc3 Mississippi   chilled  675   18.9
84   Mc3 Mississippi   chilled 1000   19.9

Exploratory Data Analysis (EDA)

str(CO2)
Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':  84 obs. of  5 variables:
 $ Plant    : Ord.factor w/ 12 levels "Qn1"<"Qn2"<"Qn3"<..: 1 1 1 1 1 1 1 2 2 2 ...
 $ Type     : Factor w/ 2 levels "Quebec","Mississippi": 1 1 1 1 1 1 1 1 1 1 ...
 $ Treatment: Factor w/ 2 levels "nonchilled","chilled": 1 1 1 1 1 1 1 1 1 1 ...
 $ conc     : num  95 175 250 350 500 675 1000 95 175 250 ...
 $ uptake   : num  16 30.4 34.8 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...
 - attr(*, "formula")=Class 'formula'  language uptake ~ conc | Plant
  .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
 - attr(*, "outer")=Class 'formula'  language ~Treatment * Type
  .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
 - attr(*, "labels")=List of 2
  ..$ x: chr "Ambient carbon dioxide concentration"
  ..$ y: chr "CO2 uptake rate"
 - attr(*, "units")=List of 2
  ..$ x: chr "(uL/L)"
  ..$ y: chr "(umol/m^2 s)"
Loading required package: psych
Warning: package 'psych' was built under R version 4.2.3
           vars  n   mean     sd median trimmed    mad  min    max range  skew
Plant*        1 84   6.50   3.47    6.5    6.50   4.45  1.0   12.0  11.0  0.00
Type*         2 84   1.50   0.50    1.5    1.50   0.74  1.0    2.0   1.0  0.00
Treatment*    3 84   1.50   0.50    1.5    1.50   0.74  1.0    2.0   1.0  0.00
conc          4 84 435.00 295.92  350.0  408.53 259.46 95.0 1000.0 905.0  0.72
uptake        5 84  27.21  10.81   28.3   27.33  14.83  7.7   45.5  37.8 -0.10
           kurtosis    se
Plant*        -1.26  0.38
Type*         -2.02  0.05
Treatment*    -2.02  0.05
conc          -0.68 32.29
uptake        -1.35  1.18

Linear Regression

We will estimate the following regression. Use equation mode to type it out.

\[ CO2_i = \beta_0 + \beta_1 Uptake_i + \epsilon_i \]

lm(data = CO2,                   # specifying my data
   formula =  conc ~ uptake)

Call:
lm(formula = conc ~ uptake, data = CO2)

Coefficients:
(Intercept)       uptake  
      73.71        13.28  

\(\beta_1\) is 13.28 in the chart above.

Use summary command on regression output
reg1 <- lm(data = CO2, 
        formula =  conc ~ uptake)

summary(reg1)

Call:
lm(formula = conc ~ uptake, data = CO2)

Residuals:
    Min      1Q  Median      3Q     Max 
-358.75 -177.93  -92.14  140.33  735.11 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   73.710     77.313   0.953    0.343    
uptake        13.276      2.642   5.024 2.91e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 260.3 on 82 degrees of freedom
Multiple R-squared:  0.2354,    Adjusted R-squared:  0.2261 
F-statistic: 25.25 on 1 and 82 DF,  p-value: 2.906e-06

Covariance-Variance formula for slope

An interesting interpretation of the slope parameter.

beta1 <- cov(CO2$conc, CO2$uptake)/var(CO2$uptake)
round(beta1, digits = 2)
[1] 13.28
NOT A CO-INCIDENCE THAT THE TWO NUMBERS ARE THE SAME !

Formal Proof.

# Plot

?plot
Help on topic 'plot' was found in the following packages:

  Package               Library
  graphics              /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
  base                  /Library/Frameworks/R.framework/Resources/library


Using the first match ...
plot(x = CO2$uptake, 
     y = CO2$conc,
     xlab = "CO2 Uptake",
     ylab = "C02 Concentration",
     main = "Best Fit Line",
     sub = "",
     bg  = "lightblue",     # a vector of background colors (Graphical Parameters)
     col = "black",         # the colors for lines and points (Graphical Parameters)
     cex = .9,             # a numerical vector giving the amount by which plotting characters and symbols should be scaled relative to the default = 1 (Graphical Parameters)
     pch = 21,              # a vector of plotting characters or symbols (Graphical Parameters) {triangle, empty circle, filled circle, square,...}
     frame = TRUE          # frame.plot    - a logical indicating whether a box should be drawn around the plot.
     )


?abline
abline(reg = lm(data    = CO2, 
                formula =  conc ~ uptake),
       lwd = 2,                    # line width, default = 1
       col = "blue"
       )

Finding the intercept

We know that \[\begin{align*} y_i & = \beta_0 + \beta_1 x_i \\ <=> \bar{y} & = \beta_0 + \beta_1 \bar{x} \\ \end{align*}\]

Rearrange the terms and solve for \(\beta_0\):

Typing equations in Latex -

\[ \beta_0 = \bar{y}-\bar{x}*\beta_1 \]

beta0 <- mean(CO2$conc) - beta1 * mean(CO2$uptake)
beta0    # print the value of beta_0
[1] 73.71
# beta_0 value should match the intercept value from the lm command
lm(data = CO2, 
   formula =  conc ~ uptake)

Call:
lm(formula = conc ~ uptake, data = CO2)

Coefficients:
(Intercept)       uptake  
      73.71        13.28  

Intercept values are also the same (as expected) !!!

We will plot the data and best fit line below.

?plot
Help on topic 'plot' was found in the following packages:

  Package               Library
  graphics              /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
  base                  /Library/Frameworks/R.framework/Resources/library


Using the first match ...
plot(x = CO2$uptake, 
     y = CO2$conc,
     xlab = "CO2 Uptake",
     ylim = c(0, 1000),
     xlim = c(0,50),     
     ylab = "C02 Concentration",
     main = "Best Fit Line",
     sub = "",
     bg  = "lightblue",     # a vector of background colors (Graphical Parameters)
     col = "black",         # the colors for lines and points (Graphical Parameters)
     cex = .9,             # a numerical vector giving the amount by which plotting characters and symbols should be scaled relative to the default = 1 (Graphical Parameters)
     pch = 21,              # a vector of plotting characters or symbols (Graphical Parameters) {triangle, empty circle, filled circle, square,...}
     frame = TRUE          # frame.plot    - a logical indicating whether a box should be drawn around the plot.
     )


?abline
abline(reg = lm(data    = CO2, 
                formula =  conc ~ uptake),
       lwd = 2,                    # line width, default = 1
       col = "blue"
       )

Caution

This formula works for simple linear regression, where you have one independent variable (X) and one dependent variable (Y). However, when you move to multivariate regression, which involves multiple independent variables, the formula for calculating the coefficients becomes more complex.

Looking Ahead: Multivariate Regression