Variance

Author

AS

mtcars dataset

?mtcars

df <- mtcars

A data frame with 32 observations on 11 (numeric) variables.

[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (1000 lbs)
[, 7] qsec 1/4 mile time
[, 8] vs Engine (0 = V-shaped, 1 = straight)
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors
# Load necessary libraries

library(psych)
library(knitr)

# Get summary statistics for the mtcars dataset
summary_stats <- describe(mtcars)

# Display summary statistics using kable
kable(summary_stats, format = "markdown", digits = 2)
vars n mean sd median trimmed mad min max range skew kurtosis se
mpg 1 32 20.09 6.03 19.20 19.70 5.41 10.40 33.90 23.50 0.61 -0.37 1.07
cyl 2 32 6.19 1.79 6.00 6.23 2.97 4.00 8.00 4.00 -0.17 -1.76 0.32
disp 3 32 230.72 123.94 196.30 222.52 140.48 71.10 472.00 400.90 0.38 -1.21 21.91
hp 4 32 146.69 68.56 123.00 141.19 77.10 52.00 335.00 283.00 0.73 -0.14 12.12
drat 5 32 3.60 0.53 3.70 3.58 0.70 2.76 4.93 2.17 0.27 -0.71 0.09
wt 6 32 3.22 0.98 3.33 3.15 0.77 1.51 5.42 3.91 0.42 -0.02 0.17
qsec 7 32 17.85 1.79 17.71 17.83 1.42 14.50 22.90 8.40 0.37 0.34 0.32
vs 8 32 0.44 0.50 0.00 0.42 0.00 0.00 1.00 1.00 0.24 -2.00 0.09
am 9 32 0.41 0.50 0.00 0.38 0.00 0.00 1.00 1.00 0.36 -1.92 0.09
gear 10 32 3.69 0.74 4.00 3.62 1.48 3.00 5.00 2.00 0.53 -1.07 0.13
carb 11 32 2.81 1.62 2.00 2.65 1.48 1.00 8.00 7.00 1.05 1.26 0.29

Multivariate Regression

reg1 <- lm(data = df, 
           formula = mpg ~ cyl + disp 
           )

summary(reg1)

Call:
lm(formula = mpg ~ cyl + disp, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4213 -2.1722 -0.6362  1.1899  7.0516 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.66099    2.54700  13.609 4.02e-14 ***
cyl         -1.58728    0.71184  -2.230   0.0337 *  
disp        -0.02058    0.01026  -2.007   0.0542 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.055 on 29 degrees of freedom
Multiple R-squared:  0.7596,    Adjusted R-squared:  0.743 
F-statistic: 45.81 on 2 and 29 DF,  p-value: 1.058e-09

Examine Residuals

reg1$residuals
          Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
        -0.84395255         -0.84395255         -3.28885510          1.57324352 
  Hornet Sportabout             Valiant          Duster 360           Merc 240D 
         4.14732774         -2.40601638         -0.25267226         -0.89226849 
           Merc 230            Merc 280           Merc 280C          Merc 450SE 
        -2.61371193         -2.48751693         -3.88751693          0.11418581 
         Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
         1.01418581         -1.08581419         -1.84730532         -2.09430892 
  Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
         1.79401841          5.70804444          3.64629354          7.05160883 
      Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
        -4.33979314          0.08281514         -0.50535572         -1.45850859 
   Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
         5.47067308          0.61421953          0.16432359          4.04561603 
     Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
         1.06207504         -2.45270705         -0.76710662         -4.42126787 
length(reg1$residuals) 
[1] 32
round(x = sum(reg1$residuals), 
      digits = 14
      )
[1] 0
hist(x = reg1$residuals)

round(x = sum(reg1$residuals^2),
      digits = 14
      )
[1] 270.7403

Summary Stats on Residuals

sum(reg1$residuals)
[1] 2.664535e-15
# Display summary statistics using kable
kable(x = psych::describe(reg1$residuals),
      format = "markdown", 
      digits = 2)
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 32 0 2.96 -0.64 -0.21 2.57 -4.42 7.05 11.47 0.67 -0.35 0.52

Interpretation of Coefficients

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
keep <- df |> 
  select(mpg, cyl, disp) 

table(keep$cyl)

 4  6  8 
11  7 14 

Replicating t value for \(\beta_1\)

summary(reg1)

Call:
lm(formula = mpg ~ cyl + disp, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4213 -2.1722 -0.6362  1.1899  7.0516 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.66099    2.54700  13.609 4.02e-14 ***
cyl         -1.58728    0.71184  -2.230   0.0337 *  
disp        -0.02058    0.01026  -2.007   0.0542 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.055 on 29 degrees of freedom
Multiple R-squared:  0.7596,    Adjusted R-squared:  0.743 
F-statistic: 45.81 on 2 and 29 DF,  p-value: 1.058e-09
-1.58728 / 0.71184 
[1] -2.229827
round(x = reg1$coefficients[2] / 0.71184, digits = 3 )
  cyl 
-2.23 

Replicating p value for \(\beta_1\)

The probability that a particular statistical measure of an assumed probability distribution will be as or more extreme than what we observed, given the null is true/

?pt
2 * pt(q = - 2.230, 
      df = 32 - (2+1), 
      lower.tail = TRUE
      )
[1] 0.03365088

Replicating residual standard error for a regression

summary(reg1)

Call:
lm(formula = mpg ~ cyl + disp, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4213 -2.1722 -0.6362  1.1899  7.0516 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.66099    2.54700  13.609 4.02e-14 ***
cyl         -1.58728    0.71184  -2.230   0.0337 *  
disp        -0.02058    0.01026  -2.007   0.0542 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.055 on 29 degrees of freedom
Multiple R-squared:  0.7596,    Adjusted R-squared:  0.743 
F-statistic: 45.81 on 2 and 29 DF,  p-value: 1.058e-09
residuals <- as.numeric(reg1$residuals)
residuals
 [1] -0.84395255 -0.84395255 -3.28885510  1.57324352  4.14732774 -2.40601638
 [7] -0.25267226 -0.89226849 -2.61371193 -2.48751693 -3.88751693  0.11418581
[13]  1.01418581 -1.08581419 -1.84730532 -2.09430892  1.79401841  5.70804444
[19]  3.64629354  7.05160883 -4.33979314  0.08281514 -0.50535572 -1.45850859
[25]  5.47067308  0.61421953  0.16432359  4.04561603  1.06207504 -2.45270705
[31] -0.76710662 -4.42126787
sd(residuals) # adjust for degrees of freedom
[1] 2.955259
 residuals - mean(residuals)                      # deviation from mean of X
 [1] -0.84395255 -0.84395255 -3.28885510  1.57324352  4.14732774 -2.40601638
 [7] -0.25267226 -0.89226849 -2.61371193 -2.48751693 -3.88751693  0.11418581
[13]  1.01418581 -1.08581419 -1.84730532 -2.09430892  1.79401841  5.70804444
[19]  3.64629354  7.05160883 -4.33979314  0.08281514 -0.50535572 -1.45850859
[25]  5.47067308  0.61421953  0.16432359  4.04561603  1.06207504 -2.45270705
[31] -0.76710662 -4.42126787
(residuals - mean(residuals) )^2                  # squared deviation from mean of X
 [1]  0.712255903  0.712255903 10.816567877  2.475095181 17.200327418
 [6]  5.788914815  0.063843269  0.796143058  6.831490036  6.187740499
[11] 15.112787915  0.013038400  1.028572866  1.178992446  3.412536939
[16]  4.386129847  3.218502069 32.581771340 13.295456586 49.725187054
[21] 18.833804476  0.006858348  0.255384408  2.127247306 29.928263937
[26]  0.377265632  0.027002242 16.367009048  1.128003399  6.015771866
[31]  0.588452574 19.547609556
sum( (residuals - mean(residuals))^2 )            # total squared deviation from mean of X (SSR)
[1] 270.7403
sum( (residuals - mean(residuals))^2) / (32 - (2+1)) # average squared deviation from mean of X (variance)
[1] 9.335872
sqrt(sum((residuals - mean(residuals))^2) / (32 -(2+1))) # average squared deviation from mean of X
[1] 3.055466
summary(reg1)

Call:
lm(formula = mpg ~ cyl + disp, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4213 -2.1722 -0.6362  1.1899  7.0516 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.66099    2.54700  13.609 4.02e-14 ***
cyl         -1.58728    0.71184  -2.230   0.0337 *  
disp        -0.02058    0.01026  -2.007   0.0542 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.055 on 29 degrees of freedom
Multiple R-squared:  0.7596,    Adjusted R-squared:  0.743 
F-statistic: 45.81 on 2 and 29 DF,  p-value: 1.058e-09
sqrt(sum(reg1$residuals^2)/29) # since mean of residuals is 0 anyways !!!
[1] 3.055466

Multivariate Linear Regression with matrix algebra

?as.matrix
X <- as.matrix(x = cbind(df$cyl,
                         df$disp) 
               )

# Create a matrix with the additional column of 1s
additional_column <- matrix(data = 1,
                            nrow = nrow(X), # the desired number of rows.
                            ncol = 1        # the desired number of columns.
                            )

# Bind the additional column to the matrix X
X <- cbind(additional_column, X)

class(X)
[1] "matrix" "array" 

Point estimates

Setup

Derivation

 y <- df$mpg
 solve(t(X) %*% X ) %*% t(X) %*% y
            [,1]
[1,] 34.66099474
[2,] -1.58727681
[3,] -0.02058363
 summary(reg1)

Call:
lm(formula = mpg ~ cyl + disp, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4213 -2.1722 -0.6362  1.1899  7.0516 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.66099    2.54700  13.609 4.02e-14 ***
cyl         -1.58728    0.71184  -2.230   0.0337 *  
disp        -0.02058    0.01026  -2.007   0.0542 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.055 on 29 degrees of freedom
Multiple R-squared:  0.7596,    Adjusted R-squared:  0.743 
F-statistic: 45.81 on 2 and 29 DF,  p-value: 1.058e-09

Standard errors

  • \(\sigma^2\) is the variance of the residual vector

Derivation

Lets understand the standard deviatio formula -

\(X^{'}X\)is a symmetric matrix.

?t
X
      [,1] [,2]  [,3]
 [1,]    1    6 160.0
 [2,]    1    6 160.0
 [3,]    1    4 108.0
 [4,]    1    6 258.0
 [5,]    1    8 360.0
 [6,]    1    6 225.0
 [7,]    1    8 360.0
 [8,]    1    4 146.7
 [9,]    1    4 140.8
[10,]    1    6 167.6
[11,]    1    6 167.6
[12,]    1    8 275.8
[13,]    1    8 275.8
[14,]    1    8 275.8
[15,]    1    8 472.0
[16,]    1    8 460.0
[17,]    1    8 440.0
[18,]    1    4  78.7
[19,]    1    4  75.7
[20,]    1    4  71.1
[21,]    1    4 120.1
[22,]    1    8 318.0
[23,]    1    8 304.0
[24,]    1    8 350.0
[25,]    1    8 400.0
[26,]    1    4  79.0
[27,]    1    4 120.3
[28,]    1    4  95.1
[29,]    1    8 351.0
[30,]    1    6 145.0
[31,]    1    8 301.0
[32,]    1    4 121.0
X_transposed <- t(X)  # t returns the transpose of x.
X_transposed
     [,1] [,2] [,3] [,4] [,5] [,6] [,7]  [,8]  [,9] [,10] [,11] [,12] [,13]
[1,]    1    1    1    1    1    1    1   1.0   1.0   1.0   1.0   1.0   1.0
[2,]    6    6    4    6    8    6    8   4.0   4.0   6.0   6.0   8.0   8.0
[3,]  160  160  108  258  360  225  360 146.7 140.8 167.6 167.6 275.8 275.8
     [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
[1,]   1.0     1     1     1   1.0   1.0   1.0   1.0     1     1     1     1
[2,]   8.0     8     8     8   4.0   4.0   4.0   4.0     8     8     8     8
[3,] 275.8   472   460   440  78.7  75.7  71.1 120.1   318   304   350   400
     [,26] [,27] [,28] [,29] [,30] [,31] [,32]
[1,]     1   1.0   1.0     1     1     1     1
[2,]     4   4.0   4.0     8     6     8     4
[3,]    79 120.3  95.1   351   145   301   121
dim(X_transposed)
[1]  3 32
dim(X)
[1] 32  3
X_transposed %*% X # symmetric matrix 
       [,1]    [,2]      [,3]
[1,]   32.0   198.0    7383.1
[2,]  198.0  1324.0   51872.4
[3,] 7383.1 51872.4 2179627.5
solve(X_transposed %*% X ) 
             [,1]          [,2]          [,3]
[1,]  0.694871232 -0.1730656141  1.764992e-03
[2,] -0.173065614  0.0542769093 -7.054934e-04
[3,]  0.001764992 -0.0007054934  1.127006e-05

var_covar_matrix <- solve(X_transposed%*% X ) * sum(reg1$residuals^2) / 29
var_covar_matrix 
            [,1]         [,2]          [,3]
[1,]  6.48722874 -1.615718386  0.0164777387
[2,] -1.61571839  0.506722267 -0.0065863960
[3,]  0.01647774 -0.006586396  0.0001052158
summary(reg1)

Call:
lm(formula = mpg ~ cyl + disp, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4213 -2.1722 -0.6362  1.1899  7.0516 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.66099    2.54700  13.609 4.02e-14 ***
cyl         -1.58728    0.71184  -2.230   0.0337 *  
disp        -0.02058    0.01026  -2.007   0.0542 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.055 on 29 degrees of freedom
Multiple R-squared:  0.7596,    Adjusted R-squared:  0.743 
F-statistic: 45.81 on 2 and 29 DF,  p-value: 1.058e-09
var_covar_matrix[2,2]^.5
[1] 0.7118443
var_covar_matrix[3,3]^.5
[1] 0.01025748
0.506722267  ^.5
[1] 0.7118443
0.0001052158 ^.5
[1] 0.01025748

https://www.scribbr.com/statistics/coefficient-of-determination/