Variance

Author

`mtcars` dataset

?mtcars

df <- mtcars

A data frame with 32 observations on 11 (numeric) variables.

[, 1]	`mpg`	Miles/(US) gallon
[, 2]	`cyl`	Number of cylinders
[, 3]	`disp`	Displacement (cu.in.)
[, 4]	`hp`	Gross horsepower
[, 5]	`drat`	Rear axle ratio
[, 6]	`wt`	Weight (1000 lbs)
[, 7]	`qsec`	1/4 mile time
[, 8]	`vs`	Engine (0 = V-shaped, 1 = straight)
[, 9]	`am`	Transmission (0 = automatic, 1 = manual)
[,10]	`gear`	Number of forward gears
[,11]	`carb`	Number of carburetors

# Load necessary libraries

library(psych)
library(knitr)

# Get summary statistics for the mtcars dataset
summary_stats <- describe(mtcars)

# Display summary statistics using kable
kable(summary_stats, format = "markdown", digits = 2)

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
mpg	1	32	20.09	6.03	19.20	19.70	5.41	10.40	33.90	23.50	0.61	-0.37	1.07
cyl	2	32	6.19	1.79	6.00	6.23	2.97	4.00	8.00	4.00	-0.17	-1.76	0.32
disp	3	32	230.72	123.94	196.30	222.52	140.48	71.10	472.00	400.90	0.38	-1.21	21.91
hp	4	32	146.69	68.56	123.00	141.19	77.10	52.00	335.00	283.00	0.73	-0.14	12.12
drat	5	32	3.60	0.53	3.70	3.58	0.70	2.76	4.93	2.17	0.27	-0.71	0.09
wt	6	32	3.22	0.98	3.33	3.15	0.77	1.51	5.42	3.91	0.42	-0.02	0.17
qsec	7	32	17.85	1.79	17.71	17.83	1.42	14.50	22.90	8.40	0.37	0.34	0.32
vs	8	32	0.44	0.50	0.00	0.42	0.00	0.00	1.00	1.00	0.24	-2.00	0.09
am	9	32	0.41	0.50	0.00	0.38	0.00	0.00	1.00	1.00	0.36	-1.92	0.09
gear	10	32	3.69	0.74	4.00	3.62	1.48	3.00	5.00	2.00	0.53	-1.07	0.13
carb	11	32	2.81	1.62	2.00	2.65	1.48	1.00	8.00	7.00	1.05	1.26	0.29

Multivariate Regression

reg1 <- lm(data = df, 
           formula = mpg ~ cyl + disp 
           )

summary(reg1)


Call:
lm(formula = mpg ~ cyl + disp, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4213 -2.1722 -0.6362  1.1899  7.0516 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.66099    2.54700  13.609 4.02e-14 ***
cyl         -1.58728    0.71184  -2.230   0.0337 *  
disp        -0.02058    0.01026  -2.007   0.0542 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.055 on 29 degrees of freedom
Multiple R-squared:  0.7596,    Adjusted R-squared:  0.743 
F-statistic: 45.81 on 2 and 29 DF,  p-value: 1.058e-09

Examine Residuals

reg1$residuals

          Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
        -0.84395255         -0.84395255         -3.28885510          1.57324352 
  Hornet Sportabout             Valiant          Duster 360           Merc 240D 
         4.14732774         -2.40601638         -0.25267226         -0.89226849 
           Merc 230            Merc 280           Merc 280C          Merc 450SE 
        -2.61371193         -2.48751693         -3.88751693          0.11418581 
         Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
         1.01418581         -1.08581419         -1.84730532         -2.09430892 
  Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
         1.79401841          5.70804444          3.64629354          7.05160883 
      Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
        -4.33979314          0.08281514         -0.50535572         -1.45850859 
   Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
         5.47067308          0.61421953          0.16432359          4.04561603 
     Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
         1.06207504         -2.45270705         -0.76710662         -4.42126787

length(reg1$residuals)

[1] 32

round(x = sum(reg1$residuals), 
      digits = 14
      )

[1] 0

hist(x = reg1$residuals)

round(x = sum(reg1$residuals^2),
      digits = 14
      )

[1] 270.7403

Summary Stats on Residuals

sum(reg1$residuals)

[1] 2.664535e-15

# Display summary statistics using kable
kable(x = psych::describe(reg1$residuals),
      format = "markdown", 
      digits = 2)

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
X1	1	32	0	2.96	-0.64	-0.21	2.57	-4.42	7.05	11.47	0.67	-0.35	0.52

Interpretation of Coefficients

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

keep <- df |> 
  select(mpg, cyl, disp) 

table(keep$cyl)


 4  6  8 
11  7 14

Replicating t value for \(\beta_1\)

summary(reg1)


Call:
lm(formula = mpg ~ cyl + disp, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4213 -2.1722 -0.6362  1.1899  7.0516 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.66099    2.54700  13.609 4.02e-14 ***
cyl         -1.58728    0.71184  -2.230   0.0337 *  
disp        -0.02058    0.01026  -2.007   0.0542 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.055 on 29 degrees of freedom
Multiple R-squared:  0.7596,    Adjusted R-squared:  0.743 
F-statistic: 45.81 on 2 and 29 DF,  p-value: 1.058e-09

-1.58728 / 0.71184

[1] -2.229827

round(x = reg1$coefficients[2] / 0.71184, digits = 3 )

  cyl 
-2.23

Replicating p value for \(\beta_1\)

The probability that a particular statistical measure of an assumed probability distribution will be as or more extreme than what we observed, given the null is true/

?pt
2 * pt(q = - 2.230, 
      df = 32 - (2+1), 
      lower.tail = TRUE
      )

[1] 0.03365088

Replicating residual standard error for a regression

summary(reg1)


Call:
lm(formula = mpg ~ cyl + disp, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4213 -2.1722 -0.6362  1.1899  7.0516 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.66099    2.54700  13.609 4.02e-14 ***
cyl         -1.58728    0.71184  -2.230   0.0337 *  
disp        -0.02058    0.01026  -2.007   0.0542 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.055 on 29 degrees of freedom
Multiple R-squared:  0.7596,    Adjusted R-squared:  0.743 
F-statistic: 45.81 on 2 and 29 DF,  p-value: 1.058e-09

residuals <- as.numeric(reg1$residuals)
residuals

 [1] -0.84395255 -0.84395255 -3.28885510  1.57324352  4.14732774 -2.40601638
 [7] -0.25267226 -0.89226849 -2.61371193 -2.48751693 -3.88751693  0.11418581
[13]  1.01418581 -1.08581419 -1.84730532 -2.09430892  1.79401841  5.70804444
[19]  3.64629354  7.05160883 -4.33979314  0.08281514 -0.50535572 -1.45850859
[25]  5.47067308  0.61421953  0.16432359  4.04561603  1.06207504 -2.45270705
[31] -0.76710662 -4.42126787

sd(residuals) # adjust for degrees of freedom

[1] 2.955259

 residuals - mean(residuals)                      # deviation from mean of X

 [1] -0.84395255 -0.84395255 -3.28885510  1.57324352  4.14732774 -2.40601638
 [7] -0.25267226 -0.89226849 -2.61371193 -2.48751693 -3.88751693  0.11418581
[13]  1.01418581 -1.08581419 -1.84730532 -2.09430892  1.79401841  5.70804444
[19]  3.64629354  7.05160883 -4.33979314  0.08281514 -0.50535572 -1.45850859
[25]  5.47067308  0.61421953  0.16432359  4.04561603  1.06207504 -2.45270705
[31] -0.76710662 -4.42126787

(residuals - mean(residuals) )^2                  # squared deviation from mean of X

 [1]  0.712255903  0.712255903 10.816567877  2.475095181 17.200327418
 [6]  5.788914815  0.063843269  0.796143058  6.831490036  6.187740499
[11] 15.112787915  0.013038400  1.028572866  1.178992446  3.412536939
[16]  4.386129847  3.218502069 32.581771340 13.295456586 49.725187054
[21] 18.833804476  0.006858348  0.255384408  2.127247306 29.928263937
[26]  0.377265632  0.027002242 16.367009048  1.128003399  6.015771866
[31]  0.588452574 19.547609556

sum( (residuals - mean(residuals))^2 )            # total squared deviation from mean of X (SSR)

[1] 270.7403

sum( (residuals - mean(residuals))^2) / (32 - (2+1)) # average squared deviation from mean of X (variance)

[1] 9.335872

sqrt(sum((residuals - mean(residuals))^2) / (32 -(2+1))) # average squared deviation from mean of X

[1] 3.055466

summary(reg1)


Call:
lm(formula = mpg ~ cyl + disp, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4213 -2.1722 -0.6362  1.1899  7.0516 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.66099    2.54700  13.609 4.02e-14 ***
cyl         -1.58728    0.71184  -2.230   0.0337 *  
disp        -0.02058    0.01026  -2.007   0.0542 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.055 on 29 degrees of freedom
Multiple R-squared:  0.7596,    Adjusted R-squared:  0.743 
F-statistic: 45.81 on 2 and 29 DF,  p-value: 1.058e-09

sqrt(sum(reg1$residuals^2)/29) # since mean of residuals is 0 anyways !!!

[1] 3.055466

Multivariate Linear Regression with matrix algebra

?as.matrix
X <- as.matrix(x = cbind(df$cyl,
                         df$disp) 
               )

# Create a matrix with the additional column of 1s
additional_column <- matrix(data = 1,
                            nrow = nrow(X), # the desired number of rows.
                            ncol = 1        # the desired number of columns.
                            )

# Bind the additional column to the matrix X
X <- cbind(additional_column, X)

class(X)

[1] "matrix" "array"

Point estimates

Setup

Derivation

 y <- df$mpg
 solve(t(X) %*% X ) %*% t(X) %*% y

            [,1]
[1,] 34.66099474
[2,] -1.58727681
[3,] -0.02058363

 summary(reg1)


Call:
lm(formula = mpg ~ cyl + disp, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4213 -2.1722 -0.6362  1.1899  7.0516 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.66099    2.54700  13.609 4.02e-14 ***
cyl         -1.58728    0.71184  -2.230   0.0337 *  
disp        -0.02058    0.01026  -2.007   0.0542 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.055 on 29 degrees of freedom
Multiple R-squared:  0.7596,    Adjusted R-squared:  0.743 
F-statistic: 45.81 on 2 and 29 DF,  p-value: 1.058e-09

Standard errors

\(\sigma^2\) is the variance of the residual vector

Derivation

Lets understand the standard deviatio formula -

\(X^{'}X\)is a symmetric matrix.

?t
X

      [,1] [,2]  [,3]
 [1,]    1    6 160.0
 [2,]    1    6 160.0
 [3,]    1    4 108.0
 [4,]    1    6 258.0
 [5,]    1    8 360.0
 [6,]    1    6 225.0
 [7,]    1    8 360.0
 [8,]    1    4 146.7
 [9,]    1    4 140.8
[10,]    1    6 167.6
[11,]    1    6 167.6
[12,]    1    8 275.8
[13,]    1    8 275.8
[14,]    1    8 275.8
[15,]    1    8 472.0
[16,]    1    8 460.0
[17,]    1    8 440.0
[18,]    1    4  78.7
[19,]    1    4  75.7
[20,]    1    4  71.1
[21,]    1    4 120.1
[22,]    1    8 318.0
[23,]    1    8 304.0
[24,]    1    8 350.0
[25,]    1    8 400.0
[26,]    1    4  79.0
[27,]    1    4 120.3
[28,]    1    4  95.1
[29,]    1    8 351.0
[30,]    1    6 145.0
[31,]    1    8 301.0
[32,]    1    4 121.0

X_transposed <- t(X)  # t returns the transpose of x.
X_transposed

     [,1] [,2] [,3] [,4] [,5] [,6] [,7]  [,8]  [,9] [,10] [,11] [,12] [,13]
[1,]    1    1    1    1    1    1    1   1.0   1.0   1.0   1.0   1.0   1.0
[2,]    6    6    4    6    8    6    8   4.0   4.0   6.0   6.0   8.0   8.0
[3,]  160  160  108  258  360  225  360 146.7 140.8 167.6 167.6 275.8 275.8
     [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
[1,]   1.0     1     1     1   1.0   1.0   1.0   1.0     1     1     1     1
[2,]   8.0     8     8     8   4.0   4.0   4.0   4.0     8     8     8     8
[3,] 275.8   472   460   440  78.7  75.7  71.1 120.1   318   304   350   400
     [,26] [,27] [,28] [,29] [,30] [,31] [,32]
[1,]     1   1.0   1.0     1     1     1     1
[2,]     4   4.0   4.0     8     6     8     4
[3,]    79 120.3  95.1   351   145   301   121

dim(X_transposed)

[1]  3 32

dim(X)

[1] 32  3

X_transposed %*% X # symmetric matrix

       [,1]    [,2]      [,3]
[1,]   32.0   198.0    7383.1
[2,]  198.0  1324.0   51872.4
[3,] 7383.1 51872.4 2179627.5

solve(X_transposed %*% X )

             [,1]          [,2]          [,3]
[1,]  0.694871232 -0.1730656141  1.764992e-03
[2,] -0.173065614  0.0542769093 -7.054934e-04
[3,]  0.001764992 -0.0007054934  1.127006e-05

var_covar_matrix <- solve(X_transposed%*% X ) * sum(reg1$residuals^2) / 29
var_covar_matrix

            [,1]         [,2]          [,3]
[1,]  6.48722874 -1.615718386  0.0164777387
[2,] -1.61571839  0.506722267 -0.0065863960
[3,]  0.01647774 -0.006586396  0.0001052158

summary(reg1)


Call:
lm(formula = mpg ~ cyl + disp, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4213 -2.1722 -0.6362  1.1899  7.0516 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.66099    2.54700  13.609 4.02e-14 ***
cyl         -1.58728    0.71184  -2.230   0.0337 *  
disp        -0.02058    0.01026  -2.007   0.0542 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.055 on 29 degrees of freedom
Multiple R-squared:  0.7596,    Adjusted R-squared:  0.743 
F-statistic: 45.81 on 2 and 29 DF,  p-value: 1.058e-09

var_covar_matrix[2,2]^.5

[1] 0.7118443

var_covar_matrix[3,3]^.5

[1] 0.01025748

0.506722267  ^.5

[1] 0.7118443

0.0001052158 ^.5

[1] 0.01025748

https://www.scribbr.com/statistics/coefficient-of-determination/