Variance
mtcars
dataset
?mtcars
<- mtcars df
A data frame with 32 observations on 11 (numeric) variables.
[, 1] | mpg |
Miles/(US) gallon |
[, 2] | cyl |
Number of cylinders |
[, 3] | disp |
Displacement (cu.in.) |
[, 4] | hp |
Gross horsepower |
[, 5] | drat |
Rear axle ratio |
[, 6] | wt |
Weight (1000 lbs) |
[, 7] | qsec |
1/4 mile time |
[, 8] | vs |
Engine (0 = V-shaped, 1 = straight) |
[, 9] | am |
Transmission (0 = automatic, 1 = manual) |
[,10] | gear |
Number of forward gears |
[,11] | carb |
Number of carburetors |
# Load necessary libraries
library(psych)
library(knitr)
# Get summary statistics for the mtcars dataset
<- describe(mtcars)
summary_stats
# Display summary statistics using kable
kable(summary_stats, format = "markdown", digits = 2)
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mpg | 1 | 32 | 20.09 | 6.03 | 19.20 | 19.70 | 5.41 | 10.40 | 33.90 | 23.50 | 0.61 | -0.37 | 1.07 |
cyl | 2 | 32 | 6.19 | 1.79 | 6.00 | 6.23 | 2.97 | 4.00 | 8.00 | 4.00 | -0.17 | -1.76 | 0.32 |
disp | 3 | 32 | 230.72 | 123.94 | 196.30 | 222.52 | 140.48 | 71.10 | 472.00 | 400.90 | 0.38 | -1.21 | 21.91 |
hp | 4 | 32 | 146.69 | 68.56 | 123.00 | 141.19 | 77.10 | 52.00 | 335.00 | 283.00 | 0.73 | -0.14 | 12.12 |
drat | 5 | 32 | 3.60 | 0.53 | 3.70 | 3.58 | 0.70 | 2.76 | 4.93 | 2.17 | 0.27 | -0.71 | 0.09 |
wt | 6 | 32 | 3.22 | 0.98 | 3.33 | 3.15 | 0.77 | 1.51 | 5.42 | 3.91 | 0.42 | -0.02 | 0.17 |
qsec | 7 | 32 | 17.85 | 1.79 | 17.71 | 17.83 | 1.42 | 14.50 | 22.90 | 8.40 | 0.37 | 0.34 | 0.32 |
vs | 8 | 32 | 0.44 | 0.50 | 0.00 | 0.42 | 0.00 | 0.00 | 1.00 | 1.00 | 0.24 | -2.00 | 0.09 |
am | 9 | 32 | 0.41 | 0.50 | 0.00 | 0.38 | 0.00 | 0.00 | 1.00 | 1.00 | 0.36 | -1.92 | 0.09 |
gear | 10 | 32 | 3.69 | 0.74 | 4.00 | 3.62 | 1.48 | 3.00 | 5.00 | 2.00 | 0.53 | -1.07 | 0.13 |
carb | 11 | 32 | 2.81 | 1.62 | 2.00 | 2.65 | 1.48 | 1.00 | 8.00 | 7.00 | 1.05 | 1.26 | 0.29 |
Multivariate Regression
<- lm(data = df,
reg1 formula = mpg ~ cyl + disp
)
summary(reg1)
Call:
lm(formula = mpg ~ cyl + disp, data = df)
Residuals:
Min 1Q Median 3Q Max
-4.4213 -2.1722 -0.6362 1.1899 7.0516
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.66099 2.54700 13.609 4.02e-14 ***
cyl -1.58728 0.71184 -2.230 0.0337 *
disp -0.02058 0.01026 -2.007 0.0542 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.055 on 29 degrees of freedom
Multiple R-squared: 0.7596, Adjusted R-squared: 0.743
F-statistic: 45.81 on 2 and 29 DF, p-value: 1.058e-09
Examine Residuals
$residuals reg1
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
-0.84395255 -0.84395255 -3.28885510 1.57324352
Hornet Sportabout Valiant Duster 360 Merc 240D
4.14732774 -2.40601638 -0.25267226 -0.89226849
Merc 230 Merc 280 Merc 280C Merc 450SE
-2.61371193 -2.48751693 -3.88751693 0.11418581
Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
1.01418581 -1.08581419 -1.84730532 -2.09430892
Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
1.79401841 5.70804444 3.64629354 7.05160883
Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
-4.33979314 0.08281514 -0.50535572 -1.45850859
Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
5.47067308 0.61421953 0.16432359 4.04561603
Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
1.06207504 -2.45270705 -0.76710662 -4.42126787
length(reg1$residuals)
[1] 32
round(x = sum(reg1$residuals),
digits = 14
)
[1] 0
hist(x = reg1$residuals)
round(x = sum(reg1$residuals^2),
digits = 14
)
[1] 270.7403
Summary Stats on Residuals
sum(reg1$residuals)
[1] 2.664535e-15
# Display summary statistics using kable
kable(x = psych::describe(reg1$residuals),
format = "markdown",
digits = 2)
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
X1 | 1 | 32 | 0 | 2.96 | -0.64 | -0.21 | 2.57 | -4.42 | 7.05 | 11.47 | 0.67 | -0.35 | 0.52 |
Interpretation of Coefficients
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
<- df |>
keep select(mpg, cyl, disp)
table(keep$cyl)
4 6 8
11 7 14
Replicating t value for \(\beta_1\)
summary(reg1)
Call:
lm(formula = mpg ~ cyl + disp, data = df)
Residuals:
Min 1Q Median 3Q Max
-4.4213 -2.1722 -0.6362 1.1899 7.0516
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.66099 2.54700 13.609 4.02e-14 ***
cyl -1.58728 0.71184 -2.230 0.0337 *
disp -0.02058 0.01026 -2.007 0.0542 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.055 on 29 degrees of freedom
Multiple R-squared: 0.7596, Adjusted R-squared: 0.743
F-statistic: 45.81 on 2 and 29 DF, p-value: 1.058e-09
-1.58728 / 0.71184
[1] -2.229827
round(x = reg1$coefficients[2] / 0.71184, digits = 3 )
cyl
-2.23
Replicating p value for \(\beta_1\)
The probability that a particular statistical measure of an assumed probability distribution will be as or more extreme than what we observed, given the null is true/
?pt2 * pt(q = - 2.230,
df = 32 - (2+1),
lower.tail = TRUE
)
[1] 0.03365088
Replicating residual standard error for a regression
summary(reg1)
Call:
lm(formula = mpg ~ cyl + disp, data = df)
Residuals:
Min 1Q Median 3Q Max
-4.4213 -2.1722 -0.6362 1.1899 7.0516
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.66099 2.54700 13.609 4.02e-14 ***
cyl -1.58728 0.71184 -2.230 0.0337 *
disp -0.02058 0.01026 -2.007 0.0542 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.055 on 29 degrees of freedom
Multiple R-squared: 0.7596, Adjusted R-squared: 0.743
F-statistic: 45.81 on 2 and 29 DF, p-value: 1.058e-09
<- as.numeric(reg1$residuals)
residuals residuals
[1] -0.84395255 -0.84395255 -3.28885510 1.57324352 4.14732774 -2.40601638
[7] -0.25267226 -0.89226849 -2.61371193 -2.48751693 -3.88751693 0.11418581
[13] 1.01418581 -1.08581419 -1.84730532 -2.09430892 1.79401841 5.70804444
[19] 3.64629354 7.05160883 -4.33979314 0.08281514 -0.50535572 -1.45850859
[25] 5.47067308 0.61421953 0.16432359 4.04561603 1.06207504 -2.45270705
[31] -0.76710662 -4.42126787
sd(residuals) # adjust for degrees of freedom
[1] 2.955259
- mean(residuals) # deviation from mean of X residuals
[1] -0.84395255 -0.84395255 -3.28885510 1.57324352 4.14732774 -2.40601638
[7] -0.25267226 -0.89226849 -2.61371193 -2.48751693 -3.88751693 0.11418581
[13] 1.01418581 -1.08581419 -1.84730532 -2.09430892 1.79401841 5.70804444
[19] 3.64629354 7.05160883 -4.33979314 0.08281514 -0.50535572 -1.45850859
[25] 5.47067308 0.61421953 0.16432359 4.04561603 1.06207504 -2.45270705
[31] -0.76710662 -4.42126787
- mean(residuals) )^2 # squared deviation from mean of X (residuals
[1] 0.712255903 0.712255903 10.816567877 2.475095181 17.200327418
[6] 5.788914815 0.063843269 0.796143058 6.831490036 6.187740499
[11] 15.112787915 0.013038400 1.028572866 1.178992446 3.412536939
[16] 4.386129847 3.218502069 32.581771340 13.295456586 49.725187054
[21] 18.833804476 0.006858348 0.255384408 2.127247306 29.928263937
[26] 0.377265632 0.027002242 16.367009048 1.128003399 6.015771866
[31] 0.588452574 19.547609556
sum( (residuals - mean(residuals))^2 ) # total squared deviation from mean of X (SSR)
[1] 270.7403
sum( (residuals - mean(residuals))^2) / (32 - (2+1)) # average squared deviation from mean of X (variance)
[1] 9.335872
sqrt(sum((residuals - mean(residuals))^2) / (32 -(2+1))) # average squared deviation from mean of X
[1] 3.055466
summary(reg1)
Call:
lm(formula = mpg ~ cyl + disp, data = df)
Residuals:
Min 1Q Median 3Q Max
-4.4213 -2.1722 -0.6362 1.1899 7.0516
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.66099 2.54700 13.609 4.02e-14 ***
cyl -1.58728 0.71184 -2.230 0.0337 *
disp -0.02058 0.01026 -2.007 0.0542 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.055 on 29 degrees of freedom
Multiple R-squared: 0.7596, Adjusted R-squared: 0.743
F-statistic: 45.81 on 2 and 29 DF, p-value: 1.058e-09
sqrt(sum(reg1$residuals^2)/29) # since mean of residuals is 0 anyways !!!
[1] 3.055466
Multivariate Linear Regression with matrix algebra
?as.matrix<- as.matrix(x = cbind(df$cyl,
X $disp)
df
)
# Create a matrix with the additional column of 1s
<- matrix(data = 1,
additional_column nrow = nrow(X), # the desired number of rows.
ncol = 1 # the desired number of columns.
)
# Bind the additional column to the matrix X
<- cbind(additional_column, X)
X
class(X)
[1] "matrix" "array"
Point estimates
Setup
Derivation
<- df$mpg
y solve(t(X) %*% X ) %*% t(X) %*% y
[,1]
[1,] 34.66099474
[2,] -1.58727681
[3,] -0.02058363
summary(reg1)
Call:
lm(formula = mpg ~ cyl + disp, data = df)
Residuals:
Min 1Q Median 3Q Max
-4.4213 -2.1722 -0.6362 1.1899 7.0516
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.66099 2.54700 13.609 4.02e-14 ***
cyl -1.58728 0.71184 -2.230 0.0337 *
disp -0.02058 0.01026 -2.007 0.0542 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.055 on 29 degrees of freedom
Multiple R-squared: 0.7596, Adjusted R-squared: 0.743
F-statistic: 45.81 on 2 and 29 DF, p-value: 1.058e-09
Standard errors
- \(\sigma^2\) is the variance of the residual vector
Derivation
Lets understand the standard deviatio formula -
\(X^{'}X\)is a symmetric matrix.
?t X
[,1] [,2] [,3]
[1,] 1 6 160.0
[2,] 1 6 160.0
[3,] 1 4 108.0
[4,] 1 6 258.0
[5,] 1 8 360.0
[6,] 1 6 225.0
[7,] 1 8 360.0
[8,] 1 4 146.7
[9,] 1 4 140.8
[10,] 1 6 167.6
[11,] 1 6 167.6
[12,] 1 8 275.8
[13,] 1 8 275.8
[14,] 1 8 275.8
[15,] 1 8 472.0
[16,] 1 8 460.0
[17,] 1 8 440.0
[18,] 1 4 78.7
[19,] 1 4 75.7
[20,] 1 4 71.1
[21,] 1 4 120.1
[22,] 1 8 318.0
[23,] 1 8 304.0
[24,] 1 8 350.0
[25,] 1 8 400.0
[26,] 1 4 79.0
[27,] 1 4 120.3
[28,] 1 4 95.1
[29,] 1 8 351.0
[30,] 1 6 145.0
[31,] 1 8 301.0
[32,] 1 4 121.0
<- t(X) # t returns the transpose of x.
X_transposed X_transposed
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
[1,] 1 1 1 1 1 1 1 1.0 1.0 1.0 1.0 1.0 1.0
[2,] 6 6 4 6 8 6 8 4.0 4.0 6.0 6.0 8.0 8.0
[3,] 160 160 108 258 360 225 360 146.7 140.8 167.6 167.6 275.8 275.8
[,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
[1,] 1.0 1 1 1 1.0 1.0 1.0 1.0 1 1 1 1
[2,] 8.0 8 8 8 4.0 4.0 4.0 4.0 8 8 8 8
[3,] 275.8 472 460 440 78.7 75.7 71.1 120.1 318 304 350 400
[,26] [,27] [,28] [,29] [,30] [,31] [,32]
[1,] 1 1.0 1.0 1 1 1 1
[2,] 4 4.0 4.0 8 6 8 4
[3,] 79 120.3 95.1 351 145 301 121
dim(X_transposed)
[1] 3 32
dim(X)
[1] 32 3
%*% X # symmetric matrix X_transposed
[,1] [,2] [,3]
[1,] 32.0 198.0 7383.1
[2,] 198.0 1324.0 51872.4
[3,] 7383.1 51872.4 2179627.5
solve(X_transposed %*% X )
[,1] [,2] [,3]
[1,] 0.694871232 -0.1730656141 1.764992e-03
[2,] -0.173065614 0.0542769093 -7.054934e-04
[3,] 0.001764992 -0.0007054934 1.127006e-05
<- solve(X_transposed%*% X ) * sum(reg1$residuals^2) / 29
var_covar_matrix var_covar_matrix
[,1] [,2] [,3]
[1,] 6.48722874 -1.615718386 0.0164777387
[2,] -1.61571839 0.506722267 -0.0065863960
[3,] 0.01647774 -0.006586396 0.0001052158
summary(reg1)
Call:
lm(formula = mpg ~ cyl + disp, data = df)
Residuals:
Min 1Q Median 3Q Max
-4.4213 -2.1722 -0.6362 1.1899 7.0516
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.66099 2.54700 13.609 4.02e-14 ***
cyl -1.58728 0.71184 -2.230 0.0337 *
disp -0.02058 0.01026 -2.007 0.0542 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.055 on 29 degrees of freedom
Multiple R-squared: 0.7596, Adjusted R-squared: 0.743
F-statistic: 45.81 on 2 and 29 DF, p-value: 1.058e-09
2,2]^.5 var_covar_matrix[
[1] 0.7118443
3,3]^.5 var_covar_matrix[
[1] 0.01025748
0.506722267 ^.5
[1] 0.7118443
0.0001052158 ^.5
[1] 0.01025748
https://www.scribbr.com/statistics/coefficient-of-determination/