set.seed(123)
n <- 100
x1 <- rnorm(n, mean = 5, sd = 2)
x2 <- rnorm(n, mean = 0, sd = 1)
x3 <- rnorm(n, mean = -3, sd = 1.5)
x4 <- rnorm(n, mean = 2, sd = 1)
# View :
hist(x1)
hist(x2)
hist(x3)
hist(x4)
df <- data.frame(Y, x1, x2, x3, x4)
mdl_1 <- lm(Y ~ x1 + x2 + x3, data = df)
mdl_2 <- lm(Y ~ x1 + x2 + x3 + x4, data = df)
summary(mdl_1); print("OOOO-GAHHHH-BOOO-GAHHHHHH");summary(mdl_2)
##
## Call:
## lm(formula = Y ~ x1 + x2 + x3, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.48374 -0.56822 0.08631 0.69916 2.46661
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.57573 0.33491 7.691 1.27e-11 ***
## x1 1.38852 0.05416 25.636 < 2e-16 ***
## x2 -2.14259 0.10145 -21.120 < 2e-16 ***
## x3 0.76725 0.06935 11.064 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9746 on 96 degrees of freedom
## Multiple R-squared: 0.9263, Adjusted R-squared: 0.924
## F-statistic: 402.3 on 3 and 96 DF, p-value: < 2.2e-16
## [1] "OOOO-GAHHHH-BOOO-GAHHHHHH"
##
## Call:
## lm(formula = Y ~ x1 + x2 + x3 + x4, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.47336 -0.58010 0.07461 0.68778 2.46552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.62249 0.38676 6.781 1.01e-09 ***
## x1 1.38788 0.05449 25.468 < 2e-16 ***
## x2 -2.14151 0.10204 -20.986 < 2e-16 ***
## x3 0.76636 0.06978 10.982 < 2e-16 ***
## x4 -0.02333 0.09506 -0.245 0.807
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9794 on 95 degrees of freedom
## Multiple R-squared: 0.9264, Adjusted R-squared: 0.9233
## F-statistic: 298.8 on 4 and 95 DF, p-value: < 2.2e-16
Consider that \(\text{MDL}_1\)’s, \(R^2 = 0.924\) and, \(\text{MDL}_2\)’s \(R^2=0.923\) which is only slightly worse. However, the only failed p-value for slope is for \(x4\).
Furthermore, consider that \(R^2\) is a measure of the % variance in \(y_i\) described by \(\hat{y}_i\); in other words, it is a measure of the % variance of the data described by the model.
Following Equations :
\[ R^2 = \frac{SSreg}{SST} = 1 - \frac{RSS}{SST} \\ \text{For : }\\ SS_{reg} = \sum^{n}_{i}(\hat{y}_i - \bar{y})^2 \\ RSS = \sum^{n}_{i}(y_i - \hat{y}_i)^2 \\ SST = \sum^{n}_{i}(y_i - \bar{y})^2 \\ \]
\[ \implies \\ R^2 = \frac{\sum^{n}_{i}(\hat{y}_i - \bar{y})^2}{\sum^{n}_{i}(y_i - \bar{y})^2}\\ = 1 - \frac{\sum^{n}_{i}(y_i - \hat{y}_i)^2}{\sum^{n}_{i}(y_i - \bar{y})^2} \]
And consider the adjusted form :
\[ R^{2}_{Adj}=1 - \frac{\frac{R_{ss}}{n-p-1}}{\frac{SST}{n-1}} \\ \text{For : } n = \text{Sample Size}\\ p = \text{Numer of Predictors} \]
\[ \text{or, } \\ 1 - \frac{\frac{\text{UNEXPLAINED}}{n-p-1}}{\frac{\text{TOTAL}}{n-1}} \\ \text{For : } \]
The motivation for \(R^{2}_{Adj}\) vs. \(R^2\) is that as we increase the number of predictors, the \(R^2\) appears to increase. This can be misleading in model selection because it suggests that a model with more predictors is always better, even if those predictors do not significantly contribute to explaining the variability in \(Y\). So its important we use the adjusted over the non-adjusted form when working with multiple variables–which prevents over-fitting.
anova(mdl_2)
## Analysis of Variance Table
##
## Response: Y
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 616.94 616.94 643.1754 <2e-16 ***
## x2 1 413.15 413.15 430.7183 <2e-16 ***
## x3 1 116.27 116.27 121.2139 <2e-16 ***
## x4 1 0.06 0.06 0.0602 0.8067
## Residuals 95 91.12 0.96
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Well… well… welll if it isnt for \(x4\) with a supah-dupah high p-value (Low F-tst Value).
In other words : 0.8067 (way above 0.05) \(\implies\) Not statistically significant
Meaning, \(x4\) is USELESS!
Its time to let go…
Use \(\text{MDL}_1\) ( \(Y \text{ ~ } x1 + x2 + x3\) ) instead.
According to the anova test, our predictor \(x4\) does contribute to explaining \(Y\) to a significant level.
x3 <- x1 + rnorm(n, mean = 0, sd = 1)
Y1 <- x1 + x2
mdl_3 <- lm(Y1 ~ x1 + x2 + x3)
mdl_4 <- lm(Y1 ~ x3 + x2 + x1)
anova(mdl_3); print("bing-bing-bong");anova(mdl_4)
## Warning in anova.lm(mdl_3): ANOVA F-tests on an essentially perfect fit are
## unreliable
## Analysis of Variance Table
##
## Response: Y1
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 312.874 312.874 1.5780e+31 <2e-16 ***
## x2 1 92.344 92.344 4.6576e+30 <2e-16 ***
## x3 1 0.000 0.000 1.0130e-01 0.7509
## Residuals 96 0.000 0.000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "bing-bing-bong"
## Warning in anova.lm(mdl_4): ANOVA F-tests on an essentially perfect fit are
## unreliable
## Analysis of Variance Table
##
## Response: Y1
## Df Sum Sq Mean Sq F value Pr(>F)
## x3 1 260.505 260.505 1.4305e+31 < 2.2e-16 ***
## x2 1 73.698 73.698 4.0468e+30 < 2.2e-16 ***
## x1 1 71.015 71.015 3.8995e+30 < 2.2e-16 ***
## Residuals 96 0.000 0.000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Multicollinearity occurs when one predictor can be linearly predicted by others, causing redundancy and instability in regression estimates.
Consider the “Model Fit Issues”; The warning “ANOVA F-tests on an essentially perfect fit are unreliable” suggests that the models fit the data almost perfectly, leading to near-zero residuals. This often happens when variables are highly collinear.
In \(\text{MDL}_3\), \(x1\) and \(x2\) are highly significant while \(x3\) is not significant ( 45% ). Suggesting
it doesnt have large explanatory power.
However in \(\text{MDL}_4\), all
variables seem significant. However this is a misleading result due
to collinearity.
ANOVA assigns variance sequentially, meaning the first variable entered absorbs the most explainable variance.
Later variables only explain what remains. If predictors are highly correlated, like \(x3\) (a noisy version of x1), it will appear insignificant when added last because x1 has already captured most of its contribution.
However, if \(x3\) were entered first, it might seem more important, even though it’s redundant