Problem 1) Q1 and 2)

set.seed(123)
n <- 100
x1 <- rnorm(n, mean = 5, sd = 2)
x2 <- rnorm(n, mean = 0, sd = 1)
x3 <- rnorm(n, mean = -3, sd = 1.5)
x4 <- rnorm(n, mean = 2, sd = 1)

# View : 
hist(x1)

hist(x2)

hist(x3)

hist(x4)

Q3)

Q4)

df <- data.frame(Y, x1, x2, x3, x4)

Q5)

mdl_1 <- lm(Y ~ x1 + x2 + x3, data = df)
mdl_2 <- lm(Y ~ x1 + x2 + x3 + x4, data = df)

Q6)

summary(mdl_1); print("OOOO-GAHHHH-BOOO-GAHHHHHH");summary(mdl_2)
## 
## Call:
## lm(formula = Y ~ x1 + x2 + x3, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.48374 -0.56822  0.08631  0.69916  2.46661 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.57573    0.33491   7.691 1.27e-11 ***
## x1           1.38852    0.05416  25.636  < 2e-16 ***
## x2          -2.14259    0.10145 -21.120  < 2e-16 ***
## x3           0.76725    0.06935  11.064  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9746 on 96 degrees of freedom
## Multiple R-squared:  0.9263, Adjusted R-squared:  0.924 
## F-statistic: 402.3 on 3 and 96 DF,  p-value: < 2.2e-16
## [1] "OOOO-GAHHHH-BOOO-GAHHHHHH"
## 
## Call:
## lm(formula = Y ~ x1 + x2 + x3 + x4, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.47336 -0.58010  0.07461  0.68778  2.46552 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.62249    0.38676   6.781 1.01e-09 ***
## x1           1.38788    0.05449  25.468  < 2e-16 ***
## x2          -2.14151    0.10204 -20.986  < 2e-16 ***
## x3           0.76636    0.06978  10.982  < 2e-16 ***
## x4          -0.02333    0.09506  -0.245    0.807    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9794 on 95 degrees of freedom
## Multiple R-squared:  0.9264, Adjusted R-squared:  0.9233 
## F-statistic: 298.8 on 4 and 95 DF,  p-value: < 2.2e-16

Consider that \(\text{MDL}_1\)’s, \(R^2 = 0.924\) and, \(\text{MDL}_2\)’s \(R^2=0.923\) which is only slightly worse. However, the only failed p-value for slope is for \(x4\).

Furthermore, consider that \(R^2\) is a measure of the % variance in \(y_i\) described by \(\hat{y}_i\); in other words, it is a measure of the % variance of the data described by the model.

Following Equations :

\[ R^2 = \frac{SSreg}{SST} = 1 - \frac{RSS}{SST} \\ \text{For : }\\ SS_{reg} = \sum^{n}_{i}(\hat{y}_i - \bar{y})^2 \\ RSS = \sum^{n}_{i}(y_i - \hat{y}_i)^2 \\ SST = \sum^{n}_{i}(y_i - \bar{y})^2 \\ \]

\[ \implies \\ R^2 = \frac{\sum^{n}_{i}(\hat{y}_i - \bar{y})^2}{\sum^{n}_{i}(y_i - \bar{y})^2}\\ = 1 - \frac{\sum^{n}_{i}(y_i - \hat{y}_i)^2}{\sum^{n}_{i}(y_i - \bar{y})^2} \]

And consider the adjusted form :

\[ R^{2}_{Adj}=1 - \frac{\frac{R_{ss}}{n-p-1}}{\frac{SST}{n-1}} \\ \text{For : } n = \text{Sample Size}\\ p = \text{Numer of Predictors} \]

\[ \text{or, } \\ 1 - \frac{\frac{\text{UNEXPLAINED}}{n-p-1}}{\frac{\text{TOTAL}}{n-1}} \\ \text{For : } \]

The motivation for \(R^{2}_{Adj}\) vs. \(R^2\) is that as we increase the number of predictors, the \(R^2\) appears to increase. This can be misleading in model selection because it suggests that a model with more predictors is always better, even if those predictors do not significantly contribute to explaining the variability in \(Y\). So its important we use the adjusted over the non-adjusted form when working with multiple variables–which prevents over-fitting.

Problem 3)

Q1)

anova(mdl_2)
## Analysis of Variance Table
## 
## Response: Y
##           Df Sum Sq Mean Sq  F value Pr(>F)    
## x1         1 616.94  616.94 643.1754 <2e-16 ***
## x2         1 413.15  413.15 430.7183 <2e-16 ***
## x3         1 116.27  116.27 121.2139 <2e-16 ***
## x4         1   0.06    0.06   0.0602 0.8067    
## Residuals 95  91.12    0.96                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Well… well… welll if it isnt for \(x4\) with a supah-dupah high p-value (Low F-tst Value).

In other words : 0.8067 (way above 0.05) \(\implies\) Not statistically significant

Meaning, \(x4\) is USELESS!

Its time to let go…

Use \(\text{MDL}_1\) ( \(Y \text{ ~ } x1 + x2 + x3\) ) instead.

According to the anova test, our predictor \(x4\) does contribute to explaining \(Y\) to a significant level.

Problem 4)

Q1)

x3 <- x1 + rnorm(n, mean = 0, sd = 1)
Y1 <- x1 + x2

Q2)

mdl_3 <- lm(Y1 ~ x1 + x2 + x3)
mdl_4 <- lm(Y1 ~ x3 + x2 + x1)

Q3)

anova(mdl_3); print("bing-bing-bong");anova(mdl_4)
## Warning in anova.lm(mdl_3): ANOVA F-tests on an essentially perfect fit are
## unreliable
## Analysis of Variance Table
## 
## Response: Y1
##           Df  Sum Sq Mean Sq    F value Pr(>F)    
## x1         1 312.874 312.874 1.5780e+31 <2e-16 ***
## x2         1  92.344  92.344 4.6576e+30 <2e-16 ***
## x3         1   0.000   0.000 1.0130e-01 0.7509    
## Residuals 96   0.000   0.000                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "bing-bing-bong"
## Warning in anova.lm(mdl_4): ANOVA F-tests on an essentially perfect fit are
## unreliable
## Analysis of Variance Table
## 
## Response: Y1
##           Df  Sum Sq Mean Sq    F value    Pr(>F)    
## x3         1 260.505 260.505 1.4305e+31 < 2.2e-16 ***
## x2         1  73.698  73.698 4.0468e+30 < 2.2e-16 ***
## x1         1  71.015  71.015 3.8995e+30 < 2.2e-16 ***
## Residuals 96   0.000   0.000                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multicollinearity occurs when one predictor can be linearly predicted by others, causing redundancy and instability in regression estimates.

Consider the “Model Fit Issues”; The warning “ANOVA F-tests on an essentially perfect fit are unreliable” suggests that the models fit the data almost perfectly, leading to near-zero residuals. This often happens when variables are highly collinear.

In \(\text{MDL}_3\), \(x1\) and \(x2\) are highly significant while \(x3\) is not significant ( 45% ). Suggesting it doesnt have large explanatory power.

However in \(\text{MDL}_4\), all variables seem significant. However this is a misleading result due to collinearity.

Q4) Why doesorder of independent variables affects the ANOVA results?

ANOVA assigns variance sequentially, meaning the first variable entered absorbs the most explainable variance.

Later variables only explain what remains. If predictors are highly correlated, like \(x3\) (a noisy version of x1), it will appear insignificant when added last because x1 has already captured most of its contribution.

However, if \(x3\) were entered first, it might seem more important, even though it’s redundant