Assignment

#HW1:Find a data set of which you can fit multiple linear regression #and interpret your results. Multiple Linear Regression is used when you have one dependent variable (Y) and two or more independent variables (X₁, X₂, X₃, …) Y=β0+β1X1+β2X2+⋯+βnXn+ε

#here i choose to use dataset “trees” which is built in R.

str(trees)

## 'data.frame':    31 obs. of  3 variables:
##  $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
##  $ Height: num  70 65 63 72 81 83 66 75 80 75 ...
##  $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...

#i want to predict volume(dependent variable) using independent variables(Girth and Height)
trees_lm<-lm(Volume~Girth+Height,data=trees)
summary(trees_lm)

## 
## Call:
## lm(formula = Volume ~ Girth + Height, data = trees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4065 -2.6493 -0.2876  2.2003  8.4847 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
## Girth         4.7082     0.2643  17.816  < 2e-16 ***
## Height        0.3393     0.1302   2.607   0.0145 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.882 on 28 degrees of freedom
## Multiple R-squared:  0.948,  Adjusted R-squared:  0.9442 
## F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

Regression Equation Volume= -57.9877+4.7082(Girth)+0.3393(Height)

#Interpretation

##1.Coefficients(predictors):

Girth(4.71): Holding height constant, for every 1-inch increase in a tree’s girth (diameter), the timber volume increases significantly by 4.71 cubic feet. Girth is a powerful predictor here.

Height (0.34): Holding girth constant, for every 1-foot increase in a tree’s height, the volume increases by 0.34 cubic feet.

##2.Statistical Significance (P-Values): Girth (p-value < 2e-16): This is virtually zero, making it highly statistically significant (p < 0.05). Height (p-value = 0.0145): is less dominant than Girth,but it is still statistically significant (p < 0.05). Both dimensions matter when calculating volume.

##3.Model Accuracy (R^2)Adjusted R-squared: 0.9442: This means 94.4% of the total variation in tree volume can be explained by just these two physical dimensions. This indicates an exceptionally strong, reliable model fit.

F-statistic p-value (< 2.2e-16): The overall model is highly significant, meaning Girth and Height together are excellent predictors of Volume.

#HW2: Read about variable selection methods

#why do we need to select variables for model?

When you build a Multiple Linear Regression model, you often start with a massive pool of potential predictor variables. However, throwing every single variable into your model is a bad idea—it leads to overfitting, multicollinearity, and complex models that are difficult to interpret.

variable selection is a process of finding subset of variables(the fewest number of predictors that yield the highest possible predictive accuracy).

#1.Stepwise Selection Methods

##A.Forward Selection: You start with an empty model (just the intercept, no predictors). The algorithm tests every variable individually and adds the one with the lowest p-value or lowest AIC(Akaike Information Criterion). It repeats this process, adding the next best variable, until adding new variables no longer statistically improves the model.

##B.Backward Elimination (Highly Recommended) You start with a saturated model containing all possible predictor variables. The algorithm looks at the variables, finds the one with the highest, least-significant p-value (or worst AIC contribution), and drops it. It repeats this process until every variable left in the model is statistically significant.

##C.Bidirectional Stepwise Selection

A combination of both. It starts like Forward Selection, but every time a new variable is added, the algorithm checks back to see if any previously added variable has now become redundant (collinear) and removes it if necessary.

#2.All Subsets Regression (Best Subsets)

Instead of moving step-by-step, this method evaluates every single possible combination of your variables.How it works: If you have 3 variables (A, B, C), it tests a model with just A, just B, just C, then AB, AC, BC, and finally ABC. It then ranks every single variation using a metric like Adjusted R^2 or Mallows’ Cp and highlights the mathematical winner. Pros: It guarantees you will find the absolute best model configuration. It doesn’t get “trapped” by a step-by-step algorithm. Cons: It is computationally exhausting. If you have 20 variables, the computer has to evaluate 2^{20} = 1,048,576 different models.

#3.Regularization / Penalization Methods (Modern Data Science Approach) Instead of hard binary choices (“keep” or “drop”), modern data science uses algorithms that dynamically shrink or penalize less important variables.

##A. LASSO Regression (L1 Regularization) LASSO adds a penalty based on the size of the coefficients. If a variable isn’t contributing much to the prediction, the algorithm aggressively shrinks its coefficient all the way to absolute zero. Why it’s great for selection: Any variable given a coefficient of 0 is effectively dropped from the model, automating variable selection seamlessly.

##B. Ridge Regression (L2 Regularization) Similar to LASSO, but it shrinks coefficients close to zero without ever making them exactly zero. It keeps all variables but minimizes the impact of the weak ones.

Assignment_2

Isabelle IZABAYO 20251MBI012

2026-06-04