______________________________________________________

ASSIGNMENT I - Application of Multiple Linear Regression

____________________________________________________________________

Scenario

____________________________________________________________________

This exercise shows how to apply Multiple Linear Regression in R programming.

A school administration wants to identify the major factors affecting students’ academic performance using collected data from students. The objective is to predict students’ final exam scores using several independent variables such as: Number of hour studied, Attendance percentage, and Assignment scores.

We are using ‘Student Performance Dataset’ to predict: ‘Final Exam Score’ Using multiple independent variables: (Hours Studied, Attendance Rate, and Assignment Score). Multiple Linear Regression is used to determine how these variables affect the final exam results.

____________________________________________________________________

Step-1: Create the Dataset

____________________________________________________________________

# Creating the student dataset
# ----------------------------
student_ds <- data.frame(
  Hr_Studied = c(5,8,2,7,6,9,4,10),
  Attendance = c(80,90,60,85,75,95,70,98),
  Assignment = c(70,85,55,80,78,92,60,96),
  Final_Exam = c(65,88,50,82,76,94,62,98)
)

# Display the student dataset
# ---------------------------
print(student_ds)
##   Hr_Studied Attendance Assignment Final_Exam
## 1          5         80         70         65
## 2          8         90         85         88
## 3          2         60         55         50
## 4          7         85         80         82
## 5          6         75         78         76
## 6          9         95         92         94
## 7          4         70         60         62
## 8         10         98         96         98

____________________________________________________________________

Step-2: Explore the Dataset

____________________________________________________________________

Structure of the Dataset

——————————-

str(student_ds)
## 'data.frame':    8 obs. of  4 variables:
##  $ Hr_Studied: num  5 8 2 7 6 9 4 10
##  $ Attendance: num  80 90 60 85 75 95 70 98
##  $ Assignment: num  70 85 55 80 78 92 60 96
##  $ Final_Exam: num  65 88 50 82 76 94 62 98

Summary Statistics

————————–

summary(student_ds)
##    Hr_Studied       Attendance      Assignment      Final_Exam   
##  Min.   : 2.000   Min.   :60.00   Min.   :55.00   Min.   :50.00  
##  1st Qu.: 4.750   1st Qu.:73.75   1st Qu.:67.50   1st Qu.:64.25  
##  Median : 6.500   Median :82.50   Median :79.00   Median :79.00  
##  Mean   : 6.375   Mean   :81.62   Mean   :77.00   Mean   :76.88  
##  3rd Qu.: 8.250   3rd Qu.:91.25   3rd Qu.:86.75   3rd Qu.:89.50  
##  Max.   :10.000   Max.   :98.00   Max.   :96.00   Max.   :98.00

____________________________________________________________________

Step-3: Fit Multiple Linear Regression Model

____________________________________________________________________

# Building the regression model
#------------------------------
model <- lm(Final_Exam ~ Hr_Studied + Attendance + Assignment, data = student_ds)

# Display model results
# ---------------------
summary(model)
## 
## Call:
## lm(formula = Final_Exam ~ Hr_Studied + Attendance + Assignment, 
##     data = student_ds)
## 
## Residuals:
##       1       2       3       4       5       6       7       8 
## -1.2306  1.1975  0.2227  1.3683 -0.4524  1.0717 -0.2318 -1.9454 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 59.04408   19.98987   2.954   0.0418 *
## Hr_Studied   8.34302    2.22418   3.751   0.0199 *
## Attendance  -0.41186    0.23404  -1.760   0.1533  
## Assignment  -0.02256    0.29796  -0.076   0.9433  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.586 on 4 degrees of freedom
## Multiple R-squared:  0.9949, Adjusted R-squared:  0.9911 
## F-statistic: 260.4 on 3 and 4 DF,  p-value: 4.859e-05

____________________________________________________________________

Step-4: Interpretation of the Results

____________________________________________________________________

Regression Equation

—————————

The model follows this equation: \[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 \] Where: - \(Y\) = Final Exam - \(X_1\) = Hours Studied - \(X_2\) = Attendance Score - \(X_3\) = Assignment Score

____________________________________________________________________

Step-5: Visualize Actual vs Predicted Values

____________________________________________________________________

# Predict final exam scores
# -------------------------
predicted_scores <- predict(model)

# Create comparison dataset
# -------------------------
results <- data.frame(
  Actual = student_ds$Final_Exam,
  Predicted = predicted_scores
)
print(results)
##   Actual Predicted
## 1     65  66.23058
## 2     88  86.80253
## 3     50  49.77727
## 4     82  80.63165
## 5     76  76.45240
## 6     94  92.92828
## 7     62  62.23184
## 8     98  99.94545

____________________________________________________________________

Step-6: Scatter Plot of Actual vs Predicted Scores

____________________________________________________________________

# Plot actual vs predicted scores
# -------------------------------
plot(results$Actual, results$Predicted, main = "Actual vs Predicted Final Exam Scores",
     xlab = "Actual Scores", ylab = "Predicted Scores", pch = 19)

# Adding regression line
# ----------------------
abline(0,1,col="yellow",lwd=2)

## Visualization Explanation
## -------------------------
## - Points close to the yellow line indicate good predictions.
## - Large distances from the line indicate prediction errors.

____________________________________________________________________

Step-7: Model Performance

____________________________________________________________________

# Calculate R-squared
# -------------------
R2 <- cor(results$Actual, results$Predicted)^2
print(R2)
## [1] 0.994905
# Interpretation
# ----------------
# R-squared value shows how well the independent variables explain the variation in final exam scores.
# For example: R² = 0.95 means 95% of exam performance is explained by the model.

____________________________________________________________________

Step-8: Regression Diagnostics

____________________________________________________________________

This section evaluates whether the multiple linear regression model is appropriate for the dataset. Regression diagnostics help us verify important assumptions such as linearity, normality, homoscedasticity, and multicollinearity.

Regression Diagnostic Plots

————————————-

# Generate regression diagnostic plots
# ------------------------------------
par(mfrow = c(2,2))
plot(model)

Explanation

—————-

The four plots produced are:

  1. Residuals vs Fitted
  2. Normal Q-Q
  3. Scale-Location
  4. Residuals vs Leverage

These plots help evaluate whether the regression assumptions are satisfied.

Residual Analysis

———————–

# Extract residuals
# -----------------
residuals_model <- residuals(model)

# Plot histogram of residuals
# -----------------------------
hist(residuals_model, main = "Histogram of Residuals", xlab = "Residuals", col = "lightblue", border = "black")

Interpretation

——————-

  • A roughly bell-shaped histogram indicates that residuals are approximately normally distributed.
  • Strong skewness or unusual patterns may indicate violations of normality assumptions.

____________________________________________________________________

Correlation Matrix

____________________________________________________________________

# Correlation
# ----------------
cor(student_ds)
##            Hr_Studied Attendance Assignment Final_Exam
## Hr_Studied  1.0000000  0.9783321  0.9894338  0.9953252
## Attendance  0.9783321  1.0000000  0.9589653  0.9603002
## Assignment  0.9894338  0.9589653  1.0000000  0.9872695
## Final_Exam  0.9953252  0.9603002  0.9872695  1.0000000

____________________________________________________________________

Scatterplot Matrix

____________________________________________________________________

# Scatterplot
# ----------------
pairs(student_ds)

____________________________________________________________________

Conclusion and Analysis

____________________________________________________________________

This exercise demonstrated the application of Multiple Linear Regression using R programming.

The analysis showed that:
- Students who study more tend to perform better.
- Higher attendance contributes positively to final exam performance.
- Assignment scores are also important predictors of academic success.

______________________________________________________

ASSIGNMENT II - Variable Selection Methods

____________________________________________________________________

I. Introduction

____________________________________________________________________

Variable selection refers to the process of choosing the most relevant variables to include in a regression model.  They help to improve model performance and avoid over fitting.

Variable selection, also known as feature selection, is the process of identifying and choosing the most important predictors for a model.
In R Programming Language This process leads to simpler, faster, and more interpretable models, and helps in preventing overfitting.
Overfitting occurs when a model is too complex and captures noise in the data rather than the underlying pattern. Based on the most relevant variables, variable selection improves the model’s ability to generalize to new, unseen data.

Variable selection is an important process in statistical modeling and machine learning.
It involves choosing the most relevant independent variables (predictors) for building a regression model.
The main purpose of variable selection is to improve model accuracy, reduce complexity, avoid overfitting, and simplify interpretation.

In R Programming, several variable selection methods are commonly used in regression analysis.
These methods help identify which variables contribute significantly to predicting the dependent variable.

____________________________________________________________________

II. Importance of Variable Selection

____________________________________________________________________

Variable selection is important because:

  • It improves prediction performance.
  • It reduces unnecessary variables.
  • It minimizes multicollinearity problems.
  • It makes models easier to interpret.
  • It reduces computational cost.

A model with too many irrelevant variables may become complex and less accurate.

____________________________________________________________________

III. Common Variable Selection Methods in R

____________________________________________________________________

1. Forward Selection

—————————

Forward selection starts with no variables in the model.
Variables are then added one at a time based on their statistical significance.

Process:

  • Start with an empty model.
  • Add the most significant variable.
  • Continue adding variables until no significant improvement occurs.

Advantages:

  • Simple to understand.
  • Useful when there are many variables.

Disadvantages:

  • May miss combinations of important variables.

Example in R:

# Create sample dataset
# ---------------------
data <- mtcars

# Full regression model
# ---------------------
full_model <- lm(mpg ~ wt + hp + disp + cyl, data = data)

# Forward selection 
# ------------------
model_forward <- step(lm(mpg ~ 1, data = data), scope = formula(full_model), direction = "forward")
## Start:  AIC=115.94
## mpg ~ 1
## 
##        Df Sum of Sq     RSS     AIC
## + wt    1    847.73  278.32  73.217
## + cyl   1    817.71  308.33  76.494
## + disp  1    808.89  317.16  77.397
## + hp    1    678.37  447.67  88.427
## <none>              1126.05 115.943
## 
## Step:  AIC=73.22
## mpg ~ wt
## 
##        Df Sum of Sq    RSS    AIC
## + cyl   1    87.150 191.17 63.198
## + hp    1    83.274 195.05 63.840
## + disp  1    31.639 246.68 71.356
## <none>              278.32 73.217
## 
## Step:  AIC=63.2
## mpg ~ wt + cyl
## 
##        Df Sum of Sq    RSS    AIC
## + hp    1   14.5514 176.62 62.665
## <none>              191.17 63.198
## + disp  1    2.6796 188.49 64.746
## 
## Step:  AIC=62.66
## mpg ~ wt + cyl + hp
## 
##        Df Sum of Sq    RSS    AIC
## <none>              176.62 62.665
## + disp  1    6.1762 170.44 63.526
summary(model_forward)
## 
## Call:
## lm(formula = mpg ~ wt + cyl + hp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9290 -1.5598 -0.5311  1.1850  5.8986 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.75179    1.78686  21.687  < 2e-16 ***
## wt          -3.16697    0.74058  -4.276 0.000199 ***
## cyl         -0.94162    0.55092  -1.709 0.098480 .  
## hp          -0.01804    0.01188  -1.519 0.140015    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared:  0.8431, Adjusted R-squared:  0.8263 
## F-statistic: 50.17 on 3 and 28 DF,  p-value: 2.184e-11

2. Backward Elimination

——————————–

Backward elimination starts with all variables included in the model.
The least significant variables are removed one by one.

Process:

  • Start with all predictors.
  • Remove the least significant variable.
  • Continue until all remaining variables are significant.

Advantages:

  • Considers all variables initially.
  • Often produces strong models.

Disadvantages:

  • Requires larger datasets.

Example in R:

# Backward elimination
# ---------------------
model_backward <- step(full_model, direction = "backward")
## Start:  AIC=63.53
## mpg ~ wt + hp + disp + cyl
## 
##        Df Sum of Sq    RSS    AIC
## - disp  1     6.176 176.62 62.665
## <none>              170.44 63.526
## - hp    1    18.048 188.49 64.746
## - cyl   1    24.546 194.99 65.831
## - wt    1    90.925 261.37 75.206
## 
## Step:  AIC=62.66
## mpg ~ wt + hp + cyl
## 
##        Df Sum of Sq    RSS    AIC
## <none>              176.62 62.665
## - hp    1    14.551 191.17 63.198
## - cyl   1    18.427 195.05 63.840
## - wt    1   115.354 291.98 76.750
summary(model_backward)
## 
## Call:
## lm(formula = mpg ~ wt + hp + cyl, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9290 -1.5598 -0.5311  1.1850  5.8986 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.75179    1.78686  21.687  < 2e-16 ***
## wt          -3.16697    0.74058  -4.276 0.000199 ***
## hp          -0.01804    0.01188  -1.519 0.140015    
## cyl         -0.94162    0.55092  -1.709 0.098480 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared:  0.8431, Adjusted R-squared:  0.8263 
## F-statistic: 50.17 on 3 and 28 DF,  p-value: 2.184e-11

3. Stepwise Selection

—————————–

Stepwise selection combines forward selection and backward elimination.
Variables can be added or removed during the process.

Advantages:

  • More flexible.
  • Often produces better models.

Disadvantages:

  • Can still produce unstable results in some datasets.

Example in R:

# Stepwise selection
# ---------------------
model_stepwise <- step(full_model, direction = "both")
## Start:  AIC=63.53
## mpg ~ wt + hp + disp + cyl
## 
##        Df Sum of Sq    RSS    AIC
## - disp  1     6.176 176.62 62.665
## <none>              170.44 63.526
## - hp    1    18.048 188.49 64.746
## - cyl   1    24.546 194.99 65.831
## - wt    1    90.925 261.37 75.206
## 
## Step:  AIC=62.66
## mpg ~ wt + hp + cyl
## 
##        Df Sum of Sq    RSS    AIC
## <none>              176.62 62.665
## - hp    1    14.551 191.17 63.198
## + disp  1     6.176 170.44 63.526
## - cyl   1    18.427 195.05 63.840
## - wt    1   115.354 291.98 76.750
summary(model_stepwise)
## 
## Call:
## lm(formula = mpg ~ wt + hp + cyl, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9290 -1.5598 -0.5311  1.1850  5.8986 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.75179    1.78686  21.687  < 2e-16 ***
## wt          -3.16697    0.74058  -4.276 0.000199 ***
## hp          -0.01804    0.01188  -1.519 0.140015    
## cyl         -0.94162    0.55092  -1.709 0.098480 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared:  0.8431, Adjusted R-squared:  0.8263 
## F-statistic: 50.17 on 3 and 28 DF,  p-value: 2.184e-11

____________________________________________________________________

IV. Criteria Used in Variable Selection

____________________________________________________________________

Different statistical criteria are used to evaluate models during variable selection.

1. AIC (Akaike Information Criterion)

  • AIC measures model quality while penalizing complexity.
  • Lower AIC values indicate better models.

2. BIC (Bayesian Information Criterion)

  • BIC is similar to AIC but penalizes complex models more heavily.

3. Adjusted R-Squared

  • Adjusted R-squared measures model performance while considering the number of variables.

____________________________________________________________________

V. Multicollinearity and VIF

____________________________________________________________________

Variable selection also helps reduce multicollinearity, which occurs when independent variables are highly correlated.
Variance Inflation Factor (VIF) is commonly used to detect multicollinearity.

High multicollinearity means some independent variables are strongly correlated with each other, which can affect model stability.

____________________________________________________________________

VI. Applications of Variable Selection

____________________________________________________________________

Variable selection is widely used in:

  • Healthcare prediction models
  • Financial forecasting
  • Stock market analysis
  • Marketing analytics
  • Educational performance analysis

____________________________________________________________________

VII. Conclusion

____________________________________________________________________

Variable selection methods are essential in regression modeling and machine learning.
They help identify the most important variables while improving model simplicity and prediction accuracy.

In R Programming, methods such as forward selection, backward elimination, and stepwise selection are commonly used to build efficient regression models.
Proper variable selection leads to better interpretation, reduced multicollinearity, and improved model performance.

_________________________________________________

THANK YOU !

_________________________________________________