Part A: Statistical Learning Regression and Classification Exercise A.3
Author
1601014
Chapter 3: Linear Regression
What it is: A simple but powerful tool used to predict a number (like price, sales, or temperature) based on the value of one or more other variables.
The Core Idea: It finds the best-fitting straight line (or a flat plane, with multiple variables) through your data points. This line is defined by the equation:
Y = β₀ + β₁X + ε
Y: The thing you want to predict (the “response”).
X: The thing you’re using to make the prediction (the “predictor”).
β₀ (Intercept): The predicted value of Y when X is zero.
β₁ (Slope): The average change in Y for a one-unit change in X.
ε (Error): The stuff the model can’t explain (random noise).
How it works: The “best” line is the one that minimizes the sum of squared differences between the actual data points and the points on the line. This is called the “least squares” method.
What you get out of it:
Predictions: Plug in a new X value, get a predicted Y value.
Relationships: The slope (β₁) tells you the strength and direction of the relationship. A positive slope means Y increases with X. A negative slope means Y decreases with X.
Significance: Statistical tests (p-values) tell you if the relationship you see is likely real or just a fluke.
In a nutshell: Linear regression is the go-to method for answering the question, “How much does Y change when X changes?” It’s the fundamental starting point for predicting numerical outcomes.
3.1 Simple Linear Regression
The model:
[ Y = _0 + _1 X + ]
Estimates coefficients using the least squares approach.
Key interpretations:
(_0): intercept — expected (Y) when (X = 0)
(_1): slope — change in (Y) for a one-unit change in (X)
3.2 Multiple Linear Regression
What it is: An extension of simple linear regression that allows you to predict a single numerical outcome (Y) based on the values of two or more predictor variables (X₁, X₂, …, Xₚ).
The Core Idea: Instead of fitting a best-fit line, multiple linear regression fits a best-fit multi-dimensional hyperplane to the data. The model is represented by the equation:
Y = β₀ + β₁X₁ + β₂X₂ + … + βₚXₚ + ε
Y: The response variable you want to predict.
X₁, X₂, …, Xₚ: The predictor variables.
β₀: The intercept (the value of Y when all predictors are zero).
β₁, β₂, …, βₚ: The slope coefficients. Each one represents the average change in Y for a one-unit increase in its corresponding predictor, holding all other predictors constant. This is the most important concept.
ε: The error term (random noise).
3.3 Other Considerations in the Regression Model
1. Qualitative Predictors (How to Use Categories)
You are not limited to numerical predictors. You can include categorical variables (e.g., Gender, Region, OwnsHouse).
How it’s done: You create dummy variables (0/1 indicators) for each category.
Example: For a predictor OwnsHouse with levels Yes/No, you create a new variable: X = 1 if Yes, 0 if No.
Interpretation: The coefficient for a dummy variable represents the average difference in the response between that category and the baseline (or reference) category (the one coded as 0).
2. Extending the Model: Interactions & Non-Linearity
The standard model makes two big assumptions: additivity (each predictor’s effect is independent) and linearity (the effect is a straight line). This section shows how to break these assumptions when needed.
A. Interactions (Synergy)
What it is: When the effect of one predictor depends on the level of another predictor.
Example: The effectiveness of TV advertising on Sales might depend on the amount spent on Radio advertising. They work together synergistically.
How it’s done: Add a new predictor that is the product of the two original variables: X₁ * X₂.
Interpretation: The coefficient for the interaction term tells you how much the effect of one predictor changes for a one-unit increase in the other predictor.
B. Non-Linear Relationships
What it is: When the relationship between a predictor and the response is curved (e.g., diminishing returns).
Example: The effect of Horsepower on MPG is likely curved—increasing horsepower improves MPG up to a point, then hurts it.
How it’s done: You can add polynomial terms (e.g., X², X³) or other transformations (e.g., log(X)) to the model.
Key Point: This is still linear regression because the model is still linear in the coefficients (β₀, β₁, β₂). You are just using non-linear predictors.
3. Potential Problems to Diagnose (The Watchlist)
A good data analyst doesn’t just fit a model; they check if it’s a good model. This section lists the top threats to a model’s validity.
Non-linearity: The relationship isn’t a straight line.
Fix: Look at residual plots. Add polynomial or transformed terms.
Correlation of Error Terms: The errors are not independent (common in time series data). This breaks the calculation of standard errors, making you overconfident in your results.
Non-constant Variance (Heteroscedasticity): The spread of the residuals changes with the predicted value (e.g., a funnel shape in the residual plot).
Fix: Transform the response variable (e.g., use log(Y)).
Outliers: Data points where the response Y is unusual given its predictors.
Fix: Identify them with studentized residuals. Investigate but don’t delete without cause.
High-Leverage Points: Data points where the predictor X is unusual. These points can have an outsized influence on the model fit, pulling the regression line toward them.
Fix: Calculate leverage statistics to identify them.
Collinearity: When two or more predictors are highly correlated with each other. This makes it:
Difficult to separate their individual effects on the response.
Causes the standard errors of the coefficients to inflate, making it hard to find statistically significant effects.
Fix: Calculate a VIF (Variance Inflation Factor). A VIF > 5 or 10 indicates a problem. Solutions include removing variables or combining them.
3.4 Chapter 3 Lab: Linear Regression in R
# Load the MASS package which contains the Boston datasetlibrary(MASS)# Attach the Boston dataset to the R search pathattach(Boston)# View the first few rows of the datasethead(Boston)
# Fit a simple linear regression model: medv (median home value) as a function of lstat (lower status population)lm.fit <-lm(medv ~ lstat, data = Boston)# Display a detailed summary of the linear modelsummary(lm.fit)
Call:
lm(formula = medv ~ lstat, data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.168 -3.990 -1.318 2.034 24.500
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.55384 0.56263 61.41 <2e-16 ***
lstat -0.95005 0.03873 -24.53 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.216 on 504 degrees of freedom
Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
# View the names of the components in the lm.fit objectnames(lm.fit)
# Plot the data: lstat vs medvplot(lstat, medv)# Add the regression line to the plotabline(lm.fit)# Add the regression line with increased line widthabline(lm.fit, lwd =3)# Add the regression line in red colorabline(lm.fit, lwd =3, col ="red")
# Plot the data points in redplot(lstat, medv, col ="red")
# Plot the data points using solid dotsplot(lstat, medv, pch =20)
# Plot the data points using plus signsplot(lstat, medv, pch ="+")
# Display all available plotting symbolsplot(1:20, 1:20, pch =1:20)
# Set up a 2x2 plotting area for diagnostic plotspar(mfrow =c(2, 2))# Plot diagnostic plots for the linear model (residuals, leverage, etc.)plot(lm.fit)
# Plot predicted values vs residualsplot(predict(lm.fit), residuals(lm.fit))# Plot predicted values vs studentized residualsplot(predict(lm.fit), rstudent(lm.fit))# Plot leverage valuesplot(hatvalues(lm.fit))# Identify the observation with the highest leveragewhich.max(hatvalues(lm.fit))
375
375
library(MASS)library(car)
Loading required package: carData
attach(Boston)
The following objects are masked from Boston (pos = 5):
age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, rad,
rm, tax, zn
# Fit the full model excluding 'age'lm.fit1 <-lm(medv ~ . - age, data = Boston)# View summarysummary(lm.fit1)
# Interaction Termssummary(lm(medv ~ lstat * age, data = Boston))
Call:
lm(formula = medv ~ lstat * age, data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.806 -4.045 -1.333 2.085 27.552
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.0885359 1.4698355 24.553 < 2e-16 ***
lstat -1.3921168 0.1674555 -8.313 8.78e-16 ***
age -0.0007209 0.0198792 -0.036 0.9711
lstat:age 0.0041560 0.0018518 2.244 0.0252 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.149 on 502 degrees of freedom
Multiple R-squared: 0.5557, Adjusted R-squared: 0.5531
F-statistic: 209.3 on 3 and 502 DF, p-value: < 2.2e-16
3.6.5 Non-linear Transformations of the Predictors
# Fit a second-degree polynomial regression model# This models medv (median home value) as a quadratic function of lstat (lower status population)lm.fit_poly <-lm(medv ~ lstat +I(lstat^2), data = Boston)# Display a summary of the polynomial regression model# This includes coefficients, standard errors, t-values, p-values, R-squared, and F-statisticsummary(lm.fit_poly)
Call:
lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.2834 -3.8313 -0.5295 2.3095 25.4148
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.862007 0.872084 49.15 <2e-16 ***
lstat -2.332821 0.123803 -18.84 <2e-16 ***
I(lstat^2) 0.043547 0.003745 11.63 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.524 on 503 degrees of freedom
Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
#We use the anova() function to further quantify the extent to which the quadratic fit is superior to the linear fit.anova(lm.fit, lm.fit_poly)
Analysis of Variance Table
Model 1: medv ~ lstat
Model 2: medv ~ lstat + I(lstat^2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 504 19472
2 503 15347 1 4125.1 135.2 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Summary
Chapter 4: Classification – Summary of Key Lessons 4.1 Why Not Linear Regression?
4.1 Overview of Classification
Classification methods predict discrete class labels such as “Yes” or “No”, “Up” or “Down”.
Evaluation focuses on classification accuracy rather than mean squared error.
4.2 Logistic Regression
Logistic regression models the probability that (Y = 1) using the logit function: [ ( ) = _0 + _1 X_1 + + _p X_p ]
Year Lag1 Lag2 Lag3
Min. :2001 Min. :-4.922000 Min. :-4.922000 Min. :-4.922000
1st Qu.:2002 1st Qu.:-0.639500 1st Qu.:-0.639500 1st Qu.:-0.640000
Median :2003 Median : 0.039000 Median : 0.039000 Median : 0.038500
Mean :2003 Mean : 0.003834 Mean : 0.003919 Mean : 0.001716
3rd Qu.:2004 3rd Qu.: 0.596750 3rd Qu.: 0.596750 3rd Qu.: 0.596750
Max. :2005 Max. : 5.733000 Max. : 5.733000 Max. : 5.733000
Lag4 Lag5 Volume Today
Min. :-4.922000 Min. :-4.92200 Min. :0.3561 Min. :-4.922000
1st Qu.:-0.640000 1st Qu.:-0.64000 1st Qu.:1.2574 1st Qu.:-0.639500
Median : 0.038500 Median : 0.03850 Median :1.4229 Median : 0.038500
Mean : 0.001636 Mean : 0.00561 Mean :1.4783 Mean : 0.003138
3rd Qu.: 0.596750 3rd Qu.: 0.59700 3rd Qu.:1.6417 3rd Qu.: 0.596750
Max. : 5.733000 Max. : 5.73300 Max. :3.1525 Max. : 5.733000
Direction
Down:602
Up :648
glm.probs <-predict(glm.fits, type ="response")glm.pred <-rep("Down", 1250)glm.pred[glm.probs >0.5] <-"Up"table(glm.pred, Direction)
Direction
glm.pred Down Up
Down 145 141
Up 457 507
Create training and test sets
train <- Year <2005glm.fits.train <-glm(Direction ~ Lag1 + Lag2, data = Smarket, family = binomial,subset = train)glm.probs.test <-predict(glm.fits.train, newdata = Smarket[!train, ],type ="response")glm.pred.test <-rep("Down", 252)glm.pred.test[glm.probs.test >0.5] <-"Up"table(glm.pred.test, Direction[!train])
glm.pred.test Down Up
Down 35 35
Up 76 106
4.6.3 Linear Discriminant Analysis
#Fit a LDA model using ‘Lag1’ and ‘Lag2’lda.fit <-lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)lda.fit
Call:
lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
Prior probabilities of groups:
Down Up
0.491984 0.508016
Group means:
Lag1 Lag2
Down 0.04279022 0.03389409
Up -0.03954635 -0.03132544
Coefficients of linear discriminants:
LD1
Lag1 -0.6420190
Lag2 -0.5135293
plot(lda.fit)
#Get predictions on the test setlda.pred <-predict(lda.fit, newdata = Smarket[!train, ])names(lda.pred)
[1] "class" "posterior" "x"
table(lda.pred$class, Direction[!train])
Down Up
Down 35 35
Up 76 106
4.6.4 Quadratic Discriminant Analysis
#Fit a QDA modelqda.fit <-qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)qda.fit
Call:
qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
Prior probabilities of groups:
Down Up
0.491984 0.508016
Group means:
Lag1 Lag2
Down 0.04279022 0.03389409
Up -0.03954635 -0.03132544
#Create matrices of predictors for the training and test datatrain.X <-as.matrix(Smarket[train, c("Lag1", "Lag2")])test.X <-as.matrix(Smarket[!train, c("Lag1", "Lag2")])train.Direction <- Smarket[train, "Direction"]
#Run KNN with K=3knn.pred <-knn(train.X, test.X, train.Direction, k =3)table(knn.pred, Direction[!train])
knn.pred Down Up
Down 48 56
Up 63 85
#Run KNN with K=1knn.pred <-knn(train.X, test.X, train.Direction, k =1)table(knn.pred, Direction[!train])