Answer: 1. (a) Flexible method - Because large sample sizes offer enough training data, and avoid overfitting.
Inflexible method - With small sample sizes, there is a chance of overfitting & high noises in the data.
Flexible method -Since it is already given that the relationship is already non-linear, then we will need to deploy flexible methods to understand the data/relationship.
Inflexible method - High variance indicates higher variability and no room for linearity in the data, so inflexible models will help with reducing the noise & perform better.
A. Regression scenario, Inference to understand which factors affect the CEO salary. n= 500 (no. of firms) p = 4 (profit, number of employees, industry and the CEO salary)
A. Classification scenario (success or failure), Prediction (mostly based on market conditions). n = 20 products p = 14
A. Regression (% change in the USD/Euro), Prediction (based on stock market changes). n = 52 weeks p = 4
A. Model for this regression is: Y-hat = 50 + 20X1 + 0.07X2 + 35X3 + 0.01X4 -10X5
iii: True, as high school graduates with higher GPA could possibly earn more than college graduates even though college graduates may be 35 units higher according to the model.
Given: IQ = 110 & GPA = 4.0 Predict the salary.
X1 = 110 X2 = 4 X3 = 1 (since college graduate) X4 = (4 * 110) = 440 X5 = 4 * 1 = 4
Predicted salary, Y-hat = $ 137,100
This question involves the Boston data set from the ISLR2 package. Separate the data into a training set and test set. Designate the first 300 observations as the training data, and the remainder (206 obs) as the testing data.
# Split the data here.
data.train = Boston[1:300,]
data.test = Boston[301:nrow(Boston),]
Using the training data, fit four models predicting medv (Y) using the feature lstat (X). The four models include:
For each of the commands, run the summary to display the results.
# Linear model
model1 = lm(medv ~ lstat , data=data.train)
summary(model1)
##
## Call:
## lm(formula = medv ~ lstat, data = data.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.761 -4.406 -1.707 2.737 21.233
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.24931 0.72992 49.66 <2e-16 ***
## lstat -1.00565 0.05901 -17.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.335 on 298 degrees of freedom
## Multiple R-squared: 0.4936, Adjusted R-squared: 0.4919
## F-statistic: 290.4 on 1 and 298 DF, p-value: < 2.2e-16
# Log model
model2 = lm(medv ~ log(lstat) , data=data.train)
summary(model2)
##
## Call:
## lm(formula = medv ~ log(lstat), data = data.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.8345 -3.7953 -0.8603 2.4777 22.1341
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.0145 1.1880 43.78 <2e-16 ***
## log(lstat) -12.0330 0.5205 -23.12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.326 on 298 degrees of freedom
## Multiple R-squared: 0.642, Adjusted R-squared: 0.6408
## F-statistic: 534.5 on 1 and 298 DF, p-value: < 2.2e-16
# Square root model
model3 = lm(medv ~ sqrt(lstat) , data=data.train)
summary(model3)
##
## Call:
## lm(formula = medv ~ sqrt(lstat), data = data.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.953 -3.922 -1.312 2.634 21.433
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.8269 1.1962 40.82 <2e-16 ***
## sqrt(lstat) -7.4278 0.3656 -20.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.765 on 298 degrees of freedom
## Multiple R-squared: 0.5807, Adjusted R-squared: 0.5793
## F-statistic: 412.7 on 1 and 298 DF, p-value: < 2.2e-16
# Polynomial model
model4 = lm(medv ~ lstat + I(lstat^2) , data=data.train)
summary(model4)
##
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = data.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.6158 -3.9828 -0.7244 2.3903 21.5045
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.406239 1.075180 42.23 <2e-16 ***
## lstat -2.720028 0.171418 -15.87 <2e-16 ***
## I(lstat^2) 0.060092 0.005742 10.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.424 on 297 degrees of freedom
## Multiple R-squared: 0.63, Adjusted R-squared: 0.6275
## F-statistic: 252.9 on 2 and 297 DF, p-value: < 2.2e-16
For the log model in part A, give an interpretation of the intercept and slope in context.
Answer:For the log model in Part A, Intercept is where the predicted ‘medv’ when ‘log(lstat)’ equals 0 which is calculated to be 52.148 and the slope represents that for every one unit increase in the log of ‘lstat’, the medv decreases by -12.4810 units.
For each of the four models, report the \(R^2\) for the training data. Which model appears to be the best fit on the training data? What if anything can we say about performance on the testing data?
# Extract R2 values
summary(model1)$r.squared
## [1] 0.493566
summary(model2)$r.squared
## [1] 0.6420314
summary(model3)$r.squared
## [1] 0.5806988
summary(model4)$r.squared
## [1] 0.6300208
Answer: Since R-squared value is a measure of how well our models help to understand the variance of the dependent variable (medv), of all the models, the highest R-squared value is 0.642 for model2 (log value of the lstat predictor). When evaluating the R^2 on test data,each model will predict based on the training data; so if a model has a low R^2 on test data, indicating the model maybe overfitted.
For each of the models, create residual plots.
# Code to generate plots
plot(model1$fitted.values, model1$residuals)
plot(model2$fitted.values, model2$residuals)
plot(model3$fitted.values, model3$residuals)
plot(model4$fitted.values, model4$residuals)
Do any of them appear particularly problematic? Explain why.
Answer: For the linear regresssion plot of model1, I notice a curved pattern for the residual plot which seems problematic & taking a log helped with that & created a better scatter pattern.I also feel the polynomial plot is still showing a slight curvature, and there is definitely room for adding a higher-degree polynomial.
Calculate the test MSE for each of the models. Based on the result, which model is preferred?
# Calculate test MSE
yhat1 = predict(model1, data.test)
yhat2 = predict(model2, data.test)
yhat3 = predict(model3, data.test)
yhat4 = predict(model4, data.test)
mse1 = mean((yhat1-data.test$medv)^2)
mse2 = mean((yhat2-data.test$medv)^2)
mse3 = mean((yhat3-data.test$medv)^2)
mse4 = mean((yhat4-data.test$medv)^2)
data.frame(Model=c("Linear","Log","Sqrt", "Poly"),MSE=c(mse1,mse2,mse3,mse4))
## Model MSE
## 1 Linear 39.26762
## 2 Log 30.83985
## 3 Sqrt 33.52887
## 4 Poly 40.52904
Answer: Lowest MSE is seen for Log model and that has the best fit for this particular study.
Is this result the same as the best training \(R^2\) value? Explain why there might be discrepancies between the training results and testing results. Use one model as a specific example.
Answer: No, the Test MSE is not the same as the best training R-square value, either because the model was overfitting, underfitting, or the actual sample train-test split caused variability. Eg. For the linear model - training R-squared = 0.493 & Testing MSE = 39.268. The lower MSE on the test would be idea to measure how well the model will perform in the real world, even if the R-squared is not the same.
Referring back to part A, note that the linear model is a reduced version of the polynomial model. How would you determine if a polynomial model is an improvement over the linear model? Conduct a test to verify this.
Answer: If the p-value of the F-test is low, then we can conclude that one model did better than the other. In this case, F-test shows that the Model4 (poly) model improves the fit over the model1 (linear) model, since the p-value is low and significant.
# Run relevant test here
# Note: Dear Prof. Aaron, I did take help with this section as I wasn't aware of the concept.
model1 = lm(medv ~ lstat , data=data.train)
model4 = lm(medv ~ lstat + I(lstat^2) , data=data.train)
anova_analysis = anova(model1,model4)
anova_analysis
## Analysis of Variance Table
##
## Model 1: medv ~ lstat
## Model 2: medv ~ lstat + I(lstat^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 298 11961
## 2 297 8738 1 3222.7 109.54 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1