Problem Set 1

Textbook problems

Answer: 1. (a) Flexible method - Because large sample sizes offer enough training data, and avoid overfitting.

Inflexible method - With small sample sizes, there is a chance of overfitting & high noises in the data.
Flexible method -Since it is already given that the relationship is already non-linear, then we will need to deploy flexible methods to understand the data/relationship.
Inflexible method - High variance indicates higher variability and no room for linearity in the data, so inflexible models will help with reducing the noise & perform better.

Explain whether each scenario is a classification or regression prob- lem, and indicate whether we are most interested in inference or pre- diction. Finally, provide n and p.

We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

A. Regression scenario, Inference to understand which factors affect the CEO salary. n= 500 (no. of firms) p = 4 (profit, number of employees, industry and the CEO salary)

We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

A. Classification scenario (success or failure), Prediction (mostly based on market conditions). n = 20 products p = 14

We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

A. Regression (% change in the USD/Euro), Prediction (based on stock market changes). n = 52 weeks p = 4

We have a dataset with five predictors, X1 = GPA, X2 = IQ, X3 = Level (1 for College and 0 for High School), X4 = Interac- tion between GPA and IQ, and X5 = Interaction between GPA and Level.

A. Model for this regression is: Y-hat = 50 + 20X1 + 0.07X2 + 35X3 + 0.01X4 -10X5

iii: True, as high school graduates with higher GPA could possibly earn more than college graduates even though college graduates may be 35 units higher according to the model.
Given: IQ = 110 & GPA = 4.0 Predict the salary.

X1 = 110 X2 = 4 X3 = 1 (since college graduate) X4 = (4 * 110) = 440 X5 = 4 * 1 = 4

Predicted salary, Y-hat = $ 137,100

False, the small coeffecient B4 = 0.01, is not enough to determine the actual effect on the interaction of the terms. A deeper understanding is needed of the p-value to give a statistically significant hypothesis/answer.

Problem 1

This question involves the Boston data set from the ISLR2 package. Separate the data into a training set and test set. Designate the first 300 observations as the training data, and the remainder (206 obs) as the testing data.

# Split the data here.

data.train = Boston[1:300,]
data.test = Boston[301:nrow(Boston),]

Part A

Using the training data, fit four models predicting medv (Y) using the feature lstat (X). The four models include:

Linear model $Y=β_0+β_1 X$
Log model $Y=β_0+β_1 \log(X)$
Square root model $Y=β_0+β_1 \sqrt{X}$
Polynomial model $Y~β_0+β_1 X+β_2 X^2$

For each of the commands, run the summary to display the results.

# Linear model
model1 = lm(medv ~ lstat , data=data.train)
summary(model1)

## 
## Call:
## lm(formula = medv ~ lstat, data = data.train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.761 -4.406 -1.707  2.737 21.233 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.24931    0.72992   49.66   <2e-16 ***
## lstat       -1.00565    0.05901  -17.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.335 on 298 degrees of freedom
## Multiple R-squared:  0.4936, Adjusted R-squared:  0.4919 
## F-statistic: 290.4 on 1 and 298 DF,  p-value: < 2.2e-16

# Log model
model2 = lm(medv ~ log(lstat) , data=data.train)
summary(model2)

## 
## Call:
## lm(formula = medv ~ log(lstat), data = data.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.8345  -3.7953  -0.8603   2.4777  22.1341 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  52.0145     1.1880   43.78   <2e-16 ***
## log(lstat)  -12.0330     0.5205  -23.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.326 on 298 degrees of freedom
## Multiple R-squared:  0.642,  Adjusted R-squared:  0.6408 
## F-statistic: 534.5 on 1 and 298 DF,  p-value: < 2.2e-16

# Square root model
model3 = lm(medv ~ sqrt(lstat) , data=data.train)
summary(model3)

## 
## Call:
## lm(formula = medv ~ sqrt(lstat), data = data.train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.953 -3.922 -1.312  2.634 21.433 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  48.8269     1.1962   40.82   <2e-16 ***
## sqrt(lstat)  -7.4278     0.3656  -20.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.765 on 298 degrees of freedom
## Multiple R-squared:  0.5807, Adjusted R-squared:  0.5793 
## F-statistic: 412.7 on 1 and 298 DF,  p-value: < 2.2e-16

# Polynomial model
model4 = lm(medv ~ lstat + I(lstat^2) , data=data.train)
summary(model4)

## 
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = data.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.6158  -3.9828  -0.7244   2.3903  21.5045 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 45.406239   1.075180   42.23   <2e-16 ***
## lstat       -2.720028   0.171418  -15.87   <2e-16 ***
## I(lstat^2)   0.060092   0.005742   10.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.424 on 297 degrees of freedom
## Multiple R-squared:   0.63,  Adjusted R-squared:  0.6275 
## F-statistic: 252.9 on 2 and 297 DF,  p-value: < 2.2e-16

Part B

For the log model in part A, give an interpretation of the intercept and slope in context.

Answer:For the log model in Part A, Intercept is where the predicted ‘medv’ when ‘log(lstat)’ equals 0 which is calculated to be 52.148 and the slope represents that for every one unit increase in the log of ‘lstat’, the medv decreases by -12.4810 units.

Part C

For each of the four models, report the $R^2$ for the training data. Which model appears to be the best fit on the training data? What if anything can we say about performance on the testing data?

# Extract R2 values
summary(model1)$r.squared

## [1] 0.493566

summary(model2)$r.squared

## [1] 0.6420314

summary(model3)$r.squared

## [1] 0.5806988

summary(model4)$r.squared

## [1] 0.6300208

Answer: Since R-squared value is a measure of how well our models help to understand the variance of the dependent variable (medv), of all the models, the highest R-squared value is 0.642 for model2 (log value of the lstat predictor). When evaluating the R^2 on test data,each model will predict based on the training data; so if a model has a low R^2 on test data, indicating the model maybe overfitted.

Part D

For each of the models, create residual plots.

# Code to generate plots
plot(model1$fitted.values, model1$residuals)

plot(model2$fitted.values, model2$residuals)

plot(model3$fitted.values, model3$residuals)

plot(model4$fitted.values, model4$residuals)

Do any of them appear particularly problematic? Explain why.

Answer: For the linear regresssion plot of model1, I notice a curved pattern for the residual plot which seems problematic & taking a log helped with that & created a better scatter pattern.I also feel the polynomial plot is still showing a slight curvature, and there is definitely room for adding a higher-degree polynomial.

Part E

Calculate the test MSE for each of the models. Based on the result, which model is preferred?

# Calculate test MSE
yhat1 = predict(model1, data.test)
yhat2 = predict(model2, data.test)
yhat3 = predict(model3, data.test)
yhat4 = predict(model4, data.test)

mse1 = mean((yhat1-data.test$medv)^2)
mse2 = mean((yhat2-data.test$medv)^2)
mse3 = mean((yhat3-data.test$medv)^2)
mse4 = mean((yhat4-data.test$medv)^2)

data.frame(Model=c("Linear","Log","Sqrt", "Poly"),MSE=c(mse1,mse2,mse3,mse4))

##    Model      MSE
## 1 Linear 39.26762
## 2    Log 30.83985
## 3   Sqrt 33.52887
## 4   Poly 40.52904

Answer: Lowest MSE is seen for Log model and that has the best fit for this particular study.

Is this result the same as the best training $R^2$ value? Explain why there might be discrepancies between the training results and testing results. Use one model as a specific example.

Answer: No, the Test MSE is not the same as the best training R-square value, either because the model was overfitting, underfitting, or the actual sample train-test split caused variability. Eg. For the linear model - training R-squared = 0.493 & Testing MSE = 39.268. The lower MSE on the test would be idea to measure how well the model will perform in the real world, even if the R-squared is not the same.

Part F

Referring back to part A, note that the linear model is a reduced version of the polynomial model. How would you determine if a polynomial model is an improvement over the linear model? Conduct a test to verify this.

Answer: If the p-value of the F-test is low, then we can conclude that one model did better than the other. In this case, F-test shows that the Model4 (poly) model improves the fit over the model1 (linear) model, since the p-value is low and significant.

# Run relevant test here
# Note: Dear Prof. Aaron, I did take help with this section as I wasn't aware of the concept.
model1 = lm(medv ~ lstat , data=data.train)
model4 = lm(medv ~ lstat + I(lstat^2) , data=data.train)

anova_analysis = anova(model1,model4)

anova_analysis

## Analysis of Variance Table
## 
## Model 1: medv ~ lstat
## Model 2: medv ~ lstat + I(lstat^2)
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    298 11961                                  
## 2    297  8738  1    3222.7 109.54 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1