Statistical Learning Homework 2

Juan Olazaba 10/23/2023

Exercise 1

  1. Data has been loaded. response variable = (logBBB) regressor variables = (bbbDescr)

  2. Yes, there are several predictors which employ a degenerate distribution. Those include: (nbasic, negative, a_acid, frac.anion7., peoe_vsa.2.1, peoe_vsa.3.1., vsa_acid, alert)

## [1]  3 16 17 22 25 50 60

##   negative peoe_vsa.2.1 peoe_vsa.3.1 a_acid vsa_acid frac.anion7. alert
## 1        0            0            0      0        0        0.000     0
## 2        0            0            0      0        0        0.001     0
## 3        0            0            0      0        0        0.000     0
## 4        0            0            0      0        0        0.000     0
## 5        0            0            0      0        0        0.000     0
## 6        0            0            0      0        0        0.000     0
  1. Generally speaking, are there strong relationships between the predictor data? If so, how could correlations in the predictor set be reduced? Does this have a dramatic effect on the number of predictors available for modeling?

Yes, about a third of our variables display significant correlation. Correlation between predictors can be reduced by identifying which predictors are highly correlated and removing them. After removing our highly correlated predictors, our predictor set has been reduced by about almost half.

#Removed degenerate distributions
removed_degdis = bbbDescr[, -nearZeroVar(bbbDescr)]

#Correlation matrix
cormatrix <- cor(removed_degdis)
corrplot(cormatrix, order = "hclust", tl.cex = .35)

Updated correlation matrix with removed predictors.

#Function to find highly correlated predictors within our correlation matrix
highcor <- findCorrelation(cormatrix, .75 )

cormatrix_updated <- cor(removed_degdis[,-highcor])
corrplot(cormatrix_updated, order = "hclust", tl.cex = .35)

  1. When our predictor set is large, we can approach our feature selection by implementing either forward selection, backward elimination or a mixture of both. The method of ‘backward elimination’ proved to have the best R-squared value, giving us the most favorable results of all three methods.
#cleaned data
newdata <- removed_degdis[,-highcor]

#merged response variable to cleaned data set
newdata$response <- logBBB

#multiple linear regression
datalm = lm(response ~ ., data = newdata)

#more data preprocessing
brain.f = ols_step_forward_p(datalm, penter = 0.05)
brain.b = ols_step_backward_p(datalm, prem = 0.1)
brain.s = ols_step_both_p(datalm, pent = 0.05, prem = 0.1)

#printout of the best variable selection method
print(brain.b)
## 
## 
##                                  Elimination Summary                                  
## -------------------------------------------------------------------------------------
##         Variable                              Adj.                                       
## Step          Removed           R-Square    R-Square     C(p)        AIC        RMSE     
## -------------------------------------------------------------------------------------
##    1    smr_vsa3                  0.7514      0.6475    61.0016    322.0318    0.4627    
##    2    slogp_vsa9                0.7514      0.6499    59.0068    320.0393    0.4611    
##    3    rpcg                      0.7514      0.6523    57.0166    318.0534    0.4596    
##    4    slogp_vsa1                0.7514      0.6546    55.0249    316.0652    0.4580    
##    5    rule.of.5violations       0.7513      0.6569    53.0344    314.0788    0.4565    
##    6    peoe_vsa.3                0.7513      0.6591    51.0627    312.1194    0.4550    
##    7    peoe_vsa.4                0.7513      0.6613    49.0761    310.1386    0.4536    
##    8    slogp_vsa0                0.7513      0.6635    47.0883    308.1562    0.4521    
##    9    most_positive_charge      0.7512      0.6656    45.1044    306.1793    0.4507    
##   10    rncg                      0.7512      0.6677    43.1407    304.2313    0.4493    
##   11    peoe_vsa.0.1              0.7511      0.6697    41.1912    302.3036    0.4479    
##   12    dipole_moment              0.751      0.6717    39.2433    300.3783    0.4465    
##   13    inthb                     0.7509      0.6737    37.2714    298.4184    0.4452    
##   14    mlogp                     0.7508      0.6756    35.3443    296.5229    0.4439    
##   15    prx                       0.7506      0.6773    33.4670    294.6983    0.4427    
##   16    nbasic                    0.7504      0.6791    31.5704    292.8461    0.4415    
##   17    tcsa                      0.7502      0.6809    29.6771    290.9985    0.4402    
##   18    rdta                      0.7501      0.6826    27.7709    289.1324    0.4390    
##   19    peoe_vsa.2                0.7497      0.6841    25.9875    287.4413    0.4380    
##   20    smr_vsa2                   0.749      0.6851    24.4135    286.0474    0.4373    
##   21    smr_vsa7                  0.7483      0.6862    22.7871    284.5775    0.4366    
##   22    n_sp3                     0.7473      0.6868    21.3766    283.4112    0.4361    
##   23    vsa_base                  0.7466      0.6878    19.7869    281.9894    0.4354    
##   24    o_sp3                     0.7442      0.6867    19.1975    281.9653    0.4362    
##   25    chaa2                     0.7427      0.6867    18.1012    281.2213    0.4362    
##   26    n_sp2                     0.7404      0.6857    17.4448    281.0749    0.4369    
##   27    saaa2                     0.7386      0.6855    16.4485    280.4489    0.4371    
##   28    slogp_vsa6                 0.736      0.6841    15.9946    280.5478    0.4380    
##   29    achg                      0.7329      0.6823    15.7666    280.9275    0.4393    
##   30    wncs                      0.7292      0.6797    15.9408    281.8108    0.4411    
##   31    peoe_vsa.6.1              0.7264      0.6782    15.6032    281.9886    0.4421    
##   32    most_negative_charge      0.7239      0.6772    15.0166    281.8225    0.4428    
## -------------------------------------------------------------------------------------
  1. Implement different criterion-based procedures to find important predictors.
cbp = regsubsets(response ~ ., data = newdata, really.big = 60)
cbp.summary = summary(cbp, )

names(cbp.summary)
## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"
par(mfrow = c(2,2))
plot(cbp.summary$rsq, xlab = "Number of Variables", ylab = "RS2", type = "b")

plot(cbp.summary$cp, xlab = "Number of Variables", ylab = "Mallow's Cp", type = "b")

plot(cbp.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "b")

  1. Methods do not necessarily agree with each other on the feature selection. The method that displayed the best results in the first diagnostic was the stepwise selection summary. In the second method, the regsubsets() function return 28 variables. However, when executing the summary() function on the second method, 8 variables were deemed to be the best (which is the default setting for the program). When trying to match to see if the variables were the same in the sets, it was concluded that there were differences.

Exercise 2

  1. Least-squares was used to fit model.

x1 = GPA x2 = IQ x3 = Level (High school = 0, College = 1) x4 = GPAIQ x5 = GPALevel y = Salary (thousands usd)

y ~ 50 + 20x + 0.07x2 + 35x3 + 0.01x4 - 10x5

High school: y ~ 50 + 20x + 0.07x2 + 0 + 0.01x4 - 0
College: y ~ 50 + 20x + 0.07x2 + 35x3 + 0.01x4 - 10x5 y ~ (high school) + 35 - 10x1

College - High School = 35 - 10x1

Thus, answer (iii) is correct. When we employ a system of equations, one is able to infer that gaining a higher gpa is actually less preferable for college students. The difference in salary between college and high school students becomes negative if they have a gpa greater than 3.5. Which signifies high school students on average earn more than college student provided their gpa is high.

  1. College equation: y ~ 50 + 20x + 0.07x2 + 35x3 + 0.01x4 - 10x5

Prediction: y ~ 50 + 20(4.0) + 0.07(110) + 35(1) + 0.01(440) - 10(4.0)
y ~ 50 + 80 + 7.7 + 35 + 4.4 - 40
y ~ 137.1

College graduate salary is approximately 137k USD.

  1. To determine the significance of the GPA/IQ interaction term we would need to know the p-value and compare to alpha. For every one unit change to the GPA/IQ interaction term, there is a 0.01 increase in the mean of the response variable. Our conclusion is thus, FALSE.

Exercise 3

  1. All predictors appear to be significant to the model. Granted that we use significance level of 0.05, all p-values fall below that threshold. I have also provided a scatter plot correlation matrix that allows us to quantify the relationship of each predictor to the response variable “medv”.

Will only show a few of the outputs of the linear regression as there are 12 variables

## 
## Call:
## lm(formula = medv ~ crim, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.957  -5.449  -2.007   2.512  29.800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24.03311    0.40914   58.74   <2e-16 ***
## crim        -0.41519    0.04389   -9.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.484 on 504 degrees of freedom
## Multiple R-squared:  0.1508, Adjusted R-squared:  0.1491 
## F-statistic: 89.49 on 1 and 504 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = medv ~ zn, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.918  -5.518  -1.006   2.757  29.082 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20.91758    0.42474  49.248   <2e-16 ***
## zn           0.14214    0.01638   8.675   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.587 on 504 degrees of freedom
## Multiple R-squared:  0.1299, Adjusted R-squared:  0.1282 
## F-statistic: 75.26 on 1 and 504 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = medv ~ indus, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.017  -4.917  -1.457   3.180  32.943 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.75490    0.68345   43.54   <2e-16 ***
## indus       -0.64849    0.05226  -12.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.057 on 504 degrees of freedom
## Multiple R-squared:  0.234,  Adjusted R-squared:  0.2325 
## F-statistic:   154 on 1 and 504 DF,  p-value: < 2.2e-16

Plots for predictors in Boston data set.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(Boston, lower.panel = panel.smooth, upper.panel = panel.cor,
      gap=0, row1attop=FALSE)

fit.nox <- lm(age ~ nox, data = Boston)

plot(Boston$rm, Boston$medv, main = "Relationship between Medv and Rm",
     ylab = "median value of owner-occupied homes", xlab = "average number of rooms per dwelling.")
abline(emod, col = "red", lwd = 3)

legend("topleft", c( "Regression"), col = c("red"), lty = c(1, 1))

plot(Boston$lstat, Boston$medv, main = "Relationship between Medv and Lstat",
     ylab = "median value of owner-occupied homes", xlab = "lower status of the population (percent).")
abline(kmod, col = "lightblue", lwd = 3)

legend("topleft", c( "Regression"), col = c("lightblue"), lty = c(1, 1))

  1. A the 0.05 significance level we can reject H0 : βj = 0, for crim, zn, chas, nox, rm, dis, rad, tax, ptratio, and lstat. All other predictors that were not included in the list can be considered for dropping from the model. They are not statistically significant based on their high p-values.
#running multiple regression on all variables
mlrmod <- lm(medv ~ crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+lstat, data = Boston)
print(summary(mlrmod))
## 
## Call:
## lm(formula = medv ~ crim + zn + indus + chas + nox + rm + age + 
##     dis + rad + tax + ptratio + lstat, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1304  -2.7673  -0.5814   1.9414  26.2526 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  41.617270   4.936039   8.431 3.79e-16 ***
## crim         -0.121389   0.033000  -3.678 0.000261 ***
## zn            0.046963   0.013879   3.384 0.000772 ***
## indus         0.013468   0.062145   0.217 0.828520    
## chas          2.839993   0.870007   3.264 0.001173 ** 
## nox         -18.758022   3.851355  -4.870 1.50e-06 ***
## rm            3.658119   0.420246   8.705  < 2e-16 ***
## age           0.003611   0.013329   0.271 0.786595    
## dis          -1.490754   0.201623  -7.394 6.17e-13 ***
## rad           0.289405   0.066908   4.325 1.84e-05 ***
## tax          -0.012682   0.003801  -3.337 0.000912 ***
## ptratio      -0.937533   0.132206  -7.091 4.63e-12 ***
## lstat        -0.552019   0.050659 -10.897  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.798 on 493 degrees of freedom
## Multiple R-squared:  0.7343, Adjusted R-squared:  0.7278 
## F-statistic: 113.5 on 12 and 493 DF,  p-value: < 2.2e-16
  1. Including the all predictors in the model significantly improves the r-squared value. When conducting simple linear regression on each individual predictor, most predictors are weak at predicting an accurate response. We have much better prediction results when we eliminate ‘indus’ and ‘age’ and include the rest of the predictors in the model.

Plotting the coefficients univariate regression vs multiple linear regression

ggplot(df, aes(x = one, y = two), color = "") +
geom_jitter() +
geom_smooth(method = "lm", se = FALSE)+
theme_bw() +
labs(x = "Univariate", y = "Multivariate")+
ggtitle("Univariate Regression Coefficients vs. Multiple Regression Coefficients")
## `geom_smooth()` using formula = 'y ~ x'

  1. Implement forward selection, backward elimination, and stepwise regression to find the important predictors. The results confirm my conclusion of including the 10 significant variables and removing the two that are not significant (which are provided in the backward elimination summary). However, we cannot solely rely on p-values when selecting variables. We must proceed by conducting more tests.
process_forward = ols_step_forward_p(mlrmod, penter = 0.05)
process_backward = ols_step_backward_p(mlrmod, prem = 0.1)
stepwise_selection = ols_step_both_p(mlrmod, pent = 0.05, prem = 0.1)

print(process_forward)
## 
##                              Selection Summary                              
## ---------------------------------------------------------------------------
##         Variable                  Adj.                                         
## Step    Entered     R-Square    R-Square      C(p)         AIC        RMSE     
## ---------------------------------------------------------------------------
##    1    lstat         0.5441      0.5432    343.8481    3288.9750    6.2158    
##    2    rm            0.6386      0.6371    170.6581    3173.5423    5.5403    
##    3    ptratio       0.6786      0.6767     98.3210    3116.0973    5.2294    
##    4    dis           0.6903      0.6878     78.6419    3099.3590    5.1386    
##    5    nox           0.7081      0.7052     47.6477    3071.4386    4.9939    
##    6    chas          0.7158      0.7124     35.3881    3059.9390    4.9326    
##    7    zn            0.7196      0.7157     30.2466    3055.0403    4.9040    
##    8    crim          0.7236      0.7192     24.8229    3049.7679    4.8738    
##    9    rad           0.7273      0.7223     20.0897    3045.0803    4.8466    
##   10    tax           0.7342      0.7289      9.1202    3033.9441    4.7889    
## ---------------------------------------------------------------------------
print(process_backward)
## 
## 
##                            Elimination Summary                             
## --------------------------------------------------------------------------
##         Variable                  Adj.                                        
## Step    Removed     R-Square    R-Square     C(p)         AIC        RMSE     
## --------------------------------------------------------------------------
##    1    indus         0.7343      0.7284    11.0470    3035.8689    4.7934    
##    2    age           0.7342      0.7289     9.1202    3033.9441    4.7889    
## --------------------------------------------------------------------------
print(stepwise_selection)
## 
##                               Stepwise Selection Summary                                
## ---------------------------------------------------------------------------------------
##                      Added/                   Adj.                                         
## Step    Variable    Removed     R-Square    R-Square      C(p)         AIC        RMSE     
## ---------------------------------------------------------------------------------------
##    1     lstat      addition       0.544       0.543    343.8480    3288.9750    6.2158    
##    2       rm       addition       0.639       0.637    170.6580    3173.5423    5.5403    
##    3    ptratio     addition       0.679       0.677     98.3210    3116.0973    5.2294    
##    4      dis       addition       0.690       0.688     78.6420    3099.3590    5.1386    
##    5      nox       addition       0.708       0.705     47.6480    3071.4386    4.9939    
##    6      chas      addition       0.716       0.712     35.3880    3059.9390    4.9326    
##    7       zn       addition       0.720       0.716     30.2470    3055.0403    4.9040    
##    8      crim      addition       0.724       0.719     24.8230    3049.7679    4.8738    
##    9      rad       addition       0.727       0.722     20.0900    3045.0803    4.8466    
##   10      tax       addition       0.734       0.729      9.1200    3033.9441    4.7889    
## ---------------------------------------------------------------------------------------
  1. Implement different criterion-based procedures to find important predictors.
# r square plot
b1 = Boston[,-13]

b1 = b1[complete.cases(b1),]
mod.sel = regsubsets(b1[,2:12], b1[,1])
reg.summary = summary(mod.sel)
plot(mod.sel, scale = "r2")

Mallow’s Cp plot

library(faraway)
## 
## Attaching package: 'faraway'

## The following object is masked from 'package:olsrr':
## 
##     hsb

## The following object is masked from 'package:lattice':
## 
##     melanoma
mod1.sel = leaps(b1[,2:12], b1[,1])
Cpplot(mod1.sel)

Stepwise AIC selection

smod = lm(medv~.,data=Boston) 
aicmod = step(smod,trace=0) 
summary(aicmod)
## 
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad + 
##     tax + ptratio + lstat, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1814  -2.7625  -0.6243   1.8448  26.3920 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  41.451747   4.903283   8.454 3.18e-16 ***
## crim         -0.121665   0.032919  -3.696 0.000244 ***
## zn            0.046191   0.013673   3.378 0.000787 ***
## chas          2.871873   0.862591   3.329 0.000935 ***
## nox         -18.262427   3.565247  -5.122 4.33e-07 ***
## rm            3.672957   0.409127   8.978  < 2e-16 ***
## dis          -1.515951   0.187675  -8.078 5.08e-15 ***
## rad           0.283932   0.063945   4.440 1.11e-05 ***
## tax          -0.012292   0.003407  -3.608 0.000340 ***
## ptratio      -0.930961   0.130423  -7.138 3.39e-12 ***
## lstat        -0.546509   0.047442 -11.519  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.789 on 495 degrees of freedom
## Multiple R-squared:  0.7342, Adjusted R-squared:  0.7289 
## F-statistic: 136.8 on 10 and 495 DF,  p-value: < 2.2e-16

Plots for several criterion-based procedures

k <- ols_step_all_possible(mlrmod)
plot(k, pch = 1)
## Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
## ggplot2 3.3.4.
## ℹ Please use "none" instead.
## ℹ The deprecated feature was likely used in the olsrr package.
##   Please report the issue at <https://github.com/rsquaredacademy/olsrr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Best subset of predictors that best meet some well-defined objective criterion.

ols_step_best_subset(mlrmod)
##                        Best Subsets Regression                        
## ----------------------------------------------------------------------
## Model Index    Predictors
## ----------------------------------------------------------------------
##      1         lstat                                                   
##      2         rm lstat                                                
##      3         rm ptratio lstat                                        
##      4         rm dis ptratio lstat                                    
##      5         nox rm dis ptratio lstat                                
##      6         chas nox rm dis ptratio lstat                           
##      7         zn chas nox rm dis ptratio lstat                        
##      8         crim zn chas nox rm dis ptratio lstat                   
##      9         crim zn nox rm dis rad tax ptratio lstat                
##     10         crim zn chas nox rm dis rad tax ptratio lstat           
##     11         crim zn chas nox rm age dis rad tax ptratio lstat       
##     12         crim zn indus chas nox rm age dis rad tax ptratio lstat 
## ----------------------------------------------------------------------
## 
##                                                        Subsets Regression Summary                                                        
## -----------------------------------------------------------------------------------------------------------------------------------------
##                        Adj.        Pred                                                                                                   
## Model    R-Square    R-Square    R-Square      C(p)         AIC         SBIC          SBC          MSEP         FPE       HSP       APC  
## -----------------------------------------------------------------------------------------------------------------------------------------
##   1        0.5441      0.5432      0.5393    343.8481    3288.9750    1851.0792    3301.6546    19549.6534    38.7884    0.0768    0.4595 
##   2        0.6386      0.6371      0.6298    170.6581    3173.5423    1735.9829    3190.4485    15531.4544    30.8764    0.0611    0.3657 
##   3        0.6786      0.6767      0.6695     98.3210    3116.0973    1678.8739    3137.2300    13837.4817    27.5628    0.0546    0.3265 
##   4        0.6903      0.6878      0.6794     78.6419    3099.3590    1662.1702    3124.7183    13361.0922    26.6659    0.0528    0.3159 
##   5        0.7081      0.7052      0.6963     47.6477    3071.4386    1634.6744    3101.0244    12619.1774    25.2344    0.0500    0.2989 
##   6        0.7158      0.7124      0.7012     35.3881    3059.9390    1623.4022    3093.7513    12311.6340    24.6674    0.0489    0.2922 
##   7        0.7196      0.7157      0.7041     30.2466    3055.0403    1618.6349    3093.0792    12169.3548    24.4298    0.0484    0.2894 
##   8        0.7236      0.7192      0.7069     24.8229    3049.7679    1613.5623    3092.0333    12019.8884    24.1766    0.0479    0.2864 
##   9        0.7283      0.7234      0.7131     18.1627    3043.1500    1607.2571    3089.6419    11840.7778    23.8624    0.0473    0.2827 
##  10        0.7342      0.7289      0.7163      9.1202    3033.9441    1598.5515    3084.6625    11604.8771    23.4323    0.0464    0.2776 
##  11        0.7343      0.7284      0.7146     11.0470    3035.8689    1600.5322    3090.8139    11626.6889    23.5216    0.0466    0.2786 
##  12        0.7343      0.7278      0.7138     13.0000    3037.8207    1602.5391    3096.9922    11649.2106    23.6126    0.0468    0.2797 
## -----------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria 
##  SBIC: Sawa's Bayesian Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
##  MSEP: Estimated error of prediction, assuming multivariate normality 
##  FPE: Final Prediction Error 
##  HSP: Hocking's Sp 
##  APC: Amemiya Prediction Criteria