Juan Olazaba 10/23/2023
Data has been loaded. response variable = (logBBB) regressor variables = (bbbDescr)
Yes, there are several predictors which employ a degenerate distribution. Those include: (nbasic, negative, a_acid, frac.anion7., peoe_vsa.2.1, peoe_vsa.3.1., vsa_acid, alert)
## [1] 3 16 17 22 25 50 60
## negative peoe_vsa.2.1 peoe_vsa.3.1 a_acid vsa_acid frac.anion7. alert
## 1 0 0 0 0 0 0.000 0
## 2 0 0 0 0 0 0.001 0
## 3 0 0 0 0 0 0.000 0
## 4 0 0 0 0 0 0.000 0
## 5 0 0 0 0 0 0.000 0
## 6 0 0 0 0 0 0.000 0
Yes, about a third of our variables display significant correlation. Correlation between predictors can be reduced by identifying which predictors are highly correlated and removing them. After removing our highly correlated predictors, our predictor set has been reduced by about almost half.
#Removed degenerate distributions
removed_degdis = bbbDescr[, -nearZeroVar(bbbDescr)]
#Correlation matrix
cormatrix <- cor(removed_degdis)
corrplot(cormatrix, order = "hclust", tl.cex = .35)Updated correlation matrix with removed predictors.
#Function to find highly correlated predictors within our correlation matrix
highcor <- findCorrelation(cormatrix, .75 )
cormatrix_updated <- cor(removed_degdis[,-highcor])
corrplot(cormatrix_updated, order = "hclust", tl.cex = .35)#cleaned data
newdata <- removed_degdis[,-highcor]
#merged response variable to cleaned data set
newdata$response <- logBBB
#multiple linear regression
datalm = lm(response ~ ., data = newdata)
#more data preprocessing
brain.f = ols_step_forward_p(datalm, penter = 0.05)
brain.b = ols_step_backward_p(datalm, prem = 0.1)
brain.s = ols_step_both_p(datalm, pent = 0.05, prem = 0.1)
#printout of the best variable selection method
print(brain.b)##
##
## Elimination Summary
## -------------------------------------------------------------------------------------
## Variable Adj.
## Step Removed R-Square R-Square C(p) AIC RMSE
## -------------------------------------------------------------------------------------
## 1 smr_vsa3 0.7514 0.6475 61.0016 322.0318 0.4627
## 2 slogp_vsa9 0.7514 0.6499 59.0068 320.0393 0.4611
## 3 rpcg 0.7514 0.6523 57.0166 318.0534 0.4596
## 4 slogp_vsa1 0.7514 0.6546 55.0249 316.0652 0.4580
## 5 rule.of.5violations 0.7513 0.6569 53.0344 314.0788 0.4565
## 6 peoe_vsa.3 0.7513 0.6591 51.0627 312.1194 0.4550
## 7 peoe_vsa.4 0.7513 0.6613 49.0761 310.1386 0.4536
## 8 slogp_vsa0 0.7513 0.6635 47.0883 308.1562 0.4521
## 9 most_positive_charge 0.7512 0.6656 45.1044 306.1793 0.4507
## 10 rncg 0.7512 0.6677 43.1407 304.2313 0.4493
## 11 peoe_vsa.0.1 0.7511 0.6697 41.1912 302.3036 0.4479
## 12 dipole_moment 0.751 0.6717 39.2433 300.3783 0.4465
## 13 inthb 0.7509 0.6737 37.2714 298.4184 0.4452
## 14 mlogp 0.7508 0.6756 35.3443 296.5229 0.4439
## 15 prx 0.7506 0.6773 33.4670 294.6983 0.4427
## 16 nbasic 0.7504 0.6791 31.5704 292.8461 0.4415
## 17 tcsa 0.7502 0.6809 29.6771 290.9985 0.4402
## 18 rdta 0.7501 0.6826 27.7709 289.1324 0.4390
## 19 peoe_vsa.2 0.7497 0.6841 25.9875 287.4413 0.4380
## 20 smr_vsa2 0.749 0.6851 24.4135 286.0474 0.4373
## 21 smr_vsa7 0.7483 0.6862 22.7871 284.5775 0.4366
## 22 n_sp3 0.7473 0.6868 21.3766 283.4112 0.4361
## 23 vsa_base 0.7466 0.6878 19.7869 281.9894 0.4354
## 24 o_sp3 0.7442 0.6867 19.1975 281.9653 0.4362
## 25 chaa2 0.7427 0.6867 18.1012 281.2213 0.4362
## 26 n_sp2 0.7404 0.6857 17.4448 281.0749 0.4369
## 27 saaa2 0.7386 0.6855 16.4485 280.4489 0.4371
## 28 slogp_vsa6 0.736 0.6841 15.9946 280.5478 0.4380
## 29 achg 0.7329 0.6823 15.7666 280.9275 0.4393
## 30 wncs 0.7292 0.6797 15.9408 281.8108 0.4411
## 31 peoe_vsa.6.1 0.7264 0.6782 15.6032 281.9886 0.4421
## 32 most_negative_charge 0.7239 0.6772 15.0166 281.8225 0.4428
## -------------------------------------------------------------------------------------
cbp = regsubsets(response ~ ., data = newdata, really.big = 60)
cbp.summary = summary(cbp, )
names(cbp.summary)## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
par(mfrow = c(2,2))
plot(cbp.summary$rsq, xlab = "Number of Variables", ylab = "RS2", type = "b")
plot(cbp.summary$cp, xlab = "Number of Variables", ylab = "Mallow's Cp", type = "b")
plot(cbp.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "b")x1 = GPA x2 = IQ x3 = Level (High school = 0, College = 1) x4 = GPAIQ x5 = GPALevel y = Salary (thousands usd)
y ~ 50 + 20x + 0.07x2 + 35x3 + 0.01x4 - 10x5
High school: y ~ 50 + 20x + 0.07x2 + 0 + 0.01x4 - 0
College: y ~ 50 + 20x + 0.07x2 + 35x3 + 0.01x4 - 10x5 y ~ (high school)
+ 35 - 10x1
College - High School = 35 - 10x1
Thus, answer (iii) is correct. When we employ a system of equations, one is able to infer that gaining a higher gpa is actually less preferable for college students. The difference in salary between college and high school students becomes negative if they have a gpa greater than 3.5. Which signifies high school students on average earn more than college student provided their gpa is high.
Prediction: y ~ 50 + 20(4.0) + 0.07(110) + 35(1) + 0.01(440) -
10(4.0)
y ~ 50 + 80 + 7.7 + 35 + 4.4 - 40
y ~ 137.1
College graduate salary is approximately 137k USD.
Will only show a few of the outputs of the linear regression as there are 12 variables
##
## Call:
## lm(formula = medv ~ crim, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.957 -5.449 -2.007 2.512 29.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.03311 0.40914 58.74 <2e-16 ***
## crim -0.41519 0.04389 -9.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.484 on 504 degrees of freedom
## Multiple R-squared: 0.1508, Adjusted R-squared: 0.1491
## F-statistic: 89.49 on 1 and 504 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = medv ~ zn, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.918 -5.518 -1.006 2.757 29.082
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.91758 0.42474 49.248 <2e-16 ***
## zn 0.14214 0.01638 8.675 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.587 on 504 degrees of freedom
## Multiple R-squared: 0.1299, Adjusted R-squared: 0.1282
## F-statistic: 75.26 on 1 and 504 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = medv ~ indus, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.017 -4.917 -1.457 3.180 32.943
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.75490 0.68345 43.54 <2e-16 ***
## indus -0.64849 0.05226 -12.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.057 on 504 degrees of freedom
## Multiple R-squared: 0.234, Adjusted R-squared: 0.2325
## F-statistic: 154 on 1 and 504 DF, p-value: < 2.2e-16
Plots for predictors in Boston data set.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(Boston, lower.panel = panel.smooth, upper.panel = panel.cor,
gap=0, row1attop=FALSE)fit.nox <- lm(age ~ nox, data = Boston)
plot(Boston$rm, Boston$medv, main = "Relationship between Medv and Rm",
ylab = "median value of owner-occupied homes", xlab = "average number of rooms per dwelling.")
abline(emod, col = "red", lwd = 3)
legend("topleft", c( "Regression"), col = c("red"), lty = c(1, 1))plot(Boston$lstat, Boston$medv, main = "Relationship between Medv and Lstat",
ylab = "median value of owner-occupied homes", xlab = "lower status of the population (percent).")
abline(kmod, col = "lightblue", lwd = 3)
legend("topleft", c( "Regression"), col = c("lightblue"), lty = c(1, 1))#running multiple regression on all variables
mlrmod <- lm(medv ~ crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+lstat, data = Boston)
print(summary(mlrmod))##
## Call:
## lm(formula = medv ~ crim + zn + indus + chas + nox + rm + age +
## dis + rad + tax + ptratio + lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.1304 -2.7673 -0.5814 1.9414 26.2526
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.617270 4.936039 8.431 3.79e-16 ***
## crim -0.121389 0.033000 -3.678 0.000261 ***
## zn 0.046963 0.013879 3.384 0.000772 ***
## indus 0.013468 0.062145 0.217 0.828520
## chas 2.839993 0.870007 3.264 0.001173 **
## nox -18.758022 3.851355 -4.870 1.50e-06 ***
## rm 3.658119 0.420246 8.705 < 2e-16 ***
## age 0.003611 0.013329 0.271 0.786595
## dis -1.490754 0.201623 -7.394 6.17e-13 ***
## rad 0.289405 0.066908 4.325 1.84e-05 ***
## tax -0.012682 0.003801 -3.337 0.000912 ***
## ptratio -0.937533 0.132206 -7.091 4.63e-12 ***
## lstat -0.552019 0.050659 -10.897 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.798 on 493 degrees of freedom
## Multiple R-squared: 0.7343, Adjusted R-squared: 0.7278
## F-statistic: 113.5 on 12 and 493 DF, p-value: < 2.2e-16
Plotting the coefficients univariate regression vs multiple linear regression
ggplot(df, aes(x = one, y = two), color = "") +
geom_jitter() +
geom_smooth(method = "lm", se = FALSE)+
theme_bw() +
labs(x = "Univariate", y = "Multivariate")+
ggtitle("Univariate Regression Coefficients vs. Multiple Regression Coefficients")## `geom_smooth()` using formula = 'y ~ x'
process_forward = ols_step_forward_p(mlrmod, penter = 0.05)
process_backward = ols_step_backward_p(mlrmod, prem = 0.1)
stepwise_selection = ols_step_both_p(mlrmod, pent = 0.05, prem = 0.1)
print(process_forward)##
## Selection Summary
## ---------------------------------------------------------------------------
## Variable Adj.
## Step Entered R-Square R-Square C(p) AIC RMSE
## ---------------------------------------------------------------------------
## 1 lstat 0.5441 0.5432 343.8481 3288.9750 6.2158
## 2 rm 0.6386 0.6371 170.6581 3173.5423 5.5403
## 3 ptratio 0.6786 0.6767 98.3210 3116.0973 5.2294
## 4 dis 0.6903 0.6878 78.6419 3099.3590 5.1386
## 5 nox 0.7081 0.7052 47.6477 3071.4386 4.9939
## 6 chas 0.7158 0.7124 35.3881 3059.9390 4.9326
## 7 zn 0.7196 0.7157 30.2466 3055.0403 4.9040
## 8 crim 0.7236 0.7192 24.8229 3049.7679 4.8738
## 9 rad 0.7273 0.7223 20.0897 3045.0803 4.8466
## 10 tax 0.7342 0.7289 9.1202 3033.9441 4.7889
## ---------------------------------------------------------------------------
##
##
## Elimination Summary
## --------------------------------------------------------------------------
## Variable Adj.
## Step Removed R-Square R-Square C(p) AIC RMSE
## --------------------------------------------------------------------------
## 1 indus 0.7343 0.7284 11.0470 3035.8689 4.7934
## 2 age 0.7342 0.7289 9.1202 3033.9441 4.7889
## --------------------------------------------------------------------------
##
## Stepwise Selection Summary
## ---------------------------------------------------------------------------------------
## Added/ Adj.
## Step Variable Removed R-Square R-Square C(p) AIC RMSE
## ---------------------------------------------------------------------------------------
## 1 lstat addition 0.544 0.543 343.8480 3288.9750 6.2158
## 2 rm addition 0.639 0.637 170.6580 3173.5423 5.5403
## 3 ptratio addition 0.679 0.677 98.3210 3116.0973 5.2294
## 4 dis addition 0.690 0.688 78.6420 3099.3590 5.1386
## 5 nox addition 0.708 0.705 47.6480 3071.4386 4.9939
## 6 chas addition 0.716 0.712 35.3880 3059.9390 4.9326
## 7 zn addition 0.720 0.716 30.2470 3055.0403 4.9040
## 8 crim addition 0.724 0.719 24.8230 3049.7679 4.8738
## 9 rad addition 0.727 0.722 20.0900 3045.0803 4.8466
## 10 tax addition 0.734 0.729 9.1200 3033.9441 4.7889
## ---------------------------------------------------------------------------------------
# r square plot
b1 = Boston[,-13]
b1 = b1[complete.cases(b1),]
mod.sel = regsubsets(b1[,2:12], b1[,1])
reg.summary = summary(mod.sel)Mallow’s Cp plot
##
## Attaching package: 'faraway'
## The following object is masked from 'package:olsrr':
##
## hsb
## The following object is masked from 'package:lattice':
##
## melanoma
Stepwise AIC selection
##
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
## tax + ptratio + lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.1814 -2.7625 -0.6243 1.8448 26.3920
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.451747 4.903283 8.454 3.18e-16 ***
## crim -0.121665 0.032919 -3.696 0.000244 ***
## zn 0.046191 0.013673 3.378 0.000787 ***
## chas 2.871873 0.862591 3.329 0.000935 ***
## nox -18.262427 3.565247 -5.122 4.33e-07 ***
## rm 3.672957 0.409127 8.978 < 2e-16 ***
## dis -1.515951 0.187675 -8.078 5.08e-15 ***
## rad 0.283932 0.063945 4.440 1.11e-05 ***
## tax -0.012292 0.003407 -3.608 0.000340 ***
## ptratio -0.930961 0.130423 -7.138 3.39e-12 ***
## lstat -0.546509 0.047442 -11.519 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.789 on 495 degrees of freedom
## Multiple R-squared: 0.7342, Adjusted R-squared: 0.7289
## F-statistic: 136.8 on 10 and 495 DF, p-value: < 2.2e-16
Plots for several criterion-based procedures
## Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
## ggplot2 3.3.4.
## ℹ Please use "none" instead.
## ℹ The deprecated feature was likely used in the olsrr package.
## Please report the issue at <https://github.com/rsquaredacademy/olsrr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Best subset of predictors that best meet some well-defined objective criterion.
## Best Subsets Regression
## ----------------------------------------------------------------------
## Model Index Predictors
## ----------------------------------------------------------------------
## 1 lstat
## 2 rm lstat
## 3 rm ptratio lstat
## 4 rm dis ptratio lstat
## 5 nox rm dis ptratio lstat
## 6 chas nox rm dis ptratio lstat
## 7 zn chas nox rm dis ptratio lstat
## 8 crim zn chas nox rm dis ptratio lstat
## 9 crim zn nox rm dis rad tax ptratio lstat
## 10 crim zn chas nox rm dis rad tax ptratio lstat
## 11 crim zn chas nox rm age dis rad tax ptratio lstat
## 12 crim zn indus chas nox rm age dis rad tax ptratio lstat
## ----------------------------------------------------------------------
##
## Subsets Regression Summary
## -----------------------------------------------------------------------------------------------------------------------------------------
## Adj. Pred
## Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC
## -----------------------------------------------------------------------------------------------------------------------------------------
## 1 0.5441 0.5432 0.5393 343.8481 3288.9750 1851.0792 3301.6546 19549.6534 38.7884 0.0768 0.4595
## 2 0.6386 0.6371 0.6298 170.6581 3173.5423 1735.9829 3190.4485 15531.4544 30.8764 0.0611 0.3657
## 3 0.6786 0.6767 0.6695 98.3210 3116.0973 1678.8739 3137.2300 13837.4817 27.5628 0.0546 0.3265
## 4 0.6903 0.6878 0.6794 78.6419 3099.3590 1662.1702 3124.7183 13361.0922 26.6659 0.0528 0.3159
## 5 0.7081 0.7052 0.6963 47.6477 3071.4386 1634.6744 3101.0244 12619.1774 25.2344 0.0500 0.2989
## 6 0.7158 0.7124 0.7012 35.3881 3059.9390 1623.4022 3093.7513 12311.6340 24.6674 0.0489 0.2922
## 7 0.7196 0.7157 0.7041 30.2466 3055.0403 1618.6349 3093.0792 12169.3548 24.4298 0.0484 0.2894
## 8 0.7236 0.7192 0.7069 24.8229 3049.7679 1613.5623 3092.0333 12019.8884 24.1766 0.0479 0.2864
## 9 0.7283 0.7234 0.7131 18.1627 3043.1500 1607.2571 3089.6419 11840.7778 23.8624 0.0473 0.2827
## 10 0.7342 0.7289 0.7163 9.1202 3033.9441 1598.5515 3084.6625 11604.8771 23.4323 0.0464 0.2776
## 11 0.7343 0.7284 0.7146 11.0470 3035.8689 1600.5322 3090.8139 11626.6889 23.5216 0.0466 0.2786
## 12 0.7343 0.7278 0.7138 13.0000 3037.8207 1602.5391 3096.9922 11649.2106 23.6126 0.0468 0.2797
## -----------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria
## SBIC: Sawa's Bayesian Information Criteria
## SBC: Schwarz Bayesian Criteria
## MSEP: Estimated error of prediction, assuming multivariate normality
## FPE: Final Prediction Error
## HSP: Hocking's Sp
## APC: Amemiya Prediction Criteria