These data was extracted from Internal Revenue Service Form 990, which some tax-exempt organizations are required to submit as part of their annual reporting. In the Mohawk Valley, there are 328 tax-exempt organizations with annual revenues of more than $200,000, therefore, they must file a 990. These data offer a snapshot of the 100 highest-grossing nonprofits between Oneida and Herkimer counties in upstate New York. We will explore these data throughout the term and eventually use the full data set. Although there is longitudinal data available, for this deliverable we will focus on the last full reporting year of 2018.
Even though IRS Form 990 allows for considerable high-dimensionality with 32 features, we have elected to use four variables as they offer the most complete data with limited missing values. Similar to for-profit companies, much can be gleaned from four major fiscal reporting categories – revenue, expenses, assets, and liabilities – to measure the overall health of a tax-exempt organization.
For modeling, we will use the least squares estimator (LSE) – minimizing the sum of squares of the errors – followed by ridge and lasso regression supervised models. The deliverable closes with a holistic comparison of the three. These linear models are considered to be the “go-to as the first algorithm to try, good for very large datasets, and good for very high-dimensionality data” (Müller & Guido, 2016). Ultimately, supervised learning is used whenever we want to predict a certain outcome from a given input with a goal of making accurate predictions for new, never-before-seen data (Müller & Guido, 2016).
df <- read.csv("C:/Users/bjorzech/Desktop/DSC609_W2_NP.csv",stringsAsFactors = FALSE)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
Some preprocessing steps are needed in order to run the models correctly. The name of the organization is not used in this specific modeling and revenue is transformed to numeric for consistency as it is loaded as an integer. For flow of the deliverable, we print out the head to show the first six records in the data set of 100 tax-exempt organizations.
df$revenue <- as.numeric (df$revenue)
df$expenses <- as.numeric (df$expenses)
df$liabilities <- as.numeric (df$liabilities)
df$assets <- as.numeric (df$assets)
str(df)
## 'data.frame': 100 obs. of 5 variables:
## $ organization: chr "Faxton-St Lukes Healthcare" "St Elizabeth Medical Center" "Trustees of Hamilton College" "Utica College" ...
## $ revenue : num 3.03e+08 2.22e+08 2.00e+08 9.98e+07 9.34e+07 ...
## $ expenses : num 3.04e+08 2.23e+08 1.82e+08 9.39e+07 9.13e+07 ...
## $ liabilities : num 1.08e+08 1.13e+08 2.76e+08 6.91e+07 2.05e+07 ...
## $ assets : num 2.61e+08 1.17e+08 1.39e+09 1.17e+08 4.89e+07 ...
head(df)
## organization revenue expenses liabilities assets
## 1 Faxton-St Lukes Healthcare 302766033 303562622 108477839 260787709
## 2 St Elizabeth Medical Center 221958072 222966808 113017092 117179581
## 3 Trustees of Hamilton College 199948992 182057650 276031568 1393517194
## 4 Utica College 99799776 93885974 69080233 117018066
## 5 Upstate Cerebral Palsy Inc 93374322 91328400 20508445 48870808
## 6 Rome Memorial Hospital Inc 85724662 88871216 27876413 55501103
Additionally, the preprocessing step of normalizing the data is essential. If different variables are to be used, then such a transformation is often necessary to avoid having a variable with large values dominate the results of the analysis (Tan et al., 2019, p. 71). This included variables revenue, expenses, assets, and liabilities.
preproc1 <- preProcess(df[,c(2:5)], method=c("center", "scale"))
norm1 <- predict(preproc1, df[,c(2:5)])
summary(norm1)
## revenue expenses liabilities assets
## Min. :-0.39552 Min. :-0.3965 Min. :-0.1675 Min. :-0.2257
## 1st Qu.:-0.38001 1st Qu.:-0.3700 1st Qu.:-0.1659 1st Qu.:-0.2153
## Median :-0.35221 Median :-0.3345 Median :-0.1595 Median :-0.2012
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.08727 3rd Qu.:-0.1178 3rd Qu.:-0.1147 3rd Qu.:-0.1516
## Max. : 6.30925 Max. : 6.4626 Max. : 9.6245 Max. : 7.0674
The following LSE model was created in order to understand the relationship between the variables. Linear models can be characterized as regression models for which the prediction is a line for a single feature, a plane when using two features, or a hyperplane in higher dimensions (Müller & Guido, 2016). Additionally, with linear models, there’s a higher chance of overfitting.
lse <- lm(revenue ~ ., data=norm1)
summary(lse)
##
## Call:
## lm(formula = revenue ~ ., data = norm1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.09338 -0.01416 -0.00831 -0.00279 0.73190
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.350e-16 8.095e-03 0.000 1.000000
## expenses 9.722e-01 9.385e-03 103.592 < 2e-16 ***
## liabilities -2.613e-02 1.564e-02 -1.670 0.098088 .
## assets 6.805e-02 1.688e-02 4.031 0.000111 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08095 on 96 degrees of freedom
## Multiple R-squared: 0.9936, Adjusted R-squared: 0.9934
## F-statistic: 5004 on 3 and 96 DF, p-value: < 2.2e-16
The R-squared is assessed in order to find how well the regression model fits the observed data while explaining the variation among the remaining independent variables. In this case, the variables show statistical significance save for liabilities. A multiple r-squared of 0.9936 and an adjusted r-squared of 0.9934 shows that a strong model fit exists.
We also plotted the findings below, simply to show the relationship between these four variables. Plots are used sparingly in this deliverable as they connect to the final outcome in a clear, meaningful way, especially with lasso regression.
library(tidyverse)
## -- Attaching packages -------------------------------------------- tidyverse 1.3.0 --
## v tibble 3.0.1 v dplyr 0.8.5
## v tidyr 1.0.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## v purrr 0.3.4
## -- Conflicts ----------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::lift() masks caret::lift()
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
pairs(~ revenue + expenses + assets + liabilities, data = norm1,
lower.panel = panel.smooth, main = 'Relationship Between Key Fiscal MV Nonprofit Variables - 2018')
A cross-validation is performed and changes the strength of fit. Additionally, a tuning parameter is included to measure the amount of shrinkage. The values are shrunk toward a central point and in this case, it’s revenue. Tuning controls the strength of the penalty. This will follow in our ridge and lasso regression models.
kfold <- trainControl(method = "cv", number = 10)
lse <- train (revenue ~.,
data = norm1,
trControl = kfold,
method = "lm")
lse
## Linear Regression
##
## 100 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 88, 91, 90, 90, 89, 91, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.1471697 0.9532927 0.05809335
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
With ridge regression, the coefficients are chosen not only so they predict well on the training data, but also to fit an additional constraint (Müller & Guido, 2016). We also want the magnitude of the coefficient to be as small as possible, therefore, all entries of the coefficient should be close to 0. Each feature should have as little effect on the outcome as possible while also predicting well (Müller & Guido, 2016).
Additionally, the tuning parameter lambda is chosen by cross-validation. When lambda is small, the result is essentially the least squares estimate. As lambda increases, shrinkage occurs so that variables that are at zero can be disregarded. In ridge regression, the alpha is also 0. Decreasing alpha allows the coefficients to be less restricted.
lambda <- 10^seq(10, -2, length = 100)
ridge <- train(revenue ~ .,
data = norm1,
trControl = kfold,
method = "glmnet",
tuneGrid = expand.grid(alpha = 0, lambda = lambda))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
The results are then printed.
ridge$results
## alpha lambda RMSE Rsquared MAE RMSESD RsquaredSD
## 1 0 1.000000e-02 0.3613550 0.9295957 0.1524341 0.8065221 0.1992586
## 2 0 1.321941e-02 0.3613550 0.9295957 0.1524341 0.8065221 0.1992586
## 3 0 1.747528e-02 0.3613550 0.9295957 0.1524341 0.8065221 0.1992586
## 4 0 2.310130e-02 0.3613550 0.9295957 0.1524341 0.8065221 0.1992586
## 5 0 3.053856e-02 0.3613550 0.9295957 0.1524341 0.8065221 0.1992586
## 6 0 4.037017e-02 0.3613550 0.9295957 0.1524341 0.8065221 0.1992586
## 7 0 5.336699e-02 0.3613550 0.9295957 0.1524341 0.8065221 0.1992586
## 8 0 7.054802e-02 0.3613550 0.9295957 0.1524341 0.8065221 0.1992586
## 9 0 9.326033e-02 0.3666040 0.9295914 0.1550639 0.8054378 0.1992569
## 10 0 1.232847e-01 0.3903458 0.9295673 0.1673926 0.8263672 0.1990657
## 11 0 1.629751e-01 0.4239037 0.9295595 0.1855841 0.8565065 0.1986461
## 12 0 2.154435e-01 0.4606231 0.9295727 0.2064178 0.8814519 0.1980767
## 13 0 2.848036e-01 0.5000989 0.9296017 0.2300091 0.8998223 0.1973321
## 14 0 3.764936e-01 0.5416043 0.9296407 0.2561987 0.9103199 0.1963787
## 15 0 4.977024e-01 0.5834348 0.9296772 0.2839235 0.9099325 0.1952102
## 16 0 6.579332e-01 0.6238334 0.9296964 0.3123336 0.8969442 0.1938302
## 17 0 8.697490e-01 0.6608256 0.9296822 0.3403882 0.8703305 0.1922650
## 18 0 1.149757e+00 0.6925419 0.9296233 0.3670021 0.8304313 0.1905599
## 19 0 1.519911e+00 0.7177330 0.9295136 0.3912661 0.7801116 0.1887809
## 20 0 2.009233e+00 0.7358387 0.9293560 0.4126738 0.7244693 0.1870033
## 21 0 2.656088e+00 0.7471105 0.9291607 0.4310091 0.6705340 0.1852999
## 22 0 3.511192e+00 0.7524667 0.9289431 0.4460741 0.6258382 0.1837307
## 23 0 4.641589e+00 0.7532591 0.9287197 0.4580472 0.5962824 0.1823356
## 24 0 6.135907e+00 0.7509994 0.9285047 0.4673099 0.5840211 0.1811334
## 25 0 8.111308e+00 0.7471367 0.9283081 0.4743318 0.5868819 0.1801245
## 26 0 1.072267e+01 0.7429514 0.9281358 0.4795773 0.5998662 0.1792964
## 27 0 1.417474e+01 0.7395973 0.9279896 0.4834594 0.6173565 0.1786288
## 28 0 1.873817e+01 0.7382789 0.9278689 0.4863187 0.6344414 0.1780981
## 29 0 2.477076e+01 0.7402739 0.9277710 0.4884219 0.6473512 0.1776811
## 30 0 3.274549e+01 0.7459023 0.9276931 0.4906406 0.6545115 0.1773563
## 31 0 4.328761e+01 0.7534892 0.9276317 0.4976583 0.6574214 0.1771050
## 32 0 5.722368e+01 0.7610440 0.9275839 0.5031059 0.6583671 0.1769117
## 33 0 7.564633e+01 0.7676167 0.9275468 0.5073093 0.6586173 0.1767635
## 34 0 1.000000e+02 0.7729962 0.9275183 0.5105376 0.6586581 0.1766503
## 35 0 1.321941e+02 0.7772678 0.9274964 0.5130082 0.6586499 0.1765640
## 36 0 1.747528e+02 0.7806036 0.9274797 0.5148937 0.6586388 0.1764984
## 37 0 2.310130e+02 0.7831826 0.9274670 0.5163295 0.6586344 0.1764485
## 38 0 3.053856e+02 0.7851638 0.9274573 0.5174213 0.6586362 0.1764107
## 39 0 4.037017e+02 0.7866791 0.9274499 0.5182503 0.6586415 0.1763820
## 40 0 5.336699e+02 0.7878347 0.9274443 0.5188793 0.6586483 0.1763603
## 41 0 7.054802e+02 0.7887140 0.9274401 0.5193561 0.6586552 0.1763438
## 42 0 9.326033e+02 0.7898122 0.9147865 0.5199455 0.6595452 0.1973322
## 43 0 1.232847e+03 0.7914733 NaN 0.5208432 0.6586878 NA
## 44 0 1.629751e+03 0.7914733 NaN 0.5208432 0.6586878 NA
## 45 0 2.154435e+03 0.7914733 NaN 0.5208432 0.6586878 NA
## 46 0 2.848036e+03 0.7914733 NaN 0.5208432 0.6586878 NA
## 47 0 3.764936e+03 0.7914733 NaN 0.5208432 0.6586878 NA
## 48 0 4.977024e+03 0.7914733 NaN 0.5208432 0.6586878 NA
## 49 0 6.579332e+03 0.7914733 NaN 0.5208432 0.6586878 NA
## 50 0 8.697490e+03 0.7914733 NaN 0.5208432 0.6586878 NA
## 51 0 1.149757e+04 0.7914733 NaN 0.5208432 0.6586878 NA
## 52 0 1.519911e+04 0.7914733 NaN 0.5208432 0.6586878 NA
## 53 0 2.009233e+04 0.7914733 NaN 0.5208432 0.6586878 NA
## 54 0 2.656088e+04 0.7914733 NaN 0.5208432 0.6586878 NA
## 55 0 3.511192e+04 0.7914733 NaN 0.5208432 0.6586878 NA
## 56 0 4.641589e+04 0.7914733 NaN 0.5208432 0.6586878 NA
## 57 0 6.135907e+04 0.7914733 NaN 0.5208432 0.6586878 NA
## 58 0 8.111308e+04 0.7914733 NaN 0.5208432 0.6586878 NA
## 59 0 1.072267e+05 0.7914733 NaN 0.5208432 0.6586878 NA
## 60 0 1.417474e+05 0.7914733 NaN 0.5208432 0.6586878 NA
## 61 0 1.873817e+05 0.7914733 NaN 0.5208432 0.6586878 NA
## 62 0 2.477076e+05 0.7914733 NaN 0.5208432 0.6586878 NA
## 63 0 3.274549e+05 0.7914733 NaN 0.5208432 0.6586878 NA
## 64 0 4.328761e+05 0.7914733 NaN 0.5208432 0.6586878 NA
## 65 0 5.722368e+05 0.7914733 NaN 0.5208432 0.6586878 NA
## 66 0 7.564633e+05 0.7914733 NaN 0.5208432 0.6586878 NA
## 67 0 1.000000e+06 0.7914733 NaN 0.5208432 0.6586878 NA
## 68 0 1.321941e+06 0.7914733 NaN 0.5208432 0.6586878 NA
## 69 0 1.747528e+06 0.7914733 NaN 0.5208432 0.6586878 NA
## 70 0 2.310130e+06 0.7914733 NaN 0.5208432 0.6586878 NA
## 71 0 3.053856e+06 0.7914733 NaN 0.5208432 0.6586878 NA
## 72 0 4.037017e+06 0.7914733 NaN 0.5208432 0.6586878 NA
## 73 0 5.336699e+06 0.7914733 NaN 0.5208432 0.6586878 NA
## 74 0 7.054802e+06 0.7914733 NaN 0.5208432 0.6586878 NA
## 75 0 9.326033e+06 0.7914733 NaN 0.5208432 0.6586878 NA
## 76 0 1.232847e+07 0.7914733 NaN 0.5208432 0.6586878 NA
## 77 0 1.629751e+07 0.7914733 NaN 0.5208432 0.6586878 NA
## 78 0 2.154435e+07 0.7914733 NaN 0.5208432 0.6586878 NA
## 79 0 2.848036e+07 0.7914733 NaN 0.5208432 0.6586878 NA
## 80 0 3.764936e+07 0.7914733 NaN 0.5208432 0.6586878 NA
## 81 0 4.977024e+07 0.7914733 NaN 0.5208432 0.6586878 NA
## 82 0 6.579332e+07 0.7914733 NaN 0.5208432 0.6586878 NA
## 83 0 8.697490e+07 0.7914733 NaN 0.5208432 0.6586878 NA
## 84 0 1.149757e+08 0.7914733 NaN 0.5208432 0.6586878 NA
## 85 0 1.519911e+08 0.7914733 NaN 0.5208432 0.6586878 NA
## 86 0 2.009233e+08 0.7914733 NaN 0.5208432 0.6586878 NA
## 87 0 2.656088e+08 0.7914733 NaN 0.5208432 0.6586878 NA
## 88 0 3.511192e+08 0.7914733 NaN 0.5208432 0.6586878 NA
## 89 0 4.641589e+08 0.7914733 NaN 0.5208432 0.6586878 NA
## 90 0 6.135907e+08 0.7914733 NaN 0.5208432 0.6586878 NA
## 91 0 8.111308e+08 0.7914733 NaN 0.5208432 0.6586878 NA
## 92 0 1.072267e+09 0.7914733 NaN 0.5208432 0.6586878 NA
## 93 0 1.417474e+09 0.7914733 NaN 0.5208432 0.6586878 NA
## 94 0 1.873817e+09 0.7914733 NaN 0.5208432 0.6586878 NA
## 95 0 2.477076e+09 0.7914733 NaN 0.5208432 0.6586878 NA
## 96 0 3.274549e+09 0.7914733 NaN 0.5208432 0.6586878 NA
## 97 0 4.328761e+09 0.7914733 NaN 0.5208432 0.6586878 NA
## 98 0 5.722368e+09 0.7914733 NaN 0.5208432 0.6586878 NA
## 99 0 7.564633e+09 0.7914733 NaN 0.5208432 0.6586878 NA
## 100 0 1.000000e+10 0.7914733 NaN 0.5208432 0.6586878 NA
## MAESD
## 1 0.2751543
## 2 0.2751543
## 3 0.2751543
## 4 0.2751543
## 5 0.2751543
## 6 0.2751543
## 7 0.2751543
## 8 0.2751543
## 9 0.2747915
## 10 0.2818401
## 11 0.2918999
## 12 0.3001985
## 13 0.3069780
## 14 0.3122734
## 15 0.3142675
## 16 0.3124791
## 17 0.3067143
## 18 0.2972846
## 19 0.2853558
## 20 0.2729424
## 21 0.2621219
## 22 0.2546186
## 23 0.2515956
## 24 0.2530395
## 25 0.2579722
## 26 0.2650061
## 27 0.2728477
## 28 0.2805532
## 29 0.2875615
## 30 0.2929351
## 31 0.2919988
## 32 0.2914119
## 33 0.2910463
## 34 0.2908186
## 35 0.2906762
## 36 0.2905865
## 37 0.2905292
## 38 0.2904922
## 39 0.2904679
## 40 0.2904516
## 41 0.2904405
## 42 0.2909043
## 43 0.2904129
## 44 0.2904129
## 45 0.2904129
## 46 0.2904129
## 47 0.2904129
## 48 0.2904129
## 49 0.2904129
## 50 0.2904129
## 51 0.2904129
## 52 0.2904129
## 53 0.2904129
## 54 0.2904129
## 55 0.2904129
## 56 0.2904129
## 57 0.2904129
## 58 0.2904129
## 59 0.2904129
## 60 0.2904129
## 61 0.2904129
## 62 0.2904129
## 63 0.2904129
## 64 0.2904129
## 65 0.2904129
## 66 0.2904129
## 67 0.2904129
## 68 0.2904129
## 69 0.2904129
## 70 0.2904129
## 71 0.2904129
## 72 0.2904129
## 73 0.2904129
## 74 0.2904129
## 75 0.2904129
## 76 0.2904129
## 77 0.2904129
## 78 0.2904129
## 79 0.2904129
## 80 0.2904129
## 81 0.2904129
## 82 0.2904129
## 83 0.2904129
## 84 0.2904129
## 85 0.2904129
## 86 0.2904129
## 87 0.2904129
## 88 0.2904129
## 89 0.2904129
## 90 0.2904129
## 91 0.2904129
## 92 0.2904129
## 93 0.2904129
## 94 0.2904129
## 95 0.2904129
## 96 0.2904129
## 97 0.2904129
## 98 0.2904129
## 99 0.2904129
## 100 0.2904129
ridge$bestTune$lambda
## [1] 0.07054802
coef(ridge$finalModel, ridge$bestTune$lambda)
## 4 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 5.520228e-17
## expenses 8.689033e-01
## liabilities -2.549446e-02
## assets 1.042747e-01
Since ridge is a more restricted model, we are less likely to overfit. The training set for ridge is higher than the LSE after cross-validation along with the Root Mean Square Error (RMSE) and the r-squared. A less complex model means worse performance in the training set, but stronger generalization, which is regularization (Müller & Guido, 2016).
As with ridge, lasso regression also restricts coefficients to be close to zero, but with regularization. When regularization, some of the coefficients are exactly zero. This means some features are entirely ignored by the model (Müller & Guido, 2016).
We run the following model and change the alpha to 1.
lambda <- 10^seq(10, -2, length = 100)
lasso <- train(revenue ~ .,
data = norm1,
trControl = kfold,
method = "glmnet",
tuneGrid = expand.grid(alpha = 1, lambda = lambda))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
Then we print the results.
lasso$results
## alpha lambda RMSE Rsquared MAE RMSESD RsquaredSD
## 1 1 1.000000e-02 0.1223510 0.9870230 0.05215131 0.19773229 0.020506909
## 2 1 1.321941e-02 0.1216803 0.9871751 0.05251851 0.19082354 0.020014069
## 3 1 1.747528e-02 0.1210370 0.9873694 0.05305019 0.18219838 0.019383484
## 4 1 2.310130e-02 0.1202193 0.9876506 0.05417750 0.17063010 0.018479817
## 5 1 3.053856e-02 0.1195040 0.9880307 0.05609905 0.15622031 0.017268451
## 6 1 4.037017e-02 0.1173026 0.9885894 0.05815561 0.13737833 0.015546427
## 7 1 5.336699e-02 0.1147471 0.9893880 0.06083259 0.11467503 0.013197438
## 8 1 7.054802e-02 0.1123525 0.9906039 0.06456895 0.09117263 0.010084490
## 9 1 9.326033e-02 0.1125317 0.9924059 0.07158897 0.08459811 0.007355101
## 10 1 1.232847e-01 0.1322745 0.9927694 0.08636212 0.10142122 0.007236515
## 11 1 1.629751e-01 0.1643882 0.9927077 0.10738597 0.12703656 0.007243937
## 12 1 2.154435e-01 0.2083751 0.9927077 0.13524533 0.16471004 0.007243937
## 13 1 2.848036e-01 0.2680967 0.9927077 0.17210997 0.21756217 0.007243937
## 14 1 3.764936e-01 0.3482924 0.9927077 0.22153992 0.28981851 0.007243937
## 15 1 4.977024e-01 0.4552720 0.9927077 0.28718466 0.38714792 0.007243937
## 16 1 6.579332e-01 0.5974321 0.9927077 0.37434916 0.51716284 0.007243937
## 17 1 8.697490e-01 0.7634657 0.9933457 0.48013618 0.63662180 0.007379350
## 18 1 1.149757e+00 0.8294742 NaN 0.53058569 0.62862168 NA
## 19 1 1.519911e+00 0.8294742 NaN 0.53058569 0.62862168 NA
## 20 1 2.009233e+00 0.8294742 NaN 0.53058569 0.62862168 NA
## 21 1 2.656088e+00 0.8294742 NaN 0.53058569 0.62862168 NA
## 22 1 3.511192e+00 0.8294742 NaN 0.53058569 0.62862168 NA
## 23 1 4.641589e+00 0.8294742 NaN 0.53058569 0.62862168 NA
## 24 1 6.135907e+00 0.8294742 NaN 0.53058569 0.62862168 NA
## 25 1 8.111308e+00 0.8294742 NaN 0.53058569 0.62862168 NA
## 26 1 1.072267e+01 0.8294742 NaN 0.53058569 0.62862168 NA
## 27 1 1.417474e+01 0.8294742 NaN 0.53058569 0.62862168 NA
## 28 1 1.873817e+01 0.8294742 NaN 0.53058569 0.62862168 NA
## 29 1 2.477076e+01 0.8294742 NaN 0.53058569 0.62862168 NA
## 30 1 3.274549e+01 0.8294742 NaN 0.53058569 0.62862168 NA
## 31 1 4.328761e+01 0.8294742 NaN 0.53058569 0.62862168 NA
## 32 1 5.722368e+01 0.8294742 NaN 0.53058569 0.62862168 NA
## 33 1 7.564633e+01 0.8294742 NaN 0.53058569 0.62862168 NA
## 34 1 1.000000e+02 0.8294742 NaN 0.53058569 0.62862168 NA
## 35 1 1.321941e+02 0.8294742 NaN 0.53058569 0.62862168 NA
## 36 1 1.747528e+02 0.8294742 NaN 0.53058569 0.62862168 NA
## 37 1 2.310130e+02 0.8294742 NaN 0.53058569 0.62862168 NA
## 38 1 3.053856e+02 0.8294742 NaN 0.53058569 0.62862168 NA
## 39 1 4.037017e+02 0.8294742 NaN 0.53058569 0.62862168 NA
## 40 1 5.336699e+02 0.8294742 NaN 0.53058569 0.62862168 NA
## 41 1 7.054802e+02 0.8294742 NaN 0.53058569 0.62862168 NA
## 42 1 9.326033e+02 0.8294742 NaN 0.53058569 0.62862168 NA
## 43 1 1.232847e+03 0.8294742 NaN 0.53058569 0.62862168 NA
## 44 1 1.629751e+03 0.8294742 NaN 0.53058569 0.62862168 NA
## 45 1 2.154435e+03 0.8294742 NaN 0.53058569 0.62862168 NA
## 46 1 2.848036e+03 0.8294742 NaN 0.53058569 0.62862168 NA
## 47 1 3.764936e+03 0.8294742 NaN 0.53058569 0.62862168 NA
## 48 1 4.977024e+03 0.8294742 NaN 0.53058569 0.62862168 NA
## 49 1 6.579332e+03 0.8294742 NaN 0.53058569 0.62862168 NA
## 50 1 8.697490e+03 0.8294742 NaN 0.53058569 0.62862168 NA
## 51 1 1.149757e+04 0.8294742 NaN 0.53058569 0.62862168 NA
## 52 1 1.519911e+04 0.8294742 NaN 0.53058569 0.62862168 NA
## 53 1 2.009233e+04 0.8294742 NaN 0.53058569 0.62862168 NA
## 54 1 2.656088e+04 0.8294742 NaN 0.53058569 0.62862168 NA
## 55 1 3.511192e+04 0.8294742 NaN 0.53058569 0.62862168 NA
## 56 1 4.641589e+04 0.8294742 NaN 0.53058569 0.62862168 NA
## 57 1 6.135907e+04 0.8294742 NaN 0.53058569 0.62862168 NA
## 58 1 8.111308e+04 0.8294742 NaN 0.53058569 0.62862168 NA
## 59 1 1.072267e+05 0.8294742 NaN 0.53058569 0.62862168 NA
## 60 1 1.417474e+05 0.8294742 NaN 0.53058569 0.62862168 NA
## 61 1 1.873817e+05 0.8294742 NaN 0.53058569 0.62862168 NA
## 62 1 2.477076e+05 0.8294742 NaN 0.53058569 0.62862168 NA
## 63 1 3.274549e+05 0.8294742 NaN 0.53058569 0.62862168 NA
## 64 1 4.328761e+05 0.8294742 NaN 0.53058569 0.62862168 NA
## 65 1 5.722368e+05 0.8294742 NaN 0.53058569 0.62862168 NA
## 66 1 7.564633e+05 0.8294742 NaN 0.53058569 0.62862168 NA
## 67 1 1.000000e+06 0.8294742 NaN 0.53058569 0.62862168 NA
## 68 1 1.321941e+06 0.8294742 NaN 0.53058569 0.62862168 NA
## 69 1 1.747528e+06 0.8294742 NaN 0.53058569 0.62862168 NA
## 70 1 2.310130e+06 0.8294742 NaN 0.53058569 0.62862168 NA
## 71 1 3.053856e+06 0.8294742 NaN 0.53058569 0.62862168 NA
## 72 1 4.037017e+06 0.8294742 NaN 0.53058569 0.62862168 NA
## 73 1 5.336699e+06 0.8294742 NaN 0.53058569 0.62862168 NA
## 74 1 7.054802e+06 0.8294742 NaN 0.53058569 0.62862168 NA
## 75 1 9.326033e+06 0.8294742 NaN 0.53058569 0.62862168 NA
## 76 1 1.232847e+07 0.8294742 NaN 0.53058569 0.62862168 NA
## 77 1 1.629751e+07 0.8294742 NaN 0.53058569 0.62862168 NA
## 78 1 2.154435e+07 0.8294742 NaN 0.53058569 0.62862168 NA
## 79 1 2.848036e+07 0.8294742 NaN 0.53058569 0.62862168 NA
## 80 1 3.764936e+07 0.8294742 NaN 0.53058569 0.62862168 NA
## 81 1 4.977024e+07 0.8294742 NaN 0.53058569 0.62862168 NA
## 82 1 6.579332e+07 0.8294742 NaN 0.53058569 0.62862168 NA
## 83 1 8.697490e+07 0.8294742 NaN 0.53058569 0.62862168 NA
## 84 1 1.149757e+08 0.8294742 NaN 0.53058569 0.62862168 NA
## 85 1 1.519911e+08 0.8294742 NaN 0.53058569 0.62862168 NA
## 86 1 2.009233e+08 0.8294742 NaN 0.53058569 0.62862168 NA
## 87 1 2.656088e+08 0.8294742 NaN 0.53058569 0.62862168 NA
## 88 1 3.511192e+08 0.8294742 NaN 0.53058569 0.62862168 NA
## 89 1 4.641589e+08 0.8294742 NaN 0.53058569 0.62862168 NA
## 90 1 6.135907e+08 0.8294742 NaN 0.53058569 0.62862168 NA
## 91 1 8.111308e+08 0.8294742 NaN 0.53058569 0.62862168 NA
## 92 1 1.072267e+09 0.8294742 NaN 0.53058569 0.62862168 NA
## 93 1 1.417474e+09 0.8294742 NaN 0.53058569 0.62862168 NA
## 94 1 1.873817e+09 0.8294742 NaN 0.53058569 0.62862168 NA
## 95 1 2.477076e+09 0.8294742 NaN 0.53058569 0.62862168 NA
## 96 1 3.274549e+09 0.8294742 NaN 0.53058569 0.62862168 NA
## 97 1 4.328761e+09 0.8294742 NaN 0.53058569 0.62862168 NA
## 98 1 5.722368e+09 0.8294742 NaN 0.53058569 0.62862168 NA
## 99 1 7.564633e+09 0.8294742 NaN 0.53058569 0.62862168 NA
## 100 1 1.000000e+10 0.8294742 NaN 0.53058569 0.62862168 NA
## MAESD
## 1 0.05945258
## 2 0.05709423
## 3 0.05414339
## 4 0.05057412
## 5 0.04658627
## 6 0.04177232
## 7 0.03657459
## 8 0.03284276
## 9 0.03725870
## 10 0.04749383
## 11 0.06053305
## 12 0.07798126
## 13 0.10127557
## 14 0.13184422
## 15 0.17210579
## 16 0.22507065
## 17 0.27227083
## 18 0.26044216
## 19 0.26044216
## 20 0.26044216
## 21 0.26044216
## 22 0.26044216
## 23 0.26044216
## 24 0.26044216
## 25 0.26044216
## 26 0.26044216
## 27 0.26044216
## 28 0.26044216
## 29 0.26044216
## 30 0.26044216
## 31 0.26044216
## 32 0.26044216
## 33 0.26044216
## 34 0.26044216
## 35 0.26044216
## 36 0.26044216
## 37 0.26044216
## 38 0.26044216
## 39 0.26044216
## 40 0.26044216
## 41 0.26044216
## 42 0.26044216
## 43 0.26044216
## 44 0.26044216
## 45 0.26044216
## 46 0.26044216
## 47 0.26044216
## 48 0.26044216
## 49 0.26044216
## 50 0.26044216
## 51 0.26044216
## 52 0.26044216
## 53 0.26044216
## 54 0.26044216
## 55 0.26044216
## 56 0.26044216
## 57 0.26044216
## 58 0.26044216
## 59 0.26044216
## 60 0.26044216
## 61 0.26044216
## 62 0.26044216
## 63 0.26044216
## 64 0.26044216
## 65 0.26044216
## 66 0.26044216
## 67 0.26044216
## 68 0.26044216
## 69 0.26044216
## 70 0.26044216
## 71 0.26044216
## 72 0.26044216
## 73 0.26044216
## 74 0.26044216
## 75 0.26044216
## 76 0.26044216
## 77 0.26044216
## 78 0.26044216
## 79 0.26044216
## 80 0.26044216
## 81 0.26044216
## 82 0.26044216
## 83 0.26044216
## 84 0.26044216
## 85 0.26044216
## 86 0.26044216
## 87 0.26044216
## 88 0.26044216
## 89 0.26044216
## 90 0.26044216
## 91 0.26044216
## 92 0.26044216
## 93 0.26044216
## 94 0.26044216
## 95 0.26044216
## 96 0.26044216
## 97 0.26044216
## 98 0.26044216
## 99 0.26044216
## 100 0.26044216
lasso$bestTune$lambda
## [1] 0.07054802
coef(lasso$finalModel, lasso$bestTune$lambda)
## 4 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 6.055488e-17
## expenses 9.250404e-01
## liabilities .
## assets .
A model is often easier to interpret and can reveal the most important features when some coefficients are exactly zero (Müller & Guido, 2016). This makes the model easier to understand. The prediction is about the same on the training set but the RMSE and r-squared are not as strong.
Since we are interested in general, we choose the ridge regression model over LSE and lasso. According to Müller and Guido (2016), in practice, ridge regression is usually the preferred choice between ridge and lasso and is often expected to outperform the LSE as we also can recognize fewer features with some importance.
If the ridge regression performs poorly, it’s an underfitted model. In this case, it performs well. This helps with future modeling as we plan to use this set more in the coming weeks.
Müller, A.C., & Guido, S. (2016). Introduction to machine learning with python. Sebastopol, CA: O’Reilly Media, Inc.
Tan, P.-N., Steinbach, M., Karpatne, A., & Kumar, V. (2019). Introduction to data mining. New York, NY: Pearson Education, Inc.