Remake the basic linear model with this new dataset. Provide the output from the summary() function.
##
## Call:
## lm(formula = median_house_value ~ ., data = housing2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -572915 -46140 -9843 34424 661633
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -31075.059 2147.183 -14.47 <2e-16 ***
## housing_median_age 1856.289 42.277 43.91 <2e-16 ***
## population -35.082 1.036 -33.87 <2e-16 ***
## households 130.079 3.105 41.89 <2e-16 ***
## median_income 42487.117 320.402 132.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 69210 on 19670 degrees of freedom
## Multiple R-squared: 0.4984, Adjusted R-squared: 0.4983
## F-statistic: 4887 on 4 and 19670 DF, p-value: < 2.2e-16
The R-squared for this model is closer to 0.5 than the previous summary in the first summary in the instructions. I also noticed that the min and max are at a much bigger range.
Make a new task for machine learning this new dataset.
## <TaskRegr:housing> (19675 x 5)
## * Target: median_house_value
## * Properties: -
## * Features (4):
## - dbl (4): households, housing_median_age, median_income, population
## <LearnerRegrGlmnet:regr.glmnet>
## * Model: -
## * Parameters: family=gaussian
## * Packages: mlr3, mlr3learners, glmnet
## * Predict Types: [response]
## * Feature Types: logical, integer, numeric
## * Properties: weights
##
## Call: (if (cv) glmnet::cv.glmnet else glmnet::glmnet)(x = data, y = target, alpha = 1, lambda = 0.001)
##
## Df %Dev Lambda
## 1 4 49.77 0.001
I tried both with the set.seed() and without. With set.seed(), the
points were much farther apart which suggests poor performance. So I
decided to not include the set.seed(). The grey line is along the teal
line (linear model) meaning that it has low bias, but maybe not as much
at the low and high ends.
## regr.rmse
## 67929.78
## regr.mae
## 51116.15
## regr.bias
## -1711.226
## regr.rmse
## 69544.44
Repeat the resampling strategy with the new task, but with a 10-fold cross validation (i.e. folds = 10). Provide the set of 10 RMSE values and the aggregate RMSE
resampling = rsmp("cv", folds = 10)
rsmp("holdout", ratio = 0.8)
## <ResamplingHoldout>: Holdout
## * Iterations: 1
## * Instantiated: FALSE
## * Parameters: ratio=0.8
resampling$instantiate(task_housing2)
rr = resample(task_housing2, learner, resampling, store_models = TRUE)
## INFO [20:19:14.519] [mlr3] Applying learner 'regr.glmnet' on task 'housing' (iter 1/10)
## INFO [20:19:14.595] [mlr3] Applying learner 'regr.glmnet' on task 'housing' (iter 2/10)
## INFO [20:19:14.631] [mlr3] Applying learner 'regr.glmnet' on task 'housing' (iter 3/10)
## INFO [20:19:14.659] [mlr3] Applying learner 'regr.glmnet' on task 'housing' (iter 4/10)
## INFO [20:19:14.693] [mlr3] Applying learner 'regr.glmnet' on task 'housing' (iter 5/10)
## INFO [20:19:14.721] [mlr3] Applying learner 'regr.glmnet' on task 'housing' (iter 6/10)
## INFO [20:19:14.748] [mlr3] Applying learner 'regr.glmnet' on task 'housing' (iter 7/10)
## INFO [20:19:14.776] [mlr3] Applying learner 'regr.glmnet' on task 'housing' (iter 8/10)
## INFO [20:19:14.811] [mlr3] Applying learner 'regr.glmnet' on task 'housing' (iter 9/10)
## INFO [20:19:14.838] [mlr3] Applying learner 'regr.glmnet' on task 'housing' (iter 10/10)
print(rr)
## <ResampleResult> with 10 resampling iterations
## task_id learner_id resampling_id iteration warnings errors
## housing regr.glmnet cv 1 0 0
## housing regr.glmnet cv 2 0 0
## housing regr.glmnet cv 3 0 0
## housing regr.glmnet cv 4 0 0
## housing regr.glmnet cv 5 0 0
## housing regr.glmnet cv 6 0 0
## housing regr.glmnet cv 7 0 0
## housing regr.glmnet cv 8 0 0
## housing regr.glmnet cv 9 0 0
## housing regr.glmnet cv 10 0 0
for (i in 1:10) {
lrn = rr$learners[[i]]
print(coef(lrn$model))
}
## 5 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -30310.78489
## households 138.31651
## housing_median_age 1840.12989
## median_income 42529.02057
## population -38.27331
## 5 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -30992.06593
## households 127.76978
## housing_median_age 1850.69829
## median_income 42405.62692
## population -34.23414
## 5 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -31294.43719
## households 129.08587
## housing_median_age 1854.46096
## median_income 42536.93263
## population -34.64433
## 5 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -31337.32359
## households 127.07192
## housing_median_age 1858.85890
## median_income 42660.55928
## population -34.44435
## 5 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -30832.32209
## households 128.23852
## housing_median_age 1861.90567
## median_income 42482.98412
## population -34.58924
## 5 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -31627.1339
## households 129.4626
## housing_median_age 1868.4524
## median_income 42517.2876
## population -34.8420
## 5 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -30903.75286
## households 128.65452
## housing_median_age 1856.94082
## median_income 42378.10489
## population -34.45528
## 5 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -31509.28194
## households 130.41265
## housing_median_age 1867.83395
## median_income 42464.36940
## population -34.94008
## 5 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -30213.32590
## households 130.27004
## housing_median_age 1848.87567
## median_income 42349.20879
## population -35.21356
## 5 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -31487.23989
## households 129.83991
## housing_median_age 1854.65213
## median_income 42550.80544
## population -34.78167
#RMSE for each individual fold and aggregate RMSE
measure2 = msr("regr.rmse")
rr$score(msr("regr.rmse"))
## task_id learner_id resampling_id iteration regr.rmse
## 1: housing regr.glmnet cv 1 69928.84
## 2: housing regr.glmnet cv 2 69615.50
## 3: housing regr.glmnet cv 3 69473.69
## 4: housing regr.glmnet cv 4 68634.05
## 5: housing regr.glmnet cv 5 66691.90
## 6: housing regr.glmnet cv 6 72922.32
## 7: housing regr.glmnet cv 7 68128.01
## 8: housing regr.glmnet cv 8 66925.12
## 9: housing regr.glmnet cv 9 69619.14
## 10: housing regr.glmnet cv 10 70336.39
## Hidden columns: task, learner, resampling, prediction
rr$aggregate(measure2)
## regr.rmse
## 69227.5
Comment on whether you think this new model is better or worse at predictions than the mode you built in the lab.
I think this model is better because, as I stated above, it has a much better R-squared value of 0.4984 when the value was 0.05 in the linear model of the instructions. It still maintains a very low p-value of < 2.2e-16 which is also good. The aggregate RMSE for the model is $69,250.02 which is over $20,000 better than the RMSE of the original model in the instructions. Overall, the points are plotted well enough to also suggest decently low bias and are clustered well.