Lab 06: Intro to Machine Learning

Exercise 1

Remake the basic linear model with this new dataset. Provide the output from the summary() function.

## 
## Call:
## lm(formula = median_house_value ~ ., data = housing2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -572915  -46140   -9843   34424  661633 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -31075.059   2147.183  -14.47   <2e-16 ***
## housing_median_age   1856.289     42.277   43.91   <2e-16 ***
## population            -35.082      1.036  -33.87   <2e-16 ***
## households            130.079      3.105   41.89   <2e-16 ***
## median_income       42487.117    320.402  132.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 69210 on 19670 degrees of freedom
## Multiple R-squared:  0.4984, Adjusted R-squared:  0.4983 
## F-statistic:  4887 on 4 and 19670 DF,  p-value: < 2.2e-16

The R-squared for this model is closer to 0.5 than the previous summary in the first summary in the instructions. I also noticed that the min and max are at a much bigger range.

Exercise 2

Make a new task for machine learning this new dataset.

## <TaskRegr:housing> (19675 x 5)
## * Target: median_house_value
## * Properties: -
## * Features (4):
##   - dbl (4): households, housing_median_age, median_income, population

## <LearnerRegrGlmnet:regr.glmnet>
## * Model: -
## * Parameters: family=gaussian
## * Packages: mlr3, mlr3learners, glmnet
## * Predict Types:  [response]
## * Feature Types: logical, integer, numeric
## * Properties: weights

## 
## Call:  (if (cv) glmnet::cv.glmnet else glmnet::glmnet)(x = data, y = target,      alpha = 1, lambda = 0.001) 
## 
##   Df  %Dev Lambda
## 1  4 49.77  0.001

I tried both with the set.seed() and without. With set.seed(), the points were much farther apart which suggests poor performance. So I decided to not include the set.seed(). The grey line is along the teal line (linear model) meaning that it has low bias, but maybe not as much at the low and high ends.

## regr.rmse 
##  67929.78

## regr.mae 
## 51116.15

## regr.bias 
## -1711.226

## regr.rmse 
##  69544.44

Exercise 3

Repeat the resampling strategy with the new task, but with a 10-fold cross validation (i.e. folds = 10). Provide the set of 10 RMSE values and the aggregate RMSE

resampling = rsmp("cv", folds = 10)
rsmp("holdout", ratio = 0.8)

## <ResamplingHoldout>: Holdout
## * Iterations: 1
## * Instantiated: FALSE
## * Parameters: ratio=0.8

resampling$instantiate(task_housing2)
rr = resample(task_housing2, learner, resampling, store_models = TRUE)

## INFO  [20:19:14.519] [mlr3] Applying learner 'regr.glmnet' on task 'housing' (iter 1/10)
## INFO  [20:19:14.595] [mlr3] Applying learner 'regr.glmnet' on task 'housing' (iter 2/10)
## INFO  [20:19:14.631] [mlr3] Applying learner 'regr.glmnet' on task 'housing' (iter 3/10)
## INFO  [20:19:14.659] [mlr3] Applying learner 'regr.glmnet' on task 'housing' (iter 4/10)
## INFO  [20:19:14.693] [mlr3] Applying learner 'regr.glmnet' on task 'housing' (iter 5/10)
## INFO  [20:19:14.721] [mlr3] Applying learner 'regr.glmnet' on task 'housing' (iter 6/10)
## INFO  [20:19:14.748] [mlr3] Applying learner 'regr.glmnet' on task 'housing' (iter 7/10)
## INFO  [20:19:14.776] [mlr3] Applying learner 'regr.glmnet' on task 'housing' (iter 8/10)
## INFO  [20:19:14.811] [mlr3] Applying learner 'regr.glmnet' on task 'housing' (iter 9/10)
## INFO  [20:19:14.838] [mlr3] Applying learner 'regr.glmnet' on task 'housing' (iter 10/10)

print(rr)

## <ResampleResult> with 10 resampling iterations
##  task_id  learner_id resampling_id iteration warnings errors
##  housing regr.glmnet            cv         1        0      0
##  housing regr.glmnet            cv         2        0      0
##  housing regr.glmnet            cv         3        0      0
##  housing regr.glmnet            cv         4        0      0
##  housing regr.glmnet            cv         5        0      0
##  housing regr.glmnet            cv         6        0      0
##  housing regr.glmnet            cv         7        0      0
##  housing regr.glmnet            cv         8        0      0
##  housing regr.glmnet            cv         9        0      0
##  housing regr.glmnet            cv        10        0      0

for (i in 1:10) {
  lrn = rr$learners[[i]]
  print(coef(lrn$model))
}

## 5 x 1 sparse Matrix of class "dgCMatrix"
##                              s0
## (Intercept)        -30310.78489
## households            138.31651
## housing_median_age   1840.12989
## median_income       42529.02057
## population            -38.27331
## 5 x 1 sparse Matrix of class "dgCMatrix"
##                              s0
## (Intercept)        -30992.06593
## households            127.76978
## housing_median_age   1850.69829
## median_income       42405.62692
## population            -34.23414
## 5 x 1 sparse Matrix of class "dgCMatrix"
##                              s0
## (Intercept)        -31294.43719
## households            129.08587
## housing_median_age   1854.46096
## median_income       42536.93263
## population            -34.64433
## 5 x 1 sparse Matrix of class "dgCMatrix"
##                              s0
## (Intercept)        -31337.32359
## households            127.07192
## housing_median_age   1858.85890
## median_income       42660.55928
## population            -34.44435
## 5 x 1 sparse Matrix of class "dgCMatrix"
##                              s0
## (Intercept)        -30832.32209
## households            128.23852
## housing_median_age   1861.90567
## median_income       42482.98412
## population            -34.58924
## 5 x 1 sparse Matrix of class "dgCMatrix"
##                             s0
## (Intercept)        -31627.1339
## households            129.4626
## housing_median_age   1868.4524
## median_income       42517.2876
## population            -34.8420
## 5 x 1 sparse Matrix of class "dgCMatrix"
##                              s0
## (Intercept)        -30903.75286
## households            128.65452
## housing_median_age   1856.94082
## median_income       42378.10489
## population            -34.45528
## 5 x 1 sparse Matrix of class "dgCMatrix"
##                              s0
## (Intercept)        -31509.28194
## households            130.41265
## housing_median_age   1867.83395
## median_income       42464.36940
## population            -34.94008
## 5 x 1 sparse Matrix of class "dgCMatrix"
##                              s0
## (Intercept)        -30213.32590
## households            130.27004
## housing_median_age   1848.87567
## median_income       42349.20879
## population            -35.21356
## 5 x 1 sparse Matrix of class "dgCMatrix"
##                              s0
## (Intercept)        -31487.23989
## households            129.83991
## housing_median_age   1854.65213
## median_income       42550.80544
## population            -34.78167

#RMSE for each individual fold and aggregate RMSE
measure2 = msr("regr.rmse")
rr$score(msr("regr.rmse"))

##     task_id  learner_id resampling_id iteration regr.rmse
##  1: housing regr.glmnet            cv         1  69928.84
##  2: housing regr.glmnet            cv         2  69615.50
##  3: housing regr.glmnet            cv         3  69473.69
##  4: housing regr.glmnet            cv         4  68634.05
##  5: housing regr.glmnet            cv         5  66691.90
##  6: housing regr.glmnet            cv         6  72922.32
##  7: housing regr.glmnet            cv         7  68128.01
##  8: housing regr.glmnet            cv         8  66925.12
##  9: housing regr.glmnet            cv         9  69619.14
## 10: housing regr.glmnet            cv        10  70336.39
## Hidden columns: task, learner, resampling, prediction

rr$aggregate(measure2)

## regr.rmse 
##   69227.5

Exercise 4

Comment on whether you think this new model is better or worse at predictions than the mode you built in the lab.

I think this model is better because, as I stated above, it has a much better R-squared value of 0.4984 when the value was 0.05 in the linear model of the instructions. It still maintains a very low p-value of < 2.2e-16 which is also good. The aggregate RMSE for the model is $69,250.02 which is over $20,000 better than the RMSE of the original model in the instructions. Overall, the points are plotted well enough to also suggest decently low bias and are clustered well.

Lab 06: Intro to Machine Learning

Anna Peterson

2023-03-19

Exercise 1

Exercise 2

Exercise 3

Exercise 4