#rm(list=ls())

This notebook contains the code samples found in Chapter 3, Section 6 of Deep Learning with R. Note that the original text features far more content, in particular further explanations and figures: in this notebook, you will only find source code and related comments. ***

In our two previous examples, we were considering classification problems, where the goal was to predict a single discrete label of an input data point. Another common type of machine learning problem is “regression”, which consists of predicting a continuous value instead of a discrete label. For instance, predicting the temperature tomorrow, given meteorological data, or predicting the time that a software project will take to complete, given its specifications.

Do not mix up “regression” with the algorithm “logistic regression”: confusingly, “logistic regression” is not a regression algorithm, it is a classification algorithm.

The Boston Housing Price dataset

We will be attempting to predict the median price of homes in a given Boston suburb in the mid-1970s, given a few data points about the suburb at the time, such as the crime rate, the local property tax rate, etc.

The dataset we will be using has another interesting difference from our two previous examples: it has very few data points, only 506 in total, split between 404 training samples and 102 test samples, and each “feature” in the input data (e.g. the crime rate is a feature) has a different scale. For instance some values are proportions, which take a values between 0 and 1, others take values between 1 and 12, others between 0 and 100…

Let’s take a look at the data:

library(keras)
dataset <- dataset_boston_housing() # default test_split = 0.2
summary(dataset)
      Length Class  Mode
train 2      -none- list
test  2      -none- list
#      Length Class  Mode
#train 2      -none- list
#test  2      -none- list
c(c(train_data, train_targets), c(test_data, test_targets)) %<-% dataset

head(train_data)
        [,1] [,2]  [,3] [,4]  [,5]  [,6]  [,7]   [,8] [,9] [,10] [,11]  [,12] [,13]
[1,] 1.23247  0.0  8.14    0 0.538 6.142  91.7 3.9769    4   307  21.0 396.90 18.72
[2,] 0.02177 82.5  2.03    0 0.415 7.610  15.7 6.2700    2   348  14.7 395.38  3.11
[3,] 4.89822  0.0 18.10    0 0.631 4.970 100.0 1.3325   24   666  20.2 375.52  3.26
[4,] 0.03961  0.0  5.19    0 0.515 6.037  34.5 5.9853    5   224  20.2 396.90  8.01
[5,] 3.69311  0.0 18.10    0 0.713 6.376  88.4 2.5671   24   666  20.2 391.43 14.65
[6,] 0.28392  0.0  7.38    0 0.493 5.708  74.3 4.7211    5   287  19.6 391.13 11.74
head(train_targets)
[1] 15.2 42.3 50.0 21.1 17.7 18.5
nrow(train_data) #404
[1] 404
nrow(test_data) #102
[1] 102
str(train_data)
 num [1:404, 1:13] 1.2325 0.0218 4.8982 0.0396 3.6931 ...
str(test_data)
 num [1:102, 1:13] 18.0846 0.1233 0.055 1.2735 0.0715 ...

As you can see, we have 404 training samples and 102 test samples. The data comprises 13 features. The 13 features in the input data are as follow:

  1. Per capita crime rate.
  2. Proportion of residential land zoned for lots over 25,000 square feet.
  3. Proportion of non-retail business acres per town.
  4. Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  5. Nitric oxides concentration (parts per 10 million).
  6. Average number of rooms per dwelling.
  7. Proportion of owner-occupied units built prior to 1940.
  8. Weighted distances to five Boston employment centres.
  9. Index of accessibility to radial highways.
  10. Full-value property-tax rate per $10,000.
  11. Pupil-teacher ratio by town.
  12. 1000 * (Bk - 0.63) ** 2 where Bk is the proportion of Black people by town.
  13. % lower status of the population.

1.人均犯罪率。 2.佔地超過 25,000 平方英尺的住宅用地比例。 3.每個鎮非零售業務用地的比例。 4.查爾斯河 dummy variable(如果靠近河流(?),則為 1;否則為 0)。 5.一氧化氮濃度(百萬分之幾)。 6.每個住宅的平均房間數。 7. 1940 年之前建造的自有住房的比例。 8.到五個波士頓就業中心的加權距離。 9.徑向公路的可達性指數。 10.每 $10,000 美元的全值財產稅率。 11.各鎮的師生比例。 12. 1000 *(Bk-0.63)** 2 其中,Bk是按城鎮劃分的黑人比例。 13. 人口狀況下降比率

summary(train_data)
       V1                 V2               V3              V4                V5               V6       
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000   Min.   :0.3850   Min.   :3.561  
 1st Qu.: 0.08144   1st Qu.:  0.00   1st Qu.: 5.13   1st Qu.:0.00000   1st Qu.:0.4530   1st Qu.:5.875  
 Median : 0.26888   Median :  0.00   Median : 9.69   Median :0.00000   Median :0.5380   Median :6.199  
 Mean   : 3.74511   Mean   : 11.48   Mean   :11.10   Mean   :0.06188   Mean   :0.5574   Mean   :6.267  
 3rd Qu.: 3.67481   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000   3rd Qu.:0.6310   3rd Qu.:6.609  
 Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000   Max.   :0.8710   Max.   :8.725  
       V7               V8               V9              V10             V11             V12        
 Min.   :  2.90   Min.   : 1.130   Min.   : 1.000   Min.   :188.0   Min.   :12.60   Min.   :  0.32  
 1st Qu.: 45.48   1st Qu.: 2.077   1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.23   1st Qu.:374.67  
 Median : 78.50   Median : 3.142   Median : 5.000   Median :330.0   Median :19.10   Median :391.25  
 Mean   : 69.01   Mean   : 3.740   Mean   : 9.441   Mean   :405.9   Mean   :18.48   Mean   :354.78  
 3rd Qu.: 94.10   3rd Qu.: 5.118   3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.16  
 Max.   :100.00   Max.   :10.710   Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
      V13       
 Min.   : 1.73  
 1st Qu.: 6.89  
 Median :11.39  
 Mean   :12.74  
 3rd Qu.:17.09  
 Max.   :37.97  

The targets are the median values of owner-occupied homes, in thousands of dollars:

str(train_targets)
 num [1:404(1d)] 15.2 42.3 50 21.1 17.7 18.5 11.3 15.6 15.6 14.4 ...

The prices are typically between $10,000 and $50,000. If that sounds cheap, remember this was the mid-1970s, and these prices are not inflation-adjusted.

Preparing the data

It would be problematic to feed into a neural network values that all take wildly different ranges. The network might be able to automatically adapt to such heterogeneous data, but it would definitely make learning more difficult. A widespread best practice to deal with such data is to do feature-wise normalization: for each feature in the input data (a column in the input data matrix), you subtract the mean of the feature and divide by the standard deviation, so that the feature is centered around 0 and has a unit standard deviation. This is easily done in R using the scale() function.

mean <- apply(train_data, 2, mean) # '2': by columns -> 13 values
std <- apply(train_data, 2, sd)
train_data <- scale(train_data, center = mean, scale = std)
test_data <- scale(test_data, center = mean, scale = std)

summary(train_data)
       V1                  V2                 V3                V4                V5         
 Min.   :-0.404599   Min.   :-0.48302   Min.   :-1.5628   Min.   :-0.2565   Min.   :-1.4694  
 1st Qu.:-0.396470   1st Qu.:-0.48302   1st Qu.:-0.8771   1st Qu.:-0.2565   1st Qu.:-0.8897  
 Median :-0.376186   Median :-0.48302   Median :-0.2077   Median :-0.2565   Median :-0.1650  
 Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.:-0.007608   3rd Qu.: 0.04291   3rd Qu.: 1.0271   3rd Qu.:-0.2565   3rd Qu.: 0.6279  
 Max.   : 9.223411   Max.   : 3.72437   Max.   : 2.4423   Max.   : 3.8888   Max.   : 2.6740  
       V6                 V7                V8                V9               V10         
 Min.   :-3.81252   Min.   :-2.3661   Min.   :-1.2859   Min.   :-0.9704   Min.   :-1.3097  
 1st Qu.:-0.55275   1st Qu.:-0.8423   1st Qu.:-0.8192   1st Qu.:-0.6255   1st Qu.:-0.7627  
 Median :-0.09662   Median : 0.3396   Median :-0.2945   Median :-0.5105   Median :-0.4562  
 Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.48172   3rd Qu.: 0.8980   3rd Qu.: 0.6786   3rd Qu.: 1.6738   3rd Qu.: 1.5633  
 Max.   : 3.46289   Max.   : 1.1091   Max.   : 3.4331   Max.   : 1.6738   Max.   : 1.8338  
      V11               V12               V13         
 Min.   :-2.6704   Min.   :-3.7664   Min.   :-1.5178  
 1st Qu.:-0.5685   1st Qu.: 0.2113   1st Qu.:-0.8065  
 Median : 0.2836   Median : 0.3875   Median :-0.1855  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.7835   3rd Qu.: 0.4396   3rd Qu.: 0.5999  
 Max.   : 1.6015   Max.   : 0.4475   Max.   : 3.4777  

Note that the quantities that we use for normalizing the test data have been computed using the training data. We should never use in our workflow any quantity computed on the test data, even for something as simple as data normalization.

Building our network

Because so few samples are available, we will be using a very small network with two hidden layers, each with 64 units. In general, the less training data you have, the worse overfitting will be, and using a small network is one way to mitigate overfitting.

# Because we will need to instantiate the same model multiple times,
# we use a function to construct it.
build_model <- function() {
  model <- keras_model_sequential() %>% 
    layer_dense(units = 64, activation = "relu", 
                input_shape = dim(train_data)[[2]]) %>% #13
    layer_dense(units = 64, activation = "relu") %>% 
    layer_dense(units = 1) 
    
  model %>% compile(
    optimizer = "rmsprop", 
    loss = "mse", 
    metrics = c("mae")
  )
}

Our network ends with a single unit, and no activation (i.e. it will be linear layer). This is a typical setup for scalar regression (i.e. regression where we are trying to predict a single continuous value). Applying an activation function would constrain the range that the output can take; for instance if we applied a sigmoid activation function to our last layer, the network could only learn to predict values between 0 and 1. Here, because the last layer is purely linear, the network is free to learn to predict values in any range.

Note that we are compiling the network with the mse loss function – Mean Squared Error, the square of the difference between the predictions and the targets, a widely used loss function for regression problems.

We are also monitoring a new metric during training: mae. This stands for Mean Absolute Error. It is simply the absolute value of the difference between the predictions and the targets. For instance, a MAE of 0.5 on this problem would mean that our predictions are off by $500 on average.

Validating our approach using K-fold validation

To evaluate our network while we keep adjusting its parameters (such as the number of epochs used for training), we could simply split the data into a training set and a validation set, as we were doing in our previous examples. However, because we have so few data points, the validation set would end up being very small (e.g. about 100 examples). A consequence is that our validation scores may change a lot depending on which data points we choose to use for validation and which we choose for training, i.e. the validation scores may have a high variance with regard to the validation split. This would prevent us from reliably evaluating our model.

The best practice in such situations is to use K-fold cross-validation. It consists of splitting the available data into K partitions (typically K=4 or 5), then instantiating K identical models, and training each one on K-1 partitions while evaluating on the remaining partition. The validation score for the model used would then be the average of the K validation scores obtained.

In terms of code, this is straightforward:

k <- 4
indices <- sample(1:nrow(train_data))
folds <- cut(1:length(indices), breaks = k, labels = FALSE) #依序平均編號 404 個值,從 1 1 1 ... 4 4 4:
#[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#[48] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#[95] 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#[142] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#[189] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
#[236] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
#[283] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
#[330] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
#[377] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

num_epochs <- 100
all_scores <- c()
for (i in 1:k) {
  cat("processing fold #", i, "\n")
  # Prepare the validation data: data from partition # k
  val_indices <- which(folds == i, arr.ind = TRUE) # 抓出編號為 i 的位置為 index
  val_data <- train_data[val_indices,]
  val_targets <- train_targets[val_indices]
  
  # Prepare the training data: data from all other partitions
  partial_train_data <- train_data[-val_indices,]
  partial_train_targets <- train_targets[-val_indices]
  
  # Build the Keras model (already compiled)
  model <- build_model()
  
  # Train the model (in silent mode, verbose=0)
  model %>% fit(partial_train_data, partial_train_targets,
                epochs = num_epochs, batch_size = 1, verbose = 0)
                
  # Evaluate the model on the validation data
  results <- model %>% evaluate(val_data, val_targets, verbose = 0)
  all_scores <- c(all_scores, results$mean_absolute_error) #記錄每一次的 mae
}  
all_scores
[1] 2.012232 2.698308 2.492190 2.619073
mean(all_scores)
[1] 2.455451

As you can notice, the different runs do indeed show rather different validation scores, from 2.1 to 2.6. Their average (2.37) is a much more reliable metric than any single of these scores – that’s the entire point of K-fold cross-validation. In this case, we are off by $2,375 on average, which is still significant considering that the prices range from $10,000 to $50,000.

Let’s try training the network for a bit longer: 500 epochs. To keep a record of how well the model did at each epoch, we will modify our training loop to save the per-epoch validation score log:

# Some memory clean-up
k_clear_session()
num_epochs <- 500
all_mae_histories <- NULL
for (i in 1:k) {
  cat("processing fold #", i, "\n")
  
  # Prepare the validation data: data from partition # k
  val_indices <- which(folds == i, arr.ind = TRUE)
  val_data <- train_data[val_indices,]
  val_targets <- train_targets[val_indices]
  
  # Prepare the training data: data from all other partitions
  partial_train_data <- train_data[-val_indices,]
  partial_train_targets <- train_targets[-val_indices]
  
  # Build the Keras model (already compiled)
  model <- build_model()
  
  # Train the model (in silent mode, verbose=0)
  history <- model %>% fit(
    partial_train_data, partial_train_targets,
    validation_data = list(val_data, val_targets),
    epochs = num_epochs, batch_size = 1, verbose = 0
  )
  mae_history <- history$metrics$val_mean_absolute_error
  all_mae_histories <- rbind(all_mae_histories, mae_history)
}

We can then compute the average of the per-epoch MAE scores for all folds:

average_mae_history <- data.frame(
  epoch = seq(1:ncol(all_mae_histories)),
  validation_mae = apply(all_mae_histories, 2, mean)
)

Let’s plot this:

library(ggplot2)
ggplot(average_mae_history, aes(x = epoch, y = validation_mae)) + geom_line()

It may be a bit hard to see the plot due to scaling issues and relatively high variance. Let’s use geom_smooth() to try to get a clearer picture:

ggplot(average_mae_history, aes(x = epoch, y = validation_mae)) + geom_smooth()

According to this plot, it seems that validation MAE stops improving significantly after 70 epochs. Past that point, we start overfitting.

Once we are done tuning other parameters of our model (besides the number of epochs, we could also adjust the size of the hidden layers), we can train a final “production” model on all of the training data, with the best parameters, then look at its performance on the test data:

result
$loss
[1] 15.43776

$mean_absolute_error
[1] 2.604118

We are still off by about $2,680.

Wrapping up

Here’s what you should take away from this example:

---
title: "Predicting house prices: a regression example"
output: 
  html_notebook: 
    theme: cerulean
    highlight: textmate
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
```

```{r}
#rm(list=ls())
```

***
This notebook contains the code samples found in Chapter 3, Section 6 of [Deep Learning with R](https://www.manning.com/books/deep-learning-with-r). Note that the original text features far more content, in particular further explanations and figures: in this notebook, you will only find source code and related comments.
***

In our two previous examples, we were considering classification problems, where the goal was to predict a single discrete label of an input data point. Another common type of machine learning problem is "regression", which consists of predicting a continuous value instead of a discrete label. For instance, predicting the temperature tomorrow, given meteorological data, or predicting the time that a software project will take to complete, given its specifications.

Do not mix up "regression" with the algorithm "logistic regression": confusingly, "logistic regression" is not a regression algorithm, it is a classification algorithm.

## The Boston Housing Price dataset


We will be attempting to predict the median price of homes in a given Boston suburb in the mid-1970s, given a few data points about the suburb at the time, such as the crime rate, the local property tax rate, etc.

The dataset we will be using has another interesting difference from our two previous examples: it has very few data points, only 506 in total, split between 404 training samples and 102 test samples, and each "feature" in the input data (e.g. the crime rate is a feature) has a different scale. For instance some values are proportions, which take a values between 0 and 1, others take values between 1 and 12, others between 0 and 100...

Let's take a look at the data:

```{r}
library(keras)
dataset <- dataset_boston_housing() # default test_split = 0.2
summary(dataset)
#      Length Class  Mode
#train 2      -none- list
#test  2      -none- list
c(c(train_data, train_targets), c(test_data, test_targets)) %<-% dataset

head(train_data)
head(train_targets)

nrow(train_data) #404
nrow(test_data) #102
```

```{r}
str(train_data)
```

```{r}
str(test_data)
```

As you can see, we have 404 training samples and 102 test samples. The data comprises 13 features. The 13 features in the input data are as 
follow:

1. Per capita crime rate.
2. Proportion of residential land zoned for lots over 25,000 square feet.
3. Proportion of non-retail business acres per town.
4. Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
5. Nitric oxides concentration (parts per 10 million).
6. Average number of rooms per dwelling.
7. Proportion of owner-occupied units built prior to 1940.
8. Weighted distances to five Boston employment centres.
9. Index of accessibility to radial highways.
10. Full-value property-tax rate per $10,000.
11. Pupil-teacher ratio by town.
12. 1000 * (Bk - 0.63) ** 2 where Bk is the proportion of Black people by town.
13. % lower status of the population.

1.人均犯罪率。
2.佔地超過 25,000 平方英尺的住宅用地比例。
3.每個鎮非零售業務用地的比例。
4.查爾斯河 dummy variable（如果靠近河流（？），則為 1；否則為 0）。
5.一氧化氮濃度（百萬分之幾）。
6.每個住宅的平均房間數。
7. 1940 年之前建造的自有住房的比例。
8.到五個波士頓就業中心的加權距離。
9.徑向公路的可達性指數。
10.每 $10,000 美元的全值財產稅率。
11.各鎮的師生比例。
12. 1000 *（Bk-0.63）** 2 其中，Bk是按城鎮劃分的黑人比例。
13. 人口狀況下降比率

```{r}
summary(train_data)
```

The targets are the median values of owner-occupied homes, in thousands of dollars:

```{r}
str(train_targets)
```

The prices are typically between \$10,000 and \$50,000. If that sounds cheap, remember this was the mid-1970s, and these prices are not inflation-adjusted.

## Preparing the data


It would be problematic to feed into a neural network values that all take wildly different ranges. The network might be able to automatically adapt to such heterogeneous data, but it would definitely make learning more difficult. A widespread best practice to deal with such data is to do feature-wise normalization: for each feature in the input data (a column in the input data matrix), you subtract the mean of the feature and divide by the standard deviation, so that the feature is centered around 0 and has a unit standard deviation. This is easily done in R using the `scale()` function.

```{r}
mean <- apply(train_data, 2, mean) # '2': by columns -> 13 values
std <- apply(train_data, 2, sd)
train_data <- scale(train_data, center = mean, scale = std)
test_data <- scale(test_data, center = mean, scale = std)

summary(train_data)
```

Note that the quantities that we use for normalizing the test data have been computed using the training data. We should never use in our workflow any quantity computed on the test data, even for something as simple as data normalization.

## Building our network

Because so few samples are available, we will be using a very small network with two hidden layers, each with 64 units. In general, the less training data you have, the worse overfitting will be, and using a small network is one way to mitigate overfitting.

```{r}
# Because we will need to instantiate the same model multiple times,
# we use a function to construct it.
build_model <- function() {
  model <- keras_model_sequential() %>% 
    layer_dense(units = 64, activation = "relu", 
                input_shape = dim(train_data)[[2]]) %>% #13
    layer_dense(units = 64, activation = "relu") %>% 
    layer_dense(units = 1) 
    
  model %>% compile(
    optimizer = "rmsprop", 
    loss = "mse", 
    metrics = c("mae")
  )
}
```

Our network ends with a single unit, and no activation (i.e. it will be linear layer). This is a typical setup for scalar regression (i.e. regression where we are trying to predict a single continuous value). Applying an activation function would constrain the range that the output can take; for instance if we applied a `sigmoid` activation function to our last layer, the network could only learn to predict values between 0 and 1. Here, because the last layer is purely linear, the network is free to learn to predict values in any range.

Note that we are compiling the network with the `mse` loss function -- Mean Squared Error, the square of the difference between the predictions and the targets, a widely used loss function for regression problems.

We are also monitoring a new metric during training: `mae`. This stands for Mean Absolute Error. It is simply the absolute value of the difference between the predictions and the targets. For instance, a MAE of 0.5 on this problem would mean that our predictions are off by \$500 on average.

## Validating our approach using K-fold validation

To evaluate our network while we keep adjusting its parameters (such as the number of epochs used for training), we could simply split the data into a training set and a validation set, as we were doing in our previous examples. However, because we have so few data points, the validation set would end up being very small (e.g. about 100 examples). A consequence is that our validation scores may change a lot depending on _which_ data points we choose to use for validation and which we choose for training, i.e. the validation scores may have a high _variance_ with regard to the validation split. This would prevent us from reliably evaluating our model.

The best practice in such situations is to use K-fold cross-validation. It consists of splitting the available data into K partitions (typically K=4 or 5), then instantiating K identical models, and training each one on K-1 partitions while evaluating on the remaining partition. The validation score for the model used would then be the average of the K validation scores obtained.

In terms of code, this is straightforward:

```{r, echo=TRUE, results='hide'}
k <- 4
indices <- sample(1:nrow(train_data))
folds <- cut(1:length(indices), breaks = k, labels = FALSE) #依序平均編號 404 個值，從 1 1 1 ... 4 4 4：
#[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#[48] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#[95] 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#[142] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#[189] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
#[236] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
#[283] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
#[330] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
#[377] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

num_epochs <- 100
all_scores <- c()
for (i in 1:k) {
  cat("processing fold #", i, "\n")
  # Prepare the validation data: data from partition # k
  val_indices <- which(folds == i, arr.ind = TRUE) # 抓出編號為 i 的位置為 index
  val_data <- train_data[val_indices,]
  val_targets <- train_targets[val_indices]
  
  # Prepare the training data: data from all other partitions
  partial_train_data <- train_data[-val_indices,]
  partial_train_targets <- train_targets[-val_indices]
  
  # Build the Keras model (already compiled)
  model <- build_model()
  
  # Train the model (in silent mode, verbose=0)
  model %>% fit(partial_train_data, partial_train_targets,
                epochs = num_epochs, batch_size = 1, verbose = 0)
                
  # Evaluate the model on the validation data
  results <- model %>% evaluate(val_data, val_targets, verbose = 0)
  all_scores <- c(all_scores, results$mean_absolute_error) #記錄每一次的 mae
}  
```

```{r}
all_scores
```

```{r}
mean(all_scores)
```

As you can notice, the different runs do indeed show rather different validation scores, from 2.1 to 2.6. Their average (2.37) is a much more reliable metric than any single of these scores -- that's the entire point of K-fold cross-validation. In this case, we are off by \$2,375 on average, which is still significant considering that the prices range from \$10,000 to \$50,000. 

Let's try training the network for a bit longer: 500 epochs. To keep a record of how well the model did at each epoch, we will modify our training loop to save the per-epoch validation score log:

```{r}
# Some memory clean-up
k_clear_session()
```

```{r, echo=TRUE, results='hide'}
num_epochs <- 500
all_mae_histories <- NULL
for (i in 1:k) {
  cat("processing fold #", i, "\n")
  
  # Prepare the validation data: data from partition # k
  val_indices <- which(folds == i, arr.ind = TRUE)
  val_data <- train_data[val_indices,]
  val_targets <- train_targets[val_indices]
  
  # Prepare the training data: data from all other partitions
  partial_train_data <- train_data[-val_indices,]
  partial_train_targets <- train_targets[-val_indices]
  
  # Build the Keras model (already compiled)
  model <- build_model()
  
  # Train the model (in silent mode, verbose=0)
  history <- model %>% fit(
    partial_train_data, partial_train_targets,
    validation_data = list(val_data, val_targets),
    epochs = num_epochs, batch_size = 1, verbose = 0
  )
  mae_history <- history$metrics$val_mean_absolute_error
  all_mae_histories <- rbind(all_mae_histories, mae_history)
}
```

We can then compute the average of the per-epoch MAE scores for all folds:

```{r}
average_mae_history <- data.frame(
  epoch = seq(1:ncol(all_mae_histories)),
  validation_mae = apply(all_mae_histories, 2, mean) #求每一個 epoch 中每個 fold 的 mae 的平均值。
)
```

Let's plot this:

```{r}
library(ggplot2)
ggplot(average_mae_history, aes(x = epoch, y = validation_mae)) + geom_line()
```

It may be a bit hard to see the plot due to scaling issues and relatively high variance. Let's use `geom_smooth()` to try to get a clearer picture:

```{r}
ggplot(average_mae_history, aes(x = epoch, y = validation_mae)) + geom_smooth()
```

According to this plot, it seems that validation MAE stops improving significantly after 70 epochs. Past that point, we start overfitting.

Once we are done tuning other parameters of our model (besides the number of epochs, we could also adjust the size of the hidden layers), we can train a final "production" model on all of the training data, with the best parameters, then look at its performance on the test data:

```{r, echo=FALSE, results='hide'}
# Get a fresh, compiled model.
model <- build_model()

# Train it on the entirety of the data.
model %>% fit(train_data, train_targets,
          epochs = 80, batch_size = 16, verbose = 0)

result <- model %>% evaluate(test_data, test_targets)
```

```{r}
result
```

We are still off by about \$2,680.

## Wrapping up

Here's what you should take away from this example:

* Regression is done using different loss functions from classification; Mean Squared Error (MSE) is a commonly used loss function for regression.
* Similarly, evaluation metrics to be used for regression differ from those used for classification; naturally the concept of "accuracy" does not apply for regression. A common regression metric is Mean Absolute Error (MAE).
* When features in the input data have values in different ranges, each feature should be scaled independently as a preprocessing step.
* When there is little data available, using K-Fold validation is a great way to reliably evaluate a model.
* When little training data is available, it is preferable to use a small network with very few hidden layers (typically only one or two), in order to avoid severe overfitting.
