Data605 - Final Assignment

0. Load libraries and initialize things

rm(list = ls())
library(tidyverse)
library(tidymodels)
library(skimr)
library(keras)
library(ggplot2)
library(matrixcalc)
library(pracma)

1. Playing with PageRank

1.1 - Form the A matrix. Then, introduce decay and form the B matrix as we did in the course notes.

#Create a transitions Matrix A
options(digits = 3)

r1 <- c(0,1/2,1/2,0,0,0)
r2 <- c(0,0,0,0,0,0)
r3 <- c(1/3,1/3,0,0,1/3,0)
r4 <- c(0,0,0,0,1/2,1/2)
r5 <- c(0,0,0,1/2,0,1/2)
r6 <- c(0,0,0,1,0,0)

n =6

A <- matrix(c(r1,r2,r3,r4,r5,r6),n,n,byrow = TRUE)
r <- c(1/6,1/6,1/6,1/6,1/6,1/6)

print(A)
##       [,1]  [,2] [,3] [,4]  [,5] [,6]
## [1,] 0.000 0.500  0.5  0.0 0.000  0.0
## [2,] 0.000 0.000  0.0  0.0 0.000  0.0
## [3,] 0.333 0.333  0.0  0.0 0.333  0.0
## [4,] 0.000 0.000  0.0  0.0 0.500  0.5
## [5,] 0.000 0.000  0.0  0.5 0.000  0.5
## [6,] 0.000 0.000  0.0  1.0 0.000  0.0
print(rowSums(A))
## [1] 1 0 1 1 1 1

We can see that Matrix A is not row stochastic. That means its row sums are not equal to one. In order to continue we need to make sure all rows are stochastic. So we will replace row 2 with a new row where all entries are 1/6 for equal probability and ensuring its sum = 1

r2 <- c(1/6,1/6,1/6,1/6,1/6,1/6)

A <- matrix(c(r1,r2,r3,r4,r5,r6),n,n,byrow = TRUE)
print(A)
##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
## [1,] 0.000 0.500 0.500 0.000 0.000 0.000
## [2,] 0.167 0.167 0.167 0.167 0.167 0.167
## [3,] 0.333 0.333 0.000 0.000 0.333 0.000
## [4,] 0.000 0.000 0.000 0.000 0.500 0.500
## [5,] 0.000 0.000 0.000 0.500 0.000 0.500
## [6,] 0.000 0.000 0.000 1.000 0.000 0.000
print(rowSums(A))
## [1] 1 1 1 1 1 1

Lets define B matrix

# Decay Factor
d <- 0.85

# Adjusted B matrix
B <- d * A + ((1-d)/n)
print(B)
##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
## [1,] 0.025 0.450 0.450 0.025 0.025 0.025
## [2,] 0.167 0.167 0.167 0.167 0.167 0.167
## [3,] 0.308 0.308 0.025 0.025 0.308 0.025
## [4,] 0.025 0.025 0.025 0.025 0.450 0.450
## [5,] 0.025 0.025 0.025 0.450 0.025 0.450
## [6,] 0.025 0.025 0.025 0.875 0.025 0.025

Let’s check it is row stochastic

print(rowSums(B))
## [1] 1 1 1 1 1 1

1.2 Start with a uniform rank vector r and perform power iterations on B until convergence. That is, compute the solution r = B^n * r. Attempt this for a sufficiently large n so that r actually converges.

# Lets calculate a large number of steps to check convergence
nsteps = 100

#Make sure you use the transpose of A, since you want the FROM as columns, and as defined in the question the A matrix has the FROM as rows.

rf <- matrix.power(t(B),nsteps) %*% r

print(rf)
##            [,1]
## [1,] 0.05170475
## [2,] 0.07367926
## [3,] 0.05741241
## [4,] 0.34870369
## [5,] 0.19990381
## [6,] 0.26859608
# Just to make sure it converged, compare to nsteps +1
rf <- matrix.power(t(B),nsteps+1) %*% r

print(rf)
##            [,1]
## [1,] 0.05170475
## [2,] 0.07367926
## [3,] 0.05741241
## [4,] 0.34870369
## [5,] 0.19990381
## [6,] 0.26859608
# Yes. same vector it has converged!

We have a new vector rf which we also checked it converged to an answer. This is our pagerank vector answer!.

1.3 Compute the eigen-decomposition of B and verify that you indeed get an eigenvalue of 1 as the largest eigenvalue and that its corresponding eigenvector is the same vector that you obtained in the previous power iteration method.

This was a hard question in my opinion, but was able to solve it correctly. The reason is because getting eigenvalues seems to be non-trivial for most software packages. But getting eigenvectors can be confusing because there are multiple vectors which can be answers as well.

So if you went ahead and just used the eigen function in R, you may not get the answer you expected.

eigen(B)$values[1]
## [1] 1
# Largest eigen value = 1

# To get the associated Eigenvector make sure you transpose it first
eigen(t(B))$vectors[,1]
## [1] 0.1044385+0i 0.1488249+0i 0.1159674+0i 0.7043472+0i 0.4037861+0i
## [6] 0.5425377+0i

We can see that the largest positive eigenvalue is one, but the associated eigenvector is not exactly what we expected. For sure is not equal to the pagerank vector we found before, also has the “i” which indicates is in the imaginary numbers. But let me assure you the vector is the same as the pagerank vector, just expressed differently.

To show it we would need to derive the eigenvector differently to show our proof.

To get same pagerank vector as what we got before by raising the Matrix to a large power, we will find instead the null space of the B matrix minus the Identity matrix.

#Fist we transpose and subtract the diagonal matrix. This is the same process we would follow to find Eigenvectors manually.  

Bt <- t(B) - diag(n)

Now we have a new Matrix which we can solve for its null space. We will use pracma library for this. The answer would be a multiple of the final answer we want.

#Find the null space of the transformed B matrix
pracmaBnull <- pracma::nullspace(Bt)

# Lets find a factor between the nullspace of B and the PageRank vector.  One should be a multiple of the other

lambdaFactor <- pracmaBnull[1,1] / rf[1,1]

#Lets compare our new vector with the original Pagerank vector
rf 
##            [,1]
## [1,] 0.05170475
## [2,] 0.07367926
## [3,] 0.05741241
## [4,] 0.34870369
## [5,] 0.19990381
## [6,] 0.26859608
pracmaBnull / lambdaFactor
##            [,1]
## [1,] 0.05170475
## [2,] 0.07367926
## [3,] 0.05741241
## [4,] 0.34870369
## [5,] 0.19990381
## [6,] 0.26859608

Solved. We can see now, that the null space of B transformed is the same Eigenvector as the PageRank vector, just with a different scaling factor, but nonetheless same vector.

We just showed that our EIGENVECTOR for Lambda equal to 1 is the SAME as the pagerank vector we found earlier.

1.4 Use the graph package in R and its page.rank method to compute the Page Rank of the graph as given in A.

#library Graph no longer exists in CRAN. But I found another library to do the trick

library(igraph)

A_Graph <- graph_from_adjacency_matrix(A, weighted = TRUE, mode = "directed")
plot(A_Graph)

Let’s see that library’s pagerank result.

pageRank <- page.rank(A_Graph)$vector

All 3 methods give same answer

# Pagerank vector by taking Matrix powers until convergence
print(t(rf))
##            [,1]       [,2]       [,3]      [,4]      [,5]      [,6]
## [1,] 0.05170475 0.07367926 0.05741241 0.3487037 0.1999038 0.2685961
# Pagerank vector by doing spectral decomposition of the B matrix
print(t(pracmaBnull/lambdaFactor))
##            [,1]       [,2]       [,3]      [,4]      [,5]      [,6]
## [1,] 0.05170475 0.07367926 0.05741241 0.3487037 0.1999038 0.2685961
# Page rank of matrix A by using page.rank function
print(pageRank)
## [1] 0.05170475 0.07367926 0.05741241 0.34870369 0.19990381 0.26859608

2. MNIST Dataset

2.1 Load files and Plot function

#setwd("Code/FinalExam/MNIST")
train_df <- read_csv("MNIST/train.csv")
test_df <- read_csv('MNIST/test.csv')
train_df[,2:ncol(train_df)] <- train_df[,2:ncol(train_df)] / 255 
head(train_df)
## # A tibble: 6 x 785
##   label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9
##   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1     1      0      0      0      0      0      0      0      0      0      0
## 2     0      0      0      0      0      0      0      0      0      0      0
## 3     1      0      0      0      0      0      0      0      0      0      0
## 4     4      0      0      0      0      0      0      0      0      0      0
## 5     0      0      0      0      0      0      0      0      0      0      0
## 6     0      0      0      0      0      0      0      0      0      0      0
## # ... with 774 more variables: pixel10 <dbl>, pixel11 <dbl>, pixel12 <dbl>,
## #   pixel13 <dbl>, pixel14 <dbl>, pixel15 <dbl>, pixel16 <dbl>, pixel17 <dbl>,
## #   pixel18 <dbl>, pixel19 <dbl>, pixel20 <dbl>, pixel21 <dbl>, pixel22 <dbl>,
## #   pixel23 <dbl>, pixel24 <dbl>, pixel25 <dbl>, pixel26 <dbl>, pixel27 <dbl>,
## #   pixel28 <dbl>, pixel29 <dbl>, pixel30 <dbl>, pixel31 <dbl>, pixel32 <dbl>,
## #   pixel33 <dbl>, pixel34 <dbl>, pixel35 <dbl>, pixel36 <dbl>, pixel37 <dbl>,
## #   pixel38 <dbl>, pixel39 <dbl>, pixel40 <dbl>, pixel41 <dbl>, ...

Let’s create a basic function that would take the matrix information of the MNIST digits and display it on screen.

#Function to display DIGITS

disp_digit <- function(digit_row, d_label) {
  
  img1 <- matrix(unlist(digit_row),nrow=28,ncol=28,byrow=TRUE)

  # reverse and transpose, if not image rotated 90 degrees
  img1 <- t(apply(img1, 2, rev))

  # To keep aspect ratio
  par(pty="s")

  image(1:28, 1:28, img1, col=gray((0:255)/255), 
        axes=FALSE, xlab = d_label, ylab = "")

}

Let’s test the function displaying the 10th row in our dataset

# Choose a row (digit image)
idnum=10
disp_digit(train_df[idnum,2:ncol(train_df)], train_df[idnum,1])

We should be looking at a number “3”

2.2 Basic information about dataset

Display the frequency of each number in the dataset

# count of all rows by Label (digit)
train_df %>%
  group_by(label) %>%
  summarise(Count = n()) %>%
  mutate(frequency = Count / sum(Count))
## # A tibble: 10 x 3
##    label Count frequency
##    <dbl> <int>     <dbl>
##  1     0  4132    0.0984
##  2     1  4684    0.112 
##  3     2  4177    0.0995
##  4     3  4351    0.104 
##  5     4  4072    0.0970
##  6     5  3795    0.0904
##  7     6  4137    0.0985
##  8     7  4401    0.105 
##  9     8  4063    0.0967
## 10     9  4188    0.0997

We can see here that although not exact, each digit 0 to 9 is represented almost uniformly across the dataset.

Just a few queries we may need later.

# 10 Rows, one per digit withe sum of every pixel (784)
train_sum_df <- train_df %>%
  group_by(label) %>%
  summarise(across(everything(),sum), .groups = 'drop') 
# 10 Rows, with sum of all columns
train_sum_columns <- train_df %>%
  group_by(label) %>%
  summarise(across(everything(),sum), .groups = 'drop') %>%
  summarise(label=label,pixel_sum = rowSums(.))
# 10 Rows, one per digit with the mean of every pixel (784)
# this just tell us for each digit, what is the mean of every specific pixel.

train_mean_df <- train_df %>%
  group_by(label) %>%
  summarise(across(everything(),mean), .groups = 'drop') 

2.3 Plot some numbers

Let’s plot the first ten numbers in the dataframe

# Plot First 10 numbers on dataframe
par(mfrow=c(2,5),mai = c(0.1, 0.1, 0.1, 0.1),mar=c(1, 0, 1, 0))

for (i in 1:10) {
  disp_digit(train_df[i,2:ncol(train_df)],"")

}

For fun let’s average all rows grouped by number and display it.

# Plot all 10 digits based on averages
par(mfrow=c(2,5),mai = c(0.1, 0.1, 0.1, 0.1),mar=c(1, 0, 1, 0))
for (i in 1:10) {
  #disp_digit(train_mean_df[i,2:ncol(train_mean_df)])
  disp_digit(train_mean_df[i,2:ncol(train_mean_df)],"")
}

2.4 More info about dataset

Display the mean of pixels for every digit. This average would be equivalent to the amount of “white” color each digit has. You can see that in where number 1 has a lower mean than number 8.

# Mean pixel intensity by digit
train_df %>%
  group_by(label) %>%
  summarise(across(everything(),sum), .groups = 'drop', count=n()) %>%
  summarise(label=label,pixel_sum = rowSums(.), count = count, avg_pix = pixel_sum/count/784)
## # A tibble: 10 x 4
##    label pixel_sum count avg_pix
##    <dbl>     <dbl> <int>   <dbl>
##  1     0   565313.  4132  0.175 
##  2     1   283676.  4684  0.0772
##  3     2   493479.  4177  0.151 
##  4     3   487574.  4351  0.143 
##  5     4   391039.  4072  0.122 
##  6     5   388299.  3795  0.131 
##  7     6   454101.  4137  0.140 
##  8     7   400174.  4401  0.116 
##  9     8   485005.  4063  0.152 
## 10     9   407456.  4188  0.124

2.5 PCA (Principal Component Analysis)

# Call PCA for all numbers to the max 744 principal components)

pca_result <- prcomp(train_df[,2:ncol(train_df)], rank = 784, center=TRUE) 
#summary(pca_result) 

The maximum Principal components, is the minimum of the number of columns or number of rows. In this case we can get up to 784 principal components, although you may not need need most of them. The idea is that you will be able to capture most of the information in the dataset with a smaller subset of the overall information. This can help save space and accelerate processing time.

Check variance by principal component

# std deviation
sdev <- pca_result$sdev
percent_variation <- sdev^2 / sum(sdev^2)

var_df <- data.frame(PC=paste0("PC",1:length(sdev)),
                     var_explained=percent_variation,
                     stringsAsFactors = FALSE)

Let’s do some plotting.

# Function which will help later plot all PC's
every_nth = function(n) {
  return(function(x) {x[c(TRUE, rep(FALSE, n - 1))]})
}

Plot of the variance explained by each Principal Component

var_df %>%
  mutate(PC = fct_inorder(PC)) %>%
  ggplot(aes(x=PC,y=var_explained))+
  scale_x_discrete(breaks=every_nth(n=100)) +
  geom_point() + ggtitle("Variance explained by PC") + xlab("Number of PC's") + ylab("Variance Explained")

Plot of cumulative variance explained by first nth Components. Select components which explain 95% and 100% of total variance

cumvar <- cumsum(var_df['var_explained'])
comp_95 <- min(which(cumvar>=0.95))
comp_100 <- min(which(cumvar>=1))
print(comp_95)
## [1] 154
print(comp_100)
## [1] 704

We see here that 95% of variance is explained by the first 154 PC’s while to get to 100% of the variance you need 704 PC’s

Let’s plot it!

ggplot(var_df,aes(x=1:784, y=cumsum(var_explained))) + geom_line() + geom_point()+ 
  ggtitle("Cummulative Variance explained by PC") +
  xlab("Number of PC's") + ylab("Cum Variance Explained")

Plot first few PRINCIPAL COMPONENTS

# Plot 10 First COMPONENTS
par(mfrow=c(2,5),mai = c(0.1, 0.1, 0.1, 0.1),mar=c(1, 0, 1, 0))

for (i in 1:10) {
  disp_digit(pca_result$rotation[,i],"")
  }

Showing the first n Principal Components looks “noisy”. that is because these are the vectors which explain the most variance of all the dataset with all numbers included. So what you are looking at is a mixture of numbers 0 to 9.

Print first 10 numbers

We can also by taking n number of components reconstruct the numbers and PLOT them. Let’s see how they look with only 10 PC’s out of 784.

# Plot 10 First Number RECONSTRUCTED with 10 PCS
par(mfrow=c(2,5),mai = c(0.1, 0.1, 0.1, 0.1),mar=c(1, 0, 1, 0))

r <- 10
recon <- t(t(pca_result$x[,1:r] %*% t(pca_result$rotation[,1:r])) + pca_result$center)

for (i in 1:10) {
    disp_digit(recon[i,],"")
  }

Not too good. You can’t be sure of the numbers behind. Let’s try with more PC’s, This time 50 PC’s

# Plot 10 First Number RECONSTRUCTED with 50 PCS
par(mfrow=c(2,5),mai = c(0.1, 0.1, 0.1, 0.1),mar=c(1, 0, 1, 0))

r <- 50
recon <- t(t(pca_result$x[,1:r] %*% t(pca_result$rotation[,1:r])) + pca_result$center)

for (i in 1:10) {
    disp_digit(recon[i,],"")
  }

Better, but still not too clear. Let’s try now with 200 PC’s

# Plot 10 First Number RECONSTRUCTED with 200 PCS
par(mfrow=c(2,5),mai = c(0.1, 0.1, 0.1, 0.1),mar=c(1, 0, 1, 0))

r <- 200
recon <- t(t(pca_result$x[,1:r] %*% t(pca_result$rotation[,1:r])) + pca_result$center)

for (i in 1:10) {
    disp_digit(recon[i,],"")
  }

Much Better. Let this time use as many Principal Components as needed to get 100% of the variance. That is 704 PC’s

# Plot 10 First Number RECONSTRUCTED 100%
par(mfrow=c(2,5),mai = c(0.1, 0.1, 0.1, 0.1),mar=c(1, 0, 1, 0))

r <- comp_100
recon <- t(t(pca_result$x[,1:r] %*% t(pca_result$rotation[,1:r])) + pca_result$center)

for (i in 1:10) {
    disp_digit(recon[i,],"")
  }

Yes. We exactly got the original digits, and we used fewer than the max 784 PC’s.

Filter only 8’s Let’s what happens when we re-do the whole PCA analysis but we include only the rows that refer to the number 8

train_all8_df <- train_df %>%
  filter(label==8)

Re-run PCA

# Cal PCA for all 8's in the dataset

pca_result8 <- prcomp(train_all8_df[,2:ncol(train_all8_df)], rank = 784, center=TRUE) 
#summary(pca_result8) 

Calculate variance only for the 8’s

# std deviation of each Principal Component
sdev <- pca_result8$sdev
percent_variation <- sdev^2 / sum(sdev^2)

#Variance of PC's
var_df <- data.frame(PC=paste0("PC",1:length(sdev)),
                     var_explained=percent_variation,
                     stringsAsFactors = FALSE)

#Cumulative variance of ordered PC's
cumvar <- cumsum(var_df['var_explained'])
comp_95 <- min(which(cumvar>=0.95))
comp_100 <- min(which(cumvar>=1))

#Variance explained up to 95% and 100%
print(comp_95)
## [1] 135
print(comp_100)
## [1] 537

Let’s Plot first few PRINCIPAL COMPONENTS

# Plot 10 First COMPONENTS
par(mfrow=c(2,5),mai = c(0.1, 0.1, 0.1, 0.1),mar=c(1, 0, 1, 0))

for (i in 1:10) {
  disp_digit(pca_result8$rotation[,i],"")
  }

We can see now the first PC’s capture only the information for the number 8, which now can be distinguished. We can see all refer to the number 8. Please recall we can later reconstruct the original images by combining these “kind of weird looking” images. Let’s do it!

Display 10 Reconstructed number 8’s

# Plot 10 First Numbers RECONSTRUCTED at 95% variance captured
par(mfrow=c(2,5),mai = c(0.1, 0.1, 0.1, 0.1),mar=c(1, 0, 1, 0))

r <- comp_95
#Let's reconstruct the number
recon <- t(t(pca_result8$x[,1:r] %*% t(pca_result8$rotation[,1:r])) + pca_result8$center)

for (i in 1:10) {
    disp_digit(recon[i,],"")
  }

2.6 Multinomial Classification Model

Convert labels to factors, as some ML models work only with factors and not characters.

train_df$label <- as.factor(train_df$label)

We will use TIDYMODELS to setup our basic classification using multinomial classification

multinom_spec <-
  multinom_reg(penalty = 0, mixture=1) %>%
  set_mode("classification") %>%
  set_engine("glmnet")

Let’s fit our model!

doParallel::registerDoParallel()
multinom_fit <- multinom_spec %>%
  fit(label ~ .,
    data = train_df
  )

# multinom_fit

Let’s get our results from our fitting.

results_train <- multinom_fit %>%
  predict(new_data = train_df) %>%
  mutate(
    truth = train_df$label
  ) 
results_train %>%
  accuracy(truth, .pred_class)
## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.952

We got to an accuracy of 95.2%

Lets show the CONFUSION MATRIX

results_train %>%
  conf_mat(truth, .pred_class)
##           Truth
## Prediction    0    1    2    3    4    5    6    7    8    9
##          0 4074    0    8    6    4   22   15    4   16    4
##          1    0 4627   16    8   14    7    1   10   41    7
##          2    6    5 3926   75   13   23   18   28   22    6
##          3    5   11   51 4021    3   92    3   10   85   40
##          4    3    0   30    2 3903   32   15   22    9   63
##          5   12    5   15  119    2 3491   35    2   78   22
##          6   13    2   23    4   20   37 4038    2   15    0
##          7    0    4   27   23    9    5    0 4217   11   88
##          8   17   24   69   62   12   69   11   11 3745   26
##          9    2    6   12   31   92   17    1   95   41 3932

2.7 Keras CLASSIFICATION using a Feedforward Neural Network

Let’s separate our data into train and testing.

mnist_nn_split <- initial_split(train_df)
train <- training(mnist_nn_split)
test <- testing(mnist_nn_split)

y_train <- as.matrix(train$label)
x_train <- as.matrix(train[,-1])
y_test <- as.matrix(test$label)
x_test <- as.matrix(test[,-1])

We need to convert some of the features to categorical

y_train <- to_categorical(y_train, num_classes = 10)
## Loaded Tensorflow version 2.3.0
y_test <- to_categorical(y_test, num_classes = 10)

Let’s define our dense feed-forward Neural Network

dense_model <- keras_model_sequential() 

dense_model %>% 
  layer_dense(units = 128, activation = 'relu', input_shape = c(784)) %>% 
  layer_batch_normalization() %>%
  layer_dropout(rate = 0.3) %>% 
  layer_dense(units = 32, activation = 'relu') %>%
  layer_batch_normalization() %>%
  layer_dropout(rate = 0.2) %>%
  layer_dense(units = 10, activation = 'softmax')

summary(dense_model)
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense_2 (Dense)                     (None, 128)                     100480      
## ________________________________________________________________________________
## batch_normalization_1 (BatchNormali (None, 128)                     512         
## ________________________________________________________________________________
## dropout_1 (Dropout)                 (None, 128)                     0           
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 32)                      4128        
## ________________________________________________________________________________
## batch_normalization (BatchNormaliza (None, 32)                      128         
## ________________________________________________________________________________
## dropout (Dropout)                   (None, 32)                      0           
## ________________________________________________________________________________
## dense (Dense)                       (None, 10)                      330         
## ================================================================================
## Total params: 105,578
## Trainable params: 105,258
## Non-trainable params: 320
## ________________________________________________________________________________

We need to compile the model

dense_model %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = optimizer_adam(),
  metrics = c('accuracy')
)

Let’s fit our model

history <- dense_model %>% fit(
  x_train, y_train, 
  epochs = 20, batch_size = 256, 
  validation_split = 0.1
)
plot(history)
## `geom_smooth()` using formula = 'y ~ x'

dense_model %>% evaluate(x_test, y_test)
##      loss  accuracy 
## 0.1039279 0.9708571

We got to an accuracy of 97%.

2.8 Keras CLASSIFICATION with a Convolutional Neural Network (1D)

# Lets reshape the X'
x_trainR <- array_reshape(x_train,c(31500,784,1))
x_testR <- array_reshape(x_test,c(10500,784,1))
#  Works now, but you need to re-shape x to add one dime of 1
cnn_model <- keras_model_sequential()

cnn_model %>% 
  layer_conv_1d(filters = 32, kernel_size = 3, activation = 'relu', 
                input_shape = c(784,1)) %>% 
  #layer_max_pooling_1d(pool_size = 3) %>% 
  layer_conv_1d(filters = 64, kernel_size = 3, activation = 'relu') %>% 
  #layer_global_average_pooling_1d() %>% 
  layer_max_pooling_1d(pool_size = 2) %>%
  layer_dropout(rate = 0.2) %>% 
  layer_flatten() %>%
  layer_dense(units=128,activation = 'relu') %>%
  layer_dropout(rate = 0.2) %>% 
  layer_dense(units = 10, activation = 'softmax') %>% 
  compile(
    loss = 'categorical_crossentropy',
    optimizer = optimizer_adam(),
    metrics = c('accuracy')
  )
summary(cnn_model)
## Model: "sequential_1"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## conv1d_1 (Conv1D)                   (None, 782, 32)                 128         
## ________________________________________________________________________________
## conv1d (Conv1D)                     (None, 780, 64)                 6208        
## ________________________________________________________________________________
## max_pooling1d (MaxPooling1D)        (None, 390, 64)                 0           
## ________________________________________________________________________________
## dropout_3 (Dropout)                 (None, 390, 64)                 0           
## ________________________________________________________________________________
## flatten (Flatten)                   (None, 24960)                   0           
## ________________________________________________________________________________
## dense_4 (Dense)                     (None, 128)                     3195008     
## ________________________________________________________________________________
## dropout_2 (Dropout)                 (None, 128)                     0           
## ________________________________________________________________________________
## dense_3 (Dense)                     (None, 10)                      1290        
## ================================================================================
## Total params: 3,202,634
## Trainable params: 3,202,634
## Non-trainable params: 0
## ________________________________________________________________________________
cnn_model %>% fit(x_trainR, y_train, batch_size = 128, 
                   epochs = 10,validation_split = 0.1)
plot(history)
## `geom_smooth()` using formula = 'y ~ x'

cnn_model %>% evaluate(x_testR, y_test, batch_size = 128)
##      loss  accuracy 
## 0.1079240 0.9754286

We got to an accuracy of 97.7%

2.9 Keras Convolutional Neural Network (2D)

This time we re-shape the images to be 28x28

# Lets reshape the X'
x_trainR <- array_reshape(x_train,c(31500,28,28))
x_testR <- array_reshape(x_test,c(10500,28,28))
#  Works now, but you need to re-shape x to add one dime of 1
cnn_model <- keras_model_sequential()
## Loaded Tensorflow version 2.8.0
cnn_model %>% 
  layer_conv_2d(filters = 32, kernel_size = c(3,3), activation = 'relu',input_shape = input_shape <- c(28,28, 1)) %>% 
  layer_max_pooling_2d(pool_size = c(2, 2)) %>%
  layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = 'relu') %>% 
  layer_max_pooling_2d(pool_size = c(2, 2)) %>% 
  layer_dropout(rate = 0.2) %>% 
  layer_flatten() %>% 
  layer_dense(units=128,activation = 'relu') %>%
  layer_dropout(rate = 0.2) %>% 
  layer_dense(units = 10, activation = 'softmax') %>% 
  compile(
    loss = 'categorical_crossentropy',
    optimizer = optimizer_adam(),
    metrics = c('accuracy')
  )
summary(cnn_model)
## Model: "sequential"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  conv2d_1 (Conv2D)                  (None, 26, 26, 32)              320         
##  max_pooling2d_1 (MaxPooling2D)     (None, 13, 13, 32)              0           
##  conv2d (Conv2D)                    (None, 11, 11, 64)              18496       
##  max_pooling2d (MaxPooling2D)       (None, 5, 5, 64)                0           
##  dropout_1 (Dropout)                (None, 5, 5, 64)                0           
##  flatten (Flatten)                  (None, 1600)                    0           
##  dense_1 (Dense)                    (None, 128)                     204928      
##  dropout (Dropout)                  (None, 128)                     0           
##  dense (Dense)                      (None, 10)                      1290        
## ================================================================================
## Total params: 225,034
## Trainable params: 225,034
## Non-trainable params: 0
## ________________________________________________________________________________
cnn_model %>% fit(x_trainR, y_train, batch_size = 128, 
                   epochs = 10,validation_split = 0.1)
plot(history)

cnn_model %>% evaluate(x_testR, y_test, batch_size = 128)
##       loss   accuracy 
## 0.03959895 0.98876190

Accuracy in 2D was: 98.9%

3. House Prices Dataset

3.1 Load Dataset

#setwd("../HousePrices")
train_df <- read_csv("HousePrices/train.csv")
test_df <- read_csv('HousePrices/test.csv')

head(train_df)
## # A tibble: 6 × 81
##      Id MSSubClass MSZoning LotFr…¹ LotArea Street Alley LotSh…² LandC…³ Utili…⁴
##   <dbl>      <dbl> <chr>      <dbl>   <dbl> <chr>  <chr> <chr>   <chr>   <chr>  
## 1     1         60 RL            65    8450 Pave   <NA>  Reg     Lvl     AllPub 
## 2     2         20 RL            80    9600 Pave   <NA>  Reg     Lvl     AllPub 
## 3     3         60 RL            68   11250 Pave   <NA>  IR1     Lvl     AllPub 
## 4     4         70 RL            60    9550 Pave   <NA>  IR1     Lvl     AllPub 
## 5     5         60 RL            84   14260 Pave   <NA>  IR1     Lvl     AllPub 
## 6     6         50 RL            85   14115 Pave   <NA>  IR1     Lvl     AllPub 
## # … with 71 more variables: LotConfig <chr>, LandSlope <chr>,
## #   Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>, BldgType <chr>,
## #   HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>,
## #   YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>, Exterior1st <chr>,
## #   Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>, ExterQual <chr>,
## #   ExterCond <chr>, Foundation <chr>, BsmtQual <chr>, BsmtCond <chr>,
## #   BsmtExposure <chr>, BsmtFinType1 <chr>, BsmtFinSF1 <dbl>, …

3.2 Basic Descriptive Stats

First let’s use SKIMR

skimr::skim(train_df)
Data summary
Name train_df
Number of rows 1460
Number of columns 81
_______________________
Column type frequency:
character 43
numeric 38
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
MSZoning 0 1.00 2 7 0 5 0
Street 0 1.00 4 4 0 2 0
Alley 1369 0.06 4 4 0 2 0
LotShape 0 1.00 3 3 0 4 0
LandContour 0 1.00 3 3 0 4 0
Utilities 0 1.00 6 6 0 2 0
LotConfig 0 1.00 3 7 0 5 0
LandSlope 0 1.00 3 3 0 3 0
Neighborhood 0 1.00 5 7 0 25 0
Condition1 0 1.00 4 6 0 9 0
Condition2 0 1.00 4 6 0 8 0
BldgType 0 1.00 4 6 0 5 0
HouseStyle 0 1.00 4 6 0 8 0
RoofStyle 0 1.00 3 7 0 6 0
RoofMatl 0 1.00 4 7 0 8 0
Exterior1st 0 1.00 5 7 0 15 0
Exterior2nd 0 1.00 5 7 0 16 0
MasVnrType 8 0.99 4 7 0 4 0
ExterQual 0 1.00 2 2 0 4 0
ExterCond 0 1.00 2 2 0 5 0
Foundation 0 1.00 4 6 0 6 0
BsmtQual 37 0.97 2 2 0 4 0
BsmtCond 37 0.97 2 2 0 4 0
BsmtExposure 38 0.97 2 2 0 4 0
BsmtFinType1 37 0.97 3 3 0 6 0
BsmtFinType2 38 0.97 3 3 0 6 0
Heating 0 1.00 4 5 0 6 0
HeatingQC 0 1.00 2 2 0 5 0
CentralAir 0 1.00 1 1 0 2 0
Electrical 1 1.00 3 5 0 5 0
KitchenQual 0 1.00 2 2 0 4 0
Functional 0 1.00 3 4 0 7 0
FireplaceQu 690 0.53 2 2 0 5 0
GarageType 81 0.94 6 7 0 6 0
GarageFinish 81 0.94 3 3 0 3 0
GarageQual 81 0.94 2 2 0 5 0
GarageCond 81 0.94 2 2 0 5 0
PavedDrive 0 1.00 1 1 0 3 0
PoolQC 1453 0.00 2 2 0 3 0
Fence 1179 0.19 4 5 0 4 0
MiscFeature 1406 0.04 4 4 0 4 0
SaleType 0 1.00 2 5 0 9 0
SaleCondition 0 1.00 6 7 0 6 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Id 0 1.00 730.50 421.61 1 365.75 730.5 1095.25 1460 ▇▇▇▇▇
MSSubClass 0 1.00 56.90 42.30 20 20.00 50.0 70.00 190 ▇▅▂▁▁
LotFrontage 259 0.82 70.05 24.28 21 59.00 69.0 80.00 313 ▇▃▁▁▁
LotArea 0 1.00 10516.83 9981.26 1300 7553.50 9478.5 11601.50 215245 ▇▁▁▁▁
OverallQual 0 1.00 6.10 1.38 1 5.00 6.0 7.00 10 ▁▂▇▅▁
OverallCond 0 1.00 5.58 1.11 1 5.00 5.0 6.00 9 ▁▁▇▅▁
YearBuilt 0 1.00 1971.27 30.20 1872 1954.00 1973.0 2000.00 2010 ▁▂▃▆▇
YearRemodAdd 0 1.00 1984.87 20.65 1950 1967.00 1994.0 2004.00 2010 ▅▂▂▃▇
MasVnrArea 8 0.99 103.69 181.07 0 0.00 0.0 166.00 1600 ▇▁▁▁▁
BsmtFinSF1 0 1.00 443.64 456.10 0 0.00 383.5 712.25 5644 ▇▁▁▁▁
BsmtFinSF2 0 1.00 46.55 161.32 0 0.00 0.0 0.00 1474 ▇▁▁▁▁
BsmtUnfSF 0 1.00 567.24 441.87 0 223.00 477.5 808.00 2336 ▇▅▂▁▁
TotalBsmtSF 0 1.00 1057.43 438.71 0 795.75 991.5 1298.25 6110 ▇▃▁▁▁
1stFlrSF 0 1.00 1162.63 386.59 334 882.00 1087.0 1391.25 4692 ▇▅▁▁▁
2ndFlrSF 0 1.00 346.99 436.53 0 0.00 0.0 728.00 2065 ▇▃▂▁▁
LowQualFinSF 0 1.00 5.84 48.62 0 0.00 0.0 0.00 572 ▇▁▁▁▁
GrLivArea 0 1.00 1515.46 525.48 334 1129.50 1464.0 1776.75 5642 ▇▇▁▁▁
BsmtFullBath 0 1.00 0.43 0.52 0 0.00 0.0 1.00 3 ▇▆▁▁▁
BsmtHalfBath 0 1.00 0.06 0.24 0 0.00 0.0 0.00 2 ▇▁▁▁▁
FullBath 0 1.00 1.57 0.55 0 1.00 2.0 2.00 3 ▁▇▁▇▁
HalfBath 0 1.00 0.38 0.50 0 0.00 0.0 1.00 2 ▇▁▅▁▁
BedroomAbvGr 0 1.00 2.87 0.82 0 2.00 3.0 3.00 8 ▁▇▂▁▁
KitchenAbvGr 0 1.00 1.05 0.22 0 1.00 1.0 1.00 3 ▁▇▁▁▁
TotRmsAbvGrd 0 1.00 6.52 1.63 2 5.00 6.0 7.00 14 ▂▇▇▁▁
Fireplaces 0 1.00 0.61 0.64 0 0.00 1.0 1.00 3 ▇▇▁▁▁
GarageYrBlt 81 0.94 1978.51 24.69 1900 1961.00 1980.0 2002.00 2010 ▁▁▅▅▇
GarageCars 0 1.00 1.77 0.75 0 1.00 2.0 2.00 4 ▁▃▇▂▁
GarageArea 0 1.00 472.98 213.80 0 334.50 480.0 576.00 1418 ▂▇▃▁▁
WoodDeckSF 0 1.00 94.24 125.34 0 0.00 0.0 168.00 857 ▇▂▁▁▁
OpenPorchSF 0 1.00 46.66 66.26 0 0.00 25.0 68.00 547 ▇▁▁▁▁
EnclosedPorch 0 1.00 21.95 61.12 0 0.00 0.0 0.00 552 ▇▁▁▁▁
3SsnPorch 0 1.00 3.41 29.32 0 0.00 0.0 0.00 508 ▇▁▁▁▁
ScreenPorch 0 1.00 15.06 55.76 0 0.00 0.0 0.00 480 ▇▁▁▁▁
PoolArea 0 1.00 2.76 40.18 0 0.00 0.0 0.00 738 ▇▁▁▁▁
MiscVal 0 1.00 43.49 496.12 0 0.00 0.0 0.00 15500 ▇▁▁▁▁
MoSold 0 1.00 6.32 2.70 1 5.00 6.0 8.00 12 ▃▆▇▃▃
YrSold 0 1.00 2007.82 1.33 2006 2007.00 2008.0 2009.00 2010 ▇▇▇▇▅
SalePrice 0 1.00 180921.20 79442.50 34900 129975.00 163000.0 214000.00 755000 ▇▅▁▁▁

Let’s take a look at basic stats

summary(train_df)
##        Id           MSSubClass      MSZoning          LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   Length:1460        Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   Class :character   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   Mode  :character   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9                      Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0                      3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                      Max.   :313.00  
##                                                      NA's   :259     
##     LotArea          Street             Alley             LotShape        
##  Min.   :  1300   Length:1460        Length:1460        Length:1460       
##  1st Qu.:  7554   Class :character   Class :character   Class :character  
##  Median :  9478   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 10517                                                           
##  3rd Qu.: 11602                                                           
##  Max.   :215245                                                           
##                                                                           
##  LandContour         Utilities          LotConfig          LandSlope        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Neighborhood        Condition1         Condition2          BldgType        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   HouseStyle         OverallQual      OverallCond      YearBuilt   
##  Length:1460        Min.   : 1.000   Min.   :1.000   Min.   :1872  
##  Class :character   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
##  Mode  :character   Median : 6.000   Median :5.000   Median :1973  
##                     Mean   : 6.099   Mean   :5.575   Mean   :1971  
##                     3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000  
##                     Max.   :10.000   Max.   :9.000   Max.   :2010  
##                                                                    
##   YearRemodAdd   RoofStyle           RoofMatl         Exterior1st       
##  Min.   :1950   Length:1460        Length:1460        Length:1460       
##  1st Qu.:1967   Class :character   Class :character   Class :character  
##  Median :1994   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1985                                                           
##  3rd Qu.:2004                                                           
##  Max.   :2010                                                           
##                                                                         
##  Exterior2nd         MasVnrType          MasVnrArea      ExterQual        
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median :   0.0   Mode  :character  
##                                        Mean   : 103.7                     
##                                        3rd Qu.: 166.0                     
##                                        Max.   :1600.0                     
##                                        NA's   :8                          
##   ExterCond          Foundation          BsmtQual           BsmtCond        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  BsmtExposure       BsmtFinType1         BsmtFinSF1     BsmtFinType2      
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median : 383.5   Mode  :character  
##                                        Mean   : 443.6                     
##                                        3rd Qu.: 712.2                     
##                                        Max.   :5644.0                     
##                                                                           
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF       Heating         
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Length:1460       
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   Class :character  
##  Median :   0.00   Median : 477.5   Median : 991.5   Mode  :character  
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4                     
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2                     
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0                     
##                                                                        
##   HeatingQC          CentralAir         Electrical           1stFlrSF   
##  Length:1460        Length:1460        Length:1460        Min.   : 334  
##  Class :character   Class :character   Class :character   1st Qu.: 882  
##  Mode  :character   Mode  :character   Mode  :character   Median :1087  
##                                                           Mean   :1163  
##                                                           3rd Qu.:1391  
##                                                           Max.   :4692  
##                                                                         
##     2ndFlrSF     LowQualFinSF       GrLivArea     BsmtFullBath   
##  Min.   :   0   Min.   :  0.000   Min.   : 334   Min.   :0.0000  
##  1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000  
##  Median :   0   Median :  0.000   Median :1464   Median :0.0000  
##  Mean   : 347   Mean   :  5.845   Mean   :1515   Mean   :0.4253  
##  3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000  
##  Max.   :2065   Max.   :572.000   Max.   :5642   Max.   :3.0000  
##                                                                  
##   BsmtHalfBath        FullBath        HalfBath       BedroomAbvGr  
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.00000   Median :2.000   Median :0.0000   Median :3.000  
##  Mean   :0.05753   Mean   :1.565   Mean   :0.3829   Mean   :2.866  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :2.00000   Max.   :3.000   Max.   :2.0000   Max.   :8.000  
##                                                                    
##   KitchenAbvGr   KitchenQual         TotRmsAbvGrd     Functional       
##  Min.   :0.000   Length:1460        Min.   : 2.000   Length:1460       
##  1st Qu.:1.000   Class :character   1st Qu.: 5.000   Class :character  
##  Median :1.000   Mode  :character   Median : 6.000   Mode  :character  
##  Mean   :1.047                      Mean   : 6.518                     
##  3rd Qu.:1.000                      3rd Qu.: 7.000                     
##  Max.   :3.000                      Max.   :14.000                     
##                                                                        
##    Fireplaces    FireplaceQu         GarageType         GarageYrBlt  
##  Min.   :0.000   Length:1460        Length:1460        Min.   :1900  
##  1st Qu.:0.000   Class :character   Class :character   1st Qu.:1961  
##  Median :1.000   Mode  :character   Mode  :character   Median :1980  
##  Mean   :0.613                                         Mean   :1979  
##  3rd Qu.:1.000                                         3rd Qu.:2002  
##  Max.   :3.000                                         Max.   :2010  
##                                                        NA's   :81    
##  GarageFinish         GarageCars      GarageArea      GarageQual       
##  Length:1460        Min.   :0.000   Min.   :   0.0   Length:1460       
##  Class :character   1st Qu.:1.000   1st Qu.: 334.5   Class :character  
##  Mode  :character   Median :2.000   Median : 480.0   Mode  :character  
##                     Mean   :1.767   Mean   : 473.0                     
##                     3rd Qu.:2.000   3rd Qu.: 576.0                     
##                     Max.   :4.000   Max.   :1418.0                     
##                                                                        
##   GarageCond         PavedDrive          WoodDeckSF      OpenPorchSF    
##  Length:1460        Length:1460        Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character   Median :  0.00   Median : 25.00  
##                                        Mean   : 94.24   Mean   : 46.66  
##                                        3rd Qu.:168.00   3rd Qu.: 68.00  
##                                        Max.   :857.00   Max.   :547.00  
##                                                                         
##  EnclosedPorch      3SsnPorch       ScreenPorch        PoolArea      
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.00   Median :  0.00   Median :  0.000  
##  Mean   : 21.95   Mean   :  3.41   Mean   : 15.06   Mean   :  2.759  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :552.00   Max.   :508.00   Max.   :480.00   Max.   :738.000  
##                                                                      
##     PoolQC             Fence           MiscFeature           MiscVal        
##  Length:1460        Length:1460        Length:1460        Min.   :    0.00  
##  Class :character   Class :character   Class :character   1st Qu.:    0.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :    0.00  
##                                                           Mean   :   43.49  
##                                                           3rd Qu.:    0.00  
##                                                           Max.   :15500.00  
##                                                                             
##      MoSold           YrSold       SaleType         SaleCondition     
##  Min.   : 1.000   Min.   :2006   Length:1460        Length:1460       
##  1st Qu.: 5.000   1st Qu.:2007   Class :character   Class :character  
##  Median : 6.000   Median :2008   Mode  :character   Mode  :character  
##  Mean   : 6.322   Mean   :2008                                        
##  3rd Qu.: 8.000   3rd Qu.:2009                                        
##  Max.   :12.000   Max.   :2010                                        
##                                                                       
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129975  
##  Median :163000  
##  Mean   :180921  
##  3rd Qu.:214000  
##  Max.   :755000  
## 

3.3 Correlation Plots

par(mfrow=c(1,1))
library(corrplot)
corr_df <- select_if(train_df, is.numeric) 

#Select 3 variables
corr_3var_df <- corr_df %>%
  dplyr::select(OverallQual,GrLivArea,GarageArea)

cormat <- round(cor(corr_3var_df,use = "complete.obs"),2)
corrplot(cormat, method="number",
         tl.cex = 0.8,
         number.cex = 0.8,
         cl.cex = 0.3)

Another flavor of Correlation Plots

# 
library(GGally)
corr_3var_df %>%
  ggscatmat(alpha = 0.7)
## Warning: The dot-dot notation (`..scaled..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(scaled)` instead.
## ℹ The deprecated feature was likely used in the GGally package.
##   Please report the issue at <]8;;https://github.com/ggobi/ggally/issueshttps://github.com/ggobi/ggally/issues]8;;>.

3.4 Scatter Plots (Selected variables)

library("ggpubr")
ggscatter(corr_df, x = "SalePrice", y = "GrLivArea", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "Sales Price $", ylab = "Gr Living Area")
## `geom_smooth()` using formula = 'y ~ x'

ggscatter(corr_df, x = "SalePrice", y = "OverallQual", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "Sales Price $", ylab = "Overall Qualty")
## `geom_smooth()` using formula = 'y ~ x'

ggscatter(corr_df, x = "SalePrice", y = "GarageArea", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "Sales Price $", ylab = "Garage Area")
## `geom_smooth()` using formula = 'y ~ x'

ggscatter(corr_df, x = "SalePrice", y = "TotalBsmtSF", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "Sales Price $", ylab = "Basement SF")
## `geom_smooth()` using formula = 'y ~ x'

3.5 Calculate Shapiro Test for normality.

for (i in colnames(corr_df)){       # for-loop over columns
  res <- shapiro.test(corr_df[[i]])$p
  print(paste0(i," ",res))
}
## [1] "Id 7.97201340256573e-21"
## [1] "MSSubClass 9.10819385467262e-39"
## [1] "LotFrontage 2.00169250796826e-29"
## [1] "LotArea 7.93365433408252e-58"
## [1] "OverallQual 2.68645715997021e-22"
## [1] "OverallCond 6.77422906906633e-37"
## [1] "YearBuilt 2.7702198520757e-26"
## [1] "YearRemodAdd 6.72028021924094e-34"
## [1] "MasVnrArea 6.5566451859357e-48"
## [1] "BsmtFinSF1 2.81385360889394e-35"
## [1] "BsmtFinSF2 1.85025360137009e-58"
## [1] "BsmtUnfSF 1.63991082180453e-25"
## [1] "TotalBsmtSF 1.61133243783591e-27"
## [1] "1stFlrSF 4.51322308307649e-26"
## [1] "2ndFlrSF 2.51488232422664e-41"
## [1] "LowQualFinSF 9.58924763161722e-64"
## [1] "GrLivArea 6.59761060721486e-26"
## [1] "BsmtFullBath 3.76066585731411e-47"
## [1] "BsmtHalfBath 1.46661594047236e-60"
## [1] "FullBath 4.23148773439281e-44"
## [1] "HalfBath 4.5815824129602e-48"
## [1] "BedroomAbvGr 4.11555125000648e-35"
## [1] "KitchenAbvGr 4.22120261568086e-61"
## [1] "TotRmsAbvGrd 2.00496395573929e-23"
## [1] "Fireplaces 4.83098036724715e-42"
## [1] "GarageYrBlt 2.8167825695788e-26"
## [1] "GarageCars 2.30168501177196e-36"
## [1] "GarageArea 4.01696328273393e-15"
## [1] "WoodDeckSF 3.22798531690171e-41"
## [1] "OpenPorchSF 1.13590473601951e-43"
## [1] "EnclosedPorch 4.84948460552283e-56"
## [1] "3SsnPorch 8.30726789752397e-64"
## [1] "ScreenPorch 3.3056878563652e-59"
## [1] "PoolArea 7.11153811656075e-65"
## [1] "MiscVal 1.52990697456972e-64"
## [1] "MoSold 3.17897279443885e-17"
## [1] "YrSold 3.42019425395422e-30"
## [1] "SalePrice 3.20614155057209e-33"

Pearson correlation to SalePrice We will run Pearson correlation of all numerical features against the objective variable SalePrice.

# Create list of Pearson Correlation coefficient
mylist <- list()

for (i in colnames(corr_df)){       # for-loop over columns
  res <- cor(corr_df[[i]],corr_df$SalePrice,method = "pearson", use = "complete.obs")
  mylist[[ i ]] <- res
}
#mylist
#Sort by value decreasing order
sorted_mylist <- mylist[order(unlist(mylist), decreasing=TRUE)]
sorted_mylist
## $SalePrice
## [1] 1
## 
## $OverallQual
## [1] 0.7909816
## 
## $GrLivArea
## [1] 0.7086245
## 
## $GarageCars
## [1] 0.6404092
## 
## $GarageArea
## [1] 0.6234314
## 
## $TotalBsmtSF
## [1] 0.6135806
## 
## $`1stFlrSF`
## [1] 0.6058522
## 
## $FullBath
## [1] 0.5606638
## 
## $TotRmsAbvGrd
## [1] 0.5337232
## 
## $YearBuilt
## [1] 0.5228973
## 
## $YearRemodAdd
## [1] 0.507101
## 
## $GarageYrBlt
## [1] 0.4863617
## 
## $MasVnrArea
## [1] 0.477493
## 
## $Fireplaces
## [1] 0.4669288
## 
## $BsmtFinSF1
## [1] 0.3864198
## 
## $LotFrontage
## [1] 0.3517991
## 
## $WoodDeckSF
## [1] 0.3244134
## 
## $`2ndFlrSF`
## [1] 0.3193338
## 
## $OpenPorchSF
## [1] 0.3158562
## 
## $HalfBath
## [1] 0.2841077
## 
## $LotArea
## [1] 0.2638434
## 
## $BsmtFullBath
## [1] 0.2271222
## 
## $BsmtUnfSF
## [1] 0.2144791
## 
## $BedroomAbvGr
## [1] 0.1682132
## 
## $ScreenPorch
## [1] 0.1114466
## 
## $PoolArea
## [1] 0.09240355
## 
## $MoSold
## [1] 0.04643225
## 
## $`3SsnPorch`
## [1] 0.04458367
## 
## $BsmtFinSF2
## [1] -0.01137812
## 
## $BsmtHalfBath
## [1] -0.01684415
## 
## $MiscVal
## [1] -0.02118958
## 
## $Id
## [1] -0.02191672
## 
## $LowQualFinSF
## [1] -0.02560613
## 
## $YrSold
## [1] -0.02892259
## 
## $OverallCond
## [1] -0.07785589
## 
## $MSSubClass
## [1] -0.08428414
## 
## $EnclosedPorch
## [1] -0.128578
## 
## $KitchenAbvGr
## [1] -0.1359074

Top 5 correlated variables where:

  1. OverallQual

  2. GrLivArea

  3. GarageCars

  4. GarageArea

  5. TotalBsmtSF

3.6 Test Pairwise Correlation Hypothesis at 80% CI

cor.test(corr_df$OverallQual, corr_df$GrLivArea, method = "pearson", conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  corr_df$OverallQual and corr_df$GrLivArea
## t = 28.121, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5708061 0.6143422
## sample estimates:
##       cor 
## 0.5930074
cor.test(corr_df$OverallQual, corr_df$GarageArea, method = "pearson", conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  corr_df$OverallQual and corr_df$GarageArea
## t = 25.946, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5386198 0.5845573
## sample estimates:
##       cor 
## 0.5620218
cor.test(corr_df$GarageArea, corr_df$GrLivArea, method = "pearson", conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  corr_df$GarageArea and corr_df$GrLivArea
## t = 20.276, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.4423993 0.4947713
## sample estimates:
##       cor 
## 0.4689975

The P-values are very small. Probability of error that there is no correlation between variables is very small.

3.7 Precision Matrix and LU Decomposition

precision_mat <- solve(cormat)
precision_mat %*% cormat
##               OverallQual    GrLivArea   GarageArea
## OverallQual  1.000000e+00 5.551115e-17 0.000000e+00
## GrLivArea    5.551115e-17 1.000000e+00 1.110223e-16
## GarageArea  -1.110223e-16 0.000000e+00 1.000000e+00
cormat %*% precision_mat
##               OverallQual    GrLivArea   GarageArea
## OverallQual  1.000000e+00 2.775558e-17 0.000000e+00
## GrLivArea    0.000000e+00 1.000000e+00 1.110223e-16
## GarageArea  -1.110223e-16 5.551115e-17 1.000000e+00
lu.decomp <- function(M){
  rows <- dim(M)[1]
  cols <- dim(M)[2]
  
  if (rows != cols) {
    return(list(0,0))} #If is not square matrix return list of 0's
  else {
    L <- diag(x = 1, rows,cols) # L by definition has 1's in the diagonal
    U <- M # Upper will have its first row EQUAL to input matrix
    for (x in 1:(rows-1)){
      for (y in (x+1):rows){
          factor = U[y,x]/U[x,x] # Is the number used to eliminate
          L[y,x] = factor # We make the lower matrix that value
          U[y,] = U[y,]-factor*U[x,] # We use Gaussian elimination
      }
    }
    return(list("L"=L,"U"=U)) # We return our completed L and U matrices
  }
}
lu <- lu.decomp(cormat)
lu
## $L
##      [,1]      [,2] [,3]
## [1,] 1.00 0.0000000    0
## [2,] 0.59 1.0000000    0
## [3,] 0.56 0.2141433    1
## 
## $U
##             OverallQual GrLivArea GarageArea
## OverallQual           1    0.5900  0.5600000
## GrLivArea             0    0.6519  0.1396000
## GarageArea            0    0.0000  0.6565056

Lets’ check manual results vs function in matrixcalc library

library(matrixcalc)
lu.decomposition(cormat)
## $L
##      [,1]      [,2] [,3]
## [1,] 1.00 0.0000000    0
## [2,] 0.59 1.0000000    0
## [3,] 0.56 0.2141433    1
## 
## $U
##      [,1]   [,2]      [,3]
## [1,]    1 0.5900 0.5600000
## [2,]    0 0.6519 0.1396000
## [3,]    0 0.0000 0.6565056

Same result both ways, which is comforting

3.8 Analysis Selected Skewed Variable

Plot Histogram of variable OpenPorchSF

# plot
p <- train_df %>%
  # filter( price<300 ) %>%
  ggplot( aes(x=OpenPorchSF)) +
    geom_histogram( binwidth=50, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
    ggtitle("Distribution of OpenPorchSF Variable")

p

Use fitdistr in MASS library

library(MASS)

Find the lambda parameter and use it to simulate an exponential distribution

resfit <- fitdistr(train_df$OpenPorchSF, "exponential")
simulated <- as.data.frame(rexp(1000,resfit$estimate))
colnames(simulated) <- c("estimate")

simulated %>%
  ggplot( aes(x=estimate)) +
    geom_histogram( binwidth=50, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
    ggtitle("Distribution of Simulate Exponential variable")

We can simulate results and see they are also skewed, but not quite the same distribution as the empirical data.

3.9 Find 5th and 95th percentile of cumulative distribution

qexp(.05, rate=resfit$estimate)
## [1] 2.393359
qexp(.95, rate=resfit$estimate)
## [1] 139.7817

Generate a 95% confidence interval from the empirical data, assuming normality.

# Calculate the mean and standard error
l.model <- lm(OpenPorchSF ~ 1, train_df)

# Calculate the confidence interval
confint(l.model, level=0.95)
##                2.5 %   97.5 %
## (Intercept) 43.25888 50.06167

Let’s try also manually to see the results and compare them.

#Manual way of calculating CI
sample.mean <- mean(train_df$OpenPorchSF)
print(sample.mean)
## [1] 46.66027
sample.n <- length(train_df$OpenPorchSF)
sample.sd <- sd(train_df$OpenPorchSF)
sample.se <- sample.sd/sqrt(sample.n)
print(sample.se)
## [1] 1.733999
alpha = 0.05
degrees.freedom = sample.n - 1
t.score = qt(p=alpha/2, df=degrees.freedom,lower.tail=F)
print(t.score)
## [1] 1.961591
margin.error <- t.score * sample.se
lower.bound <- sample.mean - margin.error
upper.bound <- sample.mean + margin.error
print(c(lower.bound,upper.bound))
## [1] 43.25888 50.06167
#Provide the empirical 5th percentile and 95th percentile of the data.

quantile(x=train_df$OpenPorchSF, probs=c(.05, .95))
##     5%    95% 
##   0.00 175.05

3.10 Modelling (Machine Learning)

We will use TIDYMODELS to prepare and fit a collection of model using a grid search method of testing many hyper-parameters. We will fit Lasso regression with regularization and also XGBoost for regression. We will check results from cross-validation and select the model and hyper-parameters which performs the best on validation data.

3.10.1 Prepare data

#convert all char to factors
train_df <- mutate(train_df, across(where(is.character), as.factor))
test_df <- mutate(test_df, across(where(is.character), as.factor))
#test_df2 <- mutate(test_df2, across(where(is.character), as.factor))

3.10.2 Split data training and test

set.seed(42)
sales_split <- initial_split(train_df, prop=9/10)
sales_train <- training(sales_split)
sales_test <- testing(sales_split)

3.10.3 Pre-process with RECIPES

# will test with 3 different options to data engineering

sales_recipe1 <- recipe(SalePrice ~ ., data = sales_train) %>%
  step_impute_knn(all_predictors(), neighbors = 50) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_zv(all_numeric_predictors()) %>%
  step_normalize(all_numeric_predictors()) %>%
  prep()

sales_prep1 <- prep(sales_recipe1)
#sales_prep1

sales_recipe2 <- recipe(SalePrice ~ ., data = sales_train) %>%
  step_impute_mode(all_nominal()) %>% 
  step_impute_mean(all_numeric_predictors()) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_zv(all_numeric_predictors()) %>%
  step_normalize(all_numeric_predictors()) %>%
  prep()

sales_prep2 <- prep(sales_recipe2)

sales_recipe3 <- recipe(SalePrice ~ ., data = sales_train) %>%
  step_impute_mode(all_nominal()) %>% 
  step_impute_mean(all_numeric_predictors()) %>%
  step_BoxCox(all_numeric_predictors()) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_zv(all_numeric_predictors()) %>%
  step_normalize(all_numeric_predictors()) %>%
  prep()

sales_prep3 <- prep(sales_recipe3)

3.10.4 Define Cross-Validation Folds

## Cross Validation Split of Training Data
set.seed(42)
sales_folds <- vfold_cv(
  data = sales_train, 
  v = 10
  ) 

#sales_folds

3.10.5 Define Models

# Linear Regression
lm_spec <- linear_reg() %>% 
            set_engine('lm') %>% 
            set_mode('regression')

#lm_spec
# Lasso Regression
lasso_spec <- linear_reg(mode = "regression",
                        penalty = tune(),
                        mixture = 1) %>% 
  set_engine("glmnet")

#lasso_spec
# XGBoost Regression
xgboost_spec <- boost_tree(
  trees = tune(),
  mtry = tune(),
  tree_depth = tune(),
  learn_rate = tune()
  ) %>%
  set_mode("regression") %>% 
  set_engine("xgboost")

3.10.6 Define workflow set

Let’s stack the models we want to compare.

workflow_set <-workflow_set(
  preproc = list(sales_recipe1,sales_recipe2, sales_recipe3),
  models = list(lasso_spec, xgboost_spec),
  cross = TRUE
  )

workflow_set
## # A workflow set/tibble: 6 × 4
##   wflow_id            info             option    result    
##   <chr>               <list>           <list>    <list>    
## 1 recipe_1_linear_reg <tibble [1 × 4]> <opts[0]> <list [0]>
## 2 recipe_1_boost_tree <tibble [1 × 4]> <opts[0]> <list [0]>
## 3 recipe_2_linear_reg <tibble [1 × 4]> <opts[0]> <list [0]>
## 4 recipe_2_boost_tree <tibble [1 × 4]> <opts[0]> <list [0]>
## 5 recipe_3_linear_reg <tibble [1 × 4]> <opts[0]> <list [0]>
## 6 recipe_3_boost_tree <tibble [1 × 4]> <opts[0]> <list [0]>

3.10.7 Fit the stack of Models and Recipes

Here we fit the models and use hyper-parameter tuning using 150 levels.

# The fitting lasted 5.5 hours in my PC.  I save the model results in order to easily knit. For actually running the fit, please change RUN = TRUE

RUN = FALSE
if (RUN) {
doParallel::registerDoParallel()
start.time <- Sys.time()
start.time
fit_workflows <- workflow_set %>%  
  workflow_map(
    seed = 42,  
    fn = "tune_grid",
    grid = 150,
    resamples = sales_folds,
    verbose = TRUE
  )

end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
doParallel::stopImplicitCluster()
}

The fitting took 5.5 hours on a i7 Gen 11 machine with 32G of RAM.

3.10.8 Save/Load resulting model (backup)

if (RUN) {
saved_housesales_modelset <- fit_workflows
saveRDS(saved_housesales_modelset, "HousePrices/housesales.rds")
}
########
if (!RUN) {
fit_workflows <- readRDS("HousePrices/housesales.rds")
}

3.10.9 Review Results of all models

Let’s see how our models performed and choose a winner.

autoplot(fit_workflows)

collect_metrics(fit_workflows)
## # A tibble: 1,800 × 9
##    wflow_id          .config preproc model .metric .esti…¹    mean     n std_err
##    <chr>             <chr>   <chr>   <chr> <chr>   <chr>     <dbl> <int>   <dbl>
##  1 recipe_1_linear_… Prepro… recipe  line… rmse    standa… 4.05e+4    10 8.50e+3
##  2 recipe_1_linear_… Prepro… recipe  line… rsq     standa… 7.75e-1    10 6.66e-2
##  3 recipe_1_linear_… Prepro… recipe  line… rmse    standa… 4.05e+4    10 8.50e+3
##  4 recipe_1_linear_… Prepro… recipe  line… rsq     standa… 7.75e-1    10 6.66e-2
##  5 recipe_1_linear_… Prepro… recipe  line… rmse    standa… 4.05e+4    10 8.50e+3
##  6 recipe_1_linear_… Prepro… recipe  line… rsq     standa… 7.75e-1    10 6.66e-2
##  7 recipe_1_linear_… Prepro… recipe  line… rmse    standa… 4.05e+4    10 8.50e+3
##  8 recipe_1_linear_… Prepro… recipe  line… rsq     standa… 7.75e-1    10 6.66e-2
##  9 recipe_1_linear_… Prepro… recipe  line… rmse    standa… 4.05e+4    10 8.50e+3
## 10 recipe_1_linear_… Prepro… recipe  line… rsq     standa… 7.75e-1    10 6.66e-2
## # … with 1,790 more rows, and abbreviated variable name ¹​.estimator
rank_results(fit_workflows, rank_metric = "rmse", select_best = TRUE)
## # A tibble: 12 × 9
##    wflow_id            .config .metric    mean std_err     n prepr…¹ model  rank
##    <chr>               <chr>   <chr>     <dbl>   <dbl> <int> <chr>   <chr> <int>
##  1 recipe_3_boost_tree Prepro… rmse    2.51e+4 2.33e+3    10 recipe  boos…     1
##  2 recipe_3_boost_tree Prepro… rsq     9.02e-1 1.55e-2    10 recipe  boos…     1
##  3 recipe_2_boost_tree Prepro… rmse    2.52e+4 2.42e+3    10 recipe  boos…     2
##  4 recipe_2_boost_tree Prepro… rsq     9.01e-1 1.64e-2    10 recipe  boos…     2
##  5 recipe_1_boost_tree Prepro… rmse    2.53e+4 2.24e+3    10 recipe  boos…     3
##  6 recipe_1_boost_tree Prepro… rsq     8.98e-1 1.34e-2    10 recipe  boos…     3
##  7 recipe_3_linear_reg Prepro… rmse    3.91e+4 7.10e+3    10 recipe  line…     4
##  8 recipe_3_linear_reg Prepro… rsq     7.81e-1 6.00e-2    10 recipe  line…     4
##  9 recipe_1_linear_reg Prepro… rmse    4.05e+4 8.50e+3    10 recipe  line…     5
## 10 recipe_1_linear_reg Prepro… rsq     7.75e-1 6.66e-2    10 recipe  line…     5
## 11 recipe_2_linear_reg Prepro… rmse    4.06e+4 8.57e+3    10 recipe  line…     6
## 12 recipe_2_linear_reg Prepro… rsq     7.74e-1 6.71e-2    10 recipe  line…     6
## # … with abbreviated variable name ¹​preprocessor

Extract the best workflow

metric <- "rmse"

best_workflow_id <- fit_workflows %>% 
  rank_results(
    rank_metric = metric,
    select_best = TRUE
  ) %>% 
  dplyr::slice(1) %>% 
  pull(wflow_id)

workflow_best <- extract_workflow(fit_workflows, id = best_workflow_id)

best_workflow_id
## [1] "recipe_3_boost_tree"

Extract tuning results from workflowset

We know our best model in training, now let’s extract the tuning parameters used to generate it.

workflow_best_tuned <- fit_workflows[fit_workflows$wflow_id == best_workflow_id,"result"][[1]][[1]]

workflow_best_tuned
## # Tuning results
## # 10-fold cross-validation 
## # A tibble: 10 × 4
##    splits             id     .metrics           .notes          
##    <list>             <chr>  <list>             <list>          
##  1 <split [1182/132]> Fold01 <tibble [300 × 8]> <tibble [1 × 3]>
##  2 <split [1182/132]> Fold02 <tibble [300 × 8]> <tibble [1 × 3]>
##  3 <split [1182/132]> Fold03 <tibble [300 × 8]> <tibble [1 × 3]>
##  4 <split [1182/132]> Fold04 <tibble [300 × 8]> <tibble [1 × 3]>
##  5 <split [1183/131]> Fold05 <tibble [300 × 8]> <tibble [1 × 3]>
##  6 <split [1183/131]> Fold06 <tibble [300 × 8]> <tibble [1 × 3]>
##  7 <split [1183/131]> Fold07 <tibble [300 × 8]> <tibble [1 × 3]>
##  8 <split [1183/131]> Fold08 <tibble [300 × 8]> <tibble [1 × 3]>
##  9 <split [1183/131]> Fold09 <tibble [300 × 8]> <tibble [1 × 3]>
## 10 <split [1183/131]> Fold10 <tibble [300 × 8]> <tibble [1 × 3]>
## 
## There were issues with some computations:
## 
##   - Warning(s) x10: Non-positive values in selected variable., Fewer than `num_unique...
## 
## Run `show_notes(.Last.tune.result)` for more information.
collect_metrics(workflow_best_tuned)
## # A tibble: 300 × 10
##     mtry trees tree_depth learn_…¹ .metric .esti…²    mean     n std_err .config
##    <int> <int>      <int>    <dbl> <chr>   <chr>     <dbl> <int>   <dbl> <chr>  
##  1     2   677          2  0.00175 rmse    standa… 8.93e+4    10 3.60e+3 Prepro…
##  2     2   677          2  0.00175 rsq     standa… 7.36e-1    10 2.32e-2 Prepro…
##  3     3   102          7  0.0562  rmse    standa… 3.09e+4    10 2.43e+3 Prepro…
##  4     3   102          7  0.0562  rsq     standa… 8.59e-1    10 1.60e-2 Prepro…
##  5     6  1851          9  0.0244  rmse    standa… 2.57e+4    10 2.30e+3 Prepro…
##  6     6  1851          9  0.0244  rsq     standa… 8.97e-1    10 1.60e-2 Prepro…
##  7     6   283         11  0.00579 rmse    standa… 5.48e+4    10 2.56e+3 Prepro…
##  8     6   283         11  0.00579 rsq     standa… 8.48e-1    10 1.89e-2 Prepro…
##  9     8   792          8  0.315   rmse    standa… 3.19e+4    10 2.55e+3 Prepro…
## 10     8   792          8  0.315   rsq     standa… 8.39e-1    10 1.82e-2 Prepro…
## # … with 290 more rows, and abbreviated variable names ¹​learn_rate, ²​.estimator
autoplot(workflow_best_tuned)

select_best(workflow_best_tuned, "rmse")
## # A tibble: 1 × 5
##    mtry trees tree_depth learn_rate .config               
##   <int> <int>      <int>      <dbl> <chr>                 
## 1    28  1755          5     0.0312 Preprocessor1_Model017

Fit the final model

Now let’s test with the unseen test data

workflow_best_final <- finalize_workflow(workflow_best, select_best(workflow_best_tuned, "rmse"))

doParallel::registerDoParallel()

workflow_best_final_fit <- workflow_best_final %>% 
  last_fit(
    split = sales_split
  )
## ! train/test split: preprocessor 1/1: Non-positive values in selected variable., Fewer than `num_unique` value...
doParallel::stopImplicitCluster()

workflow_best_final_fit
## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
##   splits             id               .metrics .notes   .predictions .workflow 
##   <list>             <chr>            <list>   <list>   <list>       <list>    
## 1 <split [1314/146]> train/test split <tibble> <tibble> <tibble>     <workflow>
## 
## There were issues with some computations:
## 
##   - Warning(s) x1: Non-positive values in selected variable., Fewer than `num_unique...
## 
## Run `show_notes(.Last.tune.result)` for more information.
workflow_best_final_fit %>% 
  collect_metrics()
## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard   24853.    Preprocessor1_Model1
## 2 rsq     standard       0.906 Preprocessor1_Model1

R2 of 0.91

Some visualizations of the selected winner model

fit_test <- workflow_best_final_fit %>% 
  collect_predictions()

fit_test
## # A tibble: 146 × 5
##    id                 .pred  .row SalePrice .config             
##    <chr>              <dbl> <int>     <dbl> <chr>               
##  1 train/test split 301319.     5    250000 Preprocessor1_Model1
##  2 train/test split 124986.    20    139000 Preprocessor1_Model1
##  3 train/test split 180091.    34    165500 Preprocessor1_Model1
##  4 train/test split 133426.    44    130250 Preprocessor1_Model1
##  5 train/test split 145016.    45    141000 Preprocessor1_Model1
##  6 train/test split 310751.    66    317000 Preprocessor1_Model1
##  7 train/test split 263161.    86    260000 Preprocessor1_Model1
##  8 train/test split 147370.    93    163500 Preprocessor1_Model1
##  9 train/test split 187506.    96    185000 Preprocessor1_Model1
## 10 train/test split 204675.   104    198900 Preprocessor1_Model1
## # … with 136 more rows

Plot our predictions vs actual unseen test data

fit_test %>%
  ggplot(aes(x=SalePrice, y=.pred)) +
  geom_abline(slope=1,intercept=0) +
  geom_point(alpha=0.3)

Boxplot of predictions vs actual results

boxplot_data <- fit_test %>%
  mutate(difference = .pred - SalePrice)
boxplot_data$difference %>%
  boxplot(main = "Results of our Selected Model predicting",
          xlab = "Difference of Predictions",
          horizontal = FALSE)

3.10.10 Analysis of Regression Variables

Variable Importance

Here we use the VIP library which helps us identify the most important variables used in our selected model. This ranking of importance can help us explain our model. Here you will the most important variable ranked from the top.

library(vip)
extract_workflow(workflow_best_final_fit) %>%
  extract_fit_parsnip() %>%
  vip(geom = "col")

3.10.11 Predict on NEW DATA and load to KAGGLE

KAGGLE username: juanfalck

my_predictions1 <- workflow_best_final_fit$.workflow[[1]] %>%
  predict(test_df)

Save the data in .CSV file to upload to KAGGLE

my_predictions2 <- as_tibble(test_df$Id)
my_predictions2$pred <- my_predictions1$.pred
write.csv(my_predictions2,"JF_predictions.csv")

THANK YOU!

It has been a great class!!!