Data605 - Final Assignment
0. Load libraries and initialize things
rm(list = ls())
library(tidyverse)
library(tidymodels)
library(skimr)
library(keras)
library(ggplot2)
library(matrixcalc)
library(pracma)1. Playing with PageRank
1.1 - Form the A matrix. Then, introduce decay and form the B matrix as we did in the course notes.
#Create a transitions Matrix A
options(digits = 3)
r1 <- c(0,1/2,1/2,0,0,0)
r2 <- c(0,0,0,0,0,0)
r3 <- c(1/3,1/3,0,0,1/3,0)
r4 <- c(0,0,0,0,1/2,1/2)
r5 <- c(0,0,0,1/2,0,1/2)
r6 <- c(0,0,0,1,0,0)
n =6
A <- matrix(c(r1,r2,r3,r4,r5,r6),n,n,byrow = TRUE)
r <- c(1/6,1/6,1/6,1/6,1/6,1/6)
print(A)## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.000 0.500 0.5 0.0 0.000 0.0
## [2,] 0.000 0.000 0.0 0.0 0.000 0.0
## [3,] 0.333 0.333 0.0 0.0 0.333 0.0
## [4,] 0.000 0.000 0.0 0.0 0.500 0.5
## [5,] 0.000 0.000 0.0 0.5 0.000 0.5
## [6,] 0.000 0.000 0.0 1.0 0.000 0.0
print(rowSums(A))## [1] 1 0 1 1 1 1
We can see that Matrix A is not row stochastic. That means its row sums are not equal to one. In order to continue we need to make sure all rows are stochastic. So we will replace row 2 with a new row where all entries are 1/6 for equal probability and ensuring its sum = 1
r2 <- c(1/6,1/6,1/6,1/6,1/6,1/6)
A <- matrix(c(r1,r2,r3,r4,r5,r6),n,n,byrow = TRUE)
print(A)## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.000 0.500 0.500 0.000 0.000 0.000
## [2,] 0.167 0.167 0.167 0.167 0.167 0.167
## [3,] 0.333 0.333 0.000 0.000 0.333 0.000
## [4,] 0.000 0.000 0.000 0.000 0.500 0.500
## [5,] 0.000 0.000 0.000 0.500 0.000 0.500
## [6,] 0.000 0.000 0.000 1.000 0.000 0.000
print(rowSums(A))## [1] 1 1 1 1 1 1
Lets define B matrix
# Decay Factor
d <- 0.85
# Adjusted B matrix
B <- d * A + ((1-d)/n)
print(B)## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.025 0.450 0.450 0.025 0.025 0.025
## [2,] 0.167 0.167 0.167 0.167 0.167 0.167
## [3,] 0.308 0.308 0.025 0.025 0.308 0.025
## [4,] 0.025 0.025 0.025 0.025 0.450 0.450
## [5,] 0.025 0.025 0.025 0.450 0.025 0.450
## [6,] 0.025 0.025 0.025 0.875 0.025 0.025
Let’s check it is row stochastic
print(rowSums(B))## [1] 1 1 1 1 1 1
1.2 Start with a uniform rank vector r and perform power iterations on B until convergence. That is, compute the solution r = B^n * r. Attempt this for a sufficiently large n so that r actually converges.
# Lets calculate a large number of steps to check convergence
nsteps = 100
#Make sure you use the transpose of A, since you want the FROM as columns, and as defined in the question the A matrix has the FROM as rows.
rf <- matrix.power(t(B),nsteps) %*% r
print(rf)## [,1]
## [1,] 0.05170475
## [2,] 0.07367926
## [3,] 0.05741241
## [4,] 0.34870369
## [5,] 0.19990381
## [6,] 0.26859608
# Just to make sure it converged, compare to nsteps +1
rf <- matrix.power(t(B),nsteps+1) %*% r
print(rf)## [,1]
## [1,] 0.05170475
## [2,] 0.07367926
## [3,] 0.05741241
## [4,] 0.34870369
## [5,] 0.19990381
## [6,] 0.26859608
# Yes. same vector it has converged!We have a new vector rf which we also checked it converged to an answer. This is our pagerank vector answer!.
1.3 Compute the eigen-decomposition of B and verify that you indeed get an eigenvalue of 1 as the largest eigenvalue and that its corresponding eigenvector is the same vector that you obtained in the previous power iteration method.
This was a hard question in my opinion, but was able to solve it correctly. The reason is because getting eigenvalues seems to be non-trivial for most software packages. But getting eigenvectors can be confusing because there are multiple vectors which can be answers as well.
So if you went ahead and just used the eigen function in R, you may not get the answer you expected.
eigen(B)$values[1]## [1] 1
# Largest eigen value = 1
# To get the associated Eigenvector make sure you transpose it first
eigen(t(B))$vectors[,1]## [1] 0.1044385+0i 0.1488249+0i 0.1159674+0i 0.7043472+0i 0.4037861+0i
## [6] 0.5425377+0i
We can see that the largest positive eigenvalue is one, but the associated eigenvector is not exactly what we expected. For sure is not equal to the pagerank vector we found before, also has the “i” which indicates is in the imaginary numbers. But let me assure you the vector is the same as the pagerank vector, just expressed differently.
To show it we would need to derive the eigenvector differently to show our proof.
To get same pagerank vector as what we got before by raising the Matrix to a large power, we will find instead the null space of the B matrix minus the Identity matrix.
#Fist we transpose and subtract the diagonal matrix. This is the same process we would follow to find Eigenvectors manually.
Bt <- t(B) - diag(n)Now we have a new Matrix which we can solve for its null space. We will use pracma library for this. The answer would be a multiple of the final answer we want.
#Find the null space of the transformed B matrix
pracmaBnull <- pracma::nullspace(Bt)
# Lets find a factor between the nullspace of B and the PageRank vector. One should be a multiple of the other
lambdaFactor <- pracmaBnull[1,1] / rf[1,1]
#Lets compare our new vector with the original Pagerank vector
rf ## [,1]
## [1,] 0.05170475
## [2,] 0.07367926
## [3,] 0.05741241
## [4,] 0.34870369
## [5,] 0.19990381
## [6,] 0.26859608
pracmaBnull / lambdaFactor## [,1]
## [1,] 0.05170475
## [2,] 0.07367926
## [3,] 0.05741241
## [4,] 0.34870369
## [5,] 0.19990381
## [6,] 0.26859608
Solved. We can see now, that the null space of B transformed is the same Eigenvector as the PageRank vector, just with a different scaling factor, but nonetheless same vector.
We just showed that our EIGENVECTOR for Lambda equal to 1 is the SAME as the pagerank vector we found earlier.
1.4 Use the graph package in R and its page.rank method to compute the Page Rank of the graph as given in A.
#library Graph no longer exists in CRAN. But I found another library to do the trick
library(igraph)
A_Graph <- graph_from_adjacency_matrix(A, weighted = TRUE, mode = "directed")
plot(A_Graph)Let’s see that library’s pagerank result.
pageRank <- page.rank(A_Graph)$vectorAll 3 methods give same answer
# Pagerank vector by taking Matrix powers until convergence
print(t(rf))## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.05170475 0.07367926 0.05741241 0.3487037 0.1999038 0.2685961
# Pagerank vector by doing spectral decomposition of the B matrix
print(t(pracmaBnull/lambdaFactor))## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.05170475 0.07367926 0.05741241 0.3487037 0.1999038 0.2685961
# Page rank of matrix A by using page.rank function
print(pageRank)## [1] 0.05170475 0.07367926 0.05741241 0.34870369 0.19990381 0.26859608
2. MNIST Dataset
2.1 Load files and Plot function
#setwd("Code/FinalExam/MNIST")
train_df <- read_csv("MNIST/train.csv")
test_df <- read_csv('MNIST/test.csv')
train_df[,2:ncol(train_df)] <- train_df[,2:ncol(train_df)] / 255
head(train_df)## # A tibble: 6 x 785
## label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0
## 3 1 0 0 0 0 0 0 0 0 0 0
## 4 4 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0 0
## # ... with 774 more variables: pixel10 <dbl>, pixel11 <dbl>, pixel12 <dbl>,
## # pixel13 <dbl>, pixel14 <dbl>, pixel15 <dbl>, pixel16 <dbl>, pixel17 <dbl>,
## # pixel18 <dbl>, pixel19 <dbl>, pixel20 <dbl>, pixel21 <dbl>, pixel22 <dbl>,
## # pixel23 <dbl>, pixel24 <dbl>, pixel25 <dbl>, pixel26 <dbl>, pixel27 <dbl>,
## # pixel28 <dbl>, pixel29 <dbl>, pixel30 <dbl>, pixel31 <dbl>, pixel32 <dbl>,
## # pixel33 <dbl>, pixel34 <dbl>, pixel35 <dbl>, pixel36 <dbl>, pixel37 <dbl>,
## # pixel38 <dbl>, pixel39 <dbl>, pixel40 <dbl>, pixel41 <dbl>, ...
Let’s create a basic function that would take the matrix information of the MNIST digits and display it on screen.
#Function to display DIGITS
disp_digit <- function(digit_row, d_label) {
img1 <- matrix(unlist(digit_row),nrow=28,ncol=28,byrow=TRUE)
# reverse and transpose, if not image rotated 90 degrees
img1 <- t(apply(img1, 2, rev))
# To keep aspect ratio
par(pty="s")
image(1:28, 1:28, img1, col=gray((0:255)/255),
axes=FALSE, xlab = d_label, ylab = "")
}Let’s test the function displaying the 10th row in our dataset
# Choose a row (digit image)
idnum=10
disp_digit(train_df[idnum,2:ncol(train_df)], train_df[idnum,1])We should be looking at a number “3”
2.2 Basic information about dataset
Display the frequency of each number in the dataset
# count of all rows by Label (digit)
train_df %>%
group_by(label) %>%
summarise(Count = n()) %>%
mutate(frequency = Count / sum(Count))## # A tibble: 10 x 3
## label Count frequency
## <dbl> <int> <dbl>
## 1 0 4132 0.0984
## 2 1 4684 0.112
## 3 2 4177 0.0995
## 4 3 4351 0.104
## 5 4 4072 0.0970
## 6 5 3795 0.0904
## 7 6 4137 0.0985
## 8 7 4401 0.105
## 9 8 4063 0.0967
## 10 9 4188 0.0997
We can see here that although not exact, each digit 0 to 9 is represented almost uniformly across the dataset.
Just a few queries we may need later.
# 10 Rows, one per digit withe sum of every pixel (784)
train_sum_df <- train_df %>%
group_by(label) %>%
summarise(across(everything(),sum), .groups = 'drop') # 10 Rows, with sum of all columns
train_sum_columns <- train_df %>%
group_by(label) %>%
summarise(across(everything(),sum), .groups = 'drop') %>%
summarise(label=label,pixel_sum = rowSums(.))# 10 Rows, one per digit with the mean of every pixel (784)
# this just tell us for each digit, what is the mean of every specific pixel.
train_mean_df <- train_df %>%
group_by(label) %>%
summarise(across(everything(),mean), .groups = 'drop') 2.3 Plot some numbers
Let’s plot the first ten numbers in the dataframe
# Plot First 10 numbers on dataframe
par(mfrow=c(2,5),mai = c(0.1, 0.1, 0.1, 0.1),mar=c(1, 0, 1, 0))
for (i in 1:10) {
disp_digit(train_df[i,2:ncol(train_df)],"")
}For fun let’s average all rows grouped by number and display it.
# Plot all 10 digits based on averages
par(mfrow=c(2,5),mai = c(0.1, 0.1, 0.1, 0.1),mar=c(1, 0, 1, 0))
for (i in 1:10) {
#disp_digit(train_mean_df[i,2:ncol(train_mean_df)])
disp_digit(train_mean_df[i,2:ncol(train_mean_df)],"")
}2.4 More info about dataset
Display the mean of pixels for every digit. This average would be equivalent to the amount of “white” color each digit has. You can see that in where number 1 has a lower mean than number 8.
# Mean pixel intensity by digit
train_df %>%
group_by(label) %>%
summarise(across(everything(),sum), .groups = 'drop', count=n()) %>%
summarise(label=label,pixel_sum = rowSums(.), count = count, avg_pix = pixel_sum/count/784)## # A tibble: 10 x 4
## label pixel_sum count avg_pix
## <dbl> <dbl> <int> <dbl>
## 1 0 565313. 4132 0.175
## 2 1 283676. 4684 0.0772
## 3 2 493479. 4177 0.151
## 4 3 487574. 4351 0.143
## 5 4 391039. 4072 0.122
## 6 5 388299. 3795 0.131
## 7 6 454101. 4137 0.140
## 8 7 400174. 4401 0.116
## 9 8 485005. 4063 0.152
## 10 9 407456. 4188 0.124
2.5 PCA (Principal Component Analysis)
# Call PCA for all numbers to the max 744 principal components)
pca_result <- prcomp(train_df[,2:ncol(train_df)], rank = 784, center=TRUE)
#summary(pca_result) The maximum Principal components, is the minimum of the number of columns or number of rows. In this case we can get up to 784 principal components, although you may not need need most of them. The idea is that you will be able to capture most of the information in the dataset with a smaller subset of the overall information. This can help save space and accelerate processing time.
Check variance by principal component
# std deviation
sdev <- pca_result$sdev
percent_variation <- sdev^2 / sum(sdev^2)
var_df <- data.frame(PC=paste0("PC",1:length(sdev)),
var_explained=percent_variation,
stringsAsFactors = FALSE)Let’s do some plotting.
# Function which will help later plot all PC's
every_nth = function(n) {
return(function(x) {x[c(TRUE, rep(FALSE, n - 1))]})
}Plot of the variance explained by each Principal Component
var_df %>%
mutate(PC = fct_inorder(PC)) %>%
ggplot(aes(x=PC,y=var_explained))+
scale_x_discrete(breaks=every_nth(n=100)) +
geom_point() + ggtitle("Variance explained by PC") + xlab("Number of PC's") + ylab("Variance Explained")Plot of cumulative variance explained by first nth Components. Select components which explain 95% and 100% of total variance
cumvar <- cumsum(var_df['var_explained'])
comp_95 <- min(which(cumvar>=0.95))
comp_100 <- min(which(cumvar>=1))
print(comp_95)## [1] 154
print(comp_100)## [1] 704
We see here that 95% of variance is explained by the first 154 PC’s while to get to 100% of the variance you need 704 PC’s
Let’s plot it!
ggplot(var_df,aes(x=1:784, y=cumsum(var_explained))) + geom_line() + geom_point()+
ggtitle("Cummulative Variance explained by PC") +
xlab("Number of PC's") + ylab("Cum Variance Explained")Plot first few PRINCIPAL COMPONENTS
# Plot 10 First COMPONENTS
par(mfrow=c(2,5),mai = c(0.1, 0.1, 0.1, 0.1),mar=c(1, 0, 1, 0))
for (i in 1:10) {
disp_digit(pca_result$rotation[,i],"")
}Showing the first n Principal Components looks “noisy”. that is because these are the vectors which explain the most variance of all the dataset with all numbers included. So what you are looking at is a mixture of numbers 0 to 9.
Print first 10 numbers
We can also by taking n number of components reconstruct the numbers and PLOT them. Let’s see how they look with only 10 PC’s out of 784.
# Plot 10 First Number RECONSTRUCTED with 10 PCS
par(mfrow=c(2,5),mai = c(0.1, 0.1, 0.1, 0.1),mar=c(1, 0, 1, 0))
r <- 10
recon <- t(t(pca_result$x[,1:r] %*% t(pca_result$rotation[,1:r])) + pca_result$center)
for (i in 1:10) {
disp_digit(recon[i,],"")
}Not too good. You can’t be sure of the numbers behind. Let’s try with more PC’s, This time 50 PC’s
# Plot 10 First Number RECONSTRUCTED with 50 PCS
par(mfrow=c(2,5),mai = c(0.1, 0.1, 0.1, 0.1),mar=c(1, 0, 1, 0))
r <- 50
recon <- t(t(pca_result$x[,1:r] %*% t(pca_result$rotation[,1:r])) + pca_result$center)
for (i in 1:10) {
disp_digit(recon[i,],"")
}Better, but still not too clear. Let’s try now with 200 PC’s
# Plot 10 First Number RECONSTRUCTED with 200 PCS
par(mfrow=c(2,5),mai = c(0.1, 0.1, 0.1, 0.1),mar=c(1, 0, 1, 0))
r <- 200
recon <- t(t(pca_result$x[,1:r] %*% t(pca_result$rotation[,1:r])) + pca_result$center)
for (i in 1:10) {
disp_digit(recon[i,],"")
}Much Better. Let this time use as many Principal Components as needed to get 100% of the variance. That is 704 PC’s
# Plot 10 First Number RECONSTRUCTED 100%
par(mfrow=c(2,5),mai = c(0.1, 0.1, 0.1, 0.1),mar=c(1, 0, 1, 0))
r <- comp_100
recon <- t(t(pca_result$x[,1:r] %*% t(pca_result$rotation[,1:r])) + pca_result$center)
for (i in 1:10) {
disp_digit(recon[i,],"")
}Yes. We exactly got the original digits, and we used fewer than the max 784 PC’s.
Filter only 8’s Let’s what happens when we re-do the whole PCA analysis but we include only the rows that refer to the number 8
train_all8_df <- train_df %>%
filter(label==8)Re-run PCA
# Cal PCA for all 8's in the dataset
pca_result8 <- prcomp(train_all8_df[,2:ncol(train_all8_df)], rank = 784, center=TRUE)
#summary(pca_result8) Calculate variance only for the 8’s
# std deviation of each Principal Component
sdev <- pca_result8$sdev
percent_variation <- sdev^2 / sum(sdev^2)
#Variance of PC's
var_df <- data.frame(PC=paste0("PC",1:length(sdev)),
var_explained=percent_variation,
stringsAsFactors = FALSE)
#Cumulative variance of ordered PC's
cumvar <- cumsum(var_df['var_explained'])
comp_95 <- min(which(cumvar>=0.95))
comp_100 <- min(which(cumvar>=1))
#Variance explained up to 95% and 100%
print(comp_95)## [1] 135
print(comp_100)## [1] 537
Let’s Plot first few PRINCIPAL COMPONENTS
# Plot 10 First COMPONENTS
par(mfrow=c(2,5),mai = c(0.1, 0.1, 0.1, 0.1),mar=c(1, 0, 1, 0))
for (i in 1:10) {
disp_digit(pca_result8$rotation[,i],"")
}We can see now the first PC’s capture only the information for the number 8, which now can be distinguished. We can see all refer to the number 8. Please recall we can later reconstruct the original images by combining these “kind of weird looking” images. Let’s do it!
Display 10 Reconstructed number 8’s
# Plot 10 First Numbers RECONSTRUCTED at 95% variance captured
par(mfrow=c(2,5),mai = c(0.1, 0.1, 0.1, 0.1),mar=c(1, 0, 1, 0))
r <- comp_95
#Let's reconstruct the number
recon <- t(t(pca_result8$x[,1:r] %*% t(pca_result8$rotation[,1:r])) + pca_result8$center)
for (i in 1:10) {
disp_digit(recon[i,],"")
}2.6 Multinomial Classification Model
Convert labels to factors, as some ML models work only with factors and not characters.
train_df$label <- as.factor(train_df$label)We will use TIDYMODELS to setup our basic classification using multinomial classification
multinom_spec <-
multinom_reg(penalty = 0, mixture=1) %>%
set_mode("classification") %>%
set_engine("glmnet")Let’s fit our model!
doParallel::registerDoParallel()
multinom_fit <- multinom_spec %>%
fit(label ~ .,
data = train_df
)
# multinom_fitLet’s get our results from our fitting.
results_train <- multinom_fit %>%
predict(new_data = train_df) %>%
mutate(
truth = train_df$label
) results_train %>%
accuracy(truth, .pred_class)## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.952
We got to an accuracy of 95.2%
Lets show the CONFUSION MATRIX
results_train %>%
conf_mat(truth, .pred_class)## Truth
## Prediction 0 1 2 3 4 5 6 7 8 9
## 0 4074 0 8 6 4 22 15 4 16 4
## 1 0 4627 16 8 14 7 1 10 41 7
## 2 6 5 3926 75 13 23 18 28 22 6
## 3 5 11 51 4021 3 92 3 10 85 40
## 4 3 0 30 2 3903 32 15 22 9 63
## 5 12 5 15 119 2 3491 35 2 78 22
## 6 13 2 23 4 20 37 4038 2 15 0
## 7 0 4 27 23 9 5 0 4217 11 88
## 8 17 24 69 62 12 69 11 11 3745 26
## 9 2 6 12 31 92 17 1 95 41 3932
2.7 Keras CLASSIFICATION using a Feedforward Neural Network
Let’s separate our data into train and testing.
mnist_nn_split <- initial_split(train_df)
train <- training(mnist_nn_split)
test <- testing(mnist_nn_split)
y_train <- as.matrix(train$label)
x_train <- as.matrix(train[,-1])
y_test <- as.matrix(test$label)
x_test <- as.matrix(test[,-1])We need to convert some of the features to categorical
y_train <- to_categorical(y_train, num_classes = 10)## Loaded Tensorflow version 2.3.0
y_test <- to_categorical(y_test, num_classes = 10)Let’s define our dense feed-forward Neural Network
dense_model <- keras_model_sequential()
dense_model %>%
layer_dense(units = 128, activation = 'relu', input_shape = c(784)) %>%
layer_batch_normalization() %>%
layer_dropout(rate = 0.3) %>%
layer_dense(units = 32, activation = 'relu') %>%
layer_batch_normalization() %>%
layer_dropout(rate = 0.2) %>%
layer_dense(units = 10, activation = 'softmax')
summary(dense_model)## Model: "sequential"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## dense_2 (Dense) (None, 128) 100480
## ________________________________________________________________________________
## batch_normalization_1 (BatchNormali (None, 128) 512
## ________________________________________________________________________________
## dropout_1 (Dropout) (None, 128) 0
## ________________________________________________________________________________
## dense_1 (Dense) (None, 32) 4128
## ________________________________________________________________________________
## batch_normalization (BatchNormaliza (None, 32) 128
## ________________________________________________________________________________
## dropout (Dropout) (None, 32) 0
## ________________________________________________________________________________
## dense (Dense) (None, 10) 330
## ================================================================================
## Total params: 105,578
## Trainable params: 105,258
## Non-trainable params: 320
## ________________________________________________________________________________
We need to compile the model
dense_model %>% compile(
loss = 'categorical_crossentropy',
optimizer = optimizer_adam(),
metrics = c('accuracy')
)Let’s fit our model
history <- dense_model %>% fit(
x_train, y_train,
epochs = 20, batch_size = 256,
validation_split = 0.1
)plot(history)## `geom_smooth()` using formula = 'y ~ x'
dense_model %>% evaluate(x_test, y_test)## loss accuracy
## 0.1039279 0.9708571
We got to an accuracy of 97%.
2.8 Keras CLASSIFICATION with a Convolutional Neural Network (1D)
# Lets reshape the X'
x_trainR <- array_reshape(x_train,c(31500,784,1))
x_testR <- array_reshape(x_test,c(10500,784,1))# Works now, but you need to re-shape x to add one dime of 1
cnn_model <- keras_model_sequential()
cnn_model %>%
layer_conv_1d(filters = 32, kernel_size = 3, activation = 'relu',
input_shape = c(784,1)) %>%
#layer_max_pooling_1d(pool_size = 3) %>%
layer_conv_1d(filters = 64, kernel_size = 3, activation = 'relu') %>%
#layer_global_average_pooling_1d() %>%
layer_max_pooling_1d(pool_size = 2) %>%
layer_dropout(rate = 0.2) %>%
layer_flatten() %>%
layer_dense(units=128,activation = 'relu') %>%
layer_dropout(rate = 0.2) %>%
layer_dense(units = 10, activation = 'softmax') %>%
compile(
loss = 'categorical_crossentropy',
optimizer = optimizer_adam(),
metrics = c('accuracy')
)summary(cnn_model)## Model: "sequential_1"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## conv1d_1 (Conv1D) (None, 782, 32) 128
## ________________________________________________________________________________
## conv1d (Conv1D) (None, 780, 64) 6208
## ________________________________________________________________________________
## max_pooling1d (MaxPooling1D) (None, 390, 64) 0
## ________________________________________________________________________________
## dropout_3 (Dropout) (None, 390, 64) 0
## ________________________________________________________________________________
## flatten (Flatten) (None, 24960) 0
## ________________________________________________________________________________
## dense_4 (Dense) (None, 128) 3195008
## ________________________________________________________________________________
## dropout_2 (Dropout) (None, 128) 0
## ________________________________________________________________________________
## dense_3 (Dense) (None, 10) 1290
## ================================================================================
## Total params: 3,202,634
## Trainable params: 3,202,634
## Non-trainable params: 0
## ________________________________________________________________________________
cnn_model %>% fit(x_trainR, y_train, batch_size = 128,
epochs = 10,validation_split = 0.1)plot(history)## `geom_smooth()` using formula = 'y ~ x'
cnn_model %>% evaluate(x_testR, y_test, batch_size = 128)## loss accuracy
## 0.1079240 0.9754286
We got to an accuracy of 97.7%
2.9 Keras Convolutional Neural Network (2D)
This time we re-shape the images to be 28x28
# Lets reshape the X'
x_trainR <- array_reshape(x_train,c(31500,28,28))
x_testR <- array_reshape(x_test,c(10500,28,28))# Works now, but you need to re-shape x to add one dime of 1
cnn_model <- keras_model_sequential()## Loaded Tensorflow version 2.8.0
cnn_model %>%
layer_conv_2d(filters = 32, kernel_size = c(3,3), activation = 'relu',input_shape = input_shape <- c(28,28, 1)) %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = 'relu') %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_dropout(rate = 0.2) %>%
layer_flatten() %>%
layer_dense(units=128,activation = 'relu') %>%
layer_dropout(rate = 0.2) %>%
layer_dense(units = 10, activation = 'softmax') %>%
compile(
loss = 'categorical_crossentropy',
optimizer = optimizer_adam(),
metrics = c('accuracy')
)summary(cnn_model)## Model: "sequential"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## conv2d_1 (Conv2D) (None, 26, 26, 32) 320
## max_pooling2d_1 (MaxPooling2D) (None, 13, 13, 32) 0
## conv2d (Conv2D) (None, 11, 11, 64) 18496
## max_pooling2d (MaxPooling2D) (None, 5, 5, 64) 0
## dropout_1 (Dropout) (None, 5, 5, 64) 0
## flatten (Flatten) (None, 1600) 0
## dense_1 (Dense) (None, 128) 204928
## dropout (Dropout) (None, 128) 0
## dense (Dense) (None, 10) 1290
## ================================================================================
## Total params: 225,034
## Trainable params: 225,034
## Non-trainable params: 0
## ________________________________________________________________________________
cnn_model %>% fit(x_trainR, y_train, batch_size = 128,
epochs = 10,validation_split = 0.1)plot(history)cnn_model %>% evaluate(x_testR, y_test, batch_size = 128)## loss accuracy
## 0.03959895 0.98876190
Accuracy in 2D was: 98.9%
3. House Prices Dataset
3.1 Load Dataset
#setwd("../HousePrices")
train_df <- read_csv("HousePrices/train.csv")
test_df <- read_csv('HousePrices/test.csv')
head(train_df)## # A tibble: 6 × 81
## Id MSSubClass MSZoning LotFr…¹ LotArea Street Alley LotSh…² LandC…³ Utili…⁴
## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 1 60 RL 65 8450 Pave <NA> Reg Lvl AllPub
## 2 2 20 RL 80 9600 Pave <NA> Reg Lvl AllPub
## 3 3 60 RL 68 11250 Pave <NA> IR1 Lvl AllPub
## 4 4 70 RL 60 9550 Pave <NA> IR1 Lvl AllPub
## 5 5 60 RL 84 14260 Pave <NA> IR1 Lvl AllPub
## 6 6 50 RL 85 14115 Pave <NA> IR1 Lvl AllPub
## # … with 71 more variables: LotConfig <chr>, LandSlope <chr>,
## # Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>, BldgType <chr>,
## # HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>,
## # YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>, Exterior1st <chr>,
## # Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>, ExterQual <chr>,
## # ExterCond <chr>, Foundation <chr>, BsmtQual <chr>, BsmtCond <chr>,
## # BsmtExposure <chr>, BsmtFinType1 <chr>, BsmtFinSF1 <dbl>, …
3.2 Basic Descriptive Stats
First let’s use SKIMR
skimr::skim(train_df)| Name | train_df |
| Number of rows | 1460 |
| Number of columns | 81 |
| _______________________ | |
| Column type frequency: | |
| character | 43 |
| numeric | 38 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| MSZoning | 0 | 1.00 | 2 | 7 | 0 | 5 | 0 |
| Street | 0 | 1.00 | 4 | 4 | 0 | 2 | 0 |
| Alley | 1369 | 0.06 | 4 | 4 | 0 | 2 | 0 |
| LotShape | 0 | 1.00 | 3 | 3 | 0 | 4 | 0 |
| LandContour | 0 | 1.00 | 3 | 3 | 0 | 4 | 0 |
| Utilities | 0 | 1.00 | 6 | 6 | 0 | 2 | 0 |
| LotConfig | 0 | 1.00 | 3 | 7 | 0 | 5 | 0 |
| LandSlope | 0 | 1.00 | 3 | 3 | 0 | 3 | 0 |
| Neighborhood | 0 | 1.00 | 5 | 7 | 0 | 25 | 0 |
| Condition1 | 0 | 1.00 | 4 | 6 | 0 | 9 | 0 |
| Condition2 | 0 | 1.00 | 4 | 6 | 0 | 8 | 0 |
| BldgType | 0 | 1.00 | 4 | 6 | 0 | 5 | 0 |
| HouseStyle | 0 | 1.00 | 4 | 6 | 0 | 8 | 0 |
| RoofStyle | 0 | 1.00 | 3 | 7 | 0 | 6 | 0 |
| RoofMatl | 0 | 1.00 | 4 | 7 | 0 | 8 | 0 |
| Exterior1st | 0 | 1.00 | 5 | 7 | 0 | 15 | 0 |
| Exterior2nd | 0 | 1.00 | 5 | 7 | 0 | 16 | 0 |
| MasVnrType | 8 | 0.99 | 4 | 7 | 0 | 4 | 0 |
| ExterQual | 0 | 1.00 | 2 | 2 | 0 | 4 | 0 |
| ExterCond | 0 | 1.00 | 2 | 2 | 0 | 5 | 0 |
| Foundation | 0 | 1.00 | 4 | 6 | 0 | 6 | 0 |
| BsmtQual | 37 | 0.97 | 2 | 2 | 0 | 4 | 0 |
| BsmtCond | 37 | 0.97 | 2 | 2 | 0 | 4 | 0 |
| BsmtExposure | 38 | 0.97 | 2 | 2 | 0 | 4 | 0 |
| BsmtFinType1 | 37 | 0.97 | 3 | 3 | 0 | 6 | 0 |
| BsmtFinType2 | 38 | 0.97 | 3 | 3 | 0 | 6 | 0 |
| Heating | 0 | 1.00 | 4 | 5 | 0 | 6 | 0 |
| HeatingQC | 0 | 1.00 | 2 | 2 | 0 | 5 | 0 |
| CentralAir | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
| Electrical | 1 | 1.00 | 3 | 5 | 0 | 5 | 0 |
| KitchenQual | 0 | 1.00 | 2 | 2 | 0 | 4 | 0 |
| Functional | 0 | 1.00 | 3 | 4 | 0 | 7 | 0 |
| FireplaceQu | 690 | 0.53 | 2 | 2 | 0 | 5 | 0 |
| GarageType | 81 | 0.94 | 6 | 7 | 0 | 6 | 0 |
| GarageFinish | 81 | 0.94 | 3 | 3 | 0 | 3 | 0 |
| GarageQual | 81 | 0.94 | 2 | 2 | 0 | 5 | 0 |
| GarageCond | 81 | 0.94 | 2 | 2 | 0 | 5 | 0 |
| PavedDrive | 0 | 1.00 | 1 | 1 | 0 | 3 | 0 |
| PoolQC | 1453 | 0.00 | 2 | 2 | 0 | 3 | 0 |
| Fence | 1179 | 0.19 | 4 | 5 | 0 | 4 | 0 |
| MiscFeature | 1406 | 0.04 | 4 | 4 | 0 | 4 | 0 |
| SaleType | 0 | 1.00 | 2 | 5 | 0 | 9 | 0 |
| SaleCondition | 0 | 1.00 | 6 | 7 | 0 | 6 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1.00 | 730.50 | 421.61 | 1 | 365.75 | 730.5 | 1095.25 | 1460 | ▇▇▇▇▇ |
| MSSubClass | 0 | 1.00 | 56.90 | 42.30 | 20 | 20.00 | 50.0 | 70.00 | 190 | ▇▅▂▁▁ |
| LotFrontage | 259 | 0.82 | 70.05 | 24.28 | 21 | 59.00 | 69.0 | 80.00 | 313 | ▇▃▁▁▁ |
| LotArea | 0 | 1.00 | 10516.83 | 9981.26 | 1300 | 7553.50 | 9478.5 | 11601.50 | 215245 | ▇▁▁▁▁ |
| OverallQual | 0 | 1.00 | 6.10 | 1.38 | 1 | 5.00 | 6.0 | 7.00 | 10 | ▁▂▇▅▁ |
| OverallCond | 0 | 1.00 | 5.58 | 1.11 | 1 | 5.00 | 5.0 | 6.00 | 9 | ▁▁▇▅▁ |
| YearBuilt | 0 | 1.00 | 1971.27 | 30.20 | 1872 | 1954.00 | 1973.0 | 2000.00 | 2010 | ▁▂▃▆▇ |
| YearRemodAdd | 0 | 1.00 | 1984.87 | 20.65 | 1950 | 1967.00 | 1994.0 | 2004.00 | 2010 | ▅▂▂▃▇ |
| MasVnrArea | 8 | 0.99 | 103.69 | 181.07 | 0 | 0.00 | 0.0 | 166.00 | 1600 | ▇▁▁▁▁ |
| BsmtFinSF1 | 0 | 1.00 | 443.64 | 456.10 | 0 | 0.00 | 383.5 | 712.25 | 5644 | ▇▁▁▁▁ |
| BsmtFinSF2 | 0 | 1.00 | 46.55 | 161.32 | 0 | 0.00 | 0.0 | 0.00 | 1474 | ▇▁▁▁▁ |
| BsmtUnfSF | 0 | 1.00 | 567.24 | 441.87 | 0 | 223.00 | 477.5 | 808.00 | 2336 | ▇▅▂▁▁ |
| TotalBsmtSF | 0 | 1.00 | 1057.43 | 438.71 | 0 | 795.75 | 991.5 | 1298.25 | 6110 | ▇▃▁▁▁ |
| 1stFlrSF | 0 | 1.00 | 1162.63 | 386.59 | 334 | 882.00 | 1087.0 | 1391.25 | 4692 | ▇▅▁▁▁ |
| 2ndFlrSF | 0 | 1.00 | 346.99 | 436.53 | 0 | 0.00 | 0.0 | 728.00 | 2065 | ▇▃▂▁▁ |
| LowQualFinSF | 0 | 1.00 | 5.84 | 48.62 | 0 | 0.00 | 0.0 | 0.00 | 572 | ▇▁▁▁▁ |
| GrLivArea | 0 | 1.00 | 1515.46 | 525.48 | 334 | 1129.50 | 1464.0 | 1776.75 | 5642 | ▇▇▁▁▁ |
| BsmtFullBath | 0 | 1.00 | 0.43 | 0.52 | 0 | 0.00 | 0.0 | 1.00 | 3 | ▇▆▁▁▁ |
| BsmtHalfBath | 0 | 1.00 | 0.06 | 0.24 | 0 | 0.00 | 0.0 | 0.00 | 2 | ▇▁▁▁▁ |
| FullBath | 0 | 1.00 | 1.57 | 0.55 | 0 | 1.00 | 2.0 | 2.00 | 3 | ▁▇▁▇▁ |
| HalfBath | 0 | 1.00 | 0.38 | 0.50 | 0 | 0.00 | 0.0 | 1.00 | 2 | ▇▁▅▁▁ |
| BedroomAbvGr | 0 | 1.00 | 2.87 | 0.82 | 0 | 2.00 | 3.0 | 3.00 | 8 | ▁▇▂▁▁ |
| KitchenAbvGr | 0 | 1.00 | 1.05 | 0.22 | 0 | 1.00 | 1.0 | 1.00 | 3 | ▁▇▁▁▁ |
| TotRmsAbvGrd | 0 | 1.00 | 6.52 | 1.63 | 2 | 5.00 | 6.0 | 7.00 | 14 | ▂▇▇▁▁ |
| Fireplaces | 0 | 1.00 | 0.61 | 0.64 | 0 | 0.00 | 1.0 | 1.00 | 3 | ▇▇▁▁▁ |
| GarageYrBlt | 81 | 0.94 | 1978.51 | 24.69 | 1900 | 1961.00 | 1980.0 | 2002.00 | 2010 | ▁▁▅▅▇ |
| GarageCars | 0 | 1.00 | 1.77 | 0.75 | 0 | 1.00 | 2.0 | 2.00 | 4 | ▁▃▇▂▁ |
| GarageArea | 0 | 1.00 | 472.98 | 213.80 | 0 | 334.50 | 480.0 | 576.00 | 1418 | ▂▇▃▁▁ |
| WoodDeckSF | 0 | 1.00 | 94.24 | 125.34 | 0 | 0.00 | 0.0 | 168.00 | 857 | ▇▂▁▁▁ |
| OpenPorchSF | 0 | 1.00 | 46.66 | 66.26 | 0 | 0.00 | 25.0 | 68.00 | 547 | ▇▁▁▁▁ |
| EnclosedPorch | 0 | 1.00 | 21.95 | 61.12 | 0 | 0.00 | 0.0 | 0.00 | 552 | ▇▁▁▁▁ |
| 3SsnPorch | 0 | 1.00 | 3.41 | 29.32 | 0 | 0.00 | 0.0 | 0.00 | 508 | ▇▁▁▁▁ |
| ScreenPorch | 0 | 1.00 | 15.06 | 55.76 | 0 | 0.00 | 0.0 | 0.00 | 480 | ▇▁▁▁▁ |
| PoolArea | 0 | 1.00 | 2.76 | 40.18 | 0 | 0.00 | 0.0 | 0.00 | 738 | ▇▁▁▁▁ |
| MiscVal | 0 | 1.00 | 43.49 | 496.12 | 0 | 0.00 | 0.0 | 0.00 | 15500 | ▇▁▁▁▁ |
| MoSold | 0 | 1.00 | 6.32 | 2.70 | 1 | 5.00 | 6.0 | 8.00 | 12 | ▃▆▇▃▃ |
| YrSold | 0 | 1.00 | 2007.82 | 1.33 | 2006 | 2007.00 | 2008.0 | 2009.00 | 2010 | ▇▇▇▇▅ |
| SalePrice | 0 | 1.00 | 180921.20 | 79442.50 | 34900 | 129975.00 | 163000.0 | 214000.00 | 755000 | ▇▅▁▁▁ |
Let’s take a look at basic stats
summary(train_df)## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 Length:1460 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 Class :character 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 Mode :character Median : 69.00
## Mean : 730.5 Mean : 56.9 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape
## Min. : 1300 Length:1460 Length:1460 Length:1460
## 1st Qu.: 7554 Class :character Class :character Class :character
## Median : 9478 Mode :character Mode :character Mode :character
## Mean : 10517
## 3rd Qu.: 11602
## Max. :215245
##
## LandContour Utilities LotConfig LandSlope
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Neighborhood Condition1 Condition2 BldgType
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## HouseStyle OverallQual OverallCond YearBuilt
## Length:1460 Min. : 1.000 Min. :1.000 Min. :1872
## Class :character 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954
## Mode :character Median : 6.000 Median :5.000 Median :1973
## Mean : 6.099 Mean :5.575 Mean :1971
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000
## Max. :10.000 Max. :9.000 Max. :2010
##
## YearRemodAdd RoofStyle RoofMatl Exterior1st
## Min. :1950 Length:1460 Length:1460 Length:1460
## 1st Qu.:1967 Class :character Class :character Class :character
## Median :1994 Mode :character Mode :character Mode :character
## Mean :1985
## 3rd Qu.:2004
## Max. :2010
##
## Exterior2nd MasVnrType MasVnrArea ExterQual
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 0.0 Mode :character
## Mean : 103.7
## 3rd Qu.: 166.0
## Max. :1600.0
## NA's :8
## ExterCond Foundation BsmtQual BsmtCond
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 383.5 Mode :character
## Mean : 443.6
## 3rd Qu.: 712.2
## Max. :5644.0
##
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Length:1460
## 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 Class :character
## Median : 0.00 Median : 477.5 Median : 991.5 Mode :character
## Mean : 46.55 Mean : 567.2 Mean :1057.4
## 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :1474.00 Max. :2336.0 Max. :6110.0
##
## HeatingQC CentralAir Electrical 1stFlrSF
## Length:1460 Length:1460 Length:1460 Min. : 334
## Class :character Class :character Class :character 1st Qu.: 882
## Mode :character Mode :character Mode :character Median :1087
## Mean :1163
## 3rd Qu.:1391
## Max. :4692
##
## 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## Min. : 0 Min. : 0.000 Min. : 334 Min. :0.0000
## 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000
## Median : 0 Median : 0.000 Median :1464 Median :0.0000
## Mean : 347 Mean : 5.845 Mean :1515 Mean :0.4253
## 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000
## Max. :2065 Max. :572.000 Max. :5642 Max. :3.0000
##
## BsmtHalfBath FullBath HalfBath BedroomAbvGr
## Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.00000 Median :2.000 Median :0.0000 Median :3.000
## Mean :0.05753 Mean :1.565 Mean :0.3829 Mean :2.866
## 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :2.00000 Max. :3.000 Max. :2.0000 Max. :8.000
##
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## Min. :0.000 Length:1460 Min. : 2.000 Length:1460
## 1st Qu.:1.000 Class :character 1st Qu.: 5.000 Class :character
## Median :1.000 Mode :character Median : 6.000 Mode :character
## Mean :1.047 Mean : 6.518
## 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :3.000 Max. :14.000
##
## Fireplaces FireplaceQu GarageType GarageYrBlt
## Min. :0.000 Length:1460 Length:1460 Min. :1900
## 1st Qu.:0.000 Class :character Class :character 1st Qu.:1961
## Median :1.000 Mode :character Mode :character Median :1980
## Mean :0.613 Mean :1979
## 3rd Qu.:1.000 3rd Qu.:2002
## Max. :3.000 Max. :2010
## NA's :81
## GarageFinish GarageCars GarageArea GarageQual
## Length:1460 Min. :0.000 Min. : 0.0 Length:1460
## Class :character 1st Qu.:1.000 1st Qu.: 334.5 Class :character
## Mode :character Median :2.000 Median : 480.0 Mode :character
## Mean :1.767 Mean : 473.0
## 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :4.000 Max. :1418.0
##
## GarageCond PavedDrive WoodDeckSF OpenPorchSF
## Length:1460 Length:1460 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Mode :character Median : 0.00 Median : 25.00
## Mean : 94.24 Mean : 46.66
## 3rd Qu.:168.00 3rd Qu.: 68.00
## Max. :857.00 Max. :547.00
##
## EnclosedPorch 3SsnPorch ScreenPorch PoolArea
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000
## Mean : 21.95 Mean : 3.41 Mean : 15.06 Mean : 2.759
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :552.00 Max. :508.00 Max. :480.00 Max. :738.000
##
## PoolQC Fence MiscFeature MiscVal
## Length:1460 Length:1460 Length:1460 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Mode :character Median : 0.00
## Mean : 43.49
## 3rd Qu.: 0.00
## Max. :15500.00
##
## MoSold YrSold SaleType SaleCondition
## Min. : 1.000 Min. :2006 Length:1460 Length:1460
## 1st Qu.: 5.000 1st Qu.:2007 Class :character Class :character
## Median : 6.000 Median :2008 Mode :character Mode :character
## Mean : 6.322 Mean :2008
## 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :12.000 Max. :2010
##
## SalePrice
## Min. : 34900
## 1st Qu.:129975
## Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
##
3.3 Correlation Plots
par(mfrow=c(1,1))
library(corrplot)
corr_df <- select_if(train_df, is.numeric)
#Select 3 variables
corr_3var_df <- corr_df %>%
dplyr::select(OverallQual,GrLivArea,GarageArea)
cormat <- round(cor(corr_3var_df,use = "complete.obs"),2)
corrplot(cormat, method="number",
tl.cex = 0.8,
number.cex = 0.8,
cl.cex = 0.3)Another flavor of Correlation Plots
#
library(GGally)
corr_3var_df %>%
ggscatmat(alpha = 0.7)## Warning: The dot-dot notation (`..scaled..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(scaled)` instead.
## ℹ The deprecated feature was likely used in the GGally package.
## Please report the issue at <]8;;https://github.com/ggobi/ggally/issueshttps://github.com/ggobi/ggally/issues]8;;>.
3.4 Scatter Plots (Selected variables)
library("ggpubr")
ggscatter(corr_df, x = "SalePrice", y = "GrLivArea",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "Sales Price $", ylab = "Gr Living Area")## `geom_smooth()` using formula = 'y ~ x'
ggscatter(corr_df, x = "SalePrice", y = "OverallQual",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "Sales Price $", ylab = "Overall Qualty")## `geom_smooth()` using formula = 'y ~ x'
ggscatter(corr_df, x = "SalePrice", y = "GarageArea",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "Sales Price $", ylab = "Garage Area")## `geom_smooth()` using formula = 'y ~ x'
ggscatter(corr_df, x = "SalePrice", y = "TotalBsmtSF",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "Sales Price $", ylab = "Basement SF")## `geom_smooth()` using formula = 'y ~ x'
3.5 Calculate Shapiro Test for normality.
for (i in colnames(corr_df)){ # for-loop over columns
res <- shapiro.test(corr_df[[i]])$p
print(paste0(i," ",res))
}## [1] "Id 7.97201340256573e-21"
## [1] "MSSubClass 9.10819385467262e-39"
## [1] "LotFrontage 2.00169250796826e-29"
## [1] "LotArea 7.93365433408252e-58"
## [1] "OverallQual 2.68645715997021e-22"
## [1] "OverallCond 6.77422906906633e-37"
## [1] "YearBuilt 2.7702198520757e-26"
## [1] "YearRemodAdd 6.72028021924094e-34"
## [1] "MasVnrArea 6.5566451859357e-48"
## [1] "BsmtFinSF1 2.81385360889394e-35"
## [1] "BsmtFinSF2 1.85025360137009e-58"
## [1] "BsmtUnfSF 1.63991082180453e-25"
## [1] "TotalBsmtSF 1.61133243783591e-27"
## [1] "1stFlrSF 4.51322308307649e-26"
## [1] "2ndFlrSF 2.51488232422664e-41"
## [1] "LowQualFinSF 9.58924763161722e-64"
## [1] "GrLivArea 6.59761060721486e-26"
## [1] "BsmtFullBath 3.76066585731411e-47"
## [1] "BsmtHalfBath 1.46661594047236e-60"
## [1] "FullBath 4.23148773439281e-44"
## [1] "HalfBath 4.5815824129602e-48"
## [1] "BedroomAbvGr 4.11555125000648e-35"
## [1] "KitchenAbvGr 4.22120261568086e-61"
## [1] "TotRmsAbvGrd 2.00496395573929e-23"
## [1] "Fireplaces 4.83098036724715e-42"
## [1] "GarageYrBlt 2.8167825695788e-26"
## [1] "GarageCars 2.30168501177196e-36"
## [1] "GarageArea 4.01696328273393e-15"
## [1] "WoodDeckSF 3.22798531690171e-41"
## [1] "OpenPorchSF 1.13590473601951e-43"
## [1] "EnclosedPorch 4.84948460552283e-56"
## [1] "3SsnPorch 8.30726789752397e-64"
## [1] "ScreenPorch 3.3056878563652e-59"
## [1] "PoolArea 7.11153811656075e-65"
## [1] "MiscVal 1.52990697456972e-64"
## [1] "MoSold 3.17897279443885e-17"
## [1] "YrSold 3.42019425395422e-30"
## [1] "SalePrice 3.20614155057209e-33"
Pearson correlation to SalePrice We will run Pearson correlation of all numerical features against the objective variable SalePrice.
# Create list of Pearson Correlation coefficient
mylist <- list()
for (i in colnames(corr_df)){ # for-loop over columns
res <- cor(corr_df[[i]],corr_df$SalePrice,method = "pearson", use = "complete.obs")
mylist[[ i ]] <- res
}
#mylist#Sort by value decreasing order
sorted_mylist <- mylist[order(unlist(mylist), decreasing=TRUE)]
sorted_mylist## $SalePrice
## [1] 1
##
## $OverallQual
## [1] 0.7909816
##
## $GrLivArea
## [1] 0.7086245
##
## $GarageCars
## [1] 0.6404092
##
## $GarageArea
## [1] 0.6234314
##
## $TotalBsmtSF
## [1] 0.6135806
##
## $`1stFlrSF`
## [1] 0.6058522
##
## $FullBath
## [1] 0.5606638
##
## $TotRmsAbvGrd
## [1] 0.5337232
##
## $YearBuilt
## [1] 0.5228973
##
## $YearRemodAdd
## [1] 0.507101
##
## $GarageYrBlt
## [1] 0.4863617
##
## $MasVnrArea
## [1] 0.477493
##
## $Fireplaces
## [1] 0.4669288
##
## $BsmtFinSF1
## [1] 0.3864198
##
## $LotFrontage
## [1] 0.3517991
##
## $WoodDeckSF
## [1] 0.3244134
##
## $`2ndFlrSF`
## [1] 0.3193338
##
## $OpenPorchSF
## [1] 0.3158562
##
## $HalfBath
## [1] 0.2841077
##
## $LotArea
## [1] 0.2638434
##
## $BsmtFullBath
## [1] 0.2271222
##
## $BsmtUnfSF
## [1] 0.2144791
##
## $BedroomAbvGr
## [1] 0.1682132
##
## $ScreenPorch
## [1] 0.1114466
##
## $PoolArea
## [1] 0.09240355
##
## $MoSold
## [1] 0.04643225
##
## $`3SsnPorch`
## [1] 0.04458367
##
## $BsmtFinSF2
## [1] -0.01137812
##
## $BsmtHalfBath
## [1] -0.01684415
##
## $MiscVal
## [1] -0.02118958
##
## $Id
## [1] -0.02191672
##
## $LowQualFinSF
## [1] -0.02560613
##
## $YrSold
## [1] -0.02892259
##
## $OverallCond
## [1] -0.07785589
##
## $MSSubClass
## [1] -0.08428414
##
## $EnclosedPorch
## [1] -0.128578
##
## $KitchenAbvGr
## [1] -0.1359074
Top 5 correlated variables where:
OverallQual
GrLivArea
GarageCars
GarageArea
TotalBsmtSF
3.6 Test Pairwise Correlation Hypothesis at 80% CI
cor.test(corr_df$OverallQual, corr_df$GrLivArea, method = "pearson", conf.level = 0.80)##
## Pearson's product-moment correlation
##
## data: corr_df$OverallQual and corr_df$GrLivArea
## t = 28.121, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5708061 0.6143422
## sample estimates:
## cor
## 0.5930074
cor.test(corr_df$OverallQual, corr_df$GarageArea, method = "pearson", conf.level = 0.80)##
## Pearson's product-moment correlation
##
## data: corr_df$OverallQual and corr_df$GarageArea
## t = 25.946, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5386198 0.5845573
## sample estimates:
## cor
## 0.5620218
cor.test(corr_df$GarageArea, corr_df$GrLivArea, method = "pearson", conf.level = 0.80)##
## Pearson's product-moment correlation
##
## data: corr_df$GarageArea and corr_df$GrLivArea
## t = 20.276, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.4423993 0.4947713
## sample estimates:
## cor
## 0.4689975
The P-values are very small. Probability of error that there is no correlation between variables is very small.
3.7 Precision Matrix and LU Decomposition
precision_mat <- solve(cormat)precision_mat %*% cormat## OverallQual GrLivArea GarageArea
## OverallQual 1.000000e+00 5.551115e-17 0.000000e+00
## GrLivArea 5.551115e-17 1.000000e+00 1.110223e-16
## GarageArea -1.110223e-16 0.000000e+00 1.000000e+00
cormat %*% precision_mat## OverallQual GrLivArea GarageArea
## OverallQual 1.000000e+00 2.775558e-17 0.000000e+00
## GrLivArea 0.000000e+00 1.000000e+00 1.110223e-16
## GarageArea -1.110223e-16 5.551115e-17 1.000000e+00
lu.decomp <- function(M){
rows <- dim(M)[1]
cols <- dim(M)[2]
if (rows != cols) {
return(list(0,0))} #If is not square matrix return list of 0's
else {
L <- diag(x = 1, rows,cols) # L by definition has 1's in the diagonal
U <- M # Upper will have its first row EQUAL to input matrix
for (x in 1:(rows-1)){
for (y in (x+1):rows){
factor = U[y,x]/U[x,x] # Is the number used to eliminate
L[y,x] = factor # We make the lower matrix that value
U[y,] = U[y,]-factor*U[x,] # We use Gaussian elimination
}
}
return(list("L"=L,"U"=U)) # We return our completed L and U matrices
}
}lu <- lu.decomp(cormat)
lu## $L
## [,1] [,2] [,3]
## [1,] 1.00 0.0000000 0
## [2,] 0.59 1.0000000 0
## [3,] 0.56 0.2141433 1
##
## $U
## OverallQual GrLivArea GarageArea
## OverallQual 1 0.5900 0.5600000
## GrLivArea 0 0.6519 0.1396000
## GarageArea 0 0.0000 0.6565056
Lets’ check manual results vs function in matrixcalc library
library(matrixcalc)
lu.decomposition(cormat)## $L
## [,1] [,2] [,3]
## [1,] 1.00 0.0000000 0
## [2,] 0.59 1.0000000 0
## [3,] 0.56 0.2141433 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1 0.5900 0.5600000
## [2,] 0 0.6519 0.1396000
## [3,] 0 0.0000 0.6565056
Same result both ways, which is comforting
3.8 Analysis Selected Skewed Variable
Plot Histogram of variable OpenPorchSF
# plot
p <- train_df %>%
# filter( price<300 ) %>%
ggplot( aes(x=OpenPorchSF)) +
geom_histogram( binwidth=50, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
ggtitle("Distribution of OpenPorchSF Variable")
pUse fitdistr in MASS library
library(MASS)Find the lambda parameter and use it to simulate an exponential distribution
resfit <- fitdistr(train_df$OpenPorchSF, "exponential")simulated <- as.data.frame(rexp(1000,resfit$estimate))
colnames(simulated) <- c("estimate")
simulated %>%
ggplot( aes(x=estimate)) +
geom_histogram( binwidth=50, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
ggtitle("Distribution of Simulate Exponential variable")We can simulate results and see they are also skewed, but not quite the same distribution as the empirical data.
3.9 Find 5th and 95th percentile of cumulative distribution
qexp(.05, rate=resfit$estimate)## [1] 2.393359
qexp(.95, rate=resfit$estimate)## [1] 139.7817
Generate a 95% confidence interval from the empirical data, assuming normality.
# Calculate the mean and standard error
l.model <- lm(OpenPorchSF ~ 1, train_df)
# Calculate the confidence interval
confint(l.model, level=0.95)## 2.5 % 97.5 %
## (Intercept) 43.25888 50.06167
Let’s try also manually to see the results and compare them.
#Manual way of calculating CI
sample.mean <- mean(train_df$OpenPorchSF)
print(sample.mean)## [1] 46.66027
sample.n <- length(train_df$OpenPorchSF)
sample.sd <- sd(train_df$OpenPorchSF)
sample.se <- sample.sd/sqrt(sample.n)
print(sample.se)## [1] 1.733999
alpha = 0.05
degrees.freedom = sample.n - 1
t.score = qt(p=alpha/2, df=degrees.freedom,lower.tail=F)
print(t.score)## [1] 1.961591
margin.error <- t.score * sample.se
lower.bound <- sample.mean - margin.error
upper.bound <- sample.mean + margin.error
print(c(lower.bound,upper.bound))## [1] 43.25888 50.06167
#Provide the empirical 5th percentile and 95th percentile of the data.
quantile(x=train_df$OpenPorchSF, probs=c(.05, .95))## 5% 95%
## 0.00 175.05
3.10 Modelling (Machine Learning)
We will use TIDYMODELS to prepare and fit a collection of model using a grid search method of testing many hyper-parameters. We will fit Lasso regression with regularization and also XGBoost for regression. We will check results from cross-validation and select the model and hyper-parameters which performs the best on validation data.
3.10.1 Prepare data
#convert all char to factors
train_df <- mutate(train_df, across(where(is.character), as.factor))
test_df <- mutate(test_df, across(where(is.character), as.factor))
#test_df2 <- mutate(test_df2, across(where(is.character), as.factor))3.10.2 Split data training and test
set.seed(42)
sales_split <- initial_split(train_df, prop=9/10)
sales_train <- training(sales_split)
sales_test <- testing(sales_split)3.10.3 Pre-process with RECIPES
# will test with 3 different options to data engineering
sales_recipe1 <- recipe(SalePrice ~ ., data = sales_train) %>%
step_impute_knn(all_predictors(), neighbors = 50) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
prep()
sales_prep1 <- prep(sales_recipe1)
#sales_prep1
sales_recipe2 <- recipe(SalePrice ~ ., data = sales_train) %>%
step_impute_mode(all_nominal()) %>%
step_impute_mean(all_numeric_predictors()) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
prep()
sales_prep2 <- prep(sales_recipe2)
sales_recipe3 <- recipe(SalePrice ~ ., data = sales_train) %>%
step_impute_mode(all_nominal()) %>%
step_impute_mean(all_numeric_predictors()) %>%
step_BoxCox(all_numeric_predictors()) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
prep()
sales_prep3 <- prep(sales_recipe3)3.10.4 Define Cross-Validation Folds
## Cross Validation Split of Training Data
set.seed(42)
sales_folds <- vfold_cv(
data = sales_train,
v = 10
)
#sales_folds3.10.5 Define Models
# Linear Regression
lm_spec <- linear_reg() %>%
set_engine('lm') %>%
set_mode('regression')
#lm_spec# Lasso Regression
lasso_spec <- linear_reg(mode = "regression",
penalty = tune(),
mixture = 1) %>%
set_engine("glmnet")
#lasso_spec# XGBoost Regression
xgboost_spec <- boost_tree(
trees = tune(),
mtry = tune(),
tree_depth = tune(),
learn_rate = tune()
) %>%
set_mode("regression") %>%
set_engine("xgboost")3.10.6 Define workflow set
Let’s stack the models we want to compare.
workflow_set <-workflow_set(
preproc = list(sales_recipe1,sales_recipe2, sales_recipe3),
models = list(lasso_spec, xgboost_spec),
cross = TRUE
)
workflow_set## # A workflow set/tibble: 6 × 4
## wflow_id info option result
## <chr> <list> <list> <list>
## 1 recipe_1_linear_reg <tibble [1 × 4]> <opts[0]> <list [0]>
## 2 recipe_1_boost_tree <tibble [1 × 4]> <opts[0]> <list [0]>
## 3 recipe_2_linear_reg <tibble [1 × 4]> <opts[0]> <list [0]>
## 4 recipe_2_boost_tree <tibble [1 × 4]> <opts[0]> <list [0]>
## 5 recipe_3_linear_reg <tibble [1 × 4]> <opts[0]> <list [0]>
## 6 recipe_3_boost_tree <tibble [1 × 4]> <opts[0]> <list [0]>
3.10.7 Fit the stack of Models and Recipes
Here we fit the models and use hyper-parameter tuning using 150 levels.
# The fitting lasted 5.5 hours in my PC. I save the model results in order to easily knit. For actually running the fit, please change RUN = TRUE
RUN = FALSE
if (RUN) {
doParallel::registerDoParallel()
start.time <- Sys.time()
start.time
fit_workflows <- workflow_set %>%
workflow_map(
seed = 42,
fn = "tune_grid",
grid = 150,
resamples = sales_folds,
verbose = TRUE
)
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
doParallel::stopImplicitCluster()
}The fitting took 5.5 hours on a i7 Gen 11 machine with 32G of RAM.
3.10.8 Save/Load resulting model (backup)
if (RUN) {
saved_housesales_modelset <- fit_workflows
saveRDS(saved_housesales_modelset, "HousePrices/housesales.rds")
}
########
if (!RUN) {
fit_workflows <- readRDS("HousePrices/housesales.rds")
}3.10.9 Review Results of all models
Let’s see how our models performed and choose a winner.
autoplot(fit_workflows)collect_metrics(fit_workflows)## # A tibble: 1,800 × 9
## wflow_id .config preproc model .metric .esti…¹ mean n std_err
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <int> <dbl>
## 1 recipe_1_linear_… Prepro… recipe line… rmse standa… 4.05e+4 10 8.50e+3
## 2 recipe_1_linear_… Prepro… recipe line… rsq standa… 7.75e-1 10 6.66e-2
## 3 recipe_1_linear_… Prepro… recipe line… rmse standa… 4.05e+4 10 8.50e+3
## 4 recipe_1_linear_… Prepro… recipe line… rsq standa… 7.75e-1 10 6.66e-2
## 5 recipe_1_linear_… Prepro… recipe line… rmse standa… 4.05e+4 10 8.50e+3
## 6 recipe_1_linear_… Prepro… recipe line… rsq standa… 7.75e-1 10 6.66e-2
## 7 recipe_1_linear_… Prepro… recipe line… rmse standa… 4.05e+4 10 8.50e+3
## 8 recipe_1_linear_… Prepro… recipe line… rsq standa… 7.75e-1 10 6.66e-2
## 9 recipe_1_linear_… Prepro… recipe line… rmse standa… 4.05e+4 10 8.50e+3
## 10 recipe_1_linear_… Prepro… recipe line… rsq standa… 7.75e-1 10 6.66e-2
## # … with 1,790 more rows, and abbreviated variable name ¹.estimator
rank_results(fit_workflows, rank_metric = "rmse", select_best = TRUE)## # A tibble: 12 × 9
## wflow_id .config .metric mean std_err n prepr…¹ model rank
## <chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr> <int>
## 1 recipe_3_boost_tree Prepro… rmse 2.51e+4 2.33e+3 10 recipe boos… 1
## 2 recipe_3_boost_tree Prepro… rsq 9.02e-1 1.55e-2 10 recipe boos… 1
## 3 recipe_2_boost_tree Prepro… rmse 2.52e+4 2.42e+3 10 recipe boos… 2
## 4 recipe_2_boost_tree Prepro… rsq 9.01e-1 1.64e-2 10 recipe boos… 2
## 5 recipe_1_boost_tree Prepro… rmse 2.53e+4 2.24e+3 10 recipe boos… 3
## 6 recipe_1_boost_tree Prepro… rsq 8.98e-1 1.34e-2 10 recipe boos… 3
## 7 recipe_3_linear_reg Prepro… rmse 3.91e+4 7.10e+3 10 recipe line… 4
## 8 recipe_3_linear_reg Prepro… rsq 7.81e-1 6.00e-2 10 recipe line… 4
## 9 recipe_1_linear_reg Prepro… rmse 4.05e+4 8.50e+3 10 recipe line… 5
## 10 recipe_1_linear_reg Prepro… rsq 7.75e-1 6.66e-2 10 recipe line… 5
## 11 recipe_2_linear_reg Prepro… rmse 4.06e+4 8.57e+3 10 recipe line… 6
## 12 recipe_2_linear_reg Prepro… rsq 7.74e-1 6.71e-2 10 recipe line… 6
## # … with abbreviated variable name ¹preprocessor
Extract the best workflow
metric <- "rmse"
best_workflow_id <- fit_workflows %>%
rank_results(
rank_metric = metric,
select_best = TRUE
) %>%
dplyr::slice(1) %>%
pull(wflow_id)
workflow_best <- extract_workflow(fit_workflows, id = best_workflow_id)
best_workflow_id## [1] "recipe_3_boost_tree"
Extract tuning results from workflowset
We know our best model in training, now let’s extract the tuning parameters used to generate it.
workflow_best_tuned <- fit_workflows[fit_workflows$wflow_id == best_workflow_id,"result"][[1]][[1]]
workflow_best_tuned## # Tuning results
## # 10-fold cross-validation
## # A tibble: 10 × 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [1182/132]> Fold01 <tibble [300 × 8]> <tibble [1 × 3]>
## 2 <split [1182/132]> Fold02 <tibble [300 × 8]> <tibble [1 × 3]>
## 3 <split [1182/132]> Fold03 <tibble [300 × 8]> <tibble [1 × 3]>
## 4 <split [1182/132]> Fold04 <tibble [300 × 8]> <tibble [1 × 3]>
## 5 <split [1183/131]> Fold05 <tibble [300 × 8]> <tibble [1 × 3]>
## 6 <split [1183/131]> Fold06 <tibble [300 × 8]> <tibble [1 × 3]>
## 7 <split [1183/131]> Fold07 <tibble [300 × 8]> <tibble [1 × 3]>
## 8 <split [1183/131]> Fold08 <tibble [300 × 8]> <tibble [1 × 3]>
## 9 <split [1183/131]> Fold09 <tibble [300 × 8]> <tibble [1 × 3]>
## 10 <split [1183/131]> Fold10 <tibble [300 × 8]> <tibble [1 × 3]>
##
## There were issues with some computations:
##
## - Warning(s) x10: Non-positive values in selected variable., Fewer than `num_unique...
##
## Run `show_notes(.Last.tune.result)` for more information.
collect_metrics(workflow_best_tuned)## # A tibble: 300 × 10
## mtry trees tree_depth learn_…¹ .metric .esti…² mean n std_err .config
## <int> <int> <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 677 2 0.00175 rmse standa… 8.93e+4 10 3.60e+3 Prepro…
## 2 2 677 2 0.00175 rsq standa… 7.36e-1 10 2.32e-2 Prepro…
## 3 3 102 7 0.0562 rmse standa… 3.09e+4 10 2.43e+3 Prepro…
## 4 3 102 7 0.0562 rsq standa… 8.59e-1 10 1.60e-2 Prepro…
## 5 6 1851 9 0.0244 rmse standa… 2.57e+4 10 2.30e+3 Prepro…
## 6 6 1851 9 0.0244 rsq standa… 8.97e-1 10 1.60e-2 Prepro…
## 7 6 283 11 0.00579 rmse standa… 5.48e+4 10 2.56e+3 Prepro…
## 8 6 283 11 0.00579 rsq standa… 8.48e-1 10 1.89e-2 Prepro…
## 9 8 792 8 0.315 rmse standa… 3.19e+4 10 2.55e+3 Prepro…
## 10 8 792 8 0.315 rsq standa… 8.39e-1 10 1.82e-2 Prepro…
## # … with 290 more rows, and abbreviated variable names ¹learn_rate, ².estimator
autoplot(workflow_best_tuned)select_best(workflow_best_tuned, "rmse")## # A tibble: 1 × 5
## mtry trees tree_depth learn_rate .config
## <int> <int> <int> <dbl> <chr>
## 1 28 1755 5 0.0312 Preprocessor1_Model017
Fit the final model
Now let’s test with the unseen test data
workflow_best_final <- finalize_workflow(workflow_best, select_best(workflow_best_tuned, "rmse"))
doParallel::registerDoParallel()
workflow_best_final_fit <- workflow_best_final %>%
last_fit(
split = sales_split
)## ! train/test split: preprocessor 1/1: Non-positive values in selected variable., Fewer than `num_unique` value...
doParallel::stopImplicitCluster()
workflow_best_final_fit## # Resampling results
## # Manual resampling
## # A tibble: 1 × 6
## splits id .metrics .notes .predictions .workflow
## <list> <chr> <list> <list> <list> <list>
## 1 <split [1314/146]> train/test split <tibble> <tibble> <tibble> <workflow>
##
## There were issues with some computations:
##
## - Warning(s) x1: Non-positive values in selected variable., Fewer than `num_unique...
##
## Run `show_notes(.Last.tune.result)` for more information.
workflow_best_final_fit %>%
collect_metrics()## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 24853. Preprocessor1_Model1
## 2 rsq standard 0.906 Preprocessor1_Model1
R2 of 0.91
Some visualizations of the selected winner model
fit_test <- workflow_best_final_fit %>%
collect_predictions()
fit_test## # A tibble: 146 × 5
## id .pred .row SalePrice .config
## <chr> <dbl> <int> <dbl> <chr>
## 1 train/test split 301319. 5 250000 Preprocessor1_Model1
## 2 train/test split 124986. 20 139000 Preprocessor1_Model1
## 3 train/test split 180091. 34 165500 Preprocessor1_Model1
## 4 train/test split 133426. 44 130250 Preprocessor1_Model1
## 5 train/test split 145016. 45 141000 Preprocessor1_Model1
## 6 train/test split 310751. 66 317000 Preprocessor1_Model1
## 7 train/test split 263161. 86 260000 Preprocessor1_Model1
## 8 train/test split 147370. 93 163500 Preprocessor1_Model1
## 9 train/test split 187506. 96 185000 Preprocessor1_Model1
## 10 train/test split 204675. 104 198900 Preprocessor1_Model1
## # … with 136 more rows
Plot our predictions vs actual unseen test data
fit_test %>%
ggplot(aes(x=SalePrice, y=.pred)) +
geom_abline(slope=1,intercept=0) +
geom_point(alpha=0.3)Boxplot of predictions vs actual results
boxplot_data <- fit_test %>%
mutate(difference = .pred - SalePrice)boxplot_data$difference %>%
boxplot(main = "Results of our Selected Model predicting",
xlab = "Difference of Predictions",
horizontal = FALSE)3.10.10 Analysis of Regression Variables
Variable Importance
Here we use the VIP library which helps us identify the most important variables used in our selected model. This ranking of importance can help us explain our model. Here you will the most important variable ranked from the top.
library(vip)
extract_workflow(workflow_best_final_fit) %>%
extract_fit_parsnip() %>%
vip(geom = "col")3.10.11 Predict on NEW DATA and load to KAGGLE
KAGGLE username: juanfalck
my_predictions1 <- workflow_best_final_fit$.workflow[[1]] %>%
predict(test_df)Save the data in .CSV file to upload to KAGGLE
my_predictions2 <- as_tibble(test_df$Id)
my_predictions2$pred <- my_predictions1$.pred
write.csv(my_predictions2,"JF_predictions.csv")THANK YOU!
It has been a great class!!!