Greetings!. I’m going to attempt to somewhat replicate Rick Scavetta’s analysis on the Boston House Prices dataset while also using methods of analysis i learned from Kaggle.
#Initializing Packages
library(corrplot)## corrplot 0.84 loaded
library(keras)## Warning: package 'keras' was built under R version 3.5.2
library(caret)## Warning: package 'caret' was built under R version 3.5.2
## Loading required package: lattice
## Loading required package: ggplot2
library(tidyverse)## -- Attaching packages ------------------------------------------------------------------------- tidyverse 1.2.1 --
## v tibble 1.4.2 v purrr 0.2.5
## v tidyr 0.8.1 v dplyr 0.7.6
## v readr 1.1.1 v stringr 1.3.1
## v tibble 1.4.2 v forcats 0.3.0
## -- Conflicts ---------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::lift() masks caret::lift()
library(randomForest)## Warning: package 'randomForest' was built under R version 3.5.2
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(rpart)
library(kableExtra)## Warning: package 'kableExtra' was built under R version 3.5.2
library(modelr)
# Loading the required data
Boston <- read.csv("https://raw.githubusercontent.com/Scavetta/conf_tensorflow_training_day1/master/1_Deep_Learning_Intro/data/boston_keras.csv")summary(Boston)## CRIM ZN INDUS CHAS
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## NOX RM AGE DIS
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## RAD TAX PTRATIO B
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## LSTAT MEDV
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Thankfully there aren’t any NAs on the list, which means i do not need to clean it
I had to take this from Rick Scavetta’s description of the variables in his github. These are the variables:
CRIM - Per capita Crime rate by town
ZN - Proportion of Residential land zoned for lots over 25k sq. ft
INDUS - Proportion of non-retail business acres per town
CHAS - Something related to Charles River. 1 if tract bounds river, else 0
NOX - Nitric Oxides concentration (parts per 10 million)
RM - Average number of rooms per dwelling
AGE - Proportion of owner-occupied units built prior to 1940
DIS - Weighted distances to five Boston employment centers
RAD - Index of accessibility to radial highways
TAX - Full-value property tax per $10000 (what is Massachussetts’ property tax rate anyway?)
PTRATIO - Student-teacher ratio per town
B - \(1000 * (Bk - 0.63)^2\) where \(Bk\) is the proportion of blacks by town
LSTAT - %lower status of the population
MEDV - Median value of owner-occupied homes in $1000’s
Assessing each predictor variable against the response variable
data.frame(r = cor(Boston[,-14], Boston$MEDV)) %>%
rownames_to_column() %>%
rename(variable = rowname) %>%
arrange(r) %>%
mutate(variable = as_factor(variable)) %>%
ggplot(aes(r, variable)) +
geom_vline(xintercept = 0, col = "red", linetype = 2) +
geom_point()+
scale_x_continuous("r", limits = c(-1, 1), expand = c(0,0))I’m going to try to implement another correlation diagram via methods i learned at Kaggle. Something to do with trying to figure out which numeric variables have a high correlation with MEDV
numericVars <- which(sapply(Boston, is.numeric))
numericVarNames <- names(numericVars)
cat("There are", length(numericVars), "numeric variables")## There are 14 numeric variables
Boston_numVar <- Boston[, numericVars]
cor_numVar <- cor(Boston_numVar, use = "pairwise.complete.obs")
#sort on decreasing correlations with MEDV
cor_sorted <- as.matrix(sort(cor_numVar[,"MEDV"], decreasing = TRUE))
# Select only high correlations
CorHigh <- names(which(apply(cor_sorted, 1, function(x) abs(x) > 0.15)))
cor_numVar <- cor_numVar[CorHigh, CorHigh]
corrplot.mixed(cor_numVar, tl.col = "black", tl.pos = "lt")We’ll take all the variables.
So, the MEDV and RM variables have the highest correlation with 0.7. I would like to create a plot on the matter.
a <- ggplot(data = Boston, aes(x = RM, y = MEDV)) + geom_point(col = "blue")
a <- a + geom_smooth(method = "lm", se = FALSE, color = "black", aes(group = 1))
a <- a + labs(x = "AVERAGE NUMBER OF ROOMS PER DWELLING")
a <- a + labs(y = "MEDIAN VALUE")
print(a)# as per the keras set
index <- 1:404
training <- Boston[index,]
testing <- Boston[-index,]Just out of curiosity
summary(training)## CRIM ZN INDUS CHAS
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08144 1st Qu.: 0.00 1st Qu.: 5.13 1st Qu.:0.00000
## Median : 0.26888 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.74511 Mean : 11.48 Mean :11.10 Mean :0.06188
## 3rd Qu.: 3.67481 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## NOX RM AGE DIS
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4530 1st Qu.:5.875 1st Qu.: 45.48 1st Qu.: 2.077
## Median :0.5380 Median :6.199 Median : 78.50 Median : 3.142
## Mean :0.5574 Mean :6.267 Mean : 69.01 Mean : 3.740
## 3rd Qu.:0.6310 3rd Qu.:6.609 3rd Qu.: 94.10 3rd Qu.: 5.118
## Max. :0.8710 Max. :8.725 Max. :100.00 Max. :10.710
## RAD TAX PTRATIO B
## Min. : 1.000 Min. :188.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.23 1st Qu.:374.67
## Median : 5.000 Median :330.0 Median :19.10 Median :391.25
## Mean : 9.441 Mean :405.9 Mean :18.48 Mean :354.78
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.16
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## LSTAT MEDV
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.89 1st Qu.:16.68
## Median :11.39 Median :20.75
## Mean :12.74 Mean :22.40
## 3rd Qu.:17.09 3rd Qu.:24.80
## Max. :37.97 Max. :50.00
summary(testing)## CRIM ZN INDUS CHAS
## Min. : 0.01311 Min. : 0.00 Min. : 1.220 Min. :0.00000
## 1st Qu.: 0.08484 1st Qu.: 0.00 1st Qu.: 5.455 1st Qu.:0.00000
## Median : 0.22901 Median : 0.00 Median : 9.795 Median :0.00000
## Mean : 3.09234 Mean :10.90 Mean :11.265 Mean :0.09804
## 3rd Qu.: 3.77944 3rd Qu.:16.25 3rd Qu.:18.100 3rd Qu.:0.00000
## Max. :25.04610 Max. :90.00 Max. :27.740 Max. :1.00000
## NOX RM AGE DIS
## Min. :0.3920 Min. :4.880 Min. : 6.00 Min. : 1.466
## 1st Qu.:0.4455 1st Qu.:5.966 1st Qu.: 42.45 1st Qu.: 2.117
## Median :0.5320 Median :6.229 Median : 73.75 Median : 3.325
## Mean :0.5442 Mean :6.354 Mean : 66.85 Mean : 4.012
## 3rd Qu.:0.6090 3rd Qu.:6.634 3rd Qu.: 92.97 3rd Qu.: 5.277
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## RAD TAX PTRATIO B
## Min. : 1.00 Min. :187.0 Min. :13.00 Min. : 24.65
## 1st Qu.: 4.00 1st Qu.:279.2 1st Qu.:17.40 1st Qu.:377.69
## Median : 5.00 Median :330.0 Median :18.90 Median :392.11
## Mean : 9.98 Mean :417.5 Mean :18.37 Mean :364.16
## 3rd Qu.:24.00 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.78
## Max. :24.00 Max. :711.0 Max. :21.20 Max. :396.90
## LSTAT MEDV
## Min. : 1.920 Min. : 5.60
## 1st Qu.: 7.305 1st Qu.:18.65
## Median :11.060 Median :21.95
## Mean :12.305 Mean :23.08
## 3rd Qu.:15.915 3rd Qu.:27.07
## Max. :31.990 Max. :50.00
Z-score scaling involves converting all the indicators to a common scale with an average of zero and a standard deviation of one.
#using dplyr to keep the data frame structure
training %>%
mutate_at(vars(-MEDV), scale) -> training
testing %>%
mutate_at(vars(-MEDV), scale) -> testingWe will use linear modeling since we will have to assume normal distribution
fit_lm <- lm(MEDV~., data = training)pred_lm <- predict(fit_lm, newdata = testing)
print(pred_lm)## 1 2 3 4 5 6 7
## 5.616490 20.016104 20.563419 32.784647 25.635007 19.470491 29.257775
## 8 9 10 11 12 13 14
## 25.372916 18.783444 21.697507 18.913598 17.213187 14.792703 35.316987
## 15 16 17 18 19 20 21
## 16.961894 20.200898 24.807727 21.403860 18.747221 21.503880 9.062848
## 22 23 24 25 26 27 28
## 14.255007 21.770143 13.228636 22.922720 22.956482 32.270652 26.978844
## 29 30 31 32 33 34 35
## 10.341962 21.234822 22.748572 16.558519 35.902619 23.607041 16.920736
## 36 37 38 39 40 41 42
## 1.556541 11.950550 21.846205 15.011593 28.760403 23.443365 28.410403
## 43 44 45 46 47 48 49
## 15.568344 34.845202 30.906676 24.024280 30.914831 17.328732 21.299910
## 50 51 52 53 54 55 56
## 23.848394 32.513849 18.868407 7.446561 12.546256 35.270463 27.562223
## 57 58 59 60 61 62 63
## 15.251091 40.162409 37.261394 24.673126 24.578983 18.493035 17.352194
## 64 65 66 67 68 69 70
## 20.339537 24.503771 25.432265 15.205553 27.969090 1.692324 7.843751
## 71 72 73 74 75 76 77
## 21.991794 25.184413 22.643307 6.647681 29.144337 21.253034 20.921375
## 78 79 80 81 82 83 84
## 24.448553 34.541919 3.382762 24.004089 36.547875 16.645596 15.801178
## 85 86 87 88 89 90 91
## 19.891605 18.787292 20.114245 24.412498 22.408526 28.237361 17.160221
## 92 93 94 95 96 97 98
## 18.008340 27.402388 29.889201 35.475481 18.572919 35.917250 36.959306
## 99 100 101 102
## 25.161326 40.576035 33.406423 24.255649
MAE_lm <- sum(abs(pred_lm - testing$MEDV))/102
print(MAE_lm)## [1] 3.316274
fit_rf <- randomForest(MEDV~., data = training)pred_rf <- predict(fit_rf, testing)
print(pred_lm)## 1 2 3 4 5 6 7
## 5.616490 20.016104 20.563419 32.784647 25.635007 19.470491 29.257775
## 8 9 10 11 12 13 14
## 25.372916 18.783444 21.697507 18.913598 17.213187 14.792703 35.316987
## 15 16 17 18 19 20 21
## 16.961894 20.200898 24.807727 21.403860 18.747221 21.503880 9.062848
## 22 23 24 25 26 27 28
## 14.255007 21.770143 13.228636 22.922720 22.956482 32.270652 26.978844
## 29 30 31 32 33 34 35
## 10.341962 21.234822 22.748572 16.558519 35.902619 23.607041 16.920736
## 36 37 38 39 40 41 42
## 1.556541 11.950550 21.846205 15.011593 28.760403 23.443365 28.410403
## 43 44 45 46 47 48 49
## 15.568344 34.845202 30.906676 24.024280 30.914831 17.328732 21.299910
## 50 51 52 53 54 55 56
## 23.848394 32.513849 18.868407 7.446561 12.546256 35.270463 27.562223
## 57 58 59 60 61 62 63
## 15.251091 40.162409 37.261394 24.673126 24.578983 18.493035 17.352194
## 64 65 66 67 68 69 70
## 20.339537 24.503771 25.432265 15.205553 27.969090 1.692324 7.843751
## 71 72 73 74 75 76 77
## 21.991794 25.184413 22.643307 6.647681 29.144337 21.253034 20.921375
## 78 79 80 81 82 83 84
## 24.448553 34.541919 3.382762 24.004089 36.547875 16.645596 15.801178
## 85 86 87 88 89 90 91
## 19.891605 18.787292 20.114245 24.412498 22.408526 28.237361 17.160221
## 92 93 94 95 96 97 98
## 18.008340 27.402388 29.889201 35.475481 18.572919 35.917250 36.959306
## 99 100 101 102
## 25.161326 40.576035 33.406423 24.255649
MAE_rf <- sum(abs(pred_rf - testing$MEDV))/102
print(MAE_rf)## [1] 2.493911
fit_rpart <- rpart(MEDV~.,training)
pred_rpart <- predict(fit_rpart, testing)
print(pred_rpart)## 1 2 3 4 5 6 7 8
## 11.89828 17.07931 21.54333 21.54333 21.54333 21.54333 26.69143 21.54333
## 9 10 11 12 13 14 15 16
## 17.07931 17.07931 21.54333 21.54333 17.07931 45.58261 11.89828 21.54333
## 17 18 19 20 21 22 23 24
## 21.54333 21.54333 11.89828 21.54333 11.89828 11.89828 21.54333 17.07931
## 25 26 27 28 29 30 31 32
## 21.54333 21.54333 26.69143 26.69143 11.89828 21.54333 21.54333 11.89828
## 33 34 35 36 37 38 39 40
## 26.69143 21.54333 11.89828 11.89828 17.07931 21.54333 21.54333 26.69143
## 41 42 43 44 45 46 47 48
## 28.50435 26.69143 11.89828 45.58261 21.54333 21.54333 26.69143 17.07931
## 49 50 51 52 53 54 55 56
## 21.54333 21.54333 35.04286 17.07931 11.89828 17.07931 35.04286 26.69143
## 57 58 59 60 61 62 63 64
## 11.89828 45.58261 26.69143 21.54333 28.50435 11.89828 11.89828 21.54333
## 65 66 67 68 69 70 71 72
## 21.54333 21.54333 11.89828 21.54333 17.07931 11.89828 21.54333 35.04286
## 73 74 75 76 77 78 79 80
## 21.54333 17.07931 21.54333 21.54333 21.54333 21.54333 35.04286 11.89828
## 81 82 83 84 85 86 87 88
## 21.54333 45.58261 21.54333 11.89828 21.54333 17.07931 21.54333 17.07931
## 89 90 91 92 93 94 95 96
## 21.54333 28.50435 17.07931 21.54333 28.50435 45.58261 26.69143 17.07931
## 97 98 99 100 101 102
## 28.50435 45.58261 21.54333 45.58261 28.50435 21.54333
MAE_rpart <- mae(model = fit_rpart, testing)Let us try to graph the rpart decision tree first
plot(fit_rpart, uniform = TRUE)
text(fit_rpart, cex = 0.35)I would like to employ a loop where i can cut down the decision tree into a slightly more viable size, Let’s try to reduce the average error
We will need to engage into looping
get_mae <- function(maxdepth, target, predictors, training_data, testing_data){
predictors <- paste(predictors, collapse = "+")
formula <- as.formula(paste(target, "~", predictors, sep = ""))
model <- rpart(formula, data = training_data, control = rpart.control(maxdepth = maxdepth))
mae <- mae(model, testing_data)
return(mae)
}target <- "MEDV"
predictors <- c("CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT")
for(i in 1:10){
mae <- get_mae(maxdepth = i, target = target, predictors = predictors, training_data = training, testing_data = testing)
print(glue::glue("Maxdepth:", i, "\t MAE:", mae ))
}## Maxdepth:1 MAE:5.595210454969
## Maxdepth:2 MAE:4.1922597524767
## Maxdepth:3 MAE:3.8619657261989
## Maxdepth:4 MAE:3.50910833098857
## Maxdepth:5 MAE:3.50910833098857
## Maxdepth:6 MAE:3.50910833098857
## Maxdepth:7 MAE:3.50910833098857
## Maxdepth:8 MAE:3.50910833098857
## Maxdepth:9 MAE:3.50910833098857
## Maxdepth:10 MAE:3.50910833098857
data.frame(Actual = testing$MEDV,
`GLM` = pred_lm,
`Random Forest` = pred_rf,
`RPart` = pred_rpart) %>%
gather(Measure, Prediction, -Actual) %>%
ggplot(aes(Actual, Prediction)) +
geom_point(shape = 16, alpha = 0.65) +
geom_abline(slope = 1, intercept = 0, col = "dark red") +
coord_fixed() +
facet_grid(. ~ Measure) +
theme_classic() +
theme(axis.text = element_text(colour = "black"),
strip.background = element_rect(colour = NA, fill = "gray92"))