1 Introduction
- 1.1 Getting a feel of the dataset
  - 1.1.1 Description of variables
- 1.2 Checking the correlation
2 Creating a Test and Train set.
3 Visualizing Output

1 Introduction

Greetings!. I’m going to attempt to somewhat replicate Rick Scavetta’s analysis on the Boston House Prices dataset while also using methods of analysis i learned from Kaggle.

#Initializing Packages
library(corrplot)

## corrplot 0.84 loaded

library(keras)

## Warning: package 'keras' was built under R version 3.5.2

library(caret)

## Warning: package 'caret' was built under R version 3.5.2

## Loading required package: lattice

## Loading required package: ggplot2

library(tidyverse)

## -- Attaching packages ------------------------------------------------------------------------- tidyverse 1.2.1 --

## v tibble  1.4.2     v purrr   0.2.5
## v tidyr   0.8.1     v dplyr   0.7.6
## v readr   1.1.1     v stringr 1.3.1
## v tibble  1.4.2     v forcats 0.3.0

## -- Conflicts ---------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x purrr::lift()   masks caret::lift()

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.5.2

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(rpart)
library(kableExtra)

## Warning: package 'kableExtra' was built under R version 3.5.2

library(modelr)

# Loading the required data
Boston <- read.csv("https://raw.githubusercontent.com/Scavetta/conf_tensorflow_training_day1/master/1_Deep_Learning_Intro/data/boston_keras.csv")

1.1 Getting a feel of the dataset

summary(Boston)

##       CRIM                ZN             INDUS            CHAS        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       NOX               RM             AGE              DIS        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       RAD              TAX           PTRATIO            B         
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      LSTAT            MEDV      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Thankfully there aren’t any NAs on the list, which means i do not need to clean it

1.1.1 Description of variables

I had to take this from Rick Scavetta’s description of the variables in his github. These are the variables:

CRIM - Per capita Crime rate by town

ZN - Proportion of Residential land zoned for lots over 25k sq. ft

INDUS - Proportion of non-retail business acres per town

CHAS - Something related to Charles River. 1 if tract bounds river, else 0

NOX - Nitric Oxides concentration (parts per 10 million)

RM - Average number of rooms per dwelling

AGE - Proportion of owner-occupied units built prior to 1940

DIS - Weighted distances to five Boston employment centers

RAD - Index of accessibility to radial highways

TAX - Full-value property tax per $10000 (what is Massachussetts’ property tax rate anyway?)

PTRATIO - Student-teacher ratio per town

B - $1000 * (Bk - 0.63)^2$ where $Bk$ is the proportion of blacks by town

LSTAT - %lower status of the population

MEDV - Median value of owner-occupied homes in $1000’s

1.2 Checking the correlation

Assessing each predictor variable against the response variable

data.frame(r = cor(Boston[,-14], Boston$MEDV)) %>%
  rownames_to_column() %>%
  rename(variable = rowname) %>%
  arrange(r) %>%
  mutate(variable = as_factor(variable)) %>%
  ggplot(aes(r, variable)) + 
  geom_vline(xintercept = 0, col = "red", linetype = 2) +
  geom_point()+
  scale_x_continuous("r", limits = c(-1, 1), expand = c(0,0))

I’m going to try to implement another correlation diagram via methods i learned at Kaggle. Something to do with trying to figure out which numeric variables have a high correlation with MEDV

numericVars <- which(sapply(Boston, is.numeric))
numericVarNames <- names(numericVars)
cat("There are", length(numericVars), "numeric variables")

## There are 14 numeric variables

Boston_numVar <- Boston[, numericVars]
cor_numVar <- cor(Boston_numVar, use = "pairwise.complete.obs")

#sort on decreasing correlations with MEDV
cor_sorted <- as.matrix(sort(cor_numVar[,"MEDV"], decreasing = TRUE))
# Select only high correlations
CorHigh <- names(which(apply(cor_sorted, 1, function(x) abs(x) > 0.15)))
cor_numVar <- cor_numVar[CorHigh, CorHigh]

corrplot.mixed(cor_numVar, tl.col = "black", tl.pos = "lt")

We’ll take all the variables.

So, the MEDV and RM variables have the highest correlation with 0.7. I would like to create a plot on the matter.

a <- ggplot(data = Boston, aes(x = RM, y = MEDV)) + geom_point(col = "blue")
a <- a + geom_smooth(method = "lm", se = FALSE, color = "black", aes(group = 1))
a <- a + labs(x = "AVERAGE NUMBER OF ROOMS PER DWELLING")
a <- a + labs(y = "MEDIAN VALUE")

print(a)

2 Creating a Test and Train set.

# as per the keras set
index <- 1:404

training <- Boston[index,]
testing <- Boston[-index,]

2.1 Inspecting the Test and Train sets

Just out of curiosity

summary(training)

##       CRIM                ZN             INDUS            CHAS        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08144   1st Qu.:  0.00   1st Qu.: 5.13   1st Qu.:0.00000  
##  Median : 0.26888   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.74511   Mean   : 11.48   Mean   :11.10   Mean   :0.06188  
##  3rd Qu.: 3.67481   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       NOX               RM             AGE              DIS        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4530   1st Qu.:5.875   1st Qu.: 45.48   1st Qu.: 2.077  
##  Median :0.5380   Median :6.199   Median : 78.50   Median : 3.142  
##  Mean   :0.5574   Mean   :6.267   Mean   : 69.01   Mean   : 3.740  
##  3rd Qu.:0.6310   3rd Qu.:6.609   3rd Qu.: 94.10   3rd Qu.: 5.118  
##  Max.   :0.8710   Max.   :8.725   Max.   :100.00   Max.   :10.710  
##       RAD              TAX           PTRATIO            B         
##  Min.   : 1.000   Min.   :188.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.23   1st Qu.:374.67  
##  Median : 5.000   Median :330.0   Median :19.10   Median :391.25  
##  Mean   : 9.441   Mean   :405.9   Mean   :18.48   Mean   :354.78  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.16  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      LSTAT            MEDV      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.89   1st Qu.:16.68  
##  Median :11.39   Median :20.75  
##  Mean   :12.74   Mean   :22.40  
##  3rd Qu.:17.09   3rd Qu.:24.80  
##  Max.   :37.97   Max.   :50.00

summary(testing)

##       CRIM                ZN            INDUS             CHAS        
##  Min.   : 0.01311   Min.   : 0.00   Min.   : 1.220   Min.   :0.00000  
##  1st Qu.: 0.08484   1st Qu.: 0.00   1st Qu.: 5.455   1st Qu.:0.00000  
##  Median : 0.22901   Median : 0.00   Median : 9.795   Median :0.00000  
##  Mean   : 3.09234   Mean   :10.90   Mean   :11.265   Mean   :0.09804  
##  3rd Qu.: 3.77944   3rd Qu.:16.25   3rd Qu.:18.100   3rd Qu.:0.00000  
##  Max.   :25.04610   Max.   :90.00   Max.   :27.740   Max.   :1.00000  
##       NOX               RM             AGE              DIS        
##  Min.   :0.3920   Min.   :4.880   Min.   :  6.00   Min.   : 1.466  
##  1st Qu.:0.4455   1st Qu.:5.966   1st Qu.: 42.45   1st Qu.: 2.117  
##  Median :0.5320   Median :6.229   Median : 73.75   Median : 3.325  
##  Mean   :0.5442   Mean   :6.354   Mean   : 66.85   Mean   : 4.012  
##  3rd Qu.:0.6090   3rd Qu.:6.634   3rd Qu.: 92.97   3rd Qu.: 5.277  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       RAD             TAX           PTRATIO            B         
##  Min.   : 1.00   Min.   :187.0   Min.   :13.00   Min.   : 24.65  
##  1st Qu.: 4.00   1st Qu.:279.2   1st Qu.:17.40   1st Qu.:377.69  
##  Median : 5.00   Median :330.0   Median :18.90   Median :392.11  
##  Mean   : 9.98   Mean   :417.5   Mean   :18.37   Mean   :364.16  
##  3rd Qu.:24.00   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.78  
##  Max.   :24.00   Max.   :711.0   Max.   :21.20   Max.   :396.90  
##      LSTAT             MEDV      
##  Min.   : 1.920   Min.   : 5.60  
##  1st Qu.: 7.305   1st Qu.:18.65  
##  Median :11.060   Median :21.95  
##  Mean   :12.305   Mean   :23.08  
##  3rd Qu.:15.915   3rd Qu.:27.07  
##  Max.   :31.990   Max.   :50.00

2.1.1 Z-score scaling

Z-score scaling involves converting all the indicators to a common scale with an average of zero and a standard deviation of one.

#using dplyr to keep the data frame structure

training %>%
  mutate_at(vars(-MEDV), scale) -> training

testing %>%
  mutate_at(vars(-MEDV), scale) -> testing

2.2 GLM and LM

We will use linear modeling since we will have to assume normal distribution

fit_lm <- lm(MEDV~., data = training)

2.2.1 Conducting predictions on the test set.

pred_lm <- predict(fit_lm, newdata = testing)
print(pred_lm)

##         1         2         3         4         5         6         7 
##  5.616490 20.016104 20.563419 32.784647 25.635007 19.470491 29.257775 
##         8         9        10        11        12        13        14 
## 25.372916 18.783444 21.697507 18.913598 17.213187 14.792703 35.316987 
##        15        16        17        18        19        20        21 
## 16.961894 20.200898 24.807727 21.403860 18.747221 21.503880  9.062848 
##        22        23        24        25        26        27        28 
## 14.255007 21.770143 13.228636 22.922720 22.956482 32.270652 26.978844 
##        29        30        31        32        33        34        35 
## 10.341962 21.234822 22.748572 16.558519 35.902619 23.607041 16.920736 
##        36        37        38        39        40        41        42 
##  1.556541 11.950550 21.846205 15.011593 28.760403 23.443365 28.410403 
##        43        44        45        46        47        48        49 
## 15.568344 34.845202 30.906676 24.024280 30.914831 17.328732 21.299910 
##        50        51        52        53        54        55        56 
## 23.848394 32.513849 18.868407  7.446561 12.546256 35.270463 27.562223 
##        57        58        59        60        61        62        63 
## 15.251091 40.162409 37.261394 24.673126 24.578983 18.493035 17.352194 
##        64        65        66        67        68        69        70 
## 20.339537 24.503771 25.432265 15.205553 27.969090  1.692324  7.843751 
##        71        72        73        74        75        76        77 
## 21.991794 25.184413 22.643307  6.647681 29.144337 21.253034 20.921375 
##        78        79        80        81        82        83        84 
## 24.448553 34.541919  3.382762 24.004089 36.547875 16.645596 15.801178 
##        85        86        87        88        89        90        91 
## 19.891605 18.787292 20.114245 24.412498 22.408526 28.237361 17.160221 
##        92        93        94        95        96        97        98 
## 18.008340 27.402388 29.889201 35.475481 18.572919 35.917250 36.959306 
##        99       100       101       102 
## 25.161326 40.576035 33.406423 24.255649

MAE_lm <- sum(abs(pred_lm - testing$MEDV))/102
print(MAE_lm)

## [1] 3.316274

2.3 Random Forests

fit_rf <- randomForest(MEDV~., data = training)

2.3.1 Conducting predictions on the test set

pred_rf <- predict(fit_rf, testing)
print(pred_lm)

##         1         2         3         4         5         6         7 
##  5.616490 20.016104 20.563419 32.784647 25.635007 19.470491 29.257775 
##         8         9        10        11        12        13        14 
## 25.372916 18.783444 21.697507 18.913598 17.213187 14.792703 35.316987 
##        15        16        17        18        19        20        21 
## 16.961894 20.200898 24.807727 21.403860 18.747221 21.503880  9.062848 
##        22        23        24        25        26        27        28 
## 14.255007 21.770143 13.228636 22.922720 22.956482 32.270652 26.978844 
##        29        30        31        32        33        34        35 
## 10.341962 21.234822 22.748572 16.558519 35.902619 23.607041 16.920736 
##        36        37        38        39        40        41        42 
##  1.556541 11.950550 21.846205 15.011593 28.760403 23.443365 28.410403 
##        43        44        45        46        47        48        49 
## 15.568344 34.845202 30.906676 24.024280 30.914831 17.328732 21.299910 
##        50        51        52        53        54        55        56 
## 23.848394 32.513849 18.868407  7.446561 12.546256 35.270463 27.562223 
##        57        58        59        60        61        62        63 
## 15.251091 40.162409 37.261394 24.673126 24.578983 18.493035 17.352194 
##        64        65        66        67        68        69        70 
## 20.339537 24.503771 25.432265 15.205553 27.969090  1.692324  7.843751 
##        71        72        73        74        75        76        77 
## 21.991794 25.184413 22.643307  6.647681 29.144337 21.253034 20.921375 
##        78        79        80        81        82        83        84 
## 24.448553 34.541919  3.382762 24.004089 36.547875 16.645596 15.801178 
##        85        86        87        88        89        90        91 
## 19.891605 18.787292 20.114245 24.412498 22.408526 28.237361 17.160221 
##        92        93        94        95        96        97        98 
## 18.008340 27.402388 29.889201 35.475481 18.572919 35.917250 36.959306 
##        99       100       101       102 
## 25.161326 40.576035 33.406423 24.255649

MAE_rf <- sum(abs(pred_rf - testing$MEDV))/102
print(MAE_rf)

## [1] 2.493911

2.4 RPart

fit_rpart <- rpart(MEDV~.,training)

pred_rpart <- predict(fit_rpart, testing)
print(pred_rpart)

##        1        2        3        4        5        6        7        8 
## 11.89828 17.07931 21.54333 21.54333 21.54333 21.54333 26.69143 21.54333 
##        9       10       11       12       13       14       15       16 
## 17.07931 17.07931 21.54333 21.54333 17.07931 45.58261 11.89828 21.54333 
##       17       18       19       20       21       22       23       24 
## 21.54333 21.54333 11.89828 21.54333 11.89828 11.89828 21.54333 17.07931 
##       25       26       27       28       29       30       31       32 
## 21.54333 21.54333 26.69143 26.69143 11.89828 21.54333 21.54333 11.89828 
##       33       34       35       36       37       38       39       40 
## 26.69143 21.54333 11.89828 11.89828 17.07931 21.54333 21.54333 26.69143 
##       41       42       43       44       45       46       47       48 
## 28.50435 26.69143 11.89828 45.58261 21.54333 21.54333 26.69143 17.07931 
##       49       50       51       52       53       54       55       56 
## 21.54333 21.54333 35.04286 17.07931 11.89828 17.07931 35.04286 26.69143 
##       57       58       59       60       61       62       63       64 
## 11.89828 45.58261 26.69143 21.54333 28.50435 11.89828 11.89828 21.54333 
##       65       66       67       68       69       70       71       72 
## 21.54333 21.54333 11.89828 21.54333 17.07931 11.89828 21.54333 35.04286 
##       73       74       75       76       77       78       79       80 
## 21.54333 17.07931 21.54333 21.54333 21.54333 21.54333 35.04286 11.89828 
##       81       82       83       84       85       86       87       88 
## 21.54333 45.58261 21.54333 11.89828 21.54333 17.07931 21.54333 17.07931 
##       89       90       91       92       93       94       95       96 
## 21.54333 28.50435 17.07931 21.54333 28.50435 45.58261 26.69143 17.07931 
##       97       98       99      100      101      102 
## 28.50435 45.58261 21.54333 45.58261 28.50435 21.54333

MAE_rpart <- mae(model = fit_rpart, testing)

2.4.1 Model improvement (Overfitting and underfitting)

Let us try to graph the rpart decision tree first

plot(fit_rpart, uniform = TRUE)
text(fit_rpart, cex = 0.35)

I would like to employ a loop where i can cut down the decision tree into a slightly more viable size, Let’s try to reduce the average error

We will need to engage into looping

get_mae <- function(maxdepth, target, predictors, training_data, testing_data){
  predictors <- paste(predictors, collapse = "+")
  formula <- as.formula(paste(target, "~", predictors, sep = ""))
  model <- rpart(formula, data = training_data, control = rpart.control(maxdepth = maxdepth))
  mae <- mae(model, testing_data)
  return(mae)
}

target <- "MEDV"
predictors <- c("CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT")

for(i in 1:10){
  mae <- get_mae(maxdepth = i, target = target, predictors = predictors, training_data = training, testing_data = testing)
  print(glue::glue("Maxdepth:", i, "\t MAE:", mae ))
}

## Maxdepth:1    MAE:5.595210454969
## Maxdepth:2    MAE:4.1922597524767
## Maxdepth:3    MAE:3.8619657261989
## Maxdepth:4    MAE:3.50910833098857
## Maxdepth:5    MAE:3.50910833098857
## Maxdepth:6    MAE:3.50910833098857
## Maxdepth:7    MAE:3.50910833098857
## Maxdepth:8    MAE:3.50910833098857
## Maxdepth:9    MAE:3.50910833098857
## Maxdepth:10   MAE:3.50910833098857

3 Visualizing Output

data.frame(Actual = testing$MEDV,
           `GLM` = pred_lm,
           `Random Forest` = pred_rf,
           `RPart` = pred_rpart) %>% 
  gather(Measure, Prediction, -Actual) %>% 
  ggplot(aes(Actual, Prediction)) +
  geom_point(shape = 16, alpha = 0.65) +
  geom_abline(slope = 1, intercept = 0, col = "dark red") +
  coord_fixed() +
  facet_grid(. ~ Measure) +
  theme_classic() +
  theme(axis.text = element_text(colour = "black"),
        strip.background = element_rect(colour = NA, fill = "gray92"))

Analysis of the Boston Housing Prices Dataset

Joel Jr Ffc Rudinas

January 19, 2019