The goal of this second assignment is to train two distinct decision tree models and a random forest model to compare their bias and variance. I will conclude the analysis with a summary of my results and insights. Also, note that I will continue to utilize the Boston Housing data set, which is accessible in R through the MASS and mlbench packages, as well as online .
# Import the Boston Housing data from a GitHub repository
BostonHousing <- read_csv("https://raw.githubusercontent.com/Heleinef/Data-Science-Master_Heleine/refs/heads/main/BostonHousing.csv")
## Rows: 506 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (14): crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, ls...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View the first 3 rows of the dataset
head(BostonHousing, n = 3)
## # A tibble: 3 × 14
## crim zn indus chas nox rm age dis rad tax ptratio b
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.00632 18 2.31 0 0.538 6.58 65.2 4.09 1 296 15.3 397.
## 2 0.0273 0 7.07 0 0.469 6.42 78.9 4.97 2 242 17.8 397.
## 3 0.0273 0 7.07 0 0.469 7.18 61.1 4.97 2 242 17.8 393.
## # ℹ 2 more variables: lstat <dbl>, medv <dbl>
# View structure of BostonHousing
glimpse(BostonHousing)
## Rows: 506
## Columns: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad <dbl> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ b <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…
# Check dimensions of the dataset
dim(BostonHousing)
## [1] 506 14
# Display column names of BostonHousing
names(BostonHousing)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "b" "lstat" "medv"
###Data Preparation: Handle Missing and Duplicated Values
# Remove rows with missing values
BostonHousing <- BostonHousing[complete.cases(BostonHousing),]
# Ensure no missing values remain
sum(is.na(BostonHousing))
## [1] 0
# Check for duplicated values
sum(duplicated(BostonHousing))
## [1] 0
# Display summary statistics
summary(BostonHousing)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio b
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# Calculate mean and median for each numeric variable
numeric_columns <- BostonHousing %>% select_if(is.numeric)
mean_values <- colMeans(numeric_columns, na.rm = TRUE)
median_values <- apply(numeric_columns, 2, median, na.rm = TRUE)
# Create a table with descriptive statistics
variables_table <- tibble(
Variable = names(numeric_columns),
Mean = mean_values,
Median = median_values
)
# Display the table
kable(variables_table, format = "markdown", caption = "Mean and Median of Variables in BostonHousing")
| Variable | Mean | Median |
|---|---|---|
| crim | 3.6135236 | 0.25651 |
| zn | 11.3636364 | 0.00000 |
| indus | 11.1367787 | 9.69000 |
| chas | 0.0691700 | 0.00000 |
| nox | 0.5546951 | 0.53800 |
| rm | 6.2846344 | 6.20850 |
| age | 68.5749012 | 77.50000 |
| dis | 3.7950427 | 3.20745 |
| rad | 9.5494071 | 5.00000 |
| tax | 408.2371542 | 330.00000 |
| ptratio | 18.4555336 | 19.05000 |
| b | 356.6740316 | 391.44000 |
| lstat | 12.6530632 | 11.36000 |
| medv | 22.5328063 | 21.20000 |
# Visualize distribution of 'medv'
ggplot(BostonHousing, aes(x = medv)) +
geom_histogram(binwidth = 2, fill = "skyblue", color = "black") +
labs(title = "Distribution of Median Value (medv)", x = "Median Value (medv)", y = "Frequency") +
theme_minimal()
# Select numeric columns excluding 'medv'
BostonHousing_long <- BostonHousing %>%
dplyr:: select(-medv) %>%
pivot_longer(cols = everything(), names_to = "feature", values_to = "value")
# Boxplot for each numeric feature
ggplot(BostonHousing_long, aes(x = feature, y = value)) +
geom_boxplot(fill = "skyblue", color = "black") +
labs(title = "Boxplots of Numeric Features in BostonHousing Dataset", x = "Features", y = "Value") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Select features for scatter plots
important_features <- c("crim", "rm", "zn", "lstat", "age", "dis")
# Create scatter plots of each feature against `medv`
for (feature in important_features) {
p <- ggplot(BostonHousing, aes_string(x = feature, y = "medv")) +
geom_point(color = "steelblue") +
labs(title = paste("Scatterplot of", feature, "vs. medv"), x = feature, y = "Median Value (medv)") +
theme_minimal()
print(p)
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Pairwise scatterplot matrix for selected features
selected_columns <- c("medv", "rm", "lstat", "dis", "zn", "crim")
subset_data <- BostonHousing %>% dplyr::select(all_of(selected_columns))
ggpairs(subset_data,
title = "Pairwise Plots of Selected Features",
lower = list(continuous = "points"),
upper = list(continuous = "cor"),
diag = list(continuous = "densityDiag")) +
theme_minimal()
library(corrplot)
numeric_data <- BostonHousing[sapply(BostonHousing, is.numeric)]
# Calculate the correlation matrix
cor_matrix <- cor(numeric_data, use = "complete.obs")
# Plot the correlation matrix using an explicit package call
corrplot::corrplot(cor_matrix, method = "color",
addCoef.col = "black", # Add correlation coefficients
tl.col = "black", # Text label color
tl.srt = 45, # Text label rotation
mar = c(1, 1, 2, 1)) # Adjust margins
# Add a title
title("Correlation Matrix of Boston Housing Data", line = 2.5, cex.main = 1.2)
# Heatmap
# Load necessary libraries
library(ggplot2)
library(viridis) # For color scales
library(reshape2) # For the melt function
# Select numeric columns from the dataset
numeric_columns <- BostonHousing[sapply(BostonHousing, is.numeric)]
# Calculate the correlation matrix
cor_matrix <- cor(numeric_columns)
# Melt the correlation matrix into long format
cor_melted <- melt(cor_matrix)
# Create the heatmap
ggplot(data = cor_melted, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_viridis(name = "Correlation", option = "C") +
labs(title = "Correlation Heatmap of Boston Housing Data",
x = "Features",
y = "Features") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust =1 ))
# Data Splitting
set.seed(25) # For reproducibility
training_indices <- sample(1:nrow(BostonHousing), size = 0.7 * nrow(BostonHousing))
training_data <- BostonHousing[training_indices, ]
testing_data <- BostonHousing[-training_indices, ]
# Build the decision tree model
tree_model_lstat <- rpart(medv ~ lstat + ., data = training_data)
# Plot the decision tree
png(file = "decision_tree_lstat.png")
rpart.plot(tree_model_lstat, main = "Decision Tree for Median Value of Homes (medv) Including LSTAT")
dev.off()
## quartz_off_screen
## 2
# Print the model summary
print(tree_model_lstat)
## n= 354
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 354 32061.4300 22.16017
## 2) rm< 6.941 302 12763.3700 19.40364
## 4) lstat>=14.395 127 2657.0090 14.51575
## 8) nox>=0.607 83 1190.8810 12.57831
## 16) lstat>=19.645 46 417.3741 10.35435 *
## 17) lstat< 19.645 37 263.1308 15.34324 *
## 9) nox< 0.607 44 566.8716 18.17045 *
## 5) lstat< 14.395 175 4870.1570 22.95086
## 10) lstat>=9.545 89 517.1056 20.52472 *
## 11) lstat< 9.545 86 3287.0430 25.46163
## 22) age< 85.4 79 1313.5910 24.43165
## 44) rm< 6.5445 51 610.1016 22.63922 *
## 45) rm>=6.5445 28 241.1896 27.69643 *
## 23) age>=85.4 7 943.8086 37.08571 *
## 3) rm>=6.941 52 3676.2510 38.16923
## 6) rm< 7.445 31 1030.7990 33.50645 *
## 7) rm>=7.445 21 976.5324 45.05238 *
# Predict on the testing data
predictions_lstat <- predict(tree_model_lstat, newdata = testing_data)
# Display the first few predictions
head(predictions_lstat)
## 1 2 3 4 5 6
## 18.17045 18.17045 22.63922 18.17045 18.17045 18.17045
# Calculate and print model performance metrics
mse_lstat <- mean((predictions_lstat - testing_data$medv)^2)
cat("Mean Squared Error (MSE) with lstat as predictor:", mse_lstat, "\n")
## Mean Squared Error (MSE) with lstat as predictor: 19.42184
## quartz_off_screen
## 2
set.seed(25) # For reproducibility
training_indices <- sample(1:nrow(BostonHousing), size = 0.7 * nrow(BostonHousing))
training_data <- BostonHousing[training_indices, ]
testing_data <- BostonHousing[-training_indices, ]
# Build the decision tree model using medv as the response variable and including age as a predictor
tree_model_stat <- rpart(medv ~ age + ., data = training_data)
# Plot the decision tree
png(file = "decision_tree_stat.png")
rpart.plot(tree_model_stat, main = "Decision Tree for Median Value of Homes (medv) Including Age")
dev.off()
## quartz_off_screen
## 2
# Print the model summary
print(tree_model_stat)
## n= 354
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 354 32061.4300 22.16017
## 2) rm< 6.941 302 12763.3700 19.40364
## 4) lstat>=14.395 127 2657.0090 14.51575
## 8) nox>=0.607 83 1190.8810 12.57831
## 16) lstat>=19.645 46 417.3741 10.35435 *
## 17) lstat< 19.645 37 263.1308 15.34324 *
## 9) nox< 0.607 44 566.8716 18.17045 *
## 5) lstat< 14.395 175 4870.1570 22.95086
## 10) lstat>=9.545 89 517.1056 20.52472 *
## 11) lstat< 9.545 86 3287.0430 25.46163
## 22) age< 85.4 79 1313.5910 24.43165
## 44) rm< 6.5445 51 610.1016 22.63922 *
## 45) rm>=6.5445 28 241.1896 27.69643 *
## 23) age>=85.4 7 943.8086 37.08571 *
## 3) rm>=6.941 52 3676.2510 38.16923
## 6) rm< 7.445 31 1030.7990 33.50645 *
## 7) rm>=7.445 21 976.5324 45.05238 *
# Predict on the testing data
predictions_stat <- predict(tree_model_stat, newdata = testing_data)
# Display the first few predictions
head(predictions_stat)
## 1 2 3 4 5 6
## 18.17045 18.17045 22.63922 18.17045 18.17045 18.17045
# Calculate and print model performance metrics
mse_stat <- mean((predictions_stat - testing_data$medv)^2)
cat("Mean Squared Error (MSE) with age as predictor:", mse_stat, "\n")
## Mean Squared Error (MSE) with age as predictor: 19.42184
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
# Create a training dataset (you can choose to split your data if needed)
set.seed(13) # For reproducibility
training_indices <- sample(1:nrow(BostonHousing), size = 0.7 * nrow(BostonHousing))
training_data <- BostonHousing[training_indices, ]
testing_data <- BostonHousing[-training_indices, ]
# Build the Random Forest model
rf_model <- randomForest(medv ~ ., data = training_data, ntree = 500)
# Print the model summary
print(rf_model)
##
## Call:
## randomForest(formula = medv ~ ., data = training_data, ntree = 500)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 10.9022
## % Var explained: 86.65
# Predict on the testing data
predictions_rf <- predict(rf_model, newdata = testing_data)
# Display the first few predictions
head(predictions_rf)
## 1 2 3 4 5 6
## 34.64562 26.66950 18.63165 19.11830 18.60591 21.38240
# Calculate and print the model performance metrics
mse <- mean((predictions_rf - testing_data$medv)^2)
cat("Mean Squared Error (MSE):", mse, "\n")
## Mean Squared Error (MSE): 15.33977
In this analysis, I explored the relationship between various features of the BostonHousing dataset and the median value of homes (medv) by employing decision trees and a random forest regression model. The goal was to assess how different predictors influence model performance and to compare the results obtained from these models, paying particular attention to bias and variance.
The first decision tree model was constructed using rm (the average number of rooms per dwelling) as the primary predictor. The output indicated that the first split occurred at a threshold of 6.941 rooms, leading to two distinct branches of the tree. The deviance values indicated that the model successfully segmented the dataset, with predictions of the median home values reflecting significant variance depending on the rooms available. The nodes further split based on lstat (percentage of lower status of the population) and nox (nitric oxides concentration), showcasing how this predictor interacted with others to refine predictions. The final predictions varied significantly at terminal nodes, suggesting a complex relationship where lower median values were predicted for homes with fewer rooms and higher levels of lstat.
In contrast, when constructing the second decision tree model, I included age (the proportion of owner-occupied units built prior to 1940) as the primary predictor. This model, like the previous one, began with rm but took a different path after the initial split. The results illustrated how changing the primary predictor influenced the tree’s structure and the median values predicted. The nodes reflected how lstat and rm interacted differently, leading to alternative outcomes at terminal nodes. This change in features significantly affected the model’s bias and variance. The bias, which represents the error due to overly simplistic assumptions in the learning algorithm, appeared reduced when utilizing predictors that capture the underlying data variability more effectively. Conversely, variance increased as the model became more sensitive to fluctuations in the training dataset.
The random forest model, leveraging an ensemble of decision trees, produced more reliable predictions than the individual trees. The output demonstrated a mean of squared residuals of 10.97856, with an impressive 86.68% of variance explained. This model benefitted from lower bias due to its ensemble nature while also managing variance through averaging multiple decision trees, reducing the risk of overfitting present in single tree models.
The performance improvement observed in the random forest model highlights the advantage of combining multiple trees, which can enhance model robustness and reduce variance significantly. In contrast, individual decision trees showed greater susceptibility to overfitting, especially when influenced by specific features like rm, age, or lstat.
In addressing concerns related to changes and version control of decision trees, I would implement tools that facilitate continuous improvement. Version control systems, such as Git, can help track changes in model architecture and features over time, ensuring that the decision trees do not lose their effectiveness. Additionally, I would establish a standardized approach for creating and evaluating decision trees, utilizing established performance metrics and validation techniques to maintain model integrity.
To manage complexity, employing automated pipelines for data preprocessing, model training, and evaluation would streamline the process, ensuring consistency and repeatability. Regular audits of model performance against new data, alongside retraining protocols when necessary, would further enhance reliability. These strategies will help ensure that the decision tree models remain effective over time, adapting to new data and maintaining accuracy in predictions.