Assignment2, Data 622

The goal of this second assignment is to train two distinct decision tree models and a random forest model to compare their bias and variance. I will conclude the analysis with a summary of my results and insights. Also, note that I will continue to utilize the Boston Housing data set, which is accessible in R through the MASS and mlbench packages, as well as online .

Set up the environment and load required libraries

I. Import and explore Boston Housing dataset

Load the data

# Import the Boston Housing data from a GitHub repository
BostonHousing <- read_csv("https://raw.githubusercontent.com/Heleinef/Data-Science-Master_Heleine/refs/heads/main/BostonHousing.csv")

## Rows: 506 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (14): crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, ls...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# View the first 3 rows of the dataset
head(BostonHousing, n = 3)

## # A tibble: 3 × 14
##      crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio     b
##     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>
## 1 0.00632    18  2.31     0 0.538  6.58  65.2  4.09     1   296    15.3  397.
## 2 0.0273      0  7.07     0 0.469  6.42  78.9  4.97     2   242    17.8  397.
## 3 0.0273      0  7.07     0 0.469  7.18  61.1  4.97     2   242    17.8  393.
## # ℹ 2 more variables: lstat <dbl>, medv <dbl>

Examine Dataset Structure

# View structure of BostonHousing
glimpse(BostonHousing)

## Rows: 506
## Columns: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad     <dbl> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ b       <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…

# Check dimensions of the dataset
dim(BostonHousing)

## [1] 506  14

Check Column Names

# Display column names of BostonHousing
names(BostonHousing)

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "b"       "lstat"   "medv"

###Data Preparation: Handle Missing and Duplicated Values

# Remove rows with missing values
BostonHousing <- BostonHousing[complete.cases(BostonHousing),]

# Ensure no missing values remain
sum(is.na(BostonHousing))

## [1] 0

# Check for duplicated values
sum(duplicated(BostonHousing))

## [1] 0

Summary Statistics

# Display summary statistics
summary(BostonHousing)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio            b         
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Generate Table of Descriptive Statistics

# Calculate mean and median for each numeric variable
numeric_columns <- BostonHousing %>% select_if(is.numeric)
mean_values <- colMeans(numeric_columns, na.rm = TRUE)
median_values <- apply(numeric_columns, 2, median, na.rm = TRUE)

# Create a table with descriptive statistics
variables_table <- tibble(
  Variable = names(numeric_columns),
  Mean = mean_values,
  Median = median_values
)

# Display the table
kable(variables_table, format = "markdown", caption = "Mean and Median of Variables in BostonHousing")

Mean and Median of Variables in BostonHousing
Variable	Mean	Median
crim	3.6135236	0.25651
zn	11.3636364	0.00000
indus	11.1367787	9.69000
chas	0.0691700	0.00000
nox	0.5546951	0.53800
rm	6.2846344	6.20850
age	68.5749012	77.50000
dis	3.7950427	3.20745
rad	9.5494071	5.00000
tax	408.2371542	330.00000
ptratio	18.4555336	19.05000
b	356.6740316	391.44000
lstat	12.6530632	11.36000
medv	22.5328063	21.20000

II. Visualizations

Histogram of medv (Median Value of Homes)

# Visualize distribution of 'medv'
ggplot(BostonHousing, aes(x = medv)) +
  geom_histogram(binwidth = 2, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Median Value (medv)", x = "Median Value (medv)", y = "Frequency") +
  theme_minimal()

Boxplots of Numeric Features

# Select numeric columns excluding 'medv'
BostonHousing_long <- BostonHousing %>%
 dplyr:: select(-medv) %>%
  pivot_longer(cols = everything(), names_to = "feature", values_to = "value")

# Boxplot for each numeric feature
ggplot(BostonHousing_long, aes(x = feature, y = value)) +
  geom_boxplot(fill = "skyblue", color = "black") +
  labs(title = "Boxplots of Numeric Features in BostonHousing Dataset", x = "Features", y = "Value") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Scatterplots of Important Features Against medv

# Select features for scatter plots
important_features <- c("crim", "rm", "zn", "lstat", "age", "dis")

# Create scatter plots of each feature against `medv`
for (feature in important_features) {
  p <- ggplot(BostonHousing, aes_string(x = feature, y = "medv")) +
    geom_point(color = "steelblue") +
    labs(title = paste("Scatterplot of", feature, "vs. medv"), x = feature, y = "Median Value (medv)") +
    theme_minimal()
  print(p)
}

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Pairwise Relationships of Selected Features

# Pairwise scatterplot matrix for selected features
selected_columns <- c("medv", "rm", "lstat", "dis", "zn", "crim")
subset_data <- BostonHousing %>% dplyr::select(all_of(selected_columns))

ggpairs(subset_data,
        title = "Pairwise Plots of Selected Features",
        lower = list(continuous = "points"),
        upper = list(continuous = "cor"),
        diag = list(continuous = "densityDiag")) +
  theme_minimal()

Correlation matrix and heatmap of the Boston housing dataset

library(corrplot)

numeric_data <- BostonHousing[sapply(BostonHousing, is.numeric)]

# Calculate the correlation matrix
cor_matrix <- cor(numeric_data, use = "complete.obs")

# Plot the correlation matrix using an explicit package call
corrplot::corrplot(cor_matrix, method = "color", 
                   addCoef.col = "black",        # Add correlation coefficients
                   tl.col = "black",             # Text label color
                   tl.srt = 45,                  # Text label rotation
                   mar = c(1, 1, 2, 1))          # Adjust margins

# Add a title
title("Correlation Matrix of Boston Housing Data", line = 2.5, cex.main = 1.2)

# Heatmap 
# Load necessary libraries
library(ggplot2)
library(viridis)      # For color scales
library(reshape2)     # For the melt function


# Select numeric columns from the dataset
numeric_columns <- BostonHousing[sapply(BostonHousing, is.numeric)]

# Calculate the correlation matrix
cor_matrix <- cor(numeric_columns)

# Melt the correlation matrix into long format
cor_melted <- melt(cor_matrix)

# Create the heatmap
ggplot(data = cor_melted, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") + 
  scale_fill_viridis(name = "Correlation", option = "C") +
  labs(title = "Correlation Heatmap of Boston Housing Data",
       x = "Features",
       y = "Features") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust =1 ))

III.Regression trees

Regression tree using the Median Value of Homes (medv) variable as the response variable and including lstat as a predictor

# Data Splitting
set.seed(25)  # For reproducibility

training_indices <- sample(1:nrow(BostonHousing), size = 0.7 * nrow(BostonHousing))
training_data <- BostonHousing[training_indices, ]
testing_data <- BostonHousing[-training_indices, ]

# Build the decision tree model
tree_model_lstat <- rpart(medv ~ lstat + ., data = training_data)

# Plot the decision tree

png(file = "decision_tree_lstat.png")
rpart.plot(tree_model_lstat, main = "Decision Tree for Median Value of Homes (medv) Including LSTAT")
dev.off()

## quartz_off_screen 
##                 2

# Print the model summary
print(tree_model_lstat)

## n= 354 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 354 32061.4300 22.16017  
##    2) rm< 6.941 302 12763.3700 19.40364  
##      4) lstat>=14.395 127  2657.0090 14.51575  
##        8) nox>=0.607 83  1190.8810 12.57831  
##         16) lstat>=19.645 46   417.3741 10.35435 *
##         17) lstat< 19.645 37   263.1308 15.34324 *
##        9) nox< 0.607 44   566.8716 18.17045 *
##      5) lstat< 14.395 175  4870.1570 22.95086  
##       10) lstat>=9.545 89   517.1056 20.52472 *
##       11) lstat< 9.545 86  3287.0430 25.46163  
##         22) age< 85.4 79  1313.5910 24.43165  
##           44) rm< 6.5445 51   610.1016 22.63922 *
##           45) rm>=6.5445 28   241.1896 27.69643 *
##         23) age>=85.4 7   943.8086 37.08571 *
##    3) rm>=6.941 52  3676.2510 38.16923  
##      6) rm< 7.445 31  1030.7990 33.50645 *
##      7) rm>=7.445 21   976.5324 45.05238 *

# Predict on the testing data
predictions_lstat <- predict(tree_model_lstat, newdata = testing_data)

# Display the first few predictions
head(predictions_lstat)

##        1        2        3        4        5        6 
## 18.17045 18.17045 22.63922 18.17045 18.17045 18.17045

# Calculate and print model performance metrics
mse_lstat <- mean((predictions_lstat - testing_data$medv)^2)
cat("Mean Squared Error (MSE) with lstat as predictor:", mse_lstat, "\n")

## Mean Squared Error (MSE) with lstat as predictor: 19.42184

## quartz_off_screen 
##                 2

Regression tree using using medv as the response variable and including age as a predictor

set.seed(25)  # For reproducibility

training_indices <- sample(1:nrow(BostonHousing), size = 0.7 * nrow(BostonHousing))
training_data <- BostonHousing[training_indices, ]
testing_data <- BostonHousing[-training_indices, ]

# Build the decision tree model using medv as the response variable and including age as a predictor
tree_model_stat <- rpart(medv ~ age + ., data = training_data)

# Plot the decision tree
png(file = "decision_tree_stat.png")
rpart.plot(tree_model_stat, main = "Decision Tree for Median Value of Homes (medv) Including Age")
dev.off()

## quartz_off_screen 
##                 2

# Print the model summary
print(tree_model_stat)

## n= 354 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 354 32061.4300 22.16017  
##    2) rm< 6.941 302 12763.3700 19.40364  
##      4) lstat>=14.395 127  2657.0090 14.51575  
##        8) nox>=0.607 83  1190.8810 12.57831  
##         16) lstat>=19.645 46   417.3741 10.35435 *
##         17) lstat< 19.645 37   263.1308 15.34324 *
##        9) nox< 0.607 44   566.8716 18.17045 *
##      5) lstat< 14.395 175  4870.1570 22.95086  
##       10) lstat>=9.545 89   517.1056 20.52472 *
##       11) lstat< 9.545 86  3287.0430 25.46163  
##         22) age< 85.4 79  1313.5910 24.43165  
##           44) rm< 6.5445 51   610.1016 22.63922 *
##           45) rm>=6.5445 28   241.1896 27.69643 *
##         23) age>=85.4 7   943.8086 37.08571 *
##    3) rm>=6.941 52  3676.2510 38.16923  
##      6) rm< 7.445 31  1030.7990 33.50645 *
##      7) rm>=7.445 21   976.5324 45.05238 *

# Predict on the testing data
predictions_stat <- predict(tree_model_stat, newdata = testing_data)

# Display the first few predictions
head(predictions_stat)

##        1        2        3        4        5        6 
## 18.17045 18.17045 22.63922 18.17045 18.17045 18.17045

# Calculate and print model performance metrics
mse_stat <- mean((predictions_stat - testing_data$medv)^2)
cat("Mean Squared Error (MSE) with age as predictor:", mse_stat, "\n")

## Mean Squared Error (MSE) with age as predictor: 19.42184

IV. Random Forest

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.3.3

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

# Create a training dataset (you can choose to split your data if needed)
set.seed(13)  # For reproducibility
training_indices <- sample(1:nrow(BostonHousing), size = 0.7 * nrow(BostonHousing))
training_data <- BostonHousing[training_indices, ]
testing_data <- BostonHousing[-training_indices, ]

# Build the Random Forest model
rf_model <- randomForest(medv ~ ., data = training_data, ntree = 500)

# Print the model summary
print(rf_model)

## 
## Call:
##  randomForest(formula = medv ~ ., data = training_data, ntree = 500) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 10.9022
##                     % Var explained: 86.65

# Predict on the testing data
predictions_rf <- predict(rf_model, newdata = testing_data)

# Display the first few predictions
head(predictions_rf)

##        1        2        3        4        5        6 
## 34.64562 26.66950 18.63165 19.11830 18.60591 21.38240

# Calculate and print the model performance metrics
mse <- mean((predictions_rf - testing_data$medv)^2)
cat("Mean Squared Error (MSE):", mse, "\n")

## Mean Squared Error (MSE): 15.33977

V. Results and findings:

In this analysis, I explored the relationship between various features of the BostonHousing dataset and the median value of homes (medv) by employing decision trees and a random forest regression model. The goal was to assess how different predictors influence model performance and to compare the results obtained from these models, paying particular attention to bias and variance.

The first decision tree model was constructed using rm (the average number of rooms per dwelling) as the primary predictor. The output indicated that the first split occurred at a threshold of 6.941 rooms, leading to two distinct branches of the tree. The deviance values indicated that the model successfully segmented the dataset, with predictions of the median home values reflecting significant variance depending on the rooms available. The nodes further split based on lstat (percentage of lower status of the population) and nox (nitric oxides concentration), showcasing how this predictor interacted with others to refine predictions. The final predictions varied significantly at terminal nodes, suggesting a complex relationship where lower median values were predicted for homes with fewer rooms and higher levels of lstat.

In contrast, when constructing the second decision tree model, I included age (the proportion of owner-occupied units built prior to 1940) as the primary predictor. This model, like the previous one, began with rm but took a different path after the initial split. The results illustrated how changing the primary predictor influenced the tree’s structure and the median values predicted. The nodes reflected how lstat and rm interacted differently, leading to alternative outcomes at terminal nodes. This change in features significantly affected the model’s bias and variance. The bias, which represents the error due to overly simplistic assumptions in the learning algorithm, appeared reduced when utilizing predictors that capture the underlying data variability more effectively. Conversely, variance increased as the model became more sensitive to fluctuations in the training dataset.

The random forest model, leveraging an ensemble of decision trees, produced more reliable predictions than the individual trees. The output demonstrated a mean of squared residuals of 10.97856, with an impressive 86.68% of variance explained. This model benefitted from lower bias due to its ensemble nature while also managing variance through averaging multiple decision trees, reducing the risk of overfitting present in single tree models.

The performance improvement observed in the random forest model highlights the advantage of combining multiple trees, which can enhance model robustness and reduce variance significantly. In contrast, individual decision trees showed greater susceptibility to overfitting, especially when influenced by specific features like rm, age, or lstat.

In addressing concerns related to changes and version control of decision trees, I would implement tools that facilitate continuous improvement. Version control systems, such as Git, can help track changes in model architecture and features over time, ensuring that the decision trees do not lose their effectiveness. Additionally, I would establish a standardized approach for creating and evaluating decision trees, utilizing established performance metrics and validation techniques to maintain model integrity.

To manage complexity, employing automated pipelines for data preprocessing, model training, and evaluation would streamline the process, ensuring consistency and repeatability. Regular audits of model performance against new data, alongside retraining protocols when necessary, would further enhance reliability. These strategies will help ensure that the decision tree models remain effective over time, adapting to new data and maintaining accuracy in predictions.