Assignment 1, Data 622

Author

Heleine Fouda

Introduction

In this analysis, I will utilize the Boston Housing data set, which is accessible in R through the MASS and mlbench packages, as well as online . After performing an exploratory data analysis (EDA) on the data set and applying various models, I will present and discuss the rationale behind the chosen algorithms.

The data:

The Boston Housing dataset includes data from 506 census tracts in Boston based on the 1970 census. The BostonHousing dataframe contains the original data by Harrison and Rubinfeld (1979), while BostonHousing2 offers a corrected version with additional spatial information. I chose this dataset because it includes features that influence the median value of owner-occupied homes, which I will use to predict housing prices and assess the contribution of each feature to the overall price.

On the other hand, the Palmerpenguins dataset contains size measurements, clutch observations, and blood isotope ratios for adult foraging Adélie, Chinstrap, and Gentoo penguins observed on islands in the Palmer Archipelago near Palmer Station, Antarctica. Data were collected and made available by Dr. Kristen Gorman and the Palmer Station Long Term Ecological Research (LTER) Program.

Load required libraries

knitr::opts_chunk$set(echo = TRUE)
#| include: false
#|  message: False

set.seed(1310)

#|label: Load required libraries
library(viridis)
library(corrplot) #for visualization of correlation
Warning: package 'corrplot' was built under R version 4.3.3
Warning: package 'melt' was built under R version 4.3.3
Warning: package 'dials' was built under R version 4.3.3
Warning: package 'modeldata' was built under R version 4.3.3
Warning: package 'recipes' was built under R version 4.3.3
Warning: package 'xgboost' was built under R version 4.3.3
Warning: package 'randomForest' was built under R version 4.3.3
Warning: package 'car' was built under R version 4.3.3
Warning: package 'ggrepel' was built under R version 4.3.3

I. Exploratory data analysis of the Boston Housing data

1.1.load the dataset

library(MASS)
library(mlbench)

#|label: Load the data

BostonHousing <- Boston

#|label: View the first 3 rows
head(BostonHousing, n = 3)
     crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
  medv
1 24.0
2 21.6
3 34.7

1.2. Import data set

#| label: Option 3,Import the Boston Housing data from a gitHub repository

BostonHousing<- read_csv("https://raw.githubusercontent.com/Heleinef/Data-Science-Master_Heleine/refs/heads/main/BostonHousing.csv")
#|label: View the first 3 rows
head(BostonHousing, n=3)
# A tibble: 3 × 14
     crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio     b
    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>
1 0.00632    18  2.31     0 0.538  6.58  65.2  4.09     1   296    15.3  397.
2 0.0273      0  7.07     0 0.469  6.42  78.9  4.97     2   242    17.8  397.
3 0.0273      0  7.07     0 0.469  7.18  61.1  4.97     2   242    17.8  393.
# ℹ 2 more variables: lstat <dbl>, medv <dbl>
glimpse(Boston)
Rows: 506
Columns: 14
$ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
$ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
$ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
$ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
$ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
$ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
$ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
$ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
$ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
$ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
$ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
$ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
$ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…
dim(Boston)
[1] 506  14

1.3. Data size and structure

dim(BostonHousing)
[1] 506  14

1.4. Get a sense of the dataset by viewing a columns names.

names(BostonHousing)
 [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
 [8] "dis"     "rad"     "tax"     "ptratio" "b"       "lstat"   "medv"   

1.5. Data preparation:

Check for Missing Values

BostonHousing <- BostonHousing[complete.cases(BostonHousing),]
sum(is.na(BostonHousing))
[1] 0

Check for duplicated values

sum(duplicated(Boston))
[1] 0

1.6. Summary Statistics

summary(BostonHousing)
      crim                zn             indus            chas        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
 1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
 Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
 Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
 3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
 Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
      nox               rm             age              dis        
 Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
 1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
 Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
 Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
 3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
 Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
      rad              tax           ptratio            b         
 Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
 1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
 Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
 Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
 3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
 Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
     lstat            medv      
 Min.   : 1.73   Min.   : 5.00  
 1st Qu.: 6.95   1st Qu.:17.02  
 Median :11.36   Median :21.20  
 Mean   :12.65   Mean   :22.53  
 3rd Qu.:16.95   3rd Qu.:25.00  
 Max.   :37.97   Max.   :50.00  

The output of the summary statistics suggests a diverse set of neighborhoods, with some showing extreme socioeconomic characteristics (e.g., high crime, low status, high taxes) while others represent more average conditions. For instance, crim (Per capita crime rate by town) is highly skewed with a median of 0.257, but a mean of 3.61, suggesting a few areas have very high crime rates (up to 88.98). Also, the maximum value (88.98) indicates that some neighborhoods have exceptionally high crime rates, potentially outliers.Another example could be nox or the Nitric oxide concentration:The median (0.538) and mean (0.554) indicate that pollution levels are moderate, with some areas having relatively high pollution levels (up to 0.871).

Descriptions, Mean, and Median of Variables in the BostonHousing Dataset
Variable Description Mean Median
crim Per capita crime rate by town 3.6135236 0.25651
zn Proportion of residential land zoned for lots over 25,000 sq. ft. 11.3636364 0.00000
indus Proportion of non-retail business acres per town 11.1367787 9.69000
chas Charles River dummy variable (1 if tract bounds river; 0 otherwise) NA NA
nox Nitrogen oxides concentration (parts per 10 million) 0.5546951 0.53800
rm Average number of rooms per dwelling 6.2846344 6.20850
age Proportion of owner-occupied units built prior to 1940 68.5749012 77.50000
dis Weighted distances to five Boston employment centers 3.7950427 3.20745
rad Index of accessibility to radial highways 9.5494071 5.00000
tax Full-value property tax rate per $10,000 408.2371542 330.00000
ptratio Pupil-teacher ratio by town 18.4555336 19.05000
b 1000(Bk - 0.63)^2, where Bk is the proportion of Black residents by town 356.6740316 391.44000
lstat Lower status of the population (percentage) 12.6530632 11.36000
medv Median value of owner-occupied homes in $1000s 22.5328063 21.20000

2. Visualization of the data

2.1. Histogram of numeric features

library(ggplot2)

#| label: Visualize the distribution of each numeric variable.

ggplot(BostonHousing, aes(x = medv)) +
  geom_histogram(binwidth = 2, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Median Value (medv)", x = "medv", y = "Frequency") +
  theme_minimal()

The histogram reveals right-skewed data, with the majority of values clustered between 15 and 30. There is a notable peak around 20, indicating a large number of homes within that price range. There are also a significant number of homes with a median value of 50, likely indicating an upper limit or ceiling effect in the data.

ggplot(BostonHousing, aes(x = medv)) +
  geom_histogram(binwidth = 2, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Median Value (medv) by rad" , x = "medv", y = "Frequency") +
  theme_minimal() +
  facet_wrap(~rad)  # Facet by 'rad' (can replace with any other categorical variable)

Important Features


Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':

    smiths
# Melt the dataset to long format for ggplot
long_data <- melt(BostonHousing[, c('crim', 'zn', 'rm', 'dis', 'lstat')])
No id variables; using all as measure variables
# Facet grid histograms
ggplot(long_data, aes(x = value)) +
  geom_histogram(fill='skyblue', color='black', bins=30) +
  facet_wrap(~variable, scales = "free_x") +
  labs(title = "Histograms of Most Significant Features", x = "Value", y = "Count") +
  theme_minimal()

2.2 Boxplots

ggplot(BostonHousing, aes(y = medv)) +
  geom_boxplot(fill = "skyblue", color = "black") +
  labs(title = "Boxplot of Median Value (medv)", 
       y = "Median Value (medv)") +
  theme_minimal()

[](Assignt-1_files/figure-html/Boxplot for the ‘medv’ variable-1.png){width=672}

The boxplot illustrates the distribution of the median value of homes (medv) in the dataset, showing a relatively narrow interquartile range (IQR) and several outliers on the upper end. The median is centered within the box, indicating that most home values are clustered around this central value, while the presence of outliers suggests that some homes have significantly higher values

library(dplyr)
library(tidyr)
library(ggplot2)
library(ggthemes)

# Load the Boston dataset
data("Boston", package = "MASS")

#| label: Select only numeric columns excluding 'medv'
BostonHousing_numeric <- Boston %>%
 dplyr:: select(-medv) %>%
  dplyr::select_if(is.numeric)  # Use select_if to select numeric columns

#| label: Pivot the dataset to long format
BostonHousing_long <- BostonHousing_numeric %>%
  pivot_longer(cols = everything(),  # Reshape all other numeric columns
               names_to = "feature", 
               values_to = "value")

#| label: Create boxplots for each feature
ggplot(BostonHousing_long, aes(x = feature, y = value)) +
  geom_boxplot(fill = "skyblue", color = "black") +
  labs(title = "Boxplot of Numeric Features in BostonHousing Dataset", 
       x = "Features", 
       y = "Value") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  
  #| label: Rotate x-axis labels for readability
  facet_wrap(~ feature, scales = "free") + 
  
  #| label: Facet wrap across features
  theme_fivethirtyeight()

# 1. Plot box plot for a single variable (e.g., 'crim' by 'rad')
ggplot(Boston, aes(x = factor(rad), y = crim)) +
  geom_boxplot(outlier.color = "red", fill = "skyblue", color = "darkblue", width = 0.7) + 
  labs(title = "Box Plot of Crime Rate by Accessibility to Radial Highways",
       x = "Radial Highways (rad)",
       y = "Crime Rate (crim)") +
  theme_minimal(base_size = 15) +  # Adjusts text size
  theme(
    plot.title = element_text(hjust = 0.5),  # Centers the title
    axis.text.x = element_text(angle = 45, hjust = 1),  # Rotates x-axis labels for better visibility
    axis.text.y = element_text(size = 12),
    axis.title = element_text(size = 14)
  )

# 2. To plot multiple box plots together (e.g., crim, nox, and rm by rad)
# Reshape data for better plotting (long format)
library(reshape2)
melted_data <- melt(Boston, id.vars = "rad", measure.vars = c("crim", "nox", "rm"))

ggplot(melted_data, aes(x = factor(rad), y = value)) +
  geom_boxplot(outlier.color = "red", fill = "lightgreen", color = "darkgreen", width = 0.7) +
  facet_wrap(~variable, scales = "free_y") +  # Creates separate box plots for each variable
  labs(title = "Box Plots of crim, nox, and rm by Radial Highways",
       x = "Radial Highways (rad)",
       y = "Value") +
  theme_minimal(base_size = 15) +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.text.y = element_text(size = 12),
    axis.title = element_text(size = 14)
  )

2.2. Scatterplots of important features against the target (medv) to idenfify outliers

2.3. Pairwise Relationships, scatterplot Matrix of BostonHousing to help idenfify outliers

# Load necessary libraries
library(GGally)
library(MASS)

# Load the Boston dataset
data("Boston", package = "MASS")

# Select a subset of numeric columns for clarity
selected_columns <- c("medv", "rm", "lstat", "dis", "zn", "crim")  # Make sure these columns exist
subset_data <- Boston[selected_columns]

# Create pairwise plots using ggpairs
ggpairs(subset_data,
        title = "Pairwise Plots of Selected Features",
        lower = list(continuous = "points"),  # Scatter plots for lower triangle
        upper = list(continuous = "cor"),  # Display correlation coefficients in the upper triangle
        diag = list(continuous = "densityDiag")) +  # Density plots on the diagonal
  theme_minimal()  # Use a minimal theme for better aesthetics

library(ggplot2)
library(gridExtra)  # For arranging multiple plots

# Select important features to plot against medv
important_features <- c("crim","rm", "zn", "lstat", "age","dis")

# Create a list to store scatter plots
scatter_plots <- lapply(important_features, function(feature) {
  ggplot(BostonHousing, aes_string(x = feature, y = "medv")) +
    geom_point(alpha = 0.5, color = "blue") +  # Add points with transparency
    geom_smooth(method = "lm", color = "red", se = FALSE) +  # Add a linear regression line
    labs(title = paste("Scatter Plot of", feature, "vs. Median Value (medv)"),
         x = feature,
         y = "Median Value (medv)") +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5, size = 1, face = "bold"),  # Center title
          axis.title = element_text(size = 9),  # Adjust axis titles
          axis.text = element_text(size = 9))   # Adjust axis text
})
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
# Arrange the scatter plots in a grid layout
grid.arrange(grobs = scatter_plots, ncol = 2)  # Change ncol for layout adjustment

3. Correlation Analysis

3.1 Correlation matrix:

library(corrplot)

#| label: Select only numeric columns
numeric_data <- BostonHousing[sapply(BostonHousing, is.numeric)]

#| label: Calculate the correlation matrix
cor_matrix <- cor(numeric_data, use = "complete.obs")

#|label: Plot the correlation matrix
corrplot(cor_matrix, method = "color", 
         addCoef.col = "black",  # Add correlation coefficients
         tl.col = "black",       # Text label color
         tl.srt = 45,            # Text label rotation
         title = "Correlation Matrix of Boston Housing Data",
         mar = c(0, 0, 1, 0))    # Adjust margins

library(corrplot)

#|label: Select only numeric columns
numeric_data <- BostonHousing[sapply(BostonHousing, is.numeric)]

#|label: Calculate the correlation matrix
cor_matrix <- cor(numeric_data, use = "complete.obs")

#|label: Select a subset of highly correlated features with medv
subset_features <- c("crim", "rm", "lstat", "tax", "zn", "age", "dis")
cor_matrix_subset <- cor_matrix[subset_features, subset_features]

#|label: Plot the correlation matrix
corrplot(cor_matrix_subset, method = "color", 
         addCoef.col = "black",  # Add correlation coefficients
         tl.col = "black",       # Text label color
         tl.srt = 45,            # Text label rotation
         title = "Correlation Matrix of Selected Features",
         mar = c(0, 0, 1, 0))    # Adjust margins

The matrix above provides a quick overview of how the different factors are interrelated, which can be crucial for understanding the dynamics in housing prices and for making informed decisions in further analysis or modeling. A value closer to 1 indicates a strong positive relationship, meaning that as one variable increases, the other tends to increase as well. For example, the variable rm (average number of rooms per dwelling) has a strong positive correlation (0.695) with medv (median value of homes), suggesting that more rooms are associated with higher home values. A value closer to -1 indicates a strong negative relationship, meaning that as one variable increases, the other tends to decrease. For instance, lstat (percentage of lower status of the population) has a strong negative correlation (-0.737) with medv, indicating that higher percentages of lower-status individuals are associated with lower home values. Values around 0 indicate little to no linear relationship between the variables.

3.2. Heatmap

# Select only numeric columns from the BostonHousing dataset
numeric_columns <- BostonHousing[sapply(BostonHousing, is.numeric)]

#|label: Calculate the correlation matrix
cor_matrix <- cor(numeric_columns)

#|label:  Melt the correlation matrix into long format
cor_melted <- melt(cor_matrix)

#|label: Create the heatmap
ggplot(data = cor_melted, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") + 
  scale_fill_viridis(name = "Correlation", option = "C") +
  labs(title = "Correlation Heatmap of Boston Housing Data",
       x = "Features",
       y = "Features") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

3.3. Outliers Detection by Z-score

#| label: Select only numeric columns from BostonHousing
numeric_data <- BostonHousing[sapply(BostonHousing, is.numeric)]

#| label: Scale the numeric columns
z_scores <- scale(numeric_data)

#| label: Convert back to a data frame if needed
z_scores_df <- as.data.frame(z_scores)

head(z_scores_df)
        crim         zn      indus        nox        rm        age      dis
1 -0.4193669  0.2845483 -1.2866362 -0.1440749 0.4132629 -0.1198948 0.140075
2 -0.4169267 -0.4872402 -0.5927944 -0.7395304 0.1940824  0.3668034 0.556609
3 -0.4169290 -0.4872402 -0.5927944 -0.7395304 1.2814456 -0.2655490 0.556609
4 -0.4163384 -0.4872402 -1.3055857 -0.8344581 1.0152978 -0.8090878 1.076671
5 -0.4120741 -0.4872402 -1.3055857 -0.8344581 1.2273620 -0.5106743 1.076671
6 -0.4166314 -0.4872402 -1.3055857 -0.8344581 0.2068916 -0.3508100 1.076671
         rad        tax    ptratio         b      lstat       medv
1 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990  0.1595278
2 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239
3 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324  1.3229375
4 -0.7521778 -1.1050216  0.1129203 0.4157514 -1.3601708  1.1815886
5 -0.7521778 -1.1050216  0.1129203 0.4406159 -1.0254866  1.4860323
6 -0.7521778 -1.1050216  0.1129203 0.4101651 -1.0422909  0.6705582

Z-scores represent how many standard deviations a data point is from the mean. A value of 0 means the data point is at the mean, positive values indicate the data point is above the mean, and negative values mean the data point is below the mean. For instance,the z-scores for crim in the first few rows are all negative, ranging from around -0.42 to -0.41. This indicates that the crime rate in these instances is below the average crime rate for the entire dataset, but not drastically so since the values are close to 0.

4. Feature Engineering: Apply log transformation for skewed data.

#| label: Log-Transformed Histogram

BostonHousing$log_medv <- log(BostonHousing$medv)

#|label: Plot the log-transformed 'medv'
ggplot(BostonHousing, aes(x = log_medv)) +
  geom_histogram(binwidth = 0.2, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Log-Transformed Median Value (log_medv)", 
       x = "log(medv)", 
       y = "Frequency") +
  theme_minimal()

II. Applicable Models:

1. Linear Regression Model/ Multivariate analysis

The goal of linear regression is to identify relationship between features and target.In this case, the response variable is medv (Median value of owner-occupied homes), and we will use all other variables as predictors.

1.1. Train a lm model using cross-validation

# Load necessary libraries
library(caret)        # For training the model
library(MASS)         # For the Boston dataset
library(randomForest) # For the random forest algorithm

# Load the BostonHousing dataset
data("Boston", package = "MASS")

# Set seed for reproducibility
set.seed(1310)

# Split the dataset into training (75%) and test (25%)
trainIndex <- createDataPartition(Boston$medv, p = 0.75, list = FALSE)
train_data <- Boston[trainIndex, ]
test_data <- Boston[-trainIndex, ]

# Define the formula for the model
formula <- medv ~ .

# Set up the control for cross-validation
control <- trainControl(method = "cv", number = 10)  # 10-fold cross-validation

# Train the Random Forest model using caret::train
rf_model <- caret::train(formula,               # Ensure we're calling caret::train()
                         data = train_data, 
                         method = "rf",         # Use random forest as the method
                         trControl = control,   # 10-fold cross-validation
                         tuneLength = 3         # Try different parameter combinations
)

# Print the model summary
print(rf_model)
Random Forest 

381 samples
 13 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 343, 343, 344, 344, 342, 344, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared   MAE     
   2    3.760654  0.8576932  2.561179
   7    3.399720  0.8767099  2.330153
  13    3.404544  0.8714123  2.378158

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 7.
# Make predictions on the test data
predictions <- predict(rf_model, newdata = test_data)

# Evaluate the model performance (on test data)
test_rmse <- RMSE(predictions, test_data$medv)
test_r2 <- R2(predictions, test_data$medv)

# Print the performance metrics
cat("Test RMSE:", test_rmse, "\n")
Test RMSE: 3.066319 
cat("Test R-squared:", test_r2, "\n")
Test R-squared: 0.861139 

1.2 Fit the linear model

set.seed(1310)
# Load necessary library
library(MASS)

# Load the BostonHousing dataset
data("Boston")

# Fit a linear model
linear_model <- lm(medv ~ ., data = Boston)

# Summary of the model
summary(linear_model)

Call:
lm(formula = medv ~ ., data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.595  -2.730  -0.518   1.777  26.199 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
zn           4.642e-02  1.373e-02   3.382 0.000778 ***
indus        2.056e-02  6.150e-02   0.334 0.738288    
chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
age          6.922e-04  1.321e-02   0.052 0.958229    
dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
black        9.312e-03  2.686e-03   3.467 0.000573 ***
lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7338 
F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

Model Fit Output Analysis: A Multiple R-squared of 0.7406, indicates that about 74% of the variance in the response variable (medv) is explained by the model. This is a good level of fit. The Adjusted R-squared is at 0.7338, which is slightly lower but still strong. The F-statistic is at 108.1 with a p-value < 2.2e-16, which means that the overall model is statistically significant. Overall, several variables are highly significant (e.g., crime rate, nitrogen oxide concentration, number of rooms, accessibility to employment centers).But, there are some predictors, such as indus (proportion of non-retail business acres) and age, that are not statistically significant in predicting housing prices.These less important features will be removed in further modeling steps.

1.3. Model Diagnostics

Let’s check the residuals to ensure the assumptions of linear regression (normality, homoscedasticity) are met.

library(MASS)
library(car)

#|label: Load data and fit the model
data("Boston")
model <- lm(medv ~ ., data = Boston)

#| label: Plot diagnostics
par(mfrow = c(1, 1))
plot(model)

#| label: Check multicollinearity (VIF)
vif_values <- vif(model)
print(vif_values)
    crim       zn    indus     chas      nox       rm      age      dis 
1.792192 2.298758 3.991596 1.073995 4.393720 1.933744 3.100826 3.955945 
     rad      tax  ptratio    black    lstat 
7.484496 9.008554 1.799084 1.348521 2.941491 

The Variance Inflation Factor (VIF) measures how much the variance of a regression coefficient is inflated due to multicollinearity with other predictors. A rule of thumb is that VIF > 5 or 10 indicates multicollinearity, suggesting that a predictor is highly correlated with other predictors, and this could affect the model’s stability and reliability. In the output above, rad has a VIF of 7.48, and tax has a VIF of 9.01. These values indicate potential multicollinearity issues for these variables and will be removed to address multicollinearity and improve model performance.Other variables like nox (VIF = 4.39) and indus (VIF = 3.99) also show moderate multicollinearity, but they are not as severe as rad and tax.

residuals <- residuals(model)
shapiro_test <- shapiro.test(residuals)
print(shapiro_test)

    Shapiro-Wilk normality test

data:  residuals
W = 0.90138, p-value < 2.2e-16

The Shapiro-Wilk test, (which checks if the residuals are normally distributed ) shows that the residuals are not normally distributed, which is a violation of the assumptions of linear regression. This may affect the model’s predictions and confidence intervals. In our case, W = 0.90138 suggests that the residuals deviate from a perfectly normal distribution as evidenced by the very small p-value < 2.2e-16.

1.4. Boxcox transformation

library(MASS)      
library(ggplot2)   

# Load the BostonHousing dataset
data("Boston", package = "MASS")  # Specify the package to avoid confusion
train_data <- Boston

# 1. Extract the response variable (medv)
response <- train_data$medv

# 2. Apply Box-Cox transformation to find the optimal lambda
# The BoxCox function requires the response to be strictly positive
# Add a small constant if necessary
if (min(response) <= 0) {
  response <- response + abs(min(response)) + 1e-5  # Shift to be positive
}

# Find the optimal lambda using BoxCox from MASS
lambda <- boxcox(lm(response ~ ., data = train_data), plot = FALSE)$x[which.max(boxcox(lm(response ~ ., data = train_data), plot = FALSE)$y)]

# 3. Apply the Box-Cox transformation with the estimated lambda
transformed_response <- (response^lambda - 1) / lambda

# Update the dataset with the transformed response variable
train_data$medv_transformed <- transformed_response

# 4. Fit the linear model using the transformed response variable
linear_model_transformed <- lm(medv_transformed ~ ., data = train_data[, -which(names(train_data) == "medv")])

# 5. Check residuals of the transformed model
# Get residuals
residuals_transformed <- residuals(linear_model_transformed)

# Perform Shapiro-Wilk normality test on transformed residuals
shapiro_test_transformed <- shapiro.test(residuals_transformed)

# Print the Shapiro-Wilk test result
print(shapiro_test_transformed)

    Shapiro-Wilk normality test

data:  residuals_transformed
W = 0.90138, p-value < 2.2e-16
# Plot residuals to visually check normality
par(mfrow = c(1, 2))  # Set up the plotting area
hist(residuals_transformed, main = "Transformed Residuals Histogram", xlab = "Residuals", col = "lightblue", border = "black")
qqnorm(residuals_transformed, main = "QQ-Plot of Transformed Residuals")
qqline(residuals_transformed, col = "red")  # Add a reference line

The residuals still deviate from normality, further adjustments like removing outliers appear necessary.

1.5. Check for heteroscedasticity in the linear regression

library(MASS)      # For the Boston dataset
library(lmtest)    # For the Breusch-Pagan test
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
# Assuming linear_model_transformed is your fitted model

# 1. Residuals vs Fitted Plot
par(mfrow = c(1, 1))  # Reset plotting area
plot(linear_model_transformed$fitted.values, 
     residuals_transformed, 
     main = "Residuals vs Fitted",
     xlab = "Fitted Values",
     ylab = "Residuals")
abline(h = 0, col = "red", lty = 2)  # Add a horizontal line at 0

# 2 Breusch-Pagan Test
bp_test <- bptest(linear_model_transformed)


print(bp_test)

    studentized Breusch-Pagan test

data:  linear_model_transformed
BP = 65.122, df = 13, p-value = 6.265e-09

1.6. Remove Outliers Using the interquartile range (IQR) Method

# Load necessary libraries
library(MASS)      # For the Boston dataset
library(dplyr)     # For data manipulation

# Load the BostonHousing dataset
data("Boston", package = "MASS")
train_data <- Boston

# Function to remove outliers using IQR
remove_outliers_iqr <- function(data) {
  # Calculate the IQR for the entire dataset
  Q1 <- apply(data, 2, quantile, probs = 0.25, na.rm = TRUE)
  Q3 <- apply(data, 2, quantile, probs = 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  
  # Determine lower and upper bounds for outliers
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  
  # Identify outliers
  is_outlier <- apply(data, 1, function(row) {
    any(row < lower_bound | row > upper_bound)
  })
  
  # Return the data without outliers
  return(data[!is_outlier, ])
}

# Remove outliers from the numeric part of the dataset
train_data_no_outliers <- remove_outliers_iqr(train_data[, sapply(train_data, is.numeric)])

# View the dimensions of the dataset before and after outlier removal
cat("Original dimensions:", dim(train_data), "\n")
Original dimensions: 506 14 
cat("New dimensions after outlier removal:", dim(train_data_no_outliers), "\n")
New dimensions after outlier removal: 268 14 

1.7 Rechecking Linear Regression Assumptions:

normality and homoscedasticity.

# Load necessary libraries
library(MASS)  
library(ggplot2) 


# 1. Fit a new linear model after removing outliers
linear_model_no_outliers <- lm(medv ~ ., data = train_data_no_outliers)

# 2. Get residuals of the new model
residuals_no_outliers <- residuals(linear_model_no_outliers)

# 3. Perform the Shapiro-Wilk test on residuals
shapiro_test_no_outliers <- shapiro.test(residuals_no_outliers)

# Print the Shapiro-Wilk test result
print(shapiro_test_no_outliers)

    Shapiro-Wilk normality test

data:  residuals_no_outliers
W = 0.96002, p-value = 9.3e-07
# 4. Plot residuals vs. fitted values
fitted_values_no_outliers <- linear_model_no_outliers$fitted.values

# Create the residuals vs. fitted values plot
ggplot(data = NULL, aes(x = fitted_values_no_outliers, y = residuals_no_outliers)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Residuals vs Fitted Values (No Outliers)",
       x = "Fitted Values",
       y = "Residuals") +
  theme_minimal()

The linear model fit seems to have improved after outlier removal.

1.8. Let’s train a new lm model

linear_model_no_outliers <- lm(medv ~ ., data = train_data_no_outliers)
summary(linear_model_no_outliers)

Call:
lm(formula = medv ~ ., data = train_data_no_outliers)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.0083 -1.5149 -0.1994  1.0568 10.9166 

Coefficients: (1 not defined because of singularities)
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 12.725605   7.259710   1.753  0.08082 .  
crim        -1.001797   0.210472  -4.760 3.25e-06 ***
zn          -0.031290   0.023353  -1.340  0.18148    
indus       -0.010376   0.039843  -0.260  0.79474    
chas               NA         NA      NA       NA    
nox         -1.038019   3.166085  -0.328  0.74329    
rm           5.127181   0.510435  10.045  < 2e-16 ***
age         -0.044874   0.009147  -4.906 1.66e-06 ***
dis         -0.688521   0.160898  -4.279 2.66e-05 ***
rad          0.429627   0.071049   6.047 5.23e-09 ***
tax         -0.013450   0.002595  -5.183 4.44e-07 ***
ptratio     -0.676678   0.101284  -6.681 1.48e-10 ***
black        0.002690   0.014621   0.184  0.85419    
lstat       -0.156437   0.049925  -3.133  0.00193 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.556 on 255 degrees of freedom
Multiple R-squared:  0.7194,    Adjusted R-squared:  0.7062 
F-statistic: 54.47 on 12 and 255 DF,  p-value: < 2.2e-16

The new model shows improved residual behavior (smaller range and standard error) and focuses on more significant predictors after removing outliers. While it captures a strong relationship between medv and several key variables, the overall explained variance (R-squared) is slightly lower. The initial model, while seemingly stronger in terms of explanatory power, may have been affected by the presence of outliers and less significant predictors, leading to potentially misleading interpretations.

The F-statistic in the initial model is higher, indicating stronger overall significance with more predictors included. However, the p-values for both models are very low, indicating strong evidence that at least one predictor is significantly related to medv.

1.9. Run Recursive Feature Elimination (RFE)

library(MASS)        # For the BostonHousing dataset
library(caret)       # For RFE
library(ggplot2)     # For visualization


#| label: Set a seed for reproducibility
set.seed(1310)

#| label: Split data into features and target variable
X <- Boston[, -14]  # Exclude the target variable 'medv'
y <- Boston$medv     # Target variable

#| label: Control for RFE
control <- rfeControl(functions = lmFuncs, # Using linear model for RFE
                      method = "cv",      # Cross-validation
                      number = 12)        # Number of folds

#| label: Run RFE
results <- rfe(X, y, sizes = c(1:13), rfeControl = control)

results

Recursive feature selection

Outer resampling method: Cross-Validated (12 fold) 

Resampling performance over subset size:

 Variables  RMSE Rsquared   MAE RMSESD RsquaredSD  MAESD Selected
         1 8.265   0.1994 5.983 1.0090    0.07826 0.6922         
         2 6.173   0.5502 4.257 1.2359    0.15372 0.5350         
         3 6.055   0.5716 4.223 1.2578    0.15682 0.5809         
         4 6.009   0.5768 4.188 1.2484    0.14971 0.6494         
         5 5.479   0.6415 3.828 1.3459    0.16013 0.6553         
         6 4.914   0.7126 3.500 0.9205    0.09695 0.5342         
         7 4.918   0.7129 3.531 0.8828    0.09344 0.5247         
         8 4.886   0.7176 3.517 0.8720    0.09121 0.5094         
         9 4.872   0.7185 3.495 0.8223    0.08614 0.4909         
        10 4.864   0.7190 3.483 0.8087    0.08362 0.4886         
        11 4.811   0.7267 3.449 0.8313    0.08045 0.5536         
        12 4.793   0.7301 3.397 0.8059    0.07569 0.5364        *
        13 4.796   0.7296 3.396 0.8092    0.07594 0.5456         

The top 5 variables (out of 12):
   nox, rm, chas, dis, ptratio
#| label: Plot the results
plot(results, type = c("g", "o"))

#| label: Get the selected features
selected_features <- predictors(results)
print(selected_features)
 [1] "nox"     "rm"      "chas"    "dis"     "ptratio" "lstat"   "rad"    
 [8] "crim"    "zn"      "indus"   "tax"     "black"  
#| label: Create a new dataset with only the selected features
reduced_data <- Boston[, c(selected_features, "medv")]

This plot shows the relationship between the number of variables used in the model (12 in this case) and the corresponding Root Mean Square Error (RMSE) from cross-validation. Lower RMSE values indicate better model performance. We see that the RMSE starts at around 8 when only 1 variable is used. As more variables are added (from 2 to 4), the RMSE significantly decreases, indicating that adding features improves the model’s predictive performance. However, beyond around 4 to 5 variables, the RMSE stabilizes and levels off.This suggests that additional variables (beyond 5) do not contribute significantly to improving the model’s performance. We can therefore conclude that using more features than necessary( beyond 5) may not improve the model’s performance and could lead to overfitting.

2. Lasso Regression

2.1 Load libraries

2.2 Prepare the Data

.
function (..., .env = parent.frame()) 
{
    structure(as.list(match.call()[-1]), env = .env, class = "quoted")
}
<bytecode: 0x13a291158>
<environment: namespace:plyr>
# Define the target variable and predictors
x <- as.matrix(Boston[, -which(names(Boston) == "medv")])  # Predictors
y <- Boston$medv  # Target variable

# Standardize the predictors (mean = 0, sd = 1)
x <- scale(x)

2.3.Split the Data into Training and Testing Sets

set.seed(1310)  # For reproducibility
index <- sample(1:nrow(Boston), 0.8 * nrow(Boston))  # 80% training data
train_x <- x[index, ]
train_y <- y[index]
test_x <- x[-index, ]
test_y <- y[-index]

2.4 Fit the Lasso Regression Model using the glmnet function

#Specify alpha = 1 for Lasso.

# Fit the Lasso regression model
lasso_model <- glmnet(train_x, train_y, alpha = 1)

# Plot the Lasso path
plot(lasso_model, xvar = "lambda", label = TRUE)

2.5.Cross-Validation to Select the Best Lambda

# Use cross-validation to find the optimal value of lambda.

# Perform cross-validation
cv_lasso <- cv.glmnet(train_x, train_y, alpha = 1)

# Plot the cross-validation curve
plot(cv_lasso)

# Get the best lambda
best_lambda <- cv_lasso$lambda.min
print(best_lambda)
[1] 0.04316211

2.6 Make Predictions on the Test Set

# Make predictions using the best lambda
predictions <- predict(lasso_model, s = best_lambda, newx = test_x)
head(predictions)
         s1
6  25.69031
9  11.11459
18 17.01076
21 12.30455
30 21.37877
31 11.38610

2.7 Evaluate the Model Performance

# Calculate the Root Mean Square Error (RMSE) to evaluate the performance.
# Calculate RMSE
rmse <- sqrt(mean((predictions - test_y)^2))
print(paste("Root Mean Square Error (RMSE):", rmse))
[1] "Root Mean Square Error (RMSE): 4.80147983912149"

The RMSE value of approximately 4.80 indicates the average deviation of the predicted housing prices from the actual prices in the dataset. In other words, the model’s predicted house prices are off by about $4,800. In general, Lower values are desirable, as they indicate better performance in terms of predicting house prices.

3. Random Forest model.

library(randomForest)  # For Random Forest modeling
library(MASS)          # For the Boston dataset

# Load the BostonHousing dataset
data("Boston", package = "MASS")  # Specify the package to avoid confusion
train_data <- Boston

# Set the seed for reproducibility
set.seed(1310)

# Fit a Random Forest model
rf_model <- randomForest(medv ~ ., data = train_data, importance = TRUE, ntree = 500)

# Print the model summary
print(rf_model)

Call:
 randomForest(formula = medv ~ ., data = train_data, importance = TRUE,      ntree = 500) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 4

          Mean of squared residuals: 9.864995
                    % Var explained: 88.31

3.2. Model Evaluation: predictions on the test set

# Load necessary libraries
library(randomForest)
library(caret)  # For the confusionMatrix function

# Assuming train_data is your training dataset and test_data is your testing dataset
set.seed(1310)
train_data <- Boston
# Split the dataset into training and testing sets
index <- sample(1:nrow(train_data), 0.8 * nrow(train_data))
train_data <- train_data[index, ]
test_data <- train_data[-index, ]

# Fit the random forest model on the training data
rf_model <- randomForest(medv ~ ., data = train_data)

# Predict on the test set
predictions <- predict(rf_model, newdata = test_data)
head(predictions)
      91      128       14      176      359      224 
22.90143 16.19577 20.19669 28.79565 21.85850 28.36583 

3.3. Evaluate performance

# Calculate performance metrics
mae <- mean(abs(predictions - test_data$medv))
mse <- mean((predictions - test_data$medv)^2)
rmse <- sqrt(mse)
r_squared <- 1 - (sum((predictions - test_data$medv)^2) / sum((mean(test_data$medv) - test_data$medv)^2))

# Print performance metrics
cat("Mean Absolute Error (MAE):", mae, "\n")
Mean Absolute Error (MAE): 1.127824 
cat("Mean Squared Error (MSE):", mse, "\n")
Mean Squared Error (MSE): 2.504633 
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
Root Mean Squared Error (RMSE): 1.582603 
cat("R-squared:", r_squared, "\n")
R-squared: 0.9751725 

3.4. Feature importance

# Get feature importance
importance_values <- randomForest::importance(rf_model)

# Create a data frame with feature names and their importance values
importance_df <- data.frame(Feature = rownames(importance_values), 
                             Importance = importance_values[, 1])

# Order the data frame by importance in descending order
importance_df <- importance_df[order(importance_df$Importance, decreasing = TRUE), ]

# Print feature importance
print(importance_df)
        Feature Importance
lstat     lstat 11119.0152
rm           rm 10551.2633
crim       crim  2604.4510
nox         nox  2522.8696
indus     indus  2030.6634
dis         dis  1992.7843
ptratio ptratio  1938.3144
tax         tax  1121.7062
age         age   880.9758
black     black   656.9633
rad         rad   310.3773
zn           zn   240.7520
chas       chas   205.6359

3.4.1. Visualizing Feaures importance

library(randomForest)
library(ggplot2)

# Remove 'log_medv' from the dataset
BostonHousing_no_log_medv <- BostonHousing[, !names(BostonHousing) %in% "log_medv"]

# Fit random forest model excluding 'log_medv'
rf_model <- randomForest(medv ~ ., data = BostonHousing_no_log_medv)

# Check the class of the rf_model to ensure it's correct
print(class(rf_model))
[1] "randomForest.formula" "randomForest"        
# Extract variable importance
importance_values <- randomForest::importance(rf_model)

# Convert importance values to a data frame
importance_df <- as.data.frame(importance_values)

# Rename the column for MSE importance
colnames(importance_df)[1] <- "IncMSE"

# Add feature names for easier plotting
importance_df$Feature <- rownames(importance_df)

# Plot using ggplot2
ggplot(importance_df, aes(x = reorder(Feature, IncMSE), y = IncMSE)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Variable Importance in Predicting medv", x = "Feature", y = "% Increase in MSE")

3.5 Hyperparameter Tuning: Optimize the Model

# Load necessary libraries
library(ranger)

Attaching package: 'ranger'
The following object is masked from 'package:randomForest':

    importance
# Set up grid search manually
best_model <- NULL
best_mse <- Inf

# Hyperparameter grid
mtry_values <- c(1, 2, 3, 4, 5)
min_node_size_values <- c(1, 5, 10)

# Train a model for each combination of mtry and min.node.size
for (mtry_val in mtry_values) {
  for (min_node_size_val in min_node_size_values) {
    # Train the ranger model
    rf_model <- ranger(medv ~ ., data = train_data, 
                       mtry = mtry_val, min.node.size = min_node_size_val, 
                       splitrule = "variance", importance = "impurity")
    
    # Evaluate model's performance using out-of-bag error
    mse <- rf_model$prediction.error
    
    # Keep track of the best model
    if (mse < best_mse) {
      best_mse <- mse
      best_model <- rf_model
    }
  }
}

# Print the best model's hyperparameters and MSE
print(paste("Best MSE:", best_mse))
[1] "Best MSE: 10.74783320717"
print(best_model)
Ranger result

Call:
 ranger(medv ~ ., data = train_data, mtry = mtry_val, min.node.size = min_node_size_val,      splitrule = "variance", importance = "impurity") 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      404 
Number of independent variables:  13 
Mtry:                             4 
Target node size:                 1 
Variable importance mode:         impurity 
Splitrule:                        variance 
OOB prediction error (MSE):       10.74783 
R squared (OOB):                  0.8823616 

Alternatively,we can also fit the random forest model directly

library(ranger)

# Fit the random forest model directly
rf_model <- ranger(medv ~ ., data = train_data, mtry = 3, min.node.size = 5)

# Print model summary
print(rf_model)
Ranger result

Call:
 ranger(medv ~ ., data = train_data, mtry = 3, min.node.size = 5) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      404 
Number of independent variables:  13 
Mtry:                             3 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       12.11752 
R squared (OOB):                  0.86737 

3.6. Visualize predictions

set.seed(1310)
index <- sample(1:nrow(BostonHousing), 0.8 * nrow(BostonHousing))
train_data <- BostonHousing[index, ]
test_data <- BostonHousing[-index, ]

# Train the random forest model
rf_model <- randomForest(medv ~ ., data = train_data)

# Predict on the test set
predictions <- predict(rf_model, newdata = test_data)

# Ensure the predictions are stored correctly in test_data
test_data$predictions <- predictions

# Check that predictions were added correctly
print(nrow(test_data))        # Should be the same as length(predictions)
[1] 102
print(length(predictions))     # Check length of predictions
[1] 102

3.7.Model Interpretation: Partial Dependence Plots (PDP)

# Load necessary libraries
library(caret)
library(randomForest)


BostonHousing <- Boston
# Ensure BostonHousing dataset is available
if (!exists("BostonHousing")) {
  stop("BostonHousing is not loaded.")
}

# Define your training control
control <- trainControl(method = "cv", number = 5)

# Split the dataset into training and testing sets
set.seed(1310)  # For reproducibility
index <- sample(1:nrow(BostonHousing), 0.8 * nrow(BostonHousing))
train_data <- BostonHousing[index, ]  # Training data
test_data <- BostonHousing[-index, ]    # Testing data

# Define a tuning grid
tune_grid <- expand.grid(mtry = c(1, 2, 3, 4, 5))

# Train the random forest model using caret
tuned_rf <- caret::train(medv ~ ., data = train_data, method = "rf", 
                         trControl = control, tuneGrid = tune_grid)

# Print best tuning parameters
print(tuned_rf$bestTune)
  mtry
5    5
This output indicates that the model we trained using caret has identified the best value for the mtry parameter as 5. This means that when building the random forest, the model will randomly select 5 features to consider at each split in the trees. This tuning process is crucial because selecting the right number of features can help improve the model’s performance by preventing overfitting and ensuring that it generalizes well to unseen data. This model is now ready for deployment.

4. XGboost

4.1.Load dataset

library(xgboost)
library(caret)  # For data splitting and evaluation
library(MASS)   # To access the Boston dataset
BostonHousing <- Boston
head(BostonHousing, n=3)
     crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
  medv
1 24.0
2 21.6
3 34.7

4.2. Data Preparation

# Set seed for reproducibility
set.seed(1310)

# Split the dataset into training and testing sets (80/20 split)
index <- sample(1:nrow(BostonHousing), 0.8 * nrow(BostonHousing))
train_data <- BostonHousing[index, ]
test_data <- BostonHousing[-index, ]

# Separate features and target variable
train_x <- as.matrix(train_data[, -which(names(train_data) == "medv")])  # Features
train_y <- train_data$medv  # Target variable

test_x <- as.matrix(test_data[, -which(names(test_data) == "medv")])  # Features
test_y <- test_data$medv  # Target variable

4.3. Train the XGBoost Model

# Train the XGBoost model
dtrain <- xgb.DMatrix(data = train_x, label = train_y)  # Create DMatrix for XGBoost
params <- list(objective = "reg:squarederror",   # Regression problem
               eval_metric = "rmse",           # Evaluation metric
               max_depth = 3,                  # Maximum tree depth
               eta = 0.1,                       # Learning rate
               nthread = 2)                     # Number of threads

# Train the model
xgb_model <- xgb.train(params = params,
                        data = dtrain,
                        nrounds = 100)  # Number of boosting rounds

4.4 make predictions

# Create DMatrix for test data
dtest <- xgb.DMatrix(data = test_x)

# Make predictions
predictions <- predict(xgb_model, newdata = dtest)

4.5 Evaluate the Model

#Evaluate the model's performance by comparing the predicted values to the actual values.

# Calculate RMSE
rmse <- sqrt(mean((predictions - test_y)^2))
cat("Root Mean Square Error (RMSE):", rmse, "\n")
Root Mean Square Error (RMSE): 3.073606 

An RMSE of 3.073606 means that, on average, the model’s predictions are about $3,073.61 off from the actual median home values in the Boston housing dataset.

4.6 Feature Importance

# visualize the feature importance to understand which features contributed the most to the predictions.

importance_matrix <- xgb.importance(feature_names = colnames(train_x), model = xgb_model)

# Plot feature importance
xgb.plot.importance(importance_matrix)

III. Control and compare: Applying the Xgboost model on the penguins datset

Load data

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
dim(penguins)
[1] 344   8

Preprocess the Data

# Load the data
data <- na.omit(penguins)

# Encode species as a factor and then as numeric labels
data$species <- as.numeric(as.factor(data$species)) - 1

# Convert categorical variables (island, sex) to dummy variables
data$island <- as.numeric(as.factor(data$island))
data$sex <- as.numeric(as.factor(data$sex))

# Remove any rows with NA values
data <- na.omit(data)

# Check the data structure
str(data)
tibble [333 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : num [1:333] 0 0 0 0 0 0 0 0 0 0 ...
 $ island           : num [1:333] 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:333] 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
 $ bill_depth_mm    : num [1:333] 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
 $ flipper_length_mm: int [1:333] 181 186 195 193 190 181 195 182 191 198 ...
 $ body_mass_g      : int [1:333] 3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
 $ sex              : num [1:333] 2 1 1 1 2 1 2 1 2 2 ...
 $ year             : int [1:333] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
 - attr(*, "na.action")= 'omit' Named int [1:11] 4 9 10 11 12 48 179 219 257 269 ...
  ..- attr(*, "names")= chr [1:11] "4" "9" "10" "11" ...

Split the Data into Training and Testing Sets

# Set seed for reproducibility
set.seed(1310)

# Split the data (70% training, 30% testing)
trainIndex <- createDataPartition(data$species, p = 0.7, list = FALSE)
train_data <- data[trainIndex, ]
test_data <- data[-trainIndex, ]

# Separate features and labels
train_x <- as.matrix(train_data[, -1])  # Features
train_y <- train_data$species           # Labels

test_x <- as.matrix(test_data[, -1])    # Features
test_y <- test_data$species             # Labels

Train an XGBoost Model

# Create DMatrix for XGBoost
dtrain <- xgb.DMatrix(data = train_x, label = train_y)

# Set XGBoost parameters for classification
params <- list(
  objective = "multi:softmax", # Multiclass classification
  num_class = 3,               # Number of classes (Adelie, Chinstrap, Gentoo)
  eval_metric = "mlogloss"     # Multi-class log loss
)

# Train the XGBoost model
xgb_model <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = 100,              # Number of boosting rounds
  verbose = 0                 # Silent mode
)

Make Predictions and Evaluate the Model

# Create DMatrix for the test data
dtest <- xgb.DMatrix(data = test_x)

# Predict species using the trained model
preds <- predict(xgb_model, dtest)

# Convert predictions back to factor levels
preds <- as.factor(preds)

# Calculate accuracy
accuracy <- mean(preds == test_y)
print(paste("Accuracy: ", round(accuracy * 100, 2), "%"))
[1] "Accuracy:  100 %"

The XGBoost model has achieved 100% accuracy on the Palmer Penguins dataset. Such a result is often a sign of overfitting, especially if the model is performing too well on the test data.

IV.Findings & Discussion

In selecting algorithms for analyzing the Boston Housing dataset, I was guided by my exploratory data analysis (EDA) and my focus on prediction rather than classification. With its 14 features, the Boston Housing data set presents a rich landscape that benefits from robust modeling techniques. I therefore chose linear regression, lasso regression, random forest, and XGBoost due to their effectiveness in handling the complexities of this dataset. Conversely, I would hesitate to use the same algorithms on the Palmer Penguins dataset, given its smaller size and simpler structure.

In terms of correlation and labels, the EDA revealed significant correlations among many columns, particularly between crime rates and the number of rooms, which influence the value of houses. This multicollinearity underscores the importance of regularization methods like lasso regression to prevent overfitting by penalizing coefficients of correlated features. Additionally, the presence of a clear target variable—median value or mdv—guided my choice for algorithms suitable for regression tasks.

With regard to pros and cons, I would say that linear regression is simple to interpret and efficient for smaller datasets; however, it assumes a linear relationship, which can lead to missing out complex interactions. Lasso regression performs variable selection, effectively reduces dimensionality and the risk of overfitting, but it may underperform if the true relationship is not sparse. Random forest excels at handling non-linear relationships and reduces variance, but it is more complex, less interpretable, and requires greater computational resources. XGBoost offers high predictive performance and effectively manages missing values, yet it can be challenging to tune and has the potential for overfitting if not properly managed.

In general, it is the characteristics of the dataset that dictate algorithms. In this analysis, for instance, I wanted each algorithm’s strengths and weaknesses to align with the Boston dataset’s features and structure.

In terms of trust, I would trust the results from the XGBoost model over the other models due to its robustness and predictive accuracy. While linear and lasso regression provide valuable insights, the complexity of relationships in the Boston Housing dataset suggests that advanced models like random forest and XGBoost yield more reliable predictions.

With regard to the potential for error in analysis, I would say that both excessive and minimal datasets can introduce errors. Overloading with too much data can obscure significant patterns, while using too little may fail to capture necessary variability, leading to inaccurate conclusions. A balanced approach, with a representative dataset, is crucial for robust analysis.

On the importance of comparing analyses across datasets. I think that comparing analyses across datasets highlights the importance of context. The Boston Housing dataset’s multifaceted features and focus on prediction demand a more complex modeling approach, while simpler datasets like the Palmer Penguins lend themselves to straightforward analyses. This divergence reinforces the necessity of tailoring algorithm choices to the specific characteristics and requirements of each dataset.

Overall, the combination of EDA insights, the nature of the analysis, and the robustness of models guided my selection of algorithm , ensuring that generated insights are reliable and actionable.