Introduction

In this blog post, we will explore the application of Random Forest analysis using R. We’ll generate a random dataset and use the randomForest package to build a predictive model and evaluate the importance of explanatory variables in predicting a binary, categorical response variable.

Generate Random Dataset

Let’s start by creating a synthetic dataset for demonstration purposes.

set.seed(123)
n <- 1000
random_data <- data.frame(
  var1 = rnorm(n),
  var2 = rnorm(n),
  var3 = rnorm(n),
  target = as.factor(sample(c(0, 1), n, replace = TRUE))
)

The dataset contains three explanatory variables (var1, var2, var3) and a binary response variable (target).

Explore Data with Scatter Plot

Visualize the relationship between explanatory variables and the response variable using a scatter plot.

library(ggplot2)

# Scatter plot
ggplot(random_data, aes(x = var1, y = var2, color = target)) +
  geom_point() +
  labs(title = "Scatter Plot of var1 vs. var2",
       x = "var1",
       y = "var2",
       color = "Target")

Split the Dataset

Next, we’ll split the dataset into training and testing sets.

library(caret)

## Loading required package: lattice

set.seed(456)
split_index <- createDataPartition(random_data$target, p = 0.8, list = FALSE)
train_data <- random_data[split_index, ]
test_data <- random_data[-split_index, ]

Build the Random Forest Model

Now, let’s use the randomForest package to build a Random Forest model.

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

# Create the model
rf_model <- randomForest(target ~ var1 + var2 + var3, data = train_data, ntree = 100, importance = TRUE)

# Print the model summary
print(rf_model)

## 
## Call:
##  randomForest(formula = target ~ var1 + var2 + var3, data = train_data,      ntree = 100, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 54.06%
## Confusion matrix:
##     0   1 class.error
## 0 190 208   0.5226131
## 1 225 178   0.5583127

Evaluate the Model

We will evaluate the model’s performance on the test set and explore feature importance scores.

# Make predictions on the test set
predictions <- predict(rf_model, test_data)

# Evaluate accuracy
accuracy <- sum(predictions == test_data$target) / nrow(test_data)
cat("Accuracy:", accuracy, "\n")

## Accuracy: 0.4271357

# Extract feature importance scores
importance_scores <- importance(rf_model)

# Print feature importance scores
print(importance_scores)

##                0          1 MeanDecreaseAccuracy MeanDecreaseGini
## var1 -2.15170019 -0.4116576           -1.6022841         130.9438
## var2  0.08239299  1.1797478            0.8992514         135.2467
## var3 -1.46616151 -0.8518437           -1.5167370         133.8140

Visualize Feature Importance

Let’s create a bar plot to visualize the feature importance scores.

# Bar plot for feature importance
barplot(importance_scores[, 1], names.arg = rownames(importance_scores),
        main = "Feature Importance",
        xlab = "Importance",
        col = "skyblue",
        ylim = c(0, 0.25))

Random Forest Analysis with R

Muhammad Farhaad

2024-02-21