In this blog post, we will explore the application of Random Forest
analysis using R. We’ll generate a random dataset and use the
randomForest package to build a predictive model and
evaluate the importance of explanatory variables in predicting a binary,
categorical response variable.
Let’s start by creating a synthetic dataset for demonstration purposes.
set.seed(123)
n <- 1000
random_data <- data.frame(
var1 = rnorm(n),
var2 = rnorm(n),
var3 = rnorm(n),
target = as.factor(sample(c(0, 1), n, replace = TRUE))
)
The dataset contains three explanatory variables (var1,
var2, var3) and a binary response variable
(target).
Visualize the relationship between explanatory variables and the response variable using a scatter plot.
library(ggplot2)
# Scatter plot
ggplot(random_data, aes(x = var1, y = var2, color = target)) +
geom_point() +
labs(title = "Scatter Plot of var1 vs. var2",
x = "var1",
y = "var2",
color = "Target")
Next, we’ll split the dataset into training and testing sets.
library(caret)
## Loading required package: lattice
set.seed(456)
split_index <- createDataPartition(random_data$target, p = 0.8, list = FALSE)
train_data <- random_data[split_index, ]
test_data <- random_data[-split_index, ]
Now, let’s use the randomForest package to build a
Random Forest model.
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
# Create the model
rf_model <- randomForest(target ~ var1 + var2 + var3, data = train_data, ntree = 100, importance = TRUE)
# Print the model summary
print(rf_model)
##
## Call:
## randomForest(formula = target ~ var1 + var2 + var3, data = train_data, ntree = 100, importance = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 54.06%
## Confusion matrix:
## 0 1 class.error
## 0 190 208 0.5226131
## 1 225 178 0.5583127
We will evaluate the model’s performance on the test set and explore feature importance scores.
# Make predictions on the test set
predictions <- predict(rf_model, test_data)
# Evaluate accuracy
accuracy <- sum(predictions == test_data$target) / nrow(test_data)
cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.4271357
# Extract feature importance scores
importance_scores <- importance(rf_model)
# Print feature importance scores
print(importance_scores)
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## var1 -2.15170019 -0.4116576 -1.6022841 130.9438
## var2 0.08239299 1.1797478 0.8992514 135.2467
## var3 -1.46616151 -0.8518437 -1.5167370 133.8140
Let’s create a bar plot to visualize the feature importance scores.
# Bar plot for feature importance
barplot(importance_scores[, 1], names.arg = rownames(importance_scores),
main = "Feature Importance",
xlab = "Importance",
col = "skyblue",
ylim = c(0, 0.25))
In this blog post, we’ve covered the basics of Random Forest analysis in R. We generated a random dataset, split it into training and testing sets, built a Random Forest model, and evaluated its performance. Visualizations such as scatter plots and bar plots help us understand the data and interpret the model results effectively.