DecissionTree

ID : “20228034”

First, let’s generate some random data for demonstration purposes:

# Load required library
library(rpart)

# Set seed for reproducibility
set.seed(123)

# Generate random data
age <- sample(18:65, 100, replace = TRUE)
income <- sample(1000:5000, 100, replace = TRUE)
gender <- sample(c("Male", "Female"), 100, replace = TRUE)
purchase <- sample(c("Yes", "No"), 100, replace = TRUE)

# Create a data frame
data <- data.frame(age, income, gender, purchase)
head(data)

##   age income gender purchase
## 1  48   1164 Female      Yes
## 2  32   4719 Female       No
## 3  31   2074 Female      Yes
## 4  20   4145 Female      Yes
## 5  59   4249   Male      Yes
## 6  60   2385   Male      Yes

Build the decision tree using the rpart() function:

# Build the decision tree model
tree_model <- rpart(purchase ~ age + income + gender, data = data, method = "class")

Visualize the decision tree using the rpart.plot() function from the rpart.plot package:

# Load required library for plotting the decision tree
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 4.2.3

# Plot the decision tree
rpart.plot(tree_model)

The decision tree provides a visual representation of the rules or splits made to classify the data. Each split represents a decision based on a specific feature and threshold value. The leaf nodes of the tree represent the final predicted outcome.

Print the text representation of the decision tree using the printcp() function:

# Print the decision tree information
printcp(tree_model)

## 
## Classification tree:
## rpart(formula = purchase ~ age + income + gender, data = data, 
##     method = "class")
## 
## Variables actually used in tree construction:
## [1] age    income
## 
## Root node error: 40/100 = 0.4
## 
## n= 100 
## 
##       CP nsplit rel error xerror    xstd
## 1 0.0875      0     1.000  1.000 0.12247
## 2 0.0500      2     0.825  1.275 0.12497
## 3 0.0100      4     0.725  1.150 0.12460

This will show you the complexity parameter table, which helps in selecting an appropriate level of pruning to avoid overfitting. It provides information such as the number of splits, the deviance, and the cross-validated error for different levels of the tree.

To predict new observations using the decision tree model, we can use the predict() function:

# Generate new hypothetical data for prediction
new_data <- data.frame(age = c(30, 40, 50), income = c(2000, 3000, 4000), gender = c("Male", "Female", "Male"))

# Predict the purchase outcome for new data
predictions <- predict(tree_model, newdata = new_data, type = "class")

# Print the predictions
print(predictions)

##   1   2   3 
##  No Yes Yes 
## Levels: No Yes

This will give us the predicted outcome (Yes/No) for each new observation based on the decision tree model.

Remember that this is just a basic example to illustrate the process of creating a decision tree in R. In practice, we might need to preprocess the data, handle missing values, and perform cross-validation for model evaluation, among other steps. Additionally, we can experiment with different parameter settings to improve the performance of the decision tree model.

With more complex example Let’s consider a dataset that contains information about patients and whether they have a specific medical condition based on various attributes. We want to build a decision tree to predict the presence or absence of the medical condition based on age, gender, blood pressure, cholesterol levels, and smoking habits.

# Load required libraries
library(rpart)
library(rpart.plot)

# Set seed for reproducibility
set.seed(123)

# Generate random data
age <- sample(20:80, 1000, replace = TRUE)
gender <- sample(c("Male", "Female"), 1000, replace = TRUE)
blood_pressure <- sample(c("Normal", "High"), 1000, replace = TRUE, prob = c(0.8, 0.2))
cholesterol <- sample(c("Normal", "High"), 1000, replace = TRUE, prob = c(0.6, 0.4))
smoking <- sample(c("Yes", "No"), 1000, replace = TRUE, prob = c(0.3, 0.7))
medical_condition <- sample(c("Present", "Absent"), 1000, replace = TRUE)

# Create a data frame
data <- data.frame(age, gender, blood_pressure, cholesterol, smoking, medical_condition)

# Build the decision tree model
tree_model <- rpart(medical_condition ~ age + gender + blood_pressure + cholesterol + smoking, 
                    data = data, 
                    method = "class")

# Plot the decision tree
rpart.plot(tree_model)

# Print the decision tree information
printcp(tree_model)

## 
## Classification tree:
## rpart(formula = medical_condition ~ age + gender + blood_pressure + 
##     cholesterol + smoking, data = data, method = "class")
## 
## Variables actually used in tree construction:
## [1] age     gender  smoking
## 
## Root node error: 488/1000 = 0.488
## 
## n= 1000 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.057377      0   1.00000 1.00000 0.032391
## 2 0.024590      1   0.94262 1.00615 0.032395
## 3 0.012295      2   0.91803 0.98566 0.032377
## 4 0.011270      4   0.89344 0.99385 0.032386
## 5 0.010000      6   0.87090 0.97951 0.032369

# Generate new hypothetical data for prediction
new_data <- data.frame(age = c(45, 60, 35), 
                       gender = c("Male", "Female", "Female"), 
                       blood_pressure = c("High", "Normal", "High"), 
                       cholesterol = c("Normal", "High", "Normal"), 
                       smoking = c("Yes", "No", "No"))

# Predict the medical condition for new data
predictions <- predict(tree_model, newdata = new_data, type = "class")

# Print the predictions
print(predictions)

##       1       2       3 
## Present  Absent Present 
## Levels: Absent Present

DecissionTree

Md. Rabbi Amin

6/13/2023