Assignment Prompt

  1. Create your own models:
  1. Based on the latest topics presented, select a dataset of your choice

  2. Create a Decision Tree where you can solve a classification or regression problem

  3. Predict the outcome of a particular feature or detail of the data used

  4. Switch variables to generate 2 decision trees, and compare the results

  5. Create a Random Forest for classification or regression, and analyze the results

  1. Consider the pros and cons of approches:
  1. Read ‘the bad & ugly’ aspects of decision trees (https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees

  2. Based on real cases where desicion trees went wrong, how can you change this perception when using the decision tree you created to solve a real problem? to solve a real problem?

#prevent conflict with skimr and dlookr
options(kableExtra.auto_format = FALSE)

library(skimr)
## Warning: package 'skimr' was built under R version 4.0.5
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6      v purrr   0.3.4 
## v tibble  3.1.8      v dplyr   1.0.10
## v tidyr   1.2.0      v stringr 1.4.0 
## v readr   2.1.2      v forcats 0.5.2
## Warning: package 'tidyr' was built under R version 4.0.5
## Warning: package 'readr' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.0.5
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(rpart) #decision tree package rec'd by Practical ML in R textbook
library(rpart.plot) #decision tree display package rec'd by Practical ML in R textbook
## Warning: package 'rpart.plot' was built under R version 4.0.4
library(randomForest) #for random forest modeling
## Warning: package 'randomForest' was built under R version 4.0.4
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(caret) #for confusionMatrix()
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

Answer A

Choosing to use a dataset showing the relationship between physiological parameters and stress level of an individual from Kaggle. The direct link to the dataset is below:

https://www.kaggle.com/datasets/laavanya/stress-level-detection?select=Stress-Lysis.csv

This is a dataset with 2001 cases and 4 variables. There are 3 variables I’ve imported as numeric: humidity, temp, and steps - these are relevant to the individuals. There are no missing values. I’ve imported my target variable, stress_lvl as a factor and it has three levels that correspond to low (0), medium (1), and high (2).

All appear to be reasonable distributed from looking at the percentiles.I note that an average humdity value is 20, average temp is 89, and average steps are about 100.

stress <- read.csv("Stress-Lysis.csv", 
                 col.names = c("humidity", "temp", "steps", "stress_lvl"),
                 colClasses = c("numeric", "numeric", "numeric", "factor"))

skim(stress)
Data summary
Name stress
Number of rows 2001
Number of columns 4
_______________________
Column type frequency:
factor 1
numeric 3
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
stress_lvl 0 1 FALSE 3 1: 790, 2: 710, 0: 501

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
humidity 0 1 20.00 5.78 10 15 20 25 30 ▇▇▇▇▇
temp 0 1 89.00 5.78 79 84 89 94 99 ▇▇▇▇▇
steps 0 1 100.14 58.18 0 50 101 150 200 ▇▇▇▇▇

check the class distribution before doing any modelling

  1. Before we begin modeling we will split this dataset into a test/train set and check the class distribution of my target variable stress_lvl.

  2. Checking the class proportion tables both the training and testing dataset seem to have roughly the same proportion of stress levels as the original data, so I can proceed with model building.

#split into test/train set
set.seed(2000)
sample_set <- sample(nrow(stress), round(nrow(stress)*0.75), replace = FALSE)
stress_train <- stress[sample_set, ]
stress_test <- stress[-sample_set, ]


round(prop.table(table(select(stress, stress_lvl), exclude = NULL)), 4) * 100
## 
##     0     1     2 
## 25.04 39.48 35.48
round(prop.table(table(select(stress_train, stress_lvl), exclude = NULL)), 4) * 100
## 
##     0     1     2 
## 25.32 38.77 35.91
round(prop.table(table(select(stress_test, stress_lvl), exclude = NULL)), 4) * 100
## 
##    0    1    2 
## 24.2 41.6 34.2

First we’ll attempt making a model to predict the stress_lvl based on all 3 of the predictive variables. From the plot it appears that humidity is the largest predictor, where a humidity measure of less than 15 means the low (0) stress level, between 15-23 is the medium (1) level, and greater than 23 is the high (2) stress level. It appears the other variables weren’t even necessary to build decision tree.

#build model via rpart package
stress_model_stress_lvl <- rpart(stress_lvl ~ .,
                         method = "class",
                         data = stress_train
                          )

#display decision tree
rpart.plot(stress_model_stress_lvl)

To create a second decision tree as required, we’ll make another tree but this time be explicit about only using the temp and steps variables as predictors. In this case, temp appears to be the next most valuable predictor, as the steps predictor didn’t even make it into the tree. A temperature of less than 84 is predicted to be the low stress level, from 84-92 the medium stress level, and greater than 92 results in the high stress level.

#build model via rpart package
stress_model_tree2 <- rpart(stress_lvl ~ temp + steps,
                            data = stress_train
                          )

#display decision tree
rpart.plot(stress_model_tree2)

The Random Forest below is built to predict the stress_level based off of the 3 predictor variables. Predicting values for the test set and looking at the confusion matrix it appears the Random Forest can predict with 100% accuracy. This isn’t surprising as the data is likely highly correlated, which makes sense with the subject matter. Physical factors in the body both cause and indicate high stress levels.

stress_model_forest <- randomForest(stress_lvl ~ .,
                                    data = stress_train)

forest_pred <- predict(stress_model_forest, stress_test)

confusionMatrix(forest_pred, stress_test$stress_lvl)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2
##          0 121   0   0
##          1   0 208   1
##          2   0   0 170
## 
## Overall Statistics
##                                           
##                Accuracy : 0.998           
##                  95% CI : (0.9889, 0.9999)
##     No Information Rate : 0.416           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9969          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity             1.000   1.0000   0.9942
## Specificity             1.000   0.9966   1.0000
## Pos Pred Value          1.000   0.9952   1.0000
## Neg Pred Value          1.000   1.0000   0.9970
## Prevalence              0.242   0.4160   0.3420
## Detection Rate          0.242   0.4160   0.3400
## Detection Prevalence    0.242   0.4180   0.3400
## Balanced Accuracy       1.000   0.9983   0.9971

Answer B

In reading this article and thinking about the dataset I chose, I find the decision trees most helpful in this case. Because the data was very simple, had few variables, and was highly correlated - it was most valuable to learn that humidity was the biggest predictor of stress level, and to see those cut-off levels of humidity that corresponded to the low, medium, and high stress levels.