Library

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(corrplot)

## corrplot 0.92 loaded

library(e1071)
library(ggplot2)
library(ggforce)
library(labelled)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(mlbench)
library(Metrics)

## 
## Attaching package: 'Metrics'

## The following objects are masked from 'package:caret':
## 
##     precision, recall

library(ModelMetrics)

## 
## Attaching package: 'ModelMetrics'

## The following objects are masked from 'package:Metrics':
## 
##     auc, ce, logLoss, mae, mse, msle, precision, recall, rmse, rmsle

## The following objects are masked from 'package:caret':
## 
##     confusionMatrix, precision, recall, sensitivity, specificity

## The following object is masked from 'package:base':
## 
##     kappa

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following object is masked from 'package:ModelMetrics':
## 
##     auc

## The following object is masked from 'package:Metrics':
## 
##     auc

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:psych':
## 
##     outlier

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(readr)
library(RColorBrewer)
library(rpart)
library(rpart.plot)
library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──

## ✔ broom        1.0.5      ✔ rsample      1.2.0 
## ✔ dials        1.2.1      ✔ tibble       3.2.1 
## ✔ dplyr        1.1.4      ✔ tidyr        1.3.1 
## ✔ infer        1.0.6      ✔ tune         1.1.2 
## ✔ modeldata    1.3.0      ✔ workflows    1.1.4 
## ✔ parsnip      1.2.0      ✔ workflowsets 1.0.1 
## ✔ purrr        1.0.2      ✔ yardstick    1.3.0 
## ✔ recipes      1.0.10

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ psych::%+%()             masks ggplot2::%+%()
## ✖ yardstick::accuracy()    masks Metrics::accuracy()
## ✖ scales::alpha()          masks psych::alpha(), ggplot2::alpha()
## ✖ dplyr::combine()         masks randomForest::combine()
## ✖ purrr::discard()         masks scales::discard()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ purrr::lift()            masks caret::lift()
## ✖ yardstick::mae()         masks ModelMetrics::mae(), Metrics::mae()
## ✖ yardstick::mape()        masks Metrics::mape()
## ✖ randomForest::margin()   masks ggplot2::margin()
## ✖ yardstick::mase()        masks Metrics::mase()
## ✖ yardstick::mcc()         masks ModelMetrics::mcc()
## ✖ yardstick::npv()         masks ModelMetrics::npv()
## ✖ rsample::permutations()  masks e1071::permutations()
## ✖ yardstick::ppv()         masks ModelMetrics::ppv()
## ✖ yardstick::precision()   masks ModelMetrics::precision(), Metrics::precision(), caret::precision()
## ✖ dials::prune()           masks rpart::prune()
## ✖ yardstick::recall()      masks ModelMetrics::recall(), Metrics::recall(), caret::recall()
## ✖ yardstick::rmse()        masks ModelMetrics::rmse(), Metrics::rmse()
## ✖ yardstick::sensitivity() masks ModelMetrics::sensitivity(), caret::sensitivity()
## ✖ yardstick::smape()       masks Metrics::smape()
## ✖ yardstick::spec()        masks readr::spec()
## ✖ yardstick::specificity() masks ModelMetrics::specificity(), caret::specificity()
## ✖ recipes::step()          masks stats::step()
## ✖ tune::tune()             masks parsnip::tune(), e1071::tune()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0     ✔ stringr 1.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ psych::%+%()           masks ggplot2::%+%()
## ✖ scales::alpha()        masks psych::alpha(), ggplot2::alpha()
## ✖ scales::col_factor()   masks readr::col_factor()
## ✖ dplyr::combine()       masks randomForest::combine()
## ✖ purrr::discard()       masks scales::discard()
## ✖ dplyr::filter()        masks stats::filter()
## ✖ stringr::fixed()       masks recipes::fixed()
## ✖ dplyr::lag()           masks stats::lag()
## ✖ purrr::lift()          masks caret::lift()
## ✖ randomForest::margin() masks ggplot2::margin()
## ✖ yardstick::spec()      masks readr::spec()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Description

Decision Trees Algorithms

Pre-work

Read this blog: https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees which shows some of the issues with decision trees.

Choose a dataset from a source in Assignment #1, or another dataset of your choice.

Assignment work

Based on the latest topics presented, choose a dataset of your choice and create a Decision Tree where you can solve a classification problem and predict the outcome of a particular feature or detail of the data used.

Switch variables* to generate 2 decision trees and compare the results. Create a random forest and analyze the results.

Based on real cases where desicion trees went wrong, and ‘the bad & ugly’ aspects of decision trees (https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees), how can you change this perception when using the decision tree you created to solve a real problem?

Deliverable

Essay (minimum 500 word document) Write a short essay explaining your analysis, and how you would address the concerns in the blog (listed in pre-work)

Exploratory Analysis using R or Python (submit code + errors + analysis as notebook or copy/paste to document)

* Note: 1. We are trying to train 2 different decision trees to compare bias and variance - so switch the features used for the first node (split) to force a different decision tree (How did the performance change?) 2. You will create 3 models: 2 x decision trees (to compare variance) and a random forest

Data Load

EDA

Initial Exploration

head(df_1k)

describe(df_1k)

str(df_1k)

## 'data.frame':    1000 obs. of  14 variables:
##  $ Region        : chr  "Middle East and North Africa" "North America" "Middle East and North Africa" "Asia" ...
##  $ Country       : chr  "Libya" "Canada" "Libya" "Japan" ...
##  $ Item.Type     : chr  "Cosmetics" "Vegetables" "Baby Food" "Cereal" ...
##  $ Sales.Channel : chr  "Offline" "Online" "Offline" "Offline" ...
##  $ Order.Priority: chr  "M" "M" "C" "C" ...
##  $ Order.Date    : chr  "10/18/2014" "11/7/2011" "10/31/2016" "4/10/2010" ...
##  $ Order.ID      : int  686800706 185941302 246222341 161442649 645713555 683458888 679414975 208630645 266467225 118598544 ...
##  $ Ship.Date     : chr  "10/31/2014" "12/8/2011" "12/9/2016" "5/12/2010" ...
##  $ Units.Sold    : int  8446 3018 1517 3322 9845 9528 2844 7299 2428 4800 ...
##  $ Unit.Price    : num  437.2 154.06 255.28 205.7 9.33 ...
##  $ Unit.Cost     : num  263.33 90.93 159.42 117.11 6.92 ...
##  $ Total.Revenue : num  3692591 464953 387260 683335 91854 ...
##  $ Total.Cost    : num  2224085 274427 241840 389039 68127 ...
##  $ Total.Profit  : num  1468506 190526 145420 294296 23726 ...

summary(df_1k)

##     Region            Country           Item.Type         Sales.Channel     
##  Length:1000        Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Order.Priority      Order.Date           Order.ID          Ship.Date        
##  Length:1000        Length:1000        Min.   :102928006   Length:1000       
##  Class :character   Class :character   1st Qu.:328074026   Class :character  
##  Mode  :character   Mode  :character   Median :556609714   Mode  :character  
##                                        Mean   :549681325                     
##                                        3rd Qu.:769694483                     
##                                        Max.   :995529830                     
##    Units.Sold     Unit.Price       Unit.Cost      Total.Revenue    
##  Min.   :  13   Min.   :  9.33   Min.   :  6.92   Min.   :   2043  
##  1st Qu.:2420   1st Qu.: 81.73   1st Qu.: 56.67   1st Qu.: 281192  
##  Median :5184   Median :154.06   Median : 97.44   Median : 754939  
##  Mean   :5054   Mean   :262.11   Mean   :184.97   Mean   :1327322  
##  3rd Qu.:7537   3rd Qu.:421.89   3rd Qu.:263.33   3rd Qu.:1733503  
##  Max.   :9998   Max.   :668.27   Max.   :524.96   Max.   :6617210  
##    Total.Cost       Total.Profit      
##  Min.   :   1417   Min.   :    532.6  
##  1st Qu.: 164932   1st Qu.:  98376.1  
##  Median : 464726   Median : 277226.0  
##  Mean   : 936119   Mean   : 391202.6  
##  3rd Qu.:1141750   3rd Qu.: 548456.8  
##  Max.   :5204978   Max.   :1726181.4

glimpse(df_1k)

## Rows: 1,000
## Columns: 14
## $ Region         <chr> "Middle East and North Africa", "North America", "Middl…
## $ Country        <chr> "Libya", "Canada", "Libya", "Japan", "Chad", "Armenia",…
## $ Item.Type      <chr> "Cosmetics", "Vegetables", "Baby Food", "Cereal", "Frui…
## $ Sales.Channel  <chr> "Offline", "Online", "Offline", "Offline", "Offline", "…
## $ Order.Priority <chr> "M", "M", "C", "C", "H", "H", "H", "M", "H", "H", "M", …
## $ Order.Date     <chr> "10/18/2014", "11/7/2011", "10/31/2016", "4/10/2010", "…
## $ Order.ID       <int> 686800706, 185941302, 246222341, 161442649, 645713555, …
## $ Ship.Date      <chr> "10/31/2014", "12/8/2011", "12/9/2016", "5/12/2010", "8…
## $ Units.Sold     <int> 8446, 3018, 1517, 3322, 9845, 9528, 2844, 7299, 2428, 4…
## $ Unit.Price     <dbl> 437.20, 154.06, 255.28, 205.70, 9.33, 205.70, 205.70, 1…
## $ Unit.Cost      <dbl> 263.33, 90.93, 159.42, 117.11, 6.92, 117.11, 117.11, 35…
## $ Total.Revenue  <dbl> 3692591.20, 464953.08, 387259.76, 683335.40, 91853.85, …
## $ Total.Cost     <dbl> 2224085.18, 274426.74, 241840.14, 389039.42, 68127.40, …
## $ Total.Profit   <dbl> 1468506.02, 190526.34, 145419.62, 294295.98, 23726.45, …

look_for(df_1k)

apply(df_1k, 2, function(x) sum(is.na(x)))

##         Region        Country      Item.Type  Sales.Channel Order.Priority 
##              0              0              0              0              0 
##     Order.Date       Order.ID      Ship.Date     Units.Sold     Unit.Price 
##              0              0              0              0              0 
##      Unit.Cost  Total.Revenue     Total.Cost   Total.Profit 
##              0              0              0              0

unique(df_1k$Region)

## [1] "Middle East and North Africa"      "North America"                    
## [3] "Asia"                              "Sub-Saharan Africa"               
## [5] "Europe"                            "Central America and the Caribbean"
## [7] "Australia and Oceania"

#unique(df_1k$Country)

length(unique(df_1k$Country))

## [1] 185

table(df_1k$Item.Type)

## 
##       Baby Food       Beverages          Cereal         Clothes       Cosmetics 
##              87             101              79              78              75 
##          Fruits       Household            Meat Office Supplies   Personal Care 
##              70              77              78              89              87 
##          Snacks      Vegetables 
##              82              97

table(df_1k$Sales.Channel)

## 
## Offline  Online 
##     520     480

unique(df_1k$Order.Priority)

## [1] "M" "C" "H" "L"

#select numeric columns 1k
df_1k_num <- df_1k %>% 
  keep(is.numeric) 

#stats
describe(df_1k_num, fast=TRUE) %>% 
  select(c(-vars,-n))

#distributions
df_1k_num %>%
  pivot_longer(cols = 1:6, names_to = "variable", values_to = "value") %>%
  ggplot(aes(value)) +
    facet_wrap(~variable, scales = "free") +
    geom_density() +
    geom_histogram(aes(y = after_stat(density)), bins = 40, alpha = 0.2, fill = "lightblue", color = "darkgreen")

From the initial EDA we see the following:

The data set is 1,000 rows and 14 columns
No labels are found in the variables
High range among the integers and doubles
Variable types include:
- 2 integers, 5 doubles and 7 character types
5 regions are noted with 185 countries associated with it
Priority is categorized C(Critical), H(High), M(Medium), and L(Low)
No variables seem to be missing values
Dependencies among the variables are as follows:
- \(Total.Cost=Units.Sold\times Unit.Cost\)
- \(Total.Revenue-Units.Sold\times Unit.Price\)
- \(Total.Profit-Total.Revenue-Total.Cost\)
- \(Total.Cost\) and \(Total.Revenue\) depends on \(Units.Sold,Units.Cost \ and \ Unit.Price\)
Distribution of the data is noted with several skewed variables which will need transformation and normalizing

Correlation

corr_matrix <- cor(df_1k_num)
corrplot(corr_matrix, 
         type = "lower", 
         order = "hclust", 
         tl.col = "blue", 
         addCoef.col = "white", 
         diag = FALSE, 
         title = "Corrplot",
         mar = c(0, 0, 1, 0),
         col = brewer.pal(10, "RdYlBu"))

Looking at the correlation plot we see the following:

Weak correlation between Unit.Price, Unit.Cost and Units.Sold
Mild correlation between Total.Profit, Total.Revenue, Total.Cost and Units.Sold
Mild correlation between Unit.Price, Unit.Cost and Total.Profit
High correlation between Unit.Price and Unit.Cost
High correlation between Total.Profit and Total Revenue
High correlation between Total,Cost and Total.Revenue

I suspect multicollinearity but will use and additional method to confirm.

VIF

set.seed(321)

sample_1k_train <- df_1k_num$Total.Revenue %>%
  createDataPartition(p = 0.8, list = FALSE)
df_train_1k  <- df_1k_num[sample_1k_train, ]
df_test_1k <- df_1k_num[-sample_1k_train, ]


model<- lm(Total.Revenue~., data=df_train_1k )

vif_values<-car::vif(model)

print(vif_values)

##     Order.ID   Units.Sold   Unit.Price    Unit.Cost   Total.Cost Total.Profit 
##     1.002970     3.041373   167.637725   167.273600    11.445468    14.149909

The values interpret as: * Order.ID has low multicollinearity * Units.Sold low multicollinearity * Unit.Price and Unit.Cost has high levels of multicollinearity * Total.Cost and Total.Profit has moderate levels of multicollinearity.

Transformation

Only transformation needed are: * date values to Month, Day and Year * levels for categorical values. * scaling for pre-processing for modelling * Attribute selection of relevant data will also be best, which includes excluding Order.ID and Country

df_1k_norm <- df_1k %>%
  mutate(
    `Order.Date` =  mdy(`Order.Date`),  # Convert to Date using lubridate's mdy function
    `Ship.Date` = mdy(`Ship.Date`),
    `Sales.Channel` = as.factor(`Sales.Channel`),
    `Order.Priority` = as.factor(`Order.Priority`),
    `Item.Type` = as.factor(`Item.Type`),
    `Region` = as.factor(`Region`),
    `Country` = as.factor(`Country`),
    `Order.ID` = as.character(`Order.ID`)
  ) %>%
  predict(preProcess(., method = c("center", "scale")), .)

levels(df_1k_norm$Sales.Channel)

## [1] "Offline" "Online"

df_1k_norm %>% 
  keep(is.numeric) %>%  
  describe(fast=TRUE) %>% 
  select(-c(vars,n))

#distribution
df_1k_norm %>% 
  keep(is.numeric) %>%  
  gather(variable, value, 1:6) %>%
  ggplot(aes(value)) +
    facet_wrap(~variable, scales = "free") +
    geom_density() +
    geom_histogram(aes(y=after_stat(density)), alpha=0.2, fill = "lightblue", 
                   color="darkgreen", position="identity",bins = 40)

df_1k_norm <- df_1k_norm %>% 
  select(-c(Country,Order.ID,))

Models

SVM

set.seed(1234)

df_1k_norm_svm <- df_1k_norm 

#split
training.samples <- df_1k_norm_svm$Total.Revenue %>% 
  createDataPartition(p = 0.8, list = FALSE)

train_df  <- df_1k_norm_svm[training.samples, ]
test_df <- df_1k_norm_svm[-training.samples, ]

svm_model<-svm(formula = Total.Revenue ~ ., data = train_df,
               type = 'eps-regression')

print(svm_model)

## 
## Call:
## svm(formula = Total.Revenue ~ ., data = train_df, type = "eps-regression")
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.03448276 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  70

Note SVM-Kernel: radial is the default

predictions_SVM <- predict(svm_model, newdata = test_df) %>% 
  bind_cols(test_df)

## New names:
## • `` -> `...1`

predictions_SVM$...1 <- as.numeric(predictions_SVM$...1)

Performance and Comparison

MAE <- MAE(predictions_SVM$Total.Revenue, predictions_SVM$...1)
RMSE <- RMSE(predictions_SVM$Total.Revenue, predictions_SVM$...1)
R2 <- R2(predictions_SVM$Total.Revenue, predictions_SVM$...1)

# Create a data frame to store the results
a_svm <- data.frame(Model = "SVM",
                MAE = MAE,
                RMSE = RMSE,
                R2 = R2)

# Print the results
print(a_svm)

##   Model        MAE       RMSE        R2
## 1   SVM 0.04818865 0.06337936 0.9961623

SLR

df_1k_2 <- df_1k  

preproc1 <-preProcess(df_1k_2, method=c("center", "scale"))
norm1 <- predict(preproc1,df_1k_2)

training.samples_slr <- norm1$Total.Revenue %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data_slr  <- norm1[training.samples_slr, ]
test.data_slr <- norm1[-training.samples_slr, ]

model_slr<- lm(Total.Revenue~Units.Sold, data=train.data_slr )

# Make predictions
predictions_slr <- model_slr %>% predict(test.data_slr)

# Calculate the metrics
MAE_slr <- MAE(predictions_slr, test.data_slr$Total.Revenue)
RMSE_slr <- RMSE(predictions_slr, test.data_slr$Total.Revenue)
R2_slr <- R2(predictions_slr, test.data_slr$Total.Revenue)

# Create a data frame to store the results
a_slr <- data.frame(Model = "Simple Linear Regression",
                     MAE = MAE_slr,
                     RMSE = RMSE_slr,
                     R2 = R2_slr)

# Print the results
print(a_slr)

##                      Model       MAE      RMSE        R2
## 1 Simple Linear Regression 0.6444445 0.8850026 0.2945233

MLR

norm2 <- norm1 %>% 
  select(-c(Country, Unit.Cost, Unit.Price, Total.Cost, Total.Profit, Order.ID, Ship.Date, Order.Date))

set.seed(146)

training.samples_mlr <- norm2$Total.Revenue %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data_mlr  <- norm2[training.samples_mlr, ]
test.data_mlr <- norm2[-training.samples_mlr, ]


model_mlr<- lm(Total.Revenue~., data=train.data_mlr )

test.data_mlr <- na.omit(test.data_mlr)
# Make predictions
predictions_mlr <- model_mlr %>% predict(test.data_mlr)

# Calculate the metrics
MAE_mlr <- MAE(predictions_mlr, test.data_mlr$Total.Revenue)
RMSE_mlr <- RMSE(predictions_mlr, test.data_mlr$Total.Revenue)
R2_mlr <- R2(predictions_mlr, test.data_mlr$Total.Revenue)

# Create a data frame to store the results
a_mlr <- data.frame(Model = "Multiple Linear Regression",
                     MAE = MAE_mlr,
                     RMSE = RMSE_mlr,
                     R2 = R2_mlr)

# Print the results
print(a_mlr)

##                        Model       MAE      RMSE        R2
## 1 Multiple Linear Regression 0.3009253 0.4055539 0.8059421

First Decision Tree

set.seed(1234)

tree <- rpart(Total.Revenue ~., data = train_df, cp = 0.004,  method = 'anova')

predictions <- predict(tree, newdata = test_df) %>% 
  bind_cols(test_df)

## New names:
## • `` -> `...1`

predictions$...1 <- as.numeric(predictions$...1)

a_tree <- data.frame(Model = "Decision Tree 1",
                #mean absolute error
                MAE = MAE(predictions$Total.Revenue, predictions$...1),
                #rmse Root Mean Squared Error
                RMSE = RMSE(predictions$Total.Revenue, predictions$...1),
                #r squared
                R2 = R2(predictions$Total.Revenue, predictions$...1)
)

2nd Decision Tree

set.seed(123456)
#colnames(df1k_norm1)

df_1k_norm3 <- df_1k_norm %>%
  select(-c("Unit.Price","Unit.Cost","Total.Cost", "Total.Profit"))

#split
training.samples3 <- df_1k_norm3$Total.Revenue %>% 
  createDataPartition(p = 0.8, list = FALSE)

train3  <- df_1k_norm3[training.samples3, ]
test3 <- df_1k_norm3[-training.samples3, ]

#train using rpart, cp- complexity, smaller # = more complexity, 
#method- anova is for regression
tree3 <- rpart(Total.Revenue ~., data = train3, cp = 0.004, method = 'anova')

predictions3 <- predict(tree3, newdata = test3) %>% 
  bind_cols(test3)

## New names:
## • `` -> `...1`

predictions3$...1 <- as.numeric(predictions3$...1)

a_tree2 <- data.frame(Model = "Decision Tree 2",
                #mean absolute error
                MAE = MAE(predictions3$Total.Revenue, predictions3$...1),
                #rmse Root Mean Squared Error
                RMSE = RMSE(predictions3$Total.Revenue, predictions3$...1),
                #r squared
                R2 = R2(predictions3$Total.Revenue, predictions3$...1)
)

Random Forest

set.seed(201)
rf <- randomForest(formula = Total.Revenue ~ ., 
                   data = train_df, importance=TRUE)

predictions4 <- predict(rf, newdata = test_df) %>% 
  bind_cols(test_df)

## New names:
## • `` -> `...1`

predictions4$...1 <- as.numeric(predictions4$...1)

a_rf <- data.frame(Model = "Random Forest",
                #mean absolute error
                MAE = MAE(predictions4$Total.Revenue, predictions4$...1),
                #rmse Root Mean Squared Error
                RMSE = RMSE(predictions4$Total.Revenue, predictions4$...1),
                #r squared
                R2 = R2(predictions4$Total.Revenue, predictions4$...1)
)

Tuned Random Forest

set.seed(908)

train1 <- train_df %>% 
  dplyr::select(-Total.Revenue)

bestmtry <- tuneRF(train1,train_df$Total.Revenue, stepFactor = 2, improve = 0.01,
                   trace=T, plot= F, doBest=TRUE, importance=TRUE)

## mtry = 3  OOB error = 0.002839379 
## Searching left ...
## mtry = 2     OOB error = 0.007202836 
## -1.536765 0.01 
## Searching right ...
## mtry = 6     OOB error = 0.00095211 
## 0.6646766 0.01 
## mtry = 11    OOB error = 0.002508913 
## -1.635108 0.01

predictions5 <- predict(bestmtry, newdata = test_df) %>% 
  bind_cols(test_df)

## New names:
## • `` -> `...1`

predictions5$...1 <- as.numeric(predictions5$...1)

a_trf <- data.frame(Model = "Tuned Random Forest",
                #mean absolute error
                MAE = MAE(predictions5$Total.Revenue, predictions5$...1),
                #rmse Root Mean Squared Error
                RMSE = RMSE(predictions5$Total.Revenue, predictions5$...1),
                #r squared
                R2 = R2(predictions5$Total.Revenue, predictions5$...1)
)

rbind( a_svm, a_slr, a_mlr, a_tree, a_tree2, a_trf )

Essay

The article Decision Tree Ensembles to Predict Coronavirus Disease 2019 Infection: A Comparative Study discusses machine learning algorithms specifically for Decision tree ensembles, to predict Covid-19 positive cases base on common lab tests. The article highlights the importance of classifiers designed for imbalanced datasets, with supporting information on how decision tree ensembles designed through classifiers for imbalanced data is better performing thatn standard methods. In the examples given, age was a major classifier in the tree ensemble, but future studies would incorporate other classifiers to understand the impact. In A novel approach to predict COVID-19 using support vector machine a paper presents a mthod with SVM classification, that predict if a person has covid-19 base on the patients symptoms. SVM classifiers notable had an accuracy of 87% based on the article. From there challenges in adaption of covid-19 was highlighted, alongside a need for ongoing analysis.

For my 3 academic papers I chose Decision tree learning through a Predictive Model for Student Academic Performance in Intelligent M-Learning environments, Decision Tree-Based Predictive Models for Academic Achievement Using College Students’ Support Networks and Using Decision Trees and Random Forest Algorithms to Predict and Determine Factors Contributing to First-Year University Students’ Learning Performance. The overall theme regarding the articles was the impact of the classifiers of the decision tree. For one use case economic and demographic factors were used to identify high risk students, which allowed the school to focus on ensuring their success. In another tools such as elearning and its impact was a major factor while another used Decision tree alongside CHAID and cforest algorithms to focus on the gender identification alongside other demographic factors. Overall, I believe whats highlighted was that a diverse understanding of the use case will help in designing a model with the correct classifiers to drive results.

As far as the data used for this assignment Tuned Random Forest performed best with a MAE of 0.01200354 RMSE of 0.02333389 and R^2 of 0.9995702

DATA 622: Machine Learning and Big Data HW3

Gabriel Campos

Last edited May 05, 2024