DATA 622 HW 1

Pre-work

1. Read this blog: https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees which shows some of the issues with decision trees
2. Choose a dataset from a source in Assignment #1, or another dataset of your choice.

Assignment work

1. Based on the latest topics presented, choose a dataset of your choice and create a Decision Tree where you can solve a classification problem and predict the outcome of a particular feature or detail of the data used.
2. Switch variables [See Note] to generate 2 decision trees and compare the results. Create a random forest and analyze the results.
3. Based on real cases where desicion trees went wrong, and 'the bad & ugly' aspects of decision trees (https://decizone.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees), how can you change this perception when using the decision tree you created to solve a real problem?

Deliverable

1. Essay (minimum 500 word document)
2. Write a short essay explaining your analysis, and how you would address the concerns in the blog (listed in pre-work)
Exploratory Analysis using R or Python (submit code + errors + analysis as notebook or copy/paste to document)

Note *:

1. We are trying to train 2 different decision trees to compare bias and variance - so switch the features used for the first node (split) to force a different decision tree (How did the performance change?)
2. You will create 3 models: 2 x decision trees (to compare variance) and a random forest

I’ll be re-using the same dataset from my first assignment, along with the basic EDA;

Mobile Device Usage and User Behavior Dataset (11 fields, 700 observations) https://www.kaggle.com/datasets/valakhorasani/mobile-device-usage-and-user-behavior-dataset?resource=download

This dataset provides a comprehensive analysis of mobile device usage patterns and user behavior classification. It contains 700 samples of user data, including metrics such as app usage time, screen-on time, battery drain, and data consumption. Each entry is categorized into one of five user behavior classes, ranging from light to extreme usage, allowing for insightful analysis and modeling.

1. User ID: Unique identifier for each user.
2. Device Model: Model of the user's smartphone.
3. Operating System: The OS of the device (iOS or Android).
4. App Usage Time: Daily time spent on mobile applications, measured in minutes.
5. Screen On Time: Average hours per day the screen is active.
6. Battery Drain: Daily battery consumption in mAh.
7. Number of Apps Installed: Total apps available on the device.
8. Data Usage: Daily mobile data consumption in megabytes.
9. Age: Age of the user.
10. Gender: Gender of the user (Male or Female).
11. User Behavior Class: Classification of user behavior based on usage patterns (1 to 5).

Import potentially needed libraries for analysis

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.3

## corrplot 0.92 loaded

library(caret)

## Warning: package 'caret' was built under R version 4.3.3

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.3.3

## Loading required package: lattice

library(caTools)

## Warning: package 'caTools' was built under R version 4.3.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.3.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(foreign)

## Warning: package 'foreign' was built under R version 4.3.3

library(FactoMineR)

## Warning: package 'FactoMineR' was built under R version 4.3.3

library(factoextra)

## Warning: package 'factoextra' was built under R version 4.3.3

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(ggplot2)
library(ISLR)

## Warning: package 'ISLR' was built under R version 4.3.3

require(tree)

## Loading required package: tree

## Warning: package 'tree' was built under R version 4.3.3

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(moments)
library(magrittr)

## Warning: package 'magrittr' was built under R version 4.3.2

library(matrixcalc)
library(nnet)

## Warning: package 'nnet' was built under R version 4.3.3

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:MASS':
## 
##     select

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(party)

## Warning: package 'party' was built under R version 4.3.3

## Loading required package: grid

## Loading required package: mvtnorm

## Warning: package 'mvtnorm' was built under R version 4.3.2

## Loading required package: modeltools

## Loading required package: stats4

## Loading required package: strucchange

## Warning: package 'strucchange' was built under R version 4.3.3

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## Warning: package 'sandwich' was built under R version 4.3.3

## 
## Attaching package: 'party'

## The following object is masked from 'package:dplyr':
## 
##     where

library(reshape2)

## Warning: package 'reshape2' was built under R version 4.3.3

library(rpart)

## Warning: package 'rpart' was built under R version 4.3.3

library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 4.3.3

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.3.3

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.3

## Warning: package 'readr' was built under R version 4.3.3

## Warning: package 'stringr' was built under R version 4.3.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ✔ readr     2.1.5

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ stringr::boundary()     masks strucchange::boundary()
## ✖ randomForest::combine() masks dplyr::combine()
## ✖ tidyr::extract()        masks magrittr::extract()
## ✖ plotly::filter()        masks dplyr::filter(), stats::filter()
## ✖ dplyr::lag()            masks stats::lag()
## ✖ purrr::lift()           masks caret::lift()
## ✖ randomForest::margin()  masks ggplot2::margin()
## ✖ plotly::select()        masks MASS::select(), dplyr::select()
## ✖ purrr::set_names()      masks magrittr::set_names()
## ✖ party::where()          masks dplyr::where()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidyr)

Import 2 datasets

Phone_Usage <- read.table(file="https://raw.githubusercontent.com/RonBalaban/CUNY-SPS/refs/heads/main/DATA622/user_behavior_dataset.csv", header=TRUE, sep=",")

View samples and summaries

head(Phone_Usage)

##   User.ID   Device.Model Operating.System App.Usage.Time..min.day.
## 1       1 Google Pixel 5          Android                      393
## 2       2      OnePlus 9          Android                      268
## 3       3   Xiaomi Mi 11          Android                      154
## 4       4 Google Pixel 5          Android                      239
## 5       5      iPhone 12              iOS                      187
## 6       6 Google Pixel 5          Android                       99
##   Screen.On.Time..hours.day. Battery.Drain..mAh.day. Number.of.Apps.Installed
## 1                        6.4                    1872                       67
## 2                        4.7                    1331                       42
## 3                        4.0                     761                       32
## 4                        4.8                    1676                       56
## 5                        4.3                    1367                       58
## 6                        2.0                     940                       35
##   Data.Usage..MB.day. Age Gender User.Behavior.Class
## 1                1122  40   Male                   4
## 2                 944  47 Female                   3
## 3                 322  42   Male                   2
## 4                 871  20   Male                   3
## 5                 988  31 Female                   3
## 6                 564  31   Male                   2

summary(Phone_Usage)

##     User.ID      Device.Model       Operating.System   App.Usage.Time..min.day.
##  Min.   :  1.0   Length:700         Length:700         Min.   : 30.0           
##  1st Qu.:175.8   Class :character   Class :character   1st Qu.:113.2           
##  Median :350.5   Mode  :character   Mode  :character   Median :227.5           
##  Mean   :350.5                                         Mean   :271.1           
##  3rd Qu.:525.2                                         3rd Qu.:434.2           
##  Max.   :700.0                                         Max.   :598.0           
##  Screen.On.Time..hours.day. Battery.Drain..mAh.day. Number.of.Apps.Installed
##  Min.   : 1.000             Min.   : 302.0          Min.   :10.00           
##  1st Qu.: 2.500             1st Qu.: 722.2          1st Qu.:26.00           
##  Median : 4.900             Median :1502.5          Median :49.00           
##  Mean   : 5.273             Mean   :1525.2          Mean   :50.68           
##  3rd Qu.: 7.400             3rd Qu.:2229.5          3rd Qu.:74.00           
##  Max.   :12.000             Max.   :2993.0          Max.   :99.00           
##  Data.Usage..MB.day.      Age           Gender          User.Behavior.Class
##  Min.   : 102.0      Min.   :18.00   Length:700         Min.   :1.00       
##  1st Qu.: 373.0      1st Qu.:28.00   Class :character   1st Qu.:2.00       
##  Median : 823.5      Median :38.00   Mode  :character   Median :3.00       
##  Mean   : 929.7      Mean   :38.48                      Mean   :2.99       
##  3rd Qu.:1341.0      3rd Qu.:49.00                      3rd Qu.:4.00       
##  Max.   :2497.0      Max.   :59.00                      Max.   :5.00

Check for Null/Missing Data

anyNA(Phone_Usage) #False

## [1] FALSE

Distribution of each variable

# Visualize the distributions for each variable
par(mfrow = c(3, 3))
hist(Phone_Usage$App.Usage.Time..min.day., main="Daily App Usage (Minutes)")
hist(Phone_Usage$Screen.On.Time..hours.day., main="Avg. Screen Activity (Hours)")
hist(Phone_Usage$Battery.Drain..mAh.day., main="Daily Battery Consuption (mAh)")
hist(Phone_Usage$Number.of.Apps.Installed, main="Apps Installed")
hist(Phone_Usage$Data.Usage..MB.day., main="Mobile Data Usage (MB)")
hist(Phone_Usage$Age, main="Age")
hist(Phone_Usage$User.Behavior.Class, main="Usage Pattern Classifier")

Column correlation and re-naming

# Rename columns
names(Phone_Usage) <- c("User_ID", "Device_Model", "Operating_System", "App_Usage_Mins", "Screen_Time_Hrs", "Battery_Drain_mAh", "Number_Apps", "Data_Usage_MB", "Age", "Gender", "User_Class")

# First, make copy of the original dataframe, but only with numeric fields
Phone_Usage_numeric <- Phone_Usage  %>%
  select_if(is.numeric)

#Remove un-needed column
Phone_Usage_numeric <- subset(Phone_Usage_numeric, select = -User_ID)

# Corrplot
corrplot.mixed(cor(Phone_Usage_numeric), order = 'AOE')

corrplot(cor(Phone_Usage_numeric), is.corr = FALSE, method = "square", order = 'alphabet', type='upper', diag=TRUE)

Rename the User_Class variable

# Replace the numeric values for better clarity
Phone_Usage$User_Class[Phone_Usage$User_Class == "1"] <- "Very Low Usage"
Phone_Usage$User_Class[Phone_Usage$User_Class == "2"] <- "Low Usage"
Phone_Usage$User_Class[Phone_Usage$User_Class == "3"] <- "Moderate Usage"
Phone_Usage$User_Class[Phone_Usage$User_Class == "4"] <- "High Usage"
Phone_Usage$User_Class[Phone_Usage$User_Class == "5"] <- "Very High Usage"

One of the great parts of decision trees is how easily interpretable they are, as they follow very human decision-making patterns. When it comes to this dataset, if I want to know what kind of mobile phone user a person is, we can scan through the dataset’s features, and see if there are any dividing nodes when it comes to classifying them. I anticipate that the Data_Usage_MB and Battery_Drain_mAh will have significant bearing.

Building the decision tree model

set.seed(1234)

# First let's split our data
sample_data = createDataPartition(Phone_Usage$User_Class, p = 0.7, list=FALSE)
# use 80% of data to training and testing the models
train_data <- Phone_Usage[sample_data, ]
# select 20% of the data for validation
test_data <- Phone_Usage[-sample_data, ]


# Create decision tree model
decision_tree_model <- rpart(User_Class ~ .-User_ID, data = train_data, method = "class")
# In case I want to exclude certain fields from the tree
# decision_tree_model <- rpart(User_Class ~ . -App_Usage_Mins, data = train_data, method = "class")

# Print the model summary
summary(decision_tree_model)

## Call:
## rpart(formula = User_Class ~ . - User_ID, data = train_data, 
##     method = "class")
##   n= 494 
## 
##          CP nsplit rel error     xerror        xstd
## 1 0.2583120      0 1.0000000 1.03580563 0.021846521
## 2 0.2506394      1 0.7416880 0.74680307 0.027946465
## 3 0.2455243      2 0.4910486 0.48849105 0.027682019
## 4 0.0100000      4 0.0000000 0.01023018 0.005094339
## 
## Variable importance
##    App_Usage_Mins Battery_Drain_mAh     Data_Usage_MB       Number_Apps 
##                20                20                20                20 
##   Screen_Time_Hrs               Age 
##                19                 1 
## 
## Node number 1: 494 observations,    complexity param=0.258312
##   predicted class=Low Usage        expected loss=0.791498  P(node) =1
##     class counts:    98   103   101    96    96
##    probabilities: 0.198 0.209 0.204 0.194 0.194 
##   left son=2 (295 obs) right son=3 (199 obs)
##   Primary splits:
##       App_Usage_Mins    < 180.5  to the right, improve=99.12084, (0 missing)
##       Battery_Drain_mAh < 1204   to the right, improve=99.12084, (0 missing)
##       Number_Apps       < 40     to the right, improve=99.12084, (0 missing)
##       Data_Usage_MB     < 611    to the right, improve=99.12084, (0 missing)
##       Screen_Time_Hrs   < 8.05   to the left,  improve=96.69432, (0 missing)
##   Surrogate splits:
##       Battery_Drain_mAh < 1204   to the right, agree=1.000, adj=1.000, (0 split)
##       Number_Apps       < 40     to the right, agree=1.000, adj=1.000, (0 split)
##       Data_Usage_MB     < 611    to the right, agree=1.000, adj=1.000, (0 split)
##       Screen_Time_Hrs   < 3.95   to the right, agree=0.994, adj=0.985, (0 split)
## 
## Node number 2: 295 observations,    complexity param=0.2506394
##   predicted class=Moderate Usage   expected loss=0.6576271  P(node) =0.597166
##     class counts:    98     0   101    96     0
##    probabilities: 0.332 0.000 0.342 0.325 0.000 
##   left son=4 (194 obs) right son=5 (101 obs)
##   Primary splits:
##       App_Usage_Mins    < 300    to the right, improve=99.63404, (0 missing)
##       Battery_Drain_mAh < 1796.5 to the right, improve=99.63404, (0 missing)
##       Number_Apps       < 60     to the right, improve=99.63404, (0 missing)
##       Data_Usage_MB     < 1004   to the right, improve=99.63404, (0 missing)
##       Screen_Time_Hrs   < 8.05   to the left,  improve=97.14634, (0 missing)
##   Surrogate splits:
##       Battery_Drain_mAh < 1796.5 to the right, agree=1.000, adj=1.00, (0 split)
##       Number_Apps       < 60     to the right, agree=1.000, adj=1.00, (0 split)
##       Data_Usage_MB     < 1004   to the right, agree=1.000, adj=1.00, (0 split)
##       Screen_Time_Hrs   < 6.05   to the right, agree=0.986, adj=0.96, (0 split)
##       Age               < 53.5   to the left,  agree=0.661, adj=0.01, (0 split)
## 
## Node number 3: 199 observations,    complexity param=0.2455243
##   predicted class=Low Usage        expected loss=0.4824121  P(node) =0.402834
##     class counts:     0   103     0     0    96
##    probabilities: 0.000 0.518 0.000 0.000 0.482 
##   left son=6 (103 obs) right son=7 (96 obs)
##   Primary splits:
##       App_Usage_Mins    < 90.5   to the right, improve=99.37688, (0 missing)
##       Battery_Drain_mAh < 598    to the right, improve=99.37688, (0 missing)
##       Number_Apps       < 20     to the right, improve=99.37688, (0 missing)
##       Data_Usage_MB     < 300    to the right, improve=99.37688, (0 missing)
##       Screen_Time_Hrs   < 2.05   to the right, improve=89.87193, (0 missing)
##   Surrogate splits:
##       Battery_Drain_mAh < 598    to the right, agree=1.000, adj=1.000, (0 split)
##       Number_Apps       < 20     to the right, agree=1.000, adj=1.000, (0 split)
##       Data_Usage_MB     < 300    to the right, agree=1.000, adj=1.000, (0 split)
##       Screen_Time_Hrs   < 2.05   to the right, agree=0.975, adj=0.948, (0 split)
##       Age               < 53.5   to the left,  agree=0.558, adj=0.083, (0 split)
## 
## Node number 4: 194 observations,    complexity param=0.2455243
##   predicted class=High Usage       expected loss=0.4948454  P(node) =0.3927126
##     class counts:    98     0     0    96     0
##    probabilities: 0.505 0.000 0.000 0.495 0.000 
##   left son=8 (98 obs) right son=9 (96 obs)
##   Primary splits:
##       App_Usage_Mins    < 479.5  to the left,  improve=96.98969, (0 missing)
##       Screen_Time_Hrs   < 8.05   to the left,  improve=96.98969, (0 missing)
##       Battery_Drain_mAh < 2400.5 to the left,  improve=96.98969, (0 missing)
##       Number_Apps       < 80     to the left,  improve=96.98969, (0 missing)
##       Data_Usage_MB     < 1519.5 to the left,  improve=96.98969, (0 missing)
##   Surrogate splits:
##       Screen_Time_Hrs   < 8.05   to the left,  agree=1.000, adj=1.000, (0 split)
##       Battery_Drain_mAh < 2400.5 to the left,  agree=1.000, adj=1.000, (0 split)
##       Number_Apps       < 80     to the left,  agree=1.000, adj=1.000, (0 split)
##       Data_Usage_MB     < 1519.5 to the left,  agree=1.000, adj=1.000, (0 split)
##       Age               < 25.5   to the left,  agree=0.557, adj=0.104, (0 split)
## 
## Node number 5: 101 observations
##   predicted class=Moderate Usage   expected loss=0  P(node) =0.2044534
##     class counts:     0     0   101     0     0
##    probabilities: 0.000 0.000 1.000 0.000 0.000 
## 
## Node number 6: 103 observations
##   predicted class=Low Usage        expected loss=0  P(node) =0.208502
##     class counts:     0   103     0     0     0
##    probabilities: 0.000 1.000 0.000 0.000 0.000 
## 
## Node number 7: 96 observations
##   predicted class=Very Low Usage   expected loss=0  P(node) =0.194332
##     class counts:     0     0     0     0    96
##    probabilities: 0.000 0.000 0.000 0.000 1.000 
## 
## Node number 8: 98 observations
##   predicted class=High Usage       expected loss=0  P(node) =0.1983806
##     class counts:    98     0     0     0     0
##    probabilities: 1.000 0.000 0.000 0.000 0.000 
## 
## Node number 9: 96 observations
##   predicted class=Very High Usage  expected loss=0  P(node) =0.194332
##     class counts:     0     0     0    96     0
##    probabilities: 0.000 0.000 0.000 1.000 0.000

# Plot
rpart.plot(decision_tree_model)

################################################################################
# Alternatively, using ISLR tree method

train_data$User_Class <- as.factor(train_data$User_Class)
train_data$Operating_System <- as.factor(train_data$Operating_System)
train_data$Gender <- as.factor(train_data$Gender)

tree.User_Class = tree(User_Class~.-User_ID, data=train_data)

## Warning in tree(User_Class ~ . - User_ID, data = train_data): NAs introduced by
## coercion

summary(tree.User_Class)

## 
## Classification tree:
## tree(formula = User_Class ~ . - User_ID, data = train_data)
## Variables actually used in tree construction:
## [1] "App_Usage_Mins"
## Number of terminal nodes:  5 
## Residual mean deviance:  0 = 0 / 489 
## Misclassification error rate: 0 = 0 / 494

plot(tree.User_Class)
text(tree.User_Class, pretty = 0)

This creates a decision tree that seems to only look at App_Usage_Mins, where if the user spends 180 mins or less, they can be broken into Low and Very Low Usage users. Conversely, the tree decides the cutoff point for Moderate Usage is those with more than 180, but less than 300 minutes. High and Very High are further segmented by having 480 mins as the cutoff, where Very High Users spend at least 480 mins. Looking over the data, this does seem accurate, and something interesting is that the App_Usage seems to have a higher importance on the classification than other variables. For example, there are several instances where Very High Usage Users have high App_Usage_Mins, but considerably lower Screen_Time_Hrs, 592 and 9.4 respectively, or 580 and 8.2 respectively. For the most part, users grouped in Very High Usage have anywhere from 9.5 to 12 hours of screen time, so this oddity seems like an instance of users having their apps open in the background for extended periods of time.

Let’s test predictions

predictions <- predict(decision_tree_model, newdata = Phone_Usage, type = "class")

# Create a confusion matrix
confusion_matrix <- table(Predicted = predictions, Actual = Phone_Usage$User_Class)
print(confusion_matrix)

##                  Actual
## Predicted         High Usage Low Usage Moderate Usage Very High Usage
##   High Usage             139         0              0               0
##   Low Usage                0       146              0               0
##   Moderate Usage           0         0            143               0
##   Very High Usage          0         0              0             136
##   Very Low Usage           0         0              0               0
##                  Actual
## Predicted         Very Low Usage
##   High Usage                   0
##   Low Usage                    0
##   Moderate Usage               0
##   Very High Usage              0
##   Very Low Usage             136

# Calculate accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
cat("Accuracy:", accuracy, "\n")

## Accuracy: 1

This decision tree is incredibly accurate as it was able to find the partitions/nodes to split the dataframe based on the App Usage in Mins, which has the direct influence on how we classify users, so this functions essentially like a flow-chart to visualize how we classify users. Now let’s do a model where I want to predict the operating system on a users phone, something that will be very irregular. However, this model seems to be overfitting, which reduces its ability to generalize to a broad range of problems, such as what happens when a user has a very high App Usage time, but low screen time- maybe they leave all their background apps open constantly. Perhaps the user has very low Data Usage, and uses alot of offline applications instead.

Let’s examine Bias/Variance

# Predict on training data
train_predictions <- predict(decision_tree_model, newdata = train_data, type = "class")
train_confusion_matrix <- table(Predicted = train_predictions, Actual = train_data$User_Class)
train_accuracy <- sum(diag(train_confusion_matrix)) / sum(train_confusion_matrix)

# Predict on test data
test_predictions <- predict(decision_tree_model, newdata = test_data, type = "class")
test_confusion_matrix <- table(Predicted = test_predictions, Actual = test_data$User_Class)
test_accuracy <- sum(diag(test_confusion_matrix)) / sum(test_confusion_matrix)


# Check;
train_confusion_matrix

##                  Actual
## Predicted         High Usage Low Usage Moderate Usage Very High Usage
##   High Usage              98         0              0               0
##   Low Usage                0       103              0               0
##   Moderate Usage           0         0            101               0
##   Very High Usage          0         0              0              96
##   Very Low Usage           0         0              0               0
##                  Actual
## Predicted         Very Low Usage
##   High Usage                   0
##   Low Usage                    0
##   Moderate Usage               0
##   Very High Usage              0
##   Very Low Usage              96

train_accuracy

## [1] 1

test_confusion_matrix

##                  Actual
## Predicted         High Usage Low Usage Moderate Usage Very High Usage
##   High Usage              41         0              0               0
##   Low Usage                0        43              0               0
##   Moderate Usage           0         0             42               0
##   Very High Usage          0         0              0              40
##   Very Low Usage           0         0              0               0
##                  Actual
## Predicted         Very Low Usage
##   High Usage                   0
##   Low Usage                    0
##   Moderate Usage               0
##   Very High Usage              0
##   Very Low Usage              40

test_accuracy

## [1] 1

This model only considers the App_Usage_Mins, and is very simplified, as perhaps that was how the User_Class was originally derived, meaning the ones who created the dataset only cared about if a user had high App Usage or Battery Drain, and little about other fields. Given that this model has high training accuracy and high test accuracy, there is a balance between bias and variance, although it’s not a very interesting model.

set.seed(1234)

# First let's split our data
sample_data_2 = createDataPartition(Phone_Usage$Device_Model, p = 0.7, list=FALSE)
# use 70% of data to training and testing the models
train_data_2 <- Phone_Usage[sample_data_2, ]
# select 30% of the data for validation
test_data_2 <- Phone_Usage[-sample_data_2, ]


# Create decision tree model
decision_tree_model_2 <- rpart(Device_Model ~ . -User_ID, data = train_data_2, method = "class")
# In case I want to exclude certain fields from the tree
# decision_tree_model <- rpart(User_Class ~ . -App_Usage_Mins, data = train_data, method = "class")

# Print the model summary
summary(decision_tree_model_2)

## Call:
## rpart(formula = Device_Model ~ . - User_ID, data = train_data_2, 
##     method = "class")
##   n= 494 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.26342711      0 1.0000000 1.0716113 0.02039846
## 2 0.01108269      1 0.7365729 0.7800512 0.02762744
## 3 0.01023018      7 0.6547315 0.8235294 0.02708020
## 4 0.01000000      8 0.6445013 0.8235294 0.02708020
## 
## Variable importance
##  Operating_System     Data_Usage_MB       Number_Apps Battery_Drain_mAh 
##                74                 5                 4                 4 
##   Screen_Time_Hrs               Age    App_Usage_Mins            Gender 
##                 4                 3                 3                 2 
##        User_Class 
##                 1 
## 
## Node number 1: 494 observations,    complexity param=0.2634271
##   predicted class=iPhone 12           expected loss=0.791498  P(node) =1
##     class counts:   100   103    94    94   103
##    probabilities: 0.202 0.209 0.190 0.190 0.209 
##   left son=2 (103 obs) right son=3 (391 obs)
##   Primary splits:
##       Operating_System  splits as  RL,         improve=101.937800, (0 missing)
##       App_Usage_Mins    < 573    to the right, improve=  2.492706, (0 missing)
##       Battery_Drain_mAh < 333.5  to the right, improve=  2.155008, (0 missing)
##       Data_Usage_MB     < 2427   to the left,  improve=  1.615551, (0 missing)
##       Number_Apps       < 11.5   to the right, improve=  1.446021, (0 missing)
##   Surrogate splits:
##       Screen_Time_Hrs < 1.05   to the left,  agree=0.796, adj=0.019, (0 split)
## 
## Node number 2: 103 observations
##   predicted class=iPhone 12           expected loss=0  P(node) =0.208502
##     class counts:     0   103     0     0     0
##    probabilities: 0.000 1.000 0.000 0.000 0.000 
## 
## Node number 3: 391 observations,    complexity param=0.01108269
##   predicted class=Xiaomi Mi 11        expected loss=0.7365729  P(node) =0.791498
##     class counts:   100     0    94    94   103
##    probabilities: 0.256 0.000 0.240 0.240 0.263 
##   left son=6 (31 obs) right son=7 (360 obs)
##   Primary splits:
##       Data_Usage_MB     < 195.5  to the left,  improve=1.974378, (0 missing)
##       App_Usage_Mins    < 40     to the right, improve=1.941595, (0 missing)
##       Number_Apps       < 11.5   to the right, improve=1.681154, (0 missing)
##       Battery_Drain_mAh < 348    to the right, improve=1.453962, (0 missing)
##       User_Class        splits as  RRLLL,      improve=1.396028, (0 missing)
##   Surrogate splits:
##       Number_Apps       < 14.5   to the left,  agree=0.928, adj=0.097, (0 split)
##       Battery_Drain_mAh < 309.5  to the left,  agree=0.923, adj=0.032, (0 split)
## 
## Node number 6: 31 observations
##   predicted class=Google Pixel 5      expected loss=0.5806452  P(node) =0.06275304
##     class counts:    13     0     7     2     9
##    probabilities: 0.419 0.000 0.226 0.065 0.290 
## 
## Node number 7: 360 observations,    complexity param=0.01108269
##   predicted class=Xiaomi Mi 11        expected loss=0.7388889  P(node) =0.7287449
##     class counts:    87     0    87    92    94
##    probabilities: 0.242 0.000 0.242 0.256 0.261 
##   left son=14 (352 obs) right son=15 (8 obs)
##   Primary splits:
##       Number_Apps       < 11.5   to the right, improve=2.155808, (0 missing)
##       Battery_Drain_mAh < 334.5  to the right, improve=1.740159, (0 missing)
##       App_Usage_Mins    < 217.5  to the right, improve=1.692458, (0 missing)
##       Data_Usage_MB     < 2342.5 to the left,  improve=1.544633, (0 missing)
##       Age               < 43.5   to the left,  improve=1.444438, (0 missing)
##   Surrogate splits:
##       App_Usage_Mins < 35     to the right, agree=0.981, adj=0.125, (0 split)
## 
## Node number 14: 352 observations,    complexity param=0.01108269
##   predicted class=Samsung Galaxy S21  expected loss=0.7471591  P(node) =0.7125506
##     class counts:    87     0    87    89    89
##    probabilities: 0.247 0.000 0.247 0.253 0.253 
##   left son=28 (28 obs) right son=29 (324 obs)
##   Primary splits:
##       Number_Apps       < 17.5   to the left,  improve=1.990400, (0 missing)
##       Battery_Drain_mAh < 749    to the left,  improve=1.634470, (0 missing)
##       Data_Usage_MB     < 2342.5 to the left,  improve=1.628301, (0 missing)
##       App_Usage_Mins    < 573    to the right, improve=1.524232, (0 missing)
##       Age               < 43.5   to the left,  improve=1.461364, (0 missing)
##   Surrogate splits:
##       Battery_Drain_mAh < 592    to the left,  agree=0.972, adj=0.643, (0 split)
##       Data_Usage_MB     < 284.5  to the left,  agree=0.972, adj=0.643, (0 split)
##       App_Usage_Mins    < 82.5   to the left,  agree=0.969, adj=0.607, (0 split)
##       Screen_Time_Hrs   < 1.95   to the left,  agree=0.969, adj=0.607, (0 split)
##       User_Class        splits as  RRRRL,      agree=0.969, adj=0.607, (0 split)
## 
## Node number 15: 8 observations
##   predicted class=Xiaomi Mi 11        expected loss=0.375  P(node) =0.01619433
##     class counts:     0     0     0     3     5
##    probabilities: 0.000 0.000 0.000 0.375 0.625 
## 
## Node number 28: 28 observations,    complexity param=0.01108269
##   predicted class=Google Pixel 5      expected loss=0.6428571  P(node) =0.05668016
##     class counts:    10     0     8     9     1
##    probabilities: 0.357 0.000 0.286 0.321 0.036 
##   left son=56 (12 obs) right son=57 (16 obs)
##   Primary splits:
##       Battery_Drain_mAh < 483    to the right, improve=2.339286, (0 missing)
##       Age               < 38     to the left,  improve=2.243525, (0 missing)
##       Gender            splits as  LR,         improve=1.703175, (0 missing)
##       App_Usage_Mins    < 57.5   to the right, improve=1.392063, (0 missing)
##       Data_Usage_MB     < 226    to the right, improve=1.257066, (0 missing)
##   Surrogate splits:
##       Screen_Time_Hrs < 1.25   to the left,  agree=0.750, adj=0.417, (0 split)
##       App_Usage_Mins  < 65     to the right, agree=0.679, adj=0.250, (0 split)
##       Data_Usage_MB   < 281    to the right, agree=0.679, adj=0.250, (0 split)
##       Age             < 30     to the left,  agree=0.679, adj=0.250, (0 split)
##       Number_Apps     < 16.5   to the right, agree=0.643, adj=0.167, (0 split)
## 
## Node number 29: 324 observations,    complexity param=0.01108269
##   predicted class=Xiaomi Mi 11        expected loss=0.7283951  P(node) =0.6558704
##     class counts:    77     0    79    80    88
##    probabilities: 0.238 0.000 0.244 0.247 0.272 
##   left son=58 (279 obs) right son=59 (45 obs)
##   Primary splits:
##       Data_Usage_MB     < 418    to the right, improve=2.243449, (0 missing)
##       App_Usage_Mins    < 217.5  to the right, improve=2.237246, (0 missing)
##       Number_Apps       < 33.5   to the right, improve=2.105514, (0 missing)
##       Battery_Drain_mAh < 604.5  to the right, improve=1.662156, (0 missing)
##       Age               < 53.5   to the left,  improve=1.614470, (0 missing)
##   Surrogate splits:
##       Screen_Time_Hrs   < 3.05   to the right, agree=0.904, adj=0.311, (0 split)
##       Battery_Drain_mAh < 826    to the right, agree=0.904, adj=0.311, (0 split)
##       App_Usage_Mins    < 89.5   to the right, agree=0.895, adj=0.244, (0 split)
##       Number_Apps       < 20     to the right, agree=0.895, adj=0.244, (0 split)
##       User_Class        splits as  LLLLR,      agree=0.895, adj=0.244, (0 split)
## 
## Node number 56: 12 observations
##   predicted class=Google Pixel 5      expected loss=0.4166667  P(node) =0.0242915
##     class counts:     7     0     3     1     1
##    probabilities: 0.583 0.000 0.250 0.083 0.083 
## 
## Node number 57: 16 observations
##   predicted class=Samsung Galaxy S21  expected loss=0.5  P(node) =0.03238866
##     class counts:     3     0     5     8     0
##    probabilities: 0.188 0.000 0.312 0.500 0.000 
## 
## Node number 58: 279 observations,    complexity param=0.01108269
##   predicted class=Google Pixel 5      expected loss=0.734767  P(node) =0.5647773
##     class counts:    74     0    66    68    71
##    probabilities: 0.265 0.000 0.237 0.244 0.254 
##   left son=116 (238 obs) right son=117 (41 obs)
##   Primary splits:
##       Age               < 53.5   to the left,  improve=2.2551930, (0 missing)
##       Data_Usage_MB     < 2342.5 to the left,  improve=1.6937570, (0 missing)
##       App_Usage_Mins    < 573    to the right, improve=1.4746010, (0 missing)
##       Battery_Drain_mAh < 2779   to the left,  improve=1.0130800, (0 missing)
##       Gender            splits as  RL,         improve=0.9274069, (0 missing)
##   Surrogate splits:
##       Data_Usage_MB < 2474   to the left,  agree=0.86, adj=0.049, (0 split)
## 
## Node number 59: 45 observations
##   predicted class=Xiaomi Mi 11        expected loss=0.6222222  P(node) =0.09109312
##     class counts:     3     0    13    12    17
##    probabilities: 0.067 0.000 0.289 0.267 0.378 
## 
## Node number 116: 238 observations
##   predicted class=Google Pixel 5      expected loss=0.7142857  P(node) =0.4817814
##     class counts:    68     0    56    51    63
##    probabilities: 0.286 0.000 0.235 0.214 0.265 
## 
## Node number 117: 41 observations,    complexity param=0.01023018
##   predicted class=Samsung Galaxy S21  expected loss=0.5853659  P(node) =0.08299595
##     class counts:     6     0    10    17     8
##    probabilities: 0.146 0.000 0.244 0.415 0.195 
##   left son=234 (21 obs) right son=235 (20 obs)
##   Primary splits:
##       Gender            splits as  LR,         improve=3.325552, (0 missing)
##       Data_Usage_MB     < 1150   to the left,  improve=2.150094, (0 missing)
##       Screen_Time_Hrs   < 6.3    to the left,  improve=1.868043, (0 missing)
##       Number_Apps       < 74.5   to the left,  improve=1.850949, (0 missing)
##       Battery_Drain_mAh < 1880   to the left,  improve=1.573171, (0 missing)
##   Surrogate splits:
##       Battery_Drain_mAh < 2193   to the right, agree=0.659, adj=0.30, (0 split)
##       Age               < 57.5   to the left,  agree=0.659, adj=0.30, (0 split)
##       App_Usage_Mins    < 389.5  to the right, agree=0.634, adj=0.25, (0 split)
##       Number_Apps       < 74.5   to the right, agree=0.610, adj=0.20, (0 split)
##       Screen_Time_Hrs   < 4.55   to the right, agree=0.585, adj=0.15, (0 split)
## 
## Node number 234: 21 observations
##   predicted class=OnePlus 9           expected loss=0.6190476  P(node) =0.04251012
##     class counts:     3     0     8     4     6
##    probabilities: 0.143 0.000 0.381 0.190 0.286 
## 
## Node number 235: 20 observations
##   predicted class=Samsung Galaxy S21  expected loss=0.35  P(node) =0.04048583
##     class counts:     3     0     2    13     2
##    probabilities: 0.150 0.000 0.100 0.650 0.100

# Plot
rpart.plot(decision_tree_model_2)

Let’s test predictions for the second tree

predictions_2 <- predict(decision_tree_model_2, newdata = Phone_Usage, type = "class")

# confusion matrix
confusion_matrix_2 <- table(Predicted = predictions_2, Actual = Phone_Usage$Device_Model)
print(confusion_matrix_2)

##                     Actual
## Predicted            Google Pixel 5 iPhone 12 OnePlus 9 Samsung Galaxy S21
##   Google Pixel 5                120         0        95                 84
##   iPhone 12                       0       146         0                  0
##   OnePlus 9                       4         0         9                  5
##   Samsung Galaxy S21              9         0        13                 23
##   Xiaomi Mi 11                    9         0        16                 21
##                     Actual
## Predicted            Xiaomi Mi 11
##   Google Pixel 5              111
##   iPhone 12                     0
##   OnePlus 9                     8
##   Samsung Galaxy S21            2
##   Xiaomi Mi 11                 25

# Calculate accuracy
accuracy <- sum(diag(confusion_matrix_2)) / sum(confusion_matrix_2)
cat("Accuracy:", accuracy, "\n")

## Accuracy: 0.4614286

This model immediately detects at its root node that all the iPhone 12’s have the IOS operating system and splits them up from the rest of the Android devices, but overall doesn’t function very well for other predictions given it only had a classification accuracy of 46%. I see that it tried to classify based on Data_Usage, the Number_Apps, Battery_Drain_mAh, and even Age and Gender, leading to multiple decision nodes along the tree and making it difficult to interpret.

Let’s examine Bias/Variance for this new tree

# Predict on training data
train_predictions_2 <- predict(decision_tree_model_2, newdata = train_data_2, type = "class")
train_confusion_matrix_2 <- table(Predicted = train_predictions_2, Actual = train_data_2$User_Class)
train_accuracy_2 <- sum(diag(train_confusion_matrix_2)) / sum(train_confusion_matrix_2)

# Predict on test data
test_predictions_2 <- predict(decision_tree_model_2, newdata = test_data_2, type = "class")
test_confusion_matrix_2 <- table(Predicted = test_predictions_2, Actual = test_data_2$User_Class)
test_accuracy_2 <- sum(diag(test_confusion_matrix_2)) / sum(test_confusion_matrix_2)


# Check;
train_confusion_matrix_2

##                     Actual
## Predicted            High Usage Low Usage Moderate Usage Very High Usage
##   Google Pixel 5             62        45             66              65
##   iPhone 12                  24        22             21              19
##   OnePlus 9                   6         3              8               4
##   Samsung Galaxy S21          7         2             10               1
##   Xiaomi Mi 11                0        34              0               0
##                     Actual
## Predicted            Very Low Usage
##   Google Pixel 5                 43
##   iPhone 12                      17
##   OnePlus 9                       0
##   Samsung Galaxy S21             16
##   Xiaomi Mi 11                   19

train_accuracy_2

## [1] 0.2267206

test_confusion_matrix_2

##                     Actual
## Predicted            High Usage Low Usage Moderate Usage Very High Usage
##   Google Pixel 5             30        19             24              32
##   iPhone 12                   5        10             11              12
##   OnePlus 9                   2         1              1               1
##   Samsung Galaxy S21          3         1              2               2
##   Xiaomi Mi 11                0         9              0               0
##                     Actual
## Predicted            Very Low Usage
##   Google Pixel 5                 24
##   iPhone 12                       5
##   OnePlus 9                       0
##   Samsung Galaxy S21              3
##   Xiaomi Mi 11                    9

test_accuracy_2

## [1] 0.2524272

So this model has neither high test accuracy, nor training accuracy, and overall has a meager 46%- which to me indicates that attempting to predict a user’s mobile device model as a function of other features is difficult and not really worthwhile. A user could have any type of mobile device and use it differently, such as installing different numbers of apps, having varying amounts of battery drain or data usage, all of which could vary for the same kind of phone. A better space for predicting a mobile device model, would be if we had data regarding the phone storage capacity, the battery life, the size of the case and screen, the number of cameras, etc. - all features about the devices hardware which would lead to better performance and prediction capabilities.

Random Forest

set.seed(1234)

# First let's split our data
sample_data_rf = createDataPartition(Phone_Usage$Device_Model, p = 0.7, list=FALSE)
train_data_rf <- Phone_Usage[sample_data, ]
test_data_rf <- Phone_Usage[-sample_data, ]

# Convert response variables to factors
train_data_rf$User_Class <- as.factor(train_data_rf$User_Class)
train_data_rf$Operating_System <- as.factor(train_data_rf$Operating_System)
train_data_rf$Gender <- as.factor(train_data_rf$Gender)

test_data_rf$User_Class <- as.factor(test_data_rf$User_Class)
test_data_rf$Operating_System <- as.factor(test_data_rf$Operating_System)
test_data_rf$Gender <- as.factor(test_data_rf$Gender)


# Convert User_Class to a factor so we can use it in the rf
Phone_Usage$User_Class <- as.factor(Phone_Usage$User_Class)

# random forest model 
Phone_Usage_rf_model <- randomForest(User_Class ~ . -Battery_Drain_mAh, data = train_data_rf, importance = TRUE, ntree = 100, na.action = na.omit)

# Print the model summary
print(Phone_Usage_rf_model)

## 
## Call:
##  randomForest(formula = User_Class ~ . - Battery_Drain_mAh, data = train_data_rf,      importance = TRUE, ntree = 100, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 0%
## Confusion matrix:
##                 High Usage Low Usage Moderate Usage Very High Usage
## High Usage              98         0              0               0
## Low Usage                0       103              0               0
## Moderate Usage           0         0            101               0
## Very High Usage          0         0              0              96
## Very Low Usage           0         0              0               0
##                 Very Low Usage class.error
## High Usage                   0           0
## Low Usage                    0           0
## Moderate Usage               0           0
## Very High Usage              0           0
## Very Low Usage              96           0

# predictions on test data
predictions_rf <- predict(Phone_Usage_rf_model, newdata = test_data_rf)

# confusion matrix
confusion_matrix_rf <- table(Predicted = predictions_rf, Actual = test_data_rf$User_Class)
print(confusion_matrix_rf)

##                  Actual
## Predicted         High Usage Low Usage Moderate Usage Very High Usage
##   High Usage              41         0              0               0
##   Low Usage                0        43              0               0
##   Moderate Usage           0         0             42               0
##   Very High Usage          0         0              0              40
##   Very Low Usage           0         0              0               0
##                  Actual
## Predicted         Very Low Usage
##   High Usage                   0
##   Low Usage                    0
##   Moderate Usage               0
##   Very High Usage              0
##   Very Low Usage              40

# Calculate accuracy
accuracy_rf <- sum(diag(confusion_matrix_rf)) / sum(confusion_matrix_rf)
cat("Accuracy:", accuracy_rf, "\n")

## Accuracy: 1

# Plot feature importance
importance(Phone_Usage_rf_model)

##                   High Usage   Low Usage Moderate Usage Very High Usage
## User_ID           0.26149427 -0.66326520     -1.4817073       1.0050378
## Device_Model     -0.02049374  0.07633477      1.0050378       0.0000000
## Operating_System -1.42857143 -1.00503782      0.0000000       1.0050378
## App_Usage_Mins   10.72376546  8.85911611     10.5842477      10.6165690
## Screen_Time_Hrs   7.06272943  5.64369159      6.7243530       7.4937651
## Number_Apps      10.33061554  9.55700665     10.6510958       9.0913554
## Data_Usage_MB    11.47557826  9.77870081     11.1497703       9.4401634
## Age              -0.87541820  2.18814262     -1.3946743      -0.2194524
## Gender            0.00000000  1.00503782     -0.1007637       0.0000000
##                  Very Low Usage MeanDecreaseAccuracy MeanDecreaseGini
## User_ID              0.09519059           -1.1010920       1.11491288
## Device_Model        -1.00503782           -0.2289235       0.13213994
## Operating_System     0.00000000           -0.8963739       0.07815643
## App_Usage_Mins       8.30033044           13.6089078     103.59720450
## Screen_Time_Hrs      5.17082599            9.5503275      61.14847093
## Number_Apps         10.23323253           13.8931620     109.74574344
## Data_Usage_MB       11.10183492           15.3490210     117.55542023
## Age                 -1.69761274           -0.3373112       0.90627967
## Gender              -1.00503782           -1.0050378       0.09628737

varImpPlot(Phone_Usage_rf_model)

# Plot the error vs the number of trees graph
plot(Phone_Usage_rf_model)

Given that I already know that the User_Class response variable is almost entirely decided by Battery_Drain_mAh and Data_usage_MB, as you can see in the plot above, I’m going to re-train this random forest model to exclude those 2 fields. It seems rather boring and intuitive that you can determine a response variable by just looking at 2 fields, so let’s redo the model. However, after removing Battery_Drain_mAh from the model, I noticed that App_Usage_Mins has the strongest effect on User_Class after it, followed by Number_Apps and only then Data_Usage_MB. As expected- Age, Operating_System, Gender, and Device_Model have little to 0 effect on the model’s accuracy, and they have no impact on the Gini, which measures purity of each lead node.

Let’s address the following question;

Based on real cases where decision trees went wrong, and 'the bad & ugly' aspects of decision trees, how can you change this perception when using the decision tree you created to solve a real problem? How you would address the concerns in the blog?

Bad aspects of decision trees occur when data is highly complicated, large, and un-maintainable, subjective and opinionated, extremely complex and highly dimensional, or when the data changes and evolves over time. I can change these perceptions by addressing potential causes of data over-fitting to training data, or how the data itself might be flawed in one particular way. For example- I’d address how the people who sourced the data did a sub-par job of classifying User_Class, as it seems the large deciding factor was the App_Usage_Mins to determine if a user was a low, moderate, or high class. Decision trees also require pruning and setting a maximum depth in order to establish better interpretability. In order to solve real problems, such as understanding how different mobile devices are used, the number of apps installed, the battery drain, all of which can be useful metrics for phone companies, I’d like to expand this dataset to have more features/dimensions to increase its usefulness. If needed, random forests and decision trees can also work with ensemble methods such as boosting or bagging to increase stability and reduce noise. To make decision trees more easy to view, there should be visual applications that make it function like a flowchart, so that users only ever see 2 branches on a tree at once, allowing them to decide which branch to go further down, which would be greatly useful to users viewing the tree on their mobile phone. Decision Trees also function poorly when overfitting to training data, which lead to good fit on that data, but fail to generalize to new data that’s different. This is why datasets used for decision trees must be diverse and have a large enough sample size in order to increase their efficacy, as well as looking at multiple features instead of just one.

DATA 622 HW 1

Ron Balaban

2024-10-17