── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rpart)library(rpart.plot)library(rattle)
Loading required package: bitops
Rattle: A free graphical interface for data science with R.
Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
# A tibble: 6 × 13
Column1 doc_index text Code1 Code2 Code3 Code4 Code5 Code6 Code7 Code8 Code9
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <lgl> <lgl>
1 1 Redacted Behav… Non-… <NA> <NA> <NA> <NA> <NA> NA NA NA
2 2 Redacted Q3. W… Non-… <NA> <NA> <NA> <NA> <NA> NA NA NA
3 3 Redacted Being… Stre… Coho… Netw… Lear… <NA> <NA> NA NA NA
4 4 Redacted Netwo… Stre… Coho… Netw… Lear… <NA> <NA> NA NA NA
5 5 Redacted I ben… Stre… Coho… Inte… <NA> <NA> <NA> NA NA NA
6 6 Redacted The p… Stre… Coho… Lear… Spea… <NA> <NA> NA NA NA
# ℹ 1 more variable: Code10 <lgl>
Step 2 - Clean data
The data isn’t organized in a way that my decision tree can understand. I need to take the following cleaning steps
Steps
Rationale
Package and Functions Used
Remove rows non-substantive and parking lot codes
Filters data only to the data that includes our codes of interest; non-substantive codes were used for comments that did not provided substantial feedback and parking lot was used for text I was unsure about during the coding step
dplyr::filter (tidyverse)
Removed codes with no coding data
There were more code columns than actual codes so we can get ride of the columns we don’t need to clean up our dataset
dplyr::select
complete.cases
Pivot rows from wide to long
This gets ALL of our codes into one column
tidyr::pivot_longer
Filter out my text segments
This makes sure we only have codes
dplyr::filter
Cross tabulate, count frequncies, and save results
These frequencies will be the building blocks are algorithm uses for the decision tree
xtabs (base)
dplyr::mutate
dplyr::if_else
rpart
rpart.plot
rattle (optional)
#Clean the data, pivot wide to long, and calculate the frequency table cohort.qda <- qda %>%select(-c(Code7:Code10, Column1)) %>%filter(Code1 !="Non-Substantive") %>%pivot_longer(!doc_index, names_to ="CodeType",values_to ="Code") %>%filter(CodeType !="text")cohort.qda <- cohort.qda[complete.cases(cohort.qda), ]code.freqs <-xtabs(~Code, data = cohort.qda) %>%as.data.frame()head(code.freqs)
Code
1 Areas for Improvements
2 Barriers to Use
3 Coalition Building and Collaboration Strategies (Suggestions for Future RSV Sessions)
4 Cohort Sessions (Strengths)
5 Cohort-Specific Sessions and Grantee Interaction (Suggestions for Future RSV Sessions)
6 Data Capacity-Building
Freq
1 70
2 1
3 1
4 168
5 4
6 9
Note: We’ll use the average frequency to calculate a threshold value to create another grouping value for our tree
summary(code.freqs$Freq) #mean is 24
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 2.00 9.00 23.94 19.50 168.00
Code
1 Areas for Improvements
2 Barriers to Use
3 Coalition Building and Collaboration Strategies (Suggestions for Future RSV Sessions)
4 Cohort Sessions (Strengths)
5 Cohort-Specific Sessions and Grantee Interaction (Suggestions for Future RSV Sessions)
6 Data Capacity-Building
Freq CodeFreqTreshold
1 70 Exceeds Average
2 1 Meets or Below Average
3 1 Meets or Below Average
4 168 Exceeds Average
5 4 Meets or Below Average
6 9 Meets or Below Average
Step 3 - Build Decision Tree
Set seed standardizes our environment to make sure we get the same results every time across different computers, scripts, coding environments
Show engines displays the components of our decision tree function from tidyverse’s parsnips package - let’s us know we have access to tidymodels documentation if we need help
We set method to “class” because decision trees are from a family of machine learning algorithms called “classifiers”
tree_model1 <-rpart(Code ~ ., data = code.freqs,method ="class",control =rpart.control(minsplit=10, cp =0.01))rpart.plot(tree_model1, type =2, extra =104, under =FALSE, fallen.leaves =FALSE, box.palette ="auto", nn =TRUE,tweak = .8, legend.x =NA)
Step 4 - Interpreting the Tree
#model level#part(Code ~ ., data = code.freqs)#this specifiies what variables to include in the model and from which dataset to pull those variables form #method = "class" #what algorithm is the engine driving (classifier)#control = rpart.control(minsplit= 10, cp = 0.01)) #how do you want the tree to calculate the node the split (I need to research this more) #graphic level #rpart.plot(tree_model1, type = 2, extra = 104, under = TRUE, # fallen.leaves = FALSE, box.palette = "auto", nn = TRUE,# tweak = .8, legend.x = NA)#in short, I want to create a tree using my tree_model1 data, use the default node format (type 2), layout the leaves to reflect that I have more than one possible outcomes (extra = 104), format text labels under the header boxes (under = TRUE), position nodes at the bottom of the graph (fallen.leaves = F), auto-generate the boxes color pallet, I don't need the leaf node number at the bottom (nn = TRUE), make text 80% of the normal size, and remove the legend fancyRpartPlot(tree_model1) #example with rattle but it didn't change too much to me
My codes are placed in my chart based on whether or not they exceed a frequency threshold determined by the decision tree. You can see the breakdown of codes by frequncies, including the % of decision space that code shares with other codes based on its overall frequency in the analysis count