Decomposition Trees for QDA

Creating a Decomposition Tree

This workflow will walk through how to create a decomposition tree to visualize qualitative data analysis coding results.

Dataset

We will be using a coded and labeled qualitative dataset

Materials

We will be using

Tidyverse for data manipulation and piping
Rpart to create the decomposition tree
Rpart.plot to visualize the decomposition tree
Partykit if not installed and loaded already under Rpart
Rattle (optional may make your plot prettier)

Workflow

Step 1 - Import Data

This dataset contains text segments that I’ve already coded in my text analysis.

#install.packages("readxl")
library(readxl)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(rpart)
library(rpart.plot)
library(rattle)

Loading required package: bitops
Rattle: A free graphical interface for data science with R.
Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.

library(partykit)

Loading required package: grid
Loading required package: libcoin
Loading required package: mvtnorm

qda  <- read_excel("Cohort-Specific Sessions QDATool.xlsx")
qda$doc_index <- "Redacted"
head(qda)

# A tibble: 6 × 13
  Column1 doc_index text   Code1 Code2 Code3 Code4 Code5 Code6 Code7 Code8 Code9
    <dbl> <chr>     <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <lgl> <lgl>
1       1 Redacted  Behav… Non-… <NA>  <NA>  <NA>  <NA>  <NA>  NA    NA    NA   
2       2 Redacted  Q3. W… Non-… <NA>  <NA>  <NA>  <NA>  <NA>  NA    NA    NA   
3       3 Redacted  Being… Stre… Coho… Netw… Lear… <NA>  <NA>  NA    NA    NA   
4       4 Redacted  Netwo… Stre… Coho… Netw… Lear… <NA>  <NA>  NA    NA    NA   
5       5 Redacted  I ben… Stre… Coho… Inte… <NA>  <NA>  <NA>  NA    NA    NA   
6       6 Redacted  The p… Stre… Coho… Lear… Spea… <NA>  <NA>  NA    NA    NA   
# ℹ 1 more variable: Code10 <lgl>

Step 2 - Clean data

The data isn’t organized in a way that my decision tree can understand. I need to take the following cleaning steps

Steps	Rationale	Package and Functions Used
Remove rows non-substantive and parking lot codes	Filters data only to the data that includes our codes of interest; non-substantive codes were used for comments that did not provided substantial feedback and parking lot was used for text I was unsure about during the coding step	dplyr::filter (tidyverse)
Removed codes with no coding data	There were more code columns than actual codes so we can get ride of the columns we don’t need to clean up our dataset	dplyr::select complete.cases
Pivot rows from wide to long	This gets ALL of our codes into one column	tidyr::pivot_longer
Filter out my text segments	This makes sure we only have codes	dplyr::filter
Cross tabulate, count frequncies, and save results	These frequencies will be the building blocks are algorithm uses for the decision tree	xtabs (base) dplyr::mutate dplyr::if_else rpart rpart.plot rattle (optional)

#Clean the data, pivot wide to long, and calculate the frequency table 


cohort.qda <- qda %>%
  select(-c(Code7:Code10, Column1)) %>% filter(Code1 != "Non-Substantive") %>%
  pivot_longer(!doc_index, names_to = "CodeType",values_to = "Code") %>%
  filter(CodeType != "text")

cohort.qda <- cohort.qda[complete.cases(cohort.qda), ]

code.freqs <- xtabs(~Code, data = cohort.qda) %>% as.data.frame()

head(code.freqs)

                                                                                    Code
1                                                                 Areas for Improvements
2                                                                        Barriers to Use
3  Coalition Building and Collaboration Strategies (Suggestions for Future RSV Sessions)
4                                                            Cohort Sessions (Strengths)
5 Cohort-Specific Sessions and Grantee Interaction (Suggestions for Future RSV Sessions)
6                                                                 Data Capacity-Building
  Freq
1   70
2    1
3    1
4  168
5    4
6    9

Note: We’ll use the average frequency to calculate a threshold value to create another grouping value for our tree

summary(code.freqs$Freq) #mean is 24

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    2.00    9.00   23.94   19.50  168.00

code.freqs <- code.freqs %>%
  mutate(CodeFreqTreshold = if_else(Freq > 24, "Exceeds Average", "Meets or Below Average"))

head(code.freqs)

                                                                                    Code
1                                                                 Areas for Improvements
2                                                                        Barriers to Use
3  Coalition Building and Collaboration Strategies (Suggestions for Future RSV Sessions)
4                                                            Cohort Sessions (Strengths)
5 Cohort-Specific Sessions and Grantee Interaction (Suggestions for Future RSV Sessions)
6                                                                 Data Capacity-Building
  Freq       CodeFreqTreshold
1   70        Exceeds Average
2    1 Meets or Below Average
3    1 Meets or Below Average
4  168        Exceeds Average
5    4 Meets or Below Average
6    9 Meets or Below Average

Step 3 - Build Decision Tree

Set seed standardizes our environment to make sure we get the same results every time across different computers, scripts, coding environments

Show engines displays the components of our decision tree function from tidyverse’s parsnips package - let’s us know we have access to tidymodels documentation if we need help

We set method to “class” because decision trees are from a family of machine learning algorithms called “classifiers”

set.seed(123)


parsnip::show_engines("decision_tree")

# A tibble: 5 × 2
  engine mode          
  <chr>  <chr>         
1 rpart  classification
2 rpart  regression    
3 C5.0   classification
4 spark  classification
5 spark  regression

tree_model1 <- rpart(Code ~ ., data = code.freqs,
                    method = "class",
                    control = rpart.control(minsplit= 10, cp = 0.01))

rpart.plot(tree_model1, type = 2,  extra = 104, under = FALSE, 
           fallen.leaves = FALSE, box.palette = "auto", nn = TRUE,
           tweak = .8, legend.x = NA)

Step 4 - Interpreting the Tree

#model level


#part(Code ~ ., data = code.freqs)
     
#this specifiies what variables to include in the model and from which dataset to pull those variables form 

#method = "class" 

#what algorithm is the engine driving (classifier)

#control = rpart.control(minsplit= 10, cp = 0.01)) 

#how do you want the tree to calculate the node the split (I need to research this more)  
#graphic level 

#rpart.plot(tree_model1, type = 2,  extra = 104, under = TRUE, 
#           fallen.leaves = FALSE, box.palette = "auto", nn = TRUE,
#           tweak = .8, legend.x = NA)

#in short, I want to create a tree using my tree_model1 data, use the default node format (type 2), layout the leaves to reflect that I have more than one possible outcomes (extra = 104), format text labels under the header boxes (under = TRUE), position nodes at the bottom of the graph (fallen.leaves = F), auto-generate the boxes color pallet, I don't need the leaf node number at the bottom (nn = TRUE), make text 80% of the normal size, and remove the legend 

fancyRpartPlot(tree_model1) #example with rattle but it didn't change too much to me

My codes are placed in my chart based on whether or not they exceed a frequency threshold determined by the decision tree. You can see the breakdown of codes by frequncies, including the % of decision space that code shares with other codes based on its overall frequency in the analysis count