Data Collection and Analysis Tree Project

Overview of the Data Collection Process

The data for the Narrative Tree Complexity Metric was collected through a web app built with the python web application flask. It is hosted on the heroku cloud application platform and can be found here. https://ctree-postgres.herokuapp.com/ .

A group of research participants were selected for their involvement in the interactive digital narrative field and were specifically requested to choose the tree that they believe was ‘more structurally complex.’ Because complexity is understood in a multitude of ways, a brief synopsis of what I mean by narrative tree structure complexity (as detailed in the literature review) was given. Other than that, no direction was given on how to determine which tree was more structurally complex. This is also detailed again in the “learn more about this app” page on the website.

When the participant selects which tree they intuit as more complex and submit, the winner and loser are recorded to a SQL database hosted on heroku, and presented with a new random pairing of trees.

After a sufficiently large sample of ratings, the data was pulled down from the server and was in the following form.

## # A tibble: 6 x 3
##      id w       l      
##   <dbl> <chr>   <chr>  
## 1     1 117.PNG 132.PNG
## 2     2 63.PNG  6.PNG  
## 3     3 111.PNG 130.PNG
## 4     4 132.PNG 107.PNG
## 5     5 165.PNG 130.PNG
## 6     6 165.PNG 75.PNG

The ‘winner’ is the tree that was selected to be more complex, and the ‘loser’ is the tree that was not selected.

These visualizations of the narrative trees are generated in the sherlock web application for interactive narrative design which is a react.js app. The visualizations are created from their design files.

These files take the following form:

unchanged image

The structure of the file and delimiters determine the structure of the tree.

For example: * ‘#’ denotes a new node * ‘[[]]’ denotes a branch, and its destination

These features were processed in python to calculate the following features of an individual tree:

number of nodes
number of branches
number of leaves (nodes with no posterior links)
number of non-leaf nodes (nodes minus leaves)
number of choice nodes (nodes with multiple exit branches)
maximum path length (path length is defined as the number of nodes connected to each other reaching from the origin to a leaf, and then the maximum is taken)
average path length (the average of all of the possible path’s lengths)
recursive branches (branches in a path that link back to a node that has already been traveled to)

These features were then processed automatically for all 300 trees in the original dataset and had the following output form:

## # A tibble: 6 x 9
##   name                nodes branches leaves   nln choice_nodes max_path avg_path
##   <chr>               <dbl>    <dbl>  <dbl> <dbl>        <dbl>    <dbl>    <dbl>
## 1 ./DesignTurns/017-~    17       16      5    12            4        8      7.4
## 2 ./DesignTurns/056-~    19       31      1    18            5        6      6  
## 3 ./DesignTurns/018-~     3        2      2     1            1        2      2  
## 4 ./DesignTurns/160-~    15        9      9     6            1        4      4  
## 5 ./DesignTurns/122-~     7        4      2     5            0        0      0  
## 6 ./DesignTurns/157-~    23       23      5    18            5       16     10  
## # ... with 1 more variable: recursive_branches <dbl>

These trees were then randomly sampled, and the pictures for the data collection phase were then collected.

After both the tree-file scraping and the complexity intuition rating stages of the project were completed, the rating events were then processed into an ELO score for every tree that was rated. The rationale for using an ELO score as a complexity metric is detailed in the literature review section. These scores were then joined to the features for analysis by a unique index That was tracked throughout this process.

The ELO scores post processing took the following form:

## # A tibble: 6 x 2
##   keyName value
##   <chr>   <dbl>
## 1 165.PNG  1694
## 2 153.PNG  1592
## 3 35.PNG   1583
## 4 115.PNG  1566
## 5 123.PNG  1547
## 6 171.PNG  1393

Intuitively the scores seem to track complexity fairly well. Here are the visualizations for one of the highest rated, and one of the lowest rated narrative tree structures:

elo score: 1694

elo score: 365

Overview of the Data

As detailed in the subsection dataset size below, the size of this data was particularly small. This combined with the fact that the model needed to be comprehensible and used analytically for researchers in the IDN space led me towards some form of regression. In the discussion section I will dig deeper into the possibility of launching a web tool that would allow researchers to upload twine files and have the features and complexity calculated automatically, but I would like to have an analytical form that a researcher could plug into a column in excel. This led me to focus on some form of linear regression.

The pair plot above shows that while a lot of the variables are positively correlated with the complexity metric ‘value’, they are also highly correlated with each other. Reducing the dimensionality of the data will both remove highly correlated predictors and follow the rule of thumb that there should be at least 10 observations per predictor in a linear regression.

Looking at the pairs plots of the data, it seems that all of our predictors have a right skew, and there are two datapoints in almost every pairs plots that are extreme outliers.

##   value index nodes branches leaves nln choice_nodes max_path avg_path
## 1  1694   165    43       61      3  40           19       32 24.03793
## 2  1393   171    50       56      6  44           11       17 13.27778
##   recursive_branches is.outlier is.extreme
## 1                639       TRUE      FALSE
## 2                  3       TRUE      FALSE

Tree 165 and 171 appear as outliers in most of the predictor columns, yet due to the way the ELO is calculated they are not outliers in terms of ELO. These datapoints could apply undue leverage on a predictive model without being accounted for.

## 
## Call:
## lm(formula = value ~ log10(nodes), data = temp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -428.74  -96.16   16.26  125.30  275.70 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     84.53      80.06   1.056    0.297    
## log10(nodes)   892.45      74.21  12.026 1.18e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 170 on 45 degrees of freedom
## Multiple R-squared:  0.7627, Adjusted R-squared:  0.7574 
## F-statistic: 144.6 on 1 and 45 DF,  p-value: 1.184e-15

According to a simple linear model, 75% of the variance in the complexity metric alone can be accounted for by a transformation of its most predictive variable.

Recursive Branches

Recursion is a well established element in complexity metrics across fields. Lets look at the number of recursive branches vs the complexity metric

The two outliers in recursive_branches are clearly calculation errors. The method in which recursion is calculated will be detailed in its own section.

Recursivity could be included as a dummy variable for whether or not a tree contains a recursive element

There seems to be some difference between the distributions of the complexity scores, so this might be worth including in our analysis.

PCA

## Recipe
## 
## Inputs:
## 
##       role #variables
##         id          1
##    outcome          1
##  predictor          8
## 
## Training data contained 47 data points and no missing data.
## 
## Operations:
## 
## Centering and scaling for nodes, branches, leaves, nln, choice_nodes, max... [trained]
## PCA extraction with nodes, branches, leaves, nln, choice_nodes, max_... [trained]

percent variation explained by each PC

Over 70% of the variance can be explained by the first principal component. Recall that from our simple regression with the log of the predictor “node” we were able to get an R^2 of over 75%. The remaining several principle components do explain a decent amount of the variation left in the data however.

Note that the predictors in PC1 have almost identical loadings, This could be due to the high colinearity between the predictors.

This plot can be intuited by the two largest trees (most nodes) being highly negative, and the most positive being the smallest. PC2 we see highly negative values from trees with recursion and low leaf counts (compared to their size) and positive values without recursion and high relative leaf counts. An interesting example here is an incoherent tree (175) where the author included a bunch of extra nodes unconnected to the tree itself, which the scraping algorithm will identify as a leaf.

We will most likely not model off of the PCA, because the goal of this project is intelligibility and ease of use by researchers, both of which PCA hinders.

Data Collection Details

Collecting the Data

Collecting Tree Features

The first step in the data collection process was to programatically calculate the features of the narrative tree structures. This was done in python by creating a Tree class object and an internal Node class object, then iterating through the narrative tree text files and scraping the relevant information. The code can be found here.

One of the more complex processes was to calculate a trees “path length.” This is a method inside of the Tree object called get_longest_path. Once the nodes are defined inside a Tree object, get_longest_path will iterate through the nodes and compile a list of lists recording the iterators previous node locations. If there is a ‘choice node’, or node with more than one branch. The iterator will copy the list of previous destinations as a new element in the list of lists. Once the iterator lands in a node that has no branches out (leaf) it will record the length of its list in a 3rd list. If the iterator follows a branch that returns to a node it has previously visited, it will delete the path and increment recursive_branches for the Tree object. because The path lists are deleted once the iterator reaches a leaf or a recursive branch, the process will complete when the main list is empty.

The result of this process is a list of all of the possible paths and their lengths. From this list the maximum path length and the average path length are calculated.

The Tree object is initialized with a narrative tree text file output from the sherlock application and has a method called features() that will output the relevant information about the tree in a list.

These lists can then be compiled into a dataframe.

Selecting Narrative Trees for the Study

One of the outputs from the Tree object was a boolean called incoherent. Sherlock does not require the narratives to be functional, so the Tree object tested for things like branches that do not lead to nodes. After some sorting for coherence, 50 trees were randomly selected for analysis in the flask app. The reason why a sample of 50 trees were chosen will be discussed in the ELO section of Data Collection Details.

flask Application

The application that the experts used to compare the narrative tree structurs was developed with a flask framework in python. This application (linked above) is hosted on the heroku platform. The github for this application can be found here.

The basic operation of the application is to randomly select two non-identical trees from the static files and display them on screen. Once a selection is made, the server will write the relevant data to a postgres SQL server also hosted on heroku, and two new trees will be presented.

Possible Further Development

As will be detailed in the ELO section, the dataset was severely limited by the fact that the number of ratings necessary to get a particular tree’s ELO to stabilize is fairly high. A potential fix for this is to actively calculate the ELO for a tree on the heroku server and store this information along with the ratings. After one random tree is selected, the current ELO for all trees could then be filtered so a partner with a similar current ELO could be compared. This would cause winners to be compared with winners and vice versa.

ELO and its Implications on the study

ELO is calculated sequentially where every new ‘player’ (in our case narrative tree) competes against another player. Each player’s initial ELO (1000) is updated upward or downward by k depending if it wins the competition. the k value (set to the default 100) is penalized depending on the initial difference between the two competitors ELO. A high ELO winner against a low ELO loaser the k will be very small, but if the inverse occurs the k will be high.

This is a plot of the changing ELO over the course of the study

unchanged image

This study has a hierarchy stability rating of .98 over 1500 paired rating events. The number of possible players creates \(\frac{n(n-1)}{2}\) possible combinations of trees.

The exponential growth of the possible combinations made using more than 50 trees in the study (in order to complete the project within the deadline) with the random pairing method of comparison.

Possible Further Development

The average number of ratings per picture when using a purely random selection method was 63.5 with a standard deviation of 22.9.

If heirarchy stability was created with 63 ratings per tree, a new method that incorporated ratings where competitors only came from local ELO neighbors and the random selection was prioritized for trees with low rating counts, a much higher total number of trees could be analyzed with far less comparisons.

Data Collection and Analysis Tree Project

Jack Wright

4/15/2022

Overview of the Data Collection Process

Overview of the Data

Recursive Branches

PCA

Data Collection Details

Collecting the Data

Collecting Tree Features

Selecting Narrative Trees for the Study

flask Application

Possible Further Development

ELO and its Implications on the study

Possible Further Development