Analysis of the versatility of dog breeds

The American Kennel Club (AKC) is conducting a study to determine the versatility of dog breeds. In this article, we will analyze the study statistically.

The Dataset

We are dealing with 133 dog breeds. We are looking into seven features: Agility, Obidence/Rally, Scent Work, Earned Therapy, Conformation, Coursing, and Good Manners. The dataset contains the proportion or probability of dogs in each breed getting titles in all seven features.

The response variable is the Versatility Index. Versatility is determined by the number of sports, on average, the dogs of a given breed have titled in during the period of 2014 through 2023. This average by breed is called the Versatility Index.

Here is the head of the cleaned dataset:

head(df)

##   Varsititlity    Agility     Rally  ScentWork EarnedTherapy Conformation
## 1     2.095781 0.28481013 0.3919831 0.10506329    0.03417722    0.3759494
## 2     2.067227 0.11428571 0.2386555 0.15462185    0.06722689    0.5697479
## 3     2.030942 0.23909986 0.3675574 0.07266760    0.03187998    0.4219409
## 4     1.996201 0.25379939 0.3723404 0.09042553    0.03875380    0.4202128
## 5     1.996169 0.04469987 0.2681992 0.06385696    0.08812261    0.4533844
## 6     1.962009 0.23174335 0.3537358 0.04769945    0.02406079    0.5390460
##     Coursing GoodManners
## 1 0.14936709   0.4468354
## 2 0.23193277   0.5008403
## 3 0.13502110   0.5152368
## 4 0.15121581   0.4468085
## 5 0.25415070   0.6449553
## 6 0.07302659   0.3626003

The Question

Does a breed’s success in a particular feature among the seven features indicate that it is more likely to have a high Versatility Index?
What is the degree of association between success in a particular feature among the seven features and the breed being versatile?
For example, does a breed’s agility success indicate it is likely to be a versatile breed?

The Analysis

Descriptive Statistics

chart.Correlation(df, histogram=TRUE, pch=19)

The upper triangle of the plot represents a pairwise correlation(Pearson correlation) between features and the response variable. Pearson correlation is a measure of linear association. The first row represents the correlation between the Versatility Index and the seven features. The Pearson correlation of the Versatility Index and the feature Obedience/Rally is the highest. A high linear association exists between the Versatility Index and the feature Obedience/Rally. Having this information, we must also look for any nonlinear associations. Also, there are significant pairwise correlations between the seven features, so we need to access the interactions within the seven features.

The diagonal of the plot shows the density/histogram of Versatility Index and each of the seven features.

The lower triangle of the plot represents a pairwise scatterplot. The first column represents the scatterplot between the Versatility Index and the seven features. The third scatterplot from the first column shows the steepest curve, which means success in Obedience/Rally is more likely to have a high Versatility Index. This is consistent with the Pearson Correlation. With this information, we will proceed with further analysis.

The Method

First, I compare different methods, namely Gradient Boosting, kNN, Neural Network, Random Forest, Linear Regression. Here is the accuracy score from 5-fold cross-validation:

The Linear Regression is the most accurate model for all different loss-functions . I compare it with Bayesian Additive Regression Tree (BART). Here is the RMSE for both methods:

Linear Regression: \(0.05496718\)
BART: \(0.03221035\). So, BART is even more accurate in prediction than Linear Regression.

What is BART

\(y=b(\mathbf{x})+\epsilon\) with \(\mathbf{x} \in \mathbb{R}^p\) and \(\epsilon \sim N(0,\sigma^2_\epsilon)\)
\(b:\mathbb{R}^p \to \mathbb{R}\) unknown regression function
\(b(\mathbf{x}) = \sum_{k=1}^{K}g(\mathbf{x}; \tau_k, \mathcal{M}_k)\)

\(\tau_k\) : topology and splitting rules of tree \(k\)

\(\mathcal{M}_k = (\mu_{k1}, \dots, \mu_{km_k})\): the set of predictions associated with the \(m_k\) terminal nodes of the tree \(k\) .

The Regression Tree

This is how a decision tree looks:

Every tree translates to a piece-wise constant function.

The main advantage of the BART is that it is fully non-parametric and can access the interactions between the features.

The model \(y=f(x)+\epsilon\) and \(\epsilon \sim N(0,\sigma^2)\), where \(f(\cdot) \sim BART\). We estimate \(f(\cdot)\) and call it \(\hat{f(\cdot)}\).

Now, I will investigate the function \(\hat{f(x)}\) by plugging in different values of \(x\), where is the vector of the seven features. In particular, we will plot \(\hat{f(x)}\) as a function of one feature at a time, fixing the other six features at some values.

For example, the following figure shows \(\hat{f(x)}\) as a function of one feature at a time, fixing the other 6 features at their median values. The interpretation is how Versatility Index behaves as a function of each feature at a time, on average, fixing the other six features.

From the figure above, it is clear that,

All curves become mostly flat after 0.4.
The slope of Obedience/Rally is increasing and is more than the others when all other features are fixed at the median.

From this, we can conclude that the increase in the success in Obedience/Rally increases the Versatility Index when the other features are fixed at the median. In other words, success in Obedience/Rally among the seven features indicates that it is more likely to have a high Versatility Index when the other features are fixed at their median values.

Now, the question is, what happens if we fix the values of the other features at different values?

So, here is the figure for the values of the other features at their respective first Quartile.

We observe the same pattern as in the previous curves. From this, we can conclude that the increase in the success in Obedience/Rally increases the Versatility Index when the other features are fixed in their respective first Quartile. In other words, success in Obedience/Rally among the seven features indicates that it is more likely to have a high Versatility Index when the other features are fixed at their respective first Quartile values.

Next is the figure, where the values of the other features are at their respective third quarter.

We observe the same pattern as in the previous curves. From this we can conclude that the increase in success in Obidience/Rally increases the Versatility Index when the other features are fixed at their respective third Quartile. In other words, success in Obedience/Rally among the seven features indicates that it is more likely to have a high Versatility Index when the other features are fixed at their respective third Quartile values.

Next is the figure when the values of the other features are at their respective 1st percentile.

We observe the same pattern as in the previous curves. From this, we can conclude that the increase in success in Obedience/Rally increases the Versatility Index when the other features are fixed at their respective 1st percentile. In other words, success in Obedience/Rally among the seven features indicates that it is more likely to have a high Versatility Index when the other features are fixed at their respective 1st percentile values.

Next is the figure when the values of the other features are at their respective 99th percentile.

We observe the same pattern as in the previous curves. From this, we can conclude that the increase in success in Obedience/Rally increases the Versatility Index when the other features are fixed at their respective 99th percentile. In other words, success in Obedience/Rally among the seven features indicates that it is more likely to have a high Versatility Index when the other features are fixed at their respective 99th percentile values.

In all “important values” of other features, success in Obedience/Rally among the seven features indicates that it is more likely to have a high Versatility Index.

To get a complete analysis, we plot these figures for the function of each feature individually, fixing the other six features at different random values.

The patterns are similar in all of these plots. From this, we can conclude that the increase in success in Obedience/Rally increases the Versatility Index when the other features are fixed at different values all over the space. In other words, success in Obedience/Rally among the seven features indicates that it is more likely to have a high Versatility Index when the other features are fixed at different values all over the space.