The American Kennel Club (AKC) is conducting a study to determine the versatility of dog breeds. In this article, we will analyze the study statistically.
We are dealing with 133 dog breeds. We are looking into seven features: Agility, Obidence/Rally, Scent Work, Earned Therapy, Conformation, Coursing, and Good Manners. The dataset contains the proportion or probability of dogs in each breed getting titles in all seven features.
The response variable is the Versatility Index. Versatility is determined by the number of sports, on average, the dogs of a given breed have titled in during the period of 2014 through 2023. This average by breed is called the Versatility Index.
Here is the head of the cleaned dataset:
## Varsititlity Agility Rally ScentWork EarnedTherapy Conformation
## 1 2.095781 0.28481013 0.3919831 0.10506329 0.03417722 0.3759494
## 2 2.067227 0.11428571 0.2386555 0.15462185 0.06722689 0.5697479
## 3 2.030942 0.23909986 0.3675574 0.07266760 0.03187998 0.4219409
## 4 1.996201 0.25379939 0.3723404 0.09042553 0.03875380 0.4202128
## 5 1.996169 0.04469987 0.2681992 0.06385696 0.08812261 0.4533844
## 6 1.962009 0.23174335 0.3537358 0.04769945 0.02406079 0.5390460
## Coursing GoodManners
## 1 0.14936709 0.4468354
## 2 0.23193277 0.5008403
## 3 0.13502110 0.5152368
## 4 0.15121581 0.4468085
## 5 0.25415070 0.6449553
## 6 0.07302659 0.3626003
The upper triangle of the plot represents a pairwise correlation(Pearson
correlation) between features and the response variable. Pearson
correlation is a measure of linear association. The first row represents
the correlation between the Versatility Index
and the seven features. The Pearson correlation of the
Versatility Index and the feature
Obedience/Rally is the highest. A high linear
association exists between the Versatility
Index and the feature
Obedience/Rally. Having this information, we
must also look for any nonlinear associations. Also, there are
significant pairwise correlations between the seven features, so we need
to access the interactions within the seven features.
The diagonal of the plot shows the density/histogram of Versatility Index and each of the seven features.
The lower triangle of the plot represents a pairwise scatterplot. The first column represents the scatterplot between the Versatility Index and the seven features. The third scatterplot from the first column shows the steepest curve, which means success in Obedience/Rally is more likely to have a high Versatility Index. This is consistent with the Pearson Correlation. With this information, we will proceed with further analysis.
First, I compare different methods, namely
Gradient Boosting
, kNN
,
Neural Network
, Random Forest
,
Linear Regression
. Here is the accuracy score from 5-fold
cross-validation:
The
Linear Regression
is the most
accurate model for all different loss-functions . I compare it with
Bayesian Additive Regression Tree
(BART
). Here
is the RMSE for both methods:
Linear Regression
: \(0.05496718\)
BART
: \(0.03221035\). So, BART is even more
accurate in prediction than Linear Regression.
\(y=b(\mathbf{x})+\epsilon\) with \(\mathbf{x} \in \mathbb{R}^p\) and \(\epsilon \sim N(0,\sigma^2_\epsilon)\)
\(b:\mathbb{R}^p \to \mathbb{R}\) unknown regression function
\(b(\mathbf{x}) = \sum_{k=1}^{K}g(\mathbf{x}; \tau_k, \mathcal{M}_k)\)
\(\tau_k\) : topology and splitting rules of tree \(k\)
\(\mathcal{M}_k = (\mu_{k1}, \dots, \mu_{km_k})\): the set of predictions associated with the \(m_k\) terminal nodes of the tree \(k\) .
The Regression Tree
This is how a decision tree looks:
Every tree translates to a piece-wise constant function.
The main advantage of the BART is
that it is fully non-parametric and can access the interactions between
the features.
The model \(y=f(x)+\epsilon\) and \(\epsilon \sim N(0,\sigma^2)\), where \(f(\cdot) \sim BART\). We estimate \(f(\cdot)\) and call it \(\hat{f(\cdot)}\).
Now, I will investigate the function \(\hat{f(x)}\) by plugging in different values of \(x\), where is the vector of the seven features. In particular, we will plot \(\hat{f(x)}\) as a function of one feature at a time, fixing the other six features at some values.
For example, the following figure shows \(\hat{f(x)}\) as a function of one feature at a time, fixing the other 6 features at their median values. The interpretation is how Versatility Index behaves as a function of each feature at a time, on average, fixing the other six features.
From the figure above, it is clear that,
All curves become mostly flat after 0.4.
The slope of Obedience/Rally is increasing and is more than the others when all other features are fixed at the median.
From this, we can conclude that the increase in the success in Obedience/Rally increases the Versatility Index when the other features are fixed at the median. In other words, success in Obedience/Rally among the seven features indicates that it is more likely to have a high Versatility Index when the other features are fixed at their median values.
Now, the question is, what happens if we fix the values of the other features at different values?
So, here is the figure for the values of the other features at their respective first Quartile.
We observe the same pattern as in
the previous curves. From this, we can conclude that the increase in the
success in Obedience/Rally increases the
Versatility Index when the other features are
fixed in their respective first Quartile. In other words, success in
Obedience/Rally among the seven features
indicates that it is more likely to have a high Versatility Index when
the other features are fixed at their respective first Quartile
values.
Next is the figure, where the values of the other features are at their respective third quarter.
We observe the same pattern as in
the previous curves. From this we can conclude that the increase in
success in Obidience/Rally increases the
Versatility Index when the other features are
fixed at their respective third Quartile. In other words, success in
Obedience/Rally among the seven features
indicates that it is more likely to have a high Versatility Index when
the other features are fixed at their respective third Quartile
values.
Next is the figure when the values of the other features are at their respective 1st percentile.
We observe the same pattern as in the previous
curves. From this, we can conclude that the increase in success in
Obedience/Rally increases the
Versatility Index when the other features are
fixed at their respective 1st percentile. In other words, success in
Obedience/Rally among the seven features
indicates that it is more likely to have a high Versatility Index when
the other features are fixed at their respective 1st percentile
values.
Next is the figure when the values of the other features are at their respective 99th percentile.
We observe the same pattern as in the previous
curves. From this, we can conclude that the increase in success in
Obedience/Rally increases the
Versatility Index when the other features are
fixed at their respective 99th percentile. In other words, success in
Obedience/Rally among the seven features
indicates that it is more likely to have a high Versatility Index when
the other features are fixed at their respective 99th percentile
values.
In all “important values” of other features, success in Obedience/Rally among the seven features indicates that it is more likely to have a high Versatility Index.
To get a complete analysis, we plot these figures for the function of each feature individually, fixing the other six features at different random values.
The patterns are similar in all of these plots. From this, we can conclude that the increase in success in Obedience/Rally increases the Versatility Index when the other features are fixed at different values all over the space. In other words, success in Obedience/Rally among the seven features indicates that it is more likely to have a high Versatility Index when the other features are fixed at different values all over the space.