SIDS
dataset to answer parts (a),
(b), and (c)Factor68
and birthweight
. Plot the function on
a scatterplot of the data## Call:
## lda(Group ~ Factor68 + BW, data = sids_data, prior = c(0.5, 0.5))
##
## Prior probabilities of groups:
## 1 2
## 0.5 0.5
##
## Group means:
## Factor68 BW
## 1 -0.2561411 0.2070468
## 2 0.7844322 -0.6340808
##
## Coefficients of linear discriminants:
## LD1
## Factor68 0.9104587
## BW -0.7094956
Here is the LDA for Factor68
and
birthweight
in the SIDS
dataset. We have to
scale the data because LDA requires similar variances on the predictors,
and the original data scales are wildly different (some features range
from -1 to 1, others are in the thousands)
As an aside, this code is actually interesting! We basically create a
grid of values (that is, essentially a Cartesian Product) in a sequence
of values between points on Factor68
and BW
,
then we slap that on the ggplot
with coloring according to
predicted Group
from the LDA. This shows us cleanly where
the LDA boundary is.
## Call:
## lda(Group ~ HR + BW + Factor68 + Gesage, data = sids_data, prior = c(0.5,
## 0.5))
##
## Prior probabilities of groups:
## 1 2
## 0.5 0.5
##
## Group means:
## HR BW Factor68 Gesage
## 1 -0.05579459 0.2070468 -0.2561411 0.1696553
## 2 0.17087093 -0.6340808 0.7844322 -0.5195693
##
## Coefficients of linear discriminants:
## LD1
## HR 0.01873216
## BW -0.64585271
## Factor68 0.87152434
## Gesage -0.15005659
##
## 1 2
## 1 41 3
## 2 8 13
We can see that if we use all four predictors, \(\frac{8 + 3}{65} \approx 16.9\%\) of observations were mis-classified.
Prior probabilities should be incorporated if its is believed that there exist certain probabilities of observations being in each class. For instance, if you believed that lower body weight or higher heart rate means a higher likelihood of SIDS (I don’t know if that’s the case, but it could be factored in if desired).
MB: 133.0
, BH: 130.0
, BL: 95.0
,
NH: 50.0
## Call:
## lda(epoch ~ mb + bh + bl + nh, data = skull_data, prior = c(0.2,
## 0.2, 0.2, 0.2, 0.2))
##
## Prior probabilities of groups:
## c4000BC c3300BC c1850BC c200BC cAD150
## 0.2 0.2 0.2 0.2 0.2
##
## Group means:
## mb bh bl nh
## c4000BC -0.5329866 0.21325361 0.50329957 -0.1246909
## c3300BC -0.3285160 0.03104325 0.48470476 -0.2182091
## c1850BC 0.1008721 0.25374481 -0.07933786 -0.1143000
## c200BC 0.3121584 -0.04993914 -0.35826004 0.3221182
## cAD150 0.4484721 -0.44810253 -0.55040643 0.1350818
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3 LD4
## mb 0.6195332 0.1894544 0.45370028 0.727928092
## bh -0.1829143 1.0377453 -0.12135214 -0.002074942
## bl -0.7804603 -0.3663088 0.07931567 0.712568479
## nh 0.2657813 -0.2479501 -0.94502248 0.214478740
##
## Proportion of trace:
## LD1 LD2 LD3 LD4
## 0.8823 0.0809 0.0326 0.0042
## $class
## [1] cAD150
## Levels: c4000BC c3300BC c1850BC c200BC cAD150
##
## $posterior
## c4000BC c3300BC c1850BC c200BC cAD150
## 1 0.169083 0.1987253 0.212759 0.205385 0.2140477
##
## $x
## LD1 LD2 LD3 LD4
## 1 0.105565 -0.4011667 0.2256902 -0.3996537
We can see that the new observation is predicted to be in class 5,
which is cAD150
.
## Response Sepal.Length :
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 63.212 31.606 119.26 < 2.2e-16 ***
## Residuals 147 38.956 0.265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Sepal.Width :
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 11.345 5.6725 49.16 < 2.2e-16 ***
## Residuals 147 16.962 0.1154
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Petal.Length :
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 437.10 218.551 1180.2 < 2.2e-16 ***
## Residuals 147 27.22 0.185
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Petal.Width :
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 80.413 40.207 960.01 < 2.2e-16 ***
## Residuals 147 6.157 0.042
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Ok, this implies that our LDA model should include all of the predictors. So now we’ll make the training and test datasets with \(\frac{3}{4}\) to \(\frac{1}{4}\) split, then train the model. Confusion matrix below:
##
## setosa versicolor virginica
## setosa 12 0 0
## versicolor 0 16 1
## virginica 0 0 9