7.3: Work with the SIDS dataset to answer parts (a), (b), and (c)

(a): Construct the linear discriminant function using Factor68 and birthweight. Plot the function on a scatterplot of the data

## Call:
## lda(Group ~ Factor68 + BW, data = sids_data, prior = c(0.5, 0.5))
## 
## Prior probabilities of groups:
##   1   2 
## 0.5 0.5 
## 
## Group means:
##     Factor68         BW
## 1 -0.2561411  0.2070468
## 2  0.7844322 -0.6340808
## 
## Coefficients of linear discriminants:
##                 LD1
## Factor68  0.9104587
## BW       -0.7094956

Here is the LDA for Factor68 and birthweight in the SIDS dataset. We have to scale the data because LDA requires similar variances on the predictors, and the original data scales are wildly different (some features range from -1 to 1, others are in the thousands)

As an aside, this code is actually interesting! We basically create a grid of values (that is, essentially a Cartesian Product) in a sequence of values between points on Factor68 and BW, then we slap that on the ggplot with coloring according to predicted Group from the LDA. This shows us cleanly where the LDA boundary is.

(b): Use all four variables to construct an LDA, find the misclassification rate

## Call:
## lda(Group ~ HR + BW + Factor68 + Gesage, data = sids_data, prior = c(0.5, 
##     0.5))
## 
## Prior probabilities of groups:
##   1   2 
## 0.5 0.5 
## 
## Group means:
##            HR         BW   Factor68     Gesage
## 1 -0.05579459  0.2070468 -0.2561411  0.1696553
## 2  0.17087093 -0.6340808  0.7844322 -0.5195693
## 
## Coefficients of linear discriminants:
##                  LD1
## HR        0.01873216
## BW       -0.64585271
## Factor68  0.87152434
## Gesage   -0.15005659
##    
##      1  2
##   1 41  3
##   2  8 13

We can see that if we use all four predictors, \(\frac{8 + 3}{65} \approx 16.9\%\) of observations were mis-classified.

(c): How should prior probabilities be incorporated?

Prior probabilities should be incorporated if its is believed that there exist certain probabilities of observations being in each class. For instance, if you believed that lower body weight or higher heart rate means a higher likelihood of SIDS (I don’t know if that’s the case, but it could be factored in if desired).

7.4: Find all classification functions for the Egyptian Skull data and use them to allocate a new skull with the following measurements: MB: 133.0, BH: 130.0, BL: 95.0, NH: 50.0

## Call:
## lda(epoch ~ mb + bh + bl + nh, data = skull_data, prior = c(0.2, 
##     0.2, 0.2, 0.2, 0.2))
## 
## Prior probabilities of groups:
## c4000BC c3300BC c1850BC  c200BC  cAD150 
##     0.2     0.2     0.2     0.2     0.2 
## 
## Group means:
##                 mb          bh          bl         nh
## c4000BC -0.5329866  0.21325361  0.50329957 -0.1246909
## c3300BC -0.3285160  0.03104325  0.48470476 -0.2182091
## c1850BC  0.1008721  0.25374481 -0.07933786 -0.1143000
## c200BC   0.3121584 -0.04993914 -0.35826004  0.3221182
## cAD150   0.4484721 -0.44810253 -0.55040643  0.1350818
## 
## Coefficients of linear discriminants:
##           LD1        LD2         LD3          LD4
## mb  0.6195332  0.1894544  0.45370028  0.727928092
## bh -0.1829143  1.0377453 -0.12135214 -0.002074942
## bl -0.7804603 -0.3663088  0.07931567  0.712568479
## nh  0.2657813 -0.2479501 -0.94502248  0.214478740
## 
## Proportion of trace:
##    LD1    LD2    LD3    LD4 
## 0.8823 0.0809 0.0326 0.0042
## $class
## [1] cAD150
## Levels: c4000BC c3300BC c1850BC c200BC cAD150
## 
## $posterior
##    c4000BC   c3300BC  c1850BC   c200BC    cAD150
## 1 0.169083 0.1987253 0.212759 0.205385 0.2140477
## 
## $x
##        LD1        LD2       LD3        LD4
## 1 0.105565 -0.4011667 0.2256902 -0.3996537

We can see that the new observation is predicted to be in class 5, which is cAD150.

Iris dataset stuff

##  Response Sepal.Length :
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## Species       2 63.212  31.606  119.26 < 2.2e-16 ***
## Residuals   147 38.956   0.265                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Sepal.Width :
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## Species       2 11.345  5.6725   49.16 < 2.2e-16 ***
## Residuals   147 16.962  0.1154                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Petal.Length :
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## Species       2 437.10 218.551  1180.2 < 2.2e-16 ***
## Residuals   147  27.22   0.185                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Petal.Width :
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## Species       2 80.413  40.207  960.01 < 2.2e-16 ***
## Residuals   147  6.157   0.042                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Ok, this implies that our LDA model should include all of the predictors. So now we’ll make the training and test datasets with \(\frac{3}{4}\) to \(\frac{1}{4}\) split, then train the model. Confusion matrix below:

##             
##              setosa versicolor virginica
##   setosa         12          0         0
##   versicolor      0         16         1
##   virginica       0          0         9