Check in Packages tab (right lower tab), whether you have installed these packages. Write in search box for quick search.
install.packages("MASS")
install.packages("car")
install.packages("caret")
install.packages("klaR")
install.packages("rattle.data")
Activate installed packages with this command
library(MASS)
library(car)
In this first study case, the wine data set, we have 13 chemical concentrations describing wine samples from three cultivars. Cultivar is a variety of a plant developed from natural species and maintained under cultivation.
#search for pre-installed database
data(wine, package = 'rattle.data')
attach(wine)
head(wine)
## Type Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids
## 1 1 14.23 1.71 2.43 15.6 127 2.80 3.06
## 2 1 13.20 1.78 2.14 11.2 100 2.65 2.76
## 3 1 13.16 2.36 2.67 18.6 101 2.80 3.24
## 4 1 14.37 1.95 2.50 16.8 113 3.85 3.49
## 5 1 13.24 2.59 2.87 21.0 118 2.80 2.69
## 6 1 14.20 1.76 2.45 15.2 112 3.27 3.39
## Nonflavanoids Proanthocyanins Color Hue Dilution Proline
## 1 0.28 2.29 5.64 1.04 3.92 1065
## 2 0.26 1.28 4.38 1.05 3.40 1050
## 3 0.30 2.81 5.68 1.03 3.17 1185
## 4 0.24 2.18 7.80 0.86 3.45 1480
## 5 0.39 1.82 4.32 1.04 2.93 735
## 6 0.34 1.97 6.75 1.05 2.85 1450
str(wine)
## 'data.frame': 178 obs. of 14 variables:
## $ Type : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ Alcohol : num 14.2 13.2 13.2 14.4 13.2 ...
## $ Malic : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
## $ Ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
## $ Alcalinity : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
## $ Magnesium : int 127 100 101 113 118 112 96 121 97 98 ...
## $ Phenols : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
## $ Flavanoids : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
## $ Nonflavanoids : num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
## $ Proanthocyanins: num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
## $ Color : num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
## $ Hue : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
## $ Dilution : num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
## $ Proline : int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...
Wine data showed that Type variable is a categorical data consist of 3 type categories. Notice the category is more than 2, thus linear discriminant analysis is appropriate method in this case.
scatterplotMatrix(wine[2:6])
The purpose of the linear discriminant analysis is to find combination of the variables that give best possible separation between groups (wine cultivars) in our data set.The . in the formula argument means that we use all the remaining variables in data as covariates.
wine.lda <- lda(Type ~., data = wine)
wine.lda
## Call:
## lda(Type ~ ., data = wine)
##
## Prior probabilities of groups:
## 1 2 3
## 0.3314607 0.3988764 0.2696629
##
## Group means:
## Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids
## 1 13.74475 2.010678 2.455593 17.03729 106.3390 2.840169 2.9823729
## 2 12.27873 1.932676 2.244789 20.23803 94.5493 2.258873 2.0808451
## 3 13.15375 3.333750 2.437083 21.41667 99.3125 1.678750 0.7814583
## Nonflavanoids Proanthocyanins Color Hue Dilution Proline
## 1 0.290000 1.899322 5.528305 1.0620339 3.157797 1115.7119
## 2 0.363662 1.630282 3.086620 1.0562817 2.785352 519.5070
## 3 0.447500 1.153542 7.396250 0.6827083 1.683542 629.8958
##
## Coefficients of linear discriminants:
## LD1 LD2
## Alcohol -0.403399781 0.8717930699
## Malic 0.165254596 0.3053797325
## Ash -0.369075256 2.3458497486
## Alcalinity 0.154797889 -0.1463807654
## Magnesium -0.002163496 -0.0004627565
## Phenols 0.618052068 -0.0322128171
## Flavanoids -1.661191235 -0.4919980543
## Nonflavanoids -1.495818440 -1.6309537953
## Proanthocyanins 0.134092628 -0.3070875776
## Color 0.355055710 0.2532306865
## Hue -0.818036073 -1.5156344987
## Dilution -1.157559376 0.0511839665
## Proline -0.002691206 0.0028529846
##
## Proportion of trace:
## LD1 LD2
## 0.6875 0.3125
The linear discriminant function from the result in above is
\[-.403 * Alcohol + 0.165*Malic - 0.369*Ash + 0.155*Alcalinity - 0.002*Magnesium + 0.618*Phenols\\ - 1.66*Flavanoids - 1.496*Nonflavanoids + 0.134*Proanthocyanins + 0.355*Color \\- 0.818*Hue - 1.15*Dilution - 0.003*Proline\] The “proportion of trace”" that is printed when you type wine.lda (the variable returned by the lda() function) is the percentage separation achieved by each discriminant function. For example, for the wine data we get the same values as just calculated (\(68.75\%\) and \(31.25\%\)).
Histogram is a nice way to displaying result of the linear discriminant analysis.We can do using ldahist() function in R. Make prediction value based on LDA function and store it in an object. predict function generate value from selected model function. The length of the value predicted will be correspond with the length of the processed data.
wine.lda.values <- predict(wine.lda)
ldahist(wine.lda.values$x[,1], g = Type)
second discriminant function separates those cultivars, by making a stacked histogram of second discriminant function’s values
ldahist(wine.lda.values$x[,2], g = Type)
Now we can produce scatter plot of the two discriminant function:
plot(wine.lda.values$x[,1], wine.lda.values$x[,2])
text(wine.lda.values$x[,1], wine.lda.values$x[,2], Type, cex = 0.7, pos = 4, col = "red")
For more advanced graphic, we use ggplot2 package. ggplot2 package only can deal with data frame object. Create new object contain type variable from wine data, and value from wine.lda.values.
#convert to data frame
newdata <- data.frame(type = wine[,1], lda = wine.lda.values$x)
library(ggplot2)
ggplot(newdata) + geom_point(aes(lda.LD1, lda.LD2, colour = type), size = 2.5)
Prediction accuracy of LDA is comparing prediction result from the model output with the actual data. First step, activate caret package. Then, create prediction result using train function. We will use confusionMatrix command to see the prediction accuracy of the model.
library(caret)
## Loading required package: lattice
wine.lda.predict <- train(Type ~ ., method = "lda", data = wine)
confusionMatrix(wine$Type, predict(wine.lda.predict, wine))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 59 0 0
## 2 0 71 0
## 3 0 0 48
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9795, 1)
## No Information Rate : 0.3989
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000
## Prevalence 0.3315 0.3989 0.2697
## Detection Rate 0.3315 0.3989 0.2697
## Detection Prevalence 0.3315 0.3989 0.2697
## Balanced Accuracy 1.0000 1.0000 1.0000
Other visualization tool using klaR package.
library(klaR)
partimat(Type ~ Alcohol + Alcalinity, data = wine, method = "lda")
You can add more variable to examine other variables to the wine type classification
partimat(Type ~ Alcohol + Alcalinity + Ash + Magnesium, data = wine, method = "lda")