In the field of data analysis and machine learning, classification tasks play a pivotal role in deciphering patterns and making predictions based on labeled data. Linear Discriminant Analysis (LDA) stands out as a powerful technique for classification, particularly when dealing with multiple classes. In this article, we learn about LDA and demonstrate its application using RStudio, a popular integrated development environment (IDE) for R programming.
Get a complete code and read the complete article about LDA in R
Be a part of Community and Stay Ahead of Data Analysis Community
Seeking Professional Coding Assistance? Elevate Your Projects with Our Expertise!
Before the analysis, it’s crucial to ensure that the necessary packages are installed and loaded into the R environment. We’ll begin by installing and loading the following packages:
Before We start, Make sure you Have the following:
To illustrate the capabilities of LDA, we’ll employ the Iris dataset, a classic benchmark dataset in machine learning. Let’s load the dataset and inspect its structure and summary statistics:
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
The summary statistics of the Iris dataset provide valuable insights into the distribution and characteristics of the four numerical variables—Sepal Length, Sepal Width, Petal Length, and Petal Width—across the three species of iris flowers: setosa, versicolor, and virginica.
The minimum and maximum values offer a glimpse into the range of each variable, showcasing the diversity within the dataset. For example, Sepal Length ranges from 4.3 to 7.9, indicating variability in the length of sepals among the iris species. Similarly, Sepal Width ranges from 2.0 to 4.4, showcasing the variability in sepal widths.
The quartiles—1st, 2nd (median), and 3rd—provide insights into the central tendency and spread of the data. For instance, the median Sepal Length is 5.8, indicating that half of the observations have a sepal length below this value and half above it.
The mean values measure the average of each variable across all observations. For instance, the mean Sepal Length is approximately 5.84, providing a central reference point for the length of sepals in the dataset.
Want to explore more about descriptive statistics or exploratory Data analysis.
Before fitting the LDA model, it’s essential to preprocess the data by splitting it into training and testing sets:
Now, let’s proceed to build the LDA model using the training data:
## Call:
## lda(Species ~ ., data = train)
##
## Prior probabilities of groups:
## setosa versicolor virginica
## 0.3333333 0.3333333 0.3333333
##
## Group means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa 4.9800 3.3700 1.4650 0.2400
## versicolor 5.9400 2.7700 4.2325 1.3275
## virginica 6.6375 3.0125 5.6225 2.0700
##
## Coefficients of linear discriminants:
## LD1 LD2
## Sepal.Length 0.787581 0.05173815
## Sepal.Width 1.605418 -2.45346114
## Petal.Length -2.144011 0.80572094
## Petal.Width -2.909670 -2.51645779
##
## Proportion of trace:
## LD1 LD2
## 0.9902 0.0098
Prior Probabilities of Groups: - The prior probabilities represent the proportions of each species (setosa, versicolor, and virginica) in the training data. Each species is assumed to have an equal prior probability of approximately 33.33%.
Group Means: - The group means display the average values of the four numerical variables (Sepal Length, Sepal Width, Petal Length, and Petal Width) for each species. For example, the mean Sepal Length for setosa is approximately 4.98, while for versicolor and virginica, it is 5.94 and 6.64, respectively.
Coefficients of Linear Discriminants: - The coefficients of linear discriminants represent the weights assigned to each variable in the discriminant functions (LD1 and LD2) used to differentiate between the species. Positive coefficients indicate a positive association with the respective discriminant function, while negative coefficients indicate a negative association.
Proportion of Trace: - The proportion of trace represents the proportion of total variance explained by each discriminant function. In this case, LD1 explains approximately 99.02% of the total variance, while LD2 explains only 0.98%.
With the LDA model in place, it’s time to evaluate its performance on the testing set:
## test_species
## setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 10 1
## virginica 0 0 9
We can visualize the LDA model and its decision boundaries using custom colors for each species:
In addition to prediction accuracy, we can compute various metrics to assess the performance of the LDA model:
## [1] 0.9666667
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 10 1
## virginica 0 0 9
##
## Overall Statistics
##
## Accuracy : 0.9667
## 95% CI : (0.8278, 0.9992)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 2.963e-13
##
## Kappa : 0.95
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 0.9000
## Specificity 1.0000 0.9500 1.0000
## Pos Pred Value 1.0000 0.9091 1.0000
## Neg Pred Value 1.0000 1.0000 0.9524
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3333 0.3000
## Detection Prevalence 0.3333 0.3667 0.3000
## Balanced Accuracy 1.0000 0.9750 0.9500
In conclusion, Linear Discriminant Analysis (LDA) serves as a valuable tool for classification tasks, offering insights into the underlying structure of data and enabling accurate predictions. By leveraging the capabilities of RStudio and its rich ecosystem of packages, analysts can harness the power of LDA to maximize classification accuracy in their projects.
Linear Discriminant Analysis (LDA) is a statistical method used for dimensionality reduction and classification.
Unlike other algorithms such as Logistic Regression, LDA seeks to model the differences between the classes’ means.
LDA assumes that the classes have identical covariance matrices and that the observations are normally distributed within each class.
The results of LDA include the discriminant functions and class centroids, which can be interpreted to understand how the classes are separated in the feature space.
Yes, LDA can somewhat handle multicollinearity, but it may lead to unstable estimates.
The number of classes minus one determines the number of discriminant functions in LDA.
LDA can be sensitive to outliers, as it seeks to minimize within-class variance.
Yes, LDA can be applied to high-dimensional data, although it may suffer from the curse of dimensionality.
LDA finds applications in various fields, including pattern recognition, image processing, and bioinformatics.
Some limitations of LDA include its assumption of normality and equal covariance matrices across classes, which may not hold true in real-world datasets.
Need a Customized solution for your data analysis projects? Are you interested in learning through Zoom? Hire me as your data analyst. I have five years of experience and a PhD. I can help you with data analysis projects and problems using R and other tools. You can visit this link and fill out the order form to hire me. You can also contact me at info@data03.online for any questions or inquiries. I will be happy to work with you and provide you with high-quality data analysis services.