class: center, middle, inverse, title-slide # Mahalanobis Distance --- ### Mahalanobis Distance * Mahalanobis Distance * multivariate distance #### How to Calculate Mahalanobis Distance in R * The Mahalanobis distance is the distance between a data point and the origin (mean) in a multivariate space. * It's often used to identify outliers in multivariate statistical analyses. ---  --- ### Mahalanobis Distance ```r library(faraway) data(cheddar) ``` --- ### cheddar: Taste of Cheddar cheese In **{faraway}**: Functions and Datasets for Books by Julian Faraway **Description** In a study of cheddar cheese from the LaTrobe Valley of Victoria, Australia, samples of cheese were analyzed for their chemical composition and were subjected to taste tests. Overall taste scores were obtained by combining the scores from several tasters. * **taste** - a subjective taste score * **Acetic** - concentration of acetic acid (log scale) * **H2S** - concentration of hydrogen sulfice (log scale) * **Lactic** - concentration of lactic acid --- #### Step 1: Create the dataset. ```r head(cheddar) ``` ``` ## taste Acetic H2S Lactic ## 1 12.3 4.543 3.135 0.86 ## 2 20.9 5.159 5.043 1.53 ## 3 39.0 5.366 5.438 1.57 ## 4 47.9 5.759 7.496 1.81 ## 5 5.6 4.663 3.807 0.99 ## 6 25.9 5.697 7.601 1.09 ``` --- ### Step 2: Calculate the Mahalanobis distance for each observation. Next, we'll use the built-in <tt>mahalanobis()</tt> function in R to calculate the Mahalanobis distance for each observation, which uses the following syntax: <pre><code> mahalanobis(x, center, cov) </code></pre> where: * <tt>x</tt>: matrix of data * <tt>center</tt>: mean vector of the distribution * <tt>cov</tt>: covariance matrix of the distribution --- #### Implementation The following code shows how to implement this function for our dataset: ```r df <- cheddar[,2:4] #calculate Mahalanobis distance for each observation mahalanobis(df, colMeans(df), cov(df)) %>% head() %>% t() ``` ``` ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 4.115811 1.235341 0.7716917 1.593862 2.768398 5.713978 ``` --- ### Step 3: Calculate the p-value for each Mahalanobis distance. We can see that some of the Mahalanobis distances are much larger than others. To determine if any of the distances are statistically significant, we need to calculate their p-values. The p-value for each distance is calculated as the p-value that corresponds to the Chi-Square statistic of the Mahalanobis distance with k-1 degrees of freedom, where k = number of variables. So, in this case we'll use a degrees of freedom of 3-1 = 2. --- Step 3: Calculate the p-value for each Mahalanobis distance. ```r #create new column in data frame to hold Mahalanobis distances df$mahal <- mahalanobis(df, colMeans(df), cov(df)) #create new column in data frame to hold p-value for each Mahalanobis distance df$p <- pchisq(df$mahal, df=2, lower.tail=FALSE) ``` --- ### Mahalanobis distance. Step 3: Calculate the p-value for each Mahalanobis distance. ```r #view data frame df %>% head() %>% kable(format="markdown") ``` | Acetic| H2S| Lactic| mahal| p| |------:|-----:|------:|---------:|---------:| | 4.543| 3.135| 0.86| 4.1158108| 0.1277212| | 5.159| 5.043| 1.53| 1.2353409| 0.5391991| | 5.366| 5.438| 1.57| 0.7716917| 0.6798753| | 5.759| 7.496| 1.81| 1.5938621| 0.4507101| | 4.663| 3.807| 0.99| 2.7683980| 0.2505244| | 5.697| 7.601| 1.09| 5.7139779| 0.0574415| --- ### Intrepretating the output * Typically a p-value that is less than some threshold (e.g. 0.001) is considered to be an outlier. * In this case, all the p values are greater than 0.001 * Depending on the context of the problem, you may *omit* any outlier observation from the dataset, as they could affect the results of the analysis. (Domain knowledge is vital). ---