The formula for the weighted Euclidean distance between two points, considering a vector of weights, is an extension of the standard Euclidean distance formula. The weighted Euclidean distance accounts for the importance of each dimension (or variable) by assigning a weight to each.

Given two points \(\mathbf{x} = (x_1, x_2, \ldots, x_n)\) and \(\mathbf{y} = (y_1, y_2, \ldots, y_n)\), and a vector of weights \(\mathbf{w} = (w_1, w_2, \ldots, w_n)\), the weighted Euclidean distance \(d_w(\mathbf{x}, \mathbf{y})\) is defined as:

\[ d_w(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^{n} w_i (x_i - y_i)^2} \]

In the context of the given problem, let’s break it down step by step:

  1. Identify the points:
    • \(\mathbf{x}\) is a row in your dataset, representing an event.
    • \(\mathbf{y}\) is the mean vector of all columns (variables) in the dataset, excluding ‘List’ and ‘Event’.
  2. Define the weights:
    • Each variable has a weight.
    • For example, if we want the Familiarity variable to be weighted 10 times more than the others, we set its corresponding weight to 10 and all other weights to 1.
  3. Calculate the weighted Euclidean distance:
    • Subtract the mean value of each variable from the corresponding value in the event.
    • Square each of these differences.
    • Multiply each squared difference by the corresponding weight.
    • Sum these weighted squared differences.
    • Take the square root of the sum to get the weighted Euclidean distance.

Example

If we have: - \(\mathbf{x} = (x_1, x_2, \ldots, x_n)\) - Mean vector \(\mathbf{y} = (y_1, y_2, \ldots, y_n)\) - Weights \(\mathbf{w} = (w_1, w_2, \ldots, w_n)\)

The distance is: \[ d_w(\mathbf{x}, \mathbf{y}) = \sqrt{w_1 (x_1 - y_1)^2 + w_2 (x_2 - y_2)^2 + \cdots + w_n (x_n - y_n)^2} \]

For the specific case where Familiarity (let’s assume it’s the first variable) is weighted 10 times more, the weights would be \(\mathbf{w} = (10, 1, 1, \ldots, 1)\).

So the formula becomes: \[ d_w(\mathbf{x}, \mathbf{y}) = \sqrt{10 (x_1 - y_1)^2 + 1 (x_2 - y_2)^2 + 1 (x_3 - y_3)^2 + \cdots + 1 (x_n - y_n)^2} \]

This formula ensures that differences in the Familiarity variable have a much larger impact on the overall distance compared to differences in the other variables.