In a typical data set, each row can be thought of as a vector:
We have \(n\) 2-dimensional vectors:
A typical data frame can be thought of as a series of row vectors:
We have \(n\) 2-dimensional vectors:
Can you visualize the data set that would yield these plots?
Multiplying by a scalar \(c\) can affect both magnitude and direction:
To subtract vectors, e.g. \(u-v\):
Go in the \(u\) direction;
Stop and go in the negative \(v\) direction
Which of these shows \(u-v\)? What is the value of \(u-v\)? Of \(v-u\)?
An important characteristic of a vector is its norm.
Norm \(\equiv\) size, or length
There are many ways to measure the norm of a vector. Two important measures in data science include:
L1 norm (aka “Taxicab” or “Manhattan”)
L2 norm (aka “Euclidean” norm)
Norms are always \(\geq 0\)!
Consider the vector \(u = (3,2)\):
\[ ||u||_1 = 5 \]
Now consider \(v = (-4, -5)\):
\[ ||v||_1 = 9 \]
Given a \(p\)-vector \(u = (u_1, u_2, ...,u_p)\):
\[ ||u||_1 = \sum_{i=1}^p |u_i| \]
Taxicabs are inefficient ways of moving from \(A\) to \(B\)!
Recall again \(u = (3,2)\).
Imagine you are a helicopter, you can fly straight from the origin \((0,0)\) to \((3,2)\). How far did you fly?
Pythagorean!
\[ ||u||_2 = \sqrt{3^2 + 2^2} = \sqrt{13} = 3.61 \]
What is the L2 norm of \(v = (-4, -5)\)?
Given a \(p\)-vector \(u = (u_1, u_2, ...,u_p)\):
\[ ||u||_2 = \sqrt{\sum_{i=1}^p u_i^2} \]
R
The following code can be used to create vectors \(u\) and \(v\) from the previous examples, and compute their norms:
Scaling is an important concept in this class, and one important scaling ingredient is the mean vector.
cars
data set.Given \(n\) \(p-\)dimensional vectors \(u_i\), the mean vector \(m\) is simply:
\[m = \frac{1}{n} \cdot \sum_{i=1}^n u_i\]
R
There are two ways to find mean vectors in R
:
dplyr
approach (allows you to specify which column to average):R
approach (requires all columns to be numeric):Subtracting the mean vector from the data yields mean-centered data: all columns have mean 0.
The best way to mean-center columns in R
is with the scale
command:
scale
both centers and standardizes (more later); for now we just want centering:
# Make the base scatterplot
library(tidyverse)
p <- ggplot(cars_centered, aes(x = speed, y = dist)) +
geom_point(color = "steelblue", size = 3, alpha= .7) +
geom_point(color="red", size = 4, aes(x = 0, y = 0), pch=18) +
geom_hline(aes(yintercept = 0), linetype = 2) +
geom_vline(aes(xintercept = 0), linetype = 2)+
theme_classic(base_size = 18) +
labs(x = "Speed",
y = "Stopping Distance")
# Add marginal density plots
library(ggExtra)
ggMarginal(p, type = "density", fill = "lightblue", alpha = 0.5)
dplyr
approach:
The complete scaling can be completed with:
Note the attributes return the mean and standard deviation vectors:
Consider two vectors \(u = (3,1)\) and \(v = (1,4)\):
Subtracting \(u\) from \(v\) yields the vector \((-2, 3)\):
Note that this vector is exactly the right magnitude and direction for traveling from \(u\) to \(v\):
We define the distance between vectors\(u\) and \(v\) be the norm (recall: length) of the vector \(u-v\) (or equivalently, the norm of the vector \(v-u\)).
Accordingly, distance can also be defined in L1 (“taxicab” or “Manhattan”) or L2 (“Euclidean”) form.
\[ ||u-v||_1 = \sum_{i=1}^p |u_i -v_i| \]
\[ ||u-v||_2 = \sqrt{\sum_{i=1}^p (u_i -v_i)^2} \]
Consider the USairpollution
data set from the HSAUR2
package:
SO2 temp manu popul wind precip predays
Albany 46 47.6 44 116 8.8 33.36 135
Albuquerque 11 56.8 46 244 8.9 7.77 58
Atlanta 24 61.5 368 497 9.1 48.34 115
Baltimore 47 55.0 625 905 9.6 41.31 111
Buffalo 11 47.1 391 463 12.4 36.11 166
Charleston 31 55.2 35 71 6.5 40.75 148
Pairs plots are useful ways of visualizing multidimensional data. The ggpairs
function from GGally
produces:
dist
functiondist()
function can be used to compute distances between \(n\) vectors of \(p\) dimensions, arranged in 1-row-per-p-vector data frames.method = 'euclidean'
(L2 distances); see ?dist
for other options.USairpollution
data set from the HSAUR2
package:By default, a distance matrix is lower-triangular. First 6 rows:
dist
objectdist
object:library(HSAUR2)
data("USairpollution")
distance_df <- (pollution_dist
1 %>% as.matrix
)
dist
object to a matrix
dist
objectdist
object:dist
objectdist
object:dist
objectdist
object:library(HSAUR2)
data("USairpollution")
distance_df <- (pollution_dist
1 %>% as.matrix
2 %>% data.frame
3 %>% mutate(CityA = rownames(.))
%>% pivot_longer(cols = -CityA,
names_to = 'CityB',
4 values_to = 'Distance')
)
dist
object to a matrix
We’re primed to play!