1.1 - data as vectors

What is a vector?

A \(p\)-vector can be thought of as:
- a coordinate in \(p\)-dimensional space;
- a specification of magnitude and direction in \(p\)-dimensional space
Vectors can be expressed in row or column format.
- \(u = (3,1)\) is a 2-dimensional row vector
- \(v = \begin{pmatrix}4\\-1\\2\end{pmatrix}\) is a 3-dimensional column vector

What is a vector?

2-dimensoinal vectors are easy to visualize in a Cartesian plane
For example, consider \(u = (3, 2)\)

Data as vectors

In a typical data set, each row can be thought of as a vector:

head(cars)

  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10

We have \(n\) 2-dimensional vectors:

The vector (4, 2);
The vector (4, 10);
etc.

Data as vectors

A typical data frame can be thought of as a series of row vectors:

head(cars)

  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10

We have \(n\) 2-dimensional vectors:

The vector (4, 2);
The vector (4, 10);
etc.

Higher dimensions

Of course, in practice we have more than 2 columns of data.
Below is a 3D scatter plot and vector plot of \(n=10\) vectors of \(p=3\) columns:

Can you visualize the data set that would yield these plots?

Scalar multiplication

Multiplying by a scalar \(c\) can affect both magnitude and direction:

If \(|c|>1\), \(cu\) lengthens the vector
If \(|c| < 1\), \(cu\) shortens the vector
If \(c < 0\), \(cu\) reverse the vector

Vector addition

Adding two vectors is as simple as adding the “steps” in each direction.
Consider \(u = (3,1)\) and \(v = (1,4)\), then \(u + v = (4,5)\):

Vector subtraction

To subtract vectors, e.g. \(u-v\):

Go in the \(u\) direction;
Stop and go in the negative \(v\) direction

Vector subtraction

Which of these shows \(u-v\)? What is the value of \(u-v\)? Of \(v-u\)?

Vector norms

An important characteristic of a vector is its norm.
Norm \(\equiv\) size, or length
There are many ways to measure the norm of a vector. Two important measures in data science include:
- L1 norm (aka “Taxicab” or “Manhattan”)
- L2 norm (aka “Euclidean” norm)
Norms are always \(\geq 0\)!

L1 (“taxicab”) norm example

Consider the vector \(u = (3,2)\):

\[ ||u||_1 = 5 \]

L1 (“taxicab”) norm example

Now consider \(v = (-4, -5)\):

\[ ||v||_1 = 9 \]

L1 norm, formally defined

Given a \(p\)-vector \(u = (u_1, u_2, ...,u_p)\):

\[ ||u||_1 = \sum_{i=1}^p |u_i| \]

L2 norm

Taxicabs are inefficient ways of moving from \(A\) to \(B\)!
Recall again \(u = (3,2)\).
Imagine you are a helicopter, you can fly straight from the origin \((0,0)\) to \((3,2)\). How far did you fly?

Pythagorean!

\[ ||u||_2 = \sqrt{3^2 + 2^2} = \sqrt{13} = 3.61 \]

L2 norm of \(v\)

What is the L2 norm of \(v = (-4, -5)\)?

L2 norm, formally defined

Given a \(p\)-vector \(u = (u_1, u_2, ...,u_p)\):

\[ ||u||_2 = \sqrt{\sum_{i=1}^p u_i^2} \]

Vectors in `R`

The following code can be used to create vectors \(u\) and \(v\) from the previous examples, and compute their norms:

u <- c(3,2)
v <- c(-4,5)

#L1 norms:
sum(abs(u))

[1] 5

sum(abs(v))

[1] 9

#L2 norms:
sqrt(sum(u^2))

[1] 3.605551

sqrt(sum(v^2))

[1] 6.403124

#NOT an L2 norm:
sum(sqrt(v^2))

[1] 9

#NOT an L2 norm:
sqrt(sum(v))^2

[1] 1

Mean vectors

Scaling is an important concept in this class, and one important scaling ingredient is the mean vector.

Consider the plot of the cars data set.
The complete data set has \(n=50\) rows (i.e., 2-dimensional vectors)
The red diamond in the middle is the mean vector
- Horizontal coordinate: mean of top density (Speed)
- Vertical coordinate: mean of right density (Stopping distance)

Mean vector formally defined

Given \(n\) \(p-\)dimensional vectors \(u_i\), the mean vector \(m\) is simply:

\[m = \frac{1}{n} \cdot \sum_{i=1}^n u_i\]

Calculating mean vectors in `R`

There are two ways to find mean vectors in R:

dplyr approach (allows you to specify which column to average):

(cars
 %>% summarize(across(.cols = c(speed, dist), .fns = mean))
)

  speed  dist
1  15.4 42.98

Base R approach (requires all columns to be numeric):

apply(cars, 2, FUN = mean)

speed  dist 
15.40 42.98

Mean-centering

Subtracting the mean vector from the data yields mean-centered data: all columns have mean 0.
The best way to mean-center columns in R is with the scale command:
scale both centers and standardizes (more later); for now we just want centering:

cars_centered <- scale(cars, center = TRUE, scale = FALSE)

Code for previous plot

# Make the base scatterplot
library(tidyverse)
p <- ggplot(cars_centered, aes(x = speed, y = dist)) +
  geom_point(color = "steelblue", size = 3, alpha= .7) +
  geom_point(color="red", size = 4, aes(x = 0, y = 0), pch=18) + 
  geom_hline(aes(yintercept = 0), linetype = 2) + 
  geom_vline(aes(xintercept = 0), linetype = 2)+
  theme_classic(base_size = 18) +
  labs(x = "Speed",
       y = "Stopping Distance")

# Add marginal density plots
library(ggExtra)
ggMarginal(p, type = "density", fill = "lightblue", alpha = 0.5)

Std dev scaling

After mean centering, dividing by the standard deviation results in \(p-\) vectors that are elementwise mean 0 and standard deviation = 1.
Important implications for measuring length and (later) distance between vectors.
Finding standard deviation vectors:

dplyr approach:

(cars
 %>% summarize(across(.cols = c(speed, dist), .fns = sd))
 )

     speed     dist
1 5.287644 25.76938

Base R approach:

apply(cars, 2, FUN = sd)

    speed      dist 
 5.287644 25.769377

Full scaling

The complete scaling can be completed with:

cars_scaled <- scale(cars, center = TRUE, scale = TRUE)

Note the attributes return the mean and standard deviation vectors:

attr(cars_scaled, "scaled:center")

speed  dist 
15.40 42.98

attr(cars_scaled, "scaled:scale")

    speed      dist 
 5.287644 25.769377

Plot of scaled data

Distances between vectors

We’ve considered the norm (“length”) of a single vector.
Distance between vectors is an important concept in this class, and is related to vector norms.
We’ll talk a lot more about distances later; going to consider a more formal vector-based definition now.

Distance visualized

Consider two vectors \(u = (3,1)\) and \(v = (1,4)\):

Distance visualized

Subtracting \(u\) from \(v\) yields the vector \((-2, 3)\):

Distance visualized

Note that this vector is exactly the right magnitude and direction for traveling from \(u\) to \(v\):

Distance defined

We define the distance between vectors\(u\) and \(v\) be the norm (recall: length) of the vector \(u-v\) (or equivalently, the norm of the vector \(v-u\)).
Accordingly, distance can also be defined in L1 (“taxicab” or “Manhattan”) or L2 (“Euclidean”) form.

\[ ||u-v||_1 = \sum_{i=1}^p |u_i -v_i| \]

\[ ||u-v||_2 = \sqrt{\sum_{i=1}^p (u_i -v_i)^2} \]

Calculating distance for example

L1 distance: \(|3-1| + |1-4| = 7\)
L2 distance: \(\sqrt{(3-1)^2 + (1-4)^2} = \sqrt{13} = 3.61\)

Air pollution data

Consider the USairpollution data set from the HSAUR2 package:

library(HSAUR2)
data("USairpollution")
head(USairpollution)

            SO2 temp manu popul wind precip predays
Albany       46 47.6   44   116  8.8  33.36     135
Albuquerque  11 56.8   46   244  8.9   7.77      58
Atlanta      24 61.5  368   497  9.1  48.34     115
Baltimore    47 55.0  625   905  9.6  41.31     111
Buffalo      11 47.1  391   463 12.4  36.11     166
Charleston   31 55.2   35    71  6.5  40.75     148

Air pollution data

Pairs plots are useful ways of visualizing multidimensional data. The ggpairs function from GGally produces:

library(GGally)
ggpairs(data = USairpollution) + 
  theme_bw()

The `dist` function

The dist() function can be used to compute distances between \(n\) vectors of \(p\) dimensions, arranged in 1-row-per-p-vector data frames.
By default, method = 'euclidean' (L2 distances); see ?dist for other options.
The following code all pairwise distances between 41 cities in the USairpollution data set from the HSAUR2 package:

pollution_dist <- dist(USairpollution)

By default, a distance matrix is lower-triangular. First 6 rows:

                Albany Albuquerque  Atlanta Baltimore  Buffalo   Charleston  
Albuquerque     155.825                                                                                                                                 
Atlanta         501.437     415.667                                                                                                                     
Baltimore       980.193     881.700  482.856                                                                                                            
Buffalo         492.975     423.745   69.447   504.518                                                                                                  
Charleston       51.163     199.113  541.855  1022.395  530.326                                                                                         
Chicago        4634.251    4545.006 4136.752  3669.934 4144.490   4672.609

Wrangling the `dist` object

The following code wrangles the dist object:

library(HSAUR2)
data("USairpollution")
distance_df <- (pollution_dist
1  %>%  as.matrix
)

1: Converge dist object to a matrix

Wrangling the `dist` object

The following code wrangles the dist object:

library(HSAUR2)
data("USairpollution")
distance_df <- (pollution_dist
1  %>%  as.matrix
2  %>% data.frame
)

1: Converge dist object to a matrix
2: Convert \(p \times p\) matrix to a data frame

Wrangling the `dist` object

The following code wrangles the dist object:

library(HSAUR2)
data("USairpollution")
distance_df <- (pollution_dist
1  %>%  as.matrix
2  %>% data.frame
3  %>% mutate(CityA = rownames(.))
)

1: Converge dist object to a matrix
2: Convert \(p \times p\) matrix to a data frame
3: Create new column moving the rownames to an actual variable

Wrangling the `dist` object

The following code wrangles the dist object:

library(HSAUR2)
data("USairpollution")
distance_df <- (pollution_dist
1  %>%  as.matrix
2  %>% data.frame
3  %>% mutate(CityA = rownames(.))
  %>% pivot_longer(cols = -CityA, 
                   names_to = 'CityB', 
4                   values_to = 'Distance')
)

1: Converge dist object to a matrix
2: Convert \(p \times p\) matrix to a data frame
3: Create new column moving the rownames to an actual variable
4: Pivot the data

Result

We’re primed to play!

head(distance_df)

# A tibble: 6 × 3
  CityA  CityB       Distance
  <chr>  <chr>          <dbl>
1 Albany Albany           0  
2 Albany Albuquerque    156. 
3 Albany Atlanta        501. 
4 Albany Baltimore      980. 
5 Albany Buffalo        493. 
6 Albany Charleston      51.2

1.1 - data as vectors

What is a vector?

What is a vector?

Data as vectors

Data as vectors

Higher dimensions

Scalar multiplication

Vector addition

Vector subtraction

Vector subtraction

Vector norms

L1 (“taxicab”) norm example

L1 (“taxicab”) norm example

L1 norm, formally defined

L2 norm

L2 norm of \(v\)

L2 norm, formally defined

Vectors in R

Mean vectors

Mean vector formally defined

Calculating mean vectors in R

Mean-centering

Code for previous plot

Std dev scaling

Full scaling

Plot of scaled data

Distances between vectors

Distance visualized

Distance visualized

Distance visualized

Distance defined

Calculating distance for example

Air pollution data

Air pollution data

The dist function

Wrangling the dist object

Wrangling the dist object

Wrangling the dist object

Wrangling the dist object

Result

Vectors in `R`

Calculating mean vectors in `R`

The `dist` function

Wrangling the `dist` object

Wrangling the `dist` object

Wrangling the `dist` object

Wrangling the `dist` object