Activity 2.2 - multidimensional scaling

SUBMISSION INSTRUCTIONS

Submit both:

Your .qmd;
a pdf either:
- rendered directly by changing your yaml to format: pdf
- A print-to-pdf of your rendered html
PDFs of your source code will NOT be considered.

Question 1

Suppose we have 3 data points in 2 dimensions:

A)

Using taxicab as the distance metric, find all pairwise distances between points in the full dimension space.

$d(A,B) = 4$; $d(A,C) = 6$, $d(B,C) = 4$

B)

Consider three one-dimensional ordinations of this space. In the first ordination, the points are collapsed to their vertical coordinates; in the second, they are collapsed to their horizontal coordinates; and in the third, they are collapsed vertically to the y=x line:

Find the stress of each of these ordinations. Which best represents the full-dimensional distances between points?

First ordination:
- Distances: $d(A,B) = 3$; $d(A, C) = 4$; $d(B, C) = 1$
- Stress: $\sqrt{(4-3)^2 + (6-4)^2 + (4-1)^2} = 3.74$
Second ordination:
- Distances: $d(A,B) = 1$; $d(A, C) = 2$; $d(B, C) = 3$
- Stress: $\sqrt{(4-1)^2 + (6-2)^2 + (4-3)^2} = 5.1$
Third ordination:
- Distances: $d(A,B) = 2$; $d(A, C) = 4$; $d(B, C) = 6$
- Stress: $\sqrt{(4-2)^2 + (6-4)^2 + (4-6)^2} = 3.46$

So the third ordination is the best 1D representation of these 2D data, as it has the lowest stress.

Question 2

Consider the following MDS biplot of a 2-dimensional representation of 4-dimensional data:

Which of these most likely represents the correlation matrix between the four variables?

Properties of this plot:

Variables A and C are quite strongly (and positively) correlated with each other, as they point in similar directions.
Variables A and C are strongly negatively correlated with B, as these arrows point in opposite directions..
A, C and B all have weak correlations with D; in fact it looks like $Cor(B,D)$ and $Cor(A,D)$ might be near 0 due to the orthogonality (right angles) of these arrows.

## Matrix 1:
   A      B      C      D
A  1.00  
B  0.90   1.00  
C  1.00   0.39   1.00   
D  0.39  -0.06   0.06   1.00

We can rule out Matrix 1 right away because it has a positive correlation between B and C, while we know B and C are negatively correlated.

## Matrix 2:
   A      B      C      D
A  1.00  
B -1.00   1.00  
C  0.94  -0.94   1.00   
D  0.06   0.39  -0.06   1.00

Matrix 2 can be ruled out since $Cor(B,D) = 0.39$ while $Cor(C,D) = -0.06$: the biplot certainly is not consistent with B and D be more correlated than C and D.

## Matrix 3:
   A      B      C      D
A  1.00  
B -1.00   1.00  
C -0.94   0.94   1.00   
D  0.06  -0.06   0.39   1.00

We can rule out Matrix 3 right away due to $Cor(A,C) = -0.94$.

## Matrix 4:
   A      B      C      D
A  1.00  
B -1.00   1.00  
C  0.94  -0.94   1.00   
D  0.06  -0.06   0.39   1.00

Matrix 4 is the one! We have:

perfect negative correlation between A and B (those arrows point in opposite directions);
strong positive correlation between A and C (those arrows point in very similar directions);
Near 0 correlation between A and D (those arrows are almost at a right angle);
Near 0 correlation between B and D (those arrows are almost at a right angle);
Strong negative correlation between B and C (these arrows point in almost opposite directions);
Weak positive correlation between C and D.

Question 3

Consider the correlation matrix between three variables given by:

      A    B     C
A  1.00 0.08 -0.14
B  0.08 1.00  0.98
C -0.14 0.98  1.00

A and B are weakly correlated; these arrows should be pointing at near right angles.
A and C are weakly correlated these arrows should be pointing at near right angles.
B and C are strongly correlated these arrows should be pointing in similar directions.

Plot 4 is most consistent with this correlation matrix.

Now consider four biplots below showing the 3-dimensional data set reduced to 2 dimensions using NMDS:

Which plot does the correlation matrix belong to?

Question 4

The data for this problem come from Kaggle. We have various numeric characterstics of 777 universities, both public and private. See the Kaggle page for the variable descriptions.

college <- read.csv('Data/College.csv')

The goal is to understand how universities differ with respect to the numeric variables measured, and how these characteristics compare between private and public universities.

A)

What is the best way to measure dissimilarities between universities with respect to the measured variables? Is scaling important before measuring dissimilarity? Why/why not?

Euclidean or Manhattan distance is appropriate as we have numeric variables that aren’t counts. We’ll use Euclidean, which does require scaling to prevent large-scale variables from dominating the distance calculations.

B)

Perform non-metric multidimensional scaling using a 2-dimensional ordination. Add the ordination coordinates to the college data frame. What is the stress of the solution, and what does this indicate? Explain as if to someone who knows nothing about NMDS.

library(vegan)
library(tidyverse)
numeric_vars <- college %>% select(Apps:Grad.Rate) 
scaled_vars <-  scale(numeric_vars)
euc_dist <- vegdist(scaled_vars, method='euclidean')

college_mds <- metaMDS(euc_dist, k = 2, trymax = 20, trace = 0)
college_update <- bind_cols(college, college_mds$points)

college_mds$stress

[1] 0.1288207

The stress of this solution is 0.13, which is acceptable. That means we are getting a decent representation of the similarities and dissimilarities between these colleges with respect to all the numeric variables with a much simpler 2-dimensional representation.

C)

Use envfit() to fit all numeric variables to the MDS coordinates. Extract the arrow vector coordinates and $R^2$ and put them in their own data frame. Sort this data set by $R^2$. Which variables are most and least responsible for separating points in this lower dimensional space?

college_fit <- envfit(ord = college_mds, env = numeric_vars)

arrowdf <- college_fit$vectors$arrows %>% 
          data.frame %>% 
          rownames_to_column('variable') %>% 
          mutate(r2 = college_fit$vectors$r) %>% 
          mutate(NMDS1 = NMDS1*sqrt(r2)*4,NMDS2 = NMDS2*sqrt(r2)*4)%>% 
          arrange(desc(r2))
arrowdf

      variable      NMDS1      NMDS2         r2
1  F.Undergrad 3.10467086 -1.4524939 0.73429498
2         Apps 3.41007911 -0.1816351 0.72885193
3       Accept 3.32930616 -0.7536090 0.72826287
4       Enroll 3.19165925 -1.1958508 0.72604674
5    Top10perc 1.67347220  2.8342190 0.67708165
6     Outstate 0.41912613  3.2137883 0.65650636
7       Expend 1.11372444  2.9709186 0.62917122
8    Top25perc 1.83825751  2.4999951 0.60182287
9          PhD 2.54576556  1.6597989 0.57724092
10 P.Undergrad 1.91785108 -2.1655003 0.52297152
11    Terminal 2.33280441  1.7103009 0.52294410
12   Grad.Rate 0.57821178  2.6208953 0.45021380
13 perc.alumni 0.03576433  2.5936603 0.42052205
14   S.F.Ratio 0.18038266 -2.5792735 0.41782436
15  Room.Board 0.66769147  2.3431923 0.37102261
16    Personal 0.88161837 -1.5537450 0.19946091
17       Books 0.14728931  0.2519091 0.00532202

Most responsible for separating universities:

number of full time undergraduates;
instructional expenditure per student;
Number of applications received and accepted, and number of new students enrolled.

Least responsible for separating universities:

estimated book costs, estimated personal spending, graduation rate, percent of faculty with terminal degree, and percent of alumni who donate.

D)

Use ggplot to plot the 2 dimensional ordination and the labeled arrows. Color-code the points by public/private. Which variable(s) are primarily separating samples along the horizontal axis? Vertical axis? What does this plot tell you about how private/public universities differ with respect to the measured variables?

Plot below, note that I am filtering to variables that have at least a 60% $R^2$ with the ordination space:

important_arrows <- arrowdf %>% filter(r2 > 0.6)

library(ggrepel)

Warning: package 'ggrepel' was built under R version 4.4.3

ggplot(data = college_update) +
  geom_point(aes(x = MDS1, y = MDS2, color = Private)) +
  geom_segment(aes(x = 0, y = 0,
                   xend = NMDS1, yend = NMDS2),
               arrow = arrow(length = unit(.2, 'cm')),
               data = important_arrows) + 
  geom_text_repel(aes(x = NMDS1, y = NMDS2, label = variable), 
            data = important_arrows) + 
  theme_classic(base_size = 14)

It appears that number of undergraduates, applications, accepted, and enrolled students (basically, the size of the university) separates points along the horizontal dimension. The vertical dimension seems to be relying on the cost and selectivity of the university: out-of-state tuition, expenditure per student, and percent of new students from the top 10% or 25% of class is what’s represented by the vertical dimension.

To summarize this plot in a nutshell, the public universities tend to be large, while the private universities tend to be expensive and selective.

E)

Use ggparcoord from the GGally package to create a parallel coordinates plot of the universities (helpful resource page here). Color-code by private/public. Use transparency and/or faceting to deal with over-plotting. Discuss your findings.

library(GGally)

Warning: package 'GGally' was built under R version 4.4.3

toplot <- college_update %>% 
  select(important_arrows$variable, Private)
ggparcoord(data = toplot, 
           columns = 1:8,
           groupColumn = 'Private', 
           alphaLines = 0.5,
           scale='uniminmax') + 
  theme_classic()

As is the case with parallel coordinates plots, the overplotting here is for the mostp art unpleasant. However, we can see that clearly the public scools tend to be higher than the private schools with respect to the number of student variables, while public schools stand out as having lower out-of-state tuition than the private schools.

As an alternative representation, here are side-by-side boxplots for these variables, which I think paints a much clearer picture:

toplot_tall <- pivot_longer(toplot, cols = -Private, values_to = 'value', names_to='variable')
ggplot(data = toplot_tall) + 
  geom_boxplot(aes(x = Private, y = value, fill=Private)) + 
  facet_wrap(~variable, scales = 'free_y')

F)

Find 4 universities that are far apart with respect to the 2D ordination (hint: use geom_text_repel() from the ggrepel package to label the points). Using your ordination and parallel coordinates plot, what can you say about how these universities differ from each other? (Hint: you may want to color these universities differently in your parallel coordinates plot.)

important_arrows <- arrowdf %>% filter(r2 > 0.6)

library(ggrepel)
ggplot(data = college_update) +
  geom_point(aes(x = MDS1, y = MDS2, color = Private)) +
    geom_text_repel(aes(x = MDS1, y = MDS2, label=Name), max.overlaps = 20) +
  geom_segment(aes(x = 0, y = 0,
                   xend = NMDS1, yend = NMDS2),
               arrow = arrow(length = unit(.2, 'cm')),
               data = important_arrows) + 
  geom_text_repel(aes(x = NMDS1, y = NMDS2, label = variable), 
            data = important_arrows) + 
  theme_classic(base_size = 14)

Warning: ggrepel: 765 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

I’ll choose Johns Hopkins, MidAmerica Nazarene, Rutgers, and U of M Twin Cities.

library(GGally)
toplot <- college_update %>% 
  select(important_arrows$variable, Private, Name) %>% 
  mutate(callout = ifelse(Name %in% c('University of Minnesota Twin Cities',
                                      'Rutgers at New Brunswick',
                                      'Johns Hopkins University',
                                      'MidAmerica Nazarene College'),
                          Name,'Other')
         )%>% 
  arrange(callout)
ggparcoord(data = toplot %>% filter(callout!='Other'), 
           columns = 1:8,
           groupColumn = 'callout') + 
  labs(color='')+
  #facet_wrap(~Private)+
  theme_classic()

Rutgers is the largest university, with comparable cost to U of M. Johns Hopkins is the most expensive. MidAmerica Nazarene is the smallest, and for a private school not too expensive!

G)

Filter the data set to Winona State and 10 other universities of your choice. Find the dissimilarity between these universities in full-dimensional, and in the 2-dimensional space. How correlated are these dissimilarities, and is this a good or a bad thing?

set.seed(223)
college_sample <- 
  bind_rows(college_update %>% filter(Name=='Winona State University'),
            college_update %>% sample_n(10))

fulld <- vegdist(college_sample %>% 
                   select(Apps:Grad.Rate) %>% 
                   scale, method='euclidean')

lowd <- vegdist(college_sample 
                %>% select(MDS1,MDS2),
                method='euclidean')

bothd <- cbind(fulld, lowd)


cor(bothd)

          fulld      lowd
fulld 1.0000000 0.8834521
lowd  0.8834521 1.0000000

These distances have a correlation of 0.88 Strong, which is a good thing - we want the 2D representations to well-approximate the full-D ones.

Question 5

For this problem we will analyze the mushroom data, which are various categorical characteristics of poisonous and edible mushrooms. The variables used to measure distance are all except the class variable. See the data set Kaggle site for variable descriptions. Note that stalk.root has several unknown values indicated by ?, which we will remove. We’ll also de-select some variables that are mostly constant and sample to only 1000 mushrooms for simplicity.

set.seed(1112)
mushroomsc <- (read.csv('Data/mushrooms-with-columnnames.csv') 
               %>% filter(stalk.root!="?") 
               %>% dplyr::select(-veil.type, -veil.color,-gill.attachment)
               %>% sample_n(1000)
)

A)

Create an appropriate dissimilarity matrix for this data set, using all variables except class.

library(fastDummies)

Warning: package 'fastDummies' was built under R version 4.4.3

mushrooms_onehot <- mushroomsc %>% 
  select(-class) %>% 
  dummy_cols(., remove_selected_columns = TRUE)

jaccard_dist <- vegdist(mushrooms_onehot, method='jaccard')

B)

Use the dissimilarity matrix to perform NMDS. (NOTE: by default, metaMDS algorithm starts from 20 different random starts to try to make sure the optimal solution is found. This is a good idea in practice, but with large n, even of 1000, each try takes a while, so 20 will take forever! For now, set trymax=1 to speed things up.) Is there much difference in stress between a 2 and 3 dimensional configuration?

mush_mds2 <- metaMDS(jaccard_dist, k = 2, trymax=1)

Run 0 stress 0.1459574 
Run 1 stress 0.148828 
*** Best solution was not repeated -- monoMDS stopping criteria:
     1: scale factor of the gradient < sfgrmin

mush_mds3 <- metaMDS(jaccard_dist, k = 3, trymax=1)

Run 0 stress 0.09984206 
Run 1 stress 0.09992898 
... Procrustes: rmse 0.005100831  max resid 0.05699259 
*** Best solution was not repeated -- monoMDS stopping criteria:
     1: scale factor of the gradient < sfgrmin

mush_mds2


Call:
metaMDS(comm = jaccard_dist, k = 2, trymax = 1) 

global Multidimensional Scaling using monoMDS

Data:     jaccard_dist 
Distance: jaccard 

Dimensions: 2 
Stress:     0.1459574 
Stress type 1, weak ties
Best solution was not repeated after 1 tries
The best solution was from try 0 (metric scaling or null solution)
Scaling: centring, PC rotation, halfchange scaling 
Species: scores missing

mush_mds3


Call:
metaMDS(comm = jaccard_dist, k = 3, trymax = 1) 

global Multidimensional Scaling using monoMDS

Data:     jaccard_dist 
Distance: jaccard 

Dimensions: 3 
Stress:     0.09984206 
Stress type 1, weak ties
Best solution was not repeated after 1 tries
The best solution was from try 0 (metric scaling or null solution)
Scaling: centring, PC rotation, halfchange scaling 
Species: scores missing

The DD configuration has a stress of 0.0998, in the “very good” realm, while the 2D is 0.146, merely “acceptable.”

C)

Focus on the 2-dimensional ordination. Use envfit to obtain relationships between the mushroom characteristics and these MDS coordinates. (NOTE: since the variables are factors here, rather than regressing the variables on the coordinates envfit() simply averages the coordinates for each level of the factors; these are contained in the $factors$centroids object.) Create a data frame of these centroids. Then create a biplot with the following specifications:

The MDS coordinates on horizontal and vertical;
Points color-coded by poisonous/edibility;
Points of different shape and color for the centroids;
Labels for the centroids with measures taken to reduce overlapping labels

Is this plot readable?

#update the mushrooms data frame with MDS coordinates:
mushrooms_with_mds <- bind_cols(mushroomsc, mush_mds2$points)

#Fit mushroom characteristics on ordination:
mushrooms_noclass <- mushroomsc %>% select(-class)
mush_fit <- envfit(ord= mush_mds2, env = mushrooms_noclass)

#Extract centroids to data frame:
centroids <- mush_fit$factors$centroids %>%
  data.frame %>% 
  rownames_to_column('variable')

#Produce plot:
library(ggrepel)

ggplot(mushrooms_with_mds) +
  geom_point(aes(x = MDS1, y = MDS2, color = class),size = 3, alpha = 0.7) +
  geom_point(aes(x = NMDS1, y = NMDS2), data = centroids, shape =15 ) +
  geom_text_repel(aes(x = NMDS1, y = NMDS2, label = variable), data = centroids)+
  labs(x = "MDS Dimension 1", y = "MDS Dimension 2") +
  theme_classic(base_size = 14)

Warning: ggrepel: 59 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

There are way too many centroid points on this plot to make sense of it!

E)

We’ve got some filtering to do. Filter the centroids data frame to contain only variables with >0.5 $R^2$ with respect to the ordination space. Then re-create your plot. If you want to eat a mushroom and live to tell about it, what characteristics should you look for? What characteristics should you avoid? (For full credit, you should spend some time on this response.)

#Sort the variables by R2:
mush_fit$factors$r %>% 
  sort(., decreasing = TRUE)

               ring.type stalk.surface.above.ring stalk.surface.below.ring 
              0.81858434               0.68658045               0.67062433 
  stalk.color.below.ring   stalk.color.above.ring                     odor 
              0.60082927               0.59836672               0.58485776 
       spore.print.color               stalk.root               population 
              0.53025223               0.40341466               0.39448550 
                 bruises                  habitat               gill.color 
              0.34218841               0.31598609               0.30981461 
               cap.color             gill.spacing              cap.surface 
              0.26634394               0.26255209               0.16811246 
               cap.shape                gill.size              stalk.shape 
              0.07578453               0.07551151               0.06948195 
             ring.number 
              0.03935608

#Filter to the ones with R2 > 0.5, which is the first 7:
important <- mush_fit$factors$r %>% 
  sort(., decreasing = TRUE)%>% 
  head(7)

# Now filter the centroids data.
centroids_important <- centroids %>% 
  #Mutate to get rid of the level indicator at end of string:
  mutate(variable_only = str_sub(variable, start=1, end = -2)) %>% 
  # Filter to variables that are in important vector:
  filter(variable_only %in% names(important))

ggplot(mushrooms_with_mds) +
  geom_point(aes(x = MDS1, y = MDS2, color = class),size = 3, alpha = 0.7) +
  geom_point(aes(x = NMDS1, y = NMDS2), data = centroids_important, shape =15 ) +
  geom_text_repel(aes(x = NMDS1, y = NMDS2, label = variable), data = centroids_important, 
                  max.overlaps = 20)+
  labs(x = "MDS Dimension 1", y = "MDS Dimension 2") +
  theme_classic(base_size = 14)

If you want to eat a mushroom and not die, you should definitely avoid yellow, buff, or brown stalks both above and below the ring. Gray and pink is much safer! Also avoid foul, pungent, or musty-smelling mushrooms, or odorless mushrooms and go for the ones that smell like anise or almond! Don’t be fooled by the silky stalk surfaces: they’re deadly. Fibrous, smooth stalk surfaces are your friend,.