<- read.csv('Data/College.csv') college
Activity 2.2 - multidimensional scaling
SUBMISSION INSTRUCTIONS
Submit both:
- Your .qmd;
- a pdf either:
- rendered directly by changing your yaml to
format: pdf
- A print-to-pdf of your rendered html
- rendered directly by changing your yaml to
- PDFs of your source code will NOT be considered.
Question 1
Suppose we have 3 data points in 2 dimensions:
A)
Using taxicab as the distance metric, find all pairwise distances between points in the full dimension space.
- \(d(A,B) = 4\); \(d(A,C) = 6\), \(d(B,C) = 4\)
B)
Consider three one-dimensional ordinations of this space. In the first ordination, the points are collapsed to their vertical coordinates; in the second, they are collapsed to their horizontal coordinates; and in the third, they are collapsed vertically to the y=x line:
Find the stress of each of these ordinations. Which best represents the full-dimensional distances between points?
- First ordination:
- Distances: \(d(A,B) = 3\); \(d(A, C) = 4\); \(d(B, C) = 1\)
- Stress: \(\sqrt{(4-3)^2 + (6-4)^2 + (4-1)^2} = 3.74\)
- Second ordination:
- Distances: \(d(A,B) = 1\); \(d(A, C) = 2\); \(d(B, C) = 3\)
- Stress: \(\sqrt{(4-1)^2 + (6-2)^2 + (4-3)^2} = 5.1\)
- Third ordination:
- Distances: \(d(A,B) = 2\); \(d(A, C) = 4\); \(d(B, C) = 6\)
- Stress: \(\sqrt{(4-2)^2 + (6-4)^2 + (4-6)^2} = 3.46\)
So the third ordination is the best 1D representation of these 2D data, as it has the lowest stress.
Question 2
Consider the following MDS biplot of a 2-dimensional representation of 4-dimensional data:
Which of these most likely represents the correlation matrix between the four variables?
Properties of this plot:
- Variables A and C are quite strongly (and positively) correlated with each other, as they point in similar directions.
- Variables A and C are strongly negatively correlated with B, as these arrows point in opposite directions..
- A, C and B all have weak correlations with D; in fact it looks like \(Cor(B,D)\) and \(Cor(A,D)\) might be near 0 due to the orthogonality (right angles) of these arrows.
## Matrix 1:
A B C D
A 1.00
B 0.90 1.00
C 1.00 0.39 1.00
D 0.39 -0.06 0.06 1.00
We can rule out Matrix 1 right away because it has a positive correlation between B and C, while we know B and C are negatively correlated.
## Matrix 2:
A B C D
A 1.00
B -1.00 1.00
C 0.94 -0.94 1.00
D 0.06 0.39 -0.06 1.00
Matrix 2 can be ruled out since \(Cor(B,D) = 0.39\) while \(Cor(C,D) = -0.06\): the biplot certainly is not consistent with B and D be more correlated than C and D.
## Matrix 3:
A B C D
A 1.00
B -1.00 1.00
C -0.94 0.94 1.00
D 0.06 -0.06 0.39 1.00
We can rule out Matrix 3 right away due to \(Cor(A,C) = -0.94\).
## Matrix 4:
A B C D
A 1.00
B -1.00 1.00
C 0.94 -0.94 1.00
D 0.06 -0.06 0.39 1.00
Matrix 4 is the one! We have:
- perfect negative correlation between A and B (those arrows point in opposite directions);
- strong positive correlation between A and C (those arrows point in very similar directions);
- Near 0 correlation between A and D (those arrows are almost at a right angle);
- Near 0 correlation between B and D (those arrows are almost at a right angle);
- Strong negative correlation between B and C (these arrows point in almost opposite directions);
- Weak positive correlation between C and D.
Question 3
Consider the correlation matrix between three variables given by:
A B C
A 1.00 0.08 -0.14
B 0.08 1.00 0.98
C -0.14 0.98 1.00
- A and B are weakly correlated; these arrows should be pointing at near right angles.
- A and C are weakly correlated these arrows should be pointing at near right angles.
- B and C are strongly correlated these arrows should be pointing in similar directions.
Plot 4 is most consistent with this correlation matrix.
Now consider four biplots below showing the 3-dimensional data set reduced to 2 dimensions using NMDS:
Which plot does the correlation matrix belong to?
Question 4
The data for this problem come from Kaggle. We have various numeric characterstics of 777 universities, both public and private. See the Kaggle page for the variable descriptions.
The goal is to understand how universities differ with respect to the numeric variables measured, and how these characteristics compare between private and public universities.
A)
What is the best way to measure dissimilarities between universities with respect to the measured variables? Is scaling important before measuring dissimilarity? Why/why not?
Euclidean or Manhattan distance is appropriate as we have numeric variables that aren’t counts. We’ll use Euclidean, which does require scaling to prevent large-scale variables from dominating the distance calculations.
B)
Perform non-metric multidimensional scaling using a 2-dimensional ordination. Add the ordination coordinates to the college
data frame. What is the stress of the solution, and what does this indicate? Explain as if to someone who knows nothing about NMDS.
library(vegan)
library(tidyverse)
<- college %>% select(Apps:Grad.Rate)
numeric_vars <- scale(numeric_vars)
scaled_vars <- vegdist(scaled_vars, method='euclidean')
euc_dist
<- metaMDS(euc_dist, k = 2, trymax = 20, trace = 0)
college_mds <- bind_cols(college, college_mds$points)
college_update
$stress college_mds
[1] 0.1288207
The stress of this solution is 0.13, which is acceptable. That means we are getting a decent representation of the similarities and dissimilarities between these colleges with respect to all the numeric variables with a much simpler 2-dimensional representation.
C)
Use envfit()
to fit all numeric variables to the MDS coordinates. Extract the arrow vector coordinates and \(R^2\) and put them in their own data frame. Sort this data set by \(R^2\). Which variables are most and least responsible for separating points in this lower dimensional space?
<- envfit(ord = college_mds, env = numeric_vars)
college_fit
<- college_fit$vectors$arrows %>%
arrowdf %>%
data.frame rownames_to_column('variable') %>%
mutate(r2 = college_fit$vectors$r) %>%
mutate(NMDS1 = NMDS1*sqrt(r2)*4,NMDS2 = NMDS2*sqrt(r2)*4)%>%
arrange(desc(r2))
arrowdf
variable NMDS1 NMDS2 r2
1 F.Undergrad 3.10467086 -1.4524939 0.73429498
2 Apps 3.41007911 -0.1816351 0.72885193
3 Accept 3.32930616 -0.7536090 0.72826287
4 Enroll 3.19165925 -1.1958508 0.72604674
5 Top10perc 1.67347220 2.8342190 0.67708165
6 Outstate 0.41912613 3.2137883 0.65650636
7 Expend 1.11372444 2.9709186 0.62917122
8 Top25perc 1.83825751 2.4999951 0.60182287
9 PhD 2.54576556 1.6597989 0.57724092
10 P.Undergrad 1.91785108 -2.1655003 0.52297152
11 Terminal 2.33280441 1.7103009 0.52294410
12 Grad.Rate 0.57821178 2.6208953 0.45021380
13 perc.alumni 0.03576433 2.5936603 0.42052205
14 S.F.Ratio 0.18038266 -2.5792735 0.41782436
15 Room.Board 0.66769147 2.3431923 0.37102261
16 Personal 0.88161837 -1.5537450 0.19946091
17 Books 0.14728931 0.2519091 0.00532202
Most responsible for separating universities:
- number of full time undergraduates;
- instructional expenditure per student;
- Number of applications received and accepted, and number of new students enrolled.
Least responsible for separating universities:
- estimated book costs, estimated personal spending, graduation rate, percent of faculty with terminal degree, and percent of alumni who donate.
D)
Use ggplot
to plot the 2 dimensional ordination and the labeled arrows. Color-code the points by public/private. Which variable(s) are primarily separating samples along the horizontal axis? Vertical axis? What does this plot tell you about how private/public universities differ with respect to the measured variables?
Plot below, note that I am filtering to variables that have at least a 60% \(R^2\) with the ordination space:
<- arrowdf %>% filter(r2 > 0.6)
important_arrows
library(ggrepel)
Warning: package 'ggrepel' was built under R version 4.4.3
ggplot(data = college_update) +
geom_point(aes(x = MDS1, y = MDS2, color = Private)) +
geom_segment(aes(x = 0, y = 0,
xend = NMDS1, yend = NMDS2),
arrow = arrow(length = unit(.2, 'cm')),
data = important_arrows) +
geom_text_repel(aes(x = NMDS1, y = NMDS2, label = variable),
data = important_arrows) +
theme_classic(base_size = 14)
It appears that number of undergraduates, applications, accepted, and enrolled students (basically, the size of the university) separates points along the horizontal dimension. The vertical dimension seems to be relying on the cost and selectivity of the university: out-of-state tuition, expenditure per student, and percent of new students from the top 10% or 25% of class is what’s represented by the vertical dimension.
To summarize this plot in a nutshell, the public universities tend to be large, while the private universities tend to be expensive and selective.
E)
Use ggparcoord
from the GGally
package to create a parallel coordinates plot of the universities (helpful resource page here). Color-code by private/public. Use transparency and/or faceting to deal with over-plotting. Discuss your findings.
library(GGally)
Warning: package 'GGally' was built under R version 4.4.3
<- college_update %>%
toplot select(important_arrows$variable, Private)
ggparcoord(data = toplot,
columns = 1:8,
groupColumn = 'Private',
alphaLines = 0.5,
scale='uniminmax') +
theme_classic()
As is the case with parallel coordinates plots, the overplotting here is for the mostp art unpleasant. However, we can see that clearly the public scools tend to be higher than the private schools with respect to the number of student variables, while public schools stand out as having lower out-of-state tuition than the private schools.
As an alternative representation, here are side-by-side boxplots for these variables, which I think paints a much clearer picture:
<- pivot_longer(toplot, cols = -Private, values_to = 'value', names_to='variable')
toplot_tall ggplot(data = toplot_tall) +
geom_boxplot(aes(x = Private, y = value, fill=Private)) +
facet_wrap(~variable, scales = 'free_y')
F)
Find 4 universities that are far apart with respect to the 2D ordination (hint: use geom_text_repel()
from the ggrepel
package to label the points). Using your ordination and parallel coordinates plot, what can you say about how these universities differ from each other? (Hint: you may want to color these universities differently in your parallel coordinates plot.)
<- arrowdf %>% filter(r2 > 0.6)
important_arrows
library(ggrepel)
ggplot(data = college_update) +
geom_point(aes(x = MDS1, y = MDS2, color = Private)) +
geom_text_repel(aes(x = MDS1, y = MDS2, label=Name), max.overlaps = 20) +
geom_segment(aes(x = 0, y = 0,
xend = NMDS1, yend = NMDS2),
arrow = arrow(length = unit(.2, 'cm')),
data = important_arrows) +
geom_text_repel(aes(x = NMDS1, y = NMDS2, label = variable),
data = important_arrows) +
theme_classic(base_size = 14)
Warning: ggrepel: 765 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
I’ll choose Johns Hopkins, MidAmerica Nazarene, Rutgers, and U of M Twin Cities.
library(GGally)
<- college_update %>%
toplot select(important_arrows$variable, Private, Name) %>%
mutate(callout = ifelse(Name %in% c('University of Minnesota Twin Cities',
'Rutgers at New Brunswick',
'Johns Hopkins University',
'MidAmerica Nazarene College'),
'Other')
Name,%>%
)arrange(callout)
ggparcoord(data = toplot %>% filter(callout!='Other'),
columns = 1:8,
groupColumn = 'callout') +
labs(color='')+
#facet_wrap(~Private)+
theme_classic()
Rutgers is the largest university, with comparable cost to U of M. Johns Hopkins is the most expensive. MidAmerica Nazarene is the smallest, and for a private school not too expensive!
G)
Filter the data set to Winona State and 10 other universities of your choice. Find the dissimilarity between these universities in full-dimensional, and in the 2-dimensional space. How correlated are these dissimilarities, and is this a good or a bad thing?
set.seed(223)
<-
college_sample bind_rows(college_update %>% filter(Name=='Winona State University'),
%>% sample_n(10))
college_update
<- vegdist(college_sample %>%
fulld select(Apps:Grad.Rate) %>%
method='euclidean')
scale,
<- vegdist(college_sample
lowd %>% select(MDS1,MDS2),
method='euclidean')
<- cbind(fulld, lowd)
bothd
cor(bothd)
fulld lowd
fulld 1.0000000 0.8834521
lowd 0.8834521 1.0000000
These distances have a correlation of 0.88 Strong, which is a good thing - we want the 2D representations to well-approximate the full-D ones.
Question 5
For this problem we will analyze the mushroom data, which are various categorical characteristics of poisonous and edible mushrooms. The variables used to measure distance are all except the class
variable. See the data set Kaggle site for variable descriptions. Note that stalk.root
has several unknown values indicated by ?
, which we will remove. We’ll also de-select some variables that are mostly constant and sample to only 1000 mushrooms for simplicity.
set.seed(1112)
<- (read.csv('Data/mushrooms-with-columnnames.csv')
mushroomsc %>% filter(stalk.root!="?")
%>% dplyr::select(-veil.type, -veil.color,-gill.attachment)
%>% sample_n(1000)
)
A)
Create an appropriate dissimilarity matrix for this data set, using all variables except class
.
library(fastDummies)
Warning: package 'fastDummies' was built under R version 4.4.3
<- mushroomsc %>%
mushrooms_onehot select(-class) %>%
dummy_cols(., remove_selected_columns = TRUE)
<- vegdist(mushrooms_onehot, method='jaccard') jaccard_dist
B)
Use the dissimilarity matrix to perform NMDS. (NOTE: by default, metaMDS
algorithm starts from 20 different random starts to try to make sure the optimal solution is found. This is a good idea in practice, but with large n
, even of 1000, each try takes a while, so 20 will take forever! For now, set trymax=1
to speed things up.) Is there much difference in stress between a 2 and 3 dimensional configuration?
<- metaMDS(jaccard_dist, k = 2, trymax=1) mush_mds2
Run 0 stress 0.1459574
Run 1 stress 0.148828
*** Best solution was not repeated -- monoMDS stopping criteria:
1: scale factor of the gradient < sfgrmin
<- metaMDS(jaccard_dist, k = 3, trymax=1) mush_mds3
Run 0 stress 0.09984206
Run 1 stress 0.09992898
... Procrustes: rmse 0.005100831 max resid 0.05699259
*** Best solution was not repeated -- monoMDS stopping criteria:
1: scale factor of the gradient < sfgrmin
mush_mds2
Call:
metaMDS(comm = jaccard_dist, k = 2, trymax = 1)
global Multidimensional Scaling using monoMDS
Data: jaccard_dist
Distance: jaccard
Dimensions: 2
Stress: 0.1459574
Stress type 1, weak ties
Best solution was not repeated after 1 tries
The best solution was from try 0 (metric scaling or null solution)
Scaling: centring, PC rotation, halfchange scaling
Species: scores missing
mush_mds3
Call:
metaMDS(comm = jaccard_dist, k = 3, trymax = 1)
global Multidimensional Scaling using monoMDS
Data: jaccard_dist
Distance: jaccard
Dimensions: 3
Stress: 0.09984206
Stress type 1, weak ties
Best solution was not repeated after 1 tries
The best solution was from try 0 (metric scaling or null solution)
Scaling: centring, PC rotation, halfchange scaling
Species: scores missing
The DD configuration has a stress of 0.0998, in the “very good” realm, while the 2D is 0.146, merely “acceptable.”
C)
Focus on the 2-dimensional ordination. Use envfit
to obtain relationships between the mushroom characteristics and these MDS coordinates. (NOTE: since the variables are factors here, rather than regressing the variables on the coordinates envfit()
simply averages the coordinates for each level of the factors; these are contained in the $factors$centroids
object.) Create a data frame of these centroids. Then create a biplot with the following specifications:
- The MDS coordinates on horizontal and vertical;
- Points color-coded by poisonous/edibility;
- Points of different shape and color for the centroids;
- Labels for the centroids with measures taken to reduce overlapping labels
Is this plot readable?
#update the mushrooms data frame with MDS coordinates:
<- bind_cols(mushroomsc, mush_mds2$points)
mushrooms_with_mds
#Fit mushroom characteristics on ordination:
<- mushroomsc %>% select(-class)
mushrooms_noclass <- envfit(ord= mush_mds2, env = mushrooms_noclass)
mush_fit
#Extract centroids to data frame:
<- mush_fit$factors$centroids %>%
centroids %>%
data.frame rownames_to_column('variable')
#Produce plot:
library(ggrepel)
ggplot(mushrooms_with_mds) +
geom_point(aes(x = MDS1, y = MDS2, color = class),size = 3, alpha = 0.7) +
geom_point(aes(x = NMDS1, y = NMDS2), data = centroids, shape =15 ) +
geom_text_repel(aes(x = NMDS1, y = NMDS2, label = variable), data = centroids)+
labs(x = "MDS Dimension 1", y = "MDS Dimension 2") +
theme_classic(base_size = 14)
Warning: ggrepel: 59 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
There are way too many centroid points on this plot to make sense of it!
E)
We’ve got some filtering to do. Filter the centroids data frame to contain only variables with >0.5 \(R^2\) with respect to the ordination space. Then re-create your plot. If you want to eat a mushroom and live to tell about it, what characteristics should you look for? What characteristics should you avoid? (For full credit, you should spend some time on this response.)
#Sort the variables by R2:
$factors$r %>%
mush_fitsort(., decreasing = TRUE)
ring.type stalk.surface.above.ring stalk.surface.below.ring
0.81858434 0.68658045 0.67062433
stalk.color.below.ring stalk.color.above.ring odor
0.60082927 0.59836672 0.58485776
spore.print.color stalk.root population
0.53025223 0.40341466 0.39448550
bruises habitat gill.color
0.34218841 0.31598609 0.30981461
cap.color gill.spacing cap.surface
0.26634394 0.26255209 0.16811246
cap.shape gill.size stalk.shape
0.07578453 0.07551151 0.06948195
ring.number
0.03935608
#Filter to the ones with R2 > 0.5, which is the first 7:
<- mush_fit$factors$r %>%
important sort(., decreasing = TRUE)%>%
head(7)
# Now filter the centroids data.
<- centroids %>%
centroids_important #Mutate to get rid of the level indicator at end of string:
mutate(variable_only = str_sub(variable, start=1, end = -2)) %>%
# Filter to variables that are in important vector:
filter(variable_only %in% names(important))
ggplot(mushrooms_with_mds) +
geom_point(aes(x = MDS1, y = MDS2, color = class),size = 3, alpha = 0.7) +
geom_point(aes(x = NMDS1, y = NMDS2), data = centroids_important, shape =15 ) +
geom_text_repel(aes(x = NMDS1, y = NMDS2, label = variable), data = centroids_important,
max.overlaps = 20)+
labs(x = "MDS Dimension 1", y = "MDS Dimension 2") +
theme_classic(base_size = 14)
If you want to eat a mushroom and not die, you should definitely avoid yellow, buff, or brown stalks both above and below the ring. Gray and pink is much safer! Also avoid foul, pungent, or musty-smelling mushrooms, or odorless mushrooms and go for the ones that smell like anise or almond! Don’t be fooled by the silky stalk surfaces: they’re deadly. Fibrous, smooth stalk surfaces are your friend,.