Multivariate Comparisons
Standard Statistical Dimension Reduction
Clustering
Searching for Outliers
More Example Code!
Summer 2020
Multivariate Comparisons
Standard Statistical Dimension Reduction
Clustering
Searching for Outliers
More Example Code!
Heat maps are a 2D matrix of data, where values are displayed as colored shapes in the matrix
Sometimes laid out in more complex tiling (e.g., hexagonal)
Sometimes used in conjunction with methods to visualize hierarchical data
Difficult to draw specific inferences about data with a heat map, but can be very useful for spotting differences
A heatmap is really not a map at all, but rather a visual representation of a table where numeric data in cells are encoded as a color
But we often use the word “heatmap” whenever we have data, a map, and colors
http://cartonerd.blogspot.com/2015/02/when-is-heat-map-not-heat-map.html
Example: Students in a class have vs. their grades on quizes
Suppose we want an overal impression of student performance on quizes
Consider:
| Student | Quiz-1 | Quiz-2 | Quiz-3 |
|---|---|---|---|
| Oscar OneGuy | 73 | 72 | 80 |
| Sue Secondly | 92 | 85 | 91 |
| Theo Thirdman | 80 | 74 | 92 |
| Finn Fourthperson | 77 | 69 | 100 |
library(ggplot2)
library(reshape2)
mydata = data.frame(Student=c("Oscar OneGuy","Sue Secondly","Theo Thirdman","Finn Fourthperson"),
Quiz1=c(73, 92, 80, 77),
Quiz2=c(72, 85, 74, 69),
Quiz3=c(80, 91, 93, 100))
mylongdata = melt(mydata)
ggplot(mylongdata, aes(variable, Student)) +
geom_tile(aes(fill=value),color="white") +
scale_fill_gradient(low="white", high="black") +
xlab("") + ylab("Student") + ggtitle("Grades on Student Quizes") +
theme(text=element_text(size=18, family="Times"), axis.ticks = element_blank())
These are differences we are spotting … the specific color is less important than dramatic changes in colors
This was easy to see from the numbers, directly; however, it would have been harder with more students and more quizes
library(ggplot2)
library(reshape2)
library(plyr,quietly=TRUE, warn.conflicts=FALSE)
library(scales)
# Load the data and order by total number of points scored
bballData <- read.csv("http://datasets.flowingdata.com/ppg2008.csv",header=TRUE)
bballData$Name <- with(bballData, reorder(Name, PTS))
# Reformat the data and scale each variable in preparation
bball <- melt(bballData)
bball <- ddply(bball, .(variable), transform, rescaledValue = scale(value))
ggplot(bball, aes(variable, Name)) +
# Use geom_tile to draw the cells
geom_tile(aes(fill=rescaledValue),color="white") +
# Scale the colors from a light blue to a dark blue
scale_fill_gradient(low="#F7FBFF", high="#08306B", guide=FALSE) +
# Label the axes
xlab("NBA Players") + ylab("Statistic") + ggtitle("NBA Stats by Player, 2008-2009 Season") +
# Size and adjust the text to be more readable
theme(text=element_text(size=12, family="Times"),
axis.ticks = element_blank(),
axis.text.y = element_text(size=9, vjust=0.25, color="black"),
axis.text.x = element_text(angle=90, vjust=0.25, hjust = 1, color="black"))
crimeURL = "http://datasets.flowingdata.com/crimeRatesByState2005.csv"
crime = read.csv(crimeURL)
stars(crime[,2:7],
labels=as.character(crime[,1]),
flip.labels=FALSE,
key.loc=c(15,11.0),
ncol=13,
cex=1.25)
Decoding direct numerical results from a star plot is not really possible
Instead, we look for points that stand out, that are different
library(lattice) # Needed for parrallelplot()
# Get the data
eduURL = "http://datasets.flowingdata.com/education.csv"
sat = read.csv(eduURL)
# Make a vector of colors, where the upper half of the SAT
# total distribution is green and the lower is blue
satTotal = sat$reading + sat$writing + sat$math
colorIndices = (satTotal > median(satTotal)) + 1
satColors = c("dodgerblue2","darkolivegreen4")[colorIndices]
# Plot the data
parallelplot(sat[,2:7], horizontal.axis=FALSE, col=satColors,lwd=1.5,
main="Average State SAT and Graduation Statistics")
This plot reveals a number of interesting things:
Reading, writing, and math scores are extremely well correlated within state averages
There appears very little relationship between SAT scores and teacher-student ratio or dropout rate
There are basically two ways to perform dimensionality reduction:
There are many such methods in statistics and machine learning, and we’ll only hit on a few traditional techniques
library(MASS) # Get a linear model with Fertility as the response variable and *all* # other variables included as explanatory variables fullLMmodel <- glm(Fertility ~., data = swiss) # Perform stepwise regression to get a *new* model, search in both # directions (fwd & bkwd), and I don't need to see the trace stepLMModel <- stepAIC(fullLMmodel, direction = "both", trace = FALSE) # Show me the resulting model and ANOVA results: print(stepLMModel$anova)
## Stepwise Model Path ## Analysis of Deviance Table ## ## Initial Model: ## Fertility ~ Agriculture + Examination + Education + Catholic + ## Infant.Mortality ## ## Final Model: ## Fertility ~ Agriculture + Education + Catholic + Infant.Mortality ## ## ## Step Df Deviance Resid. Df Resid. Dev AIC ## 1 41 2105.043 326.0716 ## 2 - Examination 1 53.02656 42 2158.069 325.2408
The stepwise regression decided that the Examination variable was not needed for the prediction
A scatter plot of high-dimensional, numeric data is difficult
But one way to visualize it is to first reduce this dimensionality somehow
library(ggplot2)
library(dplyr,quietly=TRUE, warn.conflicts=FALSE)
# Load the data, filter out stuff we have no abbreviations for
eduURL = "http://datasets.flowingdata.com/education.csv"
sat = filter(read.csv(eduURL),
state != "United States",
state != "District of Columbia")
# Compute the MDS for just the numeric variables
satDist = dist(sat[,2:7])
satMDS = cmdscale(satDist)
# Compute the percentiles for some state
highlightStat = sat$reading # <- Change the state, if you want
mu = mean(highlightStat)
sigma = sqrt(var(highlightStat))
percentiles = pnorm(highlightStat, mean=mu, sd=sigma)
# Get a new data set in the transformed coordinates
reducedSAT = data.frame(x=satMDS[,1],
y=satMDS[,2],
abb=state.abb,
highlight=(percentiles > 0.70))
ggplot(reducedSAT,aes(x,y)) +
geom_text(aes(label=abb,color=highlight,size=highlight)) +
scale_color_manual(values=c("gray49","firebrick")) +
scale_size_manual(values=c(3,4)) +
geom_text(x=55,y=15,label="Upper 30% in Reading",size=4,color="firebrick") +
ggtitle("MDS Average State SAT and Graduation Statistics") +
theme(text=element_text(size=18,family="Times"),legend.position="None")
One common statistical technique to reduce dimensions is to perform some kind of decomposition of the space
The principle components are the orthoganal rotation that explain the variance in the data, in order
# Say X is our data eg = eigen(cov(X)) pc1 = eg$vectors[,1] * sqrt(eg$values[1]) pc2 = eg$vectors[,1] * sqrt(eg$values[1]) # ...
# Say X is our data prcomp(X)
library(ggplot2)
library(dplyr,quietly=TRUE, warn.conflicts=FALSE)
# Get the data, but exclude those we don't have abbreviations for
crimeURL = "http://datasets.flowingdata.com/crimeRatesByState2005.csv"
crime = filter(read.csv(crimeURL),
state != "United States",
state != "District of Columbia")
# Deal with the data as a matrix
X = as.matrix(crime[,2:7])
pc = prcomp(X) # the PCA
M = cbind(pc$rotation[,1],pc$rotation[,2])
Y = X %*% M # the rotation
# May first two pc's and the state abbrevations into a new data frame
crimeReduced = data.frame(state=crime$state,
pc1=Y[,1],
pc2=Y[,2],
abb=state.abb)
# Find the 10 states that are the highest on the first PC axis
a = order(crimeReduced$pc1,decreasing=TRUE)[1:10]
# Find the 10 states that are the highest on the second PC axis
b = order(crimeReduced$pc2,decreasing=TRUE)[1:10]
# Highlight the states that are at the intersection of those
# (that is: Find the states on the Pareto surface of the first
# two PCs)
crimeReduced = mutate(crimeReduced,
highCrime=(1:50 %in% intersect(a,b)))
# Plot it!
ggplot(crimeReduced,aes(pc1,pc2)) +
geom_text(aes(label=abb,color=highCrime,size=highCrime)) +
scale_color_manual(values=c("darkgray","firebrick")) +
scale_size_manual(values=c(4,7)) +
ylim(c(-750,750)) +
xlab("First Principle Component") + ylab("Second Principle Component") +
theme(text=element_text(size=18,family="Times"),legend.position="None")
There are a lot of packages for clustering in R, but we’ll just look at mclust
mclust works like other modeling fitting methods, so we’ll use that one
mclust uses EM and is able to guess the best number of clusters, \(k\)
library(mclust) # Let's just look at numerical variables in the Iris dataset # Also, let's rescale all the data so everything's in the same scale range mydata <- scale(iris[,1:4]) # Build a cluster model, letting the algorithm choose k clusterModel <- Mclust(mydata) # Provide a summary of the model: summary(clusterModel) # Plot the cluster classifications directly: plot(clusterModel, what="classification")
## ---------------------------------------------------- ## Gaussian finite mixture model fitted by EM algorithm ## ---------------------------------------------------- ## ## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model ## with 2 components: ## ## log-likelihood n df BIC ICL ## -322.6936 150 29 -790.6956 -790.6969 ## ## Clustering table: ## 1 2 ## 50 100
This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.
head(USArrests)
## Murder Assault UrbanPop Rape ## Alabama 13.2 236 58 21.2 ## Alaska 10.0 263 48 44.5 ## Arizona 8.1 294 80 31.0 ## Arkansas 8.8 190 50 19.5 ## California 9.0 276 91 40.6 ## Colorado 7.9 204 78 38.7
library(cluster) # Dissimilarity matrix arrestDissimilarity <- dist(USArrests, method = "euclidean") # Hierarchical clustering using Complete Linkage hc1 <- hclust(arrestDissimilarity, method = "complete" ) # Plot the obtained dendrogram plot(hc1, cex = 0.6, hang = -1)
Extreme values that stand out in a distribution are called outliers
library(ggplot2)
library(dplyr,quietly=TRUE, warn.conflicts=FALSE)
# Load the data
crimeURL = "http://datasets.flowingdata.com/crimeRatesByState2005.csv"
crimeRaw = read.csv(crimeURL)
# Compute the differences in murders from the US overal count
USMurders = crimeRaw[1,]$murder
StateDeltaMurders = crimeRaw[2:52,]$murder - USMurders
# Make a new data frame of just the states & DC difference from US
crimeDelta = data.frame(state = crimeRaw[2:52,]$state,
StateDeltaMurders,
highlight=(StateDeltaMurders>0))
leftStateLabels = rep("",51)
leftIdx = which(crimeDelta$highlight)
leftStateLabels[leftIdx] = as.character(crimeDelta$state[leftIdx])
rightStateLabels = rep("",51)
rightIdx = which(!crimeDelta$highlight)
rightStateLabels[rightIdx] = as.character(crimeDelta$state[rightIdx])
ggplot(crimeDelta,aes(reorder(state,StateDeltaMurders),StateDeltaMurders)) +
geom_bar(stat="identity",aes(fill=highlight)) +
geom_hline(yintercept=0,color="black",size=0.75) +
geom_text(y=-0.5,hjust=1,label=leftStateLabels, size=3) +
geom_text(y=0.5,hjust=0,label=rightStateLabels, size=3) +
coord_flip() +
scale_fill_manual(values=c("goldenrod3","firebrick")) +
scale_y_continuous(breaks = seq(-5,30,by=5)) +
xlab("") + ylab("2005 Difference in Murder Rate from US Average (murder/100K)") +
theme(text=element_text(size=18,family="Times"),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
panel.grid.major.y = element_blank(),
legend.position="None")
library(ggplot2)
library(dplyr,quietly=TRUE, warn.conflicts=FALSE)
# Load data, but exclude US totals and DC
crimeURL = "http://datasets.flowingdata.com/crimeRatesByState2005.csv"
crime = filter(read.csv(crimeURL),
state != "United States",
state != "District of Columbia")
# Setup the bounds for the outliers based on inter-quartile range
focusStat = crime$forcible_rape
quartiles = summary(focusStat)[c(2,5)]
IQR = (quartiles[2] - quartiles[1])
ub = 1.58*IQR + median(focusStat)
lb = 1.58*IQR - median(focusStat)
# Add a column to the data to highlight outliers
outlier = (focusStat < lb) | (focusStat > ub)
crime = mutate(crime, highlight=outlier)
ggplot(crime,aes(reorder(state,focusStat), focusStat)) +
geom_bar(stat="identity",aes(fill=highlight)) +
geom_hline(yintercept=seq(0,80,by=10), color="white", size=0.75) +
scale_y_continuous(breaks = seq(0,80,by=10)) +
coord_flip() +
scale_fill_manual(values=c("gray49","firebrick")) +
xlab("") + ylab("Rape Crimes Reported per 100K, 2005") +
theme_bw() +
theme(text=element_text(size=18,family="Times"),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.y = element_text(size=8),
legend.position="None")
library(ggplot2)
library(dplyr,quietly=TRUE, warn.conflicts=FALSE)
# Load the SAT data, but leave out the US totals and DC
eduURL = "http://datasets.flowingdata.com/education.csv"
sat = filter(read.csv(eduURL),
state != "United States",
state != "District of Columbia")
# Create the Outlier data for placing labels on the plot
outlierPositions = boxplot(sat$pupil_staff_ratio,plot=FALSE,range=1.3)$out
outlierStates = as.character(sat$state[which(sat$pupil_staff_ratio >= min(outlierPositions))])
outlierData = data.frame(outlierStates, outlierPositions)
ggplot(sat, aes(x=factor(1),y=pupil_staff_ratio)) +
geom_boxplot(outlier.size=5,outlier.colour="firebrick", fill="pink") +
geom_text(data=outlierData, x=1.05, aes(y=outlierPositions, label=outlierStates),
hjust=0, size=3, color="firebrick") +
xlab("") + ylab("Student-Teacher Ratio") +
theme(text=element_text(size=18,family="Times"),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank())
Small sized visualizations, indexed by category and sequenced and ordered over a numeric variable not otherwise visualized
Can be actualized using lattice charts, trellis displays, panel displays, etc.
Reveals patterns and stand-out or anomolous data
# http://margintale.blogspot.in/2012/04/ggplot2-time-series-heatmaps.html
library(ggplot2)
library(plyr)
library(scales)
library(zoo)
df <- read.csv("https://raw.githubusercontent.com/selva86/datasets/master/yahoo.csv")
df$date <- as.Date(df$date) # format date
df <- df[df$year >= 2012, ] # filter reqd years
# Create Month Week
df$yearmonth <- as.yearmon(df$date)
df$yearmonthf <- factor(df$yearmonth)
df <- ddply(df,.(yearmonthf), transform, monthweek=1+week-min(week)) # compute week number of month
df <- df[, c("year", "yearmonthf", "monthf", "week", "monthweek", "weekdayf", "VIX.Close")]
# Plot
ggplot(df, aes(monthweek, weekdayf, fill = VIX.Close)) +
geom_tile(colour = "white") +
facet_grid(year~monthf) +
scale_fill_gradient(low="#e0f3db", high="#2ca25f") +
labs(x="Week of Month",
y="",
title = "Time-Series Calendar Heatmap",
subtitle="Yahoo Closing Price",
fill="Close")
We’ve covered most of this already: Week 3: Simple Effective Graphs
Almost all the plots disussed in Few’s chapter 10 are discussed there, and a few more (pun intended)
But maybe we’ll cover a couple of other things here briefly
There are two primary ways to describe (sample) distributions:
Summary statistics (numbers)
Visualization
Few mentions three key visual characteristics of distributions:
A distribution’s spread – range, IQR, standard deviation, etc.
A distribution’s center – mean, median, mode, etc.
A distribution’s shape – skew, kurtosis, etc.
Really there are more, and I would include as key:
When comparing multiple distributions, keep scales and intervals consistent
Choose histogram intervals so that individual bin sample error is not too large and the shape of the underlying distribution is relatively clear
Outliers … next slide
Very often, when analyzing distributions, what we are really looking for is what is most unusual (outliers)
This is why some types of plots make seeing outliers explicit (e.g., boxplots and strip plots)
When there are extreme outliers, we often need to be more careful about other measures – for instance, median is a measure of center more resistant to the effects of outliers than mean
Leadings digits on the left
All points starting with those digits are binned
The remaining digit for each point are listed on the right in a row
stem(mtcars$mpg)
## ## The decimal point is at the | ## ## 10 | 44 ## 12 | 3 ## 14 | 3702258 ## 16 | 438 ## 18 | 17227 ## 20 | 00445 ## 22 | 88 ## 24 | 4 ## 26 | 03 ## 28 | ## 30 | 44 ## 32 | 49
library(ggplot2)
library(dplyr,quietly=TRUE, warn.conflicts=FALSE)
eduURL = "http://datasets.flowingdata.com/education.csv"
sat = filter(melt(filter(read.csv(eduURL),
state != "United States",
state != "District of Columbia")),
variable %in% c("reading","writing","math"))
colnames(sat) <- c("State","Test", "Score")
ggplot(sat, aes(x=Test, y=Score, fill=Test)) +
geom_violin(color=NA, alpha=0.5) +
xlab("") + ylab("SAT Test Scores") +
ggtitle("Distribution of SAT Scores Across US States") +
scale_fill_brewer(palette="Set1") +
theme_bw() +
theme(text=element_text(size=18,family="Times"),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank())