Summer 2020

Outline

  1. Multivariate Comparisons

  2. Standard Statistical Dimension Reduction

  3. Clustering

  4. Searching for Outliers

  5. More Example Code!

Multivariate Comparisons

Heatmaps

  • Heat maps are a 2D matrix of data, where values are displayed as colored shapes in the matrix

  • Typically this is a rectangular grid, and:
    • Rows correspond to an observations, columns correspond to a variables
    • Each cell is colored according to the variable value for that observation
  • Sometimes laid out in more complex tiling (e.g., hexagonal)

  • Sometimes used in conjunction with methods to visualize hierarchical data

  • Difficult to draw specific inferences about data with a heat map, but can be very useful for spotting differences

When is a Heatmap Not a Heatmap?

  • A heatmap is really not a map at all, but rather a visual representation of a table where numeric data in cells are encoded as a color

  • But we often use the word “heatmap” whenever we have data, a map, and colors

  • Some things that are called “heatmap” but are not include:
    • Kernel density plots
    • Hot spot maps
    • Chloropleths

http://cartonerd.blogspot.com/2015/02/when-is-heat-map-not-heat-map.html

Heatmap Toy Example

  • Example: Students in a class have vs. their grades on quizes

  • Suppose we want an overal impression of student performance on quizes

  • Consider:

Student Quiz-1 Quiz-2 Quiz-3
Oscar OneGuy 73 72 80
Sue Secondly 92 85 91
Theo Thirdman 80 74 92
Finn Fourthperson 77 69 100

Heatmap Toy Example, Visualized

Heatmap Toy Example, Source

library(ggplot2)
library(reshape2)

mydata = data.frame(Student=c("Oscar OneGuy","Sue Secondly","Theo Thirdman","Finn Fourthperson"),
                    Quiz1=c(73, 92, 80, 77),
                    Quiz2=c(72, 85, 74, 69),
                    Quiz3=c(80, 91, 93, 100))
mylongdata = melt(mydata)

ggplot(mylongdata, aes(variable, Student)) + 
  geom_tile(aes(fill=value),color="white") +
  scale_fill_gradient(low="white", high="black") +
  xlab("") + ylab("Student") + ggtitle("Grades on Student Quizes") +
  theme(text=element_text(size=18, family="Times"), axis.ticks = element_blank())

Heatmap Toy Example, Discussed

  • From the visualization, we can get a sense that:
    • Sue performed the best overall
    • Quiz three was the easiest quiz
    • Quiz 2 was the hardest quiz
  • These are differences we are spotting … the specific color is less important than dramatic changes in colors

  • This was easy to see from the numbers, directly; however, it would have been harder with more students and more quizes

Heatmap Real Example, Visualized

Heatmap Real Example, Source p.1

library(ggplot2)
library(reshape2)
library(plyr,quietly=TRUE, warn.conflicts=FALSE)
library(scales)

# Load the data and order by total number of points scored
bballData <- read.csv("http://datasets.flowingdata.com/ppg2008.csv",header=TRUE)
bballData$Name <- with(bballData, reorder(Name, PTS))

# Reformat the data and scale each variable in preparation
bball <- melt(bballData)
bball <- ddply(bball, .(variable), transform, rescaledValue = scale(value))

Heatmap Real Example, Source p.2

ggplot(bball, aes(variable, Name)) +
  # Use geom_tile to draw the cells
  geom_tile(aes(fill=rescaledValue),color="white") +
  # Scale the colors from a light blue to a dark blue
  scale_fill_gradient(low="#F7FBFF", high="#08306B", guide=FALSE) +
  # Label the axes
  xlab("NBA Players") + ylab("Statistic") + ggtitle("NBA Stats by Player, 2008-2009 Season") +
  # Size and adjust the text to be more readable
  theme(text=element_text(size=12, family="Times"),
        axis.ticks = element_blank(),
        axis.text.y = element_text(size=9,  vjust=0.25, color="black"),
        axis.text.x = element_text(angle=90, vjust=0.25, hjust = 1, color="black"))

Star Charts

  • A star chart is a way to visualize a numeric data point with multiple dimensions by wrapping the dimensions into sectors of a polar plot
  • Each variable is mapped to a separate spoke on the star, and the length encodes the relative value

Nathan Yau’s Crime Dataset:

Star Chart Real Example, Visualized

Star Chart Real Example, Source

crimeURL = "http://datasets.flowingdata.com/crimeRatesByState2005.csv"
crime = read.csv(crimeURL)
stars(crime[,2:7],
      labels=as.character(crime[,1]),
      flip.labels=FALSE,
      key.loc=c(15,11.0),
      ncol=13,
      cex=1.25)

Star Chart Real Example, Source

  • Decoding direct numerical results from a star plot is not really possible

  • Instead, we look for points that stand out, that are different

  • Can be useful with interactive visualizations since:
    • Raw values can “pop-up”
    • Comparisons and brushing are possible

Parallel Coordinates Plot

  • Many multi-variate, numerical points can be visualized in a single plot using a parallel coordinate plot
    • Each variable is scaled between its minimum and maximum values
    • Each variable is assigned a number axis
    • These axes are placed in parallel to one another
    • The value for each variable of each point is placed along it’s respective axis
    • A piece-wise line is drawn across these axes and through these points, where each line represents a data point
  • Such plots rarely provide a detailed understanding of the data, but can sometimes show surprising things

Parallel Coordinates Real Example, Visualized

Parallel Coordinates Real Example, Source p.1

library(lattice) # Needed for parrallelplot()

# Get the data
eduURL = "http://datasets.flowingdata.com/education.csv"
sat = read.csv(eduURL)

# Make a vector of colors, where the upper half of the SAT
# total distribution is green and the lower is blue
satTotal = sat$reading + sat$writing + sat$math
colorIndices = (satTotal > median(satTotal)) + 1
satColors = c("dodgerblue2","darkolivegreen4")[colorIndices]

# Plot the data
parallelplot(sat[,2:7], horizontal.axis=FALSE, col=satColors,lwd=1.5,
             main="Average State SAT and Graduation Statistics")

Parallel Coordinates Real Example, Discussed

This plot reveals a number of interesting things:

  • Reading, writing, and math scores are extremely well correlated within state averages

  • State average total SAT score is inversely correlated with the percentage of graduates who took the SAT
    • i.e., higher scores may be result of a systemic sample bias
  • There appears very little relationship between SAT scores and teacher-student ratio or dropout rate

Standard Statistical Dimension Reduction

Dimension Reduction

There are basically two ways to perform dimensionality reduction:

  1. Calculate new variables based that somehow combine old variables, and use these new variables instead (feature extraction)
  2. Consider only a subset of variables that are somehow “important” (feature selection)

There are many such methods in statistics and machine learning, and we’ll only hit on a few traditional techniques

Stepwise Regression

  • One way to determine what variables are important in a regression model is to step through changes in the model until it is no longer improved
  • Constructively adding variables to the model is called forward stepwise regression
  • Destructively removing variables from the model is called backward stepwise regression
  • Some methods combine these two to try to find something in the middle
  • Stepwise regression is typically very greedy and not necessarily optimal
  • It attempts to minimize Akaike information criterion (AIC) – assesses how out-of-kilter the model is relative to other models
  • Sometimes this can be a way to determine when variables are not useful

Using stepAIC, Overview

  • In R, build a model with lm, glm, or loess
  • Then using the stepAIC function from the MASS library to find the best model
  • The anova results will be directly available to you

Using stepAIC, Source

library(MASS)

# Get a linear model with Fertility as the response variable and *all* 
# other variables included as explanatory variables
fullLMmodel <- glm(Fertility ~., data = swiss)

# Perform stepwise regression to get a *new* model, search in both
# directions (fwd & bkwd), and I don't need to see the trace
stepLMModel <- stepAIC(fullLMmodel, direction = "both", trace = FALSE)

# Show me the resulting model and ANOVA results:
print(stepLMModel$anova)

Using stepAIC, Results

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## Fertility ~ Agriculture + Examination + Education + Catholic + 
##     Infant.Mortality
## 
## Final Model:
## Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
## 
## 
##            Step Df Deviance Resid. Df Resid. Dev      AIC
## 1                                  41   2105.043 326.0716
## 2 - Examination  1 53.02656        42   2158.069 325.2408

The stepwise regression decided that the Examination variable was not needed for the prediction

Plotting Distances

  • A scatter plot of high-dimensional, numeric data is difficult

  • But one way to visualize it is to first reduce this dimensionality somehow

  • We can use the distances between points in the \(d\)-dimensional space to accomplish this:
    • Compute all the pairwise distances between points
    • Produce a new set of 2D points that are the same relative distance from each other in 2D that the original data were in \(d\) dimensions
    • In R, we use the dist() and cmdscale() functions to do this
    • This technique is called multidimensional scaling

Multidimensional Scaling Real Example, Visualized

Multidimensional Scaling Real Example, Source p.1

library(ggplot2)
library(dplyr,quietly=TRUE, warn.conflicts=FALSE)

# Load the data, filter out stuff we have no abbreviations for
eduURL = "http://datasets.flowingdata.com/education.csv"
sat = filter(read.csv(eduURL),
               state != "United States",
               state != "District of Columbia")

# Compute the MDS for just the numeric variables
satDist = dist(sat[,2:7])
satMDS = cmdscale(satDist)

# Compute the percentiles for some state
highlightStat = sat$reading  # <- Change the state, if you want
mu = mean(highlightStat)
sigma = sqrt(var(highlightStat))
percentiles = pnorm(highlightStat, mean=mu, sd=sigma)

Multidimensional Scaling Real Example, Source p.2

# Get a new data set in the transformed coordinates
reducedSAT = data.frame(x=satMDS[,1],
                        y=satMDS[,2],
                        abb=state.abb,
                        highlight=(percentiles > 0.70))

ggplot(reducedSAT,aes(x,y)) +
  geom_text(aes(label=abb,color=highlight,size=highlight)) +
  scale_color_manual(values=c("gray49","firebrick")) +
  scale_size_manual(values=c(3,4)) +
  geom_text(x=55,y=15,label="Upper 30% in Reading",size=4,color="firebrick") +
  ggtitle("MDS Average State SAT and Graduation Statistics") +
  theme(text=element_text(size=18,family="Times"),legend.position="None")

Principle Component Analysis

  • One common statistical technique to reduce dimensions is to perform some kind of decomposition of the space

  • The simplest of these is called principle component analysis (PCA)
    • If imagine we have a matrix, \(X\) containing \(m\) rows (observations) and \(d\) columns (variables)
    • Collect all the pairwise covariances of all the variables, which gives us a \(d \times d\) matrix
    • Gather the eigen vectors and eigen values of these covariances
    • Order and scale vectors by the magnitude of the square root of the values
    • Each component is referred to as a principle component
  • The principle components are the orthoganal rotation that explain the variance in the data, in order

Visualizing Principle Components

PCA, Continued

  • We can get the principle components in R the hard way:
# Say X is our data
eg = eigen(cov(X))
pc1 = eg$vectors[,1] * sqrt(eg$values[1])
pc2 = eg$vectors[,1] * sqrt(eg$values[1])
# ...
  • Or we can get them the easy way:
# Say X is our data
prcomp(X)
  • Principle components give us a way to reduce dimensions:
    • rotate and scale the space to align the dimensions along the first two (say) principle componets
    • Then plot a 2D scatterplot of the transformed data and look for outliers

PCA Crime Example, Visualized p.1

PCA Crime Example, Visualized p.2

PCA Crime Example, Source p.1

library(ggplot2)
library(dplyr,quietly=TRUE, warn.conflicts=FALSE)

# Get the data, but exclude those we don't have abbreviations for
crimeURL = "http://datasets.flowingdata.com/crimeRatesByState2005.csv"
crime = filter(read.csv(crimeURL),
               state != "United States",
               state != "District of Columbia")

# Deal with the data as a matrix
X = as.matrix(crime[,2:7])
pc = prcomp(X) # the PCA
M = cbind(pc$rotation[,1],pc$rotation[,2])
Y = X %*% M  # the rotation

# May first two pc's and the state abbrevations into a new data frame
crimeReduced = data.frame(state=crime$state,
                          pc1=Y[,1],
                          pc2=Y[,2],
                          abb=state.abb)

PCA Crime Example, Source p.2

# Find the 10 states that are the highest on the first PC axis
a = order(crimeReduced$pc1,decreasing=TRUE)[1:10]

# Find the 10 states that are the highest on the second PC axis
b = order(crimeReduced$pc2,decreasing=TRUE)[1:10]

# Highlight the states that are at the intersection of those
# (that is:  Find the states on the Pareto surface of the first
# two PCs)
crimeReduced = mutate(crimeReduced,
                      highCrime=(1:50 %in% intersect(a,b)))

# Plot it!
ggplot(crimeReduced,aes(pc1,pc2)) + 
  geom_text(aes(label=abb,color=highCrime,size=highCrime)) +
  scale_color_manual(values=c("darkgray","firebrick")) +
  scale_size_manual(values=c(4,7)) +
  ylim(c(-750,750)) + 
  xlab("First Principle Component") + ylab("Second Principle Component") +
  theme(text=element_text(size=18,family="Times"),legend.position="None")

Clustering

Clustering Overview

  • Another way to reduce the scope of large data is to use a clustering method
  • The most common is “Lloyd’s Algorithm” – often incorrectly called “k-means clustering”
  • Another common method is “Expectation Maximization” (EM)
  • These find an assignment of points and centers of \(k\) clusters of points that minimizes the squared sum of the distances from each point to the center to which it is assigned
  • Clustering is a way to reduce data by considering smaller regions of space where points are similar
  • It induces a special kind of structure in the space by segmenting the space into Vornoi Regions

Using mclust, Overview

  • There are a lot of packages for clustering in R, but we’ll just look at mclust

  • mclust works like other modeling fitting methods, so we’ll use that one

  • mclust uses EM and is able to guess the best number of clusters, \(k\)

  • Bayesian Information Criterion (BIC) is used to assess the quality of the model:
    • BIC basically assesses how out-of-kilter this particular model is relative to other possible models
    • Lower BIC is better

Using mclust, Source

library(mclust)

# Let's just look at numerical variables in the Iris dataset
# Also, let's rescale all the data so everything's in the same scale range
mydata <- scale(iris[,1:4])

# Build a cluster model, letting the algorithm choose k
clusterModel <- Mclust(mydata)

# Provide a summary of the model:
summary(clusterModel)

# Plot the cluster classifications directly:
plot(clusterModel, what="classification")

Using mclust, Summary

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model
## with 2 components: 
## 
##  log-likelihood   n df       BIC       ICL
##       -322.6936 150 29 -790.6956 -790.6969
## 
## Clustering table:
##   1   2 
##  50 100

Using mclust, Classification Plot

Hierachical Clustering, Overview

  • There are also a variety of methods to find clusters within clusters
  • The result of such searches is a kind of tree, or hierarchy
  • The visualization for this is typically called a dendogram
  • The R cluster package performs this for us
  • Dendograms give us a deeper sense in which data is structured, which can be used to help focus on data we think is important

The USArrests Dataset

This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.

head(USArrests)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

Hierachical Clustering, Code

library(cluster)

# Dissimilarity matrix
arrestDissimilarity <- dist(USArrests, method = "euclidean")

# Hierarchical clustering using Complete Linkage
hc1 <- hclust(arrestDissimilarity, method = "complete" )

# Plot the obtained dendrogram
plot(hc1, cex = 0.6, hang = -1)

Hierachical Clustering, Dendogram

Searching for Outliers

Outlier Detection

  • Extreme values that stand out in a distribution are called outliers

  • There are many ways to look for an outlier:
    • Use standard plots that visualy show extreme values
    • Look for the statistical outliers and highlight them
    • Use a specialized visualization for this purpose (e.g., boxplot)
  • We often want to spot outliers when we’re looking at the difference between two things:
    • Which thing is most different?
    • Which this is unsually different?
    • Etc.

Computing Differences Directly

Computing Differences Directly, Source p.1

library(ggplot2)
library(dplyr,quietly=TRUE, warn.conflicts=FALSE)

# Load the data
crimeURL = "http://datasets.flowingdata.com/crimeRatesByState2005.csv"
crimeRaw = read.csv(crimeURL)

# Compute the differences in murders from the US overal count
USMurders = crimeRaw[1,]$murder
StateDeltaMurders = crimeRaw[2:52,]$murder - USMurders

# Make a new data frame of just the states & DC difference from US
crimeDelta = data.frame(state = crimeRaw[2:52,]$state,
                        StateDeltaMurders,
                        highlight=(StateDeltaMurders>0))

Computing Differences Directly, Source p.2

leftStateLabels = rep("",51)
leftIdx = which(crimeDelta$highlight)
leftStateLabels[leftIdx] = as.character(crimeDelta$state[leftIdx])

rightStateLabels = rep("",51)
rightIdx = which(!crimeDelta$highlight)
rightStateLabels[rightIdx] = as.character(crimeDelta$state[rightIdx])

Computing Differences Directly, Source p.3

ggplot(crimeDelta,aes(reorder(state,StateDeltaMurders),StateDeltaMurders)) +
  geom_bar(stat="identity",aes(fill=highlight)) +
  geom_hline(yintercept=0,color="black",size=0.75) +
  geom_text(y=-0.5,hjust=1,label=leftStateLabels, size=3) +
  geom_text(y=0.5,hjust=0,label=rightStateLabels, size=3) +
  coord_flip() +
  scale_fill_manual(values=c("goldenrod3","firebrick")) +
  scale_y_continuous(breaks = seq(-5,30,by=5)) +
  xlab("") + ylab("2005 Difference in Murder Rate from US Average (murder/100K)") +
  theme(text=element_text(size=18,family="Times"),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        panel.grid.major.y = element_blank(),
        legend.position="None")

Highlighting Statistical Outliers

Highlighting Statistical Outliers, Source p.1

library(ggplot2)
library(dplyr,quietly=TRUE, warn.conflicts=FALSE)

# Load data, but exclude US totals and DC
crimeURL = "http://datasets.flowingdata.com/crimeRatesByState2005.csv"
crime = filter(read.csv(crimeURL),
               state != "United States",
               state != "District of Columbia")

# Setup the bounds for the outliers based on inter-quartile range
focusStat = crime$forcible_rape
quartiles = summary(focusStat)[c(2,5)]
IQR = (quartiles[2] - quartiles[1])
ub = 1.58*IQR + median(focusStat)
lb = 1.58*IQR - median(focusStat)

# Add a column to the data to highlight outliers
outlier = (focusStat < lb) | (focusStat > ub)
crime = mutate(crime, highlight=outlier)

Highlighting Statistical Outliers, Source p.1

ggplot(crime,aes(reorder(state,focusStat), focusStat)) +
  geom_bar(stat="identity",aes(fill=highlight)) +
  geom_hline(yintercept=seq(0,80,by=10), color="white", size=0.75) +
  scale_y_continuous(breaks = seq(0,80,by=10)) +
  coord_flip() +
  scale_fill_manual(values=c("gray49","firebrick")) +
  xlab("") + ylab("Rape Crimes Reported per 100K, 2005") +
  theme_bw() +
  theme(text=element_text(size=18,family="Times"),
        panel.grid.major.y = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank(),
        axis.text.y = element_text(size=8),
        legend.position="None")

Boxplots Real Example, Visualized

Boxplots Real Example, Source p.1

library(ggplot2)
library(dplyr,quietly=TRUE, warn.conflicts=FALSE)

# Load the SAT data, but leave out the US totals and DC
eduURL = "http://datasets.flowingdata.com/education.csv"
sat = filter(read.csv(eduURL),
             state != "United States",
             state != "District of Columbia")

# Create the Outlier data for placing labels on the plot
outlierPositions = boxplot(sat$pupil_staff_ratio,plot=FALSE,range=1.3)$out
outlierStates = as.character(sat$state[which(sat$pupil_staff_ratio >= min(outlierPositions))])
outlierData = data.frame(outlierStates, outlierPositions)

Boxplots Real Example, Source p.2

ggplot(sat, aes(x=factor(1),y=pupil_staff_ratio)) +
  geom_boxplot(outlier.size=5,outlier.colour="firebrick", fill="pink") +
  geom_text(data=outlierData, x=1.05, aes(y=outlierPositions, label=outlierStates),
            hjust=0, size=3, color="firebrick") +
  xlab("") + ylab("Student-Teacher Ratio") +
  theme(text=element_text(size=18,family="Times"),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank())

Small Multiples

  • Small sized visualizations, indexed by category and sequenced and ordered over a numeric variable not otherwise visualized

  • Can be actualized using lattice charts, trellis displays, panel displays, etc.

  • Reveals patterns and stand-out or anomolous data

Small Multiples

  • Pros:
    • Displays many variables with less risk of confusion w/o overplot
    • Learning to read one panel translates to the other panels
  • Take care to:
    • Place charts in logical order
    • Use same measures and scales
    • Keep charts simple

Better Know a Visualization: Small Multiples

Time-Series Calendar Small Multiple Heatmap!

Time-Series Calendar Small Multiple Heatmap!

# http://margintale.blogspot.in/2012/04/ggplot2-time-series-heatmaps.html
library(ggplot2)
library(plyr)
library(scales)
library(zoo)

df <- read.csv("https://raw.githubusercontent.com/selva86/datasets/master/yahoo.csv")
df$date <- as.Date(df$date)  # format date
df <- df[df$year >= 2012, ]  # filter reqd years

# Create Month Week
df$yearmonth <- as.yearmon(df$date)
df$yearmonthf <- factor(df$yearmonth)
df <- ddply(df,.(yearmonthf), transform, monthweek=1+week-min(week))  # compute week number of month
df <- df[, c("year", "yearmonthf", "monthf", "week", "monthweek", "weekdayf", "VIX.Close")]

# Plot
ggplot(df, aes(monthweek, weekdayf, fill = VIX.Close)) + 
  geom_tile(colour = "white") + 
  facet_grid(year~monthf) + 
  scale_fill_gradient(low="#e0f3db", high="#2ca25f") +
  labs(x="Week of Month",
       y="",
       title = "Time-Series Calendar Heatmap", 
       subtitle="Yahoo Closing Price", 
       fill="Close")

Visualizing Distributions (sort of)

We Already Covered This

  • We’ve covered most of this already: Week 3: Simple Effective Graphs

  • Almost all the plots disussed in Few’s chapter 10 are discussed there, and a few more (pun intended)

  • But maybe we’ll cover a couple of other things here briefly

Describing Distributions

There are two primary ways to describe (sample) distributions:

  1. Summary statistics (numbers)

  2. Visualization

Key Visual Characteristics

Few mentions three key visual characteristics of distributions:

  1. A distribution’s spread – range, IQR, standard deviation, etc.

  2. A distribution’s center – mean, median, mode, etc.

  3. A distribution’s shape – skew, kurtosis, etc.

Really there are more, and I would include as key:

  1. A distributions modality – how many peaks does it have

Techniques and Best Practices

  • When comparing multiple distributions, keep scales and intervals consistent

  • Choose histogram intervals so that individual bin sample error is not too large and the shape of the underlying distribution is relatively clear

  • Outliers … next slide

Outliers

  • Very often, when analyzing distributions, what we are really looking for is what is most unusual (outliers)

  • This is why some types of plots make seeing outliers explicit (e.g., boxplots and strip plots)

  • When there are extreme outliers, we often need to be more careful about other measures – for instance, median is a measure of center more resistant to the effects of outliers than mean

Stem And Leaf Charts

  • Leadings digits on the left

  • All points starting with those digits are binned

  • The remaining digit for each point are listed on the right in a row

Stem And Leaf Charts, Example

stem(mtcars$mpg)
## 
##   The decimal point is at the |
## 
##   10 | 44
##   12 | 3
##   14 | 3702258
##   16 | 438
##   18 | 17227
##   20 | 00445
##   22 | 88
##   24 | 4
##   26 | 03
##   28 | 
##   30 | 44
##   32 | 49

Violin Real Example, Visualized

Violin Real Example, Source

library(ggplot2)
library(dplyr,quietly=TRUE, warn.conflicts=FALSE)

eduURL = "http://datasets.flowingdata.com/education.csv"
sat = filter(melt(filter(read.csv(eduURL),
                         state != "United States",
                         state != "District of Columbia")),
             variable %in% c("reading","writing","math"))
colnames(sat) <- c("State","Test", "Score")

ggplot(sat, aes(x=Test, y=Score, fill=Test)) +
  geom_violin(color=NA, alpha=0.5) +
  xlab("") + ylab("SAT Test Scores") +
  ggtitle("Distribution of SAT Scores Across US States") +
  scale_fill_brewer(palette="Set1") +
  theme_bw() +
  theme(text=element_text(size=18,family="Times"),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank())

More Example Code!

Two R Graph Galleries

  • There are a number of websites that provide great resources and examples
  • R Graph Gallery
    • Breaksdown the types of plots to consider for different types of visualizations
    • Provides examples in R for all of these
  • Top 50 GGplot2 Visualizations
    • Lists 50 examples of different types of plots
    • Provides example code for all of these