An Introduction to Visualization in R and Gephi

Chris Bail
Computational Sociology

Beautiful aRt

 

In this class, I will teach you basic visualization techniques using R and Gephi.

But first, let's look at some beautiful aRt as motivation…

Beautiful aRt  

alt text

Beautiful aRt  

alt text

Beautiful aRt  

alt text

Beautiful aRt  

alt text

Beautiful aRt  

alt text

Beautiful aRt  

alt text

Beautiful aRt  

alt text

ggplot2

Many of these plots were produced using Hadley Wickham's legendary “ggplot” package.

alt text

ggplot2

 

Let's install ggplot2:

install.packages("ggplot2")

ggplot2

  Many R packages come with built-in datasets:

library(ggplot2)
data(diamonds)
head(diamonds[1:3,])
  carat     cut color clarity depth table price    x    y    z
1  0.23   Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
2  0.21 Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
3  0.23    Good     E     VS1  56.9    65   327 4.05 4.07 2.31

Anatomy of a basic ggplot

alt text

Scatterplots

Scatterplots

Let's try a basic scatterplot in ggplot2:

ggplot(diamonds, aes(x=carat, y=price)) + geom_point()

Scatterplots

Let's try a basic scatterplot in ggplot2:

ggplot(diamonds, aes(x=carat, y=price)) + geom_point()

diamonds is the data set we want to plot

aes refers to the x,y coordinates we want to plot. Note that we did not need to use the $ operator" to specify the variable names**

geom_point() describes the type of plot. The + indicates this is a “layer.”

Scatterplots

Let's try a basic scatterplot in ggplot2:  

ggplot(diamonds, aes(x=carat, y=price)) + geom_point()

plot of chunk unnamed-chunk-5

Scatterplots

Adding a factor variable (color) to our “aesthetic”

ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_point()

plot of chunk unnamed-chunk-6

Scatterplots

We can add even more information by manipulating the size of the points in the graph as follows

ggplot(diamonds, aes(x=carat, y=price, color=clarity, size=cut)) + geom_point()

plot of chunk unnamed-chunk-7

Scatterplots

We can also add trend lines by adding a geom_smooth layer to our plot

ggplot(diamonds, aes(x=carat, y=price)) + geom_point() + geom_smooth()

plot of chunk unnamed-chunk-8

Scatterplots

We can also add options to any layer. For example, let's tell R to use linear regression to draw the trend line:

ggplot(diamonds, aes(x=carat, y=price)) + geom_point() + geom_smooth(method="lm")

plot of chunk unnamed-chunk-9

Scatterplots

We can also work backwards and remove layers. For example, this plot removes the points from the graph:

ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_smooth()

plot of chunk unnamed-chunk-10

Facet Wraps and Other Types of Plots

Facet Wraps and Other Types of Plots

When we want to put multiple plots next to each other, we use “facet wraps”:

ggplot(diamonds, aes(x=carat, y=price)) + geom_point() + facet_wrap(~ cut)

plot of chunk unnamed-chunk-11

Facet Wraps and Other Types of Plots

The ~ in R generally refers to one object or variable being “explained by another”

Facet Wraps and Other Types of Plots

And of course, we could bring back in color, and trendlines as follows:

ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_point() + geom_smooth() + facet_wrap(~ cut)

plot of chunk unnamed-chunk-12

Facet Wraps and Other Types of Plots

Thus far, we have only worked with scatterplots, but ggplot can produce many other great graphs as well. For example, let's Try out a line graph:

ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_line()

plot of chunk unnamed-chunk-13

Facet Wraps and Other Types of Plots

Now, let's try a histogram:

ggplot(diamonds, aes(x=price)) + geom_histogram()

plot of chunk unnamed-chunk-14

Facet Wraps and Other Types of Plots

Once again, each layer has options, such as binwidth

ggplot(diamonds, aes(x=price)) + geom_histogram(binwidth=100)

plot of chunk unnamed-chunk-15

Facet Wraps and Other Types of Plots

Boxplots are a useful way to do a quick analysis of variance because they describe the standard errors very vividly:

ggplot(diamonds, aes(x=color, y=price)) + geom_boxplot()

plot of chunk unnamed-chunk-16

Facet Wraps and Other Types of Plots

We can also create “Violin Plots”, that combines boxplots with rotated kernel density plots:

ggplot(diamonds, aes(x=color, y=price)) + geom_violin()

plot of chunk unnamed-chunk-17

Now You Try It:

1) Load the mtcars data

2) plot the relationship between the mpg and hp variables in the form of a scatterplot with facets for the gear variable

Other Visualization Packages

Other Visualization Packages

ggplot2 is the most popular visualization package in R at present but there are many others that become available all the time. One of my favorites is “tab plot”:

install.packages("tabplot")

Other Visualization Packages

This command let's you visualize your entire dataset at once:

library(tabplot)
tableplot(diamonds, fig.height=5)

plot of chunk unnamed-chunk-19

Other Visualization Packages

Another popular visualization tool is the heatmap. We can do these in ggplot, but we can also do it in base R:

plot of chunk unnamed-chunk-20

Other Visualization Packages

First, let's look at an example that comes from Nathan Yau's “flowing data” website. First we read in some data about NBA players:

nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv", sep=",")

Other Visualization Packages

Next, let's sort the data by points and drop the first column because we don't need it:

nba <- nba[order(nba$PTS),]
row.names(nba) <- nba$Name
nba <- nba[,2:20]

Other Visualization Packages

The heatmap command requires data be in matrix format, but then we can easily run the command as follows:

nba_matrix <- data.matrix(nba)
nba_heatmap <- heatmap(nba_matrix, Rowv=NA, Colv=NA, col = cm.colors(256), scale="column", margins=c(5,10))

plot of chunk unnamed-chunk-23

Exporing Figures from R

If you are working with a ggplot object

 

myplot<-ggplot(diamonds, aes(x=color, y=price)) + geom_violin()
ggsave(file="organ donation.png", plot=myplot, width=5, height=5, dpi=300)

Most journals want you to use at least 300 dpi.

If you are working with another package...

 

png("myplot.png", width=4, height=4, units="in", res=300)
nba_heatmap <- heatmap(nba_matrix, Rowv=NA, Colv=NA, col = cm.colors(256), scale="column", margins=c(5,10))
dev.off()

If you get funky errors on a Mac, you may need to install XQuartz.

NETWORK DATA

Let's Run some Network Analysis on the Twitter Data we Scraped

 

load("Twitter Network Data.Rdata")

Fix Character Encoding

 

First we need to clean up some of the character encoding (there are some non-roman characters in the Twitter handles we scraped)

twitter_network_data$Source<-iconv(twitter_network_data$Source, "latin1", "ASCII", sub="")
twitter_network_data$Target<-iconv(twitter_network_data$Target, "latin1", "ASCII", sub="")

The iGraph Package

 

install.packages("igraph")

Convert Data Frame to "iGraph" Object

 

library(igraph) 
twitter_igraph <- graph.data.frame(twitter_network_data, directed=FALSE)

Note: R automatically generates a list of vertices from our edge list. R can also easily import network files from other software such as UCI-Net/Pajek

A Very Brief Tangent on Network Stats

 

Calculating network stats is extremely easy using igraph:

twitter_betweennes<-betweenness(twitter_igraph)
twitter_closeness<-closeness(twitter_igraph)
twitter_clustering_coefficient<-transitivity(twitter_igraph)

…and there are many, many more. Working with two-mode, weighted, and dynamic network data is R is also very easy because of its sophisticated database manipulation tools as well as a number of different packages such as sna tnet SoNIA

To Plot the Network

 

plot(twitter_igraph)

plot of chunk unnamed-chunk-31

In case you get an Error Message

 

figure margins too large

par(mar = rep(2, 4))

Pruning the Network

 

only_cool_kids<-delete.vertices(twitter_igraph,which(degree(twitter_igraph)<20))
plot(only_cool_kids)

plot of chunk unnamed-chunk-33

Remove Self References

 

only_cool_kids<-simplify(only_cool_kids)
plot(only_cool_kids)

plot of chunk unnamed-chunk-34

Change the Layout (Arrangement)

 

plot(only_cool_kids, layout=layout.reingold.tilford)

plot of chunk unnamed-chunk-35

Circle Layout

 

plot(only_cool_kids, layout=layout.circle)

plot of chunk unnamed-chunk-36

Fine-tuning network graphics

 

As with ggplot, it is possible to do quite a bit of customization with graphics in igraph (e.g. changing color, size, label fonts etc.). For a great tutorial on this check out Katya Ognyanova's Rpub: https://rpubs.com/kateto/netviz.

Though I prefer R for creating and analyzing network data, I prefer Gephi for network visualization, and particularly dynamic network visualization

Gephi

Let's play with some Gephi data

 

1) Open Gephi
2) Select “Open Graph File”
3) Choose this file from our course dropbox:

Gephi Academic Networks.gexf

Gephi is Great for Dynamic Network Data!

 

Close Gephi, and open this file in our course dropbox:

Hypertext Conference.gexf

This dataset was collected during the ACM Hypertext 2009 conference, where the SocioPatterns project deployed the Live Social Semantics application. Conference attendees volunteered to wear radio badges that monitored their face-to-face proximity. The dataset published here represents the dynamical network of face-to-face proximity of ~110 conference attendees over about 2.5 days.

Importing Data into Gephi

 

One problem with Gephi is that importing data is not always very straightforward- though perhaps easier than other types of network software such as Pajek.

There is a native Gephi file format called .gefx that is fairly stable and quite usable, but converting R data into .gefx takes a bit of practice and some rather complicated code. I prefer to move the data to .csv and then import to Gephi, but if you'd like to try writing native Gephi files, check out the rgexf package

Preparing .csv files for Gephi

 

#make edges (note, columns must be titled "Source", and "Target")
edges<-write.csv(twitter_network_data, file="edges.csv")
# make nodes
nodes<-c(twitter_network_data$Source, twitter_network_data$Target)
nodes<-as.data.frame(nodes)
nodes<-unique(nodes)
# gephi also requires Ids and labels that are the same as the node names
nodes$Id<-nodes$nodes
nodes$Label<-nodes$nodes
write.csv(nodes, file="nodes.csv")

Navigating Gephi

Importing .csv

Common Pitfalls with Gephi

 

If you import attributes for nodes or edges, make sure you select the correct variable type (e.g. string/byte etc.)

It can be helpful to save your work as a .gexf file from within Gephi in order to avoid repeated the steps of importing .csv, which can be rather time consuming.

Some of the plug-ins have buttons that are hidden in the data laboratory- also they are not always well documented.

Gephi is great for large graphs, but you may need to change the memory settings in Gephi's “initialization” file

Questions on Visualization/Networks?

Next Week: Linear Models

One of the great strengths of R's open-source format is that it provides the most comprehensive suite of packages for statistical analysis currently available. While you might consider using R or Python to collect and process big data and then STATA or SAS to analyze it, it is important for you to know the basics of statistical analysis in R in order to perform some of the more advanced techniques that are not yet available in conventional software packages. In this class, we will therefore cover very basic statistics (e.g. cross-tabs, ANOVA, linear regression)