Chris Bail
Computational Sociology
In this class, I will teach you basic visualization techniques using R and Gephi.
But first, let's look at some beautiful aRt as motivation…
Beautiful aRt
Beautiful aRt
Beautiful aRt
Beautiful aRt
Beautiful aRt
Beautiful aRt
Beautiful aRt
ggplot2
Many of these plots were produced using Hadley Wickham's legendary “ggplot” package.
Let's install ggplot2:
install.packages("ggplot2")
Many R packages come with built-in datasets:
library(ggplot2)
data(diamonds)
head(diamonds[1:3,])
carat cut color clarity depth table price x y z
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
Let's try a basic scatterplot in ggplot2:
ggplot(diamonds, aes(x=carat, y=price)) + geom_point()
Let's try a basic scatterplot in ggplot2:
ggplot(diamonds, aes(x=carat, y=price)) + geom_point()
diamonds
is the data set we want to plot
aes
refers to the x,y coordinates we want to plot. Note that we did not need to use the $
operator" to specify the variable names**
geom_point()
describes the type of plot. The +
indicates this is a “layer.”
Let's try a basic scatterplot in ggplot2:
ggplot(diamonds, aes(x=carat, y=price)) + geom_point()
Adding a factor variable (color)
to our “aesthetic”
ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_point()
We can add even more information by manipulating the size of the points in the graph as follows
ggplot(diamonds, aes(x=carat, y=price, color=clarity, size=cut)) + geom_point()
We can also add trend lines by adding a geom_smooth
layer to our plot
ggplot(diamonds, aes(x=carat, y=price)) + geom_point() + geom_smooth()
We can also add options to any layer. For example, let's tell R to use linear regression to draw the trend line:
ggplot(diamonds, aes(x=carat, y=price)) + geom_point() + geom_smooth(method="lm")
We can also work backwards and remove layers. For example, this plot removes the points from the graph:
ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_smooth()
When we want to put multiple plots next to each other, we use “facet wraps”:
ggplot(diamonds, aes(x=carat, y=price)) + geom_point() + facet_wrap(~ cut)
The ~
in R generally refers to one object or variable
being “explained by another”
And of course, we could bring back in color, and trendlines as follows:
ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_point() + geom_smooth() + facet_wrap(~ cut)
Thus far, we have only worked with scatterplots, but ggplot can produce many other great graphs as well. For example, let's Try out a line graph:
ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_line()
Now, let's try a histogram:
ggplot(diamonds, aes(x=price)) + geom_histogram()
Once again, each layer has options, such as binwidth
ggplot(diamonds, aes(x=price)) + geom_histogram(binwidth=100)
Boxplots are a useful way to do a quick analysis of variance because they describe the standard errors very vividly:
ggplot(diamonds, aes(x=color, y=price)) + geom_boxplot()
We can also create “Violin Plots”, that combines boxplots with rotated kernel density plots:
ggplot(diamonds, aes(x=color, y=price)) + geom_violin()
1) Load the mtcars
data
2) plot the relationship between the mpg
and hp
variables in the form of a scatterplot with facets for the gear
variable
ggplot2 is the most popular visualization package in R at present but there are many others that become available all the time. One of my favorites is “tab plot”:
install.packages("tabplot")
This command let's you visualize your entire dataset at once:
library(tabplot)
tableplot(diamonds, fig.height=5)
Another popular visualization tool is the heatmap. We can do these in ggplot, but we can also do it in base R:
First, let's look at an example that comes from Nathan Yau's “flowing data” website. First we read in some data about NBA players:
nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv", sep=",")
Next, let's sort the data by points and drop the first column because we don't need it:
nba <- nba[order(nba$PTS),]
row.names(nba) <- nba$Name
nba <- nba[,2:20]
The heatmap
command requires data be in matrix format, but then we can easily run the command as follows:
nba_matrix <- data.matrix(nba)
nba_heatmap <- heatmap(nba_matrix, Rowv=NA, Colv=NA, col = cm.colors(256), scale="column", margins=c(5,10))
myplot<-ggplot(diamonds, aes(x=color, y=price)) + geom_violin()
ggsave(file="organ donation.png", plot=myplot, width=5, height=5, dpi=300)
Most journals want you to use at least 300 dpi.
png("myplot.png", width=4, height=4, units="in", res=300)
nba_heatmap <- heatmap(nba_matrix, Rowv=NA, Colv=NA, col = cm.colors(256), scale="column", margins=c(5,10))
dev.off()
If you get funky errors on a Mac, you may need to install XQuartz.
load("Twitter Network Data.Rdata")
First we need to clean up some of the character encoding (there are some non-roman characters in the Twitter handles we scraped)
twitter_network_data$Source<-iconv(twitter_network_data$Source, "latin1", "ASCII", sub="")
twitter_network_data$Target<-iconv(twitter_network_data$Target, "latin1", "ASCII", sub="")
install.packages("igraph")
library(igraph)
twitter_igraph <- graph.data.frame(twitter_network_data, directed=FALSE)
Note: R automatically generates a list of vertices from our edge list. R can also easily import network files from other software such as UCI-Net/Pajek
Calculating network stats is extremely easy using igraph:
twitter_betweennes<-betweenness(twitter_igraph)
twitter_closeness<-closeness(twitter_igraph)
twitter_clustering_coefficient<-transitivity(twitter_igraph)
…and there are many, many more.
Working with two-mode, weighted, and dynamic network data is R is also very easy because of its sophisticated database manipulation tools as well as a number of different packages such as sna tnet SoNIA
plot(twitter_igraph)
figure margins too large
par(mar = rep(2, 4))
only_cool_kids<-delete.vertices(twitter_igraph,which(degree(twitter_igraph)<20))
plot(only_cool_kids)
only_cool_kids<-simplify(only_cool_kids)
plot(only_cool_kids)
plot(only_cool_kids, layout=layout.reingold.tilford)
plot(only_cool_kids, layout=layout.circle)
As with ggplot, it is possible to do quite a bit of customization with graphics in igraph (e.g. changing color, size, label fonts etc.). For a great tutorial on this check out Katya Ognyanova's Rpub: https://rpubs.com/kateto/netviz.
Though I prefer R for creating and analyzing network data, I prefer Gephi for network visualization, and particularly dynamic network visualization
1) Open Gephi
2) Select “Open Graph File”
3) Choose this file from our course dropbox:
Gephi Academic Networks.gexf
Close Gephi, and open this file in our course dropbox:
Hypertext Conference.gexf
This dataset was collected during the ACM Hypertext 2009 conference, where the SocioPatterns project deployed the Live Social Semantics application. Conference attendees volunteered to wear radio badges that monitored their face-to-face proximity. The dataset published here represents the dynamical network of face-to-face proximity of ~110 conference attendees over about 2.5 days.
One problem with Gephi is that importing data is not always very straightforward- though perhaps easier than other types of network software such as Pajek.
There is a native Gephi file format called .gefx that is fairly stable and quite usable, but converting R data into .gefx takes a bit of practice and some rather complicated code. I prefer to move the data to .csv and then import to Gephi, but if you'd like to try writing native Gephi files, check out the rgexf
package
#make edges (note, columns must be titled "Source", and "Target")
edges<-write.csv(twitter_network_data, file="edges.csv")
# make nodes
nodes<-c(twitter_network_data$Source, twitter_network_data$Target)
nodes<-as.data.frame(nodes)
nodes<-unique(nodes)
# gephi also requires Ids and labels that are the same as the node names
nodes$Id<-nodes$nodes
nodes$Label<-nodes$nodes
write.csv(nodes, file="nodes.csv")
If you import attributes for nodes or edges, make sure you select the correct variable type (e.g. string/byte etc.)
It can be helpful to save your work as a .gexf file from within Gephi in order to avoid repeated the steps of importing .csv, which can be rather time consuming.
Some of the plug-ins have buttons that are hidden in the data laboratory- also they are not always well documented.
Gephi is great for large graphs, but you may need to change the memory settings in Gephi's “initialization” file
One of the great strengths of R's open-source format is that it provides the most comprehensive suite of packages for statistical analysis currently available. While you might consider using R or Python to collect and process big data and then STATA or SAS to analyze it, it is important for you to know the basics of statistical analysis in R in order to perform some of the more advanced techniques that are not yet available in conventional software packages. In this class, we will therefore cover very basic statistics (e.g. cross-tabs, ANOVA, linear regression)